Welcome to Flupy

flupy is a lightweight library and CLI for implementing python data pipelines with a fluent interface.

Under the hood, flupy is built on generators. That means its pipelines evaluate lazily and use a constant amount of memory no matter how much data are being processed. This allows flupy to tackle Petabyte scale data manipulation as easily as it operates on a small list.

API

import json
from flupy import flu

logs = open('logs.jl', 'r')

error_count = (
    flu(logs)
    .map(lambda x: json.loads(x))
    .filter(lambda x: x['level'] == 'ERROR')
    .count()
)

print(error_count)
# 14

CLI

The flupy library, and python runtime, are also accessible from flu command line utility:

$ cat logs.txt | flu "_.filter(lambda x: x.startswith('ERROR'))"

For more information about the flu command see command line.

Getting Started

Requirements

Python 3.6+

Installation

$ pip install flupy

Example

Since 2008, what domains are our customers comming from?:

from flupy import flu

customers = [
    {'name': 'Jane', 'signup_year': 2018, 'email': 'jane@ibm.com'},
    {'name': 'Fred', 'signup_year': 2011, 'email': 'fred@google.com'},
    {'name': 'Lisa', 'signup_year': 2014, 'email': 'jane@ibm.com'},
    {'name': 'Jack', 'signup_year': 2007, 'email': 'jane@apple.com'},
]

pipeline = (
    flu(customers)
    .filter(lambda x: x['signup_year'] > 2008)
    .map_item('email')
    .map(lambda x: x.partition('@')[2])
    .group_by() # defaults to identity
    .map(lambda x: (x[0], x[1].count()))
    .collect()
)

print(pipeline)
# [('google.com', 1), ('ibm.com', 2)]