Welcome to Flupy¶
flupy is a lightweight library and CLI for implementing python data pipelines with a fluent interface.
Under the hood, flupy is built on generators. That means its pipelines evaluate lazily and use a constant amount of memory no matter how much data are being processed. This allows flupy to tackle Petabyte scale data manipulation as easily as it operates on a small list.
API¶
import json
from flupy import flu
logs = open('logs.jl', 'r')
error_count = (
flu(logs)
.map(lambda x: json.loads(x))
.filter(lambda x: x['level'] == 'ERROR')
.count()
)
print(error_count)
# 14
CLI¶
The flupy library, and python runtime, are also accessible from flu command line utility:
$ cat logs.txt | flu "_.filter(lambda x: x.startswith('ERROR'))"
For more information about the flu command see command line.
Example¶
Since 2008, what domains are our customers comming from?:
from flupy import flu
customers = [
{'name': 'Jane', 'signup_year': 2018, 'email': 'jane@ibm.com'},
{'name': 'Fred', 'signup_year': 2011, 'email': 'fred@google.com'},
{'name': 'Lisa', 'signup_year': 2014, 'email': 'jane@ibm.com'},
{'name': 'Jack', 'signup_year': 2007, 'email': 'jane@apple.com'},
]
pipeline = (
flu(customers)
.filter(lambda x: x['signup_year'] > 2008)
.map_item('email')
.map(lambda x: x.partition('@')[2])
.group_by() # defaults to identity
.map(lambda x: (x[0], x[1].count()))
.collect()
)
print(pipeline)
# [('google.com', 1), ('ibm.com', 2)]