A Beginner's Guide to Building Data Pipelines with Luigi

Preview:

Citation preview

A Beginner’s Guide to Building Data Pipelines

with

Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI?

UK Limited Companies

Customer CRM Data

Predictive Model

With big data, comes big responsibility

Hard to maintain, extend, and… look at.

Script Soup

omg moar codez

Code

More Codes

if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling

arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments')

arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args()

The Old Way

Define a command line interface for every task?

log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))log.setLogLevel(args.log_level, 'screen') <- Custom logging

table_date = parse_date(args.table_date, datetime.now())

log.info('Starting Companies House data loader...')ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)ch_loader.go(args.file_names) <- What to do if this fails?

log.info('Loader complete. Starting Companies House updater')ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go()

The Old Way

Long, processor intensive tasks stacked together

● Open-sourced & maintained by Spotify data team

● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani.

● Abstracts batch processing jobs

● Makes it easy to write modular code and create dependencies between tasks.

Luigi to the rescue!

● Task templating

● Dependency graphs

● Resumption of data flows after intermediate failure

● Command line integration

● Error emails

Luigi

Luigi 101- Counting the number of companies in the UK

companies.csvCount companies count.txt

input() output()

class CompanyCount(luigi.Task):

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries("companies.csv")

with self.output().open("w") as out_file:

out_file.write(count)

Company Count Job in Luigi code

Luigi 101- Keeping our count up to date

companies.csv

Count companies count.txt

input()

output()Companies Download

Companies Data Server

output()

requires()

class CompanyCount(luigi.Task):

def requires(self):

return CompanyDownload()

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries(self.input())

with self.output().open("w") as out_file:

out_file.write(count)

Company count with download dependency

the output of the required task

this task must complete before CompanyCount runs

Download task

class CompanyDownload(luigi.Task):

def output(self):

return luigi.LocalTarget("companies.csv")

def run(self):

data = get_company_download()

with self.output().open('w') as out_file:

out_file.write(data)

local output to be picked up by previous task

download the data and write it to the output Target

$ python company_flow.py CompanyCount --local-schedulerDEBUG: Checking if CompanyCount() is completeDEBUG: Checking if CompanyDownload() is completeINFO: Scheduled CompanyCount() (PENDING)INFO: Scheduled CompanyDownload() (PENDING)INFO: Done scheduling tasksINFO: Running Worker with 1 processesDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 10076] Worker Worker(...) running CompanyDownload()INFO: [pid 10076] Worker Worker(...) done CompanyDownload()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 10076] Worker Worker(...) running CompanyCount()INFO: [pid 10076] Worker Worker(...) done CompanyCount()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...INFO: Done

Time dependent tasks - change in companiesCompanies Count Task(Date 1)

Companies Count Task(Date 2)

Companies Count Task(Date 3)

Companies Deltacompany_count_

delta.txtoutput()

input()

input()

input()

class AnnualCompanyCountDelta(luigi.Task):

year = luigi.Parameter()

def requires(self):

tasks = []

for month in range(1, 13):

tasks.append(CompanyCount(dt.datetime.strptime(

"{}-{}-01".format(self.year, month), "%Y-%m-%d"))

)

return tasks

# not shown: output(), run()

Parameterising Luigi tasks

define parameter

generate dependencies

class CompanyCount(luigi.Task):

date = luigi.DateParameter(default=datetime.date.today())

def requires(self):

return CompanyDownload(self.date)

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries(self.input())

with self.output().open("w") as out_file:

out_file.write(count)

Adding the date dependency to Company Count

added date dependency to company count

The central scheduler

$ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014

by default, localhost:8082

Persisting our data

companies.csv

Count companies(Date)

count.txtoutput()Companies

Download(Date)

Companies Data Server

output()

requires(Date)

CompaniesToMySQL(Date)

output()SQL

Database

requires(Date)

class CompaniesToMySQL(luigi.sqla.CopyToTable):

date = luigi.DateParameter()

columns = [(["name", String(100)], {}), ...]

connection_string = "mysql://localhost/test" # or something

table = "companies" # name of the table to store data

def requires(self):

return CompanyDownload(self.date)

def rows(self):

for row in self.get_unique_rows(): # uses self.input()

yield row

Persisting our data

My pipes broke

# ./client.cfg[core]error-email: dylan@growthintel.com, stuart@growthintel.com

Things we missed out

There are lots of task types which can be used which we haven’t mentioned

● Hadoop● Spark● ssh● Elasticsearch● Hive● Pig● etc.

Check out the luigi.contrib package

class CompanyCount(luigi.contrib.hadoop.JobTask):

chunks = luigi.Parameter()

def requires(self):

return [CompanyDownload(chunk) for chunk in

chunks]

def output(self):

return

luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")

def mapper(self, line):

yield "count", 1

def reducer(self, key, values):

yield key, sum(values)

Counting the companies using Hadoop

split input in chunks

HDFS target

map and reduce methods instead of run()

● Doesn’t provide a way to trigger flows

● Doesn’t support distributed execution

Luigi Limitations

Onwards

● The docs: http://luigi.readthedocs.org/

● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/

● The source: https://github.com/spotify/luigi

● The maintainers are really helpful, responsive, and open to any and all PRs!

Stuart Coleman@stubacca81 / stuart@growthintel.com

Dylan Barth@dylan_barth / dylan@growthintel.com

Thanks!

We’re hiring Python data scientists & engineers!http://www.growthintel.com/careers/