26
A Beginner’s Guide to Building Data Pipelines with

A Beginner's Guide to Building Data Pipelines with Luigi

Embed Size (px)

Citation preview

Page 1: A Beginner's Guide to Building Data Pipelines with Luigi

A Beginner’s Guide to Building Data Pipelines

with

Page 2: A Beginner's Guide to Building Data Pipelines with Luigi

Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI?

UK Limited Companies

Customer CRM Data

Predictive Model

Page 3: A Beginner's Guide to Building Data Pipelines with Luigi

With big data, comes big responsibility

Page 4: A Beginner's Guide to Building Data Pipelines with Luigi

Hard to maintain, extend, and… look at.

Script Soup

omg moar codez

Code

More Codes

Page 5: A Beginner's Guide to Building Data Pipelines with Luigi

if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling

arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments')

arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args()

The Old Way

Define a command line interface for every task?

Page 6: A Beginner's Guide to Building Data Pipelines with Luigi

log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))log.setLogLevel(args.log_level, 'screen') <- Custom logging

table_date = parse_date(args.table_date, datetime.now())

log.info('Starting Companies House data loader...')ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)ch_loader.go(args.file_names) <- What to do if this fails?

log.info('Loader complete. Starting Companies House updater')ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go()

The Old Way

Long, processor intensive tasks stacked together

Page 7: A Beginner's Guide to Building Data Pipelines with Luigi

● Open-sourced & maintained by Spotify data team

● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani.

● Abstracts batch processing jobs

● Makes it easy to write modular code and create dependencies between tasks.

Luigi to the rescue!

Page 8: A Beginner's Guide to Building Data Pipelines with Luigi

● Task templating

● Dependency graphs

● Resumption of data flows after intermediate failure

● Command line integration

● Error emails

Luigi

Page 9: A Beginner's Guide to Building Data Pipelines with Luigi

Luigi 101- Counting the number of companies in the UK

companies.csvCount companies count.txt

input() output()

Page 10: A Beginner's Guide to Building Data Pipelines with Luigi

class CompanyCount(luigi.Task):

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries("companies.csv")

with self.output().open("w") as out_file:

out_file.write(count)

Company Count Job in Luigi code

Page 11: A Beginner's Guide to Building Data Pipelines with Luigi

Luigi 101- Keeping our count up to date

companies.csv

Count companies count.txt

input()

output()Companies Download

Companies Data Server

output()

requires()

Page 12: A Beginner's Guide to Building Data Pipelines with Luigi

class CompanyCount(luigi.Task):

def requires(self):

return CompanyDownload()

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries(self.input())

with self.output().open("w") as out_file:

out_file.write(count)

Company count with download dependency

the output of the required task

this task must complete before CompanyCount runs

Page 13: A Beginner's Guide to Building Data Pipelines with Luigi

Download task

class CompanyDownload(luigi.Task):

def output(self):

return luigi.LocalTarget("companies.csv")

def run(self):

data = get_company_download()

with self.output().open('w') as out_file:

out_file.write(data)

local output to be picked up by previous task

download the data and write it to the output Target

Page 14: A Beginner's Guide to Building Data Pipelines with Luigi

$ python company_flow.py CompanyCount --local-schedulerDEBUG: Checking if CompanyCount() is completeDEBUG: Checking if CompanyDownload() is completeINFO: Scheduled CompanyCount() (PENDING)INFO: Scheduled CompanyDownload() (PENDING)INFO: Done scheduling tasksINFO: Running Worker with 1 processesDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 10076] Worker Worker(...) running CompanyDownload()INFO: [pid 10076] Worker Worker(...) done CompanyDownload()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 10076] Worker Worker(...) running CompanyCount()INFO: [pid 10076] Worker Worker(...) done CompanyCount()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...INFO: Done

Page 15: A Beginner's Guide to Building Data Pipelines with Luigi

Time dependent tasks - change in companiesCompanies Count Task(Date 1)

Companies Count Task(Date 2)

Companies Count Task(Date 3)

Companies Deltacompany_count_

delta.txtoutput()

input()

input()

input()

Page 16: A Beginner's Guide to Building Data Pipelines with Luigi

class AnnualCompanyCountDelta(luigi.Task):

year = luigi.Parameter()

def requires(self):

tasks = []

for month in range(1, 13):

tasks.append(CompanyCount(dt.datetime.strptime(

"{}-{}-01".format(self.year, month), "%Y-%m-%d"))

)

return tasks

# not shown: output(), run()

Parameterising Luigi tasks

define parameter

generate dependencies

Page 17: A Beginner's Guide to Building Data Pipelines with Luigi

class CompanyCount(luigi.Task):

date = luigi.DateParameter(default=datetime.date.today())

def requires(self):

return CompanyDownload(self.date)

def output(self):

return luigi.LocalTarget("count.csv")

def run(self):

count = count_unique_entries(self.input())

with self.output().open("w") as out_file:

out_file.write(count)

Adding the date dependency to Company Count

added date dependency to company count

Page 18: A Beginner's Guide to Building Data Pipelines with Luigi

The central scheduler

$ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014

by default, localhost:8082

Page 19: A Beginner's Guide to Building Data Pipelines with Luigi

Persisting our data

companies.csv

Count companies(Date)

count.txtoutput()Companies

Download(Date)

Companies Data Server

output()

requires(Date)

CompaniesToMySQL(Date)

output()SQL

Database

requires(Date)

Page 20: A Beginner's Guide to Building Data Pipelines with Luigi

class CompaniesToMySQL(luigi.sqla.CopyToTable):

date = luigi.DateParameter()

columns = [(["name", String(100)], {}), ...]

connection_string = "mysql://localhost/test" # or something

table = "companies" # name of the table to store data

def requires(self):

return CompanyDownload(self.date)

def rows(self):

for row in self.get_unique_rows(): # uses self.input()

yield row

Persisting our data

Page 21: A Beginner's Guide to Building Data Pipelines with Luigi

My pipes broke

# ./client.cfg[core]error-email: [email protected], [email protected]

Page 22: A Beginner's Guide to Building Data Pipelines with Luigi

Things we missed out

There are lots of task types which can be used which we haven’t mentioned

● Hadoop● Spark● ssh● Elasticsearch● Hive● Pig● etc.

Check out the luigi.contrib package

Page 23: A Beginner's Guide to Building Data Pipelines with Luigi

class CompanyCount(luigi.contrib.hadoop.JobTask):

chunks = luigi.Parameter()

def requires(self):

return [CompanyDownload(chunk) for chunk in

chunks]

def output(self):

return

luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")

def mapper(self, line):

yield "count", 1

def reducer(self, key, values):

yield key, sum(values)

Counting the companies using Hadoop

split input in chunks

HDFS target

map and reduce methods instead of run()

Page 24: A Beginner's Guide to Building Data Pipelines with Luigi

● Doesn’t provide a way to trigger flows

● Doesn’t support distributed execution

Luigi Limitations

Page 25: A Beginner's Guide to Building Data Pipelines with Luigi

Onwards

● The docs: http://luigi.readthedocs.org/

● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/

● The source: https://github.com/spotify/luigi

● The maintainers are really helpful, responsive, and open to any and all PRs!

Page 26: A Beginner's Guide to Building Data Pipelines with Luigi

Stuart Coleman@stubacca81 / [email protected]

Dylan Barth@dylan_barth / [email protected]

Thanks!

We’re hiring Python data scientists & engineers!http://www.growthintel.com/careers/