Upload
growth-intelligence
View
612
Download
4
Embed Size (px)
Citation preview
A Beginner’s Guide to Building Data Pipelines
with
Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI?
UK Limited Companies
Customer CRM Data
Predictive Model
With big data, comes big responsibility
Hard to maintain, extend, and… look at.
Script Soup
omg moar codez
Code
More Codes
if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?
log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together
● Open-sourced & maintained by Spotify data team
● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code and create dependencies between tasks.
Luigi to the rescue!
● Task templating
● Dependency graphs
● Resumption of data flows after intermediate failure
● Command line integration
● Error emails
Luigi
Luigi 101- Counting the number of companies in the UK
companies.csvCount companies count.txt
input() output()
class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code
Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies Download
Companies Data Server
output()
requires()
class CompanyCount(luigi.Task):
def requires(self):
return CompanyDownload()
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Company count with download dependency
the output of the required task
this task must complete before CompanyCount runs
Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked up by previous task
download the data and write it to the output Target
$ python company_flow.py CompanyCount --local-schedulerDEBUG: Checking if CompanyCount() is completeDEBUG: Checking if CompanyDownload() is completeINFO: Scheduled CompanyCount() (PENDING)INFO: Scheduled CompanyDownload() (PENDING)INFO: Done scheduling tasksINFO: Running Worker with 1 processesDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 10076] Worker Worker(...) running CompanyDownload()INFO: [pid 10076] Worker Worker(...) done CompanyDownload()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 10076] Worker Worker(...) running CompanyCount()INFO: [pid 10076] Worker Worker(...) done CompanyCount()DEBUG: 1 running tasks, waiting for next task to finishDEBUG: Asking scheduler for work...INFO: Done
Time dependent tasks - change in companiesCompanies Count Task(Date 1)
Companies Count Task(Date 2)
Companies Count Task(Date 3)
Companies Deltacompany_count_
delta.txtoutput()
input()
input()
input()
class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies
class CompanyCount(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Adding the date dependency to Company Count
added date dependency to company count
The central scheduler
$ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082
Persisting our data
companies.csv
Count companies(Date)
count.txtoutput()Companies
Download(Date)
Companies Data Server
output()
requires(Date)
CompaniesToMySQL(Date)
output()SQL
Database
requires(Date)
class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data
My pipes broke
# ./client.cfg[core]error-email: [email protected], [email protected]
Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop● Spark● ssh● Elasticsearch● Hive● Pig● etc.
Check out the luigi.contrib package
class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in
chunks]
def output(self):
return
luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce methods instead of run()
● Doesn’t provide a way to trigger flows
● Doesn’t support distributed execution
Luigi Limitations
Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all PRs!
Stuart Coleman@stubacca81 / [email protected]
Dylan Barth@dylan_barth / [email protected]
Thanks!
We’re hiring Python data scientists & engineers!http://www.growthintel.com/careers/