69
December 15, 2014 Luigi NYC Data Science meetup

Luigi presentation NYC Data Science

Embed Size (px)

Citation preview

Page 1: Luigi presentation NYC Data Science

December 15, 2014

Luigi

NYC Data Science meetup

Page 2: Luigi presentation NYC Data Science

What is Luigi

Luigi is a workflow engine If you run 10,000+ Hadoop jobs every day, you need one If you play around with batch processing just for fun, you want one Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at It helps you with the plumbing of connecting lots of tasks into complicated pipelines,

especially if those tasks run on Hadoop

2

Page 3: Luigi presentation NYC Data Science

What do we use it for?

Music recommendations A/B testing Top lists Ad targeting Label reporting Dashboards … and a million other things!

3

Page 4: Luigi presentation NYC Data Science

Currently running 10,000+ Hadoop jobs every day

On average a Hadoop job is launched every 10s There’s 2,000+ Luigi tasks in production

4

Page 5: Luigi presentation NYC Data Science

Some history

… let’s go back to 2008!

5

Page 6: Luigi presentation NYC Data Science

The year was 2008

I was writing my master’s thesis about music recommendations

Had to run hundreds of long-running tasks to compute the output

6

Page 7: Luigi presentation NYC Data Science

Toy example: classify skipped tracks

$ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled $ python train_model.py /tmp/subsampled model.pickle $ python inspect_model.py model.pickle

7

Log d

Log d+1

...

Log d+k-1

Subsample and extract

features

Subsampled features

Train classifier

Classifier

Look at the output

Page 8: Luigi presentation NYC Data Science

Reproducibility matters

…and automation. !The previous code is really hard to run again

8

Page 9: Luigi presentation NYC Data Science

Let’s make into a big workflow

9

$ python run_everything.py

Page 10: Luigi presentation NYC Data Science

Reality: crashes will happen10

How do you resume this?

Page 11: Luigi presentation NYC Data Science

Ability to resume matters

When you are developing something interactively, you will try and fail a lot Failures will happen, and you want to resume once you fixed it You want the system to figure out exactly what it has to re-run and nothing else Atomic file operations is crucial for the ability to resume

11

Page 12: Luigi presentation NYC Data Science

So let’s make it possible to resume

12

Page 13: Luigi presentation NYC Data Science

13

But still annoying parts

Hardcoded junk

Page 14: Luigi presentation NYC Data Science

Generalization matters

You should be able to re-run your entire pipeline with a new value for a parameter Command line integration means you can run interactive experiments

14

Page 15: Luigi presentation NYC Data Science

… now we’re getting something

15

$ python run_everything.py --date-first 2014-01-01 --date-last 2014-01-31 --n-trees 200

Page 16: Luigi presentation NYC Data Science

16

… but it’s hardly readable

BOILERPLATE

Page 17: Luigi presentation NYC Data Science

Boilerplate matters!

We keep re-implementing the same functionality Let’s factor it out to a framework

17

Page 18: Luigi presentation NYC Data Science

A lot of real-world data pipelines are a lot more complex

The ideal framework should make it trivial to build up big data pipelines where dependencies are non-trivial (eg depend on date algebra)

18

Page 19: Luigi presentation NYC Data Science

So I started thinking

Wanted to build something like GNU Make

19

Page 20: Luigi presentation NYC Data Science

What is Make and why is it pretty cool?

Build reusable rules Specify what you want to build and then

backtrack to find out what you need in order to get there

Reproducible runs

20

# the compiler: gcc for C program, define as g++ for C++ CC = gcc ! # compiler flags: # -g adds debugging information to the executable file # -Wall turns on most, but not all, compiler warnings CFLAGS = -g -Wall ! # the build target executable: TARGET = myprog ! all: $(TARGET) ! $(TARGET): $(TARGET).c $(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c ! clean: $(RM) $(TARGET)

Page 21: Luigi presentation NYC Data Science

We want something that works for a wide range of systems

We need to support lots of systems “80% of data science is data munging”

21

Page 22: Luigi presentation NYC Data Science

Data processing needs to interact with lots of systems

Need to support practically any type of task: Hadoop jobs Database dumps Ingest into Cassandra Send email SCP file somewhere else

22

Page 23: Luigi presentation NYC Data Science

My first attempt: builder

Use XML config to build up the dependency graph!

23

Page 24: Luigi presentation NYC Data Science

Don’t use XML

… seriously, don’t use it

24

Page 25: Luigi presentation NYC Data Science

Dependencies need code

Pipelines deployed in production often have nontrivial ways they define dependencies between tasks

!!!!!!!! … and many other cases

25

Recursion (and date algebra)

BloomFilter(date=2014-05-01)

BloomFilter(date=2014-04-30)

Log(date=2014-04-30)

Log(date=2014-04-29)

...

Date algebra

Toplist(date_interval=2014-01)

Log(date=2014-01-01)

Log(date=2014-01-02)

...

Log(date=2014-01-31)

Enum types

IdMap(type=artist) IdMap(type=track)

IdToIdMap(from_type=artist, to_type=track)

Page 26: Luigi presentation NYC Data Science

Don’t ever invent your own DSL

“It’s better to write domain specific code in a general purpose language, than writing general purpose code in a domain specific language” – unknown author

!!Oozie is a good example of how messy it gets

26

Page 27: Luigi presentation NYC Data Science

2009: builder2

Solved all the things I just mentioned - Dependency graph specified in Python - Support for arbitrary tasks - Error emails - Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc - Lots of other things :)

27

Page 28: Luigi presentation NYC Data Science

Graphs!

28

Page 29: Luigi presentation NYC Data Science

More graphs!

29

Page 30: Luigi presentation NYC Data Science

Even more graphs!

30

Page 31: Luigi presentation NYC Data Science

What were the good bits? !Build up dependency graphs and visualize them Non-event to go from development to deployment Built-in HDFS integration but decoupled from the core library !!

What went wrong? !Still too much boiler plate Pretty bad command line integration

31

Page 32: Luigi presentation NYC Data Science

32

Page 33: Luigi presentation NYC Data Science

Introducing Luigi

A workflow engine in Python

33

Page 34: Luigi presentation NYC Data Science

Luigi – History at Spotify

Late 2011: Me and Elias Freider build it, release it into the wild at Spotify, people start using it

“The Python era” !Late 2012: Open source it Early 2013: First known company outside of Spotify:

Foursquare !

34

Page 35: Luigi presentation NYC Data Science

Luigi is your friendly plumber

Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration

35

Page 36: Luigi presentation NYC Data Science

Luigi Task 36

Page 37: Luigi presentation NYC Data Science

Luigi Task – breakdown 37

The business logic of the task Where it writes output What other tasks it depends on

Parameters for this task

Page 38: Luigi presentation NYC Data Science

Easy command line integration

So easy that you want to use Luigi for it

38

$ python my_task.py MyTask --param 43 INFO: Scheduled MyTask(param=43) INFO: Scheduled SomeOtherTask(param=43) INFO: Done scheduling tasks INFO: [pid 20235] Running SomeOtherTask(param=43) INFO: [pid 20235] Done SomeOtherTask(param=43) INFO: [pid 20235] Running MyTask(param=43) INFO: [pid 20235] Done MyTask(param=43) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ cat /tmp/foo/bar-43.txt hello, world $

Page 39: Luigi presentation NYC Data Science

Let’s go back to the example

39

Log d

Log d+1

...

Log d+k-1

Subsample and extract

features

Subsampled features

Train classifier

Classifier

Look at the output

Page 40: Luigi presentation NYC Data Science

Code in Luigi

40

Page 41: Luigi presentation NYC Data Science

Extract the features

41

Page 42: Luigi presentation NYC Data Science

$ python demo.py SubsampleFeatures --date-interval 2013-11-01 DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01) ... INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2-hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113 INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0% INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0% INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0% ... INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87% INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100% INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $

Run on the command line

42

Page 43: Luigi presentation NYC Data Science

Step 2: Train a machine learning model

43

Page 44: Luigi presentation NYC Data Science

Let’s run everything on the command line from scratch$ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03 DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self) INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING) INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477 INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0% INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0% INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0% INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0% INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0% INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0% INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100% INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running InspectModel(date_interval=2011-01-03, n_trees=10) time 0.1335% ms_played 96.9351% shuffle 0.0728% local_track 0.0000% bitrate 2.8586% INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done InspectModel(date_interval=2011-01-03, n_trees=10)

44

Page 45: Luigi presentation NYC Data Science

Let’s make it more complicated – cross validation

45

Log d

Log d+1

...

Log d+k-1

Subsample and extract

features

Subsampled features

Train classifier

Classifier

Log e

Log e+1

...

Log e+k-1

Subsample and extract

features

Subsampled features

Cross validation

Page 46: Luigi presentation NYC Data Science

Cross validation implementation

$ python xv.py CrossValidation --date-interval-a 2012-11-01 --date-interval-b 2012-11-02

46

Page 47: Luigi presentation NYC Data Science

Run on the command line $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) 2011-01-01 (train) AUC: 0.9040 2011-01-02 ( test) AUC: 0.9040 INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread

47

… no overfitting!

Page 48: Luigi presentation NYC Data Science

More trees!$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) 2011-01-01 (train) AUC: 0.9074 2011-01-02 ( test) AUC: 0.8896 INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) was stopped. Shutting down Keep-Alive thread

48

… overfitting!

Page 49: Luigi presentation NYC Data Science
Page 50: Luigi presentation NYC Data Science

The nice things about Luigi

50

Page 51: Luigi presentation NYC Data Science

Overhead for a task is about 5 lines (class def + requires + output + run) Easy command line integration

Section name

Minimal boiler plate

51

Page 52: Luigi presentation NYC Data Science

Everything is a directed acyclic graph

Makefile style Tasks specify what they are dependent on not what other things depend on them

52

Page 53: Luigi presentation NYC Data Science

Luigi’s visualizer

53

Page 54: Luigi presentation NYC Data Science

Dive into any task

54

Page 55: Luigi presentation NYC Data Science

Run with multiple workers

$ python dataflow.py --workers 3 AggregateArtists --date-interval 2013-W08

55

Page 56: Luigi presentation NYC Data Science

Error notifications

56

Page 57: Luigi presentation NYC Data Science

Process synchronization

Luigi worker 1 Luigi worker 2

A

B C

A C

F

Luigi central planner

Prevents the same task from being run simultaneously, but all execution is being done by the workers.

57

Page 58: Luigi presentation NYC Data Science

Luigi is a way of coordinating lots of different tasks

… but you still have to figure out how to implement and scale them!

58

Page 59: Luigi presentation NYC Data Science

Do general-purpose stuff

Don’t focus on a specific platform !… but comes “batteries included”

59

Page 60: Luigi presentation NYC Data Science

Built-in support for HDFS & Hadoop

At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and Scalding. Luigi is a great glue!

!Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in

production. !Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs,

Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into Postgres, send email reports, etc.

60

Page 61: Luigi presentation NYC Data Science

The one time we accidentally deleted 50TB of data

We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3 days

61

Page 62: Luigi presentation NYC Data Science

Some things are still not perfect

62

Page 63: Luigi presentation NYC Data Science

The missing parts

Execution is tied to scheduling – you can’t schedule something to run “in the cloud” Visualization could be a lot more useful There’s no built scheduling – have to rely on crontab These are all things we have in the backlog

63

Page 64: Luigi presentation NYC Data Science

Source:

What are some ideas for the future?

64

Page 65: Luigi presentation NYC Data Science

Separate scheduling and execution

65

Luigi central scheduler

Slave

Slave

Slave

Slave

...

Page 66: Luigi presentation NYC Data Science

Luigi in Scala?

66

Page 67: Luigi presentation NYC Data Science

Luigi implements some core beliefs

The #1 focus is on removing all boiler plate The #2 focus is to be as general as possible The #3 focus is to make it easy to go from test to production !!

67

Page 68: Luigi presentation NYC Data Science

Join the club!

Page 69: Luigi presentation NYC Data Science

Questions?

69