Upload
pydata
View
2.079
Download
3
Embed Size (px)
DESCRIPTION
Slides for a talk given by Elias Freider at PyData Silicon Valley 2013
Citation preview
Background
Billions of log messages (several TBs) every dayUsage and backend stats, debug informationWhat we want to do
AB-testingMusic recommendationsMonthly/daily/hourly reportingBusiness metric dashboards
We experiment a lot – need quick development cycles
We have a lot of data
Why did we build Luigi?
How to do it?
HadoopData storageClean, filter, join and aggregate data
PostgresSemi-aggregated data for dashboards
CassandraTime series data
A bunch of data processing tasks with inter-dependencies
Defining a data flow
Input is a list of Timestamp, Track, Artist tuples
Example: Artist Toplist
Streams Artist Aggregation
Top 10 Database
Naive (non-luigi) approach
Example: Artist Toplist
If a failed step leaves behind broken data, we want to clean that up
Errors will occur
The second step fails, you fix it, then you want to resume
Avoid duplicate work
To use data flows as command line tools
Parametrization
You want to run the dataflow for a set of similar inputs
Repeat!
Dead-simple dependency definitions (think GNU make)Emphasis on Hadoop/HDFS integrationAtomic file operationsPowerful task templating using OOPData flow visualizationCommand line integration
Main features
A Python framework for data flow definition and execution
Introducing Luigi
Luigi - Aggregate Artists
Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtists
DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-filesRunning Hive and (soon) Pig queriesInserting data sets into Postgres
Luigi comes with pre-implemented Tasks for...
Reduces code-repetition by utilizing Luigi’s object oriented data model
Task templates and targets
Writing new ones are as easy as defining an interface and implementing run()
Built-in Hadoop Streaming Python framework
Hadoop MapReduce
Very slim interfaceFetches error logs from Hadoop cluster and displays them to the userClass instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joinsEasy to send along Python modules that might not be installed on the clusterBasic support for secondary sortRuns on CPython so you can use your favorite libs (numpy, pandas etc.)
Features
Built-in Hadoop Streaming Python framework
Hadoop MapReduce
So you can use local modules remotely on the cluster
Attach Python modules
Set up local variables for map-side joins etc.
Local state is transferred
Top 10 artists - Wrapped arbitrary Python code
Completing the top list
Basic functionality for exporting to Postgres. Cassandra support is in the works
Database support
Running it all...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time
Light-weight scheduler daemon with a web interface
Data flow visualization
Imagine how cool this would be with real data...
The results
Tasks have implicit __init__
Task Parameters
Generates command line interface with typing and documentation
Class variables with some magic
$ python dataflow.py AggregateArtists --date 2013-03-05
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Task Parameters
Basic multi-processing
Multiple workers
$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
3/16/13 Luigi overview
localhost:8081 1/2
AggregateTracks
(test=False,
date=2013-03-16,
test_users=False)
JoinArtistGids
(test=False,
date=2013-03-16,
nshards=12,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-16,
index_version=1362433077,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-16,
test_users=False)
UserRecs
(test=False,
date=2013-03-17,
rec_days=5,
exp_days=10,
test_users=False,
force_updates=False,
build_from_scratch=True,
index_path=/spotify/discover/index,
index_version=None,
FOLLOWS_SCORE=5.0)
AccumulateUserMatrices
(test=False,
date=2013-03-16,
index_version=1362433077,
test_users=False,
decay_factor=0.99,
days=5,
build_from_scratch=True)
AccumulateByArtists
(test=False,
date=2013-03-16,
test_users=False,
max_days=90,
decay_factor=0.99,
days=10,
build_from_scratch=True)
IngestUserRecs
(date=2013-03-17,
test_users=False,
ttl=604800)
AggregateByArtists
(test=False,
date=2013-03-10,
test_users=False)
AccumulateByArtists
(test=False,
date=2013-03-15,
test_users=False,
max_days=90,
decay_factor=0.99,
days=10,
build_from_scratch=True)
UserRecs
(test=False,
date=2013-03-16,
rec_days=5,
exp_days=10,
test_users=False,
force_updates=False,
build_from_scratch=True,
index_path=/spotify/discover/index,
index_version=None,
FOLLOWS_SCORE=5.0)
IngestUserRecs
(date=2013-03-16,
test_users=False,
ttl=604800)
AggregateByArtists
(test=False,
date=2013-03-13,
test_users=False)
JoinArtistGids
(test=False,
date=2013-03-11,
nshards=12,
test_users=False)
MasterMetadata
(date=2013-03-15)
JoinArtistGids
(test=False,
date=2013-03-15,
nshards=12,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-15,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-15,
index_version=1362433077,
test_users=False)
AccumulateUserMatrices
(test=False,
date=2013-03-15,
index_version=1362433077,
test_users=False,
decay_factor=0.99,
days=5,
build_from_scratch=True)
JoinArtistGids
(test=False,
date=2013-03-13,
nshards=12,
test_users=False)
EndSongCleaned
(date=2013-03-15)
AggregateTracks
(test=False,
date=2013-03-15,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-15,
test_users=False)
UserLocationPeriod
(test=False,
period=10,
date=2013-03-16,
test_users=False)
UserLocationPeriod
(test=False,
period=10,
date=2013-03-17,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-12,
index_version=1362433077,
test_users=False)
ArtistFollows
(date=2013-03-14)
EndSongCleaned
(date=2013-03-16)
UserLocationDay
(test=False,
date=2013-03-16,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-06,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-14,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-14,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-09,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-08,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-13,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-14,
index_version=1362433077,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-08,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-13,
index_version=1362433077,
test_users=False)
RelatedArtistsTC
()
AggregateByArtists
(test=False,
date=2013-03-12,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-11,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-06,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-12,
test_users=False)
JoinArtistGids
(test=False,
date=2013-03-12,
nshards=12,
test_users=False)
JoinArtistGids
(test=False,
date=2013-03-14,
nshards=12,
test_users=False)
IngestUserRecs
(date=2013-03-15,
test_users=False,
ttl=604800)
UserLocationDay
(test=False,
date=2013-03-07,
test_users=False)
AggregateByArtists
(test=False,
date=2013-03-07,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-11,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-10,
test_users=False)
UserLocationDay
(test=False,
date=2013-03-09,
test_users=False)
AggregateUserMatrices
(test=False,
date=2013-03-11,
index_version=1362433077,
test_users=False)
PlaylistVectors
(test=False,
date=2013-03-14,
index_version=1362433077)
(Screenshot from web interface)
Visualization needs some work
Really large...
Great for automated execution
Error notifications
Prevents two identical tasks from running simultaneously
Process Synchronization
luigidSimple task synchronization
Data flow 1 Data flow 2
Common dependency Task
Not just another Hadoop streaming framework!
What we use Luigi forHadoop StreamingJava Hadoop MapReduceHivePigLocal (non-hadoop) data processingImport/Export data to/from PostgresInsert data into Cassandrascp/rsync/ftp data files and reportsDump and load databases
Others using it with Scala MapReduce and MRJob as well
Originated at SpotifyBased on many years of experience with data processingRecent contributions by Foursquare and BitlyOpen source since September 2012
http://github.com/spotify/luigi
Luigi is open source
For more information feel free to reach out at http://github.com/spotify/luigi
Thank you!Oh, and we’re hiring – http://spotify.com/jobs
Elias [email protected]