Upload
neville-li
View
4.646
Download
10
Embed Size (px)
DESCRIPTION
Slides for the NYC Storm user group meetup @spotify, Mar 25, 2014
Citation preview
•@Spotify since 2011 •Recommendation Team •Data & Backend •Storm, Scalding, Spark, Scala…
About Me
March 25, 2014
Spotify in numbersStarted in 2006, available in 55 markets 20+ million songs, 20,000 added per day 24+ million active users, 6+ million subscribers 1.5 billion playlists !
Big Data @spotify
600 node cluster Every day •400GB service logs •4.5TB user data •5,000 Hadoop jobs •61TB generated
March 25, 2014
What is Storm?In data-layman’s terms • Real time stream processing • Like Hadoop without HDFS • Like Map/Reduce with many reducer steps • Fault tolerant & guaranteed message processing
Photo © Blaine Courts http://www.flickr.com/photos/blainecourts/8417266909/
Storm @spotify
•storm-0.8.0 •22 node cluster •15+ topologies •200,000+ tuples per second •recommendation, ads, monitoring, analytics, etc.
“Never Gonna Give You Up”
Rick Astley Map !
First Storm Application @Spotify
�7
RT Market Launch Stats
Other Uses
•Trending tracks •Email campaign •App performance tracking •UX tracking
Anatomy of A Storm Topology
From play to recommendation
Social Listening Take 1
•PUB/SUB •Almost real-time •Spammy •Hard to scale
All characters appearing in this work are fictitious. Any resemblance to real persons, living or dead, is purely coincidental.
this����������� ������������������ guy����������� ������������������ again!
Social Listening Take 2
•Hadoop daily batch •High latency •M/R aggregation •Easier to scale
Social Listening Revamped
•Kafka → Storm → Backend •Soft real-time •Aggregate & trigger bolt •Easy to scale
Getting Data
accesspoint
playlist search storage
social
kafka
What are we transferring?
•TSV logs with version & type (moving to Avro) •Centralized Schema Repository •Parsers in Python & Java •Log parsing & splitting by topic in Kafka EndSong 21 username:Str timestamp:Int trackId:Str msPlayed:Int reasonStart:Str reasonEnd:Str … ClientEvent 15 username:Str platform:Str timestamp:Int jsonData: Str …
March 25, 2014
Getting Data Across the Globe
Photo © Riley Kaminer http://www.flickr.com/photos/rwkphotography/3282192071/
Ashburn London
Stockholm
San����������� ������������������ Jose
Hadoop Storm
big����������� ������������������ kafka
consumer
LONDON
March 25, 2014
Topology
EndSong����������� ������������������ filter
kafka����������� ������������������ spout
metadata����������� ������������������ decorator
listening����������� ������������������ trigger
privacy����������� ������������������ filterZMTP����������� ������������������
publisher
metadata
prefsfeed SUB
GET
GET
EndSong Filter Bolt
•Discard some tuples –Skipped –Too short •Keep some fields –Context –Reasons
Metadata Decoration Bolt
•tuple.getStringByField(“trackId”)!•Append fields in output tuple •[<input fields>…, “artistId”, “albumId”]!•Input fields as constructor argument •Reuse!
monadic!
Async & Batch RPC
metadata
tuple batch
callback
queueupstream
REQ REP
emit
ackbolt����������� ������������������ thread
schedule
network����������� ������������������ thread
Listening Trigger Bolt
•Rule based triggers –High intent –Repeats •In heap LRU cache –Repeat counter –Rate limiting
•Similar to metadata bolt •Async lookup •Ack all, emit some •In heap LRU cache •Cache private cases only
Privacy Bolt
ZMTP Publisher
service����������� ������������������ discovery
boltupstream
register
lookup
subscribe
feed����������� ������������������ service
•[uri, username, payload] •DNS SRV for discovery •ZMQ PUB socket on bolt •1+ subscribers •1+ redundant bolts •Tools for testing
Lessons Learned So Far
March 25, 2014
Development ProcessOne git repository One storm-shared sub project → jar → Artifactory Many storm-<team/application> subprojects Sampled log for local development Turnable params in config files
Problem Factory
topology storm
•lein uberjar (or mvn) •maven-shade-plugin •relocate com.google.*
guava����������� ������������������ 10guava����������� ������������������ 14JVM
<relocation>! <pattern>com.google.common</pattern>! <shadedPattern>shaded.com.google.common</shadedPattern>!</relocation>
March 25, 2014
Language ChoicesJava for boring stuff - Cassandra, memcaced, RPC, etc. !
Clojure for fun stuff - algorithm heavy !
Scala - summingbird?
March 25, 2014
DeploymentPuppet Shared cluster right now One per team in the future YARN? !
Monitoring from service sidePhoto © Ian Koppenbadger http://www.flickr.com/photos/runnerone/3391661946/
March 25, 2014
Thank You
March 25, 2014
Want to join the band?Check out http://www.spotify.com/jobs or @Spotifyjobs for more information. !
Neville Li [email protected] @sinisa_lyh