View
6
Download
0
Category
Preview:
Citation preview
September 26, 2013
Wouter de BieTeam Lead Data Infrastructure
Big Data Infrastructure at Spotify
Thursday, September 26, 13
Who am I?
2
According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system also aligns well with our needs, as we use Hive extensively for ad-hoc queries and for
the analysis of large datasets," de Brie said in a statement.
Thursday, September 26, 13
Who am I?
3
Thursday, September 26, 13
Why data?We play music, right?
4
Thursday, September 26, 13
Why data?We were the first company to do free music streaming. But now everybody can do it, so we need to learn quickly to iterate faster!
5
Thursday, September 26, 13
AgendaLet’s talk about Data Infrastructure, how we did it, what we learned and how we’ve failed
• Some Context• Use Cases• Our Infrastructure• Lesson learned
6
Thursday, September 26, 13
Some Context • Spotify started in 2006 (in Sweden)• Now 1200+ employees, 450+ engineers• 26 million monthly active users• 20+ million tracks available• 4 data centers across the globe• 17 data engineers building a platform for
easy access to data
7
Thursday, September 26, 13
8
ReportingBusiness AnalyticsOperational AnalyticsProduct features
Use CasesWe’re a data-driven company, so data is used almost everywhere
Thursday, September 26, 13
Reporting
• Reporting to labels, licensors, partners and advertisers• We support our partners
Thursday, September 26, 13
Business Analytics
• Analyzing growth, user behavior, sign-up funnels, etc
• Company KPIs• A/B testing• NPS analysis• Segmentation analysis• Ad performance
Thursday, September 26, 13
11
Listening behavior in Sweden
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$
1.60%$Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
SE$23-27$
SE$45-59$
SE$$0-17$
SE$35-44$
SE$60-150$
SE$28-34$
SE$18-22$
Thursday, September 26, 13
12
Listening behavior in Spain
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
ES$$0-17$
ES$35-44$
ES$23-27$
ES$45-59$
ES$60-150$
ES$28-34$
ES$18-22$
Thursday, September 26, 13
13
Impact of hurricane Sandy29 October 2012
Thursday, September 26, 13
14
Impact of hurricane Sandy30 October 2012
Thursday, September 26, 13
Operational metrics
•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)
Thursday, September 26, 13
Product features
•Radio• Top lists•Recommendations (better than external parties,
because of the amount of data)
Thursday, September 26, 13
September 26, 2013
Everybody should be able to use data!
Thursday, September 26, 13
18
So why Data Infrastructure?
View Backend Data pipe Analysis
Thursday, September 26, 13
19
Our data infrastructure
Thursday, September 26, 13
20
• 1 TB of compressed data from users per day• 400 GB of data from services per day• 61 TB of data generated in Hadoop each day• 328 node Hadoop cluster• 6500 jobs/day (192.000/month)• Soon 690 nodes (11040 cores)• 10 PB of storage capacity (soon 28 PB)
Some geeky numbers
Thursday, September 26, 13
21
Spotify’s data infrastructure
Backend services
HDFS
Map/Reduce
LuigiScheduler
OperationalDatabases
Reporting
Analytical databases
Productfeatures
Dashboards
Map/ReduceJobs
Hive
Thursday, September 26, 13
HadoopProcessing
DatabasesAnalytics/Visualization
22
The thee pillars of our Data Infrastructure
KafkaCollection
Thursday, September 26, 13
September 26, 2013
Data collection
Thursday, September 26, 13
2418
Data collection
Kafka: High volume pub-sub system
• Started with a store-and-forward system• Evaluated Apache Flume• Currently from Backend-to-HDFS, but in the future Backend-to-Backend• It was almost a good fit, but...
Thursday, September 26, 13
Guaranteed delivery
25
Kafka doesn’t provide message acknowledgements..
• ... at least, not in 0.7 (stable)• 0.8 has ACKs, but no end-to-end acknowledgements• A track streamed =~ monetary transaction
Thursday, September 26, 13
Hacking Kafka
26
Backend service Ka!a broker Ka!a client
HDFS
acknowledgements
S3
Tape
Thursday, September 26, 13
Dumping databases
27
Not only do we have log files, but also production databases
• Using Sqoop for dumping PostgreSQL• Map/Reduce job (table scans) for dumping Cassandra• hdfs2cass for uploading SSTables into Cassandra• For large DB’s we parse application logs and only dump deltas
Thursday, September 26, 13
Failures (or the hard stuff)
28
• We started with a store-and-forward system that didn’t scale• Kafka has multiple components (client, broker, ZooKeeper)• Internet weather• As with many large Java systems: Garbage collection
Thursday, September 26, 13
September 26, 2013
Hadoop: our trusted elephant
Thursday, September 26, 13
Scheduling
We wrote and open-sourced our own scheduler: Luigi
•Nothing suitable out there.. (unless you really, really like the XML hell of Oozie)•Written in Python•Generic scheduler and dependency system that supports Python and Java M/R,
Pig, Hive and Sqoop
30
Thursday, September 26, 13
Map/Reduce languages
Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API
Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available
PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available
Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)
31
Thursday, September 26, 13
Scaling Hadoop at SpotifyOur Journey
• Started with a small (scrap metal) cluster of 37 servers
•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale
• Built an in-house cluster of 60 nodes because of EMR costs
• Capacity planning every 6 months, grown to 327 nodes today
• 363 more waiting to be provisioned• Put in place data-retention policy and data
archive
32
Thursday, September 26, 13
33
Hadoop failures
• “We just need good developers” - No, we need Hadoop experts• We underestimated the complexity of Hadoop• You can throw money at the problem of scaling, but at our scale, it pays
off to optimize• Give people easy tools early on
Thursday, September 26, 13
Lessons learned
4+ years of Hadoop taught us
• Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS
• “Commodity hardware” doesn’t mean cheap hardware• Hadoop isn’t a silver bullet• Hadoop is a complex system that needs love and care• You will have to extend Hadoop (and eco-system components) to tailor it to your needs
34
Thursday, September 26, 13
September 26, 2013
Databases and visualization
Thursday, September 26, 13
36
Databases: used for aggregates
• Aggregates from Hadoop are put into PostgreSQL or Cassandra• PostgreSQL powering dashboards and empowering analysts• Cassandra for columnar sets (Spotify Analytics for labels)• Databases are used for low-latency access by systems or analysts
Thursday, September 26, 13
37
Spotify Data Warehouse
Thursday, September 26, 13
38
Spotify Data Warehouse
Thursday, September 26, 13
39
Spotify Analytics Dashboards
Thursday, September 26, 13
40
Spotify Analytics
Thursday, September 26, 13
41
Spotify Analytics: Daft Punk
Thursday, September 26, 13
42
Spotify Analytics: Whitney Houston
Thursday, September 26, 13
43
Open Source at Spotify
• Luigi (http://github.com/spotify/luigi)• Snakebite (http://github.com/spotify/snakebite)• hdfs2cass (http://github.com/spotify/hdfs2cass)
Thursday, September 26, 13
September 26, 2013
Questions?
Thursday, September 26, 13
45
Lessons learned
• Data is crucial for our business• Hadoop and other cutting-edge technology work pretty well for us• Hiring technical analysts was a good idea!• There is no “one-size-fits-all” data product• Spotify has the “build it ourselves” mindset. Sometimes it’s better to
buy then build
Thursday, September 26, 13
September 26, 2013
Check out http://www.spotify.com/jobs or @Spotifyjobs for more information.
Or mail: wouter@spotify.comOr twitter: @xinit
Want to join the band?
Thursday, September 26, 13
Recommended