Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... ·...

September 26, 2013

Wouter de BieTeam Lead Data Infrastructure

Big Data Infrastructure at Spotify

Thursday, September 26, 13

Who am I?

According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system also aligns well with our needs, as we use Hive extensively for ad-hoc queries and for

the analysis of large datasets," de Brie said in a statement.

Who am I?

Why data?We play music, right?

Why data?We were the first company to do free music streaming. But now everybody can do it, so we need to learn quickly to iterate faster!

AgendaLet’s talk about Data Infrastructure, how we did it, what we learned and how we’ve failed

• Some Context• Use Cases• Our Infrastructure• Lesson learned

Some Context • Spotify started in 2006 (in Sweden)• Now 1200+ employees, 450+ engineers• 26 million monthly active users• 20+ million tracks available• 4 data centers across the globe• 17 data engineers building a platform for

easy access to data

ReportingBusiness AnalyticsOperational AnalyticsProduct features

Use CasesWe’re a data-driven company, so data is used almost everywhere

Reporting

• Reporting to labels, licensors, partners and advertisers• We support our partners

Business Analytics

• Analyzing growth, user behavior, sign-up funnels, etc

• Company KPIs• A/B testing• NPS analysis• Segmentation analysis• Ad performance

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$Mon

$-$00$

$-$02$

$-$04$

$-$06$

$-$08$

$-$10$

$-$12$

$-$14$

$-$16$

$-$18$

$-$20$

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

$-$00$

$-$02$

$-$04$

$-$06$

$-$08$

$-$10$

$-$12$

$-$14$

$-$16$

$-$18$

$-$20$

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$Mon

$-$00$

$-$02$

$-$04$

$-$06$

$-$08$

$-$10$

$-$12$

$-$14$

$-$16$

$-$18$

$-$20$

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

$-$00$

$-$02$

$-$04$

$-$06$

$-$08$

$-$10$

$-$12$

$-$14$

$-$16$

$-$18$

$-$20$

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$

Impact of hurricane Sandy29 October 2012

Impact of hurricane Sandy30 October 2012

Operational metrics

•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)

Product features

•Radio• Top lists•Recommendations (better than external parties,

because of the amount of data)

September 26, 2013

Everybody should be able to use data!

So why Data Infrastructure?

View Backend Data pipe Analysis

Our data infrastructure

• 1 TB of compressed data from users per day• 400 GB of data from services per day• 61 TB of data generated in Hadoop each day• 328 node Hadoop cluster• 6500 jobs/day (192.000/month)• Soon 690 nodes (11040 cores)• 10 PB of storage capacity (soon 28 PB)

Some geeky numbers

Spotify’s data infrastructure

Backend services

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

HadoopProcessing

DatabasesAnalytics/Visualization

The thee pillars of our Data Infrastructure

KafkaCollection

September 26, 2013

Data collection

Kafka: High volume pub-sub system

• Started with a store-and-forward system• Evaluated Apache Flume• Currently from Backend-to-HDFS, but in the future Backend-to-Backend• It was almost a good fit, but...

Guaranteed delivery

Kafka doesn’t provide message acknowledgements..

• ... at least, not in 0.7 (stable)• 0.8 has ACKs, but no end-to-end acknowledgements• A track streamed =~ monetary transaction

Hacking Kafka

Backend service Ka!a broker Ka!a client

acknowledgements

Dumping databases

Not only do we have log files, but also production databases

• Using Sqoop for dumping PostgreSQL• Map/Reduce job (table scans) for dumping Cassandra• hdfs2cass for uploading SSTables into Cassandra• For large DB’s we parse application logs and only dump deltas

Failures (or the hard stuff)

• We started with a store-and-forward system that didn’t scale• Kafka has multiple components (client, broker, ZooKeeper)• Internet weather• As with many large Java systems: Garbage collection

September 26, 2013

Hadoop: our trusted elephant

Scheduling

We wrote and open-sourced our own scheduler: Luigi

•Nothing suitable out there.. (unless you really, really like the XML hell of Oozie)•Written in Python•Generic scheduler and dependency system that supports Python and Java M/R,

Pig, Hive and Sqoop

Map/Reduce languages

Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API

Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available

PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available

Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)

Scaling Hadoop at SpotifyOur Journey

• Started with a small (scrap metal) cluster of 37 servers

•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale

• Built an in-house cluster of 60 nodes because of EMR costs

• Capacity planning every 6 months, grown to 327 nodes today

• 363 more waiting to be provisioned• Put in place data-retention policy and data

Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... ·...

Documents

Hello, Spotify here! - Piero 97 · 1 Spotify First-Party Data 2018 Source: Spotify First-Party Data +2,800% The day after David Bowie’s death, streams of his music jumped 2,800%

Big Data and Business Analytics: The Engine of Digital ... · Enterprise Big Data Strategy . BIG DATA MANAGEMENT . BIG DATA ANALYTICS . BIG DATA APPLICATIONS . BIG DATA INTEGRATION

The BIG Future of BIG Data · 2017-03-08 · The BIG Future of BIG Data. Big Data Data Governance Data Warehousing Data Reporting Data Infrastructure. Become a PREDICTIVEEnterprise

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Spotify - youthteachingadults.ca · Spotify is an app that lets you listen to music for free. Spotify. Spotify Find the Spotify app Ask your adult learner to find the Spotify app

Spotify Behind the Scenesgkreitz/spotify/kreitz-spotify_kth11.pdf · Background Streaming P2PProtocol Spotify—BehindtheScenes GunnarKreitz Spotify gkreitz@spotify.com KTH,September292011

Spotify Behind the Scenesgkreitz/spotify/kreitz-spotify_kth11.pdf · 2011-09-29 · Background Streaming P2PProtocol Spotify—BehindtheScenes GunnarKreitz Spotify gkreitz@spotify.com

The Art of Data: Spotify

Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015

Machine learning @ Spotify - Madison Big Data Meetup

matti@spotify.com Scaling the Data Infrastructure @ Spotify · Scaling the Data Infrastructure @ Spotify matti@spotify.com. ... Styx GABO. Data at Spotify. Hadoop. Data history through

Introduction to Big Data, Big Data Processing, and Big

Ads Personalization at Spotify - NYC Data Engineering 10/23

Music recommendations at Spotify - Meetupfiles.meetup.com/1516886/Recommendations at Spotify v4.pdf · Music recommendations at Spotify Erik Bernhardsson erikbern@spotify.com. Spotify

Big Data Technology Big Data - aakritsubedi9.com.npaakritsubedi9.com.np/files/Big Data Technology.pdf · Big Data Technology Big Data 1"Big data" is a field that treats ways to analyze,

Scaling Data Infrastructure @ Spotify - QCon · Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com. Mārtiņš Kalvāns kalvans@spotify.com Matti Pehrs matti@spotify.com

· for executive: box big data ussuiu lla:ansnnns1ðxnu big data -big data big -wifñuiaÖ big data big data • hadoop big clouderâ manager hive impala big data 22 airuntju 2559

Machine Learning and Big Data for Music Discovery at Spotify

BIG DATA Jeroen Wolfs. Agenda •Big data •Check-out & big data •Toepassingen van big data in eCommerce

Big Data meets Big Data