Upload
matt-walker
View
4.796
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.
Citation preview
Patchwork Data at EtsyMatt Walker
2005 20132007 20112009
June
Etsy
What happened?
We don’t like to talk about it
Okay, we do
• http://codeascraft.etsy.com
• https://www.etsy.com/codeascraft/talks
• http://kongscreenprinting.com
Catch Phrases
• Continuous deployment
• Blameless postmortems
• Measure everything
• Continuous experimentation
Metrics-Driven Development
• Ganglia
• StatsD/Graphite
• Splunk
Scaling a Traditional RDBMS
• Sharded MySQL
• memcached
• Object-relational mapping in PHP
2005 20132007 20112009
December
Adtuitive
• Online advertising network
• Match forum post with rich product advertisements
• Unafraid of scaling across Etsy sellers
Adtuitive
• Amazon Web Services
• JRuby
• Rails
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig
• Oozie
• Avro
• Zookeeper
http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig
• Oozie
• Avro
• Zookeeper
http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
Powered by MapReduce
• ETL
• Analytics
• A/B testing
• Recommenders
• Search
Applications• Log ETL
• Database snapshotter
• TasteTest
• Facebook Gift Recommender
• Complimentary/similar listings
• Funnel Cake
• Feature Funnel
• A/B Analyzer
• Catapult
• Distributed search indexing
• Fast Game (search index)
• Search autosuggest
• SearchAds
• SCRAM ETL (fraud detection)
Applications• Log ETL
• Database snapshotter
• TasteTest
• Facebook Gift Recommender
• Complimentary/similar listings
• Funnel Cake
• Feature Funnel
• A/B Analyzer
• Catapult
• Distributed search indexing
• Fast Game (search index)
• Search autosuggest
• SearchAds
• SCRAM ETL (fraud detection)
Catapult
• End-to-end success story
• Extremely valuable for a web shop
2005 20132007 20112009
January
Relevancy Thursdays
Relevancy Thursdays
• Switch default sort order to relevance
• Each Thursday in January
Relevancy Thursdays
• Default search order was recency
• Relisting was our equivalent of advertising
• $0.20 updated your listing’s timestamp
Relevancy Thursdays
• Recency was meant to support “freshness” in search results
• Search originated as PostgreSQL query
• Converted to Solr to scale
What happens if we switch to relevance?
Relevancy Thursdays
• No A/B testing framework
• No event logs
• Limping along with Google Analytics
2005 20132007 20112009
February
First Log Analysis
First Log Analysis
• Raw web access logs
• URL- and ref tag-based
• Regex parser
Heyday of Tooling
• A/B framework
• Front end event logger
• Database snapshotter
• Barnum and Bailey
• Custom operator library
• Loaders
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume
• JDBC/ODBC
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie Barnum
• Avro TupleSerialization
• Zookeeper
A/B Framework
• Ramp-ups + A/B testing
• Feature flag development
Self-service analytics for any A/B test on the site
2005 20132007 20112009
A/B Framework
June
2005 20132007 20112009
A/B Analyzer
November
Why did it take so long?
• Non-web developers learning the PHP stack
• Failed experiments with “easier to use” MapReduce tools
• Realizing self-service analytics was what Etsy needed
2005 20132007 20112009
February
Catapult
Catapult
• A/B Analyzer + Launch Calendar
• Full product lifecycle
LAMP Stack for Big Data• HDFS S3
• MapReduce (Elastic)
• HBase
• Hive
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie Barnum
• Avro TupleSerialization
• Zookeeper
LAMP Stack for Big Data• HDFS
• MapReduce
• HBase
• Hive Vertica
• Flume logrotate
• JDBC/ODBC snapshotter/loaders
• Hue
• Pig Cascading
• Oozie
• Avro TupleSerialization
• Zookeeper
Computation Models
• Batch
• Interactive
• Streaming
Batch
Cascading
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
RDBMS / Cascading
cascading.jruby
cascading.jruby
• Productivity: no compile
• Reuse: factor out structure
• Efficiency: no JRuby runtime
• Optimization: move aggregations map-side
A nice constructor
cascading.jruby
Productivity
• Job templates
• Reloader
• Cascading local mode
• Sampled data
Reuse
Reuse
Field Names
Efficiency
• Just a constructor
• Calls into Cascading API
• No JRuby runtime on cluster
Optimization
Tuple Data Model
UDFs
Scalding
• Distributed collections
• Function literals replace UDFs
Interactive
Vertica
Sharded MySQL
• Borrowed from Flickr
• Works
Thou Shalt Not Join
2005 20132007 20112009
Hive
January
2005 20132007 20112009
Hive Turned Off
April
Hive
• Slow
• Sensitive
• Operational burden
• Educational burden
Vertica
• Offline copy of shards, master, auxiliary databases
• Joins are easy
• Reasonable latency
2005 20132007 20112009
Vertica
November
Vertica
• Game changer at Etsy
• High demand for joins
• Rapid prototyping data pipelines
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
RDBMS / Cascading
Back to MapReduce
• Event logs
• Schedule
• Load data in prod
• Scale
Vertica
• Not Hive, Impala, Shark, etc.
• May change our minds
Streaming
Not Powered by MapReduce
• Activity Feed
• Shop Stats
Etsyweb
• memcached
• Gearman
• Sharded MySQL
Usecases
• Trending
• Fraud detection
• ?
Turns out people don’t make product decisions in real time
http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
Summing Up
• Be glad you’re living in the future
• Automated tools for the common case
• Don’t be afraid to experiment
Image Credits• http://kongscreenprinting.com/what-we-do-
showcase
• http://animal.discovery.com
• http://www.rallyrace.com/turning-over-the-stone-event-production-basics/
• http://www.flickr.com/photos/bbalaji/2443820505/
• http://www.madeyoulaugh.com/funny_photos/caveman_harley/caveman_harley.jpg
• http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/
• http://www.globaltimes.cn/SPECIALCOVERAGE/Top10Peopleof2011.aspx
• http://www.theculturemap.com/scream-time-edvard-munch-museum/
• http://www.repentamerica.com/webelieve.html
• https://soundcloud.com/tearland/tl-hive
• http://pocketnow.com/2012/08/02/wifi-vs-data-speed-vs-battery-life/bush-scratching-head
Contact / Reference
• Matt Walker
• @data_daddy
• http://codeascraft.etsy.com/
• http://www.etsy.com/codeascraft/talks