Scalable Data Management@facebook

Scalable Data Management@facebook

Srinivas Narayanan11/13/09

Scale

#2 site on the Internet(time on site)

>200 billion monthly page views

Over 1 million developers in 180 countries

Over 300 million active users

More than 232 photos…

100 million search queries per day

> 3.9 trillion feed actions processed per

day

2 billion pieces ofcontent per week 6 billion minutes

per day

Growth Rate2009300MActive Users

Social Networks

The social graph links everything

Scaling Social Networks▪ Much harder than typical

websites where...▪ Typically 1-2% online: easy

to cache the data▪ Partitioning & scaling

relatively easy▪ What do you do when

everything is interconnected?

name, status, privacy, profile photo


name, status, privacy, video thumbnail













name, status, privacy, profile photoname, status, privacy, profile photo





name, status, privacy, profile photoname, status, privacy, video thumbnail

name, status, privacy, video thumbnailname, status, privacy, profile photoname, status, privacy, video thumbnail

name, status, privacy, profile photo name, status, privacy, profile photoname, status, privacy, profile photo

name, status, privacy, video thumbnailname, status, privacy, profile photo













System Architecture

Architecture

Database (slow, persistent)

Load Balancer (assigns a web server)

Web Server (PHP assembles data)

Memcache (fast, simple)

▪ Simple in-memory hash table▪ Supports get/set,delete,multiget, multiset▪ Not a write-through cache▪ Pros and Cons

▪ The Database Shield!▪ Low latency, very high request rates▪ Can be easy to corrupt, inefficient for

very small items

Memcache

▪ Multithreading and efficient protocol code - 50k req/s▪ Polling network drivers - 150k req/s▪ Breaking up stats lock - 200k req/s▪ Batching packet handling - 250k req/s▪ Breaking up cache lock - future

Memcache Optimization

Network Incast

Many SmallGet Requests

Memcache Memcache Memcache Memcache

Switch

PHP Client


Switch

PHP Client

Many bigdata packets

Network Incast


Switch

PHP Client

Network Incast


Switch

PHP Client

Network Incast

Memcache Clustering

Many small objects per server

Many small objects per server

Many servers per large object

Many servers per large object

Memcache Clustering

Memcache

10 Objects

PHP Client

Memcache

5 Objects

PHP Client

2 round trips total1 round trip per server

5 Objects

Memcache

Memcache Clustering

Memcache

3 Objects

PHP Client •3 round trips total1 round trip per server

4 Objects

MemcacheMemcache

3 Objects

Memcache Clustering

Memcache Pool Optimization▪ Currently a manual process▪ Replication for obvious hot data sets▪ Interesting problem: Optimize the allocation based on access

patterns

General pool with wide fanout

Shard 1 Shard 2Specialized Replica 2

Shard 1 Shard 2

Shard 1 Shard 2 Shard 3 Shard n

Specialized Replica 1

...

Vertical Partitioning of Object Types

ScribeScribeScribe

ScribeScribeScribe

ScribeScribeScribe

Thousands of MySQL servers in two datacentersMySQL has played a role from the beginning

MySQL Usage•Pretty solid transactional persistent store•Logical migration of data is difficult

• Logical-Physical db mapping•Rarely use advanced query features

• Performance• Database resources are precious• Web tier CPU is relatively cheap• Distributed data - no joins!

•Sound administrative model

MySQL is better because it is Open SourceWe can enhance or extend the database▪ ...as we see fit▪ ...when we see fit▪ Facebook extended MySQL to support distributed cache invalidation for memcache

INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,...

Scaling across datacentersWest Coast

MySql replication

SF Web

SF Memcache

SC Memcache

SC Web

SC MySQL

East Coast

VA MySQL

VA Web

VA Memcache

Memcache Proxy

Memcache ProxyMemcache Proxy

Other Interesting Issues▪ Application level batching and parallelization▪ Super hot data items▪ Cachekey versioning with continuous availability

Photos

Photos + Social Graph = Awesome!

Photos: Scale▪ 20 billion photos x4 = 80

billion▪ Would wrap around the world

more than 10 times!▪ Over 40M new photos per

day▪ 600K photos / second

Photos Scaling - The easy wins▪ Upload tier - handles uploads, scales images, stores on NFS▪ Serving tier: Images served from NFS via HTTP▪ However...

▪ File systems are not good at supporting large number of files▪ Metadata too large to fit in memory causing too many IOs for

each file read▪ Limited by I/O not storage density

▪ Easy wins▪ CDN▪ Cachr (http server + caching)▪ NFS file handle cache

Photos: Haystack

Overlay file systemIndex in memoryOne IO per read

Data Warehousing

Data: How much?▪ 200GB per day in March 2008▪ 2+TB(compressed) raw data per day in April 2009▪ 4+TB(compressed) raw data per day today

The Data Age ▪ Free or low cost of user services▪ Consumer behavior hard to predict

▪ Data and analysis are critical▪ More data beats better algorithms

Deficiencies of existing technologies▪ Analysis/storage on proprietary systems too expensive▪ Closed systems are hard to extend

Hadoop & Hive

Hadoop▪ Superior availability/scalability/manageability despite lower single node performance

▪ Open system▪ Scalable costs▪ Cons: Programmability and Metadata

▪ Map-reduce hard to program (users know sql/bash/python/perl)

▪ Need to publish data in well known schemas

Hive▪ A system for managing and

querying structured data built on top of Hadoop

▪ Components▪ Map-Reduce for execution▪ HDFS for storage▪ Metadata in an RDBMS

Hive: New Technology, Familiar Interfacehive> select key, count(1) from kv1 where key > 100 group by

key;

vs.

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file

/tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Hive: Sample Applications▪ Reporting

▪ E.g.,: Daily/Weekly aggregations of impression/click counts▪ Measures of user engagement

▪ Ad hoc Analysis▪ E.g.,: how many group admins broken down by state/country

▪ Machine Learning (Assembling training data)▪ Ad Optimization▪ E.g.,: User Engagement as a function of user attributes

▪ Lots More

Hive: Server Infrastructure▪ 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per

node▪ Two level network topology

▪ 1 Gbit/sec from node to rack switch▪ 4 Gbit/sec to top level rack switch

Hive & Hadoop: Usage Stats▪ 4 TB of compressed new data added per day▪ 135TB of compressed data scanned per day▪ 7500+ Hive jobs on per day▪ 80K compute hours per day▪ 200 people run jobs on Hadoop/Hive▪ Analysts (non-engineers) use Hadoop through Hive▪ 95% of jobs are Hive Jobs

Hive: Technical Overview

Hive: Open and Extensible▪ Query your own formats and types with your own Serializer/Deserializers

▪ Extend the SQL functionality through User Defined Functions

▪ Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script

Hive: Smarter Execution Plans▪ Map-side Joins▪ Predicate Pushdown▪ Partition Pruning▪ Hash based Aggregations▪ Parallel execution of operator trees▪ Intelligent Scheduling

Hive: Possible Future Optimizations▪ Pipelining?▪ Finer operator control (controlling sorts)▪ Cost based optimizations?▪ HBase

Spikes: The Username Launch

System Design▪ Database tier cannot handle the load

▪ Dedicated memcache tier for assigned usernames

▪ Miss => Available▪ Avoid database hits altogether

▪ Blacklists: bucketize, local tier cache

▪ ▪ timeout

Username Memcache Tier

▪ Parallel pool in each data center

▪ Writes replicated to all nodes

▪ 8 nodes per pool▪ Reads can go to any node (hashed by uid)

...UN0 UN1 UN7

PHP Client

Username Memcache

Write Optimization▪ Hashout store

▪ Distributed key-value store (MySQL backed)▪ Lockless (optimistic) concurrency control

Fault Tolerance▪ Memcache nodes can go down

▪ Always check another node on miss▪ Replay from a log file (scribe)

▪ Memcache sets are not guaranteed to succeed▪ Self-correcting code: write again to mc if we detect it during

db writes

Nuclear Options▪ Newsfeed

▪ Reduce number of stories▪ Turn off scrolling, highlights

▪ Profile▪ Reduce number of stories▪ Make info tab the default

▪ Chat▪ Reduce buddy list refresh

rate▪ Turn if off!

How much load?▪200k in 3 min▪1M in 1 hour▪50M in first month▪Prepared for over 10x!

Some interesting problems

Some interesting problems▪ Graph models and languages

▪ Low latency fast access▪ Slightly more expressive queries

▪ Consistency, Staleness can be a bit loose▪ Analysis over large data sets▪ Privacy as part of the model

▪ Fat data pipes▪ Push enormous volumes of data to several third party

applications (E.g., entire newsfeed to search partners).▪ Controllable QoS

Some interesting problems (contd.)▪ Search relevance▪ Storage systems▪ Middle tier (cache) optimization▪ Application data access language

Questions?

Documents

Scalable Data Management@facebook