CASSANDRA: A
DECENTRALIZED STRUCTURED
STORAGE SYSTEM
Avinash Lakshman, Prashant Malik - Facebook
Presented by Dhruv PatelCS 886 | Spring 2016
1
Overview
Background
Data Model
Architecture
Implementation
Facebook Search Index
Conclusion
Discussion
2
History
Developed by Facebook
Designed to fulfill the storage needs of Facebook
Index Search
Billions of writes/day
High wright throughput
Scale with number of users
Deployed as the backend storage system for
multiple services within Facebook
3
Motivation
High Scalability
Read & Write throughput increases linearly with
number of nodes
High Availability
Treats failures as norms rather than exceptions
Fail Tolerance
4
Data Model
Table: Distributed multi dimensional map, indexed
by a key
Value: Object which is highly structured
Row Key: String – no size restriction
Normally 16-36 bytes long
Operations are atomic on each row per replica
5
Data Model (contd.)
Column Families(CF) – columns grouped together
Number of CFs not limited per table
Types of Column Families:
Simple CF
Super CF – CF within CF
Column Sort Order – application specific
Time
Facebook Inbox Search – results displayed in time sorted order
Name
6
Data Model (contd.)
Each Column has
Name
Value
Timestamp
Column Access
Simple column
column_family: column
Super column
column_family:super_column:column
7
Data Model (contd.)
6576e768-8r73-78df User_Id User_Name Date
786780 John 2010-04-17T18:10:11
2456e124-6y78-12ef User_Id User_Name Date
218745 Bob 2010-02-12T14:12:16
Key
Column Value
Column Name
8
Data Model (contd.)
Row
Key 2
Column
1
Column
2
Column
3
Value 1 Value 2 Value 3
Column
1
Column
4
Value 1 Value 4
Column
3
Column
4
Value 3 Value 4
Column
1
Column
2
Value 1 Value 2
Row
Key 1
Row
Key 1
Super Column Family
Super Column 1 Super Column 1
Super Column FamilySimple Column Family
Figure adapted from [3]
9
API
Methods:
insert (table, key, rowMutation)
get (table, key, columnName)
delete (table, key, columnName)
columnName:
Column within column family
Column family
Super column family
Column within a super column
10
System Architecture
Partitioning – high scalability
Replication – high availability and durability
Cluster membership – how nodes are
added/deleted
Bootstrapping – how nodes start for the first time
Scaling the Cluster
Local Persistence
Implementation Details
11
Partitioning
Scale incrementally
Nodes: logically structured in Ring Topology
Each node assigned a random value – position
Hashing on data-item’s key, assign to node
Walk the ring clockwise
This node – coordinator of the key
12
Partitioning (contd.)
A
B
CD
E
h(key1)
h(key2)
13
Partitioning (contd.)
Consistent Hashing
Advantages
• Departure or arrival of the node only affects its immediate neighbours
Challenges
• Non – uniform data & load distribution
• Unaware of node performance heterogeneity
Solution
• Assign nodes to multiple position on the ring
• Analyze load information
• Load Distribution
• Move lightly loaded nodes to alleviate heavily loaded nodes
Cassandra uses the 2nd approach
14
Replication
Data items are replicated at N nodes
A
B
C
D
E
h(key1)
N=3
15
Replication (contd.)
Cassandra Replication Policies
Metadata about ranges a node is responsible for
Cached locally at each node and inside Zookeeper
Fault tolerant : Node crash and comes back – knows its range
• Replicate data on N-1 nodes after its coordinator nodeRack Unaware
• Zookeeper chooses a node leader
• Tells nodes the range they are replicas forRack Aware
• Leader is chosen at Datacenter level instead of Rack levelDatacenter Aware
16
Cluster Membership
Scuttlebutt
Gossip based protocol
Inspired from real life rumor spreading
Periodic, Pairwise & Inter-node communication
Failure Detection
Determine which node is up and down
Avoid attempts to communicate with unreachable nodes
17
Cluster Membership (contd.)
Failure Detection
Ø Accrual Failure Detection
emits Ø, instead of binary value (up/down)
Ø = Suspicion level for each monitored node
Network and load condition
Node maintains sliding window of inter arrival time of
gossip message from other nodes, Ø is calculated
If node is faulty: suspicion level increases with time
If node is correct: Ø will be constant set by the
application, generally 0
18
Bootstrapping
Node starts for the first time
Reads configuration file – list of few contact points
Zookeeper
Receives a random token for its position
Persists mapping locally and in Zookeeper
Token information is gossiped around the cluster
Or manually by administrator via command line tool
or Cassandra web interface
19
Scaling
Joining node assigned a token
s.t. it can alleviate a heavily loaded node
Splitting the range
Bootstrap algorithm initiated by CLT or web
interface
Overloaded node streams the data to new node
using kernel-kernel copy
40MBPS
20
Local Persistence
Relies on the local file system for data persistence
Write Operation
Write to commit log in local disk of the node
Durability and recoverability
Update in-memory data structure
Only after successful write in commit log
When in-memory data structure crosses certain limit, it dumps itself to disk
Merge process runs in background to merge such files into one file
21
Local Persistence (contd.)
Write Implementation
Figure adapted from [6]
22
Local Persistence (contd.)
Compaction: files that are close to each other
Figure adapted from [6]
23
Local Persistence (contd.)
Read Operation
Looks up in-memory data structure : newest to oldest
If not found, look into files on disk
Bloom Filter
Summarizing keys in one file
Avoid looking up files which do not contain the key
Stored in memory and on each data file
Column Index
Jump to right chunk on disk for column retrieval
At every 256k chunk boundary
24
Implementation Details
Cassandra process on a single machine
Partitioning module
Cluster membership and failure detection module
Storage engine module
Architecture is based on SEDA
Staged Event Driven Architecture
Operations transit from one stage to the next
Each stage can be handled by different thread pool
Gives high performance
Figure adapted from [8]
25
Implementation Details (contd.)
Commit Log Maintenance
Commit log is rolled out after it reaches certain
threshold (128MB)
Each commit log contains header
Bit vector
Shows if column family has successfully persisted to disk
This header will be checked before purging the commit log
Make sure all the data is persisted to disk
26
Facebook Index Search
Two kind of search
Term Search
Key: user_id
Super Column: words that make up the message
Interaction Search
Key: user_id
Super Column: recipients ids’
For each of these super columns, individual message identifiers are the columns
Term Search – not available currently!!!
27
Facebook Index Search (contd.)
Faster search technique
User clicks on search bar
Asynchronous message is sent to Cassandra cluster
Prime the buffer cache with that user’s index
Search results likely to be in memory when query is
executed
50+TB data on 150 node cluster
28
Conclusion
High write throughput
No single points of failure
Linear scalability
Durability
Clever integration of Bigtable and Dynamo
29
Questions and Discussion
Security
Need to use trusted environments
Development of Security Layer externally
Comparison with other such system would have
generated more persuasive results
Dense reading
No figures
30
References
[1] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2): 35-40, 2010.
[2] T. Rabl, S. Gómez-Villamor, M. Sadoghi, V. Muntés-Mulero, H.-A. Jacobsen, and S. Mankovskii, “Solving Big Data Challenges for Enterprise Application Performance Management,” Proc. VLDB Endow., vol. 5, no. 12, pp. 1724–1735, Aug. 2012.
[3] E. Hewitt, Cassandra: The Definitive Guide, 1 edition. Sebastopol, CA; Koln u.a.: O’Reilly Media, 2010.
[4] http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/6.1-Cassandra.ppt
[5] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The " PHI Accrual Failure Detector,” in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, Washington, DC, USA, 2004, pp. 66–78.
[6] http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf
[7] http://vanets.vuse.vanderbilt.edu/dokuwiki/lib/exe/fetch.php?media=teaching:cassandra_presentation_final.pptx
[8] http://www.eecs.harvard.edu/~mdw/talks/seda-sosp01-talk.pdf
[9] http://www.datastax.com/documentation/articles/cassandra/cassandrathenandno w.html
[10] https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf
[11] http://www.slideshare.net/VaradMeru/cassandra-a-decentralized-structured-storage-system
[12] http://www.powershow.com/view4/56cc1c-NTgwM/Cassandra_-_A_Decentralized_Structured_Storage_System_powerpoint_ppt_presentation
[13] https://prezi.com/lzzsbrotjjcn/cassandra-a-decentralized-structured-storage-system/
[14] https://stephenholiday.com/notes/cassandra/
[15] https://huu.la/blog/review-of-cassandra--a-decentralized-structured-storage-system
[16] http://daizuozhuo.github.io/cassandra-review/
[17] https://xduan7.com/2016/02/09/paper-review-cassandra-a-decentralized-structured-storage-system/
[18] http://www.ibm.com/developerworks/library/os-apache-cassandra/
[19] http://www.computerweekly.com/tip/Securing-NoSQL-applications-Best-practises-for-big-data-security
31