Download pdf - CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEMdavid/cs848/pslides/Cassandra.pdf · CASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... Number of CFs not limited per table

CASSANDRA: A

DECENTRALIZED STRUCTURED

STORAGE SYSTEM

Avinash Lakshman, Prashant Malik - Facebook

Presented by Dhruv PatelCS 886 | Spring 2016

1

Overview

Background

Data Model

Architecture

Implementation

Facebook Search Index

Conclusion

Discussion

2

History

Developed by Facebook

Designed to fulfill the storage needs of Facebook

Index Search

Billions of writes/day

High wright throughput

Scale with number of users

Deployed as the backend storage system for

multiple services within Facebook

3

Motivation

High Scalability

Read & Write throughput increases linearly with

number of nodes

High Availability

Treats failures as norms rather than exceptions

Fail Tolerance

4

Data Model

Table: Distributed multi dimensional map, indexed

by a key

Value: Object which is highly structured

Row Key: String – no size restriction

Normally 16-36 bytes long

Operations are atomic on each row per replica

5

Data Model (contd.)

Column Families(CF) – columns grouped together

Number of CFs not limited per table

Types of Column Families:

Simple CF

Super CF – CF within CF

Column Sort Order – application specific

Time

Facebook Inbox Search – results displayed in time sorted order

Name

6

Data Model (contd.)

Each Column has

Name

Value

Timestamp

Column Access

Simple column

column_family: column

Super column

column_family:super_column:column

7

Data Model (contd.)

6576e768-8r73-78df User_Id User_Name Date

786780 John 2010-04-17T18:10:11

2456e124-6y78-12ef User_Id User_Name Date

218745 Bob 2010-02-12T14:12:16

Key

Column Value

Column Name

8

Data Model (contd.)

Row

Key 2

Column

1

Column

2

Column

3

Value 1 Value 2 Value 3

Column

1

Column

4

Value 1 Value 4

Column

3

Column

4

Value 3 Value 4

Column

1

Column

2

Value 1 Value 2

Row

Key 1

Row

Key 1

Super Column Family

Super Column 1 Super Column 1

Super Column FamilySimple Column Family

Figure adapted from [3]

9

API

Methods:

insert (table, key, rowMutation)

get (table, key, columnName)

delete (table, key, columnName)

columnName:

Column within column family

Column family

Super column family

Column within a super column

10

System Architecture

Partitioning – high scalability

Replication – high availability and durability

Cluster membership – how nodes are

added/deleted

Bootstrapping – how nodes start for the first time

Scaling the Cluster

Local Persistence

Implementation Details

11

Partitioning

Scale incrementally

Nodes: logically structured in Ring Topology

Each node assigned a random value – position

Hashing on data-item’s key, assign to node

Walk the ring clockwise

This node – coordinator of the key

12

Partitioning (contd.)

A

B

CD

E

h(key1)

h(key2)

13

Partitioning (contd.)

Consistent Hashing

Advantages

• Departure or arrival of the node only affects its immediate neighbours

Challenges

• Non – uniform data & load distribution

• Unaware of node performance heterogeneity

Solution

• Assign nodes to multiple position on the ring

• Analyze load information

• Load Distribution

• Move lightly loaded nodes to alleviate heavily loaded nodes

Cassandra uses the 2nd approach

14

Replication

Data items are replicated at N nodes

A

B

C

D

E

h(key1)

N=3

15

Replication (contd.)

Cassandra Replication Policies

Metadata about ranges a node is responsible for

Cached locally at each node and inside Zookeeper

Fault tolerant : Node crash and comes back – knows its range

• Replicate data on N-1 nodes after its coordinator nodeRack Unaware

• Zookeeper chooses a node leader

• Tells nodes the range they are replicas forRack Aware

• Leader is chosen at Datacenter level instead of Rack levelDatacenter Aware

16

Cluster Membership

Scuttlebutt

Gossip based protocol

Inspired from real life rumor spreading

Periodic, Pairwise & Inter-node communication

Failure Detection

Determine which node is up and down

Avoid attempts to communicate with unreachable nodes

17

Cluster Membership (contd.)

Failure Detection

Ø Accrual Failure Detection

emits Ø, instead of binary value (up/down)

Ø = Suspicion level for each monitored node

Network and load condition

Node maintains sliding window of inter arrival time of

gossip message from other nodes, Ø is calculated

If node is faulty: suspicion level increases with time

If node is correct: Ø will be constant set by the

application, generally 0

18

Bootstrapping

Node starts for the first time

Reads configuration file – list of few contact points

Zookeeper

Receives a random token for its position

Persists mapping locally and in Zookeeper

Token information is gossiped around the cluster

Or manually by administrator via command line tool

or Cassandra web interface

19

Scaling

Joining node assigned a token

s.t. it can alleviate a heavily loaded node

Splitting the range

Bootstrap algorithm initiated by CLT or web

interface

Overloaded node streams the data to new node

using kernel-kernel copy

40MBPS

20

Local Persistence

Relies on the local file system for data persistence

Write Operation

Write to commit log in local disk of the node

Durability and recoverability

Update in-memory data structure

Only after successful write in commit log

When in-memory data structure crosses certain limit, it dumps itself to disk

Merge process runs in background to merge such files into one file

21

Local Persistence (contd.)

Write Implementation


22


Compaction: files that are close to each other


23


Read Operation

Looks up in-memory data structure : newest to oldest

If not found, look into files on disk

Bloom Filter

Summarizing keys in one file

Avoid looking up files which do not contain the key

Stored in memory and on each data file

Column Index

Jump to right chunk on disk for column retrieval

At every 256k chunk boundary

24

Implementation Details

Cassandra process on a single machine

Partitioning module

Cluster membership and failure detection module

Storage engine module

Architecture is based on SEDA

Staged Event Driven Architecture

Operations transit from one stage to the next

Each stage can be handled by different thread pool

Gives high performance


25

Implementation Details (contd.)

Commit Log Maintenance

Commit log is rolled out after it reaches certain

threshold (128MB)

Each commit log contains header

Bit vector

Shows if column family has successfully persisted to disk

This header will be checked before purging the commit log

Make sure all the data is persisted to disk

26

Facebook Index Search

Two kind of search

Term Search

Key: user_id

Super Column: words that make up the message

Interaction Search

Key: user_id

Super Column: recipients ids’

For each of these super columns, individual message identifiers are the columns

Term Search – not available currently!!!

27

Facebook Index Search (contd.)

Faster search technique

User clicks on search bar

Asynchronous message is sent to Cassandra cluster

Prime the buffer cache with that user’s index

Search results likely to be in memory when query is

executed

50+TB data on 150 node cluster

28

Conclusion

High write throughput

No single points of failure

Linear scalability

Durability

Clever integration of Bigtable and Dynamo

29

Questions and Discussion

Security

Need to use trusted environments

Development of Security Layer externally

Comparison with other such system would have

generated more persuasive results

Dense reading

No figures

30

References

[1] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2): 35-40, 2010.

[2] T. Rabl, S. Gómez-Villamor, M. Sadoghi, V. Muntés-Mulero, H.-A. Jacobsen, and S. Mankovskii, “Solving Big Data Challenges for Enterprise Application Performance Management,” Proc. VLDB Endow., vol. 5, no. 12, pp. 1724–1735, Aug. 2012.

[3] E. Hewitt, Cassandra: The Definitive Guide, 1 edition. Sebastopol, CA; Koln u.a.: O’Reilly Media, 2010.

[4] http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/6.1-Cassandra.ppt

[5] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “The " PHI Accrual Failure Detector,” in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, Washington, DC, USA, 2004, pp. 66–78.

[6] http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf

[7] http://vanets.vuse.vanderbilt.edu/dokuwiki/lib/exe/fetch.php?media=teaching:cassandra_presentation_final.pptx

[8] http://www.eecs.harvard.edu/~mdw/talks/seda-sosp01-talk.pdf

[9] http://www.datastax.com/documentation/articles/cassandra/cassandrathenandno w.html

[10] https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf

[11] http://www.slideshare.net/VaradMeru/cassandra-a-decentralized-structured-storage-system

[12] http://www.powershow.com/view4/56cc1c-NTgwM/Cassandra_-_A_Decentralized_Structured_Storage_System_powerpoint_ppt_presentation

[13] https://prezi.com/lzzsbrotjjcn/cassandra-a-decentralized-structured-storage-system/

[14] https://stephenholiday.com/notes/cassandra/

[15] https://huu.la/blog/review-of-cassandra--a-decentralized-structured-storage-system

[16] http://daizuozhuo.github.io/cassandra-review/

[17] https://xduan7.com/2016/02/09/paper-review-cassandra-a-decentralized-structured-storage-system/

[18] http://www.ibm.com/developerworks/library/os-apache-cassandra/

[19] http://www.computerweekly.com/tip/Securing-NoSQL-applications-Best-practises-for-big-data-security

31

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/6.1-Cassandra.ppt

http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf

http://vanets.vuse.vanderbilt.edu/dokuwiki/lib/exe/fetch.php?media=teaching:cassandra_presentation_final.pptx

http://www.eecs.harvard.edu/~mdw/talks/seda-sosp01-talk.pdf

http://www.datastax.com/documentation/articles/cassandra/cassandrathenandno w.html

https://cs.uwaterloo.ca/~tozsu/courses/CS848/W15/presentations/Cassandra.pdf

http://www.slideshare.net/VaradMeru/cassandra-a-decentralized-structured-storage-system

http://www.powershow.com/view4/56cc1c-NTgwM/Cassandra_-_A_Decentralized_Structured_Storage_System_powerpoint_ppt_presentation

https://prezi.com/lzzsbrotjjcn/cassandra-a-decentralized-structured-storage-system/

https://stephenholiday.com/notes/cassandra/

https://huu.la/blog/review-of-cassandra--a-decentralized-structured-storage-system

http://daizuozhuo.github.io/cassandra-review/

https://xduan7.com/2016/02/09/paper-review-cassandra-a-decentralized-structured-storage-system/

http://www.ibm.com/developerworks/library/os-apache-cassandra/

http://www.computerweekly.com/tip/Securing-NoSQL-applications-Best-practises-for-big-data-security