View
952
Download
1
Category
Preview:
DESCRIPTION
A discussion of the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.
Citation preview
#Cassandra13
Rethinking Topology in Cassandra
Cassandra SummitJune 11, 2013
Eric Evanseevans@opennms.com
@jericevans
#Cassandra13
DHT 101
#Cassandra13
DHT 101partitioning
AZ
#Cassandra13
DHT 101partitioning
AZ
BY
C
#Cassandra13
DHT 101partitioning
AZ
BY
C
Key = Aaa
#Cassandra13
DHT 101replica placement
AZ
BY
C
Key = Aaa
#Cassandra13
DHT 101consistency
Consistency
Availability
Partition tolerance
#Cassandra13
DHT 101scenario: consistency level = one
A
?
?
W
#Cassandra13
DHT 101scenario: consistency level = all
A
?
?
R
#Cassandra13
DHT 101scenario: quorum write
A
B
?
W
#Cassandra13
DHT 101scenario: quorum read
A
B
?R
#Cassandra13
Awesome, yes?
#Cassandra13
Well...
#Cassandra13
Problem:Poor request/stream distribution
#Cassandra13
Distribution
AZ
BY
C
M
#Cassandra13
Distribution
AZ
BY
C
M
#Cassandra13
Distribution
AZ
BY
C
M
#Cassandra13
Distribution
AZ
BY
C
M
#Cassandra13
Distribution
Z A
BY
C
M
#Cassandra13
Distribution
A
BY
C
M
A1Z
#Cassandra13
Distribution
A
BY
C
M
A1Z
#Cassandra13
Distribution
A
BY
C
M
A1Z
#Cassandra13
Problem:Poor data distribution
#Cassandra13
Distribution
A
BD
C
#Cassandra13
Distribution
A
BD
C
E
#Cassandra13
Distribution
E
A
D B
C
#Cassandra13
Distribution
E
A
D B
C
#Cassandra13
Distribution
A
BD
C
H E
FG
#Cassandra13
Distribution
A
BD
C
H E
FG
#Cassandra13
Virtual Nodes
#Cassandra13
In a nutshell...
#Cassandra13
Benefits
● Operationally simpler (no token management)
● Better distribution of load
● Concurrent streaming (all hosts)
● Smaller partitions mean greater reliability
● Better supports heterogeneous hardware
#Cassandra13
Strategies
● Automatic sharding
● Fixed partition assignment
● Random token assignment
#Cassandra13
Strategyautomatic sharding
● Partitions are split when data exceeds a threshold
● Newly created partitions are relocated to a host with less data
● Similar to Bigtable, or Mongo auto-sharding
#Cassandra13
Strategyfixed partition assignment
● Namespace divided into Q evenly-sized partitions
● Q/N partitions assigned per host (where N is number of hosts)
● Joining hosts “steal” partitions evenly from existing hosts
● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)
#Cassandra13
Strategyrandom token assignment
● Each host assigned T random tokens
● T random tokens generated for joining hosts; New tokens divide
existing ranges
● Similar to libketama; Identical to Classic Cassandra when T=1
#Cassandra13
Considerations
1.Number of partitions
2.Partition size
3.How 1 changes with more nodes and data
4.How 2 changes with more nodes and data
#Cassandra13
Evaluating
Strategy No. Partitions Partition size
Random O(N) O(B/N)
Fixed O(1) O(B)
Auto-sharding O(B) O(1)
#Cassandra13
Evaluating
Automatic sharding● Partition size is constant (great)
● Number of partitions scales linearly with data size (bad)
#Cassandra13
Evaluating
Fixed partition assignment● Number of partitions is constant (good)
● Partition size scales linearly with data size (bad)
● Greater operational complexity (bad)
#Cassandra13
Evaluating
Random token assignment● Number of partitions scales linearly with number of hosts (OK)
● Partition size increases with more data; Decreases with more
hosts (good)
#Cassandra13
Evaluating
● Automatic sharding
● Fixed partition assignment
● Random token assignment
#Cassandra13
Cassandra
#Cassandra13
Configurationconf/cassandra.yaml
# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>
or
# Number of tokens to generate.num_tokens: 256
#Cassandra13
Configurationnodetool info
Token : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...
#Cassandra13
Configurationnodetool ring
Datacenter: datacenter1==========Replicas: 0
Address Rack Status State Load Owns Token 3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
#Cassandra13
Configurationnodetool status
Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1
#Cassandra13
Migration
A
BD
#Cassandra13
Migrationedit conf/cassandra.yaml and restart
# Number of tokens to generate.num_tokens: 256
#Cassandra13
Migrationconvert to T contiguous tokens in existing ranges
B
AAAAAA
AA
A
A A A
AAAA
C
A
B
#Cassandra13
Migrationshuffle
B
AAAAAA
AA
A
A A A
AAAA
C
A
B
#Cassandra13
Shuffle
● Range transfers are queued on each host
● Hosts initiate transfer to self
● Pay attention to the logs!
#Cassandra13
ShuffleUsage: shuffle [options] <sub-command>
Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling
Options: -dc, --only-dc Apply only to named DC (create only) -u, --username JMX username -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -pw, --password JMX password -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host)
#Cassandra13
Performance
#Cassandra13
removenode
Cassandra 1.2 Cassandra 1.10
50
100
150
200
250
300
350
400
450
#Cassandra13
bootstrap
Cassandra 1.2 Cassandra 1.10
100
200
300
400
500
600
#Cassandra13
The End
● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.
● Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
● Overton, Sam. “Virtual Nodes Strategies.” Web.
● Overton, Sam. “Virtual Nodes: Performance Results.” Web.
● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.
Recommended