Nzpug welly-cassandra-02-12-2010

Introduction to Cassandra

Wellington Python User GroupAaron Morton @aaronmorton

1/12/2010

Disclaimer.This is an introduction not

a reference.

I may, from time to time and for the best possible

reasons, bullshit you.

What do you already know about Cassandra?

Get ready.

The next slide has a lot on it.

Cassandra is a distributed, fault tolerant, scalable, column oriented data

store.

A word about “column oriented”.

Relax.

It’s different to a row oriented DB like MySQL.

So...

For now, think about keys and values. Where each

value is a hash / dict.

{‘foo’ : {‘bar’ : ‘baz’,},}

{key : {col_name : col_value,},}

Cassandra’s data model and on disk storage are based on the Google Bigtable

paper from 2006.

The distributed cluster design is based on the

Amazon Dynamo paper from 2007.

Easy.Lets store ‘foo’ somewhere.

'foo'

But I want to be able to read it back if one machine

fails.

Lets distribute it on 3 of the 5 nodes I have.

This is the Replication Factor.

Called RF or N.

Each node has a token that identifies the upper value of

the key range it is responsible for.

#1<= E

#2<= J

#3<= O

#4<= T

#5<= Z

Client connects to a random node and asks it to coordinate storing the ‘foo’

key.

Each node knows about all other nodes in the cluster,

including their tokens.

This is achieved using a Gossip protocol. Every

second each node shares it’s full view of the cluster with 1 to 3 other nodes.

Our coordinator is node 5. It knows node 2 is

responsible for the ‘foo’ key.

#1<= E

#2'foo'

#3<= O

#4<= T

#5<= Z

Client

But there is a problem...

What if we have lots of values between F and J?

We end up with a “hot” section in our ring of

nodes.

That’s bad mmmkay?

You shouldn't have a hot section in your ring.

mmmkay?

A Partitioner is used to apply a transform to the

key. The transformed values are also used to define the

range for a node.

The Random Partitioner applies a MD5 transform. The range of all possible

keys values is changed to a 128 bit number.

There are other Partitioners, such as the

Order Preserving Partitioner. But start with the Random Partitioner.

Let’s pretend all keys are now transformed to an

integer between 0 and 9.

Our 5 node cluster now looks like.

#1<= 2

#2<= 4

#3<= 6

#4<= 8

#5<= 0

Pretend our ‘foo’ key transforms to 3.

#1<= 2

#2"3"

#3<= 6

#4<= 8

#5<= 0

Client

Good start.

But where are the replicas?We want to replicate the

‘foo’ key 3 times.

A Replication Strategy is used to determine which

nodes should store replicas.

It’s also used to work out which nodes should have a

value when reading.

Simple Strategy orders the nodes by their token and

places the replicas clockwise around the ring.

Network Topology Strategy is aware of the racks and

Data Centres your servers are in. Can split replicas

between DC’s.

Simple Strategy will do in most cases.

Our coordinator will send the write to all 3 nodes at

once.

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

Once the 3 replicas tell the coordinator they have

finished, it will tell the client the write completed.

Done.Let’s go home.

Hang on.What about fault tolerant?What if node #4 is down?

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

After all, two out of three ain’t bad.

The client must specify a Consistency Level for each

operation.

Consistency Level specifies how many nodes must

agree before the operation is a success.

For reads is known as R.For writes is known as W.

Here are the simple ones (there are a few more)...

One.The coordinator will only

wait for one node to acknowledge the write.

Quorum.N/2 + 1

All.

The cluster will work to eventually make all copies

of the data consistent.

To get consistent behaviour make sure that R + W > N.

You can do this by...

Always using Quorum for read and writes.

Or...

Use All for writes and One for reads.

Or...

Use All for reads and One for writes.

Try our write again, using Quorum consistency level.

Coordinator will wait for 2 nodes to complete the write before telling the client it has completed.

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

What about when node 4 comes online?

It will not have our “foo” key.

Won’t somebody please think of the “foo” key!?

During our write the coordinator will send a

Hinted Handoff to one of the online replicas.

Hinted Handoff tells the node that one of the

replicas was down and needs to be updated later.

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

send "3" to #4

When node 4 comes back up, node 3 will eventually

process the Hinted Handoffs and send the

“foo” key to it.

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

What if the “foo” key is read before the Hinted Handoff is processed?

#1<= 2

#2"3"

#3"3"

#4""

#5<= 0

Client

send "3" to #4

At our Quorum CL the coordinator asks all nodes that should have replicas to

perform the read.

The Replication Strategy is used by the coordinator to

find out which nodes should have replicas.

Once CL nodes have returned, their values are

compared.

If the do not match a Read Repair process is kicked off.

A timestamp provided by the client during the write is used to determine the

“latest” value.

The “foo” key is written to node 4, and consistency

achieved, before the coordinator returns to the

client.

At lower CL’s the Read Repair happens in the

background and is probabilistic.

We can force Cassandra to repair everything using the

Anti Entropy feature.

Anti Entropy is the main feature for achieving

consistency. RR and HH are optimisations.

Anti Entropy is started manually via command line

or Java JMX.

Great so far.

But ratemylolcats.com is going to be huge.

How do I store 100 Million pictures of cats?

Add more nodes.

More disk capacity, disk IO, memory, CPU, network IO.

More everything.

Except.

Less “work” per node.

Linear scaling.

Clusters of 100+ TB.

And now for the data model.

From the outside in.

A Keyspace is the container for everything in your

application.

Keyspaces can be thought of as Databases.

A Column Family is a container for ordered and

indexed Columns.

Columns have a name, value, and timestamp

provided by the client.

The CF indexes the columns by name and

supports get operations by name.

CF’s do not define which columns can be stored in

them.

Column Families have a large memory overhead.

You typically have few (<10) CF’s in your Keyspace. But

there is no limit.

We have Rows. Rows have a key.

Rows store columns in one or more Column Families.

Different rows can store different columns in the same Column Family.

User CF

username => fredd_o_b => 04/03

username => bobcity => wellington

key => fred

key => bob

A key can store different columns in different Column Families.

User CF

username => fredd_o_b => 04/03

09:01 => tweet_6009:02 => tweet_70

key => fred

key => fred

Timeline CF

Here comes the Super Column Family to ruin it all.

Arrgggghhhhh.

A Super Column Family is a container for ordered and indexes Super Columns.

A Super Column has a name and an ordered and indexed list of Columns.

So the Super Column Family just gives another

level to our hash.

Social Super CF

following => {bob => 01/01/2010, tom => 01/02/2010}

followers => {bob => 01/01/2010}

key => fred

How about some code?

Documents

Nzpug welly-cassandra-02-12-2010