Cassandra NYC 2011 Data Modeling

Data Modeling ExamplesMatthew F. Dennis // @mdennis

Overview

● general guiding goals for Cassandra data models

● Interesting and/or common examples/questions to get us started

● Should be plenty of time at the end for questions, so bring them up if you have them !

Data Modeling Goals

● Keep data queried together on disk together● In a more general sense think about the

efficiency of querying your data and work backward from there to a model in Cassandra

● Usually, you shouldn't try to normalize your data (contrary to many use cases in relational databases)

● Usually better to keep a record that something happened as opposed to changing a value (not always the best approach though)

● Easily the most common use of Cassandra● Financial tick data● Click streams● Sensor data● Performance metrics● GPS data● Event logs● etc, etc, etc ...

● All of the above are essentially the same as far as C* is concerned

Time Series Data

● Things happen in some timestamp ordered stream and consist of values associated with the given timestamp (i.e. “data points”)

– Every 30 seconds record location, speed, heading and engine temp

– Every 5 minutes record CPU, IO and Memory usage

● We are interested in recreating, aggregating and/or analyzing arbitrary time slices of the stream

– Where was agent:007 and what was he doing between 11:21am and 2:38pm yesterday?

– What are the last N actions foo did on my site?

Time Series Thought Model

Data Points Defined

● Each data point has 1-N values

● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of 17th hour on some date)

Data Points Mapped to Cassandra

● Row Key is id of the data point stream bucketed by time– e.g. plane01:jan_2011 or plane01:jan_01_2011 for month or day buckets

respectively

● Column Name is TimeUUID(timestamp of date point)

● Column Value is serialized data point– JSON, XML, pickle, msgpack, thrift, protobuf, avro, BSON, WTFe

● Bucketing– Avoids always requiring multiple seeks when only small slices of the stream are

requested (e.g. stream is 5 years old but I'm on only interested in Jan 5th 3 years ago and/or yesterday between 2pm and 3pm).

– Make it easy to lazily aggregate old stream activity

– Reduces compaction overhead since old rows will never have to be merged again (until you “back fill” and/or delete something)

A Slightly More Concrete Example

● Sensor data from airplanes

● Every 30 seconds each plane sends latitude+longitude, altitude and wine remaining in mdennis' glass.

The Visual

● Row Key is the id of stream being recorded (e.g. plane5:jan_2011)

● Column Name is timestamp (or TimeUUID) associated with the data point

● Column Value is the value of the event (e.g. protobuf serialized lat/long+alt+wine_level)

p5:j11TimeUUID0 TimeUUID1 TimeUUID2

28.90, 124.3045K feet

70%

plane5:jan_2011

Middle of the ocean and half a glass of wine at 44K feet

28.85, 124.2544K feet

50%

28.81, 124.2244K feet

95%

Querying

● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call

● Or use a empty start and/or end along with a count

Bucket Sizes?

● Depends greatly on● Average size of time slice queried● Average data point size● Write rate of data points to a stream● IO capacity of the nodes

So... Bucket Sizes?

● No Bigger than a few GB per row● bucket_size * write_rate * sizeof(avg_data_point)

● Bucket size >= average size of time slice queried● No more than maybe 10M entries per row● No more than a month if you have lots of different

streams● NB: there are exceptions to all of the above, which

are really nothing more than guidelines

Ordering

● In cases where the most recent data is the most interesting (e.g. last N events for entity foo or last hour of events for entity bar), you can reverse the comparator (i.e. sort descending instead of ascending)

● http://thelastpickle.com/2011/10/03/Reverse-Comparators/● https://issues.apache.org/jira/browse/CASSANDRA-2355

Spanning Buckets

● If your time slice spans buckets, you'll need to construct all the row keys in question (i.e. number of unique row keys = spans+1)

● If you want all the results between the dates, pass all the row keys to multiget_slice with the start and end of the desired time slice

● If you only want the first N results within your time slice, lowest latency comes from multiget_slice as above but best efficiency comes from serially paging one row key at a time until your desired count is reached

Expiring Streams(e.g. “I only care about the past year”)

● Just set the TTL to the age you want to keep● yeah, that's pretty much it ...

Counters

● Sometimes you're only interested in counting things that happened within some time slice

● Minor adaptation to the previous content to use counters (be aware they are not idempotent)● Column names become buckets● Values become counters

Example: Counting User Logins

U3:S5:L:D

user3:system5:logins:by_day

20110107 ... 20110523

2 7...

2 logins on Jan 7th 2011 for user 3 on system 5

7 logins on May 23rd 2011for user 3 on system 5

U3:S5:L:H

user3:system5:logins:by_hour

2011010710 ... 2011052316

1 7...

one login for user 3 on system 5 on Jan 7th 2011 for the 10th hour

2 logins for user 3 on system 5on May 23rd 2011 for the 16th hour

Eventually Atomic

● In a legacy RDBMS atomicity is “easy”

● Attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense)

● However, consistency is important and can certainly be achieved in C*

● Many approaches / alternatives

● I like a transaction log approach, especially in the context of C*

Transaction Logs(in this context)

● Records what is going to be performed before it is actually performed

● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense which is usually what people mean when they say isolation)

● Marks that the actions were performed

In Cassandra

● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), pickle, JSO, msgpack, protobuf, et cetera● Row Key = randomly chosen C* node token● Column Name = TimeUUID(nowish)

● Perform actions● Delete Column

Configuration Details

● Short gc_grace_seconds on the XACT_LOG Column Family (e.g. 5 minutes)

● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability● if it fails with an unavailable exception, pick a

different node token and/or node and try again (gives same semantics as a relational DB in terms of knowing the state of your transaction)

Failures

● Before insert into the XACT_LOG● After insert, before actions● After insert, in middle of actions● After insert, after actions, before delete● After insert, after actions, after delete

Recovery

● Each C* has a crond job offset from every other by some time period

● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period (the “recovery period”)

● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working)

XACT_LOG Comments

● Idempotent writes are awesome (that's why this works so well)

● Doesn't work so well for counters (they're not idempotent)

● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway)

Cassandra Data Modeling ExamplesMatthew F. Dennis // @mdennis

Q?

Technology

Cassandra NYC 2011 Data Modeling