67
The C* Developer Training Chuck Droukas, Systems Engineer – Datastax

Apache Cassandra Developer Training Slide Deck

Embed Size (px)

DESCRIPTION

This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.

Citation preview

Page 1: Apache Cassandra Developer Training Slide Deck

The C* Developer TrainingChuck Droukas, Systems Engineer – Datastax

Page 2: Apache Cassandra Developer Training Slide Deck

Disclaimers

•This course is designed to be a “fast start” on the basics of data modeling with Cassandra.•We will cover some basic Administration information upfront that is important to understand as you choose your data model•It is still important to take a proper Admin class if you are responsible for production instance•This course focuses on CQL3, but thrift shall not be ignored•Please ask questions and interrupt me. It makes the day go faster for both of us.

Page 3: Apache Cassandra Developer Training Slide Deck

Agenda

•Architecture Overview-Ring Topology-Write Path-Read Path-Updates and Deletes

•Break•Columns and their components•Column Families

•Lunch•Keyspaces•Complex Queries•Break•Timeseries Example•User Activity Example•Shopping Cart Example•Logging Example

Page 4: Apache Cassandra Developer Training Slide Deck

The Cassandra Schema

Consists of:•Column•Column Family (aka Table)•Keyspace (aka Database)•Cluster

Page 5: Apache Cassandra Developer Training Slide Deck

High Level Overview

Keyspace

Column Family /Table

Rows

Columns

Page 6: Apache Cassandra Developer Training Slide Deck

Components of the Column

The column is the fundamental data type in Cassandra and includes:• Column name• Column value• Timestamp• TTL (Optional)

Page 7: Apache Cassandra Developer Training Slide Deck

The Column

Name

Value

Timestamp

(Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)

Page 8: Apache Cassandra Developer Training Slide Deck

Column Name

• Can be any value• Can be any type• Not optional• Must be unique• Stored with every value

Page 9: Apache Cassandra Developer Training Slide Deck

Column Value

• Any value• Any type• Can be empty – but is required

Page 10: Apache Cassandra Developer Training Slide Deck

Column Names and Values

•the data type for a column (or row key) value is called a validator. •The data type for a column name is called a comparator. •Cassandra validates that data type of the keys of rows. •Columns are sorted, and stored in sorted order on disk, so you have to specify a comparator for columns. This can be reversed… more on this later

Page 11: Apache Cassandra Developer Training Slide Deck

Data Types

Page 12: Apache Cassandra Developer Training Slide Deck

Column TimeStamp

• 64-bit integer• Best Practice

– Should be created in a consistent manner by all your clients

• Required

Page 13: Apache Cassandra Developer Training Slide Deck

Column TTL

• Defined on INSERT• Positive delay (in seconds)• After time expires it is marked for deletion

Page 14: Apache Cassandra Developer Training Slide Deck

Special Types of Columns

• Super• Counter• Collections

Page 15: Apache Cassandra Developer Training Slide Deck

Counters

• Allows for addition / subtraction• 64-bit value• No timestamp• Deletion does not require a

timestamp

Page 16: Apache Cassandra Developer Training Slide Deck

Collections

•New in 1.2!•Set, Map, List

Page 17: Apache Cassandra Developer Training Slide Deck

SET Example

Page 18: Apache Cassandra Developer Training Slide Deck

The Cassandra Schema

Consists of:•Column•Column Family•Keyspace•Cluster

Page 19: Apache Cassandra Developer Training Slide Deck

Column Families / Tables

•Same as tables-Groupings of Rows- AcID-Eventual Consistency

•De-Normalization-To avoid I/O-Simplify the Read Path

•Static or Dynamic

Page 20: Apache Cassandra Developer Training Slide Deck

Static Column Families

•Are the most similar to a relational table•Most rows have the same column names•Columns in rows can be different

jbellisName Email Address State

Jonathan [email protected]

123 main TX

dhutchName Email Address State

Daria [email protected]

45 2nd St. CA

egilmoreName Email

eric [email protected]

Row Key Columns

Page 21: Apache Cassandra Developer Training Slide Deck

Dynamic Column Families

•Also called “wide rows”•Structured so a query into the row will answer a question

jbellisdhutch egilmore datastax mzcassie

dhutchegilmore

egilmoredatastax mzcassie

Row Key Columns

Subscribers

Page 22: Apache Cassandra Developer Training Slide Deck

Dynamic Table CQL3 Example

CREATE TABLE timeline (

user_id varchar,

tweet_id uuid,

author varchar,

body varchar,

PRIMARY KEY (user_id, tweet_id)

)

Page 23: Apache Cassandra Developer Training Slide Deck

Clustering Order

•Sorts columns on disk by default•Can change the order

Page 24: Apache Cassandra Developer Training Slide Deck

The Cassandra Schema

Consists of:•Column•Column Family•Keyspace•Cluster

Page 25: Apache Cassandra Developer Training Slide Deck

Keyspaces

•Are groupings of Column Families•Replication strategies•Replication factor

CREATE KEYSPACE videodb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }

In production you would use NetworkTopologyStrategy for multiple DCs.

CREATE KEYSPACE "Excalibur“ WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' :2};

Page 26: Apache Cassandra Developer Training Slide Deck

Complex QueriesPartitioning and Indexing

Page 27: Apache Cassandra Developer Training Slide Deck

Partitioners

•Partitioner Types- RandomPartitioner / Murmur3Partitioner- ByteOrderedPartioner

•Random means that your tokens are random your ordering is Random•Ordered means your K T is a no-op and ordering is lexical

- For each node- And for the ring

Page 28: Apache Cassandra Developer Training Slide Deck

Partitioners (cont’d)

•SELECT * FROM test WHERE token(k) > token(42);

Page 29: Apache Cassandra Developer Training Slide Deck

Primary Index Overview

•Index for all of your row keys•Per-node index•Partitioner + placement manages which node•Keys are just kept in ordered buckets•Partitioner determines how K Token

Page 30: Apache Cassandra Developer Training Slide Deck

Natural Keys

•Examples:-An email address-A user id

•Easy to make the relationship•Less de-normalization•More risk of an ‘UPSERT’•Changing the key requires a bulk copy operation

Page 31: Apache Cassandra Developer Training Slide Deck

Surrogate Keys

•Example:-UUID

•Independently generated•Allows you to store multiple versions of a user•Relationship is now indirect•Changing the key requires the creation of a new row, or column

Page 32: Apache Cassandra Developer Training Slide Deck

Compound (Composite) Primary Keys

Page 33: Apache Cassandra Developer Training Slide Deck

Sorting

•It’s Free!•Like Open Source is free•ONLY on the second column in compound Primary Key

Page 34: Apache Cassandra Developer Training Slide Deck

Secondary Indexes

•Need for an easy way to do limited ad-hoc queries•Supports multiple per row•Single clause can support multiple selectors•Implemented as a hash map, not B-Tree•Low cardinality ONLY

Page 35: Apache Cassandra Developer Training Slide Deck

Secondary Indexes

Page 36: Apache Cassandra Developer Training Slide Deck

Conditional Operators

Page 37: Apache Cassandra Developer Training Slide Deck

Data Modeling

Page 38: Apache Cassandra Developer Training Slide Deck

The Basics of C* Modeling

•Work backwards-What does your application do?-What are the access patterns?

•Now design your data model

Page 39: Apache Cassandra Developer Training Slide Deck

Procedures

Consider use case requirements•What data?•Ordering?•Filtering?•Grouping?•Events in chronological order?•Does the data expire?

Page 40: Apache Cassandra Developer Training Slide Deck

De-Normalization

•The New Black: De-Normalization-Forget everything you’ve learned about normalization…then forget it again!!!

•The Ugly:-Resource contention-Latency-Client-side joins

•Avoid them in your C* code

Page 41: Apache Cassandra Developer Training Slide Deck

Foreign Keys

•There are no foreign keys•No server-side joins

Page 42: Apache Cassandra Developer Training Slide Deck

What now?

•Ideally each query will be one row-Compared to other resources, disk space is cheap

•Reduce disk seeks•Reduce network traffic

Page 43: Apache Cassandra Developer Training Slide Deck

Workload Preference

•High level of de-normalization means you may have to write the same data many times•Cassandra handles large numbers of writes well

Page 44: Apache Cassandra Developer Training Slide Deck

Concurrent Writes

•A row is always referenced by a Key•Keys are just bytes•They must be unique within a CF•Primary keys are unique

-But Cassandra will not enforce uniqueness

-If you are not careful you will accidentally [UPSERT] the whole thing

Page 45: Apache Cassandra Developer Training Slide Deck

Let’s Review Some Examples…

Page 46: Apache Cassandra Developer Training Slide Deck

Relational Concept - De-normalization

• To combine relations into a single row• Used in relational modeling to avoid

complex joins

Employees

Department

SELECT e.First, e.Last, d.Dept FROM Department d, Employees e WHERE 1 = e.idAND e.id = d.id

Take this and then...

13

Thursday, May 2, 13

id First Last

1 Edgar Codd

2 Raymond Boyce

id Dept

1 Engineering

2 Math

Page 47: Apache Cassandra Developer Training Slide Deck

Relational Concept - De-normalization

• Combine table columns into a single view• No joins• All in how you set the data for fast reads

Employees

SELECT First, Last, Dept FROM employees

WHERE id = ‘1’

14

Thursday, May 2, 13

id First Last Dept

1 Edgar Codd Engineering

2 Raymond Boyce Math

Page 48: Apache Cassandra Developer Training Slide Deck

Cassandra Concept - One-to-Many

• Relationship without being relational

• Users have many videos• Wait? Where is the foreign key?

Users

Videos

15

Thursday, May 2, 13

username firstname lastname email

tcodd Edgar Codd [email protected]

rboyce Raymond Boyce [email protected]

videoid videoname username description tags

99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol

b3a76c6b Math tcodd Now my dog plays dogs,piano,lol

Page 49: Apache Cassandra Developer Training Slide Deck

Cassandra Concept - One-to-many

• Static table to store videos• UUID for unique video id• Add username to

denormalize

CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY(videoid)

);

16

Thursday, May 2, 13

Page 50: Apache Cassandra Developer Training Slide Deck

Cassandra Concept - One-to-Many

• Lookup video by username• Write in two tables at once for fast lookups

CREATE TABLE username_video_index ( username varchar,

videoid uuid, upload_date timestamp, video_name varchar,

PRIMARY KEY (username, videoid)

);

SELECT video_nameFROM username_video_index WHERE username = ‘ctodd’ AND videoid = ‘99051fe9’

Creates a wide row!

17

Thursday, May 2, 13

Page 51: Apache Cassandra Developer Training Slide Deck

Cassandra concept - Many-to-many• Users and videos have many comments

Videos

Comments

18

Thursday, May 2, 13

username firstname lastname email

tcodd Edgar Codd [email protected]

rboyce Raymond Boyce [email protected]

videoid videoname username description tags

99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol

b3a76c6b Math tcodd Now my dog plays dogs,piano,lol

username videoid comment

tcodd 99051fe9 Sweet!

rboyce b3a76c6b Boring :(

Users

Page 52: Apache Cassandra Developer Training Slide Deck

Cassandra concept - Many-to-many

• Model both sides of the view• Insert both when comment is

created• View from either side

CREATE TABLE comments_by_user ( username varchar,

videoid uuid, comment_ts timestamp,

comment varchar,PRIMARY KEY

username,videoid));

19

Thursday, May 2, 13

CREATE TABLE comments_by_video ( videoid uuid,username varchar, comment_ts timestamp,comment varchar,PRIMARY KEY (videoid,username));

Page 53: Apache Cassandra Developer Training Slide Deck

Time Series Data

•Sensors- CPU- Network Card- Wave-Form- Resource Utilization

•Clickstream data•Historical trends•Anything that varies on a temporal basis

Page 54: Apache Cassandra Developer Training Slide Deck

Timeseries Example

WHITEBOARD TIME!!

Page 55: Apache Cassandra Developer Training Slide Deck

Single Device Per Row

Single device per row - Time Series Pattern 1• The simplest model for storing time series data is creating a wide

row of data for each source. • The timestamp of the reading will be the column name and the

temperature the column value• Since each column is dynamic, our row will grow as needed to

accommodate the data. • We will also get the built-in sorting of Cassandra to keep everything

in order.

http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling#!pc

Page 56: Apache Cassandra Developer Training Slide Deck

Single Device Per Row

CREATE TABLE temperature (   weatherstation_id text,   event_time timestamp,   temperature text,   PRIMARY KEY (weatherstation_id,event_time));

Page 57: Apache Cassandra Developer Training Slide Deck

Slice Query

SELECT temperatureFROM temperatureWHERE weatherstation_id=’1234ABCD’AND event_time > ’2013-04-03 07:01:00′AND event_time < ’2013-04-03 07:04:00′;

Page 58: Apache Cassandra Developer Training Slide Deck

Partitioning to limit row size 

Partitioning to limit row size – Time Series Pattern 2• Cassandra can store up to 2 billion columns per row, but if we're

storing data every millisecond you wouldn't even get a month’s worth of data.

• The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device.

• Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id.

• This will give us a row per day, per weather station, and an easy way to find the data.

 

Page 59: Apache Cassandra Developer Training Slide Deck

Partitioning to limit row size 

CREATE TABLE temperature_by_day (   weatherstation_id text,   date text,   event_time timestamp,   temperature text,   PRIMARY KEY ((weatherstation_id,date),event_time));

Page 60: Apache Cassandra Developer Training Slide Deck

Get all the weather data for a single day..

SELECT *FROM temperature_by_dayWHERE weatherstation_id=’1234ABCD’AND date=’2013-04-03′; 

Page 61: Apache Cassandra Developer Training Slide Deck

Reverse Order Time Series/Expiring Columns

Reverse order timeseries with expiring columns – Time Series Pattern 3• Imagine we are using this data for a dashboard application and

we only want to show the last 10 temperature readings.• Older data is no longer useful, so can be purged eventually. • We can take advantage of a feature called expiring columns to

have our data quietly disappear after a set amount of seconds.

Page 62: Apache Cassandra Developer Training Slide Deck

Partitioning to limit row size 

CREATE TABLE latest_temperatures (   weatherstation_id text,   event_time timestamp,   temperature text,   PRIMARY KEY (weatherstation_id,event_time),) WITH CLUSTERING ORDER BY (event_time DESC);

Page 63: Apache Cassandra Developer Training Slide Deck

Insert Data With TTLs

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:03:00′,’72F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:01:00′,’73F’) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)VALUES (’1234ABCD’,’2013-04-03 07:04:00′,’74F’) USING TTL 20;

Page 64: Apache Cassandra Developer Training Slide Deck

Shopping cart use case

*Store shopping cart data reliably

*Minimize (or eliminate) downtime. Multi-dc

*Scale for the “Cyber Monday” problem

The bad

*Every minute off-line is lost $$

*Online shoppers want speed!

Page 65: Apache Cassandra Developer Training Slide Deck

Shopping Cart Example

* Un-ashamedly ripped off from Patrick McFaddin’s Cassandra Summit 2013 presentation

Page 66: Apache Cassandra Developer Training Slide Deck

The 5 C* Commandments for Developers

1. Start with queries. Don’t data model for data modeling sake. That is sooo turn of the century.

2. It’s ok to duplicate data. Really. Get over it.3. C* is designed to read and write sequentially.

Great for rotational disk, awesome for SSDs, awful for NAS. So don’t do it. Ever.

4. Secondary indexes are not a band-aid for a poor data model.

5. Embrace wide rows and de-normalization

Page 67: Apache Cassandra Developer Training Slide Deck

…and Cassandra will not ask if your “wallet is open.”