DataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect

Preview:

Citation preview

Rigorous Cassandra Data Modeling

for the Relational Data Architect

Artem Chebotko

1 Cassandra Data and Query Models

2 Rigorous Data Modeling

3 Data Modeling Example

4 From Relational to Cassandra

5 Conclusions

2 © 2015. All Rights Reserved.

Tables with Single-Row Partitions

© 2015. All Rights Reserved. 3

username age address

Alice 28 Santa Clara, CA

Alex 37 Austin, TX

users

id type settings owner

1 phone {gps ⇒ on,

pedometer ⇒ on}

Alice

2 wristband {heart rate ⇒ on, …} Alice

3 thermostat {temp ⇒ 75, …} Alice

4 security {…} Alex

5 phone {…} Alex

sensors

Tables with Single-Row Partitions

CREATE TABLE users (

username TEXT,

age INT,

address TEXT,

PRIMARY KEY(username)

);

SELECT * FROM users

WHERE username = ?;

CREATE TABLE sensors (

id INT,

type TEXT,

settings MAP<TEXT,TEXT>,

owner TEXT,

PRIMARY KEY(id)

);

SELECT * FROM sensors

WHERE id = ?;

© 2015. All Rights Reserved. 4

Tables with Multi-Row Partitions

© 2015. All Rights Reserved. 5

username id type settings age address

Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA

Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA

Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA

Alex 4 security … 37 Austin, TX

Alex 5 phone … 37 Austin, TX

sensors_by_user

AS

C

AS

C

Tables with Multi-Row Partitions

CREATE TABLE sensors_by_user (

username TEXT, age INT STATIC, address TEXT STATIC,

id INT, type TEXT, settings MAP<TEXT,TEXT>,

PRIMARY KEY(username, id)

) WITH CLUSTERING ORDER BY (id ASC);

SELECT * FROM sensors_by_user WHERE username = ?;

SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;

SELECT * FROM sensors_by_user WHERE username = ? AND id > ?

ORDER BY id DESC;

© 2015. All Rights Reserved. 6

Key Observations

• C* Data Model

– Single-row partitions

– Multi-row partitions

• C* Query Model

– Partition key

– Partition and clustering keys

– Range search and ordering on

a clustering key

• Relational Data Model

– Normalized tables

• Relational Query Model

– SQL and relational algebra

– Expressive

– Expensive

© 2015. All Rights Reserved. 7

1 Cassandra Data and Query Models

2 Rigorous Data Modeling

3 Data Modeling Example

4 From Relational to Cassandra

5 Conclusions

8 © 2015. All Rights Reserved.

Rigorous: Definition and Implications

© 2015. All Rights Reserved. 9

Formal, Well-Defined, Sound

Repeatable, Automatable

Tools, Ease of Use

Wider Adoption

We Need the Methodology!

© 2015. All Rights Reserved. 10

Conceptual

Data Model

Application

Workflow

Logical

Data Model

Physical

Data Model Mapping Optimization

n

1

id type

1

datetime

parameter

usernameage address

n

User owns Sensor

records

Measurement

has

n m

settings

value

use

r

follo

wer

Methodology Models

© 2015. All Rights Reserved. 11

Model Representation

Conceptual Data Model ERD

Application Workflow Model Graph

Logical Data Model Chebotko Diagram

Physical Data Model Chebotko Diagram, CQL

CREATE TABLE users (

username TEXT,

age INT,

address TEXT,

PRIMARY KEY(username)

);

Q2

Q1

Display user

information

Find

followers

Display

sensors

Show measurementsin a date range

Show today's hourly

aggregates

Q3

Q3

Q4 Q5

Q4

SELECT * FROM users

WHERE username = ?

SELECT * FROM followers_by_user

WHERE username = ?

SELECT * FROM sensors_by_user

WHERE username = ?

SELECT *

FROM measurements_by_sensor

WHERE id = ? AND parameter = ?

AND datetime > ?

SELECT *

FROM summary_by_sensor

WHERE id = ? AND date = ?

users

username K

age

address

followers_by_user

username K

follower_username C↑

follower_age

follower_address

Q1

sensors_by_user

username K

id C↑

type

<settings>

measurements_by_sensor

id K

week K

parameter K

datetime C↓

value

summary_by_sensor

id K

date K

parameter C↑

hour C↓

avg

...

Q2

Q3

Q4

Q5

Q4

MAP<TEXT,TEXT>

FLOAT

TEXT

TEXT

TEXT

TEXT

INT

TEXT

TIMESTAMP

UUID

INT

TEXT

TEXT

TEXT

TEXT

UUID

UUID

TIMESTAMP FLOAT

INT

TIMESTAMP

users

username K

age

address

followers_by_user

username K

follower_username C↑

follower_age

follower_address

Q1

sensors_by_user

username K

id C↑

type

<settings>

measurements_by_sensor

id K

parameter K

datetime C↓

value

summary_by_sensor

id K

date K

parameter C↑

hour C↓

avg

...

Q2

Q3

Q4

Q5

Q4

Methodology Protocols

© 2015. All Rights Reserved. 12

• Conceptual-to-logical mapping

– Mapping rules

– Mapping patterns

• Physical optimizations

– Partition size analysis

– Duplication factor analysis

– Keys, aggregation, transactions, …

Sample Mapping Pattern

© 2015. All Rights Reserved. 13

ET1

key1.2

attr1.1

attr1.2

ET2_by_ET1_key

key1.1 Kkey1.2 Kkey2.1 C↑key2.2 C↑attr1.1 Sattr1.2 Sattr1.3 (collection) S attr2.1 attr2.2 attr2.3 (collection) attr

RT

attr

1 nkey1.1

ET2

key2.1

attr2.1

attr2.2

key2.2

attr2.3

attr1.3

ACCESS PATTERN search attributes: key1.1 key1.2

ET2_by_ET1_key

key1.1 Kkey1.2 C↑key2.1 C↑key2.2 C↑attr2.1 attr2.2 attr2.3 (collection) attr

= >

PRIMARY KEY:All search attributes, followed by all key

attributes of RT

STATIC COLUMNS:Non-key attributes of

ET1, iff all key attributes of ET1 are

part of the partition keyWhat if we add green attributes

to the above table?

The Easy Way

© 2015. All Rights Reserved. 14

kdm.dataview.org

• Implements the methodology

– CDM and Query design

– Automated LDM generation

– Automated PDM and CQL generation

Yesterday’s talk:

World’s Best Data Modeling Tool

for Apache Cassandra

1 Cassandra Data and Query Models

2 Rigorous Data Modeling

3 Data Modeling Example

4 From Relational to Cassandra

5 Conclusions

15 © 2015. All Rights Reserved.

Conceptual Data Model: Fact-Based Model

• Alice is a user

• Alice is 28 y.o.

• Alice wears a wristband

• A wristband is a sensor

• A wristband records a heart rate

• A heart rate is a measurement

• …

© 2015. All Rights Reserved. 16

Conceptual Data Model: Entity-Relationship Model

© 2015. All Rights Reserved. 17

n

1

id type

1

datetime

parameter

usernameage address

n

User owns Sensor

records

Measurement

has

n m

settings

value

use

r

follo

wer

ACCESS PATTERNSQ1: Find a user with a known usernameQ2: Find followers of a userQ3: Find sensors owned by a userQ4: Find measurements for a sensor in a date rangeQ5: Find daily summary of hourly aggregates

Q2

Q1

Display user

information

Find

followers

Display

sensors

Show measurementsin a date range

Show today's hourly

aggregates

Q3

Q3

Q4 Q5

Q4

Application Workflow

© 2015. All Rights Reserved. 18

Q2

Q1

Display user

information

Find

followers

Display

sensors

Show measurementsin a date range

Show today's hourly

aggregates

Q3

Q3

Q4 Q5

Q4

SELECT * FROM users

WHERE username = ?

SELECT * FROM followers_by_user

WHERE username = ?

SELECT * FROM sensors_by_user

WHERE username = ?

SELECT *

FROM measurements_by_sensor

WHERE id = ? AND parameter = ?

AND datetime > ?

SELECT *

FROM summary_by_sensor

WHERE id = ? AND date = ?

Application Workflow and Queries

© 2015. All Rights Reserved. 19

users

username K

age

address

followers_by_user

username K

follower_username C↑

follower_age

follower_address

Q1

sensors_by_user

username K

id C↑

type

<settings>

measurements_by_sensor

id K

parameter K

datetime C↓

value

summary_by_sensor

id K

date K

parameter C↑

hour C↓

avg

...

Q2

Q3

Q4

Q5

Q4

Logical Data Model

© 2015. All Rights Reserved. 20

users

username K

age

address

followers_by_user

username K

follower_username C↑

follower_age

follower_address

Q1

sensors_by_user

username K

id C↑

type

<settings>

measurements_by_sensor

id K

week K

parameter K

datetime C↓

value

summary_by_sensor

id K

date K

parameter C↑

hour C↓

avg

...

Q2

Q3

Q4

Q5

Q4

MAP<TEXT,TEXT>

FLOAT

TEXT

TEXT

TEXT

TEXT

INT

TEXT

TIMESTAMP

UUID

INT

TEXT

TEXT

TEXT

TEXT

UUID

UUID

TIMESTAMP FLOAT

INT

TIMESTAMP

Physical Data Model

© 2015. All Rights Reserved. 21

1 Cassandra Data and Query Models

2 Rigorous Data Modeling

3 Data Modeling Example

4 From Relational to Cassandra

5 Conclusions

22 © 2015. All Rights Reserved.

Relational Methodology

© 2015. All Rights Reserved. 23

CDM

Normalized

Relational

Relational

LDM

Relational

PDM

Mapping

Optimization

Normalization

Queries

Relational Design Example

© 2015. All Rights Reserved. 24

users

username PK

age

address

followers

username PK, FK

follower_username PK, FK

ownership

username PK, FK

sensor_id PK, FK

measurements

sensor_id PK, FK

parameter PK

datetime PK

value

sensors

sensor_id PK

type

settings

sensor_id PK, FK

setting_name PK

settings_value

Relational-to-Cassandra: Indirect Translation

© 2015. All Rights Reserved. 25

Relational

Data Model

Conceptual

Data Model

Reverse

Engineer

Relational

Application

Application

Workflow

Reverse

Engineer

Apply the C*

Methodology

Reverse Engineering is Almost Straightforward

© 2015. All Rights Reserved. 26

users

username PK

age

address

followers

username PK, FK

follower_username PK, FK

ownership

username PK, FK

sensor_id PK, FK

measurements

sensor_id PK, FK

parameter PK

datetime PK

value

sensors

sensor_id PK

type

User owns Sensor

records Measurement

hassettings

sensor_id PK, FK

setting_name PK

settings_value

has Setting

Relational-to-Cassandra: Direct Translation

© 2015. All Rights Reserved. 27

Relational

Schema

SQL

Queries

Cassandra

Schema

Relational-to-Cassandra

Mapping

Extracting Functional Dependencies

© 2015. All Rights Reserved. 28

username age, address

username, sensor_id username, sensor_id

sensor_id type

username, follower_username username, follower_username

sensor_id, parameter, datetime value

sensor_id, setting_name setting_value

users

username PK

age

address

ownership

username PK, FK

sensor_id PK, FK

measurements

sensor_id PK, FK

parameter PK

datetime PK

value

sensors

sensor_id PK

type

followers

username PK, FK

follower_username PK, FK

settings

sensor_id PK, FK

setting_name PK

settings_value

Entailing New Functional Dependencies

• Armstrong’s Axioms

– Reflexivity: If Y X then X Y (trivial functional dependency)

username, sensor_id username, sensor_id

– Augmentation: If X Y then XZ YZ

username age, address

username, sensor_id age, address, sensor_id

– Transitivity: If X Y and Y Z then X Z

© 2015. All Rights Reserved. 29

The Idea

Cassandra table schema must satisfy

the original or entailed relational FDs

The best way to verify this is by computing

an attribute closure

© 2015. All Rights Reserved. 30

No kidding!

You better believe

this guy …

(1) A BC, (2) B F, (3) AD E

AD

{AD}

{ADBC}

{ADBCF}

{ADBCFE}

(trivial)

(1)

(2)

(3)

Computing an Attribute Closure

© 2015. All Rights Reserved. 31

Simple Example

© 2015. All Rights Reserved. 32

Partition key Clustering key Other columns Primary key attribute closure

username age address username, age, address

username age, address

sensor_id type

sensor_id, parameter, datetime value

sensor_id, setting_name setting_value

SELECT age, address FROM users WHERE username = ‘Alice’

username, age, address

Advanced Example

© 2015. All Rights Reserved. 33

SELECT age, type, datetime, value FROM users NATURAL JOIN ownership NATURAL JOIN sensors NATURAL JOIN measurements

WHERE username = ‘Alice’ AND parameter = ‘heart rate’

ORDER BY datetime DESC

Partition key Clustering key Other Primary key attribute closure

username

parameter

datetime ↓

age (S)

type value

username, age, address, parameter,

datetime

username

parameter

datetime ↓

sensor_id ↑

age (S)

type value

username, age, address, sensor_id,

type, parameter, datetime, value

username age, address

sensor_id type

sensor_id, parameter, datetime value

sensor_id, setting_name setting_value

username, age, address, sensor_id,

type, parameter, datetime, value

© 2015. All Rights Reserved. 34

1 Cassandra Data and Query Models

2 Rigorous Data Modeling

3 Data Modeling Example

4 From Relational to Cassandra

5 Conclusions

35 © 2015. All Rights Reserved.

Conclusions

• Cassandra data models from scratch

– The methodology: academy.datastax.com

– Automation: kdm.dataview.org

• Cassandra data models from a relational database

– Two approaches to consider

– Ripe for automation

© 2015. All Rights Reserved. 36

Thank you

Recommended