Hey Relational Developer, Let's Go Crazy (Patrick McFadin, DataStax) | Cassandra Summit 2016

Preview:

Citation preview

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra, DataStax

Hey relational developer, let's go crazy

1

Why do you develop?

value = Business.add(you)

KillrVideo

https://killrvideo.github.io/

Major areas to cover

Connecting to the database Inserting Data Selecting Data Indexing Data Locality

WARNING

Connecting to the database

Cluster cluster;Session session;

// Connect to the cluster and keyspace "killrvideo"cluster = Cluster.builder().addContactPoint(“192.168.0.1,192.168.0.2”).build();session = cluster.connect("killrvideo");

Cluster cluster;Session session;

// Connect to the cluster and keyspace "killrvideo"cluster = Cluster.builder().addContactPoint(“NODE1,NODE2”).build();session = cluster.connect("killrvideo");

WARNINGCluster cluster = Cluster.builder() .addContactPoint(“192.168.0.1,192.168.0.2”) .withLoadBalancingPolicy( DCAwareRoundRobinPolicy.builder() .withLocalDc("myLocalDC") .build() ).build();

Multi-DCEast West

< 1ms > 70ms

I wonder why I have random slow queries?

Major areas to cover

Connecting to the database Inserting Data Selecting Data Indexing Data Locality

Inserting Data

Inserting dataCREATE TABLE video_ratings_by_user ( videoid uuid, userid uuid, rating int, PRIMARY KEY (videoid, userid) );

INSERT INTO video_ratings_by_user(videoid, userid)VALUES (?,?);

Inserting data

• Batch in the same partition is great • Pay attention to the partition key

BEGIN BATCH INSERT INTO comments_by_video (videoid, userid, commentid, comment) VALUES (99051fe9-6a9c-46c2-b949-38ef78858dd0,d0f60aa8-54a9-4840-b70c-fe562b68842b,now(), 'Worst. Video. Ever.ʼ);

…100 Inserts later…

INSERT INTO comments_by_video (videoid, userid, commentid, comment) VALUES (99051fe9-6a9c-46c2-b949-38ef78858dd0,d0f60aa8-54a9-4840-b70c-fe562b68842b,now(), 'Worst. Video. Ever.');APPLY BATCH;

Batches: The bad

BEGIN BATCH 1000 insertsAPPLY BATCH;

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Client

WARNING

Prepared Statements• Built for speed an efficiency

How they work: Prepare

SELECT * FROM user WHERE id = ?

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Client

Prepare

Parsed

Hashed Cached

Prepared Statement

How they work: Bind

id = 1 + PreparedStatement Hash

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

Client

Bind & Execute

Combine Pre-parsed Query and Variable

Execute

Major areas to cover

Connecting to the database Inserting Data Selecting Data Indexing Data Locality

Selecting Data

Getting data

• Use a partition key always •Need JSON? Just ask • Order of clustering columns matter

SELECT * FROM user_videosWHERE userid = ?;

SELECT * FROM user_videosWHERE userid = ?AND added_date = ?;

CREATE TABLE IF NOT EXISTS user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY (userid, added_date, videoid)) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);

SELECT * FROM user_videosWHERE userid = ?AND videoid = ?;

SELECT JSON * FROM user_videosWHERE userid = ?;

Getting data

• CQLSH trace facility is your friend •Watch the logs. Filter for warnings

SELECT * FROM videos;

SELECT * FROM videos ALLOW FILTERING;

WARNING

SELECT * FROM videosWHERE key IN <10s, 100s or 1000s of keys>;

Major areas to cover

Connecting to the database Inserting Data Selecting Data Indexing Data Locality

Indexing

Check out what I built This query is really slow

Duh. Add an index to this field.

Oh yeah. That is faster.

Indexing data

• Secondary Indexes are not for speed • Index clustering columns • Index collections

CREATE INDEX videoid_idxON user_videos(videoid) ;

CREATE TABLE IF NOT EXISTS user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY (userid, added_date, videoid)) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);

CREATE INDEX tags_idxON videos(tags) ;

name (PK) location

Jonathan TX

Aleksey UK

Patrick CA

Stefania HK

CREATE INDEX location_idx ON users(location)

USERS Index:user(location)

Index:user(location)

Index:user(location)

Index:user(location)

name (PK) location

Jonathan TX

Aleksey UK

Patrick CA

Stefania HK

CREATE CUSTOM INDEX location_idx ON users(location) USING ‘org.apache.cassandra.sasi.SASIIndex’;

USERS

name (PK) location

Jonathan TX

Aleksey UK

Patrick CA

Stefania HK

CREATE CUSTOM INDEX location_idx ON users(location) USING ‘org.apache.cassandra.sasi.SASIIndex’;

USERS

Memtable

Users

SSTable

Users

SASI Index

SASI Index

SASI Queries

SELECT * FROM users WHERE firstname LIKE 'pat%';

SELECT * FROM users WHERE lastname LIKE ‘%Fad%';

SELECT * FROM users WHERE email LIKE '%data%';

SELECT * FROM users WHERE created_date > '2011-6-15' AND created_date < '2011-06-30';

userid | created_date | email | firstname | lastname --------------------------------------+---------------------------------+----------------------+-----------+---------- 9761d3d7-7fbd-4269-9988-6cfd4e188678 | 2011-06-20 20:50:00.000000+0000 | patrick@datastax.com | Patrick | McFadin

Major areas to cover

Connecting to the database Inserting Data Selecting Data Indexing Data Locality

Data Locality

8 Fallacies of Distributed Computing

1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous

Insert Alternative

BEGIN BATCH 1000 insertsAPPLY BATCH;

while() { future = session.executeAsync(statement)}

Instead of:

Do this:

WARNING

Collect and deal with your futures!

Thank you!Questions?

Follow me @PatrickMcFadin

Recommended