BigData Developers MeetUp

Christian Johannsen, Solutions Engineer

Evaluating Apache Cassandra as a Cloud Database

About me

2

Christian JohannsenSolutions Engineer @ DataStax

Twitter: @cjohannsen81

What is a cloud database not?

3

• A Cloud database is not simply taking a traditional RDBMS and

running it in a Cloud provider’s environment

What is a cloud database?

4

• A cloud database is a database optimised to handle the

infrastructure limitations and features of running on

shared, geographically dispersed infrastructure.

• Some of them are: Outages, noisy neighbours, security,

variable networks etc…

Key attributes of a cloud database

5

• Transparent Elasticity

• being able to add and subtract nodes (physical or virtual machines) with load balancing when the underlying application and business demands it.

• Transparent Scalability

• addition of nodes increases both (1) performance throughput; (2) ability to

handle Big Data and maintain high performance

• High Availability

• always up; no single point of failure

• Easy Data Distribution

• able to span multiple geographies, data centers, and cloud provider

zones. Can read/write to any node

6

Key attributes of a cloud database

• Support for multiple data types

• Able to handle structured, semi-structured, and unstructured data.

• Easy to manage

• easy to administer a logical database across many nodes

• Lower Cost

• It shouldn’t break the bank and you shouldn’t have to have a massive upfront cost

• Multiple Infrastructure Support

• It should support being deployed on multiple cloud providers (public and

private) as well as traditional infrastructure

• Security

• Data should be secure in flight, at rest and audited.

Key Attributes at a glance

7

1. Transparent Elasticity

2. Transparent Scalability

3. High Availability

4. Easy Data Distribution

5. Data Redundancy

6. Support for multiple data types

7. Easy to manage

8. Lower Cost

9. Multiple Infrastructure Support

10. Security

Apache Cassandra

8

Apache Cassandra

9

• Cassandra is a massively scalable, open source, NoSQL, distributed database built for modern, mission-critical online applications.

• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable

• Masterless with no single point of failure

• Distributed, data centre and rack aware

• 100% uptime

• Predictable scaling Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

http://research.google.com/archive/bigtable-osdi06.pdf

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Apache Cassandra Overview

10

• Cassandra was designed with the understanding that system/hardware failures

can and do occur

• Peer-to-peer, distributed system

• All nodes the same - masterless with no single point of failure

• Data partitioned among all nodes in the cluster

• Custom data replication to ensure fault tolerance

• Tunable data consistency

• Data Center and Rack aware

• Familiar SQL-Like language – CQL

• More capacity? Add a server!

• More throughput? Add a server!

Cassandra Adoption

11

Cassandra - Locally Distributed

12

• Replication factor (RF): How many copies of your data?

• RF = 3 in this example

• Each node is storing 60% of the clusters total data i.e. 3/5

Handy Calculator: http://www.ecyrd.com/cassandracalculator/

Node 1

1st copy

Node 4

Node 5

Node 2

2nd copy

Node 3

3rd copy

• Client reads or writes to any node

• Node coordinates with others

• Data read or replicated in parallel

Node 2

2nd copy

http://www.ecyrd.com/cassandracalculator/

Cassandra - Rack/Zone aware

13

Node 1

1st copy

Node 4

Node 2

Node 3

2nd copy

Rack 1

Rack 2Rack 2

Rack 3

Rack 1

Node 5

3rd copy

• Cassandra is aware of which rack or zone each nod resides in

• It will attempt to place each data copy in a different rack

• RF=3 in this example

Cassandra - DC/Region aware

14

• Active Everywhere – reads/writes in multiple data centres

• Client writes local

• Data syncs across WAN

• Replication Factor per DC

• Different number of nodes per

data center

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

DC: EUROPEDC: USA

Cassandra - Tuneable Consistency

15

• Consistency Level (CL)

• Client specifies per operation

• Handles multi-data center operations

ALL = All replicas ack

QUORUM = > 51% of replicas ack

LOCAL_QUORUM = > 51% in local DC ack

ONE = Only one replica acks

Plus more…. (see docs)

Blog: Eventual Consistency != Hopeful Consistency

http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-

consistency-by-christos-kalantzis/

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

500 μs ack

12 μs ack

http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis/

Cassandra - Node failure

16

• A single node failure shouldn’t bring failure.

• Replication Factor + Consistency Level = Success

• This example:

• RF = 3

• CL = QUORUM

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

12 μs ack

>51% ack – so request is a success

Cassandra - Node Recovery

17

• When a write is performed and a replica node for the row is unavailable the

coordinator will store a hint locally (3 hours)

• When the node recovers, the coordinator replays the missed writes.

• Note: a hinted write does not count the consistency level

• Note: you should still run repairs across your cluster

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Stores Hints while Node 3 is offline

Cassandra Rack/Zone Failure

18

• Cassandra will place the data in as many

different racks or availability zones as it can.

• This example:

• RF = 3

• CL = QUORUM

• AZ/Rack 2 fails

• Data copies still available in Node 1 and

Node 5

• Quorum can be honored i.e. > 51% ack

Node 1

1st copy

Node 4

Node 2

Node 3

2nd copy

Rack 1

Rack 2Rack 2

Rack 3

Rack 1

Node 5

3rd copy

request is a success

Operational Simplicity

19

• Cassandra is a complete product – there is not a multitude

of components to install, set-up and monitor.

• Extremely simple to administer and deploy

• Backups are instantaneous and simple to restore• Supports snapshots, incremental backups and point-in-time recovery.

• Cassandra can handle non-uniform hardware and disks. o This enables the mixing of solid state and spinning disks in a single cluster and pinning

tables to workload-appropriate disks.

• No downtime is required in Cassandra for upgrades or

adding/removing servers from the cluster.

What ´s up with DataStax?

20

DataStax at a glance

21

Founded in April 2010

~25 500+

Santa Clara, Austin, New York, London, Sydney

330+Employees Percent Customers

DataStax delivers value

22

Certified,

Enterprise-ready

Cassandra

Security Analytics Search VisualMonitoring

ManagementServices

In-Memory

Dev. IDE & Drivers

ProfessionalServices

Support & Training

Commercial

Confidence

Enterprise

Functionality

Enterprise Integrations

23

• DataStax adds Enterprise Features like: Hadoop, Solr,

Spark

DataStax OpsCenter

24

• DataStax OpsCenter is a browser-based, visual management and

monitoring solution for Apache Cassandra and DataStax Enterprise

• Functionality is also exposed via HTTP APIs

OpsCenter - New Cluster Example

25

A new, 10-node Cassandra (or Hadoop) cluster with OpsCenter running in 3 minutes… A new, 10-node DSE cluster with OpsCenter running on AWS in 3 minutes…

Done1 2 3

OpsCenter 5.0

26

• Manage multiple clusters and nodes

• Add and remove nodes

• Administer individual nodes or in bulk

• Configure clusters

• Perform rolling restarts

• Automatically repair data

• Rebalance data

• Backup management

• Capacity planning

DataStax Office Demo

27

• 32 Raspberry Pi´s

• 16 per DataStax Enterprise 4.5 Cluster

• Managed in OpsCenter 5.0

• “Red Button” downs one DataCenter

• Not the Performance-Demo but

• Availability

• Commodity Hardware

Native Drivers

28

• Different Native Drivers available: Java, Python etc.

• Load Balancing Policies (Client Driver receives Updates)• Data Centre Aware

• Latency Aware

• Token Aware

• Reconnection policies

• Retry policies

• Downgrading Consistency

• Plus others..

• http://www.datastax.com/download/clientdrivers

http://www.datastax.com/download/clientdrivers

CQL

29

• Cassandra Query Language

• CQL is intended to provide a common, simpler and easier to use

interface into Cassandra - and you probably already know it!

• e.g. SELECT * FROM users

• Usual statements:

• CREATE / DROP / ALTER TABLE / SELECT

CQL Basics

30

CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3, ‘DataCentre2’: 2};

USE league;

CREATE TABLE teams (

team_name varchar,

player_name varchar,

jersey int,

PRIMARY KEY (team_name, player_name)

);

SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;

INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

CQL Data Types

31

DevCenter 1.1

32

• Visual Query Tool for Developers and Administrators

• Easily create and run Cassandra Queries

• Visually navigate database objects

• Context-based suggestions

Do you remember the facts?

33

Key Attributes of a cloud database

34

1. Transparent Elasticity

2. Transparent Scalability

3. High Availability

4. Easy Data Distribution

5. Data Redundancy

6. Support for multiple data types

7. Easy to manage

8. Lower Cost

9. Multiple Infrastructure Support

10. Security

Transparent Elasticity

35

• Nodes can be added and removed from Cassandra online, with no

downtime being experienced.

1

2

3

4

5

6

1

7

104

2

3

5

6

8

9

11

12

Transparent Scalability

36

• Addition of Cassandra nodes increases performance linearly and

ability to manage TB’s-PB’s of data.

1

2

3

4

5

6

1

7

104

2

3

5

6

8

9

11

12

Performance

throughput = N

Performance

throughput = N x 2

High Availability

37

• Cassandra, with its peer-to-peer masterless architecture has no

single point of failure.

• 100% uptime is the norm!

Easy Data Distribution

38

• Cassandra allows a single logical database to span 1-N Data

Centers that are geographically dispersed.

• Also supports a hybrid Cloud implementation.

Data Redundancy

39

• Cassandra allows for customisable data redundancy so that data

is completely protected.

• Also supports rack/zone awareness (data can be replicated

between different racks to guard against machine/rack failures).

Support for multiple data types

40

• Cassandra’s data model (based on Google’s Bigtable) allows a

user to store structured, semi-structured, and unstructured data

with ease.

ID Name SSN DOB

Portfolio Keyspace

Customer Table

Easy to manage

41

• OpsCenter, AMI installers etc.. allow you to install and configure

an entire multi-node Cloud implementation in minutes.

• All can be managed and monitored via Web-based console or via

RESTful APIs.

Lower Cost

42

• Cassandra is open source software and is freely available.

• Commercial/advanced versions of Cassandra are available from

DataStax along with support and additional services.

Multiple Infrastructure Support

43

• Cassandra is supported on popular Cloud provider platforms and

operating systems.

Security

44

• Cassandra has multiple security features - authentication,

authorisation, inflight encryption

• Advanced “enterprise level” security can also provided via

DataStax Enterprise

What´s new?!

45

What is Spark?

46

• Apache Project since 2010

• 10-100x faster than Hadoop MapReduce

• In-Memory Storage

• Single JVM Processor per node

• Rich Scala, Java and Python API´s

• 2x-5x less code

• Interactive Shell

Why Spark on Cassandra?

47

• Data model independent queries

• cross-table operations (JOIN, UNION, etc.)!

• complex analytics (e.g. machine learning)

• data transformation, aggregation etc.

• stream processing (coming soon)

• all nodes are Spark workers

• by default resilient to worker failures

• first node promoted as Spark Master

• Standby Master promoted on failure

• Master HA available in Dactastax Enterprise

How to Spark on Cassandra?

48

• DataStax Cassandra Spark Driver on GitHub

• Compatible with:

• Spark 0.9+

• Cassandra 2.0+

• DataStax Enterprise 4.5+

• Cassandra Tables exposes as Spark RDDs (Resilient Distributed

Datasets

• Read from and write to Cassandra

• Mapping of Cassandra tables and rows to Scala objects

• Scala only driver for now

Spark type mapping

49

2.1 Release - User Defined Types

50

CREATE TYPE address (

street text,

city text,

zip_code int,

phones set<text>

)

CREATE TABLE users (

id uuid PRIMARY KEY,

name text,

addresses map<text, address>

)

SELECT id, name, addresses.city, addresses.phones FROM users;

id | name | addresses.city | addresses.phones

--------------------+----------------+--------------------------

63bf691f | chris | Berlin | {’0201234567', ’0796622222'}

2.1 Release - Secondary Indexes on

collections

51

CREATE TABLE songs (

id uuid PRIMARY KEY,

artist text,

album text,

title text,

data blob,

tags set<text>

);

CREATE INDEX song_tags_idx ON songs(tags);

SELECT * FROM songs WHERE tags CONTAINS 'blues';

id | album | artist | tags | title

----------+---------------+-------------------+-----------------------+------------------

5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

Thanks! Let´s see a demo!

52

Software

BigData Developers MeetUp