Upload
christian-johannsen
View
218
Download
3
Embed Size (px)
DESCRIPTION
BigData Developers MeetUp in Berlin at IBM
Citation preview
Christian Johannsen, Solutions Engineer
Evaluating Apache Cassandra as a Cloud Database
About me
2
Christian JohannsenSolutions Engineer @ DataStax
Twitter: @cjohannsen81
What is a cloud database not?
3
• A Cloud database is not simply taking a traditional RDBMS and
running it in a Cloud provider’s environment
What is a cloud database?
4
• A cloud database is a database optimised to handle the
infrastructure limitations and features of running on
shared, geographically dispersed infrastructure.
• Some of them are: Outages, noisy neighbours, security,
variable networks etc…
Key attributes of a cloud database
5
• Transparent Elasticity
• being able to add and subtract nodes (physical or virtual machines) with load balancing when the underlying application and business demands it.
• Transparent Scalability
• addition of nodes increases both (1) performance throughput; (2) ability to
handle Big Data and maintain high performance
• High Availability
• always up; no single point of failure
• Easy Data Distribution
• able to span multiple geographies, data centers, and cloud provider
zones. Can read/write to any node
6
Key attributes of a cloud database
• Support for multiple data types
• Able to handle structured, semi-structured, and unstructured data.
• Easy to manage
• easy to administer a logical database across many nodes
• Lower Cost
• It shouldn’t break the bank and you shouldn’t have to have a massive upfront cost
• Multiple Infrastructure Support
• It should support being deployed on multiple cloud providers (public and
private) as well as traditional infrastructure
• Security
• Data should be secure in flight, at rest and audited.
Key Attributes at a glance
7
1. Transparent Elasticity
2. Transparent Scalability
3. High Availability
4. Easy Data Distribution
5. Data Redundancy
6. Support for multiple data types
7. Easy to manage
8. Lower Cost
9. Multiple Infrastructure Support
10. Security
Apache Cassandra
8
Apache Cassandra
9
• Cassandra is a massively scalable, open source, NoSQL, distributed database built for modern, mission-critical online applications.
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed, data centre and rack aware
• 100% uptime
• Predictable scaling Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Apache Cassandra Overview
10
• Cassandra was designed with the understanding that system/hardware failures
can and do occur
• Peer-to-peer, distributed system
• All nodes the same - masterless with no single point of failure
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Tunable data consistency
• Data Center and Rack aware
• Familiar SQL-Like language – CQL
• More capacity? Add a server!
• More throughput? Add a server!
Cassandra Adoption
11
Cassandra - Locally Distributed
12
• Replication factor (RF): How many copies of your data?
• RF = 3 in this example
• Each node is storing 60% of the clusters total data i.e. 3/5
Handy Calculator: http://www.ecyrd.com/cassandracalculator/
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
• Client reads or writes to any node
• Node coordinates with others
• Data read or replicated in parallel
Node 2
2nd copy
Cassandra - Rack/Zone aware
13
Node 1
1st copy
Node 4
Node 2
Node 3
2nd copy
Rack 1
Rack 2Rack 2
Rack 3
Rack 1
Node 5
3rd copy
• Cassandra is aware of which rack or zone each nod resides in
• It will attempt to place each data copy in a different rack
• RF=3 in this example
Cassandra - DC/Region aware
14
• Active Everywhere – reads/writes in multiple data centres
• Client writes local
• Data syncs across WAN
• Replication Factor per DC
• Different number of nodes per
data center
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
DC: EUROPEDC: USA
Cassandra - Tuneable Consistency
15
• Consistency Level (CL)
• Client specifies per operation
• Handles multi-data center operations
ALL = All replicas ack
QUORUM = > 51% of replicas ack
LOCAL_QUORUM = > 51% in local DC ack
ONE = Only one replica acks
Plus more…. (see docs)
Blog: Eventual Consistency != Hopeful Consistency
http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-
consistency-by-christos-kalantzis/
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
500 μs ack
12 μs ack
Cassandra - Node failure
16
• A single node failure shouldn’t bring failure.
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
>51% ack – so request is a success
Cassandra - Node Recovery
17
• When a write is performed and a replica node for the row is unavailable the
coordinator will store a hint locally (3 hours)
• When the node recovers, the coordinator replays the missed writes.
• Note: a hinted write does not count the consistency level
• Note: you should still run repairs across your cluster
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Stores Hints while Node 3 is offline
Cassandra Rack/Zone Failure
18
• Cassandra will place the data in as many
different racks or availability zones as it can.
• This example:
• RF = 3
• CL = QUORUM
• AZ/Rack 2 fails
• Data copies still available in Node 1 and
Node 5
• Quorum can be honored i.e. > 51% ack
Node 1
1st copy
Node 4
Node 2
Node 3
2nd copy
Rack 1
Rack 2Rack 2
Rack 3
Rack 1
Node 5
3rd copy
request is a success
Operational Simplicity
19
• Cassandra is a complete product – there is not a multitude
of components to install, set-up and monitor.
• Extremely simple to administer and deploy
• Backups are instantaneous and simple to restore• Supports snapshots, incremental backups and point-in-time recovery.
• Cassandra can handle non-uniform hardware and disks. o This enables the mixing of solid state and spinning disks in a single cluster and pinning
tables to workload-appropriate disks.
• No downtime is required in Cassandra for upgrades or
adding/removing servers from the cluster.
What ´s up with DataStax?
20
DataStax at a glance
21
Founded in April 2010
~25 500+
Santa Clara, Austin, New York, London, Sydney
330+Employees Percent Customers
DataStax delivers value
22
Certified,
Enterprise-ready
Cassandra
Security Analytics Search VisualMonitoring
ManagementServices
In-Memory
Dev. IDE & Drivers
ProfessionalServices
Support & Training
Commercial
Confidence
Enterprise
Functionality
Enterprise Integrations
23
• DataStax adds Enterprise Features like: Hadoop, Solr,
Spark
DataStax OpsCenter
24
• DataStax OpsCenter is a browser-based, visual management and
monitoring solution for Apache Cassandra and DataStax Enterprise
• Functionality is also exposed via HTTP APIs
OpsCenter - New Cluster Example
25
A new, 10-node Cassandra (or Hadoop) cluster with OpsCenter running in 3 minutes… A new, 10-node DSE cluster with OpsCenter running on AWS in 3 minutes…
Done1 2 3
OpsCenter 5.0
26
• Manage multiple clusters and nodes
• Add and remove nodes
• Administer individual nodes or in bulk
• Configure clusters
• Perform rolling restarts
• Automatically repair data
• Rebalance data
• Backup management
• Capacity planning
DataStax Office Demo
27
• 32 Raspberry Pi´s
• 16 per DataStax Enterprise 4.5 Cluster
• Managed in OpsCenter 5.0
• “Red Button” downs one DataCenter
• Not the Performance-Demo but
• Availability
• Commodity Hardware
Native Drivers
28
• Different Native Drivers available: Java, Python etc.
• Load Balancing Policies (Client Driver receives Updates)• Data Centre Aware
• Latency Aware
• Token Aware
• Reconnection policies
• Retry policies
• Downgrading Consistency
• Plus others..
• http://www.datastax.com/download/clientdrivers
CQL
29
• Cassandra Query Language
• CQL is intended to provide a common, simpler and easier to use
interface into Cassandra - and you probably already know it!
• e.g. SELECT * FROM users
• Usual statements:
• CREATE / DROP / ALTER TABLE / SELECT
CQL Basics
30
CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3, ‘DataCentre2’: 2};
USE league;
CREATE TABLE teams (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;
INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);
CQL Data Types
31
DevCenter 1.1
32
• Visual Query Tool for Developers and Administrators
• Easily create and run Cassandra Queries
• Visually navigate database objects
• Context-based suggestions
Do you remember the facts?
33
Key Attributes of a cloud database
34
1. Transparent Elasticity
2. Transparent Scalability
3. High Availability
4. Easy Data Distribution
5. Data Redundancy
6. Support for multiple data types
7. Easy to manage
8. Lower Cost
9. Multiple Infrastructure Support
10. Security
Transparent Elasticity
35
• Nodes can be added and removed from Cassandra online, with no
downtime being experienced.
1
2
3
4
5
6
1
7
104
2
3
5
6
8
9
11
12
Transparent Scalability
36
• Addition of Cassandra nodes increases performance linearly and
ability to manage TB’s-PB’s of data.
1
2
3
4
5
6
1
7
104
2
3
5
6
8
9
11
12
Performance
throughput = N
Performance
throughput = N x 2
High Availability
37
• Cassandra, with its peer-to-peer masterless architecture has no
single point of failure.
• 100% uptime is the norm!
Easy Data Distribution
38
• Cassandra allows a single logical database to span 1-N Data
Centers that are geographically dispersed.
• Also supports a hybrid Cloud implementation.
Data Redundancy
39
• Cassandra allows for customisable data redundancy so that data
is completely protected.
• Also supports rack/zone awareness (data can be replicated
between different racks to guard against machine/rack failures).
Support for multiple data types
40
• Cassandra’s data model (based on Google’s Bigtable) allows a
user to store structured, semi-structured, and unstructured data
with ease.
ID Name SSN DOB
Portfolio Keyspace
Customer Table
Easy to manage
41
• OpsCenter, AMI installers etc.. allow you to install and configure
an entire multi-node Cloud implementation in minutes.
• All can be managed and monitored via Web-based console or via
RESTful APIs.
Lower Cost
42
• Cassandra is open source software and is freely available.
• Commercial/advanced versions of Cassandra are available from
DataStax along with support and additional services.
Multiple Infrastructure Support
43
• Cassandra is supported on popular Cloud provider platforms and
operating systems.
Security
44
• Cassandra has multiple security features - authentication,
authorisation, inflight encryption
• Advanced “enterprise level” security can also provided via
DataStax Enterprise
What´s new?!
45
What is Spark?
46
• Apache Project since 2010
• 10-100x faster than Hadoop MapReduce
• In-Memory Storage
• Single JVM Processor per node
• Rich Scala, Java and Python API´s
• 2x-5x less code
• Interactive Shell
Why Spark on Cassandra?
47
• Data model independent queries
• cross-table operations (JOIN, UNION, etc.)!
• complex analytics (e.g. machine learning)
• data transformation, aggregation etc.
• stream processing (coming soon)
• all nodes are Spark workers
• by default resilient to worker failures
• first node promoted as Spark Master
• Standby Master promoted on failure
• Master HA available in Dactastax Enterprise
How to Spark on Cassandra?
48
• DataStax Cassandra Spark Driver on GitHub
• Compatible with:
• Spark 0.9+
• Cassandra 2.0+
• DataStax Enterprise 4.5+
• Cassandra Tables exposes as Spark RDDs (Resilient Distributed
Datasets
• Read from and write to Cassandra
• Mapping of Cassandra tables and rows to Scala objects
• Scala only driver for now
Spark type mapping
49
2.1 Release - User Defined Types
50
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
addresses map<text, address>
)
SELECT id, name, addresses.city, addresses.phones FROM users;
id | name | addresses.city | addresses.phones
--------------------+----------------+--------------------------
63bf691f | chris | Berlin | {’0201234567', ’0796622222'}
2.1 Release - Secondary Indexes on
collections
51
CREATE TABLE songs (
id uuid PRIMARY KEY,
artist text,
album text,
title text,
data blob,
tags set<text>
);
CREATE INDEX song_tags_idx ON songs(tags);
SELECT * FROM songs WHERE tags CONTAINS 'blues';
id | album | artist | tags | title
----------+---------------+-------------------+-----------------------+------------------
5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind
Thanks! Let´s see a demo!
52