Upload
zznate
View
10.567
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A presentation for Data Day Austin on January 29th, 2011Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector
Citation preview
Brief Intro
NOT a "key/value store"Columns are dynamic inside a column familySSTables are immutable SSTables merged on readsAll nodes share the same role (i.e. no single point of failure)
Trading ACID compliance for scalability is a fundamental design decision
How does this impact development?
Substantially.
For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.
But you can trade availability for consistency. (More on this later)
You can store whatever you want. It's all just bytes.
You need to think about how you will query the data before you write it.
Neat. So Now What?
Like any database, you need a client!
• Python:o Telephus: http://github.com/driftx/Telephus (Twisted)o Pycassa: http://github.com/pycassa/pycassa
• Java:o Hector: http://github.com/rantav/hector (Examples https://github.com/zznate/hector-examples )o Pelops: http://github.com/s7/scale7-pelopso Kundera http://code.google.com/p/kundera/o Datanucleus JDO: http://github.com/tnine/Datanucleus-Cassandra-Plugin
• Grails:o grails-cassandra: https://github.com/wolpert/grails-cassandra
• .NET:o FluentCassandra: http://github.com/managedfusion/fluentcassandrao Aquiles: http://aquiles.codeplex.com/
• Ruby:o Cassandra: http://github.com/fauna/cassandra
• PHP:o phpcassa: http://github.com/thobbs/phpcassao SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie
... but do not roll your own
Thrift
• Fast, efficient serialization and network IO. • Lots of clients available (you can probably use it in other
places as well)
Why you don't want to work with the Thrift API directly:• SuperColumn• ColumnOrSuperColumn• ColumnParent.super_column• ColumnPath.super_column• Map<ByteBuffer,Map<String,List<Mutation>>>
mutationMap
Higher Level Client
Hector• JMX Counters• Add/remove hosts:
o automatically o programaticallyo via JMX
• Plugable load balancing• Complete encapsulation of Thrift API• Type-safe approach to dealing with Apache Cassandra• Lightweight ORM (supports JPA 1.0 annotations)• Mavenized! http://repo2.maven.org/maven2/me/prettyprint/
"CQL"
• Currently in Apache Cassandra trunk • Experimental• Lots of possibilities
from test/system/test_cql.py:
UPDATE StandardLong1 SET 1L="1", 2L="2", 3L="3", 4L="4" WHERE KEY="aa"
SELECT "cd1", "col" FROM Standard1 WHERE KEY = "kd"
DELETE "cd1", "col" FROM Standard1 WHERE KEY = "kd"
Avro??
Gone. Added too much complexity after Thrift caught up.
"None of the libraries distinguished themselves as being a particularly crappy choice for serialization."
(See CASSANDRA-1765)
Thrift API Methods
Retrieving
Writing/Removing
Meta Information
Schema Manipulation
Thrift API Methods - Retrieving
get: retrieve a single column for a key
get_slice: retrieve a "slice" of columns for a key
multiget_slice: retrieve a "slice" of columns for a list of keys
get_count: counts the columns of key (you have to deserialize the row to do it)
get_range_slices: retrieve a slice for a range of keys
get_indexed_slices (FTW!)
Thrift API Methods - Writing/Removing
insert
batch_mutate (batch insertion AND deletion)
remove
truncate**
Thrift API Methods - Meta Information
describe_cluster_name
describe_version
describe_keyspace
describe_keyspaces
Thrift API Methods - Schema
system_add_keyspace
system_update_keyspace
system_drop_keyspace
system_add_column_family
system_update_column_family
system_drop_column_family
vs. RDBMS - Consistency Level
Consistency is tunable per request!
Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).
*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK***
Idempotent: an operation can be applied multiple times without changing the result
vs. RDBMS - Append Only
Proper data modelling will minimizes seeks (Go to Tyler's presentation for more!)
On to the Code...
https://github.com/zznate/cassandra-tutorial
Uses Maven.
Really basic.
Modify/abuse/alter as needed.
Descriptions of what is going on and how to run each example are in the Javadoc comments.
Sample data is based on North American Numbering Planhttp://en.wikipedia.org/wiki/North_American_Numbering_Plan
Data Shape
512 202 30.27 097.74 W TX Austin512 203 30.27 097.74 L TX Austin512 204 30.32 097.73 W TX Austin512 205 30.32 097.73 W TX Austin512 206 30.32 097.73 L TX Austin
Get a Single Column for a Key
GetCityForNpanxx.java
Retrieve a single column with:NameValueTimestampTTL
Get the Contents of a Row
GetSliceForNpanxx.java
Retrieves a list of columns (Hector wraps these in a ColumnSlice)
"SlicePredicate" can either be explicit set of columns OR a range (more on ranges soon)
Another messy either/or choice encapsulated by Hector
Get the (sorted!) Columns of a Row
GetSliceForStateCity.java
Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it)
Can be easily modified to return results in reverse order (but this is slightly slower)
Get the Same Slice from Several Rows
MultigetSliceForNpanxx.java
Very similar to get_slice examples, except we provide a list of keys
Get Slices From a Range of Rows
GetRangeSlicesForStateCity.java
Like multiget_slice, except we can specify a KeyRange(encapsulated by RangeSlicesQuery#setKeys(start, end)
The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)
Get Slices From a Range of Rows - 2
GetSliceForAreaCodeCity.java
Compound column name for controlling ranges
Comparator at work on text field
Get Slices from Indexed Columns
GetIndexedSlicesForCityState.java
You only need to index a single column to apply clauses on other columns(BUT- the indexed column must be present with an EQUALS clause!)
(It's just another ColumnFamily maintained automatically)
Insert, Update and Delete
... are effectively the same operation.
InsertRowsForColumnFamilies.javaDeleteRowsForColumnFamily.java
Run each in succession (in whichever combination you like) and verify your results on the CLI
Hint: watch the timestamps
bin/cassandra-cli --host localhostuse Tutorial;list AreaCode;list Npanxx;list StateCity;
Stuff I Punted on for the Sake of Brevity
meta_* methodsCassandraClusterTest.java: L43-81 @hector
system_* methodsSchemaManipulation.java @ hector-examplesCassandraClusterTest.java: L84-157 @hector
ORM (it works and is in production)ORM Documentation
multiple nodes
failure scenarios
Data modelling (go see Tyler's presentation)
Things to Remember
• deletes and timestamp granularity• "range ghosts"• using the wrong column comparator and
InvalidRequestException• deletions actually write data• use column-level TTL to automate deletion• "how do I iterate over all the rows in a column family"?
o get_range_slices, but don't do thato a good sign your data model is wrong
Dealing with *Lots* of Data (Briefly)
Two biggest headaches have been addressed:• Compaction pollutes os page cache (CASSANDRA-1470)• Greater than 143mil keys on a single SSTable means more
BF false positives (CASSANDRA-1555)
Hadoop integration: Yes. (Go see Jeremy's presentation)
Bulk loading: Yes. CASSANDRA-1278
For more information: http://wiki.apache.org/cassandra/LargeDataSetConsiderations