Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

Preview:

DESCRIPTION

The Cassandra database is an excellent choice when you need scalability and high availability without compromising performance. Cassandra’s linear scalability, proven fault tolerance and tunable consistency, combined with its being optimized for write traffic, make it an attractive choice for performing structured logging of application and transactional events. But using a columnar store like Cassandra for analytical needs poses its own problems, problems we solved by careful construction of Column Families combined with diplomatic use of Hadoop. This tutorial focuses on building a similar system from scratch, showing how to perform analytical queries in near real time and still getting the benefits of the high-performance database engine of Cassandra. The key subjects are: • The splendors and miseries of NoSQL • Apache Cassandra use cases • Difficulties of using Map/Reduce directly in Cassandra • Amazon cloud solutions: Elastic MapReduce and S3 • “Real-enough” time analysis

Citation preview

Polyglot Persistence in the Real World

Anton Yazovskiy Thumbtack Technology

� Software Engineer at Thumbtack Technology �  an active user of various NoSQL solutions �  consulting with focus on scalability �  a significant part of my work is advising people on

which solutions to use and why �  big fan of BigData and clouds

� NoSQL – not a silver bullet � Choices that we make � Cassandra: operational workload � Cassandra: analytical workload � The best of both worlds � Some benchmarks � Conclusions

•  well known ways to scale •  scale in/out, scale by

function, data denormalization

•  really works •  each has disadvantages •  mostly manual process

(newSQL)

http://qsec.deviantart.com

�  solve exactly these kind of problem �  rapid application development

�  aggregate �  schema flexibility �  auto-scale-out �  auto-failover

� amount of data able to handle �  shared nothing architecture, no SPOF � performance

�  splendors and miseries of aggregate � CAP theorem dilemma

Consistency

Partition Tolerance Availability

Analytical Operational

Consistency Availability

Performance Reliability

Analytical Operational

Consistency Availability

Performance Reliability

I want it all

(released by Facebook in 2008)

� elastic scalability & linear performance * � dynamic schema � very high write throughput �  tunable per request consistency �  fault-tolerant design � multiple datacenter and cloud readiness � CaS transaction support *

* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

� Large data set on commodity hardware � Tradeoff between speed and reliability � Heavy-write workload � Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Cassandra

Operational

Reliability Performance

Analytical

Small demo after this slide

TIMESTAMP   FIELD  1   …  12344567   DATA      

SERVER  1   12326346   DATA      13124124   DATA      13237457   DATA      

SERVER  2   13627236   DATA      

� expensive range queries across cluster � unless shard by timestamp � become a bottleneck for heavy-write workload

select * from table where timestamp > 12344567 and timestamp < 13237457

�  all columns are sorted by name �  row – aggregate item (never sharded)

Column  Family  

row  key  1  column  1   column  2   column  3   ..   column  N  value  1.1   value  1.2   value  1.3   ..   value  1.N  

row  key  2  column  1   column  2   ...   column  M   value  2.1   value  2.2   …   value  2.M  

Super columns are discouraged and omitted here

get slice

get range

+ combinations of these queries + composite columns

get key

�  all columns are sorted by name �  row – aggregate item (never sharded)

row  key  1   Emestamp   Emestamp   Emestamp   Emestamp  SERVER  1   row  key  2   Emestamp   Emestamp   Emestamp  

row  key  3   Emestamp  row  key  4   Emestamp   Emestamp   Emestamp   Emestamp  

SERVER  2   row  key  5   Emestamp   Emestamp  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

get_slice(row_key, from, to, count)

get_slice(“row key 1”, from:“timestamp 1”, null, 11) get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Next page

Prev.page

�  all columns are sorted by name �  row – aggregate item (never sharded)

row  key  1   Emestamp   Emestamp   Emestamp   Emestamp  SERVER  1   row  key  2   Emestamp   Emestamp   Emestamp  

row  key  3   Emestamp  row  key  4   Emestamp   Emestamp   Emestamp   Emestamp  

SERVER  2   row  key  5   Emestamp   Emestamp  

get_slice(row_key, from, to, count)

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

events::success::User_123  Emestamp  1  

value  1  

events::success  Emestamp  1  

value  1  

events::User_123  Emestamp  1  

value  1  

� Counters: �  “get # of events for User J grouped by hour” �  “get # of events for User J grouped by day”

events::success::User_123  1380400000   1380403600  

14   42  

events::User_123  1380400000   1380403600  

842   1024  

(group by day – same but in different column family for TTL support)

�  row key should consist of combination of fields with high cardinality of values: �  name, id, etc..

�  boolean values are bad option �  composite columns – good option for it

�  timestamp may help to spread historical data

�  otherwise, scalability will not be linear

In theory – possible in real-time �  average, 3 dimensional filters, group by, etc..

But: �  hard to tune data model �  lack of aggregation options �  aggregation by historical data

“I want interactive reports”

Cassandra

“Reports could be a little bit out of date, but I want to control this delay value”

Auto update somehow

�  Impact on production system or

�  Higher total cost of ownership

�  Difficulties with scalability

�  hard to support with multiple clusters

http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

http://aws.amazon.com

�  Hadoop tech.stack �  Automatic deployment �  Management API �  Temporal cluster �  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back

JobFlowInstancesConfig instances = ..

instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1)

instances.setMasterInstanceType(..)

instances.setSlaveInstanceType(..)

RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar))

AmazonElasticMapReduce emr = ..

emr.runJobFlow(req)

Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar)

AddJobFlowStepsRequest addReq = …

addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig))

AmazonElasticMapReduce emr =

emr.addJobFlowSteps(addReq)

�  cluster lifecycle: Long-Running or Transient �  cold start = ~20 min �  tradeoff: cluster cost VS availability

�  Compressing and Combiner tuning may speed-up jobs very much

�  common problems for all big data processing tools - monitoring, testability and debug (MRUnit, local hadoop, smaller data set)

try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit()

} catch (Exception e) { sql.rollback() cassandra.rollback(txId)

}

insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..)

update CHANGES set commited = ’true’

where key = 'tx_id-58e0a7d7-eebc’

delete from CHANGES

where key = 'tx_id-58e0a7d7-eebc’

non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance

I numbers

real-time metrics update (sync): �  average latency - 60 msec �  process > 2,000 events per second �  generate > 1000 reports per second real-time metrics update (async): �  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps *

it is more then enough, but what if …

� distributed systems force you to make decisions �  systems like Cassandra trade speed for

Consistency � CAP theorem is oversimplified

�  you have much more options

� polyglot persistence can make this world a better place �  do not try to hammer every nail with the same

hammer

� Cassandra – great for time series data and heavy-write workload…

�  ... but use cases should be clearly defined

� Amazon S3 – is great �  simple, slow, but predictable storage

� Amazon EMR �  integration with S3 – great �  very good API, but … �  … isn’t a magic trick and require

knowledge about Hadoop and skills for effective usage

/** ayazovskiy@thumbtack.net @yazovsky www.linkedin.com/in/yazovsky

*/

/** http://www.thumbtack.net http://thumbtack.net/whitepapers

*/

Recommended