Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

Polyglot Persistence in the Real World

Anton Yazovskiy Thumbtack Technology

� Software Engineer at Thumbtack Technology �  an active user of various NoSQL solutions �  consulting with focus on scalability �  a significant part of my work is advising people on

which solutions to use and why �  big fan of BigData and clouds

� NoSQL – not a silver bullet � Choices that we make � Cassandra: operational workload � Cassandra: analytical workload � The best of both worlds � Some benchmarks � Conclusions

•  well known ways to scale •  scale in/out, scale by

function, data denormalization

•  really works •  each has disadvantages •  mostly manual process

(newSQL)

http://qsec.deviantart.com

�  solve exactly these kind of problem �  rapid application development

�  aggregate �  schema flexibility �  auto-scale-out �  auto-failover

� amount of data able to handle �  shared nothing architecture, no SPOF � performance

�  splendors and miseries of aggregate � CAP theorem dilemma

Consistency

Partition Tolerance Availability

Analytical Operational

Consistency Availability

Performance Reliability

Analytical Operational

Consistency Availability

Performance Reliability

I want it all

(released by Facebook in 2008)

� elastic scalability & linear performance * � dynamic schema � very high write throughput �  tunable per request consistency �  fault-tolerant design � multiple datacenter and cloud readiness � CaS transaction support *

* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

� Large data set on commodity hardware � Tradeoff between speed and reliability � Heavy-write workload � Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Cassandra

Operational

Reliability Performance

Analytical

Small demo after this slide

TIMESTAMP FIELD 1 … 12344567 DATA

SERVER 1 12326346 DATA 13124124 DATA 13237457 DATA

SERVER 2 13627236 DATA

� expensive range queries across cluster � unless shard by timestamp � become a bottleneck for heavy-write workload

select * from table where timestamp > 12344567 and timestamp < 13237457

�  all columns are sorted by name �  row – aggregate item (never sharded)

Column Family

row key 1 column 1 column 2 column 3 .. column N value 1.1 value 1.2 value 1.3 .. value 1.N

row key 2 column 1 column 2 ... column M value 2.1 value 2.2 … value 2.M

Super columns are discouraged and omitted here

get slice

get range

+ combinations of these queries + composite columns

get key

row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp

row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp

SERVER 2 row key 5 Emestamp Emestamp

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

get_slice(row_key, from, to, count)

get_slice(“row key 1”, from:“timestamp 1”, null, 11) get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11)

row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp

row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp

SERVER 2 row key 5 Emestamp Emestamp

get_slice(row_key, from, to, count)

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

events::success::User_123 Emestamp 1

value 1

events::success Emestamp 1

value 1

events::User_123 Emestamp 1

value 1

� Counters: �  “get # of events for User J grouped by hour” �  “get # of events for User J grouped by day”

events::success::User_123 1380400000 1380403600

events::User_123 1380400000 1380403600

842 1024

(group by day – same but in different column family for TTL support)

�  row key should consist of combination of fields with high cardinality of values: �  name, id, etc..

�  boolean values are bad option �  composite columns – good option for it

�  timestamp may help to spread historical data

�  otherwise, scalability will not be linear

In theory – possible in real-time �  average, 3 dimensional filters, group by, etc..

But: �  hard to tune data model �  lack of aggregation options �  aggregation by historical data

“I want interactive reports”

Cassandra

“Reports could be a little bit out of date, but I want to control this delay value”

Auto update somehow

�  Impact on production system or

�  Higher total cost of ownership

�  Difficulties with scalability

�  hard to support with multiple clusters

http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

http://aws.amazon.com

�  Hadoop tech.stack �  Automatic deployment �  Management API �  Temporal cluster �  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back

JobFlowInstancesConfig instances = ..

instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1)

instances.setMasterInstanceType(..)

instances.setSlaveInstanceType(..)

RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar))

AmazonElasticMapReduce emr = ..

emr.runJobFlow(req)

Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar)

AddJobFlowStepsRequest addReq = …

addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig))

AmazonElasticMapReduce emr =

emr.addJobFlowSteps(addReq)

�  cluster lifecycle: Long-Running or Transient �  cold start = ~20 min �  tradeoff: cluster cost VS availability

�  Compressing and Combiner tuning may speed-up jobs very much

�  common problems for all big data processing tools - monitoring, testability and debug (MRUnit, local hadoop, smaller data set)

try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit()

} catch (Exception e) { sql.rollback() cassandra.rollback(txId)

insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..)

update CHANGES set commited = ’true’

where key = 'tx_id-58e0a7d7-eebc’

delete from CHANGES

where key = 'tx_id-58e0a7d7-eebc’

non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance

I numbers

real-time metrics update (sync): �  average latency - 60 msec �  process > 2,000 events per second �  generate > 1000 reports per second real-time metrics update (async): �  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps *

it is more then enough, but what if …

� distributed systems force you to make decisions �  systems like Cassandra trade speed for

Consistency � CAP theorem is oversimplified

�  you have much more options

� polyglot persistence can make this world a better place �  do not try to hammer every nail with the same

hammer

� Cassandra – great for time series data and heavy-write workload…

�  ... but use cases should be clearly defined

� Amazon S3 – is great �  simple, slow, but predictable storage

� Amazon EMR �  integration with S3 – great �  very good API, but … �  … isn’t a magic trick and require

knowledge about Hadoop and skills for effective usage

/** ayazovskiy@thumbtack.net @yazovsky www.linkedin.com/in/yazovsky

/** http://www.thumbtack.net http://thumbtack.net/whitepapers

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

Technology

Adattárház alapú vezetői információs rendszerek · Yahoo! Hadoop, PNUTS Columnar NoSQL Twitter FlockDB, Cassandra, Hadoop/Hbase Graph, Columnar NoSQL Wikipedia Memcached, Flatfile,

CouchDB - ACCU · PDF fileCouchDB Can we be comfortable without SQL? ... • Google BigTable, HBase/Hadoop, Cassandra, ... IBM, Apple, ebay

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Manchester Hadoop User Group: Cassandra Intro

Hadoop - Semaine 5 · Hadoop - Semaine 5 Cassandra Présentationrapide CassandraesttotalementindépendantdeHadoop. Engénéral,ces deux-làs

Securing Big Data at Rest with Encryption for Hadoop, Cassandra

Evaluating Apache Cassandra as a Cloud DatabaseDataStax Enterprise – Certified Cassandra for Production Applications ..... 11 Solving the Cloud Mixed-Workload Problem ..... 11 Hadoop

C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Hadoop - yappidays.ruyappidays.ru/wp-content/uploads/2017/09/Hadoop-2017-Yaroslavl.pdf · Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.) Hadoop Jungle

BIG DATA ANALYTICS & CLOUD SERVICES - … DATA ANALYTICS & CLOUD SERVICES . ... • AWS infrastructure and operations support ... Cassandra, Hadoop, SOLR, Cascading

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

NoSQL COLUNAS Jairo Tiburtino dos Santos. Roteiro Cenário Características Apache Cassandra Apache Hadoop

Store and Process Big Data with Hadoop and Cassandra

Introduction to Real-Time Analytics with Cassandra and Hadoop

Red Hat. Cassandra and MongoDB on Encryption for Hadoop

Comparing the Hadoop Distributed File System (HDFS) · PDF file1 Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) White Paper BY DATASTAX CORPORATION

大数据时代的变革 - doc.fens.medoc.fens.me/hbun-collage-bigdata.pdf · Hadoop HDFS，Hbase, Google GFS, DynamoDB, MongoDB, Cassandra 计算： Hadoop MapReduce, Spark, Mahout,

1 Big Data Hadoop€¦ · · 2017-09-01Data Sampling and Debugging ... 2 Apache Spark & Scala 1 Introduction to Spark Limitations of MapReduce in Hadoop Objectives ... Cassandra