43
 Brisk: Truly peer-to-peer Hadoop      srisatish.ambati AT gmail.com   DataStax/OpenJDK   @srisatish

Brisk hadoop june2011

Embed Size (px)

DESCRIPTION

Brisk - Truly peer-to-peer hadoop.Brisk is an open-source Hadoop & Hive distribution that uses Apache Cassandra for its core services and storage. Brisk makes it possible to run Hadoop MapReduce on top of CassandraFS, an HDFS-compatible storage layer. By replacing HDFS with CassandraFS, users leverage MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant and scalable architecture.With CassandraFS all nodes are peers. Data files can be loaded through any node in the cluster and any node can serve as the JobTracker for MapReduce jobs. Hive MetaStore is stored & accessed as just another column family (table) on the distributed data store. Brisk makes Hadoop truly peer-to-peer.We demonstrate visualisation & monitoring of Brisk using OpsCenter. The operational simplicity of cassandra’s multi-datacenter & multi-region aware replication makes Brisk well-suited for a rich set of Applications and usecases. And by being able to store and isolate hdfs & online data within the same data cluster, Brisk makes analytics possible without ETL!LA Scalability Talk, MahaloMay 31.2011

Citation preview

Page 1: Brisk hadoop june2011

   

Brisk: Truly peer­to­peer Hadoop   

  srisatish.ambati AT gmail.com  DataStax/OpenJDK  @srisatish

Page 2: Brisk hadoop june2011

   

Brisk: Hive + Hadoop + Cassandra

@srisatish

Page 3: Brisk hadoop june2011

   

Map Reduce

@srisatish

Page 4: Brisk hadoop june2011

   

Have large sets of data & you can work on small pieces in parallel. 

@srisatish

Page 5: Brisk hadoop june2011

   Map Reduce@srisatish

Page 6: Brisk hadoop june2011

   

Multi­core map reduce framework, Kunle, et al

@srisatish

Page 7: Brisk hadoop june2011

   

Parallel Execution View @srisatish

Page 8: Brisk hadoop june2011

   

@srisatish

Page 9: Brisk hadoop june2011

   

@srisatish

Page 10: Brisk hadoop june2011

   

JobTrackerNameNode

HDFS

@srisatish

Page 11: Brisk hadoop june2011

   

Write­once­read­many!File once created, written & closed need change

@srisatish

Page 12: Brisk hadoop june2011

   

Move computation, not data

@srisatish

Page 13: Brisk hadoop june2011

   

@srisatish

Page 14: Brisk hadoop june2011

   

DataNodes: Read, Write Blocks

@srisatish

Page 15: Brisk hadoop june2011

   

NameNode: Single Master nodeSingle Machine Address spaceSingle Point of failure

Page 16: Brisk hadoop june2011

   

Enter the Cassandra:High Scale

Peer­to­peer

@srisatish

When “it” does not fit in a single node!… Enter the distributed dragon!

Page 17: Brisk hadoop june2011

   

NameNode

DataNodes

Page 18: Brisk hadoop june2011

   

One­kind­of­node!

Page 19: Brisk hadoop june2011

   

Cassandra:High Scale

Peer­to­peer

@srisatish

Page 20: Brisk hadoop june2011

   

Portfolio DemoLow latency

Live tick prices for stocks.Batch Analytics

Historical EOD prices.Value at Risk.

http://www.datastax.com/docs/0.8/brisk/brisk_demo

Page 21: Brisk hadoop june2011

   

http://ec2­50­19­4­143.compute­1.amazonaws.com:8888/opscenter/index.htmlhttp://ec2­67­202­12­176.compute­1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201105310219_0008&refresh=30http://ec2­50­19­4­143.compute­1.amazonaws.com:8983/portfolio/

Demo URLs (good for this demo only)

Page 22: Brisk hadoop june2011

Bigtable, 2006Dynamo, 2007

OSS, 2008

Incubator, 2009 TLP, 2010

Page 23: Brisk hadoop june2011

   

A

LT

W

F

P

YKey “C”

U

Cassandra:High Scale

Peer­to­peerNo SPOF

@srisatish

Page 24: Brisk hadoop june2011

   

Page 25: Brisk hadoop june2011

   

Page 26: Brisk hadoop june2011

   

Brisk

@srisatish

Page 27: Brisk hadoop june2011

   

BriskHowStuffWorks version

@srisatish

Page 28: Brisk hadoop june2011

   

YDH security edition (soon to be Apache)Apache Hive – Access via SQL like

CassandraHandlerCassandra 0.8

Page 29: Brisk hadoop june2011

   

Use ColumnFamiliesinodesblock  

@srisatish

Page 30: Brisk hadoop june2011

   

 String keyspace = “cfs”;

CfDef cf = new CfDef();   cf.setName(inodeDefaultCf);   cf.setComparator_type("BytesType");…

      

     cf.setName(sblockDefaultCf);     cf.setKey_cache_size(1M);     cf.setComment( 

"Stores blocks of information associated with a inodeStores blocks of information associated with a inode");

cf.setKeyspace(keyspace);

@srisatish

Page 31: Brisk hadoop june2011

   

Consistency: R + W > N

"brisk.consistencylevel.read", "QUORUM";"brisk.consistencylevel.write", "QUORUM";

@srisatish

Page 32: Brisk hadoop june2011

   

Hadoop: job tracker, task tracker

@srisatish

Page 33: Brisk hadoop june2011

   

BriskSnitch: brisk nodes, cassandra nodes

@srisatish

Page 34: Brisk hadoop june2011

   

BriskSimpleSnitch.java

if(TrackerInitializer.isTrackerNode)     {           myDC = BRISK_DC;          logger.info("Detected Hadoop trackers are enabled, setting my DC to " + myDC);      } else      {            myDC = CASSANDRA_DC;

logger.info("Looks like Vanilla Cassandra nodes, setting my DC to " + myDC);      } @srisatish

Page 35: Brisk hadoop june2011

   

Hive: SQL­like accesscli, hwi, jdbc, metastorePushdown predicates (v beta2)

@srisatish

Page 36: Brisk hadoop june2011

   

hive>  CREATE TABLE invites (foo INT, bar STRING)PARTITIONED BY (ds STRING);

hive>  LOAD DATA LOCAL INPATH '$BRISK_HOME/resources/hive/examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008­08­15');

hive>  SELECT count(*), ds FROM invites GROUP BY ds;

http://www.datastax.com/docs/0.8/brisk/about_hive @srisatish

Page 37: Brisk hadoop june2011

   

ETLReal­time

Cassandra CFsDataCenters

Scale

@srisatish

Page 38: Brisk hadoop june2011

   

@srisatish

Page 39: Brisk hadoop june2011

   

No me in team!

● Ben Coverston

● Ben Werther

● Brandon Williams

● Cathy Daw

● Daria Hutchinson

● Jackson Chung

● Jake Luciani

● Joaquin Casares

● Jonathan Ellis

● Michael Allen

● Mike Bulman

● Michael Weir

● Nate McCall

● Nick M Bailey

● Patricio Echague

● Tyler Hobbs

● SriSatish Ambati

● Yewei Zhang

@srisatish

Page 40: Brisk hadoop june2011

   

@srisatish100­node Brisk Cluster on Opscenter

Page 41: Brisk hadoop june2011

   

OSS, 2008

+

+ +

Brisk

Cassandra

Incubator 2009

Bigtable, 2006Dynamo, 2007

TLP, 2010

Page 42: Brisk hadoop june2011

   

git clone [email protected]:riptano/brisk.githttp://www.datastax.com/product/briskGetting  Started via Brisk AMI.Mahalo. Thank You. 

@srisatish

Page 43: Brisk hadoop june2011

   

References● MapReduce: Simplified Data Processing on Large Clusters, 2004, Jeffrey Dean and 

Sanjay Ghemawat, http://bit.ly/googmr_pdf

● Multi­core MapReduce, Kunle, et al. http://bit.ly/iRJd1n

@srisatish