Brisk hadoop june2011

Preview:

DESCRIPTION

Brisk - Truly peer-to-peer hadoop.Brisk is an open-source Hadoop & Hive distribution that uses Apache Cassandra for its core services and storage. Brisk makes it possible to run Hadoop MapReduce on top of CassandraFS, an HDFS-compatible storage layer. By replacing HDFS with CassandraFS, users leverage MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant and scalable architecture.With CassandraFS all nodes are peers. Data files can be loaded through any node in the cluster and any node can serve as the JobTracker for MapReduce jobs. Hive MetaStore is stored & accessed as just another column family (table) on the distributed data store. Brisk makes Hadoop truly peer-to-peer.We demonstrate visualisation & monitoring of Brisk using OpsCenter. The operational simplicity of cassandra’s multi-datacenter & multi-region aware replication makes Brisk well-suited for a rich set of Applications and usecases. And by being able to store and isolate hdfs & online data within the same data cluster, Brisk makes analytics possible without ETL!LA Scalability Talk, MahaloMay 31.2011

Citation preview

   

Brisk: Truly peer­to­peer Hadoop   

  srisatish.ambati AT gmail.com  DataStax/OpenJDK  @srisatish

   

Brisk: Hive + Hadoop + Cassandra

@srisatish

   

Map Reduce

@srisatish

   

Have large sets of data & you can work on small pieces in parallel. 

@srisatish

   Map Reduce@srisatish

   

Multi­core map reduce framework, Kunle, et al

@srisatish

   

Parallel Execution View @srisatish

   

@srisatish

   

@srisatish

   

JobTrackerNameNode

HDFS

@srisatish

   

Write­once­read­many!File once created, written & closed need change

@srisatish

   

Move computation, not data

@srisatish

   

@srisatish

   

DataNodes: Read, Write Blocks

@srisatish

   

NameNode: Single Master nodeSingle Machine Address spaceSingle Point of failure

   

Enter the Cassandra:High Scale

Peer­to­peer

@srisatish

When “it” does not fit in a single node!… Enter the distributed dragon!

   

NameNode

DataNodes

   

One­kind­of­node!

   

Cassandra:High Scale

Peer­to­peer

@srisatish

   

Portfolio DemoLow latency

Live tick prices for stocks.Batch Analytics

Historical EOD prices.Value at Risk.

http://www.datastax.com/docs/0.8/brisk/brisk_demo

   

http://ec2­50­19­4­143.compute­1.amazonaws.com:8888/opscenter/index.htmlhttp://ec2­67­202­12­176.compute­1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201105310219_0008&refresh=30http://ec2­50­19­4­143.compute­1.amazonaws.com:8983/portfolio/

Demo URLs (good for this demo only)

Bigtable, 2006Dynamo, 2007

OSS, 2008

Incubator, 2009 TLP, 2010

   

A

LT

W

F

P

YKey “C”

U

Cassandra:High Scale

Peer­to­peerNo SPOF

@srisatish

   

   

   

Brisk

@srisatish

   

BriskHowStuffWorks version

@srisatish

   

YDH security edition (soon to be Apache)Apache Hive – Access via SQL like

CassandraHandlerCassandra 0.8

   

Use ColumnFamiliesinodesblock  

@srisatish

   

 String keyspace = “cfs”;

CfDef cf = new CfDef();   cf.setName(inodeDefaultCf);   cf.setComparator_type("BytesType");…

      

     cf.setName(sblockDefaultCf);     cf.setKey_cache_size(1M);     cf.setComment( 

"Stores blocks of information associated with a inodeStores blocks of information associated with a inode");

cf.setKeyspace(keyspace);

@srisatish

   

Consistency: R + W > N

"brisk.consistencylevel.read", "QUORUM";"brisk.consistencylevel.write", "QUORUM";

@srisatish

   

Hadoop: job tracker, task tracker

@srisatish

   

BriskSnitch: brisk nodes, cassandra nodes

@srisatish

   

BriskSimpleSnitch.java

if(TrackerInitializer.isTrackerNode)     {           myDC = BRISK_DC;          logger.info("Detected Hadoop trackers are enabled, setting my DC to " + myDC);      } else      {            myDC = CASSANDRA_DC;

logger.info("Looks like Vanilla Cassandra nodes, setting my DC to " + myDC);      } @srisatish

   

Hive: SQL­like accesscli, hwi, jdbc, metastorePushdown predicates (v beta2)

@srisatish

   

hive>  CREATE TABLE invites (foo INT, bar STRING)PARTITIONED BY (ds STRING);

hive>  LOAD DATA LOCAL INPATH '$BRISK_HOME/resources/hive/examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008­08­15');

hive>  SELECT count(*), ds FROM invites GROUP BY ds;

http://www.datastax.com/docs/0.8/brisk/about_hive @srisatish

   

ETLReal­time

Cassandra CFsDataCenters

Scale

@srisatish

   

@srisatish

   

No me in team!

● Ben Coverston

● Ben Werther

● Brandon Williams

● Cathy Daw

● Daria Hutchinson

● Jackson Chung

● Jake Luciani

● Joaquin Casares

● Jonathan Ellis

● Michael Allen

● Mike Bulman

● Michael Weir

● Nate McCall

● Nick M Bailey

● Patricio Echague

● Tyler Hobbs

● SriSatish Ambati

● Yewei Zhang

@srisatish

   

@srisatish100­node Brisk Cluster on Opscenter

   

OSS, 2008

+

+ +

Brisk

Cassandra

Incubator 2009

Bigtable, 2006Dynamo, 2007

TLP, 2010

   

git clone git@github.com:riptano/brisk.githttp://www.datastax.com/product/briskGetting  Started via Brisk AMI.Mahalo. Thank You. 

@srisatish

   

References● MapReduce: Simplified Data Processing on Large Clusters, 2004, Jeffrey Dean and 

Sanjay Ghemawat, http://bit.ly/googmr_pdf

● Multi­core MapReduce, Kunle, et al. http://bit.ly/iRJd1n

@srisatish