Upload
srisatish-ambati
View
3.171
Download
0
Embed Size (px)
DESCRIPTION
Brisk - Truly peer-to-peer hadoop.Brisk is an open-source Hadoop & Hive distribution that uses Apache Cassandra for its core services and storage. Brisk makes it possible to run Hadoop MapReduce on top of CassandraFS, an HDFS-compatible storage layer. By replacing HDFS with CassandraFS, users leverage MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant and scalable architecture.With CassandraFS all nodes are peers. Data files can be loaded through any node in the cluster and any node can serve as the JobTracker for MapReduce jobs. Hive MetaStore is stored & accessed as just another column family (table) on the distributed data store. Brisk makes Hadoop truly peer-to-peer.We demonstrate visualisation & monitoring of Brisk using OpsCenter. The operational simplicity of cassandra’s multi-datacenter & multi-region aware replication makes Brisk well-suited for a rich set of Applications and usecases. And by being able to store and isolate hdfs & online data within the same data cluster, Brisk makes analytics possible without ETL!LA Scalability Talk, MahaloMay 31.2011
Citation preview
Brisk: Truly peertopeer Hadoop
srisatish.ambati AT gmail.com DataStax/OpenJDK @srisatish
Brisk: Hive + Hadoop + Cassandra
@srisatish
Map Reduce
@srisatish
Have large sets of data & you can work on small pieces in parallel.
@srisatish
Map Reduce@srisatish
Multicore map reduce framework, Kunle, et al
@srisatish
Parallel Execution View @srisatish
@srisatish
@srisatish
JobTrackerNameNode
HDFS
@srisatish
Writeoncereadmany!File once created, written & closed need change
@srisatish
Move computation, not data
@srisatish
@srisatish
DataNodes: Read, Write Blocks
@srisatish
NameNode: Single Master nodeSingle Machine Address spaceSingle Point of failure
Enter the Cassandra:High Scale
Peertopeer
@srisatish
When “it” does not fit in a single node!… Enter the distributed dragon!
NameNode
DataNodes
Onekindofnode!
Cassandra:High Scale
Peertopeer
@srisatish
Portfolio DemoLow latency
Live tick prices for stocks.Batch Analytics
Historical EOD prices.Value at Risk.
http://www.datastax.com/docs/0.8/brisk/brisk_demo
http://ec250194143.compute1.amazonaws.com:8888/opscenter/index.htmlhttp://ec26720212176.compute1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201105310219_0008&refresh=30http://ec250194143.compute1.amazonaws.com:8983/portfolio/
Demo URLs (good for this demo only)
Bigtable, 2006Dynamo, 2007
OSS, 2008
Incubator, 2009 TLP, 2010
A
LT
W
F
P
YKey “C”
U
Cassandra:High Scale
PeertopeerNo SPOF
@srisatish
Brisk
@srisatish
BriskHowStuffWorks version
@srisatish
YDH security edition (soon to be Apache)Apache Hive – Access via SQL like
CassandraHandlerCassandra 0.8
Use ColumnFamiliesinodesblock
@srisatish
String keyspace = “cfs”;
CfDef cf = new CfDef(); cf.setName(inodeDefaultCf); cf.setComparator_type("BytesType");…
cf.setName(sblockDefaultCf); cf.setKey_cache_size(1M); cf.setComment(
"Stores blocks of information associated with a inodeStores blocks of information associated with a inode");
cf.setKeyspace(keyspace);
@srisatish
Consistency: R + W > N
"brisk.consistencylevel.read", "QUORUM";"brisk.consistencylevel.write", "QUORUM";
@srisatish
Hadoop: job tracker, task tracker
@srisatish
BriskSnitch: brisk nodes, cassandra nodes
@srisatish
BriskSimpleSnitch.java
if(TrackerInitializer.isTrackerNode) { myDC = BRISK_DC; logger.info("Detected Hadoop trackers are enabled, setting my DC to " + myDC); } else { myDC = CASSANDRA_DC;
logger.info("Looks like Vanilla Cassandra nodes, setting my DC to " + myDC); } @srisatish
Hive: SQLlike accesscli, hwi, jdbc, metastorePushdown predicates (v beta2)
@srisatish
hive> CREATE TABLE invites (foo INT, bar STRING)PARTITIONED BY (ds STRING);
hive> LOAD DATA LOCAL INPATH '$BRISK_HOME/resources/hive/examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='20080815');
hive> SELECT count(*), ds FROM invites GROUP BY ds;
http://www.datastax.com/docs/0.8/brisk/about_hive @srisatish
ETLRealtime
Cassandra CFsDataCenters
Scale
@srisatish
@srisatish
No me in team!
● Ben Coverston
● Ben Werther
● Brandon Williams
● Cathy Daw
● Daria Hutchinson
● Jackson Chung
● Jake Luciani
● Joaquin Casares
● Jonathan Ellis
● Michael Allen
● Mike Bulman
● Michael Weir
● Nate McCall
● Nick M Bailey
● Patricio Echague
● Tyler Hobbs
● SriSatish Ambati
● Yewei Zhang
@srisatish
@srisatish100node Brisk Cluster on Opscenter
OSS, 2008
+
+ +
Brisk
Cassandra
Incubator 2009
Bigtable, 2006Dynamo, 2007
TLP, 2010
git clone [email protected]:riptano/brisk.githttp://www.datastax.com/product/briskGetting Started via Brisk AMI.Mahalo. Thank You.
@srisatish
References● MapReduce: Simplified Data Processing on Large Clusters, 2004, Jeffrey Dean and
Sanjay Ghemawat, http://bit.ly/googmr_pdf
● Multicore MapReduce, Kunle, et al. http://bit.ly/iRJd1n
@srisatish