Upload
subash-dsouza
View
484
Download
1
Embed Size (px)
DESCRIPTION
Big Data Camp LA 2014, Don't re-invent the Big-Data Wheel, Building real-time, Big Data applications on Cassandra with the open-source Kiji project by Clint Kelly of Wibidata
Citation preview
Don’t Reinvent the Big-Data Wheel!
Clint Kelly - @clintwkellyWibiData
Building real-time, Big Data applications on Cassandra with the open-source Kiji project
Big Data Camp LA14 June 2014
Agenda
Agenda
The problem
Agenda
The problemHow Kiji works
Agenda
The problemHow Kiji works
Kiji in production
Agenda
The problemHow Kiji works
Kiji in productionKiji on Cassandra
The problem.
!
!
!Open source
software
!
!
!
!
!
!
?
Data in
Data in
Data in
REST
Inspect
Inspect
Inspect
Inspect
Inspect
Train
Train
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Train
“Trained model”
Model
Model
AaBb
Model
AaBb
Score
Score
ScoreAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
ScoreAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
Score
Batch
AaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBbAaBb
Data out
Data out
Data out
REST
Data out
REST
REST
REST
REST
AaBb
AaBb
AaBb
AaBb
Experiments / Deployment
Experiments / Deployment
Experiments / Deploymentc
d
c
d
Experiments / Deploymentc
d
c
d
3
Data in / out
Data in / out(REST)
Inspect and train
Score
Score(real-time)
!
?
!!
Kiji
How Kiji works
Kiji History
Kiji History
Kiji History
How does it work?
Kiji
How does it work?
Kiji
EngineeringData
Science
How does it work?
Kiji
Data Science
Write
Engineering
How does it work?
Kiji
Data Science
Write
Channels Engineering
How does it work?
Kiji
Data Science
WriteLogs
DBs
EngineeringChannels
How does it work?
Kiji
Data Science
WriteLogs
DBs
Kij
iMR
EngineeringChannels
How does it work?
Kiji
Data Science
Write
Kij
iRE
ST
Stream
EngineeringChannels
How does it work?
Kiji
Data Science
Write
Read
Kij
iRE
ST
Stream
EngineeringChannels
How does it work?
KijiSchema(Cassandra)
Data Science
Write
Read
Kij
iRE
ST
Stream
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
Data
C
C
C
EngineeringChannels
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiMR
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMR
Scorer
C
C
C
R
R
R
EngineeringChannels
Data
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
R
KijiSchema(Cassandra)
How does it work?Data
Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
QueryKijiHive
KijiExpress
KijiMRK
ijiS
cori
ng
C
C
C
R
Kiji Model Repository
EngineeringChannels
Data
Scorer
R
R
R
3
Data in / outKijiRESTKijiMR
Inspect and trainKijiHiveKijiMR
KijiExpress
Score(real-time)
KijiModelRepositoryKijiScoring
Modular
Kiji in production
In production now
Fortune 500 retailer : Personalized recommendations
Opower: Energy usage and analytics reporting
Fortune 500 retailer
Serving personalized recommendations
Kiji
WriteLogs
DBs
Kij
iMR
EngineeringChannels
Bulk load
KijiSchema(Cassandra)
Data Science
User 1
User 2
User 3
KijiExpress
KijiMR
C
C
C
Data
Train
KijiSchema(Cassandra)
Data Science
Write
Read
Kij
iRE
ST
Stream
User 1
User 2
User 3
Kij
iSco
rin
g
C
C
C
R
Kiji Model Repository
EngineeringChannels
Scorer
Score
Kiji on Cassandra
KijiSchema
KijiSchema
KijiSchema
Cassandra
KijiSchema
Cassandra
KijiSchema
HBase
Kiji ~ BigTable
table
table
rowrowrowrowrowrowrowrowrowrowrowrow
row
Row key = entity ID
entity ID data
Composite entity IDs
data0xfa “bob”
Column families
payment0xfa “bob” interactions recommendations
inter:clicks
inter:search0xfa “bob” payment:
cardnumpayment:address
rec:scorer1
rec:scorer2
Columns
Timestamped versions
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Complex data types
record Search { string search_term; long session_id; device_type device;}
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Locality group
Locality group
Column families
Locality group
Locality group
Batch Batch Batch
Locality group
Batch Batch BatchReal-time
Real-time
Real-time
Locality group
Batch BatchReal-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch BatchReal-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch
Real-time
Real-time
Real-time
Batch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
On disk.Compressed.
locality_group_real_timelocality_group_batch
Locality group
Batch Batch Real-time
Real-time
Real-timeBatch
On disk.Compressed. In memory.
Row ➔ transactional consistency
Locality group ➔ Column family
CREATE TABLE loc_grp
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Entity ID ➔ Primary key
CREATE TABLE loc_grp (city text, user text,
PRIMARY KEY (city, user) )
WITH CLUSTERING ORDER BY (user ASC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Family, Qualifier, Version ➔ Clustering Columns
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Column values ➔ Blobs
CREATE TABLE loc_grp (city text, user text,
family text, qualifier text, version bigint, value blob,
PRIMARY KEY (city, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
songs:let it be
inter:search0xfa “bob” songs:
let it besongs:let it besongs:
let it beinter:clicks
1396560123
payment:cardnum
payment:address
rec:scorer2
rec:scorer3rec:
scorer3rec:scorer3
rec:scorer1
1395650231
Implementation notes
Implementation notes
DataStax Java driver
Implementation notes
DataStax Java driverCassandra 2.0.6
Implementation notes
DataStax Java driverCassandra 2.0.6
Async API
Implementation notes
DataStax Java driverCassandra 2.0.6
Async APINew MapReduce InputFormat
Issues
Operations across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Operations across locality groupsKiji locality group ➔ C* column family
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Operations across locality groupsKiji locality group ➔ C* column family
Read across locality groups➔ multiple C* reads (async API!)
Compare-and-set across locality groups➔ not allowed in C* Kiji
Lose transactional consistency
Filters
HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses
Filters
HBase ➔ Rich server-side filtersCassandra ➔ WHERE clauses
Client-side filtering
Project status
Components working with Cassandra
KijiSchemaKijiMR
KijiRESTKijiExpress
KijiSchema available for download / tutorial
https://github.com/kijiproject/kiji-schema/blob/cassandra/cassandra_tutorial.md
(tinyurl.com/mmubg5o)
All code available with tutorial within 1-2 months
Summary
3
Data in / outKijiRESTKijiMR
Inspect and trainKijiHiveKijiMR
KijiExpress
Score(real-time)
KijiModelRepositoryKijiScoring
Thanks to Cassandra community
Mailing listsMeetups, webinars, conferences
Try it now!
www.kiji.org
tinyurl.com/mmubg5o
@clintwkelly