Upload
nathan-bijnens
View
109
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
A real-time architecture using Hadoop and Storm.
A real-time architecture using Hadoop & Storm. 2
Speakers
Nathan Bijnens@nathan_gs
Geert Van Landeghem@gvanlandeghem
A real-time architecture using Hadoop & Storm. 3
Our Vision
Big Data
test
Volume
A real-time architecture using Hadoop & Storm. 4
Big Data
test
Velocity
A real-time architecture using Hadoop & Storm. 5
Our Vision
Volume
test
Variety
A real-time architecture using Hadoop & Storm. 6
Credits
Nathan MarzEngineer at Backtype(now Twitter).
Storm
Cascalog
ElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. 7
A Data System
A real-time architecture using Hadoop & Storm. 8
Not all information is equal. Some information is derived from other pieces of
information.
Data is more than Information
A real-time architecture using Hadoop & Storm. 9
Eventually you will reach the most
This is the information you hold true, simple because it exists.
Data is more than Information
A real-time architecture using Hadoop & Storm. 10
EventsEverything we do generates events:- Pay with Credit Card
- Commit to Git
- Click on a webpage
- Tweet
A real-time architecture using Hadoop & Storm. 11
Events used to manipulate the master data.
Events - Before
A real-time architecture using Hadoop & Storm. 12
Today, events are the master data.
Events - After
A real-time architecture using Hadoop & Storm. 13
everything.
Data System
A real-time architecture using Hadoop & Storm. 14
Data is Immutable
Events
A real-time architecture using Hadoop & Storm. 15
Data is Time Based
Events
A real-time architecture using Hadoop & Storm. 16
Capturing change traditionally
Person Location
Nathan Antwerp
Geert Dendermonde
John Ghent
Person Location
Nathan Ghent
Geert Dendermonde
John Ghent
A real-time architecture using Hadoop & Storm. 17
Capturing change
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
A real-time architecture using Hadoop & Storm. 18
The data you query is often transformed, aggregated, ...
Query
A real-time architecture using Hadoop & Storm. 19
Query
Query = function ( data )
A real-time architecture using Hadoop & Storm. 20
Number of people living in each city.
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Location Count
Ghent 2
Dendermonde 1
A real-time architecture using Hadoop & Storm. 21
Query
All Data Query
A real-time architecture using Hadoop & Storm. 22
Query: Precompute
All Data QueryPrecomputed
View
A real-time architecture using Hadoop & Storm. 23
Layered Architecture
Speed Layer
Batch Layer
Serving Layer
A real-time architecture using Hadoop & Storm. 24
Layered Architecture
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. 25
Batch Layer
A real-time architecture using Hadoop & Storm. 26
Batch Layer
HadoopElephant
DB
Incoming Data
A real-time architecture using Hadoop & Storm. 27
Unrestrained computation.
Batch Layer
A real-time architecture using Hadoop & Storm. 28
Horizontal scalable.
Batch Layer
A real-time architecture using Hadoop & Storm. 29
High Latency.
matter.
Batch Layer
A real-time architecture using Hadoop & Storm. 30
Stores master copy of data set...
Batch Layer
append only.
A real-time architecture using Hadoop & Storm. 31
Batch Layer
A real-time architecture using Hadoop & Storm. 32
Batch: View generation
Master Dataset
View #1
View #3
View #2MapReduce
A real-time architecture using Hadoop & Storm. 33
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
…
…
Output
MA
PRED
UC
E
MapReduce
DoWork() DoWork() DoWork()…
A real-time architecture using Hadoop & Storm. 34
Read only database.No random writes required.
Batch View Database
A real-time architecture using Hadoop & Storm. 35
Batch View DatabaseElephantDB
Splout
A real-time architecture using Hadoop & Storm. 36
Batch Layer
Not yet absorbed.
Data absorbed into Batch Views
Time No
w
Just a few hours of data.
A real-time architecture using Hadoop & Storm. 37
Speed Layer
A real-time architecture using Hadoop & Storm. 38
Overview
HadoopElephant
DB
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. 39
Stream processing.
Speed Layer
A real-time architecture using Hadoop & Storm. 40
Continuous computation.
Speed Layer
A real-time architecture using Hadoop & Storm. 41
Transactional.
Speed Layer
A real-time architecture using Hadoop & Storm. 42
Storing a limited window of data.Compensating for the last few hours of data.
Speed Layer
A real-time architecture using Hadoop & Storm. 43
All the complexity is isolated in the Speed layer auto-
corrected.
Speed Layer
A real-time architecture using Hadoop & Storm. 44
CAPYou have a choice between:
Availability- Queries are eventual consistent.
Consistency- Queries are consistent.
A real-time architecture using Hadoop & Storm. 45
Some algorithms are hard to implement in real time. For those cases we could
estimate the results.
Eventual accuracy
A real-time architecture using Hadoop & Storm. 46
Speed Layer
Incoming Data
Real Time
View 1
Real Time
View 2
A real-time architecture using Hadoop & Storm. 47
StormMessage passing.
Distributed processing.
Horizontally scalable.
Incremental algorithms.
Fast.
Data in motion.
A real-time architecture using Hadoop & Storm. 48
StormMessage passing.
Distributed processing.
Horizontally scalable.
Incremental algorithms.
Fast.
Data in motion.
A real-time architecture using Hadoop & Storm. 49
Storm
Nimbus Zookeeper
Worker Node
Supervisor
Wo
rker
Wo
rker
Wo
rker
Worker Node
Supervisor
Wo
rker
Wo
rker
Wo
rkerWorker Node
SupervisorW
orker
Wo
rker
Wo
rker
A real-time architecture using Hadoop & Storm. 50
StormTuple
Stream
A real-time architecture using Hadoop & Storm. 51
StormSpout
Bolt
A real-time architecture using Hadoop & Storm. 52
StormGrouping
A real-time architecture using Hadoop & Storm. 53
Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra
- Hbase
- MongoDB
- MySQL
- ElasticSearch
-
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. 54
Serving Layer
A real-time architecture using Hadoop & Storm. 55
Overview
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. 56
This layer queries the Batch & Real Time views and merges it.
Serving Layer
A real-time architecture using Hadoop & Storm. 57
Serving Layer
Real Time Views
Merge
Batch Views
A real-time architecture using Hadoop & Storm. 58
Overview
A real-time architecture using Hadoop & Storm. 59
Overview
HadoopElephant
DB
Qu
ery
Incoming Data
Cassandra
A real-time architecture using Hadoop & Storm. 60
Lambda ArchitectureCan discard any view, batch and real time, and just recreate everything from the master data.
Mistakes are corrected via recomputation.- Write bad data? Remove the data & recompute.
- Bug in view generation? Just recompute the view.
Data storage is highly optimized.
A real-time architecture using Hadoop & Storm. 61
Recommendations
A real-time architecture using Hadoop & Storm. 62
Serialization & Schema
Catch errors as quickly as they happen. Validation on write vs on read.
A real-time architecture using Hadoop & Storm. 63
Serialization & Schema
CSV is actually a serialization language that is just poorly defined.
A real-time architecture using Hadoop & Storm. 64
Serialization & SchemaUse a format with a schema.- Thrift
- Avro
- Protobuffers
A real-time architecture using Hadoop & Storm. 65
Questions?
What are your needs?@nathan_gs & @gvanlandeghem
A real-time architecture using Hadoop & Storm. 66
DataCrunchers
We enable companies in envisioning, defining and implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.