Scaling Out Hadoop and NoSQL
Age Mooij
Big DataAn Introduction to Dealing with
About me...
@agemooij
...and meBig Data
IP Address Registration for Europe, Middle East, Russia
Ipv4: 232 (4.3×109) addressesIpv6: 2128 (3.4×1038) addresses
My Current Project...
Challenge
10 years of historical registration/routing data in flat files200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)80 million per day / 1,000 per second
Make it searchable...
...and youBig Data
Google AmazonYahoo
FacebookeBay
Marktplaats
HyvesDigg
Flickr YouTube
WikipediaMySpace300M users
45M users
6.5M users, 5.5M ads
50M users
264M users
32M users
Scalability:
Handling more load / requestsHandling more data
Handling more types of data
...without anything breaking or falling over
...and without going bankrupt
UPOutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
OutOut
VS
Scaling Out, Part 1
Processing Data
a.k.a. Data Crunching
Map/Reduce
Break the data into chunks
Process the chunks in parallel
Parallel Batch Processing of Data
Distribute the chunks
Merge the results
Reliable, Scalable, Distributed Computing
(written in Java)
Distributed File System (DFS)
Automatic file replication
Automatic checksumming / error correction
Foundation for all Hadoop projects
Based on Google’s File System (GFS)
Map / Reduce
Simple Java APIPowerful supporting frameworkPowerful toolsGood support for non-java languages
24 hours, about $240
4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million finished PDFs
Scaling Out, Part 1I
Storing & Retrieving DataReads and Writes
Relational Databases are hard to scale out
Replication
Master-Slave
Master-Master Limited scaling of writes
Single point of failureSingle point of bottleneck
Ways to Scale out an RDBMS (1)
Good for scaling reads
Complicated
Partitioning
Vertical : by function / table
Not truly Relational anymore (application joins)
Horizontal : by key / id (Sharding)
Limited Scalability (relocating, resharding)
Ways to Scale out an RDBMS (2)
Why are RDBMSsso hard toscale out
ConsistencyAvailabilityPartition Tolerance ...pick any two
Brewer’s CAP Theorem
ACID vs BASE
AtomicConsistentIsolatedDurable
BasicAvailabilitySoft StateEventual Consistency
Relational Non-Relational
NoSQL NO-SQL
Better Different
Non-Relational Databases
Types of NOSQL
(Distributed) Key-Value
Document Oriented
Column Oriented
Graph Oriented
RedisVoldemortScalaris (D)
CouchDBMongoDBRiak (D)
Cassandra (D)HBase (D)
Neo4J
(D) = Distributed (automatic out scaling)
RIPE NCC
Experiences so far...
Those Big Numbers Again...
10 years of historical data in flat files200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)80 million per day / 1,000 per second
Make it searchable...
~ 200 000 000 000 records
~ 15 000 000 000 records
Map / Reduce
TimestampRecord
Our Data is 3D
IP Address Record Timestamp
Row Column Name (!) Values (!)
Best fit & performance:Column Oriented
1 10..* 0..*
CassandraDigg
0.4.1
Tunable: Availability vs Consistency
Very active community
No documentation
Tumblr
Yahoo
StumbleUpon
Meetup
Streamy
Adobe
0.20.1Good Documentation
Very active communityBuilt on top of Hadoop DFS
Initial Results:Tested on an EC2 cluster of 8 XLarge instances
3.8 B (23 GB) 33 M (1 GB)5 hours
33 M (1 GB)
75 minutes “Needle in a haystack” full on-disk table scan:44000 inserts/second 0.5 M records/second
15 GBRecord duplication: 6x
In order to choose the right scaling tools, you need to:
Know what you want to query and howUnderstand your data
Big Data...Be Prepared !
Try some Scala in the basement !
val shameless = <SelfPromotion>
</SelfPromotion>