Download pdf - Scaling Out With Hadoop And HBase

Scaling Out Hadoop and NoSQL

Age Mooij

Big DataAn Introduction to Dealing with

About me...

@agemooij

...and meBig Data

IP Address Registration for Europe, Middle East, Russia

Ipv4: 232 (4.3×109) addressesIpv6: 2128 (3.4×1038) addresses

My Current Project...

Challenge

10 years of historical registration/routing data in flat files200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)80 million per day / 1,000 per second

Make it searchable...

...and youBig Data

Twitter

Google AmazonYahoo

LinkedIn

FacebookeBay

Marktplaats

HyvesDigg

Flickr YouTube

WikipediaMySpace300M users

45M users

6.5M users, 5.5M ads

50M users

264M users

32M users

Scalability:

Handling more load / requestsHandling more data

Handling more types of data

...without anything breaking or falling over

...and without going bankrupt

UPOutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

VS

Scaling Out, Part 1

Processing Data

a.k.a. Data Crunching

Map/Reduce

Break the data into chunks

Process the chunks in parallel

Parallel Batch Processing of Data

Distribute the chunks

Merge the results

Reliable, Scalable, Distributed Computing

(written in Java)

Distributed File System (DFS)

Automatic file replication

Automatic checksumming / error correction

Foundation for all Hadoop projects

Based on Google’s File System (GFS)

Map / Reduce

Simple Java APIPowerful supporting frameworkPowerful toolsGood support for non-java languages

24 hours, about $240

4TB of raw image TIFF data (stored in S3)

100 Amazon EC2 instances

Hadoop Map/Reduce

11 million finished PDFs

Scaling Out, Part 1I

Storing & Retrieving DataReads and Writes

Relational Databases are hard to scale out

Replication

Master-Slave

Master-Master Limited scaling of writes

Single point of failureSingle point of bottleneck

Ways to Scale out an RDBMS (1)

Good for scaling reads

Complicated

Partitioning

Vertical : by function / table

Not truly Relational anymore (application joins)

Horizontal : by key / id (Sharding)

Limited Scalability (relocating, resharding)

Ways to Scale out an RDBMS (2)

Why are RDBMSsso hard toscale out

ConsistencyAvailabilityPartition Tolerance ...pick any two

Brewer’s CAP Theorem

ACID vs BASE

AtomicConsistentIsolatedDurable

BasicAvailabilitySoft StateEventual Consistency

Relational Non-Relational

NoSQL NO-SQL

Better Different

Non-Relational Databases

Types of NOSQL

(Distributed) Key-Value

Document Oriented

Column Oriented

Graph Oriented

RedisVoldemortScalaris (D)

CouchDBMongoDBRiak (D)

Cassandra (D)HBase (D)

Neo4J

(D) = Distributed (automatic out scaling)

RIPE NCC

Experiences so far...

Those Big Numbers Again...

10 years of historical data in flat files200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)80 million per day / 1,000 per second

Make it searchable...

~ 200 000 000 000 records

~ 15 000 000 000 records

Map / Reduce

TimestampRecord

Our Data is 3D

IP Address Record Timestamp

Row Column Name (!) Values (!)

Best fit & performance:Column Oriented

1 10..* 0..*

CassandraDigg

Facebook

0.4.1

Tunable: Availability vs Consistency

Twitter

Very active community

No documentation

Tumblr

Yahoo

StumbleUpon

Meetup

Streamy

Adobe

0.20.1Good Documentation

Very active communityBuilt on top of Hadoop DFS

Initial Results:Tested on an EC2 cluster of 8 XLarge instances

3.8 B (23 GB) 33 M (1 GB)5 hours

33 M (1 GB)

75 minutes “Needle in a haystack” full on-disk table scan:44000 inserts/second 0.5 M records/second

15 GBRecord duplication: 6x

In order to choose the right scaling tools, you need to:

Know what you want to query and howUnderstand your data

Big Data...Be Prepared !

Try some Scala in the basement !

val shameless = <SelfPromotion>

</SelfPromotion>