Daniel Abadi Yale University November 23 rd , 2011

Tradeoffs Between Parallel Tradeoffs Between Parallel Database Systems, Hadoop, and Database Systems, Hadoop, and

HadoopDB as Platforms for HadoopDB as Platforms for Petabyte-Scale AnalysisPetabyte-Scale Analysis

Daniel AbadiDaniel Abadi

Yale UniversityYale University

November 23November 23rdrd, 2011, 2011

Data, Data, EverywhereData, Data, Everywhere Data explosionData explosion

• Web 2.0 Web 2.0 more user data more user data• More devices that sense dataMore devices that sense data• More equipment that produce data at extraordinary More equipment that produce data at extraordinary

rates (e.g. high throughput sequencing)rates (e.g. high throughput sequencing)• More interactions being tracked (e.g. clickstream data)More interactions being tracked (e.g. clickstream data)• More business processes are being digitizedMore business processes are being digitized• More history being keptMore history being kept

Data becoming core to decision making, Data becoming core to decision making, operational activites, and scientific processoperational activites, and scientific process• Want raw data (not aggregated version)Want raw data (not aggregated version)• Want to run complex, ad-hoc analytics (in addition to Want to run complex, ad-hoc analytics (in addition to

reporting)reporting)

System Design for the Data DelugeSystem Design for the Data Deluge

Shared-memory does not scale nearly well Shared-memory does not scale nearly well enough for petascale analyticsenough for petascale analytics

Shared-disk is adequate for many Shared-disk is adequate for many applications, especially for CPU intensive applications, especially for CPU intensive applications, but can have scalability applications, but can have scalability problems for data I/O intensive workloadsproblems for data I/O intensive workloads

For scan performance, nothing beats For scan performance, nothing beats putting CPUs next to the disksputting CPUs next to the disks• Partition data across CPUs/disksPartition data across CPUs/disks• Shared-nothing designs increasingly being Shared-nothing designs increasingly being

used for petascale analyticsused for petascale analytics

Parallel Database SystemsParallel Database Systems

Shared-nothing implementations existed Shared-nothing implementations existed since the 80’ssince the 80’s• Plenty of commercial options (Teradata, Plenty of commercial options (Teradata,

Microsoft PDW, IBM Netezza, HP Vertica, EMC Microsoft PDW, IBM Netezza, HP Vertica, EMC Greenplum, Aster Data, many more)Greenplum, Aster Data, many more)

• SQL interface, with UDF supportSQL interface, with UDF support• Excels at managing and processing structured, Excels at managing and processing structured,

relational datarelational data• Query execution via relational operator Query execution via relational operator

pipelines (select, project, join, group by, etc)pipelines (select, project, join, group by, etc)

MapReduceMapReduce Data is partitioned across N machinesData is partitioned across N machines

• Typically stored in a distributed file system (GFS/HDFS)Typically stored in a distributed file system (GFS/HDFS)

On each machine n, apply a function, Map, to each On each machine n, apply a function, Map, to each data item ddata item d• Map(d) Map(d) {(key {(key11,value,value11)} )} “map job”“map job”• Sort output of all map jobs on n by keySort output of all map jobs on n by key• Send (keySend (key11,value,value11) pairs with same key value to same ) pairs with same key value to same

machine (using e.g., hashing)machine (using e.g., hashing)

On each machine m, apply reduce function to (keyOn each machine m, apply reduce function to (key11, , {value{value11}) pairs mapped to it}) pairs mapped to it• Reduce(keyReduce(key11,{value,{value11}) }) (key (key22,value,value22) ) “reduce job”“reduce job”

Optionally, collect output and present to userOptionally, collect output and present to user

Map Workers

ExampleExample Count occurrences of the word “cern” and “france” in all Count occurrences of the word “cern” and “france” in all

documentsdocuments

map(d):map(d):words = split(d,’ ‘)words = split(d,’ ‘)foreach w in words:foreach w in words:

if w == ‘cern’if w == ‘cern’emit (‘cern’, 1)emit (‘cern’, 1)

if w == ‘france’if w == ‘france’emit (‘france’, 1)emit (‘france’, 1)

reduce(key, valueSet):reduce(key, valueSet):count = 0count = 0for each v in valueSet:for each v in valueSet:

count += vcount += vemit (key,count)emit (key,count)

Data Partitions

Reduce WorkersCERNReducer

FranceReducer

(france,1) (cern, 1)

Relational Operators In MRRelational Operators In MR

Straightforward to implement Straightforward to implement relational operators in MapReducerelational operators in MapReduce• Select: simple filter in Map PhaseSelect: simple filter in Map Phase• Project: project function in Map PhaseProject: project function in Map Phase• Join: Map produces tuples with join key Join: Map produces tuples with join key

as key; Reduce performs the joinas key; Reduce performs the join Query plans can be implemented as Query plans can be implemented as

a sequence of MapReduce jobs (e.g a sequence of MapReduce jobs (e.g Hive)Hive)

Overview of TalkOverview of Talk

Compare these two approaches to Compare these two approaches to petascale data analysispetascale data analysis

Discuss a hybrid approach called Discuss a hybrid approach called HadoopDBHadoopDB

SimilaritiesSimilarities

Both are suitable for large-scale data Both are suitable for large-scale data processingprocessing• I.e. analytical processing workloadsI.e. analytical processing workloads• Bulk loadsBulk loads• Not optimized for transactional workloadsNot optimized for transactional workloads• Queries over large amounts of dataQueries over large amounts of data• Both can handle both relational and Both can handle both relational and

nonrelational queries (DBMS via UDFs)nonrelational queries (DBMS via UDFs)

DifferencesDifferences MapReduce can operate on MapReduce can operate on in-situin-situ data, data,

without requiring transformation or loadingwithout requiring transformation or loading Schemas:Schemas:

• MapReduce doesn’t require them, DBMSs doMapReduce doesn’t require them, DBMSs do• Easy to write simple MR programsEasy to write simple MR programs

IndexesIndexes• MR provides no built in supportMR provides no built in support

Declarative vs imperative programmingDeclarative vs imperative programming MapReduce uses a run-time scheduler for MapReduce uses a run-time scheduler for

fine-grained load balancingfine-grained load balancing MapReduce checkpoints intermediate results MapReduce checkpoints intermediate results

for fault tolerancefor fault tolerance

Key (Not Fundamental) DifferenceKey (Not Fundamental) Difference

HadoopHadoop• Open source implementation of Open source implementation of

MapReduceMapReduce There exists no widely used open There exists no widely used open

source parallel database systemsource parallel database system• Commercial systems charge by the Commercial systems charge by the

Terabyte or CPUTerabyte or CPU• Big problem for “big data” companies Big problem for “big data” companies

like Facebooklike Facebook

Goal of Rest of TalkGoal of Rest of Talk

Discuss our experience working with Discuss our experience working with these systemsthese systems• TradeoffsTradeoffs• Include overview of SIGMOD 2009 Include overview of SIGMOD 2009

benchmark paperbenchmark paper Discuss a hybrid system we built at Discuss a hybrid system we built at

Yale (HadoopDB)Yale (HadoopDB)• VLDB 2009 paper plus quick overviews VLDB 2009 paper plus quick overviews

of two 2011 papersof two 2011 papers

Three BenchmarksThree Benchmarks

Stonebraker Web analytics Stonebraker Web analytics benchmark (SIGMOD 2009 paper)benchmark (SIGMOD 2009 paper)

TPC-HTPC-H LUBMLUBM

Web Analytics BenchmarkWeb Analytics Benchmark

GoalsGoals• Understand differences in load and Understand differences in load and

query time for some common data query time for some common data processing tasksprocessing tasks

• Choose representative set of tasks that:Choose representative set of tasks that: Both should excel atBoth should excel at MapReduce should excel atMapReduce should excel at Databases should excel atDatabases should excel at

Hardware SetupHardware Setup

100 node cluster100 node cluster Each nodeEach node

• 2.4 GHz Code 2 Duo Processors2.4 GHz Code 2 Duo Processors• 4 GB RAM4 GB RAM• 2 250 GB SATA HDs (74 MB/Sec sequential I/O)2 250 GB SATA HDs (74 MB/Sec sequential I/O)

Dual GigE switches, each with 50 nodesDual GigE switches, each with 50 nodes• 128 Gbit/sec fabric128 Gbit/sec fabric

Connected by a 64 Gbit/sec ringConnected by a 64 Gbit/sec ring

Benchmarked SoftwareBenchmarked Software

Compare:Compare:• Popular commercial row-store parallel Popular commercial row-store parallel

database systemdatabase system• Vertica (commercial column-store Vertica (commercial column-store

parallel database system)parallel database system)• HadoopHadoop

GrepGrep Used in original MapReduce paperUsed in original MapReduce paper

Look for a 3 character pattern in 90 byte field of 100 Look for a 3 character pattern in 90 byte field of 100 byte records with schema:byte records with schema:

key VARCHAR(10) PRIMARY KEYkey VARCHAR(10) PRIMARY KEYfield VARCHAR(90)field VARCHAR(90)• Pattern occurs in .01% of recordsPattern occurs in .01% of records

SELECT * FROM T WHERE field LIKE ‘%XYZ%’SELECT * FROM T WHERE field LIKE ‘%XYZ%’

1 TB of data spread across 25, 50, or 100 nodes1 TB of data spread across 25, 50, or 100 nodes• ~10 billion records, 10–40 GB / node~10 billion records, 10–40 GB / node

Expected Hadoop to perform wellExpected Hadoop to perform well

1 TB Grep – Load Times1 TB Grep – Load Times

0

5000

10000

15000

20000

25000

30000

25 nodes 50 nodes 100 nodes

Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

1TB Grep – Query Times1TB Grep – Query Times

0

200

400

600

800

1000

1200


Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

•All systems scale linearly (to 100 nodes)•Database systems have better compression (and can operate directly on compressed data)•Vertica’s compression works better than DBMS-X

Analytical TasksAnalytical Tasks

CREATE TABLE CREATE TABLE UserVisits UserVisits ( (

sourceIP VARCHAR(16),sourceIP VARCHAR(16),

destURL VARCHAR(100), destURL VARCHAR(100),

visitDate DATE, adRevenue FLOAT, visitDate DATE, adRevenue FLOAT,

userAgent VARCHAR(64), userAgent VARCHAR(64),

countryCode VARCHAR(3), countryCode VARCHAR(3),

languageCode VARCHAR(6), languageCode VARCHAR(6),

searchWord VARCHAR(32), searchWord VARCHAR(32),

duration INT ); duration INT );

• Simple web processing schema

• Task mix both relational and non-relational

• 600,000 randomly generated documents /node

• Embedded URLs reference documents on other nodes

• 155 million user visits / node• ~20 GB / node

• 18 million rankings / node• ~1 GB / node

CREATE TABLE Documents ( url VARCHAR(100) PRIMARY KEY, contents TEXT );

CREATE TABLE Rankings (pageURL VARCHAR(100) PRIMARY KEY, pageRank INT, avgDuration INT );

Loading – User VisitsLoading – User VisitsOther tables show similar trends

05000

100001500020000250003000035000400004500050000

10 nodes 25 nodes 50 nodes 100nodes

Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

Aggregation TaskAggregation Task

Simple aggregation query to find Simple aggregation query to find adRevenue by IP prefixadRevenue by IP prefix

SELECT SUBSTR(sourceIP, 1, 7), sum(adRevenue)SELECT SUBSTR(sourceIP, 1, 7), sum(adRevenue)

FROM userVistits GROUP BY SUBSTR(sourceIP, 1, 7)FROM userVistits GROUP BY SUBSTR(sourceIP, 1, 7)

Parallel analytics query for DBMSParallel analytics query for DBMS• (Compute partial aggregate on each node, (Compute partial aggregate on each node,

merge answers to produce result)merge answers to produce result)• Yields 2,000 records (24 KB)Yields 2,000 records (24 KB)

Aggregation Task PerformanceAggregation Task Performance

0

200

400

600

800

1000

1200

1400

1600

10 nodes 25 nodes 50 nodes 100 nodes

Tim

e (

se

co

nd

s)

VerticaDBMS-XHadoop

Join TaskJoin Task

Join rankins and userVisits for sourceIP Join rankins and userVisits for sourceIP analysis and revenue attributionanalysis and revenue attribution

SELECT sourceIP, AVG(pageRank), SUM(adRevenue)SELECT sourceIP, AVG(pageRank), SUM(adRevenue)

FROM rankings, userVistits FROM rankings, userVistits

WHERE pageURL=destURLWHERE pageURL=destURL

AND visitData BETWEEN 2000-1-15 AND 2000-1-22AND visitData BETWEEN 2000-1-15 AND 2000-1-22

GROUP BY sourceIPGROUP BY sourceIP

Join TaskJoin TaskDatabase systems can co-partition by join key!

0

200

400

600

800

1000

1200

1400

1600

10 nodes 25 nodes 50 nodes 100 nodes

Tim

e (

se

co

nd

s)

Vertica

DBMS-X

Hadoop

UDF TaskUDF Task

0

200

400

600

800

1000

1200

10 nodes 25 nodes 50 nodes 100nodes

Tim

e (

se

co

nd

s)

DBMS

Hadoop

DBMS clearly doesn’t scale Calculate Calculate PageRank PageRank over a set over a set of HTML of HTML documentsdocuments

Performed Performed via a UDFvia a UDF

ScalabiltyScalabilty

Except for DBMS-X load time and Except for DBMS-X load time and UDFs all systems scale near linearly UDFs all systems scale near linearly

BUT: only ran on 100 nodesBUT: only ran on 100 nodes As nodes approach 1000, other As nodes approach 1000, other

effects come into playeffects come into play• Faults go from being rare, to not so rareFaults go from being rare, to not so rare• It is nearly impossible to maintain It is nearly impossible to maintain

homogeneity at scalehomogeneity at scale

Fault Tolerance and Cluster Fault Tolerance and Cluster Heterogeneity ResultsHeterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

Database systems restart entire query upon a single node failire, and do not adapt if a node is running slowly

Benchmark ConclusionsBenchmark Conclusions Hadoop is consistently more scalableHadoop is consistently more scalable

• Checkpointing allows for better fault toleranceCheckpointing allows for better fault tolerance• Runtime scheduling allows for better tolerance of Runtime scheduling allows for better tolerance of

unexpectedly slow nodesunexpectedly slow nodes• Better parallelization of UDFsBetter parallelization of UDFs

Hadoop is consistently less efficient for Hadoop is consistently less efficient for structured, relational datastructured, relational data• Reasons both fundamental and non-fundamentalReasons both fundamental and non-fundamental• Needs better support for compression and direct Needs better support for compression and direct

operation on compressed dataoperation on compressed data• Needs better support for indexingNeeds better support for indexing• Needs better support for co-partitioning of datasetsNeeds better support for co-partitioning of datasets

Best of Both Worlds Possible?Best of Both Worlds Possible?

Many of Hadoop’s deficiencies not Many of Hadoop’s deficiencies not fundamentalfundamental• Result of initial design for unstructured dataResult of initial design for unstructured data

HadoopDB: Use Hadoop to coordinate HadoopDB: Use Hadoop to coordinate execution of multiple independent execution of multiple independent (typically single node, open source) (typically single node, open source) database systemsdatabase systems• Flexible query interface (accepts both SQL and Flexible query interface (accepts both SQL and

MapReduce)MapReduce)• Open source (built using open source Open source (built using open source

components)components)

HadoopDB ArchitectureHadoopDB Architecture

SMS PlannerSMS Planner

HadoopDB ExperimentsHadoopDB Experiments

VLDB 2009 paper ran same VLDB 2009 paper ran same Stonebraker Web analytics Stonebraker Web analytics benchmarkbenchmark

Used PostgreSQL as the DBMS Used PostgreSQL as the DBMS storage layerstorage layer

Join TaskJoin Task

0

200

400

600

800

1000

1200

1400

1600


Vertica

DBMS-X

Hadoop

HadoopDB

•HadoopDB must faster than Hadoop•Doesn’t quite match the database systems in performance•Hadoop start-up costs•PostgreSQL join performance•Fault tolerance/run-time scheduling

UDF TaskUDF Task

0

100

200

300

400

500

600

700

800


Tim

e (

se

co

nd

s)

DBMS

Hadoop

HadoopDB

Fault Tolerance and Cluster Fault Tolerance and Cluster Heterogeneity ResultsHeterogeneity Results

0

20

4060

80

100

120

140160

180

200

Fault tolerance Slowdown tolerance

Pe

rce

nta

ge

Slo

wd

ow

n

DBMS

Hadoop

HadoopDB

HadoopDB: Current StatusHadoopDB: Current Status

Recently commercialized by HadaptRecently commercialized by Hadapt• Raised $9.5 million in venture capitalRaised $9.5 million in venture capital

SIGMOD 2011 paper benchmarking SIGMOD 2011 paper benchmarking HadoopDB on TPC-H dataHadoopDB on TPC-H data• Added various other techniquesAdded various other techniques

Column-store storageColumn-store storage

4 different join algorithms4 different join algorithms

Referential partitioningReferential partitioning

VLDB 2011 paper on using HadoopDB VLDB 2011 paper on using HadoopDB for graph data (with RDF-3X for storage)for graph data (with RDF-3X for storage)

TPC-H Benchmark ResultsTPC-H Benchmark Results

Graph ExperimentsGraph Experiments

Invisible LoadingInvisible Loading

Data starts in HDFSData starts in HDFS Data is immediately available for Data is immediately available for

processing (immediate gratification processing (immediate gratification paradigm)paradigm)

Each MapReduce job causes data Each MapReduce job causes data movement from HDFS to database movement from HDFS to database systemssystems

Data is incrementally loaded, sorted, and Data is incrementally loaded, sorted, and indexedindexed

Query performance improves “invisibly” Query performance improves “invisibly”

ConclusionsConclusions Parallel database systems can be used for many data Parallel database systems can be used for many data

intensive tasksintensive tasks• Scalability can be an issue at extreme scaleScalability can be an issue at extreme scale• Parallelization of UDFs can be an issueParallelization of UDFs can be an issue

Hadoop is becoming increasingly popular and more robustHadoop is becoming increasingly popular and more robust• Free and open sourceFree and open source• Great scalability and flexibilityGreat scalability and flexibility• Inefficient on structured dataInefficient on structured data

HadoopDB trying to get best of worldsHadoopDB trying to get best of worlds• Storage layer of database systems with parallelization and job Storage layer of database systems with parallelization and job

scheduling layer of Hadoopscheduling layer of Hadoop Hadapt is improving the code with all kinds of stuff that Hadapt is improving the code with all kinds of stuff that

researchers don’t want to doresearchers don’t want to do• Full SQL support (via SMS planner)Full SQL support (via SMS planner)• Speed up (and automate) replication and loadingSpeed up (and automate) replication and loading• Easier deployment and managingEasier deployment and managing• Automatic repartitioning about node addition/subtractionAutomatic repartitioning about node addition/subtraction