Upload
iain
View
21
Download
1
Embed Size (px)
DESCRIPTION
Tradeoffs Between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis. Daniel Abadi Yale University November 23 rd , 2011. Data, Data, Everywhere. Data explosion Web 2.0 more user data More devices that sense data - PowerPoint PPT Presentation
Citation preview
Tradeoffs Between Parallel Tradeoffs Between Parallel Database Systems, Hadoop, and Database Systems, Hadoop, and
HadoopDB as Platforms for HadoopDB as Platforms for Petabyte-Scale AnalysisPetabyte-Scale Analysis
Daniel AbadiDaniel Abadi
Yale UniversityYale University
November 23November 23rdrd, 2011, 2011
Data, Data, EverywhereData, Data, Everywhere Data explosionData explosion
• Web 2.0 Web 2.0 more user data more user data• More devices that sense dataMore devices that sense data• More equipment that produce data at extraordinary More equipment that produce data at extraordinary
rates (e.g. high throughput sequencing)rates (e.g. high throughput sequencing)• More interactions being tracked (e.g. clickstream data)More interactions being tracked (e.g. clickstream data)• More business processes are being digitizedMore business processes are being digitized• More history being keptMore history being kept
Data becoming core to decision making, Data becoming core to decision making, operational activites, and scientific processoperational activites, and scientific process• Want raw data (not aggregated version)Want raw data (not aggregated version)• Want to run complex, ad-hoc analytics (in addition to Want to run complex, ad-hoc analytics (in addition to
reporting)reporting)
System Design for the Data DelugeSystem Design for the Data Deluge
Shared-memory does not scale nearly well Shared-memory does not scale nearly well enough for petascale analyticsenough for petascale analytics
Shared-disk is adequate for many Shared-disk is adequate for many applications, especially for CPU intensive applications, especially for CPU intensive applications, but can have scalability applications, but can have scalability problems for data I/O intensive workloadsproblems for data I/O intensive workloads
For scan performance, nothing beats For scan performance, nothing beats putting CPUs next to the disksputting CPUs next to the disks• Partition data across CPUs/disksPartition data across CPUs/disks• Shared-nothing designs increasingly being Shared-nothing designs increasingly being
used for petascale analyticsused for petascale analytics
Parallel Database SystemsParallel Database Systems
Shared-nothing implementations existed Shared-nothing implementations existed since the 80’ssince the 80’s• Plenty of commercial options (Teradata, Plenty of commercial options (Teradata,
Microsoft PDW, IBM Netezza, HP Vertica, EMC Microsoft PDW, IBM Netezza, HP Vertica, EMC Greenplum, Aster Data, many more)Greenplum, Aster Data, many more)
• SQL interface, with UDF supportSQL interface, with UDF support• Excels at managing and processing structured, Excels at managing and processing structured,
relational datarelational data• Query execution via relational operator Query execution via relational operator
pipelines (select, project, join, group by, etc)pipelines (select, project, join, group by, etc)
MapReduceMapReduce Data is partitioned across N machinesData is partitioned across N machines
• Typically stored in a distributed file system (GFS/HDFS)Typically stored in a distributed file system (GFS/HDFS)
On each machine n, apply a function, Map, to each On each machine n, apply a function, Map, to each data item ddata item d• Map(d) Map(d) {(key {(key11,value,value11)} )} “map job”“map job”• Sort output of all map jobs on n by keySort output of all map jobs on n by key• Send (keySend (key11,value,value11) pairs with same key value to same ) pairs with same key value to same
machine (using e.g., hashing)machine (using e.g., hashing)
On each machine m, apply reduce function to (keyOn each machine m, apply reduce function to (key11, , {value{value11}) pairs mapped to it}) pairs mapped to it• Reduce(keyReduce(key11,{value,{value11}) }) (key (key22,value,value22) ) “reduce job”“reduce job”
Optionally, collect output and present to userOptionally, collect output and present to user
Map Workers
ExampleExample Count occurrences of the word “cern” and “france” in all Count occurrences of the word “cern” and “france” in all
documentsdocuments
map(d):map(d):words = split(d,’ ‘)words = split(d,’ ‘)foreach w in words:foreach w in words:
if w == ‘cern’if w == ‘cern’emit (‘cern’, 1)emit (‘cern’, 1)
if w == ‘france’if w == ‘france’emit (‘france’, 1)emit (‘france’, 1)
reduce(key, valueSet):reduce(key, valueSet):count = 0count = 0for each v in valueSet:for each v in valueSet:
count += vcount += vemit (key,count)emit (key,count)
Data Partitions
Reduce WorkersCERNReducer
FranceReducer
(france,1) (cern, 1)
Relational Operators In MRRelational Operators In MR
Straightforward to implement Straightforward to implement relational operators in MapReducerelational operators in MapReduce• Select: simple filter in Map PhaseSelect: simple filter in Map Phase• Project: project function in Map PhaseProject: project function in Map Phase• Join: Map produces tuples with join key Join: Map produces tuples with join key
as key; Reduce performs the joinas key; Reduce performs the join Query plans can be implemented as Query plans can be implemented as
a sequence of MapReduce jobs (e.g a sequence of MapReduce jobs (e.g Hive)Hive)
Overview of TalkOverview of Talk
Compare these two approaches to Compare these two approaches to petascale data analysispetascale data analysis
Discuss a hybrid approach called Discuss a hybrid approach called HadoopDBHadoopDB
SimilaritiesSimilarities
Both are suitable for large-scale data Both are suitable for large-scale data processingprocessing• I.e. analytical processing workloadsI.e. analytical processing workloads• Bulk loadsBulk loads• Not optimized for transactional workloadsNot optimized for transactional workloads• Queries over large amounts of dataQueries over large amounts of data• Both can handle both relational and Both can handle both relational and
nonrelational queries (DBMS via UDFs)nonrelational queries (DBMS via UDFs)
DifferencesDifferences MapReduce can operate on MapReduce can operate on in-situin-situ data, data,
without requiring transformation or loadingwithout requiring transformation or loading Schemas:Schemas:
• MapReduce doesn’t require them, DBMSs doMapReduce doesn’t require them, DBMSs do• Easy to write simple MR programsEasy to write simple MR programs
IndexesIndexes• MR provides no built in supportMR provides no built in support
Declarative vs imperative programmingDeclarative vs imperative programming MapReduce uses a run-time scheduler for MapReduce uses a run-time scheduler for
fine-grained load balancingfine-grained load balancing MapReduce checkpoints intermediate results MapReduce checkpoints intermediate results
for fault tolerancefor fault tolerance
Key (Not Fundamental) DifferenceKey (Not Fundamental) Difference
HadoopHadoop• Open source implementation of Open source implementation of
MapReduceMapReduce There exists no widely used open There exists no widely used open
source parallel database systemsource parallel database system• Commercial systems charge by the Commercial systems charge by the
Terabyte or CPUTerabyte or CPU• Big problem for “big data” companies Big problem for “big data” companies
like Facebooklike Facebook
Goal of Rest of TalkGoal of Rest of Talk
Discuss our experience working with Discuss our experience working with these systemsthese systems• TradeoffsTradeoffs• Include overview of SIGMOD 2009 Include overview of SIGMOD 2009
benchmark paperbenchmark paper Discuss a hybrid system we built at Discuss a hybrid system we built at
Yale (HadoopDB)Yale (HadoopDB)• VLDB 2009 paper plus quick overviews VLDB 2009 paper plus quick overviews
of two 2011 papersof two 2011 papers
Three BenchmarksThree Benchmarks
Stonebraker Web analytics Stonebraker Web analytics benchmark (SIGMOD 2009 paper)benchmark (SIGMOD 2009 paper)
TPC-HTPC-H LUBMLUBM
Web Analytics BenchmarkWeb Analytics Benchmark
GoalsGoals• Understand differences in load and Understand differences in load and
query time for some common data query time for some common data processing tasksprocessing tasks
• Choose representative set of tasks that:Choose representative set of tasks that: Both should excel atBoth should excel at MapReduce should excel atMapReduce should excel at Databases should excel atDatabases should excel at
Hardware SetupHardware Setup
100 node cluster100 node cluster Each nodeEach node
• 2.4 GHz Code 2 Duo Processors2.4 GHz Code 2 Duo Processors• 4 GB RAM4 GB RAM• 2 250 GB SATA HDs (74 MB/Sec sequential I/O)2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodesDual GigE switches, each with 50 nodes• 128 Gbit/sec fabric128 Gbit/sec fabric
Connected by a 64 Gbit/sec ringConnected by a 64 Gbit/sec ring
Benchmarked SoftwareBenchmarked Software
Compare:Compare:• Popular commercial row-store parallel Popular commercial row-store parallel
database systemdatabase system• Vertica (commercial column-store Vertica (commercial column-store
parallel database system)parallel database system)• HadoopHadoop
GrepGrep Used in original MapReduce paperUsed in original MapReduce paper
Look for a 3 character pattern in 90 byte field of 100 Look for a 3 character pattern in 90 byte field of 100 byte records with schema:byte records with schema:
key VARCHAR(10) PRIMARY KEYkey VARCHAR(10) PRIMARY KEYfield VARCHAR(90)field VARCHAR(90)• Pattern occurs in .01% of recordsPattern occurs in .01% of records
SELECT * FROM T WHERE field LIKE ‘%XYZ%’SELECT * FROM T WHERE field LIKE ‘%XYZ%’
1 TB of data spread across 25, 50, or 100 nodes1 TB of data spread across 25, 50, or 100 nodes• ~10 billion records, 10–40 GB / node~10 billion records, 10–40 GB / node
Expected Hadoop to perform wellExpected Hadoop to perform well
1 TB Grep – Load Times1 TB Grep – Load Times
0
5000
10000
15000
20000
25000
30000
25 nodes 50 nodes 100 nodes
Tim
e (
se
co
nd
s)
Vertica
DBMS-X
Hadoop
1TB Grep – Query Times1TB Grep – Query Times
0
200
400
600
800
1000
1200
25 nodes 50 nodes 100 nodes
Tim
e (
se
co
nd
s)
Vertica
DBMS-X
Hadoop
•All systems scale linearly (to 100 nodes)•Database systems have better compression (and can operate directly on compressed data)•Vertica’s compression works better than DBMS-X
Analytical TasksAnalytical Tasks
CREATE TABLE CREATE TABLE UserVisits UserVisits ( (
sourceIP VARCHAR(16),sourceIP VARCHAR(16),
destURL VARCHAR(100), destURL VARCHAR(100),
visitDate DATE, adRevenue FLOAT, visitDate DATE, adRevenue FLOAT,
userAgent VARCHAR(64), userAgent VARCHAR(64),
countryCode VARCHAR(3), countryCode VARCHAR(3),
languageCode VARCHAR(6), languageCode VARCHAR(6),
searchWord VARCHAR(32), searchWord VARCHAR(32),
duration INT ); duration INT );
• Simple web processing schema
• Task mix both relational and non-relational
• 600,000 randomly generated documents /node
• Embedded URLs reference documents on other nodes
• 155 million user visits / node• ~20 GB / node
• 18 million rankings / node• ~1 GB / node
CREATE TABLE Documents ( url VARCHAR(100) PRIMARY KEY, contents TEXT );
CREATE TABLE Rankings (pageURL VARCHAR(100) PRIMARY KEY, pageRank INT, avgDuration INT );
Loading – User VisitsLoading – User VisitsOther tables show similar trends
05000
100001500020000250003000035000400004500050000
10 nodes 25 nodes 50 nodes 100nodes
Tim
e (
se
co
nd
s)
Vertica
DBMS-X
Hadoop
Aggregation TaskAggregation Task
Simple aggregation query to find Simple aggregation query to find adRevenue by IP prefixadRevenue by IP prefix
SELECT SUBSTR(sourceIP, 1, 7), sum(adRevenue)SELECT SUBSTR(sourceIP, 1, 7), sum(adRevenue)
FROM userVistits GROUP BY SUBSTR(sourceIP, 1, 7)FROM userVistits GROUP BY SUBSTR(sourceIP, 1, 7)
Parallel analytics query for DBMSParallel analytics query for DBMS• (Compute partial aggregate on each node, (Compute partial aggregate on each node,
merge answers to produce result)merge answers to produce result)• Yields 2,000 records (24 KB)Yields 2,000 records (24 KB)
Aggregation Task PerformanceAggregation Task Performance
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Tim
e (
se
co
nd
s)
VerticaDBMS-XHadoop
Join TaskJoin Task
Join rankins and userVisits for sourceIP Join rankins and userVisits for sourceIP analysis and revenue attributionanalysis and revenue attribution
SELECT sourceIP, AVG(pageRank), SUM(adRevenue)SELECT sourceIP, AVG(pageRank), SUM(adRevenue)
FROM rankings, userVistits FROM rankings, userVistits
WHERE pageURL=destURLWHERE pageURL=destURL
AND visitData BETWEEN 2000-1-15 AND 2000-1-22AND visitData BETWEEN 2000-1-15 AND 2000-1-22
GROUP BY sourceIPGROUP BY sourceIP
Join TaskJoin TaskDatabase systems can co-partition by join key!
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Tim
e (
se
co
nd
s)
Vertica
DBMS-X
Hadoop
UDF TaskUDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100nodes
Tim
e (
se
co
nd
s)
DBMS
Hadoop
DBMS clearly doesn’t scale Calculate Calculate PageRank PageRank over a set over a set of HTML of HTML documentsdocuments
Performed Performed via a UDFvia a UDF
ScalabiltyScalabilty
Except for DBMS-X load time and Except for DBMS-X load time and UDFs all systems scale near linearly UDFs all systems scale near linearly
BUT: only ran on 100 nodesBUT: only ran on 100 nodes As nodes approach 1000, other As nodes approach 1000, other
effects come into playeffects come into play• Faults go from being rare, to not so rareFaults go from being rare, to not so rare• It is nearly impossible to maintain It is nearly impossible to maintain
homogeneity at scalehomogeneity at scale
Fault Tolerance and Cluster Fault Tolerance and Cluster Heterogeneity ResultsHeterogeneity Results
0
20
4060
80
100
120
140160
180
200
Fault tolerance Slowdown tolerance
Pe
rce
nta
ge
Slo
wd
ow
n
DBMS
Hadoop
Database systems restart entire query upon a single node failire, and do not adapt if a node is running slowly
Benchmark ConclusionsBenchmark Conclusions Hadoop is consistently more scalableHadoop is consistently more scalable
• Checkpointing allows for better fault toleranceCheckpointing allows for better fault tolerance• Runtime scheduling allows for better tolerance of Runtime scheduling allows for better tolerance of
unexpectedly slow nodesunexpectedly slow nodes• Better parallelization of UDFsBetter parallelization of UDFs
Hadoop is consistently less efficient for Hadoop is consistently less efficient for structured, relational datastructured, relational data• Reasons both fundamental and non-fundamentalReasons both fundamental and non-fundamental• Needs better support for compression and direct Needs better support for compression and direct
operation on compressed dataoperation on compressed data• Needs better support for indexingNeeds better support for indexing• Needs better support for co-partitioning of datasetsNeeds better support for co-partitioning of datasets
Best of Both Worlds Possible?Best of Both Worlds Possible?
Many of Hadoop’s deficiencies not Many of Hadoop’s deficiencies not fundamentalfundamental• Result of initial design for unstructured dataResult of initial design for unstructured data
HadoopDB: Use Hadoop to coordinate HadoopDB: Use Hadoop to coordinate execution of multiple independent execution of multiple independent (typically single node, open source) (typically single node, open source) database systemsdatabase systems• Flexible query interface (accepts both SQL and Flexible query interface (accepts both SQL and
MapReduce)MapReduce)• Open source (built using open source Open source (built using open source
components)components)
HadoopDB ArchitectureHadoopDB Architecture
SMS PlannerSMS Planner
HadoopDB ExperimentsHadoopDB Experiments
VLDB 2009 paper ran same VLDB 2009 paper ran same Stonebraker Web analytics Stonebraker Web analytics benchmarkbenchmark
Used PostgreSQL as the DBMS Used PostgreSQL as the DBMS storage layerstorage layer
Join TaskJoin Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes
Vertica
DBMS-X
Hadoop
HadoopDB
•HadoopDB must faster than Hadoop•Doesn’t quite match the database systems in performance•Hadoop start-up costs•PostgreSQL join performance•Fault tolerance/run-time scheduling
UDF TaskUDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Tim
e (
se
co
nd
s)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster Fault Tolerance and Cluster Heterogeneity ResultsHeterogeneity Results
0
20
4060
80
100
120
140160
180
200
Fault tolerance Slowdown tolerance
Pe
rce
nta
ge
Slo
wd
ow
n
DBMS
Hadoop
HadoopDB
HadoopDB: Current StatusHadoopDB: Current Status
Recently commercialized by HadaptRecently commercialized by Hadapt• Raised $9.5 million in venture capitalRaised $9.5 million in venture capital
SIGMOD 2011 paper benchmarking SIGMOD 2011 paper benchmarking HadoopDB on TPC-H dataHadoopDB on TPC-H data• Added various other techniquesAdded various other techniques
Column-store storageColumn-store storage
4 different join algorithms4 different join algorithms
Referential partitioningReferential partitioning
VLDB 2011 paper on using HadoopDB VLDB 2011 paper on using HadoopDB for graph data (with RDF-3X for storage)for graph data (with RDF-3X for storage)
TPC-H Benchmark ResultsTPC-H Benchmark Results
Graph ExperimentsGraph Experiments
Invisible LoadingInvisible Loading
Data starts in HDFSData starts in HDFS Data is immediately available for Data is immediately available for
processing (immediate gratification processing (immediate gratification paradigm)paradigm)
Each MapReduce job causes data Each MapReduce job causes data movement from HDFS to database movement from HDFS to database systemssystems
Data is incrementally loaded, sorted, and Data is incrementally loaded, sorted, and indexedindexed
Query performance improves “invisibly” Query performance improves “invisibly”
ConclusionsConclusions Parallel database systems can be used for many data Parallel database systems can be used for many data
intensive tasksintensive tasks• Scalability can be an issue at extreme scaleScalability can be an issue at extreme scale• Parallelization of UDFs can be an issueParallelization of UDFs can be an issue
Hadoop is becoming increasingly popular and more robustHadoop is becoming increasingly popular and more robust• Free and open sourceFree and open source• Great scalability and flexibilityGreat scalability and flexibility• Inefficient on structured dataInefficient on structured data
HadoopDB trying to get best of worldsHadoopDB trying to get best of worlds• Storage layer of database systems with parallelization and job Storage layer of database systems with parallelization and job
scheduling layer of Hadoopscheduling layer of Hadoop Hadapt is improving the code with all kinds of stuff that Hadapt is improving the code with all kinds of stuff that
researchers don’t want to doresearchers don’t want to do• Full SQL support (via SMS planner)Full SQL support (via SMS planner)• Speed up (and automate) replication and loadingSpeed up (and automate) replication and loading• Easier deployment and managingEasier deployment and managing• Automatic repartitioning about node addition/subtractionAutomatic repartitioning about node addition/subtraction