View
10
Download
0
Category
Preview:
Citation preview
YANG, Lin
COMP 6311 Spring 2012
CSE
HKUST
1
Outline Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
2
Data-driven World Science
Data bases from astronomy, genomics, environmental data, transportation data, …
Humanities and Social Sciences Scanned books, historical documents, social interactions
data, …
Business & Commerce Corporate sales, stock market transactions, census,
airline traffic, …
Entertainment Internet images, Hollywood movies, MP3 files, …
Medicine MRI & CT scans, patient records, …
3
Statistics from 2009
350 Million Named users
175 Million Active users in one day
35 Million Users updating status each day
55 Million Status each day
2.5 Billion Photos/Month
1.6 Million Active pages
Growth: 12 TB/day, 2 PB/year
Global data volume: 8.7 PB
4
A Case of Big Data
What can we do with these data? What can we do?
Scientific breakthroughs Business process
efficiencies Realistic special effects Improve quality-of-life:
healthcare, transportation, environmental disasters, daily life, …
Could We Do More?
YES: but need major advances in our capability to analyze those data
5
Challenges
6
Demand
Capacity
Time
Reso
urc
es
Time
Vo
lum
e
Moore
Data
Demand
Capacity
Time
Reso
urc
es
Basic Requirements
Capacity Elasticity
Challenges(Cont.) More Requirements
High performance
Fault tolerance
Load Balance
Cost-efficient parallelization
High availability and disaster recovery
…
7
Outline Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
8
The State of The Art Parallel DBMS technologies
Proposed in the late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises
MapReduce
pioneered by Google
popularized by Yahoo! (Hadoop)
9
Parallel DBMS Popularly used for more than two decades
Research Projects: Gamma, Grace, …
Commercial: Multi-billion dollar industry but access to only a privileged few
Relational Data Model
Indexing
SQL interface
Advanced query optimization
10
MapReduce Dean et al., OSDI’04
Key contribution:
Propose a simple but powerful programming model
Parallelization
Fault-tolerance
Load balancing
Implement the framework of this model and provide simple interface to users.
11
MapReduce(Cont.) Data flow of MR
12
Input Data M splits M pieces of intermediate output
R splits R pieces of outputs
Partition_1
Partition_2 Reduce
Map
M = input size / 64 MB R = #machine * 2
MapReduce(Cont.) Execution overview
13
MapReduce(Cont.) Word counter example
14 Credit: Martijn van Groningen. Introduction to Hadoop.
MapReduce(Cont.) Parallelization
Map and Reduce
Fault tolerance Master pings workers periodically Master write periodic checkpoint Re-execute the unfinished Reduce and all Map tasks of failed
machine
Load balance Break tasks into small granularity, schedule by master
Optimization Locality: Try to schedule Map task to the node has the
corresponding data Backup Tasks: When the job is close to completion, run backup
executions of remaining tasks, the first completed one wins
15
Introduction to Hadoop is a popular open-source map-reduce
implementation which is being used as an alternative to store and process extremely large data sets on commodity hard-ware.
Inspired by Google MapReduce and GFS
16 Architecture of Hadoop
Trend of Big Data Management
17 Credit: Big Philippe Julio. Big Data Architecture.
Outline Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
18
Comparison btw PDB & MR Pavlo et al., SIGMOD’09
Candidates
Hadoop: an open-source implementation of MapReduce.
DBMS-X: a parallel relational database.
Vertica: a parallel DBMS which stores data in column-base format.
Task 1
Grep specific-pattern string from large scale of records
5.6 million records, 100 bytes per record
Test 1: Fixes the size of the data per node as 535MB
Test 2: Fixes the total dataset size as 1TB
19
Comparison(Cont.) Data Loading
20
Observation: 1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes;
Comparison(Cont.) Grep
21
Observations: 1. For Figure 4, such little data is being
processed that the Hadoop start-up costs become the limiting factor in its performance( 10–25 seconds);
2. For Figure5, as the data volume increased, Hadoop performs almost as fast as PDB;
Comparison(Cont.) Task 2: analytical works
Data Schema & Volume
22
HTML Doc(600,000 records)
Attributes Type
url VARCHAR(100)
Contents TEXT
UserVisits(155 M records, 20GB/Node)
Attributes Type
srcIP VARCHAR(16)
dstURL VARCHAR(100)
visitDate DATE
adRevenue FLOAT
userAgent VARCHAR(64)
countryCode VARCHAR(3)
langCode VARCHAR(6)
searchWord VACHAR(32)
Duration INT
Rankings(18 M records, 1GB/node)
Attributes Type
pageURL VARCHAR(100)
pageRank INT
avgDuration INT
Comparison(Cont.) Data loading Selection
23
Find the pageURL in the Rankings table with a pageRank above a user-defined threshold.
Observations: 1. With the help of index, PDB
outperform hadoop significantly; 2. As the number of node increased,
the start-up cost would increase too.
Comparison(Cont.) Aggregation
24
Calculate the total adRevenue generated for each srcIP in the UserVisits table (20GB/node), grouped by the srcIP (Fig.7) or 7-character prefix of srcIP(Fig.8)
Observations: 1. PDB outperform hadoop; 2. For PDB, communication cost
dominate the execution time 3. Since Vertica is column-store, it
could perform better from not reading unused parts of table.
Comparison(Cont.) Join
25
Observations: 1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop
can only perform complete scan. 3. UserVisits and Rankings in PDB
are partitioned by the join key, so PDB could do the join locally on each node without any network overhead
Find the srcIP that generated the most revenue within a particular date range, then calculate the average pageRank of all the pages visited during this interval.
SQL MR
SELECT INTO Temp srcIP, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;
1. Filter on UserVisits and join wth Rankings
2. Compute the total adRevenue and avgRank based on srcIP
3. Find the one with largest adRevenu
Comparison(Cont.) UDF Aggregation
26
Scan the HTML documents and search for all the URLs appeared, and count the reference number across the entire set for each unique URL
Observations: 1. Bottom segment is the time to
execute the UDF and upper segment represent the query time;
2. Both DBMS-X and Hadoop have approximately performance
SQL MR
SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUPBY url; F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database.
Similar to Grep task
Quick Summary At the scale of experiences above, parallel DBMS perform
much better than Hadoop
the result of a number of technologies developed over the past 25 years: B-Tree index, column-store, compression algorithm, sophisticated query engine and etc.
Hadoop get a better fault tolerance with performance penalty.
Hadoop is more easy to set up and use the PDB at large scale of nodes.
PDB don’t do a good job in UDF aggregation.
The trend of both system is to move toward each other
27
Outline Background
Overview of big data management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
28
Hive Thusoo et al., VLDB’09
Problem MapReduce requires developers to write custom
programs which are hard to maintain and reuse.
Analyst are familiar with SQL
Key contribution Proposed an open-source data warehouse solution built
on top of Hadoop.
Hive provides HiveQL, a SQL-like query language which support select, join, aggregate, union all and sub-queries in from clause.
29
Hive(Cont.) Metastore: system catalog.
Thrift Server: a framework for cross-language services.
Driver: manages the life cycle of a HiveQL statement during compilation, optimization and execution.
30
Hive(Cont.) FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school
31
Hive(Cont.) Future work
Make HiveQL subsume SQL
Add cost-based optimizer
Columnar storage and more intelligent data placement
Performance enhancement
Integrate with commercial BI tools
Multi-query optimization and generic n-way joins in a single MR job
32
HadoopDB Azza Abouzeid et al. VLDB’09
Problem
It’s the age of big data.
Properties for large data analysis :
Performance
Flexible query interface
Fault tolerance
Load balance
33
Parallel database
MapReduce
Can we have both of them?
HadoopDB(Cont.) Key contribution:
To build a hybrid system to archive all the required properties.
Basic idea:
Connect multiple single-node database using Hadoop as task coordinator and net communication layer (Fault tolerance & Load balance)
Queries are parallelized across nodes using MapReduce framework
Queries are pushed inside of corresponding database note as much as possible (Performance & Multi-interface)
34
HadoopDB(Cont.)
35
Translate HiveQL to MapReduce, and translate some of MR back to SQL
Repartition data on a given key or breaking apart single node data into chunks
maintains meta information about the databases
Connect to database and execute SQL and return key/value pairs
HadoopDB(Cont.) SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY
YEAR(saleDate);
36
HadoopDB(Cont.)
37
Outline Background
Overview of big data management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
38
Conclusion It is the age of big Data now
Major technologies for big data management
Parallel DBMS: Well studied and wildly used
MapReduce: Attracting new technology with bright future while exist lots of drawbacks, especially in the aspect of performance
Hive and HadoopDB give a good paradigm for build DB solution on the top of MapReduce
There are still lots of work could be done in the field of big data management.
39
Reference Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. OSDI’04
Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD’09
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB’09
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi. HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads. VLDB’09
Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and Cloud Computing: Current State and Future Opportunities. EDBT’11
40
41
Recommended