Upload
alexey-grigorev
View
786
Download
8
Tags:
Embed Size (px)
Citation preview
INFO-H-419: Data Warehouses project
Hadoop in Data Warehousing
by Alexey Grigorev
1
Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
2
Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
3
MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• map
• apply to each element of the list
• reduce = fold = accumulate
• aggregate a list and produce one value of output
• No side effects
4
MapReduce: Origins
• (define (+1 el) (+ el 1))
• (map +1 (list 1 2 3)) (list 2 3 4)
• (reduce + 0 (list 2 3 4)) 9
• (reduce + 0 (map +1 (list 1 2 3))) 9
⇒
⇒
⇒
5
MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
• (list 1 2 3 4) (list 1 2) and (list 3 4)
• Apply map to each chuck separately, and then combine ( reduce) them
together
⇒
6
MapReduce: Origins
• Mapping separately:
• (define res1 (reduce + 0 (map +1 (list 1 2)))
• (reduce + res1 (map +1 (list 3 4)))
• This is the same as (reduce + 0 (map +1 (list 1 2 3 4)))
• Note that for reduce the function must be additive
7
MapReduce
• A map function
• takes a key-value pair (in_key, in_val)
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A reduce function
• for each group in the intermediate results
• aggregates and produces the final output
8
MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply map to each key-value pair
• group together the intermediate results by key
• reduce stage: apply reduce to each group
9
MapReduce Stages
datasource
datasource
datasource
datasource
map map map map
reduce reduce reduce
map:(in_key, in_val) ->[(out_key, out_val)]
reduce:(out_key, [out_val]) ->[res_val]
10
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
11
MapReduce Example
def map(String input_key, String doc):
for each word w in doc:
EmitIntermediate(w, 1)
def reduce(String output_key, Iterator output_vals):
int res = 0
for each v in output_vals:
res += v
Emit(res)
01.
02.
03.
04.
05.
06.
07.
08.
12
MapReduce Example
• map stage: output 1 for each word
• group a list of pairs into
• reduce stage: for each calculate how many ones there are
w
(w, 1) (w, [1, 1, . . . , 1])
w
13
MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...
14
Hadoophttp
://flickr.com
/photo
s/erike
ldrid
ge/3
614786392/
Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
“
16
Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
17
Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
18
Hadoop
mapper local storage
HDFS
fi le
Reduce Sort Copy
Read Map Combine
Pull
reducerlocal storage
result
19
http
://escie
nce
.wash
ingto
n.e
du/g
et-h
elp
-now
/what-h
adoop
20
21
22
23
24
25
26
Fault-Tolerance Load-Balancing
• No execution plan
• Node failed Task reassigned
• Node done Another task assigned
• No communication costs
≈
⇒
⇒
27
Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
28
Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
29
Disadvantages
[Abouzeid, Azza et al 2009]
30
Hadoop as a Data Warehouse
• Cheetah
• Hive
31
Cheetah
• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views
32
Cheetah
33
Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
34
Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
35
HiveQL
Tables
STATUS UPDATE(user id int, status string, ds string)
PROFILES(userid int, school string, gender int)
LOAD DATA LOCAL INPATH 'logs/status_updates'
INTO TABLE status_updates
PARTITION (ds='2009-03-20')
01.
02.
01.
02.
03.
36
HiveQL
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
37
HiveQL
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
38
HiveQL
REDUCE subq2.school, subq2.meme, subq2.cnt
USING 'top10.py' AS (school, meme, cnt)
FROM (
SELECT subq1.school, subq1.meme, count(1) as cnt
FROM
(MAP b.school, a.status
USING 'meme_extractor.py'
AS (school, meme)
FROM status_update a JOIN profiles b
ON (a.userid = b.userid)) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBURE BY school, meme
SORT BY school, meme, cnt desc)
) subq2
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
14.
39
Hadoop + Data Warehousehttp
://ww
w.flickr.co
m/p
hoto
s/mrflip
/5150336351/in
/photo
stream
/
Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
41
ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
42
ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
43
Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
44
Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
45
Analytical Sandbox ⇒
46
http
://ww
w.flickr.co
m/p
hoto
s/pasu
karu76/9
824401426/
http
://ww
w.flickr.co
m/p
hoto
s/pasu
karu76/4
977447932/
Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
49
Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
50
References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
51
References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
52
Thank you
Prepared with Shower