Hadoop in Data Warehousing

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing

by Alexey Grigorev

1

Hadoop: In this Presentation

1. Introduction

2. Origins

3. MapReduce

4. Hadoop as MapReduce Implementation

5. Data Warehouse on Hadoop

6. Hadoop and Data Warehousing

7. Conclusions

2

Why?

• Lot of Data

• How to deal with it?

• Hadoop to rescue!

• When to use?

• When not to use?

• Curiosity

3

MapReduce: Origins

• Functional Programming

• High order functions to operate on lists

• map

• apply to each element of the list

• reduce = fold = accumulate

• aggregate a list and produce one value of output

• No side effects

4

MapReduce: Origins

• (define (+1 el) (+ el 1))

• (map +1 (list 1 2 3)) (list 2 3 4)

• (reduce + 0 (list 2 3 4)) 9

• (reduce + 0 (map +1 (list 1 2 3))) 9

⇒

⇒

⇒

5

MapReduce: Origins

• These function do not have side effects

• And can be parallelized easily

• Can split the input data into chunks:

• (list 1 2 3 4) (list 1 2) and (list 3 4)

• Apply map to each chuck separately, and then combine ( reduce) them

together

⇒

6

MapReduce: Origins

• Mapping separately:

• (define res1 (reduce + 0 (map +1 (list 1 2)))

• (reduce + res1 (map +1 (list 3 4)))

• This is the same as (reduce + 0 (map +1 (list 1 2 3 4)))

• Note that for reduce the function must be additive

7

MapReduce

• A map function

• takes a key-value pair (in_key, in_val)

• produces zero or more key-value pairs: intermediate results

• intermediate results are grouped by key

• A reduce function

• for each group in the intermediate results

• aggregates and produces the final output

8

MapReduce Stages

each MapReduce Job is executed in 3 stages

• map stage: apply map to each key-value pair

• group together the intermediate results by key

• reduce stage: apply reduce to each group

9

MapReduce Stages

datasource

datasource

datasource

datasource

map map map map

reduce reduce reduce

map:(in_key, in_val) ->[(out_key, out_val)]

reduce:(out_key, [out_val]) ->[res_val]

10

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis

sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem

nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per

conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris

mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.

Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat

egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod

massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis

fringilla dolor ornare mi dictum ornare.

11

MapReduce Example

def map(String input_key, String doc):

for each word w in doc:

EmitIntermediate(w, 1)

def reduce(String output_key, Iterator output_vals):

int res = 0

for each v in output_vals:

res += v

Emit(res)

01.

02.

03.

04.

05.

06.

07.

08.

12

MapReduce Example

• map stage: output 1 for each word

• group a list of pairs into

• reduce stage: for each calculate how many ones there are

w

(w, 1) (w, [1, 1, . . . , 1])

w

13

MapReduce Example: Result

• amet: 2

• ante: 2

• aptent: 1

• consectetur: 1

• dictum: 3

• dolor: 2

• elit: 3

• ...

14

Hadoophttp

://flickr.com

/photo

s/erike

ldrid

ge/3

614786392/

http://www.flickr.com/photos/erikeldridge/3614786392/

Hadoop

... is a framework that allows for the distributed processing of large data

sets across clusters of computers using simple programming models. It is

designed to scale up from single servers to thousands of machines, each

offering local computation and storage. Rather than rely on hardware to

deliver high-availability, the library itself is designed to detect and handle

failures at the application layer, so delivering a highly-available service on

top of a cluster of computers, each of which may be prone to failures.

“

16

Hadoop

• Open Source implementation of MapReduce

• "Hadoop":

• HDFS

• Hadoop MapReduce

• HBase

• Hive

• ... many others

17

Hadoop Cluster: Terminology

• Name Node: orchestrates the process

• Workers: nodes that do the computation

• Mappers do the map phase

• Reducers do the reduce phase

18

Hadoop

mapper local storage

HDFS

fi le

Reduce Sort Copy

Read Map Combine

Pull

reducerlocal storage

result

19

http

://escie

nce

.wash

ingto

n.e

du/g

et-h

elp

-now

/what-h

adoop

20

http://escience.washington.edu/get-help-now/what-hadoop

21

22

23

24

25

26

Fault-Tolerance Load-Balancing

• No execution plan

• Node failed Task reassigned

• Node done Another task assigned

• No communication costs

≈

⇒

⇒

27

Advantages

• Simple, especially for programmers who know FP

• Fault tolerant

• No schema, can process any data

• Flexible

• Cheap and runs on commodity hardware

28

Disadvantages

• No declarative high-level language like SQL

• Performance issues:

• Map and Reduce are blocking

• Name Node: single point of failure

• It's young

29

Disadvantages

[Abouzeid, Azza et al 2009]

30

Hadoop as a Data Warehouse

• Cheetah

• Hive

31

Cheetah

• Typical DW relation-like schemas

• ... But not exactly

• They call it virtual views

32

Cheetah

33

Cheetah

• Virtual views consist of columns that can be queried

• Everything inside is entirely denormalized

• Append-only design and slowly changing dimensions

• Proprietary

34

Hive

• A data warehousing solution built by Facebook

• For Big data analysis:

• in 2010 (4 years ago!), 30+ PB

• Has its own data model

• HiveQL: a declarative SQL-like language for ad-hoc querying

35

HiveQL

Tables

STATUS UPDATE(user id int, status string, ds string)

PROFILES(userid int, school string, gender int)

LOAD DATA LOCAL INPATH 'logs/status_updates'

INTO TABLE status_updates

PARTITION (ds='2009-03-20')

01.

02.

01.

02.

03.

36

HiveQL

FROM

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary


SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary


SELECT subq.school, count(1)

GROUP BY subq1.school

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

37

HiveQL

FROM

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary


SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary


SELECT subq.school, count(1)

GROUP BY subq1.school

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

38

HiveQL

REDUCE subq2.school, subq2.meme, subq2.cnt

USING 'top10.py' AS (school, meme, cnt)

FROM (

SELECT subq1.school, subq1.meme, count(1) as cnt

FROM

(MAP b.school, a.status

USING 'meme_extractor.py'

AS (school, meme)

FROM status_update a JOIN profiles b

ON (a.userid = b.userid)) subq1

GROUP BY subq1.school, subq1.meme

DISTRIBURE BY school, meme

SORT BY school, meme, cnt desc)

) subq2

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

13.

14.

39

Hadoop + Data Warehousehttp

://ww

w.flickr.co

m/p

hoto

s/mrflip

/5150336351/in

/photo

stream

/

http://www.flickr.com/photos/mrflip/5150336351/in/photostream/

Hadoop + Data Warehouse

• Hadoop and Data Warehouses can co-exist

• DW: OLAP, BI, transactional data

• Hadoop: Raw, unstructured data

41

ETL

• Extract: load to HDFS, parse, prepare

• Run some analysis

• Transform: clean data and transform to some structured format

• with MapReduce

• Load: extract from HDFS, load to DW

42

ETL: examples

• Text processing

• Call center records analysis

• extract sentiment

• link to profile

• which customers are more important to keep?

• Image processing

43

Active Storage

• Don't delete the data after processing

• Hadoop storage is cheap: it can store anything

• Run more analysis when needed

• Like: extract new keywords/features from the old dataset

44

Active Storage - 2

• Up to 80% of data is dormant (or cold)

• Hadoop storage can be way cheaper than high-cost data management

solutions

• Move this data to Hadoop

• When needed quickly analyze there or move back to DW

45

Analytical Sandbox ⇒

46

http

://ww

w.flickr.co

m/p

hoto

s/pasu

karu76/9

824401426/

http://www.flickr.com/photos/pasukaru76/9824401426/

http

://ww

w.flickr.co

m/p

hoto

s/pasu

karu76/4

977447932/

http://www.flickr.com/photos/pasukaru76/4977447932/

Analytical Sandbox

• What are we looking in this data?

• No structure - hard to know

• Run ad-hoc Hive queries to see what's there

49

Conclusions

• Hadoop is becoming more and more popular

• Many companies plan to adopt

• Best used with existent DW solutions

• as an ETL

• as Active Storage

• as Analytical Sandbox

50

References

1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.

[pdf]

2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.

3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for

data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.

[pdf]

4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and

Teradata)

5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB

Endowment 2.2 (2009): 1626-1629. [pdf]

6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the

VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

51

http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf

http://www.cbsolution.net/techniques/ontarget/mapreduce_vs_data_warehouse

http://www2.cs.uh.edu/~ordonez/co_research_proceedings.html

http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2637

http://www.vldb.org/pvldb/vldb2010/papers/I08.pdf

References

7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.

8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]

9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]

10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of

the ACM 51.1 (2008): 107-113. [pdf]

11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.

12. Apache Hadoop project home page, url: [link].

13. Apache HBase home page, [link].

14. Apache Mahout home page, [link].

15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.

16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]

17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical

workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]

52

http://tdwi.org/articles/2013/08/13/hadoop-changing-dw-paradigm.aspx

http://www.slideshare.net/emcacademics/tdwi-best-practices-report-hadoop-foro-bi-and-dw-april-2013

http://www.mapr.com/Download-document/40-Data-Warehouse-Offload

http://research.google.com/archive/mapreduce-osdi04.pdf

http://escience.washington.edu/get-help-now/what-hadoop

http://hadoop.apache.org/

http://hbase.apache.org/

http://mahout.apache.org/

http://www.teradata.com/white-papers/The-Impact-of-Data-Temperature-on-the-Data-Warehouse-eb6690/?type=WP

http://www-master.ufr-info-p6.jussieu.fr/2009/Ext/naacke/grbd2010/extra/exposes2010/C3_VLDB09_HadoopDB.pdf

Thank you

Prepared with Shower

https://github.com/shower/shower/

Technology

Hadoop in Data Warehousing