54
INFO-H-419: Data Warehouses project Hadoop in Data Warehousing by Alexey Grigorev 1

Hadoop in Data Warehousing

Embed Size (px)

Citation preview

Page 1: Hadoop in Data Warehousing

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing

by Alexey Grigorev

1

Page 2: Hadoop in Data Warehousing

Hadoop: In this Presentation

1. Introduction

2. Origins

3. MapReduce

4. Hadoop as MapReduce Implementation

5. Data Warehouse on Hadoop

6. Hadoop and Data Warehousing

7. Conclusions

2

Page 3: Hadoop in Data Warehousing

Why?

• Lot of Data

• How to deal with it?

• Hadoop to rescue!

• When to use?

• When not to use?

• Curiosity

3

Page 4: Hadoop in Data Warehousing

MapReduce: Origins

• Functional Programming

• High order functions to operate on lists

• map

• apply to each element of the list

• reduce = fold = accumulate

• aggregate a list and produce one value of output

• No side effects

4

Page 5: Hadoop in Data Warehousing

MapReduce: Origins

• (define (+1 el) (+ el 1))

• (map +1 (list 1 2 3)) (list 2 3 4)

• (reduce + 0 (list 2 3 4)) 9

• (reduce + 0 (map +1 (list 1 2 3))) 9

5

Page 6: Hadoop in Data Warehousing

MapReduce: Origins

• These function do not have side effects

• And can be parallelized easily

• Can split the input data into chunks:

• (list 1 2 3 4) (list 1 2) and (list 3 4)

• Apply map to each chuck separately, and then combine ( reduce) them

together

6

Page 7: Hadoop in Data Warehousing

MapReduce: Origins

• Mapping separately:

• (define res1 (reduce + 0 (map +1 (list 1 2)))

• (reduce + res1 (map +1 (list 3 4)))

• This is the same as (reduce + 0 (map +1 (list 1 2 3 4)))

• Note that for reduce the function must be additive

7

Page 8: Hadoop in Data Warehousing

MapReduce

• A map function

• takes a key-value pair (in_key, in_val)

• produces zero or more key-value pairs: intermediate results

• intermediate results are grouped by key

• A reduce function

• for each group in the intermediate results

• aggregates and produces the final output

8

Page 9: Hadoop in Data Warehousing

MapReduce Stages

each MapReduce Job is executed in 3 stages

• map stage: apply map to each key-value pair

• group together the intermediate results by key

• reduce stage: apply reduce to each group

9

Page 10: Hadoop in Data Warehousing

MapReduce Stages

datasource

datasource

datasource

datasource

map map map map

reduce reduce reduce

map:(in_key, in_val) ->[(out_key, out_val)]

reduce:(out_key, [out_val]) ->[res_val]

10

Page 11: Hadoop in Data Warehousing

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis

sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem

nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per

conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris

mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.

Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat

egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod

massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis

fringilla dolor ornare mi dictum ornare.

11

Page 12: Hadoop in Data Warehousing

MapReduce Example

def map(String input_key, String doc):

for each word w in doc:

EmitIntermediate(w, 1)

def reduce(String output_key, Iterator output_vals):

int res = 0

for each v in output_vals:

res += v

Emit(res)

01.

02.

03.

04.

05.

06.

07.

08.

12

Page 13: Hadoop in Data Warehousing

MapReduce Example

• map stage: output 1 for each word

• group a list of pairs into

• reduce stage: for each calculate how many ones there are

w

(w, 1) (w, [1, 1, . . . , 1])

w

13

Page 14: Hadoop in Data Warehousing

MapReduce Example: Result

• amet: 2

• ante: 2

• aptent: 1

• consectetur: 1

• dictum: 3

• dolor: 2

• elit: 3

• ...

14

Page 15: Hadoop in Data Warehousing

Hadoophttp

://flickr.com

/photo

s/erike

ldrid

ge/3

614786392/

Page 16: Hadoop in Data Warehousing

Hadoop

... is a framework that allows for the distributed processing of large data

sets across clusters of computers using simple programming models. It is

designed to scale up from single servers to thousands of machines, each

offering local computation and storage. Rather than rely on hardware to

deliver high-availability, the library itself is designed to detect and handle

failures at the application layer, so delivering a highly-available service on

top of a cluster of computers, each of which may be prone to failures.

16

Page 17: Hadoop in Data Warehousing

Hadoop

• Open Source implementation of MapReduce

• "Hadoop":

• HDFS

• Hadoop MapReduce

• HBase

• Hive

• ... many others

17

Page 18: Hadoop in Data Warehousing

Hadoop Cluster: Terminology

• Name Node: orchestrates the process

• Workers: nodes that do the computation

• Mappers do the map phase

• Reducers do the reduce phase

18

Page 19: Hadoop in Data Warehousing

Hadoop

mapper local storage

HDFS

fi le

Reduce Sort Copy

Read Map Combine

Pull

reducerlocal storage

result

19

Page 20: Hadoop in Data Warehousing

http

://escie

nce

.wash

ingto

n.e

du/g

et-h

elp

-now

/what-h

adoop

20

Page 21: Hadoop in Data Warehousing

21

Page 22: Hadoop in Data Warehousing

22

Page 23: Hadoop in Data Warehousing

23

Page 24: Hadoop in Data Warehousing

24

Page 25: Hadoop in Data Warehousing

25

Page 26: Hadoop in Data Warehousing

26

Page 27: Hadoop in Data Warehousing

Fault-Tolerance Load-Balancing

• No execution plan

• Node failed Task reassigned

• Node done Another task assigned

• No communication costs

27

Page 28: Hadoop in Data Warehousing

Advantages

• Simple, especially for programmers who know FP

• Fault tolerant

• No schema, can process any data

• Flexible

• Cheap and runs on commodity hardware

28

Page 29: Hadoop in Data Warehousing

Disadvantages

• No declarative high-level language like SQL

• Performance issues:

• Map and Reduce are blocking

• Name Node: single point of failure

• It's young

29

Page 30: Hadoop in Data Warehousing

Disadvantages

[Abouzeid, Azza et al 2009]

30

Page 31: Hadoop in Data Warehousing

Hadoop as a Data Warehouse

• Cheetah

• Hive

31

Page 32: Hadoop in Data Warehousing

Cheetah

• Typical DW relation-like schemas

• ... But not exactly

• They call it virtual views

32

Page 33: Hadoop in Data Warehousing

Cheetah

33

Page 34: Hadoop in Data Warehousing

Cheetah

• Virtual views consist of columns that can be queried

• Everything inside is entirely denormalized

• Append-only design and slowly changing dimensions

• Proprietary

34

Page 35: Hadoop in Data Warehousing

Hive

• A data warehousing solution built by Facebook

• For Big data analysis:

• in 2010 (4 years ago!), 30+ PB

• Has its own data model

• HiveQL: a declarative SQL-like language for ad-hoc querying

35

Page 36: Hadoop in Data Warehousing

HiveQL

Tables

STATUS UPDATE(user id int, status string, ds string)

PROFILES(userid int, school string, gender int)

LOAD DATA LOCAL INPATH 'logs/status_updates'

INTO TABLE status_updates

PARTITION (ds='2009-03-20')

01.

02.

01.

02.

03.

36

Page 37: Hadoop in Data Warehousing

HiveQL

FROM

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary

PARTITION (ds='2009-03-20')

SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

PARTITION (ds='2009-03-20')

SELECT subq.school, count(1)

GROUP BY subq1.school

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

37

Page 38: Hadoop in Data Warehousing

HiveQL

FROM

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary

PARTITION (ds='2009-03-20')

SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

PARTITION (ds='2009-03-20')

SELECT subq.school, count(1)

GROUP BY subq1.school

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

38

Page 39: Hadoop in Data Warehousing

HiveQL

REDUCE subq2.school, subq2.meme, subq2.cnt

USING 'top10.py' AS (school, meme, cnt)

FROM (

SELECT subq1.school, subq1.meme, count(1) as cnt

FROM

(MAP b.school, a.status

USING 'meme_extractor.py'

AS (school, meme)

FROM status_update a JOIN profiles b

ON (a.userid = b.userid)) subq1

GROUP BY subq1.school, subq1.meme

DISTRIBURE BY school, meme

SORT BY school, meme, cnt desc)

) subq2

01.

02.

03.

04.

05.

06.

07.

08.

09.

10.

11.

12.

13.

14.

39

Page 40: Hadoop in Data Warehousing

Hadoop + Data Warehousehttp

://ww

w.flickr.co

m/p

hoto

s/mrflip

/5150336351/in

/photo

stream

/

Page 41: Hadoop in Data Warehousing

Hadoop + Data Warehouse

• Hadoop and Data Warehouses can co-exist

• DW: OLAP, BI, transactional data

• Hadoop: Raw, unstructured data

41

Page 42: Hadoop in Data Warehousing

ETL

• Extract: load to HDFS, parse, prepare

• Run some analysis

• Transform: clean data and transform to some structured format

• with MapReduce

• Load: extract from HDFS, load to DW

42

Page 43: Hadoop in Data Warehousing

ETL: examples

• Text processing

• Call center records analysis

• extract sentiment

• link to profile

• which customers are more important to keep?

• Image processing

43

Page 44: Hadoop in Data Warehousing

Active Storage

• Don't delete the data after processing

• Hadoop storage is cheap: it can store anything

• Run more analysis when needed

• Like: extract new keywords/features from the old dataset

44

Page 45: Hadoop in Data Warehousing

Active Storage - 2

• Up to 80% of data is dormant (or cold)

• Hadoop storage can be way cheaper than high-cost data management

solutions

• Move this data to Hadoop

• When needed quickly analyze there or move back to DW

45

Page 46: Hadoop in Data Warehousing

Analytical Sandbox ⇒

46

Page 47: Hadoop in Data Warehousing

http

://ww

w.flickr.co

m/p

hoto

s/pasu

karu76/9

824401426/

Page 48: Hadoop in Data Warehousing

http

://ww

w.flickr.co

m/p

hoto

s/pasu

karu76/4

977447932/

Page 49: Hadoop in Data Warehousing

Analytical Sandbox

• What are we looking in this data?

• No structure - hard to know

• Run ad-hoc Hive queries to see what's there

49

Page 50: Hadoop in Data Warehousing

Conclusions

• Hadoop is becoming more and more popular

• Many companies plan to adopt

• Best used with existent DW solutions

• as an ETL

• as Active Storage

• as Analytical Sandbox

50

Page 51: Hadoop in Data Warehousing

References

1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.

[pdf]

2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.

3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for

data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.

[pdf]

4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and

Teradata)

5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB

Endowment 2.2 (2009): 1626-1629. [pdf]

6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the

VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

51

Page 52: Hadoop in Data Warehousing

References

7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.

8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]

9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]

10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of

the ACM 51.1 (2008): 107-113. [pdf]

11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.

12. Apache Hadoop project home page, url: [link].

13. Apache HBase home page, [link].

14. Apache Mahout home page, [link].

15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.

16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]

17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical

workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]

52

Page 53: Hadoop in Data Warehousing

Thank you

Page 54: Hadoop in Data Warehousing

Prepared with Shower