Hadoop in Data Warehousing

INFO-H-419: Data Warehouses project

by Alexey Grigorev

Hadoop: In this Presentation

1. Introduction

2. Origins

3. MapReduce

4. Hadoop as MapReduce Implementation

5. Data Warehouse on Hadoop

6. Hadoop and Data Warehousing

7. Conclusions

• Lot of Data

• How to deal with it?

• Hadoop to rescue!

• When to use?

• When not to use?

• Curiosity

MapReduce: Origins

• Functional Programming

• High order functions to operate on lists

• map

• apply to each element of the list

• reduce = fold = accumulate

• aggregate a list and produce one value of output

• No side effects

MapReduce: Origins

• (define (+1 el) (+ el 1))

• (map +1 (list 1 2 3)) (list 2 3 4)

• (reduce + 0 (list 2 3 4)) 9

• (reduce + 0 (map +1 (list 1 2 3))) 9

MapReduce: Origins

• These function do not have side effects

• And can be parallelized easily

• Can split the input data into chunks:

• (list 1 2 3 4) (list 1 2) and (list 3 4)

• Apply map to each chuck separately, and then combine ( reduce) them

together

MapReduce: Origins

• Mapping separately:

• (define res1 (reduce + 0 (map +1 (list 1 2)))

• (reduce + res1 (map +1 (list 3 4)))

• This is the same as (reduce + 0 (map +1 (list 1 2 3 4)))

• Note that for reduce the function must be additive

MapReduce

• A map function

• takes a key-value pair (in_key, in_val)

• produces zero or more key-value pairs: intermediate results

• intermediate results are grouped by key

• A reduce function

• for each group in the intermediate results

• aggregates and produces the final output

MapReduce Stages

each MapReduce Job is executed in 3 stages

• map stage: apply map to each key-value pair

• group together the intermediate results by key

• reduce stage: apply reduce to each group

MapReduce Stages

datasource

map map map map

reduce reduce reduce

map:(in_key, in_val) ->[(out_key, out_val)]

reduce:(out_key, [out_val]) ->[res_val]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis

sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem

nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per

conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris

mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.

Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat

egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod

massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis

fringilla dolor ornare mi dictum ornare.

MapReduce Example

def map(String input_key, String doc):

for each word w in doc:

EmitIntermediate(w, 1)

def reduce(String output_key, Iterator output_vals):

int res = 0

for each v in output_vals:

res += v

Emit(res)

MapReduce Example

• map stage: output 1 for each word

• group a list of pairs into

• reduce stage: for each calculate how many ones there are

(w, 1) (w, [1, 1, . . . , 1])

MapReduce Example: Result

• amet: 2

• ante: 2

• aptent: 1

• consectetur: 1

• dictum: 3

• dolor: 2

• elit: 3

• ...

Hadoophttp

://flickr.com

/photo

s/erike

614786392/

Hadoop

... is a framework that allows for the distributed processing of large data

sets across clusters of computers using simple programming models. It is

designed to scale up from single servers to thousands of machines, each

offering local computation and storage. Rather than rely on hardware to

deliver high-availability, the library itself is designed to detect and handle

failures at the application layer, so delivering a highly-available service on

top of a cluster of computers, each of which may be prone to failures.

Hadoop

• Open Source implementation of MapReduce

• "Hadoop":

• HDFS

• Hadoop MapReduce

• HBase

• Hive

• ... many others

Hadoop Cluster: Terminology

• Name Node: orchestrates the process

• Workers: nodes that do the computation

• Mappers do the map phase

• Reducers do the reduce phase

Hadoop

mapper local storage

Reduce Sort Copy

Read Map Combine

reducerlocal storage

result

://escie

/what-h

Fault-Tolerance Load-Balancing

• No execution plan

• Node failed Task reassigned

• Node done Another task assigned

• No communication costs

Advantages

• Simple, especially for programmers who know FP

• Fault tolerant

• No schema, can process any data

• Flexible

• Cheap and runs on commodity hardware

Disadvantages

• No declarative high-level language like SQL

• Performance issues:

• Map and Reduce are blocking

• Name Node: single point of failure

• It's young

Disadvantages

[Abouzeid, Azza et al 2009]

Hadoop as a Data Warehouse

• Cheetah

• Hive

Cheetah

• Typical DW relation-like schemas

• ... But not exactly

• They call it virtual views

Cheetah

• Virtual views consist of columns that can be queried

• Everything inside is entirely denormalized

• Append-only design and slowly changing dimensions

• Proprietary

• A data warehousing solution built by Facebook

• For Big data analysis:

• in 2010 (4 years ago!), 30+ PB

• Has its own data model

• HiveQL: a declarative SQL-like language for ad-hoc querying

HiveQL

Tables

STATUS UPDATE(user id int, status string, ds string)

PROFILES(userid int, school string, gender int)

LOAD DATA LOCAL INPATH 'logs/status_updates'

INTO TABLE status_updates

PARTITION (ds='2009-03-20')

HiveQL

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary

SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

SELECT subq.school, count(1)

GROUP BY subq1.school

HiveQL

(SELECT a.status, b.school, g.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid and a.ds = '2009-03-20') subq1

INSERT OVERWRITE TABLE gender_summary

SELECT subq1.gender, count(1)

GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

SELECT subq.school, count(1)

GROUP BY subq1.school

HiveQL

REDUCE subq2.school, subq2.meme, subq2.cnt

USING 'top10.py' AS (school, meme, cnt)

FROM (

SELECT subq1.school, subq1.meme, count(1) as cnt

(MAP b.school, a.status

USING 'meme_extractor.py'

AS (school, meme)

FROM status_update a JOIN profiles b

ON (a.userid = b.userid)) subq1

GROUP BY subq1.school, subq1.meme

DISTRIBURE BY school, meme

SORT BY school, meme, cnt desc)

) subq2

Hadoop + Data Warehousehttp

w.flickr.co

s/mrflip

/5150336351/in

/photo

stream

Hadoop + Data Warehouse

• Hadoop and Data Warehouses can co-exist

• DW: OLAP, BI, transactional data

• Hadoop: Raw, unstructured data

• Extract: load to HDFS, parse, prepare

• Run some analysis

• Transform: clean data and transform to some structured format

• with MapReduce

• Load: extract from HDFS, load to DW

ETL: examples

• Text processing

• Call center records analysis

• extract sentiment

• link to profile

• which customers are more important to keep?

• Image processing

Active Storage

• Don't delete the data after processing

• Hadoop storage is cheap: it can store anything

• Run more analysis when needed

• Like: extract new keywords/features from the old dataset

Active Storage - 2

• Up to 80% of data is dormant (or cold)

• Hadoop storage can be way cheaper than high-cost data management

solutions

• Move this data to Hadoop

• When needed quickly analyze there or move back to DW

Analytical Sandbox ⇒

w.flickr.co

s/pasu

karu76/9

824401426/

w.flickr.co

s/pasu

karu76/4

977447932/

Analytical Sandbox

• What are we looking in this data?

• No structure - hard to know

• Run ad-hoc Hive queries to see what's there

Conclusions

• Hadoop is becoming more and more popular

• Many companies plan to adopt

• Best used with existent DW solutions

• as an ETL

• as Active Storage

• as Analytical Sandbox

References

1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.

2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.

3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for

data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.

4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and

Teradata)

5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB

Endowment 2.2 (2009): 1626-1629. [pdf]

6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the

VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

References

7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.

8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]

9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]

10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of

the ACM 51.1 (2008): 107-113. [pdf]

11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.

12. Apache Hadoop project home page, url: [link].

13. Apache HBase home page, [link].

14. Apache Mahout home page, [link].

15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.

16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]

17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical

workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]

Thank you

Prepared with Shower

Hadoop in Data Warehousing

Technology

3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Data Warehousing - Vande Mataram · 2018-01-16 · Data Warehousing Syllabus Introduction to Data Warehousing Unit-I Data Warehousing Design Consideration and Dimensional Modeling

Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

Hadoop-GIS: A High Performance Spatial Data Warehousing

Data warehousing with Hadoop

Big Data & Data Warehousing - Meetupfiles.meetup.com/20841720/Big Data and Data Warehouse info.pdf · Linux: RedHat and SuSE Hadoop: HDInsight, Hortonworks, Cloudera, MapR ... Big

Integrating Hadoop into Business Intelligence and …download.101com.com/pub/tdwi/Files/TDWI 040913.pdfIntegrating Hadoop Into Business Intelligence & Data Warehousing . 2 TDWI would

Self-Service Analytics Data Warehousing on Hadoop Data ... · PDF fileBig Data Consulting (Spark, Hadoop) Data Hub / Data Warehouse ... Issues Legacy Architecture p: +353 1 254 2897

GPC Hadoop SDC Presentation September 2012...Hadoop/MapReduce, Cloud Computing and Data Warehousing at Fortune/Inc 500 companies such as Yahoo, Apple and Plateau Led software development

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

HIVE: Data Warehousing & Analytics on Hadoop

Large-Scale Data Engineeringevent.cwi.nl/lsde/2016/slides/04-TheHadoopEcosystem.pdf · Hive and Pig • Hive: data warehousing application in Hadoop –Query language is HQL, variant

Sharing Enterprise Data Data administration Data administration Data downloading Data downloading Data warehousing Data warehousing

Data Warehousing - unibz · Data Warehousing and Data Mining Lecture dates (subject to change): ... Data warehousing: business intelligence, data integration, data warehouse, facts,

Centralized HADOOP Secure Data Store with Data Integration ... · traditional data warehousing and business intelligence tools the customer wanted to implement a cost-effective and

HIVE Data Warehousing Analytics on Hadoop Facebook Data Team

The truth about SQL and Data Warehousing on Hadoop

DATA WAREHOUSING · Data Vault, Hadoop, SuperNova, Data Science, Big Data Analytics, ... “Data Virtualization for Business Intelligence Systems”, alsook tientallen whitepapers

DATA ANALYTICS & MACHINE LEARNING · - Predict the characteristics of high LTV customers and helps in customer segmentation. ... - Python - SAS 2. Data Warehousing - Hadoop - SQL

DATA WAREHOUSING & BUSINESS INTELLIGENCE SUMMIT 2017€¦ · The lessons we have learned in applying analytics and data science ... introduced, including Hadoop, Spark, data virtualization,