Is hadoop for you

Preview:

DESCRIPTION

Introduction to Hadoop for Oracle Database professionals. Presented at E4 conference.

Citation preview

1

Is Hadoop For You?Gwen Shapira, Solutions Architect

2

About Me

• Solution Architect @ Cloudera• Making our customers successful• Formerly:• Database consultant @ Pythian• Specializing in Exadata, RAC, replication• Oracle ACED, Oak Table Member

• @gwenshap <- Hadoop tips in 140 characters

3

Agenda

Answer the question:Who needs Hadoop?

4

In more details…

Getting Started

What you need to succeed

When to Hadoop

Basic Hadoop Architecture

What's so special about Hadoop

0% 5%10%

15%20%

25%30%

35%40%

45%

% of Session

% of Session

5

What’s so special about Hadoop?

Technically Speaking

6

Databases in 1999

1. Buy a really big machine2. Install an expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another really big machine as a

backup

7

Problems:

• Reliability• Scalability• Storage throughput • Complex Upgrades• Relational only

8

Exadata: State of the Art - 2007

1. Storage and compute in one rack2. Cluster with Infiniband interconnect3. Balanced architecture4. Offloading 5. Parallelism6. Compression

9

Hadoop

• Distributed File System• Programming Framework• Many projects on top• Open Source

(This means free)

10

Designed For:

• Reliability• Parallel Processing• Scalability• Flexibility

11

Reminders:

• Disk does a seek for each I/O operation• Seeks are expensive (~10ms)• Big I/Os mean better throughput• Network is fast inside rack• Slower between racks

12

The File System

• Files are split into 64M blocks• 64M!!!• Distributed• Replicated• Write-Once

HDFS Architecture

13

DataNode

Metadata

Paths, filenames, file sizes, block locations, …

NameNode

DataNode DataNode DataNode

HDFS Architecture

14

DataNode

Data

Blocks, checksums

NameNode

DataNode DataNode DataNode

HDFS Write Path

15

DN 1

NameNode

DN 2 DN 3 DN 4

Rack 1 Rack 2

Client

create(“/tmp/myfile”)

Write to [DN4,DN3,DN2]

[DN3,DN2]

[DN2]

HDFS Read Path

16

DN 1

NameNode

DN 2 DN 3 DN 4

Rack 1 Rack 2

Client

open(“/tmp/myfile”,“r”)

Read from [DN4,DN3,DN2]

readdata

17

Map-Reduce

• Java Framework • Works on Key-Value pairs• Map:

• Operate on every element• Filter or transform• Code runs where the data is stored

• Shuffle:• Redistribution of data

• Reduce:• Aggregate or Join

MapReduce Architcture

18

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

• Gateway for users• Assigns tasks to

TaskTrackers• Tracks job status

MapReduce Architcture

19

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

• TaskTrackers execute Map and Reduce tasks assigned by JT

20

Word Count Example

MapReduce Architcture

21

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

wordcount(<files>)

M1 M2 M3 M4 R1

[cat, 1] [dog, 1][the, 1] [sat, 1]

MapReduce Architcture

22

DN 1

JobTracker

DN 2 DN 3 DN 4

Rack 1 Rack 2

NameNode

TT 3 TT 4TT 2TT 1

wordcount(<files>)

M5 M6 M7 M8 R1

[a, 5][cat, 2][dog, 1][the, 4][mat, 1]

23

Compare to Oracle PX

• Mappers -> Producers• Reducers -> Consumers• Shuffle -> Re-distribution

24

In Short

Benefits

• Reliable• Scalable• Infinite Flexibility• Cheap

Challenges

• New skills• Infinitely Flexible• Feature-completeness• Best practices and examples

25

Use Cases

When to Hadoop?

26

When to Hadoop?

When Relational Databases Don’t Add Benefits

27

Non-relational Data

• XML• Logs • Geo spatial data• Video

28

Adding to the Data Warehouse

• ETL• History• Some reports• Rocket Data Science

29

What you Need to Succeed

30

A Problem

31

Right Toolset

32

Toolset

33

Toolset for DBAs

• Hive – Turn SQL to Map-Reduce• Streaming – Map-Reduce in any language• Pig – Write and Execute execution plans• Oozie – Coordinate workflows• Impala – real-time SQL• HBase – key-value real-time data store

34

Data Model

• Partitions• Batch processing• Star Schema• Materialized Views• Sort and Compress

• De-normalize• Tune the data• Nested data structures

35

Right Hardware

• If possible – POC with your workload• Sizing by storage• You probably need to over-provision• Machine reliability• Big Data Appliance is a good start

36

Non-technical Advice

• Your team will have to learn a lot• Be ready for a challenge

37

Getting Started

38

Why get started?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

39

VM Cloud Cluster

40

Books

41

More Books

42

Beginner Projects

• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

43

Need Help?

• I can help:• @gwenshap• gshapira@cloudera.com

• Hadoop Community:• http://community.cloudera.com• user@hadoop.apache.org• Google group: CDH Users

44

Recommended