24
Introducing: The Modern Data Operating System

Hadoop 101

Embed Size (px)

DESCRIPTION

First slide of Hadoop: * Introduction to Big Data and Hadoop: - Presenting and defining big data - Introducing Hadoop and History - Hadoop - how it works? - HDFS

Citation preview

Page 1: Hadoop 101

Introducing:

The Modern Data Operating System

Page 2: Hadoop 101

Hadoop is ...A scalable fault tolerant distributed for data storage and processing (open source under the Apache license)

● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage

● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

- Core Hadoop has two main systems:

Page 3: Hadoop 101

GFS

Map/Reduce

BigTable

> > >

> > >

> > >

HDFS

MapReduce

Hadoop Origins

Page 4: Hadoop 101

Hadoop Chronicles

GFS

Map/Reduce

BigTable

Doug Cutting

Page 5: Hadoop 101

Etymology

● Hadoop was created in 2004 by "Douglass (Doug) Cutting"

● Implemented Google Filesystem and Big Tables papers

● He aimed it, to index the internet in google style for startup search engine 'Nutch'

● Named it after his son's elephant shaped favourite toy named hadoop

Page 6: Hadoop 101

What is Big Data?

"In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools."

Wikipedia

Page 7: Hadoop 101
Page 8: Hadoop 101
Page 9: Hadoop 101
Page 10: Hadoop 101
Page 11: Hadoop 101
Page 12: Hadoop 101

● 2008: Google processes 20PB a day

● 2012: Facebook ingests 500TB of data a day

● 2009: eBay has 6.5 PB user data + 50 TB a day

● 2011: Yahoo! has 180-200 PB of data

How big is big?

Page 13: Hadoop 101

Limitations of Existing Analytics Architecture

BI Reports + Online Apps

RDBMS (aggregated data)

ETL (Extract, Transfer & Load)

Storage Grid

Data Collection

Instrumentation (Raw Data Sources)

Moving Data from storage to compute doesn't scale!

Can't explore original raw data

Archiving = Premature deathMostly Append

Page 14: Hadoop 101

Why Hadoop?Challenge: Read 1 TB of data

1 Machine

- 4 IO channels- Each channel: 100 MB/s

?

10 Machines

- 4 IO channels- Each channel: 100 MB/s

?45 minutes 4.5 minutes

Page 15: Hadoop 101

Hadoop and Friends

Page 16: Hadoop 101

The Key Benefit: Agility/FlexibilitySchema-On-Write (RDBMS) Schema-On-Read (Hadoop)

- Schema must be created before any data can be loaded

- An explicit load operation has to take place which transforms data to DB internal structure

- New columns must be be added explicitly before new data for such columns can be loaded into the database

- Reads are fast - Load is fast- Standards / Governance - Flexibility / Agility

- Data is simply copied to the file store, no transformations are needed

- A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding)

- New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it

Page 17: Hadoop 101

Hadoop ComponentsMaster/Slave Architecture

Name Node

Job Tracker

Data Nodes

Task Trackers

Page 18: Hadoop 101

File metadata:/kenshoo/data1.txt ---> 1,2,3/kenshoo/data2.txt ---> 4,5

NameNode r=3

hdfs-site.xml

dfs.replication

Data Nodes

3 3 3 1

2 2 1 1 2

5 4

5 4 5

4

Page 19: Hadoop 101

Underlying FS options

ext3- released in 2001 - Used by Yahoo!- bootstrap + format slow- set:

- noatime- tune2fs (to turn off reserved blocks)

ext4- released in 2008 - Used by Google- Fast as XFS- set:

- delayed allocation off-noatime- tune2fs (to turn off reserved blocks)

XFS- released in 1993 - Fast- Drawbacks:

- deleting large # of files

Page 20: Hadoop 101

Sample HDFS shell Commands

bin/hadoop fs -lsbin/hadoop fs -mkdirbin/hadoop fs -copyFromLocalbin/hadoop fs -copyToLocalbin/hadoop fs -moveToLocalbin/hadoop fs -rmbin/hadoop fs -tailbin/hadoop fs -chmodbin/hadoop fs -setrep -w 4 -R /dir1/s-dir

Mounting using FUSE:

hadoop-fuse-dfs dfs://10.73.9.50 /hdfs

Page 21: Hadoop 101

Name Node

2

4

3

5

Rack 1

2

4

3

5

Rack 2

2

4

3

5

Rack 3

Job Tracker HBase Master

Network Topology

Yahoo! Installation

- 8 core switches- 100 racks- 40 servers/rack- 1 GBit in rack- 10 GBit among racks-Total 11PB

Page 22: Hadoop 101

Name Node

2

4

3

5

Rack 1

7

9

8

10

Rack 2

12

14

13

15

Rack 3

Job Tracker HBase Master

Rack Awareness

metadata

file.txt = Blk A: DN: 2,7,8

Blk B:DN: 9,12,14

NameNode

A

B

B

BB

A A

A

Page 23: Hadoop 101

2

4

3

5

Rack 1

7

9

8

10

Rack 2

HDFS Writes

metadata

file.txt = Blk A: DN: 2,7,9

NameNode

A

A

A B

Core

C

Client

A

A

Page 24: Hadoop 101

2

4

3

5

Rack 1

7

9

8

10

Rack 2

12

14

13

15

Rack 3

Reading Files

metadata

file.txt = Blk A: DN: 2,7,8

Blk B:DN: 9,12,14

NameNode

A

BB

BB

A A

A

Client

File1.txt parts:Blk A: 2,7,8Blk B: 9,12,14

Core

wanna read file1.txt