34
Hadoop Distributed File System (HDFS) 10/05/2018 1

Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Hadoop Distributed File

System (HDFS)

10/05/2018 1

Page 2: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Overview

A distributed file system

Built on the architecture of Google File

System (GS)

Shares a similar architecture to many other

common distributed storage engines such as

Amazon S3 and Microsoft Azure

HDFS is a stand-along storage engine and

can be used in isolation of the query

processing engine

10/05/2018 2

Page 3: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Architecture

10/05/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

3

Page 4: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

What is where?

10/05/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

File and directory names

Block ordering and locations

Capacity of data nodes

Architecture of data nodes

Block data

Name node location

4

Page 5: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Analogy to Unix FS

10/05/2018

The logical view is similar

/

usermary

chu

etc hadoop

5

Page 6: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Analogy to Unix FS

10/05/2018

The physical model is comparable

Unix HFDS

File1

List of iNodes

Block 1

Block 2

Block 3

File1

List of block locations

Meta data

B B B

B B B

B B B

B

B B B

B B

6

Page 7: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Data nodes

File creator

Name node

7

Page 8: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Data nodes

File creatorCreate(…)

Name node

The creator process calls the create

function which translates to an RPC

call at the name node

8

Page 9: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creatorCreate(…)

The master node creates three initial

blocks

1. First block is assigned to a random

machine

2. Second block is assigned to another

random machine in the same rack of

the first machine

3. Third block is assigned to a random

machine in another rack

1 2 3

9

Page 10: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creatorOutputStream

1 2 3

10

Page 11: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

11

Page 12: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

12

Page 13: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

13

Page 14: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Create

10/05/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

When a block is filled up, the

creator contacts the name node

to create the next block

Next block

14

Page 15: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Notes about writing to HDFS

Data transfers of replicas are pipelined

The data does not go through the name node

Random writing is not supported

Appending to a file is supported but it creates

a new block

10/05/2018 15

Page 16: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Self-writing

10/05/2018

Name node

Data nodes

File

creator

If the file creator is running on one

of the data nodes, the first replica

is always assigned to that node

16

Page 17: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Reading from HDFS

Reading is relatively easier

No replication is needed

Replication can be exploited

Random reading is allowed

10/05/2018 17

Page 18: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Read

10/05/2018

Data nodes

File readeropen(…)

Name node

The reader process calls the open

function which translates to an RPC

call at the name node

18

Page 19: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Read

10/05/2018

Data nodes

File readerInputStream

Name node

The name node locates the first block

of that file and returns the address of

one of the nodes that store that block

The name node returns an input

stream for the file

19

Page 20: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Read

10/05/2018

Data nodes

File reader

InputStream#read(…)

Name node

20

Page 21: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Read

10/05/2018

Data nodes

File reader

Name node

When an end-of-block is

reached, the name node

locates the next block

Next block

21

Page 22: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Read

10/05/2018

Data nodes

File reader

Name node

seek(pos)

InputStream#seek operation locates

a block and positions the stream

accordingly

22

Page 23: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Self-reading

10/05/2018

Data nodes

File

reader

Name node

1. If the block is locally stored

on the reader, this replica is

chosen to read

2. If not, a replica on another

machine in the same rack is

chosen

3. Any other random block is

chosen

Open,

seek

23

When self-reading occurs,

HDFS can make it much faster

through a feature called

short-circuit

Page 24: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Notes About Reading

The API is much richer than the simple

open/seek/close API

You can retrieve block locations

You can choose a specific replica to read

The same API is generalized to other file

systems including the local FS and S3

Review question: Compare random access

read in local file systems to HDFS

10/05/2018 24

Page 25: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS Special Features

Node decomission

Load balancer

Cheap concatenation

10/05/2018 25

Page 26: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Node Decommission

10/05/2018 26

B B B

B B B

B B B

B

B B B

B B

B B B

B

Page 27: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Load Balancing

10/05/2018 27

B B B

B B B

B B B

B

B B B

B B

Page 28: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Load Balancing

10/05/2018 28

B B B

B B B

B B B

B

B B B

B B

Start the load balancer

Page 29: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Cheap Concatenation

10/05/2018 29

Name node

File 1

File 2

File 3

Concatenate File 1 + File 2 + File 3 File 4

Rather than creating new blocks, HDFS can just

change the metadata in the name node to delete

File 1, File 2, and File 3, and assign their blocks to a

new File 4 in the right order.

Page 30: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS API

10/05/2018 30

FileSystem

DistributedFileSystemLocalFileSystem S3FileSystem

Path Configuration

Page 31: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS API

10/05/2018 31

Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);

// To get the local FSfs = FileSystem.getLocal (conf);

// To get the default FSfs = FileSystem.get(conf);

Create the file system

Page 32: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS API

10/05/2018 32

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive);fs.deleteOnExit(path);

Delete a file

fs.rename(oldPath, newPath);

Rename a file

Page 33: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS API

10/05/2018 33

FSDataInputStream in = fs.open(path, …);

Open a file

in.seek(pos);in.seekToNewSource(pos);

Seek to a different location

Page 34: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

HDFS API

10/05/2018 34

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Get block locations