48
Hadoop Distributed File System (HDFS) 1

Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Hadoop Distributed

File System (HDFS)

1

Page 2: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Overview

A distributed file system

Built on the architecture of Google File

System (GFS)

Shares a similar architecture to many other

common distributed storage engines such as

Amazon S3 and Microsoft Azure

HDFS is a stand-alone storage engine and

can be used in isolation of the query

processing engine

2

Page 3: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Background on Disk Storage

What are file systems and why do we need

them?

A file is a logical sequence of bits/bytes

A physical disk stores data in sectors, tracks,

tapes, blocks, … etc.

3

Page 4: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

File System

Any file system, is a method to provide a

high-level abstraction on physical disk to

make it easier to store files

4

File System

FoldersFiles

Page 5: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Distributed File System

5

FoldersFiles

Distributed File System

Page 6: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Analogy to Unix FS

The logical view is similar

/

usermary

chu

etc hadoop

6

Page 7: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Analogy to Unix FS

The physical model is comparable

Unix HDFS

File1

List of iNodes

Block 1

Block 2

Block 3

File1

List of block locations

Meta data

B B B

B B B

B B B

B

B B B

B B

7

Page 8: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Architecture

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

8

Page 9: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

What is where?

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

File and directory names

Block ordering and locations

Capacity of data nodes

Architecture of data nodes

Block data

Name node location

9

Page 10: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Physical

Cluster

Layout

10

Rack

#1

Rack

#2 …

Node #32

…Node #3

Node #2

Node #1

…Node #34

Node #33

Page 11: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS ShellManage the files from command line

11

Page 12: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Shell

The easiest way to deal with HDFS is through

its shell

The commands are very similar to the Linux

shell commands

General format

So, instead of

You will write

12

hdfs dfs -<cmd> <arguments>

mkdir –p myproject/mydir

hdfs dfs -mkdir –p myproject/mydir

Page 13: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Shell

In addition to regular commands, there are

special commands in HDFS

copyToLocal/get Copies a file from HDFS to

the local file system

copyFromLocal/put Copies a file from the local

file system to HDFS

setrep Changes the replication factor

A list of shell commands with usagehttps://hadoop.apache.org/docs/r2.10.0/hadoop-project-

dist/hadoop-common/FileSystemShell.html

13

Page 14: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing Process

Data nodes

File creator

Name node

14

Page 15: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing Process

Data nodes

File creatorCreate(…)

Name node

The creator process calls the create

function which translates to an RPC

call at the name node

15

Page 16: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creatorCreate(…)

The master node creates three initial

blocks

1. First block is assigned to a random

machine

2. Second block is assigned to another

random machine on a different rack

3. Third block is assigned to a random

machine on the second rack

1 2 3

16

Page 17: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creatorOutputStream

1 2 3

17

Page 18: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creator

1 2 3

OutputStream#write

18

Page 19: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creator

1 2 3

OutputStream#write

19

Page 20: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creator

1 2 3

OutputStream#write

20

Page 21: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Writing ProcessName node

Data nodes

File creator

1 2 3

OutputStream#write

When a block is filled up, the

creator contacts the name node

to create the next block

Next block

21

Page 22: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Notes about writing to HDFS

Data transfers of replicas are pipelined

The data does not go through the name node

Random writing is not supported

Appending to a file is supported but it creates

a new block

22

Page 23: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Self-writing

Name node

Data nodes

File

creator

If the file creator is running on one

of the data nodes, the first replica

is always assigned to that node

The second and third replicas are

assigned as before

23

Page 24: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Reading from HDFS

Reading is relatively easier

No replication is needed

Replication can be exploited

Random reading is allowed

24

Page 25: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Reading Process

Data nodes

File readeropen(…)

Name node

The reader process calls the open

function which translates to an RPC

call at the name node

25

Page 26: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Reading Process

Data nodes

File readerInputStream

Name node

The name node locates the first block

of that file and returns the address of

one of the nodes that store that block

The name node returns an input

stream for the file

26

Page 27: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Reading Process

Data nodes

File reader

InputStream#read(…)

Name node

27

Page 28: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Reading Process

Data nodes

File reader

Name node

When an end-of-block is

reached, the name node

locates the next block

Next block

28

Page 29: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Reading Process

Data nodes

File reader

Name node

seek(pos)

InputStream#seek operation locates

a block and positions the stream

accordingly

29

Page 30: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Self-reading

Data nodes

File

reader

Name node

1. If the block is locally stored

on the reader, this replica is

chosen to read

2. If not, a replica on another

machine in the same rack is

chosen

3. Any other random block is

chosen

Open,

seek

30

When self-reading occurs,

HDFS can make it much faster

through a feature called

short-circuit

Page 31: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Notes About Reading

The API is much richer than the simple

open/seek/close API

You can retrieve block locations

You can choose a specific replica to read

The same API is generalized to other file

systems including the local FS and S3

Review question: Compare random access

read in local file systems to HDFS

31

Page 32: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS Special Features

Node decommission

Load balancer

Cheap concatenation

32

Page 33: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Node Decommission

33

B B B

B B B

B B B

B

B B B

B B

B B B

B

Page 34: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Load Balancing

34

B B B

B B B

B B B

B

B B B

B B

Page 35: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Load Balancing

35

B B B

B B B

B B B

B

B B B

B B

Start the load balancer

Page 36: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Cheap Concatenation

36

Name node

File 1

File 2

File 3

Concatenate File 1 + File 2 + File 3 ➔ File 4

Rather than creating new blocks, HDFS can just

change the metadata in the name node to delete

File 1, File 2, and File 3, and assign their blocks to a

new File 4 in the right order.

Page 37: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS APIMange the file system programmatically

37

Page 38: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

FileSystem API

HDFS provides a Java API that allows your

programs to manage the files similar to the

shell. It is even more powerful.

For interoperability, the FileSystem API

covers not only HDFS, but also the local file

system and other common file systems, e.g.,

Amazon S3

If you write your program in Hadoop

FileSystem API, it will generally work for

those file systems

38

Page 39: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API Basic Classes

39

FileSystem

DistributedFileSystemLocalFileSystem S3FileSystem

Path Configuration

Page 40: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API Classes

Configuration: Holds system configuration

such as where the master node is running

and default system parameters

Path: Stores a path to a file or directory

FileSystem: An abstract class for file system

commands

40

Page 41: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Fully Qualified Path

41

hdfs://masternode:9000/path/to/file

hdfs: the file system scheme. Other possible

values are file, ftp, s3, …

masternode: the name or IP address of the node

that hosts the master of the file system

9000: the port on which the master node is

listening

/path/to/file: the absolute path of the file

Page 42: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Shorter Path Forms

file: relative path to the current working

directory in the default file system

/path/to/file: Absolute path to a file in the

default* file system (as configured)

hdfs://path/to/file: Use the default* values for

the master node and port

hdfs://masternode/path/tofile: Use the given

masternode name or IP and the default* port

*All the defaults are in the Configuration object

42

Page 43: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API

43

Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);

// To get the local FSfs = FileSystem.getLocal(conf);

// To get the default FSfs = FileSystem.get(conf);

Create the file system

Page 44: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API

44

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive);

fs.deleteOnExit(path); // For temporary files

Delete a file

fs.rename(oldPath, newPath);

Rename/Move a file

Page 45: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API

45

FSDataInputStream in = fs.open(path, …);

Open a file for reading

in.seek(pos);in.seekToNewSource(pos);

Seek to a different location for random access

Page 46: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

HDFS API

46

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Get block locations

Page 47: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Conclusion

HDFS is a general-purpose distributed file

system

Provides a similar abstraction to other file

systems

HDFS provides two interfaces

Shell script. Similar to Linux and MacOS

Java API: For programmatic access

The FileSystem API applies to other file

systems including the local file system and

Amazon S3

47

Page 48: Hadoop Distributed File System (HDFS)eldawy/20SCS167/slides/CS167-03-HDFS.pdf · HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares

Further Readings

HDFS Architecture

https://hadoop.apache.org/docs/r2.10.0/hadoop-

project-dist/hadoop-hdfs/HdfsDesign.html

Shell commands

https://hadoop.apache.org/docs/r2.10.0/hadoop-

project-dist/hadoop-

common/FileSystemShell.html

FileSystem API

https://hadoop.apache.org/docs/r2.10.0/api/org/ap

ache/hadoop/fs/FileSystem.html

48