Hadoop Interacting with HDFS

DataTorrent

HADOOPInteracting with HDFS

1

→ What's the “Need” ? ←

❏ Big data Ocean

❏ Expensive hardware

❏ Frequent Failures and Difficult recovery

❏ Scaling up with more machines

2

→ Hadoop ←Open source software

- a Java framework- initial release: December 10, 2011

It provides both,

Storage → [HDFS]

Processing → [MapReduce]

HDFS: Hadoop Distributed File System

3

→ How Hadoop addresses the need? ← Big data Ocean

Have multiple machines. Each will store some portion of data, not the entire data.

Expensive hardware

Use commodity hardware. Simple and cheap.

Frequent Failures and Difficult recovery

Have multiple copies of data. Have the copies in different machines.

Scaling up with more machines

If more processing is needed, add new machines on the fly4

→ HDFS ←Runs on Commodity hardware: Doesn't require expensive machines

Large Files; Write-once, Read-many (WORM)

Files are split into blocksActual blocks go to DataNodesThe metadata is stored at NameNode

Replicate blocks to different node

Default configuration:

Block size = 128MB

Replication Factor = 3 5

6

7

8

→ Where NOT TO use Hadoop/HDFS ← Low latency data access

HDFS is optimized for high throughput of data at the expense of latency.

Large number of small filesNamenode has the entire file-system metadata in memory.

Too much metadata as compared to actual data.

Multiple writers / Arbitrary file modificationsNo support for multiple writers for a file

Always append to end of a file9

→ Some Key Concepts ←❏ NameNode

❏ DataNodes

❏ JobTracker (MR v1)

❏ TaskTrackers (MR v1)

❏ ResourceManager (MR v2)

❏ NodeManagers (MR v2)

❏ ApplicationMasters (MR v2)10

→ NameNode & DataNodes ←❏ NameNode:

Centerpiece of HDFS: The Master

Only stores the block metadata: block-name, block-location etc.

Critical component; When down, whole cluster is considered down; Single point of failure

Should be configured with higher RAM

❏ DataNode:

Stores the actual data: The Slave

In constant communication with NameNode

When down, it does not affect the availability of data/cluster

Should be configured with higher disk space

❏ SecondaryNameNode:

Doesn't actually act as a NameNode

Stores the image of primary NameNode at certain checkpoint

Used as backup to restore NameNode

11

12

→ JobTracker & TaskTrackers ←❏ JobTracker:

Talks to the NameNode to determine location of the data

Monitors all TaskTrackers and submits status of the job back to the client

When down, HDFS is still functional; no new MR job; existing jobs halted

Replaced by ResourceManager/ApplicationMaster in MRv2

❏ TaskTracker:

Runs on all DataNodes

TaskTracker communicates with JobTracker signaling the task progress

TaskTracker failure is not considered fatal

Replaced by NodeManager in MRv2

13

→ ResourceManager & NodeManager ←❏ Present in Hadoop v2.0

❏ Equivalent of JobTracker & TaskTracker in v1.0

❏ ResourceManager (RM):

Runs usually at NameNode; Distributes resources among applications.

Two main components: Scheduler and ApplicationsManager (AM)

❏ NodeManager (NM):

Per-node framework agent

Responsible for containers

Monitors their resource usage

Reports the stats to RM

Central ResourceManager and Node specific Manager together is called YARN

14

15

→ Hadoop 1.0 vs. 2.0 ←HDFS 1.0:

Single point of failure

Horizontal scaling performance issue

HDFS 2.0:

HDFS High Availability

HDFS Snapshot

Improved performance

HDFS Federation16

17

HDFS Federation

http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf

→ Interacting with HDFS ←Command prompt:

Similar to Linux terminal commands

Unix is the model, POSIX is the API

Web Interface:

Similar to browsing a FTP site on web

18

Interacting With HDFS

On Command Prompt

19

→ Notes ←File Paths on HDFS:

hdfs://<namenode>:<port>/path/to/file.txt

hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt

hdfs://localhost:8020/user/USERNAME/demo/data/file.txt

/user/USERNAME/demo/file.txt

demo/file.txt

File System:Local: local file system (linux)HDFS: hadoop file system

At some places:The terms “file” and “directory” has the same meaning.

20

→ Before we start ←Command:

hdfs

Usage:

hdfs [--config confdir] COMMAND

Example:

hdfs dfs

hdfs dfsadmin

hdfs fsck

hdfs namenode

hdfs datanode

21

hdfs `dfs` commands

22

→ In general Syntax for `dfs` commands ← hdfs

dfs

-<COMMAND>

-[OPTIONS]

<PARAMETERS>

e.g.

hdfs dfs -ls -R /user/USERNAME/demo/data/23

0. Do It yourselfSyntax:

hdfs dfs -help [COMMAND … ]

hdfs dfs -usage [COMMAND … ]

Example:

hdfs dfs -help cat

hdfs dfs -usage cat

24

1. List the file/directorySyntax:

hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>

Example:

hdfs dfs -ls

hdfs dfs -ls /

hdfs dfs -ls /user/USERNAME/demo/list-dir-example

hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example

25

2. Creating a directorySyntax:

hdfs dfs -mkdir [-p] <hdfs-dir-path>

Example:

hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example

hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3

26

3. Create a file on local & put it on HDFSSyntax:

vi filename.txt

hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>

Example:

vi file-copy-to-hdfs.txt

hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/

27

4. Get a file from HDFS to localSyntax:

hdfs dfs -get <hdfs-file-path> [local-dir-path]

Example:

hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/

28

5. Copy From LOCAL To HDFSSyntax:

hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>

Example:

hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/

29

6. Copy To LOCAL From HDFSSyntax:

hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>

Example:

hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt ~/demo/

30

7. Move a file from local to HDFSSyntax:

hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>

Example:

hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/

31

8. Copy a file within HDFSSyntax:

hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>

Example:

hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/

32

9. Move a file within HDFSSyntax:

hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>

Example:

hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/

33

10. Merge files on HDFSSyntax:

hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>

Examples:

hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt

34

11. View file contentsSyntax:

hdfs dfs -cat <hdfs-file-path>

hdfs dfs -tail <hdfs-file-path>

hdfs dfs -text <hdfs-file-path>

Examples:

hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt

hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head

35

12. Remove files/dirs from HDFSSyntax:

hdfs dfs -rm [options] <hdfs-file-path>

Examples:

hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt

hdfs dfs -rm -R /user/USERNAME/demo/remove-example/

hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/

36

13. Change file/dir propertiesSyntax:

hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>

hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>

hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>

Examples:

hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt

37

14. Check the file sizeSyntax:

hdfs dfs -du <hdfs-file-path>

Examples:

hdfs dfs -du /user/USERNAME/demo/data/file.txt

hdfs dfs -du -s -h /user/USERNAME/demo/data/

38

15. Create a zero byte file in HDFS

Syntax:

hdfs dfs -touchz <hdfs-file-path>

Examples:

hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt

39

16. File test operationsSyntax:

hdfs dfs -test -[defsz] <hdfs-file-path>

Examples:

hdfs dfs -test -e /user/USERNAME/demo/data/file.txt

echo $?

40

17. Get FileSystem StatisticsSyntax:

hdfs dfs -stat [format] <hdfs-file-path>

Format Options:

%b - file size in blocks, %g - group name of owner

%n - filename %o - block size

%r - replication %u - user name of owner

%y - modification date

41

18. Get File/Dir CountsSyntax:

hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>

Example:

hdfs dfs -count -v /user/USERNAME/demo/

42

19. Set replication factorSyntax:

hdfs dfs -setrep -w -R n <hdfs-file-path>

Examples:

hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt

43

20. Set Block SizeSyntax:

hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path>

Examples:

hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/

44

21. Empty the HDFS trashSyntax:

hdfs dfs -expunge

Location:

45

Other hdfs commands (admin)

46

22. HDFS Admin Commands: fsckSyntax:

hdfs fsck <hdfs-file-path>

Options:

[-list-corruptfileblocks |[-move | -delete | -openforwrite][-files [-blocks [-locations | -racks]]][-includeSnapshots]

47

48

23. HDFS Admin Commands: dfsadminSyntax:

hdfs dfsadmin

Options:

[-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]]

Examples:

hdfs dfsadmin -report -live

49

50

24. HDFS Admin Commands: namenodeSyntax:

hdfs namenode

Options:

[-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ]

Examples:

hdfs namenode -help

51

25. HDFS Admin Commands: getconfSyntax:

hdfs getconf [-options]

Options:

[ -namenodes ] [ -secondaryNameNodes ][ -backupNodes ] [ -includeFile ][ -excludeFile ] [ -nnRpcAddresses ][ -confKey [key] ]

52

Again,,, THE most important commands !!Syntax:

hdfs dfs -help [options]

hdfs dfs -usage [options]

Examples:

hdfs dfs -help help

hdfs dfs -usage usage

53

Interacting With HDFS

In Web Browser

54

Web HDFSURL:

http://namenode:50070/explorer.html

Examples:

http://localhost:50070/explorer.html

http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html

55

http://namenode:50070/explorer.html

http://localhost:50070/explorer.html

http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html

References1. http://www.hadoopinrealworld.com

2. http://www.slideshare.net/sanjeeb85/hdfscommandreference

3. http://www.slideshare.net/jaganadhg/hdfs-10509123

4. http://www.slideshare.net/praveenbhat2/adv-os-presentation

5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html

6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf

7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

9. http://hadoop.apache.org/docs/r1.2.1/distcp.html

56

http://www.hadoopinrealworld.com/

http://www.slideshare.net/sanjeeb85/hdfscommandreference

http://www.slideshare.net/jaganadhg/hdfs-10509123

http://www.slideshare.net/praveenbhat2/adv-os-presentation

http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html

http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

http://hadoop.apache.org/docs/r1.2.1/distcp.html

Thank You!!

57

APPENDIX

58

Copy data from one cluster to anotherDescription:

Copy data between hadoop clusters

Syntax:

hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo

hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo

Where srclist.file contains

hdfs://nn1:8020/foo/a

hdfs://nn1:8020/foo/b59

Technology

Hadoop Interacting with HDFS