59
DataTorrent HADOOP Interacting with HDFS 1

Hadoop Interacting with HDFS

Embed Size (px)

Citation preview

Page 1: Hadoop Interacting with HDFS

DataTorrent

HADOOPInteracting with HDFS

1

Page 2: Hadoop Interacting with HDFS

→ What's the “Need” ? ←

❏ Big data Ocean

❏ Expensive hardware

❏ Frequent Failures and Difficult recovery

❏ Scaling up with more machines

2

Page 3: Hadoop Interacting with HDFS

→ Hadoop ←Open source software

- a Java framework- initial release: December 10, 2011

It provides both,

Storage → [HDFS]

Processing → [MapReduce]

HDFS: Hadoop Distributed File System

3

Page 4: Hadoop Interacting with HDFS

→ How Hadoop addresses the need? ← Big data Ocean

Have multiple machines. Each will store some portion of data, not the entire data.

Expensive hardware

Use commodity hardware. Simple and cheap.

Frequent Failures and Difficult recovery

Have multiple copies of data. Have the copies in different machines.

Scaling up with more machines

If more processing is needed, add new machines on the fly4

Page 5: Hadoop Interacting with HDFS

→ HDFS ←Runs on Commodity hardware: Doesn't require expensive machines

Large Files; Write-once, Read-many (WORM)

Files are split into blocksActual blocks go to DataNodesThe metadata is stored at NameNode

Replicate blocks to different node

Default configuration:

Block size = 128MB

Replication Factor = 3 5

Page 6: Hadoop Interacting with HDFS

6

Page 7: Hadoop Interacting with HDFS

7

Page 8: Hadoop Interacting with HDFS

8

Page 9: Hadoop Interacting with HDFS

→ Where NOT TO use Hadoop/HDFS ← Low latency data access

HDFS is optimized for high throughput of data at the expense of latency.

Large number of small filesNamenode has the entire file-system metadata in memory.

Too much metadata as compared to actual data.

Multiple writers / Arbitrary file modificationsNo support for multiple writers for a file

Always append to end of a file9

Page 10: Hadoop Interacting with HDFS

→ Some Key Concepts ←❏ NameNode

❏ DataNodes

❏ JobTracker (MR v1)

❏ TaskTrackers (MR v1)

❏ ResourceManager (MR v2)

❏ NodeManagers (MR v2)

❏ ApplicationMasters (MR v2)10

Page 11: Hadoop Interacting with HDFS

→ NameNode & DataNodes ←❏ NameNode:

Centerpiece of HDFS: The Master

Only stores the block metadata: block-name, block-location etc.

Critical component; When down, whole cluster is considered down; Single point of failure

Should be configured with higher RAM

❏ DataNode:

Stores the actual data: The Slave

In constant communication with NameNode

When down, it does not affect the availability of data/cluster

Should be configured with higher disk space

❏ SecondaryNameNode:

Doesn't actually act as a NameNode

Stores the image of primary NameNode at certain checkpoint

Used as backup to restore NameNode

11

Page 12: Hadoop Interacting with HDFS

12

Page 13: Hadoop Interacting with HDFS

→ JobTracker & TaskTrackers ←❏ JobTracker:

Talks to the NameNode to determine location of the data

Monitors all TaskTrackers and submits status of the job back to the client

When down, HDFS is still functional; no new MR job; existing jobs halted

Replaced by ResourceManager/ApplicationMaster in MRv2

❏ TaskTracker:

Runs on all DataNodes

TaskTracker communicates with JobTracker signaling the task progress

TaskTracker failure is not considered fatal

Replaced by NodeManager in MRv2

13

Page 14: Hadoop Interacting with HDFS

→ ResourceManager & NodeManager ←❏ Present in Hadoop v2.0

❏ Equivalent of JobTracker & TaskTracker in v1.0

❏ ResourceManager (RM):

Runs usually at NameNode; Distributes resources among applications.

Two main components: Scheduler and ApplicationsManager (AM)

❏ NodeManager (NM):

Per-node framework agent

Responsible for containers

Monitors their resource usage

Reports the stats to RM

Central ResourceManager and Node specific Manager together is called YARN

14

Page 15: Hadoop Interacting with HDFS

15

Page 16: Hadoop Interacting with HDFS

→ Hadoop 1.0 vs. 2.0 ←HDFS 1.0:

Single point of failure

Horizontal scaling performance issue

HDFS 2.0:

HDFS High Availability

HDFS Snapshot

Improved performance

HDFS Federation16

Page 18: Hadoop Interacting with HDFS

→ Interacting with HDFS ←Command prompt:

Similar to Linux terminal commands

Unix is the model, POSIX is the API

Web Interface:

Similar to browsing a FTP site on web

18

Page 19: Hadoop Interacting with HDFS

Interacting With HDFS

On Command Prompt

19

Page 20: Hadoop Interacting with HDFS

→ Notes ←File Paths on HDFS:

hdfs://<namenode>:<port>/path/to/file.txt

hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt

hdfs://localhost:8020/user/USERNAME/demo/data/file.txt

/user/USERNAME/demo/file.txt

demo/file.txt

File System:Local: local file system (linux)HDFS: hadoop file system

At some places:The terms “file” and “directory” has the same meaning.

20

Page 21: Hadoop Interacting with HDFS

→ Before we start ←Command:

hdfs

Usage:

hdfs [--config confdir] COMMAND

Example:

hdfs dfs

hdfs dfsadmin

hdfs fsck

hdfs namenode

hdfs datanode

21

Page 22: Hadoop Interacting with HDFS

hdfs `dfs` commands

22

Page 23: Hadoop Interacting with HDFS

→ In general Syntax for `dfs` commands ← hdfs

dfs

-<COMMAND>

-[OPTIONS]

<PARAMETERS>

e.g.

hdfs dfs -ls -R /user/USERNAME/demo/data/23

Page 24: Hadoop Interacting with HDFS

0. Do It yourselfSyntax:

hdfs dfs -help [COMMAND … ]

hdfs dfs -usage [COMMAND … ]

Example:

hdfs dfs -help cat

hdfs dfs -usage cat

24

Page 25: Hadoop Interacting with HDFS

1. List the file/directorySyntax:

hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>

Example:

hdfs dfs -ls

hdfs dfs -ls /

hdfs dfs -ls /user/USERNAME/demo/list-dir-example

hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example

25

Page 26: Hadoop Interacting with HDFS

2. Creating a directorySyntax:

hdfs dfs -mkdir [-p] <hdfs-dir-path>

Example:

hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example

hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3

26

Page 27: Hadoop Interacting with HDFS

3. Create a file on local & put it on HDFSSyntax:

vi filename.txt

hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>

Example:

vi file-copy-to-hdfs.txt

hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/

27

Page 28: Hadoop Interacting with HDFS

4. Get a file from HDFS to localSyntax:

hdfs dfs -get <hdfs-file-path> [local-dir-path]

Example:

hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/

28

Page 29: Hadoop Interacting with HDFS

5. Copy From LOCAL To HDFSSyntax:

hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>

Example:

hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/

29

Page 30: Hadoop Interacting with HDFS

6. Copy To LOCAL From HDFSSyntax:

hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>

Example:

hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt ~/demo/

30

Page 31: Hadoop Interacting with HDFS

7. Move a file from local to HDFSSyntax:

hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>

Example:

hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/

31

Page 32: Hadoop Interacting with HDFS

8. Copy a file within HDFSSyntax:

hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>

Example:

hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/

32

Page 33: Hadoop Interacting with HDFS

9. Move a file within HDFSSyntax:

hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>

Example:

hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/

33

Page 34: Hadoop Interacting with HDFS

10. Merge files on HDFSSyntax:

hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>

Examples:

hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt

34

Page 35: Hadoop Interacting with HDFS

11. View file contentsSyntax:

hdfs dfs -cat <hdfs-file-path>

hdfs dfs -tail <hdfs-file-path>

hdfs dfs -text <hdfs-file-path>

Examples:

hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt

hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head

35

Page 36: Hadoop Interacting with HDFS

12. Remove files/dirs from HDFSSyntax:

hdfs dfs -rm [options] <hdfs-file-path>

Examples:

hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt

hdfs dfs -rm -R /user/USERNAME/demo/remove-example/

hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/

36

Page 37: Hadoop Interacting with HDFS

13. Change file/dir propertiesSyntax:

hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>

hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>

hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>

Examples:

hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt

37

Page 38: Hadoop Interacting with HDFS

14. Check the file sizeSyntax:

hdfs dfs -du <hdfs-file-path>

Examples:

hdfs dfs -du /user/USERNAME/demo/data/file.txt

hdfs dfs -du -s -h /user/USERNAME/demo/data/

38

Page 39: Hadoop Interacting with HDFS

15. Create a zero byte file in HDFS

Syntax:

hdfs dfs -touchz <hdfs-file-path>

Examples:

hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt

39

Page 40: Hadoop Interacting with HDFS

16. File test operationsSyntax:

hdfs dfs -test -[defsz] <hdfs-file-path>

Examples:

hdfs dfs -test -e /user/USERNAME/demo/data/file.txt

echo $?

40

Page 41: Hadoop Interacting with HDFS

17. Get FileSystem StatisticsSyntax:

hdfs dfs -stat [format] <hdfs-file-path>

Format Options:

%b - file size in blocks, %g - group name of owner

%n - filename %o - block size

%r - replication %u - user name of owner

%y - modification date

41

Page 42: Hadoop Interacting with HDFS

18. Get File/Dir CountsSyntax:

hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>

Example:

hdfs dfs -count -v /user/USERNAME/demo/

42

Page 43: Hadoop Interacting with HDFS

19. Set replication factorSyntax:

hdfs dfs -setrep -w -R n <hdfs-file-path>

Examples:

hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt

43

Page 44: Hadoop Interacting with HDFS

20. Set Block SizeSyntax:

hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path>

Examples:

hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/

44

Page 45: Hadoop Interacting with HDFS

21. Empty the HDFS trashSyntax:

hdfs dfs -expunge

Location:

45

Page 46: Hadoop Interacting with HDFS

Other hdfs commands (admin)

46

Page 47: Hadoop Interacting with HDFS

22. HDFS Admin Commands: fsckSyntax:

hdfs fsck <hdfs-file-path>

Options:

[-list-corruptfileblocks |[-move | -delete | -openforwrite][-files [-blocks [-locations | -racks]]][-includeSnapshots]

47

Page 48: Hadoop Interacting with HDFS

48

Page 49: Hadoop Interacting with HDFS

23. HDFS Admin Commands: dfsadminSyntax:

hdfs dfsadmin

Options:

[-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]]

Examples:

hdfs dfsadmin -report -live

49

Page 50: Hadoop Interacting with HDFS

50

Page 51: Hadoop Interacting with HDFS

24. HDFS Admin Commands: namenodeSyntax:

hdfs namenode

Options:

[-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ]

Examples:

hdfs namenode -help

51

Page 52: Hadoop Interacting with HDFS

25. HDFS Admin Commands: getconfSyntax:

hdfs getconf [-options]

Options:

[ -namenodes ] [ -secondaryNameNodes ][ -backupNodes ] [ -includeFile ][ -excludeFile ] [ -nnRpcAddresses ][ -confKey [key] ]

52

Page 53: Hadoop Interacting with HDFS

Again,,, THE most important commands !!Syntax:

hdfs dfs -help [options]

hdfs dfs -usage [options]

Examples:

hdfs dfs -help help

hdfs dfs -usage usage

53

Page 54: Hadoop Interacting with HDFS

Interacting With HDFS

In Web Browser

54

Page 55: Hadoop Interacting with HDFS

Web HDFSURL:

http://namenode:50070/explorer.html

Examples:

http://localhost:50070/explorer.html

http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html

55

Page 56: Hadoop Interacting with HDFS

References1. http://www.hadoopinrealworld.com

2. http://www.slideshare.net/sanjeeb85/hdfscommandreference

3. http://www.slideshare.net/jaganadhg/hdfs-10509123

4. http://www.slideshare.net/praveenbhat2/adv-os-presentation

5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html

6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf

7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

9. http://hadoop.apache.org/docs/r1.2.1/distcp.html

56

Page 57: Hadoop Interacting with HDFS

Thank You!!

57

Page 58: Hadoop Interacting with HDFS

APPENDIX

58

Page 59: Hadoop Interacting with HDFS

Copy data from one cluster to anotherDescription:

Copy data between hadoop clusters

Syntax:

hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo

hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo

Where srclist.file contains

hdfs://nn1:8020/foo/a

hdfs://nn1:8020/foo/b59