HDFS User Reference
Biju Nair
Local File System
FileA
FileB
FileC
Inode-‐n
Inode-‐m
Inode-‐p
File A0ributes
Block 0 Address
Block 1 Address
Block 2 Address
Block 3 Address
Inode-‐n
File A0ributes
Block 0 Address
Block 1 Address
Block 2 Address
Inode-‐m
File A0ributes
Block 0 Address
Block 1 Address
Block 2 Address
Block 3 Address
Inode-‐m
DISK
Directory
Par@@on Table
MBR
Boot block
Super block
Free Space Trk
i-‐nodes
Root dir
File block size is based on what is used when FS is defined
2
Hadoop Distributed File System
FileA
FileB
FileC
H1:blk0, H2:blk1
H3:blk0,H1:blk1
H2:blk0;H3:blk1
HDFS Directory Master Host (NN)
DISK
Local File System File
FileA0
FileB1
Inode-‐x
Inode-‐y
Local FS Directory Host 1
FileA1
FileC0
Inode-‐a
Inode-‐n
Local FS Directory Host 2
FileB0
FileC1
Inode-‐r
Inode-‐c
Local FS Directory Host 3
In-‐x
In-‐y
In-‐a
In-‐n
In-‐r
In-‐c DISK
DISK
DISK
Files created are of size equal to the HDFS blksize
3
HDFS
Date Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta
/... /subdir2/
Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta
/... /subdir2/
Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta
/... /subdir2/
Name Node ${dfs.name.dir}/current/VERSION
/edits,/fsimage,/fs@me
Secondary Name Node ${fs.checkpoint.dir}/current/VERSION
/edits,/fsimage,/fs@me
Hadoop CLI HDFS UI WebHDFS
Data Nodes
HDFS Data Transfer Protocol
RPC
HTTP
RPC
HTTP/S
4
HDFS Config Files and Ports • Default configuraJon
– core-‐default.xml, hdfs-‐default.xml • Site specific configuraJon
– core-‐site.xml, hdfs-‐site.xml under conf • ConfiguraJon of daemon processes
– hadoop-‐env.sh under conf • List of slave/data nodes
– “slaves” file under conf • Ports
– Default NN UI port 50070 (HTTP), 50470 (HTTPS) – Default NN Port 8020/9000 – Default DN UI port 50075 (HTTP), 50475(HTTPS)
5
HDFS -‐ Write Flow
Client
Namespace MetaData Blockmap (Fsimage Edit files)
Name Node
Data Node Data Node Data Node
1
2
3
4
5
6 6
7 7
8
1. Client requests to open a file to write through fs.create() call. This will overwrite exisJng file. 2. Name node responds with a lease to the file path 3. Client writes to local and when data reaches block size, requests Name Node for write 4. Name Node responds with a new blockid and the desJnaJon data nodes for write and replicaJon 5. Client sends the first data node the data and the checksum generated on the data to be wriaen 6. First data node writes the data and checksum and in parallel pipelines the replicaJons to other DN 7. Each data node where the data is replicated responds back with success /failure to the first DN 8. First data node in turn informs to the Name node that the write request for the block is complete
which in turn will update its block map Note: There can be only one write at a Jme on a file
6
HDFS -‐ Read Flow
Client
Namespace MetaData Blockmap (Fsimage Edit files)
Name Node
Data Node Data Node Data Node
1
2
3
4
5 6
1. Client requests to open a file to read through fs.open() call 2. Name node responds with a lease to the file path 3. Client requests for read the data in the file 4. Name Node responds with block ids in sequence and the corresponding data nodes 5. Client reaches out directly to the DNs for each block of data in the file 6. When DNs sends back data along with check sum, client performs a checksum verificaJon by
generaJng a checksum 7. If the checksum verificaJon fails client reaches out to other DNs where the re is a replicaJon
7
7
HDFS -‐ Name Node
Fsimage (MetaData) Namespace Ownership Permissions
Create/mod/Access Jme, Is hidden
EditFile (Journal)
Changes to metadata
BlockMap (In-‐memory)
Details on File blocks and where they are stored
1. Name node manages the HDFS file system using the fsimage/edifile and block-‐map data structures 2. Fsimage and edifile data are stored on disk. When hdfs starts they are read, merged and stored in-‐memory 3. Data nodes sends details about the blocks they are storing when it starts and also at regular intervals 4. Name node uses the block map send by data nodes to build the BlockMap data structure data 5. The BlockMap data is used when requests for reads on files comes to the FileSystem 6. Also the BlockMap data is used to idenJfy the under/over replicated files which requires correcJon 7. At no point Name node stores data locally or directly involved in transferring data from files to client 8. The client reading/wriJng data receives meta data details from NN and then directly works with DNs 9. Name nodes require large memory since it needs to hold all the in-‐memory data structures 10. If the NN is lost the data in the file systems can’t be accessed
8
FS Meta Data Change Management
Fsimage (MetaData)
EditFile (Journal)
1. When HDFS is up and running changes to file system metadata are stored in Edit files 2. When NN starts it looks for EditFiles in the system and merges the content with the fsimage on the disk 3. The merging process creates new fsimage and edifile. Also the process discards the old fsimage & edit files. 4. Since the edit files can be large for a very acJve HDFS cluster, the NN start-‐up will take a long Jme 5. Secondary name node at regular interval or aier a certain edifile size, merges the edit file and fsimage file 6. The merge process creates a new fsimage file and an edit file. The secondary NN copies the new fsimage file back to NN 7. This will reduce the NN start-‐up process and also the fsimage can be used if there is a failure in the NN server to restore
Secondary NameNode
Fsimage_1 (MetaData)
EditFile_1 (MetaData)
Fsimage (MetaData)
EditFile (Journal)
NameNode
Fsimage_1 (MetaData)
EditFile_1 (MetaData)
At Start-‐up Periodically
9
HDFS -‐ Data Node
MetaData BlockMap
Data Node Data Node Data Node
Name Node
Heart Beat / Block map
1. Data nodes stores blocks of data for each file stored in HDFS and the default clock size is 128 MB 2. Blocks of data is replicated n Jmes and by default it is 3 Jmes 3. Data node periodically sends a heartbeat to the name node to inform NN that it is alive 4. If NN doesn’t receive a heart beat , it will mark the DN as dead and stops sending further requests to the DN 5. Also in periodic intervals, data node sends out a block map which includes all the file blocks it stores 6. When a DN is dead, all the files for which blocks were stored in the DN will get marked as under replicated 7. NN will recJfy under replicaJon by replicaJng the blocks to other data nodes
10
Ensuring Data Integrity
• Through replicaJon/replicaJon assurance – First replica closer to client node – Second replica on a different rack – Third replica on the rack as the second replica
• File system checks run manually
• Block scanning over a period of Jme
• Storing checksums along with block data
11
Permission and Quotas
• File and directories use much of POSIX model – Associated with an owner and a group – Permission for owner, group and others – r for read, w for append to files – r for lisJng files, w for delete/create files in dirs – x to access child directories – Stciky bit on dirs prevents deleJons by others – User idenJficaJon can be simple (OS) or Kerberos
12
Permission and Quotas
• Quota for number of files – Name quota – dfsadmin -‐setQuota <N> <dir>...<dir> – dfsadmin -‐clrSpaceQuota <dir>...<dir>
• Quota on the size of data – Space quota can be set to restrict space usage – dfsadmin -‐setSpaceQuota <N> <dir>...<dir>
• Replicated data also consumes quota – dfsadmin -‐clrSpaceQuota <dir>...<dir>
• ReporJng – fs -‐count -‐q <dir>...<dir>
13
HDFS snapshot • No copy of data blocks. Only the metadata (block list and file names) are copied • Allow snapshot on a directory
– hdfs dfsadmin –allowSnapshot <path> • Create snapshot
– hdfs dfs –createSnapshot <path> [<name>] – Default name is ‘s’+Jmestamp
• Verify snapshot – hadoop fs –ls <path>/.snapshot
• Directory with snapshot can’t be deleted or renamed • Disallow snapshot
– hdfs dfsadmin –disallowSnapshot <path> – All exisJng snapshot need to be deleted before disallow
• Delete snapshot – hdfs dfs –deleteSnapshot <path> <name>
• Rename snapshot – hdfs dfs –renameSnapshot <path> <oldname> <newname>
• Snapshot differences – hdfs snapshotDiff <path> <starJng snapshot name> <ending snapshot name>
• List all snap shoaable directories – hdfs lsSnapshoaableDir
14
HDFS back-‐up using snapshot • Create a snapshot on the source cluster • Perform a “distcp” of the snapshot to backup cluster • Create a snapshot of the copy on the backup cluster • Cleanup any old back-‐up copies to comply with the enterprise retenJon policy
• The reverse can be followed to recover data from the backup – Data need to be removed on the producJon cluster before the restore
– During deleJon –skipTrash opJon of “rm” will help reduce space usage
15
distcp • Tool to perform inter and intra cluster copy of data • UJlizes mapreduce to perform the copy • It can be used to
– Copy data with in a cluster – Copy data between clusters – Copy files or directories – Copy data from mulJple sources
• Can be used to create a backup cluster • Starts up containers on both source and target • Consumes network traffic between clusters • Need to be scheduled at appropriate Jme • Can control resource uJlizaJon using parameters
16
distcp • Hadoop distcp [opJons] <srcURL> … <srcURL> <destURL> – Source path need to be obsolute – DesJnaJon directory will be created if not present – “update” opJon will update only the changed files – “skipcrccheck” opJon to disable checksum – “overwrite” opJon is to overwrite exisJng files which is by default skipped if present
– “delete” opJon to delete files in desJnaJon which are not in source
– “hip” fs need to be used to copy between different versions of HDFS
– “m” opJon to specify the number of mappers
17
distcp – “atomic” opJon to commit all changes or none
– “async” to run distcp async i.e. non blocking – “i” opJon to ignore failures during copy – “log” directory on DFS where logs to be saved – “p [rbugp]” preserve file status as source – “strategy [staJc|dynamic]” – “bandwidth [MB]” bandwidth per map in MB
18
HDFS JAVA APIs Func@on API
Directory Create FileSystem.mkdirs(path, permission)
Directory Rename/Move FileSystem.rename(oldpath, newpath)
Directory Delete FileSystem.delete(path, true)
File Create FileSystem.createNewFile(path)
File Open FileSystem.open(path)
File Read FSDataInputStream.read*
File Write FSDataOutputStream.write*
File Rename/Move FileSystem.rename(oldpath, newpath)
File Delete FileSystem.delete(path, false)
File Append FileSystem.append(path)
File Seek FSDataInputStream.seek(int)
File System FileSystem.get(conf)
19
HDFS FederaJon
Diagram source: hadoop.apache.org – JIRA HDFS-‐1052
HDFS without Federa@on HDFS with Federa@on
-‐ Namespace management and block management together -‐ Supports one name space -‐ Hinders scalability above 400 0 nodes -‐ Doesn’t support some of mulJ-‐tenancy requirements
-‐ Namespace management and block management seperated -‐ Block management can be on its node on its own -‐ Supports more than one name space/NN -‐ Scalable beyond 4000 nodes and millions of rows -‐ Can deploy mulJ-‐tenancy requirements like NN for specific
departments and isoloaJon -‐ A namespace and block pool is called namespace volume
20
Enabling HDFS federaJon • IdenJfy an unique cluster id • IdenJfy nameservices ids for name nodes • Add dfs.nameservices to hdfs-‐site.xml
– Comma separated nameservice(ns) names • Update hdfs-‐site.xml on all NNs and DNs
– dfs.namenode.rpc-‐address.ns – dfs.namenode.hap-‐address.ns – dfs.namenode.servicerpc-‐address.ns – dfs.namenode.haps-‐address.ns – dfs.namenode.secondaryhap-‐address.ns – dfs.namenode.backup.address.ns
• Format all name nodes using the cluster id – hdfs namenode –format –clusterId <cluster id>
21
HDFS Rack Awareness • Rack awareness enables efficient data placement
– Data writes – Balancer – Decommissioning/commissioning of nodes
• Each node is assigned to a rack (rack id) – Rack id is used in the path names
• Data placement – First block is placed near client or random node/rack – Second replica of block placed in a second rack node – Third replica is placed in a different node in second rack – If HDFS is not rack aware, second and third replicas are placed at random nodes
22
Enabling HDFS Rack Awareness
• Update core-‐site.xml with topology properJes – topology.script.file.name
• Script can be shell script, Python, Java – topology.script.number.args
• Copy the script to the conf directory • Distribute the script and core-‐site.xml • Stop and start the name node • Verify that the racks are recognized by HDFS – hdfs fsck -‐racks
23
HDFS NFS Gateway
• Allows HDFS to be mounted as part of local FS • Stateless daemon translates NFS to HDFS access protocol • DFSClient is part of the gateway daemon
– Averages 30 MB/S for writes • MulJple gateways can be used for scalability • Gateway machine requires all soiware and configs like HDFS client
– Gateway can be run on HDFS cluster nodes • Random writes are not supported
HDFS Cluster
NN
DN
DN
DN
NFS Gateway (DFSClient)
RPC
HDFS
HDFS
Client NFSv3
24
HDFS NFS Gateway ConfiguraJon
• Consists of two daemons – portmap and nfs3
• ConfiguraJon – dfs.nodename.access.precision; 3600000 (1 Hr)
• Name node restart – dfs.nfs3.dump.dir; dir to store out of seq data
• Enough space to store data for all concurrent file writes • Use NFS for smaller file transfers in the order of 1 GB
– dfs.nfs.exports.allowed.hosts; Host access • client*.abc.com r;client*.xyc.com rw
– Update log4j.properJes file • log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG • log4j.logger.org.apache.hadoop.oncrpc=DEBUG
25
HDFS NFS Gateway ConfiguraJon
• Stop nfs & rpcbind services provided by OS – service nfs stop – service rpcbind stop
• Start hadoop portmap as root – hadoop-‐daemon.sh start portmap – To stop use “stop” instead of “start” as parameter
• Start mountd and nfsd as user starJng HDFS – hadoop-‐daemon.sh start nfs3 – To stop use “stop” instead of “start” as parameter
26
HDFS NFS Gateway ConfiguraJon • Validate NFS services are running
– rpcinfo –p $nfs_server_ip – Should see entries for mountd, portmapper & nfs
• Verify HDFS namespace is exported for mount – showmount –e $nfs_server_ip – Should see the export list
• Mount HDFS on client – Create a mount point as root; – Change ownership of mount point to user running HDFS cluster – mount -‐t nfs -‐o vers=3,proto=tcp,nolock $nfs_server:/
$mount_point – Client sends UID of user to NFS – NFS looks up the username for UID and uses it to access HDFS – User name and UID should be the same on client and NFS
27
HDFS Name Node HA
Ac@ve Name Node Passive Name Node
Shared Storage
ZKFC ZKFC
Zookeeper Quorum ZK ZK ZK
Data Node Data Node Data Node
HB
HB
• Zookeeper does failure detecJon and helps acJve name node elecJon • ZKFC ZooKeeper Failover Controller
• monitors the health of name node • Holds a session open on ZK and a lock for acJve NN • If no other NN holds zlock, it tries to acquire it to make NN acJve
• Share storage can be NFS mount or quorum of journal storage • Fencing is defined to prevent split brain scenario of two NN wriJng
28
HDFS NN HA ConfiguraJon • Define dfs.nameservices
– Nameservice Id • Define dfs.namenodes.[nameservice id]
– Comma separated list of name nodes • Define dfs.namenode.rpc-‐address.[Nameservice Id].[Name node Id]
– Fully qualified machine name and port • Define dfs.namenode.hap-‐address.[nameservice ID].[name node ID]
– Fully qualified machine name and port • Define dfs.namenode.shared.edits.dir
– For nfs: file:///mnt/... – For Journal nodes: qjournal://node1:8485;node2. com:8485;
• Define dfs.client.failover.proxy.provider.[nameservice ID] – org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
• Define dfs.ha.fencing.methods – sshfence; requires password less ssh into name nodes from one another – shell
• Define fs.defaultFS the HA enabled logical URI • For journal nodes
– Define dfs.journalnode.edits.dir where edits and other local states used by JNs will be stored
29
HDFS NN HA ConfiguraJon • Define dfs.ha.automaJc-‐failover.enabled
– Set to true • Define ha.zookeeper.quorum
– Host and port of ZK • To enable HA in an exisJng cluster
– Run hdfs dfsadmin –safemode enter – Run hdfs dfsadmin –saveNamespace – Stop HDFS cluster dfs-‐stop.sh – Start journal node daemons hdfs-‐daemon.sh journalnode – Run hdfs zkfc –formatZK on exisJng NN – Run hdfs –iniEalizeSharedEdits on exisJng NN – Run hdfs namenode –bootstrapStandBy on new NN – Delete secondary name node – Start HDFS cluster dfs-‐start.sh
30
hdfs haadmin
• -‐ns <nameserviceId> • -‐transiJonToAcJve <serviceId> • -‐transiJonToStandby <serviceId> • -‐failover <serviceId> <serviceId> – [-‐-‐forcefence] [-‐-‐forceacJve]
• -‐getServiceState <serviceId> • -‐checkHealth <serviceId> • -‐help <command>
31
hdfs dfsadmin
• -‐report • -‐safemode [enter|leave|get|wait] • -‐finalizeUpgrade • -‐refreshNodes uses files defined in dfs.hosts & dfs.host.exclude
• -‐report • -‐lsr • -‐upgradeProgress status • -‐metasave • -‐setQuota <quota>/-‐clrQuota <dirname>…<dirname> • -‐setRep [-‐w] <w> <path/file>
32
hdfs fsck
• hdfs fsck [opJons] path – move
– delete – openforwrite – files – blocks – locaJons – racks
33
Balancer
• start-‐balancer.sh – policy datanode|blockpool – threshold <percentage>; default 10% – dfs.balancer.bandwidthPerSec specified in bytes
• Default 1 MB/sec
34
Adding New Nodes
• Add node address to dfs.hosts file – Update mapred.hosts file if using mapred
• Update namenode with the new set of nodes – hadoop dfsadmin –refreshNodes – Update jobtracker with the new set of nodes
• hadoop mradmin –refreshNodes
• Update “slaves” file with the new node names • Start new datanodes (and tasktrackers) • Check the availability of the new nodes in UI • Run balancer so that data is distributed
35
Decommissioning Nodes
• Add node address to exclude file – dfs.hosts.exclude – mapred.hosts.exclude
• Update namenode (and jobtracker) – hadoop dfsadmin –refreshNodes – hadoop mradmin –refreshNodes
• Verify all the nodes are decommissioned (UI) • Remove nodes from dfs.hosts (and mapred.hosts) file • Update namenode (and jobtracker) • Remove nodes from the “slaves” file
36
HDFS Upgrade • No file system layout change – Install new version of HDFS (and MapReduce)
– Stop the old daemons – Update the configuraJon files – Start the new daemons
– Update clients to use the new libraries – Remove the old install and the configuraJon files – Update applicaJon code for deprecated APIs
37
HDFS Upgrade • With file system layout changes
– When there is a layout change NN will not start – Run FSCK to make sure that the FS is healthy – Keep a copy of the FSCK output for verificaJon – Clear HDFS and map reduce temporary files – Make sure that any previous upgrade is finalized – Shutdown map reduce and kill orphaned task – Shutdown HDFS and make a copy of NN directories – Install new versions of HDFS and Map Reduce – Start HDFS with –upgrade opJon
• Start-‐dfs.sh –upgrade – Once the upgrade is complete perform manual spot checks
• hadoop dfsadmin –upgradeProcess status – Start Map Reduce – Rollback or Finalize the upgrade
• stop-‐dfs.sh; start-‐dfs.sh –rollback • hadoop dfsadmin -‐finalizeUpgrade
38
Key Parameters
Parameter Descrip@on Default Value
dfs.blocksize File block size 128 MB
dfs.replicaJon File block replicaJon count 3
dfs.datanode.numblocks No of blocks aier which new sub directory gets created in DN
io.bytes.per.checksum Number of data bytes for which check sum is calculated
512
dfs.datanode.scan.period.hours Timeframe in hours to complete block scanning
504 (3 weeks)
39