1
Specific Goals Data Spillage In Hadoop Clusters Oluwatosin Alabi 1 , Joe Beckman 2 , Dheeraj Gurugubelli 3 1 Purdue University, [email protected]; 2 Purdue University, [email protected]; 3 Purdue University, [email protected] Problem Statement HDFS • Retrieval of NameNode Metadata • Recovery of DataNode Data Remnants Load Tagged Document • Retrieval of NameNode Metadata • Recovery of DataNode Data Remnants Delete Tagged Document Apply digital forensics procedure to locate data remnants Assess data sanctification level Compare results using the delete file method to a more secure method (e.g. NIST 800-88) Apply a deletion protocol Determine can data be recovered in HDFS & to what extent? Data spillage is the undesired transfer of classified information into an unauthorized compute node or memory media The loss of control over sensitive and protected data can become a serious threat to business operations and national security (NSA Mitigation Group, 2012). Research Question Can classified data leaked, by user error, into an unauthorized Hadoop Distributed File System (HDFS), be located, recovered, and removed completely from the server? Research Process Scope: Determine the investigation boundaries and limitation Preparation: Prepare cluster environment Identification: Identify the spilled data location on data nodes using the metadata on name node Collection & Preservation: Acquire the .vmdk image files of impacted nodes. Examine & Analyze: Conduct forensic procedure to find any spilled data remnants on the disks Presentation: Document and Report Findings Iterative Name node: Maintains Namespace tree and mapping of blocks to DataNodes: Metadate: Inodes & Block list Edit logs: Journal HDFS Cluster Data node 1: •Stores the actual data in blocks >=64MB •Blocks metadata Heartbeat Data node 2: Data node 4: Data node 8: Data node 3: File1 R1 File1 R2 File1 R3 Fle2 R3 File2 R2 File2 R1 Retrieve locale info LOAD Into Cluster Test file 1 3 2 5 4 2 3 File Meta data Impacted DataNodes Replication: DataNode 4: NameNode: Tracks metadata Data location logs HDFS Cluster DataNode 1: •Stores the actual data in blocks >=64MB •Blocks metadata Heartbeat DataNode 2: DataNode 8: Dead DataNode 3: File1 R1 File1 R2 File1 R3 File2R 3 File2R 2 File2R 1 File1 R2 No Heartbeat .VMDK: Virtual Machine Disk file Retrieve file evidence 4 Forensic Acquisition Mount disk images Log Analysis (Name Node) HDFS Event Logs Disk ID’s (Before Delete) Moved to ./trash (After Delete) Forensic Analysis (Data Node) Block-Based Carving Header/Footer Carving File structure based Carving Live Search Index Search Forensic Tools FTK Imager Forensic Tool Kit 5.3 5 Examine & Analyze Disk Files 1 Lessons Learned Imaging retrieval and process time is proportional to the disk size: Larger the disk -> Longer processing time File types require different recovery procedures: Text file searching using FTK tool cannot be done doing Research process is iterative and each iteration may require different steps/tools Future Work Exploration of the efficacy of various secure data sanitation methods for data removal in a virtual HDFS cluster Extension of this process to include the minimization of cluster downtime during the removal process Extension of this process to include detection and removal of other file types Automation of data removal process in HDFS 6 6 6 References: [1] Mitigations NSA Group. (2012). Securing Data and Handling Spillage Events [White Paper]. Retrieved fromhttps://www.nsa.gov/ia/_files/factsheets/final_data_spill.pdf [2] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on (pp. 1-10). IEEE. [3] Lim, S., Yoo, B., Park, J., Byun, K., & Lee, S. (2012). A research on the investigation method of digital forensics for a VMware Workstation’s virtual machine. Mathematical and Computer Modelling, 55(1), 151-160.

Data Spillage In Hadoop Clusters - CERIAS · delete file method to a more secure method (e.g. NIST 800-88) ... the disk size: Larger the disk -> Longer processing time File types

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Spillage In Hadoop Clusters - CERIAS · delete file method to a more secure method (e.g. NIST 800-88) ... the disk size: Larger the disk -> Longer processing time File types

Specific Goals

Data Spillage In Hadoop Clusters

Oluwatosin Alabi1, Joe Beckman2, Dheeraj Gurugubelli3

1Purdue University, [email protected]; 2Purdue University, [email protected]; 3Purdue University, [email protected]

Problem Statement

HDFS

• Retrieval of NameNodeMetadata

• Recovery of DataNodeData Remnants

Load Tagged Document

• Retrieval of NameNodeMetadata

• Recovery of DataNodeData Remnants

Delete Tagged Document

Apply digital forensics procedure to locate data remnants

Assess data sanctification level

Compare results using the delete file method to a more secure method (e.g. NIST 800-88)

• Apply a deletion protocol

• Determine can data be recovered in HDFS & to what extent?• Data spillage is the undesired transfer of classified information into an unauthorized compute node or memory media

• The loss of control over sensitive and protected data can become a serious threat to business operations and national security (NSA Mitigation Group, 2012).

Research QuestionCan classified data leaked, by user error,

into an unauthorized Hadoop Distributed

File System (HDFS), be located,

recovered, and removed completely from

the server?

Research Process

Scope: Determine the investigation boundaries and limitation

Preparation: Prepare cluster environment

Identification: Identify the spilled data location on data nodes using the metadata on name node

Collection & Preservation: Acquire the .vmdkimage files of impacted nodes.

Examine & Analyze: Conduct forensic procedure to find any spilled data remnants on the disks

Presentation: Document and Report Findings

Itera

tive

Name node: •Maintains Namespace tree and mapping of blocks to DataNodes:

•Metadate: Inodes & Block list•Edit logs: Journal

HDFS ClusterData node 1:•Stores the actual data in blocks >=64MB•Block’s metadata

Heartbeat

Data node 2: Data node 4: Data node 8:Data node 3:

File1R1

File1R2

File1R3 Fle2

R3

File2R2

File2R1

Retrieve locale info

LOADInto Cluster

Test file

1

3

2

5

4

2

3

File Meta

data

Impacted

DataNodes

Replication:

DataNode 4:

NameNode: •Tracks metadata•Data location logs

HDFS Cluster DataNode 1:•Stores the actual data in blocks >=64MB•Block’s metadataHeartbeat

DataNode 2: DataNode 8:Dead DataNode 3:

File1R1

File1R2

File1R3 File2R

3

File2R2File2R

1File1R2

No Heartbeat.VMDK:

Virtual Machine Disk file

Retrieve file evidence4

Forensic Acquisition

• Mount disk images

Log Analysis (Name Node)

• HDFS Event Logs

• Disk ID’s (Before Delete)

• Moved to ./trash (After Delete)

Forensic Analysis (Data Node)

• Block-Based Carving

• Header/Footer Carving

• File structure based Carving

• Live Search

• Index Search

Forensic Tools

FTK Imager

Forensic Tool Kit 5.3

5Examine & Analyze Disk Files

1

Lessons Learned Imaging retrieval and process time is proportional to

the disk size: Larger the disk -> Longer processing time

File types require different recovery procedures: Text file searching using FTK tool cannot be done doing

Research process is iterative and each iteration may require different steps/tools

Future Work Exploration of the efficacy of various secure data sanitation methods for data

removal in a virtual HDFS cluster Extension of this process to include the minimization of cluster downtime during

the removal process Extension of this process to include detection and removal of other file types Automation of data removal process in HDFS

6

6 6

References:[1] Mitigations NSA Group. (2012). Securing Data and Handling Spillage Events [White Paper]. Retrieved fromhttps://www.nsa.gov/ia/_files/factsheets/final_data_spill.pdf[2] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on (pp. 1-10). IEEE.[3] Lim, S., Yoo, B., Park, J., Byun, K., & Lee, S. (2012). A research on the investigation method of digital forensics for a VMware Workstation’s virtual machine. Mathematical and Computer Modelling, 55(1), 151-160.