SECURITY IMPLEMENTATION IN HADOOPsearch.iiit.ac.in/cloud/presentations/28.pdf · Processes and mechanisms by which sensitive and valuable information and services are protected

SECURITY

IMPLEMENTATION

IN

HADOOP

By

Narsimha Chary(200607008)

Siddalinga K M(200950034)

Rahman(200950032)

AGENDA

What is security ?

Security in Distributed File Systems ?

Current level of security in Hadoop !

Security features to be incorporated in HDFS to make it robust.

What is Security ?

Protection of information and property from theft,

corruption, or natural disaster, while allowing the

information and property to remain accessible and

productive to its intended users

Processes and mechanisms by which sensitive and

valuable information and services are protected

from publication, tampering or collapse by

unauthorized activities or untrustworthy

individuals and unplanned events respectively.

Security in Distributed File

System

Private Clouds are more or less secure as

they are deployed within the locales of the

organization and is also secured by firewall

Public Clouds are prone to all sorts of

danger as you know where your data is

residing at any instance

Current level of security in

Hadoop Current version of Hadoop has very basic

rudimentary implementation of security which isadvisory access control mechanism.

Hadoop doesn’t strongly authenticate the client , itsimply asks the underlying Unix system byexecuting `whoami` command

Anyone can communicate directly with aDatanode (without the need for communicatingwith the Namenode)and ask for blocks if you havethe block location details (This was experimentedat the recent Cloudera's Hadoop Hackathon)

With current provisions ….

Hadoop cluster may be prone to followingattacks

Unauthorized clients can impersonateauthorized users and access the cluster.

One can get the blocks directly from theDatanodes by bypassing the Namenode.

Eavesdropping/sniffing of data packetsbeing sent by Datanodes to client.

(Can this be resolved by using secure socketover a regular socket ? YES , YES ! Butquite a overhead and hinders performance

Proposed Solutions

Authentication of users/clients accessing

the Hadoop cluster using Kerberos Protocol

Authorization for accessing data residing

over the HDFS (by granting and revoking

capabilities)

A little about Kerberos

Protocol

Network authentication protocol

Developed at MIT in the mid 1980s

Available as open source or in supported

commercial software

How does Kerberos work?

Instead of client sending password to

application server:

– Request Ticket from authentication server

– Ticket and encrypted request sent to application

server

How to request tickets without repeatedly

sending credentials?

– Ticket granting ticket (TGT)

How does Kerberos work?:

Ticket Granting Tickets

Kerberos contd…

Some of the notations used

A -> B : M #denotes a message M fromnode A to node B

KUA and KR

A :The public and private keys ofNode A

KAB : Key shared between A and B

{M} KAB: A message M encrypted with KAB

<M> KRA : A message signed with KR

A

C, N, D: Client , Namenode, Datanode

1. Authentication

The Namenode checks the details of the

request and if the client is a valid user,

accordingly issues/doesn't issue the Ticket

T

C -> N: request_ticket, TS,

hash<request_ticket,TS,KCN>

N-> C : T

Authentication contd…

Message exchange between client C and

Datanode D to establish shared key amongst

them.

C -> D: {(KCD , TS, nonce)KRC}KUD, T

D -> C: nonce', hash(nonce',KCD)

The client sends the ticket T along with a

shared key KCD that it wants to establish

with the Datanode D, the client also sends a

nonce so that the Datanode can verify the

freshness of the message.


To complete the ticket establishment step,

the Datanode has to respond to a nonce

challenge.

T = <IDU,KUC,IV,TS,TE>KR

M

KCD = hash<IV,KUD,random_data>


T contains the user Id, public key,

initialization vector and the ticket’s lifetime.

Shared key is computed by hashing the IV

with the Datanode’s public key and some

arbitrarily random data.

2. Capabilities

To read data from the HDFS the client has to obtain

block locations and capabilities from Namenode

before it goes to Datanodes.

C->N : read (path),TS,

hash<read(path),TS,KU

CN>

N->C : block_locations,

hash<block_locations>

The capabilities are embedded into block location

information and signed by the Namenode. The

Datanode verifies the capabilities and accordingly

allows to read or doesn’t.

C->D : read (block),TS,

Capabilities cont…

Description of capability information

embedded into the block location

information. The sign (With Namenode's

private key) of the capability and block id is

also embedded

C = ID,permissions,path

Sign = <c,block_id>KRN

Revocation of capabilities

Capabilities can potentially be re-used by clients

to read the data from HDFS at any time after when

they were issued. However the file permissions

change over period of time.

Revocation of capabilities needs to be done, in

order to prevent replay attacks.

Capabilities issued by Namenode will have an

expiry period ( say 1hr) and this can be configured

in hadoop-site.xml

Revocation of Capabilities

contd… The client has to get a renewal ticket issued by the

Namenode and has to present it to the Datanode for

every request after expiry of the capabilities. If the

renewal ticket is not presented, the Datanode will

deny the request.

Revocation of capabilities is done actively by

Namenode . This is done by sending the message to

the Datanodes to deny the particular capabilities.

Difficulties faced Integrating Kerberos protocol with the

HDFS Frame work is quite a task !

Need a more efficient design on Granting

and Revoking capabilities

Conclusion The overhead with the introduction of capabilities

is low but secures the data access for only clientswhich have been issued the capabilities byNamenode.

However, if the filesize is lessthan 64MB, theoverhead still remains the same as a single block,for smaller files the overhead would besubstantial.

Although the performance overhead at theDatanode isn’t significant for 64MB or largerblock size, it can be reduced further by caching thecapabilities for each block.

Thank You for your time !

Documents

SECURITY IMPLEMENTATION IN HADOOPsearch.iiit.ac.in/cloud/presentations/28.pdf · Processes and mechanisms by which sensitive and valuable information and services are protected