35
Up-Armoring the Elephant Secure Hadoop is Here Jakob Homan [email protected]

Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Embed Size (px)

DESCRIPTION

Presented at HadoopDay, Seattle, 2010.

Citation preview

Page 1: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Up-Armoring the ElephantSecure Hadoop is Here

Jakob Homan

[email protected]

Page 2: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Jakob Homan

• HDFS full-time @Y!

• ApacheHadoop committer

• Past six months –security!

• @blueboxtraveler

Who I am

8/14/20102

Page 3: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Using Hadoop at Yahoo!

38,000+ Nodes170 PB worth of

storage

More than 1,000,000 MR jobs monthly

Almost every product uses

Hadoop in some way

8/14/20103

Page 4: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

As of 2009, 72% percent of

patches going into the

Hadoop source code were

coming from Yahoo! 72%

Developing Hadoop at Yahoo!

8/14/20104

Page 5: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Yahoo! provides extensive

QE and QA resources to

test Hadoop releases at

scale. Q{A,E}

Developing Hadoop at Yahoo!

8/14/20105

Page 6: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Developing Hadoop at Yahoo!

8/14/20106

The Yahoo! distribution of

Hadoop, available on

Github, is the same code

we run internally on our

servers.

Patches important to

stability and performance

and stability are applied

here, as well as Apache.

Page 7: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Developing Hadoop at Yahoo!

8/14/20107

The rest of the family

Page 8: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Hadoop at Yahoo! Sunnyvale

8/14/20108

Page 9: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Why do we need a secure Hadoop?

• Different clusters for different data not a workable solution

• Costs of operating clusters andmoving data too high

Silos don’t cut it anymore

• Personably Identifiable Information

• Financial data

• Regulatory requirements

More sensitive data

8/14/20109

Page 10: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Current state of security in Hadoop

8/14/201010

Lessthanideal

File system

• POSIX-style permissions

• Audit logging available

Authentication

• Do we really know who we’re talking to?

• Both users and services

Authorization

• Who can see files, launch jobs?

• File systems permissions help with this

Page 11: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Current state of security in Hadoop

8/14/201011

Bowser copyright Nintendo

Page 12: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

The elephant is too trusting

8/14/201012

Page 13: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Which can let bad people do bad things

8/14/201013

Page 14: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Why is securing Hadoop hard?

8/14/201014

Page 15: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Industry-standard network authentication protocol.

Open-source project from MIT.

Acts as trusted third party to identify and authenticate components in an Hadoop cluster.

It’s out there: Microsoft’s Active Directory can act as a KDC.

Enter Kerberos!

8/14/201015

Page 16: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Kerberos workflow

User or service authenticatesto KDC

• Users use kinit, can be automatic upon login

• Services use keytabs

KDC provides a ticket-granting-ticket (TGT)

• This verifies identity to other actors in system

• TGTs last for 10 hours, renewable for up to 7 days

User or service presents this ticket

to NN or JT

8/14/201016

Page 17: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

RPC upgraded to use SASL/GSSAPI

Hadoop RPC

• Hadoop has own RPC framework

• Lots of players:

• Namenode

• Datanodes

• Clients

• JobTracker

• TaskTrackers

Simple Authentication and

Security Layer

• RFC 2222 –Standard for lightweight authentication between clients and servers

• Works with GSSAPI to Support Kerberos as an authentication method

• Delegation tokens are also supported

Delegation Tokens

• DIGEST-MD5-based identifiers generated by Namenode

• Alleviate load on Kerberos server when 10,000s of tasks launch simultaneously

• Used to support cross-cluster authentication

8/14/201017

Page 18: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

What does a secure Hadoop look like?

8/14/201018

Page 19: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Like this

8/14/201019

Page 20: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Everyone now authenticated

Users browsing filesystem on command line

Users submitting jobs on command

line

Servers within system

• Datanodes Namenode

• Tasktrackers JobTracker

• SNN NameNode

Oozie

• Submits jobs on behalf of users

• Configurable proxy user

8/14/201020

Page 21: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Additional security throughout system

• MapReduce system directory

• Task directory

• On-node HDFS directories

On-disk directory permissions are

700

• Linux Task Controller now runs as userwho owns job

Tasks run as user who launched them

• Use privileged ports for non-RPC calls

• Working on making this pluggable for other types of solutions

DataNodes’ports secured

• Streaming tasks can verify the identity of the TaskTracker and vice versaStreaming secured

8/14/201021

Page 22: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

How do I write a secure MapReduce job?

8/14/201022

Word count pre-security

Page 23: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Word count post-

security

This is how

8/14/201023

No

changes!

Page 24: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

UserGroupInformation.java

• Completely re-written – nexus of authentication code

• Really should never have been public

New type of DistributedCache

• Public is available to all users

• Private is secured for only submitting user

Significant user-facing changes

8/14/201024

Page 25: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Authenticating users for web access is pluggable

• Yahoo! has internal internal web authentication system, other organizations do as well.

Would really like to have a SPNEGO implementation

• Any volunteers?

Until then, the Doctor has returned

• Simple plugin returns DrWho for web access

Secure web access is pluggable

8/14/201025

Page 26: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

DistCP works… in 3 out of 4 cases

Destination

Cluster

Unsecure 20 Secure 20

So

urc

eC

luste

r Unsecure 20

✔ ✗

Secure 20

✔ ✔

8/14/201026

Page 27: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Out of scope

On-disk encryption

Datanode directories’ permissions more

locked down

Actual block files and metadata not

encrypted

On-the-wire encryption

RPC and http data transfer sent in the

clear

Assumption that network is secure

8/14/201027

Page 28: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Impact on performance

4%Maximum performance degradationallowed by our performance team. We met or bested this requirement.

8/14/201028

Page 29: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Download from http://yhoo.it/aVAke1

Take security for a test drive

8/14/201029

Page 30: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Gory details at http://bit.ly/aze3Ba

Or build a secure cluster at home

8/14/201030

Page 31: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Other projects and security

Pig

• We’ve worked with the Pig team.

• Pig 6 and 7support security

Hbase

• Work in progress

• JIRA: HBASE-2016

Oozie

• Extensive collaboration.

• Oozie 2supports security

Hive

• Early work in progress

• JIRA: HIVE-1264

8/14/201031

Page 32: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Yahoo!’s distribution

• Security deployed to all clusters.

• The rest soon.

• All patches in Yahoo!’s gitrepository at: http://github.com/yahoo/hadoop-common

• Committed to open-sourcing all improvements and bug fixes to Y20S.

Current state

8/14/201032

Page 33: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Apache Distribution

• All of the security work has been forward-ported to trunk

• Still working on securing new-to-trunk features

• 22 will be first fully secured Apache release

Current state

8/14/201033

Page 34: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Security list

[email protected]

8/14/201034

Send security holes to this

email list

Already have had two security issues identified,

fixes in-flight

Page 35: Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Questions?

8/14/201035