40
© Hortonworks Inc. 2011 Improvements in Hadoop Security Sanjay Radia [email protected] @srr Chris Nauroth [email protected] @cnauroth Page 1

Improvements in Hadoop Security

Embed Size (px)

Citation preview

Page 1: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Improvements in Hadoop Security

Sanjay Radia

[email protected]

@srr

Chris Nauroth

[email protected]

@cnauroth

Page 1

Page 2: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Hello

Sanjay Radia

• Founder, Hortonworks

• Part of the Hadoop team at Yahoo! since 2007

– Chief Architect of Hadoop Core at Yahoo!

– Long time Apache Hadoop PMC and Committer– Designed and developed several key Hadoop features

• Prior

– Data center automation, virtualization, Java, HA, OSs, File Systems (Startup, Sun Microsystems, …)

– Ph.D., University of Waterloo

Chris Nauroth

• Member of Technical Staff, Hortonworks

– Apache Hadoop Committer

– Major contributor to HDFS ACLs

• Hadoop user since 2010

– Prior employment experience deploying, maintaining and using Hadoop clusters

Page 2Architecting the Future of Big Data

Page 3: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Overview

• Models of Deployment

– Secure and insecure

• Hadoop Authentication

– The how and why

– Knox – perimeter security

• Authorization – existing and what is new

– HDFS

– Tables and Hive

– HBase and Accumulo

• Data protection and encryption

– Wire

– Data at rest

Page 3Architecting the Future of Big Data

Page 4: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Two Reasons for Security in Hadoop

Hadoop Contains Sensitive Data

– As Hadoop adoption grows so too has the types of data organizations look to store. Often the

data is proprietary or personal and it must be protected.

– In this context, Hadoop is governed by the same security requirements as any data center

platform.

Hadoop is subject to Compliance adherence

– Organizations are often subject to comply with regulations such as HIPPA, PCI DSS, FISAM that

require protection of personal information.

– Adherence to other Corporate security policies.

1

2

Page 5: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Three Models of Hadoop Deployment

• Insecure cluster

– You have protected via the perimeter

– You trust the code that runs in the system

– - note In Hadoop cluster, user submitted code runs inside the cluster

– (Note true in typical client-server applications)

– The client side libraries pass the client’s login credential

– There is not end-end-authentication here

– Authorization is done against this credentials

• Secure cluster

– Full authentication

– Can run arbitrary code in jobs

• Perimeter security using Knox

– Internal cluster can be secure or insecure depending on your needs

Page 6: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Pillars of Hadoop Security

Authorization

Restrict access to

explicit data

Audit

Understand who did

what

Data Protection

Encrypt data at rest &

motion

AD/Kerberos in native

Apache Hadoop

Perimeter Security

with Apache Knox

Gateway

Authentication

Who am I/prove it?

Control access to cluster.

Page 7: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Hadoop Authentication Overview

• Kerberos/Active Directory based security

– SSO – users do not have to re-login into Hadoop

– Hadoop accounts do not have to be recreated

– Caveat – MR Task isolation require Unix accounts for each user current but this is going away with Linux containers

– Hadoop tokens – supplement the Kerberos authentication

– Delegation tokens – deal with the delayed job execution

– Block tokens – capabilities to deal with the distributed nature of HDFS

– Trusted Proxies – support for third party services to act as proxy

– Oozie

– Gateways – HDFS proxy, Knox, etc

• Base security infrastructure being extended to support other authentication services besides

Kerberos

• Knox – Perimeter Security and Rest Gateway

Page 7

Page 8: Improvements in Hadoop Security

© Hortonworks Inc. 2011

User DB/Account

Single-signon using your organization’s existing user DB

• No need to create Hadoop user accounts to be created

– Note the OS account on compute nodes are for isolation – this will be removed when we use Linux containers

• Currently supports Kerberos (and LDAP indirectly)

• The RPC layer’s authentication is fairly general

– Supports Kerberos and Hadoop’s tokens

– Can be extended

Page 9: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Why the tokens

• Why does Hadoop have its own tokens?

– Standard client-server security model is not sufficient for Hadoop

– Works when logged-in client is directly accessing a Hadoop service

– But for a job, the execution happens much later

– The job submitter has long logged off

• Hence we needed to add delegation tokens

• HDFS is a distributed service and needed to add capability-like tokens for datanode

authentication

The permissions/ ACLs are in the Namenode

Page 10: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Apache KnoxPerimeter Security for Hadoop REST APIs

Architecting the Future of Big DataPage 10

Page 11: Improvements in Hadoop Security

© Hortonworks Inc. 2011

The Gateway or Edge Node

• Hadoop APIs can be used from any desktop after SSO login

– FileSystem and MapReduce Java APIs

– Pig, Hive and Oozie clients (that wrap the Java APIs)

• However it is typical to Use “Edge Node” or “Gateway Node” that is “inside” cluster

– The libraries for the APIs are generally only installed on the gateway

– Users SSH to Edge Node and execute API commands from shell

Page 11

HadoopUser Edge Node

SSH

Page 12: Improvements in Hadoop Security

© Hortonworks Inc. 2011

• Single Hadoop access point

• REST API hierarchy

• Consolidated API calls

• Multi-cluster support

• Eliminates SSH “edge

node”

• Central API management

• Central audit control

• Simple Service level

Authorization

• SSO Integration –

Siteminder, API Key*,

OAuth* & SAML*

• LDAP & AD integration

Perimeter Security with Apache Knox

Integrated with existing

systems to simplify identity

maintenance

Incubated and led by Hortonworks,

Apache Knox provides a simple and open framework for

Hadoop perimeter security.

Single, simple point of

access for a cluster

Central controls ensure

consistency across one or

more clusters

Page 13: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Hadoop REST APIs

• Useful for connecting to Hadoop from the outside the cluster

• When more client language flexibility is required

– i.e. Java binding not an option

• Challenges (Knox addresses these challenges)

– Client must have knowledge of cluster topology

– Required to open ports (and in some cases, on every host) outside the cluster

Page 13

Service API

WebHDFS Supports HDFS user operations including reading files, writing to files,

making directories, changing permissions and renaming. Learn more about

WebHDFS.

WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL

commands. Learn more about WebHCat.

Hive Hive REST API operations

HBase HBase REST API operations

Oozie Job submission and management, and Oozie administration. Learn more

about Oozie.

Page 14: Improvements in Hadoop Security

© Hortonworks Inc. 2011

What can be done today?

Authorization

Restrict access to

explicit data

Audit

Understand who did

what

Data Protection

Encrypt data at rest &

motion

Previously• All Services: Service level ACLs

• HDFS: Permissions

• Yarn: Queue ACLs

• Hive/Pig Tables: Table level via

HDFS

• Apache Accumulo: Cell level

• HBase: Namespace, Table,

Column Family and Column level

ACLs

Authentication

Who am I/prove it?

Control access to cluster.

Hadoop 2.x• HDFS: ACLs

• Hive: Column level ACLs

• HBase: Cell level ACLs

• Knox:

• Rest Service level Authorization

• Access Audit with Knox

Page 15: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS ACLs

• Existing HDFS POSIX permissions good, but not flexible enough

– Permission requirements may differ from the natural organizational hierarchy of users and groups.

• HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX

ACL model.

– An ACL (Access Control List) provides a way to set different permissions for specific named users or named

groups, not only the file’s owner and file’s group.

Page 15Architecting the Future of Big Data

Page 16: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS File Permissions Example

• Authorization requirements:

– In a sales department, they would like a single user Maya (Department Manager) to

control all modifications to sales data

–Other members of sales department need to view the data, but can’t modify it.

–Everyone else in the company must not be allowed to view the data.

• Can be implemented via the following:

Read/Write perm for user

mayaUser

GroupRead perm for group sales

File with sales data

Page 17: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS ACLs

• Problem

–No longer feasible for Maya to control all modifications to the file

– New Requirement: Maya, Diane and Clark are allowed to make modifications

– New Requirement: New group called executives should be able to read the sales data

–Current permissions model only allows permissions at 1 group and 1 user

• Solution: HDFS ACLs

–Now assign different permissions to different users and groups

Owner

Group

Others

HDFS

Directory

… rwx

… rwx

… rwx

Group D … rwx

Group F … rwx

User Y … rwx

Page 18: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS ACLs

New Tools for ACL Management (setfacl, getfacl)

–hdfs dfs -setfacl -m group:execs:r-- /sales-data

–hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group:

sales user::rw- group::r-- group:execs:r-- mask::r-- other::--

– How do you know if a directory has ACLs set?

–hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0

2014-03-04 16:31 /sales-data

Page 19: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS ACLs

Default ACLs–hdfs dfs -setfacl -m default:group:execs:r-x /monthly-sales-data

–hdfs dfs -mkdir /monthly-sales-data/JAN

–hdfs dfs –getfacl /monthly-sales-data/JAN

– # file: /monthly-sales-data/JAN # owner: maya # group: sales user::rwx group::r-

x group:execs:r-x mask::r-x other::--- default:user::rwx default:group::r-

x default:group:execs:r-x default:mask::r-x default:other::---

Page 20: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HDFS ACLs Best Practices

• Start with traditional HDFS permissions to implement most permission requirements.

• Define a smaller number of ACLs to handle exceptional cases.

• A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that

has only traditional permissions.

Page 20Architecting the Future of Big Data

Page 21: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Tables and Hive

Architecting the Future of Big DataPage 21

Page 22: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Table ACLs – The Challenge and Solution

• Hive and Pig have traditionally offer full table access control via HDFS access control

• The challenge in column-level access control

– Hive and Pig queries are executed as Tez-based tasks that access the HDFS files directly

– HDFS does not have knowledge of columns (there are several file/table formats)

• Solution for Column level ACLs

– Let Hive server check and submit the query execution

– Let the table be accessible only by special user (“HiveServer”)

– But one has to restrict the UDFs and file formats

– Good news: Hive provides an authorization plugin to do this cleanly

• Use standard SQL permission constructs

– GRANT/REVOKE

• Store the ACLs in Hive Metastore instead of some external DB

• But what about Pig, there is no Pig server …

Page 22Architecting the Future of Big Data

Page 23: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Hive ATZ-NG – Architecture

HDFS

Metastore

HiveServer2

O/JDBC Beeline CLI

• ATZ-NG is called for O/JDBC & Beeline CLI

• Standard SQL GRANT / REVOKE for management

• Privilege to register UDF restricted to Admin user

• Policy integrated with Table/View life cycle

Storage Based Authorization Provider

Hive

CLI

OozieHue

PIG HCat

Ambari

0. Enable HiveATZ-NG

1. Authentication

UDFs

Protected – column level

Protected – table level

Restrict direct access to Metastore

Protect HDFS with Kerberos & HDFS ACL

ATZ-NG2. Authorization

Page 24: Improvements in Hadoop Security

© Hortonworks Inc. 2011

What about MR/Pig

• Note there is no Pig/MR server to submit and check column ACLs

• Hence in the same cluster running Hive

–You can cannot give Pig similar access control

–If Pig/MR is important,

–Use coarse grained table level

–Or run Pig/MR as privileged uses with full table level access

Page 26

Page 25: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Hive ATZ-NG Example

Page 28

Page 26: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Scenario

• Objective: Share Product Management Roadmap securely

• Actors:

–Admin Role – Specified in hive-site

– Admin role controls role memberships

–Product Management Role

– Should be able to create, read all road map details.

– Members: Vinay Shukla, Tim Hall

–Engineering Role

– Should be able to read (see) all roadmap details

– Members: Kevin Minder, Larry McCay

Page 29

Page 27: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Step 1: Admin role Creates Roles, Adds Users

1. CREATE ROLE PM;

1. CREATE ROLE ENG;

1. GRANT ROLE PM to user timhall with admin option;

1. GRANT ROLE PM to user vinayshukla;

1. GRANT ROLE ENG to user kevinminder with admin option;

1. GRANT ROLE ENG to user larrymccay;

Page 28: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Step 2: Super-user Creates Tables/Views

create table hdp_hadoop_plans (

id int,

hadoop_roadmap string,

hdp_roadmap string

);

Page 29: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Step 3: Users or Roles Assigned To Tables

1. GRANT ALL ON hdp_hadoop_plans TO ROLE PM;

1. GRANT SELECT ON hdp_hadoop_plans TO ROLE ENG;

Page 30: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HBase Cell Level Authorization

• The HBase permissions model already supports ACLs defined at the namespace, table, column

family and column level.

– This is sufficient to meet many requirements

– This can be insufficient if a data model requires protection on individual rows/cells.

– Example: Medical data, each row representing a patient, may require customizing who can see an individual patient’s data,

and the social security number of each row may need further restriction.

Page 33

Page 31: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HBase Cell Level Authorization

• Cell level authorization augments the permissions model by allowing ACLs specified on

individual cells.

– ACLs are now supported at the individual cell level.

– Individual operations may choose order of evaluation. Cell level ACLs may be evaluated last or first.

– Evaluating last is useful if the common case is access granted through table or column family ACLs, and cell level

ACLs define exceptions for denial.

– Evaluating first is useful if many users are granted access through cell level ACLs.

Page 34Architecting the Future of Big Data

Page 32: Improvements in Hadoop Security

© Hortonworks Inc. 2011

HBase Cell Level Authorization

• Visibility labels

– Visibility expressions can be stored as metadata in a cell’s tag.

– A visibility expression consists of labels combined with boolean operators.

– E.g. (financial | strategy | research) & !newhire

– This means that a user must be labeled financial or strategy or research and not be a newhire in order to see the column.

– The mapping of users to their labels is pluggable. By default, a user’s labels are specified as authorizations in the

individual operation.

– HBase visibility labels were inspired by similar features in Apache Accumulo, and the model will look very familiar

to Accumulo users.

Page 35Architecting the Future of Big Data

Page 33: Improvements in Hadoop Security

© Hortonworks Inc. 2011

What can be done today?

Authorization

Restrict access to

explicit data

Audit

Understand who did

what

Data Protection

Encrypt data at rest &

motion

Wire encryption• In native Hadoop

• With Knox

• SSL for Rest (2.x)

File encryption• Via MR file format

• 3rd Party encryption

tools for col level

encryption

• Native HDFS support

coming

Authentication

Who am I/prove it?

Control access to cluster.

Page 34: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Wire Encryption – for data in motion

Page 37

• Hadoop client to DataNode is via Data Transfer Protocol

– HDFS client reads/writes to HDFS service over encrypted channel

– Configurable encryption strength

• ODBC/JDBC Client to HiveServer 2

– Encryption is via SASL Quality Of Protection

• Map to Reduce via shuffle

– Shuffle is over HTTP(S)

– Supports mutual authentication via SSL

– Host name verification enabled

• Rest Protocols

– SSL support

Page 35: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Data at Rest

• Coming: HDFS encrypted file system currently under development in Apache

– https://issues.apache.org/jira/browse/HADOOP-10150

– https://issues.apache.org/jira/browse/HDFS-6134

Page 38Architecting the Future of Big Data

Page 36: Improvements in Hadoop Security

© Hortonworks Inc. 2011

XA SecureA Major step forward in Hadoop security

See Shaun Connolly’s Key note on Wednesday June 4

Architecting the Future of Big DataPage 39

Page 37: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Security in Hadoop with HDP + XA Secure

Authorization

Restrict access to

explicit data

Audit

Understand who

did what

Data Protection

Encrypt data at

rest & in motion

• Kerberos in native

Apache Hadoop

• HTTP/REST API

Secured with

Apache Knox

Gateway

• MapReduce Access Control Lists

• HDFS Permissions, HDFS ACL,

• Audit logs in with HDFS & MR

• Hive ATZ-NG

• Cell level access control in

Apache Accumulo

Authentication

Who am I/prove it?

• Wire encryption

in Hadoop

• Orchestrated

encryption with

3rd party tools

• HDFS, Hive and

Hbase

• Fine grain

access control

• RBAC

• Centralized

audit reporting

• Policy and

access history

• Future roadmap

• Strategy to be

finalized

HD

P 2

.1X

AS

ecure

Centralized Security Administration

• As-Is, works with

current

authentication

methods

Page 38: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Open Source?

•Yes XASecure technology will be open sourced

–Not just a Apache license where you are forced to get

the latest from Hortonworks

–But a full-fledged Apache Project that is truly open to the

community of developers and users

•See Shaun Connolly’s Keynote on Wednesday for

details

Page 41Architecting the Future of Big Data

Page 39: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Summary

• Very strong Authentication via Kerberos and Active directory

– Uses your organization's user DB and integrates to its group and role membership

– Supplemented by Hadoop tokens

– Note these are necessary due to delayed job execution after user logs-off

• Strong fine grained authentication with some recent improvements

– HDFS ACLs

– Hive – integrated via SQL model and Hive Metastore

– Note a hacked side addon

– HBase Cell Level Authorization

• Strong encryption support

– Wire

– Data

– Some improvements coming soon

• Every product has audit logs

• XASecure adds a major step forward

– Yes it will be open sourced as a Apache Project

Page 42Architecting the Future of Big Data

Page 40: Improvements in Hadoop Security

© Hortonworks Inc. 2011

Thank you, Q&A

Page 43

Resource Location

Hortonworks Security Labs http://hortonworks.com/labs/security/

Apache Knox Project Page http://knox.incubator.apache.org/

HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/

Encrypted File System

Development

https://issues.apache.org/jira/browse/HADOOP-10150

https://issues.apache.org/jira/browse/HDFS-6134

HBase Cell Level

Authorization

https://blogs.apache.org/hbase/entry/hbase_cell_security

Learn more