Upload
cloudera-inc
View
3.835
Download
0
Embed Size (px)
Citation preview
1
Big Data Security Joey Echeverria | Principal Solu8ons Architect [email protected] | @fwiffo
©2013 Cloudera, Inc.
Big Data Security
EARLY DAYS
2
Hadoop File Permissions
• Added in HADOOP-‐1298 • Hadoop 0.16 • Early 2008
• Authoriza8on without authen8ca8on • POSIX-‐like RWX bits
3
MapReduce ACLs
• Added in HADOOP-‐3698 • Hadoop 0.19 • Late 2008
• ACLs per job queue • Set a list of allowed users or groups per opera8on
• Job submission • Job administra8on
• No authen8ca8on
4
Securing a Cluster Through a Gateway
• Hadoop cluster runs on a private network • Gateway server dual-‐homed (Hadoop network and public network)
• Users SSH onto gateway • Op8onally can create an SSH proxy for jobs to be submi`ed from the client machine
• Provides minimum level of protec8on
5
Big Data Security
WHY SECURITY MATTERS
6
Prevent Accidental Access
• Don’t let users shoot themselves in the foot • Main driver for early features • Not security per-‐se, but a cri8cal first step • Doesn’t require strong authen8ca8on
7
Stop Malicious Users
• Early features were necessary, but not sufficient • Security has to get real • Hadoop runs arbitrary code • Implicit trust doesn’t prevent the insider threat
8
Co-‐mingle All Your Data
• Ofen overlooked • Big data means gegng rid of stovepipes
• Scalability and flexibility are only 50% of the problem • Trust your data in a mul8-‐tenant environment
• Most cri8cal driver
9
Big Data Security
AN EVOLVING STORY
10
Authoriza8on
• Files • MapReduce/YARN job queues • Service-‐level authoriza8on
• Whitelists and blacklists of hosts and users
11
Authen8ca8on
• HADOOP-‐4487 • Hadoop 0.22 and 0.20.205 • Late 2010
• Based on Kerberos and internal delega8on tokens • Provides strong user authen8ca8on • Also used for service-‐to-‐service authen8ca8on
12
2.2 High Level Use Cases 2 USE CASES
2.2 High Level Use Cases
1. Applications accessing files on HDFS clusters Non-MapReduce ap-plications, including hadoop fs, access files stored on one or more HDFSclusters. The application should only be able to access files and servicesthey are authorized to access. See figure 1. Variations:
(a) Access HDFS directly using HDFS protocol.(b) Access HDFS indirectly though HDFS proxy servers via the HFTP
FileSystem or HTTP get.
Name Node
Data Node
kerb(joe)
kerb(hdfs)
block token
ApplicationMapReduce
Task
block token
delg(joe)
Figure 1: HDFS High-level Dataflow
2. Applications accessing third-party (non-Hadoop) services Non-MapReduce applications and MapReduce tasks accessing files or opera-tions supported by third party services. An application should only beable to access services they are authorized to access. Examples of third-party services:
(a) Access NFS files(b) Access ZooKeeper
3. User submitting jobs to MapReduce clusters A user submits jobs toone or more MapReduce clusters. Jobs can only be submitted to queuesthe user is authorized to use. The user can disconnect after job submissionand may re-connect to get job status. Jobs may need to access files storedon HDFS clusters as the user as described in case 1). The user needsto specify the list of HDFS clusters for a job at job submission. Jobsshould only be able to access only those HDFS files or third-party servicesauthorized for the submitting user. See figure 2. Variations:
(a) Job is submitted via JobClient protocol(b) Job is submitted via Web Services protocol (Phase 2)
4
Encryp8on
• Over the wire encryp8on for some socket connec8ons
• RPC encryp8on added soon afer Kerberos • Shuffle encryp8on (HTTPS) added in Hadoop 2.0.2-‐alpha, back ported to CDH4 MR1
• HDFS block streamer encryp8on added in Hadoop 2.0.2-‐alpha
• Volume-‐level encryp8on for data at rest
13
Big Data Security
SECURITY FOR KEY VALUE STORES
14
Apache Accumulo
• Robust, scalable, high performance data storage and retrieval system
• Built by NSA, now an Apache project • Based on Google’s BigTable • Built on top of HDFS, ZooKeeper and Thrif • Iterators for server-‐side extensions • Cell labels for flexible security models
15
Data Model
• Mul8-‐dimensional, persistent, sorted map • Key/Value store with a twist • A single primary key (Row ID) • Secondary key (Column) internal to a row
• Family • Qualifier
• Per-‐cell 8mestamp
16
Cell-‐Level Security
• Labels stored per cell • Labels consist of Boolean expressions (AND, OR, nes8ng)
• Labels associated with each user • Cell labels checked against user’s labels with a built-‐in iterator
17
Pluggable Authen8ca8on
• Currently supports username/password authen8ca8on backed by ZooKeeper
• ACCUMULO-‐259 • Targeted for Accumulo 1.5.0
• Authen8ca8on info replaced with generic tokens • Supports mul8ple implementa8ons (e.g. Kerberos)
18
Applica8on Level
• Accumulo ofen paired with applica8on level authen8ca8on/authoriza8on
• Accumulo users created per applica8on • Each applica8on granted access level of most permi`ed user
• Applica8on authen8cates users, grabs user authoriza8ons, passes user labels with requests
19
Apache HBase
• Also based on Google’s BigTable • Started as a Hadoop contrib project • Supports column-‐level ACLs • Kerberos for authen8ca8on • Discussion and early prototypes of cell-‐level security ongoing
20
Big Data Security
FUTURE
21
Encryp8on for Data at Rest
• Need mul8ple levels of granularity • Encryp8on keys 8ed to authoriza8on labels (like Accumulo labels or HBase ACLs)
• APIs for file-‐level, block-‐level, or record-‐level encryp8on
22
Hive Security
• Column-‐level ACLs • Kerberos authen8ca8on • AccessServer
23
24 ©2013 Cloudera, Inc.