Hadoop 101

Preview:

DESCRIPTION

Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.

Citation preview

1 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop 101

2 © Copyright 2011 EMC Corporation. All rights reserved.

Agenda

•  What is Hadoop ? •  History of Hadoop •  Hadoop Components •  Hadoop Ecosystem •  Customer Use Cases •  Hadoop Challenges

3 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop?

4 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

•  Concepts : –  NameNode (aka Master) is responsible for managing the

namespace repository (index) for the filesystem, and managing Jobs

–  DataNode (aka Segment) is responsible for storing blocks of data and running tasks

–  MapReduce – Push compute to where data resides

5 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop and Where did it start?

•  Created by Doug Cutting

–  HDFS (storage) & MapReduce (compute)

–  Inspired by Google’s MapReduce and Google File System (GFS) papers

•  It is now a top-level Apache project backed by large open source development community

•  Three major subprojects

–  Nutch

–  Lucene

–  Hadoop

6 © Copyright 2011 EMC Corporation. All rights reserved.

What makes Hadoop Different? •  Hadoop is a complete paradigm shift

•  Bypasses 25yrs of enterprise ceilings

•  Hadoop also defers some really difficult challenges: –  Non-transactional –  File System is essentially read-only

•  “Greater potential for the Hadoop architecture to mature and handle the complexity of transactions then RDBMS to figure out failures, and data growth”

7 © Copyright 2011 EMC Corporation. All rights reserved.

Confluence of Factors • Hadoop makes analytics on large scale data sets

more pragmatic –  BI Solutions often suffer from garbage-in, garbage-out

–  Opens up new ways of understanding and thus running lines of business

• Classic Architectures won’t scale any further • New sources of information (social media) are too

big and unwieldy for traditional solutions –  5 year enterprise data growth is estimated at 650% - with over 80% of that

unstructured data EG. Facebook collects 100TB per day

• What works for Google & Facebook!

8 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop vs Relational Solutions •  Hadoop is a paradigm shift in the way we think about and manage data

•  Traditional solutions were not designed with growth in mind

•  Big-Data accelerates this problem dramatically

Category Traditional RDBMS Hadoop

Scalability

Resource constrained Linear Expansion

Re-architecture Seamless addition & subtraction of nodes

~ 5TB ~ 5PB

Fault Tolerance

After thought, many critical points of failure

Designed in, tasks are automatically restarted

Problem Space

Transactional, OLTP Batch, OLAP (today!)

Inability to incorporate new sources

No bounds

9 © Copyright 2011 EMC Corporation. All rights reserved.

History of Hadoop?

10 © Copyright 2011 EMC Corporation. All rights reserved.

History •  Google paper: Simplified Data Processing on Large Clusters – 2006

–  GFS & MapReduce framework

•  Top Level Apache Open Source Community Project - 2008

•  Yahoo, Facebook, and Powerset become the main contributors, with Yahoo running over 10K nodes (300K cores) - 2009

•  Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08 –  209s to sort 1TB (62 seconds NOW) –  Cluster Config

•  910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-SATA disks

•  1Gb Ethernet •  40 nodes per rack •  8Gb Ethernet uplinks from each rack •  RH Server 5.1 •  Sun JDK 1.6.0_05-b13

11 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components

12 © Copyright 2011 EMC Corporation. All rights reserved.

Component Design

HDFS

jbod jbod jbod jbod jbod …..

MapReduce

java

python

stream

Packages

Hive HBase

Pig

Analytics

Mahout

R

13 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components

•  Storage & Compute in One Framework •  Open Source Project of the Apache Software Foundation •  Written in Java

HDFS MapReduce

Two Core Components

Storage Compute

14 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS

15 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS Concepts •  Sits on top of native (ext3, xfs, etc.) file system

•  Performs best with a ‘modest’ number of large files –  Millions, rather than billions, of files –  Each file typically 100Mb or more

•  Files in HDFS are ‘write once’ –  No random writes to files are allowed –  Append support is available in Hadoop 0.21

•  HDFS is optimized for large, streaming reads of files –  Rather than random reads

16 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS • Hadoop Distributed File System

–  Data is organized into files & directories –  Files are divided into blocks, typically 64-128MB each,

and distributed across cluster nodes –  Block placement is known at runtime by map-reduce so

computation can be co-located with data –  Blocks are replicated (default is 3 copies) to handle

failure –  Checksums are used to ensure data integrity

• Replication is the one and only strategy for error handling, recovery and fault tolerance

–  Make multiple copies and be happy!

17 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Architecture - HDFS

18 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS Components • NameNode • DataNode • Standby NameNode •  Job Tracker • Task Tracker

19 © Copyright 2011 EMC Corporation. All rights reserved.

NameNode • Provides a centralized, repository for the

namespace –  A index of what files are stored in which blocks

• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks (algorithm

–  Make multiple copies and be happy!

•  In Memory only –  0.23 provides distributed namenode –  Namenode recovery must re-build entire meta-data

repository

20 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Architecture - HDFS •  Block level storage

•  N-Node replication

•  Namenode for –  File system index (EditLog) –  Access coordination –  IPC via TCP/IP

•  Datanode for –  Data Block Management –  Job Execution (MapReduce)

•  Automated Fault Tolerance

Put

21 © Copyright 2011 EMC Corporation. All rights reserved.

Job Tracker • MapReduce jobs are controlled by a software daemon

known as the JobTracker •  The JobTracker resides on a single node

–  Clients submit MapReduce jobs to the JobTracker –  The JobTracker assigns Map and Reduce tasks to other

nodes on the cluster –  These nodes each run a software daemon known as the

TaskTracker –  The TaskTracker is responsible for actually instantiating the

Map or Reduce task, and reporting progress back to the JobTracker

•  A Job consists of a collection of Map & Reduce Tasks

22 © Copyright 2011 EMC Corporation. All rights reserved.

MapReduce

23 © Copyright 2011 EMC Corporation. All rights reserved.

Map Reduce Framework •  Map step

–  Input records are parsed into intermediate key/value pairs

–  Multiple Maps per Node •  10TB => 128MB/Blk => 82K Maps

•  Reduce step –  Each Reducer handles all like

keys –  3 Steps

•  Shuffle: All like keys are retrieved from each Mapper

•  Sort: Intermediate keys are sorted prior to reduce

•  Reduce: Values are processed

24 © Copyright 2011 EMC Corporation. All rights reserved.

Map Reduce

25 © Copyright 2011 EMC Corporation. All rights reserved.

Reduce Task •  After the Map phase is over, all the intermediate values for a

given intermediate key are combined together into a list •  This list is given to a Reducer

–  There may be a single Reducer, or multiple Reducers –  This is specified as part of the job configuration (see later) –  All values associated with a particular intermediate key are

guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the

Reducer in sorted key order –  This step is known as the ‘shuffle and sort’

•  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS

•  In practice, the Reducer usually emits a single key/value pair for each input key

26 © Copyright 2011 EMC Corporation. All rights reserved.

Fault Tolerance •  HDFS will only allocate jobs to active nodes

•  Map-Reduce can compensate for slow running jobs –  If a Mapper appears to be running significantly more slowly than the

others, a new instance of the Mapper will be started on another machine, operating on the same data

–  The results of the first Mapper to finish will be used –  Hadoop will kill off the Mapper which is still running

•  Yahoo experiences multiple failures (> 10) of various components (drives, cables, servers) every day

–  Which have exactly 0 impact on operations

27 © Copyright 2011 EMC Corporation. All rights reserved.

Ecosystem

28 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Ecosystem

29 © Copyright 2011 EMC Corporation. All rights reserved.

Ecosystem Distribution by Role

Consulting

Hadoop-ide

IDE

Reporting Analytics Distribution Monitoring

Manageability

Training

Data Integration

Data Visualization UAP

Apache

30 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components (hadoop.apache.org) •  Hadoop Distributed File System HDFS •  Framework for writing scalable data applications MapReduce •  Procedural language that abstracts lower level MapReduce Pig •  Highly reliable distributed coordination Zookeeper •  System for querying data and managing structured data built on top of

HDFS (SQL-like query) Hive •  Database for random, real time read/write access HBase •  workflow/coordination to manage jobs Oozie •  Scalable machine learning libraries Mahout

31 © Copyright 2011 EMC Corporation. All rights reserved.

Technology Adoption Lifecycle

Innovators/ Early Adopters

Early Majority Late Majority Laggards

Today

32 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase, Pig, Hive

33 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase Overview • Hbase is a sparse, distributed, persistent,

scalable, reliable multi-dimensional map which is indexed by row key

–  Hadoop Database, ~ “No-SQL” database –  Many relational features –  Scalable: Region Servers –  Multiple client access: java, ReST, Thrift

• What’s it good for? –  Queries against a number of rows that makes your

Oracle Server puke! • Hbase leverages HDFS for its storage

34 © Copyright 2011 EMC Corporation. All rights reserved.

HBase in Practice •  High performance, real-time query •  Client is typically a Java program •  But HBase supports many other API’s:

–  JSON: Java Script Object Notation –  REST: Representational State Transfer –  Thrift, Avro: Frameworks..

35 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase – key/value Store • Excellent key-based access to a specific cell or

sequential cells of data • Column Oriented Architecture ( like GPDB)

–  Column Families related attributes often queried together

–  Members are stored together • Versioning of cells is used to provide update

capability –  Change to an existing cell is stored as a new version

by timestamp • No transactional guarantee

36 © Copyright 2011 EMC Corporation. All rights reserved.

Hive • Data Warehousing package built on top of Hadoop • System for managing and querying structured data

–  Leverages MapReduce for execution –  Utilizes HDFS (or HBase) for storage

• Data is stored in tables –  Consists of separate Schema metastore and data files

• HiveQL is a sql-like language –  Queries are converted into MapReduce jobs

37 © Copyright 2011 EMC Corporation. All rights reserved.

Hive – Basics & Syntax --- Hive example -- set hive to use local (non-hdfs) storage hive > SET mapred.job.tracker=local; hive > SET mapred.local.dir=/Users/hardar/Documents/training/HDWorkshop/labs/9.hive/data hive > SET hive.exec.mode.local.auto=false; -- setup hive storage location in hdfs - if not using local $ hadoop fs -mkdir /tmp $ hadoop fs -mkdir /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse -- create an orders table create table orders (orderid bigint, customerid bigint, productid int, qty int, rate int, estdlvdate string, status string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/orders.txt' into table orders; -- query select * from orders; -- create a product table create table products (productid int, description string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/products.txt' into table products; -- select * from products.

Tell Hive to use a local repository for mapreduce, not

hdfs

Create repository folders in hdfs

Create a Customers table

Load data from local file system

Create a Products table

Load data from local file system

38 © Copyright 2011 EMC Corporation. All rights reserved.

Pig •  Provides a mechanism for using MapReduce without

programming in Java –  Utilizes HDFS & MapReduce

• Allows for a more intuitive means to specify data flows

–  High-level sequential, data flow language –  Pig Latin –  Python integration

•  Comfortable for researchers who are familiar with Perl & Python

• Pig is easier to learn & execute, but more limited in scope of functionality then java

39 © Copyright 2011 EMC Corporation. All rights reserved.

PIG – Basics & Syntax -- file : demographic.pig -- -- extracts INCOME (in thousands) and ZIPCODE from census data. Filters out ZERO incomes grunt> DEMO_TABLE = LOAD 'data/input/demo_sample.txt' using PigStorage(',') AS (gender:chararray, age:int, income:int, zip:chararray); -- describe DEMO_TABLE grunt> describe DEMO_TABLE; ## run mr job to dump DEMO_TABLE grunt> dump DEMO_TABLE; ## store DEMO_TABLE in hdfs grunt> store DEMO_TABLE into '/gphd/pig/DEMO_TABLE';

Define a table and load directly from local file

Describe

Select * from

Store the data in hdfs

40 © Copyright 2011 EMC Corporation. All rights reserved.

Others…….

41 © Copyright 2011 EMC Corporation. All rights reserved.

Mahout •  Important stuff first: most common pronunciation is “Ma-h-

out” – rhymes with ‘trout’ •  Machine Learning Library that Runs on HDFS •  4 Primary Use Cases:

–  Recommendation Mining – People who like X, also like Y –  Clustering – Topic based association –  Classification – Assign new docs to existing categories –  Frequent Item set Mining – Which things will appear together

42 © Copyright 2011 EMC Corporation. All rights reserved.

Revolutions Analytics R

• Statistical programming language for Hadoop –  Open Source & Revolution R Enterprise –  More than just counts and averages

• Ability to manipulate HDFS directly from R • Mimics Java APIs

43 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Use Cases

44 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Use Cases •  Internet

–  Search Index Generation –  User Engagement Behavior –  Targeting / Advertising Optimizations –  Recommendations

•  BioMed –  Computational BioMedical Systems –  Bioinformatics –  Data Mining and Genome Analysis

•  Financial –  Prediction Models –  Fraud Analysis –  Portfolio Risk Management

•  Telecom –  Call data records –  Set top & DVR streams

•  Social –  Recommendations –  Network Graphs –  Feed Updates

•  Enterprises –  email analysis, and image processing –  ETL –  Reporting & Analytics –  Natural Language Processing

•  Media/Newspapers –  Image Conversions

•  Agriculture –  Process “agri” stream

•  Image –  Geo-Spatial processing

•  Education –  Systems Research –  Statistical analysis of stuff on the web

45 © Copyright 2011 EMC Corporation. All rights reserved.

Greenplum Hadoop Customers How our customers are using Hadoop

•  Return Path –  World’s leader in email certification & scoring

–  Uses Hadoop & Hbase to store & process ISP data

–  Replaced Cloudera with Greenplum MR

•  American Express –  Early stages of developing Big Data Analytics strategy

–  Greenplum MR selected over Cloudera

–  Chose GP b/c of EMC Support & Existing Relationship

•  SunGard –  IT company focusing on availability services

–  Choose Greenplum MR as platform for big-data-analytics-as-a-service

–  Compete against AWS Elastic MapReduce

46 © Copyright 2011 EMC Corporation. All rights reserved.

Major Telco: CDR Churn Analysis •  Business problem: Construct a churn model to provide early

detection of customers who are going to end their contracts

•  Available data –  Dependent variable: did a customer leave in a 4-month period? –  Independent variables: various features on customer call history –  ~120,000 training data points, ~120,000 test data points

•  First attempt –  Use R, specifically the Generalised Additive Models (GAM) package –  Quickly built a model that matched T-Mobile’s existing model

47 © Copyright 2011 EMC Corporation. All rights reserved.

Challenges

48 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Pain Points

•  No Integrated Hadoop Stack •  Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout…

Integrated Product Suite

•  No Industry standard ETL and BI Stack Integration •  Informatica, Microstrategy, Business Objects … Interoperability

•  Poor Job and Application Monitoring Solution •  Non-existent Performance Monitoring Monitoring

•  Complex System Configuration and Manageability •  No Data Format Interoperability & Storage Abstractions

Operability and Manageability

•  Poor Dimensional Lookup Performance •  Very poor Random Access and Serving Performance Performance

49 © Copyright 2011 EMC Corporation. All rights reserved.

Analytic Productivity Applications, Tools, Chorus

Greenplum Database Hadoop

Compute

Storage

SQL DB Engine

Compute

Storage

MapReduce Engine

Data Computing Interfaces SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time)

All Data Types •  unstructured data •  structured data •  temporal data

•  geospatial data •  sensor data •  spatial data

parallel data exchange

parallel data exchange

Data Co-Processing

Network

50 © Copyright 2011 EMC Corporation. All rights reserved.

Questions……? &

THANK YOU

Recommended