50
1 © Copyright 2011 EMC Corporation. All rights reserved. Hadoop 101

Hadoop 101

Embed Size (px)

DESCRIPTION

Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.

Citation preview

Page 1: Hadoop 101

1 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop 101

Page 2: Hadoop 101

2 © Copyright 2011 EMC Corporation. All rights reserved.

Agenda

•  What is Hadoop ? •  History of Hadoop •  Hadoop Components •  Hadoop Ecosystem •  Customer Use Cases •  Hadoop Challenges

Page 3: Hadoop 101

3 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop?

Page 4: Hadoop 101

4 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

•  Concepts : –  NameNode (aka Master) is responsible for managing the

namespace repository (index) for the filesystem, and managing Jobs

–  DataNode (aka Segment) is responsible for storing blocks of data and running tasks

–  MapReduce – Push compute to where data resides

Page 5: Hadoop 101

5 © Copyright 2011 EMC Corporation. All rights reserved.

What is Hadoop and Where did it start?

•  Created by Doug Cutting

–  HDFS (storage) & MapReduce (compute)

–  Inspired by Google’s MapReduce and Google File System (GFS) papers

•  It is now a top-level Apache project backed by large open source development community

•  Three major subprojects

–  Nutch

–  Lucene

–  Hadoop

Page 6: Hadoop 101

6 © Copyright 2011 EMC Corporation. All rights reserved.

What makes Hadoop Different? •  Hadoop is a complete paradigm shift

•  Bypasses 25yrs of enterprise ceilings

•  Hadoop also defers some really difficult challenges: –  Non-transactional –  File System is essentially read-only

•  “Greater potential for the Hadoop architecture to mature and handle the complexity of transactions then RDBMS to figure out failures, and data growth”

Page 7: Hadoop 101

7 © Copyright 2011 EMC Corporation. All rights reserved.

Confluence of Factors • Hadoop makes analytics on large scale data sets

more pragmatic –  BI Solutions often suffer from garbage-in, garbage-out

–  Opens up new ways of understanding and thus running lines of business

• Classic Architectures won’t scale any further • New sources of information (social media) are too

big and unwieldy for traditional solutions –  5 year enterprise data growth is estimated at 650% - with over 80% of that

unstructured data EG. Facebook collects 100TB per day

• What works for Google & Facebook!

Page 8: Hadoop 101

8 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop vs Relational Solutions •  Hadoop is a paradigm shift in the way we think about and manage data

•  Traditional solutions were not designed with growth in mind

•  Big-Data accelerates this problem dramatically

Category Traditional RDBMS Hadoop

Scalability

Resource constrained Linear Expansion

Re-architecture Seamless addition & subtraction of nodes

~ 5TB ~ 5PB

Fault Tolerance

After thought, many critical points of failure

Designed in, tasks are automatically restarted

Problem Space

Transactional, OLTP Batch, OLAP (today!)

Inability to incorporate new sources

No bounds

Page 9: Hadoop 101

9 © Copyright 2011 EMC Corporation. All rights reserved.

History of Hadoop?

Page 10: Hadoop 101

10 © Copyright 2011 EMC Corporation. All rights reserved.

History •  Google paper: Simplified Data Processing on Large Clusters – 2006

–  GFS & MapReduce framework

•  Top Level Apache Open Source Community Project - 2008

•  Yahoo, Facebook, and Powerset become the main contributors, with Yahoo running over 10K nodes (300K cores) - 2009

•  Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08 –  209s to sort 1TB (62 seconds NOW) –  Cluster Config

•  910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-SATA disks

•  1Gb Ethernet •  40 nodes per rack •  8Gb Ethernet uplinks from each rack •  RH Server 5.1 •  Sun JDK 1.6.0_05-b13

Page 11: Hadoop 101

11 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components

Page 12: Hadoop 101

12 © Copyright 2011 EMC Corporation. All rights reserved.

Component Design

HDFS

jbod jbod jbod jbod jbod …..

MapReduce

java

python

stream

Packages

Hive HBase

Pig

Analytics

Mahout

R

Page 13: Hadoop 101

13 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components

•  Storage & Compute in One Framework •  Open Source Project of the Apache Software Foundation •  Written in Java

HDFS MapReduce

Two Core Components

Storage Compute

Page 14: Hadoop 101

14 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS

Page 15: Hadoop 101

15 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS Concepts •  Sits on top of native (ext3, xfs, etc.) file system

•  Performs best with a ‘modest’ number of large files –  Millions, rather than billions, of files –  Each file typically 100Mb or more

•  Files in HDFS are ‘write once’ –  No random writes to files are allowed –  Append support is available in Hadoop 0.21

•  HDFS is optimized for large, streaming reads of files –  Rather than random reads

Page 16: Hadoop 101

16 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS • Hadoop Distributed File System

–  Data is organized into files & directories –  Files are divided into blocks, typically 64-128MB each,

and distributed across cluster nodes –  Block placement is known at runtime by map-reduce so

computation can be co-located with data –  Blocks are replicated (default is 3 copies) to handle

failure –  Checksums are used to ensure data integrity

• Replication is the one and only strategy for error handling, recovery and fault tolerance

–  Make multiple copies and be happy!

Page 17: Hadoop 101

17 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Architecture - HDFS

Page 18: Hadoop 101

18 © Copyright 2011 EMC Corporation. All rights reserved.

HDFS Components • NameNode • DataNode • Standby NameNode •  Job Tracker • Task Tracker

Page 19: Hadoop 101

19 © Copyright 2011 EMC Corporation. All rights reserved.

NameNode • Provides a centralized, repository for the

namespace –  A index of what files are stored in which blocks

• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks (algorithm

–  Make multiple copies and be happy!

•  In Memory only –  0.23 provides distributed namenode –  Namenode recovery must re-build entire meta-data

repository

Page 20: Hadoop 101

20 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Architecture - HDFS •  Block level storage

•  N-Node replication

•  Namenode for –  File system index (EditLog) –  Access coordination –  IPC via TCP/IP

•  Datanode for –  Data Block Management –  Job Execution (MapReduce)

•  Automated Fault Tolerance

Put

Page 21: Hadoop 101

21 © Copyright 2011 EMC Corporation. All rights reserved.

Job Tracker • MapReduce jobs are controlled by a software daemon

known as the JobTracker •  The JobTracker resides on a single node

–  Clients submit MapReduce jobs to the JobTracker –  The JobTracker assigns Map and Reduce tasks to other

nodes on the cluster –  These nodes each run a software daemon known as the

TaskTracker –  The TaskTracker is responsible for actually instantiating the

Map or Reduce task, and reporting progress back to the JobTracker

•  A Job consists of a collection of Map & Reduce Tasks

Page 22: Hadoop 101

22 © Copyright 2011 EMC Corporation. All rights reserved.

MapReduce

Page 23: Hadoop 101

23 © Copyright 2011 EMC Corporation. All rights reserved.

Map Reduce Framework •  Map step

–  Input records are parsed into intermediate key/value pairs

–  Multiple Maps per Node •  10TB => 128MB/Blk => 82K Maps

•  Reduce step –  Each Reducer handles all like

keys –  3 Steps

•  Shuffle: All like keys are retrieved from each Mapper

•  Sort: Intermediate keys are sorted prior to reduce

•  Reduce: Values are processed

Page 24: Hadoop 101

24 © Copyright 2011 EMC Corporation. All rights reserved.

Map Reduce

Page 25: Hadoop 101

25 © Copyright 2011 EMC Corporation. All rights reserved.

Reduce Task •  After the Map phase is over, all the intermediate values for a

given intermediate key are combined together into a list •  This list is given to a Reducer

–  There may be a single Reducer, or multiple Reducers –  This is specified as part of the job configuration (see later) –  All values associated with a particular intermediate key are

guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the

Reducer in sorted key order –  This step is known as the ‘shuffle and sort’

•  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS

•  In practice, the Reducer usually emits a single key/value pair for each input key

Page 26: Hadoop 101

26 © Copyright 2011 EMC Corporation. All rights reserved.

Fault Tolerance •  HDFS will only allocate jobs to active nodes

•  Map-Reduce can compensate for slow running jobs –  If a Mapper appears to be running significantly more slowly than the

others, a new instance of the Mapper will be started on another machine, operating on the same data

–  The results of the first Mapper to finish will be used –  Hadoop will kill off the Mapper which is still running

•  Yahoo experiences multiple failures (> 10) of various components (drives, cables, servers) every day

–  Which have exactly 0 impact on operations

Page 27: Hadoop 101

27 © Copyright 2011 EMC Corporation. All rights reserved.

Ecosystem

Page 28: Hadoop 101

28 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Ecosystem

Page 29: Hadoop 101

29 © Copyright 2011 EMC Corporation. All rights reserved.

Ecosystem Distribution by Role

Consulting

Hadoop-ide

IDE

Reporting Analytics Distribution Monitoring

Manageability

Training

Data Integration

Data Visualization UAP

Apache

Page 30: Hadoop 101

30 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Components (hadoop.apache.org) •  Hadoop Distributed File System HDFS •  Framework for writing scalable data applications MapReduce •  Procedural language that abstracts lower level MapReduce Pig •  Highly reliable distributed coordination Zookeeper •  System for querying data and managing structured data built on top of

HDFS (SQL-like query) Hive •  Database for random, real time read/write access HBase •  workflow/coordination to manage jobs Oozie •  Scalable machine learning libraries Mahout

Page 31: Hadoop 101

31 © Copyright 2011 EMC Corporation. All rights reserved.

Technology Adoption Lifecycle

Innovators/ Early Adopters

Early Majority Late Majority Laggards

Today

Page 32: Hadoop 101

32 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase, Pig, Hive

Page 33: Hadoop 101

33 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase Overview • Hbase is a sparse, distributed, persistent,

scalable, reliable multi-dimensional map which is indexed by row key

–  Hadoop Database, ~ “No-SQL” database –  Many relational features –  Scalable: Region Servers –  Multiple client access: java, ReST, Thrift

• What’s it good for? –  Queries against a number of rows that makes your

Oracle Server puke! • Hbase leverages HDFS for its storage

Page 34: Hadoop 101

34 © Copyright 2011 EMC Corporation. All rights reserved.

HBase in Practice •  High performance, real-time query •  Client is typically a Java program •  But HBase supports many other API’s:

–  JSON: Java Script Object Notation –  REST: Representational State Transfer –  Thrift, Avro: Frameworks..

Page 35: Hadoop 101

35 © Copyright 2011 EMC Corporation. All rights reserved.

Hbase – key/value Store • Excellent key-based access to a specific cell or

sequential cells of data • Column Oriented Architecture ( like GPDB)

–  Column Families related attributes often queried together

–  Members are stored together • Versioning of cells is used to provide update

capability –  Change to an existing cell is stored as a new version

by timestamp • No transactional guarantee

Page 36: Hadoop 101

36 © Copyright 2011 EMC Corporation. All rights reserved.

Hive • Data Warehousing package built on top of Hadoop • System for managing and querying structured data

–  Leverages MapReduce for execution –  Utilizes HDFS (or HBase) for storage

• Data is stored in tables –  Consists of separate Schema metastore and data files

• HiveQL is a sql-like language –  Queries are converted into MapReduce jobs

Page 37: Hadoop 101

37 © Copyright 2011 EMC Corporation. All rights reserved.

Hive – Basics & Syntax --- Hive example -- set hive to use local (non-hdfs) storage hive > SET mapred.job.tracker=local; hive > SET mapred.local.dir=/Users/hardar/Documents/training/HDWorkshop/labs/9.hive/data hive > SET hive.exec.mode.local.auto=false; -- setup hive storage location in hdfs - if not using local $ hadoop fs -mkdir /tmp $ hadoop fs -mkdir /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse -- create an orders table create table orders (orderid bigint, customerid bigint, productid int, qty int, rate int, estdlvdate string, status string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/orders.txt' into table orders; -- query select * from orders; -- create a product table create table products (productid int, description string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/products.txt' into table products; -- select * from products.

Tell Hive to use a local repository for mapreduce, not

hdfs

Create repository folders in hdfs

Create a Customers table

Load data from local file system

Create a Products table

Load data from local file system

Page 38: Hadoop 101

38 © Copyright 2011 EMC Corporation. All rights reserved.

Pig •  Provides a mechanism for using MapReduce without

programming in Java –  Utilizes HDFS & MapReduce

• Allows for a more intuitive means to specify data flows

–  High-level sequential, data flow language –  Pig Latin –  Python integration

•  Comfortable for researchers who are familiar with Perl & Python

• Pig is easier to learn & execute, but more limited in scope of functionality then java

Page 39: Hadoop 101

39 © Copyright 2011 EMC Corporation. All rights reserved.

PIG – Basics & Syntax -- file : demographic.pig -- -- extracts INCOME (in thousands) and ZIPCODE from census data. Filters out ZERO incomes grunt> DEMO_TABLE = LOAD 'data/input/demo_sample.txt' using PigStorage(',') AS (gender:chararray, age:int, income:int, zip:chararray); -- describe DEMO_TABLE grunt> describe DEMO_TABLE; ## run mr job to dump DEMO_TABLE grunt> dump DEMO_TABLE; ## store DEMO_TABLE in hdfs grunt> store DEMO_TABLE into '/gphd/pig/DEMO_TABLE';

Define a table and load directly from local file

Describe

Select * from

Store the data in hdfs

Page 40: Hadoop 101

40 © Copyright 2011 EMC Corporation. All rights reserved.

Others…….

Page 41: Hadoop 101

41 © Copyright 2011 EMC Corporation. All rights reserved.

Mahout •  Important stuff first: most common pronunciation is “Ma-h-

out” – rhymes with ‘trout’ •  Machine Learning Library that Runs on HDFS •  4 Primary Use Cases:

–  Recommendation Mining – People who like X, also like Y –  Clustering – Topic based association –  Classification – Assign new docs to existing categories –  Frequent Item set Mining – Which things will appear together

Page 42: Hadoop 101

42 © Copyright 2011 EMC Corporation. All rights reserved.

Revolutions Analytics R

• Statistical programming language for Hadoop –  Open Source & Revolution R Enterprise –  More than just counts and averages

• Ability to manipulate HDFS directly from R • Mimics Java APIs

Page 43: Hadoop 101

43 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Use Cases

Page 44: Hadoop 101

44 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Use Cases •  Internet

–  Search Index Generation –  User Engagement Behavior –  Targeting / Advertising Optimizations –  Recommendations

•  BioMed –  Computational BioMedical Systems –  Bioinformatics –  Data Mining and Genome Analysis

•  Financial –  Prediction Models –  Fraud Analysis –  Portfolio Risk Management

•  Telecom –  Call data records –  Set top & DVR streams

•  Social –  Recommendations –  Network Graphs –  Feed Updates

•  Enterprises –  email analysis, and image processing –  ETL –  Reporting & Analytics –  Natural Language Processing

•  Media/Newspapers –  Image Conversions

•  Agriculture –  Process “agri” stream

•  Image –  Geo-Spatial processing

•  Education –  Systems Research –  Statistical analysis of stuff on the web

Page 45: Hadoop 101

45 © Copyright 2011 EMC Corporation. All rights reserved.

Greenplum Hadoop Customers How our customers are using Hadoop

•  Return Path –  World’s leader in email certification & scoring

–  Uses Hadoop & Hbase to store & process ISP data

–  Replaced Cloudera with Greenplum MR

•  American Express –  Early stages of developing Big Data Analytics strategy

–  Greenplum MR selected over Cloudera

–  Chose GP b/c of EMC Support & Existing Relationship

•  SunGard –  IT company focusing on availability services

–  Choose Greenplum MR as platform for big-data-analytics-as-a-service

–  Compete against AWS Elastic MapReduce

Page 46: Hadoop 101

46 © Copyright 2011 EMC Corporation. All rights reserved.

Major Telco: CDR Churn Analysis •  Business problem: Construct a churn model to provide early

detection of customers who are going to end their contracts

•  Available data –  Dependent variable: did a customer leave in a 4-month period? –  Independent variables: various features on customer call history –  ~120,000 training data points, ~120,000 test data points

•  First attempt –  Use R, specifically the Generalised Additive Models (GAM) package –  Quickly built a model that matched T-Mobile’s existing model

Page 47: Hadoop 101

47 © Copyright 2011 EMC Corporation. All rights reserved.

Challenges

Page 48: Hadoop 101

48 © Copyright 2011 EMC Corporation. All rights reserved.

Hadoop Pain Points

•  No Integrated Hadoop Stack •  Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout…

Integrated Product Suite

•  No Industry standard ETL and BI Stack Integration •  Informatica, Microstrategy, Business Objects … Interoperability

•  Poor Job and Application Monitoring Solution •  Non-existent Performance Monitoring Monitoring

•  Complex System Configuration and Manageability •  No Data Format Interoperability & Storage Abstractions

Operability and Manageability

•  Poor Dimensional Lookup Performance •  Very poor Random Access and Serving Performance Performance

Page 49: Hadoop 101

49 © Copyright 2011 EMC Corporation. All rights reserved.

Analytic Productivity Applications, Tools, Chorus

Greenplum Database Hadoop

Compute

Storage

SQL DB Engine

Compute

Storage

MapReduce Engine

Data Computing Interfaces SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time)

All Data Types •  unstructured data •  structured data •  temporal data

•  geospatial data •  sensor data •  spatial data

parallel data exchange

parallel data exchange

Data Co-Processing

Network

Page 50: Hadoop 101

50 © Copyright 2011 EMC Corporation. All rights reserved.

Questions……? &

THANK YOU