View
7.459
Download
2
Category
Tags:
Preview:
DESCRIPTION
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
Citation preview
1 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop 101
2 © Copyright 2011 EMC Corporation. All rights reserved.
Agenda
• What is Hadoop ? • History of Hadoop • Hadoop Components • Hadoop Ecosystem • Customer Use Cases • Hadoop Challenges
3 © Copyright 2011 EMC Corporation. All rights reserved.
What is Hadoop?
4 © Copyright 2011 EMC Corporation. All rights reserved.
What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
• Concepts : – NameNode (aka Master) is responsible for managing the
namespace repository (index) for the filesystem, and managing Jobs
– DataNode (aka Segment) is responsible for storing blocks of data and running tasks
– MapReduce – Push compute to where data resides
5 © Copyright 2011 EMC Corporation. All rights reserved.
What is Hadoop and Where did it start?
• Created by Doug Cutting
– HDFS (storage) & MapReduce (compute)
– Inspired by Google’s MapReduce and Google File System (GFS) papers
• It is now a top-level Apache project backed by large open source development community
• Three major subprojects
– Nutch
– Lucene
– Hadoop
6 © Copyright 2011 EMC Corporation. All rights reserved.
What makes Hadoop Different? • Hadoop is a complete paradigm shift
• Bypasses 25yrs of enterprise ceilings
• Hadoop also defers some really difficult challenges: – Non-transactional – File System is essentially read-only
• “Greater potential for the Hadoop architecture to mature and handle the complexity of transactions then RDBMS to figure out failures, and data growth”
7 © Copyright 2011 EMC Corporation. All rights reserved.
Confluence of Factors • Hadoop makes analytics on large scale data sets
more pragmatic – BI Solutions often suffer from garbage-in, garbage-out
– Opens up new ways of understanding and thus running lines of business
• Classic Architectures won’t scale any further • New sources of information (social media) are too
big and unwieldy for traditional solutions – 5 year enterprise data growth is estimated at 650% - with over 80% of that
unstructured data EG. Facebook collects 100TB per day
• What works for Google & Facebook!
8 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop vs Relational Solutions • Hadoop is a paradigm shift in the way we think about and manage data
• Traditional solutions were not designed with growth in mind
• Big-Data accelerates this problem dramatically
Category Traditional RDBMS Hadoop
Scalability
Resource constrained Linear Expansion
Re-architecture Seamless addition & subtraction of nodes
~ 5TB ~ 5PB
Fault Tolerance
After thought, many critical points of failure
Designed in, tasks are automatically restarted
Problem Space
Transactional, OLTP Batch, OLAP (today!)
Inability to incorporate new sources
No bounds
9 © Copyright 2011 EMC Corporation. All rights reserved.
History of Hadoop?
10 © Copyright 2011 EMC Corporation. All rights reserved.
History • Google paper: Simplified Data Processing on Large Clusters – 2006
– GFS & MapReduce framework
• Top Level Apache Open Source Community Project - 2008
• Yahoo, Facebook, and Powerset become the main contributors, with Yahoo running over 10K nodes (300K cores) - 2009
• Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08 – 209s to sort 1TB (62 seconds NOW) – Cluster Config
• 910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-SATA disks
• 1Gb Ethernet • 40 nodes per rack • 8Gb Ethernet uplinks from each rack • RH Server 5.1 • Sun JDK 1.6.0_05-b13
11 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Components
12 © Copyright 2011 EMC Corporation. All rights reserved.
Component Design
HDFS
jbod jbod jbod jbod jbod …..
MapReduce
java
python
stream
Packages
Hive HBase
Pig
Analytics
Mahout
R
13 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Components
• Storage & Compute in One Framework • Open Source Project of the Apache Software Foundation • Written in Java
HDFS MapReduce
Two Core Components
Storage Compute
14 © Copyright 2011 EMC Corporation. All rights reserved.
HDFS
15 © Copyright 2011 EMC Corporation. All rights reserved.
HDFS Concepts • Sits on top of native (ext3, xfs, etc.) file system
• Performs best with a ‘modest’ number of large files – Millions, rather than billions, of files – Each file typically 100Mb or more
• Files in HDFS are ‘write once’ – No random writes to files are allowed – Append support is available in Hadoop 0.21
• HDFS is optimized for large, streaming reads of files – Rather than random reads
16 © Copyright 2011 EMC Corporation. All rights reserved.
HDFS • Hadoop Distributed File System
– Data is organized into files & directories – Files are divided into blocks, typically 64-128MB each,
and distributed across cluster nodes – Block placement is known at runtime by map-reduce so
computation can be co-located with data – Blocks are replicated (default is 3 copies) to handle
failure – Checksums are used to ensure data integrity
• Replication is the one and only strategy for error handling, recovery and fault tolerance
– Make multiple copies and be happy!
17 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Architecture - HDFS
18 © Copyright 2011 EMC Corporation. All rights reserved.
HDFS Components • NameNode • DataNode • Standby NameNode • Job Tracker • Task Tracker
19 © Copyright 2011 EMC Corporation. All rights reserved.
NameNode • Provides a centralized, repository for the
namespace – A index of what files are stored in which blocks
• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks (algorithm
– Make multiple copies and be happy!
• In Memory only – 0.23 provides distributed namenode – Namenode recovery must re-build entire meta-data
repository
20 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Architecture - HDFS • Block level storage
• N-Node replication
• Namenode for – File system index (EditLog) – Access coordination – IPC via TCP/IP
• Datanode for – Data Block Management – Job Execution (MapReduce)
• Automated Fault Tolerance
Put
21 © Copyright 2011 EMC Corporation. All rights reserved.
Job Tracker • MapReduce jobs are controlled by a software daemon
known as the JobTracker • The JobTracker resides on a single node
– Clients submit MapReduce jobs to the JobTracker – The JobTracker assigns Map and Reduce tasks to other
nodes on the cluster – These nodes each run a software daemon known as the
TaskTracker – The TaskTracker is responsible for actually instantiating the
Map or Reduce task, and reporting progress back to the JobTracker
• A Job consists of a collection of Map & Reduce Tasks
22 © Copyright 2011 EMC Corporation. All rights reserved.
MapReduce
23 © Copyright 2011 EMC Corporation. All rights reserved.
Map Reduce Framework • Map step
– Input records are parsed into intermediate key/value pairs
– Multiple Maps per Node • 10TB => 128MB/Blk => 82K Maps
• Reduce step – Each Reducer handles all like
keys – 3 Steps
• Shuffle: All like keys are retrieved from each Mapper
• Sort: Intermediate keys are sorted prior to reduce
• Reduce: Values are processed
24 © Copyright 2011 EMC Corporation. All rights reserved.
Map Reduce
25 © Copyright 2011 EMC Corporation. All rights reserved.
Reduce Task • After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list • This list is given to a Reducer
– There may be a single Reducer, or multiple Reducers – This is specified as part of the job configuration (see later) – All values associated with a particular intermediate key are
guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order – This step is known as the ‘shuffle and sort’
• The Reducer outputs zero or more final key/value pairs – These are written to HDFS
• In practice, the Reducer usually emits a single key/value pair for each input key
26 © Copyright 2011 EMC Corporation. All rights reserved.
Fault Tolerance • HDFS will only allocate jobs to active nodes
• Map-Reduce can compensate for slow running jobs – If a Mapper appears to be running significantly more slowly than the
others, a new instance of the Mapper will be started on another machine, operating on the same data
– The results of the first Mapper to finish will be used – Hadoop will kill off the Mapper which is still running
• Yahoo experiences multiple failures (> 10) of various components (drives, cables, servers) every day
– Which have exactly 0 impact on operations
27 © Copyright 2011 EMC Corporation. All rights reserved.
Ecosystem
28 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Ecosystem
29 © Copyright 2011 EMC Corporation. All rights reserved.
Ecosystem Distribution by Role
Consulting
Hadoop-ide
IDE
Reporting Analytics Distribution Monitoring
Manageability
Training
Data Integration
Data Visualization UAP
Apache
30 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Components (hadoop.apache.org) • Hadoop Distributed File System HDFS • Framework for writing scalable data applications MapReduce • Procedural language that abstracts lower level MapReduce Pig • Highly reliable distributed coordination Zookeeper • System for querying data and managing structured data built on top of
HDFS (SQL-like query) Hive • Database for random, real time read/write access HBase • workflow/coordination to manage jobs Oozie • Scalable machine learning libraries Mahout
31 © Copyright 2011 EMC Corporation. All rights reserved.
Technology Adoption Lifecycle
Innovators/ Early Adopters
Early Majority Late Majority Laggards
Today
32 © Copyright 2011 EMC Corporation. All rights reserved.
Hbase, Pig, Hive
33 © Copyright 2011 EMC Corporation. All rights reserved.
Hbase Overview • Hbase is a sparse, distributed, persistent,
scalable, reliable multi-dimensional map which is indexed by row key
– Hadoop Database, ~ “No-SQL” database – Many relational features – Scalable: Region Servers – Multiple client access: java, ReST, Thrift
• What’s it good for? – Queries against a number of rows that makes your
Oracle Server puke! • Hbase leverages HDFS for its storage
34 © Copyright 2011 EMC Corporation. All rights reserved.
HBase in Practice • High performance, real-time query • Client is typically a Java program • But HBase supports many other API’s:
– JSON: Java Script Object Notation – REST: Representational State Transfer – Thrift, Avro: Frameworks..
35 © Copyright 2011 EMC Corporation. All rights reserved.
Hbase – key/value Store • Excellent key-based access to a specific cell or
sequential cells of data • Column Oriented Architecture ( like GPDB)
– Column Families related attributes often queried together
– Members are stored together • Versioning of cells is used to provide update
capability – Change to an existing cell is stored as a new version
by timestamp • No transactional guarantee
36 © Copyright 2011 EMC Corporation. All rights reserved.
Hive • Data Warehousing package built on top of Hadoop • System for managing and querying structured data
– Leverages MapReduce for execution – Utilizes HDFS (or HBase) for storage
• Data is stored in tables – Consists of separate Schema metastore and data files
• HiveQL is a sql-like language – Queries are converted into MapReduce jobs
37 © Copyright 2011 EMC Corporation. All rights reserved.
Hive – Basics & Syntax --- Hive example -- set hive to use local (non-hdfs) storage hive > SET mapred.job.tracker=local; hive > SET mapred.local.dir=/Users/hardar/Documents/training/HDWorkshop/labs/9.hive/data hive > SET hive.exec.mode.local.auto=false; -- setup hive storage location in hdfs - if not using local $ hadoop fs -mkdir /tmp $ hadoop fs -mkdir /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse -- create an orders table create table orders (orderid bigint, customerid bigint, productid int, qty int, rate int, estdlvdate string, status string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/orders.txt' into table orders; -- query select * from orders; -- create a product table create table products (productid int, description string) row format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/products.txt' into table products; -- select * from products.
Tell Hive to use a local repository for mapreduce, not
hdfs
Create repository folders in hdfs
Create a Customers table
Load data from local file system
Create a Products table
Load data from local file system
38 © Copyright 2011 EMC Corporation. All rights reserved.
Pig • Provides a mechanism for using MapReduce without
programming in Java – Utilizes HDFS & MapReduce
• Allows for a more intuitive means to specify data flows
– High-level sequential, data flow language – Pig Latin – Python integration
• Comfortable for researchers who are familiar with Perl & Python
• Pig is easier to learn & execute, but more limited in scope of functionality then java
39 © Copyright 2011 EMC Corporation. All rights reserved.
PIG – Basics & Syntax -- file : demographic.pig -- -- extracts INCOME (in thousands) and ZIPCODE from census data. Filters out ZERO incomes grunt> DEMO_TABLE = LOAD 'data/input/demo_sample.txt' using PigStorage(',') AS (gender:chararray, age:int, income:int, zip:chararray); -- describe DEMO_TABLE grunt> describe DEMO_TABLE; ## run mr job to dump DEMO_TABLE grunt> dump DEMO_TABLE; ## store DEMO_TABLE in hdfs grunt> store DEMO_TABLE into '/gphd/pig/DEMO_TABLE';
Define a table and load directly from local file
Describe
Select * from
Store the data in hdfs
40 © Copyright 2011 EMC Corporation. All rights reserved.
Others…….
41 © Copyright 2011 EMC Corporation. All rights reserved.
Mahout • Important stuff first: most common pronunciation is “Ma-h-
out” – rhymes with ‘trout’ • Machine Learning Library that Runs on HDFS • 4 Primary Use Cases:
– Recommendation Mining – People who like X, also like Y – Clustering – Topic based association – Classification – Assign new docs to existing categories – Frequent Item set Mining – Which things will appear together
42 © Copyright 2011 EMC Corporation. All rights reserved.
Revolutions Analytics R
• Statistical programming language for Hadoop – Open Source & Revolution R Enterprise – More than just counts and averages
• Ability to manipulate HDFS directly from R • Mimics Java APIs
43 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Use Cases
44 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Use Cases • Internet
– Search Index Generation – User Engagement Behavior – Targeting / Advertising Optimizations – Recommendations
• BioMed – Computational BioMedical Systems – Bioinformatics – Data Mining and Genome Analysis
• Financial – Prediction Models – Fraud Analysis – Portfolio Risk Management
• Telecom – Call data records – Set top & DVR streams
• Social – Recommendations – Network Graphs – Feed Updates
• Enterprises – email analysis, and image processing – ETL – Reporting & Analytics – Natural Language Processing
• Media/Newspapers – Image Conversions
• Agriculture – Process “agri” stream
• Image – Geo-Spatial processing
• Education – Systems Research – Statistical analysis of stuff on the web
45 © Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Hadoop Customers How our customers are using Hadoop
• Return Path – World’s leader in email certification & scoring
– Uses Hadoop & Hbase to store & process ISP data
– Replaced Cloudera with Greenplum MR
• American Express – Early stages of developing Big Data Analytics strategy
– Greenplum MR selected over Cloudera
– Chose GP b/c of EMC Support & Existing Relationship
• SunGard – IT company focusing on availability services
– Choose Greenplum MR as platform for big-data-analytics-as-a-service
– Compete against AWS Elastic MapReduce
46 © Copyright 2011 EMC Corporation. All rights reserved.
Major Telco: CDR Churn Analysis • Business problem: Construct a churn model to provide early
detection of customers who are going to end their contracts
• Available data – Dependent variable: did a customer leave in a 4-month period? – Independent variables: various features on customer call history – ~120,000 training data points, ~120,000 test data points
• First attempt – Use R, specifically the Generalised Additive Models (GAM) package – Quickly built a model that matched T-Mobile’s existing model
47 © Copyright 2011 EMC Corporation. All rights reserved.
Challenges
48 © Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Pain Points
• No Integrated Hadoop Stack • Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout…
Integrated Product Suite
• No Industry standard ETL and BI Stack Integration • Informatica, Microstrategy, Business Objects … Interoperability
• Poor Job and Application Monitoring Solution • Non-existent Performance Monitoring Monitoring
• Complex System Configuration and Manageability • No Data Format Interoperability & Storage Abstractions
Operability and Manageability
• Poor Dimensional Lookup Performance • Very poor Random Access and Serving Performance Performance
49 © Copyright 2011 EMC Corporation. All rights reserved.
Analytic Productivity Applications, Tools, Chorus
Greenplum Database Hadoop
Compute
Storage
SQL DB Engine
Compute
Storage
MapReduce Engine
Data Computing Interfaces SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time)
All Data Types • unstructured data • structured data • temporal data
• geospatial data • sensor data • spatial data
parallel data exchange
parallel data exchange
Data Co-Processing
Network
50 © Copyright 2011 EMC Corporation. All rights reserved.
Questions……? &
THANK YOU
Recommended