Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

TCloud Computing, Inc.

Hadoop Product Family and Ecosystem

Agenda

• What is Big Data?

• Big Data Opportunities

• Hadoop

– Introduction to Hadoop

– Hadoop 2.0

– What’s next for Hadoop?

• Hadoop ecosystem

• Conclusion

What is Big Data?

A set of files A database A single file

4 V’s of Big Data

http://www.datasciencecentral.com/profiles/blogs/data-veracity




Big data Expands on 4 fronts

Velocity

Volume

Variety

Veracity

MB GB TB PB

batch

periodic

near Real-Time

Real-Time

http://whatis.techtarget.com/definition/3Vs



Big Data Opportunities

http://www.sap.com/corporate-en/news.epx?PressID=21316

Big Data Revenue by Market Segment 2012

• 1

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Big Data Market Forecast 2012-2017

• 1

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Hadoop Solutions

The most common problems Hadoop can solve

Threat Analysis/Trade Surveillance

• Challenge:

– Detecting threats in the form of fraudulent activity or attacks

• Large data volumes involved

• Like looking for a needle in a haystack

• Solution with Hadoop:

– Parallel processing over huge datasets

– Pattern recognition to identify anomalies

• – i.e., threats

• Typical Industry:

– Security, Financial Services

Recommendation Engine

• Challenge:

– Using user data to predict which products to recommend

• Solution with Hadoop:

– Batch processing framework

• Allow execution in in parallel over large datasets

– Collaborative filtering

• Collecting ‘taste’ information from many users

• Utilizing information to predict what similar users like

• Typical Industry

– ISP, Advertising

Walmart Case

Revenue ?

Friday

Beer

Diapers

• 1

http://tech.naver.jp/blog/?p=2412

Hadoop!

• Apache Hadoop project

– inspired by Google's MapReduce and Google File System papers.

• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware

• Open Source Software + Hardware Commodity

– IT Costs Reduction

– inspired by

Hadoop Concepts

• Distribute the data as it is initially stored in the system

• Moving Computation is Cheaper than Moving Data

• Individual nodes can work on data local to those nodes

• Users can focus on developing applications.

Hadoop 2.0

• Hadoop 2.2.0 is expected to GA in Fall 2013

• HDFS Federation

• HDFS High Availability (HA)

• Hadoop YARN (MapReduce 2.0)

HDFS Federation - Limitation of Hadoop 1.0

• Scalability

– Storage scales horizontally - namespace doesn’t

• Performance

– File system operations throughput limited by a single node

• Poor isolation

– All the tenants share a single namespace

HDFS Federation

• Multiple independent NameNodes and Namespace Volumes in a cluster

– Namespace Volume = Namespace + Block Pool

• Block Storage as generic storage service

– Set of blocks for a Namespace Volume is called a Block Pool

– DNs store blocks for all the Namespace Volumes – no partitioning

HDFS Federation

Hadoop Hadoop 2.0

http://hortonworks.com/blog/an-introduction-to-hdfs-federation/

/home/ /app/Hive /app/HBase

HDFS High Availability (HA)

• Secondary Name Node is not Name Node

• http://www.youtube.com/watch?v=hEqQMLSXQlY

http://www.youtube.com/watch?v=hEqQMLSXQlY

http://www.youtube.com/watch?v=hEqQMLSXQlY

HDFS High Availability (HA)

https://issues.apache.org/jira/browse/HDFS-1623

Why do we need YARN

• Scalability

– Maximum Cluster size – 4,000 nodes

– Maximum concurrent tasks – 40,000

• Single point of failure

– Failure kills all queued and running jobs

• Lacks support for alternate paradigms

– Iterative applications implemented using MapReduce are 10x slower

– Example: K-Means, PageRank

Hadoop YARN

http://hortonworks.com/hadoop/yarn/

Role of YARN

• Resource Manager

– Per-cluster

– Global resource scheduler

– Hierarchical queues

• Node Manager

– Per-machine agent

– Manages the life-cycle of container

– Container resource monitoring

• Application Master

– Per-application

– Manages application scheduling and task execution

– E.g. MapReduce Application Master

Job Tracker

Resource Manager

Application Master

Hadoop YARN architectural

http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

• Container

– Basic unit of allocation

– Ex. Container A = 2GB, 1CPU

– Fine-grained resource allocation

– Replace the fixed map/reduce slots

What’s next for Hadoop?

• Real-time

– Apache Tez

• Part of Stinger

– Spark

• SQL in Hadoop

– Stinger

• An immediate aim of 100x performance increase for Hive is more ambitious than any other effort.

• Based on industry standard SQL, the Stinger Initiative improves HiveQL to deliver SQL compatibility.

– Shark

What’s next for Hadoop?

• Security: Data encryption

– hadoop-9331: Hadoop crypto codec framework and crypto codec implementations

• hadoop-9332: Crypto codec implementations for AES

• hadoop-9333: Hadoop crypto codec framework based on compression codec

• mapreduce-5025: Key Distribution and Management for supporting crypto codec in Map Reduce

• 2013/09/28 Hadoop in Taiwan 2013

– Hadoop Security: Now and future

– Session B, 16:00~16:40

The Hadoop Ecosystems

Growing Hadoop Ecosystem

• The term ‘Hadoop’ is taken to be the combination of HDFS and MapReduce

• There are numerous other projects surrounding Hadoop

– Typically referred to as the ‘Hadoop Ecosystem’

• Zookeeper

• Hive and Pig

• HBase

• Flume

• Other Ecosystem Projects

– Sqoop

– Oozie

– Mahout

The Ecosystem is the System

• Hadoop has become the kernel of the distributed operating system for Big Data

• No one uses the kernel alone

• A collection of projects at Apache

Relation Map

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

ZooKeeper – Coordination Framework


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

What is ZooKeeper

• A centralized service for maintaining

– Configuration information

– Providing distributed synchronization

• A set of tools to build distributed applications that can safely handle partial failures

• ZooKeeper was designed to store coordination data

– Status information

– Configuration

– Location information

Why use ZooKeeper?

• Manage configuration across nodes

• Implement reliable messaging

• Implement redundant services

• Synchronize process execution

ZooKeeper Architecture

– All servers store a copy of the data (in memory)

– A leader is elected at startup

– 2 roles – leader and follower

• Followers service clients, all updates go through leader

• Update responses are sent when a majority of servers have persisted the change

– HA support

HBase – Column NoSQL DB


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Structured-data V.S. Raw-data

I – Inspired by

• Apache open source project

• Inspired from Google Big Table

• Non-relational, distributed database written in Java

• Coordinated by Zookeeper

Row & Column Oriented

HBase – Data Model

• Cells are “versioned”

• Table rows are sorted by row key

• Region – a row range [start-key:end-key]

When to use HBase

• Need random, low latency access to the data

• Application has a flexible schema where each row is slightly different

– Add columns on the fly

• Most of columns are NULL in each row

Flume / Sqoop – Data Integration Framework


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

What’s the problem for data collection

• Data collection is currently a priori and ad hoc

• A priori – decide what you want to collect ahead of time

• Ad hoc – each kind of data source goes through its own collection path

(and how can it help?)

• A distributed data collection service

• It efficiently collecting, aggregating, and moving large amounts of data

• Fault tolerant, many failover and recovery mechanism

• One-stop solution for data collection of all formats

An example flow

Sqoop

• Easy, parallel database import/export

• What you want do?

– Insert data from RDBMS to HDFS

– Export data from HDFS back into RDBMS

What is Sqoop

• A suite of tools that connect Hadoop and database systems

• Import tables from databases into HDFS for deep analysis

• Export MapReduce results back to a database for presentation to end-users

• Provides the ability to import from SQL databases straight into your Hive data warehouse

How Sqoop helps

• The Problem

– Structured data in traditional databases cannot be easily combined with complex data stored in HDFS

• Sqoop (SQL-to-Hadoop)

– Easy import of data from many databases to HDFS

– Generate code for use in MapReduce applications

Why Sqoop

• JDBC-based implementation

– Works with many popular database vendors

• Auto-generation of tedious user-side code

– Write MapReduce applications to work with your data, faster

• Integration with Hive

– Allows you to stay in a SQL-based environment

Pig / Hive – Analytical Language


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Why Hive and Pig?

• Although MapReduce is very powerful, it can also be complex to master

• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code

• Many organizations have programmers who are skilled at writing code in scripting languages

• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce

– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by

• What is Hive?

– An SQL-like interface to Hadoop

• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop

– MapRuduce for execution

– HDFS for storage

• Hive Query Language

– Basic-SQL : Select, From, Join, Group-By

– Equi-Join, Muti-Table Insert, Multi-Group-By

– Batch query

SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive/MR V.S. Hive/Tez

http://www.slideshare.net/adammuise/2013-jul-23thughivetuningdeepdive

Pig

• A high-level scripting language (Pig Latin)

• Process data one step at a time

• Simple to write MapReduce program

• Easy understand

• Easy debug A = load ‘a.txt’ as (id, name, age, ...)

B = load ‘b.txt’ as (id, address, ...)

C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

– Initiated by

Hive vs. Pig

Hive Pig

Language HiveQL (SQL-like) Pig Latin, a scripting language

Schema Table definitions

that are stored in a

metastore

A schema is optionally defined

at runtime

Programmait Access JDBC, ODBC PigServer

• Input

• For the given sample input the map emits

• the reduce just sums up the values

Hello World Bye World

Hello Hadoop Goodbye Hadoop

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

WordCount Example

WordCount Example In MapReduce public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'

OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

Spark / Shark - Analytical Language


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Why

• MapReduce is too slow

• Aims to make data analytics fast — both fast to run and fast to write.

• When you have the request: iterative algorithms

What is

• In-memory distributed computing framework

• Create by UC Berkeley AMP Lab in 2010

• Target Problem that Hadoop MR is bad at

– Iterative algorithm (Machine Learning )

– Interactive data mining

• More general purpose than Hadoop MR

• Active contributions from ~15 companies

BDAS, the Berkeley Data Analytics Stack

https://amplab.cs.berkeley.edu/software/

What Different between Hadoop and Spark

Data Source

Map()

Data Source 2

Join()

Cache() Transform

http://spark.incubator.apache.org

HDFS

Map

Reduce

Map

Reduce

What is Shark

• A data analytic (warehouse) system that

– Port of Apache Hive to run on Spark

– Compatible with existing Hive data, metastores, and query(Hive, UDFs,etc)

– Similar speedup of up to 40x than hive

– Scale out and is fault-tolerant

– Support low-latency, interactive query through in-memory computing

Shark Architecture

Hive

Meta Store

HDFS/HBase

Spark

SQL

Parser

Query

Optimizer Physical Plan

Execution

Cache Mgr.

CLI Thrift/JDBC

Driver

Oozie – Job Workflow & Scheduling


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

What is ?

• A Java Web Application

• Oozie is a workflow scheduler for Hadoop

• Crond for Hadoop

Job 1

Job 3

Job 2

Job 4 Job 5

Why

• Why use Oozie instead of just cascading a jobs one after another

• Major flexibility

– Start, Stop, Suspend, and re-run jobs

• Oozie allows you to restart from a failure

– You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes

How it triggered

• Time

– Execute your workflow every 15 minutes

• Time and Data

– Materialize your workflow every hour, but only run them when the input data is ready.

00:15 00:30 00:45 01:00

01:00 02:00 03:00 04:00

Hadoop

Input Data Exists?

Oozie use criteria

• Need Launch, control, and monitor jobs from your Java Apps

– Java Client API/Command Line Interface

• Need control jobs from anywhere

– Web Service API

• Have jobs that you need to run every hour, day, week

• Need receive notification when a job done

– Email when a job is complete

Mahout – Data Mining


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

What is

• Machine-learning tool

• Distributed and scalable machine learning algorithms on the Hadoop platform

• Building intelligent applications easier and faster

Why

• Current state of ML libraries

– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Are Research oriented

Mahout – scale

• Scale to large datasets

– Hadoop MapReduce implementations that scales linearly with data

• Scalable to support your business case

– Mahout is distributed under a commercially friendly Apache Software license

• Scalable community

– Vibrant, responsive and diverse

Mahout – four use cases • Mahout machine learning algorithms

– Recommendation mining : takes users’ behavior and find items said specified user might like

– Clustering : takes e.g. text documents and groups them based on related document topics

– Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category

– Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together

Use case Example

• Predict what the user likes based on

– His/Her historical behavior

– Aggregate behavior of people similar to him

Conclusion

• Big Data Opportunities

– The market still growing

• Hadoop 2.0

– Federation

– HA

– YARN

• What’s next for Hadoop

– Real-time query

– Data encryption

• What other projects are included in the Hadoop ecosystem

– Different project for different purpose

– Choose right tools for your needs

Recap – Hadoop Ecosystem


Framework)


HBase (Column

NoSQL DB)





YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Questions?

Thank you!

Education

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3