201305 hadoop jpl-v3

© Hortonworks Inc. 2013

Apache Hadoop for Big Science

History, Use cases & Futures

Eric Baldeschwieler, “Eric14”

Hortonworks CTO

@jeric14


Agenda

• What is Apache Hadoop• Project motivation & history• Use cases• Futures and observations

© Hortonworks Inc. 2013Page 3

What is Apache Hadoop?


Traditional data systems vs. Hadoop

Traditional data systems– Limited scaling options– Expensive at scale

– Complex components– Proprietary software– Reliability in Hardware

– Optimized for latency, IOPs

Page 4

Hadoop Cluster– Low cost scale-out

– Commodity components– Open source software– Reliability in software

– Optimized for throughput

When your data infrastructure does not scale … Hadoop


Sto

rage

Apache Hadoop: Big Data Platform

Open Source data management with scale-out storage & distributed processing

Page 5

HDFS• Distributed across a cluster• Natively redundant, self-healing• Very high bandwidth

Pro

cess

ing Map Reduce

• Splits a job into small tasks and moves compute “near” the data

• Self-Healing• Simple programming model

Key Characteristics• Scalable

– Efficiently store and process petabytes of data

– Scale out linearly by adding nodes (node == commodity computer)

• Reliable– Data replicated 3x– Failover across nodes and racks,

• Flexible– Store all types of data in any format

• Economical– Commodity hardware– Open source software (via ASF)– No vendor lock-in

© Hortonworks Inc. 2013 (From Richard McDougall, VMware, Hadoop Summit, 2012 talk)

Hadoop’s cost advantage

SAN Storage

$2 - $10/Gigabyte

$1M gets:0.5Petabytes

1,000,000 IOPS1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:1 Petabyte

400,000 IOPS2Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets:20 Petabytes

10,000,000 IOPS800 Gbytes/sec

“And you get racks of free

computers when you buy storage!”

- Eric14


Hadoop hardware

• 10 to 4500 node clusters

– 1-4 “master nodes”– Interchangeable workers

• Typical node– 4-12 * 2-4TB SATA– 64GB RAM– 2 * 4-8 core, ~2GHz– 2 * 1Gb NICs– Single power supply– jBOD, not RAID, …

• Switches– 1-2 Gb to the node– ~20 Gb to the core– Full bisection bandwidth– Layer 2 or 3, simple

Page 7


ApplianceCloudOS / VM

Zooming out: An Apache Hadoop Platform

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI


Zooming out: A Big Data Architecture

Page 9

APPL

ICAT

ION

SDA

TA S

YSTE

MS

TRADITIONAL REPOSRDBMS EDW MPP

DATA

SO

URC

ES

MOBILEDATA

OLTP, POS SYSTEMS

OPERATIONALTOOLS

MANAGE & MONITOR

Traditional Sources (RDBMS, OLTP, OLAP)

New Sources (web logs, email, sensor data, social media)

DEV & DATATOOLS

BUILD & TEST

Business Analytics

Custom Applications

Packaged Applications

HORTONWORKS DATA PLATFORM


Motivation and History2007 2008 2009

10

2010

The Datagraph Blog

http://www.rapleaf.com/

http://visibletechnologies.com/

http://www.contextweb.com/




http://www.rackspace.com/index.php

http://www.apollogrp.edu/Default.aspx



http://multivu.prnewswire.com/mnr/therubiconproject/32383/images/32383-hi-rubicon_project_logo.jpg

http://www.moldex3d.com/jla/en/images/stories/Success/Samsung/samsung-logo.jpg




http://www.accelacommunications.com/

http://www.accelacommunications.com/

http://www.brainpad.co.jp/

http://www.brainpad.co.jp/

http://www.forward3d.co.uk/

http://www.forward3d.co.uk/

http://infochimps.org/

http://www.linkedin.com/home?trk=hb_logo





http://www.pronux.ch/

http://www.pronux.ch/

http://twitter.com/

http://twitter.com/

http://en.wikipedia.org/wiki/File:Microsoft_wordmark.svg

http://www.netflix.com/

http://www.google.com/imgres?imgurl=http://www.crunchbase.com/assets/images/resized/0002/2996/22996v1-max-250x250.png&imgrefurl=http://www.crunchbase.com/company/admeld&usg=__bWbC4xpT5XlnSLtmgJ9OIfkTVmE=&h=56&w=250&sz=6&hl=en&start=6&um=1&itbs=1&tbnid=ZDshM3S4I1oT6M:&tbnh=25&tbnw=111&prev=/images?q=admeld&um=1&hl=en&sa=N&tbs=isch:1


Eric Baldeschwieler - CTO Hortonworks

Page 11

• 2011 – now Hortonworks - CTO• 2006 – 2011 Yahoo! - VP Engineering, Hadoop• 2003 – 2005 Yahoo! – Web Search Engineering

- Built systems that crawl & index the web

• 1996 – 2003 Inktomi – Web Search Engineering

- Built systems that crawl & index the web

• Previously– UC Berkeley – Masters CS– Video Game Development– Digital Video & 3D rendering software– Carnegie Mellon – BS Math/CS


Early history

• 1995 – 2005– Yahoo! search team builds 4+ generations of systems to crawl & index

the WWW. 20 Billion pages!

• 2004– Google publishes Google File System & MapReduce papers

• 2005– Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!– Yahoo! search commits to build open source DFS & MapReduce

– Compete / Differentiate via Open Source contribution!– Attract scientists – Become known center of big data excellence– Avoid building proprietary systems that will be obsolesced– Gain leverage of wider community building one infrastructure

• 2006– Hadoop is born!

– Dedicated team under E14 staffed at Yahoo!– Nutch prototype used to seed new Apache Hadoop project


Hadoop at Yahoo!

Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/


Hortonworks – 100% Open Source

Page 14

• We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform

• We engineer, test & certify HDP for enterprise usage

• We employ the core architects, builders and operators of Apache Hadoop

• We drive innovation within Apache Software Foundation projects

• We are uniquely positioned to deliver the highest quality of Hadoop support

• We enable the ecosystem to work better with Hadoop

Develop Distribute Support

We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution

Endorsed by Strategic Partners

Headquarters: Palo Alto, CAEmployees: 200+ and growingInvestors: Benchmark, Index, Yahoo

twice the engagement

CASE STUDYYAHOO SEARCH ASSIST™

15© Yahoo 2011

Before Hadoop After Hadoop

Time 26 days 20 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

• Database for Search Assist™ is built using Apache Hadoop• Several years of log-data• 20-steps of MapReduce

15

, early adopters Scale and productize Hadoop

Apache Hadoop

Ecosystem History

2006 – present

Wide Adoption Funds further development, enhancements

2011 – present

Other Internet Companies Add tools / frameworks, enhance

Hadoop

2008 – present

…

16

Service Providers Provide training, support, hosting

2010 – present

…Cloudera, MapRMicrosoftIBM, EMC, Oracle


Use cases

© Hortonworks Inc. 2013Page 18


Use-case: Full genome sequencing

• The data– 1 full genome = 1TB (raw uncompressed)– 1M people sequenced = 1 Exabyte– Cost per 1 person = $1000 and continues to drop

• Uses for Hadoop:– Large scale compute applications:

– Map NGS data (“reads”) to a reference genome– Used for drug development, personalized treatment– Community developed Hadoop-based software for gene matching:

cloudburst, crossbow

– Store, manage and share genomics data in the bio-informatics community

Page 19

See: http://hortonworks.com/blog/big-data-in-genomics-and-cancer-treatment


Use-case: Oil & gas

• Digital Oil Field:– Data sizes: 2+ TB / day– Application: safety/security, improve field performance– Hadoop used for data storage and analytics

• Seismic image processing: – Drill ship costs $1M/day– One “shot” (in SEGY format) contains ~2.5GB– Hadoop used to parallelize computation and store data post-

processing

Page 20

– Previously data discarded immediately after processing!

– Now kept for reprocessing– Research & Development


Use-case: high-energy physics

• Collecting events from colliders– “We have a very big digital camera”; each “event” = ~1MB– Looking for rare events (need millions of events for stat significance)

• Typical task: scan through events and look for particles with a certain mass

– Analyze millions of events in parallel– Hadoop used in streaming with C++ code to analyze events

• HDFS used for low cost storage

Page 21

http://www.linuxjournal.com/content/the-large-hadron-collider


Use-case: National Climate Assessment

• Rapid, Flexible, and Open Source Big Data Technologies for the U.S. National Climate Assessment

– Chris A. Mattmann– Senior Computer Scientist, NASA JPL– Chris and team have done a number of projects

with Hadoop.

• Goal– Compare regional climate models to a variety of

satellite observations– Traditionally models are compared to other

models, not to actual observations– Normalize data complex multi-format data to

lat/long + observation values

• Hadoop– Used Apache Hive to provide Scale-out SQL

warehouse of the data– See paper or case study in

“Programming Hive – O’Reilly 2012” Credit: Kathy Jacobs


Big DataTransactions + Interactions + Observations

Apache Hadoop: Patterns of Use

Page 23

Refine Explore Enrich


EnterpriseData Warehouse

Operational Data RefineryHadoop as platform for ETL modernization

Capture• Capture new unstructured data along with log

files all alongside existing sources• Retain inputs in raw form for audit and

continuity purposes

Process• Parse the data & cleanse• Apply structure and definition• Join datasets together across disparate data

sources

Exchange• Push to existing data warehouse for

downstream consumption• Feeds operational reporting and online systems

Page 24

Unstructured Log files

Refinery

Structure and join

Capture and archive

Parse & Cleanse


DB data

Upload


VisualizationToolsEDW / Datamart

Explore

Big Data ExplorationHadoop as agile, ad-hoc data mart

Capture• Capture multi-structured data and retain inputs

in raw form for iterative analysis

Process• Parse the data into queryable format• Explore & analyze using Hive, Pig, Mahout and

other tools to discover value • Label data and type information for

compatibility and later discovery• Pre-compute stats, groupings, patterns in data

to accelerate analysis

Exchange• Use visualization tools to facilitate exploration

and find key insights• Optionally move actionable insights into EDW

or datamart

Page 25

Capture and archive

upload JDBC / ODBC

Structure and join

Categorize into tables

Unstructured Log files DB data


Optional

31-Mar-2013 NCAR-SEA-2013 26


Online Applications

Enrich

Application EnrichmentDeliver Hadoop analysis to online apps

Capture• Capture data that was once

too bulky and unmanageable

Process• Uncover aggregate characteristics across data • Use Hive Pig and Map Reduce to identify patterns• Filter useful data from mass streams (Pig)• Micro or macro batch oriented schedules

Exchange• Push results to HBase or other NoSQL alternative

for real time delivery• Use patterns to deliver right content/offer to the

right person at the right time

Page 27

Derive/Filter

Capture

Parse

NoSQL, HBaseLow Latency

Scheduled & near real time

Unstructured Log files DB data



CASE STUDYYAHOO! HOMEPAGE

28

Personalized for each visitor

Result: twice the engagement

+160% clicksvs. one size fits all

+79% clicksvs. randomly selected

+43% clicksvs. editor selected

Recommended links News Interests Top Searches

© Yahoo 2011 28

CASE STUDYYAHOO! HOMEPAGE

29

• Serving Maps• Users - Interests

• Five Minute Production

• Weekly Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USERBEHAVIOR

ENGAGED USERS

CATEGORIZATIONMODELS (weekly)

SERVINGMAPS

(every 5 minutes)USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

© Yahoo 2011 29


Futures & observations


Hadoop 2.0 Innovations - YARN

HDFS

MapReduce

Redundant, Reliable Storage

• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation

• Next generation execution– Improves MapReduce performance

• Supports new frameworks beyond MapReduce

– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, Giraph


Hadoop 2.0 Innovations - YARN

• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation

• Next generation execution– Improves MapReduce performance

• Supports new frameworks beyond MapReduce

– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, Giraph HDFS

Map

Redu

ce

Redundant, Reliable Storage

YARN: Cluster Resource Management

Tez

Stre

amin

g

Oth

er


Stinger Initiative

• Community initiative around Hive• Enables Hive to support interactive workloads• Improves existing tools & preserves investments

Query Planner

Hive

Execution Engine

Tez

= 100X+ +File

Format

ORC file


Data Lake Projects

• Keep raw data– 20+ PB projects– Previously discarded

• Unify may data sources– Pull from all over organization

• Produce derived views– Automatic “ETL” for regular

downstream use cases– New applications from unified data

• Support ad hoc exploration– Prototype new use cases– Answer unanticipated questions– Agile rebuild from raw data

c

stag

e Core

- ge

nera

l

archive

Landing zone(NFS, JMS)

IngestDescriptor

Core

- se

cure

c

b

a

b

a

Data flow described in descriptor docs


Interesting things on the Horizon

• Solid state storage and disk drive evolution– So far LFF drives seem to be maintaining their economic

advantage (4TB drives now & 7TB! Next year) – SSDs are becoming ubiquitous and will become part of the

architecture

• In RAM databases– Bring them on, let’s port them to Yarn!– Hadoop complements these technologies, shines w huge data

• Atom / ARM processors– This is great for Hadoop! But…– Vendors are not yet designing the right machines (bandwidth to

disk)

• Software Defined Networks– This is great for Hadoop, more network for less!


Thank You!Eric BaldeschwielerCTO HortonworksTwitter @jeric14

Apache Foundation

New Users

Contributions

&

Validation

Get Involved!


See Hadoop > Learn Hadoop > Do Hadoop

Full environment to evaluate

Hadoop

Hands on step-by- step

tutorials to learn


STOP!Bonus material follows


Hortonworks Approach

Identify and introduce enterprise requirements into the pubic domain

Work with the community to advance and incubate open source projects

Apply Enterprise Rigor to provide the most stable and reliable distribution

Community Driven Enterprise Apache Hadoop


Driving Enterprise Hadoop Innovation

Page 40

AMBARI

HBASE

HCATALOG

HIVE

PIG

HADOOP CORE

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Lines Of Code By CompanySource: Apache Software Foundation

Hortonworks Yahoo! Cloudera Other

HortonworksCommitters

Cloudera Committers

19 9

5 1

1 0

5 0

3 7

14 0


Hortonworks Process for Enterprise Hadoop

Page 41

Upstream Community Projects Downstream Enterprise Product

HortonworksData Platform

Design & Develop

Distribute

Integrate & Test

Package & Certify

ApacheHCatalo

g

ApachePig

ApacheHBase

Other Apache Projects

ApacheHive

Apache Ambari

ApacheHadoop

Test &Patch

Design & Develop

Release

No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects

Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream

Stable Project Releases

Fixed Issues


Hadoop and Cloud

• Can I run hadoop in Open stack or in my virtualization infrastructure?

– Yes, but… it depends on your use-case and hardware choices– We will see a lot of innovation in this space in coming years

– Openstack Savanna – Collaboration to bring Hadoop to Openstack

• Zero procurement POC – Try Hadoop in cloud– 5-10 nodes – works great! (On private or public cloud)– Many projects are done today in public clouds

• Occasional use (run Hadoop when cluster not busy)– Where do you store the data when Hadoop is not running?– >20 nodes review your network and storage design

• Large scale, continuous deployment 100 – 4000 nodes– Need to design your storage and network for Hadoop

Page 42


BI – Jaspersoft, Pentaho, …NoSQL in Apps – HBase, Cassandra, MangoDB, …

Search Apps – ElasticSearch, Solr, …

Open source in the Architecture

Page 43

APPL

ICAT

ION

SDA

TA S

YSTE

MS

DBs – Postgres, MySQLSearch – ElasticSearch, Solr, …

DATA

SO

URC

ES

OPERATIONALTOOLS

DEV & DATATOOLS

HORTONWORKS DATA PLATFORM

Eclipse, OpenJDK, Spring, VirtualBox…

Nagios, Ganglia, Chef, Puppet…

DBsSearch Repos

ESB, ETL – ActiveMQ, Talend, Kettle


CASE STUDYYAHOO! WEBMAP

44© Yahoo 2011

What is a WebMap?• Gigantic table of information about every web site,

page and link Yahoo! knows about• Directed graph of the web• Various aggregated views (sites, domains, etc.)• Various algorithms for ranking, duplicate detection,

region classification, spam detection, etc.

Why was it ported to Hadoop?• Custom C++ solution was not scaling• Leverage scalability, load balancing and resilience

of Hadoop infrastructure• Focus on application vs. infrastructure

44


CASE STUDYWEBMAP PROJECT RESULTS

45© Yahoo 2011

33% time savings over previous system on the same cluster (and Hadoop keeps getting better)

Was largest Hadoop application, drove scale• Over 10,000 cores in system• 100,000+ maps, ~10,000 reduces• ~70 hours runtime• ~300 TB shuffling• ~200 TB compressed output

Moving data to Hadoop increased number of groups using the data

45


Use-case: computational advertising

• A principled way to find “best match” ads, in context, for a query (or page view)

• Lots of data:– Search: billions of unique queries per hour– Display: Trillions of ads displayed per hour– Billions of users– Billions of ads

• Big business: – $132B total advertising market (2015)– $600B total worldwide market (2015)

• Challenges:– A huge number of small transactions– Cost of serving < revenue per search

Page 46

Example: predicting CTR (search ads)

Rank = bid * CTR

Predict CTR for each ad to determine placement, based on:- Historical CTR- Keyword match- Etc…

Approach: supervised learning


Hadoop for advertising science @ Yahoo!

• Advertising science moved CTR prediction from “legacy” (MyNA) systems to Hadoop

– Scientist productivity dramatically improved– Platform for massive A/B testing for computational advertising

algorithmic improvements

• Hadoop enabled next-gen contextual advertising matching platform

– Heavy compute process that is highly parallelizable

Page 48

MapReduce

• MapReduce is a distributed computing programming model• It works like a Unix pipeline:

– cat input | grep | sort | uniq -c > output

– Input | Map | Shuffle & Sort | Reduce | Output

• Strengths:– Easy to use! Developer just writes a couple of

functions– Moves compute to data

• Schedules work on HDFS node with data if possible– Scans through data, reducing seeks– Automatic reliability and re-execution on failure

4949


HDFS Client

NameNode

DataNode 1 DataNode 2 DataNode 3

Big Data

Put into HDFS

(Via RPC or REST)

Break the data into chunks anddistribute to the DataNodes

The DataNodes replicate the chunks

HDFS in action

Documents

201305 hadoop jpl-v3