Upload
eric-baldeschwieler
View
132
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
© Hortonworks Inc. 2013
Apache Hadoop for Big Science
History, Use cases & Futures
Eric Baldeschwieler, “Eric14”
Hortonworks CTO
@jeric14
© Hortonworks Inc. 2013
Agenda
• What is Apache Hadoop• Project motivation & history• Use cases• Futures and observations
© Hortonworks Inc. 2013Page 3
What is Apache Hadoop?
© Hortonworks Inc. 2013
Traditional data systems vs. Hadoop
Traditional data systems– Limited scaling options– Expensive at scale
– Complex components– Proprietary software– Reliability in Hardware
– Optimized for latency, IOPs
Page 4
Hadoop Cluster– Low cost scale-out
– Commodity components– Open source software– Reliability in software
– Optimized for throughput
When your data infrastructure does not scale … Hadoop
© Hortonworks Inc. 2013
Sto
rage
Apache Hadoop: Big Data Platform
Open Source data management with scale-out storage & distributed processing
Page 5
HDFS• Distributed across a cluster• Natively redundant, self-healing• Very high bandwidth
Pro
cess
ing Map Reduce
• Splits a job into small tasks and moves compute “near” the data
• Self-Healing• Simple programming model
Key Characteristics• Scalable
– Efficiently store and process petabytes of data
– Scale out linearly by adding nodes (node == commodity computer)
• Reliable– Data replicated 3x– Failover across nodes and racks,
• Flexible– Store all types of data in any format
• Economical– Commodity hardware– Open source software (via ASF)– No vendor lock-in
© Hortonworks Inc. 2013 (From Richard McDougall, VMware, Hadoop Summit, 2012 talk)
Hadoop’s cost advantage
SAN Storage
$2 - $10/Gigabyte
$1M gets:0.5Petabytes
1,000,000 IOPS1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:1 Petabyte
400,000 IOPS2Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:20 Petabytes
10,000,000 IOPS800 Gbytes/sec
“And you get racks of free
computers when you buy storage!”
- Eric14
© Hortonworks Inc. 2013
Hadoop hardware
• 10 to 4500 node clusters
– 1-4 “master nodes”– Interchangeable workers
• Typical node– 4-12 * 2-4TB SATA– 64GB RAM– 2 * 4-8 core, ~2GHz– 2 * 1Gb NICs– Single power supply– jBOD, not RAID, …
• Switches– 1-2 Gb to the node– ~20 Gb to the core– Full bisection bandwidth– Layer 2 or 3, simple
Page 7
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
Zooming out: An Apache Hadoop Platform
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA, DR, Snapshots, Security, …
Distributed Storage & ProcessingHDFS
MAP REDUCE
DATASERVICES
Store, Process and Access Data
HCATALOG
HIVEPIGHBASE
SQOOP
FLUME
OPERATIONAL SERVICES
Manage & Operate at
ScaleOOZIE
AMBARI
© Hortonworks Inc. 2013
Zooming out: A Big Data Architecture
Page 9
APPL
ICAT
ION
SDA
TA S
YSTE
MS
TRADITIONAL REPOSRDBMS EDW MPP
DATA
SO
URC
ES
MOBILEDATA
OLTP, POS SYSTEMS
OPERATIONALTOOLS
MANAGE & MONITOR
Traditional Sources (RDBMS, OLTP, OLAP)
New Sources (web logs, email, sensor data, social media)
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
Packaged Applications
HORTONWORKS DATA PLATFORM
© Hortonworks Inc. 2013
Motivation and History2007 2008 2009
10
2010
The Datagraph Blog
© Hortonworks Inc. 2013
Eric Baldeschwieler - CTO Hortonworks
Page 11
• 2011 – now Hortonworks - CTO• 2006 – 2011 Yahoo! - VP Engineering, Hadoop• 2003 – 2005 Yahoo! – Web Search Engineering
- Built systems that crawl & index the web
• 1996 – 2003 Inktomi – Web Search Engineering
- Built systems that crawl & index the web
• Previously– UC Berkeley – Masters CS– Video Game Development– Digital Video & 3D rendering software– Carnegie Mellon – BS Math/CS
© Hortonworks Inc. 2013
Early history
• 1995 – 2005– Yahoo! search team builds 4+ generations of systems to crawl & index
the WWW. 20 Billion pages!
• 2004– Google publishes Google File System & MapReduce papers
• 2005– Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!– Yahoo! search commits to build open source DFS & MapReduce
– Compete / Differentiate via Open Source contribution!– Attract scientists – Become known center of big data excellence– Avoid building proprietary systems that will be obsolesced– Gain leverage of wider community building one infrastructure
• 2006– Hadoop is born!
– Dedicated team under E14 staffed at Yahoo!– Nutch prototype used to seed new Apache Hadoop project
© Hortonworks Inc. 2013
Hadoop at Yahoo!
Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/
© Hortonworks Inc. 2013
Hortonworks – 100% Open Source
Page 14
• We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform
• We engineer, test & certify HDP for enterprise usage
• We employ the core architects, builders and operators of Apache Hadoop
• We drive innovation within Apache Software Foundation projects
• We are uniquely positioned to deliver the highest quality of Hadoop support
• We enable the ecosystem to work better with Hadoop
Develop Distribute Support
We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution
Endorsed by Strategic Partners
Headquarters: Palo Alto, CAEmployees: 200+ and growingInvestors: Benchmark, Index, Yahoo
twice the engagement
CASE STUDYYAHOO SEARCH ASSIST™
15© Yahoo 2011
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
• Database for Search Assist™ is built using Apache Hadoop• Several years of log-data• 20-steps of MapReduce
15
, early adopters Scale and productize Hadoop
Apache Hadoop
Ecosystem History
2006 – present
Wide Adoption Funds further development, enhancements
2011 – present
Other Internet Companies Add tools / frameworks, enhance
Hadoop
2008 – present
…
16
Service Providers Provide training, support, hosting
2010 – present
…Cloudera, MapRMicrosoftIBM, EMC, Oracle
© Hortonworks Inc. 2013
Use cases
© Hortonworks Inc. 2013Page 18
© Hortonworks Inc. 2013
Use-case: Full genome sequencing
• The data– 1 full genome = 1TB (raw uncompressed)– 1M people sequenced = 1 Exabyte– Cost per 1 person = $1000 and continues to drop
• Uses for Hadoop:– Large scale compute applications:
– Map NGS data (“reads”) to a reference genome– Used for drug development, personalized treatment– Community developed Hadoop-based software for gene matching:
cloudburst, crossbow
– Store, manage and share genomics data in the bio-informatics community
Page 19
See: http://hortonworks.com/blog/big-data-in-genomics-and-cancer-treatment
© Hortonworks Inc. 2013
Use-case: Oil & gas
• Digital Oil Field:– Data sizes: 2+ TB / day– Application: safety/security, improve field performance– Hadoop used for data storage and analytics
• Seismic image processing: – Drill ship costs $1M/day– One “shot” (in SEGY format) contains ~2.5GB– Hadoop used to parallelize computation and store data post-
processing
Page 20
– Previously data discarded immediately after processing!
– Now kept for reprocessing– Research & Development
© Hortonworks Inc. 2013
Use-case: high-energy physics
• Collecting events from colliders– “We have a very big digital camera”; each “event” = ~1MB– Looking for rare events (need millions of events for stat significance)
• Typical task: scan through events and look for particles with a certain mass
– Analyze millions of events in parallel– Hadoop used in streaming with C++ code to analyze events
• HDFS used for low cost storage
Page 21
http://www.linuxjournal.com/content/the-large-hadron-collider
© Hortonworks Inc. 2013
Use-case: National Climate Assessment
• Rapid, Flexible, and Open Source Big Data Technologies for the U.S. National Climate Assessment
– Chris A. Mattmann– Senior Computer Scientist, NASA JPL– Chris and team have done a number of projects
with Hadoop.
• Goal– Compare regional climate models to a variety of
satellite observations– Traditionally models are compared to other
models, not to actual observations– Normalize data complex multi-format data to
lat/long + observation values
• Hadoop– Used Apache Hive to provide Scale-out SQL
warehouse of the data– See paper or case study in
“Programming Hive – O’Reilly 2012” Credit: Kathy Jacobs
© Hortonworks Inc. 2013
Big DataTransactions + Interactions + Observations
Apache Hadoop: Patterns of Use
Page 23
Refine Explore Enrich
© Hortonworks Inc. 2013
EnterpriseData Warehouse
Operational Data RefineryHadoop as platform for ETL modernization
Capture• Capture new unstructured data along with log
files all alongside existing sources• Retain inputs in raw form for audit and
continuity purposes
Process• Parse the data & cleanse• Apply structure and definition• Join datasets together across disparate data
sources
Exchange• Push to existing data warehouse for
downstream consumption• Feeds operational reporting and online systems
Page 24
Unstructured Log files
Refinery
Structure and join
Capture and archive
Parse & Cleanse
Refine Explore Enrich
DB data
Upload
© Hortonworks Inc. 2013
VisualizationToolsEDW / Datamart
Explore
Big Data ExplorationHadoop as agile, ad-hoc data mart
Capture• Capture multi-structured data and retain inputs
in raw form for iterative analysis
Process• Parse the data into queryable format• Explore & analyze using Hive, Pig, Mahout and
other tools to discover value • Label data and type information for
compatibility and later discovery• Pre-compute stats, groupings, patterns in data
to accelerate analysis
Exchange• Use visualization tools to facilitate exploration
and find key insights• Optionally move actionable insights into EDW
or datamart
Page 25
Capture and archive
upload JDBC / ODBC
Structure and join
Categorize into tables
Unstructured Log files DB data
Refine Explore Enrich
Optional
31-Mar-2013 NCAR-SEA-2013 26
© Hortonworks Inc. 2013
Online Applications
Enrich
Application EnrichmentDeliver Hadoop analysis to online apps
Capture• Capture data that was once
too bulky and unmanageable
Process• Uncover aggregate characteristics across data • Use Hive Pig and Map Reduce to identify patterns• Filter useful data from mass streams (Pig)• Micro or macro batch oriented schedules
Exchange• Push results to HBase or other NoSQL alternative
for real time delivery• Use patterns to deliver right content/offer to the
right person at the right time
Page 27
Derive/Filter
Capture
Parse
NoSQL, HBaseLow Latency
Scheduled & near real time
Unstructured Log files DB data
Refine Explore Enrich
twice the engagement
CASE STUDYYAHOO! HOMEPAGE
28
Personalized for each visitor
Result: twice the engagement
+160% clicksvs. one size fits all
+79% clicksvs. randomly selected
+43% clicksvs. editor selected
Recommended links News Interests Top Searches
© Yahoo 2011 28
CASE STUDYYAHOO! HOMEPAGE
29
• Serving Maps• Users - Interests
• Five Minute Production
• Weekly Categorization models
SCIENCE HADOOP CLUSTER
SERVING SYSTEMS
PRODUCTION HADOOP CLUSTER
USERBEHAVIOR
ENGAGED USERS
CATEGORIZATIONMODELS (weekly)
SERVINGMAPS
(every 5 minutes)USER
BEHAVIOR
» Identify user interests using Categorization models
» Machine learning to build ever better categorization models
Build customized home pages with latest data (thousands / second)
© Yahoo 2011 29
© Hortonworks Inc. 2013
Futures & observations
© Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN
HDFS
MapReduce
Redundant, Reliable Storage
• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation
• Next generation execution– Improves MapReduce performance
• Supports new frameworks beyond MapReduce
– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, Giraph
© Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN
• Focus on scale and innovation– Support 10,000+ computer clusters– Extensible to encourage innovation
• Next generation execution– Improves MapReduce performance
• Supports new frameworks beyond MapReduce
– Do more with a single Hadoop cluster– Low latency, Streaming, Services– Science – MPI, Spark, Giraph HDFS
Map
Redu
ce
Redundant, Reliable Storage
YARN: Cluster Resource Management
Tez
Stre
amin
g
Oth
er
© Hortonworks Inc. 2013
Stinger Initiative
• Community initiative around Hive• Enables Hive to support interactive workloads• Improves existing tools & preserves investments
Query Planner
Hive
Execution Engine
Tez
= 100X+ +File
Format
ORC file
© Hortonworks Inc. 2013
Data Lake Projects
• Keep raw data– 20+ PB projects– Previously discarded
• Unify may data sources– Pull from all over organization
• Produce derived views– Automatic “ETL” for regular
downstream use cases– New applications from unified data
• Support ad hoc exploration– Prototype new use cases– Answer unanticipated questions– Agile rebuild from raw data
c
stag
e Core
- ge
nera
l
archive
Landing zone(NFS, JMS)
IngestDescriptor
Core
- se
cure
c
b
a
b
a
Data flow described in descriptor docs
© Hortonworks Inc. 2013
Interesting things on the Horizon
• Solid state storage and disk drive evolution– So far LFF drives seem to be maintaining their economic
advantage (4TB drives now & 7TB! Next year) – SSDs are becoming ubiquitous and will become part of the
architecture
• In RAM databases– Bring them on, let’s port them to Yarn!– Hadoop complements these technologies, shines w huge data
• Atom / ARM processors– This is great for Hadoop! But…– Vendors are not yet designing the right machines (bandwidth to
disk)
• Software Defined Networks– This is great for Hadoop, more network for less!
© Hortonworks Inc. 2013
Thank You!Eric BaldeschwielerCTO HortonworksTwitter @jeric14
Apache Foundation
New Users
Contributions
&
Validation
Get Involved!
© Hortonworks Inc. 2013
See Hadoop > Learn Hadoop > Do Hadoop
Full environment to evaluate
Hadoop
Hands on step-by- step
tutorials to learn
© Hortonworks Inc. 2013
STOP!Bonus material follows
© Hortonworks Inc. 2013
Hortonworks Approach
Identify and introduce enterprise requirements into the pubic domain
Work with the community to advance and incubate open source projects
Apply Enterprise Rigor to provide the most stable and reliable distribution
Community Driven Enterprise Apache Hadoop
© Hortonworks Inc. 2013
Driving Enterprise Hadoop Innovation
Page 40
AMBARI
HBASE
HCATALOG
HIVE
PIG
HADOOP CORE
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Lines Of Code By CompanySource: Apache Software Foundation
Hortonworks Yahoo! Cloudera Other
HortonworksCommitters
Cloudera Committers
19 9
5 1
1 0
5 0
3 7
14 0
© Hortonworks Inc. 2013
Hortonworks Process for Enterprise Hadoop
Page 41
Upstream Community Projects Downstream Enterprise Product
HortonworksData Platform
Design & Develop
Distribute
Integrate & Test
Package & Certify
ApacheHCatalo
g
ApachePig
ApacheHBase
Other Apache Projects
ApacheHive
Apache Ambari
ApacheHadoop
Test &Patch
Design & Develop
Release
No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Stable Project Releases
Fixed Issues
© Hortonworks Inc. 2013
Hadoop and Cloud
• Can I run hadoop in Open stack or in my virtualization infrastructure?
– Yes, but… it depends on your use-case and hardware choices– We will see a lot of innovation in this space in coming years
– Openstack Savanna – Collaboration to bring Hadoop to Openstack
• Zero procurement POC – Try Hadoop in cloud– 5-10 nodes – works great! (On private or public cloud)– Many projects are done today in public clouds
• Occasional use (run Hadoop when cluster not busy)– Where do you store the data when Hadoop is not running?– >20 nodes review your network and storage design
• Large scale, continuous deployment 100 – 4000 nodes– Need to design your storage and network for Hadoop
Page 42
© Hortonworks Inc. 2013
BI – Jaspersoft, Pentaho, …NoSQL in Apps – HBase, Cassandra, MangoDB, …
Search Apps – ElasticSearch, Solr, …
Open source in the Architecture
Page 43
APPL
ICAT
ION
SDA
TA S
YSTE
MS
DBs – Postgres, MySQLSearch – ElasticSearch, Solr, …
DATA
SO
URC
ES
OPERATIONALTOOLS
DEV & DATATOOLS
HORTONWORKS DATA PLATFORM
Eclipse, OpenJDK, Spring, VirtualBox…
Nagios, Ganglia, Chef, Puppet…
DBsSearch Repos
ESB, ETL – ActiveMQ, Talend, Kettle
twice the engagement
CASE STUDYYAHOO! WEBMAP
44© Yahoo 2011
What is a WebMap?• Gigantic table of information about every web site,
page and link Yahoo! knows about• Directed graph of the web• Various aggregated views (sites, domains, etc.)• Various algorithms for ranking, duplicate detection,
region classification, spam detection, etc.
Why was it ported to Hadoop?• Custom C++ solution was not scaling• Leverage scalability, load balancing and resilience
of Hadoop infrastructure• Focus on application vs. infrastructure
44
twice the engagement
CASE STUDYWEBMAP PROJECT RESULTS
45© Yahoo 2011
33% time savings over previous system on the same cluster (and Hadoop keeps getting better)
Was largest Hadoop application, drove scale• Over 10,000 cores in system• 100,000+ maps, ~10,000 reduces• ~70 hours runtime• ~300 TB shuffling• ~200 TB compressed output
Moving data to Hadoop increased number of groups using the data
45
© Hortonworks Inc. 2013
Use-case: computational advertising
• A principled way to find “best match” ads, in context, for a query (or page view)
• Lots of data:– Search: billions of unique queries per hour– Display: Trillions of ads displayed per hour– Billions of users– Billions of ads
• Big business: – $132B total advertising market (2015)– $600B total worldwide market (2015)
• Challenges:– A huge number of small transactions– Cost of serving < revenue per search
Page 46
Example: predicting CTR (search ads)
Rank = bid * CTR
Predict CTR for each ad to determine placement, based on:- Historical CTR- Keyword match- Etc…
Approach: supervised learning
© Hortonworks Inc. 2013
Hadoop for advertising science @ Yahoo!
• Advertising science moved CTR prediction from “legacy” (MyNA) systems to Hadoop
– Scientist productivity dramatically improved– Platform for massive A/B testing for computational advertising
algorithmic improvements
• Hadoop enabled next-gen contextual advertising matching platform
– Heavy compute process that is highly parallelizable
Page 48
MapReduce
• MapReduce is a distributed computing programming model• It works like a Unix pipeline:
– cat input | grep | sort | uniq -c > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Strengths:– Easy to use! Developer just writes a couple of
functions– Moves compute to data
• Schedules work on HDFS node with data if possible– Scans through data, reducing seeks– Automatic reliability and re-execution on failure
4949
© Hortonworks Inc. 2013
HDFS Client
NameNode
DataNode 1 DataNode 2 DataNode 3
Big Data
Put into HDFS
(Via RPC or REST)
Break the data into chunks anddistribute to the DataNodes
The DataNodes replicate the chunks
HDFS in action