Upload
brock-noland
View
33.521
Download
3
Embed Size (px)
DESCRIPTION
Citation preview
Common and Unique Use Cases for Apache Hadoop August 30, 2011
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
Exploding Data Volumes
• Online • Web-‐ready devices • Social media • Digital content • Smart grids
• Enterprise • TransacBons • R&D data • OperaBonal (control) data
Relational
Complex, Unstructured
Copyright 2011 Cloudera Inc. All rights reserved
2,500 exabytes of new informaBon in 2012 with Internet as primary driver
Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “ze\abytes” this year Source: An IDC White Paper -‐ sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009
2005 2007 2009
Copyright 2011 Cloudera Inc. All rights reserved
2008 2004 2006 2010 2003 2002
Open Source, Web Crawler project created by Doug Cucng
Publishes MapReduce, GFS Paper
Open Source, MapReduce & HDFS project created by Doug Cucng
Runs 4,000 Node Hadoop Cluster
Hadoop wins Terabyte sort benchmark
Launches SQL Support for Hadoop
Releases CDH3 and Cloudera Enterprise
Origin of Hadoop How does an elephant sneak up on you?
MapReduce
Hadoop Distributed File System (HDFS)
• Consolidates Everything • Move complex and relaBonal data into a single repository
• Stores Inexpensively • Keep raw data always available • Use commodity hardware
• Processes at the Source • Eliminate ETL bo\lenecks • Mine data first, govern later
Copyright 2011 Cloudera Inc. All rights reserved
What is Apache Hadoop? Open Source Storage and Processing Engine
What is Apache Hadoop? The Standard Way Big Data Gets Done
• Hadoop is Flexible: • Structured, unstructured • Schema, no schema • High volume, merely terabytes • All kinds of analyBc applicaBons
• Hadoop is Open: 100% Apache-‐licensed open source
• Hadoop is Scalable: Proven at petabyte scale
• Benefits: • Controls costs by storing data more affordably per terabyte than any other
plalorm • Drives revenue by extracBng value from data that was previously out of reach
Copyright 2011 Cloudera Inc. All rights reserved
No Lock-‐In -‐ Investments in skills, services & hardware are preserved regardless of vendor choice
Community Development -‐ Hadoop & related projects are expanding at a rapid pace
Copyright 2011 Cloudera Inc. All rights reserved
Rich Ecosystem -‐ Dozens of complementary somware, hardware and services firms
What is Apache Hadoop? The Importance of Being Open
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
• Common uses of logs
• Find or count events (grep)
grep “ERROR” file grep -‐c “ERROR” file
• Calculate metrics (performance or user behavior analysis)
awk ‘{sums[$1]+=$2; counts[$1]+=1} END {for(k in counts) {print sums[k]/counts [k]}}’
• InvesBgate user sessions
grep “USER” files … | sort | less
Log Processing A Perfect Fit
• Shoot…too much data
• Homegrown parallel processing omen done on per file basis, cause it’s easy
• No parallelism on a single large file
Log Processing A Perfect Fit
access_log
Task 0
access_log
Task 1
access_log
Task 2
• MapReduce to the rescue!
• Processing is done per unit of data
Log Processing A Perfect Fit
Task 0
0-‐64MB 64-‐128MB 128-‐192MB 192-‐256MB
Task 1 Task 2 Task 3
Each task is responsible for a unit of data
access_log
• Network or disk are bo\lenecks
• Reading 100GB of data
• 14 minutes with 1GbE network connecBon
• 22 minutes on standard disk drive
Log Processing A Perfect Fit
grep Bandwidth is limited
access_log
• Hadoop to the rescue!
• Eliminates network bo\leneck, data is on local disk
• Data is read from many, many disks in parallel
Log Processing A Perfect Fit
Task 0
0-‐64MB
Task 1
64-‐128MB
Task 2
128-‐192MB
Task 3
192-‐256MB
NodeA NodeY NodeX NodeZ
Physical Machines
• Hadoop currently scales to 4,000 nodes
• Goal for next release is 10,000 nodes
• Nodes typically have 12 hard drives
• A single hard drive has throughput of about 75MB/second
• 12 Hard Drives * 75 MB/second * 4000 Nodes = 3.4 TB/second
• That’s bytes, not bits
• That’s enough bandwidth to read 1PB (1000 TB) in 5 minutes
Log Processing A Perfect Fit
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
• You have a few billion images of faces with geo-‐tags
• Tremendous storage problem
• Tremendous processing problem
• Bandwidth
• CoordinaBon
Catching `Osama’ Embarrassingly Parallel
• Store the images in Hadoop
• When processing, Hadoop will read the images from local disk, thousands of local disks spread throughout the cluster
• Use Map only job to compare input images against `needle’ image
Catching `Osama’ Embarrassingly Parallel
Catching `Osama’ Embarrassingly Parallel
Store images in Sequence Files
Map Task 0
Map Task 1
Tasks have copy of `needle’
Output faces `matching’ needle
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
• One of the most common use cases I see is replacing ETL processes
• Hadoop is a huge sink of cheap storage and processing
• Aggregates built in Hadoop and exported
• Apache Hive provides SQL like querying on raw data
Extract Transform Load (ETL) Everyone is doing it
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
ETL
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Much blood shed, here
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Hadoop Import Export
Extract Transform Load (ETL) Everyone is doing it
Online DB
`Real’ Time System (Website)
AnalyBcal DB
Data Warehouse
Business Intelligence ApplicaBons
Hadoop Apache Sqoop Apache Sqoop
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
• AnalyBcs is omen simply counBng things
• Facebook chose HBase to store it’s massive counter infrastructure (more later)
• How might one implement a counter infrastructure in HBase?
AnalyScs in HBase Scaling writes
AnalyScs in HBase Scaling writes
URL Counter
com.cloudera/blog/… 154
com.cloudera/downloads/… 923621
com.cloudera/resources/… 2138
User Content Counter
[email protected] NEWS 5431
[email protected] TECH 79310
[email protected] SHOPPING 59
[email protected] SPORTS 94214
Individual Page Counters
User & Content Type Counters `Like’ bu\on IMG request sends HTTP request to Facebook servers which
increments several counters
AnalyScs in HBase Scaling writes
URL Counter
com.cloudera/blog/… 154
com.cloudera/downloads/… 923621
com.cloudera/resources/… 2138
Individual Page Counters
Host is reversed in URL as part of the key
• Data is physically stored in sorted order
• Scanning all `com.cloudera’ counters results in sequenBal I/O
• Real-‐Bme counters of URLs shared, links “liked”, impressions generated
• 20 billion events/day (200K events/sec)
• ~30 second latency from click to count
• Heavy use of incrementColumnValue API for consistent counters
• Tried MySQL, Cassandra, se\led on HBase h\p://Bny.cloudera.com/hbase-‐�-‐analyBcs
Facebook AnalyScs Scaling writes
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
Machine Learning Apache Mahout
Text Clustering on Google News
Machine Learning Apache Mahout
CollaboraBve Filtering on Amazon
Machine Learning Apache Mahout
ClassificaBon in GMail
Machine Learning Apache Mahout
• Apache Mahout implements
• CollaboraBve Filtering
• ClassificaBon
• Clustering
• Frequent itemset
• More coming with the integraBon of MapReduce.Next
Agenda
• What is Apache Hadoop? • Log Processing • Catching `Osama’ • Extract Transform Load (ETL) • AnalyBcs in HBase • Machine Learning • Final Thoughts
Copyright 2011 Cloudera Inc. All rights reserved
• Other use cases
• OpenTSDB an open distributed, scalable Time Series Database (TSDB)
• Building Search Indexes (canonical use case)
• Facebook Messaging
• Cheap and Deep Storage, e.g. archiving emails for SOX compliance
• Audit Logging
• Non-‐Use Cases
• Data processing is handled by one beefy server
• Data requires transacBons
Final Thoughts Use the right tool
• Brock Noland
• h\p://twi\er.com/brocknoland
• TC-‐HUG h\p://tch.ug
About the Presenter