Common and unique use cases for Apache Hadoop

Common and Unique Use Cases for Apache Hadoop August 30, 2011

Agenda

•  What is Apache Hadoop? •  Log Processing •  Catching `Osama’ •  Extract Transform Load (ETL) •  AnalyBcs in HBase •  Machine Learning •  Final Thoughts

Copyright 2011 Cloudera Inc. All rights reserved

Exploding Data Volumes

•  Online •  Web-‐ready devices •  Social media •  Digital content •  Smart grids

•  Enterprise •  TransacBons •  R&D data •  OperaBonal (control) data

Relational

Complex, Unstructured


2,500 exabytes of new informaBon in 2012 with Internet as primary driver

Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “ze\abytes” this year Source: An IDC White Paper -‐ sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009

2005 2007 2009


2008 2004 2006 2010 2003 2002

Open Source, Web Crawler project created by Doug Cucng

Publishes MapReduce, GFS Paper

Open Source, MapReduce & HDFS project created by Doug Cucng

Runs 4,000 Node Hadoop Cluster

Hadoop wins Terabyte sort benchmark

Launches SQL Support for Hadoop

Releases CDH3 and Cloudera Enterprise

Origin of Hadoop How does an elephant sneak up on you?

MapReduce

Hadoop Distributed File System (HDFS)

•  Consolidates Everything •  Move complex and relaBonal data into a single repository

•  Stores Inexpensively •  Keep raw data always available •  Use commodity hardware

•  Processes at the Source •  Eliminate ETL bo\lenecks •  Mine data first, govern later


What is Apache Hadoop? Open Source Storage and Processing Engine

What is Apache Hadoop? The Standard Way Big Data Gets Done

•  Hadoop is Flexible: •  Structured, unstructured •  Schema, no schema •  High volume, merely terabytes •  All kinds of analyBc applicaBons

•  Hadoop is Open: 100% Apache-‐licensed open source

•  Hadoop is Scalable: Proven at petabyte scale

•  Benefits: •  Controls costs by storing data more affordably per terabyte than any other

plalorm •  Drives revenue by extracBng value from data that was previously out of reach


No Lock-‐In -‐ Investments in skills, services & hardware are preserved regardless of vendor choice

Community Development -‐ Hadoop & related projects are expanding at a rapid pace


Rich Ecosystem -‐ Dozens of complementary somware, hardware and services firms

What is Apache Hadoop? The Importance of Being Open

Agenda



•  Common uses of logs

•  Find or count events (grep)

grep “ERROR” file grep -‐c “ERROR” file

•  Calculate metrics (performance or user behavior analysis)

awk ‘{sums[$1]+=$2; counts[$1]+=1} END {for(k in counts) {print sums[k]/counts [k]}}’

•  InvesBgate user sessions

grep “USER” files … | sort | less

Log Processing A Perfect Fit

•  Shoot…too much data

•  Homegrown parallel processing omen done on per file basis, cause it’s easy

•  No parallelism on a single large file


access_log

Task 0

access_log

Task 1

access_log

Task 2

•  MapReduce to the rescue!

•  Processing is done per unit of data


Task 0

0-‐64MB 64-‐128MB 128-‐192MB 192-‐256MB

Task 1 Task 2 Task 3

Each task is responsible for a unit of data

access_log

•  Network or disk are bo\lenecks

•  Reading 100GB of data

•  14 minutes with 1GbE network connecBon

•  22 minutes on standard disk drive


grep Bandwidth is limited

access_log

•  Hadoop to the rescue!

•  Eliminates network bo\leneck, data is on local disk

•  Data is read from many, many disks in parallel


Task 0

0-‐64MB

Task 1

64-‐128MB

Task 2

128-‐192MB

Task 3

192-‐256MB

NodeA NodeY NodeX NodeZ

Physical Machines

•  Hadoop currently scales to 4,000 nodes

•  Goal for next release is 10,000 nodes

•  Nodes typically have 12 hard drives

•  A single hard drive has throughput of about 75MB/second

•  12 Hard Drives * 75 MB/second * 4000 Nodes = 3.4 TB/second

•  That’s bytes, not bits

•  That’s enough bandwidth to read 1PB (1000 TB) in 5 minutes


Agenda



•  You have a few billion images of faces with geo-‐tags

•  Tremendous storage problem

•  Tremendous processing problem

•  Bandwidth

•  CoordinaBon

Catching `Osama’ Embarrassingly Parallel

•  Store the images in Hadoop

•  When processing, Hadoop will read the images from local disk, thousands of local disks spread throughout the cluster

•  Use Map only job to compare input images against `needle’ image



Store images in Sequence Files

Map Task 0

Map Task 1

Tasks have copy of `needle’

Output faces `matching’ needle

Agenda



•  One of the most common use cases I see is replacing ETL processes

•  Hadoop is a huge sink of cheap storage and processing

•  Aggregates built in Hadoop and exported

•  Apache Hive provides SQL like querying on raw data

Extract Transform Load (ETL) Everyone is doing it


Online DB

`Real’ Time System (Website)

ETL

AnalyBcal DB

Data Warehouse

Business Intelligence ApplicaBons

Much blood shed, here


Online DB


AnalyBcal DB

Data Warehouse


Hadoop Import Export


Online DB


AnalyBcal DB

Data Warehouse


Hadoop Apache Sqoop Apache Sqoop

Agenda



•  AnalyBcs is omen simply counBng things

•  Facebook chose HBase to store it’s massive counter infrastructure (more later)

•  How might one implement a counter infrastructure in HBase?

AnalyScs in HBase Scaling writes


URL Counter

com.cloudera/blog/… 154

com.cloudera/downloads/… 923621

com.cloudera/resources/… 2138

User Content Counter

[email protected] NEWS 5431

[email protected] TECH 79310

[email protected] SHOPPING 59

[email protected] SPORTS 94214

Individual Page Counters

User & Content Type Counters `Like’ bu\on IMG request sends HTTP request to Facebook servers which

increments several counters


URL Counter

com.cloudera/blog/… 154

com.cloudera/downloads/… 923621

com.cloudera/resources/… 2138

Individual Page Counters

Host is reversed in URL as part of the key

•  Data is physically stored in sorted order

•  Scanning all `com.cloudera’ counters results in sequenBal I/O

•  Real-‐Bme counters of URLs shared, links “liked”, impressions generated

•  20 billion events/day (200K events/sec)

•  ~30 second latency from click to count

•  Heavy use of incrementColumnValue API for consistent counters

•  Tried MySQL, Cassandra, se\led on HBase h\p://Bny.cloudera.com/hbase-‐�-‐analyBcs

Facebook AnalyScs Scaling writes

Agenda



Machine Learning Apache Mahout

Text Clustering on Google News


CollaboraBve Filtering on Amazon


ClassificaBon in GMail


•  Apache Mahout implements

•  CollaboraBve Filtering

•  ClassificaBon

•  Clustering

•  Frequent itemset

•  More coming with the integraBon of MapReduce.Next

Agenda



•  Other use cases

•  OpenTSDB an open distributed, scalable Time Series Database (TSDB)

•  Building Search Indexes (canonical use case)

•  Facebook Messaging

•  Cheap and Deep Storage, e.g. archiving emails for SOX compliance

•  Audit Logging

•  Non-‐Use Cases

•  Data processing is handled by one beefy server

•  Data requires transacBons

Final Thoughts Use the right tool

•  Brock Noland

•  [email protected]

•  h\p://twi\er.com/brocknoland

•  TC-‐HUG h\p://tch.ug

About the Presenter

Technology

Common and unique use cases for Apache Hadoop