17
Feeds processing at Yahoo! One Platform, One Hadoop, Two Systems Yahoo! Inc. Apache Hadoop India Summit 16 th February 2011

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Embed Size (px)

Citation preview

Page 1: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Feeds processing at Yahoo!One Platform, One Hadoop, Two Systems

Yahoo! Inc.

Apache Hadoop India Summit16th February 2011

Page 2: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Agenda Pacman

Design Contributions

The small feeds problem

Pepper Requirements Design

Production numbers

Cover the whole spectrum

Examples of processing

ConclusionYahoo!

Inc2

Page 3: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

PacmanStarted in 2006 in Bangalore

Process large feeds, millions of records in few hours

Multi-Tenant

Reliability, Operability

Use Hadoop M/R, one record is unit of processing

Workflow semantics over HadoopWorkflow defined by DAG

Each node result is stored in HDFS ‘Channels’

Feeds processing oriented API, abstracting M/R

High Availability, Cross-colo replication HDFS data

3Yahoo!

Inc

Page 4: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Design

4Yahoo!

Inc

Notification

Asynchronous processing

One Job for each WF node

State in DB

Feed copied on the Grid

Reporting service exposes metrics and logs

FeedsArchive

Receiver

Hadoop

HDFS

Pacman Grid

6 : Read feeds

2 : Large feeds notify

3 : Store notification

DeploymentService

WorkflowExecutor

1 : Deploy WFDeploy native pkg

5 : Send jobAnd wait notify

(for each WF node)

CoreDB

Pebls/UDF

ReportingService

7 : Send Instrumentation data(for each WF node)

9 : Read logs

4 : LaunchWF

Admin User

Page 5: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Contributions

Multiple Output files for a Job

Counters

Chaining of Maps

Led to open-sourced Oozie

5Yahoo!

Inc

Page 6: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

The small feeds problem

More and more small feeds on boarded (NPC, OMG, Green…)

Overhead of Pacman is high (Hadoop, DB…)

Too many small files on HDFS

Solution : Process nodes of Workflow in WebServer Farm

Lack of IsolationBetween executions

Native libraries management

Operability issues (provisioning,…)

6Yahoo!

Inc

Page 7: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Pepper requirements

Be able to support all properties :News, Finance, Travel, …

Scalable (millions of feeds a day), Elastic

Isolation, Multiple Native Libraries versions

Low overhead (<5s)

Compatible with Pacman API

Reuse Pacman code/infrastructure as most as possible

7Yahoo!

Inc

Page 8: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Pepper

Servlet Model

Synchronous in-memory execution of the workflow (very fast)

No use of HDFS

Share Pacman API and infrastructure

Hadoop

Reporting, Deployment…

Cloud like qualities

Elastic, Scalable

Isolation

8Yahoo!

Inc

Page 9: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

DesignEmbedded Jetty server runs in Map task, registers with ZooKeeper

1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow

Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server

9Yahoo!

Inc

ProxyRouter

6 : Send request(synchronous)

9 : Send request(synchronous)

4 : Send job

ZooKeeper

Hadoop JT5 : Createhost entry

7 : Read avail.entries

2 : Copywebapp

HDFS

1 : Register webapp

Job Manager

3 : Add Webapp

node

Admin User

10 : copy logs

Map Web

Engine

Page 10: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Production numbers

10Yahoo!

Inc

System Burst Rate

(request/min)

Throughput

(requests/day)

Platform

Latency (Avg.)

Response Time (Avg.)

Pepper 2,000 3 million 75 ms 4s

PacMan 50 10,000 90s 120s

Qualified with simple workflow and 3 Hadoop slaves cluster

Page 11: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Production numbers

Pacman :

20+ solutions (Autos, Real Estate, Deals…)

150,000 feeds

250 requests/h

200 millions listings processed/week

Pepper :

News, Finance, NPC

600,000 feeds

10,000 requests/h… for now

20 Hadoop slave cluster (x2 colos)

11Yahoo!

Inc

Page 12: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Cover the whole spectrum

Clever switch between the 2 systems

Choice can be done upfront

‘Sticky’ feeds go to Pacman

Size > 2MB go to Pacman

Failed feeds in Pepper are redirected to PacmanOutOfMemory

TimeOut

12Yahoo!

Inc

Page 13: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Example of processing

Validation against schema

Filtering (Security), Image resizing

Send images to edge serving

Reformat to common model

Simple (in-line) enrichments

Categorization

Geocoding

Entity Recognition

Clustering

13Yahoo!

Inc

Page 14: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Conclusion

One common platform (Deployment, Reporting…)

Covers the whole spectrum of feeds

Share same Hadoop cluster

Very generic conceptsPacman : Workflow engine

Pepper : Serving cloud on top of Hadoop

14Yahoo!

Inc

Page 15: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Pepper future work

On-demand allocation of servers

Async NIO between Proxy Router & Map Web Engine to increase scalability

Improving distribution of requests across web servers

Follow Hadoop roadmap

15Yahoo!

Inc

Page 16: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

References

Ooziehttp://yahoo.github.com/oozie/

http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2- oozie/

Pepperhttp://yahoo.github.com/pepper/ (new !!)

http://www.computer.org/portal/web/csdl/doi/10.1109/CloudCom.2010.39

http://salsahpc.indiana.edu/CloudCom2010/slides/PDF/Pepper%20An%20Elastic%20Web%20Server%20Farm%20for%20Cloud%20based%20on%20Hadoop.pdf

16Yahoo!

Inc

Page 17: Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Questions ?

17Yahoo!

Inc