10
1 @eBay n ay.com Platform Development

Hadoop at eBay

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Hadoop at eBay

1

@eBay

Anil [email protected]

Analytics Platform Development

Page 2: Hadoop at eBay

2

• 2007 Research Team Builds a 4 node Cluster – Subset of Click Stream and EDW data– Innovation with Mobius Query Language– Visualization and Click Path analysis

• 2009 Sept Search Clusters – Machine Learning Ranking cluster of 28 nodes– Search relevance cluster of 10 nodes– Subset of Click Stream and EDW Data

• 2010 May – Athena* Exploratory Cluster of 532 nodes– Platform Teams join hands with Search/Research to build a larger cluster .– Build it as a core competency for advanced insights for complex data– Rapid build-out with timelines pulled in by couple of months

* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology

MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.

2

Page 3: Hadoop at eBay

Infrastructure

3

• Enterprise Nodes – Sun 64bit , Red Hat Linux

– 2 Quad Core Nehalem, 72GB RAM, 4TB

– Servers• NameNode(s)• Job Tracker• Zookeeper• HBaseMaster• Ganglia Server• eBay (Cloudera) HUE

• Data Nodes– SGI-Rackables, Cent OS, 1U , 5.3PB

– 2 Quad Core Nehalem, 36GB RAM, 10TB

– Hbase on 20 nodes

• Network– TOR 1Gbps– Core Switches uplink 40Gbps

3

Page 4: Hadoop at eBay

Ecosystem

4

4

Hadoop Core (HDFS,Common)

MapReduce (Java, Streaming, Pipes,Scala)

Data Access (Hbase, Pig, Hive)

Tools & Libraries(HUE,UC4,Oozie.Mobius,Mahout)

Monitoring & Alerting (Ganglia, Nagios)

• MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python…

• Data Access FrameworksHbase - for EDWdataPig – data piplelinesHive – Adhoc queries MQL – Mobius Query Language

• Monitoring & AlertingGanglia, Nagios

• Tools HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines Mahout – data mining

Page 5: Hadoop at eBay

Administration

• Groups– Built to support multiple groups– Job invocation uses the group name– Fair Scheduler

• Allocations based on investment

• Weights

• Minimum share of mappers and reducers

• poolMaxJobsDefault

• userMaxJobsDefault

• defaultMinSharePreemptionTimeout

• fairSharePreemptionTimeout

• Auth & Auth– HUE – custom module to use corp. credentials– CLI*– PAM custom module– Security* - Implement token interface to replace

Kerberos with SAML.

* Work in Progress5

Page 6: Hadoop at eBay

Data Sourcing Patterns

6

Click Stream

EDW

Images

Search Indices

Analytics Reporting

Algorithmic Models

AcquisitionDescription

Source Preparation Format Pattern

Click StreamSessionEventSession Container

Session/Event Streamed as LZO/Text

SessionContainer generate Sequence Files

Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/TwitterSession Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join

EDWItemTransactionUserFeedbackBids

Streamed as GZIP/Text

Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data.

Hive StorageHandlers to point to SequenceFile/Hbase snapshot

TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.Create Hbase regions using HfileUpdate RegionServers using ruby script loadtable.rb

Concerns - Hbase append performance, Hfile flush HBASE-1923

Page 7: Hadoop at eBay

Search Use Case – Machine Learned Ranking

7

ClickStream Items Users Feedback

Classifiers

Ranking Function

Great Search Results

• Goal– Enhance search relevance for eBay’s items.

• Hadoop Usage– Build a ranking function that takes multiple factors into account like price, listing format, seller

track record, relevance.

– Ability to add new factors to validate hypothesis

– .

Page 8: Hadoop at eBay

Research Use Case – Description Data Mining

• Goal– Extend catalog coverage

• Hadoop Usage– Leverage data mining/machine learning techniques to create inventory into name value pairs

in an completely unsupervised way

8

BARBIE1999 "PREMIERE NIGHT"

Home Shopping Special EditionGorgeous Doll With Beautiful Blond Hair /  In A Gown

Of Purple And SilverNew / Never Removed From Box / Doll Is In Mint

Condition / Remember This Beauty Is 11 Years OldFree Shipping To US Only / Will Ship International /

Please E-mail For CostFeel Free To Ask Me Any Questions Or Concerns

Smoke - Free EnvironmentFree Shipping

Year: 1999Model: premiere nightEdition: home shopping specialHair: blondGown: purple and silverCondition: new / never removed from box / mint

Page 9: Hadoop at eBay

Platform Details

Metrics Job Statistics, System/Disk Consumption, Utilization

Infrastructure Publish/Subscribe ETL tools, low latency data movement

Development Tools, Environment, IDE,

Architecture Schemas, Metadata, Governance, Policies

Operations Administration, Configuration, Monitoring

Reporting Visualization, BI Generation, Information delivery

Security User & Group Management, Auth & Auth

9

Clusters Details

Exploratory Strategic investment 1000-5000 nodes

Production Site facing, low latency, high availability

Use Case Specific Advertising, Trust & Safety , Merchandizing

Page 10: Hadoop at eBay

10

Acknowledgments

• Athena Team

• Cloudera Inc.

• Community