Upload
hadoop-user-group
View
2.596
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
2
• 2007 Research Team Builds a 4 node Cluster – Subset of Click Stream and EDW data– Innovation with Mobius Query Language– Visualization and Click Path analysis
• 2009 Sept Search Clusters – Machine Learning Ranking cluster of 28 nodes– Search relevance cluster of 10 nodes– Subset of Click Stream and EDW Data
• 2010 May – Athena* Exploratory Cluster of 532 nodes– Platform Teams join hands with Search/Research to build a larger cluster .– Build it as a core competency for advanced insights for complex data– Rapid build-out with timelines pulled in by couple of months
* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology
MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.
2
Infrastructure
3
• Enterprise Nodes – Sun 64bit , Red Hat Linux
– 2 Quad Core Nehalem, 72GB RAM, 4TB
– Servers• NameNode(s)• Job Tracker• Zookeeper• HBaseMaster• Ganglia Server• eBay (Cloudera) HUE
• Data Nodes– SGI-Rackables, Cent OS, 1U , 5.3PB
– 2 Quad Core Nehalem, 36GB RAM, 10TB
– Hbase on 20 nodes
• Network– TOR 1Gbps– Core Switches uplink 40Gbps
3
Ecosystem
4
4
Hadoop Core (HDFS,Common)
MapReduce (Java, Streaming, Pipes,Scala)
Data Access (Hbase, Pig, Hive)
Tools & Libraries(HUE,UC4,Oozie.Mobius,Mahout)
Monitoring & Alerting (Ganglia, Nagios)
• MapReduce Sourcing data primarily Java Applications using Perl, Scala, Python…
• Data Access FrameworksHbase - for EDWdataPig – data piplelinesHive – Adhoc queries MQL – Mobius Query Language
• Monitoring & AlertingGanglia, Nagios
• Tools HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines Mahout – data mining
Administration
• Groups– Built to support multiple groups– Job invocation uses the group name– Fair Scheduler
• Allocations based on investment
• Weights
• Minimum share of mappers and reducers
• poolMaxJobsDefault
• userMaxJobsDefault
• defaultMinSharePreemptionTimeout
• fairSharePreemptionTimeout
• Auth & Auth– HUE – custom module to use corp. credentials– CLI*– PAM custom module– Security* - Implement token interface to replace
Kerberos with SAML.
* Work in Progress5
Data Sourcing Patterns
6
Click Stream
EDW
Images
Search Indices
Analytics Reporting
Algorithmic Models
AcquisitionDescription
Source Preparation Format Pattern
Click StreamSessionEventSession Container
Session/Event Streamed as LZO/Text
SessionContainer generate Sequence Files
Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/TwitterSession Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join
EDWItemTransactionUserFeedbackBids
Streamed as GZIP/Text
Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data.
Hive StorageHandlers to point to SequenceFile/Hbase snapshot
TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.Create Hbase regions using HfileUpdate RegionServers using ruby script loadtable.rb
Concerns - Hbase append performance, Hfile flush HBASE-1923
Search Use Case – Machine Learned Ranking
7
ClickStream Items Users Feedback
Classifiers
Ranking Function
Great Search Results
• Goal– Enhance search relevance for eBay’s items.
• Hadoop Usage– Build a ranking function that takes multiple factors into account like price, listing format, seller
track record, relevance.
– Ability to add new factors to validate hypothesis
– .
Research Use Case – Description Data Mining
• Goal– Extend catalog coverage
• Hadoop Usage– Leverage data mining/machine learning techniques to create inventory into name value pairs
in an completely unsupervised way
8
BARBIE1999 "PREMIERE NIGHT"
Home Shopping Special EditionGorgeous Doll With Beautiful Blond Hair / In A Gown
Of Purple And SilverNew / Never Removed From Box / Doll Is In Mint
Condition / Remember This Beauty Is 11 Years OldFree Shipping To US Only / Will Ship International /
Please E-mail For CostFeel Free To Ask Me Any Questions Or Concerns
Smoke - Free EnvironmentFree Shipping
Year: 1999Model: premiere nightEdition: home shopping specialHair: blondGown: purple and silverCondition: new / never removed from box / mint
Platform Details
Metrics Job Statistics, System/Disk Consumption, Utilization
Infrastructure Publish/Subscribe ETL tools, low latency data movement
Development Tools, Environment, IDE,
Architecture Schemas, Metadata, Governance, Policies
Operations Administration, Configuration, Monitoring
Reporting Visualization, BI Generation, Information delivery
Security User & Group Management, Auth & Auth
9
Clusters Details
Exploratory Strategic investment 1000-5000 nodes
Production Site facing, low latency, high availability
Use Case Specific Advertising, Trust & Safety , Merchandizing
10
Acknowledgments
• Athena Team
• Cloudera Inc.
• Community