Yahoo Chid Presentation

Embed Size (px)

Citation preview

  • 8/8/2019 Yahoo Chid Presentation

    1/47

    Large Scale Distributed Infrastructures (using Hadoop)

    Chidambaran KollengodeHadoop EngineeringCloud Computing and Data Infrastructure GroupYahoo India R & D, Bangalore

    Workshop on Cloud Computing18-20, Aug 2010at IIT Madras, Chennai

  • 8/8/2019 Yahoo Chid Presentation

    2/47

    Agenda

    DemystifyingtheCloud

    Whyhave

    Cloudy datacenters?

    WhydoesYahoo!needacloudinfrastructure

    CasestudiesinYahoo!

    HadoopArchitecture birdseyeview

    KeyChallengesinCloudComputing Andthecontinuedlearningfromthosechallenges!

    Q&A

  • 8/8/2019 Yahoo Chid Presentation

    3/47

    Demystifying the Cloud.

  • 8/8/2019 Yahoo Chid Presentation

    4/47

    Is the Cloud concept new?

    FirstGen

    Networkas

    acloud,

    message

    in

    and

    out

    Cloudhidprocessingfromusers

    NextEvolution

    www:Cloudarounddocuments

    URLin,

    document

    out

  • 8/8/2019 Yahoo Chid Presentation

    5/47

    And the present really rocks!

    Butnowwithcloudcomputingwehave..YOUR BUSINESSYOUR BUSINESS

    YOUR

    DATA

    Control

    Processing

    Storage

    YOUR

    DATA?

    TheInternet

    TheInternet

    HOSTED SERVICES

  • 8/8/2019 Yahoo Chid Presentation

    6/47

    So what is a cloud?

    CloudComputingiseither: Hosteddataprocessingservices,or

    Hostedweb

    services

    Whichare: Ahighlydistributedandelastic computingenvironment Withpredictableavailability

    Basically,its

    aself

    scaling

    computing

    resource

    Fireandforget Thanks,inpartto:

    Cheaper

    bandwidth

    and

    hardware Muchfastermachines

    Abstraction

    Businesses,theresearchcommunity&OpenSource

  • 8/8/2019 Yahoo Chid Presentation

    7/47

    Why should new datacenters beclouds?

  • 8/8/2019 Yahoo Chid Presentation

    8/47

    Cloud ROI

    Scalabilityondemand

    Concentrateon

    improving

    business

    process/moretimeforinnovation

    Streamliningdata

    centers

    (including

    off

    loadingtopubliccloud)

    Levelplayingfield minimizingstartupcosts

  • 8/8/2019 Yahoo Chid Presentation

    9/47

    Two facets of Scalability

    Withalotofmachines

    Moreusers,moredata,moremining,moreadsetc

    or Potentialtodothingslog(N)yearsearlierthan

    othersThe latter is innovation!

  • 8/8/2019 Yahoo Chid Presentation

    10/47

    Where did all my time go?

    Ifonlymyteamhasmorebandwidthwecan

    innovate UCBerkeleystudy

    3040%differentiatedandvaluecreation

    7060%undifferentiatedtasks hardware,installs,upgrades,provisioning,loadbalancing!!

  • 8/8/2019 Yahoo Chid Presentation

    11/47

    Data Center streamlining

    DataCenterchallenges

    Conflictingdemands

    bring

    costs

    down

    yet

    provideinnovativesolutions

    Canonpremisedatacentersdothebalancingact?

    Enterpriseswillbeginwithprivateclouds(andsolvescalabilityproblems!)

    Offloading

    spikes

    to

    public

    clouds

  • 8/8/2019 Yahoo Chid Presentation

    12/47

    In the horizon

    EverybusinesshastohaveWebpresence

    Zerocontrol

    on

    who,

    how

    many,

    when,

    how

    long..

    Nochoicebuttomigratethistoscalableinfrastructures

    Blurringthelinebetweenappsforemployeesversus

    customers(web

    will

    enable

    this)

    so

    why

    have

    two

    experiences?

    Enterprisethenrunsitsappsasmeteredutilities

    increasedmachine

    usage

    and

    ROI!

    Thismeanscloud

  • 8/8/2019 Yahoo Chid Presentation

    13/47

    Why Cloud @ Yahoo!

  • 8/8/2019 Yahoo Chid Presentation

    14/47

    Yahoo Business Model

    Customer Experience

    Traffic

    Ads

    Simple Growth Model

    For this growth- Incremental scaling is the key - Add one node at a time!

    - Reverse scalability (redirecting resource to apps)

    21st century is about understanding people the experiences they want.It is a lot more than infrastructure

  • 8/8/2019 Yahoo Chid Presentation

    15/47

    Yahoo! is Perfect for Cloud Computing

    HUNDREDSOF PROPERTIES / PRODUCTS

    600MUNIQUE USERS / MONTH

    300M+YAHOO! MAIL USERS / MONTHHUNDREDS

    OF PETABYTES OF STORAGEBILLIONS

    OF OBJECTS STOREDPETABYTES

    OF TRAFFIC DAILY

  • 8/8/2019 Yahoo Chid Presentation

    16/47

    Why Cloud Infrastructure is the only answer

    Cost

    effective Multitenant RapidExperimentation Handlefailuredaily Unpredictable

    peaks

    (scale)

    onlycloudcanenablethis

  • 8/8/2019 Yahoo Chid Presentation

    17/47

    What is Yahoo! doing?

    Privatecloudforinternaluse.But,manyopensource

    components. Hadoop:

    Opensource(Apache)framework forrunning

    applicationson

    large

    clusters

    on

    commodityhardwarecommodityhardware

    islargest(only?)opensourceframeworkfordataintensiveapps(petabytes)

    PIG:

    Parallel

    Programming

    Language

    and

    Runtime

    Zookeeper:HighAvailabilityDirectoryand

    Configuration

    Service

  • 8/8/2019 Yahoo Chid Presentation

    18/47

    Yahoo! Cloud ServicesHorizontal and Functional

    (Hadoop)

  • 8/8/2019 Yahoo Chid Presentation

    19/47

    How is Yahoo! seeing the space?

    Yahoo!seestwokindsofCloudservices:

    HorizontalCloud

    Services

    Functionalityenablingtenantstobuildapplicationsornewservicesontopofthecloud

    ThefocusofCCDI

    FunctionalCloudServices Functionalitythatisusefulinandofitselftotenants.

    Yahoo!sIndexTools;Yahoo!propertiesaimedatendusers

    e.g.,flickr,

    Groups,

    Mail,

    News,

    Shopping

    Couldbebuiltontopofhorizontalcloudservicesorfromscratch

  • 8/8/2019 Yahoo Chid Presentation

    20/47

  • 8/8/2019 Yahoo Chid Presentation

    21/47

    AdvertisingOptimization

    &DeliveryContent

    Optimization

    SearchIndex

    Image/VideoStorage&Delivery

    Yahoo!s Cloud Use Case

    RSSFeeds

    Caching,LoadBalancing

    MachineLearning

    (e.g.Spamfilters)

  • 8/8/2019 Yahoo Chid Presentation

    22/47

    Large Applications2008 2009

    Webmap ~70 hours runtime

    ~300 TB shuffling~200 TB output

    ~73 hours runtime

    ~490 TB shuffling~280 TB output

    +55% Hardware

    Sort benchmarks

    (Jim Gray contest)

    1 Terabyte sorted

    209 sec onds900 nodes

    1 Terabyte sorted

    62 sec ond s, 1500 nodes1 Petabyte sorted16.25 hours, 3700 nodes

    Largest c luster 2000 nodes

    6PB raw disk16TB of RAM16K Cores

    4000 nodes

    16PB raw disk64TB of RAM32K Cores(40% faster too!)

  • 8/8/2019 Yahoo Chid Presentation

    23/47

    23

    Example: Search AssistTM

    Before Hadoop After Hadoop

    Time 26 days 20 minutes

    Language C++ Python

    Development Time 2-3 weeks 2-3 days

    DatabaseforSearchAssist isbuiltusingHadoop. 3yearsoflogdata

    20stepsofmapreduce

  • 8/8/2019 Yahoo Chid Presentation

    24/47

    Image Search components

    Yahoo! Confidential & Proprietary

    200 Node HDFS cluster

    RHEL AS 4 U4, 64-bit

    1TB * 4 Disks, JBOD (RF as 2 in HDFS)8 tasks per machine

    Dump jobs 800% performance gains

  • 8/8/2019 Yahoo Chid Presentation

    25/47

  • 8/8/2019 Yahoo Chid Presentation

    26/47

    Excellent Cloud use case

    NYTIMES Neededofflineconversionofpublicdomainarticlesfrom18511922.

    Used

    Hadoop

    to

    convert

    scanned

    images

    to

    PDF Ran100AmazonEC2instancesforaround24hours

    4TBofinput

    1.5TBofoutput

    Published 1892, copyright New York Times

  • 8/8/2019 Yahoo Chid Presentation

    27/47

    Hadoop Big Data Processor!

    A birdseyeviewofArchitecture

  • 8/8/2019 Yahoo Chid Presentation

    28/47

    How to Process BigData?

    Justreading100terabytesofdatacanbe

    overwhelming Takes~11daystoreadonastandardcomputer

    Takesadayacrossa10Gbitlink(veryhighendstoragesolution)

    Butitonlytakes15minuteson1000standardcomputers!

    Usingclusters

    of

    standard

    computers,

    you

    get

    Linearscalability

    Commoditypricing

  • 8/8/2019 Yahoo Chid Presentation

    29/47

    Yahoo! Hadoop Cluster

    What 25,0000 Hadoop nodes look like

  • 8/8/2019 Yahoo Chid Presentation

    30/47

    How does Hadoop scale?

    Map/Reduce

    InputInput

    MapMap MapMap MapMap MapMap

    Transient DataTransient Data

    ResultsResults

    ReduceReduce ReduceReduce ReduceReduce ReduceReduce

    Split intobits

    Process the bitson each node

    Process the bitson each node

    Collate each binon each node

    Collate each binon each node

    Shuffle into

    bins

    Join it alltogether

  • 8/8/2019 Yahoo Chid Presentation

    31/47

    Map-Reduce and HDFS

    file1 (1,3)file2 (2,4,5)

    Namenode

    1 12

    224 5

    33 4 4

    55

    Map tasksReduce tasks

    JobTracker

    TT TT TT

    TT TT

  • 8/8/2019 Yahoo Chid Presentation

    32/47

    Map-Reduce on a larger scale

    TakethepreviousexampleandmakeitWeb

    scale Billionsofwebpages

    Indexcanreachafewpetabytes

    Thousandsof

    machines

    Runmultiplejobs/programs

    Computeandprocessintensive

    Weneed

    aplatform to

    do

    this

    HADOOP!

    wearebuildingHadoopwiththecommunity!

    Hadoopisopensource

  • 8/8/2019 Yahoo Chid Presentation

    33/47

    Challenges & Learnings

    Surprisesweveencounteredalong

    theway.andourapproach

  • 8/8/2019 Yahoo Chid Presentation

    34/47

    Key Challenges

    Elasticscaling

    Typically,

    with

    commodity

    infrastructure

    Availability Tradingconsistency/performance/availability

    Handling

    failures Whatcanbecountedonafterafailure?

    Operationalefficiency Managingandtuningmultitenantedclouds

    Therightabstractions Data,security,andservicesinthecloud

    Dontforgetfailures!

  • 8/8/2019 Yahoo Chid Presentation

    35/47

    Data Diversity Challenges

    TypesofDatainclude:

    StaticText

    Web

    page

    crawl

    DynamicText Socialproperties(Answers,Flickr)

    StructuredData(Autos,Local,Shopping)

    Streams(Finance,

    News)

    Multimedia

    MailHowtoanalyzeandintegratethisBigData?

  • 8/8/2019 Yahoo Chid Presentation

    36/47

    Growth Challenges

    Challenge Opportunity

    Data transferbottlenecks FedEx-ing disks, DataBackup/Archival

    Performanceunpredictability

    Improved VM support, flashmemory, scheduling VMs

    Scalable structuredstorage

    Major research opportunity

    Bugs in large distributedsystems

    Invent Debugger that relieson Distributed VMs

    Scaling quickly Snapshots (may be?)

    RAD Labs

  • 8/8/2019 Yahoo Chid Presentation

    37/47

    Adoption Challenges (Public Clouds)

    Challenge Opportunity

    Availability /business continuity Multiple providers & DCs

    Data lock-in Standardization

    Data Confidentiality andAuditability

    Encryption, VLANs,Firewalls; Jurisdiction ofData Storage

    RAD Labs

  • 8/8/2019 Yahoo Chid Presentation

    38/47

    Users!! Cant live with them, cant

    shoot them! Thereisalwaysanewwaytocrashthesystem!

    Tragedyofthecommons When

    have

    you

    seen

    ashared

    drive

    that

    is

    not

    full?

    Wedolovethemofcourse,theypayourwages Weengagethem!

    Makesharedcostsvisible! Baddesignsleadtobadresults

  • 8/8/2019 Yahoo Chid Presentation

    39/47

    Challenges in Hadoop QE and RE

    ReliabilityLossofnodesData

    corruption

    Lossofdatablocks

    Scale

    Usesimulation

    DataNode

    /Task

    tracker

    simulation

    RepeatabilityDeploymentonmultinodeclusters

    Configs

    forvariety

    of

    clusters

    ContinuousIntegration(dailyintegration)

  • 8/8/2019 Yahoo Chid Presentation

    40/47

    Testing -> Stability and Agility

    Twocompetingneeds: Rapiddevelopment

    Addingnew

    features/Innovate

    Increasestability Hadoopismissioncritical/Pressuretomoveslowly!

    Howdo

    you

    move

    the

    curve?

    Investinautomatedtesting!

    Continuousintegration

    Stresstesting

  • 8/8/2019 Yahoo Chid Presentation

    41/47

    Research Problems

    Checkpointingparallelapplications

    Reschedulingpolicies

    Performancemodeling

    Energybased

    optimizations

    Performance Problems for Hadoop /

  • 8/8/2019 Yahoo Chid Presentation

    42/47

    Performance Problems for Hadoop /Hadoop Clusters

    Understandexternalfailurecharacteristicsandthecost

    Externalfailures(i.e.,otherthanthosecausedbyHadoopbugs) faultydisks/Memory/Network/CPU.

    AutotuningHadoopforperformance

    toolsthat

    can

    tuneHadoopclusterswiththerightdefaults(e.g.,MapandReduceslots)fortypicalworkloads.

    autotuneHadoopjobconfigurationstooptimizeexecutiontime.

    Buildtools

    to

    pinpointhotspotsthatcausedapplicationstorunslowly

    VisualizeJobProgressandClusterUtilization

  • 8/8/2019 Yahoo Chid Presentation

    43/47

    Challenges for Cloud providers

    Hardware datacenterinvestment(machines,power,cooling)

    whatkind

    of

    HW/OS?

    Homogenous?

    Commodity?

    insertnewmachinesorremovebadones,withoutdisruptingservice

    Software Whatsoftwarestacktoprovide

    Data howdoescustomerdatageton/offcloud?

    QoS

    high

    availability,

    always

    available,

    updating

    SW/HW

    without

    bringing

    servicedown

    payasyougo?easy/automaticelasticity?payforwhatyouuse?

  • 8/8/2019 Yahoo Chid Presentation

    44/47

    Challenges for Cloud users

    Existingapplications

    Needelasticity?

    Costeffectiveness(HW/SW,ops)ofrunningontheCloud

    Migration(large

    business

    opportunity

    in

    future!)

    Newapps.Donewthingswithelasticresources?

    lotsofdata?

    batchprocessing,analytics

  • 8/8/2019 Yahoo Chid Presentation

    45/47

    Cloud Computing is NOT

    about saving money(as it exists today)

  • 8/8/2019 Yahoo Chid Presentation

    46/47

    The future is here; its just not widely distributed yet.-- William Gibson

    Chid Kollengode

    [email protected]

  • 8/8/2019 Yahoo Chid Presentation

    47/47

    Foranyspecificquestionsrelatedto

    Yahoo!(otherthancoveredinthis

    presentation)please

    contact

    pr

    [email protected]