apacheconus2010_3

Embed Size (px)

Citation preview

  • 8/2/2019 apacheconus2010_3

    1/22

    How to make your map-reduce jobs perform as wellas pig: Lessons from pig optimizations

    http://pig.apache.org

    Thejas Nair

    pig team @ Yahoo!

    Apache pig PMC member

    http://pig.apache.org/http://pig.apache.org/
  • 8/2/2019 apacheconus2010_3

    2/22

    What is Pig?

    Pig Latin, a high leveldata processinglanguage.

    An engine thatexecutes Pig Latinlocally or on aHadoop cluster.

    Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

  • 8/2/2019 apacheconus2010_3

    3/22

    Pig Latin example

    Users = load users as (name, age);

    Fltrd = filter Users by age >= 18 and age

  • 8/2/2019 apacheconus2010_3

    4/22

    Comparison with MR in Java

    020406080

    100120140160

    180

    Hadoop Pig

    1/20 the lines of code

    0

    50

    100

    150

    200250

    300

    Hadoop Pig

    M i n u t e s

    1/16 the development time

    What about Performance ?

  • 8/2/2019 apacheconus2010_3

    5/22

    Pig Compared to Map Reduce

    Faster development time Data flow versus programming logic

    Many standard data operations (e.g. join) included Manages all the details of connecting

    jobs and data flow Copes with Hadoop version change

    issues

  • 8/2/2019 apacheconus2010_3

    6/22

    And, You Dont Lose Power

    UDFs can be used to load, evaluate,aggregate, and store data

    External binaries can be invoked Metadata is optional Flexible data model

    Nested data types Explicit data flow programming

  • 8/2/2019 apacheconus2010_3

    7/22

    Pig performance

    Pigmix : pig vs mapreduce

  • 8/2/2019 apacheconus2010_3

    8/22

    Pig optimization principles

    vs RDBMS: There is absence of accurate models for data, operators and execution env

    Use available reliable info. Trust userchoice.

    Use rules that help in most cases Rules based on runtime information

  • 8/2/2019 apacheconus2010_3

    9/22

    Logical Optimizations

    Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites

    ScriptA = loadB = foreachC = filter

    Logical PlanA -> B -> C

    Parser Logical Optimizer

    Optimized L. PlanA -> C -> B

  • 8/2/2019 apacheconus2010_3

    10/22

    Physical Optimizations

    Physical plan: sequence of MR jobshaving physical operators. Built-in rules. eg. use of combiner

    Specified in query - eg. join type

    Optimized L. PlanX -> Y -> Z

    Optimizer

    Phy/MR planM(PX-PYm) R(PYr)->M(Z)

    Optimized Phy/MR PlanM(PX-PYm) C(PYc)R(PYr)->M(Z)

    Translator

  • 8/2/2019 apacheconus2010_3

    11/22

    Hash Join

    Pages Users

    Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Users by name, Pages by user;

    Map 1

    Pagesblock n

    Map 2Users

    block m

    Reducer 1

    Reducer 2

    (1, user)

    (2, name)

    (1, fred)(2, fred)(2, fred)

    (1, jane)(2, jane)(2, jane)

  • 8/2/2019 apacheconus2010_3

    12/22

    Skew Join

    Pages Users

    Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using skewed ;

    Map 1

    Pagesblock n

    Map 2Users

    block m

    Reducer 1

    Reducer 2

    (1, user)

    (2, name)

    (1, fred, p1)(1, fred, p2)(2, fred)

    (1, fred, p3)(1, fred, p4)(2, fred)

    SP

    SP

  • 8/2/2019 apacheconus2010_3

    13/22

    Merge Join

    Pages Users

    aaron...

    .

    .

    .

    .

    .zach

    aaron...

    .

    .

    .zach

    Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using merge ;

    Map 1

    Map 2

    Users

    Users

    Pages

    Pages

    aaron amr

    aaron

    amy barb

    amy

  • 8/2/2019 apacheconus2010_3

    14/22

    Replicated Join

    PagesUsersaaron

    aaron...

    ..

    .

    .zach

    aaron.

    zach

    Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using replicated ;

    Map 1

    Map 2

    UsersPages

    Pages

    aaron amr

    aaron.

    zach

    amy barb

    Usersaaron

    .zach

  • 8/2/2019 apacheconus2010_3

    15/22

    Group/cogroup optimizations On sorted and collected data grp = group Users by name using collected ;Pages

    aaronaaronbarneycarol

    .

    .

    .

    .

    .

    .

    .zach

    Map 1

    aaronaaronbarney

    Map 2carol

    .

    .

  • 8/2/2019 apacheconus2010_3

    16/22

    Multi-store scriptA = load users as (name, age, gender,

    city, state);B = filter A by name is not null ;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);

    store D into bydemo ; C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into bystate ;

    A: load B: filter

    C2: group

    C1: group

    C3: eval udf

    C2: eval udf

    store intobystate

    store intobydemo

  • 8/2/2019 apacheconus2010_3

    17/22

    Multi-Store Map-Reduce Planmap filter

    local rearrange

    split

    local rearrange

    reduce

    multiplexpackage package

    foreach foreach

  • 8/2/2019 apacheconus2010_3

    18/22

    Memory Management

    Use disk if large objects dont fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM

    - unreliable pre-set limit for large bags. Custom spill logic for

    different bags -eg distinct bag.

  • 8/2/2019 apacheconus2010_3

    19/22

    Other optimizations

    Aggressive use of combiner, secondarysort

    Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern

  • 8/2/2019 apacheconus2010_3

    20/22

    Future optimization work

    Improve memory management Join + group in single MR, if same keys

    used Even better skew handling Adaptive optimizations

    Automated hadoop tuning

  • 8/2/2019 apacheconus2010_3

    21/22

    Pig - fast and flexible

    More flexibility in 0.8, 0.9 Udfs in scripting languages

    (python)

    MR job as relation Relation as scalar Turing complete pig (0.9)

    Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

  • 8/2/2019 apacheconus2010_3

    22/22

    Further reading

    Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks -

    http://wiki.apache.org/pig/PigTalksPapers

    Training videos in vimeo.com (searchhadoop pig)

    http://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapers