Upload
thejas-nair
View
223
Download
0
Embed Size (px)
Citation preview
8/2/2019 apacheconus2010_3
1/22
How to make your map-reduce jobs perform as wellas pig: Lessons from pig optimizations
http://pig.apache.org
Thejas Nair
pig team @ Yahoo!
Apache pig PMC member
http://pig.apache.org/http://pig.apache.org/8/2/2019 apacheconus2010_3
2/22
What is Pig?
Pig Latin, a high leveldata processinglanguage.
An engine thatexecutes Pig Latinlocally or on aHadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
8/2/2019 apacheconus2010_3
3/22
Pig Latin example
Users = load users as (name, age);
Fltrd = filter Users by age >= 18 and age
8/2/2019 apacheconus2010_3
4/22
Comparison with MR in Java
020406080
100120140160
180
Hadoop Pig
1/20 the lines of code
0
50
100
150
200250
300
Hadoop Pig
M i n u t e s
1/16 the development time
What about Performance ?
8/2/2019 apacheconus2010_3
5/22
Pig Compared to Map Reduce
Faster development time Data flow versus programming logic
Many standard data operations (e.g. join) included Manages all the details of connecting
jobs and data flow Copes with Hadoop version change
issues
8/2/2019 apacheconus2010_3
6/22
And, You Dont Lose Power
UDFs can be used to load, evaluate,aggregate, and store data
External binaries can be invoked Metadata is optional Flexible data model
Nested data types Explicit data flow programming
8/2/2019 apacheconus2010_3
7/22
Pig performance
Pigmix : pig vs mapreduce
8/2/2019 apacheconus2010_3
8/22
Pig optimization principles
vs RDBMS: There is absence of accurate models for data, operators and execution env
Use available reliable info. Trust userchoice.
Use rules that help in most cases Rules based on runtime information
8/2/2019 apacheconus2010_3
9/22
Logical Optimizations
Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites
ScriptA = loadB = foreachC = filter
Logical PlanA -> B -> C
Parser Logical Optimizer
Optimized L. PlanA -> C -> B
8/2/2019 apacheconus2010_3
10/22
Physical Optimizations
Physical plan: sequence of MR jobshaving physical operators. Built-in rules. eg. use of combiner
Specified in query - eg. join type
Optimized L. PlanX -> Y -> Z
Optimizer
Phy/MR planM(PX-PYm) R(PYr)->M(Z)
Optimized Phy/MR PlanM(PX-PYm) C(PYc)R(PYr)->M(Z)
Translator
8/2/2019 apacheconus2010_3
11/22
Hash Join
Pages Users
Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Users by name, Pages by user;
Map 1
Pagesblock n
Map 2Users
block m
Reducer 1
Reducer 2
(1, user)
(2, name)
(1, fred)(2, fred)(2, fred)
(1, jane)(2, jane)(2, jane)
8/2/2019 apacheconus2010_3
12/22
Skew Join
Pages Users
Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using skewed ;
Map 1
Pagesblock n
Map 2Users
block m
Reducer 1
Reducer 2
(1, user)
(2, name)
(1, fred, p1)(1, fred, p2)(2, fred)
(1, fred, p3)(1, fred, p4)(2, fred)
SP
SP
8/2/2019 apacheconus2010_3
13/22
Merge Join
Pages Users
aaron...
.
.
.
.
.zach
aaron...
.
.
.zach
Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using merge ;
Map 1
Map 2
Users
Users
Pages
Pages
aaron amr
aaron
amy barb
amy
8/2/2019 apacheconus2010_3
14/22
Replicated Join
PagesUsersaaron
aaron...
..
.
.zach
aaron.
zach
Users = load users as (name, age);Pages = load pages as (user, url);Jnd = join Pages by user, Users by name using replicated ;
Map 1
Map 2
UsersPages
Pages
aaron amr
aaron.
zach
amy barb
Usersaaron
.zach
8/2/2019 apacheconus2010_3
15/22
Group/cogroup optimizations On sorted and collected data grp = group Users by name using collected ;Pages
aaronaaronbarneycarol
.
.
.
.
.
.
.zach
Map 1
aaronaaronbarney
Map 2carol
.
.
8/2/2019 apacheconus2010_3
16/22
Multi-store scriptA = load users as (name, age, gender,
city, state);B = filter A by name is not null ;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);
store D into bydemo ; C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into bystate ;
A: load B: filter
C2: group
C1: group
C3: eval udf
C2: eval udf
store intobystate
store intobydemo
8/2/2019 apacheconus2010_3
17/22
Multi-Store Map-Reduce Planmap filter
local rearrange
split
local rearrange
reduce
multiplexpackage package
foreach foreach
8/2/2019 apacheconus2010_3
18/22
Memory Management
Use disk if large objects dont fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM
- unreliable pre-set limit for large bags. Custom spill logic for
different bags -eg distinct bag.
8/2/2019 apacheconus2010_3
19/22
Other optimizations
Aggressive use of combiner, secondarysort
Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern
8/2/2019 apacheconus2010_3
20/22
Future optimization work
Improve memory management Join + group in single MR, if same keys
used Even better skew handling Adaptive optimizations
Automated hadoop tuning
8/2/2019 apacheconus2010_3
21/22
Pig - fast and flexible
More flexibility in 0.8, 0.9 Udfs in scripting languages
(python)
MR job as relation Relation as scalar Turing complete pig (0.9)
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
8/2/2019 apacheconus2010_3
22/22
Further reading
Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks -
http://wiki.apache.org/pig/PigTalksPapers
Training videos in vimeo.com (searchhadoop pig)
http://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapershttp://wiki.apache.org/pig/PigTalksPapers