Upload
hortonworks
View
125
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
© Hortonworks Inc. 2011
Daniel Dai
Member of Technical Staff
Committer, VP of Apache Pig
Page 1
Pig 0.11 - New Features
© Hortonworks Inc. 2011
Pig 0.11 release plan
• Branched on Oct 12, 2012• Will come in weeks
– Fix tests: PIG-2972– Documentation: PIG-2756– Several last minute fixes
Page 2Architecting the Future of Big Data
© Hortonworks Inc. 2011
New features
• CUBE operator• Rank operator• Groovy UDFs• New data type: DateTime• SchemaTuple optimization• Works with JDK 7• Works with Windows ?
Page 3Architecting the Future of Big Data
© Hortonworks Inc. 2011
New features
• Faster local mode• Better stats/notification
– Ambros
• Default scripts: pigrc• Integrate HCat DDL• Grunt enhancement: history/clear• UDF enhancement
– New/enhanced UDFs– AvroStorage enhancement
Page 4Architecting the Future of Big Data
© Hortonworks Inc. 2011
CUBE operator
Page 5Architecting the Future of Big Data
rawdata = load ’input' as (ptype, pstore, number);
cubed = cube rawdata by rollup(ptype, pstore);
result = foreach cubed generate flatten(group), SUM(cube.number);
dump result;
Ptype Pstore number
Dog Miami 12
Cat Miami 18
Turtle Tampa 4
Dog Tampa 14
Cat Naples 9
Dog Naples 5
Turtle Naples 1
Ptype Pstore Sum
Cat Miami 18
Cat Naples 9
Cat 27
Dog Miami 12
Dog Tampa 14
Dog Naples 5
Dog 31
Turtle Tampa 4
Turtle Naples 1
Turtle 5
63
© Hortonworks Inc. 2011
CUBE operator
Page 6Architecting the Future of Big Data
• Syntax
• Umbrella Jira PIG-2167• Non-distributed version will be in 0.11 (PIG-
2765)• Distributed version still in progress (PIG-2831)– Push algebraic computation to map/combiner– Reference: “Distributed cube materialization on holistic measures”, Arnab Nandi et al, ICDE 2011
outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];
© Hortonworks Inc. 2011
Rank operator
Page 7Architecting the Future of Big Data
rawdata = load ’input' as (name, gpa:double);
ranked = rank rawdata by gpa;
dump ranked;
Name Gpa
Katie 3.5
Fred 4.0
Holly 3.7
Luke 3.5
Nick 3.7
Rank Name Gpa
1 Katie 3.5
5 Fred 4.0
2 Holly 3.7
1 Luke 3.5
2 Nick 3.7
© Hortonworks Inc. 2011
Rank operator
Page 8Architecting the Future of Big Data
rawdata = load ’input' as (name, gpa:double);
ranked = rank rawdata by gpa desc dense;
dump ranked;
Name Gpa
Katie 3.5
Fred 4.0
Holly 3.7
Luke 3.5
Nick 3.7
Rank Name Gpa
3 Katie 3.5
1 Fred 4.0
2 Holly 3.7
3 Luke 3.5
2 Nick 3.7
© Hortonworks Inc. 2011
Rank operator
Page 9Architecting the Future of Big Data
• Limitation– Only 1 reducer
• Possible improvements– Provide a distributed implementation
• PIG-2353
© Hortonworks Inc. 2011
Groovy UDFs
Page 10Architecting the Future of Big Data
register 'test.groovy' using groovy as myfuncs;
a = load '1.txt' as (a0, a1:long);
b = foreach a generate myfuncs.square(a1);
dump b;
import org.apache.pig.builtin.OutputSchema;
class GroovyUDFs { @OutputSchema('x:long') long square(long x) { return x*x; }}
test.groovy:
© Hortonworks Inc. 2011
Embed Pig into Groovy
Page 11Architecting the Future of Big Data
import org.apache.pig.scripting.Pig;
public static void main(String[] args) {
String input = ”input"
String output = "output"
Pig P = Pig.compile("A = load '\$in'; store A into '\$out';")
result = P.bind(['in':input, 'out':output]).runSingle()
if (result.isSuccessful()) {
print("Pig job succeeded")
} else {
print("Pig job failed")
}
}
bin/pig -x local demo.groovy
Command line:
© Hortonworks Inc. 2011
New data type: DateTime
Page 12Architecting the Future of Big Data
a = load ’input' as (a0: datetime, a1:chararray, a2:long);
b = foreach a generate a0, ToDate(a1, ‘yyyyMMdd HH:mm:ss’), ToDate(a2), CurrentTime();
• Support timezone• Millisecond precision
© Hortonworks Inc. 2011
New data type: DateTime
Page 13Architecting the Future of Big Data
GetYear YearsBetween SubtractDuration
GetMonth MonthsBetween ToDate
GetDay WeeksBetween ToDateISO
GetWeekYear DaysBetween ToMilliSeconds
GetWeek HoursBetween ToString
GetHour MinutesBetween ToUnixTime
GetMinute SecondsBetween CurrentTime
GetSecond MilliSecondsBetween
GetMilliSecond AddDuration
• DateTime UDFs
© Hortonworks Inc. 2011
SchemaTuple optimization
Page 14Architecting the Future of Big Data
• Idea– Generate schema specific tuple code when schema is known
• Benefit
– Decrease memory footprint– Better performance
© Hortonworks Inc. 2011
SchemaTuple optimization
Page 15Architecting the Future of Big Data
• When tuple schema is known(a0: int, a1: chararray, a2: double)
Tuple {
List<Object> mFields;
Object get(int fieldNum) {
return mFields.get(fieldNum);
}
void set(int fieldNum, Object val)
mFields.set(fieldNum, val);
}
}
SchemaTuple { int f0; String f1; double f2; Object get(int fieldNum) { switch (fieldNum) { case 0: return f0; case 1: return f1; case 2: return f2; } void set(int fieldNum, Object val) …… }}
Original Tuple: Schema Tuple:
© Hortonworks Inc. 2011
Pig on new environment
Page 16Architecting the Future of Big Data
• JDK 7– All unit tests pass– Jira: PIG-2908
• Hadoop 2.0.0– Jira: PIG-2791
• Windows– No need for cygwin– Jira: PIG-2793– Try to make it to 0.11
© Hortonworks Inc. 2011
Faster local mode
Page 17Architecting the Future of Big Data
• Skip generating job.jar– PIG-2128
– In 0.9, 0.10 as well, unadvertised• Remove 5 seconds hardcoded wait time for
JobControl– PIG-2702
© Hortonworks Inc. 2011
Better stats
Page 18Architecting the Future of Big Data
• Information on alias/lines of a map/reduce job– An information line for every map/reduce job
detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]
Map contains:
alias A: line 1 column 4
alias A: line 3 column 4
alias B: line 2 column 4
Combiner contains:
alias A: line 3 column 4
alias B: line 2 column 4
Reduce contains:
alias A: line 3 column 4
alias C: line 5 column 4
Explanation:
© Hortonworks Inc. 2011
Better notification
Page 19Architecting the Future of Big Data
• For the support of Ambrose– Check out “Twitter Ambrose”– github open source– Monitor Pig job progress in a UI
© Hortonworks Inc. 2011
Integrate HCat DDL
Page 20Architecting the Future of Big Data
• Embed HCat DDL command in Pig script• Run HCat DDL command in Grunt
• Embed HCat DDL in scripting language
grunt> sql create table pig_test(name string, age int, gpa double) stored as textfile;
grunt>
from org.apache.pig.scripting import Pig
ret = Pig.sql("""drop table if exists table_1;""")
if ret==0:
#success
© Hortonworks Inc. 2011
Grunt enhancement
Page 21Architecting the Future of Big Data
• History
• Clear– Clear screen
grunt> a = load '1.txt';
grunt> b = foreach a generate $0, $1;
grunt> history
1 a = load '1.txt';
2 b = foreach a generate $0, $1;
grunt>
© Hortonworks Inc. 2011
New/enhanced UDFs
Page 22Architecting the Future of Big Data
• New UDFs
• Enhanced UDFs
• EvalFunc enhancement
– getInputSchema(): Get input schema for UDF
STARTSWITH INVERSEMAP VALUESET
BagToString KEYSET
BagToTuple VALUELIST
RANDOM Take a seed
AvroStorage Support recursive recordSupport globs and commasUpgrade to Avro 1.7.1
© Hortonworks Inc. 2011
1
• Simplify deployment to get started quickly and easily
• Monitor, manage any size cluster with familiar console and tools
• Only platform to include data integration services to interact with any data
• Metadata services opens the platform for integration with existing applications
• Dependable high availability architecture
• Tested at scale to future proof your cluster growth
Hortonworks Data Platform
Page 23
Reduce risks and cost of adoption Lower the total cost to administer and provision Integrate with your existing ecosystem
© Hortonworks Inc. 2011
Hortonworks Training
The expert source for Apache Hadoop training & certification
Role-based Developer and Administration training– Coursework built and maintained by the core Apache Hadoop development team.– The “right” course, with the most extensive and realistic hands-on materials– Provide an immersive experience into real-world Hadoop scenarios– Public and Private courses available
Comprehensive Apache Hadoop Certification– Become a trusted and valuable
Apache Hadoop expert
Page 24
© Hortonworks Inc. 2011
Next Steps?
• Expert role based training• Course for admins, developers
and operators• Certification program• Custom onsite options
Page 25
Download Hortonworks Data Platformhortonworks.com/download
1
2 Use the getting started guidehortonworks.com/get-started
3 Learn more… get support
• Full lifecycle technical support across four service levels
• Delivered by Apache Hadoop Experts/Committers
• Forward-compatible
Hortonworks Support
hortonworks.com/training hortonworks.com/support
© Hortonworks Inc. 2011
Thank You!Questions & Answers
Follow: @hortonworksRead: hortonworks.com/blog
Page 26