26
© Hortonworks Inc. 2011 Daniel Dai Member of Technical Staff Committer, VP of Apache Pig Page 1 Pig 0.11 - New Features

New features in Pig 0.11

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: New features in Pig 0.11

© Hortonworks Inc. 2011

Daniel Dai

Member of Technical Staff

Committer, VP of Apache Pig

Page 1

Pig 0.11 - New Features

Page 2: New features in Pig 0.11

© Hortonworks Inc. 2011

Pig 0.11 release plan

• Branched on Oct 12, 2012• Will come in weeks

– Fix tests: PIG-2972– Documentation: PIG-2756– Several last minute fixes

Page 2Architecting the Future of Big Data

Page 3: New features in Pig 0.11

© Hortonworks Inc. 2011

New features

• CUBE operator• Rank operator• Groovy UDFs• New data type: DateTime• SchemaTuple optimization• Works with JDK 7• Works with Windows ?

Page 3Architecting the Future of Big Data

Page 4: New features in Pig 0.11

© Hortonworks Inc. 2011

New features

• Faster local mode• Better stats/notification

– Ambros

• Default scripts: pigrc• Integrate HCat DDL• Grunt enhancement: history/clear• UDF enhancement

– New/enhanced UDFs– AvroStorage enhancement

Page 4Architecting the Future of Big Data

Page 5: New features in Pig 0.11

© Hortonworks Inc. 2011

CUBE operator

Page 5Architecting the Future of Big Data

rawdata = load ’input' as (ptype, pstore, number);

cubed = cube rawdata by rollup(ptype, pstore);

result = foreach cubed generate flatten(group), SUM(cube.number);

dump result;

Ptype Pstore number

Dog Miami 12

Cat Miami 18

Turtle Tampa 4

Dog Tampa 14

Cat Naples 9

Dog Naples 5

Turtle Naples 1

Ptype Pstore Sum

Cat Miami 18

Cat Naples 9

Cat 27

Dog Miami 12

Dog Tampa 14

Dog Naples 5

Dog 31

Turtle Tampa 4

Turtle Naples 1

Turtle 5

63

Page 6: New features in Pig 0.11

© Hortonworks Inc. 2011

CUBE operator

Page 6Architecting the Future of Big Data

• Syntax

• Umbrella Jira PIG-2167• Non-distributed version will be in 0.11 (PIG-

2765)• Distributed version still in progress (PIG-2831)– Push algebraic computation to map/combiner– Reference: “Distributed cube materialization on holistic measures”, Arnab Nandi et al, ICDE 2011

outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];

Page 7: New features in Pig 0.11

© Hortonworks Inc. 2011

Rank operator

Page 7Architecting the Future of Big Data

rawdata = load ’input' as (name, gpa:double);

ranked = rank rawdata by gpa;

dump ranked;

Name Gpa

Katie 3.5

Fred 4.0

Holly 3.7

Luke 3.5

Nick 3.7

Rank Name Gpa

1 Katie 3.5

5 Fred 4.0

2 Holly 3.7

1 Luke 3.5

2 Nick 3.7

Page 8: New features in Pig 0.11

© Hortonworks Inc. 2011

Rank operator

Page 8Architecting the Future of Big Data

rawdata = load ’input' as (name, gpa:double);

ranked = rank rawdata by gpa desc dense;

dump ranked;

Name Gpa

Katie 3.5

Fred 4.0

Holly 3.7

Luke 3.5

Nick 3.7

Rank Name Gpa

3 Katie 3.5

1 Fred 4.0

2 Holly 3.7

3 Luke 3.5

2 Nick 3.7

Page 9: New features in Pig 0.11

© Hortonworks Inc. 2011

Rank operator

Page 9Architecting the Future of Big Data

• Limitation– Only 1 reducer

• Possible improvements– Provide a distributed implementation

• PIG-2353

Page 10: New features in Pig 0.11

© Hortonworks Inc. 2011

Groovy UDFs

Page 10Architecting the Future of Big Data

register 'test.groovy' using groovy as myfuncs;

a = load '1.txt' as (a0, a1:long);

b = foreach a generate myfuncs.square(a1);

dump b;

import org.apache.pig.builtin.OutputSchema;

class GroovyUDFs { @OutputSchema('x:long') long square(long x) { return x*x; }}

test.groovy:

Page 11: New features in Pig 0.11

© Hortonworks Inc. 2011

Embed Pig into Groovy

Page 11Architecting the Future of Big Data

import org.apache.pig.scripting.Pig;

public static void main(String[] args) {

String input = ”input"

String output = "output"

Pig P = Pig.compile("A = load '\$in'; store A into '\$out';")

result = P.bind(['in':input, 'out':output]).runSingle()

if (result.isSuccessful()) {

print("Pig job succeeded")

} else {

print("Pig job failed")

}

}

bin/pig -x local demo.groovy

Command line:

Page 12: New features in Pig 0.11

© Hortonworks Inc. 2011

New data type: DateTime

Page 12Architecting the Future of Big Data

a = load ’input' as (a0: datetime, a1:chararray, a2:long);

b = foreach a generate a0, ToDate(a1, ‘yyyyMMdd HH:mm:ss’), ToDate(a2), CurrentTime();

• Support timezone• Millisecond precision

Page 13: New features in Pig 0.11

© Hortonworks Inc. 2011

New data type: DateTime

Page 13Architecting the Future of Big Data

GetYear YearsBetween SubtractDuration

GetMonth MonthsBetween ToDate

GetDay WeeksBetween ToDateISO

GetWeekYear DaysBetween ToMilliSeconds

GetWeek HoursBetween ToString

GetHour MinutesBetween ToUnixTime

GetMinute SecondsBetween CurrentTime

GetSecond MilliSecondsBetween

GetMilliSecond AddDuration

• DateTime UDFs

Page 14: New features in Pig 0.11

© Hortonworks Inc. 2011

SchemaTuple optimization

Page 14Architecting the Future of Big Data

• Idea– Generate schema specific tuple code when schema is known

• Benefit

– Decrease memory footprint– Better performance

Page 15: New features in Pig 0.11

© Hortonworks Inc. 2011

SchemaTuple optimization

Page 15Architecting the Future of Big Data

• When tuple schema is known(a0: int, a1: chararray, a2: double)

Tuple {

List<Object> mFields;

Object get(int fieldNum) {

return mFields.get(fieldNum);

}

void set(int fieldNum, Object val)

mFields.set(fieldNum, val);

}

}

SchemaTuple { int f0; String f1; double f2; Object get(int fieldNum) { switch (fieldNum) { case 0: return f0; case 1: return f1; case 2: return f2; } void set(int fieldNum, Object val) …… }}

Original Tuple: Schema Tuple:

Page 16: New features in Pig 0.11

© Hortonworks Inc. 2011

Pig on new environment

Page 16Architecting the Future of Big Data

• JDK 7– All unit tests pass– Jira: PIG-2908

• Hadoop 2.0.0– Jira: PIG-2791

• Windows– No need for cygwin– Jira: PIG-2793– Try to make it to 0.11

Page 17: New features in Pig 0.11

© Hortonworks Inc. 2011

Faster local mode

Page 17Architecting the Future of Big Data

• Skip generating job.jar– PIG-2128

– In 0.9, 0.10 as well, unadvertised• Remove 5 seconds hardcoded wait time for

JobControl– PIG-2702

Page 18: New features in Pig 0.11

© Hortonworks Inc. 2011

Better stats

Page 18Architecting the Future of Big Data

• Information on alias/lines of a map/reduce job– An information line for every map/reduce job

detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]

Map contains:

alias A: line 1 column 4

alias A: line 3 column 4

alias B: line 2 column 4

Combiner contains:

alias A: line 3 column 4

alias B: line 2 column 4

Reduce contains:

alias A: line 3 column 4

alias C: line 5 column 4

Explanation:

Page 19: New features in Pig 0.11

© Hortonworks Inc. 2011

Better notification

Page 19Architecting the Future of Big Data

• For the support of Ambrose– Check out “Twitter Ambrose”– github open source– Monitor Pig job progress in a UI

Page 20: New features in Pig 0.11

© Hortonworks Inc. 2011

Integrate HCat DDL

Page 20Architecting the Future of Big Data

• Embed HCat DDL command in Pig script• Run HCat DDL command in Grunt

• Embed HCat DDL in scripting language

grunt> sql create table pig_test(name string, age int, gpa double) stored as textfile;

grunt>

from org.apache.pig.scripting import Pig

ret = Pig.sql("""drop table if exists table_1;""")

if ret==0:

#success

Page 21: New features in Pig 0.11

© Hortonworks Inc. 2011

Grunt enhancement

Page 21Architecting the Future of Big Data

• History

• Clear– Clear screen

grunt> a = load '1.txt';

grunt> b = foreach a generate $0, $1;

grunt> history

1 a = load '1.txt';

2 b = foreach a generate $0, $1;

grunt>

Page 22: New features in Pig 0.11

© Hortonworks Inc. 2011

New/enhanced UDFs

Page 22Architecting the Future of Big Data

• New UDFs

• Enhanced UDFs

• EvalFunc enhancement

– getInputSchema(): Get input schema for UDF

STARTSWITH INVERSEMAP VALUESET

BagToString KEYSET

BagToTuple VALUELIST

RANDOM Take a seed

AvroStorage Support recursive recordSupport globs and commasUpgrade to Avro 1.7.1

Page 23: New features in Pig 0.11

© Hortonworks Inc. 2011

1

• Simplify deployment to get started quickly and easily

• Monitor, manage any size cluster with familiar console and tools

• Only platform to include data integration services to interact with any data

• Metadata services opens the platform for integration with existing applications

• Dependable high availability architecture

• Tested at scale to future proof your cluster growth

Hortonworks Data Platform

Page 23

Reduce risks and cost of adoption Lower the total cost to administer and provision Integrate with your existing ecosystem

Page 24: New features in Pig 0.11

© Hortonworks Inc. 2011

Hortonworks Training

The expert source for Apache Hadoop training & certification

Role-based Developer and Administration training– Coursework built and maintained by the core Apache Hadoop development team.– The “right” course, with the most extensive and realistic hands-on materials– Provide an immersive experience into real-world Hadoop scenarios– Public and Private courses available

Comprehensive Apache Hadoop Certification– Become a trusted and valuable

Apache Hadoop expert

Page 24

Page 25: New features in Pig 0.11

© Hortonworks Inc. 2011

Next Steps?

• Expert role based training• Course for admins, developers

and operators• Certification program• Custom onsite options

Page 25

Download Hortonworks Data Platformhortonworks.com/download

1

2 Use the getting started guidehortonworks.com/get-started

3 Learn more… get support

• Full lifecycle technical support across four service levels

• Delivered by Apache Hadoop Experts/Committers

• Forward-compatible

Hortonworks Support

hortonworks.com/training hortonworks.com/support

Page 26: New features in Pig 0.11

© Hortonworks Inc. 2011

Thank You!Questions & Answers

Follow: @hortonworksRead: hortonworks.com/blog

Page 26