New features in Pig 0.11

© Hortonworks Inc. 2011

Daniel Dai

Member of Technical Staff

Committer, VP of Apache Pig

Page 1

Pig 0.11 - New Features


Pig 0.11 release plan

• Branched on Oct 12, 2012• Will come in weeks

– Fix tests: PIG-2972– Documentation: PIG-2756– Several last minute fixes

Page 2Architecting the Future of Big Data


New features

• CUBE operator• Rank operator• Groovy UDFs• New data type: DateTime• SchemaTuple optimization• Works with JDK 7• Works with Windows ?



New features

• Faster local mode• Better stats/notification

– Ambros

• Default scripts: pigrc• Integrate HCat DDL• Grunt enhancement: history/clear• UDF enhancement

– New/enhanced UDFs– AvroStorage enhancement



CUBE operator


rawdata = load ’input' as (ptype, pstore, number);

cubed = cube rawdata by rollup(ptype, pstore);

result = foreach cubed generate flatten(group), SUM(cube.number);

dump result;

Ptype Pstore number

Dog Miami 12

Cat Miami 18

Turtle Tampa 4

Dog Tampa 14

Cat Naples 9

Dog Naples 5

Turtle Naples 1

Ptype Pstore Sum

Cat Miami 18

Cat Naples 9

Cat 27

Dog Miami 12

Dog Tampa 14

Dog Naples 5

Dog 31

Turtle Tampa 4

Turtle Naples 1

Turtle 5

63


CUBE operator


• Syntax

• Umbrella Jira PIG-2167• Non-distributed version will be in 0.11 (PIG-

2765)• Distributed version still in progress (PIG-2831)– Push algebraic computation to map/combiner– Reference: “Distributed cube materialization on holistic measures”, Arnab Nandi et al, ICDE 2011

outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];


Rank operator


rawdata = load ’input' as (name, gpa:double);

ranked = rank rawdata by gpa;

dump ranked;

Name Gpa

Katie 3.5

Fred 4.0

Holly 3.7

Luke 3.5

Nick 3.7

Rank Name Gpa

1 Katie 3.5

5 Fred 4.0

2 Holly 3.7

1 Luke 3.5

2 Nick 3.7


Rank operator


rawdata = load ’input' as (name, gpa:double);

ranked = rank rawdata by gpa desc dense;

dump ranked;

Name Gpa

Katie 3.5

Fred 4.0

Holly 3.7

Luke 3.5

Nick 3.7

Rank Name Gpa

3 Katie 3.5

1 Fred 4.0

2 Holly 3.7

3 Luke 3.5

2 Nick 3.7


Rank operator


• Limitation– Only 1 reducer

• Possible improvements– Provide a distributed implementation

• PIG-2353


Groovy UDFs


register 'test.groovy' using groovy as myfuncs;

a = load '1.txt' as (a0, a1:long);

b = foreach a generate myfuncs.square(a1);

dump b;

import org.apache.pig.builtin.OutputSchema;

class GroovyUDFs { @OutputSchema('x:long') long square(long x) { return x*x; }}

test.groovy:


Embed Pig into Groovy


import org.apache.pig.scripting.Pig;

public static void main(String[] args) {

String input = ”input"

String output = "output"

Pig P = Pig.compile("A = load '\$in'; store A into '\$out';")

result = P.bind(['in':input, 'out':output]).runSingle()

if (result.isSuccessful()) {

print("Pig job succeeded")

} else {

print("Pig job failed")

}

}

bin/pig -x local demo.groovy

Command line:


New data type: DateTime


a = load ’input' as (a0: datetime, a1:chararray, a2:long);

b = foreach a generate a0, ToDate(a1, ‘yyyyMMdd HH:mm:ss’), ToDate(a2), CurrentTime();

• Support timezone• Millisecond precision


New data type: DateTime


GetYear YearsBetween SubtractDuration

GetMonth MonthsBetween ToDate

GetDay WeeksBetween ToDateISO

GetWeekYear DaysBetween ToMilliSeconds

GetWeek HoursBetween ToString

GetHour MinutesBetween ToUnixTime

GetMinute SecondsBetween CurrentTime

GetSecond MilliSecondsBetween

GetMilliSecond AddDuration

• DateTime UDFs


SchemaTuple optimization


• Idea– Generate schema specific tuple code when schema is known

• Benefit

– Decrease memory footprint– Better performance


SchemaTuple optimization


• When tuple schema is known(a0: int, a1: chararray, a2: double)

Tuple {

List<Object> mFields;

Object get(int fieldNum) {

return mFields.get(fieldNum);

}

void set(int fieldNum, Object val)

mFields.set(fieldNum, val);

}

}

SchemaTuple { int f0; String f1; double f2; Object get(int fieldNum) { switch (fieldNum) { case 0: return f0; case 1: return f1; case 2: return f2; } void set(int fieldNum, Object val) …… }}

Original Tuple: Schema Tuple:


Pig on new environment


• JDK 7– All unit tests pass– Jira: PIG-2908

• Hadoop 2.0.0– Jira: PIG-2791

• Windows– No need for cygwin– Jira: PIG-2793– Try to make it to 0.11


Faster local mode


• Skip generating job.jar– PIG-2128

– In 0.9, 0.10 as well, unadvertised• Remove 5 seconds hardcoded wait time for

JobControl– PIG-2702


Better stats


• Information on alias/lines of a map/reduce job– An information line for every map/reduce job

detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]

Map contains:

alias A: line 1 column 4


alias B: line 2 column 4

Combiner contains:


alias B: line 2 column 4

Reduce contains:


alias C: line 5 column 4

Explanation:


Better notification


• For the support of Ambrose– Check out “Twitter Ambrose”– github open source– Monitor Pig job progress in a UI


Integrate HCat DDL


• Embed HCat DDL command in Pig script• Run HCat DDL command in Grunt

• Embed HCat DDL in scripting language

grunt> sql create table pig_test(name string, age int, gpa double) stored as textfile;

grunt>

from org.apache.pig.scripting import Pig

ret = Pig.sql("""drop table if exists table_1;""")

if ret==0:

#success


Grunt enhancement


• History

• Clear– Clear screen

grunt> a = load '1.txt';

grunt> b = foreach a generate $0, $1;

grunt> history

1 a = load '1.txt';

2 b = foreach a generate $0, $1;

grunt>


New/enhanced UDFs


• New UDFs

• Enhanced UDFs

• EvalFunc enhancement

– getInputSchema(): Get input schema for UDF

STARTSWITH INVERSEMAP VALUESET

BagToString KEYSET

BagToTuple VALUELIST

RANDOM Take a seed

AvroStorage Support recursive recordSupport globs and commasUpgrade to Avro 1.7.1


1

• Simplify deployment to get started quickly and easily

• Monitor, manage any size cluster with familiar console and tools

• Only platform to include data integration services to interact with any data

• Metadata services opens the platform for integration with existing applications

• Dependable high availability architecture

• Tested at scale to future proof your cluster growth

Hortonworks Data Platform

Page 23

Reduce risks and cost of adoption Lower the total cost to administer and provision Integrate with your existing ecosystem


Hortonworks Training

The expert source for Apache Hadoop training & certification

Role-based Developer and Administration training– Coursework built and maintained by the core Apache Hadoop development team.– The “right” course, with the most extensive and realistic hands-on materials– Provide an immersive experience into real-world Hadoop scenarios– Public and Private courses available

Comprehensive Apache Hadoop Certification– Become a trusted and valuable

Apache Hadoop expert

Page 24


Next Steps?

• Expert role based training• Course for admins, developers

and operators• Certification program• Custom onsite options

Page 25

Download Hortonworks Data Platformhortonworks.com/download

1

2 Use the getting started guidehortonworks.com/get-started

3 Learn more… get support

• Full lifecycle technical support across four service levels

• Delivered by Apache Hadoop Experts/Committers

• Forward-compatible

Hortonworks Support

hortonworks.com/training hortonworks.com/support


Thank You!Questions & Answers

Follow: @hortonworksRead: hortonworks.com/blog

Page 26

Documents

New features in Pig 0.11