24
The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison [email protected]

The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

  • Upload
    jules

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison [email protected]. What is Pig Latin?. set-oriented data transformation language primitives filter, combine, split, and order data users describe transformations in steps - PowerPoint PPT Presentation

Citation preview

Page 1: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

The Pig Latin Dataflow LanguageA Brief Overview

James JollyUniversity of Wisconsin-Madison

[email protected]

Page 2: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

What is Pig Latin?

• set-oriented data transformation language– primitives filter, combine, split, and order data– users describe transformations in steps– steps bundled into queries– each set transformation is stateless

• flexible data model– nested bags of tuples– semi-structured datatypes

• extensible– supports user-defined functions

2

Page 3: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

How is it used in practice?

• useful for computations across large, distributed datasets– abstracts away details of execution framework– users can change order of steps to improve performance

• often used in tandem with Hadoop and HDFS– transformations converted to MapReduce dataflows– HDFS tracks where data is stored– operations scheduled nearby their data

3

Page 4: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

An example...

Given two datasets:list of words and their frequency of appearance on webpageslist of users and webpages they visit

Let’s find words users might be interested in lately.

4

Page 5: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Dataset: words and their frequency of appearance...

website word frequency date

news.bbc.co.uk obama 0.010 20081005abcnews.go.com scheme 0.025 20081010abcnews.go.com bombing 0.021 20081006www.foxnews.com bush 0.001 20081006www.cnn.com mccain 0.031

20081017www.cnn.com obama 0.001 20081002www.reuters.com bush 0.012 20080921abcnews.go.com congress 0.002 20080927www.reuters.com bush 0.012 20080921www.foxnews.com bush 0.001 20081006www.latimes.com abortion 0.001 20081015www.latimes.com attack 0.010 20081015www.reuters.com obama 0.005 20080917www.foxnews.com economy 0.038 20081006

5

Page 6: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Dataset: webpages users visit...

website user

www.reuters.com billnews.bbc.co.uk mikewww.cnn.com mikewww.foxnews.com billwww.reuters.com drewwww.latimes.com jamesabcnews.go.com james

6

Page 7: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Loading word frequency data...

freqs = LOAD '/home/jolly/TestData/NewsWords.txt' USING PigStorage(',')AS (website_indexed, word, freq, date);

(news.bbc.co.uk, obama, 0.010, 20081005)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(www.foxnews.com, bush, 0.001, 20081006)(www.cnn.com, mccain, 0.031, 20081017)(www.cnn.com, obama, 0.001, 20081002)(www.reuters.com, bush, 0.012, 20080921)(abcnews.go.com, congress, 0.002, 20080927)(www.reuters.com, bush, 0.012, 20080921)(www.foxnews.com, bush, 0.001, 20081006)(www.latimes.com, abortion, 0.001, 20081015)(www.latimes.com, attack, 0.010, 20081015)(www.reuters.com, obama, 0.005, 20080917)(www.foxnews.com, economy, 0.038, 20081006)

7

Page 8: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Hmm, we have some repeats...

(news.bbc.co.uk, obama, 0.010, 20081005)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(www.foxnews.com, bush, 0.001, 20081006)(www.cnn.com, mccain, 0.031, 20081017)(www.cnn.com, obama, 0.001, 20081002)(www.reuters.com, bush, 0.012, 20080921)(abcnews.go.com, congress, 0.002, 20080927)(www.reuters.com, bush, 0.012, 20080921)(www.foxnews.com, bush, 0.001, 20081006)(www.latimes.com, abortion, 0.001, 20081015)(www.latimes.com, attack, 0.010, 20081015)(www.reuters.com, obama, 0.005, 20080917)(www.foxnews.com, economy, 0.038, 20081006)

8

Page 9: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Duplicate data no more!

distinct_freqs = DISTINCT freqs;

(www.cnn.com, obama, 0.001, 20081002)(www.cnn.com, mccain, 0.031, 20081017)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(abcnews.go.com, congress, 0.002, 20080927)(news.bbc.co.uk, obama, 0.010, 20081005)(www.foxnews.com, bush, 0.001, 20081006)(www.foxnews.com, economy, 0.038, 20081006)(www.latimes.com, attack, 0.010, 20081015)(www.latimes.com, abortion, 0.001, 20081015)(www.reuters.com, bush, 0.012, 20080921)(www.reuters.com, obama, 0.005, 20080917)

9

Page 10: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Hmm, these tuples are old…

(www.cnn.com, obama, 0.001, 20081002)(www.cnn.com, mccain, 0.031, 20081017)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(abcnews.go.com, congress, 0.002, 20080927)(news.bbc.co.uk, obama, 0.010, 20081005)(www.foxnews.com, bush, 0.001, 20081006)(www.foxnews.com, economy, 0.038, 20081006)(www.latimes.com, attack, 0.010, 20081015)(www.latimes.com, abortion, 0.001, 20081015)(www.reuters.com, bush, 0.012, 20080921)(www.reuters.com, obama, 0.005, 20080917)

10

Page 11: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

... and these (green) tuples are not very significant.

(www.cnn.com, obama, 0.001, 20081002)(www.cnn.com, mccain, 0.031, 20081017)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(abcnews.go.com, congress, 0.002, 20080927)(news.bbc.co.uk, obama, 0.010, 20081005)(www.foxnews.com, bush, 0.001, 20081006)(www.foxnews.com, economy, 0.038, 20081006)(www.latimes.com, attack, 0.010, 20081015)(www.latimes.com, abortion, 0.001, 20081015)(www.reuters.com, bush, 0.012, 20080921)(www.reuters.com, obama, 0.005, 20080917)

11

Page 12: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Let’s filter them out.

important_freqs = FILTER distinct_freqs BY date > 20081001 AND freq > 0.002;

(www.cnn.com, mccain, 0.031, 20081017)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(news.bbc.co.uk, obama, 0.010, 20081005)(www.foxnews.com, economy, 0.038, 20081006)(www.latimes.com, attack, 0.010, 20081015)

12

Page 13: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Hmm, we don’t need these anymore...

(www.cnn.com, mccain, 0.031, 20081017)(abcnews.go.com, scheme, 0.025, 20081010)(abcnews.go.com, bombing, 0.021, 20081006)(news.bbc.co.uk, obama, 0.010, 20081005)(www.foxnews.com, economy, 0.038, 20081006)(www.latimes.com, attack, 0.010, 20081015)

13

Page 14: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Let’s project them out.

websites_to_words = FOREACH important_freqs GENERATE website_indexed, word;

(www.cnn.com, mccain)(abcnews.go.com, scheme)(abcnews.go.com, bombing)(news.bbc.co.uk, obama)(www.foxnews.com, economy)(www.latimes.com, attack)

14

Page 15: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Now we are ready to join our lists.

Websites to Users

(news.bbc.co.uk, mike)(www.cnn.com, mike)(www.foxnews.com, bill)(www.reuters.com, drew)(www.latimes.com, james)(abcnews.go.com, james)

Websites to Words

(www.cnn.com, mccain)(abcnews.go.com, scheme)(abcnews.go.com, bombing)(news.bbc.co.uk, obama)(www.foxnews.com, economy)(www.latimes.com, attack)

15

Page 16: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Joining on website: finding words interesting to users...

users_to_words_equijoin = JOIN websites_to_users BY website_visited, websites_to_words BY website_indexed;users_to_words = FOREACH users_to_words_equijoin GENERATE user, word;

(mike, mccain)(james, scheme)(james, bombing)(mike, obama)(bill, economy)(james, attack)

16

Page 17: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Let’s group our results.

interests = GROUP users_to_words BY user;

(bill, {(bill, economy)})(mike, {(mike, mccain), (mike, obama)})(james, {(james, scheme), (james, bombing), (james, attack)})

17

Page 18: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

How does it work?

• logic factored into MapReduce jobs– mapper processes run on machines with input tuples– input tuples processed using MAP( ) function,

producing intermediate tuples– intermediate tuples grouped together,

transferred to reducer nodes– reducer processes consume intermediate tuples

with REDUCE( ) function

18

Page 19: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Translating Pig Latin to MapReduce...

transformed_by_map = FOREACH input_tuple GENERATE MAP(*);intermediate_tuple_partition = GROUP transformed_by_map BY input_tuple_key;result_tuples = FOREACH intermediate_tuple_partition GENERATE REDUCE(*);

These statements can be executed using a single MapReduce job:

19

Page 20: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Example message traffic...

20

Page 21: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Why Pig Latin? Why not a C library?

We could just supply MAP( ) and REDUCE( ) to a C library...

Pig Latin allows you to:

• describe long tasks

– in a friendly scripting language

• use many built-in datatypes

– support for semi-structured data

• use many built-in functions

– filters, projections, joins, unions, splits, etc.

– tends to make user-defined functions simpler

21

Page 22: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Why Pig Latin? Why not SQL?

Pig Latin:

• is imperative

– lets users manually tune query execution plan

• doesn’t need a schema

– can easily read, write, and represent semi-structured data

22

Page 23: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Pig Latin really describes a generic dataflow.

inputs = LOAD ‘input.txt’;

results = FILTER inputs BY IsBoring(important_attribute);

STORE results into ‘results.txt’;

23

Page 24: The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Summary

Pig Latin programs:• typically operate on large volumes of unstructured data• describe a dataflow between primitive operations

– many RDBMS-like operations built into the language– custom operations can be provided by the user– user specifies order of operations– dataflows can be executed using MapReduce paradigm

Thanks for listening!

24