18
1/18 Miscellanea The Book Chapter 11 Conclusion References CS-495/595 Pig Lecture #6 Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015 18 Feb. 2015

CS-495/595 Pig Lecture #6 Dr. Chuck Cartledge Dr. Chuck

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

1/18

Miscellanea The Book Chapter 11 Conclusion References

CS-495/595Pig

Lecture #6

Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

18 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 2015

2/18

Miscellanea The Book Chapter 11 Conclusion References

Table of contents I

1 Miscellanea

2 The Book

3 Chapter 11

4 Conclusion

5 References

3/18

Miscellanea The Book Chapter 11 Conclusion References

Corrections and additions since last lecture.

Completed gradingAssignment #1.

4/18

Miscellanea The Book Chapter 11 Conclusion References

Hadoop, The Definitive Guide

Version 3 is specified in thesyllabus [2]

Version 4 came out inNovember 2015

We’ll use Version 3 as muchas possible

5/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

The essence of Pig.

Pig provides a level of abstraction for dealing with large data sets.

There are two major parts to thePig ecosystem:

The language (Pig Latin),

The execution environment

In a previous lecture, we touched on how a JOIN operation couldbe performed using MapReduce technology. Pig hides all thatcomplexity.

6/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Pig is not installed on the Hadoop cluster

You will have to download it and install it.

The tar.gz file is about 120MBYou’ll need to download it,untar it, and test yourinstallationThere are some gotchas:

1 Pig is looking for theenvironment variableJAVA HOME to be set

2 Hadoop cluster runs tcshvice BASH by default

3 Have to set JAVA HOMEbefore Pig will run

Some things are left as anexercise for the student.

Section “Installing and Running Pig” on page 366 gives youinformation on where to download it from, how to install it, andhow to test it.

7/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Pig runs on top of Hadoop

Pig can run in three different modes:

Script: a file contains PigLatin commands

Grunt: an interactive mode

Embedded: run from Javausing the PigServer class

The eclipse and NetBeans IDEsare supposed to have a Pigplug-ins.

Initially you will “tickle” your Pig installation via grunt, later wewill use scripts.

8/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Language differences between Pig and RDBMS.

Pig Latin is a data flowprogramming language“. . . dataflow programmingemphasizes the movement ofdata and models programs as aseries of connections. Explicitlydefined inputs and outputsconnect operations, whichfunction like black boxes.” [3]

SQL is a declarativeprogramming language“. . . declarative programming isa programming paradigm, astyle of building the structureand elements of computerprograms, that expresses thelogic of a computation withoutdescribing its control flow.” [4]

9/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Schema differences between Pig and RDBMS.

Pig allows optional schemadefinition at run time

RDBMS store data in tables,and schemas are well knownin advance

Pig defaults to tab delimitedfields, csv files processed viaUDF.

Pig reads data at program start (roughly) vs. data already intables at start.

10/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Data differences between Pig and RDBMS.

Pig allows complex, nesteddata structures

RDBMS tables are much“flatter”

Pig Latin is generally morecustomizable than most SQLdialects.

11/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Access time differences between Pig and RDBMS.

Pig does not supportrandom reads and writes tothe data (WORM)

RDBMS supports randomaccess (indices, views, etc.)

RDBMS are good for interactive,or low latency activities.

Pig uses Hadoop and HDFS as its underpinnings and inherits allthose strengths and weaknesses.

12/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

A simple example

LOAD — establishes wherethe data will be coming from

AS — defines the schema

FILTER, GROUP — similarto a SQL

FOREACH — processeseach tuple

MAX — one of manyfunctions

DUMP — output the data

Nothing happens until a dataflow is defined and a trigger eventoccurs.

13/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

What are trigger events?

Pig Latin is a data flow language, something has to start the dataflow. Different commands act as triggers.

DUMP — a diagnosticstatement

STORE — depends on whenthe statement is encountered

Image from [1].

Pig Latin → Logical Plan → Physical Plan → MapReduce Plan →Execution

14/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Image from [1].

15/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Image from [1].

16/18

Miscellanea The Book Chapter 11 Conclusion References

Pig provide information/process hiding

Image from [1].

17/18

Miscellanea The Book Chapter 11 Conclusion References

What have we covered?

Covered the “essence” of PigPig runs on top of the Hadoopecosystem and has all the strengthsand limitations thereofCompared Pig to traditionalRDBMSPig is a dataflow programminglanguage

Next lecture: Hadoop book, Chapter 12 and return exam

18/18

Miscellanea The Book Chapter 11 Conclusion References

References I

[1] Prashanth Babu, Introduction to pig,http://www.slideshare.net/prashanthvvbabu/the-

fifthelephant-2012handsonintrotopig, 2013.

[2] Tom White, Hadoop: The definitive guide, 3rd edition, O’ReillyMedia, Inc., 2012.

[3] Wikipedia, Dataflow programming — wikipedia, the freeencyclopedia,http://en.wikipedia.org/wiki/Dataflow_programming,2014.

[4] , Declarative programming — wikipedia, the freeencyclopedia, http://en.wikipedia.org/wiki/Declarative_programming,2014.