22
Pig : Data Analysis Tool in the Cloud Jeff Zhang [email protected] Committer of Pig in ASF

Pig: Data Analysis Tool in Cloud

Embed Size (px)

DESCRIPTION

Presentation in Java One conference in Beijing 2010

Citation preview

Page 1: Pig: Data Analysis Tool in Cloud

Pig : Data Analysis Tool in the Cloud

Jeff [email protected] of Pig in ASF

Page 2: Pig: Data Analysis Tool in Cloud

Agenda

• Background

• What is Pig

• Brief introduction of Pig internals

• Demo

• Q/A

Page 3: Pig: Data Analysis Tool in Cloud

Data Explosion

• Web 2.0

• More digit terminal

Page 4: Pig: Data Analysis Tool in Cloud

What we have for data analysis

• RDBMS (Scalability)

• Parallel RDBMS (Expensive)

• Programming Language (Too complex)

• Hadoop MapReduce (Still too complex for non-hadoop users)

Page 5: Pig: Data Analysis Tool in Cloud

Then, Pig’s Coming

Page 6: Pig: Data Analysis Tool in Cloud

What is Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

Page 7: Pig: Data Analysis Tool in Cloud

A simple example of Pig-Latin

raw_data = load '/java_one/pv' Using PigStorage(‘,') as (time_stamp : long, url : chararray);

pages = foreach raw_data generate url;pages = group pages by url;pages = foreach pages generate group as url, COUNT(pages.url) as pv;

pages = order pages by pv desc;top10 = limit pages 10;

dump top10;

• Page view

• The most 10 popular pages

1291950309812, http://snda.com/page_1 1291950309822, http://snda.com/page_2 1291950309832, http://snda.com/page_3

….

Page 8: Pig: Data Analysis Tool in Cloud

Operators in Pig-Latin

Load - a = load ‘data’ using PigStorage(‘\t’) as (f1:int ,f2:double,f3:chararray)

Store - store a into ‘/test/output’ using PigStorage(‘,’)

Dump - dump a

Filter - b = filter a by f1 > 0 and f2 == ‘java_one’

Foreach - b = foreach a generate f1, f3

Group - b= group a by f3;

Join - b = Join a by f1, b by f1;

Describe - describe b;

….

Page 9: Pig: Data Analysis Tool in Cloud

Data Structure in Pig

• Cell field in database- Primitive types: int, long, float, double, bytearray, chararrar,nul

- Complex types: map, tuple, databag

• Tuple row– (1, 1.2, “java”)

• DataBag table or view – { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }

Page 10: Pig: Data Analysis Tool in Cloud

How to use Pig

Grunt (Interactive Shell)

Java API

Other languages (in future)

Page 11: Pig: Data Analysis Tool in Cloud

Architecture of Pig

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan LogicalPlan)

Compiler (LogicalPlan PhysiclaPlan MapReducePlan)

ExecutionEngine

PigContext

Hadoop

Grunt (Interactive shell) PigServer (Java API)

Page 12: Pig: Data Analysis Tool in Cloud

Three basic operations of Pig

• Group by

• Join

• Order

Page 13: Pig: Data Analysis Tool in Cloud

How Pig do Group by

(A,1)(B,2)(C,3)(B,4)(B,5)(C,6)(A,7)(E,8)(D,9)

(A,1)(B,2)(C,3)

(B,4)(B,5)(C,6)

(A,7)(E,8)(D,9)

(A,{(A,1),(A,7)})(C,{(C,3),(C,6)})

(E,{(E,8)})

(B,{(B,2),(B,4),(B,5)})(D,{(D,9)})

Data Source Split Mapper Partition Reducer

Page 14: Pig: Data Analysis Tool in Cloud

How Pig do Join

(3,A3)(5,A5)(3,B3)(2,B2)

(2,A2)(4,B4)

((1,A1),(1,B1))((3,A3),(3,B3))((5,A5),(5,B5))

((2,A2)(2,B2))((4,B4),(4,B4))

(1,A1)(4,A4)(3,A3)(5,A5)(2,A2)

(5,B5)(1,B1)(3,B3)(2,B2)(4,B4)

(1,A1)(4,A4)(5,B5)(1,B1)

Data Source Split Mapper Partition Reducer

Page 15: Pig: Data Analysis Tool in Cloud

How Pig do Sort

(100)(200)(900)(50)

(600)(800)(300)(400)

(100)(200)(900)

(50)(600)(800)

(300)(400)

(50)(100)(200)(300)(400)

(600)(800)

Data Source Split Mapper Range Partition Reducer

Page 16: Pig: Data Analysis Tool in Cloud

UDF (User-Defined-Function)

register myudf.jar; raw_data = load ‘/java_one/udf’ as (name:chararray);firstnames = foreach raw_data generate myudf.FirstName (name); store firstnames into ‘/java_one/udf_output’;

public class FirstName extends EvalFunc<String>{

@Override public String exec(Tuple input) throws IOException { String name=input.get(0).toString(); …. return firstname; }}

Page 17: Pig: Data Analysis Tool in Cloud

What Storage Pig Supports

• HDFS– Plain Text– Binary format– Customized format (XML, JSON, Protobuf, Thrift…)

• RDBMS (DBStorage)

• Cassandra (CassandraStorage)

• HBase (HBaseStorage)

Page 18: Pig: Data Analysis Tool in Cloud

What fields can Pig be applied

• Data Analysis

• Text Processing

• ETL

• Machine Learning

Page 19: Pig: Data Analysis Tool in Cloud

Who’s using Pig

More: http://wiki.apache.org/pig/PoweredBy

Page 20: Pig: Data Analysis Tool in Cloud

References

• http://pig.apache.org (Pig official site)

• http://hadoop.apache.org (Hadoop official site)

• https://github.com/zjffdu/RAF-PIG (Rich API for Pig)

Page 21: Pig: Data Analysis Tool in Cloud

Demo

Page 22: Pig: Data Analysis Tool in Cloud

Thank you !Q&A