88

Click here to load reader

MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 [email protected] 北京大学信息科学技术学院 7/14/2009

Embed Size (px)

Citation preview

Page 1: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

MapReduce 高层应用&

频繁集挖掘算法的 MapReduce 实现

http://net.pku.edu.cn/~course/cs402/2009/彭波

[email protected]北京大学信息科学技术学院

7/14/2009

Page 2: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

大纲

作业回顾 MapReduce 高层应用 频繁集挖掘算法的 MapReduce 实现 课程安排

Page 3: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Review of Lecture 3

Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtim

e order and avoid unexpected race conditions.

Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtim

e order and avoid unexpected race conditions.

Page 4: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

What makes this work?

Underneath the socket layer are several more protocols

Most important are TCP and IP (which are used hand-in-hand so often, they’re often spoken of as one protocol: TCP/IP)

Your dataTCP header

IP header

Page 5: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Why is This Necessary?

Not actually tube-like “underneath the hood” Unlike phone system (circuit switched), the

packet switched Internet uses many routes at once

you www.google.com

Page 6: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable”

-- Leslie Lamport

Page 7: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Ken Arnold, CORBA designer:

“Failure is the defining difference between distributed and local programming”

Page 8: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

The Eight Design Fallacies

The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.

-- Peter Deutsch and James Gosling, Sun Microsystems

Page 9: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Random Walks Over the Web

Model: User starts at a random Web page User randomly clicks on links, surfing from page

to page What’s the amount of time that will be

spent on any given page? This is PageRank

Page 10: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Given page x with in-bound links t1…tn, where C(t) is the out-degree of t is probability of random jump N is the total number of nodes in the graph

PageRank: Defined

n

i i

i

tC

tPR

NxPR

1 )(

)()1(

1)(

X

t1

t2

tn

Page 11: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

The Google File System

Main Contribution We treat component failures as the norm rather

than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system.

Page 12: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 13: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 14: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 15: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

MapReduce: Simplied Data Processing on Large Clusters

Main Contribution MapReduce is a programming model and an

associated implementation for processing and generating large data sets.1. the model is easy to use, even for

programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.

2. a large variety of problems are easily expressible as MapReduce computations.

3. developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines.

Page 16: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 17: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 18: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

MapReduce 高层应用

Page 19: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Chris Olston Benjamin ReedUtkarsh Srivastava

Ravi Kumar Andrew Tomkins

Pig Latin: A Not-So-Foreign Language For Data Processing

Pig Latin: A Not-So-Foreign Language For Data Processing

Research

Page 20: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Data Processing Renaissance

Internet companies swimming in data• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

Page 21: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Data Warehousing …?

ScaleScale Often not scalable enough

$ $ $ $$ $ $ $Prohibitively expensive at web scale

• Up to $200K/TB

SQLSQL• Little control over execution method• Query optimization is hard

• Parallel environment• Little or no statistics• Lots of UDFs

Page 22: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad

. . .

Page 23: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Map-Reduce

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

Just a group-by-aggregate?Just a group-by-aggregate?

Page 24: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

The Map-Reduce Appeal

ScaleScaleScalable due to simpler design

• Only parallelizable operations• No transactions

$ $ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”SQL SQL

Page 25: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Disadvantages

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

MM RR

MM MM RR MM

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

Page 26: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Pros And Cons

Need a high-level, general data flow language

Page 27: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Enter Pig Latin

Pig LatinPig Latin

Need a high-level, general data flow language

Page 28: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Salient features

• Implementation

Page 29: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Page 30: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Data Flow

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10 urls

Foreach categorygenerate top10 urls

Page 31: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Page 32: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Salient features

• Implementation

Page 33: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Step-by-step Procedural ControlTarget users are entrenched procedural programmers

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

• Automatic query optimization is hard • Pig Latin does not preclude optimization

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

David CiemiewiczSearch Excellence, Yahoo!

Page 34: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Operates directly over filesOperates directly over files

Page 35: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Schemas optional; Can be assigned dynamically

Schemas optional; Can be assigned dynamically

Page 36: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

User-Code as a First-Class Citizen

User-defined functions (UDFs) can be used in every construct

• Load, Store• Group, Filter, Foreach

User-defined functions (UDFs) can be used in every construct

• Load, Store• Group, Filter, Foreach

Page 37: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps

• More natural to programmers than flat tuples• Avoids expensive joins• See paper

Nested Data Model

yahoo ,financeemailnews

Page 38: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Novel features

• Implementation

Page 39: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Implementation

cluster

Hadoop Map-Reduce

Hadoop Map-Reduce

PigPig

SQL

automaticrewrite +optimize

or

or

user

Pig is open-source.http://incubator.apache.org/pig

Pig is open-source.http://incubator.apache.org/pig

Page 40: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Compilation into Map-Reduce

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Page 41: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Usage

• First production release about a year ago

• 150+ early adopters within Yahoo!

• Over 25% of the Yahoo! map-reduce user base

Page 42: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Related Work

• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation

• DryadLINQ– SQL-like language on top of Dryad

• Nested data models– Object-oriented databases

Page 43: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Future Work

• Optional “safe” query optimizer– Performs only high-confidence rewrites

• User interface– Boxes and arrows UI– Promote collaboration, sharing code fragments and U

DFs

• Tight integration with a scripting language– Use loops, conditionals of host language

Page 44: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Arun MurthyPi SongSanthosh SrinivasanAmir Youssefi

Shubham ChopraAlan GatesShravan NarayanamurthyOlga Natkovich

Credits

Page 45: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

频繁集挖掘算法的 MapReduce 实现

Page 46: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

`Basket data’

A very common type of data; often also called transaction data.

Next slide shows example transaction database, where each record represents a transaction between (usually) a customer and a shop. Each record in a supermarket’s transaction DB, for example, corresponds to a basket of specific items.

Page 47: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

1 1 1 1 1 1

2 1 1 1

3 1 1 1

4 1 1 1

5 1 1

6 1 1

7 1 1 1

8 1 1

9 1 1

10 1 111 1 112 1

13 1 1

14 1 1

15 1 1

16 117 1 1

18 1 1 1 1 119 1 1 1 1 120 1

ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

Page 48: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Discovering RulesA common and useful application of data mining

A `rule’ is something like this: If a basket contains apples and cheese,

then it also contains beer Any such rule has two associated

measures: confidence – when the `if’ part is true, how

often is the `then’ bit true? This is the same as accuracy.

coverage or support – how much of the

database contains the `if’ part?

Page 49: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

1 1 1 1 1 1

2 1 1 1

3 1 1 1

4 1 1 1

5 1 1

6 1 1

7 1 1 1

8 1 1

9 1 1

10 1 111 1 112 1

13 1 1

14 1 1

15 1 1

16 117 1 1

18 1 1 1 1 119 1 1 1 1 120 1

ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey

What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey

2/20 of the records contain both beer and cheese, so coverage is 10%

Of these 2, 1 contains honey, so confidence is 50%

Page 50: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Interesting/Useful rules

Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data

may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder.

I.e: if you choose a basket at random: There is a probability 0.1 that it contains bread. There is a probability 0.04 that it contains

washing-up powder.

Page 51: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Bread and washing up powder

What is the probability of a basket containing both bread and washing-up powder?

The laws of probability say:If these two things are independent, chance is 0.1 * 0.04 = 0.004

That is, we would expect 0.4% of baskets to contain both bread and washing up powder

Page 52: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Interesting means surprising

We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder.

If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: Something is going on in shoppers’ minds:

bread and washing-up powder are connected in some way.

There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket?

Page 53: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Finding surprising rules

Suppose we ask `what is the most surprising rule in this database? ‘

This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable.

Page 54: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Here are some interesting onesin our mini basket DB:

If a basket contains glue, then it also contains either beer or eggs

confidence: 100% ; coverage 25%

If a basket contains apples and dates, then it also contains honey

confidence 100% ; coverage 20%

Page 55: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Finding surprising rules

Looking only at rules of the form: if basket contains X and Y, then it also contains

Z … our realistic numbers tell us that there may

be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c 1016 operations …

By searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones.

Does MapReduce work here?Does MapReduce work here?

Page 56: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Find rules in two stagesAgarwal and colleagues divided the problem

of finding good rules into two phases:1. Find all itemsets with a specified minimal

support (coverage). An itemset is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm can efficiently find all itemsets whose coverage is above a given minimum.

2. Use these itemsets to help generate interersting rules. Having done stage 1, we have considerably narrowed down the possibilities, and can do reasonably fast processing of the large itemsets to generate candidate rules.

Page 57: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Terminology

k-itemset : a set of k items. E.g. {beer, cheese, eggs} is a 3-itemset {cheese} is a 1-itemset {honey, ice-cream} is a 2-itemset

support: an itemset has support s% if s% of the records in the DB contain that itemset.

minimum support: the Apriori algorithm starts with the specification of a minimum level of support, and will focus on itemsets with this level or above.

Page 58: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Terminology

large itemset: doesn’t mean an itemset with many items. It means one whose support is at least minimum support.

Lk : the set of all large k-itemsets in the DB.

Ck : a set of candidate large k-itemsets. In the algorithm we will look at, it generates this set, which contains all the k-itemsets that might be large, and then eventually generates the set above.

Page 59: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Terminology

sets: Let A be a set (A = {cat, dog}) and let B be a set (B = {dog, eel, rat}) and let C = {eel, rat}

I use `A + B’ to mean A union B. So A + B = {cat, dog, eel. rat} When X is a subset of Y, I use Y – X to mean

the set of things in Y which are not in X. E.g. B – C = {dog}

Page 60: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

1 1 1 1 1 1

2 1 1 1

3 1 1 1

4 1 1 1

5 1 1

6 1 1

7 1 1 1

8 1 1

9 1 1

10 1 111 1 112 1

13 1 1

14 1 1

15 1 1

16 117 1 1

18 1 1 1 1 119 1 1 1 1 120 1

ID a, b, c, d, e, f, g, h, i

E.g. 3-itemset {a,b,h}has support 15%

2-itemset {a, i} has support 0%

4-itemset {b, c, d, h}has support 5%

If minimum support is 10%, then {b} is a largeitemset, but {b, c, d, h}Is a small itemset!

Page 61: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Insight

What’s the relationship between k-itemset and k+1-itemset?

If k-itemset is not large itemset, the k+1-itemset contains it must not be large itemset.

If k-itemset is not large itemset, the k+1-itemset contains it must not be large itemset.

Page 62: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

The Apriori algorithm for finding large itemsets efficiently in big DBs

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

5 For all records r in the DB6 { Cr = subset(Ck, r);

7 For each c in Cr , c.count++

8 }7 Set Lk := all c in Ck whose count >= minsup

8 } /* end -- return all of the Lk sets.

Page 63: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets

To start off, we simply find all of the large 1-itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get

L1 = { {a}, {b}, {c}, {d}, {f}}

Page 64: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

1: Find all large 1-itemsets

2: For (k = 2 ; while Lk-1 is non-empty; k++)

We already have L1. This next bit just

means that the remainder of the algorithm generates L2, L3 , and so on until we get to an Lk that’s empty.

How these are generated is like this:

Explaining the Apriori Algorithm …

Page 65: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets

2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1) Given the large k-1-itemsets, this step generates

some candidate k-itemsets that might be large. Because of how apriori-gen works, the set Ck is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

Page 66: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets

2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

We are going to work out the support for each of the candidate k-itemsets in Ck, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

Page 67: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets

2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr ,

c.count++ } We now take each record r in the DB and do this: get

all the candidate k-itemsets from Ck that are contained in r. For each of these, update its count.

Page 68: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)3 {Ck = apriori-gen(Lk-1)4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }7 Set Lk := all c in Ck whose count >= minsup

Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have Lk, We now go back into the for loop of line 2 and start working towards finding Lk+1

Page 69: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)3 {Ck = apriori-gen(Lk-1)4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }7 Set Lk := all c in Ck whose count >= minsup

8 } /* end -- return all of the Lk sets.

We finish at the point where we get an empty Lk . The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

Page 70: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

apriori-gen : notes Suppose we have worked out that the

large 2-itemsets are: L2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}}

apriori-gen now generates 3-itemsets that all may be large.

Page 71: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

apriori-gen : the join step

Keep an ordering of the items. a < b will mean that a comes before b in alphabetical order.

Suppose we have Lk and wish to generate Ck+1 First we take every distinct pair of sets in Lk {a1, a2 , … ak} and {b1, b2 , … bk}, and do this: in all cases where {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk,

{a1, a2 , … ak, bk} is a candidate k+1-itemset.

Page 72: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

An illustration of that

Suppose the 2-itemsets are: L2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}, {noodles, peas}, {noodles, tights}}

The pairs that satisfy this:{a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, are: {milk, noodles}|{milk, tights} {noodles, peas}|{noodles, quorn}

{noodles, peas}|{noodles, tights} {noodles, quorn}|{noodles, tights}

So the candidate 3-itemsets are: {milk, noodles, tights}, {noodles, peas, quorn} {noodles, peas, tights}, {noodles, quorn, tights}

Page 73: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

apriori-gen : the prune step

In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some 2-subset of it is not a large k-itemset. Such couldn’t possibly be a large k+1-itemset. E.g. in the current example, we have (n = noodles,

etc): L2 = { {milk, n}, {milk, tights}, {n, quorn}, {n, peas}, {n, tights}} And candidate k+1-itemsets so far: {m, n, t}, {n, p, q}, {n, p, t}, {n, q,

t} Now, {p, q} is not a 2-itemset, so {n,p,q} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {q,t} is not a 2-itemset, so {n,q,t} is pruned. After this we finally have C3 = {{milk, noodles,

tights}}

Page 74: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Understanding rules The Apriori algorithm finds interesting (i.e. frequent) itemsets.

E.g. it may find that {apples, bananas, milk} has coverage 30%

-- so 30% of transactions contain each of these three things.

What can you say about the coverage of {apples, milk}?

We can invent several potential rules, e.g.: IF basket contains apples and bananas, it also contains MILK.

Suppose support of {a, b} is 40%; what is the confidence of this rule?

Page 75: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Understanding rules II Suppose itemset A = {beer, cheese, eggs} has 30% support in the DB{beer, cheese} has 40%, {beer, eggs} has 30%, {cheese, eggs} has

50%,and each of beer, cheese, and eggs alone has 50% support..

What is the confidence of: IF basket contains Beer and Cheese, THEN basket also contains Eggs ?

The confidence of a rule if A then B is simply: support(A + B) / support(A).

So it’s 30/40 = 0.75 ; this rule has 75% confidence

What is the confidence of: IF basket contains Beer, THEN basket also contains Cheese

and Eggs ? 30 / 50 = 0.6 so this rule has 60% confidence

Page 76: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Understanding rules III

If the following rule has confidence c: If A then B and if support(A) = 2 * support(B), what can be

said about the confidence of: If B then A

confidence c is support(A + B) / support(A)= support(A + B) / 2 * support(B)

Let d be the confidence of ``If B then A’’.d is support(A+B / support(B) -- Clearly, d = 2c

E.g. A might be milk and B might be newspapers

Page 77: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Summary

The Apriori algorithm for efficiently finding frequent large itemsets in large DBs

Associated terminology Associated notes about rules, and working

out the confidence of a rule based on the support of its component itemsets

Page 78: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

A full run through of Apriori1 1 1 1 1

2 1 1 1

3 1 1 1

4 1 1

5 1 1 1

6 1

7 1 1 1

8 1

9 1 1

10 1 111 1 1 112 1

13 1 1

14 1 1 1 1

15

16 117 1 1 1

18 1 1 1 119 1 1 1 1 120 1

ID a, b, c, d, e, f, g

We will assume this isour transaction databaseD and we will assume minsup is 4 (20%)

This will not be run through in the lecture; it is here to help with revision

Page 79: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

First we find all the large 1-itemsets. I.e., in this case, all the 1-itemsets that are contained by at least 4 records in the DB.In this example, that’s all of them. So,

L1 = {{a}, {b}, {c}, {d}, {e}, {f}, {g}}

Now we set k = 2 and run apriori-gen to generate C2

The join step when k=2 just gives us the set of all alphabeticallyordered pairs from L1, and we cannot prune any away, so wehave C2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}}

Page 80: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

So we have C2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d},

{c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}}

Line 4 of the Apriori algorithm now tells us set a counter for each of these to 0. Line 5 now prepares us to take each record in the DB in turn, and find which of those in C2 are contained in it. The first record r1 is: {a, b, d, g}. Those of C2 it contains are: {a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}. Hence Cr1 = {{a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}}

and the rest of line 6 tells us to increment the counters of these itemsets.

The second record r2 is:{c, d, e}; Cr2 = {{c, d}, {c, e}, {d, e}},

and we increment the counters for these three itemsets. … After all 20 records, we look at the counters, and in this case we will find that the itemsets with >= minsup (4) counters are: {a, d}, {c, e}.

So, L2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}}

Page 81: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

So we have L2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}}

We now set k = 3 and run apriori-gen on L2 . The join step finds the following pairs that meet therequired pattern: {a, c}:{a, d} {c, d}:{c, e} {c, d}:{c, f} {c, e}:{c, f}

This leads to the candidates 3-itemsets: {a, c, d}, {c, d, e}, {c, d, f}, {c, e, f} We prune {c, d, e} since {d, e} is not in L2

We prune {c, d, f} since {d, f} is not in L2

We prune {c, e, f} since {e, f} is not in L2

We are left with C3 = {a, c, d}

We now run lines 5—7, to count how many records contain {a, c, d}. The count is 4, so L3 = {a, c, d}

Page 82: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

So we have L3 = {a, c, d}

We now set k = 4, but when we run apriori-gen on L3 we get the empty set, and hence eventually we find L4 = {}

This means we now finish, and return the set of all of the non-empty Ls – these are all of the large itemsets:

Result = {{a}, {b}, {c}, {d}, {e}, {f}, {g}, {a, c}, {a, d}, {c, d}, {c, e}, {c, f}, {a, c, d}}

Each large itemset is intrinsically interesting, and may be of business value. Simple rule-generation algorithms can now use the large itemsets as a starting point.

Page 83: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

MapReduce Implementation

Ck = apriori-gen(Lk-1) Join step Prune step

Cr = subset(Ck, r); 2-itemset generation

Page 84: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

通知

Page 85: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009
Page 86: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Special Talk

Brewster Kahle His stated goal is "Universal Acce

ss to all Knowledge". Director of the Internet Archive A key supporter of the

Open Content Alliance In 2005, Kahle was elected a fello

w of the American Academy of Arts and Sciences.

地点:理科 1号楼 1131时间: 17日上午 9点地点:理科 1号楼 1131时间: 17日上午 9点

Page 87: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

课程安排

本周四的课换到下周二上 本周五听 Internet Archive 的创始人 Brewster Ka

hle 的报告 下周二,请各小组报告:课程项目 Proposal

Page 88: MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/14/2009

Q&A