85
© 2017 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 3: Storage and Processing 1 E6893 Big Data Analytics Lecture 3: Big Data Storage and Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 21, 2017

E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

  • Upload
    vanlien

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing1

E6893 Big Data Analytics Lecture 3:

Big Data Storage and Analytics

Ching-Yung Lin, Ph.D.Adjunct Professor, Dept. of Electrical Engineering and Computer Science

September 21, 2017

Page 2: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing2

Demos — Hive, HBase, and Spark by TA

Page 3: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing3

Analytics 1 — Recommendation

Page 4: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics4

Key Components of Mahout

Page 5: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms5

Mahout reference book

Page 6: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms6

Setting Up Mahout

Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout

Eclipse

Page 7: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

7

Motivation – Information Overload

! Provide users with the information they need at the right time

Books, Journals,Research

papers

News

Books

Movies

Electronics

Page 8: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

8

Recommender Systems: General IdeaUser profile (info about the user)

Set of itemscomparison

Recommendation e.g. news articles, books, music, movies, products, …

Sample applications p E-commerce

Product recommender - Amazon p Enterprise Activity Intelligence

Domain expert finder, … p Digital Libraries

Pages/documents/books recommendation p Personal Assistance Museum guidance, …

Algorithms p Content-based Filtering p Collaborative Filtering p Hybrid Filtering (combination of the above two)

Page 9: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

9

Preference: and

Content-based Filtering (CBF)Recommending items based on the content and properties

sports World HealthWorld Health…TechMatching

Item Stream

Recommending

Page 10: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

10

Collaborative Filtering (CF)Leveraging opinions of like-minded users

People with similar tastes

Given

Rec

omm

end

Page 11: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

11

Exploiting Dynamic Patterns for Recommender Systems

Dynamic nature from both items and users Items expire over time with types of

Short-term Long-term

Users’ intentions of Updating breaking information Looking for long-term items

Users’ interests evolve over time

User adoption patterns Some users are earlier adopters

Page 12: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms12

Recommender — Inputs

Solid lines: positively related Dashed lines: negatively related

Input Data: User, Item, Rating

Page 13: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms13

User-based Recommendation — Scenario I

gettofail.com

Page 14: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms14

User-based Recommendation — Scenario II

Page 15: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms15

User-based Recommendation — Scenario III

Page 16: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms16

User-based Recommendation Algorithms

Page 17: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms17

Example Recommender Code via Mahout

Page 18: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms18

Process and output of the example

Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored

Page 19: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms19

User Similarity Measurements

Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity

Page 20: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms20

Pearson Correlation Similarity

Data:

missing data

Page 21: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms21

On Pearson Similarity

Three problems with the Pearson Similarity:

1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.

Page 22: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms22

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

Page 23: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms23

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).

Page 24: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms24

Spearman Correlation Similarity Example for ties

Pearson value on the relative ranks

Page 25: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms25

Caching User Similarity

Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed.

Page 26: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms26

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

Page 27: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms27

Log-Likelihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

Page 28: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms28

Performance measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson: 0.89

Page 29: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms29

Performance measurements

95% of training; 5% of testing

10 nearest neighbors: 0.98 100 nearest neighbors: 0.89 500 nearest neighbors: 0.75

Page 30: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms30

Selecting the number of neighbors

Based on number of neighbors

Based on a fixed threshold, e.g., 0.7 or 0.5

Page 31: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms31

Item-based recommendation

Page 32: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms32

Item-based recommendation algorithm

Page 33: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms33

Code and Performance of Item-Based Recommendation

performance

Page 34: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms34

Slope-One Recommender

Page 35: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms35

Slope-One Algorithm

Slope-One got a result of near 0.65 on the GroupLens data

Difference values from the example

Page 36: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms36

Other recommenders — SVD recommender

number of features

lambda: factor for regularization

number of training step

SVD method got 0.69 on the GroupLens data

Page 37: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms37

Linear Interpolation Item-based recommender

SVD method got 0.76 on the GroupLens data

Page 38: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms38

Cluster-based Recommendation

Page 39: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms39

Other Recommenders not in Mahout — Groups (SDM 06)

CBDR

CBDR

CBDR

– Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering)

– A 3rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year.

Page 40: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

40

In E-Commerce: Rogers’ Diffusion of Innovations Theory

0

10

20

30

40

Innovators Early adopters Early majority Late majority Laggards

16

3434

14

3

Users’ adoption patterns: Some users tend to adopt innovations earlier than others ! Information virtually flows from early adopters to late adopters

Page 41: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

41

Innovator

People with similar tastes

Early adopter

Early majorityLate majority

Laggard

Innovator

People with similar tastes

Early adopter

Early majorityLate majority

Laggard

Recommendation Driven by Information Flow

Influence is not symmetric!

adopt

adopt

?

adopt??

Page 42: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

42

Scheme Overviewp Leverage the uneven influence

EABIF: Early Adoption based Information Flow Network

Dataset EABIF Information Propagation

Model

Application:Personalized

Recommendation

Page 43: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

43

EABIF (1) -- Markov Chainp A Markov chain has two components

■ A network structure where each node is called a state ■ A transition probability of traversing a link given that the chain is

in a state (P: transition probability matrix) p A stationary distribution is a probability distribution q such

that q = qP — how likely you will stay at one node

p Application -- PageRank [Brin and Page ‘98] ■ Assumption: A link from page A to page B is a recommendation

of page B by the author of A ■ Quality of a page is related to

p Number of pages linking to it p The quality of pages linking to it

■ Assume the web is a Markov chain p PageRank = stationary distribution of this Markov chain

i jPij

C

A B

Page 44: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

44

EABIF (2)p Early Adoption Matrix (EAB)

■ Count how many items one user accesses earlier than the other – pairwise comparison

p Markov Chain Model ■ Normalize EAB to a transition matrix F of a

Markov chain ■ Adjustment F to guarantee the existence of

stationary distribution of the Markov chain p Make the matrix stochastic p Make the Markov chain irreducible

1ijjF =∑

0, 1 ,ijF i j N≠ ≤ ≤

Page 45: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

45

v u

r1

r2 r3…

v u

r1

r2 r3

v u

r1

r2 r3…

Information Propagation Models1. Summation of various

propagation steps

2. Direct summation

3. Exponential weighted summation

( ) ( ) ( ) ( )( )

( )2 1

(exp)1 1

exp2! 1 !

N

if Nβ β β β

−# $= + + + + ⋅ −& '& '−( )

F F F F! !

( )( )1( )

N

if d

−⋅ −=

F I FF

I F

( )( ) ( )( )2 m

if m m= + + +F F F F!

N: number of the nodes

Page 46: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

46

Topic-Sensitive Early Adoption Based Information Flow (TEABIF) Network

p Adoption is typically category specific ■ An early adopter of fashion may not be an early adopter of

technology

TOPIC 1

TOPIC 3

TOPIC 2

LDA

TOPIC 1

TOPIC 3

TOPIC 2

Application: Personalized

Recommendation

Page 47: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

47

Experimental SetupER dataset

2004 Apr. to 2005 Apr. as training data 2005 May to 2005 Jul. as test data

1033 users, 586 documents

Process Construct information flow network based on the training data Trigger earliest users to start the process Predict who will be also interested in these documents

Evaluation Precision & Recall

Page 48: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

48

Experimental Results --Recommendation Quality

Precision Comparison (Number of Triggered Users =1, Propagation Steps = 1)

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4No. of retrieved users

Prec

isio

n

CFEABIFTEABIF

Recall Comparison (Number of Triggered Users = 1,Propagation Steps = 1)

00.020.040.060.08

0.10.120.14

1 2 3 4No. of retrieved users

Rec

all

CFEABIFTEABIF

Precision Comparison (Number of Triggered Users =2, Propagation Steps = 1)

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4No. of retrieved users

Prec

isio

n

CFEABIFTEABIF

Recall Comparison (Number of Triggered Users = 2,Propagation Steps = 1)

0

0.02

0.04

0.06

0.08

0.1

0.12

1 2 3 4No. of retrieved users

Rec

all

CFEABIFTEABIF

p Comparing to Collaborative Filtering (CF), precision: EABIF is 91.0% better, TEABIF is 108.5% better

Recall: EABIF is 87.1% better, TEABIF is 112.8% better

Page 49: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

49

Experimental Results -- Propagation Performance

Precision Improvement Comparison (Number of triggeredusers = 1, Baseline: CF)

00.20.40.60.8

11.21.41.6

m =

1

m =

2

m =

3

m =

4

m =

5

sum

exp(β=

1)

exp(β=

1.5

)

exp(β=

2)

exp(β=

3)

exp(β=

4)

exp(β=

5)

exp(β=

8)

exp(β=

16)

Rat

io (

x100

%)

EABIFTEABIF

Recall Improvement Comparison (Number of triggeredusers = 1, Baseline: CF)

00.20.40.60.8

11.21.4

m =

1

m =

2

m =

3

m =

4

m =

5

sum

exp(β=

1)

exp(β=

1.5

)

exp(β=

2)

exp(β=

3)

exp(β=

4)

exp(β=

5)

exp(β=

8)

exp(β=

16)

Rat

io (

x100

%)

EABIFTEABIF

Precision Improvement Comparison (Number of triggeredusers = 2, Baseline: CF)

00.10.20.30.40.50.60.70.8

m =

1

m =

2

m =

3

m =

4

m =

5

sum

exp(β=

1)

exp(β=

1.5

)

exp(β=

2)

exp(β=

3)

exp(β=

4)

exp(β=

5)

exp(β=

8)

exp(β=

16)

Rat

io (

x100

%)

EABIFTEABIF

Recall Improvement Comparison (Number of triggered users= 2, Baseline: CF)

00.10.20.30.40.50.60.70.8

m =

1

m =

2

m =

3

m =

4

m =

5

sum

exp(β=

1)

exp(β=

1.5

)

exp(β=

2)

exp(β=

3)

exp(β=

4)

exp(β=

5)

exp(β=

8)

exp(β=

16)

Rat

io (

x100

%)

EABIFTEABIF

p TEABIF with exponential weighted summation ( ) p achieves the best performance: p improves 108.5% on precision and 116.9% on recall comparing to CF

3β =

Page 50: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

50

SummaryExploit dynamic patterns including

Leverage dynamic patterns from both documents and users’ perspective

Analyzing documents accessing types Predicting documents’ expiration date Detecting users’ intentions Identifying users interests evolving over time – CTC model Ranking the documents adaptively – Time-sensitive Adaboost

Utilize users’ adoption patterns Information virtually flows from early adopters to late adopters

Experimental results demonstrate Dynamic factors are important for recommendations

Page 51: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

51

X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, “Modeling Evolutionary Behaviors for Community-based Dynamic Recommendation,” SIAM Conf. on Data Mining, Bethesda, MD, Apr. 2006.

X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, “Personalized Recommendation Driven by Information Flow,” ACM SIGIR, Aug. 2006.

X. Song, C.-Y. Lin, B. L. Tseng and M.-T. Sun, “Modeling and Predicting Personal Information Dissemination Behavior,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2005.

X. Song, B. L. Tseng, C.-Y. Lin, and M.-T. Sun, "ExpertiseNet: Relational and Evolutionary Expert Modeling," International Conference on User Modeling, Edinburgh, UK, Jul. 24-30, 2005.

Selected Publications

Page 52: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms52

Distributed Item-based Recommender

Page 53: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms53

Distributed recommender — get co-occurrence matrix

Data:

Page 54: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms54

Multiply the co-occurrence matrix with user preference

The highest is 103 (101, 104, 105, 107 have been purchased by user 3)

Page 55: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms55

Translating to MapReduce: generating user vectors

Page 56: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms56

Translating to MapReduce: calculating co-occurrence

Page 57: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms57

Translating to MapReduce: matrix multiplication

Page 58: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms58

Translating to MapReduce: partial products

Page 59: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms59

Translating to MapReduce: partial product II

Page 60: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms60

Running Recommender on MapReduce and HDFS

Page 61: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Storage and Processing61

Questions?

Page 62: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms62

E6893 Big Data Analytics:

Appendix — references

Page 63: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms63

Part I: Pig installation and Demo

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Page 64: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms64

1. Installation of Pig:https://pig.apache.org/docs/r0.7.0/setup.html

Download pig and run following sentence:

export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATH

Page 65: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms65

Page 66: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms66

2. Running pig in local mode:

pig -x local

movies = LOAD '/Users/Rich/Documents/Courses/Fall2014/BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

DUMP movies

Page 67: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms67

Filter:

List the movies that were released between 1950 and 1960 

movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961;

store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960'; 

Page 68: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms68

Foreach Generate:

List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600;

store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration'; 

Page 69: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms69

Order:

List all movies in descending order of year 

movies_year_sort =order movies by year desc;

store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort'; 

Page 70: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms70

3. Running pig with HDFS:

pig

Run HDFS first:

ssh localhost

cd /usr/local/Cellar/hadoop/2.5.0

sbin/start-dfs.sh

Page 71: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms71

3. Upload file to HDFS:

Make a directory:

bin/hdfs dfs -mkdir /PigSource

Upload a file:

bin/hdfs dfs -put /Users/Rich/Documents/Courses/Spring2014/CloudComputing/HW/MINI5/movies_data.csv /PigSource

Page 72: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms72

4. Run pig in grunt with HDFS:

pig

movies = LOAD '/PigSource/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

DUMP movies

Page 73: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms73

5. Run .pig file with HDFS:

pig

pig /Users/Rich/Documents/Courses/Fall2014/BigData/Pig/run1.pig

Page 74: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms74

Part II: Hbase installation and Demo HBase is an open source, non-relational, distributed database modeled after Google's Big Table and written in Java.

It runs on top of HDFS, and can serve as input and output for MapReduce jobs run in Hadoop.

Access through Java API and Pig.

Page 75: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms75

Hbase Configuration:

Download and configure Hbase by following the instruction online.

To run Hbase, under your Hbase path:$bin/start-hbase.shEnter shell $bin/hbase shell

Also visit localhost:60010 to check Hbase webUI

Page 76: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms76

Hbase Command:

Create:hbase>create ‘table’, ‘cf’hbase>put ’table', ’r1', ’cf:c1', 'value1’hbase>scan ‘table’

Also we can do count, get, delete and drop. etc much like other DB systems.

Page 77: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms77

Part III: Hive installation and Demo

Page 78: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms78

Simple demo for Hive

Install steps:

Mac: you can first install brew $ brew install Hive

Linux: cd ~/Downloadswget http://mirror.cc.columbia.edu/pub/software/apache/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gzcp apache-hive-0.13.1-bin.tar.gz ~cd ~tar zxvf apache-hive-0.13.1-bin.tar.gzmv apache-hive-0.13.1-bin hivecd hivecd bin./hive

Page 79: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms79

Simple demo for Hive

Step 1. start Hadoop, create a file under /user/yourname

hadoop fs -mkdir test

Step 2. put your dataset into hdfshadoop fs -put (your dataset path) testhadoop fs -ls test

Page 80: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms80

Simple demo for Hive

Step 3. create the table (SQL) in Hive shellhive

Step 4. store the data into tableload data inpath '/user/yourname/test' into table client;

You can see your data has already been put into the table

Page 81: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms81

1. select * from client where name='Henry';

Simple demo for Hive

2. select avg(age) from client where gender='male';

Page 82: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms82

Use JDBC(JAVA) to access data in Hive Server Other choice: Python, PHP etc.

Simple demo for Hive

Page 83: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms83

Simple demo for Hive

Result in the console

Open the Hive (Starting Hive Thrift Server)serverhive --service hiveserver -p 10000

URL: jdbc:hive://localhost:10000/default

Page 84: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms84

Simple demo for Hive

Example. Jetty Server with Jersey, restful web service Create simple API (Http request, XML/JSON format output)

Page 85: E6893 Big Data Analytics Lecture 3: Big Data Storage … E6893 Big Data Analytics – Lecture 3: ... p E-commerce Product recommender - Amazon ... On Pearson Similarity

© 2017 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 3: Big Data Analytics Algorithms85

Example. Jetty Server with Jersey, restful web service

Simple demo for Hive