22
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

Embed Size (px)

DESCRIPTION

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. Chao Liu, Hung- chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond. Internet Services Research Center (ISRC). - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce

Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang

Internet Services Research Center (ISRC)Microsoft Research Redmond

Page 2: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad

technologies• Representing a new model for moving technologies quickly

from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback

1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries

1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines

Page 3: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Dyadic Data on the Web

• Web abounds with dyadic data– Web search: term by document,

query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,

user by ad, …– Social media: tag by image, user by community,

friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic

Page 4: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Nonnegative Matrix Factorization (NMF)

• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]

– Interpretable dimensionality reduction [Lee & Seung, 1999]

– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]

• Challenge: Can we scale NMF to million-by-million matrices

Am

n

WH

m

nkk

0,0,0 HWA

Page 5: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

NMF Algorithm [Lee & Seung, 2000]

Am

n

WH

m

nkk

0,0,0 HWA

Page 6: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Parallel NMF [Robila & Maciak, 2006]

• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory

Page 7: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Distributed NMF

• Data Partition: A, W and H across machines

A…

),,( , jiAji

W. . . . .

),( iwi

H

. . . . .

),( jhj

Page 8: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Copmuting DNMF: The Big Picture

WAW

AWH

Y

XHH

T

T

*.*.

Page 9: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

… … …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

Page 10: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

AWX T

… …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

),(: iwiW

Page 11: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

WHWY T

… …

Map-IIIMap-IV

),0( WW T

),0( iTi ww …),( jyj

),(: iwiW ),(: jhjH

Reduce-III WHWY T

m

ii

Ti

T wwWWC1

W

. . . . .

),( iwi

. . .

. . .

Page 12: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

YXHH *.

),( jxj

Map-V

),,,( jjj yxhj

…),( jyj

),(: jhjH

… ),( newjhj

Reduce-V

Page 13: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

… … …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

Page 14: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Experimental Evaluation

• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters

• Real-world data on a commercial cluster– Real-world scalability

Page 15: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Synthesized Data on Sandbox Cluster

• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB

memory, 150G hard drive– V: Number of workers in cluster

• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:

371617 2,2,2,2 knm

Page 16: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Computation Breakdown

• dominates the computation• is lightweight• The sparser, the faster

AWX TWHWY T

Page 17: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Performance w.r.t. Parameters

• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster

size V

Page 18: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Scalability on Real-world Data

• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level

• http://www.cnn.com/breakingnews --> www.cnn.com

• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel

Execution [Chaiken et al., VLDB’08]

Page 19: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Executions w.r.t. Iterations

• Observations– Longer total elapse time– Shorter time per iteration

• Reason– Overlapped computation

across iterations

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606

Iterations

Nor

mal

ized

Elap

se T

ime

Page 20: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Scalability w.r.t. Matrix Size

3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours

Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Page 21: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Conclusion

• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web

• NMF is admissible to MapReduce • Distributed NMF solves the scalability

challenge• Applications down the road

Page 22: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on  MapReduce

Q&A

Thank You!