Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce

Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang

Internet Services Research Center (ISRC)Microsoft Research Redmond

http://research.microsoft.com/en-us/groups/isrc/

http://research.microsoft.com/en-us/labs/redmond/default.aspx

Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad

technologies• Representing a new model for moving technologies quickly

from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback

1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries

1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines

Dyadic Data on the Web

• Web abounds with dyadic data– Web search: term by document,

query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,

user by ad, …– Social media: tag by image, user by community,

friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic

Nonnegative Matrix Factorization (NMF)

• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]

– Interpretable dimensionality reduction [Lee & Seung, 1999]

– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]

• Challenge: Can we scale NMF to million-by-million matrices

Am

n

WH

m

nkk

0,0,0 HWA

NMF Algorithm [Lee & Seung, 2000]

Am

n

WH

m

nkk

0,0,0 HWA

Parallel NMF [Robila & Maciak, 2006]

• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory

Distributed NMF

• Data Partition: A, W and H across machines

A…

…

),,( , jiAji

W. . . . .

),( iwi

H

. . . . .

),( jhj

Copmuting DNMF: The Big Picture

WAW

AWH

Y

XHH

T

T

*.*.

… … …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

…

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

…

…

…

… ),( newjhj

Reduce-III

Reduce-V

AWX T

… …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

),(: iwiW

…

…

…

WHWY T

… …

Map-IIIMap-IV

),0( WW T

),0( iTi ww …),( jyj

),(: iwiW ),(: jhjH

Reduce-III WHWY T

m

ii

Ti

T wwWWC1

W

. . . . .

),( iwi

. . .

. . .

YXHH *.

…

),( jxj

Map-V

…

),,,( jjj yxhj

…),( jyj

),(: jhjH

…

… ),( newjhj

Reduce-V

… … …

…

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

…

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

…

…

…

… ),( newjhj

Reduce-III

Reduce-V

Experimental Evaluation

• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters

• Real-world data on a commercial cluster– Real-world scalability

Synthesized Data on Sandbox Cluster

• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB

memory, 150G hard drive– V: Number of workers in cluster

• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:

371617 2,2,2,2 knm

Computation Breakdown

• dominates the computation• is lightweight• The sparser, the faster

AWX TWHWY T

Performance w.r.t. Parameters

• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster

size V

Scalability on Real-world Data

• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level

• http://www.cnn.com/breakingnews --> www.cnn.com

• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel

Execution [Chaiken et al., VLDB’08]

http://www.cnn.com/breakingnews

http://www.cnn.com/

Executions w.r.t. Iterations

• Observations– Longer total elapse time– Shorter time per iteration

• Reason– Overlapped computation

across iterations

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606

Iterations

Nor

mal

ized

Elap

se T

ime

Scalability w.r.t. Matrix Size

3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours

Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Conclusion

• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web

• NMF is admissible to MapReduce • Distributed NMF solves the scalability

challenge• Applications down the road

Q&A

Thank You!

Documents

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce