Upload
jessenia-kianoush
View
38
Download
2
Embed Size (px)
DESCRIPTION
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. Chao Liu, Hung- chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond. Internet Services Research Center (ISRC). - PowerPoint PPT Presentation
Citation preview
Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce
Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang
Internet Services Research Center (ISRC)Microsoft Research Redmond
Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad
technologies• Representing a new model for moving technologies quickly
from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce
11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback
1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries
1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines
Dyadic Data on the Web
• Web abounds with dyadic data– Web search: term by document,
query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,
user by ad, …– Social media: tag by image, user by community,
friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic
Nonnegative Matrix Factorization (NMF)
• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]
– Interpretable dimensionality reduction [Lee & Seung, 1999]
– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]
• Challenge: Can we scale NMF to million-by-million matrices
Am
n
WH
m
nkk
0,0,0 HWA
NMF Algorithm [Lee & Seung, 2000]
Am
n
WH
m
nkk
0,0,0 HWA
Parallel NMF [Robila & Maciak, 2006]
• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory
Distributed NMF
• Data Partition: A, W and H across machines
A…
…
),,( , jiAji
W. . . . .
),( iwi
H
. . . . .
),( jhj
Copmuting DNMF: The Big Picture
WAW
AWH
Y
XHH
T
T
*.*.
… … …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
Map-IIIMap-IV
),0( WW T
Map-V
),0( iTi ww
…
),,,( jjj yxhj
…),( jyj
),(: iwiW ),(: jhjH
…
…
…
… ),( newjhj
Reduce-III
Reduce-V
AWX T
… …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
),(: iwiW
…
…
…
WHWY T
… …
Map-IIIMap-IV
),0( WW T
),0( iTi ww …),( jyj
),(: iwiW ),(: jhjH
Reduce-III WHWY T
m
ii
Ti
T wwWWC1
W
. . . . .
),( iwi
. . .
. . .
YXHH *.
…
),( jxj
Map-V
…
),,,( jjj yxhj
…),( jyj
),(: jhjH
…
… ),( newjhj
Reduce-V
… … …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
Map-IIIMap-IV
),0( WW T
Map-V
),0( iTi ww
…
),,,( jjj yxhj
…),( jyj
),(: iwiW ),(: jhjH
…
…
…
… ),( newjhj
Reduce-III
Reduce-V
Experimental Evaluation
• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters
• Real-world data on a commercial cluster– Real-world scalability
Synthesized Data on Sandbox Cluster
• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB
memory, 150G hard drive– V: Number of workers in cluster
• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:
371617 2,2,2,2 knm
Computation Breakdown
• dominates the computation• is lightweight• The sparser, the faster
AWX TWHWY T
Performance w.r.t. Parameters
• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster
size V
Scalability on Real-world Data
• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level
• http://www.cnn.com/breakingnews --> www.cnn.com
• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel
Execution [Chaiken et al., VLDB’08]
Executions w.r.t. Iterations
• Observations– Longer total elapse time– Shorter time per iteration
• Reason– Overlapped computation
across iterations
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606
Iterations
Nor
mal
ized
Elap
se T
ime
Scalability w.r.t. Matrix Size
3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours
Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values
Conclusion
• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web
• NMF is admissible to MapReduce • Distributed NMF solves the scalability
challenge• Applications down the road
Q&A
Thank You!