Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Scalable Network Analysis
Inderjit S. DhillonUniversity of Texas at Austin
Dec 20, 2013COMAD, Ahmedabad, India
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
OutlineUnstructured Data - Scale & Diversity
Evolving NetworksMachine Learning Problems arising in Networks
Recommender SystemsLink PredictionSign Prediction
Formulation as Missing Value EstimationScalable Algorithms
NOMAD: Distributed matrix completion algorithmResults on ApplicationsConclusions
2
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Structured Data
• Data organized into fields: Relational databases, spreadsheets, XML
• Highly optimized for storage & retrieval (e.g. using SQL)
3
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Structured Data
• Focus is on data format, efficient storage & search
• Less or no uncertainty in semantics: e.g. businesses know the fields of the data
4
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Unstructured Data
5
Modern data is unstructured and diverse
Networks, Graphs Text
Images, Videos
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Unstructured Data
6
Much greater growth rate
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Unstructured Data
• Dynamic aspects of unstructured data:
• Constantly evolving
• Uncertainties abound: What should I ask of the data?
• Seek insights
• Heterogeneity renders traditional database models inadequate
7
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Unstructured Data
• Buzzwords - “Big Data” & “Data Science”
• Machine Learning: Predictive models for data
• Engineering perspective: Scale matters
8
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Network Graphs
9
APQ6
MIP
AQP2
AQP1 AQP5
GK
Social networks(Friendship) Bipartite networks
(Membership, Ratings, etc.)
Gene networks(Functional interaction)
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Graph Evolution
• Social networks are highly dynamic
• Constantly grow, change quickly over time
• Users arrive/leave, relationships form/dissolve
• Understanding graph evolution is important
10
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Graphs meet Machine Learning
• Network analysis: Understanding structure & evolution of networks
• Formulate predictive problems on the adjacency matrix of the graph
• Confluence of graph theory & machine learning
11
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Facebook growth
Scalable Network Analysis
12
Link Prediction
Recommender Systems
Netflix problem: 100M ratings, 0.5M users, 20K movies
A Toy Problem In Comparison!
13
Use
rs
Movies
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Recommender Systems
13
U
sers
Movies
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Link Prediction in Social Networks
• Problem: Infer missing relationships from a given snapshot of the network
14
Network at time T Network at time T + 1
?
?
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Predicting gene-disease links
15
?
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Signed Social Networks
• Sign Prediction Problem: Given a snapshot of the signed social network, predict the signs of missing edges
16
Like
Disli
ke
Like / Dislike?
Formulation as Missing Value Estimation
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
18
Use
rs
Movies
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
19
Use
rs
Movies
?
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
20
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
21
minW2Rm⇥k,H2Rn⇥k
X
(i,j)2⌦
(Aij �WiHTj )
2 + �(kWk2F + kHk2F )
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
22
?
minW2Rm⇥k,H2Rn⇥k
X
(i,j)2⌦
(Aij �WiHTj )
2 + �(kWk2F + kHk2F )
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Low-rank Matrix Completion
23
3.44
minW2Rm⇥k,H2Rn⇥k
X
(i,j)2⌦
(Aij �WiHTj )
2 + �(kWk2F + kHk2F )
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Link Prediction
• Can be posed as matrix completion problem
• Issue: Only positive relationships are observed
24
A(t) =
2
66664
. 1 1 . .1 . 1 1 11 1 . . 1. 1 . . .. 1 1 . .
3
77775⇡
2
66664
11111
3
77775
⇥1 1 1 1 1
⇤
Test Link Score
(4,5) 1
(1,4) 1
(3,4) 1
(1,5) 1
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Link Prediction
• Formulate Biased Matrix Completion Problem:
25
A(t) =
2
66664
. 1 1 . .1 . 1 1 11 1 . . 1. 1 . . .. 1 1 . .
3
77775⇡
2
66664
0.851.090.970.690.85
3
77775
⇥0.85 1.09 0.97 0.69 0.85
⇤
Test Link Score
(4,5) 0.59
(1,4) 0.59
(3,4) 0.67
(1,5) 0.72
minW2Rn⇥k,H2Rn⇥k
X
(i,j)2⌦
(Aij �WiHTj )
2 + ↵X
(i,j)/2⌦
(Aij �WiHTj )
2 + �(kWk2F + kHk2F )
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Signed Social Networks
• Social Balance [Harary,1953]: • In real-world signed networks, triangles tend to be
balanced
26
Friend of a friendis a friend
Enemy of an enemyis a friend
Balanced Not balanced
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Signed Social Networks
27
Theorem: All triangles in a network are balanced if and only if there exist two antagonistic groups.
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Signed Social Networks
• Relaxation: Weak balance
• Allow triangles with all negative edges
28
Weakly Balanced Not balanced
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Signed Social Networks
29
Theorem: All triangles in a network are weakly balanced if and only if there exist k antagonistic groups.
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Sign Prediction
• Sign inference can be posed as low-rank matrix completion
30
Theorem: A k-weakly balanced signed network has rank at most k.
Theorem: If there are no “small” groups, the underlying network can be exactly recovered, under certain
conditions.
K. Chiang et al. Prediction and Clustering in Signed Networks: A Local to Global Perspective. To appear in JMLR.
Scalable Algorithms
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Stochastic Gradient
• Time per update
• Effective for very large-scale problems
32
wi wi � ⌘((Aij �wTi hj)hj + �wi)
hj hj � ⌘((Aij �wTi hj)wi + �hj)
• Sample random index (i,j) and update corresponding factors:
O(k)
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Distributed Stochastic Gradient Descent (DSGD) [Gemulla et al. KDD 2011]
• Decoupled updates
• Easy to parallelize
• But communication & computation are interleaved
33
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
DSGD
34
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
DSGD
35
Curse of the last reducer
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
DSGD
36
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
• Goal: Keep CPU & network simultaneously busy.
• Asynchronous distributed solution.
37
Non-locking stOchastic Multi-machine algorithm for Asynchronous & Decentralized matrix factorization
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
38
w1
w2
w3
w4
h1
h2
h4h5 h3
Nomadic variables queue
Native variables
Workers
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
39
w1
w2
w3
w4
h1
h2 h4
h5 h3
h4
Nomadic variables queue
Native variables
Push
Workers
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
40
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
41
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
42
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
43
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
44
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
45
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
46
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
47
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
48
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
49
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD
50
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD Algorithm
1. Initialize: Randomly assign columns to worker queues2. Parallel Foreach q in {1,2,...,p}3. If queue[q] not empty then4. queue[q].pop()5. for ( ) in do6. Do SGD updates7. end for8. Sample q’ uniformly from {1,2,...,p}9. queue[q’].push( )10. end if11. Parallel End
51
(j,hj)
(j,hj)
i, j ⌦(q)j
hj
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
NOMAD Algorithm
1. Initialize: Randomly assign columns to worker queues2. Parallel Foreach q in {1,2,...,p}3. If queue[q] not empty then4. queue[q].pop()5. for ( ) in do6. Do SGD updates7. end for8. Sample q’ uniformly from {1,2,...,p}9. queue[q’].push( )10. end if11. Parallel End
52
(j,hj)
(j,hj)
i, j ⌦(q)j
hj
Concurrent object
Distributed setting:Write over the network
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Algorithm Complexity
• Average space required per worker:
• Average time for one sweep:
53
O(mk/p+ nk/p+ |⌦|/p)
O(|⌦|k/p)
Results on Applications
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Recommender Systems
55
Multicore Distributed
Netflix dataset: 2,649,429 users, 17,770 movies, ~100M ratings
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Recommender Systems
56
Synthetic dataset: ~85M users, 17,770 items, ~8.5B observations
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Link Prediction• Flickr dataset: 1.9M users & 42M links.
• Test set: sampled 5K users.
•
57
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Predicting Gene-Disease Links
58
0 10 20 30 40 50 60 70 80 90 1000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Rank
Pr(R
ank
<= x
)
Singleton gene rank
CATAPULT(Blom et al.,2013)KatzBiased MF
0 10 20 30 40 50 60 70 80 90 1000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Rank
Pr(R
ank
<= x
)
Test pair rank
CATAPULT(Blom et al.,2013)KatzBiased MF
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Sign Prediction• Epinions dataset (+ve & -ve reviews):
• 131K nodes, 840K edges, 15% edges negative.
• MF-ALS is faster and achieves higher accuracy.
•
59
MF-ALS takes 455 secs on network with
1.1M nodes & 120M edges
Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis
Conclusions
• Rapid growth of unstructured data demands scalable machine learning solutions for analysis
• Machine learning problems arising in network analysis can be cast in the matrix completion framework
• Our proposed asynchronous distributed algorithm NOMAD outperforms state-of-the-art matrix completion solvers
• Beyond Matrix Completion: Asynchronous distributed framework for solving machine learning problems
60