Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Content
• Link analysiso PageRank, Hits …
• Markov Chains• Random Walks
• Learning on graphso Classification, clusteringo Transductive and inductive learning
• Theme and sentiment Analysiso Latent modelso Markov Chain Monte Carlo
• Graph miningo Community detectiono Diffusion in graphs
• Recommendationo Collaborative recommendationo Singular value decomposition, Non negative matrix factorizationo Ranking, etc
2Mining Content Information Networks
Content Information Networks
• Webo Hyperlinks
• Social networkso Friends, comments, tags, metadata (date, geo‐localization, etc.)…
• Bibliographical networkso Authors, co‐authors, conferences, editor site, metadata,…
• Blogso Comments, messages, backlinks, linkbackso Micro blogging: followers
• E‐mailso To, from, subject, date, etc.
• Any collection of content elements with relationso Images, video, texts, …o Implicit relations based on similarities
• Collaborative recommendation networks
3Mining Content Information Networks
Examples
•
4
Enron e‐mail
11 K Web hosts ‐Webspam
Wikipedia themes classification
Flickr Friendship network
Mining Content Information Networks
Heterogeneous network
•
5Mining Content Information Networks
• Modelingo Graphs
• Nodes are content elements• Links represent relations
• Characteristicso Content elements
• Maybe of different types (heterogeneous)o Relations
• Simple• homogeneous• Heterogeneous• Multiple• Directed / undirected
o Static or dynamic networks
6Mining Content Information Networks
• Needso Structural characteristics of the networko Dynamics
• Network evolution• Information propagation
o Nodes importanceo Classification, rankingo Content analysis
• Thematic• Sentiment• ….
7Mining Content Information Networks
Link analysis
PageRankHITSSalsa
Motivations
• Computing score functions on graph datao Importance of an item
• web pageo Number of incoming links measures the popularity of the page
• Social networkso Links measure social interaction (e.g. friends)
• Scientific literatureo Impact factor (journals)
• Average number of citation per published item
o Classification or ranking score• Annotation of items (images)
o Recommendation
9Mining Content Information Networks
PageRank
• Generalo Popularized by googleo Assign an authority score for each web pageo Using only the structure of the web graph (query independent)o Now one of the many components used for computing page
scores in Google S.E.• Intuition
o Assign higher scores to pages with many in‐links from authoritative pages with few out‐links
• Modelo Random surfer model
• Stationary distribution of a Markov Chaino Principal eigenvector of a linear system
10Mining Content Information Networks
Notations
o G = (V, E) grapho A adjacencymatrix
• Binary matrixo aij =1 if there is a link between i and jo aij =0 otherwise
o P transitionmatrix
• , i, j 1. .
• di degree of node vi ( ∑ )
• pij is the probability to move from node i to node j in the graph• P is row stochastic
o ∑ 1
11Mining Content Information Networks
•
0 1 0 10 0 1 11 0 0 00 0 1 0
, P
0 1/2 0 1/20 0 1/2 1/21 0 0 00 0 1 0
• Basic PageRank
o Initialize the PageRank score vector to a stochastic vector e.g. p 0o Update the PRank vector until convergence
• p(k+1) = PTp(k)
• 0
1/41/41/41/4
, 1
2/81/83/82/8
, 2
6/162/165/163/16
, …
• Conditions for convergence, unicity of solution ?
12
1
2
3 4
Mining Content Information Networks
Non Negative Matrices
• A square matrix Anxn is non negative if aij 0o Notation A 0o Example: graph incidence matrix
• Anxn is positive if aij > 0o Notation A > 0
• Anxn is irreducible if o ∀ , , ∃ ∈ / 0o If A is a graph incidence matrix, this means that G is strongly
connected• There is a path between any pair of vertices
• Anxn is primitive if ∃ ∈ / 0o A primitive matrix is irreducibleo Converse is false
13Mining Content Information Networks
Examples (Baldi et al. 2003)
14Mining Content Information Networks
Perron‐Frobenius theorem
• Anxn a non negative irreducible matrix o A has a real and positive eigenvalue / | ’| for any other
eigenvalue ’o corresponds to a strictly positive eigenvectoro No other eigenvector is positiveo is a simple root of the characteristic equation 0
• Remarkso is called the dominant eigenvalue of A and the corresponding
eigenvector the dominant eigenvector• The dominant eigenvalue is denoted 1 in the following
o There might be other eigenvalues j / | j| = | 1|
• e.g. 0 11 0 is non negative and irreducible, with two eigenvalues 1, ‐1
on the unit circle
15Mining Content Information Networks
• Perron‐Frobenius theorem for a primitive matrixo In property 1, the inequality is strict
• i.e. A has a real and positive eigenvalue 1 / 1 > | ’| for any other eigenvalue ’
• For a primitive stochastic matrixo 1 = 1 since A1 = 1
• Why is it interesting ?o Simple procedure for computing the eigenvalues of a matrix using the powers of a matrix
16Mining Content Information Networks
Intuition on the power method
• Leto ∈o u1,…,un theeigenvectorsofA,o c1,…,cn thecoordinatesofx intheeigenvectorbasis
•o ∑
o ∑ o 1 dominates,then → fortlarge
• Trueifxnonorthogonaltou1• Sinceu1 positive,anypositivevectorwilldo
o e.g.x 1n
17Mining Content Information Networks
Power method
• Let A a be a primitive matrix• Start with an arbitrary vector x0
o yt = Axto xt+1 = yt/|| yt ||
• Convergenceo Converges towards u_1 the eigenvector associated to , the largest
eigenvalue of A.o Whatever the initial vector x0
• Rate of convergenceo Geometric with ratio XX o > are the first two dominant eigenvalues of A
18Mining Content Information Networks
Pagerank
• Recallo G a directed graph (Web)o A its adjacency matrixo P the transition matrix
19Mining Content Information Networks
• Intuitiono Rank of a document is high if the rank of its parents is higho Embodied e.g. in
• ∝ ∑ ∈
• r(v) : rank value at vo Each parent contributes
• Proportionally to r(w)• Inversely to its out degree
o Amounts at solving• for a given matrix M• Eigenvector problem
20Mining Content Information Networks
Examples (Baldi et al. 2003)
21Mining Content Information Networks
• In order to converge to a stationary solutiono Remove sink nodes
• Many such situations in the webo Images, files, etc
o Make M primitive
22Mining Content Information Networks
Adjustments of the P matrix
• The transition matrix P most often lacks these propertieso Stochasticity
• Dangling nodes (nodes with no outlinks) make P non stochastic.
• Rows corresponding to dangling nodes are replaced by a stochastic vector v
o a common choice is , with 1 the vector of 1s
• The new transition matrix iso .o ai = 1 if i is a dangling node, 0 otherwiseo P’ is row stochastic
23Mining Content Information Networks
Example (Langville&Meyer 2006)
24Mining Content Information Networks
• Primitiveo The M matrix shall be primitive in order for the PageRank vector
to existo One possible solution is
• " ′ 1 . with 0< 1 and v a stochastic vector• Different v correspond to different random walks
o v uniform 1: teleportation operator in the random walk modelo v non uniform: personalization vector
• P"isamixtureoftwostochasticmatriceso Itis stochastico P” is trivially primitive since every node is connected to itselfo controls the proportion of time P’ and . are used
• also controls convergence rate of the Random Walk• P” is called the Google matrix
25Mining Content Information Networks
• Example (Langville&Meyer 2006)
26Mining Content Information Networks
Two formulations of the PageRank problem1. Eigenvector solution
• Solve o With y stochastic vectoro PageRank original algorithm uses the power method
• y " 1 , with any starting vector 0o Rewrites as
• ′ 1 1• 1 1 1
o Note• Computations can be performed on the sparse matrix P instead of the dense matrix P ’’
27Mining Content Information Networks
p2
Diapositive 27
p2 pg; 30/01/2011
• Checko Irreducibility guarantees the convergence P” being stochastic, its dominant eigenvalue 1 = 1
o P” being primitive, the eigenvector associated to 1(PageRank vector) is unique
28Mining Content Information Networks
• Rate of convergenceo For the web graph, convergence is governed by o The rate of convergence is the rate at which t 0
• Initial paper by Brin & Page uses = 0.85 and 50 to 100 iterations
29Mining Content Information Networks
Two formulations of the PageRank problem2. Linear system formulation
• Solve T
o ThiscanberewrittenasafunctionofPdirectly• Solve T
•o Itisnonsingularo Columnsumsare1‐ fornondanglingnodesor1fordanglingnodes
•
o e.g.Jacobi,Gauss‐Seidel,successiveoverrelaxationmethods
30Mining Content Information Networks
Jacobi method for solving linear systems
• Let the linear system• Ax = b• Decompose A into
o A = D + R with D the diagonal of A• R diagonal is 0
• Ax = b writes Dx = b – Rx• If D invertible, Jacobi method solves the linear equation by
• Matrix form: 1
• Element form: 1 ∑• Converges if A is strictly diagonally dominant
• i.e. ∑ (strict row diagonal dominance)• Th Levy‐Desplanques
o a square matrix with a diagonal strictly dominant is invertible
31Mining Content Information Networks
• PageRank with Jacobi• Algorithm
o Start with an arbitrary vector y(0)o Iterate
• 1
32Mining Content Information Networks
• Personalization vectoro Any probability vector v with positive elements can be used
o uniform teleportation
o Can be used to• personalize the search• Control spamming (link farms)
33Mining Content Information Networks
Convergence rate of PageRank
• Theorem (Bianchini et al. 2005)o Let y* the stationary vector of PageRank,
∗∗
the 1 norm of the relative error in the
computation of PageRank at time t, then0
o If there is no dangling page, then there exists v 0 andv = P”v, s.t. the equality holds
Mining Content Information Networks 34
Random Walk interpretation
• Initial formulation of PageRank was with Random Walkso A surfer walks the web and moves from page to page according to a transition probability matrix M
o Rank of a page v• Probability that the surfer is browsing page v
o M is interpreted as the matrix of a first order Markov Chaino The google vector r is the stationary distribution of a discrete time Markov Chain
35Mining Content Information Networks
Markov Chains
• Stochastic processo Set of random variables {Xt} defined on a state space S = {S1,…,Sn}
• t is often the time, Xt is the state of the process at time to e.g. S: pages of the Web, process: surfing the web, Xt the web page
viewed at time t
• Markov Chaino Is a stochastic process that satisfies the Markov Propertyo | ,…, )= | )o i.e. memoryless process, the state at time t+1 only depends on the
state at time t• | ) is the transition probability, i.e. probability of moving from Si at time t‐1 to Sj at time t
• P is a row stochastic matrix • Stationary Markov Chain
o MC in which ∀t
36Mining Content Information Networks
•
• transition matrix
o
• Initial distribution vectoro 0 0 ,… , 0
• Pi(0) is the probability that the chain starts in Si
• Irreducible Markov Chaino The transition matrix is irreducible
37Mining Content Information Networks
Graphical representation
• State= circle, Transition = directed link
p33
S1
S3S2
p31p21
p32
S1 S2 S3p12
p11
38Mining Content Information Networks
Exemple 1 (Rabiner ‐ Juang)
• S1 = Rain, S2 = Clouds, S3 = Sun
39Mining Content Information Networks
Example 2
• Web surfingo States: pageso Transitions : hyperlinkso Parameter estimation: statistics on users’ browsingo Use: model the browsing behavior
40Mining Content Information Networks
Exemple 3 : n‐gram language model
• Build a language model which captures the sequential nature of texts in a corpus.
• n‐gram model = MC of order n‐1• Example : 20 K words vocabulary
1,6*1017Quatre‐gram
8*1012Trigram
20K*20K = 400*106Bigram
# PARAMETERSMODEL
41Mining Content Information Networks
• Probability distribution vector• Non negative vector , … , whose components sum to 1.
• Stationary distribution vector for MC P• Vector s.t.
• kth step probability vector of a MC• Probability of being in a state at time k• , … ,
42Mining Content Information Networks
Properties
• For an initial vector p, what is the state distribution pk at time k ?
• Propertyo P be the transition matrix for a MC on states S1,…,Sno Pk is the kth step transition matrix
• [Pk]ij is the probability of moving from i to j in k stepso p(k) = (Pk)Tp(0) is the kth probability vectoro if P is primitive, → with the unique dominant eigenvector of the transition matrix P
Mining Content Information Networks
RandomWalks on graphs
• Random Walko is a stochastic process which randomly jumps from nodes to
nodes.o i.e. a MC
• G = (V, W) a weighted grapho The transition matrix of the random walk is
• , i, j 1. .
•• If the graph is connected and non bipartite (P primitive),
the random walk possesses a unique stationary distribution
44Mining Content Information Networks
HITS (Kleinberg 98)
• Hub
• Authority
45Mining Content Information Networks
• Hubo Important reference pageso Points to good authority
pageso Hub score of a page: sum of
the authority scores of its children
• Authorityo Important reference pages for
a topico Pointed by good hub pageso Authority score of a page:
sum of the hub scores of its parents
46Mining Content Information Networks
HITS ‐ Algorithm
• Inputo Web subgraph relative to a query
• The subgraph is composed of the retrieved documents + the linked (in and out) web documents• Only a part of the linked document is considered (e.g. 100)
• Outputo Authority and hub scores, h() and a() for all pages in the graph
• Algorithmo Initialize
• a(v) = 1, h(v) = 1, v (any positive vector value will do)o Repeat
• ∑ →
• ∑ →• Normalize h and a
o
o
o Until convergenceo Return the two lists
47Mining Content Information Networks
48
HITS algorithm (followed)
• For the subgraph, leto h : vector of page hubso a : vector of page authoritieso A the adjacency matrix
• In matricial form, the algorithm writes
ionnormalizat
or
ionnormalizat
AA
AA
A
A
T
T
T
1-tt
1-tt
1-tt
1-tt
aahh
haah
Mining Content Information Networks
• Matriceso ATA is called the authority matrix
• Determines authority scoreo AAT is called the hub matrix
• Determines hub scoreo Both are symmetric, positive semi‐definite
• The dominant eigenvalue 1 is unique
• Algorithmo The update algorithm is the power method for matrices AAT and
ATAo It converges towards one of the dominant eigenvector
associated to AAT and ATA
49Mining Content Information Networks
• Convergenceo Although 1 is unique, it may have multiple eigenvectors, so that the convergence will depend on the initial vectors a(0) and h(0).
o A trick similar to PageRank can be used to make the matrices primitive and converge to a unique eigenvector:
o Replace AAT with 1 . ,0< 1 and v a stochastic vector
o Same thing with ATA
50Mining Content Information Networks
• Example (Langville – Meyer 2006)
51Mining Content Information Networks
• Matriceso A symmetric matrix B is positive semi‐definite if
• For all non zero vector x, xTBx 0• Or equivalently all eigenvalues are 0
o A matrix B is positive definite if• is replaced by >
52Mining Content Information Networks
SALSA (Lempel – Moran 2001)
• Many variants or algorithms inspired from the success of PageRank and HITS.
• SALSA (Stochastic Approach for Link Structure Analysis) is a stochastic extension to HITSo Makes use of a subgraph of the webo Computes hub and authority values
53Mining Content Information Networks
• G = (V, E)• Build a bipartite undirected graph with two sets of vertices
o Vh: all vertices with outdegree > 0 in Go Va: all vertices with indegree > 0 in Go Edges connect Vh to Va
• Perform two separate random walks to compute hub and authority scores using hub and authority transition matrices H and B
o Hub score• Start from a node in Vh• Jump to a node in Va according to H
o Follow a link in G• Jump back to a node in Vh according to B
o Follow a backlink in Go Authority score
• Idem starting from Vao The stationary vectors of the random walks are the two score vectors h and a.
• h and a are the principal eigenvectors of H and Bo Note
• Each walk starts on one side of the bipartite graph and remains on this side
54Mining Content Information Networks
Example (Langville – Meyer 2006)
55Mining Content Information Networks
• Transition matriceso Hub matrix H
• ∑ deg deg∈ / , ∈
,
• u, v Vh, w Vao Authority matrix B
• ∑ deg deg∈ / , ∈
,
• u, v Va, w Vh
56Mining Content Information Networks
• The transition matrices can be computed from the adjacency matrix A of the initial graph Go Ar row normalized adjacency matrixo Ac column normalized adjacency matrixo H non zero rows and columns of Ar(Ac)T
o B non zero rows and columns of (Ac)TAr
57Mining Content Information Networks
Example (Langville – Meyer 2006)
Mining Content Information Networks 58
Example (Langville – Meyer 2006)
Mining Content Information Networks 59
Bibliography
• Langville A. Meyer C.D., Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press Princeton, NJ, USA ©2006, ISBN:0691122024
• Baldi P., Frasconi P., Smyth P., Modeling the Internet and the Web: Probabilistic Methods and Algorithm, Wiley, 2003
• Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM Transactions on Internet Technology. 2005;5(1):92‐128.
60Mining Content Information Networks
Classification and ranking on networked data
Introduction• Motivation• Graph Laplacian
Regularization based methodsCollective classification
Mining Content Information Networks 62
Relational graph data
• Fast growing semantic resourceso Webo Social networkso Sharing media
• Newo Serviceso Data typeso industrial problemso Research problems
• Challengeo Machine learning and Information Retrieval for networked data
Mining Content Information Networks 63
Differents types of analysis
• Analysis of network (Kleinberg, Faloutsos, …)o Mainly based on connectivity analysiso Structureo Dynamicso Information propagation
• Machine learning on graph datao Mainly classificationo Two main approaches
• Collective classification• Regularization framework
o May take into account both content and connectivity
Classification and Ranking on networked data
• Problemo Classification and ranking are two important generic problems for Machine learning
• Mainly developed for vectorial and sequential datao Classification/ Ranking on graphs
• As usualo Some data points are labeledo Infer labels of other nodes
• Specificity of graph datao Node interdependencyo The labels inferred at a given node will depend on their neighbors
Mining Content Information Networks 64
?
??
?
Mining Content Information Networks 65
Example : webspam detection
• WebSpam challenge 2007
• 11 K hosts• 7 K labeled• 26 % spam• Partial view of the host grapho Black : spamo White : non spam
Exemple Blog Spam
66Mining Content Information Networks
Graph Laplacians (von Luxburg 2007)
• The following definitions hold for undirected graphso Let G = (V, E) an undirected graph
• |V| = n, W a nxn non negative, symmetric weight matrix• D a nxn diagonal matrix with ∑
o The unnormalized Laplacian of G is•
o Properties
• ∈ , ∑ ,
• L is symmetric, positive semi‐definite• L has n non negative real eigenvalues
• The smallest eigenvalue of L is 0 with eigenvector 1
Mining Content Information Networks 67
Graph Laplacians (followed)
o The following two matrices are called normalized Laplacians• , Lrw is related to random walks
• Ls is symmetrico Properties
• Ls and Lrw are positive semi‐definite, they have n non negative real eigenvalues
• 0 is an eigenvalue of Ls and Lrw with eigenvector respectively D1/21 and 1
• Ls and Lrw are similar matriceso They have the same eigenvalueso is an eigenvalue of Lrw with eigenvector v iff is an eigenvalue of Ls with eigenvector D1/2v
• ∈ , ∑ ,
Mining Content Information Networks 68
Classification and ranking on networked data
Introduction• Motivation• Graph Laplacian
Regularization based methodsCollective classification
Mining Content Information Networks 70
Graph labeling with regularization
• Initial framework comes from semi‐supervised learning
• Later extended to other situationso Classification on graphso Ranking
• Classification frameworko Classical setting for classification is inductive learning
• Learn from a set of labeled datao Usually manual labeling of the data
• Infer on new data• Semi‐supervised learning
o Motivation• Labeling data is expensive while unlabeled data is often available in large quantities
o Often the case for e.g. web applications• Train classifiers using both (few) labeled data and unlabeled data.• The regularization framework mainly concerns transductivelearning
o i.e. All data (labeled + unlabeled) are available at once
Mining Content Information Networks 71
• Where does the graph come from in semi‐supervised learning ?
• Make use of local data consistency (proximity, similarity) besides global consistency
• See illustration
Mining Content Information Networks 72
Mining Content Information Networks 73
Data consistency (Zhou et al. 2003)• Context: semi‐supervised learning (SSL)• SSL rely on local (neighbors share the same label) and global (data
structure) data consistency
Fig. from Zhou et al. 2003
Mining Content Information Networks 74
• Graph methods for semi‐supervised learning general ideao Given
• An undirected graph defined on data points• a similarity matrix between nodes in G• A set of labeled nodes on G• Propagate observed labels to unlabeled nodes using similarities
• Notationso D ={x1,…, xl ,xl+1,…xn} data points
• First l points labeled, others unlabeledo y: n*1 vector
• We consider binary classification (classes C1 and C2) for simplifying• n : # data points• yi: class scores for pattern xi
o e.g. we target is 1 if xi is C1 and 0 if C2o G = (V, E) an undirected graph
• A its adjacency matrix• W a similarity matrix : Wij is the similarity between nodes i and j• S a row stochastic matrix defined on G
o Different possibilities for S
Mining Content Information Networks 75
Iterative algorithm for semi‐supervised classification –general scheme
• I/Oo Input
• Labeled and unlabeled data pointso Output
• Labeled data
• Algorithmo Compute
• a similarity matrix W• a normalized similarity matrix S
o Iterate
o Label each point • e.g. y*i = 1 if y*i > 0.5, 0 otherwise
• Example
• D is a diagonal matrix whose ith element is the sum of ith row of W
Where y(0):• matrix of initial labels for labeled nodes (1 is
C1 and 0 if C2 XX or -1 ???)• 0 for unlabeled nodes
Mining Content Information Networks 76
0),2
exp(: 2
2
iiji
ij Wxx
WW
WDS 1
)0()1()(.)1( ytySty
• Properties• Classical convergence conditions of iterative methods
o e.g. S primitive• Converges to
o y* = (1‐ α)(I – S)‐1y(0)
Mining Content Information Networks 77
• Different variantso Different W matrices can be used
• Inverse exponential distanceso Dense connection matrix
• May use a threshold to make it sparse• K Nearest Neighbors
o Local connectivity, sparse connection matrix• Kernels on graphs
o See latero Any S matrix which satisfies convergence conditions can be usedo e.g.
•
•
Mining Content Information Networks 78
Mining Content Information Networks 79
Iterations
Fig. from Zhou et al. 2003
Same algorithm but with
Regularization view of the algorithm
• y* could be obtained as the solution minimizing the following cost function
o ∑ 0.. ∑ , .. • First term : fitting constraint wrt the initial labels• Second term : smoothness constraint on neighbor nodes• y(0)I = 1 if node I is class 1, 0 otherwise (class 2 and unlabeled points)
• In compact formo 0 0
• Differentiating Q wrt y giveso 0 (*)
• is nonsingular,the solution is • ∗ 0
o The algorithm on slide XX is Jacobi iterative algorithm for solving the linear system (*)
Mining Content Information Networks 80
Multiclass extension
• Direct extension of the above algorithmo Replace vector ynx1 witho Matrix Ynxc
• c is the number of classes• Yij = 1 if xi is of class Cj and 0 otherwise• Y(0) is the matrix of initial labels• Y*ij = 1 if Yij = argmaxk Yik
Mining Content Information Networks 81
Mining Content Information Networks 82
Ranking extensions
• Remark
o Similar formulations have been proposed for ranking in web search engines (e.g. Zhou 2004, Deng 2008)
o Ideas• Documents and queries are the graph vertices• Scores are propagated for computing document relevance to queries while considering document similarity
• Documents are ranked for each query according to scores
Content + link Information
• Propagation methodo do not consider directly the content of the different nodeso Content only appears through the similarity or kernel matrix
• It is possible to use the graph regularization idea together with content based classifiers
Mining Content Information Networks 83
Content + link Information (Continued)Abernathy et al. 2010 (classification)
Denoyer et al. 2010 (ranking)
• Contexto Transductive semi‐supervised learningo Each node is characterized by a content information
• e.g. image, text, other• Content classifier
o ∑ ,∈ 0• Smoothing term
o ∑ ′ ,, ∈
• Regularized content + link classifiero ∑ ,∈ 0 ∑ ′ ,, ∈
• Learningo Gradient like algorithm for learning f parameterso Extensions allow to learn the weights as well
Mining Content Information Networks 84
Mining Content Information Networks 85
Ranking model for image annotation in a social network (Denoyer 2009)
• Problemo Automatic annotation of images in large social networks (e.g. Flickr)
o Consider simultaneously• Explicit relations (authorship, friendship)• Implicit relations (similarity)• Different types of content
o Text, image
Mining Content Information Networks 86
• Approacho Regularization based method
• Cost functiono Fitting term: ranking functiono Regularity term: based on one type of relation
• Resultso Importance of social links
• Authors, friendship• Large improvement over non relational (classical) ranking methods
• Few improvement with implicit relations
Mining Content Information Networks 87
Experiments
• 3 corpora extracted from Flickr
Mining Content Information Networks 88
Results
Other extensions
• Directed graphs• Multiple relations• Heterogeneous networks
Mining Content Information Networks 89
• Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. Learning with local and global consistency. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference.Vol 1.; 2004:595–602.
• Zhu X, Ghahramani Z. Semi‐supervised learning using gaussian fields and harmonic functions. MACHINE LEARNING‐. 2003;20(2):912.• Abernethy J, Chapelle O, Castillo C. Graph regularization methods for Web spam detection. Machine Learning. 2010;81(2):207‐225.
Mining Content Information Networks 90
Classification and ranking on networked data
Introduction• Motivation• Graph Laplacian
Regularization based methodsCollective classification
Mining Content Information Networks 92
Networked data
• Available informationo Connectivity
• Labels : partial labeling• Links
o assumed known usually – explicit or implicito Node featureso Others
• Label metric
• Three types of correlations can then be exploitedo Correlation between label of node i and its featureso Correlation between the label of i and the observed features and / or
labels of node i neighborso Correlation between the label of i and the unobserved features
and labels of node i neighbors
Mining Content Information Networks 93
• Solving the global label assignment is usually NP hard• Exact inference algorithms, when they exist are too costly
• Most methods use approximate inference algorithms
• Note: most methods consider onlyo Unweighted linkso Single links
Mining Content Information Networks 94
Notations and problem definition
• Notationso Graph G = (V, E)o Node i features : xi
• xi may incorporate input features (e.g. text) and/ or relational featureso local features : e.g. neighbor labels, number of neighbors, …o global features : e.g. centrality.
o Node i label : yio Neighborhood of node i : N(i)o Labels take their values in L= {l1, …, lp}
• Classification problemo Some labels and / or features being observedo Infer the unobserved labels of other nodes
Mining Content Information Networks 95
Collective classification methods (Sen et al. 2008)
• Usual schemeo Bootstrap
• Assign an initial value to each node using a local classifier• Any classifier may be used
o Iterate• compute node labels using graph contextual information• Iterations are needed since the new label values for nodes in N(i) provide new information for yi
• Most methods for collective classification thus requireo A relational classifiero An iteration policy
Mining Content Information Networks 96
Collective classification methods
o Gibbso Iterative classificationo Relaxation labelingo Stacked learningo Random walks …..
Mining Content Information Networks 97
Feature vectors
• For vector classifiers, xi should be of fixed sizeo Neighborhoods N(i) may be of variable size for different nodes i
o Usual solution : use aggregate features in order to build fixed size feature vectors
• e.g. # class k labels in N(i), class k relative frequency in N(i), majority label in N(i), ….
• The value of xi may change from one iteration to the other• xi shall be computed at each iteration
• Example – aggregate features
Mining Content Information Networks 98
Mining Content Information Networks 99
Iterative classification (Neville et al 2000, Lu et al. 2003)
• Boostrapo For each unlabeled node i
• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)
• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i,
• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)
Mining Content Information Networks 100
Simulated Iterative Classification (Maes et al. 2009)
• Training Bias in ICAo Training is performed on correct labelso Inference is performed with noisy labels
SICA (Followed)
• Ideao Training and test conditions should be made similaro Make training examples representative of test ones by simulating inference during learning
o How• Repeatedly run inference during training by sampling from the current classifier distribution of predicted labels
• Different sampling schemes
Mining Content Information Networks 101
Mining Content Information Networks 102
Gibbs sampling (McDowell et al. 2007, Neville et al. 2007)
• Simplified version of the original Gibbs sampling strategy (Geman & Geman 84)
o Introduces a classifier F() not present in the original Gibbs
• Training often requires a fully labeled training set
• Inferenceo Sample the outputs for each node and take the majority label
• Difference with ICo Label sampling
• Sequential update• Any classifier can be used for Bootstrap or Iterate steps
Mining Content Information Networks 103
Gibbs sampling
• Boostrapo For each unlabeled node i
• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)
• For each label lo Counts [i, l] = 0
• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i,
• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)
• Counts [i, yi] = Counts [i, yi] +1• yi = argmaxl count[i, l]
Mining Content Information Networks 104
Remarks
• For both ICA and Gibbso A Sequential update of unobserved labels is performedo Any classifier for Bootstrap and Iterate steps can be usedo Hard labels are computed at each stepo Node ordering has no real impacto Classifier choice may impact the performance
• Trainingo Usually requires a fully annotated data set
Mining Content Information Networks 105
ICA ‐ Gibbs
•
Mining Content Information Networks 106
Stacked graphical learning (Cohen et al XX)
• Main difference is in the training phase• Ideas
o Train a local classifier y = F(x)o Train a second classifier using both input x and predicted outputs in N(i)
o Uses stacked learning
• Usually requires only few (1 !) iterations
Mining Content Information Networks 107
Stacked graphical learning
• Trainingo Boostrap
• Learn a local classifier F0 on training set Do Iterate k = 1 to K
• Build training set Dk by augmenting xi with YN(i):o xk = (x,YN(x))
• Learn Fk on Dk
• Note : this step uses stacked learningo Final model : FK
• Inférenceo y0 = F0 (x)o For k = 1 to K
• Compute xk as above• yk = Fk (xk)
o yK = FK (xK)
Mining Content Information Networks 108
Stacked learning
o For robustness, training is performed using stacked learning
o Training set D• Let D1, .., Dm be a partition of D• Fk is trained as follows:
o Train m fonctions fi• fi is trained on D – Di
• Let x Di , y = F(x) = fi(x)
o Note• At each iteration a different partition will be used• This prevents overtraining
Mining Content Information Networks 109
Stacked learning
•
Mining Content Information Networks 110
Some tests
• Sequential data (handwritten word recognition)
Mining Content Information Networks 111
Other approaches
• Extensions of graphical models classifiers have been proposed for collective and relational classificationo Directed models
• Relational Bayesian Networks (Taskar et al. 2001)o Undirected models
• Relational Dependency Networks (Neville et al. 2003, 2007)• Relational Markov Networks (Taskar et al 2002)
o …….
Mining Content Information Networks 112
Special case : univariate classification
• When the input features are ignored, collective classification is known as univariate collective classification
• Labels are propagated from observed nodes to unlabeled nodes
• All the above methods can be used in this setting (Macskassy et al. 2007)
References
• Sen P, Namata G, Bilgic M, Getoor L, Galligher B, T. Collective classification in network data. AI Magazine. 2008;29(3):1‐24.
Mining Content Information Networks 113
Graph kernels
Motivations
• Graph Kernels allow to define similarities between nodes in a graph, based on the graph structureo e.g. # paths connecting two nodes, mean weight of paths, etc
• i.e. complex similarity measures• Distance measures between nodes could be easily derived from
these similaritieso The kernel framework allows to consider a large variety of distance
measures and to represent the nodes of a graph as points in an euclideanspace
• Link with regularization based approacheso Some graph kernels may be obtained as solutions to the optimization of
loss functions• Link with random walks
o Some graph kernels may be defined in terms of random walks on the graph
Mining Content Information Networks 115
Kernels
• Kernels are “similarity” functions k(x,x’) s.t.o k(x,x’) can be computed via an inner product of some transformation of x and x’ in a feature space
• Definitiono K: X x X R is a kernel function if for all x, z in X, K(x,z) = < Φ(x),
Φ(z)>whereΦ is a mapping from X onto an inner product feature space
(Hilbert space)
Mining Content Information Networks 116
• Initial motivations in machine learningo Non linear classification
• Map the data onto a possibly high dimensional space, so that the problem becomes linear in that space
• Limit the complexity of similarity computations to O(input space dimension)
o Computations may be performed in the original (smaller) space at a linear cost
Mining Content Information Networks 117
• Initial motivations in machine learningo Non linear classification
Mining Content Information Networks 118
xx’
(x)(x’)
K(x, x’)
Kernel functions ‐ examples
• Linear Kernels
• Order 2 polynomials
Mining Content Information Networks 119
zxzxK .),(
spolynomial 2d ofset theofsubset i.e.),)2(,).(()(/ with)().(),(
).(),(
21n
: 2 d of monomials all i.e.
).()(/)( with)().(),(
).)(.(.),(
.),(
,1,1,,
2
,1,,
1,
2
1
2
ccxxxx(x)zxzxK
czxzxK
xxxxzxzxK
zzxxzxzxK
zxzxK
niinjijiji
njijiji
n
jijiji
n
iii
Kernels on finite spaces
• Leto X = { x1,…, xN}, with x Xo K(x,x') a symmetric function, K: XxX→ R
• K is a kernel function iff matrix , .. ispositive semi‐definite
• There are several equivalent characterizations for a kernel functiono matrix K is symmetric positive semi‐definite iff any of the
following property holds• 0∀• All the eigenvalues of the real matrix K are real and non negative• for some real matrix B.
o B is not unique in general and different decompositions may exist
Mining Content Information Networks 120
• , .. is called the kernel matrix
• n
G , .. , , ..
Mining Content Information Networks 121
Symmetric matricesUseful properties
• A symmetric matrix has only real eigenvalues• The eigenvectors of a symmetric matrix are orthogonal
and can then be chosen orthonormal• If a symmetric matrix A has k non zero eigenvalues, then
it can be diagonalized and expressed aso A = UUT
o With• the diagonal kxk matrix of eigenvalues
o Usually ordered in decreasing order• U the nxk orthonormal matrix of corresponding eigenvectors
• Another expression for A iso A = (U 1/2)(U 1/2 )T = XXT
Mining Content Information Networks 122
• The data matrix associated to a kernel matrixo Let
• K kernel matrix• K = (U 1/2)(U 1/2 )T = XXT its eigenvalue decomposition• xi the column vector of XT
o Then• (K)ij = xiTxj• xi is a r‐dimensional vector, this is the feature vector associated to the ith pattern• xi is the euclidean representation of pattern i in this space
o When pattern i characterizes a node in a graph, xi is its “euclidean representation”o X is called the data matrix associated to the kernel matrix Ko Note
• This means that data points in complex spaces (e.g. graphs) may be represented in an euclidean space using this data matrix representation.
• Classical euclidean operations (dot product, distances, projections) can be defined on these complex objects via the kernel matrix directly
Mining Content Information Networks 123
• Angles
o
• Distances
o• With 0,… , 0,1,0, … , 0 , with 1 in position i
Mining Content Information Networks 124
Kernels on graphsSimilarity between graph nodes
Graph “Metric” Space
Mining Content Information Networks 125
How to define meaningful kernels on graphs
Mining Content Information Networks 126
Example: kernels based on adjacency matrix
Kernel matrix
K An # paths of length n
…+ Allpaths from length 1tonK= ∑ Infinite discounted sum
Mining Content Information Networks 127
0 1 1 11 0 0 11 0 0 01 1 0 03 1 0 11 2 1 10 1 1 11 1 1 2
v1
v4v3
v2
# paths of length 2
Diffusion kernels on graphs (Shawe‐Taylor et al. 2004)
• Let B = (bij)i,j=1..n denote a similarity matrix between the graph nodes s.t. bij is the similarity between nodes i and j, B is symmetric.
• Consider the following similarity:o 2 ∑o i.e. bij(2) is the sum of similarities of all length 2 paths between i
and j whereo Then 2 and B2 is a kernel matrix
• In the same wayo Bk is a kernel matrix
• Gives the sum of all k length path similaritieso Any linear combinations of B power matrices is a kernel matrix.
Mining Content Information Networks 128
• Von Neumann diffusion kernelo ∑o Ak measures the number of paths of length k between any pair
of nodeso The similarity between two nodes i and j: (KVN)ij integrates
contributions from all the paths from I to j in the graph, with a discounting factor decreasing with k.
o Here the importance of a path is inversely proportional to its length
o Converges if 0 < < ((A))‐1 with (A) the spectral radius of Ao Converges too Note
• Similar kernels could be defined for any symmetric B matrix
Mining Content Information Networks 129
Other graph kernels (Fouss et al. 2009)
• Exponential diffusion kernelo ∑
!exp with A the adjacency matrix of graph G
o is the number of length k paths between i and j.o Similar to Von Neuman Kernel, with a different discounting rate
• The Laplacian exponential diffusion kernel is the same as the exponential diffusion kernel except adjacency matrix A is replaced with minus the laplacian matrix L.
o ∑!
exp• The regularized Laplacian kernel is similar to Von Neumann kernel
with minus the unormalized laplacian ‐ L substituted to Ao ∑o This kernel also appears in the regularized approach to semi‐supervised
learning (slide XX)
Mining Content Information Networks 130
• Random walk with restart kernelo Let us consider a random walker which jumps from node i to node j
with probability according to a row stochastic transition matrix P and at each step jumps back to node i with probability 1 ‐ (Gori et al. 2006))
o The RW is described by the following process
•0
1 1• The steady state solution for a walk starting at node i is:• 1• x is the ith column of 1 , it provides a similaritybetween node i and the other nodes of the graph
• The random walk with restart matrix is• XX explain why we transpose the matrix
Mining Content Information Networks 131
Using graph kernels for recommendation
• Collaborative filteringo U a set of userso I a set of itemso Each user rates some of the items
• User‐item matrix ‐ sparseo Collaborative filtering
• Recommend items for users• Usually based on the user similarity of ratings• Predict the missing ratings for users• Many different techniques
Mining Content Information Networks 132
ItemsUsers
1 2 3 4
1 5 323 34
5 1 2
ItemsUsers
1 2 3 4
1 5 ? ? 32 ? ? ? ?3 ? 3 ? ?4 ? ? ? ?5 1 2 ? ?
• Popular challenges on movie recommendationo CAMRa2010o NetFlix Prize
• On September 21, 2009 we awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”.
• There are currently 51051 contestants on 41305 teams from 186 different countries. We have received 44014 valid submissions from 5169 different teams;
Mining Content Information Networks 133
Using graph kernels for recommendation
• Let G = (V, E) be the user‐item bipartite grapho Nodes V are users and itemso Links
• A adjacency matrix• aij = 1 if user i has rated item I• aij = 0 otherwise
o Bipartite graph
Mining Content Information Networks 134
Using graph kernels for recommendation
• The different graph kernels could be computed on this bipartite graph.o The kernel matrix K (N+M)x(N+M) provides the similarities between graph nodes
o It could be partitioned into 4 matrices
•o KUU is the MxM user‐user similarity matrixo KUU is the NxN item‐item similarity matrixo KUI is the NxM user‐item preference matrix and KIU its symmetric matrix
Mining Content Information Networks 135
Using graph kernels for recommendation
• Three ways for computing recommendations (Fouss et al 2009)o Direct
• Use sim(Useri, Itemj) for direct ranking of the recommendations
o User based• Compute sim(Useri, Userj) • Keep the k‐nearest‐neighbors of Useri
o k hyperparametero The recommendation score of item j for user I is
• _ ,∑ ,..∑ , ..
• apj = 1 if Userp rated Itemj and 0 otherwise
Mining Content Information Networks 136
Using graph kernels for recommendation
o Item based• Compute sim(Itemi, Itemj) • Keep the k‐nearest‐neighbors of Itemi
o k hyperparametero The recommendation score of item j for user I is
• _ ,∑ ,..∑ , ..
• aip = 1 if Useri rated Itemp and 0 otherwise
Mining Content Information Networks 137
Using graph kernels for classification
• Kernels may be used for semi‐supervised classificationo Several of the matrices obtained via transductive approaches to
semi‐supervised learning are indeed kernels – or similarity matrices
o Given a Kernel matrix K (nxn), a simple rule for classification is:• Let yc be an nx1 vector s.t. yc,I = 1 if node i is from class c and 0 otherwise (unknown or other class)
• K. yc is the vector of class c scores for the graph nodeso It computes the similarity of any node with the labeled nodes from class c
• Finally, a node may be classified into the class with highest scoreo This is similar to what we did with regularization based approaches
• Noteo Other, more sophisticated classification rules might be used
Mining Content Information Networks 138
References
• Shawe‐Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004• Yen L, Pirotte A, Saerens M. An Experimental Investigation of Graph Kernels on Collaborative Recommendation and
Semisupervised Classification. Submitted. 2009:1‐39.
Mining Content Information Networks 139
Latent models
Non Negative Matrix Factorization
Apprentissage Statistique ‐ P. Gallinari 140
Non Negative Matrix Factorization
• Ideao Project data vectors in a latent space of dimension k < m size of the original space
o Axis in this latent space represent a new basis for data representation
o Each original data vector will be approximated as a linear combination of k basis vectors in this new space
o Data are assigned to the nearest axiso This provide a clustering of the data
Apprentissage Statistique ‐ P. Gallinari 141
o {x1,…, xn}, ∈ , 0o Xm x n non negative matrix with columns the xi so Find non negative factors U, V, /
• With U an m x k matrix, V a k x n matrix, k < m, n
x
m x n m x k k x n
Apprentissage Statistique ‐ P. Gallinari 142
X UV
•o , ∑
o Columns ofU,uj arebasisvectors,the arethecoefficientofxi inthis basis
•o Solve
, Underconstraints , 0
o Convex loss function inUandinV,butnotinboth UandV
Apprentissage Statistique ‐ P. Gallinari 143
• Algorithmo Constrained optimization problemo Can be solved by a Lagrangian formulationo Iterative algorithm (Xu et al. 2003)
• U, V initialized at random values• Iterate until convergence
o ←
o ←
o The solution U, V is not unique, if U, V are solution, then UD, D‐1V for D diagonal positive are also solution
Apprentissage Statistique ‐ P. Gallinari 144
• Clusteringo Normalize U as a column stochastic matrix (each column vector is of norm 1)
• ←∑
• ← ∑
o Under the constraint “U normalized” the solution U, V is unique
o Associate xi to cluster j if
Apprentissage Statistique ‐ P. Gallinari 145
• Noteo many different versions of NMFo Different loss functions
• g.g. different constraints on the decompositiono Different algorithms
• Applicationso Clusteringo Recommendationo Link predictiono Etc
• Specific forms of NMF can be shown equivalent too PLSAo Spectral clustering
Apprentissage Statistique ‐ P. Gallinari 146
Illustration (Lee & Seung 1999)• Basis images for
• NMF
• Vector Quantization
• Principal Component Analysis
Apprentissage Statistique ‐ P. Gallinari 147
Latent models
Probabilistic Latent Semantic Analysis‐ PLSA
Apprentissage Statistique ‐ P. Gallinari 148
Apprentissage Statistique ‐ P. Gallinari 149
Preliminaries : unigram model
• Generative model of a document
Select document length Pick a word w with probability p(w) Continue until the end of the document
• Applications Classification Clustering Ad‐hoc retrieval (language models)
i
i dwpdp )()(
Apprentissage Statistique ‐ P. Gallinari 150
Preliminaries ‐ Unigram model – geometric interpretation
P(w1|d)
P(w3|d)
P(w2|d)
Document d
Word simplex
2/1)(
4/1)(
4/1)(
3
2
1
tionrepresenta d doc
dwp
dwp
dwp
Apprentissage Statistique ‐ P. Gallinari 151
Latent models for document generation
• Several factors influence the creation of a document (authors, topics, mood, etc).o They are usually unknown
• Generative statistical modelso Associate the factors with latent variableso Identifying (learning) the latent variables allows us to uncover (inference) complex latent structures
Apprentissage Statistique ‐ P. Gallinari 152
Probabilistic Latent Semantic Analysis ‐ PLSA (Hofmann 99)
• Motivationso Several topics may be present in a document or in a document collection
o Learn the topics from a training collectiono Applications
• Identify the semantic content of documents, documents relationships, trends, …
• Segment documents, ad‐hoc IR, …
Apprentissage Statistique ‐ P. Gallinari 153
PLSA
• The latent structure is a set of topicso Each document is generated as a set of words chosen from selected
topicso A latent variable z (topic) is associated to each word occurrence in the
document
• Generative Processo Select a document d, P(d)o Iterate
• Choose a latent class z, P(z|d)• Generate a word w according to P(w| z)
o Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics
Apprentissage Statistique ‐ P. Gallinari 154
PLSA ‐ Topic
• A topic is a distribution over words
• Remark A topic is shared by several words A word is associated to several topics
P(w|z)
words
word P(w|z)
machine 0.04
learning 0.01
information 0.09
retrieval 0.02
…… …….
Apprentissage Statistique ‐ P. Gallinari 155
PLSA as a graphical model
z
dzPzwPdwP
dwPdPwdP
)()()(
)(*)(),(
Boxes represent repeated samplingd wz
Corpus level
Document level
P(z|d) P(w|z)
DNd
Apprentissage Statistique ‐ P. Gallinari 156
PLSA model
• Hypothesiso # values of z is fixed a priorio Bag of wordso Documents are independent
• No specific distribution on the documentso Conditional independence
• z being known, w and d are independent
• Learning• Maximum Likelihood : p(Doc‐collection)• EM algorithm and variants
Apprentissage Statistique ‐ P. Gallinari 157
PLSA ‐ geometric interpretation
• Topici is a point on the word simplex• Documents are constrained to lie on the topic simplex• Creates a bottleneck in document representation
Topic simplex
topic2
topic1
topic3w2 w1
w3
Word simplex
Document d
z
dzPzwPdwP )()()(
Apprentissage Statistique ‐ P. Gallinari 158
Applications
• Thematic segmentation• Creating documents hierarchies• IR : PLSI model• Clustering and classification• Image annotation
o Learn and infer P(w|image)• Collaborative filtering
• Note : #variants and extensionso E.g. Hierarchical PLSA (see Gaussier et al.)
An introduction to recommendersystems:
Collaborative FilteringLocal Neighborhood methodsMatrix Factorization methods
Mining Content Information Networks 159
Example ‐ Amazon
Mining Content Information Networks 160
Example (2) ‐ Amazon
Mining Content Information Networks 161
Example (3) ‐ Netflix
Mining Content Information Networks 162
Example (4) ‐ Recsys
Mining Content Information Networks 163
Personalized recommendation
• Two main strategieso Content filtering (not in this course)
• Use product or user characteristics to recommend a list of product to a user• Learn to associate users to product
o Collaborative filtering: this course• Use user previous transactions or ratings to associate users to products
o Introduced by researchers from Xerox PARC in 1992:Using collaborative filtering to weave an information TapestryD. Goldberg, D. Nichols, B.M. Oki, D. Terry
o Domain freeo Implements the ”word‐of‐mouth” principle,
given a user, its interests for a given product are predicted using tasteinformation from the other users which, globally, have the same tastes as thecurrent user.
• Methodso Neighborhood methodso Factorization methods
Mining Content Information Networks 164
Collaborative filtering : Data
• The data take the form of recommendation matriceso m users , … ,o n products , … ,o The rating matrix Rm x n contains values characterizingthe interest of users for products
• mesures the interest of user i for item j• = ? If no known value• R is very sparse (often almost empty)
o Mesures of interest• Ratings, binary values (e.g. like)• Clicks on search results, purchase, etc
Mining Content Information Networks 165
Collaborative filtering : neighborhhoodmethods
• Ideaso Define a similarity between users or between items
• Two users are similar is they share the same tastes or have similarinteractions with the system
• Two items are similar if they were given similar ratings by manyusers or user‐product interactions are similar for the two items
o Predict an unknown rating for a product• User based
o Product p ratings for user u are weighted averages over userssimilar to the current user u, of the known ratings for product p
• Item basedo Product p ratings for user u are weighted averages over the similar products, of the known ratings of user u
Mining Content Information Networks 166
User based collaborative filtering
• Similarity measures between userso Cosine measure
• cos ,∑ / ? ?
∑ / ? ∑ / ?
o Correlation coefficient • Let be the average of the known ratings for
• co ,∑ / ? ?
∑ / ? ∑ / ?
Mining Content Information Networks 167
User based collaborative filtering
• Prediction functiono Let
• be the predicted rating for product and user • U(i) the K users most similar to • Prediction function
•∑ ,∈ ; ?
∑ ,∈ ?
• where sim(,) is one of the similarity functions
Mining Content Information Networks 168
Item based collaborative filtering
• Similarity measures between productso Cosine measure
• cos ,∑ / ? ?
∑ / ? ∑ / ?
o Adjusted cosine• Let be the average of the known ratings for
• adjcos ,∑ / ? ?
∑ / ? ∑ / ?
o Correlation coefficiento Other measures
Mining Content Information Networks 169
Item based collaborative filtering
• Prediction functiono Let
• be the predicted rating for product and user • N(j) the k products which are most similar to • Prediction function
•∑ ,∈ ; ?
∑ ,∈ ; ?
• where sim(,) is one of the similarity functions
Mining Content Information Networks 170
Quality of the predicted ratings
Mining Content Information Networks 171
Matrix Factorization methods
• Datao The interaction matrix
• Users and Items are mapped onto a common latent factor space of dimensionality
• User‐Item interactions are measured as innerproducts in this space
Mining Content Information Networks 172
Example (Koren et al. 2009)
Mining Content Information Networks 173
Basic model
• User i represented by a d dimensional vector
• Item j represented by a d dimensional vector
• Predicted rating • Optimization problem
o , ∈ , ∈ ⨀• Where⨀ denotes the elementwise multiplication• i.e. only the known ratings are considered
Mining Content Information Networks 174
• Avoiding overfitting using regularization terms•
∈ , ∈
+
Mining Content Information Networks 175
Algorithms
• Two popular approacheso Stochastic gradient descento Alternating Least Squares
Mining Content Information Networks 176