Mining Content Information Networksgallinar/gallinari/uploads/... · o Comments, messages, backlinks, linkbacks o Micro blogging: followers • E‐mails o To, from, subject, date,

Mining Content Information Networks

[email protected]

Master DMKMUPMC

Content

• Link analysiso PageRank, Hits …

• Markov Chains• Random Walks

• Learning on graphso Classification, clusteringo Transductive and inductive learning

• Theme and sentiment Analysiso Latent modelso Markov Chain Monte Carlo

• Graph miningo Community detectiono Diffusion in graphs

• Recommendationo Collaborative recommendationo Singular value decomposition, Non negative matrix factorizationo Ranking, etc

2Mining Content Information Networks

Content Information Networks

• Webo Hyperlinks

• Social networkso Friends, comments, tags, metadata (date, geo‐localization, etc.)…

• Bibliographical networkso Authors, co‐authors, conferences, editor site, metadata,…

• Blogso Comments, messages, backlinks, linkbackso Micro blogging: followers

• E‐mailso To, from, subject, date, etc.

• Any collection of content elements with relationso Images, video, texts, …o Implicit relations based on similarities

• Collaborative recommendation networks


Examples

•

4

Enron e‐mail

11 K Web hosts ‐Webspam

Wikipedia themes classification

Flickr Friendship network


Heterogeneous network

•


• Modelingo Graphs

• Nodes are content elements• Links represent relations

• Characteristicso Content elements

• Maybe of different types (heterogeneous)o Relations

• Simple• homogeneous• Heterogeneous• Multiple• Directed / undirected

o Static or dynamic networks


• Needso Structural characteristics of the networko Dynamics

• Network evolution• Information propagation

o Nodes importanceo Classification, rankingo Content analysis

• Thematic• Sentiment• ….


Link analysis

PageRankHITSSalsa

Motivations

• Computing score functions on graph datao Importance of an item

• web pageo Number of incoming links measures the popularity of the page

• Social networkso Links measure social interaction (e.g. friends)

• Scientific literatureo Impact factor (journals)

• Average number of citation per published item

o Classification or ranking score• Annotation of items (images)

o Recommendation


PageRank

• Generalo Popularized by googleo Assign an authority score for each web pageo Using only the structure of the web graph (query independent)o Now one of the many components used for computing page

scores in Google S.E.• Intuition

o Assign higher scores to pages with many in‐links from authoritative pages with few out‐links

• Modelo Random surfer model

• Stationary distribution of a Markov Chaino Principal eigenvector of a linear system


Notations

o G = (V, E) grapho A adjacencymatrix

• Binary matrixo aij =1 if there is a link between i and jo aij =0 otherwise

o P transitionmatrix

• , i, j 1. .

• di degree of node vi ( ∑ )

• pij is the probability to move from node i to node j in the graph• P is row stochastic

o ∑ 1


•

0 1 0 10 0 1 11 0 0 00 0 1 0

, P

0 1/2 0 1/20 0 1/2 1/21 0 0 00 0 1 0

• Basic PageRank

o Initialize the PageRank score vector to a stochastic vector e.g. p 0o Update the PRank vector until convergence

• p(k+1) = PTp(k)

• 0

1/41/41/41/4

, 1

2/81/83/82/8

, 2

6/162/165/163/16

, …

• Conditions for convergence, unicity of solution ?

12

1

2

3 4


Non Negative Matrices

• A square matrix Anxn is non negative if aij 0o Notation A 0o Example: graph incidence matrix

• Anxn is positive if aij > 0o Notation A > 0

• Anxn is irreducible if o ∀ , , ∃ ∈ / 0o If A is a graph incidence matrix, this means that G is strongly

connected• There is a path between any pair of vertices

• Anxn is primitive if ∃ ∈ / 0o A primitive matrix is irreducibleo Converse is false


Examples (Baldi et al. 2003)


Perron‐Frobenius theorem

• Anxn a non negative irreducible matrix o A has a real and positive eigenvalue / | ’| for any other

eigenvalue ’o corresponds to a strictly positive eigenvectoro No other eigenvector is positiveo is a simple root of the characteristic equation 0

• Remarkso is called the dominant eigenvalue of A and the corresponding

eigenvector the dominant eigenvector• The dominant eigenvalue is denoted 1 in the following

o There might be other eigenvalues j / | j| = | 1|

• e.g. 0 11 0 is non negative and irreducible, with two eigenvalues 1, ‐1

on the unit circle


• Perron‐Frobenius theorem for a primitive matrixo In property 1, the inequality is strict

• i.e. A has a real and positive eigenvalue 1 / 1 > | ’| for any other eigenvalue ’

• For a primitive stochastic matrixo 1 = 1 since A1 = 1

• Why is it interesting ?o Simple procedure for computing the eigenvalues of a matrix using the powers of a matrix


Intuition on the power method

• Leto ∈o u1,…,un theeigenvectorsofA,o c1,…,cn thecoordinatesofx intheeigenvectorbasis

•o ∑

o ∑ o 1 dominates,then → fortlarge

• Trueifxnonorthogonaltou1• Sinceu1 positive,anypositivevectorwilldo

o e.g.x 1n


Power method

• Let A a be a primitive matrix• Start with an arbitrary vector x0

o yt = Axto xt+1 = yt/|| yt ||

• Convergenceo Converges towards u_1 the eigenvector associated to , the largest

eigenvalue of A.o Whatever the initial vector x0

• Rate of convergenceo Geometric with ratio XX o > are the first two dominant eigenvalues of A


Pagerank

• Recallo G a directed graph (Web)o A its adjacency matrixo P the transition matrix


• Intuitiono Rank of a document is high if the rank of its parents is higho Embodied e.g. in

• ∝ ∑ ∈

• r(v) : rank value at vo Each parent contributes

• Proportionally to r(w)• Inversely to its out degree

o Amounts at solving• for a given matrix M• Eigenvector problem


Examples (Baldi et al. 2003)


• In order to converge to a stationary solutiono Remove sink nodes

• Many such situations in the webo Images, files, etc

o Make M primitive


Adjustments of the P matrix

• The transition matrix P most often lacks these propertieso Stochasticity

• Dangling nodes (nodes with no outlinks) make P non stochastic.

• Rows corresponding to dangling nodes are replaced by a stochastic vector v

o a common choice is , with 1 the vector of 1s

• The new transition matrix iso .o ai = 1 if i is a dangling node, 0 otherwiseo P’ is row stochastic


Example (Langville&Meyer 2006)


• Primitiveo The M matrix shall be primitive in order for the PageRank vector

to existo One possible solution is

• " ′ 1 . with 0< 1 and v a stochastic vector• Different v correspond to different random walks

o v uniform 1: teleportation operator in the random walk modelo v non uniform: personalization vector

• P"isamixtureoftwostochasticmatriceso Itis stochastico P” is trivially primitive since every node is connected to itselfo controls the proportion of time P’ and . are used

• also controls convergence rate of the Random Walk• P” is called the Google matrix


• Example (Langville&Meyer 2006)


Two formulations of the PageRank problem1. Eigenvector solution

• Solve o With y stochastic vectoro PageRank original algorithm uses the power method

• y " 1 , with any starting vector 0o Rewrites as

• ′ 1 1• 1 1 1

o Note• Computations can be performed on the sparse matrix P instead of the dense matrix P ’’


p2

Diapositive 27

p2 pg; 30/01/2011

• Checko Irreducibility guarantees the convergence P” being stochastic, its dominant eigenvalue 1 = 1

o P” being primitive, the eigenvector associated to 1(PageRank vector) is unique


• Rate of convergenceo For the web graph, convergence is governed by o The rate of convergence is the rate at which t 0

• Initial paper by Brin & Page uses = 0.85 and 50 to 100 iterations


Two formulations of the PageRank problem2. Linear system formulation

• Solve T

o ThiscanberewrittenasafunctionofPdirectly• Solve T

•o Itisnonsingularo Columnsumsare1‐ fornondanglingnodesor1fordanglingnodes

•

o e.g.Jacobi,Gauss‐Seidel,successiveoverrelaxationmethods


Jacobi method for solving linear systems

• Let the linear system• Ax = b• Decompose A into

o A = D + R with D the diagonal of A• R diagonal is 0

• Ax = b writes Dx = b – Rx• If D invertible, Jacobi method solves the linear equation by

• Matrix form: 1

• Element form: 1 ∑• Converges if A is strictly diagonally dominant

• i.e. ∑ (strict row diagonal dominance)• Th Levy‐Desplanques

o a square matrix with a diagonal strictly dominant is invertible


• PageRank with Jacobi• Algorithm

o Start with an arbitrary vector y(0)o Iterate

• 1


• Personalization vectoro Any probability vector v with positive elements can be used

o uniform teleportation

o Can be used to• personalize the search• Control spamming (link farms)


Convergence rate of PageRank

• Theorem (Bianchini et al. 2005)o Let y* the stationary vector of PageRank,

∗∗

the 1 norm of the relative error in the

computation of PageRank at time t, then0

o If there is no dangling page, then there exists v 0 andv = P”v, s.t. the equality holds

Mining Content Information Networks 34

Random Walk interpretation

• Initial formulation of PageRank was with Random Walkso A surfer walks the web and moves from page to page according to a transition probability matrix M

o Rank of a page v• Probability that the surfer is browsing page v

o M is interpreted as the matrix of a first order Markov Chaino The google vector r is the stationary distribution of a discrete time Markov Chain


Markov Chains

• Stochastic processo Set of random variables {Xt} defined on a state space S = {S1,…,Sn}

• t is often the time, Xt is the state of the process at time to e.g. S: pages of the Web, process: surfing the web, Xt the web page

viewed at time t

• Markov Chaino Is a stochastic process that satisfies the Markov Propertyo | ,…, )= | )o i.e. memoryless process, the state at time t+1 only depends on the

state at time t• | ) is the transition probability, i.e. probability of moving from Si at time t‐1 to Sj at time t

• P is a row stochastic matrix • Stationary Markov Chain

o MC in which ∀t


•

• transition matrix

o

• Initial distribution vectoro 0 0 ,… , 0

• Pi(0) is the probability that the chain starts in Si

• Irreducible Markov Chaino The transition matrix is irreducible


Graphical representation

• State= circle, Transition = directed link

p33

S1

S3S2

p31p21

p32

S1 S2 S3p12

p11


Exemple 1 (Rabiner ‐ Juang)

• S1 = Rain, S2 = Clouds, S3 = Sun


Example 2

• Web surfingo States: pageso Transitions : hyperlinkso Parameter estimation: statistics on users’ browsingo Use: model the browsing behavior


Exemple 3 : n‐gram language model

• Build a language model which captures the sequential nature of texts in a corpus.

• n‐gram model = MC of order n‐1• Example : 20 K words vocabulary

1,6*1017Quatre‐gram

8*1012Trigram

20K*20K = 400*106Bigram

# PARAMETERSMODEL


• Probability distribution vector• Non negative vector , … , whose components sum to 1.

• Stationary distribution vector for MC P• Vector s.t.

• kth step probability vector of a MC• Probability of being in a state at time k• , … ,


Properties

• For an initial vector p, what is the state distribution pk at time k ?

• Propertyo P be the transition matrix for a MC on states S1,…,Sno Pk is the kth step transition matrix

• [Pk]ij is the probability of moving from i to j in k stepso p(k) = (Pk)Tp(0) is the kth probability vectoro if P is primitive, → with the unique dominant eigenvector of the transition matrix P


RandomWalks on graphs

• Random Walko is a stochastic process which randomly jumps from nodes to

nodes.o i.e. a MC

• G = (V, W) a weighted grapho The transition matrix of the random walk is

• , i, j 1. .

•• If the graph is connected and non bipartite (P primitive),

the random walk possesses a unique stationary distribution


HITS (Kleinberg 98)

• Hub

• Authority


• Hubo Important reference pageso Points to good authority

pageso Hub score of a page: sum of

the authority scores of its children

• Authorityo Important reference pages for

a topico Pointed by good hub pageso Authority score of a page:

sum of the hub scores of its parents


HITS ‐ Algorithm

• Inputo Web subgraph relative to a query

• The subgraph is composed of the retrieved documents + the linked (in and out) web documents• Only a part of the linked document is considered (e.g. 100)

• Outputo Authority and hub scores, h() and a() for all pages in the graph

• Algorithmo Initialize

• a(v) = 1, h(v) = 1, v (any positive vector value will do)o Repeat

• ∑ →

• ∑ →• Normalize h and a

o

o

o Until convergenceo Return the two lists


48

HITS algorithm (followed)

• For the subgraph, leto h : vector of page hubso a : vector of page authoritieso A the adjacency matrix

• In matricial form, the algorithm writes

ionnormalizat

or

ionnormalizat

AA

AA

A

A

T

T

T

1-tt

1-tt

1-tt

1-tt

aahh

haah


• Matriceso ATA is called the authority matrix

• Determines authority scoreo AAT is called the hub matrix

• Determines hub scoreo Both are symmetric, positive semi‐definite

• The dominant eigenvalue 1 is unique

• Algorithmo The update algorithm is the power method for matrices AAT and

ATAo It converges towards one of the dominant eigenvector

associated to AAT and ATA


• Convergenceo Although 1 is unique, it may have multiple eigenvectors, so that the convergence will depend on the initial vectors a(0) and h(0).

o A trick similar to PageRank can be used to make the matrices primitive and converge to a unique eigenvector:

o Replace AAT with 1 . ,0< 1 and v a stochastic vector

o Same thing with ATA


• Example (Langville – Meyer 2006)


• Matriceso A symmetric matrix B is positive semi‐definite if

• For all non zero vector x, xTBx 0• Or equivalently all eigenvalues are 0

o A matrix B is positive definite if• is replaced by >


SALSA (Lempel – Moran 2001)

• Many variants or algorithms inspired from the success of PageRank and HITS.

• SALSA (Stochastic Approach for Link Structure Analysis) is a stochastic extension to HITSo Makes use of a subgraph of the webo Computes hub and authority values


• G = (V, E)• Build a bipartite undirected graph with two sets of vertices

o Vh: all vertices with outdegree > 0 in Go Va: all vertices with indegree > 0 in Go Edges connect Vh to Va

• Perform two separate random walks to compute hub and authority scores using hub and authority transition matrices H and B

o Hub score• Start from a node in Vh• Jump to a node in Va according to H

o Follow a link in G• Jump back to a node in Vh according to B

o Follow a backlink in Go Authority score

• Idem starting from Vao The stationary vectors of the random walks are the two score vectors h and a.

• h and a are the principal eigenvectors of H and Bo Note

• Each walk starts on one side of the bipartite graph and remains on this side


Example (Langville – Meyer 2006)


• Transition matriceso Hub matrix H

• ∑ deg deg∈ / , ∈

,

• u, v Vh, w Vao Authority matrix B

• ∑ deg deg∈ / , ∈

,

• u, v Va, w Vh


• The transition matrices can be computed from the adjacency matrix A of the initial graph Go Ar row normalized adjacency matrixo Ac column normalized adjacency matrixo H non zero rows and columns of Ar(Ac)T

o B non zero rows and columns of (Ac)TAr






Bibliography

• Langville A. Meyer C.D., Google's PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press Princeton, NJ, USA ©2006, ISBN:0691122024

• Baldi P., Frasconi P., Smyth P., Modeling the Internet and the Web: Probabilistic Methods and Algorithm, Wiley, 2003

• Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM Transactions on Internet Technology. 2005;5(1):92‐128.


Classification and ranking on networked data

Introduction• Motivation• Graph Laplacian

Regularization based methodsCollective classification


Relational graph data

• Fast growing semantic resourceso Webo Social networkso Sharing media

• Newo Serviceso Data typeso industrial problemso Research problems

• Challengeo Machine learning and Information Retrieval for networked data


Differents types of analysis

• Analysis of network (Kleinberg, Faloutsos, …)o Mainly based on connectivity analysiso Structureo Dynamicso Information propagation

• Machine learning on graph datao Mainly classificationo Two main approaches

• Collective classification• Regularization framework

o May take into account both content and connectivity

Classification and Ranking on networked data

• Problemo Classification and ranking are two important generic problems for Machine learning

• Mainly developed for vectorial and sequential datao Classification/ Ranking on graphs

• As usualo Some data points are labeledo Infer labels of other nodes

• Specificity of graph datao Node interdependencyo The labels inferred at a given node will depend on their neighbors


?

??

?


Example : webspam detection

• WebSpam challenge 2007

• 11 K hosts• 7 K labeled• 26 % spam• Partial view of the host grapho Black : spamo White : non spam

Exemple Blog Spam


Graph Laplacians (von Luxburg 2007)

• The following definitions hold for undirected graphso Let G = (V, E) an undirected graph

• |V| = n, W a nxn non negative, symmetric weight matrix• D a nxn diagonal matrix with ∑

o The unnormalized Laplacian of G is•

o Properties

• ∈ , ∑ ,

• L is symmetric, positive semi‐definite• L has n non negative real eigenvalues

• The smallest eigenvalue of L is 0 with eigenvector 1


Graph Laplacians (followed)

o The following two matrices are called normalized Laplacians• , Lrw is related to random walks

• Ls is symmetrico Properties

• Ls and Lrw are positive semi‐definite, they have n non negative real eigenvalues

• 0 is an eigenvalue of Ls and Lrw with eigenvector respectively D1/21 and 1

• Ls and Lrw are similar matriceso They have the same eigenvalueso is an eigenvalue of Lrw with eigenvector v iff is an eigenvalue of Ls with eigenvector D1/2v

• ∈ , ∑ ,






Graph labeling with regularization

• Initial framework comes from semi‐supervised learning

• Later extended to other situationso Classification on graphso Ranking

• Classification frameworko Classical setting for classification is inductive learning

• Learn from a set of labeled datao Usually manual labeling of the data

• Infer on new data• Semi‐supervised learning

o Motivation• Labeling data is expensive while unlabeled data is often available in large quantities

o Often the case for e.g. web applications• Train classifiers using both (few) labeled data and unlabeled data.• The regularization framework mainly concerns transductivelearning

o i.e. All data (labeled + unlabeled) are available at once


• Where does the graph come from in semi‐supervised learning ?

• Make use of local data consistency (proximity, similarity) besides global consistency

• See illustration



Data consistency (Zhou et al. 2003)• Context: semi‐supervised learning (SSL)• SSL rely on local (neighbors share the same label) and global (data

structure) data consistency

Fig. from Zhou et al. 2003


• Graph methods for semi‐supervised learning general ideao Given

• An undirected graph defined on data points• a similarity matrix between nodes in G• A set of labeled nodes on G• Propagate observed labels to unlabeled nodes using similarities

• Notationso D ={x1,…, xl ,xl+1,…xn} data points

• First l points labeled, others unlabeledo y: n*1 vector

• We consider binary classification (classes C1 and C2) for simplifying• n : # data points• yi: class scores for pattern xi

o e.g. we target is 1 if xi is C1 and 0 if C2o G = (V, E) an undirected graph

• A its adjacency matrix• W a similarity matrix : Wij is the similarity between nodes i and j• S a row stochastic matrix defined on G

o Different possibilities for S


Iterative algorithm for semi‐supervised classification –general scheme

• I/Oo Input

• Labeled and unlabeled data pointso Output

• Labeled data

• Algorithmo Compute

• a similarity matrix W• a normalized similarity matrix S

o Iterate

o Label each point • e.g. y*i = 1 if y*i > 0.5, 0 otherwise

• Example

• D is a diagonal matrix whose ith element is the sum of ith row of W

Where y(0):• matrix of initial labels for labeled nodes (1 is

C1 and 0 if C2 XX or -1 ???)• 0 for unlabeled nodes


0),2

exp(: 2

2

iiji

ij Wxx

WW

WDS 1

)0()1()(.)1( ytySty

• Properties• Classical convergence conditions of iterative methods

o e.g. S primitive• Converges to

o y* = (1‐ α)(I – S)‐1y(0)


• Different variantso Different W matrices can be used

• Inverse exponential distanceso Dense connection matrix

• May use a threshold to make it sparse• K Nearest Neighbors

o Local connectivity, sparse connection matrix• Kernels on graphs

o See latero Any S matrix which satisfies convergence conditions can be usedo e.g.

•

•



Iterations

Fig. from Zhou et al. 2003

Same algorithm but with

Regularization view of the algorithm

• y* could be obtained as the solution minimizing the following cost function

o ∑ 0.. ∑ , .. • First term : fitting constraint wrt the initial labels• Second term : smoothness constraint on neighbor nodes• y(0)I = 1 if node I is class 1, 0 otherwise (class 2 and unlabeled points)

• In compact formo 0 0

• Differentiating Q wrt y giveso 0 (*)

• is nonsingular,the solution is • ∗ 0

o The algorithm on slide XX is Jacobi iterative algorithm for solving the linear system (*)


Multiclass extension

• Direct extension of the above algorithmo Replace vector ynx1 witho Matrix Ynxc

• c is the number of classes• Yij = 1 if xi is of class Cj and 0 otherwise• Y(0) is the matrix of initial labels• Y*ij = 1 if Yij = argmaxk Yik



Ranking extensions

• Remark

o Similar formulations have been proposed for ranking in web search engines (e.g. Zhou 2004, Deng 2008)

o Ideas• Documents and queries are the graph vertices• Scores are propagated for computing document relevance to queries while considering document similarity

• Documents are ranked for each query according to scores

Content + link Information

• Propagation methodo do not consider directly the content of the different nodeso Content only appears through the similarity or kernel matrix

• It is possible to use the graph regularization idea together with content based classifiers


Content + link Information (Continued)Abernathy et al. 2010 (classification)

Denoyer et al. 2010 (ranking)

• Contexto Transductive semi‐supervised learningo Each node is characterized by a content information

• e.g. image, text, other• Content classifier

o ∑ ,∈ 0• Smoothing term

o ∑ ′ ,, ∈

• Regularized content + link classifiero ∑ ,∈ 0 ∑ ′ ,, ∈

• Learningo Gradient like algorithm for learning f parameterso Extensions allow to learn the weights as well



Ranking model for image annotation in a social network (Denoyer 2009)

• Problemo Automatic annotation of images in large social networks (e.g. Flickr)

o Consider simultaneously• Explicit relations (authorship, friendship)• Implicit relations (similarity)• Different types of content

o Text, image


• Approacho Regularization based method

• Cost functiono Fitting term: ranking functiono Regularity term: based on one type of relation

• Resultso Importance of social links

• Authors, friendship• Large improvement over non relational (classical) ranking methods

• Few improvement with implicit relations


Experiments

• 3 corpora extracted from Flickr


Results

Other extensions

• Directed graphs• Multiple relations• Heterogeneous networks


• Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. Learning with local and global consistency. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference.Vol 1.; 2004:595–602.

• Zhu X, Ghahramani Z. Semi‐supervised learning using gaussian fields and harmonic functions. MACHINE LEARNING‐. 2003;20(2):912.• Abernethy J, Chapelle O, Castillo C. Graph regularization methods for Web spam detection. Machine Learning. 2010;81(2):207‐225.






Networked data

• Available informationo Connectivity

• Labels : partial labeling• Links

o assumed known usually – explicit or implicito Node featureso Others

• Label metric

• Three types of correlations can then be exploitedo Correlation between label of node i and its featureso Correlation between the label of i and the observed features and / or

labels of node i neighborso Correlation between the label of i and the unobserved features

and labels of node i neighbors


• Solving the global label assignment is usually NP hard• Exact inference algorithms, when they exist are too costly

• Most methods use approximate inference algorithms

• Note: most methods consider onlyo Unweighted linkso Single links


Notations and problem definition

• Notationso Graph G = (V, E)o Node i features : xi

• xi may incorporate input features (e.g. text) and/ or relational featureso local features : e.g. neighbor labels, number of neighbors, …o global features : e.g. centrality.

o Node i label : yio Neighborhood of node i : N(i)o Labels take their values in L= {l1, …, lp}

• Classification problemo Some labels and / or features being observedo Infer the unobserved labels of other nodes


Collective classification methods (Sen et al. 2008)

• Usual schemeo Bootstrap

• Assign an initial value to each node using a local classifier• Any classifier may be used

o Iterate• compute node labels using graph contextual information• Iterations are needed since the new label values for nodes in N(i) provide new information for yi

• Most methods for collective classification thus requireo A relational classifiero An iteration policy


Collective classification methods

o Gibbso Iterative classificationo Relaxation labelingo Stacked learningo Random walks …..


Feature vectors

• For vector classifiers, xi should be of fixed sizeo Neighborhoods N(i) may be of variable size for different nodes i

o Usual solution : use aggregate features in order to build fixed size feature vectors

• e.g. # class k labels in N(i), class k relative frequency in N(i), majority label in N(i), ….

• The value of xi may change from one iteration to the other• xi shall be computed at each iteration

• Example – aggregate features



Iterative classification (Neville et al 2000, Lu et al. 2003)

• Boostrapo For each unlabeled node i

• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)

• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i,

• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)


Simulated Iterative Classification (Maes et al. 2009)

• Training Bias in ICAo Training is performed on correct labelso Inference is performed with noisy labels

SICA (Followed)

• Ideao Training and test conditions should be made similaro Make training examples representative of test ones by simulating inference during learning

o How• Repeatedly run inference during training by sampling from the current classifier distribution of predicted labels

• Different sampling schemes



Gibbs sampling (McDowell et al. 2007, Neville et al. 2007)

• Simplified version of the original Gibbs sampling strategy (Geman & Geman 84)

o Introduces a classifier F() not present in the original Gibbs

• Training often requires a fully labeled training set

• Inferenceo Sample the outputs for each node and take the majority label

• Difference with ICo Label sampling

• Sequential update• Any classifier can be used for Bootstrap or Iterate steps


Gibbs sampling

• Boostrapo For each unlabeled node i

• Local classifiero Compute xio Compute label yi using observed nodes in N(i) : yi = F(xi)

• For each label lo Counts [i, l] = 0

• Iterateo Generate an ordering on unlabeled nodeso For each unlabeled node i,

• Relational classifiero Compute xio Compute label yi using N(i) : yi = F(xi)

• Counts [i, yi] = Counts [i, yi] +1• yi = argmaxl count[i, l]


Remarks

• For both ICA and Gibbso A Sequential update of unobserved labels is performedo Any classifier for Bootstrap and Iterate steps can be usedo Hard labels are computed at each stepo Node ordering has no real impacto Classifier choice may impact the performance

• Trainingo Usually requires a fully annotated data set


ICA ‐ Gibbs

•


Stacked graphical learning (Cohen et al XX)

• Main difference is in the training phase• Ideas

o Train a local classifier y = F(x)o Train a second classifier using both input x and predicted outputs in N(i)

o Uses stacked learning

• Usually requires only few (1 !) iterations


Stacked graphical learning

• Trainingo Boostrap

• Learn a local classifier F0 on training set Do Iterate k = 1 to K

• Build training set Dk by augmenting xi with YN(i):o xk = (x,YN(x))

• Learn Fk on Dk

• Note : this step uses stacked learningo Final model : FK

• Inférenceo y0 = F0 (x)o For k = 1 to K

• Compute xk as above• yk = Fk (xk)

o yK = FK (xK)


Stacked learning

o For robustness, training is performed using stacked learning

o Training set D• Let D1, .., Dm be a partition of D• Fk is trained as follows:

o Train m fonctions fi• fi is trained on D – Di

• Let x Di , y = F(x) = fi(x)

o Note• At each iteration a different partition will be used• This prevents overtraining


Stacked learning

•


Some tests

• Sequential data (handwritten word recognition)


Other approaches

• Extensions of graphical models classifiers have been proposed for collective and relational classificationo Directed models

• Relational Bayesian Networks (Taskar et al. 2001)o Undirected models

• Relational Dependency Networks (Neville et al. 2003, 2007)• Relational Markov Networks (Taskar et al 2002)

o …….


Special case : univariate classification

• When the input features are ignored, collective classification is known as univariate collective classification

• Labels are propagated from observed nodes to unlabeled nodes

• All the above methods can be used in this setting (Macskassy et al. 2007)

References

• Sen P, Namata G, Bilgic M, Getoor L, Galligher B, T. Collective classification in network data. AI Magazine. 2008;29(3):1‐24.


Graph kernels

Motivations

• Graph Kernels allow to define similarities between nodes in a graph, based on the graph structureo e.g. # paths connecting two nodes, mean weight of paths, etc

• i.e. complex similarity measures• Distance measures between nodes could be easily derived from

these similaritieso The kernel framework allows to consider a large variety of distance

measures and to represent the nodes of a graph as points in an euclideanspace

• Link with regularization based approacheso Some graph kernels may be obtained as solutions to the optimization of

loss functions• Link with random walks

o Some graph kernels may be defined in terms of random walks on the graph


Kernels

• Kernels are “similarity” functions k(x,x’) s.t.o k(x,x’) can be computed via an inner product of some transformation of x and x’ in a feature space

• Definitiono K: X x X R is a kernel function if for all x, z in X, K(x,z) = < Φ(x),

Φ(z)>whereΦ is a mapping from X onto an inner product feature space

(Hilbert space)


• Initial motivations in machine learningo Non linear classification

• Map the data onto a possibly high dimensional space, so that the problem becomes linear in that space

• Limit the complexity of similarity computations to O(input space dimension)

o Computations may be performed in the original (smaller) space at a linear cost


• Initial motivations in machine learningo Non linear classification


xx’

(x)(x’)

K(x, x’)

Kernel functions ‐ examples

• Linear Kernels

• Order 2 polynomials


zxzxK .),(

spolynomial 2d ofset theofsubset i.e.),)2(,).(()(/ with)().(),(

).(),(

21n

: 2 d of monomials all i.e.

).()(/)( with)().(),(

).)(.(.),(

.),(

,1,1,,

2

,1,,

1,

2

1

2

ccxxxx(x)zxzxK

czxzxK

xxxxzxzxK

zzxxzxzxK

zxzxK

niinjijiji

njijiji

n

jijiji

n

iii

Kernels on finite spaces

• Leto X = { x1,…, xN}, with x Xo K(x,x') a symmetric function, K: XxX→ R

• K is a kernel function iff matrix , .. ispositive semi‐definite

• There are several equivalent characterizations for a kernel functiono matrix K is symmetric positive semi‐definite iff any of the

following property holds• 0∀• All the eigenvalues of the real matrix K are real and non negative• for some real matrix B.

o B is not unique in general and different decompositions may exist


• , .. is called the kernel matrix

• n

G , .. , , ..


Symmetric matricesUseful properties

• A symmetric matrix has only real eigenvalues• The eigenvectors of a symmetric matrix are orthogonal

and can then be chosen orthonormal• If a symmetric matrix A has k non zero eigenvalues, then

it can be diagonalized and expressed aso A = UUT

o With• the diagonal kxk matrix of eigenvalues

o Usually ordered in decreasing order• U the nxk orthonormal matrix of corresponding eigenvectors

• Another expression for A iso A = (U 1/2)(U 1/2 )T = XXT


• The data matrix associated to a kernel matrixo Let

• K kernel matrix• K = (U 1/2)(U 1/2 )T = XXT its eigenvalue decomposition• xi the column vector of XT

o Then• (K)ij = xiTxj• xi is a r‐dimensional vector, this is the feature vector associated to the ith pattern• xi is the euclidean representation of pattern i in this space

o When pattern i characterizes a node in a graph, xi is its “euclidean representation”o X is called the data matrix associated to the kernel matrix Ko Note

• This means that data points in complex spaces (e.g. graphs) may be represented in an euclidean space using this data matrix representation.

• Classical euclidean operations (dot product, distances, projections) can be defined on these complex objects via the kernel matrix directly


• Angles

o

• Distances

o• With 0,… , 0,1,0, … , 0 , with 1 in position i


Kernels on graphsSimilarity between graph nodes

Graph “Metric” Space


How to define meaningful kernels on graphs


Example: kernels based on adjacency matrix

Kernel matrix

K An # paths of length n

…+ Allpaths from length 1tonK= ∑ Infinite discounted sum


0 1 1 11 0 0 11 0 0 01 1 0 03 1 0 11 2 1 10 1 1 11 1 1 2

v1

v4v3

v2

# paths of length 2

Diffusion kernels on graphs (Shawe‐Taylor et al. 2004)

• Let B = (bij)i,j=1..n denote a similarity matrix between the graph nodes s.t. bij is the similarity between nodes i and j, B is symmetric.

• Consider the following similarity:o 2 ∑o i.e. bij(2) is the sum of similarities of all length 2 paths between i

and j whereo Then 2 and B2 is a kernel matrix

• In the same wayo Bk is a kernel matrix

• Gives the sum of all k length path similaritieso Any linear combinations of B power matrices is a kernel matrix.


• Von Neumann diffusion kernelo ∑o Ak measures the number of paths of length k between any pair

of nodeso The similarity between two nodes i and j: (KVN)ij integrates

contributions from all the paths from I to j in the graph, with a discounting factor decreasing with k.

o Here the importance of a path is inversely proportional to its length

o Converges if 0 < < ((A))‐1 with (A) the spectral radius of Ao Converges too Note

• Similar kernels could be defined for any symmetric B matrix


Other graph kernels (Fouss et al. 2009)

• Exponential diffusion kernelo ∑

!exp with A the adjacency matrix of graph G

o is the number of length k paths between i and j.o Similar to Von Neuman Kernel, with a different discounting rate

• The Laplacian exponential diffusion kernel is the same as the exponential diffusion kernel except adjacency matrix A is replaced with minus the laplacian matrix L.

o ∑!

exp• The regularized Laplacian kernel is similar to Von Neumann kernel

with minus the unormalized laplacian ‐ L substituted to Ao ∑o This kernel also appears in the regularized approach to semi‐supervised

learning (slide XX)


• Random walk with restart kernelo Let us consider a random walker which jumps from node i to node j

with probability according to a row stochastic transition matrix P and at each step jumps back to node i with probability 1 ‐ (Gori et al. 2006))

o The RW is described by the following process

•0

1 1• The steady state solution for a walk starting at node i is:• 1• x is the ith column of 1 , it provides a similaritybetween node i and the other nodes of the graph

• The random walk with restart matrix is• XX explain why we transpose the matrix


Using graph kernels for recommendation

• Collaborative filteringo U a set of userso I a set of itemso Each user rates some of the items

• User‐item matrix ‐ sparseo Collaborative filtering

• Recommend items for users• Usually based on the user similarity of ratings• Predict the missing ratings for users• Many different techniques


ItemsUsers

1 2 3 4

1 5 323 34

5 1 2

ItemsUsers

1 2 3 4

1 5 ? ? 32 ? ? ? ?3 ? 3 ? ?4 ? ? ? ?5 1 2 ? ?

• Popular challenges on movie recommendationo CAMRa2010o NetFlix Prize

• On September 21, 2009 we awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”.

• There are currently 51051 contestants on 41305 teams from 186 different countries. We have received 44014 valid submissions from 5169 different teams;



• Let G = (V, E) be the user‐item bipartite grapho Nodes V are users and itemso Links

• A adjacency matrix• aij = 1 if user i has rated item I• aij = 0 otherwise

o Bipartite graph



• The different graph kernels could be computed on this bipartite graph.o The kernel matrix K (N+M)x(N+M) provides the similarities between graph nodes

o It could be partitioned into 4 matrices

•o KUU is the MxM user‐user similarity matrixo KUU is the NxN item‐item similarity matrixo KUI is the NxM user‐item preference matrix and KIU its symmetric matrix



• Three ways for computing recommendations (Fouss et al 2009)o Direct

• Use sim(Useri, Itemj) for direct ranking of the recommendations

o User based• Compute sim(Useri, Userj) • Keep the k‐nearest‐neighbors of Useri

o k hyperparametero The recommendation score of item j for user I is

• _ ,∑ ,..∑ , ..

• apj = 1 if Userp rated Itemj and 0 otherwise



o Item based• Compute sim(Itemi, Itemj) • Keep the k‐nearest‐neighbors of Itemi

o k hyperparametero The recommendation score of item j for user I is

• _ ,∑ ,..∑ , ..

• aip = 1 if Useri rated Itemp and 0 otherwise


Using graph kernels for classification

• Kernels may be used for semi‐supervised classificationo Several of the matrices obtained via transductive approaches to

semi‐supervised learning are indeed kernels – or similarity matrices

o Given a Kernel matrix K (nxn), a simple rule for classification is:• Let yc be an nx1 vector s.t. yc,I = 1 if node i is from class c and 0 otherwise (unknown or other class)

• K. yc is the vector of class c scores for the graph nodeso It computes the similarity of any node with the labeled nodes from class c

• Finally, a node may be classified into the class with highest scoreo This is similar to what we did with regularization based approaches

• Noteo Other, more sophisticated classification rules might be used


References

• Shawe‐Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004• Yen L, Pirotte A, Saerens M. An Experimental Investigation of Graph Kernels on Collaborative Recommendation and

Semisupervised Classification. Submitted. 2009:1‐39.


Latent models

Non Negative Matrix Factorization

Apprentissage Statistique ‐ P. Gallinari 140

Non Negative Matrix Factorization

• Ideao Project data vectors in a latent space of dimension k < m size of the original space

o Axis in this latent space represent a new basis for data representation

o Each original data vector will be approximated as a linear combination of k basis vectors in this new space

o Data are assigned to the nearest axiso This provide a clustering of the data


o {x1,…, xn}, ∈ , 0o Xm x n non negative matrix with columns the xi so Find non negative factors U, V, /

• With U an m x k matrix, V a k x n matrix, k < m, n

x

m x n m x k k x n


X UV

•o , ∑

o Columns ofU,uj arebasisvectors,the arethecoefficientofxi inthis basis

•o Solve

, Underconstraints , 0

o Convex loss function inUandinV,butnotinboth UandV


• Algorithmo Constrained optimization problemo Can be solved by a Lagrangian formulationo Iterative algorithm (Xu et al. 2003)

• U, V initialized at random values• Iterate until convergence

o ←

o ←

o The solution U, V is not unique, if U, V are solution, then UD, D‐1V for D diagonal positive are also solution


• Clusteringo Normalize U as a column stochastic matrix (each column vector is of norm 1)

• ←∑

• ← ∑

o Under the constraint “U normalized” the solution U, V is unique

o Associate xi to cluster j if


• Noteo many different versions of NMFo Different loss functions

• g.g. different constraints on the decompositiono Different algorithms

• Applicationso Clusteringo Recommendationo Link predictiono Etc

• Specific forms of NMF can be shown equivalent too PLSAo Spectral clustering


Illustration (Lee & Seung 1999)• Basis images for

• NMF

• Vector Quantization

• Principal Component Analysis


Latent models

Probabilistic Latent Semantic Analysis‐ PLSA



Preliminaries : unigram model

• Generative model of a document

Select document length Pick a word w with probability p(w) Continue until the end of the document

• Applications Classification Clustering Ad‐hoc retrieval (language models)

i

i dwpdp )()(


Preliminaries ‐ Unigram model – geometric interpretation

P(w1|d)

P(w3|d)

P(w2|d)

Document d

Word simplex

2/1)(

4/1)(

4/1)(

3

2

1

tionrepresenta d doc

dwp

dwp

dwp


Latent models for document generation

• Several factors influence the creation of a document (authors, topics, mood, etc).o They are usually unknown

• Generative statistical modelso Associate the factors with latent variableso Identifying (learning) the latent variables allows us to uncover (inference) complex latent structures


Probabilistic Latent Semantic Analysis ‐ PLSA (Hofmann 99)

• Motivationso Several topics may be present in a document or in a document collection

o Learn the topics from a training collectiono Applications

• Identify the semantic content of documents, documents relationships, trends, …

• Segment documents, ad‐hoc IR, …


PLSA

• The latent structure is a set of topicso Each document is generated as a set of words chosen from selected

topicso A latent variable z (topic) is associated to each word occurrence in the

document

• Generative Processo Select a document d, P(d)o Iterate

• Choose a latent class z, P(z|d)• Generate a word w according to P(w| z)

o Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics


PLSA ‐ Topic

• A topic is a distribution over words

• Remark A topic is shared by several words A word is associated to several topics

P(w|z)

words

word P(w|z)

machine 0.04

learning 0.01

information 0.09

retrieval 0.02

…… …….


PLSA as a graphical model

z

dzPzwPdwP

dwPdPwdP

)()()(

)(*)(),(

Boxes represent repeated samplingd wz

Corpus level

Document level

P(z|d) P(w|z)

DNd


PLSA model

• Hypothesiso # values of z is fixed a priorio Bag of wordso Documents are independent

• No specific distribution on the documentso Conditional independence

• z being known, w and d are independent

• Learning• Maximum Likelihood : p(Doc‐collection)• EM algorithm and variants


PLSA ‐ geometric interpretation

• Topici is a point on the word simplex• Documents are constrained to lie on the topic simplex• Creates a bottleneck in document representation

Topic simplex

topic2

topic1

topic3w2 w1

w3

Word simplex

Document d

z

dzPzwPdwP )()()(


Applications

• Thematic segmentation• Creating documents hierarchies• IR : PLSI model• Clustering and classification• Image annotation

o Learn and infer P(w|image)• Collaborative filtering

• Note : #variants and extensionso E.g. Hierarchical PLSA (see Gaussier et al.)

An introduction to recommendersystems:

Collaborative FilteringLocal Neighborhood methodsMatrix Factorization methods


Example ‐ Amazon


Example (2) ‐ Amazon


Example (3) ‐ Netflix


Example (4) ‐ Recsys


Personalized recommendation

• Two main strategieso Content filtering (not in this course)

• Use product or user characteristics to recommend a list of product to a user• Learn to associate users to product

o Collaborative filtering: this course• Use user previous transactions or ratings to associate users to products

o Introduced by researchers from Xerox PARC in 1992:Using collaborative filtering to weave an information TapestryD. Goldberg, D. Nichols, B.M. Oki, D. Terry

o Domain freeo Implements the ”word‐of‐mouth” principle,

given a user, its interests for a given product are predicted using tasteinformation from the other users which, globally, have the same tastes as thecurrent user.

• Methodso Neighborhood methodso Factorization methods


Collaborative filtering : Data

• The data take the form of recommendation matriceso m users , … ,o n products , … ,o The rating matrix Rm x n contains values characterizingthe interest of users for products

• mesures the interest of user i for item j• = ? If no known value• R is very sparse (often almost empty)

o Mesures of interest• Ratings, binary values (e.g. like)• Clicks on search results, purchase, etc


Collaborative filtering : neighborhhoodmethods

• Ideaso Define a similarity between users or between items

• Two users are similar is they share the same tastes or have similarinteractions with the system

• Two items are similar if they were given similar ratings by manyusers or user‐product interactions are similar for the two items

o Predict an unknown rating for a product• User based

o Product p ratings for user u are weighted averages over userssimilar to the current user u, of the known ratings for product p

• Item basedo Product p ratings for user u are weighted averages over the similar products, of the known ratings of user u


User based collaborative filtering

• Similarity measures between userso Cosine measure

• cos ,∑ / ? ?

∑ / ? ∑ / ?

o Correlation coefficient • Let be the average of the known ratings for

• co ,∑ / ? ?

∑ / ? ∑ / ?


User based collaborative filtering

• Prediction functiono Let

• be the predicted rating for product and user • U(i) the K users most similar to • Prediction function

•∑ ,∈ ; ?

∑ ,∈ ?

• where sim(,) is one of the similarity functions


Item based collaborative filtering

• Similarity measures between productso Cosine measure

• cos ,∑ / ? ?

∑ / ? ∑ / ?

o Adjusted cosine• Let be the average of the known ratings for

• adjcos ,∑ / ? ?

∑ / ? ∑ / ?

o Correlation coefficiento Other measures


Item based collaborative filtering

• Prediction functiono Let

• be the predicted rating for product and user • N(j) the k products which are most similar to • Prediction function

•∑ ,∈ ; ?

∑ ,∈ ; ?

• where sim(,) is one of the similarity functions


Quality of the predicted ratings


Matrix Factorization methods

• Datao The interaction matrix

• Users and Items are mapped onto a common latent factor space of dimensionality

• User‐Item interactions are measured as innerproducts in this space


Example (Koren et al. 2009)


Basic model

• User i represented by a d dimensional vector

• Item j represented by a d dimensional vector

• Predicted rating • Optimization problem

o , ∈ , ∈ ⨀• Where⨀ denotes the elementwise multiplication• i.e. only the known ratings are considered


• Avoiding overfitting using regularization terms•

∈ , ∈

+


Algorithms

• Two popular approacheso Stochastic gradient descento Alternating Least Squares