Hypergraph p-Laplacian: A Differential Geometry ViewHypergraph p-Laplacian: A Differential Geometry View Shota Saito The University of Tokyo [email protected] Danilo P Mandic

Hypergraph p-Laplacian: A Differential Geometry View

Shota SaitoThe University of Tokyo

[email protected]

Danilo P MandicImperial College [email protected]

Hideyuki SuzukiOsaka University

[email protected]

Abstract

The graph Laplacian plays key roles in information processingof relational data, and has analogies with the Laplacian indifferential geometry. In this paper, we generalize the anal-ogy between graph Laplacian and differential geometry tothe hypergraph setting, and propose a novel hypergraph p-Laplacian. Unlike the existing two-node graph Laplacians,this generalization makes it possible to analyze hypergraphs,where the edges are allowed to connect any number of nodes.Moreover, we propose a semi-supervised learning methodbased on the proposed hypergraph p-Laplacian, and formalizethem as the analogue to the Dirichlet problem, which oftenappears in physics. We further explore theoretical connectionsto normalized hypergraph cut on a hypergraph, and proposenormalized cut corresponding to hypergraph p-Laplacian. Theproposed p-Laplacian is shown to outperform standard hy-pergraph Laplacians in the experiment on a hypergraph semi-supervised learning and normalized cut setting.

IntroductionGraphs are a standard way to represent pairwise relation-ship data on both regular and irregular domains. One ofthe most important operators characterizing a graph is thegraph Laplacian, which can be explained in several ways.For the example of spectral clustering (von Luxburg 2007),we consider normalized graph cut (Shi and Malik 1997;Yu and Shi 2003), random walks (Meila and Shi 2001;Grady 2006), and analogues to differential geometry ofgraphs (Branin 1966; Grady and Schwartz 2003; Zhou andScholkopf 2006; Bougleux, Elmoataz, and Melkemi 2007).

Hypergraphs are a natural generalization of graphs, wherethe edges are allowed to connect more than two nodes (Berge1984). The data representation with a hypergraph is used ina variety of applications (Huang, Liu, and Metaxas 2009;Liu, Latecki, and Yan 2010; Klamt, Haus, and Theis 2009;Tan et al. 2014). This natural generalization of graphs moti-vates us to consider a natural generalization of Laplacian tohypergraphs, which can be applied to hypergraph clusteringproblems. However, there is no straightforward approach togeneralize the graph Laplacian to a hypergraph Laplacian.One way is to model a hypergraph as a tensor, for which wecan define Laplacian (Cooper and Dutle 2012; Hu and Qi

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

2015) and construct hypergraph cut algorithms (Bulo andPelillo 2009; Ghoshdastidar and Dukkipati 2014). However,this requires the hypergraph to obey a strict condition of ak-uniform hypergraph, where each edge connects exactly knodes. The second approach is to construct a weighted graph,which can deal with arbitrary hypergraphs. Rodriguez’s ap-proach defines Laplacian of arbitrary hypergraph as an adja-cency matrix of weighted graph (Rodriguez 2002). Zhou’sapproach defines a hypergraph from the normalized cut ap-proach, and outperforms Rodriguez’s Laplacian on a cluster-ing problem (Zhou, Huang, and Scholkopf 2006). However,Rodriguez’s Laplacian does not consider how many nodesare connected by each edge, and Zhou’s Laplacian is notconsistent with the graph Laplacian. Although all of previousstudies consider the analogue to graph Laplacian, none ofthem considers the analogue to the Laplacian from differen-tial geometry. This allows us to further extend to more generalhypergraph p-Laplacian, which is not extensively studied un-like in the case of graph p-Laplacian (Buhler and Hein 2009;Zhou and Scholkopf 2006).

In this paper, we generalize the analogy between graphLaplacian and differential geometry to the hypergraph set-ting, and propose a novel hypergraph p-Laplacian, whichis consistent with the graph Laplacian. We define gradientof the function over hypergraph, and induce the divergenceand Laplacian as formulated in differential geometry. Tak-ing advantage of this formulation, we extend our hypergraphLaplacian to a hypergraph p-Laplacian, which allows us tobetter capture hypergraph characteristics. We also propose asemi-supervised machine learning method based upon thisp-Laplacian. Our experiment on hypergraph semi-supervisedclustering problem shows that our hypergraph p-Laplacianoutperforms the current hypergraph Laplacians.

The versatility of differential geometry allows us to in-troduce several rigorous interpretations of our hypergraphLaplacian. A normalized cut formulation is shown to yieldthe proposed hypergraph Laplacian in the same manner asin standard graphs. We further propose a normalized cutcorresponding to our p-Laplacian, which shows better per-formance than the ones corresponding to current Laplaciansin the experiments. We also explore the physical interpreta-tion of hypergraph Laplacian, by considering the analogue tothe continuous p-Dirichlet problem, which is widely used inPhysics. All proofs are in Appendix Section.

arX

iv:1

711.

0817

1v1

[st

at.M

L]

22

Nov

201

7

Differential Geometry on HypergraphsPreliminary Definition of HypergraphIn this section, we review standard definitions and no-tations from hypergraph theory. We refer to the litera-ture (Berge 1984) for a more comprehensive study. A hy-pergraph G is a pair (V,E), where E ⊂ ∪|V |k=1∪v1,...,vk⊂V[vσ(1), . . . , vσ(k)] | σ ∈ Sk, and Sk denotes the set of per-mutations σ on 1, . . . , k. An element of V is called a vertexor node, and an element of E is referred to as an edge or hy-peredge of the hypergraph. A hypergraph is connected if theintersection graph of the edges is connected. In what follows,we assume that the hypergraph G is connected. A hyper-graph is undirected when the set of edges are symmetric, andwe denote a set of undirected edges as Eun = E/S, whereS = ∪|V |k=1Sk. In other words, edges [v1, v2, . . . , vk] ∈ E and[vσ(1), vσ(2), . . . , vσ(k)] ∈ E are not distinguished inEun forany σ ∈ Sk, where k is the number of nodes in the edge. Ahypergraph is weighted when it is associated with a functionw : E → R+. For an undirected hypergraph it holds thatw([v1, v2, . . . , vk])=w([vσ(1), vσ(2), . . . , vσ(k)]). We definethe degree of a node v ∈ V as d(v) =

∑e∈E:v∈e w(e),

while the degree of an edge e ∈ E is defined as δ(e) = |e|.To simplify the notation we write δe instead of δ(e).

We defineH(V ) as a Hilbert space of real-valued functionsendowed with the usual inner product

〈f, g〉H(V ) :=∑v∈V

f(v)g(v) (1)

for all f, g ∈ H(V ). Accordingly, the Hilbert spaceH(E) isdefined with the inner product

〈f, g〉H(E) :=∑e∈E

1

δe!f(e)g(e). (2)

Note that f and g are defined for directed edges.

Hypergraph Gradient and Divergence OperatorsWe shall now extend standard graph gradient and divergenceoperators studied in (Zhou and Scholkopf 2006) to hyper-graphs, which can be considered as hypergraph analoguesin both of discrete and continuous case. First, we propose todefine hypergraph gradient as follows.Definition 1. The hypergraph gradient is an operator∇ : H(V )→ H(E) defined by

(∇ψ)([v1, . . . , vδe ] = e) :=

√w(e)√δe − 1

δe∑i=1

(ψ(vi)√d(vi)

− ψ(v1)√d(v1)

)(3)

The gradient is defined as a sum of a pairwise smoothnessterm between e[1] node and the others. Since the coefficientof graph gradient is defined as a square root of the weightw(e) (Zhou and Scholkopf 2006), we derive the coefficientfor hypergraph by considering the average among the pairsbetween e[1] and the other node, w(e) divided by δe − 1, inorder to normalize the effect of weight. For an undirectedhypergraph, we define a gradient for an edge e ∈ Eun and

vertex v, i.e. (∇ψ)(e, v). Using the gradient defined for eachedge, we can define the gradient at each node v as∇ψ(v) :=(∇ψ)(e)/δe! | e[1] = v, e ∈ E, where e[1] denotes thefirst element of edge e. Then, the norm of the gradient∇ψ atnode v is defined by

‖∇ψ(v)‖ :=( ∑e∈E:e[1]=v

(∇ψ)2(e)

δe!

) 12

. (4)

It then follows that the definition of this norm satisfies theconditions of a norm in a metric space. The p-Dirichlet sumof the function ψ is given by

Sp(ψ) :=∑v∈V‖∇ψ(v)‖p. (5)

Loosely speaking, the norm of the gradient on a node of ahypergraph measures local smoothness of the function aroundthe node, and the Dirichlet sum measures total roughness overthe hypergraph. Remark that ‖∇ψ‖ is defined in the spaceH(E) as ‖∇ψ‖=〈∇ψ,∇ψ〉1/2H(E), and satisfies S2=‖∇ψ‖2.

Definition 2. The hypergraph divergence is an operator div :H(E)→ H(V ) which satisfies ∀ψ ∈ H(V ),∀φ ∈ H(E)

〈∇ψ, φ〉H(E) = 〈ψ,−divφ〉H(V ). (6)

Notice that Eq. (6) can be regarded as a hypergraph ana-logue of Stokes’ Theorem on manifolds. The divergence cannow be written in a closed form as follows:Proposition 3.

divφ(v) =−∑

e∈E:v∈e

√w(e)

δe!√δe − 1

√d(v)

φ(e)

+∑

e∈E:e[1]=v

δe

√w(e)

δe!√δe − 1

√d(v)

φ(e). (7)

Intuitively, Eq. (7) measures the net flows at the vertex v;the first term counts the outflows from the originator v andthe second term measures the inflow towards v. Note that thisallows us to use Eq. (7) as a definition of the divergence; itsatisfies Eq. (6), analogously to Stokes’ theorem in the contin-uous case. Note also that divergence is always 0 if φ is undi-rected i.e φ(v1, v2, . . . , vk) = φ(vσ(1), vσ(2), . . . , vσ(k)).

Laplace OperatorsIn this section, we present the hypergraph p-Laplace oper-ator, which can be considered as a discrete analogue of theLaplacian in the continuous case.Definition 4. The hypergraph Laplacian is an operator∆p : H(V )→ H(V ) defined by

∆pψ := −div(‖∇ψ‖p−2∇ψ). (8)

This operator is linear for p=2. For an undirected hyper-graph, we get the hypergraph p-Laplacian as follows;Proposition 5.

(∆pψ)(v) =∑

u∈V \v

(dp(v)

ψ(v)√d(v)

− wp(u, v)ψ(u)√d(u)

)(9)

denoting wp(v, v) = 0 and

wp(u, v) =∑

e∈Eun;u,v∈e

w(e)

δe − 1×

(−‖∇ψe‖p−2 + ‖∇ψ(u)‖p−2 + ‖∇ψ(v)‖p−2

),

and

dp(v) = d(v)‖∇ψ(v)‖p−2

−∑

e∈Eun;v∈e

(w(e)

δe − 1

(−‖∇ψe‖p−2 + ‖∇ψ(v)‖p−2

)),

where ‖∇ψe‖p =∑v′∈e ‖∇ψ(v′)‖p/δe.

Let Wp be a matrix whose elements wp(u, v), Dp be adiagonal matrix whose elements d(u, u) =

∑v∈V dp(u, v).

For p = 2 case, which is a standard setting for hypergraphLaplacian, it becomes

(∆ψ)(v) =∑

u∈V \v

(d(v)

ψ(v)√d(v)

− w(u, v)ψ(u)√d(u)

),

(10)where

w(u, v) =∑

e∈Eun;u,v∈e

w(e)

δe − 1, w(u, u) = 0.

We denote W2 by W and a diagonal matrix D whoseelements by d(u, u) = d(u). Note that dp(u, u) =∑u∈V wp(v, u). Using these matrices the Laplacian in (8)

can be rewritten as

(∆pψ) = D−1/2(Dp −Wp)D−1/2ψ. (11)

We shall denote the matrix associated with the Laplacian byLp = D−1/2(Dp −Wp)D

−1/2, so that the Dirichlet sumcan be rewritten by using Lp as follows.

Proposition 6. The Dirichlet sum Sp(ψ) can be rewritten as

Sp(ψ) = ψ>Lpψ. (12)

Note that Lp depends on the function ψ, while L := L2 isindependent. When the hypergraph degenerates into a stan-dard graph and p = 2, L coincides with the graph Laplacian.

From the above analysis, the following three statementsfollow straightforwardly.

Proposition 7. 〈ψ,∆pψ〉H(V ) = Sp(ψ)

Corollary 8. The Laplacian Lp is positive semi-definite.

Proposition 9. ∂∂ψSp(ψ) = p∆pψ

Remark 1. For the case of standard graph in this setting,the discussion in this section reduces to the discrete geometryfor standard graphs, as introduced in (Zhou and Scholkopf2006). This implies that our proposed definition is a naturalgeneralization of discrete geometry for a graph.

Hypergraph RegularizationHypergraph Regularization AlgorithmIn this section, we consider the hypergraph regularizationproblem and propose a novel solution. Given the hypergraphH = (V,E) and label set Y = −1, 1, and assume that thesubset of S ⊂ V is labeled, the problem is to classify the ver-tices in V \S using the label of S. To solve this, we formulatehypergraph regularization as follows. The regularization of agiven function y ∈ H(V ) aims to find a function ψ∗, whichenforces smoothness on all the nodes of the hypergraph, andat the same time closeness to the values of a given functiony, as follows:

ψ∗ = argminψ∈H(V )

(Sp(ψ) + µ‖ψ − y‖2

), (13)

where y(v) takes −1 or 1 if v is labeled, 0 otherwise. Thefirst energy term represents the smoothness as explained inEq. (5), while the second term is a regularization term. LetEp(ψ, y, µ) be the objective function of Eq. (13). Since thepositive power of positive convex function is also convex,Ep is a convex function for ψ, meaning that Eq. (13) has aunique solution, satisfying

∂Ep∂ψ

∣∣∣∣v

=∂

∂ψ‖∇vψ‖p + 2µ(ψ(v)− y(v)) = 0, (14)

for all v ∈ V . Using Proposition 9 we can rewrite the lefthand side of Eq. (14) as

p(∆pψ)(v) + 2µ(ψ(v)− y(v)) = 0, ∀v ∈ V. (15)

The solution to the problem (13) is therefore also solutionto (15). Substituting the expression of the Laplacian fromEq. (9) into Eq. (15) yields

p∑u∈V

lp(u, v)ψ(u) + 2µ(ψ(v)− y(v)) = 0, (16)

where lp(u, v) is a element of Lp. To solve Eq. (16) numeri-cally, we shall use the Gauss-Jacobi iterative algorithm, simi-larly to the discrete case introduced in (Bougleux, Elmoataz,and Melkemi 2007). Let ψ(t) be the solution obtained at the it-eration step t, the update rule of the corresponding linearizedGauss-Jacobi algorithm is then given by

ψ(t+1)(v) =∑

u∈V \v

c(t)(u, v)ψ(t)(u) +m(t)(v)y(v),

(17)where

c(t)(u, v) = − pl(t)p (u, v)

pl(t)p (v, v) + 2µ

and m(t)(v) =2µ

pl(t)p (v, v) + 2µ

,

and where l(t)p (u, v) is a p-Laplacian defined by ψ(t). Wetake ψ(0) = y as an initial condition. The following theoremguarantees the convergence of the update rule for arbitrary p.Theorem 10. The update rule Eq. (17) yields a convergentsequence. Moreover, with notations α= 1/(1 + µ) and β=µ/(1 + µ), a closed form solution to Eq. (13) for p = 2 is

ψ = β(I − αD−1/2WD−1/2)−1y. (18)

The update rule (17) can be intuitively thought of as ananalogue to heat diffusion process, similar to the standardgraph case (Zhou and Scholkopf 2006). At each step, everyvertex is affected by its neighbors, which is normalized by therelationship among any number of nodes. At the same time,the neighbors also retains some fraction from their effects.The relative amount by which these updates occur is specifiedby the coefficients defined in Eq. (17).

Physical Interpretation of HypergraphRegularizationIn standard graph cases, regularization with the graph Lapla-cian can be explained as an analogue to the continuous Dirich-let problem (Grady and Schwartz 2003), which is widely usedin physics, particularly in fluid dynamics. To avoid confusion,the continuous calculus operators are referred to when (c) issuperscripted. The Dirichlet integral is defined as

S(c)p (ψ) =

∫Ω

‖∇(c)ψ‖pdΩ, (19)

and is minimized when the Laplace equation

∆(c)(ψ) := div(c)(‖∇(c)ψ‖p−2∇(c)ψ) = 0 (20)is satisfied (Courant and Hilbert 1962). The parameter pis a coefficient for characteristics of viscosity of fluid. Thefunction φ satisfying the Laplace equation is called a har-monic function. Solving Eq. (19) with a boundary conditionmakes it possible to find a plausible interpolation betweenthe boundary points.

From a physical standpoint, finding the shape of an elasticmembrane is well approximated by the Dirichlet problem.One may think about a rubber sheet fixed along its boundary,and hung down by gravity. This setting can be written asDirichlet problem, and the solution would give the most sta-ble form of a rubber sheet, whose characteristics of elasticityis represented by p. To solve this numerically, we have todiscretize this continuous function. With the pairwise effectbetween the nodes, solving the Dirichlet problem over a stan-dard graph can be thought of as finding a plausible surfaceover the graph.

In the graph setup, we can say that solving the Dirich-let problem over a standard graph corresponds to finding aplausible surface over the graph which favors boundary con-dition y. For the hypergraph setting, solving the hypergraphDirichlet problem gives a plausible surface with the boundaryy, and p is a parameter for hypergraph; this is achieved byconsidering not only the pairwise effects, but also the inter-actions among any number of nodes. In fact, if we discretizethe continuous domain of definition into lattice and considerthe effect of the next neighbor, the second-order differentialoperator is given by D−W when p = 2. Interestingly, if weset up the lattice as a hypergraph, which means that we takeinto account any number of neighbors at the same time, thesecond-order differential operator is also D −W .

Hypergraph CutRevisiting the Hypergraph Two-class CutFrom the discussion so far, it is to be expected that there existsa relationship between hypergraph spectral theory and the

considered manifold setup. Similarly to the case of standardgraph and Zhou’s hypergraph Laplacian, we now introducethe hypergraph cut problem that has a connection to ourLaplacian, whereby a hypergraph can be partitioned intotwo disjoint sets, A and B, A ∪ B = V , and A ∩ B = ∅.The normalized hypergraph cut can now be formulated as aminimization problem given by

Ncut(A,B) = ∂V (A,B)

(1

vol(A)+

1

vol(B)

), (21)

where

∂V (A,B) :=∑

u∈A,v∈B

∑e∈Eun:u,v∈e

w(e)

δe − 1=

∑u∈A,v∈B

w(u, v),

(22)

and vol(A) =∑u∈A d(u).Note that this setting is consistent

with the normalized cut problem on a standard graph. Letf ∈ H(V ), be a |V | dimensional indicator vector function;f(v) = a if node v is in A, f(v) = −b otherwise, wherea = 1/vol(A) and b = 1/vol(B). With these notations theproblem (21) can be rewritten as Rayleigh quotient:

min Ncut(A,B) =f>D−

12 (D −W )D−

12 f

f>f

s.t.√d(v)f(v) ∈ a,−b, f>D1 = 0, (23)

Minimizing Ncut is NP-hard, but it can be relaxed if weembed this problem in the real domain, and the solution of therelaxed problem is given by the second smallest eigenvalueof L (Golub and Van Loan 1996; Shi and Malik 1997).

This setting is somewhat different from the work by Zhou,Huang, and Scholkopf (2006) : if we replace the denomina-tor of Eq. (22) from (δe − 1) to δe, then it is exactly sameas (Zhou, Huang, and Scholkopf 2006). This difference fromZhou’s approach allows for the proposed setting to be consis-tent with standard graphs and standard random walk settingwhereas Zhou’s setting can be seen as a case of the lazyrandom walk, as discussed in Appendix.

Hypergraph Multiclass CutWe shall now extend two-class cut to multiclass cuts and es-tablish the connection between this setting and our proposedLaplacian, similarly to (Zhou, Huang, and Scholkopf 2006).In the standard graph case, multiclass clustering problem cor-responds to decomposing V into k disjoint sets; V = ∪ki=1Viand Vi ∩ Vj = ∅ for i 6= j. We shall denote this multiclassclustering by ΓkV = V1, . . . , Vk, and formulate this prob-lem as that of minimizing

Ncut(ΓkV ) =

k∑i=1

∂V (Vi, V \Vi)vol(Vi)

, (24)

where ∂V (Vi, V \Vi)/vol(Vi) measures the total weights ofthe links from Vi to other clusters in G. We denote the multi-class clustering ΓkV by a |V |×k matrixX , where x(u, i) = 1if node u belongs to the ith cluster and 0 otherwise. This al-

lows us to rewrite the problem as

min. Ncut(ΓkV ) =

k∑i=1

X>i (D −W )Xi

X>i DXi

s.t. X ∈ 1, 0N×k, X1k = 1|V |. (25)

To this end, we consider relaxing the constraints X byminimizing Ncut(ΓkV ) with constraints Z>Z = Ik whereZ = D1/2Z and Z = X(X>DX)(−1/2). The optimal solu-tion to this problem is given by the eigenvectors associatedwith the smallest k eigenvalues of the Laplacian L. Similarlyto Zhou’s Laplacian, the following proposition holds.Proposition 11. Denote the eigenvalues of Laplacian L byλ1 ≤ · · · ≤ λ|V |, and define ck(H) = minNcut s. t. X ∈1, 0N×k, X1k = 1|V |. Then

∑ki=1 λi ≤ ck(H).

As discussed in (Zhou, Huang, and Scholkopf 2006), thisresult shows that the result of the real-value relaxed opti-mization problem gives us a lower bound for the originalcombinatorial optimization problem. However, it is not clearhow to use the k eigenvectors to obtain k clusters. For astandard graph, applying the k-means method to the k eigen-vectors heuristically performs well, and this approach can beapplied to the hypergraph problem as well.

Hypergraph p-Normalized CutFrom the above discussion, one might expect that there existscorresponding hypergraph cut induced from hypergraph p-Laplacian, similarly to the graph p-Laplacian case (Buhlerand Hein 2009). Since p-Laplace operator is nonlinear, weneed to define eigenvalues and eigenvectors.Definition 12. Hypergraph p-eigenvalue λp ∈ R and p-eigenvector ψ ∈ H(V ) of ∆p are defined by

(∆pψ)(v) = λpξp(ψ(v)),whereξp(x) = |x|p−1sgn(x).(26)

To obtain p-eigenvector and p-eigenvalue, we considerRayleigh quotient and the following statements follow:Proposition 13. Consider the Rayleigh quotient for p-Laplacian,

Rp(ψ) =Sp(ψ)

‖ψ‖pp,where ‖ψ‖p = (

∑v

ψp(v))1/p. (27)

The function Rp has a critical point at ψ if and only if ψ isp-eigenvector of ∆p. The corresponding p-eigenvalue λp isgiven as λp=Rp(ψ). Moreover, we have Rp(αψ)=Rp(ψ),∀ψ ∈ H(V ) and α ∈ R, α 6= 0.

Corollary 14. The smallest p-eigenvalue λ(1)p equals to 0,

and corresponding p-eigenvector is D1/21.

Eq. (27) is analogue to the continuous nonlinear Rayleighquotient

R(c)p (ψ) =

∫Ω‖∇(c)ψ‖pdΩ∫Ω‖ψ‖pdΩ

, (28)

which relates to nonlinear eigenproblem.

In order to define a hypergraph cut corresponding to hyper-graph p-Laplacian, let us consider f ∈ H(V ), a |V | dimen-sional indicator vector function in Eq. (23). Then substitutingf into Eq. (27) gives

min p-Ncut(A,B) :=f>D−

12 (Dp −Wp)D

− 12 f

‖f‖pps.t.

√d(v)f(v) ∈ a,−b, f>D1 = 0, (29)

which can be seen as the cut corresponding to our hypergraphp-Laplacian. The problem (29) is NP-hard, and therefore weneed to consider a relaxed problem, similarly to the case ofp=2. The constraints in Eq. (29) require the second eigen-vector to be orthogonal to the first eigenvector. However, theorthogonal constraint is not suitable for p-eigenvalue problem,since the p-Laplacian is nonlinear and therefore eigenvectorsare not necessary to be orthogonal to each other.

For p = 2 case, since we see

‖ψ‖22 = ‖ψ − 〈ψ,D121〉

|V |D

121‖ = min

c∈R‖ψ − cD 1

21‖ (30)

by 〈ψ,D 121〉=0 for the second eigenvector, the Rayleigh

quotient to get the second eigenvector v(2) can be written as

v(2) = argminψ∈H(V )

Sp(ψ)

minc ‖ψ − cD121‖2

. (31)

Motivated by this, we here define the Rayleigh quotient forthe second smallest p-eigenvalue as

R(2)p (ψ) =

Sp(ψ)

min ‖ψ − cD 121‖pp

, (32)

This quotient is supported by the following theorem.Theorem 15. The global minimum of Eq. (32) is equalto the second smallest p-eigenvalue λ(2)

p of ∆p. The cor-responding p-eigenvector ψ(2)

p can be obtained by ψ(2)p =

ψ∗ − c∗D 121, for any global minimizer ψ∗ of R(2), where

c∗ = argminc∈R |∑ψ∗(v)−

√d(v)c|p. Moreover, we have

R(2)p (tψ + c) = R

(2)p (ψ) where t ∈ R, t 6= 0, and ∀c ∈ R.

Therefore, for the relaxed problem of Eq.(29), we considerthe Rayleigh quotient Eq.(32). We note that, as R(2)

p is notconvex, optimization algorithms might have danger not toachieve the global minimum. However, since the functionR

(2)p is continuous for p, we can assume that the global min-

imizer of R(2)p1 and of R(2)

p2 are close, if p1 and p2 are close.Hence, we firstly obtain the second eigenvector for p = 2,where there exist more stable algorithms to obtain eigen-vectors, and use it as the initial condition for optimizationalgorithms for p 6= 2.

Comparison to Existing Hypergraph Laplaciansand Related RegularizerWe now compare our Laplacian with other two standardones. Zhou, Huang, and Scholkopf (2006) have proposedthe Laplacian LZ=I−D−

12

v HWeD−1e H>D

− 12

v based on a

normalized cut and lazy random walk view, where the degreematrices Dv and De stand respectively for diagonal matri-ces, containing degree of nodes and edges, We is a diagonalmatrix containing the weights of edges, and indices matrixH is a |V |×|Eun| matrix whose element h(v, e)=1 if nodev is connected to the edge e, and 0 otherwise. In this settingthe hypergraph is represented by the matrix HWeD

−1e H>,

where weights We is normalized by degree of edges De.This Laplacian gives the same Laplacian if we consider thestandard graph, except for the coefficient 1/2. This differencecomes from the consistency of a lazy random walk view as ex-plained in Appendix. Agarwal, Branson, and Belongie (2006)shows that Zhou’s Laplacian is equivalent to hypergraph starexpansion in (Zien, Schlag, and Chan 1999) and (Li and Sole1996).

Another Laplacian has been proposed under the un-weighted setting by Rodriguez (2002) and is referredto as Simple Graph Method in (Zhou, Huang, andScholkopf 2006). The hypergraph is represented bya matrix HWeH

>−Dv and Laplacian is defined asLR=I−D−1/2

R HWeH>D−1/2R , where DR is a diagonal

matrix whose elements are dR(u, u)=∑v∈V wR(u, v) and

wR(u, v)=∑e∈Eun:u,v∈e w(e). This view is consistent with

the standard graph, but it does not consider the difference ofedge degree δe. Rodriguez Laplacian is theoretically equiva-lent to hypergraph clique expansion in (Zien, Schlag, andChan 1999), (Bolla 1993), and (Gibson, Kleinberg, andRaghavan 2000) as shown in (Agarwal, Branson, and Be-longie 2006).

Our Laplacian can be regarded as a family of Rodriguez’sLaplacian, but we normalize the weight by the edge degreeδe − 1 when constructing Laplacian, whose interpretationis in the definition of gradient. If we consider the cliqueconstructed by w(e), and also to normalize w(e) by δe−1,we obtain our 2-Laplacian. Moreover, from the viewpoint ofdifferential geometry, we obtain Rodriguez’s Laplacian bychanging the denominator in definition of gradient (Def. 1)from

√δe − 1 to 1.

Note that Zhou’s, Rodriguez’s, and our Laplacian can beseen to reduce a hypergraph to an ordinary graph, whoseadjacency matrix HWeD

−1e H>, HWeH

> −Dv , and W re-spectively. Moreover, Rodriguez’s and our Laplacian can beconstructed from the graph gradient in (Zhou and Scholkopf2006) using the graph reduced from hypergraph. However,because our hypergraph gradient is different from graph gra-dient, we cannot construct our hypergraph p-Laplacian fromgraph p-Laplacian in (Zhou and Scholkopf 2006) or anotherdefinition of graph p-Laplacian in (Buhler and Hein 2009).For example, consider an undirected hypergraph G whereV=v1, v2, v3, E=e = v1, v2, v3 and w(e)=1, p=1and the function ψ(v1)= 1, ψ(v2)=0, and ψ(v3)=0. In thissetting we get ∆1(v1)=4/

√6, and if we consider a graph

reduced from hypergraph Zhou’s for v1 is 1/2+1/2√

2 andunnormalized and normalized Buhler’s for v1 are both 1/2,while our and Zhou’s 2-Laplacians give the same values.Details are in Appendix.

Hein et al. (2013) proposed a semi-supervised cluster-ing using a p-regularizer S(h)

p =∑e w(e) (maxv∈V ψ(v)−

minu∈V ψ(u))p, induced from a total variation on hyper-graph, which is Lovasz expansion of a hypergraph cut inEq. (22). Moreover, they use total variation S(h)

1 for hyper-graph cut, which favors balance while ours and the othersfavor to be attracted by larger hypergraph weights. This reg-ularizer is reduced to the same one composed from graphp-Laplacian in (Buhler and Hein 2009; Tudisco and Hein2016) when we consider a standard graph.

ExperimentsWe compare the proposed hypergraph Laplacian with otherhypergraph Laplacians, Zhou’s and Rodriguez’s, and Hein’sregularizer, on categorical data, where for each instance oneor more attributes are given. Each attribute has small numberof attribute values, corresponding to a specific category. Wesummarize the benchmark datasets we used in Table 1. Asin (Zhou, Huang, and Scholkopf 2006), we constructed a hy-pergraph for each dataset, where each category is representedas one hyperedge ,and set the weight of all the edges as 1. Weused semi-supervised learning and clustering to classify theobjects of a dataset. For fair comparison, we evaluate the per-formance with error rate, which is used in the previous studieson hypergraph clustering (Zhou, Huang, and Scholkopf 2006;Hein et al. 2013).

Semi-supervised Learning. We compared our semi-supervised learning method, shown in Eq. (17) with the ex-isting ones using the Laplacians reported by Zhou, Huang,and Scholkopf (2006) and Rodriguez (2002). There are va-riety of ways to extend two-class clustering to multiclassclustering (Bishop 2007), but in order to keep the compari-son simple, we conducted the experiment only on two-classdatasets. The parameter µ was chosen for all methods from10k, where k∈0, 1, 2, 3, 4 by 5-fold cross validation. Werandomly picked up a certain number of labels as knownlabels, and predicted the remaining ones. We repeated thisprocedure 10 times for different number of known labels. Forour p-Laplacian, we varied p from 1 to 3 with the intervalof 0.1, and we show the result of p=2 and the result of pgiving the smallest average error for each number of knownlabeled points. The parameter p for Hein’s regularizer is fixedat 2, since this is recommended by Hein et al. (2013). Theresults are shown in Fig. 1 (a)-(d). When p=2 our Lapla-cian almost consistently outperformed Rodriguez’s Lapla-cian, and showed almost the same behavior as Zhou’s. Thismeans that normalizing hypergraph weights by the edge de-gree when constructing Laplacian enhances the performance.When we tuned p, our Laplacian consistently outperformedother Laplacians, and Hein’s regularizer except the mush-room. The dataset mushroom might fit to the Hein’s balancedcut assumption more than the other Laplacian’s normalizedcut assumption. Table 2 shows the values of p which give thefirst and the second smallest average error. We can observethat p giving the optimal error to each number of knownlabel points would give close value to other points. This re-sult implies that p is a parameter for each hypergraph, ratherthan a parameter for the number of known label points. Thiscan be seen as the analogue to fluid dynamics, where p is acoefficient for characteristics of viscosity of each fluid.

Table 1: Dataset summary. All datasets were taken from UCI Machine Learning Repository.mushroom cancer chess congress zoo 20 newsgroups nursery

# of classes 2 2 2 2 7 4 5|V | 8124 699 3196 435 101 16242 12960|E| 112 90 73 48 42 100 27∑e∈E |e| 170604 6291 115056 6960 1717 65451 103680

2% 4% 6% 8% 10% 12% 14% 16% 18% 20%# label points

0

0.05

0.1

0.15

0.2

0.25

Err

or

Proposed p=2

p best

Zhou

Rodriguez

Hein

(a) Mushroom

2% 4% 6% 8% 10% 12% 14% 16% 18% 20%# label points

0.03

0.035

0.04

0.045

Err

or

for

all

except H

ein

0

0.1

0.2

0.3

0.4

0.5

Err

or

for

Hein

Proposed p=2

p best

Zhou

Rodriguez

Hein

(b) Breast Cancer

2% 4% 6% 8% 10% 12% 14% 16% 18% 20%# label points

0.1

0.2

0.3

0.4

0.5

0.6

Err

or

Proposed p=2

p best

Zhou

Rodriguez

Hein

(c) Chess

2% 4% 6% 8% 10% 12% 14% 16% 18% 20%# label points

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Err

or

Proposed p=2

p best

Zhou

Rodriguez

Hein

(d) Congress

Figure 1: Results of semi-supervised learning from proposed method and the state-of-the-art

Table 2: The value of p which gives an optimal error.Dataset error The portion of # of known label points

2% 4% 6% 8% 10% 12% 14% 16% 18% 20%

Mushroom smallest 1.6 1.6 2.3 1.5 1.3 1.3 1.7 1.7 1.0 1.42nd smallest 2.0 2.1 2.4 1.6 1.4 1.4 1.6 1.4 1.1 1.7

Breast Cancer smallest 2.7 2.4 2.7 2.7 2.7 2.7 2.5 2.5 2.6 2.62nd smallest 2.6 2.1 2.4 2.6 2.6 2.6 2.6 2.6 2.7 2.7

Chess smallest 1.2 1.0 1.0 1.0 1.0 1.1 1.0 1.9 1.0 1.02nd smallest 1.1 1.1 1.1 1.5 1.1 1.2 1.2 2.0 2.4 1.1

Congress smallest 1.1 1.6 1.3 1.5 1.1 1.1 1.1 2.9 1.1 1.12nd smallest 1.2 1.7 1.4 1.6 1.2 1.2 1.2 3.0 1.2 1.2

Clustering. This experiment aimed to evaluate the pro-posed Laplacian on a clustering task. We performed two-classand multiclass clustering tasks by solving the normalized cuteigenvalue problem of the Laplacian L for p = 2. For ourp-Laplacian, we obtain the second eigenvector of Eq.(27)by varying p from 1 to 3 with the interval 0.1, and showedthe optimal result. In multiclass task experiments, we usedthe k-means method for the obtained k eigenvectors from L,and the true number of clusters as the number of clusters. Tokeep the comparison simple, we conducted experiments forp-Laplacian only on twoclass problem for the same reasonas semi-supervised problem. For comparison, we present theresults obtained from the Zhou’s and Rodriguez’s Laplacianand Hein’s regularizer for normalized cut, and compared theerror rates of the clustering results, as summarized in Table 3.Among the Laplacians, we can observe that our p-Laplacianconsistently outperformed Rodriguez’s and Zhou’s Lapla-cian, while our 2-Laplacian showed slightly better or similarresults than Rodriguez’s and Zhou’s ones. For mushroom,Hein’s is significantly better than others. This might be forthe same reason in the semi-supervised learning experiment.

Conclusion

We have proposed a hypergraph p-Laplacian from the per-spective of differential geometry, and have used it to de-velop a semi-supervised learning method in a clusteringsetting, and formalize them as the analogue to the Dirich-let problem. We have further explored a theoretical con-nection with the normalized cut, and propose a normal-ized cut corresponding to our p-Laplacian. Our proposedp-Laplacian has consistently outperformed the current hy-pergraph Laplacians on the semi-supervised clustering andthe clustering tasks. There are several future directions. Afruitful future direction would be to explore extentions,such as algorithms which require less memory (Hein etal. 2013), and nodal domain theorem (Tudisco and Hein2016). It is also worth to find more applications where hy-pergraph is used such as (Huang, Liu, and Metaxas 2009;Liu, Latecki, and Yan 2010; Klamt, Haus, and Theis 2009)and where hypergraph Laplacian is the most effective ap-proach compared to the other machine learning approaches.In addition, it would be valuable if we choose the best pa-rameter p, especially in the clustering case where we have toassume that no labelled data is available. Moreover, it would

Table 3: The experimental result on clustering: error rate of clustering. For the result of proposed p, we attached the value of pgiving the optimal value in the parentheses next to the error value.

Two-Class Multiclassmushroom cancer chess congress zoo 20 newsgroups nursery

Proposed p 0.2329 (1) 0.0243 (1.5) 0.2847(2.6) 0.1195 (2.5) - - -Proposed p = 2 0.3156 0.0286 0.4775 0.1241 0.2287 0.3307 0.2400

Zhou’s 0.3156 0.0300 0.4925 0.1241 0.1975 0.3307 0.2426Rodriguez’s 0.4791 0.3419 0.4931 0.3885 0.5376 0.4318 0.2607

Hein’s 0.1349 0.3362 0.4778 0.3034 0.1881 0.5113 0.5131

be interesting to explore a theoretical connection betweenhypergraph Laplacian and continuous Laplacian, like in thecase of graph where the graph Laplacian is shown to convergeto continuous Laplacian (Belkin and Niyogi 2003).

Acknowledgments. We wish to thank Daiki Nishiguchifor useful comments. This research is supported by JST ER-ATO Kawarabayashi Large Graph Project, Grant NumberJPMJER1201, Japan, and by JST CREST, Grant NumberJPMJCR14D2, Japan.

References[Agarwal, Branson, and Belongie 2006] Agarwal, S.; Branson, K.;

and Belongie, S. 2006. Higher order learning with graphs. In Proc.ICML, 17–24.

[Belkin and Niyogi 2003] Belkin, M., and Niyogi, P. 2003. Lapla-cian eigenmaps for dimensionality reduction and data representation.Neural Comput. 15(6):1373–1396.

[Berge 1984] Berge, C. 1984. Hypergraphs: combinatorics of finitesets, volume 45. Elsevier.

[Bishop 2007] Bishop, C. M. 2007. Pattern Recognition and Ma-chine Learning. Springer.

[Bolla 1993] Bolla, M. 1993. Spectra, euclidean representations andclusterings of hypergraphs. Discrete Math. 117(1-3):19–39.

[Bougleux, Elmoataz, and Melkemi 2007] Bougleux, S.; Elmoataz,A.; and Melkemi, M. 2007. Discrete regularization on weightedgraphs for image and mesh filtering. In Proc. SSVM, 128–139.

[Bougleux, Elmoataz, and Melkemi 2009] Bougleux, S.; Elmoataz,A.; and Melkemi, M. 2009. Local and nonlocal discrete regular-ization on weighted graphs for image and mesh processing. Int. J.Comput. Vision 84(2):220–236.

[Branin 1966] Branin, F. H. 1966. The algebraic topological basisfor network analogies and the vector calculus. In Proc. Symposiumon Generalized Networks, 452–491.

[Buhler and Hein 2009] Buhler, T., and Hein, M. 2009. Spectralclustering based on the graph p-Laplacian. In Proc. ICML, 81–88.

[Bulo and Pelillo 2009] Bulo, S. R., and Pelillo, M. 2009. A game-theoretic approach to hypergraph clustering. In Proc. NIPS, 1571–1579.

[Cooper and Dutle 2012] Cooper, J., and Dutle, A. 2012. Spectra ofuniform hypergraphs. Linear Algebra Appl. 436(9):3268–3292.

[Courant and Hilbert 1962] Courant, R., and Hilbert, D. 1962. Meth-ods of Mathematical Physics Volume 2. Methods of MathematicalPhysics. Interscience Publishers.

[Ghoshdastidar and Dukkipati 2014] Ghoshdastidar, D., andDukkipati, A. 2014. Consistency of spectral partitioning ofuniform hypergraphs under planted partition model. In Proc. NIPS,397–405.

[Gibson, Kleinberg, and Raghavan 2000] Gibson, D.; Kleinberg, J.;and Raghavan, P. 2000. Clustering categorical data: An approachbased on dynamical systems. The VLDB Journal 8(3-4):222–236.

[Golub and Van Loan 1996] Golub, G. H., and Van Loan, C. F. 1996.Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hop-kins University Press.

[Grady and Schwartz 2003] Grady, L., and Schwartz, E. L. 2003.Anisotropic interpolation on graphs: The combinatorial Dirichletproblem. Technical report, Boston University.

[Grady 2006] Grady, L. 2006. Random walks for image segmenta-tion. IEEE Trans. Pattern Anal. Mach. Intell. 28(11):1768–1783.

[Hein et al. 2013] Hein, M.; Setzer, S.; Jost, L.; and Rangapuram,S. S. 2013. The total variation on hypergraphs - learning on hyper-graphs revisited. In Proc. NIPS, 2427–2435.

[Hu and Qi 2015] Hu, S., and Qi, L. 2015. The Laplacian of auniform hypergraph. J. Comb. Optim. 29(2):331–366.

[Huang, Liu, and Metaxas 2009] Huang, Y.; Liu, Q.; and Metaxas,D. 2009. Video object segmentation by hypergraph cut. In Proc.CVPR, 1738–1745.

[Klamt, Haus, and Theis 2009] Klamt, S.; Haus, U.-U.; and Theis,F. 2009. Hypergraphs and Cellular Networks. PLoS Comput. Biol.5(5):e1000385+.

[Li and Sole 1996] Li, W.-C. W., and Sole, P. 1996. Spectra ofregular graphs and hypergraphs and orthogonal polynomials. Europ.J. Combinatorics 17(5):461 – 477.

[Liu, Latecki, and Yan 2010] Liu, H.; Latecki, L. J.; and Yan, S.2010. Robust clustering as ensembles of affinity relations. InProc. NIPS, 1414–1422.

[Meila and Shi 2001] Meila, M., and Shi, J. 2001. A random walksview of spectral segmentation. In Proc. AISTATS.

[Rodriguez 2002] Rodriguez, J. A. 2002. On the Laplacian eigen-values and metric parameters of hypergraphs. Linear MultilinearAlgebra 50(1):1–14.

[Shi and Malik 1997] Shi, J., and Malik, J. 1997. Normalized cutsand image segmentation. IEEE Trans. Pattern Anal. Mach. Intell22:888–905.

[Tan et al. 2014] Tan, S.; Guan, Z.; Cai, D.; Qin, X.; Bu, J.; and Chen,C. 2014. Mapping users across networks by manifold alignment onhypergraph. In Proc. AAAI, 159–165.

[Tudisco and Hein 2016] Tudisco, F., and Hein, M. 2016. A nodaldomain theorem and a higher-order Cheeger inequality for the graphp-Laplacian. arXiv:1602.05567.

[von Luxburg 2007] von Luxburg, U. 2007. A tutorial on spectralclustering. Stat. Comput. 17(4):395–416.

[Yu and Shi 2003] Yu, S. X., and Shi, J. 2003. Multiclass spectralclustering. In Proc. ICCV, 313–319.

[Zhou and Scholkopf 2006] Zhou, D., and Scholkopf, B. 2006. Dis-crete regularization. In Semi-supervised Learning. MIT Press.

[Zhou, Huang, and Scholkopf 2006] Zhou, D.; Huang, J.; andScholkopf, B. 2006. Learning with hypergraphs: Clustering, classi-fication, and embedding. In Proc. NIPS, 1601–1608.

[Zien, Schlag, and Chan 1999] Zien, J. Y.; Schlag, M. D. F.; andChan, P. K. 1999. Multilevel spectral hypergraph partitioning witharbitrary vertex sizes. IEEE Trans. Comput.-Aided Design Integr.Circuits Syst. 18(9):1389–1399.

Proof of Proposition 3〈∇ψ, φ〉H(E)

=∑e∈E

∇ψ(e)φ(e)

δe!

=∑e∈E

√w(e)

δe!√δe − 1

(∑v∈e

ψ(v)√d(v)

− δeψ(e[1])√d(e[1])

)φ(e)

=∑e∈E

√w(e)

δe!√δe − 1

(∑v∈e

ψ(v)√d(v)

φ(e)− δeψ(e[1])√d(e[1])

φ(e)

)

=∑v∈V

∑e∈E:v∈e

√w(e)√d(v)

ψ(v)φ(e)

δe!√δe − 1

−∑v∈V

∑e∈E:e[1]=v

δe

√w(e)√d(v)

ψ(v)φ(e)

δe!√δe − 1

=∑v∈V

ψ(v)

( ∑e∈E:v∈e

√w(e)√d(v)

φ(e)

δe!√δe − 1

−∑

e∈E:e[1]=v

δe

√w(e)√d(v)

φ(e)

δe!√δe − 1

The last equality implies Eq. (7)

Proof of Proposition 5By substituting Eq. (3) and Eq. (7) into the definition (8) theLaplace operator for undirected hypergraph becomes− div(‖∇ψ‖p−2∇ψ)(v)

=∑

e∈E:v∈e

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ

−∑

e∈E:e[1]=v

δe

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ

=∑

e∈Eun:v∈e

(∑u∈e

√w(e)

δe!√δe − 1

√d(v)

×(δe − 1)!‖∇ψ(u)‖p−2∇ψ(e; e[1] = u)

−δe√w(e)

δe!√δe − 1

√d(v)

(δe − 1)!‖∇ψ(v)‖p−2∇ψ(e; e[1] = v)

)

=∑

e∈E:v∈e

∑u∈e\v

w(e)

(δe − 1)√d(v)

×

(‖∇ψ(u)‖p−2 + ‖∇ψ(v)‖p−2 −

∑u′∈e

‖∇ψ(u′)‖p−2

δe

)ψ(u)√d(u)

− w(e)

δe(δe − 1)√d(v)

δe‖∇ψ(v)‖p−2ψ(v)

)

=∑

e∈E:v∈e

(∑u∈e

w(e)

(δe − 1)√d(v)

×

(‖∇ψ(u)‖p−2 + ‖∇ψ(v)‖p−2 −

∑u′∈e

‖∇ψ(u′)‖p−2

δe

)ψ(u)√d(u)

− w(e)(δe − 1)

(δe − 1)√d(v)‖∇ψ(v)‖p−2 ψ(v)√

d(v)

+

∑e∈Eun;v∈e

w(e)

(δe − 1)√d(v)

(‖∇ψ(v)‖p−2 − ‖∇ψe‖p−2

) ψ(v)√d(v)

=−

∑u∈V \v

wp(u, v)ψ(u)√d(u)

− dp(v)ψ(v)√d(v)

. (33)

Note that, the first term of Eq. (7) vanishes due to the sym-metry property of the gradient when p = 2.

Proof of Proposition 6 and Proposition 7Proposition 7 can be shown by

〈ψ,∆pψ〉H(V ) = 〈ψ,−div‖∇ψ‖p−2∇ψ〉H(V )

= 〈∇ψ, ‖∇ψ‖p−2∇ψ〉H(E)

=∑v∈V

∑e∈E:v∈e

‖∇ψ‖p−2 (∇ψ)2(e)

δe!

=∑v∈V‖∇ψ(v)‖p = Sp(ψ). (34)

Corollary 8 immediately follows; the hypergraph Lapla-cian is positive semi-definite.

Proposition 6 also follows from Proposition 7 by consider-ing:

Sp(ψ) = 〈ψ,∆pψ〉= ψ>D−1/2(Dp −Wp)D

−1/2ψ. (35)

Proof of Proposition 9

∂

∂ψSp(ψ) =

∂

∂ψ

∑v∈V‖∇ψ(v)‖p

=∂

∂ψ

∑v∈V

∑e∈E:e[1]=v

w(e)

δe!(δe − 1)

(∑υ′∈e

ψ(υ′)√d(υ′)

− δeψ(v)√d(v)

)2p/2

.

Since the derivative only depends on the vertices connectedto v by the edges E, we do not have to consider the other

terms. Hence, we obtain

∂

∂ψSp(ψ)

∣∣∣∣v

=− p∑

e∈E:e[1]=v

∑e∈E:e[1]=v

w(e)

δe!(δe − 1)

×

(∑υ′∈e

ψ(υ′)√d(υ′)

− δeψ(v)√d(v)

)2(p−2)/2

× w(e)(δe − 1)√d(v)δe!(δe − 1)

(∑υ′∈e

ψ(υ′)√d(υ′)

− δeψ(v)√d(v)

)

+ p∑

u∈V \v

∑e∈E:e[1]=v,u,v∈e

∑e∈E:e[1]=v

w(e)

δe!(δe − 1)

×

(∑υ′∈e

ψ(υ′)√d(υ′)

− δeψ(v)√d(v)

)2(p−2)/2

× w(e)√d(v)δe!(δe − 1)

(∑υ′∈e

ψ(υ′)√d(υ′)

− δeψ(u)√d(u)

)

=− p∑

e∈E:e[1]=v

(δe − 1)

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ(v)

+ p∑

u∈V \v

∑e∈E:v,u∈e

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ(u)

=− p∑

e∈E:e[1]=v

δe

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ(v)

+ p∑u∈V

∑e∈E:v,u∈e

√w(e)

δe!√δe − 1

√d(v)‖∇ψ‖p−2∇ψ(u)

=p∆pψ(v).

Proof of Theorem 10We show this proposition in a similar way in the standardgraph case reported in (Bougleux, Elmoataz, and Melkemi2009).

Let G be an update map, and C(t) be a matrix whoseelements are c(t)(u, v) when u 6= v otherwise 0. To simplifythe discussion, we omit the superscript (t) of C, and m(v).

The matrix C can be rewritten as follows;

C = pD−1/2(pDp + 2µ)−1WpD−1/2. (36)

Then the following conditions are satisfied:

m(v) ≤ 0,∀v ∈ V, (37)∑u∈V

c(v, u) ≤ 0,∀v ∈ V, (38)

m(v) +∑v∈V

c(v, u) = 1,∀v ∈ V. (39)

From these conditions, we get

minψ(0)(v),minu∈V

ψ(t)(u) ≤ ψ(t+1)(v)

≤ maxψ(0)(v),maxu∈V

ψ(t)(u).(40)

LetM(V ) denote by the set of the function ψ ∈ H(V ) suchthat ‖ψ‖∞ ≤ ‖ψ(0)‖∞, where ‖ψ‖∞ = maxv∈V ψ(v). Bythis definition M(V ) is a Banach space. From the condi-tions above, for the iteration G(ψ(t)) = ψ(t) we can sayG : M(V ) → M(V ), and with the minimum and maxi-mum principal G(M(V )) ⊂M(V ). The iteration G(ψ)(v)is continuous with respect to ‖ · ‖∞ for all v ∈ V , whichstates that G is a continuous mapping. From the discussionabove, since the Banach spaceM(V ) is non-empty and con-vex, the Shauder’s fixed point theorem shows that there existψ ∈M satisfying ψ = G(ψ). Since Sp is convex and G hasa fixed point, Sp has a global minimum, and G converges tothe global minimum of Sp.

We remark that the discussion can be more simple if p = 2.For the case of p = 2, Eq. (16) can be rewritten in a matrixform as

(I −D−1/2WD−1/2)ψ + µ(ψ − y) = 0, (41)which yields the closed form solution to Eq. (13);

ψ = β(I − αD−1/2WD−1/2)−1y. (42)with the notation α = 1/(1 + µ) and β = µ/(1 + µ).

We also show that the update rule G is a contraction map-ping.

‖Gψ −Gψ′‖2 = ‖αD−1/2WD−1/2(ψ − ψ′)‖2≤ α‖D−1/2WD−1/2‖2‖ψ − ψ′‖2≤ α‖ψ − ψ′‖2. (43)

The last inequality holds since the all the eigenvalues ofD−1/2WD−1/2 are in the range of [−1, 1]. The inequalitystates that the update rule always converges. This can besolved by the power method to show that the following resultholds.

By using W and D, we can rewrite Eq. (17) asψ(t) = αD−1/2WD−1/2ψ(t−1) + βy. Denote Q =D−1/2WD−1/2, and note that the eigenvalues of Q are inthe range of [−1, 1]. Then by the iteration, we obtain

ψ(t) = (αQ)t−1ψ(1) +

t−1∑j=1

β(αQ)j−1y. (44)

Since 0 < α < 1, we can show that limt→∞(αQ)t = 0 andlimt→∞

∑(αQ)t = (I − αQ)−1, to yield

ψ(∞) = β(I − αQ)−1y, (45)that is same as the closed form Eq. (42).

Proof of Proposition 11If we relax X to be a real number, then

ck(H) = minNcut(ΓkV ) ≥ minZ>Z=I

traceZ>LZ =

k∑i=1

λi.

(46)

Random Walk View of Hypergraph LaplacianSpectral clustering in a standard graph can be interpretedusing a random walk (Meila and Shi 2001). In the follow-ing, we establish the random walk view for clustering ona hypergraph, similarly to Zhou’s one (Zhou, Huang, andScholkopf 2006). One can move from current position u ∈ Vto another node v as long as u, v ∈ e in the following way:firstly choose a hyperedge e containing u with the probabilityproportional to w(e), and next choose node v ∈ e from auniform distribution, other than the current position u. LetP denote the transition matrix, then each element of P isdefined as

p(u, v) :=1

d(u)

∑e∈Eun:u,v∈e

w(e)

δe − 1=w(u, v)

d(u).

We define π∞ = (π(u))u∈V where π∞(u) = d(u)/vol(V ),and it is easy to show that π∞ is a stationary distribution,that is, P>π∞ = π∞. We also note that this Markov chainis reversible, that is, π∞(u)p(u, v) = π∞(v)p(v, u) =p(u, v)/vol(V ).

We shall define PAB as the probability of transition fromcluster A to another cluster B when the random walk reachesits stationary distribution. Then, PAB can be written as

PAB =

∑u∈A,v∈B π

∞(u)p(u, v)

π∞(A)=

∑u∈A,v∈B w(u, v)

vol(A),

to giveNcut(A,B) = PAB + PBA.

Note that this formulation is consistent with the randomwalk defined on a standard graph. We also remark that thisformulation is somewhat different from Zhou’s random walkmatrix p(u, v) =

∑e∈Eun

h(u, e)h(v, e)/d(u)δe, which canbe obtained by changing the denominator of the definition ofrandom walk, and also by filling the non-zero diagonal entriesp(u, u) =

∑e∈Eun

h(u, e)h(u, e)/d(u)δe. Our approach isdifferent than Zhou’s approach which can be seen as a lazyrandom walk setting; that has self-loops in the random walkeven if the original hypergraph does not have any self-loop,while ours is a standard random walk; that does not have self-loops if they do not appear in the original. As an example,consider a standard graph with the random walk setting fora graph with no self-loop, and whose adjacency matrix isA. In Zhou’s setting, the location can move from node vto other nodes with probability aij/2d(v), and stay in thesame node v with probability 1/2. On the other hand, ourapproach is consistent with the random walk on a standardgraph, which means that one can move from v to anothernode with probability aij/d(v).

Proof of Propposition 13By differentiating Eq. (27) by ψ, we can obtain the conditionfor critical points of Eq. (27) as follows;

∆pψ −Sp(ψ)

‖ψ‖ppξp(ψ) = 0 (47)

By Eq. (26), we can immediately show that ψ is an eigen-vector of ∆p. Moreover, the eigenvalue λ can be obtained by

Sp(ψ)/‖ψ‖pp. The last statement can be shown immediatelyby the definition.

By semidefiniteness of ∆p, all p-eigenvalue is nonnegative.The vector D1/21 satisfies λ1 = 0. By this we can showCorollary 14.

Proof of Theorem 15Most of the proof can be done in a similar manner as(Buhler and Hein 2009), although the definition of graphp-Laplacian in (Buhler and Hein 2009) is different than thedefinition in (Zhou and Scholkopf 2006), and therefore thegraph p-Laplacian induced from our hypergraph p-Laplacian.In (Buhler and Hein 2009), the Buhler’s graph p-Laplacian isdefined as

Qp(ψ) = 〈ψ,∆(B)p ψ〉 =

1

2

∑v,u∈V

w(u, v)|ψ(v)− ψ(u)|p,

(48)

where we restrict all the hypergraph functions to standardgraph ones, and ∆

(B)p is graph p-Laplacian in (Buhler and

Hein 2009), while our definition is Def. 8.Note that when p = 2, Q2(ψ) = S2(ψ). From this defini-

tion, we get the following lemma immediately.Lemma 16.

∂

∂ψQp(ψ) = p∆(B)

p , (49)

Lemma 16 is analogous to Proposition 12. By using thisfact, we can show Theorem 15 in a similar way as (Buhlerand Hein 2009).

However, since our p-Laplacian is different from Buhler’sp-Laplacian, (Buhler and Hein 2009) cannot be applieddirectly. Namely, we need to set up the different p-mean andp-variant functions, which play an important role to proveTheorem 15;Definition 17. We define p-mean and p-variance on hyper-graph G as follows:

meanp,G(ψ) := argminc‖ψ − cD1/21‖pp, (50)

varp,G := minc‖ψ − cD1/21‖pp. (51)

In what follows we denote meanp,G(ψ) = meanp(ψ) andvarp,G(ψ) = meanp(ψ) for simplicity.

On the other hand, Buhler’s p-mean and p-varient func-tions are

mean(u,B)p (ψ) := argmin

c‖ψ − c1‖pp, (52)

var(u,B)p := min

c‖ψ − c1‖pp (53)

for Buhler’s unnormalized p-Laplacian, and

mean(n,B)p (ψ) := argmin

c

∑v∈V

d(v)|ψ(v)− c|p (54)

var(n,B)p := min

c

∑v∈V

d(v)|ψ − c|p (55)

for Buhler’s normalized p-Laplacian.This change is postulated from the difference of denomi-

nator of Rayleigh quotient between ours and Buhler’s, that iscaused by the difference of the definition of p-Laplacian. Thismakes a change in the proof of Theorem 15, from Theorem3.2 of (Buhler and Hein 2009). However, apart from this, theproof can be done in a similar manner.

We start the proof of Theorem 15 by the following lemma;Lemma 18. For any c ∈ R and ψ ∈ H(v), the followingproperties are satisfied for ∆p and Sp(ψ):

∆p(ψ + cD1/21) = ∆p(ψ), (56)∆p(cψ) = ξ(c)∆p(ψ), (57)

Sp(ψ + cD1/21) = Sp(ψ), (58)Sp(cψ) = |c|pSp(ψ). (59)

Proof. All those statements follow directly from the defini-tion of ∆p(ψ) and Sp(ψ).

We shall move on to show the basic properties of p-meanand p-variance.Proposition 19. The p-variance has the following proper-ties;

varp(ψ + cD121) = varp(ψ) (60)

varp(cψ) = |c|pvarp(ψ) (61)

Proof. Let the p-mean of ψ and ψ + cD1/21 be given bym1 = meanp(ψ) and m2 = meanpψ + cD1/21. By thenotation m′2 = m1 + c. Then it follows that

varp(ψ + cD1/21) = min

∑v∈V|ψ(v)−

√d(v)(c−m)|p

≤∑v∈V|ψ(v)−

√d(v)(c−m2)|p

=∑v∈V|ψ(v)−

√d(v)m1|p

= varp(ψ) (62)Accordingly, for m′1 = m2 − c, we obtain varp(ψ) ≤varp(ψ+cD1/21), and hence varp(ψ) ≤ varp(ψ+cD1/21).

The latter equation can be shown in the same wayas (Buhler and Hein 2009).

Moreover, we have the following statement:Proposition 20. Let ψ ∈ H(V ) and m ∈ R. Then ψ hasp-mean m = meanpψ if and only if the following conditionholds: ∑

v∈V

√d(v)ξ(ψ(v)− m) = 0. (63)

Proof. Differentiating by m yields

∂

∂m

(∑v∈V|ψ(v)−

√d(v)m|p

)= p

∑v∈V|ψ(v)−m|p−1sgn(ψ(v)−m)(−1)

= −p∑v∈V

√d(v)ξ(ψ −

√d(v)m), (64)

which implies that a necessary condition for any mini-mizer m of the term

∑v∈V

√d(v)ξ(ψ(v) − m) is given

as∑v∈V

√d(v)ξ(ψ(v) − m) = 0. Convexity of the term∑

v∈V√d(v)ξ(ψ(v) − m) implies that this is also a suffi-

cient condition.

Proposition 21. The derivative of varp(ψ) with respect toψ(v) is given as

∂

∂ψ(v)varp(ψ) = pξp(ψ(v)−

√d(v)meanp(ψ)). (65)

Proof.

∂

∂ψ(v)varp(ψ) =

∂

∂ψ(v)

(∑u∈V|ψ(u)−

√d(u)meanp(ψ)|p

)=∑u∈V

p|ψ(u)−√d(u)meanp(ψ)|p−1sgn(ψ(v)−

√d(u)meanp(ψ))

× ∂

∂ψ(u)(ψ(u)−

√d(u)meanp(ψ))

=∑u∈V

pξp(ψ(u)−√d(u)meanp(ψ))× ∂

∂ψ(v)ψ(u)

−∑u∈V

pξp(ψ(u)−√d(u)meanp(ψ))× ∂

∂ψ(v)

√d(u)meanp(ψ)

=pξp(ψ(v)−√d(v)meanp(ψ))

−∑u∈V

p√d(u)ξp(ψ(u)−

√d(u)meanp(ψ))× ∂

∂ψ(v)meanp(ψ)

=pξp(ψ(v)−√d(v)meanp(ψ)) (66)

The last equality follows from Proposition 20.

Proposition 22. For any function ψ ∈ H(V ) and let ψ bep-mean, which is defined as meanp(ψ) = argminc ‖ψ −cD1/21‖pp, then it holds that,

R(2)p (ψ) = Rp(ψ − ψD1/21), (67)

(∂

∂ψ(v)R(2)p

)(ψ) =

(∂

∂ψ(v)Rp

)(ψ − ψD1/21), (68)

(∂2

∂ψ(v)∂ψ(u)R(2)p

)(ψ)

=

(∂2

∂ψ(v)∂ψ(u)Rp

)(ψ − ψD1/21) +R(2)

p (ψ)Ω(ψ)u,v

(69)

where

Ω(ψ)v,u =p(p− 1)|ψ(u)−

√d(u)ψ|p−2|ψ(v)−

√d(v)ψ|p−2∑

v∈V |ψ(v)−√d(v)ψ|p

∑v∈V |ψ(v)−

√d(v)ψ|p−2

,

(70)

Proof. Eq. (67) can be directly proven by Lemma 18. By thenotation

varp = minc‖ψ − cD1/21‖pp, (71)

we can rewrite R(2)p as follows:

R(2)p (ψ) =

Sp(ψ)

varp(ψ). (72)

Here the derivative ofR(2)p with respect to ψ(v) can be rewrit-

ten as(∂

∂ψ(v)R(2)p

)(ψ)

=p

varp(ψ)∆pψ(v)− Sp(ψ)p

var2p(ψ)

ξp(ψ(v)−√d(v)ψ)

(73)

using Proposition 21. Eq. (73) can be rewritten as

p

‖ψ − ψ1‖pp∆p(ψ − ψD1/21)|v

− Sp(ψ − ψD1/21)p

‖ψ − ψD1/21‖2ppξp(ψ(v)−

√d(v)ψ), (74)

which yields the second statement by the comparison with∂∂ψRp(ψ).

The third statement can be shown in the same manneras (Buhler and Hein 2009).

By using Lemma 18 and Proposition 22, we can showTheorem 15 in the same way as the proof of Theorem 3.2 in(Buhler and Hein 2009).

Discussion on Comparison to OtherHypergraph Laplacian and Related

RegularizerTable 4 summarizes the forms of standard graph and hy-pergraph regularization using various Laplacians and totalvariation regularizer, where

wRp(u, v) =

∑e∈Eun;u,v∈e

w(e)×

(−‖∇ψe‖p−2 + ‖∇ψ(u)‖p−2 + ‖∇ψ(v)‖p−2

),

and where the matrix WRp whose element is wRp(u, v)and the diagonal matrix DRp whose element is dRp(v) =∑u∈E wRp(u, v).Regarding the concrete example of hypergraph where

graph p-Laplacian and hypergraph p-Laplacian are notequal, consider an undirected hypergraph G where V =v1, v2, v3, E = e = v1, v2, v3 and w(e) = 1, and afunction ψ ∈ H(V ) over G. Hypergraph gradient for v1 ande is computed as follows;

∇(e, v1) =

√w(e)√δe − 1

∑u∈V \v1

(ψ(u)√d(u)

− ψ(v1)√d(v1)

)

=1√2

(ψ(v2) + ψ(v3)− 2ψ(v1)). (75)

Hence we get the norm of gradient of v1 can be computed as

‖∇ψ(v1)‖ =( ∑e∈E:e[1]=v

(∇ψ)2(e)

δe!

) 12

=

(1

3

(1√2

(ψ(v2) + ψ(v3)− 2ψ(v1))

)2)1/2

=1√3 · 2|ψ(v2) + ψ(v3)− 2ψ(v1)|. (76)

We can compute the values for v2 and v3 in the same manner.Now the p-Laplacian is

lp(v1, v2)

=−∑

e∈Eun;u,v∈e

w(e)

(δe − 1)√d(v)d(u)

×(−‖∇ψe‖p−2 + ‖∇ψ(v2)‖p−2 + ‖∇ψ(v1)‖p−2

)=− 1

2

(−‖∇ψe‖p−2 + ‖∇ψ(v2)‖p−2 + ‖∇ψ(v1)‖p−2

)=− 1

2

(2

3‖∇ψ(v1)‖p−2 +

2

3‖∇ψ(v2)‖p−2 − 1

3‖∇ψ(v3)‖p−2

).

(77)

On the other hand, in this setting we can get a reducedgraph from hypergraph represented by the adjacency matrixA,

A =

(0 1/2 1/2

1/2 0 1/21/2 1/2 0

). (78)

We can compute graph gradient in (Zhou and Scholkopf2006) for v1 and v2 as follows;

∇(g)(v1, v2) =

(√w(v1, v2)√d(v2)

ψ(v2)−√w(v1, v2)√d(v1)

ψ(v1)

)

=

√1

2(ψ(v2)− ψ(v1)), (79)

and we get

‖∇(g)ψ(v1)‖ =( ∑e∈E:e[1]=v

(∇(g)ψ)2(e)

δe!

) 12

=

(1

2(ψ(v2)− ψ(v1))

2+

1

2(ψ(v3)− ψ(v1))

2

)1/2

,

(80)

where (g) is superscripted for the operators for graphs.

Table 4: The comparison of standard graph and hypergraph regularizations based on various Laplacians and total variation.

Kind Proposed by Graph Regularization

CliqueExpansion

Rodriguez (2002)

Hypergraph (p = 2) ψ>(I−D−1/2R HWeH

>D−1/2R )ψ

Hypergraph (p) ψ>D−1/2(DRp−WRp

)D−1/2ψGraph (p = 2) ψ>(I−D−1/2WD−1/2)ψ

Graph (p) 12

∑u,v∈E wRp(u, v)(ψ(u)− ψ(v))2

This work

Hypergraph (p = 2) ψ>D−1/2(D −W )D−1/2ψHypergraph (p) ψ>D−1/2(Dp −Wp)D

−1/2ψGraph (p = 2) ψ>(I−D−1/2WD−1/2)ψ

Graph (p) 12

∑u,v∈E wp(u, v)(ψ(u)− ψ(v))2

StarExpansion Zhou et al. (2006)

Hypergraph (p = 2) ψ>(I −D−12

v HWeD−1e H>D

− 12

v )ψHypergraph (p) -Graph (p = 2) ψ>((I−D−1/2WD−1/2)/2)ψ

Graph (p) -

Total Variation Hein et al. (2013)

Hypergraph (p = 2)∑e w(e)(maxv∈V ψ(v)−minu∈V ψ(u))2

Hypergraph (p)∑e w(e)(maxv∈V ψ(v)−minu∈V ψ(u))p

Graph (p = 2) ψ>(I−D−1/2WD−1/2)ψ = 12

∑u,v∈E w(u, v)(ψ(u)− ψ(v))2

Graph (p) 12

∑u,v∈E w(u, v)|ψ(u)− ψ(v)|p

From these results we can compute the p-Laplacian matrixfor graph as

l(g)p (v1, v2) = −1

2

w(v1, v2)√d(v1)d(v2)

(‖∇(g)ψ(v1)‖p−2 + ‖∇(g)ψ(v2)‖p−2)

(81)

= −1

4(‖∇(g)ψ(v1)‖p−2 + ‖∇(g)ψ(v2)‖p−2)

(82)

Let us think the case of p = 1 and ψ(v1) = 1, ψ(v2) = 0,and ψ(v3) = 0. We then obtain

‖∇ψ(v1)‖ =2√3 · 2

, (83)

‖∇ψ(v2)‖ = ‖∇ψ(v3)‖ =1√3 · 2

. (84)

l1(v1, v2) = l1(v1, v3) = −√

6

3, (85)

l1(v1, v1) = −(l1(v1, v2) + l1(v1, v3)) =2√

6

3, (86)

‖∇(g)ψ(v1)‖ = 1, (87)

‖∇(g)ψ(v2)‖ = ‖∇(g)ψ(v3)‖ =1√2, (88)

l(g)1 (v1, v2) = l

(g)1 (v1, v3) = −1

4(1 +

√2), (89)

l(g)1 (v1, v1) = −(l

(g)1 (v1, v2) + l

(g)1 (v1, v3)) =

1

2

(1 +

1√2

),

(90)

which yields ∆1(ψ)(v) = 2√

6/3 = 4/√

6 and ∆(g)1 =

1/2 + 1/2√

2. Additionally, for this setting we get ∆(u,B)1 =

∆(n,B)1 =

∑u∼v w(u, v)ξ1(ψ(u)− ψ(v)) = 1/2, where

∆(u,B)p (v) =

∑u∼v

w(u, v)ξp(ψ(u)− ψ(v)) (91)

∆(n,B)p (v) =

∑u∼v

1

d(v)w(u, v)ξp(ψ(u)− ψ(v)). (92)

We note that when p = 2, our and Zhou’s Laplacians wouldgive the same values as ∆2(v1) = ∆

(g)2 (v1) = 1. However,

both unnormalized and normalized Laplacian in (Buhler andHein 2009) are not same as our and Zhou’s Laplacian.

Our and Zhou’s way to formalize Laplacian need nor-malizing factor 1/

√d(v) for any ψ(v), while Laplacians

in (Buhler and Hein 2009) do not have normalizing factor forthe function. This means that our and Zhou’s Laplacian andBuhler’s Laplacian are not same.

These results show that graph p-Laplacian in (Zhou andScholkopf 2006) and (Buhler and Hein 2009) constructedfrom graph reduced from hypergraph are not equal to ourhypergraph p-Laplacian.

Documents

Hypergraph p-Laplacian: A Differential Geometry ViewHypergraph p-Laplacian: A Differential Geometry View Shota Saito The University of Tokyo [email protected] Danilo P Mandic