25
Noname manuscript No. (will be inserted by the editor) Beyond rankings: comparing directed acyclic graphs Eric Malmi · Nikolaj Tatti · Aristides Gionis Received: date / Accepted: date Abstract Defining appropriate distance measures among rankings is a classic area of study which has led to many useful applications. In this paper, we propose a more general abstraction of preference data, namely directed acyclic graphs (DAGs), and introduce a measure for comparing DAGs, given that a vertex correspondence between the DAGs is known. We study the properties of this measure and use it to aggregate and cluster a set of DAGs. We show that these problems are NP-hard and present efficient methods to obtain solutions with approximation guarantees. In addition to preference data, these methods turn out to have other interesting applications, such as the analysis of a collection of information cascades in a network. We test the methods on synthetic and real-world datasets, showing that the methods can be used to, e.g., find a set of influential individuals related to a set of topics in a network or to discover meaningful and occasionally surprising clustering structure. Keywords directed acyclic graphs · aggregation · clustering · preferences · information cascades 1 Introduction Rankings and partial rankings are used in real-life applications to model asso- ciations in data. Examples include ranking documents for information-retrieval applications, ranking user preferences for modeling users and making rec- ommendations, ranking experts for learning applications, and many more. E. Malmi HIIT, Aalto University, Espoo, Finland E-mail: eric.malmi@aalto.fi N. Tatti E-mail: nikolaj.tatti@aalto.fi A. Gionis E-mail: aristides.gionis@aalto.fi

Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Noname manuscript No.(will be inserted by the editor)

Beyond rankings: comparing directed acyclic graphs

Eric Malmi · Nikolaj Tatti ·Aristides Gionis

Received: date / Accepted: date

Abstract Defining appropriate distance measures among rankings is a classicarea of study which has led to many useful applications. In this paper, wepropose a more general abstraction of preference data, namely directed acyclicgraphs (DAGs), and introduce a measure for comparing DAGs, given that avertex correspondence between the DAGs is known. We study the propertiesof this measure and use it to aggregate and cluster a set of DAGs. We showthat these problems are NP-hard and present efficient methods to obtainsolutions with approximation guarantees. In addition to preference data, thesemethods turn out to have other interesting applications, such as the analysisof a collection of information cascades in a network. We test the methods onsynthetic and real-world datasets, showing that the methods can be used to,e.g., find a set of influential individuals related to a set of topics in a networkor to discover meaningful and occasionally surprising clustering structure.

Keywords directed acyclic graphs · aggregation · clustering · preferences ·information cascades

1 Introduction

Rankings and partial rankings are used in real-life applications to model asso-ciations in data. Examples include ranking documents for information-retrievalapplications, ranking user preferences for modeling users and making rec-ommendations, ranking experts for learning applications, and many more.

E. MalmiHIIT, Aalto University, Espoo, FinlandE-mail: [email protected]

N. TattiE-mail: [email protected]

A. GionisE-mail: [email protected]

Page 2: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

2 Eric Malmi et al.

Defining appropriate distance measures among rankings is a classic area ofstudy [18, 26, 27] and it provides useful tools to analyze datasets that containrankings.

The problems of aggregating and clustering rankings are motivated by manyapplication scenarios and they have been widely studied in the literature [2,6, 31, 32]. For example, the rank-aggregation problem arises in meta-searchengines, where a number of search engines return different rankings over a set ofdocuments and the goal is to find a new ranking that incorporates the rankingsof the individual search engines in the best possible way. Similarly, the problemof clustering rankings is encountered when users state their preferences withrespect to a set of items (commercial products, movies, restaurants, etc.) andone needs to segment the user base so that in each market segment the usersagree as much as possible with respect to their rankings.

A number of authors convincingly argue that rankings encountered in prac-tical applications are not total orders over the full set of items [1, 13, 14].Instead, it is more common to be confronted with missing information, andrankings become partial rankings (rankings with ties) and top-k lists. Conse-quently, the problem of defining distance measures among partial rankings andtop-k lists was studied in the literature, as well as the problem of aggregatingsuch rankings with missing information.

In this paper, we go one step further towards modeling and managing datawith missing information. We assume that associations in data are modeled notjust by partial rankings or by top-k lists, but by the more general abstraction ofdirected acyclic graphs (DAGs). We view DAGs as a useful abstraction to modelinteresting associations between data items in many modern applications. Wethus consider the problem of developing methods that can be used to cope withdata represented as DAGs. We start by addressing the problem of devising ameaningful distance measure between DAGs, and we then consider the classicproblems of aggregation and clustering on DAG data. We should point out thatin our problem setting, the correspondence between the vertices of differentgraphs is known, as will become clear later in the examples and experimentsthat we provide. This makes the comparison of two graphs computationallymuch cheaper than in methods based on graph edit distances (see e.g. [9,22]).

The notion of a distance between partial orders, that is, transitively closedDAGs, has been previously introduced by Brandenburg et al. [7, 8]. However,computing the two distance measures they propose results in an NP or coNP-complete problem, so more efficient methods are called for.

Before discussing our methods and our contributions in more detail, weprovide further motivation for our approach by giving examples of real-worlddatasets modeled as DAGs.

User preferences. Humans are better at making comparative rather thanabsolute judgements [29]. In many cases, preferences of users over a set ofitems can be recorded via user-feedback mechanisms. For example, when auser is selecting an item among a small number of choices provided by theapplication, the user, implicitly states a preference for that item over the rest.

Page 3: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 3

In such applications, a user can be modeled by a preference graph, a directedgraph that represents preferences between pairs of data items. Such preferencegraphs are likely to be DAGs, or near-DAGs, as the underlying preferencerelation tends to be transitive. The crucial observation here is that since weare typically recording preferences over small sets of items, a large number ofedges is missing, and thus, the preference graph resembles a DAG rather thana full or a partial ranking.

Information cascades. Our second example arises in the analysis of infor-mation cascades in social media [3, 4, 16, 17, 20, 25, 30, 33, 34]. In this casewe consider “actions” performed in a social network. Due to social influence,copies of such actions are performed by neighboring nodes in the network, e.g.,retweets in the Twitter network, and thus information cascades are observed.Due to time ordering, the cascade of an action in a network is a DAG. Analysisof information cascades is important in understanding dynamic phenomena inthe social network, for example, who are the influential nodes, how informa-tion propagates, and so on. Therefore, one needs appropriate tools to compare,aggregate, and cluster cascades (DAGs) of different actions.

Motivated from the above applications, we develop techniques for manag-ing data represented as DAGs. We first propose a sound distance measure forcomparing DAGs. Our measure extends the Kendall-tau measure for full rank-ings and partial rankings. Similar to the work of Fagin et al. [13], our measureuses parameters 0 ≤ p, q ≤ 1 to penalize for pairs of items for which one, orboth, of the two DAGs do not make a clear ordering decision. We study theproposed distance measure with respect to its metric properties. We are ableto show that the measure satisfies a relaxed version of the triangle inequality,where the “slackness” factor depends on the penalty parameters p and q.

The proposed distance measure is then used to define the problems ofaggregating and clustering a set of DAGs. We show that the DAG aggregationand DAG clustering problems are NP-hard and we present efficient methodsto obtain solutions with approximation guarantees. Our solutions rely cruciallyon the relaxed triangle inequality property of the proposed distance measure.

We test our methods on synthetic and real-world datasets. Our experimentsshow that a simple greedy approach yields good performance. It can be used torecover ground-truth DAGs when data are inflicted with high levels of noise,as well as to identify meaningful clusters in real-world applications where dataare represented with DAGs.

The rest of the paper is organized as follows. In Section 2 we discuss basicconcepts and we introduce our notation. Our distance measure is presented inSection 3. Sections 4 and 5 discuss the problems of aggregating and clusteringDAGs. In Section 6 we discuss other work related to the problems we consider.Our experimental evaluation on synthetic and real datasets are presented inSection 7, while Section 8 is a short conclusion.

Page 4: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

4 Eric Malmi et al.

2 Preliminaries and Notation

We will discuss basic concepts and we establish the notation that we will usethroughout the remaining paper.

A directed graph G = (V,E) is a tuple of vertices V and edges E, whichis a set of ordered pairs (i, j) ∈ V × V . A directed acyclic graph (DAG) is adirected graph that has no directed cycles.

Without loss of generality, we assume that graphs share the same set ofvertices. Indeed, if two graphs have different set of vertices, the missing verticesfor each graph can be added as singletons. Furthermore, as we will see, ourDAG-comparison measure penalizes appropriately for such missing verticesadded as singletons. Consequently, we assume that when comparing two DAGswe can just compare their edges. Given two DAGs G1 = (V,E1) and G2 =(V,E2) we say that a vertex pair (i, j) is a discordant if (i, j) ∈ E1 and (j, i) ∈E2. On the other hand, if (i, j) ∈ E1 and (i, j) ∈ E2, then we say that the pairis concordant.

A special case of DAG is a total order, in which the vertices are assumedto have an order and the edge set E consists of

(|V |2

)edges, so that (i, j) ∈ E

if and only if i occurs before j in the total order. We should point out that wedo not assume that our graphs are transitively closed. This means that whilewe can apply our distance to a partial order, which is essentially a transitivelyclosed DAG, we also consider DAGs that are not partial orders.

Given two total orders G1 and G2, a Kendall-tau distance K(G1, G2) be-tween the two orders is the number of all discordant vertex pairs between them.Kendall-tau is a distance measure that takes values between 0 and

(|V |2

).1 It

becomes 0 when the two orders are the same, and it becomes(|V |

2

)when G1 is a

reversed version of G2. Kendall-tau is a distance measure that is widely used tocompare rankings. Fagin et al. [13] extend the Kendall-tau distance (as well asother ranking distances) for partial rankings, that is, rankings with ties. Theirwork is not only theoretically interesting but of great practical importance,since orders (rankings) encountered in practice are rarely total orders (fullrankings). In this paper, we take one further step on the problem of devisingmeasures to compare rankings: we extend the Kendall-tau distance to generalDAGs. Consequently, our methods provide a basis to deal with datasets inwhich DAGs is the appropriate data abstraction.

3 Measuring DAG distance

Our goal is to develop a meaningful and well-founded distance measure be-tween DAGs. We first define such a distance measure as a generalization ofKendall-tau, and then study its properties. In particular we show that theproposed measure satisfies a relaxed version of the triangle inequality. This

1 Most often the Kendall-tau distance is defined to be a value between 0 and 1 by nor-

malizing with the total number of vertex pairs(|V |

2

).

Page 5: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 5

near-metric property is particularly useful as it allows to develop approxima-tion algorithms for aggregation and clustering tasks.

3.1 Generalization of Kendall-tau for DAGs

As with the Kendall-tau distance, we consider a measure defined over pairsof vertices in the two DAGs. As opposed to full rankings (total orders), inDAGs a pair of vertices may not be directly comparable. As a result, giventwo directed acyclic graphs, a pair of vertices may be neither concordant nordiscordant. Thus, a distance measure that considers pairs of vertices in twoDAGs, should consider the following two cases of assigning a penalty:

Discordant pairs: We should penalize an edge (i, j) if i precedes j in one graphwhile j precedes i in the other graph.

Potentially discordant pairs: If one or both of the input DAGs are not fullyconnected then there is a pair of vertices (i, j) for which we do not know ifi precedes j or vice versa, so we have a potentially discordant pair, whichshould be penalized less heavily than a pure discordant pair.

The proposed DAG distance measure satisfies the above two requirements. Ourapproach is similar to the approach by Fagin et al. [13], where they generalizethe Kendall-tau distance to partial rankings by assigning a penalty 0 < p < 1to pairs for which the two partial rankings do not have clear agreement ordisagreement. In a similar way, our DAG distance measure is defined withrespect to partial penalty parameters p and q, where 0 ≤ p, q ≤ 1. As it willbecome clear below we also assume q ≤ p.

For our definition, we also assume an arbitrary ordering of the verticesof the two DAGs. Such an ordering is only assumed for notational simplicityand is not related to the concept of a total order. This ordering is only usedto ensure that each pair of vertices i and j is considered only once, so anyarbitrary bijection from the set of vertices V to {1, . . . , |V |} can be used. Weshould stress that even though the definition of the DAG distance is based onan ordering of vertices, the final result does not depend on it.

Let G1 = (V,E1) and G2 = (V,E2) be two DAGs and let 0 ≤ p, q ≤ 1be fixed parameters. Let i and j be two vertices of V such that i < j. Let

e = (i, j) and f = (j, i) be e with reversed direction. We define K(p,q)ij (G1, G2)

to be a penalty associated to vertices i and j for the DAGs G1 and G2 withrespect to the parameters p and q. We consider four cases:

Case 1: (e ∈ E1 and e ∈ E2) or (f ∈ E1 and f ∈ E2). In this case, G1 and G2

agree on e, and we set the distance to be Kij(G1, G2) = 0.Case 2: (e ∈ E1 and f ∈ E2) or (e ∈ E2 and f ∈ E1). In this case, G1 and G2

completely disagree on e, and we set the distance to be Kij(G1, G2) = 1.Case 3: (e or f ∈ E1 and e, f /∈ E2) or (e or f ∈ E2 and e, f /∈ E1). In this

case, an edge between i and j exists in one graph but not in the other. Weset Kij(G1, G2) = p.

Page 6: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

6 Eric Malmi et al.

ab

cd

(a) G1

ab

cd

(b) G2

ab

(c) H1

ab

(d) H2

ab

(e) H3

Fig. 1 Toy graphs related to Examples 1–2

Case 4: e, f /∈ E1 and e, f /∈ E2. In this last case there is no edge betweeni and j in either of the graphs. We set Kij(G1, G2) = q. The motivationfor setting q > 0 is that otherwise two empty DAGs are as similar as twoidentical complete DAGs even though one might argue that the latter twohave much more in common than the empty DAGs.

We are now ready to define our DAG distance measure.

Definition 1 Let G1 = (V,E1) and G2 = (V,E2) be two DAGs and let0 ≤ p, q ≤ 1 be fixed parameters. We define the DAG distance K(p,q)(G1, G2)to be the sum of distances over pairs of vertices (i, j) ∈ V ×V such that i < j,

K(p,q)(G1, G2) =∑

(i,j)∈V×Vi<j

K(p,q)ij (G1, G2) .

Whenever clear from the context, we will drop p and q from the notation andwrite K (G1, G2) instead of K(p,q)(G1, G2).

Example 1 Let us compute the distance K(p,q)(G1, G2) of the graphs given inFigure 1. There is one concordant edge, namely (a, b), one discordant pair,(c, d), one pair of type Case 3, (a, d), and the remaining three pairs are ofCase 4. Hence, the distance is equal to

K(p,q)(G1, G2) = 0 + 1 + p + 3q.

A few observations are in order: First, it is easy to see that in the case thatthe two DAGs are total orders, Definition 1 reduces to the standard Kendall-tau.

Second, it is reasonable to assume that Case 3, when an edge exists in onegraph but not the other, is a case of stronger or equal disagreement comparedto Case 4, when an edge is missing from both graphs. Therefore, we will assumethat the penalty associated to Case 3 is greater than or equal to the penaltyassociated to Case 4, that is, q ≤ p. In fact, the case q > p has undesirableconsequences. For example, for the problem of DAG aggregation, which wediscuss in the next section, q > p implies that the optimal centroid of twoempty DAGs will be a complete DAG.

Page 7: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 7

3.2 Relaxed triangle inequality

Our next step is to examine whether the proposed distance measure K isa metric. We remind that for a distance measure d : X × X → R to bea metric, the following properties should hold: (i) non-negativity: d(x, y) ≥0; (ii) symmetry: d(x, y) = d(y, x); (iii) identity: d(x, y) = 0 if and only ifx = y; and (iv) triangle inequality: d(x, z) ≤ d(x, y) + d(y, z). The reasonfor undertaking this study is twofold: First, satisfying the metric propertiesprovides additional evidence that our distance measure is meaningful and wellfounded. Second, many algorithms have been designed to take advantage ofmetric properties with respect to efficiency, e.g., via effective pruning rules,or with respect to satisfying certain quality guarantees. Indeed, after we showthat the proposed distance measure K satisfies relaxed metric properties, weare able to exploit those properties in order to obtain high-quality solutionsfor the problems of aggregating and clustering DAGs.

Back to the question whether the proposed distance K satisfies the metricproperties, it is immediate that the distance is non-negative and symmetric. Ifq = 0, then the distance satisfies the identity property, namely, K (G,G) = 0,however this property does not hold for q > 0 as penalty q is induced by edgesthat are missing from both input graphs. Note that this was done by design aswe want to punish potential discordant pairs. The most important propertywith respect to designing approximation algorithms is the triangle inequality.Our next result shows that the distance measure K satisfies a relaxed versiontriangle inequality, meaning that for any directed acyclic graphs G1, G2, G3

it holds K (G1, G2) ≤ c(K (G1, G3) + K (G3, G2)), for some constant c ≥ 1,which depends on the parameters p and q.

Proposition 1 Let 0 ≤ q ≤ p ≤ 1 and 0 < p. Then distance measure Ksatisfies relaxed triangle inequality,

K (G1, G2) ≤ c(K (G1, G3) + K (G2, G3)),

where c = max(4p2, q + max(2p, 1)

)/2p.

Proof We will prove the result by upper bounding the distance K (G1, G2) andlower bounding the distance measures K (G1, G3) and K (G2, G3).

Consider K (G1, G2), and let ki be the number of Case i edges as definedin Section 3.1. The distance is equal to

K (G1, G2) = k2 + pk3 + qk4.

Define also n1, . . . , n4 for K (G1, G3) and m1, . . . ,m4 for K (G2, G3).Consider a set D = (E1 \ E2) ∪ (E2 \ E1). For every Case 2 edge, say

e = (i, j), there are two edges (i, j) ∈ D and (j, i) ∈ D. The remaining edgesin D correspond to Case 3 edges possibly with a reversed direction. This impliesthat |D| = 2k2 + k3. Let k = 2k2 + k3, n = 2n2 + n3, and m = 2m2 + m3.The previous argument shows that k is the number of directed edges whereG1 and G2 disagree. This is a known metric and it follows that k ≤ n + m.

Page 8: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

8 Eric Malmi et al.

Note that any edge of Case 4 in K (G1, G2) can only be Case 3 or Case 4edge in K (G1, G3). Hence, k4 ≤ n3 + n4 and similarly k4 ≤ m3 + m4.

We consider three cases, depending on values of p and q.First, let us assume that p ≤ 1/2. Since k2 ≤ k/2 we can upper bound

K (G1, G2) by

K (G1, G2) = k2 + pk3 + qk4 = k2 + p(k − 2k2) + qk4

= (1− 2p)k2 + pk + qk4

≤ (1− 2p)k/2 + pk + qk4 = k/2 + qk4.

Define x = p/(1 + q). Note that x + qx = p and x ≤ 1/2. Consequently,

K (G1, G3) = n2 + pn3 + qn4 = n2 + xn3 + q(n4 + xn3)

= n2 + x(n− 2n2) + q(n4 + xn3)

= (1− 2x)n2 + xn + q(n4 + xn3)

≥ xn + q(n4 + xn3).

Define c = (1 + q)/(2p). Note that cx = 1/2 and that c ≥ 1/2. Multiplyingthe previous inequality with c gives us

cK (G1, G3) ≥ cxn + cq(n4 + xn3) = n/2 + q(cn4 + n3/2)

≥ n/2 + q/2(n4 + n3).

Similarly, we can lower bound the distance K (G2, G3) by

cK (G2, G3) ≥ m/2 + q/2(m4 + m3).

We can now combine the inequalities,

K (G1, G2) ≤ k/2 + qk4

≤ (n + m)/2 + q/2(m3 + m4 + n3 + n4)

≤ c(K (G2, G3) + K (G1, G3)).

Assume now that p ≥ 1/2. We can upper bound K (G1, G2) by settingk2 = 0,

K (G1, G2) = (1− 2p)k2 + pk + qk4 ≤ pk + qk4.

Assume that 4p2 − q ≤ 2p. Let z = p/(q + 2p) and x = p− qz = 2p2/(q + 2p).Note that the assumption implies that x ≤ 1/2, which allows us to lower bound

K (G1, G3) = (1− 2x)n2 + xn + q(n4 + zn3)

≥ xn + q(n4 + zn3).

Let c = (q + 2p)/2p. Then cx = p and cz = 1/2. Since c ≥ 1/2, we have

cK (G1, G3) ≥ pn + q/2(n4 + n3).

Combining the inequalities, similar to the first case, proves the proposition forthe second case.

Page 9: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 9

Finally, assume that 4p2 − q ≥ 2p. Let z = 1/(4p) and set x = p − qz =(4p2 − q)/(4p2). Since x ≥ 1/2, we obtain a lower bound

K (G1, G3) = (1− 2x)n2 + xn + q(n4 + zn3)

≥ n/2 + q(n4 + zn3).

Let c = 2p. Then c/2 = p and cz = 1/2. Since c ≥ 1/2,

cK (G1, G3) ≥ pn + q/2(n4 + n3).

Combining the inequalities, similar to the first case, proves the proposition forthe third case.

By setting c = max(4p2, q + max(2p, 1)

)/2p, we are able to handle all three

cases simultaneously. ut

Note that if we set p = 1/2 and q = 0, then the distance satisfies thetriangle inequality. In fact, K(1/2,0)(G1, G2) is a half of symmetric differencebetween edge sets E1 and E2,

K(1/2,0)(G1, G2) =1

2|(E1 \ E2) ∪ (E2 \ E1)|.

If we set p = q = 0, then we are penalizing only Case 1 edges, that is,discordant pairs. Proposition 1 holds only when p > 0, but it suggests thatwhen p = 0, K does not satisfy even a relaxed version of the triangle inequality.In order to prove this intuition, we show a simple example.

Example 2 Consider H1, H2, and H3 given in Figure 1. If we set p = 0, thenK (H1, H2) = 1 but K (H1, H3) = 0 K (H2, H3) = 0. Consequently, there is noc > 0 such that

K (H1, H2) ≤ c(K (H1, H3) + K (H2, H3)).

4 DAG aggregation

We now consider the problem of DAG aggregation, that is, summarizing aset of DAGs with a single DAG. We define the problem and demonstratethat it is computationally intractable. We then propose two different methodsfor solving the DAG-aggregation problem, and we show that both methodsprovide approximations to the optimal solution. While one method provides abetter theoretical bound than the other, a simple heuristic based on the secondapproach is shown to be the algorithm that works best in practice.

Page 10: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

10 Eric Malmi et al.

4.1 Problem definition

We formulate the problem of aggregating DAGs as the following median-typeoptimization problem.

Problem 1 (DAG Aggregation) Given a set of M directed graphs G1, . . . ,GM , find a DAG C minimizing

M∑i=1

K(p,q)(Gi, C) .

We can view DAG Aggregation as an extension of Rank Aggrega-tion, if we view ranks as graphs. It is known that Rank Aggregation isNP-hard [11], however this does not imply directly that DAG Aggregationis also NP-hard, since in DAG Aggregation the output is not required tobe a full ranking.

In order to prove hardness we reduce a known NP-hard problem calledFeedback Arc Set, where the goal is to construct a DAG with least amountof edges.

Problem 2 (Feedback Arc Set (FAS)) Given a directed graph G = (V,E),find the smallest set of edges F ⊂ E such that (V,E \ F ) is a DAG.

The NP-hardness of FAS [24] implies the following result.

Proposition 2 DAG Aggregation is NP-hard.

Proof We will prove the hardness by reducing FAS to DAG Aggregation.Fix p, q > 0.

Assume that we are given a directed graph G = (V,E). We can safelyassume that there are no 2-cycles in G, otherwise we can modify G by splittingedge into two edges by adding at most |E| vertices.

Our first step is to split G into two DAGs. In order to do that select andfix an arbitrary order over the vertices. Define E1 = {(i, j) ∈ E | i < j} andE2 = E \ E1. Clearly E1 ∩ E2 = ∅ and E1 ∪ E2 = E. Set G1 = (V,E1) andG2 = (V,E2). Both G1 and G2 are DAGs.

Let C = (V,H) be the solution for DAG Aggregation for G1, G2. Weclaim that F = E \H solves FAS.

In order to prove this, let us first show that H ⊆ E. Assume otherwiseand let e ∈ H \ E. Then this edge contributes either 2p or p + 1 to the cost,depending whether the reversed edge is in E. However, if we delete e from H,then the cost is either 2q or p + q. Consequently, we can always decrease thecost by deleting e. Thus, we can safely assume that H ⊆ E.

Since we have no 2-cycles in E, an edge e ∈ H contributes p to the cost ofDAG Aggregation because e ∈ E1 and e /∈ E2, or vice versa. On the otherhand, an edge e ∈ E \H contributes p+q to the cost. Hence, the cost of DAGaggregation is equal to

K (G1, C) + K (G2, C) = q(|V |(|V | − 1)− 2|E|) + p|H|+ (p + q)|E \H|= q(|V |(|V | − 1)− 2|E|)− q|H|+ (p + q)|E|.

Page 11: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 11

Consequently, minimizing the cost is equal to maximizing |H| which is equalto minimizing |E \H|. ut

Proposition 2 implies that solving DAG Aggregation optimally is com-putationally intractable. Hence, in the rest of this section we consider twoapproximation algorithms.

4.2 Approximating DAG aggregation by a median centroid

Our first approach to select a centroid is simply picking the centroid from theinput graphs instead of constructing it from scratch. Remarkably, the fact thatthe distance is almost a metric gives us an approximation ratio guarantee.

More formally, assume that we are given G1, . . . , GM . We select the cen-troid from the input graphs G1, . . . , GM minimizing the distance, that is,

Median(G1, . . . , GM ) = arg minGj

M∑i=1

K(p,q)(Gi, Gj) .

Since K satisfies a relaxed triangle inequality, we can achieve a constantapproximation ratio. The proof of the following proposition is standard and itis omitted for lack of space.

Proposition 3 Assume a set of DAGs G1, . . . , GM . For 0 ≤ q ≤ p ≤ 1 with0 < p, Median(G1, . . . , GM ) yields an approximation ratio of 2c, where c isa constant as defined in Proposition 1.

4.3 Greedy approximation of DAG aggregation

Our next approach is based on the fact that we can express DAG Aggrega-tion as an instance of Weighted Feedback Arc Set problem.

In order to do that, assume that we are given G1, . . . , GM DAGs and letC = (V,E) be a candidate centroid. Let b(i, j) define the cost of not having(i, j) as an edge in the centroid,

b(i, j) =

M∑m=1

Kij(Gm, Hempty) , (1)

where Hempty is a graph with no edges. Similarly, define w(i, j) to be the costof having (i, j) as an edge in the centroid,

w(i, j) =

M∑m=1

Kij(Gm, Hfull) , (2)

where Hfull is a full directed graph. This gives us

M∑m=1

K (Gm, C) =∑e∈E

w(e) +∑e/∈E

b(e).

Page 12: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

12 Eric Malmi et al.

ab

cd

(a) G1

ab

cd

(b) G2

ab

cd

(c) G3

ab

cd

7/12

7/12

1/6

7/121/6

(d) (V, P )

ab

cd

7/12

7/12

1/6

7/12

(e) C

Fig. 2 Toy graphs related to Example 3

Let us now define P = {e ∈ V × V | w(e) < b(e)} to be the set of edgesthat ideally we would like to have in the centroid. We can safely assume thatthe centroid edges are all in P , E ⊆ P . However, since P may contain cycles,we cannot have all of these edges. For each edge e not included in the centroid,we define regret r(e) = b(e)−w(e) to be the difference we need to pay for nothaving the edge in C. We can write the cost as

M∑m=1

K (Gm, C) =∑e∈P

w(e) +∑e/∈P

b(e) +∑

e∈P\E

r(e).

Since the first two terms do not depend on C, we see that for selecting opti-mal centroid we need to minimize the third sum. This is in fact an instanceof weighted feedback arc set (WFAS) problem, with an input graph(V, P, r).

Example 3 Consider G1, G2, and G3 given in Figure 2. Set p = 1/3 andq = 1/4 and let us compute the optimal centroid. In order to do that, let uscompute P . The regret for an edge (a, b) is equal to

r(a, b) = 2/3 + 1/4− 1/3 = 7/12.

Similarly, r(a, c) = r(c, d) = 7/12. The regret for (a, d) and (b, d) is

r(a, d) = r(b, d) = 1/3 + 2/4− 2/3 = 1/6.

Finally, the regret for (b, c) is

r(b, c) = 2/3 + 1/4− 1/3− 1 = −5/12

which is negative and hence (b, c) is not included in P . The rest of the vertexpairs also have negative regrets, and are not included in P . The graph (V, P )is given in Figure 2(d). This graph has a cycle and the optimal DAG, C, isobtained by removing the weakest edge (b, d). The optimal centroid is givenin Figure 2(e).

We can estimate WFAS with O(log |V | log log |V |) with an algorithm basedon Linear Programming techniques [12].

Proposition 4 AggrFas yields an approximation ratio of O(log |V | log log |V |).

Page 13: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 13

Algorithm 1: AggrFas, solves DAG Aggregation problem usingFeedback arc set solver

compute w and b for each pair (i, j) using Eqs. 1–2;P ← {e ∈ V × V | w(e) < b(e)};r(e)← b(e)− w(e) for all e ∈ P ;F ←WFAS(V, P, r);return (V, P \ F );

Algorithm 2: Greedy, estimates optimal centroid given a set of DAGsG1, . . . , GM

Q←{⋃M

i=1 Ei

};

compute w and b for each edge e ∈ Q using Eqs. 1–2;P ← {e ∈ Q | w(e) < b(e)};r(e)← b(e)− w(e) for all e ∈ P ;E ← ∅;foreach e ∈ P sorted by regret do

if e ∪ E has no cycle thenadd e to E;

return (V,E);

Proof Let G1, . . . , Gm be the input graphs and let r be the regret. We showedearlier that we can write the cost function as

M∑m=1

K (Gm, C) = const +∑

e∈P\E

r(e),

for any graph C = (V,E), where const ≥ 0 is a constant and does not dependon C. Let OPT be the optimal solution of the second term and let Q be thecost of a solution of weighted FAS found by an algorithm given in [12]. Thensince const ≥ 0 and Q ≥ OPT , we have

const + Q

const + OPT≤ Q

OPT.

The solver in [12] has an approximation ratio guarantee of

Q/OPT ∈ O(log |V | log log |V |)

which proves the result. ut

As approximation algorithms based on LP-approaches are rarely practical,we consider a significantly simpler but efficient approach, given in Algorithm 2.

The idea behind this algorithm is straightforward. We order the edges basedon regret, edges with smallest regret first. Note that we only need to consideredges that have appeared in at least one of the input DAGs since w(e) < b(e)can not hold for the other edges. Then we keep adding edges into a centroidin an order, ignoring the edges that create cycles. Even though this is a very

Page 14: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

14 Eric Malmi et al.

simple approach, in our experiments it outperforms Median, an algorithm forwhich we have a constant approximation guarantee.

Computational complexity: Calculating the proposed distance measure Kbetween DAGs G1 = (V,E1) and G2 = (V,E2) only requires counting thenumber of concordant and discordant pairs as the potentially discordant pairscan be computed based on these two numbers and the number of vertices|V |. If we store edges in a hash table as a preprocessing step, this yields acomplexity of O(min(|E1|, |E2|)).

Assume that we are given M input DAGs. Let k be the total number ofedges in the input graphs, k =

∑Mi=1 |E(Gi)|. Computing the cost of a centroid

can be done in O(k) time. Hence, the complexity cost for Median is O(Mk).

In Greedy, we first need to form sets Q ={⋃M

i=1 Ei

}and P = {e ∈

Q such that w(e) < b(e)}. Taking the union of all edges for Q takes O(k) steps.We can see that |P | ≤ |Q|. Second, we need to detect cycles in an incrementally

increasing graph. This can be done in O(min(|P |12 , |V |

23 )|P |) steps using the

method described by Bender et al. [5], which gives us a running time of O(k+

min(|P |12 , |V |

23 )|P |). However, in our experiments, we implemented the cycle

detection simply by keeping track of the transitive closure of the centroid whileadding new edges to it since term O(k) seemed to be the bottleneck of thealgorithm.

Finally, we would like to point out that the main computational challenge inDAG Aggregation is due to the requirement of having an acyclic centroid.Indeed, if we ignored the acyclicity constraint, we could compute the optimalcentroid in polynomial time by taking all edges for which it holds that w(e) <b(e). Nevertheless, in some cases it is useful to have a DAG centroid, forexample, if you want to also get a topological ordering of the vertices. Such ascenario is exemplified by our last experiment on clustering preference data.

5 Clustering DAGs

A natural application for distance measure is clustering. More formally, con-sider the following problem.

Problem 3 Given a set of DAGs G1, . . . , GM and a number k, find k clustersP1, . . . , Pk and centroids C1, . . . , CK such that

k∑i=1

∑G∈Pi

K(p,q)(G,Ci)

is minimized.

Note that since we require to discover centroids along with the clusters, theclustering problem becomes automatically NP-hard. In fact, if we set k = 1,then the problem reduces into finding a single centroid for all input graphs,

Page 15: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 15

a DAG Aggregation problem. This contrasts standard clustering problemswhere finding the centroid is typically a straightforward computation.

We approach this problem by running a k-means type algorithm. Givena set of centroids, we group the input DAGs into clusters minimizing thedistance. Once the groups are selected, we then select a centroid from eachcluster. In order to select a centroid, we use Median and Greedy algorithmsintroduced in the previous section.

6 Related work

DAG Aggregation is an extension of the rank aggregation problem. Thelatter task arises typically from aggregating search engine results. However,this optimization problem has been studied in the context of voting, centuriesbefore first computers, see for example [6]. When using Kendall-tau distance,the rank aggregation problem is also known as Kemeny-Young rank aggre-gation problem. The problem is NP-hard [11], however it admits a PTASscheme [28]. Extensions of rank aggregations have suggested such as partialrankings, that is, rankings with ties [1, 13], and top-k rankings [1, 14], whereonly top-k elements are ranked and the remaining objects are left unranked.

Extensions of rank aggregation to DAGs have been considered by Bran-denburg et al. [7,8]. Here the extension is done by representing a DAG with aset of linear extensions. This allows to use set distances with Kendall-tau orSpearman footrule as a base distance. Unfortunately, the number of extensionsmay be exponential and indeed the problem of computing certain set distancesbecomes NP-hard.

In case the vertex correspondence between the graphs to be compared isunknown, the main computational challenge is to find an alignment betweenvertices. Several studies are based on this premise (see e.g. [9, 22]). However,since our starting point is rankings, it is natural to assume that the vertexcorrespondence is known. Thus we focus on the problem of defining a mean-ingful distance measure given the correspondence, which makes the distancemeasure evaluation computationally much cheaper.

As we have seen in previous sections, finding the optimal solution for DAGAggregation is intimately related to the feedback arc set (FAS) prob-lem. The reduction of NP-hardness is done by reducing FAS to DAG Ag-gregation and one we can get an approximation algorithm by using a solvergiven in [12]. FAS is known to be APX-hard with a known coefficient ofc = 1.3606 [10, 23], that is, given that P 6= NP there is no approximationalgorithm with a constant approximation guarantee of c. On the other hand,selecting a centroid from the input graphs gives us a guarantee ratio of (1+q)/p.Solving FAS for tournaments, graphs having edges for each vertex pair, is sub-stantially easier as this problem admits a PTAS scheme [28]. Note that in orderto get a tournament graph we need to have dense input graphs. This hints thatDAG Aggregation may be easier to solve if the input graphs are dense.

Page 16: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

16 Eric Malmi et al.

7 Experimental evaluation

7.1 Experiments on synthetic data

In this section, we present our experiments with synthetic datasets. The ob-jective is to test the ability of the K distance measure to distinguish DAGsgenerated from different distributions, and also to study the sensitivity ofthe distance measure with respect to its parameters. For the aggregation andclustering tasks, knowing the ground truth in the generated data allows tocompare the Median and the Greedy algorithms against baselines. We startour discussion by describing how we generate the data.

Synthetic DAG generation. In order to generate synthetic DAGs havinga cluster structure, we use the following approach. We start by generatinga seed DAG, adding edges (i, j), where i < j, with probability pedge. Fromthis seed we create N corrupted DAGs, by deleting edges with a probabilitypremove, adding new edges with a probability of padd, and swapping verticesNswaps times. The probability padd is determined from pedge and premove so thatthe corrupted graphs have on average the same amount of edges as the seed.This leaves us three parameters, pedge, premove and Nswaps. Increasing premove

and Nswaps will make graphs more corrupt. The parameter pedge determinesthe density of the DAGs, so that pedge = 0 will produce an empty DAG whilepedge = 1 will produce a total order. In all experiments with the synthetic data,we set the number of vertices to 50 and we assume transitivity, thus takingthe transitive closures of the DAGs once all of them have been generated.Note that transitivity is not required for the proposed distance measure andalgorithms to work. Indeed, in Section 7.2, we present experiments on real-world information cascade data where transitivity does not hold.

Selecting distance measure parameters. A good distance measure shouldbe able to differentiate between samples from different distributions while rec-ognizing the similarity of the samples from the same distribution. In order tomeasure the goodness of the distance, we can vary parameters p and q of theproposed measure K, and see how well K can differentiate two sets of DAGs,given that they are drawn from two different distributions. To measure thesimilarity of the sets, we use the Friedman-Rafsky MST-based runs test [15].The idea is that we calculate the pairwise distances between all samples fromboth sets and find the minimum spanning tree. Test statistic R is the numberof edges that connect samples from different sets, and a low value of R indi-cates that we can reject the null hypothesis of the sets being from the samedistribution.

Our experimental setting is the following. First, we generate two sets ofDAGs with the same parameters but with different seed DAGs. Then, weuse the Friedman-Rafsky test to see whether these two sets follow the samedistribution, while varying p and q. Since we know that the two sets are fromdifferent distributions, we can simply see how the variation in the parametersp and q affects the statistics R. The analysis is repeated for sparse DAGs

Page 17: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 17

0 0.5 10

5

10

15

20

p

Test sta

tistic R

Sparse DAGs, q=0.85*p

0 0.5 110

15

20

25

30

35

p

Test sta

tistic R

Dense DAGs, q=0.45*p

0 0.1 0.2 0.3 0.40

5

10

15

20

q

Test sta

tistic R

Sparse DAGs, p=0.5

0 0.1 0.2 0.3 0.410

15

20

25

30

35

q

Test sta

tistic R

Dense DAGs, p=0.5

Fig. 3 Differentiation capability of the proposed measure K with different values of q (top)and different values of p keeping the ratio of p and q fixed (bottom).

(pedge = 0.03) and dense DAGs (pedge = 0.08) and for different levels ofcorruption. Results for a single level of corruption (premove = 0.7, Nswaps = 0)while varying p and q are shown in Fig. 3. From the top figures, we havecomputed the optimal ratios of p and q for sparse DAGs (q/p = 0.85) and fordense DAGs (q/p = 0.45) when p is fixed. In the bottom figures, these ratioshave been fixed after which we have varied p and q accordingly.

We omit detailed illustration of our results, however, the main conclusionsdrawn are the following: (i) When setting p = 0.5, the optimal value of qdepends on the density of the DAGs so that for sparse DAGs q should benear to p, whereas for denser DAGs q can be smaller in order to be able todistinguish DAGs from different distributions. (ii) When the ratio of p and qhas been fixed, the distance measure is not sensitive to the value of p. (iii)The optimal value of q depends on how corrupted the DAGs are.

All in all, these results show that the optimal values of p and q are problemdependent and it remains an open problem how to best select them. Never-theless, in our experiments with real-world data, the following simple strategyproved to be useful: set p = 0.5, q = 0 and increase q until the obtained aggre-gated centroids become nonempty. Furthermore, an expert might be able touse his or her domain knowledge to decide how much potentially discordantpairs should be penalized and select p and q accordingly.

DAG aggregation. Next we evaluate our DAG aggregation methods, by test-ing how well they are able to uncover an underlying seed DAG. Our methodol-ogy here is to generate a set of input DAGs from a fixed seed, apply our DAG

Page 18: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

18 Eric Malmi et al.

0 0.25 0.5 0.75 1280

290

300

310

320

330

premove

Sparse DAGs, Nswaps

=20

K(c

en

ter,

se

ed

)

0 5 10 15 20280

290

300

310

320

330

Nswaps

Sparse DAGs, premove

=0

K(c

en

ter,

se

ed

)

0 0.25 0.5 0.75 1200

250

300

350

400

premove

Dense DAGs, Nswaps

=20

0 5 10 15 20200

250

300

350

400

Nswaps

Dense DAGs, premove

=0

median greedy empty baseline optimal

Fig. 4 DAG aggregation performance measured as the distance from the planted seed DAGto the obtained centroid.

aggregation methods to compute a centroid, and then calculate the distancebetween the centroid and the seed DAG.

We compare Median and Greedy along with two other methods. Thebaseline method uses an empty DAG as the centroid, while the optimal methodoutputs the seed DAG itself giving a lower bound for the distance. We set theparameters of the measure to p = 1

2 and q = 14 . The results are shown in Fig. 4

for different values of premove and Nswaps.

From the results we see that in most cases Greedy performs significantlybetter than Median. However, the drawback of Greedy is that if the datageneration parameters are increased above a certain point so that the inputDAGs become very diverse, then the centroid converges into an empty DAG.In our other experiments, we notice that this happens even more easily if thevalue of q is decreased.

DAG clustering. In our final experiment with synthetic data we clusterDAGs. We generate 5 clusters using our synthetic-data generator, each clustercontaining 20 graphs. Furthermore, we make 20 additional vertex index swapsfor each seed creating discordant pairs between clusters, and furthering clustersaway from each other.

Evaluating the performance of a clustering algorithm is generally a difficultproblem due to lack of ground truth but with our synthetic dataset we knowthe correct partitioning of the DAGs. Thus we can measure performance usingthe Adjusted Rand Index (ARI) [21].

We use k-means type algorithm which updates the cluster centroids usingeither Median or Greedy. These approaches are compared to a hierarchi-

Page 19: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 19

0 5 10 15 20

0

0.2

0.4

0.6

0.8

1

Nswaps

Sparse DAGs, premove

=0

Adju

ste

d R

and Index

0 5 10 15 20

0

0.2

0.4

0.6

0.8

1

Nswaps

Dense DAGs, premove

=0

Adju

ste

d R

and Index

0 0.25 0.5 0.75 1

0

0.2

0.4

0.6

0.8

1

premove

Sparse DAGs, Nswaps

=0

Adju

ste

d R

and Index

0 0.25 0.5 0.75 1

0

0.2

0.4

0.6

0.8

1

premove

Dense DAGs, Nswaps

=0

Adju

ste

d R

and Index

median

greedy

hierarchical

optimal

Fig. 5 Clustering method comparison for the synthetic data. Higher values are better.

cal clustering method with complete linkage criterion and to optimal methodwhich sets the seed DAGs as the initial cluster centroids and then assignseach input DAG to the closest centroid, giving us an upper bound for theperformance. For the k-means approaches, we run ten restarts with randominitial cluster assignments and select the run which minimizes the clusteringcost defined in Problem 3. For all methods, we run ten repetitions with newlygenerated datasets and calculate the average ARI. We set the parameters forthe distance measure to p = 1

2 and q = 14 . The results are shown in Fig. 5.

When varying premove, Greedy yields the correct clustering up to premove =0.6 for sparse DAGs, but after this the performance drops. Median and thehierarchical methods perform comparably. For dense DAGs all methods areable to find the correct clustering on almost every occasion. On the otherhand, when varying parameter Nswaps and potentially inducing discordantpairs within a cluster, the differences are more clear. Greedy has the bestoverall performance, whereas the hierarchical method performs the worst.

7.2 Experiments on information cascades

In this section, we apply the proposed DAG-aggregation methods to real-world music-listening data from Last.fm. This type of analysis can providevaluable insights, e.g., for artists who want to advertise their music to targetedindividuals that are influential among their peers. Our experimental settingis inspired by the large body of work in social-network analysis, where oneobserves information cascades in social networks and the goal is to infer theunderlying influence model [3,4,16,17,19,20,30,33,34]. This line of work focuses

Page 20: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

20 Eric Malmi et al.

Table 1 Clustering results for information cascades.

Clustering cost ± std Time (sec) Iterations

Median 73 746 000± 960 166 2.3Greedy 73 738 430± 350 63 8.3

Table 2 A clustering for artists in the Last.fm dataset.

Cluster #artists Example artists Top tags

1 25 Amy Winehouse, Kelly Rowland,Evanescence, Linkin Park, Jason Mraz

pop, female vocalists,rnb, dance, soul

2 24 Pink Floyd, Black Sabbath, Joy Divi-sion, Led Zeppelin, Duran Duran

rock, classic rock, 80s,new wave, alternative

3 12 Shakira, Taylor Swift, Lana Del Rey,Florence + the Machine, Madonna

pop, female vocalists,rnb, dance, indie

4 16 Feist, La Roux, Mika, Bat for Lashes,Gossip

indie, alternative, femalevocalists, rock, indie rock

5 119 Johnny Cash, Placebo, Vampire Week-end, Air, Kiss

rock, alternative, indie,pop, electronic

on learning influence probabilities on the edges of the social network, but alsothe roles of the network users in the information-diffusion process.

Music listening data. The dataset we use is collected from Last.fm, a musicservice that tracks user music listening and provides recommendations. Thedataset contains 1 372 users who have listened to a total of 1.2 million tracks(51 495 unique tracks) from 4 322 different artists between Jan 1, 2010 and Nov10, 2010. The dataset also includes the social network of the users. Since wewant to analyze information cascades, we study how the listening of differentartists is propagated in the social network. When user A starts listening toan artist that her friend B is already listening to, we say that A is followingB, and draw an edge (B,A). We note that we do not have complete userhistories, so user A may have listened to the same artist previously (or infact from another platform), but as all studies of social influence we ignorethis effect. This process gives us a DAG for each artist whose vertices are theLast.fm users. We limit ourselves to 196 artists who have at least 100 listeners.

Artist clustering. We apply k-means, using Greedy and Median for se-lecting centroids, with ten restarts. We set the number of clusters to 5, andp = 1/2 and q = 0.4.

Greedy outperforms Median both in terms of the average running timeand the average cost of the clustering. The results are shown in Table 1. Therelative differences of the costs are small since the costs mainly consists of thepenalties caused by Case 4 pairs as the DAGs are very sparse. Nevertheless,Greedy consistently outperforms Median.

The clustering that obtained the lowest cost using Greedy is shown inTable 2. We can see that different clusters capture different music genres.

Page 21: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 21

Clusters 2 and 5 contain mostly rock artists with the difference that the formeris focused on classic rock whereas the latter is more diverse. Cluster 3 capturesfemale pop artists, whereas in Cluster 4, we have indie artists. Cluster 1 is aninteresting mix of pop artists and alternative metal artists suggesting thatthere is a group of users listening to both genres. The results also show thatthe pop and rock genres are well represented in the data causing smaller genresto merge to these.

Influential users. We should note that our approach gives not only a cluster-ing of artists based on cascades, but it can also help us identify the most influ-ential users for each cluster. The users who are roots in the centroid DAG ofeach cluster and have large subtrees below are the potentially influential ones.Identifying influential users is very important for viral marking and campaigndesign.

7.3 Experiments on preference data

In this section, we apply our methods to analyze preference data of differentusers. This analysis shows how the proposed methods can be used to discoverand visualize groups of people with different tastes.

Artist preference data. The data was collected through a Finnish musicrelated website whose owner allowed us to display a survey for the visitorsof the site for two days. We selected twenty popular Finnish/foreign artistsfrom five different music genres (pop, rock, rap, metal, and electronic) andpresented the visitors a series of questions of the form “Which artist do youlike more: A or B” where A and B were two randomly selected, distinct artists.In total, we received data from 3 683 users (=IP addresses) out of whom 960satisfy the following criteria: (i) answered at least 10 preference questions (ii)answered questions about their age and gender (iii) country is Finland (iv)preference graph is acyclic (6% of the graphs contain a cycle).2

User clustering. A preference DAG is formed by taking all the pairwisepreferences of a user and drawing an edge (a, b) if the user prefers artist a to b.We divide the users into 480 train users and 480 test users and cluster the trainusers using the greedy method with 3 clusters and parameters p = 0.50, q =0.48. The value of q is selected by increasing it until the centroids obtained byGreedy become nonempty. The results averaged over ten restarts are shownin Table 3. Again Greedy obtains a better performance than Median interms of the clustering cost and the running time even though it uses moreiterations to converge.

The centroids of the clustering with the lowest cost are shown in Fig. 6. Ananalysis of the centroids reveals that the clusters capture very different typesof preferences. The bottom cluster contains all metal fans, whereas the top-left cluster contains all those who like anything but metal. Quite interestingly,

2 The dataset can be downloaded at http://users.ics.aalto.fi/emalmi/artist_

preference_data.zip

Page 22: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

22 Eric Malmi et al.

Table 3 Clustering results for preference data.

Clustering cost ± std Time (millisec) Iterations

Median 43 751± 22 650 2.0Greedy 43 097± 28 260 14.0

ApulantaEppu

Normaali

Coldplay

Muse

Metallica

System ofa Down

Childrenof Bodom

Kotiteollisuus

PMMP

LadyGaga

IsacElliot

MileyCyrus

Eminem

Cheek PitbullJukkaPoika

DaftPunk

AviciiEllie

Goulding

Arminvan Buuren

ApulantaEppu

NormaaliColdplay

Muse

MetallicaSystem ofa Down

Childrenof Bodom

Kotiteollisuus PMMP

LadyGaga

IsacElliot

MileyCyrus

Eminem

Cheek

Pitbull

JukkaPoika

DaftPunk

Avicii

EllieGoulding

Arminvan Buuren

Apulanta

EppuNormaali

Coldplay

Muse

MetallicaSystem ofa Down

Childrenof Bodom

Kotiteollisuus

PMMP

LadyGaga

IsacElliot

MileyCyrus

Eminem

Cheek

Pitbull

JukkaPoika

DaftPunk

AviciiEllie

GouldingArmin

van Buuren

Genre:

Origin:

Metal Rock Pop Rap Electronic

International Finnish

Fig. 6 Centroids for a clustering of the artist preference data.

the users in the top-right cluster seem to be indifferent to genre, however,preferring Finnish artists over foreign ones. To make the structure of the DAGsmore visible, we removed all edges e = (u, v) if there was an alternative pathfrom u to v via some intermediate vertices.

Preference prediction. To obtain further evidence that the obtained clus-ters are meaningful, we apply them to a music preference prediction problem.For each test user, we use k randomly selected preferences to determine theclosest cluster centroid c and then predict all the remaining preferences bytaking the majority vote over all train users in cluster c. As a baseline, we usea majority vote over the train users in all clusters. The results in Fig. 7 showthat we obtain up to an 8% absolute improvement over the baseline method.Note that further improvements might be attainable using more sophisticatedmethods but the purpose of this experiment is merely to show that the pro-posed clustering method is able to group people with similar preferences.

Page 23: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 23

1 3 5 7 9 11 130.54

0.56

0.58

0.6

0.62

0.64

# of train preferences

Accura

cy

Majority vote (baseline)

Majority vote within cluster

Fig. 7 Artist preference prediction results.

8 Conclusions

In this paper, we suggested a natural generalization of Kendall-tau distanceto directed acyclic graphs. We showed that this distance is a near-metric, thatis, it satisfies a relaxed version of triangle inequality. We considered two appli-cations of the distance: DAG aggregation and clustering. We were able to usethe near-metric property to show that we can obtain a constant approximationguarantee for the DAG aggregation problem using a median approach. DAGaggregation, in turn, can used to obtain a clustering by running a k-meanstype algorithm.

Our measure has potential applications in information-scarce rank analy-sis, where instead of full rankings we only have partial ranking information.Interestingly enough, DAGs also arise naturally from information cascades.The nature of the problem here is different: in rank analysis we have DAGsbecause we have missing information while in cascade analysis, DAGs containall information that we hope to have. Here clustering and analyzing DAGsalso have many applications, for example, by studying different clusters andtheir centroids, we can find the influential users in each cluster.

We run experiments both on synthetic data, measuring how well we areable to recover a planted clustering, and on real-world data regarding user pref-erences and information cascades, measuring the clustering cost and runningtime. In most cases a simple greedy approach outperformed Median, an al-gorithm for which we have a constant approximation guarantee. Furthermore,we showed that the obtained clusterings reveal meaningful and occasionallysurprising information. For instance, a clustering of music preferences showedthat while the preferences of some users are dictated by the genre of the music,for others they depend on the nationalities of the artists.

Our work opens several lines for future work. First, we could study how torank items from a given set of DAGs. Second, we saw that DAG aggregationand Feedback Arc Set problem are intimately related to each other. Thelatter is a well-studied problem and it would be fruitful to see whether this

Page 24: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

24 Eric Malmi et al.

connection can be used further to establish new theoretical complexity resultsconcerning both problems. Third, the standard Kendall-tau coefficient is oftenused as a test statistic for studying the dependency between two variables.Similarly, it would interesting to look into whether the proposed measure couldbe used as a test statistic for assessing the dependency between two graphs.

A limitation of the proposed distance measure is that it is not clear how tofind the best values of parameters p and q since these are problem dependentas our experiments show. In the experiments with real-world datasets, we haveused a simple heuristic proposed in Section 7.1, but it remains an open problemhow to select the parameter values optimally.

Acknowledgements The authors are grateful to Nicola Barbieri for providing the Last.fmdataset. We also thank the anonymous reviewers for their constructive feedback. This workwas supported by Academy of Finland grant 118653 (ALGODAN).

References

1. Ailon, N.: Aggregation of partial rankings, p-ratings and top-m lists. Algorithmica57(2) (2010)

2. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Rankingand clustering. Journal of ACM 55(5) (2008)

3. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social net-works. In: KDD (2008)

4. Barbieri, N., Bonchi, F., Manco, G.: Cascade-based community detection. In: WSDM(2013)

5. Bender, M.A., Fineman, J.T., Gilbert, S., Tarjan, R.E.: A new approach to incrementalcycle detection and related problems. arXiv:1112.0784 (2011)

6. Borda, J.: Memoire sur les elections au scrutin. Histoire de l’Academie Royale desSciences (1781)

7. Brandenburg, F., Gleißner, A., Hofmeier, A.: Comparing and aggregating partial orderswith kendall tau distances. Discrete Mathematics, Algorithms and Applications 5(2)(2013)

8. Brandenburg, F., Gleißner, A., Hofmeier, A.: The nearest neighbor Spearman footruledistance for bucket, interval, and partial orders. Journal of Combinatorial Optimization26(2), 310–332 (2013)

9. Bunke, H., Shearer, K.: A graph distance metric based on the maximal common sub-graph. Pattern recognition letters 19(3) (1998)

10. Dinur, I., Safra, S.: On the hardness of approximating vertex cover. Annals of Mathe-matics 162(1) (2005)

11. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for theweb. WWW (2001)

12. Even, G., Naor, J., Schieber, B., Sudan, M.: Approximating minimum feedback sets andmulti-cuts in directed graphs. In: IPCO (1995)

13. Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., Vee, E.: Comparing partial rankings.SIAM Journal on Discrete Mathematics 20(3) (2006)

14. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top-k lists. SIAM Journal on DiscreteMathematics 17(1) (2003)

15. Friedman, J.H., Rafsky, L.C.: Multivariate generalizations of the Wald-Wolfowitz andSmirnov two-sample tests. The Annals of Statistics 7(4) (1979)

16. Gomez-Rodriguez, M., Balduzzi, D., Scholkopf, B.: Uncovering the Temporal Dynamicsof Diffusion Networks. In: ICML (2011)

17. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring Networks of Diffusion andInfluence. ACM Transactions on Knowledge Discovery from Data 5(4) (2012)

Page 25: Beyond rankings: comparing directed acyclic graphs · For example, the rank-aggregation problem arises in meta-search engines, where a number of search engines return di erent rankings

Beyond rankings: comparing directed acyclic graphs 25

18. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications, iv:Simplification of asymptotic variances. Journal of the American Statistical Association67(338) (1972)

19. Goyal, A., Bonchi, F., Lakshmanan, L.: Discovering leaders from community actions.In: CIKM (2008)

20. Goyal, A., Bonchi, F., Lakshmanan, L.V.: Learning influence probabilities in socialnetworks. In: WSDM (2010)

21. Hubert, L., Arabie, P.: Comparing partitions. Journal of classification 2(1) (1985)22. Jiang, X., Munger, A., Bunke, H.: An median graphs: properties, algorithms, and ap-

plications. Pattern Analysis and Machine Intelligence 23(10) (2001)23. Kann, V.: On the approximability of np-complete optimization problems. Ph.D. thesis,

KTH (1992)24. Karp, R.M.: Reducibility among combinatorial problems. CCC (1972)

25. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through asocial network. KDD (2003)

26. Kendall, M.: A new measure of rank correlation. Biometrika 30 (1938)27. Kendall, M.: Rank Correlation Methods, 4th, revised edn. Hodder Arnold (1976)28. Kenyon-Mathieu, C., Schudy, W.: How to rank with few errors. STOC (2007)29. Laming, D.: Human judgment: the eye of the beholder. Cengage Learning EMEA (2003)30. Macchia, L., Bonchi, F., Gullo, F., Chiarandini, L.: Mining Summaries of Propagations.

In: ICDM (2013)31. Madden, J.I.: Analyzing and Modeling Rank Data. Chapman & Hall (1995)32. Murphy, T.B., Martin, D.: Mixtures of distance-based models for ranking data. Comp.

Statistics & Data Analysis 41(3–4) (2003)33. Saito, K., Nakano, R., Kimura, M.: Prediction of information diffusion probabilities for

independent cascade model. Knowledge-Based Intelligent Information and EngineeringSystems (2008)

34. Su, H., Gionis, A., Rousu, J.: Structured prediction of network response. In: ICML(2014)