Estimating heterogeneous graphical models for discrete ...dept.stat.lsa.umich.edu/~elevina//hetero-ising.pdf · arXiv: math.PR/0000000 ESTIMATING HETEROGENEOUS GRAPHICAL MODELS FOR

arXiv: math.PR/0000000

ESTIMATING HETEROGENEOUS GRAPHICAL MODELSFOR DISCRETE DATA WITH AN APPLICATION TO

ROLL CALL VOTING

By Jian Guo, Elizaveta Levina, George Michailidis and Ji Zhu

Department of Statistics, University of Michigan

We consider the problem of jointly estimating a collection ofgraphical models for discrete data, corresponding to several categoriesthat share some common structure. An example of this setting is vot-ing records of senators on different issues, such as defense, energy,and healthcare. We develop a Markov graphical model to character-ize the heterogeneous dependence structures arising from such data.The model is fitted via a joint estimation method that preserves theunderlying common graph structure, but also allows for differencesbetween the networks. The method employs a group penalty that tar-gets the common zero interaction effects across all the networks. Weapply the method to describe the internal networks of the US Senateon several important issues. Our analysis reveals individual structurefor each issue, distinct from the underlying well-known bipartisanstructure common to all categories which we are able to extract sep-arately. We also establish consistency of the proposed method bothfor parameter estimation and model selection, and evaluate its nu-merical performance on a number of simulated examples.

1. Introduction. The analysis of roll call data of legislative bodies hasattracted a lot of attention both in the political science and statistical liter-ature. For political scientists, such data allow to study broad issues such asparty cohesion as well as more specific ones such as coalition formation (seefor example the books [7, 15, 18, 20]). For statisticians the challenge is how tobest model and present the data, so that interesting patterns become appar-ent and informative for subsequent analyses. A number of techniques havebeen employed including principal components analysis (PCA) [5], multidi-mensional scaling (MDS) [6], Bayesian models [4] and graphical models [2].Dimension reduction techniques such as PCA and MDS aim at constructinga “map”, with the members of the legislative body positioned relative totheir peers according to their voting pattern. A typical example of such amap of the US Senate members in the 109th Congress (2005-06) using mul-tidimensional scaling for selected votes is shown in Figure 1; for a detailed

AMS 2000 subject classifications: Primary 62H99Keywords and phrases: graphical models, group penalty, high-dimensional data, `1

penalty, Markov network, binary data

1

http://arxiv.org/abs/math.PR/0000000

2 J. GUO ET AL.

description of the data see Section 4. A clear separation between membersof the two parties is seen (Republicans to the left of the map and Democratsto the right), together with some members exhibiting a voting pattern devi-ating from their ideological peers, e.g., Nelson (Democrat of Nebraska), andCollins and Snow (Republicans of Maine), while the independent Jeffords(shown in purple) votes like a Democrat. More interestingly, the voting pat-terns within the two parties exhibit a clustering structure, which a closerinspection of the votes and subsequent analysis showed was mainly drivenby votes on defense/security and healthcare.

Fig 1. Multidimensional scaling projection of roll call data of the US Senate for the period2005-06 (Republicans shown in red and Democrats in blue).

This finding suggests that treating all votes as homogeneous, i.e., assum-ing that they represent the same underlying relationship between senators,may mask more subtle patterns which depend on the issues being votedupon. Therefore, treating votes as heterogeneous is more accurate, and canprovide further insight into the voting behavior of different groups of sena-tors on different issues. In this paper, we focus on voting records on threetypes of bills: defense and national security, environment and energy, andhealthcare issues. Voting on the latter category is typically more partisanthan voting on defense and national security, and thus we expect to seedifferent connections in different categories.

To model this heterogeneity, we use Markov network models to capturethe dependence structure of binary and/or categorical random variables.Similarly to Gaussian graphical models, nodes in the network correspond

ESTIMATING HETEROGENEOUS GRAPHICAL MODELS 3

to variables, while edges represent dependence between nodes. Graphicalmodels have been used in a number of application areas, including bioinfor-matics [1], natural language processing [11] and image analysis [13]. In thecase of Gaussian graphical models, the structure of the underlying graphcan be fully determined from the corresponding inverse covariance (preci-sion) matrix, the off-diagonal elements of which are proportional to par-tial correlations between the variables. A number of methods have beenrecently proposed in the literature to fit sparse Gaussian graphical models(see for example, Banerjee, El Ghaoui and d’Aspremont [2], Meinshausenand Buhlmann [16], Peng et al. [19], Ravikumar et al. [22], Rothman et al.[23], Yuan and Lin [25] and references therein).

To accomplish the analysis allowing for heterogeneity, we develop a frame-work for fitting different Markov models for each category that are never-theless linked, sharing nodes and some common edges across all categories,while other edges are uniquely associated with a particular category. Asymp-totic properties of the proposed estimator are also established. Note that theGaussian case was studied by Guo et al. [8], who proposed a joint likelihoodbased estimation method that borrowed strength across categories.

The advantage of using a Markov graphical model is that it quantifiesthe degree of dependence between the senators based on their voting recordand hence the obtained network and is directly interpretable. Techniqueslike multidimensional scaling and principal components analysis representrelative similarities between senators’ voting records on the map, and hencethe distance between any two senators can be interpreted as a quantitavemeasure of similarity between their voting records. However, unlike in aMarkov network, these distances are not intepretable in the context of agenerative probabilty model.

The remainder of the paper is organized as follows. Section 2 introducesthe Markov network and addresses algorithmic issues, and Section 3 brieflyillustrates the performance of the joint estimation method on simulated data.A detailed analysis of the US Senate’s voting record from 109th Congressis presented in Section 4. Section 5 presents asymptotic results, and someconcluding remarks are drawn in Section 6. The Appendix contains theproofs.

2. Model and Estimation Algorithm. In this section, we presentthe Markov model for heterogeneous data, focusing on the special case ofbinary variables (also known as the Ising model). The extension to generalcategorical variables is briefly discussed in Section 6. We start by discussingestimation of separate models for each category and then develop a model

4 J. GUO ET AL.

for joint estimation.Notice that the main technical challenge when estimating the likelihood of

Markov graphical models is its computational intractability due to the nor-malizing constant. To overcome this difficulty, different methods employingcomputationally tractable approximations to the likelihood have been pro-posed in the literature; these include methods based on surrogate likelihood[2, 12] and pseudo-likelihood [9, 10, 21]. Hoefling and Tibshirani [10] alsoproposed an iterative algorithm that successively approximates the originallikelihood through a series of pseudo-likelihoods, while Ravikumar, Wain-wright and Lafferty [21] and Guo et al. [9] established asymptotic consistencyof their respective methods.

2.1. Problem Setup and Separate Estimation. Suppose that data havebeen collected on p binary variables for K categories, with nk observations

for the k-th category. Let x(k)i = (x

(k)i,1 , . . . , x

(k)i,p ) denote a p-dimensional row

vector containing the data for the i-th observation in the k-th category andassume that it is drawn independently from an exponential family with thedensity function

(2.1) fk(X1, . . . , Xp) =1

Z(Θ(k))exp

( p∑j=1

θ(k)j,j Xj +

∑1≤j<j′≤p

θ(k)j,j′XjXj′

).

The partition function Z(Θ(k)) =∑

Xj∈{0,1},j exp(θ(k)j,j Xj+

∑j<j′ θ

(k)j,j′XjXj′)

ensures that the density function in (2.1) integrates to one. The parameters

θ(k)j,j , 1 ≤ j ≤ p correspond to the main effect for variable Xj in the k-th cat-

egory, while θ(k)j,j′ , 1 ≤ j < j′ ≤ p to the interaction effect between variables

Xj and Xj′ . The underlying network associated with the k-th category is de-

termined by the symmetric matrix Θ(k) = (θ(k)j,j′)p×p. Specifically, if θ

(k)j,j′ = 0,

then Xj and Xj′ are conditionally independent in the k-th category givenall the remaining variables, and hence their corresponding nodes are notconnected. For each category, criterion (2.1) is referred to as the Markovnetwork in the machine learning literature, and as the log-linear model in

the statistics literature, where θ(k)j,j′ is also interpreted as the conditional log-

odds-ratio between Xj and Xj′ given the other variables. Although generalMarkov networks allow higher order interactions (3-way, 4-way, etc), Raviku-mar, Wainwright and Lafferty [21] pointed out that one can consider onlythe pairwise interaction effects without loss of generality, since higher or-der interactions can be converted to pairwise ones by introducing additionalvariables [24].


The simplest way to deal with heterogenous data is to estimate K separateMarkov models, one for each category. If one further assumes sparsity forthe k-th category, the structure of the underlying graph can be estimatedby regularizing the log-likelihood using an `1 penalty:(2.2)

maxΘ(k)

1

nk

nk∑i=1

{ p∑j=1

θ(k)j,j x

(k)i,j +

∑j<j′

θ(k)j,j′x

(k)i,j x

(k)i,j′

}− log Z(Θ(k))− λ

∑j<j′

|θ(k)j,j′ |.

The `1 penalty shrinks some of the interaction effects θ(k)j,j′ to zero and λ

controls the degree of sparsity. However, estimating (2.2) directly is compu-tationally infeasible due to the nature of the partition function. To overcomethis difficulty, we adopt a pseudo-likelihood estimation method Guo et al.[9], Hoefling and Tibshirani [10], based on

maxΘ(k)

1

nk

nk∑i=1

p∑j=1

[x(k)i,j

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)− log

{1 + exp

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)}]−λ

∑j<j′

|θ(k)j,j′ | ,(2.3)

where Θ(k) is restricted to be symmetric. Criterion (2.3) can be efficientlymaximized using the modified coordinate descent algorithm of Hoefling andTibshirani [10].

2.2. Joint Estimation of Heterogeneous Networks. We start by reparam-

eterizing each θ(k)j,j′ as

(2.4) θ(k)j,j′ = φj,j′γ

(k)j,j′ , 1 ≤ j 6= j′ ≤ p; 1 ≤ k ≤ K.

To avoid sign ambiguities between φj,j′ and γ(k)j,j′ , we restrict φj,j′ ≥ 0, 1 ≤

j < j′ ≤ p. To preserve the symmetry of Θ(k), we also require φj,j′ = φj′,j

and γ(k)j,j′ = γ

(k)j′,j , for all 1 ≤ j < j′ ≤ p and 1 ≤ k ≤ K. Moreover, for

identifiability reasons, we restrict the diagonal elements φj,j = 1 and γ(k)j,j =

ω(k)j,j . Note that φj,j′ is a common factor across all K categories that controls

the occurrence of common links shared across categories, while γ(k)j,j′ is an

individual factor specific to the k-th category. The proposed joint estimation

6 J. GUO ET AL.

method maximizes the following penalized criterion:

max{Φ(k),Γ(k)}Kk=1

K∑k=1

1

nk

nk∑i=1

p∑j=1

[x(k)i,j

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)− log

{1 + exp

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)}]

−η1∑j<j′

φj,j′ − η2∑j<j′

K∑k=1

|γ(k)j,j′ | ,(2.5)

where Φ(k) = (φj,j′)p×p and Γ(k) = (γ(k)j,j′)p×p. The tuning parameter η1 con-

trols sparsity of the common structure across the K networks. Specifically,

if φj,j′ is shrunk to zero, all θ(1)j,j′ , . . . , θ

(K)j,j′ are also zero, and hence there is no

link between nodes j and j′ in any of the K graphs. Similarly, η2 is a tuningparameter controlling sparsity of links in individual categories. Due to the

nature of the `1 penalty, some of γ(k)j,j′ ’s will be shrunk to zero, resulting in

a collection of graphs with individual differences. Note that this two-levelpenalty was originally proposed by Zhou and Zhu [26] for group variableselection in linear regression.

To simplify estimation, we convert the criterion (2.5) to an equivalentcriterion with only one tuning parameter:

max{Θ(k)}Kk=1

K∑k=1

1

nk

nk∑i=1

p∑j=1

[x(k)i,j

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)− log

{1 + exp

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)}]

−λ∑

1≤j<j′≤p

√√√√ K∑k=1

|θ(k)j,j′ | ,(2.6)

where λ = 2√η1η2. The equivalence between (2.5) and (2.6) can be for-

malized as follows (here A · B denotes the Schur-Hadamard element-wiseproduct of two matrices).

Proposition 1. Let {Θ(k)}Kk=1 be a local minimizer of (2.6). Then there

exists a local minimizer of (2.5), (Φ, {Γ(k)}Kk=1), such that Θ

(k)= Φ · Γ

(k),

for all 1 ≤ k ≤ K. On the other hand, if (Φ, {Γ(k)}Kk=1) is a local minimizer

of (2.5), then there also exists a local minimizer of (2.6), {Θ(k)}Kk=1, such

that Θ(k)

= Φ · Γ(k)

, for all 1 ≤ k ≤ K.


The proof of this proposition is similar to the proofs of Lemma 1 andTheorem 1 in [26] and is omitted here.

2.3. Algorithm and Model Selection. Criterion (2.6) leads to an efficientestimation algorithm based on the local linear approximation. Specifically,

letting (θ(k)j,j′)

[t] denote the estimates from the t-th iteration, we approximate√∑Kk=1 |θ

(k)j,j′ | ≈

∑Kk=1 |θ

(k)j,j′ |/

√∑Kk=1 |(θ

(k)j,j′)

[t]|, when θ(k)j,j′ ≈ (θ

(k)j,j′)

[t]. Thus,

at the (t + 1)-th iteration, problem (2.6) is decomposed into K individualoptimization problems:

maxΘ(k)

1

nk

nk∑i=1

p∑j=1

[x(k)i,j

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)− log

{1 + exp

(θ(k)j,j +

∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)}]

−λ∑

1≤j<j′≤p

( K∑k=1

|(θ(k)j,j′)[t]|)−1/2

|θ(k)j,j′ | .(2.7)

Note that criterion (2.7) is a variant of criterion (2.3) with a weighted `1penalty and hence can be solved by the algorithm of [10]. For numerical sta-

bility, we threshold√∑K

k=1 |(θ(k)j,j′)

[t]| at 10−10. The algorithm is summarized

as follows:

Step 1. Initialize θ(k)j,j′ ’s (1 ≤ j, j′ ≤ p; 1 ≤ k ≤ K) using the estimates from

the separate estimation method;

Step 2. For each 1 ≤ k ≤ K, update θ(k)j,j′ ’s by solving (2.7) using the JOSE

algorithm in Guo et al. [9];Step 3. Repeat Step 2 until convergence.

The tuning parameter λ in (2.6) controls the sparsity of the resultingestimator and can be selected using cross-validation. Specifically, for each1 ≤ k ≤ K, we randomly split the data in the k-th category into D subsets ofsimilar sizes and denote the index set of the observations in the d-th subsetas T (k)

d , 1 ≤ d ≤ D. Then λ is selected by maximizing

1

D

D∑d=1

K∑k=1

1

|T (k)d |

∑i∈T (k)

d

p∑j=1

x(k)i,j

{(θ

(k)j,j )[−d](λ) +

∑j′ 6=j

(θ(k)j,j′)

[−d](λ)x(k)i,j′

}− log

[1 + exp

{(θ

(k)j,j )[−d](λ) +

∑j′ 6=j

(θ(k)j,j′)

[−d](λ)x(k)i,j′

}],(2.8)

8 J. GUO ET AL.

where |T (k)d | is the cardinality of T (k)

d and (θ(k)j,j′)

[−d](λ) is the joint estimate

of θ(k)j,j′ based on all observations except those in T (1)

d

⋃. . .⋃T (K)d , as well

as the tuning parameter λ.

3. Simulation Study. Before turning our attention to examining theUS Senate voting patterns, we evaluate the performance of the joint estima-tion method on three synthetic examples, each with p = 50 variables andK = 3 categories. The network structure in each example is composed oftwo parts: the common structure across all categories and the individualstructure specific to a category. The common structures in these examplesare a chain graph, a nearest neighbor graph and a scale-free graph. Thesegraphs are generated as follows:

Example 1: Chain Graph. A chain graph is generated by connecting nodes1 to p in increasing order, as shown in Figure 2 (A1).

Example 2: Nearest Neighbor Graph. The data generating mechanismof the nearest neighbor graph is adapted from Li and Gui [14]. Specif-ically, we generate p points randomly on a unit square, calculate allp(p− 1)/2 pairwise distances, and find three nearest neighbors of eachpoint in terms of these distances. The nearest neighbor network is ob-tained by linking any two points that are nearest neighbors of eachother. Figure 2 (B1) illustrates a nearest-neighbor graph.

Example 3: Scale-free Graph. A scale-free graph has a power-law de-gree distribution and can be simulated by the Barabasi-Albert algo-rithm [3]. A realization of a scale-free network is depicted in Figure 2(C1).

In each example, the network for the k-th category (k = 1, . . . ,K) is createdby randomly adding links to the common structure. The individual linksin different categories are disjoint and have the same degree of sparsity,measured by ρ, the ratio of the number of individual links to the numberof common links. In particular, ρ = 0 corresponds to identical networks forall three categories. In the simulation study, we consider ρ=0, 1/4 and 1,gradually increasing the proportion of individual links (Figure 2). Given thegraphs, the symmetric parameter matrix Θ(k) is generated as follows. Each

θ(k)j,j′ = θ

(k)j′,j corresponding to an edge between nodes j and j′ is uniformly

drawn from [−1,−0.5] ∪ [0.5, 1], whereas all other elements are set to zero.Then we generate the data using Gibbs sampling. Specifically, suppose the

i-th iteration sample has been drawn and is denoted as (x(k)1 )[t], . . . , (x

(k)p )[t];

then, in the (t + 1)-th iteration, we draw (x(k)j )[t+1], 1 ≤ j ≤ p, from the


Bernoulli distribution:

(3.1) (x(k)j )[t+1] ∼ Bernoulli

( exp(θ(k)j,j +

∑j′ 6=j θ

(k)j,j′(x

(k)j′ )[t])

1 + exp(θ(k)j,j +

∑j′ 6=j θ

(k)j,j′(x

(k)j′ )[t])

).

To ensure that the simulated observations are close to i.i.d. samples from thetarget distribution, the first 1,000,000 rounds are discarded (burn-in) and thedata are collected every 100 iterations from the sampler. In the simulationstudy, we consider a balanced scenario and an unbalanced scenario. Theformer consists of nk = 200 observations in each category, whereas the latterhas three unbalanced categories with sample sizes n1 = 150, n2 = 300 andn3 = 450.

We compared the structure estimation results of the joint estimationmethod and the separate estimation method using ROC curves, which dy-namically characterize the sensitivity (proportion of correctly identified links)and the specificity (proportion of correctly excluded links) by varying thetuning parameter λ. Figure 3 shows the ROC curves averaged over 50 repli-cations from the three examples in the balanced scenario, where the jointestimation method dominates separate estimation when the proportion ofindividual links is low. As ρ increases, the structures become more different,and the joint and separate methods move closer together. This is expected,since the joint estimation method is designed to take advantage of commonstructure. The results in the unbalanced scenario exhibit a similar pattern(Figure 4).

4. Analysis of the U.S. Senate voting records. We applied theproposed joint estimation method to the voting records of the U.S. Senatefrom the 109th Congress covering the period 2005-2006. The data were ob-tained directly from the Senate’s website (www.senate.gov). The variablescorrespond to the 100 senators, and the observations to the 645 votes thatthe Senate deliberated and voted on during that period, which include bills,resolutions, motions, debates and roll call votes. The votes are recorded as“yes” (encoded as “1”) and “no” (encoded as “0”). Missing observationswere replaced with the majority vote of the senator’s party on that partic-ular vote. The bills with a “yes/no” proportion greater than 90% or lessthan 10% were excluded from the analysis. Three categories of votes wereextracted from bills, resolutions and motions: 1) defense and security issues(133); 2) environment and energy issues (34); 3) health and medical careissues (46). The tuning parameter for the proposed method was selectedthrough cross-validation. Following Li and Gui [14], we used a bootstrap

10 J. GUO ET AL.

procedure with the proposed estimator to evaluate the confidence of the es-timated edges. Specifically, we estimated the network for multiple bootstrapsamples of the same size, and only retained the edges that appeared morethat α percent of the time. This procedure is similar to stability selection[17]

The network representation, depicting both the common and the individ-ual structures with a cut-off value for inclusion α of 0.4 and 0.6, is givenin Figures 5 and 6, respectively. Note that unlike techniques such as princi-pal components analysis and multidimensional scaling that directly embedthe senators in a two-dimensional map, the proposed method estimates theedges and constructs the adjacency matrix of the graph of senators; sub-sequently, we employed a graph drawing program to position the senatorson a two-dimensional map. The common networks estimated by the jointestimation method are shown in the top left panels of Figures 5 and 6. Forthe individual categories, we only plot the edges associated with the cat-egory to enhance the visual reading of the graphs. As expected, membersof the two political parties are clearly separated. For both cutoff inclusionvalues, there are strong positive associations between senators of the sameparty and equally strong negative associations between senators of oppositeparties. Obviously, at the higher cutoff value the common dependence struc-ture becomes sparser. Of particular interest is the finding that at both cutoffvalues, there are many more associations between Democratic senators thanRepublican ones and this pattern holds for both the common and individualstructures. One possible explanation may be that the Democrats were in theopposition, thus voting more like a block. Further, the Independent SenatorJeffords is associated with the Democrats, while the moderate RepublicansCollins, Snowe, Chafee and Specter (who switched to the Democratic partyin early 2009) and the conservative Democrat Nelson (Nebraska) are notstrongly associated with the Republican colleagues, thus confirming resultsof previous analyses by Clinton, Jackman and Rivers [4] and de Leeuw [5](albeit based on data from the 105th Congress). Somewhat more surpris-ingly, a similar finding holds for the Republican senators McCain, Grahamand DeMint, a pattern also hinted at by the multidimensional scaling analy-sis presented in the Introduction, but more clearly captured by the Markovmodel. Also, the analysis suggests that Senator Lieberman had a solid Demo-cratic voting record before switching to an independent in 2008.

Other interesting patterns emerging from the analysis are that the moremoderate members of two parties are located closer to the center of theirrespective “clouds”’ (e.g. Warner, Voinovich, Smith on the Republican sideand Levin, Reid, Mikulski, Rockefeller on the Democratic side), the close


ties of the liberal Democrats Kennedy, Boxer and Nelson (Florida), the closevoting records of senators from the same state (Murkowski and Stevens fromAlaska and Cantwell and Murray from Washington).

Examining the individual networks for the three categories shown in Fig-ures 5 and 6, we note that additional positive associations among Democratsemerge, primarily for defense and healthcare categories, thus indicating astronger ideological cohesion on these issues. Further, a number of stablenegative associations emerge in the environment and healthcare categories,indicating a stronger ideological divide between senators.

Other patterns of interest include a strong dependence between Durbin,Corzine, Lincoln, Harkin, Dod and Dayton on defense and health issues,and between Schumer, Clinton, Murray, and Lautenberg, on the same is-sues. There is also a cluster of positive associations around Democratic sen-ator Nelson (Florida) on environmental issues, which is of interest given hisposition on offshore oil drilling.

For comparison purposes, separate multidimensional scaling analyses areshown in Figure 7 for all the votes together and for the three individual cat-egories. As alluded in the Introduction, the clustering within the two partiesis driven to a large extent by the corresponding clustering in the defenseand health categories. On the other hand, voting on environmental issuescreates a clear separation between the two parties, although the analysis re-veals that the moderate Republicans Chafee, Collins and Snowe are shown tohave a voting record similar to the Democrats, while the Democrats Nelson(Nebraska) and Landrieu are closer to the Republicans.

5. Asymptotic Properties. In this section, we study the asymptoticproperties of the proposed joint estimation method. Since the structure ofthe underlying network only depends on the interaction effects, we focus ona variant of the model without main effects. Specifically, we solve

max{Θ(k)}Kk=1

K∑k=1

1

nk

nk∑i=1

p∑j=1

[x(k)i,j

(∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)− log

{1 + exp

(∑j′ 6=j

θ(k)j,j′x

(k)i,j′

)}]

−λ∑j<j′

√√√√ K∑k=1

|θ(k)j,j′ | .(5.1)

We will show that the estimator in criterion (5.1) is consistent in terms ofboth parameter estimation and model selection, when p and n go to infinityand the tuning parameter λ goes to zero at some appropriate rate.

Before stating the main results, we introduce necessary notation and regu-

larity conditions. For each k = 1, . . . ,K, denote θ(k) = (θ(k)1,2 , . . . , θ

(k)j,j′ , . . . , θ

(k)p−1,p)

12 J. GUO ET AL.

as a p(p−1)/2-dimensional vector, recording all upper triangular elements in

Θ(k). Let θ(k)

be the true value of θ(k). Let Q(k)

be the population Fisher in-formation matrix of the model in criterion (5.1) (see the Appendix for a pre-

cise definition) and let X (k)(i) be a matrix with p rows and p(p−1)/2 columns,

whose (j, j′)-th column is composed of zeros except the j-th (j′-th) compo-

nent being xi,j′ (xi,j). In addition, we define U(k)

= E[X (k)(i)

T

X (k)(i) ]. To index

the zero and nonzero elements, let Sk = {(j, j′) : θ(k)j,j′ 6= 0, 1 ≤ j < j′ ≤ p}

and Sck = {(j, j′) : θ(k)j,j′ = 0, 1 ≤ j < j′ ≤ p}, and let S∩ =

⋂Kk=1 Sk,

S∪ =⋃Kk=1 Sk. The cardinalities of Sk and S∪ are denoted by qk and q,

respectively. For any matrix W and subsets of row and column indices Uand V, let W U ,V be the matrix consisting of rows U and columns V in W .Finally, let Λmin(·) and Λmax(·) denote the smallest and largest eigenvalueof a matrix, respectively.

The asymptotic properties of the joint estimation method rely on thefollowing regularity conditions:

(A) Nonzero elements bounds: There exist positive constants γmin andγmax such that

(i) min1≤k≤K min(j,j′)∈Sk|θ(k)j,j′ | ≥ γmin;

(ii) max1≤k≤K max(j,j′)∈Sk\S∩ |θ(k)j,j′ | ≤ γmax.

(B) Dependency: There exist positive constants τmin and τmax such thatfor any k = 1, . . . ,K,

(5.2) Λmin(Q(k)Sk,Sk

) ≥ τmin and Λmax(U(k)Sk,Sk

) ≤ τmax .

(C) Incoherence: There exists a constant τ ∈ (1−√γmin/4γmax, 1) such

that for any k = 1, . . . ,K,

(5.3) ‖Q(k)Sck,Sk

(Q(k)Sk,Sk

)−1‖∞ ≤ 1− τ .

Condition (A) enforces a lower bound on the magnitudes of all nonzeroelements, as well as an upper bound on the magnitudes of those nonzeroelements associated with individual links. Conditions (B) and (C) boundthe amount of dependence and the influence that the non-neighbors canhave on a given node, respectively. Conditions similar to (B) and (C) werealso assumed by Meinshausen and Buhlmann [16], Ravikumar, Wainwrightand Lafferty [21], Peng et al. [19] and Guo et al. [9]. Our conditions are mostclosely related to those of Guo et al. [9], but here they are extended to theheterogenous data setting.


Theorem 1. (Parameter estimation). Suppose all regularity conditionshold. If the tuning parameter λ = Cλ

√(log p)/n for some constant Cλ >

(8 − 4τ)√γmin/(1 − τ) and if min{n/q3, n1/q31, . . . , nK/q3K} > (4/C) log p

for some constant C = min{τ2minτ2/288(1 − τ)2, τ2minτ

2/72, τminτ/48}, then

there exists a local maximizer of the criterion (5.1), {θ(k)}Kk=1, such that,

with probability tending to 1,

(5.4)

K∑k=1

‖θ(k)− θ(k)‖2 ≤M

√q log p

n,

for some constant M > (2KCλ/τmin√γmin)(3− 2τ)/(2− τ).

Theorem 2. (Structure selection). Under conditions of Theorem 1, with

probability tending to 1, the maximizer {θ(k)}Kk=1 from Theorem 1 satisfies

θ(k)j,j′ 6= 0, for all (j, j′) ∈ Sk, k = 1, . . . ,K;

θ(k)j,j′ = 0, for all (j, j′) ∈ Sck, k = 1, . . . ,K .

Theorems 1 and 2 establish the consistency in terms of parameter es-timation and structure selection, respectively. The proofs are given in theAppendix.

6. Concluding Remarks. We have proposed a joint estimation methodfor the analysis of heterogenous Markov networks motivated by the need tojointly estimate heterogeneous networks, such as those of the Senate vot-ing patterns. The method improves estimation of the networks’ commonstructure by borrowing strength across categories, and allows for individualdifferences. Asymptotic properties of the method have been established. Inparticular, we show that the convergence rate is similar to the rate for Gaus-sian graphical models in a similar context [9]. The proposed method can beextended to deal with general categorical data with more than two levels us-ing the strategy described in Ravikumar, Wainwright and Lafferty [21] andGuo et al. [9]. The most interesting feature emerging from the analysis ofthe Senate voting records is the existence of more stable associations for theDemocrats, both in terms of the common structure and in the healthcareand defense categories.

There are other techniques suitable for analyzing roll call data. Dimensionreduction techniques create maps, where the relative positioning of the sen-ators allows one to infer similarity in their voting patterns. They provide auseful visual tool to capture broad patterns and relationships. On the other

14 J. GUO ET AL.

hand, a Markov network model aims directly at estimating the associationsbetween the senators and thus provides an alternative view of the votingpatterns, which together with the thresholding technique employed gives ameasure of the stability of such associations. Further, the joint estimationmethod allows one to separately study the overall voting patterns and thosedriven by specific issues. In our view, both set of techniques are useful, withdimension reduction providing a global perspective, and the Markov modelrevealing more nuanced patterns.

Acknowledgements. E. Levina’s research is partially supported byNSF grant DMS-0805798. G. Michailidis’s research is partially supportedby NIH grant 1RC1CA145444-0110. J. Zhu’s research is partially supportedby NSF grants DMS-0705532 and DMS-0748389.

References.

[1] Airoldi, E. M. (2007). Getting Started in Probabilistic Graphical Models. PLoSComputational Biology 3 e252.

[2] Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Model selectionthrough sparse maximum likelihood estimation. Journal of Machine Learning Re-search 9 485–516.

[3] Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks.Science 286 509–512.

[4] Clinton, J., Jackman, S. and Rivers, D. (2004). The statistical analysis of rollcall data. American Political Science Review 98 355–370.

[5] de Leeuw, J. (2006). Principal component analysis of senate voting patterns. InReal Data Analysis (S. Sawilowski, ed.) 405–411. Information Age Publishing, NorthCarolina.

[6] Diaconis, P., Goel, S. and Holmes, S. (2008). Horseshoes in multidimensionalscaling and local kernel methods. Annals of Applied Statistics 777–807.

[7] Enelow, J. M. and Hinich, M. J. (1984). The spatial theory of voting: an introduc-tion. Cambridge University Press, Cambridge.

[8] Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2010a). Joint estimation ofmultiple graphical models. Biometrika To appear.

[9] Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2010b). Joint structure esti-mation for categorical Markov networks Technical Report, Department of Statistics,University of Michigan, Ann Arbor.

[10] Hoefling, H. and Tibshirani, R. (2009). Estimation of sparse binary pairwiseMarkov networks using pseudo-likelihoods. Journal of Machine Learning Research10 883–906.

[11] Jung, S. Y., Park, Y. C., Choi, K. S. and Kim, Y. (1996). Markov random fieldbased English part-of-speech tagging system. In Proceedings of the 16th Conferenceon Computational Linguistics 236–242.

[12] Kolar, M. and Xing, E. P. (2008). Improved estimation of high-dimensional Isingmodels. In Eprint arXiv:0811.1239.

[13] Li, S. Z. (2001). Markov Random Field Modeling in Image Analysis. Springer, NewYork.


[14] Li, H. and Gui, J. (2006). Gradient directed regularization for sparse Gaussian con-centration graphs, with applications to inference of genetic networks. Biostatistics 7302–317.

[15] Matthews, D. R. and A., S. J. (1975). Yeas and nays: normal decision-making inthe U.S. House of Representatives. Wiley Press, New York.

[16] Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs with thelasso. Annals of Statistics 34 1436–1462.

[17] Meinshausen, N. and Buhlmann, P. (2010). Stability Selection (with discussion).Journal of the Royal Statistical Society, Series B 72 417–473.

[18] Morton, R. B. (1999). Methods and models: a guide to the empirical analysis offormal models in political science. Cambridge University Press, Cambridge.

[19] Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation byjoint sparse regression model. Journal of the American Statistical Asscociation 104735–746.

[20] Poole, K. T. and Rosenthal, H. (1997). Congress: a political-economic history ofroll-call voting. Oxford University Press, Oxford.

[21] Ravikumar, P., Wainwright, M. J. and Lafferty, J. (2010). High-dimensionalIsing model selection using `1-regularized logistic regression. Annals of Statistics 381287-1319.

[22] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2008). High-dimensional covariance estimation by minimizing `1-penalized log-determinant diver-gence Technical Report, Department of Statistics, University of California, Berkeley.

[23] Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutationinvariant covariance estimation. Electronic Journal of Statistics 2 494–515.

[24] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential fami-lies, and variational inference. Foundations and Trends in Machine Learning 1 1–305.

[25] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graph-ical model. Biometrika 94 19–35.

[26] Zhou, N. and Zhu, J. (2007). Group variable selection via a hierarchical lasso and itsoracle property Technical Report, Department of Statistics, University of Michigan.

Appendix. The appendix presents the proofs of Theorems 1 and 2. Themain idea of the proof is closely related to [9], and some strategies for dealingwith the joint estimation are borrowed from [8].

We introduce notation first. For the k-th category, we define the log-likelihood as

l(θ(k)) =1

nk

nk∑i=1

p∑j=1

[x(k)i,j (∑j′ 6=j

θ(k)j,j′x

(k)i,j′)− log{1 + exp(

∑j′ 6=j

θ(k)j,j′x

(k)i,j′)}] ,

whose first derivative and second derivative are denoted by ∇l(θ(k)) and∇2l(θ(k)), respectively. Note that ∇l(θ(k)) is a p(p−1)/2-dimensional vectorand ∇2l(θ(k)) is a p(p − 1)/2 × p(p − 1)/2 matrix. Then, the populationFisher information matrix of the model in (5.1) at θ can be defined as

Q(k)

= −E[∇2l(θ(k)

)], and its sample counterpart is Q(k)

= −∇2l(θ(k)

).

We also write U(k)

= 1/n∑n

i=1X(k)(i)

T

X (k)(i) for the sample counterpart of

16 J. GUO ET AL.

U(k)

defined in Section 5. Let θ(k) = (θ(k)1,2, . . . , θ

(k)j,j′ , . . . , θ

(k)p−1,p) be the same

as θ(k) except that all elements in Sck are set to zero and write δ(k) =

θ(k)−θ(k) and δ(k) = θ(k)−θ(k). Finally, let W be a subset of the index set{1, 2, . . . , p(p− 1)/2}. For a p(p− 1)/2-dimensional vector β, we define βWas the vector consisting of the elements of β associated with W.

Next, we introduce a variant of criterion (5.1) by restricting all true zerosin {θ(k)}Kk=1 to be estimated as zero. Specifically, the restricted criterion isformulated as follows:

(6.1) max{θ(k)}Kk=1

K∑k=1

l(θ(k))− λ∑

1≤j<j′≤p

√√√√ K∑k=1

|θ(k)j,j′ | ,

and its maximizer is denoted by {θ(k)}Kk=1. In addition, we consider the

sample versions of regularity conditions (B) and (C).

(B′) Sample dependency: There exist positive constants τmin and τmax

such that for any k = 1, . . . ,K,

(6.2) Λmin(Q(k)

Sk,Sk) ≥ τmin and Λmax(U

(k)

Sk,Sk) ≤ τmax .

(C′) Sample incoherence: There exists a constant τ ∈ (1−√γmin/4γmax, 1)

such that for any k = 1, . . . ,K,

(6.3) ‖Q(k)

Sck,Sk

(Q(k)

Sk,Sk)−1‖∞ ≤ 1− τ .

For convenience of the readers, the proof of our main result is divided intotwo parts: Part I presents the main idea of the proof by listing the importantpropositions and the proofs of Theorems 1 and 2, whereas Part II containsadditional technical details and proofs of propositions in Part I.

Part I: Propositions and Proof of Theorems 1 and 2. The proof consistsof the following steps. Proposition 2 shows that, under sample regularityconditions (B′) and (C′), the conclusions of Theorems 1 and 2 hold forthe local maximizer of the restricted problem (6.1). Next, Proposition 3proves that the population regularity conditions (B) and (C) give rise to theirsample counterparts (B′) and (C′) with probability tending to one; hencethe conclusions of Proposition 2 also hold with the population regularityconditions. Lastly, we show that the local maximizer of (6.1) is also a localmaximizer of the original model (5.1). This is established via Proposition4, which sets out the Karush-Kuhn-Tucker (KKT) conditions for the local


maximizer of criterion (5.1), and Proposition 5, which shows that, withprobability tending to one, the local maximizer of (6.1) satisfies these KKTconditions.

Proposition 2. Suppose condition (A) and the sample conditions (B′)and (C′) hold. If the tuning parameter λ = Cλ

√(log p)/n for some constant

Cλ > (8 − 4τ)√γmin/(1 − τ) and q

√(log p)/n = o(1), then with probability

tending to one, there exists a local maximizer of the restricted criterion,

{θ(k)}Kk=1, satisfying

(i)∑K

k=1 ‖θ(k)−θ(k)‖2 ≤M

√q(log p)/n for some constant M > (2KCλ/τmin

√γmin)[(3−

2τ)/(2− τ)];

(ii) For each k = 1, . . . ,K, θ(k)

j,j′ 6= 0 for all (j, j′) ∈ Sk and θ(k)

j,j′ = 0 for all(j, j′) ∈ Sck.

Proposition 3. Suppose the regularity conditions (B) and (C) hold,then for any ε > 0, the following inequalities hold with probability tending toone for all k = 1, . . . ,K:

(i) P{Λmin(Q(k)

Sk,Sk) ≤ τmin − ε} ≤ 2 exp{−(ε2/2)(nk/q

2k) + 2 log qk};

(ii) P{Λmax(U(k)

Sk,Sk) ≥ τmax + ε} ≤ 2 exp{−(ε2/2)(nk/q

2k) + 2 log qk};

(iii) P[‖Q(k)

Sck,Sk

(Q(k)

Sk,Sk)−1‖∞ ≥ 1 − τ/2] ≤ 12 exp(−Cnk/q3k + 4 log p), for

some constant C = min{τ2minτ2/288(1− τ)2, τ2minτ2/72, τminτ/48}.

Proposition 4. {θ}Kk=1 is a local maximizer of problem (5.1) if andonly if the following conditions hold for all k = 1, . . . ,K:

(6.4)∇j,j′ l(θ

(k)) = λsgn(θ

(k)j,j′)/(

∑Kk=1 |θ

(k)j,j′ |)

1/2, if θ(k)j,j′ 6= 0;

|∇j,j′ l(θ(k)

)| < λ/(∑K

k=1 |θ(k)j,j′ |)

1/2, if θ(k)j,j′ = 0.

Proposition 5. Under all conditions of Proposition 2, with probabilitytending to one, we have, for each k = 1, . . . ,K,(6.5)

∇j,j′ l(θ(k)

) = λsgn(θ(k)

j,j′)/(∑K

k=1 |θ(k)

j,j′ |)1/2, for all (j, j′) ∈ Sk;

|∇j,j′ l(θ(k)

)| < λ/(∑K

k=1 |θ(k)

j,j′ |)1/2, for all (j, j′) ∈ Sck.

Proof of Theorems 1 and 2. The condition min{n/q3, n1/q31, . . . , nK/q3K} >(4/C) log p implies that, for each k = 1, . . . ,K, we have −Cnk/q3k+4 log p <0 and −(ε2/2)(nk/q

2k) + 2 log qk < 0 when qk is large enough. This condi-

tion also implies q√

(log p)/n = o(1). In addition, by Proposition 3, the

18 J. GUO ET AL.

sample conditions (B′) and (C′) hold with probability tending to one whenregularity conditions (B) and (C) hold. Therefore, by Proposition 2, with

probability tending to one, the solution of the restricted problem {θ(k)}Kk=1

satisfies both parameter estimation consistency and structure selection con-sistency. On the other hand, by Proposition 5, with probability tending to

one, {θ(k)}Kk=1 also satisfies the KKT conditions in Proposition 4, thus it is

a local maximizer of criterion (5.1). This proves Theorems 1 and 2. �

Part II: Proofs of Propositions. Before proving the propositions, we statea few lemmas which will be used in the proofs. These lemmas are variantsof Lemmas 1, 2 and 5 in Guo et al. [9], adapted to the settings of theheterogenous model and thus the proofs are omitted here. Likewise, theproof of Proposition 3 is very similar to the proof of Propositions 3 and 4in Guo et al. [9] and is omitted.

Lemma 1. For each k = 1, . . . ,K, with probability tending to 1, we have

‖∇l(θ(k))‖∞ ≤ C∇√

(log p)/n for some constant C∇ > 4.

Lemma 2. If the sample dependency condition (B′) holds and q√

(log p)/n =o(1), then for any αk ∈ [0, 1], k = 1, . . . ,K, the following inequality holdswith probability tending to 1:

(6.6) −K∑k=1

δ(k)Sk

T

[∇2l(θ(k)

+ αkδ(k))]Sk,Sk

δ(k)Sk≥ 1

2τmin

K∑k=1

‖δ(k)‖22 .

Lemma 3. Suppose the sample dependency condition (B) holds. For anyαk ∈ [0, 1], k = 1, . . . ,K, the following inequality holds with probabilitytending to one:

(6.7) ‖[∇2l(θ(k)

+ αkδ(k))−∇2l(θ

(k))]δ(k)‖∞ ≤ τmax‖δ(k)‖22 .

Proof of Proposition 2. The main idea of the proof was first introducedin this context in Rothman et al. [23] and has since been used by manyauthors. Define(6.8)

G({δ(k)}Kk=1) = −K∑k=1

[l(θ(k)

+δ(k))−l(θ(k))]+λ∑

1≤j<j′≤p{(

K∑k=1

|θ(k)j,j′+δ(k)j,j′ |)

1/2−(K∑k=1

|θ(k)j,j′ |)1/2}.

It can be seen from (6.1) that, {δ(k)}Kk=1 minimizes G({δ(k)}Kk=1) and G({0}Kk=1) =

0. Thus we must have G({δ(k)}Kk=1) ≤ 0. If we take a closed set A which


contains {0}Kk=1, and show that G is strictly positive everywhere on theboundary ∂A, then it implies that G has a local minimum inside A, since Gis continuous and G({0}Kk=1) = 0. Specifically, we define A = {{δ(k)}Kk=1 :∑K

k=1 ‖δ(k)‖2 ≤ Man}, with boundary ∂A = {{δ(k)}Kk=1 :

∑Kk=1 ‖δ

(k)‖2 =Man}, for some constant M > (2KCλ/τmin

√γmin)[(3 − 2τ)/(2 − τ)] and

an =√q(log p)/n. For any {δ(k)}Kk=1 ∈ ∂A, the Taylor series expansion

gives G({δ(k)}Kk=1) = I1 + I2 + I3, where

I1 = −K∑k=1

[∇l(θ(k))]Sk

T

δ(k)Sk

,

I2 = −K∑k=1

δ(k)Sk

T

[∇2l(θ(k)

+ αkδ(k))]Sk,Sk

δ(k)Sk, for some αk ∈ [0, 1] ,

I3 = λ∑

(j,j′)∈S∪

{(K∑k=1

|θ(k)j,j′ + δ(k)j,j′ |)

1/2 − (

K∑k=1

|θ(k)j,j′ |)1/2} .(6.9)

Since Cλ > (8− 4τ)√γmin/(1− τ), we have [(1− τ)/(2− τ)]Cλ/

√γmin > 4.

By Lemma 1,(6.10)

|I1| ≤K∑k=1

‖[∇l(θ(k))]Sk‖∞‖δ(k)Sk

‖1 ≤ [(1− τ)CλMγ−1/2min /(2− τ)](q log p)/n .

In addition, by condition q√

(log p)/n = o(1), Lemma 2 holds and thus

(6.11) I2 ≥ (τmin/2)

K∑k=1

‖δ(k)‖22 ≥ [τmin/(2K)]M2q(log p)/n .

Finally, by the triangular inequality and regularity condition (A),

|I3| ≤ λ∑

(j,j′)∈S∪

K∑k=1

||θ(k)j,j′ + δ(k)j,j′ | − |θ

(k)j,j′ ||

(∑K

k=1 |θ(k)j,j′ + δ

(k)j,j′ |)1/2 + (

∑Kk=1 |θ

(k)j,j′ |)1/2

≤ (λγ−1/2min )

K∑k=1

∑(j,j′)∈S∪

|δ(k)j,j′ | ≤ (λq1/2γ−1/2min )

K∑k=1

‖δ(k)‖2

≤ (MCλγ−1/2min ){q(log p)/n}(6.12)

Then we have

(6.13) G({δ(k)}Kk=1) ≥M2 q log p

n

(τmin

2K− (1− τ)Cλ

(2− τ)Mγ1/2min

− Cλ

Mγ1/2min

)> 0.

20 J. GUO ET AL.

The last inequality uses the condition M > (2KCλ/τmin√γmin)[(3−2τ)/(2−

τ)]. Therefore, with probability tending to 1, we have∑K

k=1 ‖θ(k)−θ(k)‖2 ≤

M√q(log p)/n, and consequently claim (i) in Proposition 2 holds.

On the other hand, by the definition of θ(k)

, we have θ(k)

j,j′ = 0 for all(j, j′) ∈ Sck. By regularity condition (A) and Proposition 2 (i), for any

(j, j′) ∈ Sk, k = 1, . . . ,K, we have |θ(k)

j,j′ | ≥ |θ(k)j,j′ |−|θ

(k)

j,j′−θ(k)j,j′ | ≥ γmin/2 > 0,

when n is large enough. �

Proof of Proposition 5. By Proposition 2, with probability tending to

one, we have θj,j′ 6= 0 for all (j, j′) ∈ Sk. Since {θ(k)}Kk=1 is a local max-

imizer of the restricted problem (6.1), with probability tending to one,

∇j,j′ l(θ(k)

) = λsgn(θ(k)

j,j′)/(∑K

k=1 |θ(k)

j,j′ |)1/2, for all (j, j′) ∈ Sk.To show the second claim, we apply the mean value theorem and write

∇l(θ(k)

) = ∇l(θ(k)) + r(k) − Q(k)δ(k)

, where r(k) = {∇2l(θ(k)

+ αkδ(k)

) −∇2l(θ

(k))}δ

(k). After some simplifications, we have

(6.14)

[∇l(θ(k)

)]Sck

= [∇l(θ(k))]Sck+r

(k)Sck−[Q

(k)

Sck,Sk

(Q(k)

Sk,Sk)−1]{[∇l(θ(k))]Sk

+r(k)Sk−[∇l(θ

(k))]Sk}

and thus,

‖[∇l(θ(k)

)]Sck‖∞ ≤ ‖[∇l(θ(k))]Sc

k‖∞ + ‖r(k)Sc

k‖∞

+‖Q(k)

Sck,Sk

(Q(k)

Sk,Sk)−1‖∞{‖[∇l(θ

(k))]Sk‖∞ + ‖r(k)Sk

‖∞ + ‖[∇l(θ(k)

)]Sk‖∞}

≤ (2− τ)‖∇l(θ(k))‖∞ + (2− τ)‖r(k)‖∞ + (1− τ)‖[∇l(θ(k)

)]Sk‖∞

≤ [(1− τ)Cλ/√γmin]

√(log p)/n+ (2− τ)τmaxM

2q(log p)/n

+(1− τ)λ/ min(j,j′)∈Sk

[

K∑k=1

|θj,j′ |]1/2

≤ [2(1− τ)/√γmin]λ+ op(λ).(6.15)

On the other hand, λ/[∑K

k=1 |θ(k)

j,j′ |]1/2 = +∞ when (j, j′) ∈ Sc∪. Otherwise,if (j, j′) ∈ S∪\Sk, then

λ/(

K∑k=1

|θj,j′ |)1/2 ≥ λ/{K∑k=1

(|θj,j′−θj,j′ |+|θj,j′ |}1/2 ≥ λ/√γmax ≥ (2−2τ)λ/

√γmin .


Thus, for any (j, j′) ∈ Sck (k = 1, . . . ,K), we have

|∇j,j′ l(θ(k)

)| ≤ max1≤k≤K

max(j,j′)∈Sc

k

|∇j,j′ l(θ(k)

)|

< min1≤k≤K

min(j,j′)∈Sc

k

λ/

√√√√ K∑k=1

|θ(k)

j,j′ | ≤ λ/

√√√√ K∑k=1

|θ(k)

j,j′ | .(6.16)

�

1085 South University, Ann Arbor, MI 48109-1107

22 J. GUO ET AL.

(A1) Chain: !=0

(A2) Chain: !=1/4

(A3) Chain: !=1

(B1) Nearest-neighbor: !=0

(B2) Nearest-neighbor: !=1/4

(B3) Nearest-neighbor: !=1

(C1) Scale-free: !=0

(C2) Scale-free: !=1/4

(C3) Scale-free: !=1

Fig 2. The networks used in three simulated examples. The black lines represent the com-mon structure, whereas the red, blue and green lines represent the individual links in thethree categories. ρ is the ratio of the number of individual links to the number of commonlinks.


0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=0

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=1

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Nearest−neighbor: ρ=0

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Nearest−neighbor: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0


1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=0

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=1

1 − Specificity

Se

nsitiv

ity

Fig 3. Results for the balanced scenario (n1 = n2 = n3 = 200). The ROC curves areaveraged over 50 replications. ρ is the ratio between the number of individual links and thenumber of common links.

24 J. GUO ET AL.

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=0

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Chain: ρ=1

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0


1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Nearest−neighbor: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0


1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=0

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=1/4

1 − Specificity

Se

nsitiv

ity

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Scale−free: ρ=1

1 − Specificity

Se

nsitiv

ity

Fig 4. Results for the unbalanced scenario (n1 = 150, n2 = 300, n3 = 450). The ROCcurves are averaged over 50 replications. ρ is the ratio between the number of individuallinks and the number of common links.


Common Structure

Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden

Defense and Security

Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden

Environment and Energy

Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden

Health and Medical Issues

Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden

Fig 5. The common and individual structures for the Senate voting data with an inclu-sion cutoff value of 0.4. The nodes represent the 100 senators, with red, blue and purplenode colors corresponding to Republican, Democrat, or Independent (Senator Jeffords),respectively. A solid line corresponds to a positive interaction effect and a dashed line toa negative interaction effect. The width of a link is proportional to the magnitude of thecorresponding overall interaction effect. For each individual network, the links that onlyappear in this category are highlighted in grey.

26 J. GUO ET AL.

Common Structure

Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden


Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

Warner

Wyden


Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

Warner

Wyden


Akaka

Alexander

AllardAllen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

DaytonDeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

GrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

InouyeIsakson

Jeffords Johnson

Kennedy

Kerry

KohlKyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski

Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

SalazarSantorum

Sarbanes

Schumer

Sessions

Shelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

Voinovich

WarnerWyden

Fig 6. The common and individual structures for the Senate voting data with an inclu-sion cutoff value of 0.6. The nodes represent the 100 senators, with red, blue and purplenode colors corresponding to Republican, Democrat, or Independent (Senator Jeffords),respectively. A solid line corresponds to a positive interaction effect and a dashed line toa negative interaction effect. The width of a link is proportional to the magnitude of thecorresponding overall interaction effect. For each individual network, the links that onlyappear in this category are highlighted in grey.


All votes

Akaka

Alexander

AllardAllen

Baucus

BayhBennettBiden

Bingaman

Bond

Boxer

BrownbackBunning

Burns

Burr

ByrdCantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

Coleman

Collins

Conrad

Cornyn

Corzine

Craig

Crapo

Dayton

DeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

Grassley

Gregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

Inouye

Isakson

Jeffords

Johnson

KennedyKerry

Kohl

Kyl

Landrieu

Lautenberg

LeahyLevin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski Murray

Nelson

Nelson

Obama

PryorReed

Reid

Roberts

Rockefeller

Salazar

Santorum

SarbanesSchumer

Sessions

Shelby

SmithSnowe

Specter StabenowStevens

Sununu

Talent

Thomas

Thune

Vitter

VoinovichWarner

Wyden


Akaka

AlexanderAllard

Allen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

BurnsBurr ByrdCantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

ColemanCollins

Conrad

Cornyn

Corzine

Craig

Crapo

Dayton

DeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

Graham

Grassley

Gregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

Inouye

Isakson

Jeffords

Johnson

KennedyKerry

Kohl

Kyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

Mikulski

Murkowski Murray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts Rockefeller

Salazar

Santorum

SarbanesSchumer

Sessions

ShelbySmith

Snowe

Specter

Stabenow

Stevens

Sununu

Talent

Thomas

Thune

Vitter

VoinovichWarner

Wyden


Akaka

Alexander

Allard

Allen

Baucus

Bayh

Bennett

Biden

Bingaman

Bond

Boxer

Brownback

Bunning

Burns

Burr

Byrd

Cantwell

Carper

Chafee

Chambliss

Clinton

Coburn

Cochran

Coleman

Collins

Conrad

Cornyn

Corzine

Craig

Crapo

Dayton

DeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

EnziFeingold

Feinstein

Frist Graham

Grassley

Gregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

Inouye

Isakson

Jeffords

Johnson

Kennedy

Kerry

Kohl

Kyl

Landrieu

Lautenberg

Leahy

Levin

Lieberman

Lincoln

Lott

Lugar

Martinez

McCain

McConnell

MikulskiMurkowskiMurray

Nelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

Salazar

Santorum

Sarbanes

Schumer

SessionsShelby

Smith

SnoweSpecter

Stabenow

Stevens

Sununu

TalentThomas

Thune

Vitter

Voinovich

Warner

Wyden


Akaka

Alexander

Allard

Allen

Baucus

Bayh

Bennett

BidenBingaman

Bond

Boxer

Brownback

Bunning

BurnsBurr Byrd

CantwellCarper

Chafee

Chambliss

Clinton

Coburn

CochranColeman

Collins

Conrad

Cornyn

Corzine

Craig

Crapo

Dayton

DeMint

DeWine

Dodd

Dole

Domenici

Dorgan

Durbin

Ensign

Enzi

Feingold

Feinstein

Frist

GrahamGrassleyGregg

Hagel

Harkin

Hatch

Hutchison

Inhofe

Inouye

Isakson

Jeffords

Johnson

Kennedy

Kerry

Kohl

Kyl

Landrieu

LautenbergLeahyLevin

Lieberman

Lincoln

Lott

LugarMartinez

McCain

McConnell

Mikulski

Murkowski

MurrayNelson

Nelson

Obama

Pryor

Reed

Reid

Roberts

Rockefeller

Salazar

Santorum

SarbanesSchumer

Sessions

Shelby

Smith

Snowe

Specter

Stabenow

Stevens

Sununu

Talent

Thomas

ThuneVitter

Voinovich

Warner Wyden

Fig 7. Multidimensional scaling analysis for all the votes together, and the three individualcategories. The nodes represent the 100 senators, with red, blue and purple node colorscorresponding to Republican, Democrat, or Independent (Senator Jeffords), respectively.

Documents

Estimating heterogeneous graphical models for discrete ...dept.stat.lsa.umich.edu/~elevina//hetero-ising.pdf · arXiv: math.PR/0000000 ESTIMATING HETEROGENEOUS GRAPHICAL MODELS FOR