10
1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We consider the problem of selecting important nodes in a random network, where the nodes connect to each other randomly with certain transition probabilities. The node importance is characterized by the stationary probabilities of the corresponding nodes in a Markov chain defined over the network, as in Google’s PageRank. Unlike deterministic network, the transition probabilities in random network are unknown but can be estimated by sampling. Under a Bayesian learning framework, we apply the first-order Taylor expansion and normal approximation to provide a computationally efficient posterior approximation of the stationary probabilities. In order to maximize the probability of correct selection, we propose a dynamic sampling procedure which uses not only posterior means and variances of certain interaction parameters between different nodes, but also the sensitivities of the stationary probabilities with respect to each interaction parameter. Numerical experiment results demonstrate the superiority of the proposed sampling procedure. Index Terms—network, Markov chain, ranking and selection, Bayesian learning, dynamic sampling. I. I NTRODUCTION We consider the problem of selecting the top m important nodes from n (n>m) nodes in a network, a central problem for many social and economical networks. In the World Wide Web, web pages and hyperlinks constitute a network, and Google’s PageRank lists the most important web pages for each keyword [1], [2]; in sports events such as basketball and football, teams compete with each other and their win-loss relationship network helps determine which teams should be invited [3], [4]; in social network like Twitter, the linking topology is used to rank members for popularity recommendation [5]. Other examples include venture capitalists selection [6] and academic paper searching [7], [8]. A Markov chain is often used to describe the network. Specifically, each node (page/user/team) in the network is considered as a state of the Markov chain, and the nodes are linked randomly with certain transition probabilities. The node importance is ranked by the station- ary probability of a Markov chain, which is the long-run proportion of visits to each state. A larger stationary probability indicates that the corresponding node is more important. Existing works consider a Markov chain with given transition probabilities, and focus on how to efficiently calculate stationary probabilities (see, e.g., [1], [2], [9]). In practice, the transition probabilities are usually not apriori knowledge but estimated from the data. For instance, the hyperlinks between web pages on the Internet change dynamically; the relationship network among twitters is topic-specific; the competition results in sports are uncertain. Therefore, we focus on a random network with unknown transition probabilities. Haidong Li is with the Department of Industrial Engineering and Management, Peking University, Beijing, 100871 China e-mail: [email protected]. Xiaoyun Xu is with the Department of Industrial Engineering and Management, Peking University, Beijing, 100871 China e-mail: xi- [email protected]. Yijie Peng is with the Department of Industrial Engineering and Management, Peking University, Beijing, 100871 China e-mail: pengyi- [email protected]. Chun-Hung Chen is with the Department of System Engineering and Operations Research, George Mason University, Fairfax, VA, 22030 USA e- mail: [email protected]. In the random network, we sample the interactions between the nodes to estimate the transition probabilities as functions of certain interaction parameters, which is in turn to estimate the stationary probabilities. Since sampling could be expensive, the total number of samples is usually limited. Moreover, the number of transition proba- bilities grows with the square of the number of nodes, so it would be practically infeasible to estimate all transition probabilities accurately for large-scale networks. We consider a problem of maximizing the probability of correct selection (PCS) for selecting the top m nodes subject to a fixed sampling budget. The estimation insecurities in different interaction parameters have heterogeneous effects on the PCS. The final ranking may be more sensitive to the perturbation in some interaction parameters. We aim to develop a dynamic sampling procedure to select the top m nodes with a high statistical efficiency. Our problem is closely related to the ranking and selection (R&S) problem well known in the field of simulation optimization [10], which considers selecting the best or an optimal subset from a finite alternatives. There are the frequentist and Bayesian branches in R&S [11]. Sampling procedures in the frequentist branch allocate samples to guarantee a pre-specified PCS level (see, e.g., [12], [13], [14]). The sampling procedures in the Bayesian branch aim to either maximize the PCS or minimize the expected opportunity cost subject to a given sampling budget (see, e.g., [15], [16], [17]). Chen et al. [18], Zhang et al. [19], and Gao and Chen [20] study sampling procedures to maximize the PCS for selecting an optimal subset; Xiao and Lee [21] derive the convergence rate of the false subset-selection probability, and offer an allocation rule achieving an asymptotically optimal convergence rate; and Gao and Chen [22] develop a sampling procedure based on the expected opportunity cost. In R&S, the alternatives are ranked by the expectations of their sample performance, which can be directly estimated by the sample average of each alternative, whereas in our problem, the nodes are ranked by the stationary probabilities of the Markov chain, which are estimated indirectly from the interaction samples between different nodes. In this research, a Bayesian estimation scheme is introduced to update the posterior belief on the unknown interaction parameters, and an efficient posterior approximation of the stationary probability is derived by Taylor expansion and normal approximation. The asymptotic analysis of the normal approximation is provided. We propose a dynamic allocation scheme for Markov chain (DAM) to efficiently select the top m nodes, which myopically maximizes an approximation of the PCS and is proved to be consistent. The DAM uses not only posterior means and variances of certain interaction parameters between different nodes, but also the sensitivities of the stationary probabilities with respect to each interaction parameter. The rest of this paper is organized as follows. In Section II, we formulate the problem. Section III derives a posterior distribution approximation of the stationary probability. The DAM is proposed in Section IV, and numerical results are given in Section V. The last section concludes the paper and outlines future directions. arXiv:1901.03466v1 [stat.ME] 11 Jan 2019

Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

1

Efficient Sampling for Selecting Important Nodes in RandomNetwork

Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen

Abstract—We consider the problem of selecting important nodes in arandom network, where the nodes connect to each other randomly withcertain transition probabilities. The node importance is characterized bythe stationary probabilities of the corresponding nodes in a Markov chaindefined over the network, as in Google’s PageRank. Unlike deterministicnetwork, the transition probabilities in random network are unknown butcan be estimated by sampling. Under a Bayesian learning framework,we apply the first-order Taylor expansion and normal approximationto provide a computationally efficient posterior approximation of thestationary probabilities. In order to maximize the probability of correctselection, we propose a dynamic sampling procedure which uses not onlyposterior means and variances of certain interaction parameters betweendifferent nodes, but also the sensitivities of the stationary probabilitieswith respect to each interaction parameter. Numerical experiment resultsdemonstrate the superiority of the proposed sampling procedure.

Index Terms—network, Markov chain, ranking and selection, Bayesianlearning, dynamic sampling.

I. INTRODUCTION

We consider the problem of selecting the top m important nodesfrom n (n > m) nodes in a network, a central problem for manysocial and economical networks. In the World Wide Web, web pagesand hyperlinks constitute a network, and Google’s PageRank liststhe most important web pages for each keyword [1], [2]; in sportsevents such as basketball and football, teams compete with each otherand their win-loss relationship network helps determine which teamsshould be invited [3], [4]; in social network like Twitter, the linkingtopology is used to rank members for popularity recommendation [5].Other examples include venture capitalists selection [6] and academicpaper searching [7], [8].

A Markov chain is often used to describe the network. Specifically,each node (page/user/team) in the network is considered as a stateof the Markov chain, and the nodes are linked randomly with certaintransition probabilities. The node importance is ranked by the station-ary probability of a Markov chain, which is the long-run proportionof visits to each state. A larger stationary probability indicates thatthe corresponding node is more important. Existing works consider aMarkov chain with given transition probabilities, and focus on how toefficiently calculate stationary probabilities (see, e.g., [1], [2], [9]). Inpractice, the transition probabilities are usually not apriori knowledgebut estimated from the data. For instance, the hyperlinks between webpages on the Internet change dynamically; the relationship networkamong twitters is topic-specific; the competition results in sports areuncertain. Therefore, we focus on a random network with unknowntransition probabilities.

Haidong Li is with the Department of Industrial Engineeringand Management, Peking University, Beijing, 100871 China e-mail:[email protected].

Xiaoyun Xu is with the Department of Industrial Engineering andManagement, Peking University, Beijing, 100871 China e-mail: [email protected].

Yijie Peng is with the Department of Industrial Engineering andManagement, Peking University, Beijing, 100871 China e-mail: [email protected].

Chun-Hung Chen is with the Department of System Engineering andOperations Research, George Mason University, Fairfax, VA, 22030 USA e-mail: [email protected].

In the random network, we sample the interactions between thenodes to estimate the transition probabilities as functions of certaininteraction parameters, which is in turn to estimate the stationaryprobabilities. Since sampling could be expensive, the total number ofsamples is usually limited. Moreover, the number of transition proba-bilities grows with the square of the number of nodes, so it would bepractically infeasible to estimate all transition probabilities accuratelyfor large-scale networks. We consider a problem of maximizing theprobability of correct selection (PCS) for selecting the top m nodessubject to a fixed sampling budget. The estimation insecurities indifferent interaction parameters have heterogeneous effects on thePCS. The final ranking may be more sensitive to the perturbation insome interaction parameters. We aim to develop a dynamic samplingprocedure to select the top m nodes with a high statistical efficiency.

Our problem is closely related to the ranking and selection (R&S)problem well known in the field of simulation optimization [10],which considers selecting the best or an optimal subset from afinite alternatives. There are the frequentist and Bayesian branchesin R&S [11]. Sampling procedures in the frequentist branch allocatesamples to guarantee a pre-specified PCS level (see, e.g., [12],[13], [14]). The sampling procedures in the Bayesian branch aimto either maximize the PCS or minimize the expected opportunitycost subject to a given sampling budget (see, e.g., [15], [16], [17]).Chen et al. [18], Zhang et al. [19], and Gao and Chen [20] studysampling procedures to maximize the PCS for selecting an optimalsubset; Xiao and Lee [21] derive the convergence rate of the falsesubset-selection probability, and offer an allocation rule achieving anasymptotically optimal convergence rate; and Gao and Chen [22]develop a sampling procedure based on the expected opportunitycost. In R&S, the alternatives are ranked by the expectations of theirsample performance, which can be directly estimated by the sampleaverage of each alternative, whereas in our problem, the nodes areranked by the stationary probabilities of the Markov chain, which areestimated indirectly from the interaction samples between differentnodes.

In this research, a Bayesian estimation scheme is introduced toupdate the posterior belief on the unknown interaction parameters,and an efficient posterior approximation of the stationary probabilityis derived by Taylor expansion and normal approximation. Theasymptotic analysis of the normal approximation is provided. Wepropose a dynamic allocation scheme for Markov chain (DAM) toefficiently select the top m nodes, which myopically maximizes anapproximation of the PCS and is proved to be consistent. The DAMuses not only posterior means and variances of certain interactionparameters between different nodes, but also the sensitivities of thestationary probabilities with respect to each interaction parameter.

The rest of this paper is organized as follows. In Section II, weformulate the problem. Section III derives a posterior distributionapproximation of the stationary probability. The DAM is proposedin Section IV, and numerical results are given in Section V. The lastsection concludes the paper and outlines future directions.

arX

iv:1

901.

0346

6v1

[st

at.M

E]

11

Jan

2019

Page 2: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

2

Fig. 1. Illustration of node ranking process.

II. PROBLEM FORMULATION

The objective of this study is to identify a subset of importantnodes of a random network. The importance of the nodes is rankedby their stationary probabilities in a Markov chain. Specifically, πidenotes the stationary probability of node i, and our goal is to selectthe top m nodes from n nodes:

Θm , 〈1〉, . . . , 〈m〉 ,

where notations 〈i〉, i = 1, . . . , n, are the ranking indices suchthat π〈1〉 ≥ · · · ≥ π〈n〉. The Markov chain of the network isconstructed by the interaction strength between each pair of nodes.Given interaction parameter xij describing the interaction strengthbetween nodes i and j, 1 ≤ i < j ≤ n, the transition proba-bilities Pij = Pij(X ), i, j = 1, . . . , n, are functions of a vectorX , (xij)1≤i<j≤n with all interaction parameters as its elements.The functions appear in various forms for different applications [23].The transition matrix P = [Pij ]n×n is a stochastic matrix satisfyingirreducibility and aperiodicity, which guarantees the existence anduniqueness of the stationary distribution. The vector of stationaryprobabilities π , (π1, . . . , πn) is the solution of the followingequilibrium equation:

πP = π, andn∑i=1

πi = 1, πi > 0. (1)

Notice that each stationary probability πi is also a function ofX . Figure 1 summaries the process of constructing a transitionprobability matrix by the interaction parameters and ranking all nodesaccording to their stationary probabilities.

In a random network, the interactions between nodes are random.Specifically, let Xij,t, t ∈ Z+ be the t-th sample for the interactionsbetween nodes i and j, which is assumed to follow an independentand identically distributed (i.i.d.) Bernoulli distribution with unknownparameter xij , 1 ≤ i < j ≤ n. The Bernoulli assumption is naturalin many practices that involve pairwise interaction. For instance, inthe web page ranking, a binary variable takes 1 for the visits fromweb page j to i and takes 0 for the reverse direction; in the sportmatches, the competition results of team i and team j take binaryvalues 1 (winning) and 0 (losing). All interaction parameters xij ,1 ≤ i < j ≤ n, are assumed to be unknown but can be estimatedby sampling. With the estimates of the interaction parameters, thetransition probabilities and stationary probabilities can be in turnestimated.

Suppose the number of samples is fixed. The research problem ofthis work is to sequentially allocate each sample based on availableinformation collected throughout previous sampling at each step toestimate the interaction strengthes between different pairs of nodesfor efficiently selecting the top m nodes. Given the information of s

allocated samples, the selection is to pick the nodes with the top mposterior estimates of the stationary probabilities, i.e.,

Θ(s)m , 〈1〉s, . . . , 〈m〉s ,

where 〈i〉s, i = 1, . . . , n, are the ranking indices such that

π〈1〉s(X (s)) ≥ · · · ≥ π〈n〉s(X (s)),

and X (s) is a posterior estimate of X based on s samples. Wemeasure the statistical efficiency of a sampling procedure by the PCSdefined as follows:

Pr(

Θ(s)m = Θm

).

III. POSTERIOR OF STATIONARY PROBABILITIES

We introduce a Bayesian framework to obtain posterior estimatesof the stationary probabilities. From the Bayes rule, the posteriordistribution of πk is

F (dπk|Et) ,L(Et;πk)F (dπk|ζ0)∫L(Et;πk)F (dπk|ζ0)

,

where ζ0 is parameter in prior distribution, Et is the informationcollected throughout the t-th sample, and L(·;πk) is the likelihoodfunction of observed samples. In our problem, the likelihood functionL(·;πk) does not have a closed form, so it is computationallychallenging to calculate the posterior distribution of the stationarydistribution. To address the problem, we propose an efficient approx-imation for the posterior distributions of stationary probabilities byusing the first-order Taylor expansion:

πk(X ) ≈πk(X (t))

+∑

1≤i<j≤n

[∂πk(X )

∂xij

∣∣∣∣X =X (t)

(xij − x(t)

ij

)].

(2)

Section III-A will provide details in calculating x(t)ij , πk(X (t)), and

∂πk(X )∂xij

∣∣X =X (t) , and in Section III-B, we will provide a normal

approximation for the posterior of πk.

A. Posterior of Interaction Parameters

Suppose the prior distribution of xij follows a non-informativeprior U [0, 1]. By conjugacy [24], the posterior distribution of xij isa Beta distribution Beta(α

(t)ij , β

(t)ij ) with the density given by

f(x, α

(t)ij , β

(t)ij

)=

Γ(α(t)ij + β

(t)ij )

Γ(α(t)ij )Γ(β

(t)ij )

xα(t)ij −1(1− x)β

(t)ij −1,

where

α(t)ij , 1 +

tij∑`=1

Xij,`, β(t)ij , 1 +

tij∑`=1

(1−Xij,`),

Page 3: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

3

and tij is the number of samples allocated to estimate xij afterallocating t samples in total, the posterior mean is

x(t)ij , α

(t)ij /(α

(t)ij + β

(t)ij ),

and the posterior variance is

(σ(t)ij )2 , α

(t)ij β

(t)ij /

[(α

(t)ij + β

(t)ij )2(α

(t)ij + β

(t)ij + 1)

].

Let X (t) , (x(t)ij )1≤i<j≤n be a posterior estimate of X . Posterior

estimates of the transition probability matrix P and its derivativematrix are

P(X (t)) =[P`k(X (t))

]n×n

and∂P(X )

∂xij

∣∣∣∣X (t)

,

[∂P`k(X )

∂xij

∣∣∣∣X =X (t)

]n×n

,

respectively. Solving equilibrium equation (1) by plugging inP(X (t)) yields a posterior estimate of the vector of the stationaryprobabilities:

π(X (t)) =(π1(X (t)), . . . , πn(X (t))

).

By taking derivatives on both sides of the equilibrium equationπk(X ) =

∑n`=1 π`(X )P`k(X ) with respect to xij , we have

∂πk(X )

∂xij=∂∑n`=1 π`(X )P`k(X )

∂xij

=

n∑`=1

(∂π`(X )

∂xijP`k(X ) + π`(X )

∂P`k(X )

∂xij

)=

n∑`=1

∂π`(X )

∂xijP`k(X ) +

n∑`=1

π`(X )∂P`k(X )

∂xij.

The derivative of the stationary distribution vector denoted by

∂π(X )

∂xij,

(∂π1(X )

∂xij, . . . ,

∂πn(X )

∂xij

)is a solution of the following set of equations:

∂π(X )

∂xij[I−P(X )] = π(X )

∂P(X )

∂xij,

n∑k=1

∂πk(X )

∂xij= 0 .

(3)

By plugging in X (t), we have a posterior estimate of the derivativevector of the stationary probabilities:

∂π(X )

∂xij

∣∣∣∣X (t)

,

(∂π1(X )

∂xij, . . . ,

∂πn(X )

∂xij

) ∣∣∣∣X =X (t)

.

Remark 1. In order to calculate π(X (t)) and∂π(X )

∂xij

∣∣∣∣X (t)

, we

need to solve the linear equations involving the transition matrix of aMarkov chain. Numerous efficient methods of industrial strength canbe applied [25], [26]. Google, for instance, has applied the powermethod to solve linear equations with a transition matrix of order8.1 billions [27].

B. Normal Approximation

The posterior approximation of πk on the right hand side of (2)is a linear combination of xij ∼ Beta(α

(t)ij , β

(t)ij ), 1 ≤ i < j ≤ n.

Since Beta distributions are not closed in a linear combination, weuse normal distribution N

(x

(t)ij , (σ

(t)ij )

2)to approximate the poste-

rior distribution of xij , which leads to a closed-form approximateposterior distribution of πk.

Notice that Beta(α(t)ij , β

(t)ij ) and N

(x

(t)ij , (σ

(t)ij )

2)share the same

mean and variance. We further show that Beta(αt, βt) converges indistribution to a normal distribution as t → +∞, where αt = α0tand βt = β0t with α0, β0 > 0.

Theorem 1. As t→ +∞,√t

[Wt −

α0

α0 + β0

]d−−→ N

(0,

α0β0

(α0 + β0)3

),

where Wt ∼ Beta(αt, βt) and d−−→ denotes convergence in distri-bution.

Proof. A Beta(αt, βt) random variable can be represented asYt/(Yt + Zt), where Yt ∼ Gamma(αt, 1), Zt ∼ Gamma(βt, 1)and Yt is independent of Zt [28]. Note that Gamma(n, λ) can berepresented as the sum of n i.i.d. exponential random variables withparameter λ. By a Central Limit Theorem,

√αt [Yt/αt − 1]

d−−→ N(0, 1) as t→ +∞

and √βt [Zt/βt − 1]

d−−→ N(0, 1) as t→ +∞.

Therefore,√t

[(Yt/t√α0

Zt/t√β0

)−( √

α0√β0

)]d−−→ N

((00

),

[1 00 1

])as t→ +∞.

With the multivariate delta method [29], if there is a sequenceof multivariate random variables θn satisfying

√n(θn − θ)

d−−→N(0,Σ) as n → +∞, where θ and Σ are constant matrices, thenfor any continuously differentiable function g(·),√n(g(θn)− g(θ))

d−−→ N(0, (∇g(θ))TΣ(∇g(θ))) as n→ +∞.

Note that Wt is a function of Yt/t√α0 and Zt/t

√β0, i.e.,

Wt =

√α0(Yt/t

√α0)

√α0(Yt/t

√α0) +

√β0(Zt/t

√β0)

,

then√t

[Wt −

α0

α0 + β0

]d−−→ N

(0,

α0β0

(α0 + β0)3

)as t→ +∞,

which proves the conclusion.

By the law of large number, α(t)ij , β

(t)ij ∼ O(tij) almost surely

(a.s.), as tij → +∞, where A(x) = O(B(x)) as x → ∞ (x → 0)means that |A(x)/B(x)| → C > 0 as x → ∞ (x → 0). Theasymptotic result in Theorem 1 justifies the asymptotic normality ofBeta(α

(t)ij , β

(t)ij ). Moreover, we show that the Kullback-Leibler (KL)

divergence between Beta(α(t)ij , β

(t)ij ) and N

(x

(t)ij , (σ

(t)ij )

2)goes to

zero as tij → +∞. The KL divergence is a statistical (asymmetric)distance between two distributions [30]. Specifically, if U and V areprobability measures over set Ω, the KL divergence between V andU is defined by

DKL(U‖V ) =

∫Ω

logdU

dVdU.

Theorem 2. When xij 6= 1/2,

DKL(Beta‖Normal) = O(t−1ij

)a.s. tij → +∞,

and when xij = 1/2,

DKL(Beta‖Normal) = o(t−1ij

)a.s. tij → +∞,

where A(x) = o(B(x)) means that limx→+∞

A(x)/B(x) = 0.

Page 4: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

4

Proof. Let

f(x, α

(t)ij , β

(t)ij

)=

Γ(α(t)ij + β

(t)ij )

Γ(α(t)ij )Γ(β

(t)ij )

xα(t)ij −1(1− x)β

(t)ij −1,

where Γ(x) =∫ +∞

0zx−1e−zdz is the Gamma function, and

g(x, x

(t)ij , σ

(t)ij

)=(√

2πσ(t)ij

)−1

exp

− (x− x(t)ij )2

2(σ(t)ij )

2

.

Then,

DKL(Beta‖Normal)

=

∫ 1

0

f(x, α

(t)ij , β

(t)ij

)log

f(x, α

(t)ij , β

(t)ij

)g(x, x

(t)ij , σ

(t)ij

) dx= log e×

∫ 1

0

f(x, α

(t)ij , β

(t)ij

)lnf(x, α

(t)ij , β

(t)ij

)g(x, x

(t)ij , σ

(t)ij

) dx= log e×

(∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln f

(x, α

(t)ij , β

(t)ij

)dx

−∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln g

(x, x

(t)ij , σ

(t)ij

)dx

).

From [31], the entropy for the Beta distribution can be calculated by∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln f

(x, α

(t)ij , β

(t)ij

)dx

=− lnB(α(t)ij , β

(t)ij ) + (α

(t)ij − 1)(ψ(α

(t)ij )− ψ(α

(t)ij + β

(t)ij ))

+ (β(t)ij − 1)(ψ(β

(t)ij )− ψ(α

(t)ij + β

(t)ij )),

where the digamma function ψ(·) is the first derivative of the log-gamma function, and B(α, β) = Γ(α)Γ(β)/Γ(α+ β). By Stirling’sformula, we have that as α(t)

ij , β(t)ij → +∞,

ln Γ(α(t)ij + β

(t)ij ) =

1

2ln 2π + (α

(t)ij + β

(t)ij −

1

2) ln(α

(t)ij + β

(t)ij )

−(α(t)ij + β

(t)ij ) +

1

12(α

(t)ij + β

(t)ij )−1 + o((α

(t)ij + β

(t)ij )−1);

ln Γ(α(t)ij ) =

1

2ln 2π + (α

(t)ij −

1

2) lnα

(t)ij − α

(t)ij +

1

12(α

(t)ij )−1

+o((α(t)ij )−1);

ln Γ(β(t)ij ) =

1

2ln 2π + (β

(t)ij −

1

2) lnβ

(t)ij − β

(t)ij +

1

12(β

(t)ij )−1

+o((β(t)ij )−1) .

With the results in [32], as α(t)ij , β

(t)ij → +∞, the digamma function

has the following expansion:

ψ(α(t)ij + β

(t)ij ) = ln(α

(t)ij + β

(t)ij )− 1

2(α

(t)ij + β

(t)ij )−1

− 1

12(α

(t)ij + β

(t)ij )−2 + o((α

(t)ij + β

(t)ij )−2);

ψ(α(t)ij ) = lnα

(t)ij −

1

2(α

(t)ij )−1 − 1

12(α

(t)ij )−2 + o((α

(t)ij )−2);

ψ(β(t)ij ) = lnβ

(t)ij −

1

2(β

(t)ij )−1 − 1

12(β

(t)ij )−2 + o((β

(t)ij )−2).

In addition,∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln g

(x, x

(t)ij , σ

(t)ij

)dx

=−∫ 1

0

f(x, α

(t)ij , β

(t)ij

)(ln(√

2πσ(t)ij ) +

(x− x(t)ij )2

2(σ2ij)

(t)

)dx

=− ln

(√

2π(α

(t)ij )

12 (β

(t)ij )

12

(α(t)ij + β

(t)ij )(α

(t)ij + β

(t)ij + 1)

12

)− 1

2.

By the law of large numbers, we have

α(t)ij , β

(t)ij → +∞ a.s. when tij → +∞

and

limtij→+∞

α(t)ij

α(t)ij + β

(t)ij

= 1− limtij→+∞

β(t)ij

α(t)ij + β

(t)ij

= xij a.s..

Therefore, as tij → +∞,

DKL(Beta‖Normal)

= log e×(∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln f

(x, α

(t)ij , β

(s)ij

)dx

−∫ 1

0

f(x, α

(t)ij , β

(t)ij

)ln g

(x, x

(t)ij , σ

(t)ij

)dx

)= log e×

(1

2ln[1− (α

(t)ij + β

(t)ij + 1)−1

]+

1

3α(t)ij

+1

3β(t)ij

− 5

6(α(t)ij + β

(t)ij )

+ o(t−1ij

))

= log e×(

1

3xij(1− xij)− 4

3

)t−1ij + o

(t−1ij

)a.s.,

where the last equation holds due to the fact that ln(1 − x−1) =−x−1 + o(x−1) as x→ +∞. Notice that

1

3xij(1− xij)− 4

3= 0

if and only if xij = 1/2. The conclusion follows immediately.

Remark 2. Notice that DKL(Beta‖Normal) converges at thefastest rate when xij = 1/2. This could be explained by the factthat the normal distribution is a symmetrical distribution, and theBeta distribution is also a symmetrical distribution when xij = 1/2.Numerical results show that DKL(Beta‖Normal) is close to zeroeven when tij is not sufficiently large. For instance, when tij = 3,α

(t)ij = 2 and β(t)

ij = 3, the KL divergence between Beta(2, 3) andN(2/5, 1/25) is 0.0444; when tij = 18, α(t)

ij = 8 and β(t)ij = 12, the

KL divergence between Beta(8, 12) and N(2/5, 2/175) is 0.0049.

The discussions above suggest that the statistical characteristics ofBeta(α

(t)ij , β

(t)ij ) and N

(x

(t)ij , (σ

(t)ij )

2)are close when the allocated

sample size is fairly large. Thus, we replace xij ∼ Beta(α(t)ij , β

(t)ij )

with xij ∼ N(x

(t)ij , (σ

(t)ij )

2)in (2). Then, the posterior approxima-

tion of πk in (2) is a linear combination of normal distributions,which follows the following normal distribution:

πk ∼ N(π

(t)k , (τ

(t)k )2

),

whereπ

(t)k , πk(X (t)),

and

(τ(t)k )2 ,

∑1≤i<j≤n

[(∂πk(X )

∂xij

∣∣∣∣X =X (t)

)2

(σ(t)ij )

2

].

When n = 2,

(τ(t)1 )2 =

(∂π1(X )

∂x1,2

∣∣∣∣X =X (t)

)2

(σ(t)1,2)

2

=

(∂π2(X )

∂x1,2

∣∣∣∣X =X (t)

)2

(σ(t)1,2)

2= (τ

(t)2 )2.

Note that variance (τ(t)k )2 in the normal approximation for the

posterior distribution of the stationary probability is affected by

Page 5: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

5

both posterior variance (σ(t)ij )2 of xij and posterior estimate

∂πk(X )

∂xij

∣∣∣∣X =X (t)

for the derivative of πk with respect to xij .

Obviously, increasing in the posterior variance of xij will result in theincreasing in the variance of πk in the posterior approximation. Onthe other hand, if stationary probability πk is insensitive to parameterinsecurity in xij , i.e., ∂πk/∂xij is small, large variance in xij maynot lead to large variance in πk. Thus, posterior variance (σ

(t)ij )2

of xij is scaled by posterior estimate∂πk(X )

∂xij

∣∣∣∣X =X (t)

for the

derivative of πk with respect to xij in (τ(t)k )2.

IV. DYNAMIC SAMPLING FOR MARKOV CHAIN

Given the posterior approximations of stationary probabilities, wetry to derive an efficient dynamic sampling procedure based on anapproximate PCS. The PCS for selecting the top m nodes can beexpressed as

PCS =Pr(

Θ(s)m = Θm

)=Pr

(π〈i〉s > π〈j〉s , i ∈ 1, . . . ,m, j ∈ m+ 1, . . . , n

).

Insecurity in estimating xij , 1 ≤ i < j ≤ n, could result ininsecurity in estimating πk, k = 1, . . . , n, which in turn leads to a lowPCS. A noticeable feature in estimating the stationary probabilitiesis that the marginal influence of xij’s estimation insecurity on thestationary probabilities and thus the PCS is heterogeneous. To bemore specific, large perturbations in some interaction parameters mayhave little influence on the stationary probabilities, whereas smallperturbations in other interaction parameters could cause significantchanges in the rank of the stationary probabilities and thus greatlyaffect the PCS. Such heterogeneity can be demonstrated by thefollowing simple example: consider a 3-node network which hasthree interaction parameters (x1,2, x1,3, x2,3) to be estimated. Thesecond column of Table I lists the true interaction parameters, thestationary probabilities, and the final ranking. Table I shows that theperturbation in x1,3 could cause a significant change in stationaryprobabilities (Estimation 1), which even leads to incorrect selectionof the best node. On the other hand, the same perturbation in x2,3

has little influence on correctly selecting the optimal node subset(Estimation 2). To enhance the PCS under a limited sample size, thisheterogeneity needs to be taken into consideration in the design ofthe sampling scheme.

We aim to obtain a dynamic sampling policy Ds to maximize thePCS:

maxDs

Pr(π〈i〉s > π〈j〉s , i ∈ 1, . . . ,m, j ∈ m+ 1, . . . , n

).

(4)

The dynamic sampling policy Ds is a sequence of maps Ds(·) =(D1(·), . . . , Ds(·)). Based on information set Et−1, 1 ≤ t ≤ s,Dt(Et−1) ∈ (i, j) : 1 ≤ i < j ≤ n allocates the t-th sample toestimate an interaction parameter xij , 1 ≤ i < j ≤ n. Similar tothat in [33] and [34], the policy optimization problem such as (4)can be formulated as a stochastic control (dynamic programming)problem. The expected payoff for a sampling scheme Ds can bedefined recursively by

Vs(Es; Ds) , E[1

Θ(s)m = Θm

∣∣∣Es]=Pr

(π〈i〉s > π〈j〉s , i ∈ 1, ..,m, j ∈ m+ 1, .., n

∣∣∣Es) , (5)

and for 0 ≤ t < s,

Vt(Et; Ds) , E[Vt+1(Et ∪ Xij,t+1; Ds)

∣∣∣Et] ∣∣∣(i,j)=Dt+1(Et)

,

where equation (5) is a posterior integrated PCS. Then, the optimalsampling policy is well defined by

D∗s , arg maxDs

V0(ζ0; Ds),

where ζ0 is prior information. It is important to note that the definitionof decision variable in our study is different from the one in R&S.For the R&S problem, the decision is to choose an alternative i insampling, whereas our decision is to choose a pair of nodes (i, j) insampling.

In principle, the backward induction can be used to solve thestochastic control problem, but it suffers from curse-of-dimensionality(see [34]). To address this issue, we adopt approximate dynamicprogramming (ADP) schemes which make dynamic decision basedon a value function approximation (VFA) and keep learning theVFA with decisions moving forward [35]. From Section III, anapproximation for the posterior distribution of πk conditioned on Etis a normal distribution with mean π(t)

k and (τ2k )(t). Therefore, the

joint distribution of vector(π〈1〉t−π〈m+1〉t , .., π〈1〉t−π〈n〉t , .., π〈m〉t−π〈m+1〉t , .., π〈m〉t−π〈n〉t

)follows a joint normal distribution with mean vector(π

(t)

〈1〉t−π(t)

〈m+1〉t , .., π(t)

〈1〉t−π(t)

〈n〉t , .., π(t)

〈m〉t−π(t)

〈m+1〉t , .., π(t)

〈m〉t−π(t)

〈n〉t

),

and covariance matrix Γ′ΛΓ, where

Λ , diag((σ(t)1,2)2, . . . , (σ

(t)1,n)2, (σ

(t)2,3)2, . . . , (σ

(t)n−1,n)2),

and Γ ,

d(t)1,2(1,m + 1) ·· d

(t)1,2(1, n) ·· d

(t)1,2(m,m + 1) ·· d

(t)1,2(m,n)

.

.

.

.

.

.

.

.

.

.

.

.

d(t)1,n(1,m + 1) ·· d

(t)1,n(1, n) ·· d

(t)1,n(m,m + 1) ·· d

(t)1,n(m,n)

d(t)2,3(1,m + 1) ·· d

(t)2,3(1, n) ·· d

(t)2,3(m,m + 1) ·· d

(t)2,3(m,n)

.

.

.

.

.

.

.

.

.

.

.

.

d(t)n−1,n

(1,m + 1) ·· d(t)n−1,n

(1, n) ·· d(t)n−1,n

(m,m + 1) ·· d(t)n−1,n

(m,n)

,

where matrix Λ is a diagonal matrix whose dimensionality is thesame as the number of interaction parameters, Γ is [n(n − 1)/2] ×[m(n −m)] matrix, and for i ∈ 1, . . . ,m, j ∈ m + 1, . . . , n,1 ≤ r < q ≤ n,

d(t)r,q(i, j) ,

∂(π〈i〉t(X )− π〈j〉t(X )

)∂xrq

∣∣∣∣X =X (t)

.

Elements in matrix Γ reflect the posterior information on sensitivitiesof the differences in stationary probabilities with respect to xij .

To derive a dynamic sampling procedure with an analytical form,we use the same VFA technique developed in [34]. At any step t,we treat the (t + 1)-th step as the last step and try to maximize theexpected value function by allocating the (t+ 1)-th sample to a pair(i, j):

Vt(Et; (i, j)) , E[Vt+1(Et ∪ Xij,t+1)

∣∣∣Et] ,where

Vt+1(Et+1) ,

Pr(π〈i〉t+1

> π〈j〉t+1, i ∈ 1, . . . ,m, j ∈ m+ 1, . . . , n

∣∣Et+1

).

The posterior probability above is an integral of the multivariatestandard normal density over a region encompassed by some hy-perplanes. We approximate the posterior probability by an integralover a maximum tangent inner ball in the integral region. See moredetails about this approximation in [34]. By symmetry of the normaldensity, maximizing the integral over a maximum tangent inner ball

Page 6: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

6

TABLE ITHE INFLUENCE OF ESTIMATION ERRORS IN TESTED PARAMETERS.

True Value Estimation 1 Estimation 2Parameter (x1,2, x1,3, x2,3) (0.7, 0.35, 0.6) (0.7, 0.35+0.02, 0.6) (0.7, 0.35, 0.6+0.02)

Stationary Probability (π1, π2, π3) (0.3477, 0.2916, 0.3607) (0.3582, 0.2897, 0.3521) (0.3497, 0.2989, 0.3514)Order Statistics (3, 1, 2) (1, 3, 2) (3, 1, 2)

is equivalent to maximizing the volume of the ball, which has thefollowing analytical formula:

v(Et+1) = mink∈1,...,m`∈m+1,...,n

η(t+1)k`

(t+1)k` ,

where

η(t)k` ,

(t)

〈k〉t − π(t)

〈`〉t + ε)2

,

and

ζ(t)k` ,

∑1≤r<q≤n

[d(t)r,q(k, `) σ

(t)rq

]2.

Here, we introduce a small positive real number ε, so that the volumeof the ball v(Et+1) must be positive. In [34] where the samples follownormal distributions, the volume of the ball is positive a.s. However,in our study, since all samples follow Bernoulli distributions, (π

(t)

〈i〉t−π

(t)

〈j〉t) is a discrete random variable so that the event(π

(t)

〈i〉t − π(t)

〈j〉t+1

)2

= 0

happens with a positive probability. In other words, the hyperplanesencompassing the integral region could pass through the origin ofspace. In order to have a positive volume of the ball, we shift eachhyperplane away from the origin by distance ε, which is visualizedin Figure 2. In implementation, ε is set as a small positive number,e.g, 0.0001.

Fig. 2. Illustration of the effect of ε.

By certainty equivalent approximation [36],

v(Et ∪ E[Xij,t+1|Et]

)≈ E

[v(Et ∪ Xij,t+1

)∣∣∣Et] ,we have the following VFA:

Vt(Et; (i, j)) , v(Et ∪ E[Xij,t+1|Et])

= mink∈1,...,m`∈m+1,...,n

η(t)k`∑

1≤r<q≤n

(d

(t)r,q(k, `)

)2

(σ(t)rq )2

(i,j)

,

(6)

where

(σ(t)rq )2

(i,j) ,

α(t)rq β

(t)rq

(α(t)rq +β

(t)rq )2(α

(t)rq +β

(t)rq +2)

,when (r, q) = (i, j);

α(t)rq β

(t)rq

(α(t)rq +β

(t)rq )2(α

(t)rq +β

(t)rq +1)

,when (r, q) 6= (i, j).

A dynamic allocation scheme for Markov chain (DAM) thatoptimizes the VFA is given by

Dt+1(Et) = arg max1≤i<j≤n

Vt(Et; (i, j)). (7)

The DAM uses the information on the posterior means of the sta-tionary probabilities, which are calculated by the posterior means ofthe interaction parameters via equilibrium equation (1), the posteriorvariances of the interaction parameters, and the sensitivities of thestationary probabilities with respect to each interaction parameter.Ignoring the small positive constant ε, we note that η(t)

k` and ζ(t)k` are

the squared mean and variance of the approximate posterior distri-bution of the difference in the stationary probabilities, respectively.Therefore, equation (6) can be rewritten as

mink∈1,...,m`∈m+1,...,n

1/c2v(k, `),

where cv(k, `) is the coefficient of variation (CV, or sometimes callednoise-signal ratio) of the posterior approximation for

(π〈k〉t − π〈`〉t

).

The DAM minimizes the maximum of cv(k, `)’s, which is intuitivelyreasonable since large cv(k, `) implies high difficulty in comparingπ〈k〉t and π〈`〉t from the posterior information. The DAM sequen-tially allocates each sample to estimate the interaction parameterto reduce the CV of the difference in each pair of the stationaryprobabilities. In particular, the DAM focuses on the pair most difficultin comparison among all possible pairs in differentiating the topm stationary probabilities from the rests based on the posteriorinformation at each step. The DAM is proved to be consistent inthe following theorem.

Theorem 3. If for 1 ≤ i < j ≤ n and t ∈ Z+,

∂π(X )

∂xij

∣∣∣∣X (t)

6= 0 a.s.,

then the DAM is consistent, i.e.,

lims→+∞

Θ(s)m = Θm, a.s.

Page 7: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

7

Proof. We only need to prove that each xij will be sampled infinitelyoften a.s. following DAM, and the consistency will follow by the lawof large numbers. Suppose parameter xij is only sampled finitelyoften and parameter xrq is sampled infinitely often. Therefore, thereexists a finite number N0 such that parameter xij will stop receivingreplications after the sampling number t exceeds N0. Thus we have

limt→+∞

(σ(t)ij )2 > 0, lim

t→+∞(σ(t)rq )2 = 0 .

If there exists a pair (k, `), k ∈ 1, . . . ,m, ` ∈ m+ 1, . . . , nsuch that

limt→+∞

[d

(t)i,j (k, `)

]2> 0,

thenlim

t→+∞v(Et) < +∞ .

Consider the pair

(k′, `′) , arg mink∈1,...,m`∈m+1,...,n

limt→+∞

η(t)k` /ζ

(t)k` .

Iflim

t→+∞

[d

(t)i,j (k

′, `′)]2

= 0

holds for each parameter xij which is only sampled finitely often,then

limt→+∞

η(t)

k′`′/ζ(t)

k′`′ = +∞,

which contradicts with

limt→+∞

η(t)

k′`′

ζ(t)

k′`′

= mink∈1,...,m`∈m+1,...,n

limt→+∞

η(t)k`

ζ(t)k`

= limt→+∞

v(Et) < +∞ .

However, iflim

t→+∞

[d

(t)i,j (k

′, `′)]2> 0

holds for a certain parameter xij which is only sampled finitely often,by noticing that

limt→+∞

[(σ

(t)ij )2 − (σ

(t)ij )2

(i,j)

]> 0,

andlim

t→+∞

[(σ(t)rq )2 − (σ(t)

rq )2(r,q)

]= 0,

we have

limt→+∞

[Vt(Et; (i, j))− v(Et)

]> 0 a.s.,

and

limt→+∞

[Vt(Et; (r, q))− v(Et)

]= 0 a.s.,

which contradicts with the sampling rule in equation (7) that theparameter with the largest Vt(Et; (i, j)) is sampled.

Therefore,lim

t→+∞

[d

(t)i,j (k, `)

]2= 0

holds for each pair (k, `), k ∈ 1, . . . ,m, ` ∈ m+ 1, . . . , n, thatis, for 1 ≤ k, ` ≤ n

limt→+∞

∂πk(X )

∂xij

∣∣∣∣X =X (t)

= limt→+∞

∂π`(X )

∂xij

∣∣∣∣X =X (t)

.

Sincen∑k=1

∂πk(X )

∂xij

∣∣∣∣X =X (t)

= 0,

we have

limt→+∞

∂πk(X )

∂xij

∣∣∣∣X =X (t)

= 0, 1 ≤ k ≤ n,

which contradicts with

∂π(X )

∂xij

∣∣∣∣X (t)

6= 0 a.s. 1 ≤ i < j ≤ n, t ∈ Z+ .

Therefore, DAM must be consistent.

Remark 3. The assumptions in Theorem 3 can be checked forthe Markov chain in Google’s PageRank [9], where the transitionprobabilities are given by

Pji , xij/(n− 1), 1 ≤ i < j ≤ n;

Pij , 1/(n− 1)− Pji, i 6= j;

Pii , 1−∑j 6=i

Pij , 1 ≤ i ≤ n.

This Markov chain is a random walk, which is irreducible andaperiodic. At each step, the current state (page) j chooses anotherpage with equal probability to interact, and if page i is chosen, thenext state will be i with probability xij or still stay in j otherwise. Theimportance of each web page is described by the long-run proportionof time spent on each state, i.e., its stationary probability. For thetransition matrix in PageRank,

∂P

∂xij=

1/(n− 1) , for element (i, i) and (j, i);−1/(n− 1) , for element (i, j) and (j, j);

0 , otherwise.

From (3),∂π(X )

∂xij

∣∣∣∣X (t)

= 0

is equivalent to π(t)i + π

(t)j = 0. By ergodicity of Markov chain,

π(t)i + π

(t)j > 0 a.s., 1 ≤ i < j ≤ n, t ∈ Z+ .

Therefore,

∂π(X )

∂xij

∣∣∣∣X (t)

6= 0 a.s., 1 ≤ i < j ≤ n, t ∈ Z+ .

V. NUMERICAL RESULTS

In the numerical experiments, we test the performance of differentsampling procedures for ranking node importance in the Markovchain of PageRank. The proposed DAM is compared with the equalallocation (EA) and an approximately optimal allocation (AOA)adapted from a sampling procedure for classic R&S problem in [34].Specifically, EA equally allocates sampling budget to estimate eachxij , 1 ≤ i < j ≤ n (roughly s/(n(n−1)/2) samples for each xij);AOA allocates samples according to the following rules:

At+1(Et) = arg max1≤i<j≤n

Vt(Et; (i, j)),

where

Vt(Et; (i, j)) , mink∈1,...,m`∈m+1,...,n

η(t)k`∑

1≤r<q≤n(σ

(t)rq )2

(i,j)

.

Notice that the AOA only utilizes the information in the posteriormeans of the stationary probabilities and the posterior variances ofthe interaction parameters, but it does not consider the informationin the sensitivities of the stationary probabilities with respect toeach interaction parameter. In all numerical examples, the statisticalefficiency of the sampling procedures is measured by the PCSestimated by 10,000 independent experiments. The PCS is reportedas a function of the sampling budget in each experiment.

Page 8: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

8

Example 1: selecting top-3 nodes in a 10-node network

In this example, we aim to identify the top-3 nodes from a networkof 10 nodes. Suppose the true value of each interaction parameter is

xij = 0.5 + 0.03× (j − i), 1 ≤ i < j ≤ 10 .

As the assumption in Section II, the samples of the interactionparameter xij are generated i.i.d. from a Bernoulli distribution withparameter xij . According to the definition of xij , node i is visitedmore often in the interactions between nodes i and j when xij > 0.5.It is straightforward to know that nodes 1, 2, 3 are the top-3 nodes.

In Figure 3, we can see that AOA has a slight edge over EA,which could be attributed to the reason that EA utilizes no sampleinformation while AOA utilizes the information in the posteriormeans and variances, and DAM performs significantly better thanthe other two sampling procedures. In order to attain PCS = 80%,DAM needs less than 1500 samples, whereas EA and AOA requiremore than 2000 samples. That is to say DAM reduces the samplingbudget by more than 25%. The performance enhancement of DAMcould be attributed to the utilization of not only the information inposterior means and variances but also the sensitivity information(∂πk/∂xij). The numerical result shows that in this example, thesensitivity information plays a dominant role in enhancing the sam-pling efficiency.

Fig. 3. PCS of the three sampling procedures in Example 1.

Fig. 4. PCS of the three sampling procedures in Example 2.

Example 2: selecting top-5 nodes in a 20-node network

In this example, we test the performance of the proposed DAM in alarger scale network with 20 nodes. The true value of each interactionparameter xij , 1 ≤ i < j ≤ 20, is drawn from a uniform priordistribution U [0, 1]. Our objective is to identify the optimal subsetof nodes with the top-5 largest stationary probabilities. Figure 4illustrates the performance of the three sampling procedures. Similarto Example 1, DAM remains as the most efficient sampling procedureamong the three, and AOA is slightly better than EA. However, itcan be noticed that the advantage of DAM is more significant whenthe network size becomes larger. In order to attain PCS = 60%, thenumber of samples consumed by DAM is less than 9000, while bothEA and AOA require more than 15000 samples. That is to say DAMreduces the sampling budget by more than 40%.

Example 3: selecting top-15 nodes in a 105-node website network

In this example, we test the robustness for the performance of DAMin a real data set from the Sogou Labs, a major web searching enginecompany in China (http://www.sogou.com/labs/resource/t-link.php).The data set includes a mapping table from URL to document IDand a list of hyperlink relationship of the documents. Our objectiveis to select the top-15 websites from a 105-node website network.The true value of each interaction parameter xij is estimated fromthe data set. Figure 5 illustrates the interactions among the websites.For instance, the visits from website j to i occur 12 times, while thevisits from website i to j only occur 7 times, so the true value ofxij is set as 12/(12 + 7).

Fig. 5. Interactions in Website Network from Sogou Labs.

In Figure 6, we can see the PCS of the DAM grows at a muchfaster rate than those of the EA and AOA. In order to attain PCS =60%, DAM consumes less than 3.7 × 105 samples, while both EAand AOA require more than 5.5× 105 samples. In addition, we seethat the gap between the PCS of the DAM and those of the EA andAOA widens as the sampling budget increases.

VI. CONCLUSIONS

This paper deals with a sample allocation problem for selectingimportant nodes in random network. Node importance is ranked bythe stationary probabilities of a Markov chain. We use the first-order Taylor expansion and normal approximation to estimate theposterior distribution of the stationary probabilities. An efficientsampling procedure named DAM is derived by maximizing a VFA

Page 9: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

9

Fig. 6. PCS of the three sampling procedures in Example 3.

one-step ahead. The sensitivity of the stationary probability withrespect to each interaction parameter is taken into account in thedesign of DAM. Numerical experiments demonstrate that DAM ismuch more efficient than the other tested sampling procedures, andthe performance of the proposed method is robust in different scalesof the networks and the real data situation.

Unlike existing literature considering deterministic network, wefocus on random network with unknown interaction parameters.Random network is a more realistic scenario of the node importanceranking problem. The proposed DAM improves the sample allocationpattern for the Markov ranking in random network, which reflectsa trade-off among posterior means, variances, and sensitivities. Assuggested by the numerical testing results, DAM can significantlysave the sampling budget in practical applications such as Google’sPageRank.

In general, a Markov chain can have several ergodic classes andtransient states, and it may not satisfy the aperiodicity condition.Decomposition for the Markov chain may be needed in order to rankthe nodes in each ergodic class. Future research includes developingan efficient sampling scheme for both decomposition and ranking.Moreover, the asymptotic analysis for the sampling ratio of thesequential sampling procedure for ranking the node importance ina Markov chain also deserves future work (see [37]).

ACKNOWLEDGMENT

This work was supported in part by the National Science Foun-dation of China (NSFC) under Grants 71571048, 71720107003,71690232, and 61603321, and by the National Science Foundationunder Awards ECCS-1462409 and CMMI-1462787.

REFERENCES

[1] S. Brin and L. Page, “The anatomy of a large-scale hypertextual websearch engine,” Computer Networks, vol. 30, no. 1-7, pp. 107–117, 1998.

[2] L. Page, “The pagerank citation ranking : Bringing order to the web,”Stanford Digital Libraries Working Paper, vol. 9, no. 1, pp. 1–14, 1998.

[3] A. Y. Govan, “Ranking theory with application to popular sports,”Hyperfine Interactions, vol. 175, no. 1-3, pp. 9–14, 2008.

[4] I. Luke, “Ranking NCAA sports teams with linear algebra,” Master’sthesis, College of Charleston, 2007.

[5] J. Weng, E. P. Lim, J. Jiang, and Q. He, “Twitterrank: finding topic-sensitive influential twitterers,” in Proceedings of the third ACM inter-national conference on Web search and data mining. ACM, 2010, pp.261–270.

[6] H. S. Bhat and B. Sims, “Investorrank and an inverse problem forpagerank,” Electronic Theses & Dissertations, 2012.

[7] D. Walker, H. Xie, K.-K. Yan, and S. Maslov, “Ranking scientificpublications using a model of network traffic,” Journal of StatisticalMechanics: Theory and Experiment, vol. 2007, no. 06, p. P06010, 2007.

[8] P. Jomsri, S. Sanguansintukul, and W. Choochaiwattana, “Citerank:combination similarity and static ranking with research paper searching,”International Journal of Internet Technology and Secured Transactions,vol. 3, no. 2, pp. 161–177, 2011.

[9] A. N. Langville and C. D. Meyer, Google’s PageRank and Beyond: TheScience of Search Engine Rankings. Princeton University Press, 2011.

[10] R. E. Bechhofer, T. J. Santner, and D. M. Goldsman, Design andAnalysis of Experiments for Statistical Selection, Screening, and MultipleComparisons. John Wiley & Sons, New York, 1995.

[11] C.-H. Chen and L. H. Lee, Stochastic simulation optimization: anoptimal computing budget allocation. World Scientific, 2011, vol. 1.

[12] Y. Rinott, “On two-stage selection procedures and related probability-inequalities,” Communications in Statistics-Theory and Methods, vol. 7,no. 8, pp. 799–811, 1978.

[13] L. W. Koenig and A. M. Law, “A procedure for selecting a subset ofsize m containing the l best of k independent normal populations, withapplications to simulation,” Communications in Statistics - Simulationand Computation, vol. 14, no. 3, pp. 719–734, 1985.

[14] S.-H. Kim and B. L. Nelson, “A fully sequential procedure forindifference-zone selection in simulation,” ACM Transactions on Mod-eling and Computer Simulation, vol. 11, no. 3, pp. 251–273, 2001.

[15] C.-H. Chen, J. Lin, E. Yucesan, and S. E. Chick, “Simulation budgetallocation for further enhancing the efficiency of ordinal optimization,”Discrete Event Dynamic Systems, vol. 10, no. 3, pp. 251–270, 2000.

[16] S. E. Chick and K. Inoue, “New two-stage and sequential proceduresfor selecting the best simulated system,” Operations Research, vol. 49,no. 5, pp. 732–743, 2001.

[17] Y. Peng, C.-H. Chen, M. C. Fu, and J.-Q. Hu, “Efficient simulation re-source sharing and allocation for selecting the best,” IEEE Transactionson Automatic Control, vol. 58, no. 4, pp. 1017–1023, 2013.

[18] C.-H. Chen, D. He, M. Fu, and L. H. Lee, “Efficient simulationbudget allocation for selecting an optimal subset,” INFORMS Journalon Computing, vol. 20, no. 4, pp. 579–595, 2008.

[19] S. Zhang, L. H. Lee, E. P. Chew, C. H. Chen, and H. Y. Jen, “Animproved simulation budget allocation procedure to efficiently select theoptimal subset of many alternatives,” in IEEE International Conferenceon Automation Science and Engineering, 2012, pp. 230–236.

[20] S. Gao and W. Chen, “A note on the subset selection for simulationoptimization,” in Winter Simulation Conference, 2015, pp. 3768–3776.

[21] H. Xiao and L. H. Lee, “Efficient simulation budget allocation forranking the top m designs,” Discrete Dynamics in Nature and Society,vol. 2014, pp. 1–9, 2014.

[22] S. Gao and W. Chen, “Efficient subset selection for the expectedopportunity cost,” Automatica, vol. 59, no. C, pp. 19–26, 2015.

[23] A. N. Langville and C. D. Meyer, Who’s #1?: The Science of Ratingand Ranking. Princeton University Press, 2012.

[24] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, andD. B. Rubin, Bayesian Data Analysis. CRC Press, 2014.

[25] W. J. Stewart, Introduction to the Numerical Solution of Markov Chains.Princeton University Press, 1994.

[26] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra,V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates forthe solution of linear systems: building blocks for iterative methods.SIAM, 1994, vol. 43.

[27] C. Moler, “The world’s largest matrix computation,” MATLAB News andNotes, pp. 12–13, 2002.

[28] W. T. Song and Y. C. Chen, “Eighty univariate distributions and theirrelationships displayed in a matrix format,” IEEE Transactions onAutomatic Control, vol. 56, no. 8, pp. 1979–1984, 2011.

[29] C. Cox, Delta Method. John Wiley & Sons, Ltd, 2006.[30] S. Kullback, Information theory and statistics. John Wiley & Sons,

1959.[31] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley.

Tsinghua University Press, 1991.[32] M. Abramowitz, I. Stegun, and D. A. Mcquarrie, Handbook of Mathe-

matical Functions. United States Department of Commerce, NationalBureau of Standards (NBS), 1964.

[33] Y. Peng, C.-H. Chen, M. C. Fu, and J.-Q. Hu, “Dynamic samplingallocation and design selection,” INFORMS Journal on Computing,vol. 28, no. 2, pp. 195–208, 2016.

[34] Y. Peng, E. K. Chong, C.-H. Chen, and M. C. Fu, “Ranking and selectionas stochastic control,” IEEE Transactions on Automatic Control, vol. 63,no. 8, pp. 2359–2373, 2018.

Page 10: Efficient Sampling for Selecting Important Nodes …1 Efficient Sampling for Selecting Important Nodes in Random Network Haidong Li, Xiaoyun Xu, Yijie Peng, and Chun-Hung Chen Abstract—We

10

[35] W. B. Powell, Approximate Dynamic Programming: Solving the cursesof dimensionality. John Wiley & Sons, 2007, vol. 703.

[36] D. P. Bertsekas, Dynamic programming and optimal control. Athenascientific Belmont, MA, 1995, vol. 1, no. 2.

[37] Y. Peng and M. C. Fu, “Myopic allocation policy with asymptotically op-timal sampling rate,” IEEE Transactions on Automatic Control, vol. 62,no. 4, pp. 2041–2047, 2017.

Haidong Li is a Ph.D. candidate in the Departmentof Industrial Engineering and Management, PekingUniversity, Beijing, China. He received his B.S. De-gree from the Department of Engineering Mechanicsat Peking University. His research interests includesimulation optimization and network analysis.

Xiaoyun Xu received his B.S. Degree in IndustrialEngineering from Tsinghua University in China in2003, and his Ph.D. Degree in Industrial Engineer-ing from Arizona State University in 2008. He isan Associate Professor at Department of IndustrialEngineering and Management, Peking University,Beijing, China. His main research interests lie inscheduling, simulation optimization and their appli-cations in manufacturing and service industries.

Yijie Peng received the B.E. degree in mathe-matics from Wuhan University, Wuhan, China, in2007, and the Ph.D. degree in management sciencefrom Fudan University, Shanghai, China, in 2014,respectively.He was a research fellow with FudanUniversity and George Mason University.

He is currently an Assistant Professor at the De-partment of Industrial Engineering and Management,Peking University, Beijing, China. His research in-terests include ranking and selection and sensitivityanalysis in the simulation optimization field with

applications in data analytics, health care, and machine learning.

Chun-Hung Chen received the Ph.D. degree inengineering sciences from Harvard University, Cam-bridge, MA, USA, in 1994.

He is currently a Professor with the Departmentof Systems Engineering and Operations Research,George Mason University, Fairfax, VA, USA. He isthe author of two books, including a best seller:Stochastic Simulation Optimization: An OptimalComputing Budget Allocation (World Scientific,2010).

Dr. Chen received the National Thousand TalentsAward from the central government of China in 2011, the Best AutomationPaper Award from the 2003 IEEE International Conference on Robotics andAutomation, and 1994 Eliahu I. Jury Award from Harvard University. He wasa Department Editor for the IIE Transactions, a Department Editor for Asia-Pacific Journal of Operational Research, an Associate Editor for the IEEETRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING,an Associate Editor for the IEEE TRANSACTIONS ON AUTOMATICCONTROL, an Area Editor for the Journal of Simulation Modeling Practiceand Theory, an Advisory Editor for the International Journal of Simulationand Process Modeling, and an Advisory Editor for the Journal of Traffic andTransportation Engineering.