Dynamic Cooperative Secondary Access in Hierarchical Spectrum Sharing Networks

1536-1276 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TWC.2014.2333744, IEEE Transactions on Wireless Communications

ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS 1

Dynamic Cooperative Secondary Access inHierarchical Spectrum Sharing Networks

Liping Wang, and Viktoria Fodor, Member, IEEE

Abstract—We address the challenge of energy efficiency inhierarchical spectrum sharing networks with dynamic traffic.We consider a primary and a cognitive secondary transmitter-receiver pair, where the secondary transmitter can utilize cooper-ative transmission to relay primary traffic while superimposingits own information. The secondary user meets a dilemma inthis scenario. By choosing cooperation it can transmit a packetimmediately, but it has to bear the additional cost of relaying.Otherwise, it can wait for the primary user to become idle, whichincreases the queuing delay secondary packets experience. Tosolve this dilemma, and trade off delay and energy consumption,we propose dynamic cooperative secondary access control thattakes the state of the spectrum sharing network into account.We formulate the problem as a Markov Decision Process (MDP)and prove the existence of a stationary policy that is averagecost optimal. We evaluate reinforcement learning to find optimaltransmission strategy when the traffic and link statistics are notknown. We demonstrate that dynamic cooperation is necessaryfor the secondary system to be able to adapt to changingnetwork conditions, and show that optimal sequential decisioncan significantly improve the tradeoff of the energy consumptionand the delay.

Index Terms—Hierarchical spectrum sharing, cooperativetransmission, queuing systems, Markov decision process, rein-forcement learning.

I. Introduction

Hierarchical spectrum sharing among users of differentnetworks is a promising solution to improve the spectrumefficiency, and thus to alleviate the spectrum shortage problemcaused by the rapidly growing demand for wireless appli-cations and services. Under hierarchical spectrum sharingthe higher priority, primary users (PUs) have performanceguarantees, whereas the secondary, low priority users (SUs)need to be cognitive, and adjust their access strategies so thatthe primary performance does not degrade. One traditionalparadigm to facilitate hierarchical spectrum sharing is oppor-tunistic spectrum sharing [1], where SUs identify the time-frequency resources unused by the PUs [2][3][4], and exploitthem for their own transmissions [5][6][7].

Thanks to the development of advanced signal processingand interference management techniques, cooperative spec-trum sharing is considered as an alternative way to sharespectrum. Instead of transmitting in idle time or on idlefrequency, the SUs relay primary packets, and transmit their

Part of this work was presented at the IEEE International Conferenceon Communications (ICC), Budapest, Hungary, June 2013. This work issupported by the Swedish Research Council, under the SRA TNG grant.

The authors are with the School of Electrical Engineering and the AC-CESS Linnaeus Center, KTH Royal Institute of Technology, Sweden (e-mail:[email protected]; [email protected]).

own packets with superimposed signal [8][9][10], or witha time-[11] frequency-[12][13], or space-[14] division basedcooperative relaying scheme. With appropriate cooperation be-tween the two networks, the throughput or the power efficiencyof the PUs can be guaranteed or improved, whereas the SUsgain more transmission opportunities.

The literature on such cooperative spectrum sharing net-works aims in general at improving the spectrum efficiency ofthe cognitive system, without considering energy efficiency.Optimal relay selection and resource allocation solutions areproposed in [11][15][16], while scaling laws are derived in[17][18][19], assuming that users are always willing to coop-erate and always have packets to transmit. Non-backloggedtraffic is considered for various cooperative spectrum sharingschemes in [20][21][22][23][24], however, these works stilldisregard the increased power consumption due to cooperationand optimize for maximum secondary throughput.

At the same time, most networking applications, includingweb access, VoIP, monitoring and networked control, requirethe transmission of irregularly generated, small amount ofinformation, possibly with some rate and delay constraint. Sec-ondary users with such applications do not require throughputmaximization. Instead, their objective should be to minimizeenergy consumption, while maintaining some quality of ser-vice.

In this paper we address the challenge of energy efficientspectrum sharing by considering a primary and a secondarynode pair, both with dynamic traffic and unreliable transmis-sion channel. The secondary user meets a dilemma in thiscase. As the primary performance needs to be guaranteed, theSU has additional cost of cooperation, in terms of increasedtransmission power. Therefore, instead of cooperating, the SUmay wait for idle time and transmit opportunistically, tradingoff delay and energy consumption. Note that the PU doesnot have this dilemma, since its transmission performance isguaranteed in the hierarchical spectrum sharing scenario.

We define the cost of secondary access as the combinationof the cost of cooperation and the cost of additional packetdelay, and aim at minimizing the long-run average cost ofthe SU. In general, the SU may decide about transmission orwaiting at each transmission opportunity. The decision policymay be static, in this case the SU transmits only opportunisti-cally, or always cooperates. In contrast, under dynamic accesspolicy the SU decides for each packet transmission, whetherto cooperate or wait. This dynamic, sequential decision maydepend on the time, on the history of the system state, or onthe present state of the system. Our objective is to find, andevaluate the achievable gain of the optimal sequential decision.




The main contributions of the paper are summarized asfollows:

i) We define four secondary access strategies in the hier-archical spectrum sharing system with opportunistic, coopera-tive, sequential decision and random cooperation based access.We derive the stable-throughput region following the notionof strong stability.

ii) To find optimal sequential decision policy, we formulatethe dilemma of the secondary user as an MDP. We showthat the long-run average cost is upper bounded within thestable throughput region. We prove the existence of an optimalsequential decision policy that is stationary, that is, dependsonly on the present state of the system. Consequently, tofind the optimal policy for the SU, it is enough to considerthe stationary policies only. To find the optimal stationarypolicy, we give an approximation method based on linearprogramming.

iii) We consider the case of unknown primary and secondarytraffic and link statistics, and evaluate the efficiency of R-learning to find optimal sequential decision policy.

iv) We show that optimal sequential decision can sig-nificantly reduce the average cost compared to the cost ofpure opportunistic and pure cooperative spectrum sharing, andimproves the energy consumption – delay tradeoff comparedrandom cooperation with optimized probability of cooperation.We show that the performance of sequential decision underlearning is close to optimal, even with limited knowledge onthe primary queue size.

Networks with dynamic traffic have often been characterizedby their stable-throughput region, see for example [20][21][22]for hierarchical, and [25][26] for general spectrum sharingqueues. We extend the above results by considering dynamiccooperation, but also by considering the notion of strongstability, that is necessary to evaluate the secondary cost. TheMDP framework and its variations have been used extensivelyfor optimizing control strategies in general stochastic andqueuing systems [27][28][29][30]. In the area of spectrumsharing networks with opportunistic secondary access MDPhas been used to design sensing and access strategies for theSUs, when the primary traffic can be modeled with someknown stochastic processes [2][3][5][6][23]. We contribute tothis line of works by optimizing the SU access strategy inthe case of cooperation cost and infinite transmission buffer.Reinforcement learning techniques, such as Q-learning [31]and R-learning [32] provide online optimization tools that cansolve MDPs iteratively without a priori knowledge of the statetransition probabilities, and therefore, has been applied forsecondary access or interference control design [4][6][7]. Asthe convergence of reinforcement learning is not proved formost of the practical cases, we evaluate R-learning, developedto find average cost optimal policies, via numerical examples.

The rest of the paper is organized as follows. We introducethe system model and the four different spectrum sharingschemes in Section II. The stable-throughput regions of theconsidered schemes are evaluated in Section III. In SectionIV, we define and discuss the MDP to achieve the optimalsequential decision of the secondary system, and in Section Vgive the R-learning formulation. The performance of optimal

sequential decision based spectrum sharing is evaluated inSection VI. Finally, we conclude the paper in Section VII.

II. Spectrum sharing schemes and System model

We consider a spectrum sharing network, where a primaryuser and a secondary user, PT and ST, intend to transmitpackets to their respective destinations, PR and SR, via ashared wireless channel. Time is slotted and the transmissionof each packet takes one time slot. The PT and the ST cantransmit in separate time slots directly to the destinations, orthey can use a cooperative transmission scheme, where theST relays the primary packet toward the PR, at the same timesuperimposing its own packet to the SR.

We compare two static and two dynamic secondary accessschemes:Opportunistic spectrum sharing: the PT transmits directly tothe PR. The ST senses the channel at the beginning of eachtime slot. If the channel is idle, the ST transmits a packetdirectly to the SR, and it keeps silent otherwise.Cooperative spectrum sharing: the ST always relays theprimary packet, and superimposes its own packet, if the thePT transmission queue is non-empty. If the PT is idle, the STtransmits directly to the SR.Sequential decision: in each time slot the ST decides aboutcooperating or not, where the decision can depend on thecurrent state of the system. The ST needs to be aware of thenumber of packets waiting, that is, the state of the two queuesQp and Qs, and needs to inform the PT about its decision. Ifthe PT does not transmit, the ST uses direct transmission.Random cooperation: in each time slot the ST cooperatesor transmits opportunistically, with fixed probabilities, andinforms the PT about its decision. Again, the ST transmitsdirectly to the SR if the PT is silent. This scheme is a specialcase of the sequential decision scheme where the decisiondoes not depend on the system state. The random cooperationscheme is clearly sub-optimal, but leads to simpler accesscontrol.

The cost of the ST reflects the increased delay of op-portunistic access and the increased power consumption ofcooperative transmission. Specifically, in each time slot theaccumulated secondary cost is increased with ChQs and Cc,where Ch denotes the cost of the ST for holding one packet inits queue for one time slot and thus ChQs reflects the secondaryqueuing delay, and Cc represents the cost of cooperation, if thesecondary node performs cooperative transmission. Clearly, byinvesting in cooperative transmission, the ST can decrease itsqueuing delay at a given secondary load. The objective of theST is to minimize the long-run average cost, defined as:

C = limN→∞

1N

∞∑n=1

c(n), (1)

where the cost of the ST in each time slot c(n) = ChQs(n)if the ST transmits opportunistically or is silent, and c(n) =Cc +ChQs(n) if it transmits with cooperative transmission.

We model the primary and the secondary system as follows.Service process: Since packets may not be received success-fully due to the impairments of the radio channel, we model




PT PR

SR

p

s

s

qpd

ST

Qp = 0: qsd

Qp 0: 0

PT PR

SR

p

s

s

qpc

ST

Qp = 0: qsd

Qp 0: qsc

p p

(a) (b)

Fig. 1. Queuing network modeling the spectrum sharing system with (a)opportunistic and (b) cooperative spectrum sharing.

the packet transmission on each end-to-end communicationlink via independent Bernoulli processes. Let qpd and qsd

denote the probabilities of successful packet transmission ina time slot under direct transmission at the primary andsecondary systems, respectively, whereas qpc and qsc representthe primary and secondary transmission success probabilitiesover the cooperative transmission channel. That is, qpc is theprobability that the primary packet is successfully received atthe ST and then reconstructed from the signals received fromthe PT and from the ST, or not received at the ST, but still suc-cessfully reconstructed at the PR. The secondary cooperativetransmission needs to improve or at least guarantee the proba-bility of successful packet transmission at the PT, therefore weconsider qpc ≥ qpd. At the same time cooperative transmissionmay decrease the transmission success probability at the ST,that is, qsc ≤ qsd. The probability of successful transmission isan abstraction of different cooperative transmission schemesand channel models. The expressions for calculating theseprobabilities can be found for instance in [8] for a spectrumsharing networks using a two-phase cooperative decode-and-forward relaying protocol under Rayleigh fading channels.

Arrival process: We model the packet arrival at the primaryand secondary users by independent Bernoulli processes withper slot packet arrival probabilities λp and λs, respectively. Ourwork can be extended to consider Markov modulated arrivaland loss processes.

Buffer capacity: Both PT and ST have a buffer of infinitecapacity for storing the incoming packets. While real systemshave finite buffers, in most of the cases they operate in a regimewhere the packet loss probability due to buffer overflow isvery low. Therefore, instead of addressing the issue of bufferdimensioning, we assume infinite buffers.

Retransmission control: Packets stay in the buffer until theyare successfully received. Packets that are not received suc-cessfully are retransmitted, the number of retransmissions isnot limited. ACK/NACK messages from the PR and the SRdo not get lost.

Control messages and channel sensing: The PT and the SThave correct information about the queue sizes and the channelstatus.

The resulting queuing networks for opportunistic and co-operative spectrum sharing are shown in Figs 1(a) and (b),respectively. We can see that in both cases the primary andthe secondary queues are coupled, more precisely, the servicerate of the ST depends on the status of the queue at the PT.The key symbols used in the paper are listed in Table I.

p, s index for primary and secondary userd, c index for direct and cooperative transmissionD,C,R opportunistic (direct), cooperative and

random cooperative spectrum sharingn time slotλ arrival probabilityq successful transmission probabilityμ average transmission ratepc cooperation probabilityQ queue lengthS stable throughput regionN buffer capacityCh cost of holding packetCc cost of cooperationc(s, a) cost of action a in state sΠ,Π∗ policy and optimal policy

TABLE IKey symbols used in the paper.

III. The Stable-Throughput Region

First we evaluate the stable-throughput regions of theconsidered spectrum sharing methods. In contrast to re-lated work where the notion of mean rate stability is used[20][21][22][25][26] we need to follow the notion of strongstability given in [33][34], as it will be necessary to be ableto characterize the long-run average cost.

To define strong stability for a general queue, let us denotethe queue length at the beginning of a time slot n as Q(n), thenumber of arrivals in time slot n as A(n), and the service rateas B(n). We assume that arrivals happen at the end of the timeslot, and can be served only in the following slot. Then, thequeue evolves, as:

Q(n + 1) = max[Q(n) − B(n), 0] + A(n). (2)

Definition 1. The queue is strongly stable, if [33]:

lim supN→∞

1N

N∑n=1

E[Q(n)] < ∞. (3)

That is, the queue is strongly stable, if it has a boundedaverage queue length. A network of queues is strongly stable,if all queues are strongly stable. The stable-throughput regionof a queuing network is given by the arrival rate vectors forwhich the network of queues is strongly stable.

Proposition 1. The following conditions are sufficient forstrong stability in a general slotted system.

1) The arrival and service processes are rate convergent,that is:

limN→∞

1N

N∑n=1

E[A(n)] = λ, limN→∞

1N

N∑n=1

E[B(n)] = μ,

(4)

and for each positive δ there exists an N, such that,regardless of past history:

E

⎧⎪⎪⎨⎪⎪⎩1N

N∑n=1

A(n0 + n)

⎫⎪⎪⎬⎪⎪⎭ ≤ λ+δ, E

⎧⎪⎪⎨⎪⎪⎩1N

N∑n=1

B(n0 + n)

⎫⎪⎪⎬⎪⎪⎭ ≥ μ−δ;(5)




2) in each time slot the arrival process is bounded insecond moment, and the service process is bounded,regardless of past history; and finally

3) the arrival rate is less than the average service rate,that is λ < μ (note that this is the known condition forthe weaker mean rate stability of queues).

Proof: The proof of the proposition is given in [34].

Theorem 1. 1) The stable-throughput regions SD and SC,for opportunistic and cooperative spectrum sharing re-spectively, are:

SD =

{(λp, λs) : λp < qpd and λs < qsd −

qsd

qpdλp

}, (6)

and

SC =

{(λp, λs) : λp < qpc and λs < qsd +

qsc − qsd

qpcλp

},

(7)

while SR, the stable-throughput region of random coop-eration with cooperation probability pc is:

SR =

{(λp, λs) :

λp < (1 − pc)qpd + pcqpc

λs < qsd +pcqsc−qsd

(1−pc)qpd+pcqpcλp

}. (8)

2) If the ST performs sequential decisions Π, then thestable-throughput region SΠ is respectively lower- andupper-bounded by SD and SC:

SD ⊆ SΠ ⊆ SC . (9)

Proof: To derive the stable-throughput region of thespectrum sharing system, we prove that conditions 1 and 2in Proposition 1 hold for the secondary and primary arrivaland service processes, if also condition 3 is fulfilled. Then wefind the arrival rates λp and λs, where condition 3 holds.

Let Ap(n) and As(n) denote the number of arrivals in timeslot n at the PT and ST, respectively, whereas Bp(n) andBs(n) are the respective service rates. Since the consideredprimary and secondary arrival processes are i.i.d Bernoulliprocesses with E[Ap(n)] = λp and E[As(n)] = λs, they are rateconvergent with bounded second moment, that is, conditions1 and 2 on the arrival processes hold.

The service processes depend on the spectrum sharingscheme. For all schemes however Bp(n) and Bs(n) can takevalues 0 or 1, that is, condition 2 for the service pro-cesses holds. Let us consider condition 1. Under opportunisticspectrum sharing, the primary service process is an i.i.dBernoulli process with E[Bp(n)] = qpd, in this case condition1 certainly holds. For the other cases the service processesare modulated by the state of the queues. If the queuesare stable according to the weak definition of λ < μ, theycan be described with an ergodic discrete time birth-deathsprocess that converges monotonically to steady state [35]. (Forexample, under opportunistic spectrum sharing, the secondaryservice process is controlled by the primary queue length Qp,E[Bs(n)|Qp(n) = 0] = qpd and E[Bs(n)|Qp(n) > 0] = 0.)Consequently, under condition 3, the primary and secondaryservice processes are rate convergent, and thus, condition 1 isfulfilled as well, for all considered spectrum sharing schemes.

Let us now find the λp, λs pairs when condition 3 holds.These arrival rate pairs give the stable-throughput region ofthe system.

Consider first opportunistic spectrum sharing. The PT trans-mits a packet with success probability qpd whenever its queueis non-empty, independently from the state of the ST. Con-sequently, μpd, the average primary service rate under directtransmission can be derived as:

P[Bp(n) = 1] = qpd, P[Bp(n) = 0] = 1 − qpd,

⇒ μpd = E[Bp(n)] = qpd, (10)

where P[x] denotes the probability of event x.However, the ST can transmit a packet with success prob-

ability qsd if and only if the primary queue is empty. Consid-ering that P[Qp = 0] = 1− λp/μpd, μsd, the average secondaryservice rate under direct transmission becomes:

P[Bs(n) = 1] = qsdP[Qp(n) = 0],P[Bs(n) = 0] = 1 − qsdP[Qp(n) = 0],

}

⇒ μsd = E[Bs(n)] = qsdP[Qp = 0]

= qsd

(1 −λp

μpd

)= qsd −

qsd

qpdλp. (11)

Condition 3, that is λp < μpd and λs < μsd, gives thefollowing stable-throughput region for opportunistic spectrumsharing:

SD =

{(λp, λs) : λp < qpd and λs < qsd −

qsd

qpdλp

}. (12)

Similarly, we can derive the stable-throughput region undercooperative spectrum sharing. Then, the PT transmits a packetwith success probability qpc whenever its queue is non-empty, using cooperative transmission, that is, μpc, the averageprimary service rate under cooperation is:

μpc = qpc. (13)

The ST transmits a packet with success probability qsd ifthe primary queue is empty, and with success probability qsc

otherwise. Consequently, μsc, the average secondary servicerate under cooperation becomes:

μsc = qsdP[Qp = 0] + qscP[Qp � 0]

= qsd

(1 −λp

μpc

)+ qsc

λp

μpc= qsd +

qsc − qsd

qpcλp. (14)

Under λp < μpc and λs < μsc we get SC , the stable-throughput region of cooperative spectrum sharing:

SC =

{(λp, λs) : λp < qpc and λs < qsd +

qsc − qsd

qpcλp

}. (15)

Clearly, SD ⊆ SC for all qpc ≥ qpd and qsc ≤ qsd.To consider the random cooperation scheme, let pc denote

the probability that the secondary user chooses to cooperate,given that Qp � 0. The average service rates under randomcooperation, μpr and μsr, become:

μpr = (1 − pc)qpd + pcqpc, (16)

μsr = qsdP[Qp = 0] + pcqscP[Qp � 0], (17)




qpd(=0.6)

qsd (=0.5)

qsc (=0.4)

p

s

0

(qpc,qsc)

qpc(=0.8)

OPP COOP

scenario 2

scenario 3

scenario 1

scenario 4

scenario 5

RAND

Fig. 2. Stable-throughput region for the spectrum sharing schemes.

which, similarly to the opportunistic and cooperative cases,give SR, the stable-throughput region of the random coopera-tion:

SR =

{(λp, λs) :

λp < (1 − pc)qpd + pcqpc

λs < qsd +pcqsc−qsd

(1−pc)qpd+pcqpcλp

}. (18)

Let us now evaluate the stable-throughput region of thesequential decision scheme, following the dominant systemapproach. We consider a system X to be a dominant systemof Y , if the queue sizes in X are, at all times, at least as largeas those in Y . The stable-throughput region of the dominantsystem X inner bounds that of Y [22][36].

By comparing the average service rate of PT in (10) withthat in (13), and the average service rate of ST in (11) withthat in (14), we get μpd ≤ μpc and μsd ≤ μsc. So for anysequential decision scheme Π, the primary and secondaryservice rates are bounded as μΠp ∈ [μpd, μpc] and μΠs ∈ [μsd, μsc].Consequently, any sequential decision scheme stochasticallydominates the opportunistic one, and is dominated by thecooperative one, that is, SD ⊆ SΠ ⊆ SC .

Fig. 2 gives an example of the stable-throughput region foropportunistic, cooperative and random cooperation schemes.The shaded area shows the improvement achieved by coop-eration, which is significant if qpc is larger than qpd, andqsc is close to qsd. The corner point of the random cooper-ation scheme moves as a function of pc. Opportunistic andcooperative spectrum sharing are special cases of randomcooperation with pc = 0 and pc = 1, respectively. As expected,SR is equivalent to SD when pc = 0, and to SC whenpc = 1. According to Theorem 1, the boundary of the stable-throughput region of any SΠ is located in the shaded area.

IV. Optimal Sequential Decision Policy for the SecondarySystem

A. MDP formulation of sequential decision

We use a Markov Decision Process (MDP) to model thesequential decision of the secondary user, and to find theoptimal decision policy when the system parameters, thatis, the packet arrival probabilities (λp, λs) and the successfultransmission probabilities (qpd, qpc, qsd, qsc), as well as thesystem state, that is, the primary and secondary queue lengths(Qp,Qs) are known for the ST.

In general, an MDP describes a stochastic control system,whose state can be observed in discrete time. At each time slot,the decision maker chooses an action depending on the present

state or the history of the process. An immediate cost (orreward) incurs after taking the action, and the system movesto a state with some transition probability that is determinedby the present state and the selected action.

The MDP we formulated is defined as MDP〈S,A,A, p, c〉,where• S = {(Qp,Qs),Qp ∈ N0 and Qs ∈ N0}: the countable set

of discrete states, each of which is defined as the queuelength pair.

• A = {0, 1}: the set of control actions taken by thesecondary system, where 0 denotes the case that the STchooses to access the channel opportunistically, and 1refers to cooperative transmission.

• A : S → P(A): the action constraint function. A(s)denotes the set of allowed actions in state s. We have:

A(s) =

{{0} if s ∈ {(Qp,Qs),Qp = 0 and Qs ∈ N0}{0, 1} if s ∈ {(Qp,Qs),Qp � 0 and Qs ∈ N0} .

(19)

• p : S × A → Δ(S): the transition function, whereΔ(S) denotes the set of all probability distributions onS. The probability that the process moves to state s′

after taking action a in state s is given by pa(s, s′) =P[st+1 = s′|st = s, at = a], which depends on the arrivalrates, and also on the state and action dependent servicerates. The derivation of state transition probabilities isstraightforward, examples are given in [37].

• c : S × A → R: the cost function c(s, a) denoting theimmediate cost that depends on the present state and theselected action. A general cost function for queue-lengthcontrolled system is c(s, a) = B(a)+H(s), where B(a) ≥ 0represents the cost of selecting action a, and H(s) ≥ 0 isa cost that depends on the system state.According to (1) we consider:

c(s, a) = B(a) + H(s) = Cca +ChQs, (20)

that is, we set B(a) = Cca to represent the additionalpower consumption of relaying the primary packet, andH(s) = ChQs to denote the cost of secondary queuingdelay.

In each state, actions are chosen by following a policy Π,which defines a rule for decisions that may depend on thecurrent state, on the past history of the process, and on thetime. A policy is Markovian, if the choice does not dependon the history, and is stationary if it does not depend on thetime either. The policy is random, if several actions can beselected in a state with some probabilities, and is deterministicotherwise.

Note, that both the opportunistic spectrum sharing policyΠD and the cooperative spectrum sharing policy ΠC aredeterministic stationary policies. They can be expressed re-spectively as:

ΠD = {πn = 0, n ∈ N+}; (21)

ΠC = {πn = i, n ∈ N+} with i =

{0 if sn = (0, j), j ∈ N0

1 otherwise.(22)




The sequential decision degrades to opportunistic spectrumsharing if a = 0, and to cooperative spectrum sharing if a = 1for all Qp � 0 and Qs. It degrades to random cooperation ifP(a = 0) is constant and independent from Qp and Qs for allQp � 0.

The objective is to find the optimal sequential decisionpolicy, that is, policy Π∗ that minimizes the long-run av-erage cost C(Π), for given system parameters (λp, λs) and(qpd, qpc, qsd, qsc):

C(Π) = limN→∞

1NEΠ

⎡⎢⎢⎢⎢⎢⎣N∑

n=1

c(sn, πn)|s1 = (0, 0)

⎤⎥⎥⎥⎥⎥⎦ , (23)

where s1 denotes the initial state of the network, πn denotesthe action taken in the nth time slot according to policy Π,and EΠ is the expectation taken under policy Π.

We find Π∗ in three steps. First we prove that C(Π) is upperbounded within the stable-throughput region. Then we provethe existence of an optimal policy that is stationary. Finally,building on these two results we show the correctness of afinite state MDP based approximation.

B. Long-run average cost upper bound

Theorem 2. 1) The long-run average cost defined in (23)achieved by any policy Π is upper bounded when thearrival rates lie within the stable-throughput regionSΠ. Moreover, upper bounded average cost implies thestrong stability of Qs.

2) If the ST makes sequential decisions according to theoptimal policy Π∗, then the long-run average cost isupper bounded when the arrival rates lie within thestable-throughput region SC.

Proof: For c(s, a) = Cca + ChQs, the long-run averagecost in (23) becomes:

C(Π) = Cc limN→∞

1NEΠ

⎡⎢⎢⎢⎢⎢⎣N∑

n=1

πn

⎤⎥⎥⎥⎥⎥⎦ +Ch limN→∞

1NEΠ

⎡⎢⎢⎢⎢⎢⎣N∑

n=1

Qs(n)

⎤⎥⎥⎥⎥⎥⎦

≤ Cc +Ch limN→∞

1NEΠ

⎡⎢⎢⎢⎢⎢⎣N∑

n=1

Qs(n)

⎤⎥⎥⎥⎥⎥⎦ < ∞. (24)

The first inequality holds because the first limit is upperbounded by always taking action 1 in every time slot. Thesecond inequality holds as a consequence of the definitionof the stable-throughput region introduced in Definition 1,considering strong stability according to (3).

Similarly, by selecting action 0, we get a lower bound onthe first limit, and consequently,

C(Π) ≥ Ch limN→∞

1NEΠ

⎡⎢⎢⎢⎢⎢⎣N∑

n=1

Qs(n)

⎤⎥⎥⎥⎥⎥⎦ , (25)

which implies the strong stability of Qs for C(Π) < ∞.However, the stability of Qp is not guaranteed.

As the optimal policy Π∗ achieves the minimum averagecost under given λp and λs, this cost has to be upper boundedby the average cost of the cooperation based spectrum sharing,that is, C(Π∗) ≤ C(ΠC). As C(ΠC) < ∞ within SC , C(Π∗) < ∞within SC as well.

C. The existence of optimal stationary policy

We modeled the sequential decision with an infinite stateMDP with unbounded costs and finite action set. We prove theexistence of a stationary policy that is average cost optimal,building on the results of [28].

Following [28], let us introduce a discount factor 0 < β < 1,and give the total expected discounted cost incurred by policyΠ as:

VΠ,β(i, j) = EΠ

⎡⎢⎢⎢⎢⎢⎣∞∑

n=1

βn−1c(sn, πn)|s1 = (i, j)

⎤⎥⎥⎥⎥⎥⎦ . (26)

Let Vβ(i, j) = infΠ VΠ,β(i, j) and hβ(i, j) = Vβ(i, j) − Vβ(0, 0).Proposition 2 specifies the conditions that must be satisfied

for the average cost optimal stationary policy to exist.

Proposition 2. There exists a stationary policy that is aver-age cost optimal for the MDP〈S,A,A, p, c〉 if the followingconditions are satisfied:

1) Vβ(i, j) is finite for all (i, j) and β;2) There exists a nonnegative N such that hβ(i, j) ≥ −N for

all (i, j) and β;3) There exists nonnegative Mi, j such that hβ(i, j) ≤ Mi, j

for every (i, j) and β. For every (i, j), there exists anaction a such that

∑k,l pa((i, j), (k, l))Mk,l < ∞.

Proof: See [28].

Corollary 1. Under condition 1, the quantity Vβ(i, j) satisfiesthe optimality equation:

Vβ(i, j) = mina

⎧⎪⎪⎨⎪⎪⎩c((i, j), a) + β∑k,l

pa((i, j), (k, l))Vβ(k, l)

⎫⎪⎪⎬⎪⎪⎭ . (27)

Proof: See [28].Now we can prove the existence of the optimal stationary

policy for sequential decision based spectrum sharing.

Theorem 3. For the MDP〈S,A,A, p, c〉, there exists a sta-tionary policy Π∗ that minimizes the long-run average cost.

Proof: We prove that conditions 1-3 in Proposition 2hold. To evaluate condition 1, we upper bound Vβ(i, j) in twosteps, by considering ΠD, the policy of always accessing thespectrum opportunistically defined in (21), and by assumingthat there are primary and secondary packet arrivals in eachtime slot. This gives:

Vβ(i, j) ≤ VΠD,β(i, j) ≤∞∑

n=1

βn−1c(sn, 0)|sn=(i+n, j+n)

=

∞∑n=1

Chβn−1( j + n) =

Ch j1 − β +

Ch

(1 − β)2, (28)

that is, condition 1 is fulfilled.Condition 2 is fulfilled if Vβ(i, j) is positive and non-

decreasing in i and j. The state space falls into four areas((i, j = 0), (i = 0, j > 0), (i > 0, j + 0), (i, j > 0)), where underthe same action Vβ(i, j) has the same form for all states. Asc((i, j), a) is nondecreasing in i and j, Corollary 1 can be usedto show that Vβ(i, j) is also non-decreasing in i and j (see[28]).




Finally, let us consider condition 3. Since Vβ(i, j) is positiveand non-decreasing in i and j, we have:

hβ(i, j) ≤ Vβ(i, j) − 0 ≤∞∑

n=1

βn−1c(sn, 0)|sn=(i+n, j+n). (29)

Let Mi, j =∑∞

n=1 βn−1c(sn, 0)|sn=(i+n, j+n). Then the first part of

condition 3 is fulfilled. As in the considered system there isa finite number of possible transitions from each state, thesecond part of the condition holds as well.

Note, that Theorem 3 holds for the more general casewhen H(s) in (20) is a (nonnegative and nondecreasing)polynomial of degree m in Qs. Only condition 1 needs tobe reevaluated. The right side of (28) is now a polynomial ofdegree m with the leading item

∑∞n=1 β

n−1nm.∑∞

n=1 βn−1nm =

(1 − β)−m−1 ∑mk=1 a(m)

k βk−1, where a(m)

1 = a(m)m = 1 and a(m)

k =

ka(m−1)k + (m − k + 1)a(m−1)

k−1 for k = 2, ...,m − 1. Consequently,∑∞n=1 β

n−1nm is the sum of finite number of finite terms and∑∞n=1 β

n−1nm < ∞. Therefore, Vβ(i, j) < ∞, and condition 1 issatisfied.

D. LP approximation

From Theorem 3 we know that there exists an optimal policythat is stationary. However, obtaining the optimal stationarypolicy is computationally prohibitive since it involves solvingan MDP with a countably infinite state-space. To make theproblem tractable, we aim at approximating the original MDPby a finite-state MDP with tunable number of states. Specifi-cally, we consider the system where the PT and the ST havefinite buffers for storing at most Np and Ns packets, respec-tively, and arriving packets are dropped if there is no space inthe corresponding queue. In this case, the state space becomesS = {(Qp,Qs),Qp ∈ {0, 1, ...,Np} and Qs ∈ {0, 1, ...,Ns}},whereas the action space and the cost function remains thesame.

Proposition 3. If the arrival rates are inside the stable-throughput region SC, the long-run average cost from the LPapproximation converges as Np → ∞ and Ns → ∞.

Proof: Let us denote with C(Π, (Np,Ns)) the long termaverage cost of the finite buffer system under policy Π.From Theorem 2 the long-run average cost under pol-icy Π∗ is bounded in the infinite buffer system. To provethe convergence, we need to show that C(Π∗, (Np,Ns)) ≤C(Π∗+, (Np,Ns)+), where + denotes the system with increasedprimary or secondary buffer.

First consider the optimal policy Π∗+. The MDP of thespectrum sharing queues fulfils the conditions of optimalmonotonic policy [30], that is, under Π∗+ the probability oftaking action 1 is nondecreasing in (Qp,Qs). Consequently,truncating the state space by decreasing Np or Ns can notincrease any of the two cost components of C(Π∗+). This givesC(Π∗+, (Np,Ns)) ≤ C(Π∗+, (Np,Ns)+).

We can prove the theorem via contradiction.Assume that C(Π∗, (Np,Ns)) > C(Π∗+, (Np,Ns)+).From C(Π∗+, (Np,Ns)) ≤ C(Π∗+, (Np,Ns)+) it followsC(Π∗+, (Np,Ns)) < C(Π∗, (Np,Ns)). That is, Π∗ can not be the

optimal policy for the (Np,Ns). This is contradiction, thusC(Π∗, (Np,Ns)) ≤ C(Π∗+, (Np,Ns)+) needs to hold.

We find the optimal stationary policy for the finite-statesystem by solving the following linear program (LP) [29]:

{z∗s,a}s∈S,a∈A(s) = arg min∑s∈S

∑a∈A(s)

zs,ac(s, a),

s.t.∑s∈S

∑a∈A(s)

zs,a = 1,

∑a∈A(s′)

zs′,a =∑s∈S

∑a∈A(s)

zs,a pa(s, s′), ∀s′ ∈ S,

zs,a ≥ 0, ∀a ∈ A(s), s ∈ S, (30)

where zs,a denotes the probability that the system is in state sand chooses action a. With the optimal solution to the LP,the optimal randomized stationary policy Π∗ = {π∗s,a} thatminimizes the long-run average cost per unit time is computedas π∗s,a = z∗s,a/

∑b∈A(s) z∗s,b, and the objective function gives the

minimum long-run average cost. As it is shown in [29], ineach state there is only one action that has positive z∗s,a, andtherefore the optimal policy achieved by (30) is the optimaldeterministic stationary policy. We estimate the optimal sta-tionary policy for the original infinite buffer MDP by lettingNp → ∞ and Ns → ∞.

V. Sequential Decision with Online Reinforcement Learning

In the case when the system parameters or the systemstate are not known for the ST, MDP based optimization ofsequential decision can not be applied. Online reinforcementlearning provides a viable alternative in this case. It canalso be used to find near-optimal solutions for large MDPs,where the complexity of LP based optimization is prohibitive.Specifically, we propose to adapt R-learning, an average-reward reinforcement learning method [6][32]. R-learning usessimulation-based stochastic approximation, and thus can avoidthe need for computing the transition probability and thereward matrices. It is based on the iterative updating of thestate dependent action-value functions called the R-factors,and the experienced average cost ρ, via a sample path. The R-factor Rt(s, a) represents the expected cost of taking action ain state s given that an optimal policy is applied for all futuresteps. The R-learning algorithm works as follows:

1) At time t = 1, all the R-factors are initialized to a finitevalue (for instance 0), and the average cost to ρ1 = 0.Let s denote the current state.

2) Action a = arg minb∈A(s) Rt(s, b) is selected with proba-bility 1−αt, whereas with probability αt, an exploratoryaction a is chosen uniformly from A(s), to ensure thatthe action space is explored and the learning does notconverge to a local optimum.

3) Let c(s, a) and s′ denote the incurred cost and the nextstate, respectively. The R-factor is updated as:

Rt+1(s, a)← (1−βr)Rt(s, a)+βr

[c(s, a) − ρt + min

b∈A(s′)Rt(s

′, b)

].

(31)




0 5 10 150

0.5

1

1.5

2

2.5

Np = N

s(a)

Ave

rage

cos

t (C

)scenario 1: λ

p=0.2, λ

s=0.2; C

h=C

c=1

OPP (sim)COOP (sim)ORC (sim)OSD (sim)OSD (LP)

0 5 10 150

0.1

0.2

0.3

0.4

Np = N

s(b)

Pac

ket l

oss

ratio

(p lo

ss)

scenario 1: λp=0.2, λ

s=0.2; C

h=C

c=1

OPP (PT)COOP (PT)ORC (PT)OSD (PT)OPP (ST)COOP (ST)ORC (ST)OSD (ST)

0 5 10 150

1

2

3

4

5

Np = N

s(c)

Ave

rage

cos

t (C

)


s=0.2; C

h=C

c=1


0 5 10 150

5

10

15

Np = N

s(d)

Ave

rage

cos

t (C

)


s=0.5; C

h=C

c=1


Fig. 3. (a) Average cost and (b) packet loss ratio (simulation) vs. buffer size for scenario 1. Average cost vs. buffer size for (c) scenario 2 and (d) scenario3. The average costs of COOP and ORC overlap when Np,Ns ≥ 3 in (a), and when Np,Ns ≥ 2 in (c). They overlap with the average cost of OSD in (d).

In the case a = arg minb∈A(s) Rt(s, b) was selected, theaverage cost is updated as well:

ρt+1 ← (1−βρ)ρt+βρ

[c(s, a) + min

b∈A(s′)Rt(s

′, b) − minb∈A(s)

Rt(s, b)

].

(32)

4) Let t = t + 1 and s = s′, and go to step 2.In (31) and (32), βr and βρ denote the update rate of the

R-factors and ρ, with 0 ≤ βr ≤ 1 and 0 ≤ βρ ≤ 1. After con-vergence, the decision in state s is set to arg minb∈A(s) R(s, b).

We apply R-learning algorithm with semi-uniform explo-ration, with constant αt. Other exploratory methods, like Boltz-mann exploration, uncertainty estimation (UE) exploration arepresented and compared in [32][38]. For the infinite buffersystem the state space is extended dynamically according tothe maximum experienced queue lengths.

The computational complexity of R-learning is related tostoring and updating the R-factors in each iteration step, wherein turn the number of R-factors is given by the product of thesize of the state and the action spaces. The number of requirediteration steps, that is, the efficiency of learning depends onthe underlying stochastic process and also on the learningparameters such as αt, βr and βρ. While the convergenceof R-learning to the optimal value is not proved, detailedevaluations show that R-learning finds near optimal solutionsin most scenarios [32]. The fundamental computational andinformation-theoretic limitations of reinforcement learning ingeneral are discussed in [39].

We consider two cases of R-learning based OSD, full-state and reduced-state. In both of the cases the ST doesnot have knowledge on the arrival and successful transmissionprobabilities.• Full-state case: the ST has full knowledge on the primary

and secondary queue lengths.• Reduced-state case: the ST only knows whether the

primary queue is empty or not. In this case, the originalstate space S = {(Qp,Qs),Qp ∈ N0,Qs ∈ N0} is reducedto S′ = {(Qp,Qs),Qp ∈ {0, 1},Qs ∈ N0}, where 0 denotesthat the primary queue is empty, and 1, otherwise.

VI. Case Study

In this section, we compare the optimal sequential decision(OSD) with the opportunistic (OPP), and the cooperative

(COOP) schemes, as well as with optimal random coopera-tion (ORC), that is, random cooperation with optimal state-independent cooperation probability, that minimizes the long-run average cost. For OSD we obtain the optimal sequentialdecision policy by solving the LP in (30), while we findthe optimal cooperation probability for ORC with exhaustivesearch. Moreover, we evaluate whether R-learning (RL) canprovide OSD average-cost close to the optimal one.

Matlab based simulation results are presented to validatethe analytic results of OSD, and also for evaluating theperformance of OPP, COOP and ORC schemes. We considerthe same system model as for the analysis, given by the packetarrival probabilities λp and λs and the successful transmissionprobabilities, qpd, qpc, qsd, qsd. Under OSD the ST selects itsaction according to the LP based optimal policy in each timeslot. The length of each time slot is assumed to be one timeunit. All simulation results shown are average values from 10runs, each of which lasting for 50,000 time slots.

A. Stable-throughput region and simulation scenarios

Fig. 2 shows the stable-throughput region when the PTand the ST have infinite buffers and the probabilities ofsuccessful packet transmission are set as qpd = 0.6, qpc = 0.8,qsd = 0.5, and qsc = 0.4. We keep these parameters fixed, andconsider five scenarios with different sets of (λp, λs). Underscenarios 1-3, the arrival rates (λp, λs) are (0.2, 0.2), (0.5, 0.2)and (0.2, 0.5), respectively. As shown in Fig. 2, under thesethree scenarios the (λp, λs) pairs are in SD, in SC and outsideSC respectively. Under scenario 4 we fix λp = 0.2, giving amoderate primary load, and increase λs until it reaches theupper bound of SC , whereas under scenario 5, we fix λs = 0.2and increase λp.

B. Average cost under increasing buffer size

To evaluate the LP approximation we set the unit costs asCh = Cc = 1, and increase the buffer size of PT and ST inscenarios 1-3. Fig. 3 shows that if the (λp, λs) pair is insidethe stable-throughput region of a scheme, the average costconverges, and the packet loss ratio (ploss) reduces to zeroas the buffer sizes increase. The LP approximation gives agood estimation of the average cost, though the required buffersize increases with the load. If the (λp, λs) pair is outside the




0 2 4 6 8 100.5

1

1.5

2

2.5

3

Cc/C

h(a)

Ave

rage

cos

t (C

)


s=0.2; N

p=N

s=30

OPP

COOP

ORC

OSD

0 2 4 6 8 10

0

0.5

1

1.5

2

2.5

3

Cc/C

h(b)

Ave

rage

cos

t com

pone

nts

(Cco

op, C

hold

)


s=0.2; N

p=N

s=30

OPP

COOP

ORC

OSD

Ccoop

Chold

0 2 4 6 8 101

2

3

4

5

6

7

8

Cc/C

h(c)

Ave

rage

pac

ket d

elay

(D p, D

s)


s=0.2; N

p=N

s=30

OPP

COOP

ORC

OSD

Dp

Ds

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

Cc/C

h(d)

Coo

pera

tion

prob

abili

ty (

p c)


s=0.2; N

p=N

s=30

ORC

OSD

3 4 5 6 7 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Delay (Ds)

(e)

Coo

pera

tion

ener

gy (

Ec)


s=0.2; N

p=N

s=30

ORC (qpc

=0.7)

OSD (qpc

=0.7)

ORC (qpc

=0.8)

OSD (qpc

=0.8)

ORC (qpc

=0.9)

OSD (qpc

=0.9)

OPP

COOP

3 4 5 6 7 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Delay (Ds)

(f)

Coo

pera

tion

ener

gy (

Ec)


s=0.2; N

p=N

s=30

ORC (qsc

=0.35)

OSD (qsc

=0.35)

ORC (qsc

=0.4)

OSD (qsc

=0.4)

ORC (qsc

=0.45)

OSD (qsc

=0.45)

COOP

OPP

Fig. 4. (a) Average cost, (b) components of average cost, (c) average PT and ST packet delay, and (d) cooperation probability for scenario 1, and (e,f)energy-delay tradeoff vs. Cc/Ch for scenario 1 and varying successful transmission probabilities under cooeration.

stable-throughput region of a scheme, the average cost goes toinfinity, as it happens for OPP in scenario 2 and for all schemesin scenario 3 (shown in Fig. 3(c) and (d), respectively). Fig. 3also indicates that for OSD the analytic and simulation resultsare consistent. For the rest of the evaluation, unless specified,we use fixed buffer size Np = Ns = 30, and use the analyticLP results to show the average cost under OSD.

C. On the increase of the cost for cooperation

The cost of cooperation depends on the preferences of thesecondary system, and may change, for example increase asenergy resources become scarce. To evaluate the effect ofincreasing cooperation cost Cc, we consider scenario 1, keepthe cost of holding a packet constant Ch = 1, and increaseCc = 0, 1, 2, . . . 10.

As it is shown in Fig. 4(a), the cost of OPP does notdepend on Cc/Ch, while the cost of COOP increases linearly,as expected. OSD achieves the lowest average cost, whichconverges to the cost of COOP at low Cc, and to the costof OPP at high Cc values, as OSD trades off the cost ofcooperation and the cost of holding packets. ORC can as welltrade off these costs, however with lower efficiency. For theconsidered parameters OSD can decrease the average cost witharound 15% at high Cc values, compared to OPP, while it canachieve up to 40% gain for medium Cc values, where none ofOPP or COOP is efficient.

Fig. 4(b) shows the components of the average cost, thatis, the cost of cooperation (Ccoop) and the cost of holding

packets (Chold). For OSD and ORC the cost of cooperationis kept more or less constant, or is even decreased, while thecost of holding packets increases sublinearly with Cc. Note,that when OSD and ORC have the same cooperation cost (ataround Cc/Ch = 5), OSD achieves significantly lower costof holding packets, which shows the efficiency of the statedependent decision policy. Fig. 4(c) shows the average packetdelay at the PT (Dp) and the ST (Ds). Compared to OPP, thedelay experienced by PT is decreased or at least guaranteedfor all the other schemes, motivating that the primary sys-tem will allow cooperation. Comparing the secondary packetdelay in Fig. 4(c) with the packet holding cost in Fig. 4(b),we can see that they are proportional, which indicates thatthe introduced state depended cost reflects the experienceddelay, and delay sensitive secondary systems can tune Ch toachieve the preferred performance. To see the reason of thedelay increase, in Fig. 4(d) we compare the probability thatcooperation is performed in an arbitrary time slot for OSDand ORC. Both schemes reduce the cooperation probabilitywhen the unit cost for cooperation increases, though withdifferent rate. At Cc = 0, both schemes act as the COOPscheme, which under the given load parameters lead to 0.25cooperation probability. As Cc increases, these schemes movefrom cooperating towards transmitting opportunistically.

Finally, in Figs 4(e) and (f) we assess the achieved tradeoffbetween the delay experienced at the ST and the energy spentfor cooperative transmission (Ec), when Cc/Ch ∈ [0, 10]. Theenergy consumption is assumed to be linear to the cooperation




0 0.1 0.2 0.3 0.40

0.5

1

1.5

2

λs

(a)

Ave

rage

cos

t (C

)

scenario 4: λp=0.2; C

h=1,C

c=2; N

p=N

s=30

OPP

COOP

ORC

OSD

0 0.1 0.2 0.3 0.41

1.2

1.4

1.6

1.8

2

λs

(b)

Rel

ativ

e co

st


h=1,C

c=2; N

p=N

s=30

OPP/OSDCOOP/OSDORC/OSD

30%50%

0 0.1 0.2 0.3 0.4

0

0.5

1

1.5

2

λs

(c)

Ave

rage

cos

t com

pone

nts

(Cco

op, C

hold

)


h=1,C

c=2; N

p=N

s=30

OPP

COOP

ORC

OSD

Ccoop

Chold

Fig. 5. (a) Average cost, (b) relative cost, and (c) components of average cost vs. arrival rate at the ST for scenario 4.

0 0.2 0.4 0.60

0.5

1

1.5

2

2.5

3

λp

(a)

Ave

rage

cos

t (C

)

scenario 5: λs =0.2; C

h=1,C

c=2; N

p=N

s=30

OPPCOOPORCOSD

0 0.2 0.4 0.61

1.2

1.4

1.6

1.8

2

λp

(b)

Rel

ativ

e co

st


h=1,C

c=2; N

p=N

s=30

OPP/OSD

COOP/OSD

ORC/OSD

40%

0 0.2 0.4 0.6

0

0.5

1

1.5

2

2.5

3

λp

(c)A

vera

ge c

ost c

ompo

nent

s (C

coop

, Cho

ld)


h=1,C

c=2; N

p=N

s=30

OPP

COOP

ORC

OSD

Chold

Ccoop

Fig. 6. (a) Average cost, (b) relative cost, and (c) components of average cost vs. arrival rate at the PT, for scenario 5.

probability shown in Fig. 4(d). To evaluate the effect ofthe system parameters, these figures show results for differ-ent successful transmission probabilities under cooperation.Specifically, on Fig. 4(e) qpc is varied for qsc = 0.4, andon Fig. 4(f) qsc is varied for constant qpc = 0.8. The OPPand COOP schemes are not sensitive to the Cc/Ch ratio,while ORC and OSD trades energy consumption for delay.The gap between the ORC and OSD curves is significant,for all considered qpc and qsc values. For the consideredscenarios, OSD can halve the energy consumption for givendelay, or decrease the delay with one third under given energyconsumption value.

D. On the increase of the arrival rates

Let us now evaluate, how the proposed solutions adapt tochanging load conditions, first considering an increasing loadat ST, according to scenario 4. Fig. 5(a) shows that OPPperforms well at low, while COOP at high λs. OSD and ORCbalances well between the two deterministic solutions, thoughORC performs just as well as the more effective deterministicscheme. To better understand the gain of OSD, in Fig. 5(b)we shows the costs relative to the OSD one. We see that theOSD gain is significant at moderate λs, where the cost ofOPP and COOP is higher with 50% and the cost of ORCwith 30%. The components of the average cost are shown

in Fig. 5(c), where the cooperation cost curves indicate thatthe two dynamic secondary access schemes (ORC and OSD)increase the cooperation probability when the load at thesecondary user increases, and cooperation is necessary to keepthe ST queuing delay low. The packet holding cost increasesexponentially for OPP and COOP, as expected. Interestingly,for this scenario OSD can manage nearly the same packetholding cost as COOP, but with lower cooperation cost.

Let us now increase λp according to scenario 5. Figs 6(a)and (b) show that OSD always achieves the lowest averagecost, and its gain is significant when λp is not very low.Fig. 6(b) shows that the average cost of OPP is slightly lowerthan that of COOP when λp is low, but increases very fast andthe system becomes unstable. ORC balances between OPPand COOP again, but it is not efficient, for the consideredscenario its cost is up to 40% higher than that of OSD.Fig. 6(c) shows the components of the average cost. As wecan see, cooperation can decrease the cost of holding packetssignificantly, in the considered scenario this cost is nearlyindependent from λp under COOP, OSD and ORC. The priceis the increased cost of cooperation, which increases nearlylinearly with λp not only for COOP, but also for OSD andORC.

Our results considering the changing primary or secondaryload show that although ORC can trade off the cost of




0 1 2 3

x 104

0

0.5

1

1.5

2

Number of time slot (n)(a)

Ave

rage

cos

t (C

)

scenario 1: λp= 0.2, λ

s=0.2; C

h=1,C

c=2

Np=N

s=5

Np=N

s=10

Np=N

s= ∞

2 4 6 8 100

0.5

1

1.5

2

2.5

3

Np = N

s(b)

Ave

rage

cos

t (C

)

scenario 1: λp =0.2,λ

s =0.2; C

h=1,C

c=2

OPP (∞)COOP (∞)OSD (LP)OSD (RL,full−state)OSD (RL,reduced−state)

∞ 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.5

1

1.5

2

2.5

3

λs

(c)

Ave

rage

cos

t (C

)

scenario 4: λp =0.2; C

h=1,C

c=2; N

p=N

s=∞

OPP (∞)COOP (∞)OSD (LP)OSD (RL,full−state)OSD (RL,reduced−state)

Fig. 7. (a) Average cost from a single RL simulation run for the full-state case. Average cost (b) vs. buffer size for scenario 1, and (c) vs. secondary arrivalrate for scenario 4.

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.130

0.5

1

1.5

2

λs1

=λs2

(a)

Ave

rage

cos

t (C

)

λp1

=λp2

=0.1; Ch=1, C

c=2; N

p1=N

p2=N

s1=N

s2=∞

OPP (ST1)

COOP (ST1)

OSD (RL,full−state,ST1)

OSD ( RL,reduced−state,ST1)

OPP (ST2)

COOP (ST2)


OSD (RL,reduced−state,ST2)

0 5 10 15

x 104

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Number of time slot (n)(b)

Ave

rage

cos

t (C

)

λp1

=λp2

=λs2

=λs2

=0.1; Ch=1, C

c=2; N

p1=N

p2=N

s1=N

s2=∞

OPP (ST1)

OPP (ST2)

COOP (ST1)

COOP (ST2)



Fig. 8. Multi-user network. (a) Average cost vs. secondary arrival rates. (b) Average cost from a single RL simulation run with ST2 entering at the 7.5 ·104thtime slot.

cooperation and delay, its performance is not significantlybetter than the better one of the deterministic (OPP or COOP)schemes. The average cost can be significantly reduced withOSD, by utilizing cooperation when it effectively decreasesthe packet holding cost.

E. On the performance of R-learning

We consider the case when the traffic and link statisticsand possibly even the primary system state are not known forthe SU, and the optimal sequential decision policy needs tobe found by R-learning (RL), as defined in Section V. Theparametrization of R-learning is non-trivial [32]. We considerfixed exploration probability αt = 0.5 for t ∈ N+ and updaterates βr = 0.5 and βρ = 0.05. The optimization of theseparameters is beyond the focus of this paper.

We evaluate the convergence speed of RL in Fig. 7(a),showing the average cost of the full-state case, from time zeroas a function of the number of iteration steps. We consider sce-nario 1, and single simulation runs for Np = Ns = 5, 10,+∞.Similar convergence can be shown under the reduced-statecase. The figure shows that the average cost converges fastand changes little after step 10, 000. The convergence speedis almost unaffected by the buffer size, which shows that RLcan efficiently discover the action space in the states wherethe system resides with high probability.

Figs 7(b) and (c) show the average cost with 95% confidenceinterval, based on 10 simulation runs. Each simulation runstarts with a learning phase of 50, 000 iteration steps. Then thepolicy is fixed and the average cost is calculated for the next50, 000 time slots. In Fig. 7(b) we compare the average cost ofOSD under scenario 1 with different buffer sizes, consideringthe LP solution, and full-, and reduced-state RL. We showthe OPP and COOP performance for comparison. For all thethree OSD solutions the cost first increases with the buffersize just as on Fig. 3(a). RL achieves a little higher averagecost than LP, however, still performs better than the twodeterministic schemes. Fig. 7(c), considering scenario 4 showssimilar results. The efficiency of the RL algorithm motivateswell its use to find dynamic secondary access control policies,when the secondary user does not know a priori the traffic andlink statistics, and even the present queue size of the primaryuser.

F. On multi-user spectrum sharing

To demonstrate the performance of optimal sequential deci-sion in multi-user networking scenarios, we consider primaryand secondary transmission on a single channel. PTs accessthe channel with time division multiplexing. ST-SR pairs arepre-assigned to the PT-PR ones. An ST can transmit withcooperative spectrum sharing by relaying the packet of its PTpair, or opportunistically in any slot left idle by a PT. STs




introduce access control for the opportunistic transmissions,e.g., with short random back-offs. Here we assume that oneof the STs with packet to send is selected randomly uniformly.

OSD requires a set of control actions as in the singletransmitter case, that is, access the channel opportunisticallyaccording to the secondary access control, or with cooperativetransmission. The state space, however, increases with thenumber of user pairs. For each ST, it includes all primaryand its own queue lengths for the full-state case, and the ownqueue length and empty or non-empty states for the primaryqueues for the reduced-state case. Moreover, the probabilityof successful opportunistic spectrum access depends now notonly on the channel state, but also on the traffic generatedby the other PTs and STs and therefore is not known a priori.Consequently, LP based approximation is not feasible in multi-user networks, and online reinforcement learning needs to beused.

Fig. 8 evaluates the performance of R-learning when twoprimary and two secondary user pairs share the spectrum andRL runs independently at the two STs. We set qp1d = qp2d =

0.6, qp1c = qp2c = 0.8, qs1d = qs2d = 0.5, qs1c = qs1c = 0.4,Ch = 1 and Cc = 2 as in the most of the single-user evaluationscenarios. Fig. 8(a) shows the OSD performance with RLcompared to OPP and COOP for increasing λs1 = λs2. Wecan see that both STs adjust the access policy as the arrivalrates increase, similarly to the single-user case on Fig. 7(c).We have to note, however, that at high load, COOP achievesslightly lower average cost, that is, the RL algorithm could notfind the optimal policy. Fig. 8(b) evaluates RL under changingtraffic. At time zero only PT1, PT2 and ST1 are active, andST1 uses mainly opportunistic transmission due to the lowaggregate load. ST2 enters at the 7.5 · 104th time slot. Afterthe learning process, both STs settle down to the new optimalpolicy, which is now closer to COOP.

These results show that OSD is necessary even in multi-user networks. R-learning still finds near-optimum solutions,and allows the STs to tune the spectrum access policy underchanging traffic conditions. However, the performance of R-learning decreases at increased state and action spaces. There-fore, in multi-user networks, we suggest a modular designwhere the relay selection and the secondary network accesscontrol is performed independently from the optimization ofthe spectrum access policy.

VII. Discussion

In this work we presented novel results on spectrum sharingnetworks where the transmission power cost needs to betaken into account. We considered non-backlogged traffic,where the secondary user has the possibility to trade offcooperation cost and channel access delay. We formulated theproblem of optimal sequential decision as an MDP with a costfunction that combines the cost of cooperation, e.g., increasedsecondary energy consumption, and the cost of queuing delay,and proved the existence of a stationary policy that is averagecost optimal. We showed that dynamic secondary access withoptimal sequential decision significantly improves the energyconsumption – delay tradeoff compared to random dynamic

cooperation and can achieve significant gain compared tothe static opportunistic and cooperative access schemes, evenwhen online learning needs to be applied.

Several of the system assumptions could be relaxed withthe cost of increased modeling complexity, e.g., to considerMarkov modulated arrival and loss processes, limited numberof retransmissions, ACK/NACK loss, or errors in the PT-ST control information exchange. Further tradeoffs couldbe discovered when tuning the power allocation parameterof the cooperative scheme, changing the ratio of qpc andqsc. We considered spectrum sharing options that ensure perpacket performance guarantees for the primary system. Othersolutions ensuring instead long term average performanceguarantees could be evaluated in a similar framework as well.

The analytic modeling of sequential decision in multi-usernetworks is a challenging direction of future research, sincethe interacting dynamic queues may have unexpected behavior[40], unless a fully symmetric case is considered and mean-field methods can be applied [26]. Therefore we consideredlearning based OSD for the multi-user, single channel scenario,and showed that it can still decrease the average cost, thoughit looses efficiency due to the increased state space. Therefore,to extend the proposed OSD for multi-channel, multi-usernetworks we suggest that the OSD policy, the one to one as-signment of the primary and secondary links [11][15][16] andthe opportunistic channel access control [41][42] is optimizedindependently or jointly through an iterative process.

References

[1] Q. Zhao and B. Sadler, “A survey of dynamic spectrumaccess,” IEEE Signal Proc. Mag., vol. 24, no. 3, pp. 79–89, 2007.

[2] Y. Chen, Q. Zhao, and A. Swami, “Joint design andseparation principle for opportunistic spectrum access inthe presence of sensing errors,” IEEE Trans. Inf. Theory,vol. 54, no. 5, pp. 2053–2071, 2008.

[3] A. T. Hoang, Y.-C. Liang, and Y. Zeng, “Adaptive jointscheduling of spectrum sensing and data transmissionin cognitive radio networks,” IEEE Trans. Commun.,vol. 58, no. 1, pp. 235–246, 2010.

[4] U. Berthold, F. Fu, M. V. der Schaar, and F. Jondral,“Detection of spectral resources in cognitive radios usingreinforcement learning,” in Proc. of IEEE DySPAN, 2008.

[5] X. Li, Q. Zhao, X. Guan, and L. Tong, “Optimal cogni-tive access of Markovian channels under tight collisionconstraints,” IEEE J. Sel. Areas Commun., vol. 29, no. 4,pp. 746–756, Apr. 2011.

[6] M. Levorato, S. Firouzabadi, and A. Goldsmith, “A learn-ing framework for cognitive interference networks withpartial and noisy observations,” IEEE Trans. WirelessCommun., vol. 11, no. 9, pp. 3101–3111, Sept. 2012.

[7] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for aggregated interference control in cognitiveradio networks,” IEEE Trans. Veh. Technol., vol. 59,no. 4, pp. 1823–1834, 2010.

[8] Y. Han, A. Pandharipande, and S. H. Ting, “Cooperativedecode-and-forward relaying for secondary spectrum ac-




cess,” IEEE Trans. Wireless Commun., vol. 8, no. 10, pp.4945–4950, Oct. 2009.

[9] I. Krikidis, “Multilevel modulation for cognitive multiac-cess relay channel,” IEEE Trans. Veh. Technol., vol. 59,no. 6, pp. 3121–3125, Jul. 2010.

[10] B. Cao, L. X. Cai, and et. al., “Cooperative cognitiveradio networking using quadrature signaling,” in Proc.of IEEE INFOCOM, 2012.

[11] T. Elkourdi and O. Simeone, “Spectrum leasing viacooperation with multiple primary users,” IEEE Trans.Veh. Technol., vol. 61, no. 2, pp. 820–825, Feb. 2012.

[12] W. Su, J. Matyjas, and S. Batalama, “Active cooperationbetween primary users and cognitive radio users in het-erogeneous ad hoc networks,” IEEE Trans. Inf. Theory,vol. 60, no. 4, pp. 1796–1805, Apr. 2012.

[13] W. Lu, Y. Gong, S. H. Ting, X. Wu, and N. Zhang,“Cooperative OFDM relaying for opportunistic spectrumsharing: protocol design and resource allocation,” IEEETrans. Wireless Commun., vol. 11, no. 6, pp. 2126–2135,Jun. 2012.

[14] S. Hua, H. Liu, M. Wu, and S. Panwar, “ExploitingMIMO antennas in cooperative cognitive radio net-works,” in Proc. of IEEE INFOCOM, 2011.

[15] Z. Guan, T. Melodia, D. Yuan, and D. A. Pados, “Dis-tributed spectrum management and relay selection ininterference-limited cooperative wireless networks,” inProc. of ACM MobiCom, 2011, pp. 229–240.

[16] M. Shamaiah, S. H. Lee, S. Vishwanath, and H. Vikalo,“Distributed algorithms for spectrum access in cognitiveradio relay networks,” IEEE J. Sel. Areas Commun.,vol. 30, no. 10, pp. 1947–1957, 2012.

[17] L. Gao, R. Zhang, C. Yin, and S. Cui, “Throughput anddelay scaling in supportive two-tier networks,” IEEE J.Sel. Areas Commun., vol. 30, no. 2, pp. 415–424, Feb.2012.

[18] Y. Han, S. H. Ting, M. Motani, and A. Pandharipande,“On throughput and delay scaling with cooperative spec-trum sharing,” in Proc. of IEEE ISIT, Aug. 2011.

[19] L. Wang and V. Fodor, “On the gain of primary exclu-sion region and vertical cooperation in spectrum sharingwireless networks,” IEEE Trans. Veh. Technol., vol. 61,no. 8, pp. 3746–3758, Oct. 2012.

[20] O. Simeone, Y. Bar-Ness, and U. Spagnolini, “Stablethroughput of cognitive radios with and without relay-ing capacity,” IEEE Trans. Wireless Commun., vol. 55,no. 12, pp. 2351–2360, Dec. 2007.

[21] I. Krikidis, J. N. Laneman, J. S. Thompson, andS. McLaughlin, “Protocol design and throughput anal-ysis for multi-user cognitive cooperative systems,” IEEETrans. Wireless Commun., vol. 8, no. 9, pp. 4740–4751,Sep. 2009.

[22] S. Kompella, G. Nguyen, J. Wieselthier, andA. Ephremides, “Stable throughput tradeoffs in cognitiveshared channels with cooperative relaying,” in Proc. ofIEEE INFOCOM, 2011.

[23] M. Levorato, U. Mitra, and M. Zorzi, “Cognitive in-terference management in retransmission-based wirelessnetworks,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp.

3023–3046, 2012.[24] F. Lapiccirella, X. Liu, and Z. Ding, “Distributed control

of multiple cognitive radio overlay for primary queuestability,” IEEE Trans. Wireless Commun., vol. 12, no. 1,pp. 112–122, 2013.

[25] B. Rong and A. Ephremides, “Cooperative access inwireless networks: stable throughput and delay,” IEEETrans. Inf. Theory, vol. 58, no. 9, pp. 5890–5907, sept.2012.

[26] A. Sadek, K. Liu, and A. Ephremides, “Cognitive mul-tiple access via cooperation: Protocol design and perfor-mance analysis,” IEEE Trans. Inf. Theory, vol. 53, no. 10,pp. 3677–3696, 2007.

[27] K. W. Ross, “Randomized past-dependent policies forMarkov decision processes with multiple constraints,”Operations Research, vol. 37, no. 3, pp. 474–477, 1989.

[28] L. Sennott, “Average cost optimal stationary policies ininfinite state markov decision processes with unboundedcost,” Operation Research, vol. 37, pp. 626–633, 1989.

[29] M. L. Puterman, Markov decision processes: discretestochastic dynamic programming. New York, NY: John.Wiley & Sons, Inc., 1994.

[30] E. Altman and S. Stidham Jr., “Optimality of monotonicpolicies for two-action markovian decision processes,with applications to control of queues with delayedinformation,” Queueing Systems, vol. 21, no. 3-4, pp.267–291, 1995.

[31] D. P. Bersekas, Dynamic programming and optimal con-trol, 2nd ed. Belmont, MA: Athena Scientific, 2001.

[32] S. Mahadevan, “Average reward reinforcement learning:foundations, algorithms, and empirical results,” MachineLearning, vol. 22, no. 1, pp. 159–195, 1996.

[33] L. Georgiadis, M. J. Neely, and L. Tassiulas, “Resourceallocation and cross-layer control in wireless networks,”Found. Trends Netw., vol. 1, no. 1, pp. 1–144, Apr. 2006.

[34] M. J. Neely, “Dynamic power allocation and routingfor satellite and wireless networks with time varyingchannels,” Ph.D. dissertation, Massachusetts Institute ofTechnology, 2003.

[35] P. Coolen-Schrijner and E. A. V. Doorn, “On the conver-gence to stationarity of birth-death processes,” Journalof Applied Probability, vol. 38, no. 3, pp. pp. 696–706,2001.

[36] R. Rao and A. Ephremindes, “On the stability of inter-acting queues in a multiple-access systems,” IEEE Trans.Inf. Theory, vol. 34, no. 5, pp. 918–930, Sep. 1988.

[37] L. Wang and V. Fodor, “Cooperate or not: the sec-ondary user’s dilemma in hierarchical spectrum sharingnetwork,” in Proc. of IEEE ICC, 2013.

[38] S. B. Thrun, “The role of exploration in learning control,”in Handbook of Intelligent Control: Neural, Fuzzy, andAdaptive Approaches, 1992.

[39] K. J. Michael, Computational Complexity of MachineLearning. Cambridge, Massachusetts, London, England:The MIT Press, 1990.

[40] H. Tagaki and L. Kleinrock, “Optimal transmissionranges for randomly distributed packet radio terminals,”IEEE Trans. Wireless Commun., vol. 32, no. 3, pp. 246–




257, 1984.[41] X. Zhang and H. Su, “Cream-mac: Cognitive radio-

enabled multi-channel mac protocol over dynamic spec-trum access networks,” IEEE J. Sel. Topics in SignalProcessing, vol. 5, no. 1, pp. 110–123, Feb 2011.

[42] H. Cho and G. Hwang, “An optimized random channelaccess policy in cognitive radio networks under packetcollision requirement for primary users,” IEEE Trans.Wireless Commun.,, vol. 12, no. 12, pp. 6382–6391,December 2013.

Liping Wang received the B.Sc. and M.Sc. de-grees in electronic and information engineering fromTongji University, Shanghai, China, in 2004 and2007, respectively. In 2007 and 2008, she waswith the National Institute of Informatics, Tokyo,Japan, where she worked on resource allocation inorthogonal frequency-division multiple-access relay-enhanced cellular networks. In 2013, she receiveda Ph.D. in Telecommunications from KTH RoyalInstitute of Technology, Stockholm, Sweden. Hercurrent research interests include cognitive radio,

cooperative communications, and radio resource management in wirelessnetworks.

Viktoria Fodor received the M.Sc. and Ph.D. de-grees from the Budapest University of Technologyand Economics, Budapest, Hungary, in 1992 and1999, respectively, both in computer engineering.In 1994 and 1995, she was a Visiting Researcherwith Polytechnic University of Turin, Turin, Italy,and with Boston University, Boston, MA, whereshe conducted research on optical packet switchingsolutions. In 1998, she was a Senior Researcherwith the Hungarian Telecommunication Company.Since 1999, she has been with the KTH Royal

Institute of Technology, Stockholm, Sweden, where she now acts as anAssociate Professor with the Laboratory for Communication Networks. Hercurrent research interests include network performance evaluation, cognitiveand cooperative communication, protocol design for sensor and multimedianetworking.

Documents

Dynamic Cooperative Secondary Access in Hierarchical Spectrum Sharing Networks