12
464 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020 Deep Reinforcement Learning for Dynamic Spectrum Sensing and Aggregation in Multi-Channel Wireless Networks Yunzeng Li, Wensheng Zhang , Member, IEEE, Cheng-Xiang Wang , Fellow, IEEE, Jian Sun , Member, IEEE, and Yu Liu Abstract—In this paper, the problem of dynamic spectrum sensing and aggregation is investigated in a wireless network con- taining N correlated channels, where these channels are occupied or vacant following an unknown joint 2-state Markov model. At each time slot, a single cognitive user with certain bandwidth requirement either stays idle or selects a segment comprising C (C < N) continuous channels to sense. Then, the vacant chan- nels in the selected segment will be aggregated for satisfying the user requirement. The user receives a binary feedback sig- nal indicating whether the transmission is successful or not (i.e., ACK signal) after each transmission, and makes next decision based on the sensing channel states. Here, we aim to find a pol- icy that can maximize the number of successful transmissions without interrupting the primary users (PUs). The problem can be considered as a partially observable Markov decision process (POMDP) due to without full observation of system environment. We implement a Deep Q-Network (DQN) to address the challenge of unknown system dynamics and computational expenses. The performance of DQN, Q-Learning, and the Improvident Policy with known system dynamics is evaluated through simulations. The simulation results show that DQN can achieve near-optimal performance among different system scenarios only based on partial observations and ACK signals. Manuscript received September 1, 2019; revised January 8, 2020 and February 14, 2020; accepted February 14, 2020. Date of publication March 23, 2020; date of current version June 9, 2020. The authors acknowl- edge the support from the National Key R&D Program of China under Grant 2018YFB1801101, the National Natural Science Foundation of China (NSFC) under Grant 61960206006, the Fundamental Research Funds of Shandong University under Grants 2017JC029 and 2017JC009, the China Scholarship Council (CSC) under Grant 201806225029, the High Level Innovation and Entrepreneurial Research Team Program in Jiangsu, the High Level Innovation and Entrepreneurial Talent Introduction Program in Jiangsu, the Research Fund of National Mobile Communications Research Laboratory, Southeast University, under Grant 2020B01, the Fundamental Research Funds for the Central Universities under Grant 2242019R30001, Taishan Scholar Program of Shandong Province, the EU H2020 RISE TESTBED2 project under Grant 872172, and the Shandong Natural Science Foundation under Grant ZR2019BF040. The associate editor coordinating the review of this article and approving it for publication was Y. Wu. (Corresponding authors: Wensheng Zhang; Cheng-Xiang Wang.) Yunzeng Li, Wensheng Zhang, and Jian Sun are with the School of Information Science and Engineering, Shandong Provincial Key Laboratory of Wireless Communication Technologies, Shandong University, Qingdao 266237, China (e-mail: [email protected]; [email protected]; [email protected]). Cheng-Xiang Wang is with National Mobile Communications Research Laboratory, School of Information Science and Engineering, Southeast University, Nanjing 210096, China, and also with the Pervasive Communication Research Center, Purple Mountain Laboratories, Nanjing 211111, China (e-mail: [email protected]). Yu Liu is with the School of Microelectronics, Shandong University, Jinan 250101, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TCCN.2020.2982895 Index Terms—Dynamic spectrum aggregation, dynamic spec- trum sensing, deep reinforcement learning, deep Q-network, POMDP. I. I NTRODUCTION I N WIRELESS networks, the spectrum is assigned to pri- mary users (PUs) under a static and inflexible spectrum allocation policy, in which spectrum holes are not utilized in temporal or frequency domain as shown in Fig. 1. With the growing spectrum demand and limited spectrum resources, it is necessary to address the problem of spectrum underutilization and inefficiency. Cognitive radio [1], [2] has allowed sec- ondary users (SUs) to sense and leverage the spectrum holes that are not occupied by PUs to improve spectrum utiliza- tion and alleviate spectrum scarcity. As shown in Fig. 2, there are two main parts in cognitive radio: primary network and cognitive network. PUs in the primary network are licensed to use spectrum bands, and SUs in the cognitive network have to access the spectrum holes in an opportunistic man- ner. The spectrum holes, however, are discrete and usually too insufficient to meet SUs’ demand. As a solution, spectrum aggregation [3], [4] has attracted great concerns. Spectrum aggregation refers to the fact that a user can simultaneously access multiple discrete spectrum holes through Discontiguous Orthogonal Frequency Division Multiplexing (D-OFDM) [5] and aggregate them into a sufficiently wide band for success- ful transmission. Although the aggregation capacity (i.e., the range of the aggregated bands) is fixed due to the limitations of hardware [6], [7], spectrum aggregation will play a critical and potential role in future cognitive radio networks [8]. A. Existing Spectrum Occupancy Models Given that the spectrum occupancy activity of PUs leads to the dynamic and uncertainty of the spectrum environment, a reasonable spectrum occupancy model is necessary to describe the channel state transition for utilization of spectrum holes. The spectrum occupancy model can provide a reliable basis for the prediction of future spectrum occupancy status for SUs, thus the conflict between PUs and SUs will be effec- tively reduced. Many spectrum occupancy models as shown in Fig. 3 have been discussed to simulate the behavior of PUs and describe the time-varying spectrum environment precisely. 2332-7731 c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

464 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

Deep Reinforcement Learning for DynamicSpectrum Sensing and Aggregation in

Multi-Channel Wireless NetworksYunzeng Li, Wensheng Zhang , Member, IEEE, Cheng-Xiang Wang , Fellow, IEEE,

Jian Sun , Member, IEEE, and Yu Liu

Abstract—In this paper, the problem of dynamic spectrumsensing and aggregation is investigated in a wireless network con-taining N correlated channels, where these channels are occupiedor vacant following an unknown joint 2-state Markov model. Ateach time slot, a single cognitive user with certain bandwidthrequirement either stays idle or selects a segment comprising C(C < N) continuous channels to sense. Then, the vacant chan-nels in the selected segment will be aggregated for satisfyingthe user requirement. The user receives a binary feedback sig-nal indicating whether the transmission is successful or not (i.e.,ACK signal) after each transmission, and makes next decisionbased on the sensing channel states. Here, we aim to find a pol-icy that can maximize the number of successful transmissionswithout interrupting the primary users (PUs). The problem canbe considered as a partially observable Markov decision process(POMDP) due to without full observation of system environment.We implement a Deep Q-Network (DQN) to address the challengeof unknown system dynamics and computational expenses. Theperformance of DQN, Q-Learning, and the Improvident Policywith known system dynamics is evaluated through simulations.The simulation results show that DQN can achieve near-optimalperformance among different system scenarios only based onpartial observations and ACK signals.

Manuscript received September 1, 2019; revised January 8, 2020 andFebruary 14, 2020; accepted February 14, 2020. Date of publicationMarch 23, 2020; date of current version June 9, 2020. The authors acknowl-edge the support from the National Key R&D Program of China under Grant2018YFB1801101, the National Natural Science Foundation of China (NSFC)under Grant 61960206006, the Fundamental Research Funds of ShandongUniversity under Grants 2017JC029 and 2017JC009, the China ScholarshipCouncil (CSC) under Grant 201806225029, the High Level Innovation andEntrepreneurial Research Team Program in Jiangsu, the High Level Innovationand Entrepreneurial Talent Introduction Program in Jiangsu, the ResearchFund of National Mobile Communications Research Laboratory, SoutheastUniversity, under Grant 2020B01, the Fundamental Research Funds for theCentral Universities under Grant 2242019R30001, Taishan Scholar Programof Shandong Province, the EU H2020 RISE TESTBED2 project underGrant 872172, and the Shandong Natural Science Foundation under GrantZR2019BF040. The associate editor coordinating the review of this article andapproving it for publication was Y. Wu. (Corresponding authors: WenshengZhang; Cheng-Xiang Wang.)

Yunzeng Li, Wensheng Zhang, and Jian Sun are with the School ofInformation Science and Engineering, Shandong Provincial Key Laboratoryof Wireless Communication Technologies, Shandong University, Qingdao266237, China (e-mail: [email protected]; [email protected];[email protected]).

Cheng-Xiang Wang is with National Mobile Communications ResearchLaboratory, School of Information Science and Engineering, SoutheastUniversity, Nanjing 210096, China, and also with the PervasiveCommunication Research Center, Purple Mountain Laboratories, Nanjing211111, China (e-mail: [email protected]).

Yu Liu is with the School of Microelectronics, Shandong University, Jinan250101, China (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCCN.2020.2982895

Index Terms—Dynamic spectrum aggregation, dynamic spec-trum sensing, deep reinforcement learning, deep Q-network,POMDP.

I. INTRODUCTION

IN WIRELESS networks, the spectrum is assigned to pri-mary users (PUs) under a static and inflexible spectrum

allocation policy, in which spectrum holes are not utilized intemporal or frequency domain as shown in Fig. 1. With thegrowing spectrum demand and limited spectrum resources, it isnecessary to address the problem of spectrum underutilizationand inefficiency. Cognitive radio [1], [2] has allowed sec-ondary users (SUs) to sense and leverage the spectrum holesthat are not occupied by PUs to improve spectrum utiliza-tion and alleviate spectrum scarcity. As shown in Fig. 2, thereare two main parts in cognitive radio: primary network andcognitive network. PUs in the primary network are licensedto use spectrum bands, and SUs in the cognitive networkhave to access the spectrum holes in an opportunistic man-ner. The spectrum holes, however, are discrete and usually tooinsufficient to meet SUs’ demand. As a solution, spectrumaggregation [3], [4] has attracted great concerns. Spectrumaggregation refers to the fact that a user can simultaneouslyaccess multiple discrete spectrum holes through DiscontiguousOrthogonal Frequency Division Multiplexing (D-OFDM) [5]and aggregate them into a sufficiently wide band for success-ful transmission. Although the aggregation capacity (i.e., therange of the aggregated bands) is fixed due to the limitationsof hardware [6], [7], spectrum aggregation will play a criticaland potential role in future cognitive radio networks [8].

A. Existing Spectrum Occupancy Models

Given that the spectrum occupancy activity of PUs leads tothe dynamic and uncertainty of the spectrum environment, areasonable spectrum occupancy model is necessary to describethe channel state transition for utilization of spectrum holes.The spectrum occupancy model can provide a reliable basisfor the prediction of future spectrum occupancy status forSUs, thus the conflict between PUs and SUs will be effec-tively reduced. Many spectrum occupancy models as shownin Fig. 3 have been discussed to simulate the behavior of PUsand describe the time-varying spectrum environment precisely.

2332-7731 c© 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 2: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 465

Fig. 1. The concept of spectrum holes.

Fig. 2. Structure of a cognitive wireless network.

The usage percentage of these models is shown inFig. 4 [9], [10]. These models can be divided into time-domain, frequency-domain and space-domain model from theperspective of spectrum measurement, or Markov process,Queuing theory, ON/OFF model, Time series, Mathematicaldistribution model and Miscellaneous model from the basisof modeling. Different models focus on different characteris-tics of wireless spectrum environment, and there is no modelcompletely applicable in various wireless scenarios. Since wefocus on the state transition of the spectrum environment, themost widely used Markov model is adopted in this paper.

B. Deep Reinforcement Learning and Related Works

In recent years, Machine Learning (ML) has made greatachievements, not only in computer vision and natural lan-guage processing [11]–[14], but also in wireless communica-tion [15]–[18], incurring a collection of theoretical researcheson optimization principle [19]–[23]. As an important branchof ML, Reinforcement Learning (RL) is characterized byinteracting with the changing and uncertain environmentfrequently to acquire knowledge, which provides an excel-lent performance in handling dynamic systems [24]–[26].Q-Learning implemented in this paper is one of popular RL

methods. Instead of trying to model the dynamic characteris-tics of Markov decision process, Q-Learning directly estimatesthe Q-value of each action in each state. The Q-value estimatesthe expected accumulated discounted reward. The policy canthen be executed by selecting the action with the highest Q-value for each state. Different from RL, Deep ReinforcementLearning (DRL) combines Deep Learning (DL) with RL, mak-ing it more capable of dealing with huge state space andcomplex computation.

Recently DRL has achieved significant breakthroughs in thedynamic spectrum allocation problems [27]–[35]. The worksin [27], [28], [29], [30] and [31] studied the multichannelaccess problem under the assumption of Markov spectrumoccupancy model. The authors of [27] considered the highlycorrelation between channels thus the user can access thevacant channel by historical partial observations. An actor-critic DRL based framework was proposed in [28], [29] andits performance was further improved in [27] especially inscenarios with a large number of channels. In [30] all chan-nels are independent so the user is supposed to have fullyobservation of the system via wideband spectrum sensingtechniques. The independent channel model is also adoptedin [31], but the authors of [31] considered the presence ofspectrum sensing errors, and the position of each user is spec-ified in the proposed scenario. Moreover, multi-user scenariosare also studied in [29] and [31] through distributive learn-ing. However, in the most of aforementioned works the useronly selects one channel to access with the hope of avoidingcollisions at each time slot. The authors of [29] considered ascenario where the user can access more than one channel ata time, but it has nothing to do with the user’s requirement.In other words, previous works mainly focused on allocatingthe single channel to the user without taking into account theuser’s demand of bandwidth.

Our simulation models are pretty different from the previousones. What is new in our work is that we consider theuser’s requirement for broadband transmission (i.e., a success-ful transmission may be not affordable with a single channel)and provide the user sufficient bandwidth through applyingthe spectrum aggregation technology. In this paper, the corre-lation between channels is also taken into account, and the userwill use this correlation to dispense with the perception of thewhole frequency band. The user only needs to sense a segmentin multiple channels, and the vacant channels in the selectedsegment will be aggregated for transmission. Meanwhile, thenext segment to be sensed will be determined according to thesensing results. The problem can be formulated as a partiallyobservable Markov decision process (POMDP), where the usercannot accurately know the current state of the environmentdue to incomplete environmental observations.

C. Contributions

We implement Deep Q-Network (DQN) [36], [37] toapproximate the action-value function which can give esti-mated Q-values of the user’s available actions with the partialobservations of channel states as input. We apply DQN into the

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 3: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

466 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

Fig. 3. Classification of existing spectrum occupancy models.

Fig. 4. Usage percentage of all spectrum occupancy models.

dynamic spectrum sensing and aggregation problem in corre-lated channels to find a good policy to cope with the uncertainspectrum environment. The major contributions and noveltiesof our work can be summarized as follows:

• The bandwidth requirement of the user is considered inthe correlated multichannel spectrum environment whichis modeled as a Markov chain. The user is given thespectrum aggregation capability to synthesize reliablefrequency bands for successful transmission based onpartial observations of the spectrum. We describe theproblem as dynamic spectrum sensing and aggregation,and formulate the problem as a POMDP.

• DQN is adopted for the dynamic spectrum sensing andaggregation problem to deal with the uncertain spec-trum environment. The action-value function is given byDQN through online learning to guide the user decisionwith no prior knowledge of system dynamics and lowcomputational complexity.

• Q-Learning and the Improvident Policy based on fullknowledge of the system is proposed to evaluate theperformance of DQN. Simulations suggest that DQN

can provide a near-optimal performance compared withQ-Learning and the Improvident Policy.

The rest of the paper is organized as follows. Section IIformulates the dynamic spectrum sensing and aggregationproblem when channels are potentially correlated. Section IIIpresents the Improvident Policy and the DQN framework tosolve this problem, and Section IV shows through simula-tions that DQN can achieve near-optimal performance amongdifferent system scenarios. Finally, Section V concludes ourwork.

II. PROBLEM FORMULATION

The multichannel access problem has been studiedin [27]–[31] with single user where the state transitionof channels is modeled as a joint Markov chain. Thecorrelation between channels has been taken into accountin [27]–[29] while an independent channel model has beenused in [30], [31]. The efficiency of spectrum aggregationand the spectrum assignment problem have been studiedin [6] and [7], respectively. The authors of [6], [7] consid-ered the bandwidth demand of SUs and the fixed aggregationcapability in multichannel network. Based on above works,we focus on the dynamic spectrum sensing and aggregationproblem with one single user in several correlated channels.Furthermore, the user requirement for bandwidth and fixedaggregation capacity are taken into account in our work. Inthis section we formulate the problem in detail.

A. System Model

We consider a wireless network containing N correlatedchannels whose states can be either vacant (0) or occupied (1).The joint state transition of these channels follows a 2N -statesMarkov model. Generally, the SU needs to sense the state ofall channels and aggregate the vacant channels among them.However, due to the limited aggregation capability, only thevacant channels within the aggregation range can be utilizedby the user, which makes the full-band sensing inefficient.Some existing works assume the dynamic radio environmentas a simple independent-channel model, while in practice theexternal interference results in a high degree of correlation

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 4: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 467

between these channels in wireless network [27]. Based onthe correlations between channels, we hope the single SU withcertain bandwidth demand d can only select a segment com-prising C channels to sense and aggregate the vacant channelsfor transmission, or just stay idle at the beginning of eachtime slot. The segment length is also the user’s aggregationcapability C which is determined by the hardware limitations,thus all sensed vacant channels in the selected segment canbe aggregated. If the number of the vacant channels in theselected segment is larger (smaller) than d, the transmissionis successful (failed), which can be presented by ACK sig-nal. ACK signal is the control character sent by the receivingstation to the sending station through control channel, whichindicates that the data has been received successfully. If thetransmitter has obtained the ACK signal, it transmits the nextblock of data, otherwise it repeats the current block of data.Based on the sensing of the selected segment, the user decideswhat action to take in the next time slot. The goal is to achievesuccessful transmission as much as possible over time.

As the user can only sense the selected segment and has nofull observations of the system, the problem can be formulatedas a POMDP, where the user’s environmental observation isincomplete at each time slot. Consequently, the user cannoteven accurately know the current state of the system, and theprediction of the next state would be more difficult. Withoutthe knowledge of the system dynamics, partial observationslead to larger state space and higher computational complexity.The user is supposed to deduce the current state from partialobservations based on channel correlations, and infer the nextstate through learning from the Markov process.

B. State Space

Consider a wireless network with N correlated channelsdivided from a shared bandwidth. Given that each channelhas two possible states: occupied (1) or vacant (0), the wholesystem can be described as a 2N -states Markov model, andthe state space is denoted as S

S = {s = (s1, . . . , sN )|si ∈ {0, 1}, i ∈ {1, . . . ,N }}. (1)

Let P =

[p00 p10p01 p11

]be the transition matrix of the Markov

chain and the state transition of each channel is shown inFig. 5. The dynamic change of the spectrum environment withtime in the whole system is shown in Fig. 6.

C. Action Space and User Observation

A single user with certain bandwidth demand d (i.e., itrequires d vacant channels for broadband transmission) isable to aggregate vacant channels in the range of aggrega-tion capacity C (C < N). At the beginning of each timeslot, the user either stays idle and transmits nothing or selectsa length-C segment of the whole channels to sense, andthen there are N − C + 1 segments for selection. Thevacant channels in the selected segment will be aggregatedfor transmission. Let A = {0, 1, . . . ,N − C + 1} presentthe action space and the user will choose the i th segmentat the beginning of time slot t if at = i(i ∈ A, i �= 0)

Fig. 5. State transition of each channel.

or transmit nothing and sense the state of the first segmentacquiescently if at = 0. The spectrum sensing errors arenot taken into account. What the user observed, denoted asot ∈ {(o1t , . . . , oCt )|oit ∈ {0, 1}, i ∈ {1, . . . ,C}}, is thebasis to determine the next action at+1. Since the user onlysense the state of selected segment, namely the whole systemis partially observable to the user, the problem falls into ageneral POMDP. However, the user can use the correlationbetween channels to infer the current system state based onits decision and observation, and then deduce the next stateand determine the next decision. If the number of vacant chan-nels in the selected segment is larger than user demand d, thetransmission is successful, otherwise failed.

D. Reward Design

After action at is taken at time slot t, we assume that theuser can receive a absolutely accurate binary feedback ft indi-cating whether its packet is successfully delivered (e.g., ACKsignal). Assume ft = 1 if the transmission has succeed, other-wise ft = 0, and then we define reward function at time slott as

rt (st , at ) =

{0, if at = 04ft − 2, if 1 ≤ at ≤ N − C + 1

(2)

where st is system state at time slot t, which is not completelyobservable to the user but determines the binary feedback ftpotentially. Our objective is to find a policy π, which is a func-tion mapping the observation ot to next action at+1 at eachtime slot, to maximize the excepted accumulated discountedreward

Vπ(o) = Eπ

[ ∞∑t=0

γtrt+1(st+1, π(ot ))|o0 = o

](3)

where γ ∈ (0, 1) is a discount factor, π(ot ) is the action underpolicy π at time slot t + 1 when current observation is ot .The optimal policy π∗ can be presented as

π∗ = argmaxπ

Vπ(o)

= argmaxπ

[ ∞∑t=0

γtrt+1(st+1, π(ot ))|o0 = o

]. (4)

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 5: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

468 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

Fig. 6. The system spectrum environment.

III. IMPROVIDENT POLICY AND DRL FRAMEWORK

There are two approaches to cope with the dynamic spec-trum sensing and aggregation problem in correlated channelsformulated in Section II: i) investigate the system transitionmatrix and make decisions based on the prior knowledge ofthe system dynamics, which is known as model-based plan-ning; ii) approximate the function mapping observations tooptimal action by interacting with the system directly. In thissection, we propose the Improvident Policy with known fullknowledge of the system dynamics as the first approach toobtain near-optimal performance. Q-Learning and DQN areadopted without any prior knowledge of the system dynamicsas the second approach.

A. Improvident Policy

The Improvident Policy aims at maximizing immediateexpected reward, which means that prior knowledge of thesystem dynamics is necessary. We assume that the systemtransition matrix has been known and the current state canbe deduced precisely through partial observation under theImprovident Policy, and then the policy can be presented as

π̂ = argmaxa

∑s′∈S

P(s ′|s)r(s ′, a) (5)

where P(s ′|s) is the joint transition possibility from currentstate s to next state s ′. Given next possible state s ′ and actiona, the reward r(s ′, a) is obtained by checking if the number ofvacant channels in selected segment satisfies the user demandd, but not through ACK signal.

Note that the next state is only relevant to current stateand has no business with what action has been taken, so theperformance of the Improvident Policy with known systemdynamics is near-optimal. The Improvident Policy is primarilydesigned to measure the performance of DQN, so we give ita lot of favorable assumptions. However, it is hard to obtainthe system dynamics and infer the current state according topartial observation in practice.

B. Q-Learning

RL is an important branch of ML, which mainly includesfour elements: agent, environment state, action and reward.The agent acts by observing the state of the environment andobtains rewards. According to the rewards, the agent gradually

acquires the action strategy that adapts to the current environ-ment. Therefore, RL is very suitable for solving the continuousdecision-making problems in Markov process.

Though the dynamic spectrum sensing and aggregationproblem turns into a POMDP due to partial observability ofthe whole system, we can convert the POMDP into a Markovdecision process (MDP) by regarding x as the system statewhere x includes two parts: sensing action and correspondingobservation of the sensed segment.

In RL the agent interacts with the environment in discretetime, receives a reward r corresponding to each state-actionpair (x, a) as shown in Fig. 7. Q-Learning as one of the mostpopular RL method aims at finding a sequence of actionsto maximize the expected accumulated discounted rewardthrough approximating an action-value function. The Q-valueof the action a under the state x, given by the action-valuefunction, denotes the expected revenue of the state-action pair(x, a) and the action with the largest Q-value will be chosen ateach time slot. We define Qπ(x , a) as the action-value func-tion when a sensing action a is taken in environment state xunder policy π. The Q-value of each state-action pair (x, a),denoted as q(x, a), updates through interacting with the systemenvironment as follows

q(x t , at+1)← q(x t , at+1)

+ α

[rt+1 + γ max

a ′∈Aq(x t+1, a

′)− q(x t , at+1)

](6)

where α ∈ (0, 1] is the learning rate.The problem of Q-Learning is that the Q-value of each state-

action pair is stored in a look-up table. The large system statespace leads to the scale of Q-value table increases enormously.As a consequence, the Q-values of some state-action pairs can-not be sufficiently updated or even seldom updated in limitediterations.

C. Deep Q-Network

The performance of traditional RL methods is limited bythe scale of the state space and action space of the problem.However, complex and realistic tasks are characterized bylarge state space and continuous action space, which areintractable with general RL methods. DRL combines DL withRL to enable the agent to deal with complex states and actionsand DQN is one of the most popular DRL method. In DQNa deep neural network is adopted to replace the look-up table

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 6: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 469

Fig. 7. Interaction between agent and environment in reinforcement learning.

Algorithm 1 DQN Algorithm for Dynamic Spectrum Sensingand Aggregation• Input: memory size M, mini-batch size B, discount rateγ, learning rate α, ε in ε-greedy policy, targetnetwork update frequency F, and the number ofiterations Imax .

• Do Initialize the Q-network Q(x t , at+1; θ

)and its

target network Q̂(x t , at+1; θ̂

)with random

weights. Initialize the starting action a0 andexecute it to get the initial state x0.Initialize Train=True.

• For t = 1, 2, ... doIf Train Then

Choose at by ε-greedy policy.Else

at = argmaxa Q(x t−1, a; θ

).

End IfExecute action at and collect rt and x t .If Train Then

Store (x t−1, at , rt , x t ) in memory unit.If t ≥ M Then

Remove the oldest experience tuple inmemory unit.

End IfEnd IfIf Train And t ≥ M Then

Sample random mini-batch of experience tuplesfrom memory unit.Compute the loss function L(θ) and updatethe weights θ.If (t −M )mod F = 0 Then

copy the weights θ → θ̂.If (t −M ) > Imax Then Train=False

End If• End ForAlgorithm End

in Q-Learning to provide the Q-value of each state-actionpair. The neural network enables DQN to tackle the curse ofdimensionality resulting from large system state space whichis intractable with the Q-value table in Q-Learning. The mainprocess of DQN for dynamic spectrum sensing and aggrega-tion is detailed in Alg. 1. The structure of DQN is presentedin Fig. 8 and each component is specified as below.

1) Input Layer: The input of DQN is state x t including thesensing action taken at time slot t and corresponding observa-tion, i.e., x t = [a t ,ot ], where action vector a t is the one-hotvector representation of action at by setting the (at + 1)th

element to 1.

Fig. 8. An illustration of DQN.

2) Output Later: The output of DQN is a vector of sizeN − C + 2. The estimated Q-value if the user stays idle ispresented in the first entry. The (k + 1)th entry is the estimatedQ-value for transmitting in the k th segment at time slot t + 1,where 1 ≤ k ≤ N − C + 1.

3) Reward Definition: The reward rt after taking action atis obtained through ACK signal ft , which has been definedin (2).

4) Q-Network: The Q-network maps the current state to astring of action values, denoted as Q(x t , at+1; θ) : x t →{q(x t , a; θ)|a ∈ A}, where θ is the parameters in thenetwork. Given a state x, the Q-values obtained from theQ-network present the estimates of the expected accumu-lated discounted rewards of all actions. After training process,the action with the largest Q-value will be taken in eachtime slot.

5) Action Selection: In the initial stage of training, the Q-value of each state-action pair is not correct since the networkhas not converged. If we take the action with the largest Q-value, most of the actions will not be implemented and the cor-responding Q-values cannot be effectively updated. To get ridof the local optimum of DQN, ε-greedy policy is adopted foraction selection, where action at+1 = argmaxa Q(x t , a; θ)with possibility 1− ε while a random action is selected withpossibility ε.

6) Experience Replay: For each time slot t, we referto (x t , at+1, rt+1,x t+1) as an experience tuple stored ina memory unit and a mini-batch of experience tuples willbe sampled in each iteration for training. As a supervisedlearning model, deep neural network requires the data tobe independently and uniformly distributed, but the sam-ples obtained through interaction with the environment arecorrelated. Experience Replay breaks the correlation bystorage-sampling.

7) Target Network: We implement the target network,denoted as Q̂(x t , at+1; θ̂), to generate the target value. Thetarget network has the same structure with the Q-network,whose parameters θ̂ are copied from the Q-network at reg-ular temporal intervals. In other words, the target networkis updated at a lower frequency, which ensures that the tar-get value received by Q-network in the training process isrelatively stable.

8) Loss Function: The loss function is defined as the meansquare error of the target value and the Q-value, i.e.,

L(θ) = E

[(yj −Q

(x j , aj+1; θ

))2] (7)

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 7: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

470 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

Algorithm 2: Re-Training of DQN• Do Train the DQN and find a good policy as in Alg. 1.• For t = 1, 2, ... do

Perform actions according to the trained DQN andreceive rewards.Calculate the accumulated reward.If the accumulated reward is less than a given thresholdThen

Re-train DQN as in Alg. 1 to find a new policy.End If

• End ForAlgorithm End

where yj is the target value combining the output of the targetnetwork and the reward

yj = rj+1 + γmaxa∈A

Q̂(x j+1, a; θ̂

). (8)

As shown in Alg. 1, different from general DL, there is notraining data set or test data set for network training. In theproposed scheme, the user has to choose a channel segment forsensing and aggregation according to its bandwidth demandand aggregation capacity in a dynamic spectrum environment.Then the user receives a feedback (i.e., ACK signal) indicatingwhether the transmission is successful. We implement DRL tofind a good selection strategy in the unknown dynamic envi-ronment. The user’s selection of channel segments is treated asan interaction with the environment, and the reward is designedbased on the ACK signal. The user simply interacts with theenvironment in a continuous way and learns from the rewardsreceived in the process. When the learning process is finished,the user knows which actions should be issued in differentenvironment states to obtain a greater reward. Therefore, itdoes not need to provide a special training set and test set, butonly needs an interactive environment and accurate feedback.This is the reason why the training set is not required in theproposed scheme. During the learning process, the results ofthe user’s interaction with the environment in the past periodof time are stored in the memory unit as experience tuples,which will be taken out in batches for network training. Theprocess in RL is called Experience Replay. Due to the expe-rience tuples in the memory unit changes over time, it cannotbe considered as a data set.

Note that the ACK signal can not only provide rewardsduring training process, but also serve as monitoring duringDQN implementation. The performance degradation causedby the change of system environment will be reflected in theACK signal, so that DQN can be reminded to enter the trainingprocess again. In other words, DQN can use ACK signal toadjust itself in time to better adapt to the dynamic spectrumenvironment. The whole process is shown in Alg. 2.

IV. SIMULATION RESULTS AND DISCUSSIONS

In this section we compare DQN with other three poli-cies: the Improvident Policy, Q-Learning, and the RandomPolicy. We assume that under the Improvident Policy the usercan deduce the current states of the whole system channelsprecisely based on partial observation and the system transi-tion pattern is also known, even though it is hard to achieve

TABLE IHYPER-PARAMETERS OF DQN

in practice. In the Random Policy, the user randomly selectsone of available segments and aggregates the vacant channelsfor transmission at the beginning of each time slot.

A. Details of DQN

The neural network adopted in the DQN has four fully con-nected layers with each hidden layer containing 50 neuronsand ReLU as activation function. ReLU is the most popularactivation function, which is simple for computing and canavoid gradient explosion. Adam [38] algorithm is applied toimplement gradient descent during updating the parametersof the DQN, which is a commonly used optimizer with highconvergence speed. The memory size is selected accordingto the number of possible system states and available useractions to ensure that the experience tuple of each state-actionpair can be included. Mini-batch size is the number of sam-ples fed into the model at each network training step. 32 isa commonly used value in DL, which can guarantee the fastconvergence speed of the network and avoid the huge memoryoccupation. The ε in ε-greedy policy is initially set as 0.9, anddecreases to 0 in 10000 iterations. At the beginning of train-ing, the network has not converged, so we encourage the userto explore the rewards brought by different actions. When thetraining is over, the network has converged, and the user canchoose the action according to the given Q-value. Therefore,ε is supposed to gradually decrease from a value close to 1during training. The discount rate (usually close to 1) is setto 0.9 so that the user will focus on rewards in the immediatefuture, because paying more attention to the long-term rewardwill make the training process slower and more difficult dueto the uncertainty of long-term reward. Details of the DQNhyper-parameters are summarized in Table I. The simulationis implemented in Tensorflow with GPU.

B. Performance Evaluation

We consider the system containing N = 24 channels whichare highly correlated, where several independent channels fol-low the same 2-state Markov chain with transition matrix P,while the state of any other channel is the same (the corre-lation coefficient ρ = 1) or opposite (ρ = −1) to that of anindependent channel. Note that which channels are interrelatedis randomly determined.

As discussed in Section II, the user can stay idle (a = 0)or select one of available segments (a = i, i > 0) to sense,and then there are four possible situations after an action is

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 8: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 471

Fig. 9. Decision accuracy of four policies in 10 different scenarios whenC = 8 and d = 4.

taken: i) a = 0 and none of the segments in the system envi-ronment could satisfy the user’s transmission demand, i.e., asuccessful transmission is impossible in current system state;ii) a = 0 but there exists available segments in which the num-ber of vacant channels could afford a successful transmission;iii) a = i, i > 0 and the transmission succeeds; iv) a = i, i > 0but the transmission fails, which means that the PUs are dis-rupted in a way. However, the user is not able to tell the firsttwo situations in practice due to partial observation of thesystem, but the difference between them is taken into accountin simulations to evaluate the performance. We assume the userhas made the right decision if it achieves a successful trans-mission or stays idle when the system environment is terribleand calculate the decision accuracy in 10000 time slots.

1) Decision Accuracy: Fig. 9 shows the decision accuracyof four policies: DQN, Improvident Policy, Q-Learning, andRandom Policy. We assume there are 4, 5, or 6 independentchannels in the system and change the transition matrix P toget 10 different dynamic system scenarios with the correlationcoefficient ρ = −1 in the first six scenarios and ρ = 1 inthe last four scenarios. Specifically, there are 4 independentchannels in scenario 1, 2, 7, and 8, 5 independent channels inscenario 3, 4, and 9, and 6 independent channels in scenario 5,6, and 10. Moreover, four different transfer matrices are usedin these ten scenarios. As shown in Fig. 9, the ImprovidentPolicy, which is assumed having full knowledge of the systemdynamics, as well as DQN, performs better than Q-Learning.In spite of the scarcity of the system prior knowledge, DQNachieves a performance very close to the Improvident Policyin most scenarios, which indicates the strong learning abilityof DQN in dynamic environment. Q-Learning works worsethan DQN due to incapable of dealing with large state space.

Additionally, the performance curves of the three policiesabove has a larger fluctuation among all scenarios than theRandom Policy, which indicates that the state transitions indifferent scenarios have different randomness, making theachievable performance limited even if the system dynamics isfully known. If the randomness of the state transition of eachchannel were very small (i.e., the probability of state transi-tion is too large or too small), it would be easy for the user

Fig. 10. Accumulated discounted reward of four policies in scenario 1 whenC = 8 and d = 4.

to predict the next state of the system environment, so as toachieve a high decision accuracy such as 100% in scenario 1.If the randomness of the state transition of each channel werevery large, the achievable optimal decision accuracy would belimited under any policies. Therefore in scenario 7 the deci-sion accuracy of DQN, Improvident Policy, and Q-Learningis about 60%. Different scenarios are independent with eachother in our simulations. In different scenarios, there are dif-ferent number of states, different probability of state transitionand different sets of relevant channels. So even under the samepolicy, the achievable optimal performance is diverse in differ-ent scenarios due to the limitation of environmental conditions.Therefore, the performance curve fluctuates with the scenario.However, as shown in the figure, the proposed DQN has closeperformances with the near-optimal Improvident Policy in var-ious scenarios, indicating that the proposed algorithm canachieve near-optimal performance regardless of the spectrumenvironment.

Fig. 10 shows the accumulated discounted reward over timeof four policies in scenario 1. It can be seen that except forthe Random Policy, the accumulated discounted rewards ofother policies increase gradually with time and finally stabilize.The curves of DQN and Improvident Policy are coincidentand stable at the highest value, indicating that DQN achievesextremely close performance to Improvident Policy even inthe absence of system dynamics information.

In addition, we give the change of maximum Q-value overtime to show the learning process of DQN. The maximumQ-value for all actions in a given state represents the estimateof the maximum expected cumulative discount reward. We cal-culated the average maximum Q-value in each iteration fromten different initial states of the first five scenarios. As shownin the Fig. 11, the average maximum Q-value in all scenariosincreases and remains stable, indicating that DQN graduallylearns a good policy and maintains it.

Previous works [27]–[30] tended to adopt obtained rewardfor performance evaluation. In our work, we defined differentperformance evaluation parameters for four different situationsas discussed above. It is necessary to distinguish situationi) and ii) accurately in performance evaluation because staying

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 9: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

472 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

Fig. 11. Average maximum Q-values of DQN in training process whenC = 8 and d = 4.

idle when the spectrum environment is in good condition isnot the correct decision, but still better than a failed transmis-sion, because transmission failure will lead to the interferenceto the PUs. However, these two situations cannot be differen-tiated through reward because staying idle means the obtainedreward is 0. So we treat situation i) and iii) as the right deci-sion, situation ii) as the conservative decision, situation iv) asthe wrong decision, and assign different weights to them inthe definition of the modified decision accuracy. We define themodified decision accuracy as

modified decision accuracy

=#success + β ×#conservative

#timeslots(9)

where #success is the number of times the user takes the cor-rect decision, #conservative is the number of times the userstays idle while the system environment is in good condition,β is the measurement weight of such conservative choices,which is set as 0.5 in this paper, and #timeslots is the totalnumber of time slots.

In the definition of modified decision accuracy, the weightsof the correct decision and the failed transmission can beviewed as 1 and 0, respectively. So the weight of #conser-vative is supposed to be between 0 and 1, which depend onhow much we approve of this conservative action. We regardthe conservative action as half-correct decision so the weightof #conservative is set to the middle value. The result is shownin Fig. 12.

Compared with Fig. 9, the modified decision accuracy ofDQN is higher than the decision accuracy in the last threescenarios, but in the other policies the change is little, whichmeans that at some time slots the user of DQN stays idleand avoids transmission failure, consequently reduces theinterference to PUs.

We can compare the interference directly of four policiesresulting from transmission failure, and the interference isdefined as

interference =#failure

#timeslots(10)

where #failure is the number of failed transmission.

Fig. 12. Modified decision accuracy of four policies in 10 different scenarioswhen C = 8 and d = 4.

Fig. 13. Interference of four policies in 10 different scenarios when C = 8and d = 4.

Fig. 13 shows that compared with the Improvident Policy,the interference of DQN is similar or even lower, that is whythe modified decision accuracy of DQN is better than that ofthe Improvident Policy in the last three scenarios.

The proposed DQN can be theoretically extended for caseswith any user demand and aggregation capacity. However,the aggregation capacity and the user demand jointly deter-mine the problem complexity of the dynamic spectrum sensingand aggregation. If the aggregation capability were strongenough and the user requirement were small, it would be easyto achieve successful transmissions under any policy. If theaggregation capability were weak and the user requirementwere large, a successful transmission tends to be impossible.So we assume the aggregation capacity as 1/3 of the full-band and the user demand as 1/2 of the aggregation capacityto make the problem more reasonable and practical. We alsocompare the performance of four policies with different valuesof adjusted user demand and aggregation capacity to indicatethat the proposed algorithm can perform well regardless of thevalue of user demand and aggregation capacity.

The robustness of DQN with different aggregation capacityC and user demand d is verified and the results are shown inFig. 14–17. We found that in most scenarios DQN performsclosely to the Improvident Policy even without any knowledge

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 10: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 473

Fig. 14. Modified decision accuracy of four policies in 10 different scenarioswhen C = 8 and d = 3.

Fig. 15. Modified decision accuracy of four policies in 10 different scenarioswhen C = 8 and d = 5.

Fig. 16. Modified decision accuracy of four policies in 10 different scenarioswhen C = 7 and d = 4.

of the system dynamics. In some special scenarios DQN per-forms best, which mainly because DQN tends to stay idle toavoid transmission failure at some intractable time slots. In avery few scenarios, the performance of DQN is significantlyworse than that of the Improvident Policy, but it is still betterthan that of Q-Learning.

2) Computational Complexity: From the perspective oftemporal complexity, DQN and Q-Learning have an obvious

Fig. 17. Modified decision accuracy of four policies in 10 different scenarioswhen C = 9 and d = 4.

Fig. 18. Average processing time of three policies in 10 different scenarios.

advantage because the mapping from state to action has beenlearned through training process, while the Improvident Policyneeds to apply the knowledge of the system dynamics in eachtime slot. The Improvident Policy has to compute the tran-sition possibilities from the current state to the next possiblestates, as well as the reward for each action in each of the nextpossible states. So the temporal complexity of the ImprovidentPolicy can be presented as O(2i +2i × (N −C +2)), where iis the number of independent channels, 2i denotes the numberof the next possible states, and N − C + 2 is the number ofactions. Both DQN and Q-Learning have temporal complexityof O(1) since there is no recurrent computation.

In Fig. 18 we show the average processing time of threepolicies for 10000 time slots in our device to verify theaforementioned discussions. The Improvident Policy has sig-nificantly lower time efficiency than the other two strategies.What is worse is that the curve of the Improvident Policyvaries greatly in different scenarios, because each additionalindependent channel doubles the number of the next possi-ble state, and thus the processing time. Q-Learning definitelyhas the least processing time in that it uses table lookupinstead of computation. DQN provides a performance veryclose to Q-Learning in temporal complexity. Although DQNand Q-Learning require extra training time compared with

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 11: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

474 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 6, NO. 2, JUNE 2020

the Improvident Policy, they have greater advantages in thelong run. By the way, Q-Learning with the lowest tempo-ral complexity also has the highest spatial complexity dueto the huge size of look-up table. Another advantage ofDQN and Q-Learning over the Improvident Policy is thattheir curves barely fluctuate in different scenarios, whichreflects the robustness of their performance in different systemenvironment.

In terms of spatial complexity, since DQN has its fixednetwork structure, no matter how large the environment statespace and action space are, the storage space of the algo-rithm cannot be affected. So the spatial complexity of DQNis O(1). The Improvident Policy needs to traverse all possiblenext states to obtain the expected reward of each action, soits spatial complexity is proportional to the number of states,which can be formulated as O(2i ). Although Q-Learning hasthe highest temporal efficiency than the other policies, it countson a huge spatial complexity of O(2i × (N −C +2)) becauseQ-Learning has to store the Q-value corresponding to eachaction in each state. As discussed above, only the spatial com-plexity of DQN does not increase with the size of the problem.While the spatial complexity of Improvident Policy is affectedby the state space, the spatial complexity of Q-Learning isaffected by both the state space and the action space.

V. CONCLUSION

In this paper, we have considered the correlation betweenchannels in wireless networks and modeled the dynamic spec-trum environment as a joint Markov chain. We have assumedthat the SU with certain bandwidth demand has the fixedaggregation capacity to access multiple vacant channels simul-taneously for successful transmission. At each time slot, theSU either stays idle or selects a segment of the spectrum tosense. The segment length is determined by the aggregationcapacity so that the SU can aggregate the vacant channels inthe selected segment. The next user decision will be madebased on the sensed state of the selected segment. We haveformulated the dynamic spectrum sensing and aggregationproblem as a POMDP and proposed a DQN framework toaddress it. We have compared the performance of three differ-ent policies: DQN, Q-Learning, and the Improvident Policywith known the system dynamics. Simulations have shownthat the DQN can achieve a near-optimal decision accuracy inmost system scenarios even without the prior knowledge ofthe system dynamics. The performance is also robust amongdifferent aggregation capacities and different user demandsfor bandwidth. Moreover, DQN has the lowest computationalcomplexity in both time and space, and the temporal and spa-tial complexity of DQN is not affected by the expansion ofthe state space or action space of the problem.

REFERENCES

[1] J. Mitola, and G. Q. Maguire, “Cognitive radio: Making softwareradios more personal,” IEEE Pers. Commun., vol. 6, no. 4, pp. 13–18,Aug. 1999.

[2] Y.-C. Liang, K.-C. Chen, G. Y. Li, and P. Mahonen, “Cognitive radionetworking and communications: An overview,” IEEE Trans. Veh.Technol., vol. 60, no. 7, pp. 3386–3407, Sep. 2011.

[3] A. Shukla, B. Willamson, J. Burns, E. Burbidge, A. Taylor, and D.Robinson, A Study for the Provision of Aggregation of Frequency toProvide Wider Bandwidth Services, QinetiQ Ltd., Farnborough, U.K.,2006.

[4] W. Wang, Z. Zhang, and A. Huang, “Spectrum aggregation: Overviewand challenges,” Netw. Protocols Algorithms, vol. 2, no. 1, pp. 184–196,May 2010.

[5] J. D. Poston and W. D. Horne, “Discontiguous OFDM considerationsfor dynamic spectrum access in idle TV channels,” in Proc. IEEE Int.Symp. New Front. Dyn. Spectr. Access Netw. (DySPAN), Baltimore, MD,USA, Nov. 2005, pp. 607–610.

[6] B. Gao, Y. Yang, and J. Park, “Channel aggregation in cognitiveradio networks with practical considerations,” in Proc. IEEE Int. Conf.Commun. (ICC), Kyoto, Japan, Jun. 2011, pp. 1–5.

[7] F. Huang, W. Wang, H. Luo, G. Yu, and Z. Zhang, “Prediction-based spectrum aggregation with hardware limitation in cognitive radionetworks,” in Proc. IEEE Veh. Technol. Conf. (VTC), Taipei, China,May 2010, pp. 1–5.

[8] W. Zhang, C.-X. Wang, X. Ge, and Y. Chen, “Enhanced 5G cognitiveradio networks based on spectrum sharing and spectrum aggregation,”IEEE Trans. Commun., vol. 66, no. 12, pp. 6304–6316, Dec. 2018.

[9] M. López-Benítez and F. Casadevall, “An overview of spectrum occu-pancy models for cognitive radio networks,” in Proc. Int. Conf. Res.Netw., 2011, pp. 32–41.

[10] Y. Saleem and M. H. Rehmani, “Primary radio user activity models forcognitive radio networks: A survey,” J. Netw. Comput. Appl., vol. 43,pp. 1–16, Aug. 2014.

[11] E. Nishani and B. Çiço, “Computer vision approaches based on deeplearning and neural networks: Deep neural networks for video analysisof human pose estimation,” in Proc. IEEE Mediterr. Conf. EmbeddedComput. (MECO), Bar, Montenegro, Jun. 2017, pp. 1–4.

[12] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deepneural networks for inverse problems in imaging: Beyond analyti-cal methods,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 20–36,Jan. 2018.

[13] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends indeep learning based natural language processing,” IEEE Comput. Intell.Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.

[14] Y. Wu, F. Hu, G. Min, and A. Zomaya, Big Data and ComputationalIntelligence in Networking. Boca Raton, FL, USA: Taylor & Francis,2017.

[15] L. Bai et al., “Predicting wireless mmWave massive MIMO channelcharacteristics using machine learning algorithms,” Wireless Commun.Mobile Comput., vol. 2018, Aug. 2018, Art. no. 9783863.

[16] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMOCSI feedback,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 748–751,Oct. 2018.

[17] J. Huang et al., “A big data enabled channel model for 5G wirelesscommunication systems,” IEEE Trans. Big Data, vol. 6, no. 2, Jun.2020.

[18] Z. Zhang, H. Chen, M. Hua, C. Li, Y. Huang, and L. Yang, “Doublecoded caching in ultra dense networks: Caching and multicast schedulingvia deep reinforcement learning,” IEEE Trans. Commun., vol. 68, no. 2,pp. 1071–1086, Feb. 2020.

[19] L. Zhang, W. Zhang, Y. Li, J. Sun, and C.-X. Wang, “Standard conditionnumber of hessian matrix for neural networks,” in Proc. IEEE Int. Conf.Commun. (ICC), Shanghai, China, May 2019, pp. 1–6.

[20] J. Pennington and Y. Bahri, “Geometry of neural network loss surfacesvia random matrix theory,” in Proc. Int. Conf. Mach. Learn. (ICML),Sydney NSW, Australia, Aug. 2017, pp. 2798–2806.

[21] C. Louart, Z. Liao, and R. Couillet, “A random matrix approach toneural networks,” Ann. Appl. Probab., vol. 28, no. 2, pp. 1190–1248,Apr. 2018.

[22] Q. Nguyen and M. Hein, “The loss surface and expressivity ofdeep convolutional neural networks,” Oct. 2017. [Online]. Available:https://arxiv.org/abs/1710.10928

[23] H. Wang, Y. Wu, G. Min, J. Xu, and P. Tang, “Data-driven dynamicresource scheduling for network slicing: A deep reinforcement learningapproach,” Inf. Sci., vol. 498, pp. 106–116, Sep. 2019.

[24] X. Hu, S. Liu, R. Chen, W. Wang, and C. Wang, “A deep reinforcementlearning-based framework for dynamic resource allocation in multibeamsatellite systems,” IEEE Commun. Lett., vol. 22, no. 8, pp. 1612–1615,Aug. 2018.

[25] J. Liu, B. Krishnamachari, S. Zhou, and Z. Niu, “DeepNap: Data-drivenbase station sleeping operations through deep reinforcement learning,”IEEE Internet Things J., vol. 5, no. 6, pp. 4273–4282, Dec. 2018.

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.

Page 12: Deep Reinforcement Learning for Dynamic Spectrum Sensing ......LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS

LI et al.: DEEP REINFORCEMENT LEARNING FOR DYNAMIC SPECTRUM SENSING AND AGGREGATION IN MULTI-CHANNEL WIRELESS NETWORKS 475

[26] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforce-ment learning for mobile edge caching: Review, new features, and openissues,” IEEE Netw., vol. 32, no. 6, pp. 50–57, Nov./Dec. 2018.

[27] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-ment learning for dynamic multichannel access in wireless networks,”IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265,Jun. 2018.

[28] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, “Actor-Critic deepreinforcement learning for dynamic multichannel access,” in Proc. IEEEGlob. Conf. Signal Inf. Process. (GlobalSIP), Anaheim, CA, USA,Nov. 2018, pp. 599–603.

[29] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, “A deep actor–criticreinforcement learning framework for dynamic multichannel access,”Aug. 2019. [Online]. Available: https://arxiv.org/abs/1908.08401

[30] H. Q. Nguyen, B. T. Nguyen, T. Q. Dong, D. T. Ngo, and T. A. Nguyen,“Deep Q-learning with multiband sensing for dynamic spectrum access,”in Proc. IEEE Int. Symp. Dyn. Spectr. Access Netw. (DySPAN), Seoul,South Korea, Oct. 2018, pp. 1–5.

[31] H. Chang, H. Song, Y. Yi, J. Zhang, H. He, and L. Liu, “Distributivedynamic spectrum access through deep reinforcement learning: A reser-voir computing-based approach,” IEEE Internet Things J., vol. 6, no. 2,pp. 1938–1948, Apr. 2019.

[32] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning fordistributed dynamic spectrum access,” IEEE Trans. Wireless Commun.,vol. 18, no. 1, pp. 310–323, Jan. 2019.

[33] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multipleaccess for heterogeneous wireless networks,” IEEE J. Sel. AreasCommun., vol. 37, no. 6, pp. 1277–1290, Jun. 2019.

[34] X. Liu, Y. Xu, L. Jia, Q. Wu, and A. Anpalagan, “Anti-jammingcommunications using spectrum waterfall: A deep reinforcement learn-ing approach,” IEEE Commun. Lett., vol. 22, no. 5, pp. 998–1001,May 2018.

[35] P. Yang et al., “Dynamic spectrum access in cognitive radionetworks using deep reinforcement learning and evolutionary game,”in Proc. IEEE/CIC Int. Conf. Commun. China (ICCC), Beijing, China,Aug. 2018, pp. 405–409.

[36] V. Mnih et al., “Human-level control through deep reinforcementlearning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

[37] M. Fang, Y. Li, and T. Cohn, “Learning how to active learn: Adeep reinforcement learning approach,” Aug. 2017. [Online]. Available:https://arxiv.org/abs/1708.02383

[38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”Dec. 2014. [Online]. Available: https://arxiv.org/abs/1412.6980

Yunzeng Li received the B.Sc. degree in com-munication engineering from Shandong University,China, in 2018, where he is currently pursuingthe M.Eng. degree in electronics and communica-tion engineering. His research interests include deepreinforcement learning and B5G dynamic spectrumallocation.

Wensheng Zhang (Member, IEEE) received theM.E. degree in electrical engineering from ShandongUniversity, China, in 2005, and the Ph.D. degree inelectrical engineering from Keio University, Japan,in 2011, respectively. In 2011, he joined the Schoolof Information Science and Engineering, ShandongUniversity, where he is currently an AssociateProfessor. He was a Visiting Scholar with theUniversity of Oulu, Finland, in 2010 and Universityof Arkansas, USA, in 2019. His research interestslie in tensor computing, random matrix theory, and

intelligent B5G wireless communications.

Cheng-Xiang Wang (Fellow, IEEE) received theB.Sc. and M.Eng. degrees in communication andinformation systems from Shandong University,China, in 1997 and 2000, respectively, and the Ph.D.degree in wireless communications from AalborgUniversity, Denmark, in 2004.

He was a Research Assistant with the HamburgUniversity of Technology, Hamburg, Germany, from2000 to 2001, a Visiting Researcher with SiemensAG Mobile Phones, Munich, Germany, in 2004,and a Research Fellow with the University of

Agder, Grimstad, Norway, from 2001 to 2005. He has been with Heriot-Watt University, Edinburgh, U.K., since 2005, where he was promotedto a Professor in 2011. In 2018, he joined Southeast University, China,as a Professor. He is also a Part-Time Professor with Purple MountainLaboratories, Nanjing, China. He has authored 3 books, 1 book chapter,and more than 370 papers in refereed journals and conference proceed-ings, including 23 Highly Cited Papers. He has also delivered 18 InvitedKeynote Speeches/Talks and 7 Tutorials in international conferences. His cur-rent research interests include wireless channel measurements and modeling,B5G wireless communication networks, and applying artificial intelligence towireless communication networks.

Prof. Wang is a Fellow of the IET, an IEEE Communications SocietyDistinguished Lecturer in 2019 and 2020, and a Highly Cited Researcher rec-ognized by Clarivate Analytics, in 2017–2019. He is currently an ExecutiveEditorial Committee member for the IEEE TRANSACTIONS ON WIRELESS

COMMUNICATIONS. He has served as an Editor for nine international journals,including the IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS from2007 to 2009, the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY

from 2011 to 2017, and the IEEE TRANSACTIONS ON COMMUNICATIONS

from 2015 to 2017. He was a Guest Editor for the IEEE Journal on SelectedAreas in Communications, Special Issue on Vehicular Communications andNetworks (Lead Guest Editor), Special Issue on Spectrum and EnergyEfficient Design of Wireless Communication Networks, and Special Issueon Airborne Communication Networks. He was also a Guest Editor for theIEEE TRANSACTIONS ON BIG DATA, SPECIAL ISSUE ON WIRELESS BIG

DATA, and is a Guest Editor for the IEEE TRANSACTIONS ON COGNITIVE

COMMUNICATIONS AND NETWORKING, and Special Issue on IntelligentResource Management for 5G and Beyond. He has served as a TPC Member,TPC Chair, and General Chair for over 80 international conferences. Hereceived ten Best Paper Awards from IEEE GLOBECOM 2010, IEEE ICCT2011, ITST 2012, IEEE VTC 2013-Spring, IWCMC 2015, IWCMC 2016,IEEE/CIC ICCC 2016, WPMC 2016, and WOCC 2019.

Jian Sun (Member, IEEE) received the B.Sc.degree in applied electronic technology, the M.Eng.degree in measuring and testing technologies andinstruments, and the Ph.D. degree in commu-nication and information systems from ZhejiangUniversity, Hangzhou, China, in 1996, 1999, and2005, respectively.

From 2005 to 2018, he was a Lecturer withthe School of Information Science and Engineering,Shandong University, China. Since 2018, he hasbeen an Associate Professor. In 2008, he was a

Visiting Scholar with the University of California San Diego. In 2011, he was aVisiting Scholar with Heriot-Watt University, U.K., supported by U.K.-ChinaScience Bridges: R&D on B4G Wireless Mobile Communications project.His current research interests include signal processing for wireless com-munications, channel sounding and modeling, propagation measurement andparameter extraction, maritime communication, visible light communication,software-defined radio, MIMO, multicarrier, and wireless systems design andimplementation.

Yu Liu received the Ph.D. degree in communicationand information systems from Shandong University,Jinan, China, in 2017. From 2015 to 2017, she wasa Visiting Scholar with the School of Engineeringand Physical Sciences, Heriot-Watt University,Edinburgh, U.K. From 2017 to 2019, she was aPostdoctoral Research Associate with the Schoolof Information Science and Engineering, ShandongUniversity. Since 2019, she has been an AssociateProfessor with the School of Microelectronics,Shandong University. Her main research interests

include nonstationary wireless MIMO channel modeling, high-speed trainwireless propagation characterization and modeling, and channel modelingfor special scenarios.

Authorized licensed use limited to: Southeast University. Downloaded on June 28,2020 at 14:51:52 UTC from IEEE Xplore. Restrictions apply.