Upload
butest
View
869
Download
0
Tags:
Embed Size (px)
Citation preview
Application of Reinforcement Learning in Network Routing
ByBy
Chaopin ZhuChaopin Zhu
Machine Learning
Supervised LearningSupervised Learning Unsupervised LearningUnsupervised Learning Reinforcement LearningReinforcement Learning
Supervised Learning
Feature: Learning with a teacherFeature: Learning with a teacher PhasesPhases
• Training phaseTraining phase
• Testing phaseTesting phase ApplicationApplication
• Pattern recognitionPattern recognition
• Function approximationFunction approximation
Unsupervised Leaning
FeatureFeature
• Learning without a teacherLearning without a teacher ApplicationApplication
• Feature extractionFeature extraction
• Other preprocessingOther preprocessing
Reinforcement Learning
Feature: Learning with a criticFeature: Learning with a critic ApplicationApplication
• OptimizationOptimization
• Function approximationFunction approximation
Elements ofReinforcement Learning
AgentAgent EnvironmentEnvironment PolicyPolicy Reward functionReward function Value functionValue function Model of environment (optional)Model of environment (optional)
Reinforcement Learning Problem
Environment
Agent
action an
reward rn
state xn
xn+1
Rn+1
Markov Decision Process (MDP)
Definition:Definition:
A reinforcement learning task that satisfies A reinforcement learning task that satisfies the Markov propertythe Markov property
Transition probabilitiesTransition probabilities
nnnayx axyxP n
n,|Pr 1
An Example of MDP
High Low
search wait
recharge
search wait
Markov Decision Process (cont.)
ParametersParameters
Value functionsValue functionspolicy
discount
rewardr
actiona
statex
n
n
n
0
0
,|,
|
knnkn
knn
knkn
k
axrEaxQ
xxrExV
Elementary Methods forReinforcement Learning Problem
Dynamic programmingDynamic programming Monte Carlo MethodsMonte Carlo Methods Temporal-Difference LearningTemporal-Difference Learning
Bellman’s Equations
aaxxaxQrEaxQ
aaxxxVrExV
nnna
n
nnnna
,|,max,
,|max
'1
*1
*
1*
1*
'
Dynamic Programming Methods
Policy evaluationPolicy evaluation
Policy improvementPolicy improvement
xxxVrExV nnknk |111
aaxxxVrEaxQ
axQx
nnnn
a
,|,
,maxarg'
11
Dynamic Programming (cont.)
E ---- policy evaluationE ---- policy evaluation
I ---- policy improvementI ---- policy improvement Policy IterationPolicy Iteration Value IterationValue Iteration
**210
10 VVVEIEIEIE
Monte Carlo Methods
FeatureFeature• Learning from experienceLearning from experience• Do not need complete transition Do not need complete transition
probabilitiesprobabilities IdeaIdea
• Partition experience into episodesPartition experience into episodes• Average sample returnAverage sample return• Update at episode-by-episode baseUpdate at episode-by-episode base
Temporal-Difference Learning
Features Features (Combination of Monte Carlo and DP ideas)(Combination of Monte Carlo and DP ideas)
• Learn from experience (Monte Learn from experience (Monte Carlo)Carlo)
• Update estimates based in part on Update estimates based in part on other learned estimates (DP)other learned estimates (DP)
TD(TD() algorithm seemlessly integrates TD ) algorithm seemlessly integrates TD and Monte Carlo Methodsand Monte Carlo Methods
TD(0) LearningInitialize V(x) arbitrarilyInitialize V(x) arbitrarily to the policy to be evaluatedto the policy to be evaluatedRepeat (for each episode):Repeat (for each episode):
Initialize xInitialize xRepeat (for each step of episode)Repeat (for each step of episode)
aaaction given by action given by for x for xTake action a; observe reward r and next state x’Take action a; observe reward r and next state x’
xxx’x’until x is terminaluntil x is terminal
)]()'([)()( xVxVrxVxV
Q-Learning
Initialize Q(x,a) arbitrarilyInitialize Q(x,a) arbitrarilyRepeat (for each episode)Repeat (for each episode)
Initialize xInitialize xRepeat (for each step of episode):Repeat (for each step of episode):
Choose a from x using policy derived from QChoose a from x using policy derived from QTake action a, observe r, x’Take action a, observe r, x’
xxx’x’until x is terminaluntil x is terminal
)],(),(max[),(),( ''
'axQaxQraxQaxQ
a
Q-Routing
QQxx(y,d)----estimated time that a packet would (y,d)----estimated time that a packet would
take to reach the destination node d from take to reach the destination node d from current node x via x’s neighbor node ycurrent node x via x’s neighbor node y
TTyy(d) ------y’s estimate for the time remaining (d) ------y’s estimate for the time remaining
in the tripin the trip
qqyy ---------queuing time in node y ---------queuing time in node y
TTxyxy --------transmission time between x and y --------transmission time between x and y
dzQdT yyNz
y ,min)()(
Algorithm of Q-Routing
1.1. Set initial Q-values for each nodeSet initial Q-values for each node2.2. Get the first packet from the packet queue of Get the first packet from the packet queue of
node xnode x3.3. Choose the best neighbor node and forward Choose the best neighbor node and forward
the packet to node bythe packet to node by4.4. Get the estimated valueGet the estimated value from node from node5.5. Update Update 6.6. Go to 2.Go to 2.
y
y dyQy xxNy
,minargˆ)(
yy qdT ˆˆ
dyQdTqtdyQdyQ xyyyxxx ,ˆ,ˆ,ˆ ˆˆˆ
Dual Reinforcement Q-Routing
s
x y
d
Qx(y,d) Qy(x,s) Packet
Backward Exploration
Forward Exploration
Network Model
Subnet1 Subnet2
Network Model (cont.)
16
13
10
7
4
1
17
14
11
8
5
2
18
15
12
9
6
3
19
22
25
28
31
34
20
23
26
29
32
35
21
24
27
30
33
36
Node Model
Packet Generator
Packet Destroyer
Routing Controller
Input Queue
Input Queue
Input Queue
Input Queue
Output Queue
Output Queue
Output Queue
Output Queue
Routing Controller
Init
Idle
Arrival Departure
Arrival Depart
Default
Initialization/ Termination Procedures
InitilizationInitilization Initialize and / or register global variableInitialize and / or register global variable Initialize routing tableInitialize routing table
TerminationTermination Destroy routing tableDestroy routing table Release memoryRelease memory
Arrival Procedure
Data packet arrivalData packet arrival Update routing tableUpdate routing table Route it with control information or Route it with control information or
destroy the packet if it reaches the destroy the packet if it reaches the destinationdestination
Control information packet arrivalControl information packet arrival Update routing tableUpdate routing table Destroy the packetDestroy the packet
Departure Procedure
Set all fields of the packetSet all fields of the packet Get a shortest routeGet a shortest route Send the packet according to the routeSend the packet according to the route
References
[1] Richard S. Sutton and Andrew G. Barto, [1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning—An IntroductionReinforcement Learning—An Introduction
[2] Chengan Guo, Applications of [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Reinforcement Learning in Sequence Detection and Network RoutingDetection and Network Routing
[3] Simon Haykin, Neural Networks– A [3] Simon Haykin, Neural Networks– A Comprehensive FoundationComprehensive Foundation