Upload
michal-stanek
View
224
Download
0
Embed Size (px)
Citation preview
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 1/13
Reinforcement approach to Adaptive Package Sheduling in Routers
Halina Kwasnicka Michał Stanek
15th April 2008
Abstract
The current and important question for Internet is how
to assure the quality of service. Several protocols have
been proposed to support different classes of network traf-
fic. The open research problem is how to divide avail-
able bandwidth among those traffic classes to support their
Quality of Service requirements. A major challenge in this
area is developing algorithms that can handle situations
in which we don’t know the traffic intensities in all traf-fic classes in advance or those intensities are changing in
time. In the paper we formulate the problem and than pro-
pose the reinforcement learning algorithm to solve it. Pro-
posed reinforcement function is evaluated and compared
to other methods.
1 Introduction
People change the way in which they use Internet. In-
ternet becomes the place where more and more often they
seek multimedia informations. The fact that we can leadvoice conversations with the whole word, not surprise any-
body. More and more often we listen to the radio or even
watch the television which is broadcast through Internet.
The major challenge in the Internet is to assure good qual-
ity of such services (QoS) in terms of packet loss, delay
and delay variations [12]. Currently used Internet Proto-
col (IP) not consider any QoS requirements. Several pro-
tocols have been proposed to support different classes of
service, such as IntServ [9] or DiffServ [3] which have
been created to support a small number of traffic classes
with different QoS requirements. The main question here
is how to divide available bandwidth among traffic classes
to support their QoS requirements.Nowadays commonly used solution in routers is deliv-
ering separate buffers for all traffic classes [5]. In this ap-
proach all incoming packets are classified and then moved
to appropriate queues. Router task is to decide about the
order in which queues are serviced. All scheduling de-
cisions must take into account the QoS (Quality of Ser-
vice) constrains of all queues and ensure reasonable per-
formance for the best-effort traffic class. Best-effort traffic
class is the class of packets which don’t have QoS require-
ments and should be serviced with the highest possible
bandwidth. Finding good scheduling algorithms is still
open topic in this area [6].
In the literature one can find many existing scheduling
algorithms. The simplest and most commonly used FCFS
(First Come First Out) schedules packets in order of their
arrival time. This approach is easy to implement and ef-
fective in time manner because it not require any calculus,
but its major weakness is lack of consideration any QoS
requirements.Another algorithm EDF (Earliest Deadline First) calcu-
lates the difference between the waiting time and delay
constraint for all packets. The packet with smallest differ-
ence is serviced. The danger of using this algorithm lie in
the fact that best-effort traffic do not have delay constraints
and could be never serviced.
SP (Sequential Priority) algorithm assigns priority to all
queues. The non empty queue with the highest priority is
serviced till the moment in which packets in highest prior-
ity queu earrive. In this approach the queues with lowest
priorities could never be serviced.
The weakness of SP is partially solved by the WFQ(Weighted Fair Queueing) algorithm. In this approach
all queues have weights and scheduling algorithm assigns
service time proportionally to the queue weight. It pre-
vents situations when lowest priority queues are ignored.
The knowledge about traffic intensities and QoS con-
straints is required before scheduling process starts and
are directly contained in the queues weights.
The necessity of knowledge a’priory the traffic intensi-
ties and QoS requirements is the main weakness of WFQ
algorithm. In real situations these parameters are un-
known, and additionally, they could considerably change
during system work. Solving real problems needs adap-
tive methods that do not need previous information abouttraffic conditions and constraints. Such approaches can be
find in literature [6, 1, 7, 11]. Proposed by Hall and Mars
method can be used in dynamic environment with variable
inflow intensities.
Hall and Mars proposed method based on Stochas-
tic Learning Automate (SLA)[7]. Service of each queue
has assigned one action and during system run scheduler
learns the probabilities of choosing each action in order
to meet the predefined QoS requirements and to maximize
1
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 2/13
bandwidth for best-effort traffic class. Results obtained
by Hall were better than using previously described algo-
rithms (FCFS, EDF, SP).
Hall’s method assigns equal actions probabilities to all
system states. Ferra and others [6] extend Halls method
and a scheduler could prefer some actions depending on
information about waiting time in each queue. Ferra useda reinforcement learning to learn scheduling policy. They
got much better results than Hall.
This paper is inspired by Ferra work. We have pointed
that his method do not take into consideration some sys-
tem properties which acknowledgment could lead to ac-
celerate the learning process and getting better results in
meaning of allocation available bandwidth among traffic
classes. Our main contribution is a modyfication of rein-
forcement function.
The paper is organized as follows. In next section we
formulate the problem statement. Section three describes
proposed method. Experiments are described in fourth
section. The last section contains conclusions and pointspossible further works.
2 Problem Formulation
In our model of the network router, all packets have
fixed constant length. This is typical for the internal
queues in routers that use a cell switching fabrics[5]. This
assumption allows us to model traffic using discrete-time
arrival process, where one time slot of fixed length is re-
quired to transmit one packet.
In our system we can identify following blocks. The
input flow represents all traffic packets send to router. Theclassifier identify class of packets in input flow, and move
them to appropriate queue. The priority queues have to
hold packet from different traffic class, each traffic class
has its own queue. The last element of our system is the
scheduler. Its task is to to choose queue which will be
serviced. Figure 1 presents the system model. Below all
blocks are described more precisely.
Figure 1: Packet scheduling for multiple traffic class
Classifier moves packets from input flow to appropriate
queues. We assume perfect recognition of packets classes.
Each traffic class i has been represented by separate pri-
ority queue qi, where i = 1, . . ,N . All queues have finite
capacity, and can hold only N i number of packets. If qi
holds exactly N i packets, and classifier find in the input
flow next packet from this traffic class, this packet will be
rejected.
The all traffic classes have the same arrival distributiondescribed by Bernoulli process, with the mean value λi for
i − th traffic class. This means that in each time slot we
can observe new packet in queue qi with probability λi .
The mean delay requirement Rti is the maximum ac-
ceptable mean delay per packets in queue qi and time
slot t. We assume that this requirement is defined for the
queues q1, . . ,qN −1 and it can change in time. The last
queue qN does not have any delay requirement, and should
be served as fast as possible, we denote this queue as the
best-effort queue.
The scheduler has to choose exactly one packet which
will be transmitted in each time slot. Formally the sched-
uler must choose one action ai from action set A =a1, a2, . . ,aN . Action ai means that server will get one
packet from queue qi. All queues are served in FCFS (Firs
Came First Served) order. Scheduler uses strategy π to
choose action based on the current system state. Strategy
is the function which maps a current system state to the
action:
π : Ω → A (1)
A strategy function can be static one and predefined in
the system, or it can be dynamic and scheduler can change
and improve it during system working.
System state xt
∈ Ω in time slot t can be define as avector:
xt = [Qt1,...,Qt
N ] (2)
where Qti is the qi state in time slot t. The state in i − th
queue is described as a vector:
Qti = [ pt
i,1, pti,2, . . ,pt
i,Lt
i
] (3)
the pti,j ∈ ZZ is the waiting time of j − th packet in i −
th queue at t time slot. The Lti is the number of packets
waiting in queue qi at time slot t. It is always true that
pti,j ≤ N i.
Based on Qti we can define the mean waiting time in
queue qi in time slot t as a function of m : ZZ × ZZ → :
m(i, t) =
Lt
i
j=1 pti,j
Lti
(4)
Our goal is to find such a strategy π which assure as
small as possible mean delay in best-effort queue qN ,
witch simultaneously hold delay requirements for the rest
queues ∀i ∈ 1, . . ,N − 1.m(i, t) < Rti .
2
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 3/13
3 Proposed method
To solve the described problem in previous section we
propose the reinforcement learning approach. This tech-
nique allows us to learn scheduling policy π during system
work without prior knowledge about traffic distributions.
Based only on information about the feedback receivedafter performed actions and knowledge about the system
state.
The basic idea behind the reinforcement learning algo-
rithm can be described in five steps
1. observe the current system state xt,
2. choose the action at to perform it in the state xt,
3. perform the selected action at,
4. observe the reinforcement rt and the next state xt+1,
5. learn from the experience < xt, at, rt, xt+1 >.
To use the above idea one must solve three very impor-
tant problems that decide about success: the representa-
tion of states, the reinforcement function and the learning
algorithm. These three problems are discussed in the next
subsections.
3.1 State representation
Based on the problem formulation (section 2), we have
infinite state space. All reinforcement learning algorithms
work only if the number of states and actions are finite.
Therefore we use a state aggregation method [10, 8]. In
this approach many states are recognized as one state andthat state is passed to the learning algorithm.
The efficiency of reinforcement learning depends on the
state space size. If the state space is very large, learn-
ing process could be very long and ineffective because too
many parameters must be tuned. This feature of reinforce-
ment learning must be considered in developing aggrega-
tion function. We have tested a number of aggregation
function but the best results were produced by the method
proposed by Ferra[6].
Let us introduce the aggregation function, which trans-
form a given state space into another one:
Ψ : Ω → Ω
. (5)
We assume that the size of state space Ω is smaller than
Ω. The form of aggregation function is following:
Ψ(xt) = [ψ(Qti), . . ,ψ(Qt
N −1)], (6)
where a state of each queue is transformed by the function:
ψ(Qti) =
1, if m(i, t) ≤ Rt
i
0, if m(i, t) > Rti
(7)
In other words, after applying the above aggregation
function to a given system state we obtain the vector of
size N − 1 in which all queues with satisfied mean time
constraints have value 0, otherwise 1. For example, vector
[1, 1, 1, 0, 0] corresponds to the situation where the system
consists of six queues, queues q4 and q5 do not satisfy re-
quirements R4 and R5 in time slot t.It is worth to notice that in the adopted aggregation ap-
proach there is no variable corresponds to the best-effort
queue qN . Our state space has been reduced to 2N −1
states. Such simplification allows as to solve real life prob-
lems where a number of queues are greater than 10. Addi-
tional advantage is connected with queue capacities, their
sizes do not influence the number of system states.
3.2 Reinforcement function
A reinforcement function provides feedback to learning
algorithm about the effects of performed actions. Based
on this information learning a algorithm could change the
scheduling policy if another action leads the systen to bet-
ter state in considered time step or even in the future.
A reinforcement function is a crucial element for a
learning algorithm in terms of the obtained results and the
convergence. The function proposed by Ferra [6] does not
take into consideration many important aspects of a sys-
tem state which could lead both to better results and con-
vergence time. In this paper we refer to Ferra proposition
of reinforcement function as the “Ferra function”.
We propose as a major improvement of Ferra function
consideration of a waiting time in best-effort queue andthe degree of breaking QoS requirements in all queues.
Equation of the proposed reinforcement function is given
by (10): inforcement learning algorithm work.
The efficiency of reinforcement learning depends on the
state space size. If the state space is very large, learn-
ing process could be very long and ineffective because too
many parameters must be tuned. This feature of reinforce-
ment learning must be considered in developing aggrega-
tion function. We have tested a number of aggregation
function but the best results were produced by the method
proposed by Ferra[6].
Let us introduce the aggregation function, which trans-
form a given state space into another one:
Ψ : Ω → Ω. (8)
We assume that the size of state space Ω is smaller than
Ω. The form of aggregation function is following:
Ψ(xt) = [ψ(Qti), . . ,ψ(Qt
N −1)], (9)
3
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 4/13
r(xt) =N −1i=1
(ζ (Qti) · ξ(xt−1, i)) + ς (Qt
N ) · ξ(xt−1, N )
+ φ(Ψ(xt), Ψ(xt−1)) (10)
Below we describe all elements of the r(xt
) function.ζ (Qt
i) function is responsible for holding the time con-
straints in queue qi, but in a case, when the time constraint
has been broken, it measures the degree of breaking QoS.
The function is given by equation (11):
ζ (Qti) =
C 21·m(i,t)
Rt
i
, if m(i, t) ≤ Rti
−C 2·m(i,t)2Rt
i
, if m(i, t) > Rti,
(11)
where C 1 and C 2 are constants. Their values should be
fixed before learning process starts. m(i, t) (eq. 4) is
the mean waiting time in queue i at time step t. Figure
2 presents the visual representation of ζ (Qti).
Figure 2: The visualization of ζ (Qti) function.
ς (Qti) (eq. 12) is used as the reinforcement function
to the best-effort queue. The task of this function is to
promote situations in which packets wait shortly.
ς (Qti) =
−C 4 · m(i, t)
max(m(i, 1), . . ,m(i, t))+ C 4, (12)
where C 4 is an constant value, and the
max(m(q1i ), . . ,m(qti )) is the maximal waiting time
in queue till the t-th time step.After each decision we scale the value for the last served
queue by factor of 0.3. Such scaling is introduced for sit-
uations in which packet of only one traffic class is being
served. For the scaling purposes we introduce function
ξ(xt, i). The form of this function is given by equation
(13).
ξ(xt, i) =
0.3, if πt(xt) = ai
1, if πt(xt) = ai
(13)
Figure 3: The visualization of ς function
where πt(xt) (eq. 1) is a strategy function used to obtain
an action in state xt, and ai ∈ A is an action responsible
to serves the queue qi.
The last element which we take into consideration in
the reinforcement function (eq. 10) is a situation in which
one or more mean waiting times in queues are greater then
constraints assigned to thess queues. An action which im-
proves system state by decrease a waiting time in such aqueues, should be additionally rewarded. For this purpose
we use φ(xt, xt−1) function, given by following equation:
φ(xt, xt−1) = C 3 · max(
N −1i=1
(xti − xt−1
i ), 0), (14)
where xt and xt−1 ∈ Ω (state space obtained after using
the aggregation method, introduced in the previous sec-
tion), C 3 is a constant, and xti ∈ is i-th value in the state
vector in time slot t.
φ(xt, xt−1) function takes two parameters, the aggre-
gated state vectors for time step t and t − 1. The value
returned by this function tels about a number of queues in
which we improve waiting time to satisfy their time con-
straints. This information is scaled by constant value C 3.
In this function we do not weigh a situation in which the
next system state is worst when previous one in terms of
a number of satisfied constraints. In such situations the
function return zero.
3.3 Learning algorithm
Let us consider a system in state xt at time step t. The
scheduler selects one of the available actions at
∈ A ac-cording to policy πt. As a result of performing the se-
lected action the system change its state from xt ∈ Ω to
xt+1 ∈ Ω and scheduler receives information about ef-
fects of selected action returned by reinforcement function
r(xt+1) (eq. 10). An action at changes the system state in
non deterministic manner, one packet is serviced but new
one can arrive with probabilities λi for each queue qi.
The receives a sequence of reinforcement values
(r(xt+1), r(xt+2), r(xt+3),...) in the future times units.
4
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 5/13
Our goal is to find the strategy π which maximizes such
expression:
V π(xt) = E
∞
k=1
γ kr(xt+k)
, (15)
where V π : Ω → . E is an expected value, γ is a dis-
count factor, and r(xt+k) is a reward received in the time
step t + k. The discount factor 0 ≤ γ < 1 assures that the
sum of future rewards is finite. The second advantage of
using the discount factor can be found in [10]. V π(xt) is
called value function for strategy π.
Let us introduce the function g : Ω × A → , which re-
turns the expected value of reward after performing action
at in state xt:
g(xt, at) =y∈Ω
P (y|xt, at) · r(y)
, (16)
where P (y|xt, at) is a probability of change the system
state from a state xt ∈ Ω into y ∈ Ω after an actionat ∈ A.
We can now calculate the V π(xt) for any policy π using
eg (17):
V π(x) = g(x, π(x))+γ
y inΩ
P (y|x, π(x))·V π(x) (17)
More useful for the future learning process is using a
value-action function Qπ : Ω × A → , which calculates
the discount sum of future rewards when in state xt we
select any action at and in the next time steps we select
actions according to the strategy π.
Qπ(x, a) = g(x, a) + γ
y inΩ
P (y|x, π(x)) · Qπ(y, π(y))
(18)
In this situation the optimal policy π∗ (not necessarily
unique) is obtained by maximizing:
π∗ = argmaxa∈A
Q∗(x, a) (19)
Q∗(x, a) is the optimal value-action function given by fol-
lowing equation:
Q∗(x, a) = g(x, a)+γ
y inΩ
P (y|x, π(x))·maxa∈A
Q∗(y, a)
(20)
The situation would be clear if P (y|x, π(x)) is known
for all x, y ∈ Ω, because the optimal value of Q-function
can be simply compute and the optimal policy π∗ can be
choosen. Because we assume that traffic intensities are
unknown, it is impossible to calculate those probabilities
in advance.
Algorithm 1 Q-learning pseudo code
Require: i := 0α - learning rate, and 0 < α < 1γ - discount factor, and 0 ≤ γ < 1Q[x,a] ← initialized arbitrary for all x ∈ Ω and a ∈ A
1: repeat
2: i := i + 13: Observe current state x4: Apply aggregation function x = Ψ(x)5: Choose a for aggregated state x using policy π(x)
derived from Q (ε-greedy, soft-max)
6: Perform action a, observe next state y,
7: Calculate r(y)8: ∆ = α(r(y) + γ maxa∈A Q[Ψ(y), a] − Q[x, a])9: Q[x, a] := Q[x, a] + ∆
10: until forever
To omit this inconvenience we use learning method.
The pseudo code of this method is presented as Algorithm1. Q-learning algorithm stores the values of Q∗(x, a) in a
twodimensional array. In each time step the algorithm up-
dates values in this array. Updateing process takes into ac-
count four parameters: a previous aggregated system state,
a selected action, a value of reinforcement function for
the current state, and the current aggregated system state.
Based on this parameters the algorithm calculates ∆ (point
8 in pseudo code) which can be treated as a correction
to the previous approximation of the optimal action-value
function Q∗(x, a). In this correction Q-learning method
uses α parameter which is the learning rate. In static prob-
lems alpha could be decreased in time in such way that
the values stored in the Q[x, a] are arithmetic mean of allapproximations [10]. For dynamic problems α should be
constant in aim of better adaptation to changing traffic in-
tensities or QoS requirements.
When certain conditions are fulfilled, one can find the
proof that Q-learning converges to the optimal values of
Q∗(x, a) [8, 2, 4]. To meet those requirements we need to
assure that all available actions in each state are selected
during the learning process. This can be done by providing
action selection strategies which are described in the next
subsection.
3.4 Exploration and Exploitation
During the learning process we need to assure the bal-
ance between the exploitation of known actions and explo-
ration of new ones. We need to explore available actions
to ensure good learning convergence but also we need to
select the best known actions to ensure good scheduling
quality. We have used in experiments two strategies: ε-
greedy and soft-max.
To understand how those strategies work let us extend
5
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 6/13
the definition of the policy function presented in equation
1 to the following form:
π : Ω × A → [0, 1] (21)
Right now, the policy takes two parameters: an aggre-
gated system state x ∈ Ω and an action a ∈ A. For those
parameters the policy function returns a probability of se-
lection action a in state x.
In ε-greedy strategy all non optimal actions have equal
selection probabilities:
∀a ∈ A . a /∈ argmaxb∈A
Q∗(x, b) ⇒ π(x, a) =1
ε, (22)
and for the optimal actions, the probability of selection is
defined as:
∀a ∈ A . a ∈ argmaxb∈A
Q∗(x, b) ⇒
π(x, a) =1 − ε
| argmaxb∈A Q(x, b)|+
ε
|A|, (23)
where the set of optimal actions
A∗ = argmaxb∈A Q(x, b) contains actions with
the highest Q-function value. The ε parameter is constant,
its value should be fixed before the learning process starts.
The ε should be greater than 0 and less than 1|A| , where
|A| is the number of elements in set A.
There are two main disadvantages in ε-greedy strategy.
ε is constant during whole learning process, and the se-
lection probabilities among non optimal actions are equaleven if the the corresponding them Q-function values dif-
fer greatly. The first disadvantage can be easily easily
fixed by decreasing ε value in successive iterations. In
early learning phases we need to explore available actions
and then we want to tune-up learning process and decrease
probabilities of selection non optimal actions.
The second disadvantage of ε-greedy is solved by soft-
max strategy. In this approach the probability of action
selection in each system state depends on the action qual-
ity in comparison to other actions (eq. 24).
∀a ∈ A . π (x, a) = exp(Q[x, a]/τ (i))b∈A exp(Qi[x, b]/τ (i))
(24)
τ function is called temperature. When its value is high,
the probabilities of selection of all actions are very similar,
but when τ → 0 only the optimal actions have non-zero
selection probabilities. Because we want to decrease the
temperature during the learning process it depends on the
current learning iteration i. Equation 25 defines the value
of τ (i).
τ (i) =
τ k−τ s
τ i· i + τ s, if i ≤ τ i,
τ k, if i > τ i(25)
where i is a learning iteration, τ s and τ k are starting and
ending temperature respectively. τ i is a number of itera-
tions in which the value of function are decreased till it
reaches τ k.
4 Experiments
To evaluate the proposed method we perform a series
of experiments. We can divide them into three groups.
Experiments in the first group check the convergence of
learning process and achieved mean waiting times in the
system containing three traffic classes. Parameters of ar-
rival process are constant in time. Experiments in the sec-
ond group check how well the method copes with changes
in the traffic arrival process and QoS requirements. The
last group of experiments check the behaviour of the sys-
tem when the additional traffic classes are included. All
results are compared to “Ferra method”.
4.1 Three traffic classes
The three first experiments evaluate our method with
Soft-Max action selection strategy (eq. 24) and three traf-
fic classes. In all these experiments we have used the same
learning parameters (Table 2) and QoS requirements but
different probabilities of packets arrivals in traffic classes
(Table 1). All experiments start within clean system (with-
out any packets in queues) and randomly selected startedscheduling policy which are thereafter improved by Q-
learning method.
Queue Arrival Rate Mean Delay
[packet / time slot] Constraint
Exp. 1 Exp. 2 Exp. 3 [time slots]
1 15 15 20 100
2 25 25 25 40
3 25 55 55 best effort
Table 1: Traffic parameters and QoS requirements in ex-
periments 1 - 3
All experiments are repeated ten times, and the results
are presented in the Figures 4, 5 and 6 respectively.
In all three experiments the learning time was shorter
in comparing “Ferra method”. This is especially seen in
the third experiment, when convergence has been approxi-
mately 3 times faster. Mean waiting time in queues which
have QoS requirements are comparable in both methods,
but our method produces better policies, they assure sig-
nificantly smaller waiting times in the best-effort queue
6
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 7/13
Figure 4: Results obtained in the first experiment for “Ferra method” (function 1) and authors method (function 2), the
QoS constraints and traffic conditions are presented in Table 1
Figure 5: Results obtained in the second experiment for “Ferra method” (function 1) and authors method (function 2),
the QoS constraints and traffic conditions are presented in Table 1
7
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 8/13
Figure 6: Results obtained in the third experiment for “Ferra method” (function 1) and authors method (function 2)
QoS constraints and traffic conditions are presented in Table 1
especially in early learning stages (Table 3). Comparison
the results of the three first experiments are presented in
Fig. 7.
Figure 7: Comparision of the results obtained using Ferra
method (function 1) and authors method (function 2) fromfirst three experiments
In fourth and fifth experiments we use ε-greedy action
selection method in aim of evaluate the performance with
different exploitation functions. Parameters used in the
learning process are presented in Table 4 and the traffic
parameters are equal to the parameters used in the previ-
ous experiments (Table 1).
During experiments we find that the “Ferra method”
Function Value Learning Value
Parameters parameters
C 1 50 γ 0.5
C 2 30 α 0.05
C 3 50 τ i 2000
C 4 20 τ s 50τ k 3
Table 2: Learning parameters used in the three first exper-
iments
Exp. 1 Exp. 2 Exp 3
Ferra method 11 320 320
Authors method 12 206 175
Table 3: Maximal waiting time in the best-effort queue in
the first three experiments
Function Value Learning Value
Parameters parameters
C 1 50 γ 0.5
C 2 30 α 0.05
C 3 50 ε 0.09
C 4 20
Table 4: The learning parameters in the fourth and fifth
experiment
8
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 9/13
may leads to situations when learning process cannot tune-
up and the obtained scheduling policies do not preserve
the QoS requirements (Fig. 8). By changing the value of εwe can avoid such situations but this must be done for each
case separately. This problem eliminates the method from
practical usage. In contrary, the proposed method copes
with the same examples very well without any hand-madetuning (Fig. 8). Even in the cases when “Ferra method”
find the satisfactory policies (Fig. 9) the mean waiting
time in the best-effort queue is smaller in the proposed
method.
It is worth to notice that the results obtained using ε-
greedy strategy are worst than results obtained after apply-
ing the soft-max action selection method. It is connected
with the fact that in ε-greedy strategy during the whole
learning process, the probability of selection a non opti-
mal action must be relatively large to assure good state
space exploitation in the early learning phases. This in-
convenience is fixed in the soft-max method where we de-
crease such probability in the learning process and pro-duced strategies are well tuned.
4.2 Adopt to traffic conditions changes
The aim of next experiment is to check the ability of our
method to adopt to traffic conditions changes. The learn-
ing process took 100000 time slots. The experiment starts
with the traffic parameters presented in Table 6, but after
50000 the arrival process and the QoS requirements are
changed (Table 7). In the learning process we use the soft-
max action selection method with parameters presented in
Table 2.
Queue Arrival Rate Mean Delay
[packet / time slot] Constraint
[time slots]
1 50 400
2 30 300
3 15 best effort
Table 6: Starting traffic parameters and QoS requirements
used in the sixth experiment
Queue Arrival Rate Mean Delay
[packet / time slot] Constraint[time slots]
1 30 250
2 50 350
3 5 best effort
Table 7: Ending traffic parameters and QoS requirements
used in the sixth experiment
The convergence of the proposed method, after switch-
ing traffic parameters, is faster than in method proposed
by Ferra, additionally the scheduling strategy obtained by
our method in the first traffic phase is significantly better
than that obtained by “Ferra method”. The results of this
experiment is presented in Fig 10.
4.3 Greater number of traffic classes
Last two experiments was performed to evaluate our
method with using the greater number of traffic classes.
In case of four traffic classes with the soft-max action
selection method with learning parameters presented in
Table 2 and traffic conditions as in Table 8, scheduling
policies obtained by our method outperform policy pro-
duced by “Ferra method”. The means waiting time in
queues with QoS are more stable and the mean waiting
time for the best-effort queue is significantly better. Re-
sults are presented in Table 8 and in Figure 11.
In the last experiment the number of traffic classes wasincreased to five. The learning parameters were the same
as in the previous experiment. The traffic conditions and
obtained results are presented in Table 9. In Fig. 12 we
can see that “Ferra method” cannot find good scheduling
strategy whereas our method finds strategy which assure
the system stability – the mean waiting times do not grow
in time.
5 Conclusion
In the paper we have formulated the packet scheduling
problem. This problem was solved using the reinforce-
ment learning method. The main contribution of our work
is a proposition of a new reinforcement function which
is compared with the method proposed by Ferra [6]. We
have evaluated the proposed method by the number of ex-
periments. The obtained results are significantly better.
Our method characterizes faster learning and better wait-
ing times in the best-effort queue, especially in the early
learning stages. Proposed method preserves its features
also with the greater number of traffic classes.
Future study may concentrate on the state aggregation
function as well as on the further improvements of the re-
inforcement function. It is also desirable to introduce amechanism which can detect significant changes in traffic
conditions and change learning parameters, for example
the learning rate or discount factor.
References
[1] T. Anker, R. Cohen, D. Dolev, and Y. Singer. Prfq:
Probabilistic fair queuing, 2000.
9
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 10/13
Queue Arrival Rate Mean Delay Achieved Mean Delay Standard Deviation
[packet / time slot] Constraint [time slots] [time slots]
[time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2
1 15 100 92.42 92.80 0.04 0.045
2 25 40 37.14 37.73 0.004 0.0006
4 25 best-effort 6.81 6.08 0.09 0.05
Table 5: Obtained results in the fifth experiment using “Ferra method” (function 1) and the proposed method (function
2)
Figure 8: Using ε-greedy action selection method with “Ferra method” (function 1) can lead to the situation when the
learning process cannot tune-up but the proposed method finds gratifying solutions
Queue Arrival Rate Mean Delay Achieved Mean Delay Standard Deviation
[packet / time slot] Constraint [time slots] [time slots][time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2
1 15 70 62.39 63.74 0.071 0.045
2 25 200 180.27 190.37 0.002 0.020
3 25 100 92.03 97.72 0.023 0.007
4 25 best-effort 84.54 8.63 302.777 0.103
Table 8: Parameters used and results obtained for the experiment with four traffic classes
10
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 11/13
Figure 9: Results obtained after using “Ferra method” (function 1) and the proposed method (function 2) to find
scheduling policies using Q-learning and ε-greedy action selection method
Figure 10: Results obtained by using Ferra method (function 1) and the proposed method (function 2) when traffic
condition are changed during learning process
11
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 12/13
Figure 11: Mean waiting times obtained for the four traffic classes after using the policies produced by “Ferra method”
(function 1) and the authors method (function 2)
Queue Arrival Rate Mean Delay Achieved Mean Delay Standard Deviation
[packet / time slot] Constraint [time slots] [time slots]
[time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2
1 20 70 65.44 67.50 0.015 0.001
2 10 200 176.25 163.95 0.698 2.166
3 10 100 85.78 71.66 0.204 0.222
4 25 150 114.08 147.87 0.059 0.0075 35 best-effort 7023.99 1705.05 34313.020 212.304
Table 9: Parameters used and results obtained for the experiment with five traffic classes
[2] R. Babuska, L. Busoniu, and B. De Schutter. Rein-
forcement learning for multi-agent systems. Tech-
nical Report 06-041, Delft Center for Systems and
Control, Delft University of Technology, July 2006.
Paper for a keynote presentation at the 11th IEEE
International Conference on Emerging Technolo-
gies and Factory Automation (ETFA 2006), Prague,
Czech Republic, Sept. 2006.
[3] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang,
and W. Weiss. An architecture for differentiated ser-
vice, 1998.
[4] Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, Ji-
aqiao Hu, and Steven I. Marcus. Simulation-based
Algorithms for Markov Decision Processes (Com-
munications and Control Engineering). Springer-
Verlag New York, Inc., Secaucus, NJ, USA, 2007.
[5] Cisco IOS Documentation. Quality of service so-
lution guide, implementing diffserv for end-to-end
quality of service, 2002.
[6] Herman L. Ferra, Ken Lau, Christopher Leckie, and
Anderson Tang. Applying reinforcement learning to
packet scheduling in routers. In IAAI , pages 79–84,
2003.
[7] J. Hall and P. Mars. Satisfying qos with a learning
based scheduling algorithm, 2000.
[8] Leslie Pack Kaelbling, Michael L. Littman, and An-
drew P. Moore. Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4:237–
285, 1996.
12
8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers
http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 13/13
Figure 12: Mean waiting times obtained for the five traffic classes after using the policies produced by “Ferra method”
(function 1) and the authors method (function 2)
[9] S. Shenker, C. Partridge, and R. Guerin. Specifi-
cation of guaranteed quality of service. RFC 2212,
1997.
[10] R.S. Sutton and A.G. Barto. Reinforcement Learn-
ing: An Introduction. MIT Press, Cambridge, MA,
1998.
[11] H. Wang, C. Shen, and K. Shin. Adaptive-weighted
packet scheduling for premium service, 2001.
[12] H. Zhang. Service disciplines for guaranteed perfor-
mance service in packet-switching networks, 1995.
13