Reinforcement approach to Adaptive Package Sheduling in Routers

8/8/2019 Reinforcement approach to Adaptive Package Sheduling in Routers

http://slidepdf.com/reader/full/reinforcement-approach-to-adaptive-package-sheduling-in-routers 1/13

Reinforcement approach to Adaptive Package Sheduling in Routers

Halina Kwasnicka Michał Stanek

15th April 2008

Abstract

The current and important question for Internet is how

to assure the quality of service. Several protocols have

been proposed to support different classes of network traf-

fic. The open research problem is how to divide avail-

able bandwidth among those traffic classes to support their

Quality of Service requirements. A major challenge in this

area is developing algorithms that can handle situations

in which we don’t know the traffic intensities in all traf-fic classes in advance or those intensities are changing in

time. In the paper we formulate the problem and than pro-

pose the reinforcement learning algorithm to solve it. Pro-

posed reinforcement function is evaluated and compared

to other methods.

1 Introduction

People change the way in which they use Internet. In-

ternet becomes the place where more and more often they

seek multimedia informations. The fact that we can leadvoice conversations with the whole word, not surprise any-

body. More and more often we listen to the radio or even

watch the television which is broadcast through Internet.

The major challenge in the Internet is to assure good qual-

ity of such services (QoS) in terms of packet loss, delay

and delay variations [12]. Currently used Internet Proto-

col (IP) not consider any QoS requirements. Several pro-

tocols have been proposed to support different classes of

service, such as IntServ [9] or DiffServ [3] which have

been created to support a small number of traffic classes

with different QoS requirements. The main question here

is how to divide available bandwidth among traffic classes

to support their QoS requirements.Nowadays commonly used solution in routers is deliv-

ering separate buffers for all traffic classes [5]. In this ap-

proach all incoming packets are classified and then moved

to appropriate queues. Router task is to decide about the

order in which queues are serviced. All scheduling de-

cisions must take into account the QoS (Quality of Ser-

vice) constrains of all queues and ensure reasonable per-

formance for the best-effort traffic class. Best-effort traffic

class is the class of packets which don’t have QoS require-

ments and should be serviced with the highest possible

bandwidth. Finding good scheduling algorithms is still

open topic in this area [6].

In the literature one can find many existing scheduling

algorithms. The simplest and most commonly used FCFS

(First Come First Out) schedules packets in order of their

arrival time. This approach is easy to implement and ef-

fective in time manner because it not require any calculus,

but its major weakness is lack of consideration any QoS

requirements.Another algorithm EDF (Earliest Deadline First) calcu-

lates the difference between the waiting time and delay

constraint for all packets. The packet with smallest differ-

ence is serviced. The danger of using this algorithm lie in

the fact that best-effort traffic do not have delay constraints

and could be never serviced.

SP (Sequential Priority) algorithm assigns priority to all

queues. The non empty queue with the highest priority is

serviced till the moment in which packets in highest prior-

ity queu earrive. In this approach the queues with lowest

priorities could never be serviced.

The weakness of SP is partially solved by the WFQ(Weighted Fair Queueing) algorithm. In this approach

all queues have weights and scheduling algorithm assigns

service time proportionally to the queue weight. It pre-

vents situations when lowest priority queues are ignored.

The knowledge about traffic intensities and QoS con-

straints is required before scheduling process starts and

are directly contained in the queues weights.

The necessity of knowledge a’priory the traffic intensi-

ties and QoS requirements is the main weakness of WFQ

algorithm. In real situations these parameters are un-

known, and additionally, they could considerably change

during system work. Solving real problems needs adap-

tive methods that do not need previous information abouttraffic conditions and constraints. Such approaches can be

find in literature [6, 1, 7, 11]. Proposed by Hall and Mars

method can be used in dynamic environment with variable

inflow intensities.

Hall and Mars proposed method based on Stochas-

tic Learning Automate (SLA)[7]. Service of each queue

has assigned one action and during system run scheduler

learns the probabilities of choosing each action in order

to meet the predefined QoS requirements and to maximize

1



bandwidth for best-effort traffic class. Results obtained

by Hall were better than using previously described algo-

rithms (FCFS, EDF, SP).

Hall’s method assigns equal actions probabilities to all

system states. Ferra and others [6] extend Halls method

and a scheduler could prefer some actions depending on

information about waiting time in each queue. Ferra useda reinforcement learning to learn scheduling policy. They

got much better results than Hall.

This paper is inspired by Ferra work. We have pointed

that his method do not take into consideration some sys-

tem properties which acknowledgment could lead to ac-

celerate the learning process and getting better results in

meaning of allocation available bandwidth among traffic

classes. Our main contribution is a modyfication of rein-

forcement function.

The paper is organized as follows. In next section we

formulate the problem statement. Section three describes

proposed method. Experiments are described in fourth

section. The last section contains conclusions and pointspossible further works.

2 Problem Formulation

In our model of the network router, all packets have

fixed constant length. This is typical for the internal

queues in routers that use a cell switching fabrics[5]. This

assumption allows us to model traffic using discrete-time

arrival process, where one time slot of fixed length is re-

quired to transmit one packet.

In our system we can identify following blocks. The

input flow represents all traffic packets send to router. Theclassifier identify class of packets in input flow, and move

them to appropriate queue. The priority queues have to

hold packet from different traffic class, each traffic class

has its own queue. The last element of our system is the

scheduler. Its task is to to choose queue which will be

serviced. Figure 1 presents the system model. Below all

blocks are described more precisely.

Figure 1: Packet scheduling for multiple traffic class

Classifier moves packets from input flow to appropriate

queues. We assume perfect recognition of packets classes.

Each traffic class i has been represented by separate pri-

ority queue qi, where i = 1, . . ,N . All queues have finite

capacity, and can hold only N i number of packets. If qi

holds exactly N i packets, and classifier find in the input

flow next packet from this traffic class, this packet will be

rejected.

The all traffic classes have the same arrival distributiondescribed by Bernoulli process, with the mean value λi for

i − th traffic class. This means that in each time slot we

can observe new packet in queue qi with probability λi .

The mean delay requirement Rti is the maximum ac-

ceptable mean delay per packets in queue qi and time

slot t. We assume that this requirement is defined for the

queues q1, . . ,qN −1 and it can change in time. The last

queue qN does not have any delay requirement, and should

be served as fast as possible, we denote this queue as the

best-effort queue.

The scheduler has to choose exactly one packet which

will be transmitted in each time slot. Formally the sched-

uler must choose one action ai from action set A =a1, a2, . . ,aN . Action ai means that server will get one

packet from queue qi. All queues are served in FCFS (Firs

Came First Served) order. Scheduler uses strategy π to

choose action based on the current system state. Strategy

is the function which maps a current system state to the

action:

π : Ω → A (1)

A strategy function can be static one and predefined in

the system, or it can be dynamic and scheduler can change

and improve it during system working.

System state xt

∈ Ω in time slot t can be define as avector:

xt = [Qt1,...,Qt

N ] (2)

where Qti is the qi state in time slot t. The state in i − th

queue is described as a vector:

Qti = [ pt

i,1, pti,2, . . ,pt

i,Lt

i

] (3)

the pti,j ∈ ZZ is the waiting time of j − th packet in i −

th queue at t time slot. The Lti is the number of packets

waiting in queue qi at time slot t. It is always true that

pti,j ≤ N i.

Based on Qti we can define the mean waiting time in

queue qi in time slot t as a function of m : ZZ × ZZ → :

m(i, t) =

Lt

i

j=1 pti,j

Lti

(4)

Our goal is to find such a strategy π which assure as

small as possible mean delay in best-effort queue qN ,

witch simultaneously hold delay requirements for the rest

queues ∀i ∈ 1, . . ,N − 1.m(i, t) < Rti .

2



3 Proposed method

To solve the described problem in previous section we

propose the reinforcement learning approach. This tech-

nique allows us to learn scheduling policy π during system

work without prior knowledge about traffic distributions.

Based only on information about the feedback receivedafter performed actions and knowledge about the system

state.

The basic idea behind the reinforcement learning algo-

rithm can be described in five steps

1. observe the current system state xt,

2. choose the action at to perform it in the state xt,

3. perform the selected action at,

4. observe the reinforcement rt and the next state xt+1,

5. learn from the experience < xt, at, rt, xt+1 >.

To use the above idea one must solve three very impor-

tant problems that decide about success: the representa-

tion of states, the reinforcement function and the learning

algorithm. These three problems are discussed in the next

subsections.

3.1 State representation

Based on the problem formulation (section 2), we have

infinite state space. All reinforcement learning algorithms

work only if the number of states and actions are finite.

Therefore we use a state aggregation method [10, 8]. In

this approach many states are recognized as one state andthat state is passed to the learning algorithm.

The efficiency of reinforcement learning depends on the

state space size. If the state space is very large, learn-

ing process could be very long and ineffective because too

many parameters must be tuned. This feature of reinforce-

ment learning must be considered in developing aggrega-

tion function. We have tested a number of aggregation

function but the best results were produced by the method

proposed by Ferra[6].

Let us introduce the aggregation function, which trans-

form a given state space into another one:

Ψ : Ω → Ω

. (5)

We assume that the size of state space Ω is smaller than

Ω. The form of aggregation function is following:

Ψ(xt) = [ψ(Qti), . . ,ψ(Qt

N −1)], (6)

where a state of each queue is transformed by the function:

ψ(Qti) =

1, if m(i, t) ≤ Rt

i

0, if m(i, t) > Rti

(7)

In other words, after applying the above aggregation

function to a given system state we obtain the vector of

size N − 1 in which all queues with satisfied mean time

constraints have value 0, otherwise 1. For example, vector

[1, 1, 1, 0, 0] corresponds to the situation where the system

consists of six queues, queues q4 and q5 do not satisfy re-

quirements R4 and R5 in time slot t.It is worth to notice that in the adopted aggregation ap-

proach there is no variable corresponds to the best-effort

queue qN . Our state space has been reduced to 2N −1

states. Such simplification allows as to solve real life prob-

lems where a number of queues are greater than 10. Addi-

tional advantage is connected with queue capacities, their

sizes do not influence the number of system states.

3.2 Reinforcement function

A reinforcement function provides feedback to learning

algorithm about the effects of performed actions. Based

on this information learning a algorithm could change the

scheduling policy if another action leads the systen to bet-

ter state in considered time step or even in the future.

A reinforcement function is a crucial element for a

learning algorithm in terms of the obtained results and the

convergence. The function proposed by Ferra [6] does not

take into consideration many important aspects of a sys-

tem state which could lead both to better results and con-

vergence time. In this paper we refer to Ferra proposition

of reinforcement function as the “Ferra function”.

We propose as a major improvement of Ferra function

consideration of a waiting time in best-effort queue andthe degree of breaking QoS requirements in all queues.

Equation of the proposed reinforcement function is given

by (10): inforcement learning algorithm work.

The efficiency of reinforcement learning depends on the

state space size. If the state space is very large, learn-

ing process could be very long and ineffective because too

many parameters must be tuned. This feature of reinforce-

ment learning must be considered in developing aggrega-

tion function. We have tested a number of aggregation

function but the best results were produced by the method

proposed by Ferra[6].

Let us introduce the aggregation function, which trans-

form a given state space into another one:

Ψ : Ω → Ω. (8)

We assume that the size of state space Ω is smaller than

Ω. The form of aggregation function is following:

Ψ(xt) = [ψ(Qti), . . ,ψ(Qt

N −1)], (9)

3



r(xt) =N −1i=1

(ζ (Qti) · ξ(xt−1, i)) + ς (Qt

N ) · ξ(xt−1, N )

+ φ(Ψ(xt), Ψ(xt−1)) (10)

Below we describe all elements of the r(xt

) function.ζ (Qt

i) function is responsible for holding the time con-

straints in queue qi, but in a case, when the time constraint

has been broken, it measures the degree of breaking QoS.

The function is given by equation (11):

ζ (Qti) =

C 21·m(i,t)

Rt

i

, if m(i, t) ≤ Rti

−C 2·m(i,t)2Rt

i

, if m(i, t) > Rti,

(11)

where C 1 and C 2 are constants. Their values should be

fixed before learning process starts. m(i, t) (eq. 4) is

the mean waiting time in queue i at time step t. Figure

2 presents the visual representation of ζ (Qti).

Figure 2: The visualization of ζ (Qti) function.

ς (Qti) (eq. 12) is used as the reinforcement function

to the best-effort queue. The task of this function is to

promote situations in which packets wait shortly.

ς (Qti) =

−C 4 · m(i, t)

max(m(i, 1), . . ,m(i, t))+ C 4, (12)

where C 4 is an constant value, and the

max(m(q1i ), . . ,m(qti )) is the maximal waiting time

in queue till the t-th time step.After each decision we scale the value for the last served

queue by factor of 0.3. Such scaling is introduced for sit-

uations in which packet of only one traffic class is being

served. For the scaling purposes we introduce function

ξ(xt, i). The form of this function is given by equation

(13).

ξ(xt, i) =

0.3, if πt(xt) = ai

1, if πt(xt) = ai

(13)

Figure 3: The visualization of ς function

where πt(xt) (eq. 1) is a strategy function used to obtain

an action in state xt, and ai ∈ A is an action responsible

to serves the queue qi.

The last element which we take into consideration in

the reinforcement function (eq. 10) is a situation in which

one or more mean waiting times in queues are greater then

constraints assigned to thess queues. An action which im-

proves system state by decrease a waiting time in such aqueues, should be additionally rewarded. For this purpose

we use φ(xt, xt−1) function, given by following equation:

φ(xt, xt−1) = C 3 · max(

N −1i=1

(xti − xt−1

i ), 0), (14)

where xt and xt−1 ∈ Ω (state space obtained after using

the aggregation method, introduced in the previous sec-

tion), C 3 is a constant, and xti ∈ is i-th value in the state

vector in time slot t.

φ(xt, xt−1) function takes two parameters, the aggre-

gated state vectors for time step t and t − 1. The value

returned by this function tels about a number of queues in

which we improve waiting time to satisfy their time con-

straints. This information is scaled by constant value C 3.

In this function we do not weigh a situation in which the

next system state is worst when previous one in terms of

a number of satisfied constraints. In such situations the

function return zero.

3.3 Learning algorithm

Let us consider a system in state xt at time step t. The

scheduler selects one of the available actions at

∈ A ac-cording to policy πt. As a result of performing the se-

lected action the system change its state from xt ∈ Ω to

xt+1 ∈ Ω and scheduler receives information about ef-

fects of selected action returned by reinforcement function

r(xt+1) (eq. 10). An action at changes the system state in

non deterministic manner, one packet is serviced but new

one can arrive with probabilities λi for each queue qi.

The receives a sequence of reinforcement values

(r(xt+1), r(xt+2), r(xt+3),...) in the future times units.

4



Our goal is to find the strategy π which maximizes such

expression:

V π(xt) = E

∞

k=1

γ kr(xt+k)

, (15)

where V π : Ω → . E is an expected value, γ is a dis-

count factor, and r(xt+k) is a reward received in the time

step t + k. The discount factor 0 ≤ γ < 1 assures that the

sum of future rewards is finite. The second advantage of

using the discount factor can be found in [10]. V π(xt) is

called value function for strategy π.

Let us introduce the function g : Ω × A → , which re-

turns the expected value of reward after performing action

at in state xt:

g(xt, at) =y∈Ω

P (y|xt, at) · r(y)

, (16)

where P (y|xt, at) is a probability of change the system

state from a state xt ∈ Ω into y ∈ Ω after an actionat ∈ A.

We can now calculate the V π(xt) for any policy π using

eg (17):

V π(x) = g(x, π(x))+γ

y inΩ

P (y|x, π(x))·V π(x) (17)

More useful for the future learning process is using a

value-action function Qπ : Ω × A → , which calculates

the discount sum of future rewards when in state xt we

select any action at and in the next time steps we select

actions according to the strategy π.

Qπ(x, a) = g(x, a) + γ

y inΩ

P (y|x, π(x)) · Qπ(y, π(y))

(18)

In this situation the optimal policy π∗ (not necessarily

unique) is obtained by maximizing:

π∗ = argmaxa∈A

Q∗(x, a) (19)

Q∗(x, a) is the optimal value-action function given by fol-

lowing equation:

Q∗(x, a) = g(x, a)+γ

y inΩ

P (y|x, π(x))·maxa∈A

Q∗(y, a)

(20)

The situation would be clear if P (y|x, π(x)) is known

for all x, y ∈ Ω, because the optimal value of Q-function

can be simply compute and the optimal policy π∗ can be

choosen. Because we assume that traffic intensities are

unknown, it is impossible to calculate those probabilities

in advance.

Algorithm 1 Q-learning pseudo code

Require: i := 0α - learning rate, and 0 < α < 1γ - discount factor, and 0 ≤ γ < 1Q[x,a] ← initialized arbitrary for all x ∈ Ω and a ∈ A

1: repeat

2: i := i + 13: Observe current state x4: Apply aggregation function x = Ψ(x)5: Choose a for aggregated state x using policy π(x)

derived from Q (ε-greedy, soft-max)

6: Perform action a, observe next state y,

7: Calculate r(y)8: ∆ = α(r(y) + γ maxa∈A Q[Ψ(y), a] − Q[x, a])9: Q[x, a] := Q[x, a] + ∆

10: until forever

To omit this inconvenience we use learning method.

The pseudo code of this method is presented as Algorithm1. Q-learning algorithm stores the values of Q∗(x, a) in a

twodimensional array. In each time step the algorithm up-

dates values in this array. Updateing process takes into ac-

count four parameters: a previous aggregated system state,

a selected action, a value of reinforcement function for

the current state, and the current aggregated system state.

Based on this parameters the algorithm calculates ∆ (point

8 in pseudo code) which can be treated as a correction

to the previous approximation of the optimal action-value

function Q∗(x, a). In this correction Q-learning method

uses α parameter which is the learning rate. In static prob-

lems alpha could be decreased in time in such way that

the values stored in the Q[x, a] are arithmetic mean of allapproximations [10]. For dynamic problems α should be

constant in aim of better adaptation to changing traffic in-

tensities or QoS requirements.

When certain conditions are fulfilled, one can find the

proof that Q-learning converges to the optimal values of

Q∗(x, a) [8, 2, 4]. To meet those requirements we need to

assure that all available actions in each state are selected

during the learning process. This can be done by providing

action selection strategies which are described in the next

subsection.

3.4 Exploration and Exploitation

During the learning process we need to assure the bal-

ance between the exploitation of known actions and explo-

ration of new ones. We need to explore available actions

to ensure good learning convergence but also we need to

select the best known actions to ensure good scheduling

quality. We have used in experiments two strategies: ε-

greedy and soft-max.

To understand how those strategies work let us extend

5



the definition of the policy function presented in equation

1 to the following form:

π : Ω × A → [0, 1] (21)

Right now, the policy takes two parameters: an aggre-

gated system state x ∈ Ω and an action a ∈ A. For those

parameters the policy function returns a probability of se-

lection action a in state x.

In ε-greedy strategy all non optimal actions have equal

selection probabilities:

∀a ∈ A . a /∈ argmaxb∈A

Q∗(x, b) ⇒ π(x, a) =1

ε, (22)

and for the optimal actions, the probability of selection is

defined as:

∀a ∈ A . a ∈ argmaxb∈A

Q∗(x, b) ⇒

π(x, a) =1 − ε

| argmaxb∈A Q(x, b)|+

ε

|A|, (23)

where the set of optimal actions

A∗ = argmaxb∈A Q(x, b) contains actions with

the highest Q-function value. The ε parameter is constant,

its value should be fixed before the learning process starts.

The ε should be greater than 0 and less than 1|A| , where

|A| is the number of elements in set A.

There are two main disadvantages in ε-greedy strategy.

ε is constant during whole learning process, and the se-

lection probabilities among non optimal actions are equaleven if the the corresponding them Q-function values dif-

fer greatly. The first disadvantage can be easily easily

fixed by decreasing ε value in successive iterations. In

early learning phases we need to explore available actions

and then we want to tune-up learning process and decrease

probabilities of selection non optimal actions.

The second disadvantage of ε-greedy is solved by soft-

max strategy. In this approach the probability of action

selection in each system state depends on the action qual-

ity in comparison to other actions (eq. 24).

∀a ∈ A . π (x, a) = exp(Q[x, a]/τ (i))b∈A exp(Qi[x, b]/τ (i))

(24)

τ function is called temperature. When its value is high,

the probabilities of selection of all actions are very similar,

but when τ → 0 only the optimal actions have non-zero

selection probabilities. Because we want to decrease the

temperature during the learning process it depends on the

current learning iteration i. Equation 25 defines the value

of τ (i).

τ (i) =

τ k−τ s

τ i· i + τ s, if i ≤ τ i,

τ k, if i > τ i(25)

where i is a learning iteration, τ s and τ k are starting and

ending temperature respectively. τ i is a number of itera-

tions in which the value of function are decreased till it

reaches τ k.

4 Experiments

To evaluate the proposed method we perform a series

of experiments. We can divide them into three groups.

Experiments in the first group check the convergence of

learning process and achieved mean waiting times in the

system containing three traffic classes. Parameters of ar-

rival process are constant in time. Experiments in the sec-

ond group check how well the method copes with changes

in the traffic arrival process and QoS requirements. The

last group of experiments check the behaviour of the sys-

tem when the additional traffic classes are included. All

results are compared to “Ferra method”.

4.1 Three traffic classes

The three first experiments evaluate our method with

Soft-Max action selection strategy (eq. 24) and three traf-

fic classes. In all these experiments we have used the same

learning parameters (Table 2) and QoS requirements but

different probabilities of packets arrivals in traffic classes

(Table 1). All experiments start within clean system (with-

out any packets in queues) and randomly selected startedscheduling policy which are thereafter improved by Q-

learning method.

Queue Arrival Rate Mean Delay

[packet / time slot] Constraint

Exp. 1 Exp. 2 Exp. 3 [time slots]

1 15 15 20 100

2 25 25 25 40

3 25 55 55 best effort

Table 1: Traffic parameters and QoS requirements in ex-

periments 1 - 3

All experiments are repeated ten times, and the results

are presented in the Figures 4, 5 and 6 respectively.

In all three experiments the learning time was shorter

in comparing “Ferra method”. This is especially seen in

the third experiment, when convergence has been approxi-

mately 3 times faster. Mean waiting time in queues which

have QoS requirements are comparable in both methods,

but our method produces better policies, they assure sig-

nificantly smaller waiting times in the best-effort queue

6



Figure 4: Results obtained in the first experiment for “Ferra method” (function 1) and authors method (function 2), the

QoS constraints and traffic conditions are presented in Table 1

Figure 5: Results obtained in the second experiment for “Ferra method” (function 1) and authors method (function 2),

the QoS constraints and traffic conditions are presented in Table 1

7



Figure 6: Results obtained in the third experiment for “Ferra method” (function 1) and authors method (function 2)

QoS constraints and traffic conditions are presented in Table 1

especially in early learning stages (Table 3). Comparison

the results of the three first experiments are presented in

Fig. 7.

Figure 7: Comparision of the results obtained using Ferra

method (function 1) and authors method (function 2) fromfirst three experiments

In fourth and fifth experiments we use ε-greedy action

selection method in aim of evaluate the performance with

different exploitation functions. Parameters used in the

learning process are presented in Table 4 and the traffic

parameters are equal to the parameters used in the previ-

ous experiments (Table 1).

During experiments we find that the “Ferra method”

Function Value Learning Value

Parameters parameters

C 1 50 γ 0.5

C 2 30 α 0.05

C 3 50 τ i 2000

C 4 20 τ s 50τ k 3

Table 2: Learning parameters used in the three first exper-

iments

Exp. 1 Exp. 2 Exp 3

Ferra method 11 320 320

Authors method 12 206 175

Table 3: Maximal waiting time in the best-effort queue in

the first three experiments

Function Value Learning Value

Parameters parameters

C 1 50 γ 0.5

C 2 30 α 0.05

C 3 50 ε 0.09

C 4 20

Table 4: The learning parameters in the fourth and fifth

experiment

8



may leads to situations when learning process cannot tune-

up and the obtained scheduling policies do not preserve

the QoS requirements (Fig. 8). By changing the value of εwe can avoid such situations but this must be done for each

case separately. This problem eliminates the method from

practical usage. In contrary, the proposed method copes

with the same examples very well without any hand-madetuning (Fig. 8). Even in the cases when “Ferra method”

find the satisfactory policies (Fig. 9) the mean waiting

time in the best-effort queue is smaller in the proposed

method.

It is worth to notice that the results obtained using ε-

greedy strategy are worst than results obtained after apply-

ing the soft-max action selection method. It is connected

with the fact that in ε-greedy strategy during the whole

learning process, the probability of selection a non opti-

mal action must be relatively large to assure good state

space exploitation in the early learning phases. This in-

convenience is fixed in the soft-max method where we de-

crease such probability in the learning process and pro-duced strategies are well tuned.

4.2 Adopt to traffic conditions changes

The aim of next experiment is to check the ability of our

method to adopt to traffic conditions changes. The learn-

ing process took 100000 time slots. The experiment starts

with the traffic parameters presented in Table 6, but after

50000 the arrival process and the QoS requirements are

changed (Table 7). In the learning process we use the soft-

max action selection method with parameters presented in

Table 2.


[packet / time slot] Constraint

[time slots]

1 50 400

2 30 300

3 15 best effort

Table 6: Starting traffic parameters and QoS requirements

used in the sixth experiment


[packet / time slot] Constraint[time slots]

1 30 250

2 50 350

3 5 best effort

Table 7: Ending traffic parameters and QoS requirements

used in the sixth experiment

The convergence of the proposed method, after switch-

ing traffic parameters, is faster than in method proposed

by Ferra, additionally the scheduling strategy obtained by

our method in the first traffic phase is significantly better

than that obtained by “Ferra method”. The results of this

experiment is presented in Fig 10.

4.3 Greater number of traffic classes

Last two experiments was performed to evaluate our

method with using the greater number of traffic classes.

In case of four traffic classes with the soft-max action

selection method with learning parameters presented in

Table 2 and traffic conditions as in Table 8, scheduling

policies obtained by our method outperform policy pro-

duced by “Ferra method”. The means waiting time in

queues with QoS are more stable and the mean waiting

time for the best-effort queue is significantly better. Re-

sults are presented in Table 8 and in Figure 11.

In the last experiment the number of traffic classes wasincreased to five. The learning parameters were the same

as in the previous experiment. The traffic conditions and

obtained results are presented in Table 9. In Fig. 12 we

can see that “Ferra method” cannot find good scheduling

strategy whereas our method finds strategy which assure

the system stability – the mean waiting times do not grow

in time.

5 Conclusion

In the paper we have formulated the packet scheduling

problem. This problem was solved using the reinforce-

ment learning method. The main contribution of our work

is a proposition of a new reinforcement function which

is compared with the method proposed by Ferra [6]. We

have evaluated the proposed method by the number of ex-

periments. The obtained results are significantly better.

Our method characterizes faster learning and better wait-

ing times in the best-effort queue, especially in the early

learning stages. Proposed method preserves its features

also with the greater number of traffic classes.

Future study may concentrate on the state aggregation

function as well as on the further improvements of the re-

inforcement function. It is also desirable to introduce amechanism which can detect significant changes in traffic

conditions and change learning parameters, for example

the learning rate or discount factor.

References

[1] T. Anker, R. Cohen, D. Dolev, and Y. Singer. Prfq:

Probabilistic fair queuing, 2000.

9



Queue Arrival Rate Mean Delay Achieved Mean Delay Standard Deviation

[packet / time slot] Constraint [time slots] [time slots]

[time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2

1 15 100 92.42 92.80 0.04 0.045

2 25 40 37.14 37.73 0.004 0.0006

4 25 best-effort 6.81 6.08 0.09 0.05

Table 5: Obtained results in the fifth experiment using “Ferra method” (function 1) and the proposed method (function

2)

Figure 8: Using ε-greedy action selection method with “Ferra method” (function 1) can lead to the situation when the

learning process cannot tune-up but the proposed method finds gratifying solutions


[packet / time slot] Constraint [time slots] [time slots][time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2

1 15 70 62.39 63.74 0.071 0.045

2 25 200 180.27 190.37 0.002 0.020

3 25 100 92.03 97.72 0.023 0.007

4 25 best-effort 84.54 8.63 302.777 0.103

Table 8: Parameters used and results obtained for the experiment with four traffic classes

10



Figure 9: Results obtained after using “Ferra method” (function 1) and the proposed method (function 2) to find

scheduling policies using Q-learning and ε-greedy action selection method

Figure 10: Results obtained by using Ferra method (function 1) and the proposed method (function 2) when traffic

condition are changed during learning process

11



Figure 11: Mean waiting times obtained for the four traffic classes after using the policies produced by “Ferra method”

(function 1) and the authors method (function 2)


[packet / time slot] Constraint [time slots] [time slots]

[time slots] Fun. 1 Fun. 2 Fun. 1 Fun. 2

1 20 70 65.44 67.50 0.015 0.001

2 10 200 176.25 163.95 0.698 2.166

3 10 100 85.78 71.66 0.204 0.222

4 25 150 114.08 147.87 0.059 0.0075 35 best-effort 7023.99 1705.05 34313.020 212.304

Table 9: Parameters used and results obtained for the experiment with five traffic classes

[2] R. Babuska, L. Busoniu, and B. De Schutter. Rein-

forcement learning for multi-agent systems. Tech-

nical Report 06-041, Delft Center for Systems and

Control, Delft University of Technology, July 2006.

Paper for a keynote presentation at the 11th IEEE

International Conference on Emerging Technolo-

gies and Factory Automation (ETFA 2006), Prague,

Czech Republic, Sept. 2006.

[3] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang,

and W. Weiss. An architecture for differentiated ser-

vice, 1998.

[4] Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, Ji-

aqiao Hu, and Steven I. Marcus. Simulation-based

Algorithms for Markov Decision Processes (Com-

munications and Control Engineering). Springer-

Verlag New York, Inc., Secaucus, NJ, USA, 2007.

[5] Cisco IOS Documentation. Quality of service so-

lution guide, implementing diffserv for end-to-end

quality of service, 2002.

[6] Herman L. Ferra, Ken Lau, Christopher Leckie, and

Anderson Tang. Applying reinforcement learning to

packet scheduling in routers. In IAAI , pages 79–84,

2003.

[7] J. Hall and P. Mars. Satisfying qos with a learning

based scheduling algorithm, 2000.

[8] Leslie Pack Kaelbling, Michael L. Littman, and An-

drew P. Moore. Reinforcement learning: A survey.

Journal of Artificial Intelligence Research, 4:237–

285, 1996.

12



Figure 12: Mean waiting times obtained for the five traffic classes after using the policies produced by “Ferra method”

(function 1) and the authors method (function 2)

[9] S. Shenker, C. Partridge, and R. Guerin. Specifi-

cation of guaranteed quality of service. RFC 2212,

1997.

[10] R.S. Sutton and A.G. Barto. Reinforcement Learn-

ing: An Introduction. MIT Press, Cambridge, MA,

1998.

[11] H. Wang, C. Shen, and K. Shin. Adaptive-weighted

packet scheduling for premium service, 2001.

[12] H. Zhang. Service disciplines for guaranteed perfor-

mance service in packet-switching networks, 1995.

13

Documents

Reinforcement approach to Adaptive Package Sheduling in Routers