Large-Scale Trafﬁc Grid Signal Control Using Decentralized

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/319197987

Large-Scale Traffic Grid Signal Control Using Decentralized Fuzzy

Reinforcement Learning

Conference Paper · September 2018

DOI: 10.1007/978-3-319-56994-9_44

CITATIONS

0READS

131

4 authors, including:

Some of the authors of this publication are also working on these related projects:

Traditional Chinese Medicine, cardiovascular disease View project

Durability and life cycle management of concrete structures View project

Bo Peng

Stanford University

4 PUBLICATIONS 4 CITATIONS

SEE PROFILE

Jie Wang

428 PUBLICATIONS 4,828 CITATIONS

SEE PROFILE

All content following this page was uploaded by Bo Peng on 13 February 2018.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/319197987_Large-Scale_Traffic_Grid_Signal_Control_Using_Decentralized_Fuzzy_Reinforcement_Learning?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/publication/319197987_Large-Scale_Traffic_Grid_Signal_Control_Using_Decentralized_Fuzzy_Reinforcement_Learning?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_3&_esc=publicationCoverPdf

https://www.researchgate.net/project/Traditional-Chinese-Medicine-cardiovascular-disease?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/project/Durability-and-life-cycle-management-of-concrete-structures?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Bo_Peng56?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/Stanford_University?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Jie_Wang229?enrichId=rgreq-aa313a61b5d2559049cf42b60a40f8e6-XXX&enrichSource=Y292ZXJQYWdlOzMxOTE5Nzk4NztBUzo1OTM2MDUxNjkyNzQ4ODBAMTUxODUzNzkwOTE2MA%3D%3D&el=1_x_4&_esc=publicationCoverPdf




Metadata of the chapter that will be visualized inSpringerLink

Book Title Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016Series Title

Chapter Title Large-Scale Traffic Grid Signal Control Using Decentralized Fuzzy Reinforcement Learning

Copyright Year 2018

Copyright HolderName Springer International Publishing AG

Corresponding Author Family Name TanParticle

Given Name TianPrefix

Suffix

Division Department of Civil and Environmental Engineering

Organization Stanford University

Address Stanford, CA, 94305-2004, USA

Email [email protected]

Author Family Name ChuParticle

Given Name TianShuPrefix

Suffix





Author Family Name PengParticle

Given Name BoPrefix

Suffix





Author Family Name WangParticle

Given Name JiePrefix

Suffix





Abstract With the rise of rapid urbanization around the world, a majority of countries have experienced a significantincrease in traffic congestion. The negative impacts of this change have resulted in a number of serious andadverse effects, not only regarding the quality of daily life at an individual level but also for nations’economic growth. Thus, the importance of traffic congestion management is well recognized. Adaptivereal-time traffic signal control is effective for traffic congestion management. In particular, adaptivecontrol with reinforcement learning (RL) is a promising technique that has recently been introduced in thefield to better manage traffic congestion. Traditionally, most studies on traffic signal control have usedcentralized reinforcement learning, whose computation inefficiency prevents it from being employed forlarge traffic networks. In this paper, we propose a computationally cost-effective distributed algorithm,namely, a decentralized fuzzy reinforcement learning approach, to deal with problems related to theexponentially growing number of possible states and actions in RL models for a large-scale traffic signalcontrol network. More specifically, the traffic density at each intersection is first mapped to four differentfuzzy sets (i.e., low, medium, high, and extremely high). Next, two different kinds of algorithms, greedyand neighborhood approximate Q-learning (NAQL), are adaptively selected, based on the real-time,fuzzified congestion levels. To further reduce computational costs and the number of state-action pairs inthe RL model, coordination and communication between the intersections are confined within a singleneighborhood, i.e., the controlled intersection with its immediate neighbor intersections, for the NAQLalgorithm. Finally, we conduct several numerical experiments to verify the efficiency and effectiveness ofour approach. The results demonstrate that the decentralized fuzzy reinforcement learning algorithmachieves comparable results when measured against traditional heuristic-based algorithms. In addition, thedecentralized fuzzy RL algorithm generates more adaptive control rules for the underlying dynamics oflarge-scale traffic networks. Thus, the proposed approach sheds new light on how to provide furtherimprovements to a networked traffic signal control system for real-time traffic congestion.

Keywords(separated by '-')

Traffic signal control - Fuzzy logic - Approximate Q-learning - Reinforcement learning - Multi-agentsystems - Intelligent systems

Large-Scale Traffic Grid Signal Control UsingDecentralized Fuzzy Reinforcement Learning

Tian Tan(B), TianShu Chu, Bo Peng, and Jie Wang

Department of Civil and Environmental Engineering, Stanford University,Stanford, CA 94305-2004, USA

{tiantan,cts1988,bpeng,jiewang}@stanford.edu

Abstract. With the rise of rapid urbanization around the world, amajority of countries have experienced a significant increase in trafficcongestion. The negative impacts of this change have resulted in a num-ber of serious and adverse effects, not only regarding the quality of dailylife at an individual level but also for nations’ economic growth. Thus, theimportance of traffic congestion management is well recognized. Adaptivereal-time traffic signal control is effective for traffic congestion manage-ment. In particular, adaptive control with reinforcement learning (RL)is a promising technique that has recently been introduced in the fieldto better manage traffic congestion. Traditionally, most studies on trafficsignal control have used centralized reinforcement learning, whose com-putation inefficiency prevents it from being employed for large trafficnetworks. In this paper, we propose a computationally cost-effective dis-tributed algorithm, namely, a decentralized fuzzy reinforcement learningapproach, to deal with problems related to the exponentially growingnumber of possible states and actions in RL models for a large-scaletraffic signal control network. More specifically, the traffic density ateach intersection is first mapped to four different fuzzy sets (i.e., low,medium, high, and extremely high). Next, two different kinds of algo-rithms, greedy and neighborhood approximate Q-learning (NAQL), areadaptively selected, based on the real-time, fuzzified congestion levels. Tofurther reduce computational costs and the number of state-action pairsin the RL model, coordination and communication between the inter-sections are confined within a single neighborhood, i.e., the controlledintersection with its immediate neighbor intersections, for the NAQLalgorithm. Finally, we conduct several numerical experiments to verifythe efficiency and effectiveness of our approach. The results demonstratethat the decentralized fuzzy reinforcement learning algorithm achievescomparable results when measured against traditional heuristic-basedalgorithms. In addition, the decentralized fuzzy RL algorithm generatesmore adaptive control rules for the underlying dynamics of large-scaletraffic networks. Thus, the proposed approach sheds new light on howto provide further improvements to a networked traffic signal controlsystem for real-time traffic congestion.

AQ1AQ2

c⃝ Springer International Publishing AG 2018Y. Bi et al. (eds.), Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016,Lecture Notes in Networks and Systems 15, DOI 10.1007/978-3-319-56994-9 44

Au

tho

r P

roo

f

2 T. Tan et al.

Keywords: Traffic signal control · Fuzzy logic · ApproximateQ-learning · Reinforcement learning · Multi-agent systems · Intelligentsystems

1 Introduction

The trend of traffic congestion is that it is getting exponentially worse, whichbrings greater attention to the importance of effective traffic management. Withtoday’s advanced technology, traffic signals have become ubiquitous. This factleads us to question of how to utilize traffic signals efficiently to make optimaltraffic-signal-management decisions. Many traditional traffic control systems usea time-fixed model. These systems are encoded, based on the time periods withina day and the days of a week [1–4]. However, due to the nature of dynamic andlarge variation traffic flows, the time-fixed system is inefficient. Therefore, it isnecessary to develop a traffic signal control model that is adaptive, based onreal-time traffic data.

With the rapid development of computer science and information technology,the concept of adaptive signal control has drawn greater attention to transporta-tion researchers. Q-learning, one of the most important techniques of reinforce-ment learning (RL), can be applied to learn traffic patterns and obtain real-timecontrol [5]. Q-learning was first applied to isolated intersection controls in [6].Since then, multi-agent models have also cooperated with Q-learning to solvetraffic signal control issues [7–10].

However, there are serious challenges for implementing Q-learning algorithmsfor large-scale traffic signal control. A traditional way to apply Q-learning is totake the traffic grid as a whole – which results in increased computational costs– while considering different traffic conditions in the traffic grid as separatestates. However, when it comes to large-scale control, the computations are socostly that it makes the centralized method impractical since the state spaceand the action space grow exponentially with the increasing number of inter-sections [5,11]. Therefore, a decentralized model was proposed in [12] to bettersolve this problem. More specifically, a decentralized model can be divided intoa totally decentralized model, where each intersection functions as an indepen-dent control agent, and a relatively less decentralized model, which considersthe controlled intersection with its surrounding cooperative intersections simul-taneously. One example of a relatively less decentralized model is the regionalreinforcement learning proposed in [13]. In this paper, we introduce fuzzy logicas a linkage between the totally decentralized model and the less decentralizedmodel to improve computational efficiency. To elaborate, Qiao et al. [14] usedgenetic algorithms to find the optimal fuzzy rules for traffic congestion levels.Additionally, a geometric type-2 fuzzy inference system was proposed and usedfor urban traffic control problems in [15]. A fuzzy neural network is an alter-native way for traffic signal optimization [16,17]. Thus, here, we use fuzzy logicwithin the decentralized RL model with the aim of increasing efficiency andcooperative responses simultaneously with the neighbor intersections. We first

Au

tho

r P

roo

f

Large-Scale Traffic Grid Signal Control 3

map traffic density into four fuzzy sets – low, medium, high, and extremely high– to represent the varying congestion levels. When the fuzzified congestion levelnear a control intersection is either low or medium, we apply a greedy algorithmfor control. In cases of heavier traffic, we apply approximate Q-learning (AQL)to control the intersection along with its immediate neighbor intersections.

The remainder of this paper is organized as follows. Section 2 describes theproblem formation of the traffic signal control model using a directed graph.Section 3 illustrates the new algorithm that combines fuzzy logic with approxi-mate Q-learning. Section 4 analyzes the simulated results. Lastly, in Sect. 5, wepresent our conclusions.

2 Problem Statement

The traffic grid is represented as a directed graph G(V,E), where V is the nodecollection, and E is the edge collection. An arbitrary node i is a representativeof an intersection, and an edge ij represents the road from node i to node j.For each intersection i, ui(t) is the traffic signal, and xi(t) is its local observa-tion or state. To simplify the problem, we present only four neighbors for eachintersection: north, south, east and west, and the neighbors of i are denoted asVi = {Ni, Si, Ei,Wi}. Also, there are only two roads (each has one lane) con-necting nodes i and j: these two roads are ij and ji. The traffic signal controlat each intersection, ui(t), has only two possible actions: green traffic lights forincoming traffic in the North and South directions, and green for incoming trafficin the East and West directions. To prevent traffic light changes from occurringtoo often, the action can be taken every ∆t time step. The state, which is theobservation xi(t), is denoted by the number of vehicles waiting on the roads thatare connected to intersection i from its neighbors: xi(t) = {qji(t)}j∈Vi , where thequeue length, qji, is the number of vehicles waiting on road ji.

This problem can be described as an Markov Decision Process (MDP), withsystem dynamics (x, u, g, p), standing for the state, action, stage cost and tran-sition probability. Although traffic signal at each intersection can be regarded asa single control agent, the global state of the multi-agent system is still Markov:the next state only depends on the current state:

x(t + 1) = f(x(t), u(t), z(t)) (1)

where z is the randomness generated from traffic simulation, such as a driver’sturning behaviors. Next, the cost function g is defined as the total number ofvehicles waiting at all intersections:

g(t) =∑

ij∈E

|qij(t)| (2)

where |qij(t)| is the number of vehicles that on the road ij with speed zero.Finally, the traffic signal control problem can be represented as minimizing thewaiting vehicles on the traffic grid over a certain planning horizon T :

Au

tho

r P

roo

f

4 T. Tan et al.

minimizeu(0),...,u(T−1)1T

ET−1∑

t=0

g(t)

subject to x(0) = x0

x(t + 1) = f(x(t), u(t), z(t))u(t) = u(t − 1) if t mod∆t ̸= 0u(t) ∈ U if t mode∆t = 0

(3)

Note that the action only will be updated every ∆t time step. Since the trafficdynamics and transition models are unknown to the control agents, reinforce-ment learning is promising and can be used to solve the problem [5].

3 Method

Centralized reinforcement learning is computationally infeasible for large-scaletraffic signal control, as the state space and the action space grow exponentiallywith the number of intersections.

To resolve the problems with centralized control, fuzzy logic and fuzzy setsare introduced to reduce computational costs by establishing rules for algorithmselection. To reduce the search space for finding optimal actions, the coordinationand communications between individual agents or intersections are restricted tothe controlled intersection with its neighbors.

3.1 Fuzzy Sets and Fuzzification of Queue Length

Different from crispy (two-valued) sets, elements of a fuzzy set are describedby a membership degree, and the membership degree specifies the certainty (oruncertainty) of a specific element belonging to the set. More formally, suppose Xis the domain of all possible numeric values, x ∈ X is an element in the domain,then a fuzzy set A over domain X can be characterized by a membership functionµA [18]:

µA : X → [0, 1] (4)

Function µA maps the elements in X to a membership degree between 0 and 1,where 0 means element x is excluded from set A; while 1 means that it is quitecertain that the element is in set A. Therefore, µA(x) indicates to what degree anelement x belongs to set A.

For the traffic control problem, the numerical domain is the number of vehi-cles on each road or on an edge between two connected intersections, and thenumerical queue length on each road is mapped to four fuzzy sets – Low, Medium,High, and Extremely High – to indicate congestion levels. This is shown in Fig. 1below. Therefore, all of the possible queue values are mapped to four congestionlevels.

Au

tho

r P

roo

f


Fig. 1. The fuzzy representations of queue length (Max. Capacity = 20 vehicles onroad ij ∈ E)

3.2 Coordination Between Agents

Each intersection is treated as an individual control agent. The communicationsand coordination are confined between distributed agents and their immediateneighbors. Since each intersection can choose two possible actions or controlplans and has at most four immediate neighbors in the traffic grid, and all of theagents have the same capability for decision-making, the number of combinationsof possible control actions in the action space is reduced to 25 = 32.

3.3 Rules for Algorithm Selection

For any intersection i, the control or action, ui, taken at time step t is eithergenerated from greedy algorithm or Neighborhood Approximate Q-Learning(NAQL). The fuzzified queue length or fuzzified congestion level on each road isused as a guide for choosing control algorithms.

For intersection i, there are at most four outgoing edges, denoted as Eiout,

to its neighbors Vi. The fuzzified congestion levels on roads Eiout are indicators

of traffic congestion levels at the immediate neighbor intersections. Therefore,if the congestion levels on Ei

out are relatively low, it is sufficient for agent i tochoose an action that minimizes its own costs without considering its neighbors.Hence, we define the rules as follows:

If the maximum congestion level on Eiout at time step t belongs to fuzzy set

“Low” or “Medium” with the highest membership degree, a greedy algorithm isapplied to choose action for intersection i at the current time step. Otherwise,the NAQL is used for choosing actions for intersection i.

3.4 Greedy Algorithm

Intersection agent i chooses actions that minimize its own costs atthe current time step considering all incoming roads or edges from itsimmediate neighbors, denoted as Ei

in, in the four possible directionsd ∈ {North(N), South(S), East(E),West(W )}.

Au

tho

r P

roo

f

6 T. Tan et al.

The greedy policy for traffic signal control can be expressed as follows: if thesummation of numerical queue length on Ei

in in the North and South directionsis greater than the sum of the East and West directions, then we set green fortraffic going in the South and North directions, ui = 1; otherwise, traffic signalis set to green for traffic in the East and West directions, ui = 0.

3.5 Neighborhood Approximate Q-Learning (NAQL)

Agent i cooperates with its immediate neighbors for choosing optimal actions.Denote NBi = {i, Vi} as neighborhood of i, where Vi = {Ni, Si, Ei,Wi}. This isshown in Fig. 2. The Q-function at intersection i is approximated as Qi(xi, ui) =wi

T ϕi(xi, ui) [19], where wi is the weight vector at intersection i, and ϕi(xi, ui)is the high-level feature vector extracted from the traffic grid to represent state-action patterns at intersection i at time step t. Thus, the Q-function for theNBi is simply the summation of Qi’s at all intersections in the neighborhood,which is:

QNBi(xNB , uNB) =∑

i∈NBi

wiT ϕi(xi, ui); (5)

where xNB = {xi, xj}, uNB = {ui, uj}, j ∈ Vi, represent the state and action forthe neighborhood, respectively. If the action uj at one of the neighbors of i hasalready been decided by the greedy algorithm, the greedily selected action willbe used for computing Q values for intersection j. Given the Q-function for anyneighborhood NBi in the traffic grid, the approximately optimal control can beeasily obtained for intersections in NBi.

uNBi(t) = {ui(t), uNi(t), uSi(t), uEi(t), uWi(t)}= argminuNB∈U{QNBi(XNB(t), uNB)}

(6)

where each component in uNBi(t) can take 2 possible plans or values, either0 or 1. Again, if the action uj(t) at one of the neighbors has been pre-selectedby the greedy algorithm and is inconsistent with the action generated for NBi,for simplicity, the greedy action is used for signal control at the current timestep. Therefore, the remaining challenge is to learn a good Q-function for eachintersection with its neighbors in the traffic grid.

(1) Feature Design: For each intersection i, ϕi(xi, ui) is a high-level featurevector that captures and predicts the influence of the action taken at time step ton the cost of intersection i in the next time step. To design feature ϕi(t)(xi, ui),we keep track of the queue length qj i(t) on all incoming roads connected tointersection i from neighbor j at each time step, where j ∈ Vi = {Ni, Si, Ei,Wi}.The number of vehicles entering inflow edges of i, Ei

in, between every t andt + ∆t interval is also recorded. If qin

j i(t) denotes the number of vehicles enteringinflow edge ji between t − ∆t and t, then feature vector ϕi(t)(xi, ui) at time tcan be designed as the following:

ϕi(t)(xi, ui = 1) = [qNi i + qSi i + qinNi i + qin

Si i

− qoutNi i − qout

Si i; qWi i + qEi i + qinWi i + qin

Ei i](7)

Au

tho

r P

roo

f


ϕi(t)(xi, ui = 0) = [qNi i + qSi i + qinNi i + qin

Si i;

qWi i + qEi i + qinWi i + qin

Ei i − qoutEi i − qout

Wi i](8)

where qoutj i (t) = min(qj i(t), qout

max), j ∈ Vi, is an estimation of the number ofvehicles going through/crossing intersection i between t and t + ∆t from edgeji given green traffic signal at time t. qout

max is an empirical constant definingthe maximum possible number of vehicles crossing an intersection i during ∆ttime span.

Fig. 2. Traffic intersections and neighborhood of i

(2) Parameter Learning: Each intersection i in the traffic grid keeps one copyof its own weight vector wi, and we use dynamic programming and stochasticgradient descent to update the weight vector after each observation [19]. To bemore specific, the one step ahead Q-function for a NBi is computed by,

TQ(xNB(t), uNB(t)) = g(t) + γminu∈UQ(xNB(t + 1), u) (9)

where γ is a discount factor to ensure convergence and u is an action vectorrepresenting the actions for the whole NBi. Then, the weights can be updatedby stochastic gradient descent as follows,

w = w − α(Q − TQ)(xNB(t), uNB(t))ϕNB(t)(xNB(t), uNB(t)) (10)

w =

⎡

⎢⎢⎢⎢⎣

wi

wNi

wSi

wEi

wWi

⎤

⎥⎥⎥⎥⎦;ϕNB =

⎡

⎢⎢⎢⎢⎣

ϕi(xi, ui)ϕNi(xNi , uNi)ϕSi(xSi , uSi)ϕEi(xEi , uEi)ϕWi(xWi , uWi)

⎤

⎥⎥⎥⎥⎦(11)

where α is the learning rate, w and ϕNB are weights and features for NBi,respectively; and we update w for each neighborhood in the traffic grid untilconvergence.

Au

tho

r P

roo

f

8 T. Tan et al.

4 Experiments

The proposed decentralized fuzzy reinforcement learning algorithm is tested andevaluated on a 10 × 10 traffic grid with an identical control agent at each inter-section. Although the grid is small in practice, it is considerably a larger traf-fic grid than many simulation traffic grids used elsewhere in the literature. Toevaluate traffic conditions and to simulate traffic flows, we use an intermodal,open-source, traffic simulation software SUMO [20] (traci package) for trafficsimulations. Here, SUMO and the reinforcement learning program interact withone another: the reinforcement learning program controls the traffic signal ateach time step, and SUMO simulates the traffic conditions and internal car fol-lowing models based on the given traffic signal control. Figure 3 shows a 10 × 10traffic grid in SUMO - black lines are the roads and white dots on roads denotethe vehicles.

Fig. 3. The traffic grid in SUMO simulation

4.1 Traffic Simulation

To simplify traffic dynamics, the length of each road connecting two adjacentintersections is set to be 100 m and the length of each vehicle is set to be 5 m.One induction loop detector is installed in the middle of each road to measurethe incoming traffic to the connected intersections, i.e. qin

j i(t). The simulationis 500 s per trial and the traffic signal control updates every ∆t = 5 s. Foreach trial, a minor traffic flow is generated along each road with an averagegeneration rate of 2.5 vehicles/min, and a maximum speed of 5 m/s for eachgenerated vehicle. Moreover, there are 12 major traffic flows during the entiresimulation, stemming from either the north or south boundary (source) to themiddle of the traffic grid (sink), with an average rate of 6.25 vehicles/min andwith a maximum speed 10 m/s for generated vehicles. We also estimated themaximum number of vehicles crossing an intersection i during ∆t = 5 s timespan during this simulation; its value is qout

max = 5 vehicles, given the above trafficsettings.

Au

tho

r P

roo

f


4.2 Results and Comparisons

At the first step, we train the NAQL algorithm over ten different trials to learnthe weights and Q-functions for each neighborhood. The values of learning para-meters are the following: α = 0.001 and γ = 0.9. Figure 4 shows the cumulativeaverage magnitude of all the weights in the traffic grid after each update, fromthe first to the last trial. The average magnitude of weights increases from 0.00to 0.40 during the first five trials, and then the curve flattens out and begins toconverge. Although there is still some fluctuation after the ten trials, possiblydue to stochastic traffic conditions and noise from the simulations, the weightsvary within a narrow range. They can be considered convergent after trainingthe ten trials.

Fig. 4. Weight learning curve

Fig. 5. Cumulative stage costs for different control policies (RL = proposed decentral-ized fuzzy RL algorithm)

Next, to better evaluate the performance of the proposed algorithm, we com-pare the trained RL model with two heuristic decentralized algorithms usingthe same traffic conditions and simulations. One heuristic algorithm is the“Sotl-phase” (SOTL) algorithm, which was proposed in [9]. The other one is

Au

tho

r P

roo

f

10 T. Tan et al.

the “Anticipated All Clearing” (AAC) policy, which was proposed in [8]. Weshow the comparisons of the cumulative average stage costs,

J(t) = (1/t)t∑

k=0

g(k) (12)

for the three algorithms in Fig. 5. In general, AAC performs better thanSOTL in lighter traffic conditions (0–450 s). As the simulation continues, how-ever, and more vehicles enter the traffic grid making it more congested, SOTLperforms better (450 to 500 s). The proposed algorithm, decentralized fuzzy RL,performs the best at the beginning, during relatively lighter traffic; it has amedium performance among the three algorithms from 200 to 500 s in the simu-lation. Equally important is the stability, which is also shown in Fig. 5. Specifi-cally, among the shown three curves, the proposed algorithm appears to have thesteadiest slope, which indicates a better stability than the other two algorithms.

Since the coordination and communication between agents are confinedwithin each neighborhood, which have only five nearby intersections to reducethe action space, the overall traffic dynamics in larger regions is more difficultto assess. Therefore, it is unlikely to be captured by the trained NAQL model,which may be one reason why the proposed algorithm does not outperform thetwo heuristic algorithms in the simulation.

5 Conclusion

To summarize, this paper integrates fuzzy logic into a decentralized RL algorithmand applies it to a large-scale traffic grid in order to achieve better traffic signalcontrol. In particular, we proposed a new decentralized reinforcement learningalgorithm, NAQL, by dividing the whole traffic grid into neighborhoods andthen applying approximate Q-learning within each sub-grid. We tested the pro-posed algorithm on a large-scale, 10 × 10 traffic grid in the SUMO simulator,and compared it with other decentralized algorithms. The simulation results arecompatible with the other two heuristic, rule-based, decentralized algorithms,namely, SOTL and AAC. Furthermore, our proposed algorithm revealed a higherdegree of adaptive control for dynamic traffic flow patterns. That is, it is rela-tively more stable during the sudden and unexpected transitions between highand low congestion levels compared to SOTL and AAC. Thus, in theory, theproposed algorithm is more practical for controlling congested traffic flows inlarge-scale traffic networks because of its ability to deal with underlying trafficflow dynamics.

In future work, we will analyze the level of optimality and scalability withthe proposed algorithm and the heuristic decentralized algorithms, as well aswith centralized learning approaches, to better recognize the degree of efficiencyimprovement. In addition, to achieve better performance, we need to considerimplementing cooperation and communication among neighborhoods so thatthe learning model can capture dynamic traffic patterns in expanded regions.

Au

tho

r P

roo

f


However, here, computational efficiency may be reduced when implementingcooperation among more control agents. Additionally, in future work, we alsoneed to consider a good trade-off between efficiency and optimality.

References

1. Zhao, D., Dai, Y., Zhang, Z.: Computational intelligence in urban traffic signalcontrol: a survey. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42(4), 485–494(2012)

2. Webster, F.V.: Traffic signal setting. Road Research Laboratory, HMSO, London,UK, Technical Paper 39, p. 144 (1958)

3. Miller, A.J.: Settings for fixed-cycle traffic signals. Oper. Res. Q. 14(4), 373–386(1963)

4. December 2010. http://www.trlsoftware.co.uk/5. Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. IEEE Trans.

Neural Netw. 9(5), 1051–1053 (1998)6. Wiering, M., Van Veenen, J., Vreeken, J., Koopman, A.: Intelligent Traffic Light

Control. Utrecht University, Institute of Information and Computing Sciences(2004)

7. Zhang, C., Lesser, V.: Coordinating multi-agent reinforcement learning limitedcommunication. In: Ito, T., Jonker, C., Gini, M., Shehory, O. (eds.) Proceedings ofthe 12th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2013), Saint Paul, Minnesota, USA, 6–10 May 2013

8. Lammer, S., Helbing, D.: Self-control of traffic lights and vehicle flows in urbanroad networks. J. Stat. Mech. Theory Exp. 2008(04), P04019 (2008)

9. Gershenson, C.: Self-organizing traffic lights. Complex Syst. 16, 29–53 (2005)10. Khamis, M., Gomaa, W.: Adaptive multi-objective reinforcement learning with

hybrid exploration for traffic signal control based on cooperative multi-agent frame-work. Eng. Appl. Artif. Intell. 29, 134–151 (2014)

11. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)12. Le, T., Kovacs, P., Walton, N., Vu, H.: Decentralized signal control for urban road

networks. Transp. Res. Part C: Emerg. Technol. 58, 431–450 (2015)13. Chu, T., Qu, S., Wang, J.: Large-scale traffic grid signal control with regional

reinforcement learning. In: American Control Conference (ACC), July 2016, inpress

14. Qiao, J., Yang, N., Gao, J.: Two-stage fuzzy logic controller for signalized inter-section. IEEE Trans. Syst. 41(1), 389–403 (2011)

15. Gokulan, B.P., Srinivasan, D.: Distributed geometric fuzzy multiagent urban trafficsignal control. IEEE Trans. Intell. Transp. Syst. 11(3), 517–523 (2010)

16. Fan, S., Tian, H., Sengul, C.: Self-optimization of coverage and capacity based on afuzzy neural network with cooperative reinforcement learning. EURASIP J. Wirel.Commun. Netw. 2014, 57 (2014)

17. Bingham, E.: Reinforcement learning in neuro fuzzy traffic signal control. Eur. J.Oper. Res. 131(2), 232–241 (2001)

18. Engelbrecht, A.P.: Fuzzy sets. In: Computational Intelligence An Introduction, 2ndedn., chap. 20, pp. 453–463. Wiley, Chichester (2007)

19. Szepesvri, C.: Algorithms for reinforcement learning. In: Synthesis Lectures onArtificial Intelligence and Machine Learning, vol. 4, no. 1, p. 1100 (2010)

20. SUMO traffic simulator. http://sumo.sourceforge.net

Au

tho

r P

roo

f

http://www.trlsoftware.co.uk/

http://sumo.sourceforge.net

Author Queries

Chapter 44

QueryRefs.

Details Required Author’sresponse

AQ1 Please confirm if the corresponding author is correctlyidentified. Amend if necessary.

AQ2 Per Springer style, both city and country names mustbe present in affiliations. Accordingly, we have insertedthe country name “USA” in affiliation. Please check andconfirm if the inserted country name is correct. If not,please provide us with the correct country name.

Au

tho

r P

roo

f

MARKED PROOF

Please correct and return this set

Instruction to printer

Leave unchanged under matter to remain

through single character, rule or underline

New matter followed by

or

or

or

or

or

or

or

or

or

and/or

and/or

e.g.

e.g.

under character

over character

new character

new characters

through all characters to be deleted

through letter or

through characters

under matter to be changed





Encircle matter to be changed

(As above)

(As above)

(As above)

(As above)

(As above)

(As above)

(As above)

(As above)

linking characters

through character or

where required

between characters orwords affected

through character or

where required

or

indicated in the marginDelete

Substitute character or

substitute part of one ormore word(s)

Change to italics

Change to capitalsChange to small capitalsChange to bold type

Change to bold italic

Change to lower case

Change italic to upright type

Change bold to non-bold type

Insert ‘superior’ character

Insert ‘inferior’ character

Insert full stop

Insert comma

Insert single quotation marks

Insert double quotation marks

Insert hyphen

Start new paragraph

No new paragraph

Transpose

Close up

Insert or substitute space

between characters or words

Reduce space betweencharacters or words

Insert in text the matter

Textual mark Marginal mark

Please use the proof correction marks shown below for all alterations and corrections. If you

in dark ink and are made well within the page margins.

wish to return your proof by fax you should ensure that all amendments are written clearly

View publication statsView publication stats

https://www.researchgate.net/publication/319197987

Documents

Large-Scale Trafﬁc Grid Signal Control Using Decentralized