UNIVERSIDAD DE SEVILLAbernabe/theses/Thesis_JAPCarrasco.pdf · 2012. 12. 30. · UNIVERSIDAD DE SEVILLA . A mi abuelo, mi abuela, y mi madre. Agradecimientos Ha sido un largo camino

UNIVERSIDAD DE SEVILLA Departamento de Teoría de la Señal y Comunicaciones

Herramienta de simulación para construir y analizar sistemas

complejos y jerárquicamente estructurados basados en AER que

implementan procesado de la información visual

Memoria presentada por

José Antonio Pérez Carrasco

para optar al grado de Doctor por la Universidad de Sevilla

Sevilla, Enero de 2011

Herramienta de simulación para construir y analizar

sistemas complejos y jerárquicamente estructurados

basados en AER que implementan procesado de la

información visual

Memoria presentada por:

José Antonio Pérez Carrasco

para optar al grado de Doctor por la Universidad de Sevilla

Sevilla, Enero 2011

Directores:

Dra. Begoña Acha Piñero

Dra. Carmen Serrano Gotarredona

Dr. Be abé L' are Barranco % P P

Dra. Tere! serano Gotarredona

Departamento de Teoría d A a Señal y Comunicaciones

UNIVERSIDAD DE SEVILLA

A mi abuelo, mi abuela, y mi madre.

Agradecimientos

Ha sido un largo camino el recorrido para llegar hasta aquí, y la verdad es que no libre de dificultades. Pero la ilusión y la voluntad de trabajo pueden con todo y abren camino a pesar de las complicaciones. Mirando hacia el comienzo, aún recuerdo la chispa de ilusión de realizar un proyecto de fin de carrera en imágenes médicas. Y esa chispa se transformó en llama cuando gracias a Carmen, Bego, Bernabé y Teresa, el mundo de la investigación “me abrió las puertas”. Una especie de sentimiento de pertenencia me invadió haciéndome sentir que éste es mi trabajo, y esto es lo que quiero hacer. A veces parece que la vida es una serie de televisión en el sentido de que van apareciendo nuevos personajes, o que por circunstancias de la vida (o por desgracia) otros ya no están como acompañantes en el camino, y el hilo argumental va añadiendo situaciones nuevas, complicaciones nuevas, momentos de tensión, pero también momentos de ilusión, alegría, sí, de esos en los que uno tiene que admitir que ha llorado de lo feliz que se siente. En mi vida tengo claro quiénes son mis protagonistas: Gracias Carmen y Bego, pero no gracias, sino infinitas gracias. Serían tantas cosas que ni escribiendo un libro entero de gratitud contaría todo lo que significáis para mí. Ante todo gracias por abrirme las puertas de vuestras vidas, por vuestro cariño y apoyo en todo momento y en vuestro consejo en cada paso que he dado en todo. Tengo la suerte de tener dos directoras de tesis que además de quererlas como mis “profesoras”, las quiero más como mi familia. Sólo yo sé personalmente cuánto me habéis ayudado, en lo personal infinito, y en el trabajo, también infinito. Gracias, todo os lo debo a vosotras dos. Gracias Bernabé y Teresa, y de nuevo, no gracias, sino infinitas gracias. Con vuestro ejemplo, trabajo y todos los consejos/críticas constructivas me habéis hecho progresar muchísimo, me habéis enseñado a ser más crítico conmigo mismo y con mi trabajo. Gracias por regar la semilla de ilusión sembrada en mi interior todos los días, pero ante todo, gracias por ser las maravillosas personas que sois y por haberme apoyado en todo momento. Igualmente, sois también los responsables de que yo haya llegado hasta aquí. Muchas gracias. Gracias Aurora y Carlos, también infinitas, porque al final se vive casi más tiempo en el trabajo que en casa, y vosotros me habéis acompañado en los momentos buenos, y en los malos, en los de carcajadas y en los de llanto. Gracias por llenarme de vida cada día con vuestra amistad, compañía y compartir conmigo también las chispas de vuestras ilusiones y preocupaciones. Gracias a José Ignacio Acha, por su consejo y actitud conmigo. Muchísimas gracias.

Gracias a todos los compañeros del departamento y profesores. Muchísimas gracias a Pablo Aguilera y Pablo Olmos, a Michelle y a Luis, por darle la vida a nuestra sala y por ser tan maravillosas personas conmigo. Gracias a Eugenio, S. Thorpe y a S. Furber, por haberme acogido con ellos y dedicarme parte de sus súper-ocupados tiempos. Gracias por haberme enseñado tanto. Por supuesto, si hay alguien con quien tengo que estar agradecido es con mis padres y hermanos. Ellos han vivido muchísimo todo mi trabajo, recibiendo más de un dolor de cabeza. Muchísimas gracias a ellos porque son los pilares de mi vida y sin ellos no podría haber logrado nada. Gracias a mi madre en especial, porque es la más grande del mundo entero, porque lo que ha tenido que trabajar y luchar ella para que yo (y mis hermanos) estemos donde estamos, no está en los escritos. Será por su ejemplo y lucha que he aprendido a valorar infinito el esfuerzo que cuesta cada pequeña cosa. Gracias mamá. Finalmente, gracias a la persona estelar en mi vida, que seguro que a día de hoy lleva aún mi foto en su cartera presumiendo de nieto por ahí arriba. Seguro que es el que se sentirá más orgulloso el día de presentación de mi tesis, y, aunque no entienda inglés ni nada de lo que cuente, se sentará, me mirará todo el tiempo fijamente con una mirada alucinante de interesarle y no se perderá ni una sílaba de las palabras que salgan por mi boca. Y si alguien habla, seguro que mi abuelo le manda un “ssshhhhh” para seguir enterándose. Gracias a mi abuelo por ser la fuente de mi motivación, el responsable de que yo quiera ser siempre mejor persona y mejorar en mi trabajo. Gracias por ser mi acompañante ahora todos los días. Muchas gracias a todos, os quiero.

Contents

List of Figures ix

List of Tables xv

1 INTRODUCTION 1

1.1 Antecedents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 EVENT-BASED PROCESSING SYSTEMS 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Event-Based vs Frame-Based Processing Systems . . . . . . . . . . . . . 6

2.3 Coding Schemes for Event-Based Systems . . . . . . . . . . . . . . . . . 7

2.3.1 Rate Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Rank Order Coding . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 Time-to-First-Spike . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.4 Spike-Count Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.5 Population coding . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.6 Phase-of-firing code . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.7 Intensity Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 AER Protocol for Event-Based Systems . . . . . . . . . . . . . . . . . . 12

3 IMPLEMENTATION OF AN AER SIMULATION TOOL 15

3.1 Requirement of a simulation tool . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Synchronous or Clock-driven Algorithms . . . . . . . . . . . . . . 16

3.1.2 Asynchronous or Event-driven Algorithms . . . . . . . . . . . . . 16

iii

CONTENTS

3.2 Description of the AER Simulation Tool . . . . . . . . . . . . . . . . . . 19

3.2.1 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Event Description . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 Instance Description . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.4 Description of Program Flow . . . . . . . . . . . . . . . . . . . . 24

3.2.5 Algorithm Optimizations for Efficient Computation Speed . . . . 26

3.2.6 C++ implementation . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 AER Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 AER switch Module . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1.1 AER Splitter . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1.2 AER Merger . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Subsampling Module . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.3 Mapper Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.3.1 3.3.3.1 AER Scanner . . . . . . . . . . . . . . . . . . . 34

3.3.3.2 3.3.3.2 AER Rotator . . . . . . . . . . . . . . . . . . . 35

3.3.4 AER Convolution Chip . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.4.1 3.3.4.1 System Level Architecture of the Convolution

Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.4.2 3.3.4.2 AERST Convolution Module . . . . . . . . . . 40

3.3.5 Projection Module . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.6 Integrate and Fire Module . . . . . . . . . . . . . . . . . . . . . . 44

3.3.7 Rate-Reducer Module . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.8 Self-Exciting Modules . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Validation of the AER Tool . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 Detection and tracking of moving circles of given radius . . . . . 47

3.4.2 Recognition of high speed Rotating Propellers . . . . . . . . . . . 49

4 MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION 51

4.1 Fukushima’s Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 AER-based system for Character Recognition . . . . . . . . . . . . . . . 54

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

iv

CONTENTS

5 IMPLEMENTATION OF TEXTURE RETRIEVAL USING AER-BASED SYSTEMS 67

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 State of the art in texture recognition . . . . . . . . . . . . . . . . . . . 67

5.3 AER implementation for texture retrieval . . . . . . . . . . . . . . . . . 69

5.3.1 Frame-based implementation for texture retrieval . . . . . . . . . 70

5.3.2 AER-based implementation for texture retrieval . . . . . . . . . 72

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4.1 Comparison with the State-of-the-Art . . . . . . . . . . . . . . . 78

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 82

6 EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FAST VI-SION POSTURE RECOGNITION 85

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2 Frame-Based Convolutional Network . . . . . . . . . . . . . . . . . . . . 88

6.3 Justification of the Architecture Used . . . . . . . . . . . . . . . . . . . 92

6.4 Frame-Free Convolutional Network . . . . . . . . . . . . . . . . . . . . . 95

6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5.1 AER ConvNet with 32x32 pixel inputs . . . . . . . . . . . . . . . 102

6.5.2 AER ConvNet with 64x64 pixel inputs . . . . . . . . . . . . . . . 113

6.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.6.1 Learning in Convolutional Networks . . . . . . . . . . . . . . . . 115

6.6.2 Computations in the Frame-Based System . . . . . . . . . . . . . 117

6.6.2.1 Filtering layers . . . . . . . . . . . . . . . . . . . . . . . 117

6.6.2.2 Subsampling Layers . . . . . . . . . . . . . . . . . . . . 118

6.6.2.3 Full-Connection layer F6 . . . . . . . . . . . . . . . . . 118

6.6.3 Computations in the Frame-free system . . . . . . . . . . . . . . 118

6.6.3.1 Filtering Layers . . . . . . . . . . . . . . . . . . . . . . 119

6.6.3.2 Subsampling Layers . . . . . . . . . . . . . . . . . . . . 119

6.6.3.3 Sixth Layer F6 . . . . . . . . . . . . . . . . . . . . . . . 119

6.6.4 Implementation of non-linearities and equivalences between the

frame-based and the AER-based implementation . . . . . . . . . 120

v

CONTENTS

7 CONCLUSIONS 125

Appendices 127

Appendix A AERST Tool User Guide 129

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.2 Description of an AER System . . . . . . . . . . . . . . . . . . . . . . . 129

A.3 MATLAB Initialization of Parameters and States . . . . . . . . . . . . . 132

A.3.1 Initialization of Parameters . . . . . . . . . . . . . . . . . . . . . 133

A.3.2 Initialization of States . . . . . . . . . . . . . . . . . . . . . . . . 133

A.4 RUNNING AERST in MATLAB . . . . . . . . . . . . . . . . . . . . . . 134

A.4.1 Building Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A.5 C++ Initialization of Parameters and States . . . . . . . . . . . . . . . 137

A.5.1 Initialization of Parameters . . . . . . . . . . . . . . . . . . . . . 138

A.5.2 Initialization of States . . . . . . . . . . . . . . . . . . . . . . . . 138

A.6 RUNNING AERST in C++ . . . . . . . . . . . . . . . . . . . . . . . . 139

A.6.1 Building C++ Modules . . . . . . . . . . . . . . . . . . . . . . . 142

A.7 Matlab Auxiliary Functions . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.7.1 Generation of AER events from a standard image . . . . . . . . . 143

A.7.2 Reconstruction of images from channels . . . . . . . . . . . . . . 144

A.7.3 Reconstruction of channels from the text output file . . . . . . . 144

A.8 MATLAB Step-by-Step Example . . . . . . . . . . . . . . . . . . . . . . 145

A.8.1 Preparing the Stimulus Events . . . . . . . . . . . . . . . . . . . 146

A.8.2 Setting Up the Configuration File . . . . . . . . . . . . . . . . . 147

A.8.3 Initializing Parameters . . . . . . . . . . . . . . . . . . . . . . . . 147

A.8.3.1 Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.8.3.2 Chip1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.8.4 Editing the Modules . . . . . . . . . . . . . . . . . . . . . . . . . 149

A.8.4.1 Splitter Module . . . . . . . . . . . . . . . . . . . . . . 150

A.8.4.2 Chip1 Module . . . . . . . . . . . . . . . . . . . . . . . 150

A.8.4.3 Merger Module . . . . . . . . . . . . . . . . . . . . . . . 152

A.8.5 Editing the AERST.m file . . . . . . . . . . . . . . . . . . . . . . 152

A.8.6 Simulating the System . . . . . . . . . . . . . . . . . . . . . . . . 153

A.8.7 Viewing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

vi

CONTENTS

A.9 C++ Step-by-Step Example . . . . . . . . . . . . . . . . . . . . . . . . . 155

A.9.1 Converting a Matrix of Events to a source text file . . . . . . . . 156

A.9.2 Setting Up the Configuration File . . . . . . . . . . . . . . . . . 156

A.9.3 Initializing Parameters . . . . . . . . . . . . . . . . . . . . . . . . 156

A.9.3.1 Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

A.9.3.2 Chip1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

A.9.4 Editing the C++ Modules . . . . . . . . . . . . . . . . . . . . . . 160

A.9.4.1 Splitter Module . . . . . . . . . . . . . . . . . . . . . . 160

A.9.4.2 Chip1 C++ Module . . . . . . . . . . . . . . . . . . . . 162

A.9.4.3 MERGER C++ Module . . . . . . . . . . . . . . . . . 165

A.9.5 Simulating the System in C++ . . . . . . . . . . . . . . . . . . . 167

Appendix B RESUMEN 171

B.1 INTRODUCCION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

B.2 Descripcion del Simulador AERST . . . . . . . . . . . . . . . . . . . . . 173

B.3 IMPLEMENTACIONES . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

B.4 Sistema de Reconocimiento de Caracteres basado en AER . . . . . . . . 176

B.5 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B.6 Clasificacion de Imagenes basada en informacion de textura . . . . . . . 179

B.7 Red neuronal Convolucional para el reconocimiento de personas . . . . . 180

B.7.1 RED NEURONAL DE DETECCION DE PERSONAS BASADA

EN FOTOGRAMAS . . . . . . . . . . . . . . . . . . . . . . . . . 181


EN AER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

B.7.3 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Bibliography 187

vii

CONTENTS

viii

List of Figures

2.1 Conceptual illustration of frame-based (top) versus event-based (bottom)

vision sensing and processing system. . . . . . . . . . . . . . . . . . . . . 7

2.2 Comparison of timing issues between (top) a frame- and (bottom) an

event-based sensing and processing system. . . . . . . . . . . . . . . . . 8

2.3 Rate-based vs Rank-Order based scheme . . . . . . . . . . . . . . . . . . 10

2.4 Representation of the Time-to-First Spike coding scheme . . . . . . . . . 11

2.5 Concept of point-to-point interchip AER communication. . . . . . . . . 13

3.1 Example AER system and its ASCII file netlist description . . . . . . . 20

3.2 Basic Algorithm implemented by the AER tool . . . . . . . . . . . . . . 25

3.3 Time Optimizations in the Simulation tool . . . . . . . . . . . . . . . . . 28

3.4 CAVIAR AER vision system . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 AER-Switch hardware interface . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 AER-switch acting as Splitter or Merger . . . . . . . . . . . . . . . . . . 32

3.7 Scanner and Rotator AER Modules . . . . . . . . . . . . . . . . . . . . . 33

3.8 Comparison between (a) classical frame-based and (b) AER event-based

convolution processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.9 Convolution Chip 32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.10 AER convolution module implemented in the tool . . . . . . . . . . . . 42

3.11 AER Integrate and Fire module implemented in the tool . . . . . . . . . 44

3.12 AER Self-Exciting Module . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.13 Block diagram of the AER system developed to simulate the hardware

implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.14 a) Kernel to detect a circumference of a certain radius. b) Kernel used

in the WTA module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

ix

LIST OF FIGURES

3.15 Winner-Takes-All module . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.16 On the left, input and output obtained with the hardware implemen-

tation. On the right, input and output obtained with the simulated

implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.17 a) Kernel used to detect the propeller, b) and c) input and output when

we collect events during 50µs, d) and e) input and output when we collect

events during 200ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 A typical architecture of the neocognitron network. . . . . . . . . . . . . 52

4.2 The process of pattern recognition in the neocognitron. The lower half

of the figure is an enlarged illustration of a part of the network. . . . . . 54

4.3 Character recognition system based on AER . . . . . . . . . . . . . . . . 55

4.4 Kernels used in the first layer for feature detection. The red cross indi-

cates the origin of coordinates of the kernel when it is proyected in the

pixel array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Kernels used in the second layer for spatial weighting. The kernel C at

the bottom end is a single convolution chip for detecting whether the

events coming from the previous layer are more or less clustered together. 59

4.6 Letters used for testing the system based on AER for character recognition. 60

4.7 The output events generated at the different convolution outputs {c1,

c2, c3, c5, c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA,

fB, fC, fH, fL, fM, fT} for the case of input stimulus ‘A’. . . . . . . . . 62

4.8 Events obtained in the system at outputs {c1,c2,c3,c5,c11,d1,d2,d3,d5,d11,dA,fA}when input is letter ‘A’ Time is expressed in µs. . . . . . . . . . . . . . 63

4.9 Events obtained in the system at input and output channels for the first

version of each of the letters. . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Scheme of the AER-based system implemented for texture-based re-

trieval of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Scheme of a generic FEM used in Fig. 5.1. . . . . . . . . . . . . . . . . . 74

5.3 Texture retrieval accuracy obtained for images D1-D2-D3-D8-D9-D10 as

function of Tcount (in milliseconds) . . . . . . . . . . . . . . . . . . . . 78

5.4 Comparison between frame-based and AER-based systems . . . . . . . . 79

x

LIST OF FIGURES

6.1 Frame-based ConvNet to detect people in up, up-side-down or horizontal

positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 Real scenarios where AER recordings with the motion retina were obtained 88

6.3 Images obtained collecting input spikes from the retina each 30ms. The

second and third rows were obtained rotating previosly the input events

90 and 180 degrees respectively. . . . . . . . . . . . . . . . . . . . . . . . 89

6.4 Comparison of recognition rates when we use a trainable set of filters in

first layer or a fixed Gabor filter bank. . . . . . . . . . . . . . . . . . . . 93

6.5 Comparison of recognition rates when we use different Gabor filter banks

at different number of scales and orientations. . . . . . . . . . . . . . . . 94

6.6 Different accuracies obtained when varying the number of feature maps

in the third layer fixing the number of feature maps in the fifth layer. . 95

6.7 Different accuracies obtained when varying the number of feature maps

in the fifth layer fixing the number of feature maps in the third layer. . 96

6.8 Maximum absolute value of the weights during the training stage and at

the end of the training stage. . . . . . . . . . . . . . . . . . . . . . . . . 97

6.9 AER-based implementation of the ConvNet system. . . . . . . . . . . . 98

6.10 Convolution Structure at layers C3, C5. Each incoming spike makes a

convolution map to be added on a pixel array. . . . . . . . . . . . . . . . 99

6.11 Neuron in the pixel array. Each time a spike is received a certain weight

is added to the neuron state. . . . . . . . . . . . . . . . . . . . . . . . . 100

6.12 Algorithm used to configure the system. First, the system was trained

with the frame-based version. Then all the obtained weights were used

in the frame-free system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.13 a) Images corresponding to downsampling the 128x128 input stimulus

to 32x32. b) Images obtained cropping the input stimulus in a central

square of size 64x64 and downsampling the cropped stimulus to 32x32. . 103

6.14 Input events used to test the system. x axis represents time in seconds.

y axis represents the input event coordinates in a 32x32 pixel array,

numbered from 0 to 1023. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.15 Output events corresponding to each one of the input flows, a) outputs

when input is up position, b) output when input is horizontal position,

c) output when input is up-side-down position. . . . . . . . . . . . . . . 107

xi

LIST OF FIGURES

6.16 Recognition rate and number of output events per second obtained by

varying the refractory times in layers C3 and C5. . . . . . . . . . . . . . 108

6.17 a) Input and output activity when input is alternated between up, lay

and up-side-down positions. No refractory periods have been considered.

Values ‘5’, ‘6’ and ‘7’ correspond to up, horizontal and up-side-down

respectively. Absolute values ‘1’, ‘2’, ‘3’ and ‘4’ correspond to the output

channels identifying up-side-down, horizontal, up positions and noise

respectively. b) Input and output activity when input is alternated and

a refractory period of 9ms is used in layer C5. Input event correct

orientation is shown by the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to

up, horizontal and up-side-down positions, respectively. Output events

corresponding to the up category are represented with blue circles, with

red crosses for the horizontal category, with green stars for the up-side-

down category and black dots for the noise category. c) Input and output

activity when a refractory time of 18ms is used in layer C5. d) Input

and output activity when the simulated annealing algorithm is employed

to obtain optimum parameters . . . . . . . . . . . . . . . . . . . . . . . 110

6.18 Recognition Rate and Number of Output Events obtained when varying

forgetting rates F1, F3, F5, and F6. a) Results when varying F1 and F3.

b)Results when varying F3 and F6. c)Results when varying F3 and F5.

d)Results when varying F1 and F5. . . . . . . . . . . . . . . . . . . . . . 111

6.19 Zoomed version of the simulation results of Fig. 6.17 between 5760ms y

5830ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.20 a) Input and output activity when input is alternated and a refractory

period of 18ms is used in layer C5. Input events are shown by grey

circles. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-

down positions respectively. Output events corresponding to the up

category are represented with blue circles, output events corresponding

to the horizontal position are represented by red crosses, output events

corresponding to the up-side-down positions are represented by green

stars and the noise category by black points. b) Input and output activity

when the simulated annealing algorithm is employed to obtain optimum

parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xii

LIST OF FIGURES

6.21 Computation of the saturation point in the hyperbolic tangent function.

The function saturates when the absolute value of the argument is higher

than 1.5283. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

A.1 System Simulated in the Step-by-Step Example . . . . . . . . . . . . . . 148

B.1 Concepto de comunicacion punto a punto basada en AER. . . . . . . . . 172

B.2 Ejemplo de Sistema AER y su descripcion mediate un fichero ASCII . . 174

B.3 Sistema de Reconocimiento de Caracteres basado en AER . . . . . . . . 177

B.4 Caracteres utilizados para evaluar el Sistema AER. . . . . . . . . . . . . 178

B.5 Esquema del sistema basado en AER para clasificacion de imagenes

basada en textura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

B.6 Sistema neuronal convolucional basado en fotogramas para detectar per-

sonas de pie, en posicion horizontal o boca a bajo. . . . . . . . . . . . . 181

B.7 Implementacion AER de la red neuronal convolucional para el reconocimiento

de personas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

B.8 a) Entrada y salida del sistema cuando la entrada es alternada entre

las posiciones ‘de pie’, ‘horizontal’ y ‘boca abajo’. Los valores ‘5’, ‘6’

y ‘7’ corresponden a las posiciones ‘horizontal’, ‘boca abajo’ y ‘de pie’

respectivamente. Los valores absolutos ‘1’, ‘2’, ‘3’ y ‘4’ corresponden a la

actividad en los canales de salida identificando las posiciones ‘horizontal’,

‘boca abajo’, ‘de pie’ y ‘ruido’. . . . . . . . . . . . . . . . . . . . . . . . 186

xiii

LIST OF FIGURES

xiv

List of Tables

4.1 Origin of Coordinates for Kernels in Layer 2 . . . . . . . . . . . . . . . . 58

4.2 Timing and accuracy obtained for each of the letters . . . . . . . . . . . 61

5.1 Retrieval Performance for Each of the 112 Brodatz Images. Compar-

ison Between Manjunath’s Frame-Based Method and the AER-Based

Proposed Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Comparison of Average Retrieval Rate Between Different Methods Using

the Brodatz Database) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Comparison of Computational Times Between Different Methods Using

the Brodatz Database) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.1 Parameters used in the frame-based and AER-based implementations . 90

6.2 Maximum kernel weights, threshold values, refractory times, layer times

and events per second in the system . . . . . . . . . . . . . . . . . . . . 104

6.3 Parameter Vector obtained with the Simulated Annealing Algorithm . . 112

6.4 Time-to-first output event after transitions of the input between up,

horizontal and up-side-down positions. . . . . . . . . . . . . . . . . . . . 113

6.5 Parameter Vector obtained with the Simulated Annealing Algorithm . . 115

6.6 Refractory periods, layer times and maximum number of events per sec-

ond computed for each layer in the system . . . . . . . . . . . . . . . . . 123

xv

LIST OF TABLES

xvi

Chapter 1

INTRODUCTION

1.1 Antecedents

Still today it seems inconceivable that even the most powerful computer is not able to

implement a “simple” detection, recognition and tracking of an object or person in an

environment with multiple distractor elements in milliseconds. It is true that computers

are good at some kind of tasks (mathematical operations, storage, data recovery, etc.).

However, even today, computers are not ready to implement other tasks that could

be considered for a person as simple operations, such as object recognition, decision

taking, etc.

These operations, that are straightforward for human brains, require multiple se-

quential operations in computers, still today mainly based on the classical Von Neuman

architecture, where a stored program is executed line by line. For vision/image pro-

cessing is even worst, as all the pixels in the image have to be processed one by one

with complex operations. When all the pixels are processed, some features and results

can be extracted and sent to the next layer to continue processing. Thus, as well as the

individual processing for every pixel, several layers are required to finish a processing

that in the best of the cases will be only close to that offered by a human brain, both

in terms of speed and efficiency.

These efficiency and speed limitations in classical processing systems have motivated

many researchers to study the biological processing implemented in brains to propose

and analyze new techniques and architectures to implement those “so simple” tasks.

This is not a simple issue, as this power in processing in human brains is mainly due

1

1. INTRODUCTION

to the high number of neurons and the massive interconnections (synapses) present

between them. This massive interconnection allows a parallel processing of all the

input information, providing a real-time operation. Moreover, in the case of vision

processing, opposite to classical computers, information is not processed considering

frames, but transmitting and integrating independent and individual pulses, which are

processed in parallel and transmitted layer by layer without waiting to collect all the

input visual information.

There are many researchers working in emulating (both in software and hardware)

the human visual system and its processing (consider as example the works developed by

T. Serre [1], Fukushima [2], T. Masquelier [3], Y. LeCun [4], S. Furber [5], etc.). Other

researchers study where and how the information is coded (rank-order coding, rate-

based coding, population coding, etc.) and other analyses are focused to determine the

best model for neurons and their learning strategy (i&f (integrate and fire), Izhikevich

[6], Hodgkin and Huxley [7], etc. ).

The main barrier that these researchers face when trying to mimic the bio-inspired

systems is the massive connectivity present in the biological like systems. In today

technologies it is plausible to fabricate on a single chip many thousands (even millions)

of artificial neurons or simple processing cells. However, it is not viable to connect phys-

ically each of them to even a few hundreds of other neurons. The problem is greater

for multi-chip multi-layer hierarchically structured bio-inspired systems. Address-event

representation (AER) is a promising emergent hardware technology that shows poten-

tial for providing the computing requirements of large frameless projection-field-based

multilayer systems providing a hardware solution to the massive connectivity prob-

lems. AER was first proposed in 1991 in one of the California Institute of Technology

(Caltech) research labs [30], and has been used since then by a wide community of

neuromorphic hardware engineers.

The main conclusion that can be extracted after considering these ideas is that

it would be desirable to have available a simulator tool, easy to handle, to model

and to simulate cortical-like processing systems. This tool should be able to consider

not only the existing biological models for neurons and their learning strategies, but

also the limitations (timing, space, connectivity, etc) and performance figures of those

existing hardware elements which also try to emulate the complex bioinspired processing

strategies.

2

1.2 Objectives

1.2 Objectives

The first main objetive in this thesis work is to propose and evaluate an event-driven

software tool AERST for simulating systems based on the AER (address-event repre-

sentation) protocol. The second objective is to build and analyze complex multilayer

multichip AER vision processing systems using the developed simulator tool. The sys-

tems considered should be composed of existing available AER hardware elements in

order to provide real timing and performance figures. To achieve the two objetives

several stages in the work have been carried out:

1. Implementation of the AER software simulator tool AERST. To achieve this it

is necessary to study the different programming environment options and the

characteristics of the AER elements and systems that are going to be used (if

already existing) or implemented.

2. Validating the AERST tool. For doing this, several experiments already imple-

mented in hardware, will be repeated with the tool for comparison purposes.

AERST should provide the same results (timing and performance) obtained with

the hardware systems.

3. Proposing new vision and complex processing AER systems using AERST. These

systems should only be composed of existing AER elements or elements that can

be physically implemented in the close future.

The work developed in this thesis has been carried out within the Andalucian project

P06TIC01417 (Brain System) in the National Microelectronics Center (IMSE-CNM-

CSIC) and the University of Seville. The aim of this project was to build a complete

vision processing system using bioinspired AER convolution chips to emulate the pro-

cessing implemented in human brains.

1.3 Structure

The current thesis is structured as follows:

1. The First Chapter is an introduction comparing event-based and frame-based

systems. The AER hardware protocol as a solution to implement the massive

interconnection existing in brains using multi-AER-chips is explained

3

1. INTRODUCTION

2. The Second Chapter explains the AERST Simulator Tool and the Validations to

assess the tool

3. The Third Chapter shows one multi-layer feed-forward neural processing system

for character recognition in AER.

4. The Fourth Chapter explains a four-layer system implementing texture retrieval

for image classification.

5. In the Fifth Chapter a neural network consisting of six layers trained using back-

propagation is implemented to detect people in three different positions: up,

horizontal and up-side-down.

6. The Sixth Chapter corresponds to the Conclusions and Future Work Chapter

7. In the Seventh Chapter an Appendix Section where a Tutorial and User Manual

on how to use the AERST Simulation Tool is provided.

8. Finally, a list of publications is provided

4

Chapter 2

EVENT-BASED PROCESSING

SYSTEMS

2.1 Introduction

ARTIFICIAL man-made machine vision systems operate in a quite different way from

biological brains. Machine vision systems usually capture and process sequences of

frames. For example, a video camera captures images at about 25-30 frames per sec-

ond, which are then processed frame by frame, pixel by pixel, usually with convolution

operations, to extract, enhance, and combine features, and perform operations in fea-

ture spaces, until a desired recognition is achieved. This frame convolution processing

is slow, especially if many convolutions need to be computed in sequence for each input

image or frame. Biological brains seem to not operate on a frame by frame basis. In the

retina, each pixel sends spikes (also called events) to the cortex when its activity level

reaches a threshold. Pixels are not read by an external scanner. Pixels decide when to

send an event. All these spikes are transmitted as they are being produced, and do not

wait for an artificial “frame time” before sending them to the next processing layer.

Besides this frameless nature, brains are structured hierarchically in cortical layers [8].

Neurons (pixels) in one layer connect to a projection field of neurons (pixels) in the

next layer. This processing based on projection fields is similar to convolution-based

processing [9], at least for the earlier cortical layers. For example, it is widely accepted

that the first layer of visual cortex V1 performs an operation similar to a bank of

2-D Gabor-like filters at different scales and orientations [1] whose actual parameters

5

2. EVENT-BASED PROCESSING SYSTEMS

have been measured [10][11][12]. This fact has been exploited by many researchers to

propose powerful convolution-based image processing algorithms [1][13][14]. However,

convolutions are computationally expensive. It seems unlikely that the high number of

convolutions that might be performed by the brain could be emulated fast enough by

software programs running on the fastest of today’s computers.

It seems that the solution for powerful and frame-free biological-like real-time vision

processing systems is to consider the devolopment of hardware combining event-based

multineuron modules to compute projective fields. In these systems, relevant image

features will be communicated and processed first, resulting in extremely high-speed

processing throughput. This way, the processing delay will depend mainly on the

number of layers, and not on the complexity of objects and shapes to be recognized.

Their latency and throughput are not limited by a conventional sampling rate.

2.2 Event-Based vs Frame-Based Processing Systems

To show the high-speed processing capabilities of event-based systems when compared

to frame-based systems consider Fig. 2.1. Fig. 2.1 illustrates the conceptual differ-

ence between a frame- and an event-based sensing and processing system. Each use

a camera sensor to capture reality. In the top row, a frame-based camera captures a

sequence of frames, each of which is transmitted to the computing system. Each frame

is processed by sophisticated image processing algorithms for achieving some recogni-

tion. The computing system needs to have all pixel values of a frame before starting

any computation. In the bottom row, an event-based vision sensor operates without

frames. Each pixel sends an event (usually its own (x, y) coordinate) when it senses

something (change in intensity [15], contrast with respect to neighboring pixels [16],

etc.). Events are sent out to the computing system as they are produced, without

waiting for a frame time. The computing system updates its state after each event.

Fig. 2.2 illustrates the inherent difference in timings between both concepts. In the top

(frame-based), reality is binned into compartments of duration Tframe. During the first

frame T1, an event happens (such as a flashing shape), but the information produced

by this event does not reach the computing system until the full frame is captured (at

T1) and transmitted (with an additional delay ∆). Then, the computing system has

to process the full frame, handling large amount of data and requiring a long frame

6

2.3 Coding Schemes for Event-Based Systems

Figure 2.1: Conceptual illustration of frame-based (top) versus event-based (bottom)vision sensing and processing system.

computation time TFC before the “recognition” information is available. In the bot-

tom of Fig. 2.2, pixels “see” directly the event in reality and send out their own events

with a delay ∆ to the computing system. Events are processed as they flow with an

event computation delay Tev (some nanoseconds [17]). For performing recognition not

all events are necessary. Actually, more relevant events usually come out first or with

higher frequency. Consequently, recognition time Trcg can be smaller than the total

time of the events produced. Note that recognition is possible before frame time T1,

resulting in a negative T ′FC when compared to the recognition delay of a frame-based

system.


Neuroscience, computational, and engineering application researchers have reported

and used several schemes for information coding with spikes or events. Some of them

are summarized next.

2.3.1 Rate Coding

Rate coding is a traditional coding scheme, assuming that most, if not all, information

about the stimulus is contained in the firing rate of the neuron. In most sensory systems,

7


Figure 2.2: Comparison of timing issues between (top) a frame- and (bottom) an event-based sensing and processing system.

the firing rate increases, generally non-linearly, with increasing stimulus intensity [18].

Any information possibly encoded in the temporal structure of the spike train is ignored.

The concept of firing rates has been successfully applied during the last 80 years. It

dates back to the pioneering work of Adrian who showed that the firing rate of stretch

receptor neurons in the muscles is related to the force applied to the muscle [19]. In the

following decades, measurement of firing rates became a standard tool for describing

the properties of all types of sensory or cortical neurons, partly due to the relative ease

of measuring rates experimentally. However, this approach neglects all the information

possibly contained in the exact timing of the spikes. During recent years, more and

more experimental evidences have suggested that a straightforward firing rate concept

based on temporal averaging may be too simplistic to describe brain activity [20].

2.3.2 Rank Order Coding

While the idea of coding an image using a rate code may seem plausible, recent ex-

perimental work has actually ruled it out in the mouse retina, since the amount of

information available by counting spikes within a given amount of time was insufficient

8


to explain the animal’s behavioral performance [21]. In rank-order schemes (originally

proposed by Thorpe [22]), the information is encoded in the relative order of firing

across the population of neurons that is used (see also [23]). The idea follows natu-

rally from the fact that an integrate-and-fire neuron can be thought of as a capacitance

with a threshold. In response to a visual stimulus, retinal ganglion cells will charge up

progressively until they reach a threshold for generating a spike, and the time taken

to reach threshold will depend on how well the stimulus matches the cell’s receptive

field. Simulation studies have demonstrated that by using the order in which the cells

fire, it is possible to reconstruct an image sufficiently well to allow the key objects to

be identified even when less than 1% of the cells in the retina have had time to emit a

spike [24]. Furthermore, the idea that relative spike timing can be used as an efficient

code has recently been demonstrated experimentally in the salamander retina [25]. The

discovery of spike timing dependent plasticity, where synaptic efficacy is modulated by

the precise timing of spikes, is a strong evidence that temporal codes (as the rank-

order scheme) are used for cortical information transmission. The rate-based and Rank

Order-based coding schemes are compared in Fig 2.3. In the upper part of the figure,

the stimuli received by three different neurons (labeled as nA, nB and nC ) are shown.

As it can be seen, in the rate-based implementation (at the bottom of the figure), the

output activities of the three neurons vary with the intensity of the stimuly. In this

scheme we do not need to use reference signals. On the contrary, in the rank-order

scheme (central part of the figure), the information is coded in the relative order in

which the three neurons fired. This time a reference signal is needed to signal the start

of new time windows. However the number of spikes required is lower requiring thus a

lower bandwith in communications.

2.3.3 Time-to-First-Spike

This is a type of Rank Order Coding. Each neuron sends only one spike (event) after

a periodic reset signal. The time difference between a pixel’s spike and the reference

signals codes the state of the neuron. Consequently, not only the ordering of spikes

codes the information (as in rank order coding), but also the timing of the spikes is

used to code the precise amplitude of neural states (see Fig 2.4).

9


Figure 2.3: Rate-based vs Rank-Order based scheme

2.3.4 Spike-Count Rate

The Spike-count rate, also referred to as temporal average, is obtained by counting the

number of spikes that appear during a trial and dividing by the duration of trial. In

practice, to get sensible averages, several spikes should occur within the time window.

Typical values are T = 100ms or T = 500ms, but the duration may also be longer

or shorter. The spike-count rate can be determined from a single trial, but at the

expense of losing all temporal resolution about variations in neural response during the

course of the trial. Temporal averaging can work well in cases where the stimulus is

constant or slowly varying. Real-world input, however, is hardly stationary, but often

changing on a fast time scale. For example, even when viewing a static image, humans

perform saccades, rapid changes of the direction of gaze. The image projected onto

the retinal photoreceptors changes therefore every few hundred milliseconds. Despite

its shortcomings, the concept of a spike-count rate code is widely used not only in

experiments, but also in models of neural networks. It has led to the idea that a

neuron transforms information about a single input variable (the stimulus strength)

into a single continuous output variable (the firing rate).

10


Figure 2.4: Representation of the Time-to-First Spike coding scheme

2.3.5 Population coding

Population coding is a method to represent stimuli by using the joint activities of a

number of neurons. In population coding, each neuron has a distribution of responses

over some set of inputs, and the responses of many neurons may be combined to de-

termine some value about the inputs. From the theoretical point of view, population

coding is one of a few mathematically well-formulated problems in neuroscience. It

grasps the essential features of neural coding and yet, is simple enough for theoretic

analysis [26]. Experimental studies have revealed that this coding paradigm is widely

used in the sensor and motor areas of the brain. For example, in the visual area medial

temporal (MT), neurons are tuned to the moving direction [27]. In response to an ob-

ject moving in a particular direction, many neurons in MT fire, with a noise-corrupted

and bell-shaped activity pattern across the population. The moving direction of the

object is retrieved from the population activity, to be immune from the fluctuation

existing in a single neuron’s signal. Population coding has a number of advantages,

including reduction of uncertainty due to neuronal variability and the ability to repre-

sent a number of different stimulus attributes simultaneously. Population coding is also

much faster than rate coding and can reflect changes in the stimulus conditions nearly

11


instantaneously [28]. Individual neurons in such a population typically have different

but overlapping selectivities, so that many neurons, but not necessarily all, respond to

a given stimulus.

2.3.6 Phase-of-firing code

Phase-of-firing code is a neural coding scheme that combines the spike count code

with a time reference based on slow oscillations. It has been shown that neurons in

some cortical sensory areas encode rich naturalistic stimuli in terms of their spike times

relative to the phase of ongoing network fluctuations, rather than only in terms of their

spike count [29]. Oscillations reflect local field potential signals. It is often categorized

as a temporal code although the time label used for spikes is coarse grained. That is,

four discrete values for phase are enough to represent all the information content in this

kind of code with respect to the phase of oscillations in low frequencies. Phase-of-firing

code is loosely based on the phase precession phenomena observed in place cells of the

hippocampus.

2.3.7 Intensity Variation

In this coding scheme a neuron (pixel) generates a spike if its intensity has changed a

certain quantity since the previous generated spike. This type of information coding

computes the temporal derivative of the signal. It is used, for example, in Dynamic-

Vision-Sensors (DVS ), where a pixel produces an Address-Event every time a pixel

detects a temporal contrast above a pre-tuned threshold [15].

2.4 AER Protocol for Event-Based Systems

As it has been shown, latency and throughput in event-based systems are not limited by

a sampling rate. However, in real-time hardware implementations, hardware engineers

face a very strong barrier when trying to mimic the bio-inspired hierarchically layered

structure: the massive connectivity. In present day state-of-the-art very large-scale

integrated (VLSI) circuit technologies it is plausible to fabricate on a single chip many

thousands (even millions) of artificial neurons or simple processing cells. However, it is

not viable to connect physically each of them to even a few hundreds of other neurons.

12

2.4 AER Protocol for Event-Based Systems

Figure 2.5: Concept of point-to-point interchip AER communication.

The problem is greater for multi-chip multi-layer hierarchically structured bio-inspired

systems.

Address-event representation (AER) is a promising emergent hardware technol-

ogy that shows potential for providing the computing requirements of large frameless

projection-field-based multilayer systems providing a hardware solution to the inter-

chip massive connectivity problem. AER was first proposed in 1991 in one of the

California Institute of Technology (Caltech) research labs [30], and has been used since

then by a wide community of neuromorphic hardware engineers.

Fig. 4 illustrates event communication in a point-to-point rate-coded AER link

[31], where pixel intensity is coded directly as pixel event frequency. The continuous-

time states of pixels in an emitter chip are transformed into sequences of fast digital

pulses (spikes or events) of minimal width (in the order of nanoseconds) but with

much longer inter-spike intervals (typically in the order of milliseconds). Each time a

pixel generates a spike, its address is written on the interchip digital bus, after proper

arbitration [30]. This is called an address event. The receiver chip reads and decodes

the addresses of the incoming events and sends spikes to the corresponding receiving

pixels for reconstruction or further processing. This point-to-point communication in

Fig. 4 can be extended to a multireceiver scheme [31]. Also, multiple emitters can

merge their outputs into a smaller set of receiver chips [36]. Moreover, AER visual

information can easily be translated or rotated by remapping the addresses during

interchip transmission [37][38]. Complex processing such as convolutions can be also

implemented [17][39][36].

AER has been used fundamentally in image sensors, for simple light intensity to

frequency transformations [32], time-to-first-spike coding [33][34], foveated sensors [40],

13


contrast [41][16], more elaborate transient detectors [15], and motion sensing and com-

putation systems [42]. But AER has also been used for auditory systems [43][44],

competition and winner-takes-all networks [45][46], and even for systems distributed

over wireless networks [47]. However, the high potential of AER has become even

more apparent since the availability of AER convolution chips [17][39]. These chips,

which can perform large arbitrary kernel convolutions (32x32 in [17]) at speeds of about

3x109 connections/s/chip, can be used as building blocks for larger cortical-like multi-

layer hierarchical structures, because of the modular and scalable nature of AER-based

systems.

There is a growing community of AER protocol users for bio-inspired applications in

vision and audition systems. The goal of this community is to build large multi-chip and

multi-layer hierarchically structured systems capable of performing complicated array

data processing in real time. Currently, only a small number of AER-based chips have

been used simultaneously [36]. The largest AER system reported so far is the CAVIAR

system [36], which uses four custom made AER chips (motion retina, convolution chip,

winner-take-all chip, and learning chip) plus a set of FPGA based AER interfacing and

mapping modules. The CAVIAR system includes 45k neurons, emulates up to 5 million

synapses, performs an equivalent of 9 giga-connects-per-second, and can sense, identify

and track objects with a 3ms delay. However, this system only has 4 convolution

modules, but it is expected that hundreds of such modular AER convolution units

could be integrated in a compact volume, such as a miniature printed circuit board

(PCB) or into chips of the type known as networks-on-chip (NoC) [49]. This would

eventually allow the assembly of large cortical-like convolutional neural networks and

event-based frameless vision processing systems operating at very high speeds. The

success of such systems will strongly depend on the availability of robust and efficient

development and debugging AER tools, as well as a theoretical know-how on how to

assemble and program multi-layer multi-chip AER systems for specific applications.

The objective of the present thesis is to provide a simulation tool for complex AER

systems, and study some particular application examples.

14

Chapter 3

IMPLEMENTATION OF AN

AER SIMULATION TOOL

3.1 Requirement of a simulation tool

With the growing popularity and real-time capability of AER systems, it becomes

really desirable to have available a powerful tool to simulate efficiently the behaviour

and operation of such systems, prior to their physical development in hardware. In

the present thesis we have developed an AER event-driven simulator. There are two

versions, one in Matlab and one in C++. This allows us to behavioraly describe any

AER module (including timing and non-ideal characteristics), and to assemble large

systems composed by many different modules, thus building complex and event-driven

processing systems.

The performance characteristics of the simulated AER modules (convolution chips,

mergers, splitters, and mappers) are obtained from already manufactured, tested and

reported AER modules. Many of the AER modules modeled are only available as exper-

imental prototype chips, so to assemble physically large AER systems is not possible.

However, modeling the performance characteristics of the available AER hardware mod-

ules (chips) with the AER behavioral simulator, we can obtain a very good estimate of

the overall systems performance. Furthermore, the AER behavioral simulator can be

used to propose and test new AER processing modules to be used in larger systems,

and thus orient hardware developers on what kind of AER hardware modules may be

useful and what performance characteristics they should possess.

15

3. IMPLEMENTATION OF AN AER SIMULATION TOOL

At present there does not exist any AER simulation tools. So far, AER-based

systems developers and researchers have been proposing new systems mainly based on

hardware implementations. At present, AER modules implement simple processing

tasks such us edge detections and other simple kinds of filtering [36][50][51][45].

In a software simulator there are two main strategies for simulating the behaviour of

a system: Synchronous or clock-driven and asynchronous or event-driven algorithms.

3.1.1 Synchronous or Clock-driven Algorithms

In a synchronous or clock-driven algorithm the state variables of all neurons (and pos-

sibly synapses) are updated at every tick of a clock: X(t)− >X(t + dt). Then, after

updating all variables, the threshold condition is checked for every neuron. Each neuron

that satisfies this condition produces a spike which is transmitted to its target neurons,

updating the corresponding variables.

The obvious drawback of clock-driven algorithms is that spike timings are aligned to a

grid (ticks of the clock), thus the simulation is approximate even when the differential

equations are computed exactly. Other specific errors come from the fact that threshold

conditions are checked only at the ticks of the clock, implying that some spikes might

be missed.

For realistic large-scale networks, the algorithmic complexity and, hence, computational

load scales linearly with the number of neurons, but also linearly with the temporal

resolution. Increasing the temporal resolution dt leads to a marked increase in the

time needed to simulate neural activity in a corresponding time window and, as stated

above, it determines the accuracy of the numerical simulation (note that dt introduces

an artificial cutoff for time-scales captured by the simulation).

The main argued advantage of clock-driven algorithms is that they can be coded

and applied to any neuron model. However, when the timing of the spikes is important

(as occurs for instance in the STDP learning algorithms), the election of a proper dt

value is crucial and it can lead to severe misbehaviours.

3.1.2 Asynchronous or Event-driven Algorithms

The growing experimental evidence that spike timing may be important to explain

neural computations has motivated the use of event-based simulation techniques, rather

16

3.1 Requirement of a simulation tool

than the traditional clock-driven-based models.

In event-driven algorithms, the simulation advances from one event to the next event.

In contrast with standard clock-driven simulations, state variables need to be updated

at the time of every incoming spike rather than at every tick of the clock in order to

simulate the network.

The key advantages in event-driven algorithms are:

1. A potential gain in speed due to the avoidence of update steps in neurons where

no events have arrived. Systems with a reduced number of events require very

reduced processing times to complete the simulations.

2. Spike timings are computed exactly.

3. The event-driven approaches are free from the dependence on the temporal res-

olution by using the exact times of events. This gain in accuracy comes at the

cost that, now, the computational load scales with the number of events in the

network, which rises linearly with the number of neurons in realistic large-scale

neuronal networks [52].

The main drawback argued when using event-driven simulators is that not all neu-

ron models can be implemented (as Hodgkin and Huxley model [7]). So far, the simple

i&f (integrate and fire) model has been preferred with this kind of algorithms. How-

ever, there are efforts to design suitable algorithms for complex models (for example the

two-variable i&f models of Izhikevich [6] and Brette and Gerstner [53]), or to develop

more realistic models that are suitable for event-driven simulation. Besides, as it will

be shown in the next sections, some tricks can be implemented to emulate clock-driven

elements, such as dummy connections operating as clocks to activate the computation

of some differential equations.

Comprehensive comparisons between clock-driven and event-driven simulators can

be found in [54][52] together with the description of some well-known already imple-

mented simulators (NEURON, GENESIS, NEST, Mvaspike, etc.).

17


In the present thesis, an event-driven simulator tool has been implemented and

it is called AERST (Address Event Representation Simulator Tool). As it is event-

driven, it is aimed at simulating systems where event-times can be computed efficiently.

Among all programming languages, we initially chose MATLAB due to its following

properties:

1. Power in matrix manipulation. Matlab provides many convenient ways for cre-

ating (and operating with) vectors, matrices, and multi-dimensional arrays. This

property is crutial when simulating AER systems as neurons are usually allocated

in array distributions.

2. Modularity. It is very easy to design new functions in MATLAB implementing

the same functionalities as the physical devices. Besides it is easy to update this

functions adding new complexities or modifying their own parameters.

3. Powerful interface with programs written in other languages, including C, C++,

Java, ActivX, .NET, and Fortran. MATLAB can call functions and subroutines

written in the C programming language or Fortran. A wrapper function is cre-

ated allowing MATLAB data types to be passed and returned. The dynamically

loadable object files created by compiling such functions are termed ‘MEX-files’

(for MATLAB executable). Libraries written in Java, ActiveX or .NET can be

directly called from MATLAB and many MATLAB libraries (for example XML

or SQL support) are implemented as wrappers around Java or ActiveX libraries.

4. Easy to translate. In spite of not being a very fast processing language, in the last

times MATLAB has added many translation tools so that it is easy to translate

MATLAB applications to other languages, such as SIMULINK, java, C++ or

even to hardware description languages as VHDL.

These four powerful properties together with the growing popularity of MATLAB makes

this development environment a very attractive and appropriate tool to simulate AER-

based systems.

In spite of achieving an acceptable processing speed for small systems, the Matlab

implementation turned out to be not fast enough when trying to simulate large and

complex systems with a huge number of total events (higher than 1Mevents). This

18

3.2 Description of the AER Simulation Tool

fact motivated us to implement the tool in C++ trying to keep the format of all the

events and modules in the Matlab implementation.

In the following sections we describe the algorithm implemented, the library of

modules developed and how to analyze the information obtained after a simulation.


In this simulator a generic AER system is described by a netlist that uses only two

types of elements: instances and channels. An instance is a block that generates

and/or produces AER streams. For example, a retina chip would be a source that

provides an input AER stream to the AER system. A convolution chip [36] would be

an AER processing instance with an input AER stream and an output AER stream.

A splitter [55] would be an instance which replicates the events from one input AER

stream onto several output AER streams. Similarly, a merger [55] is another instance

which would receive as input several AER streams and merge them into a single output

AER stream. AER streams constitute the nodes of the netlist in an AER system, and

are called channels. The simulator imposes the restriction that a channel connects a

single AER output from an instance to a single AER input of another (or the same)

instance. This way, channels represent point-to-point connections. For splitting and/or

merging channels splitters and/or merger instances must be included in the netlist.

3.2.1 Configuration File

An AER system will be composed of several instances interconnected in some way and

a set of channels carrying the information between the instances. The system to be

simulated is described by a configuration file that includes a netlist with the instances

and channels. Fig 3.1 shows an example netlist and its ASCII file netlist description.

The netlist contains 7 instances and 8 channels. The netlist description is provided to

the simulator through a text file, which is shown in the bottom of Fig 3.1. Channel 1 is

a source channel. All its events are available a priori as an input file to the simulator.

There can be any arbitrary number of source channels in the system. Source channels

need a line in the netlist file, starting with key word sources, followed by the channel

numbers and the files containing their events.

19


Figure 3.1: Example AER system and its ASCII file netlist description

If we want to use more than one source in the system we only have to enumerate

the different sources using commas in the following way:

sources {1, 2, ..., N}{datasource1, datasource2, ..., datasourceN} (3.1)

The following lines describe each of the instances, one line per instance in the

network. Generally, a descriptive line for one instance will have the following format:

instance {ch in1, ..., ch inN}{ch out1, ..., ch outN}{file params}{file state}(3.2)

The first field in the line is the instance name, followed by its input channels, output

channels, name of structure containing its parameters, and name of structure contain-

ing its state. Each instance is described by a MATLAB function whose name is the

name of the instance.

The simulator imposes no restriction on the format of the parameters and states struc-

tures. This is left open to the user writing the code of the function of each instance.

The simulator only needs to know the name of the parameter and state files where

these structures are stored. If one instance does not need parameters or states, these

20


fields will appear as empty. For example, consider that an instance called receiver does

not need neither output channels nor any kind of parameters. Then its description will

be something like:

receiver {channel N}{}{}{} (3.3)

The module File in Fig 3.1 acts as a source in the system providing the events

that will travel in channel denoted as ‘1’ and entering to the module splitter. The

module splitter will receive channel ‘1’ events and will copy them to the two output

ports (corresponding to the channels labelled as ‘2’ and ‘4’ in the figure) without any

kind of processing. These two channels are communicated with the processing modules

labelled as HorizontalEdge and ‘−90’. The events generated by HorizontalEdge

travel through channel ‘3’ to the upper input port in the module named as merger in

the system. The events coming to the instance ‘−90’ get their (x, y) coordinates rotated

-90 degrees and travel through channel ‘5’ to HorizontalEdge. Channel ‘6’ connects

processing chip HorizontalEdge with the processing element ‘+90’, where events are

rotated this time +90 degrees. Events going out from the instance described as ‘+90’

will travel through channel ‘7’ to the second (bottom) input port in the merger module.

This module will replicate all the input events into the only output port labelled as

‘8’. The information in all the channels will be available to be used as source in other

systems or to analyze the behaviour and delays in the system simulated.

3.2.2 Event Description

Every event in the system carries information about the neuron that originated it and

the time in which the event was created so that any receptor module can decode the

information and decide how to process it. The reception of an event by a neuron can

cause changes in the state variables and the generation of new output events. For

instance, in a convolution chip [56], the reception of one event implies that charge

packets will contribute to the state voltage integral of a group of receiving neurons in

the chip. If the state of any of these neurons reaches a certain threshold, the neurons

will fire new output events and reset themselves.

AER systems use generally an asynchronous communication. Therefore, there will

be two corresponding request and acknowledge signals between the emitter and receiver

21


elements to handle the event communication. In our implementation each event con-

tains six fields. The first three correspond to timing information of the event, while the

last three correspond to data transmitted by the event:

[Tprereqst Treqst Tack x y sign] (3.4)

The data fields are irrelevant for the simulator, and only need to be interpreted

properly by the modules (instances) receiving and generating them. For the particular

cases we describe in this thesis we have always used the same three fields: ‘x′ and ‘y′

represent the coordinates or addresses of the pixel originating the event and ‘sign′ its

sign. The three timing fields are as follows: ‘Tprereqst’ represents the time at which

the event is created at the emitter instance, ‘Treqst’ represents the time at which the

event can start to be processed by the receiver instance, and ‘Tack’ represents the time

at which the event is finally acknowledged by the receiver instance. We distinguish

between a pre-Request time and an effective Request time. The first one is only depen-

dent on the emitter instance, while the second one requires that the receiver instance is

ready to read and process an event request. Thus, we can provide as source a full list

of events which are described only by their data fields and pre-Request times. Events

travelling through one channel in the system will be stored sequentially inside a matrix.

Each row in this matrix will correspond to the information of one single event as in

eq. B.1 . In this way, for each channel in the system there will be a matrix storing

all events that travelled through it, with so many rows as events and with six columns

to provide the six fields. Once the events are processed by the simulator, their final

effective request and acknowledge times are established. As the simulator is handling

the events in the different channels, it keeps track of the ‘actual time’ Tact, and uses it

to generate the final Treqst and Tack times for each terminated event transfer.

3.2.3 Instance Description

Each module in the system receives events and can implement some kind of processing.

When a module receives a request signal from one of its input ports (meaning the

presence of a new event) it reads the event (in case the module is not busy) and

returns an acknowledge signal to the emitter. The module will process the event, and

according to the parameters defining the module and its previous state variables it

22


might produce new output events. Its internal state variables will be updated and

also the internal actual time Tact for the block, which is the last time this block was

“visited” by the simulator. The new output events will be written to the output ports

with the corresponding pre−Request times for each of the generated events.

The operation of an instance must consider all these aspects and is described by an

independent MATLAB function whose name is identical to the instance name (module

name) in the netlist. A user can add and write new instances as desired. The only

restriction is to respect the calling format of the function. The calling format of a

function will be the following:

[new event in, events out, new state, new time, port out] = (3.5)

module f(event in, pars, old state, old time, port in)

event in corresponds to the present event information (as in eq. B.1) sent through

the channel. The event in information passed to the function as input parameter

contains the x and y coordinates of the event being processed and its Tprereqst time.

The updated new event in returned by the function contains also the established Treqstand Tack times. old state and new state represent the instance state before and after

processing the event. old time and new time are the global system times before and

after processing the event. events out is a list of output events produced by the instance

at its different output channels. port in is the port number from where the event has

entered the module and port out is a list of numbers identifying the output ports where

each of the output events created will be written. These new output events (which

are still unprocessed events) are included by the simulator in their respective channel

matrices with Tprereqst as the present actual times, which at a later time should be

processed by their respective destination instances.

The basic operation of a general module in our tool is as described next. Each time

an event has to be processed, the simulator tool provides to the corresponding instance

its parameters pars, the state variables old state describing the instance previous state,

the instance actual time old time and the input port from which the event is coming

in. According to the actual time and the parameters defining the internal delays and

processing times, the instance updates correspondingly the values Treqst and Tack for

the incoming event. Then the event is processed by the instance using the functional

parameters pars and internal behaviour. After this, the internal state is changed and

23


some new events might be created. All these new events will have their value Tprereqstset to the present creation time, and their values Treqst and Tack initialized to ‘0’ and

‘-1’ respectively (meaning such events are waiting to be processed). For each new event,

the corresponding output port is specified. After the execution, the actual time for the

instance is also updated and the main program will write the created events on the

correspondent channel list.

The way how the tool is run and how the netlist, parameters and sources are

initialized is carefully described in Appendix 1.

3.2.4 Description of Program Flow

The execution of the simulator is described in Fig 3.2. Initially the netlist file is read

as well as all parameters and states files of all instances. Each instance is initialized

according to the initial state of each instance. Then, the program enters a continuous

loop that performs the following steps:

1. All channels are examined. The simulator selects the channel with the earliest un-

processed event. An event is unprocessed when only its pre-Request time Tprereqsthas been established, but not its final request time Trqst nor its acknowledge time

Tack. The system is able to handle the situation in which several channels have

unprocessed events with the same Tprereqst time. This is left transparent to the

user. Note that events in different channels can occur simultaneously, since chan-

nels are physically independent.

2. Once a channel is selected for processing, its earliest unprocessed event informa-

tion is provided as input for the destination instance. By default, this event is

provided with updated values of its Tprereqst and its Tack time (to consider the case

in which an instance built by a user does not specify this operation). The state

(old state) and parameter (pars) variables, which were created and stored by the

user in Matlab files before running the simulation, are loaded to be provided to

the destination instance.

3. The instance is called and it updates and corrects the event time information

(Tprereqst and Tack) in case some instance specific delay times are considered.

The instance updates its internal state according to the information carried by

24


Figure 3.2: Basic Algorithm implemented by the AER tool

the event. In case this event triggers new output events, a list of new unprocessed

events is provided as output. This list of new unprocessed events provides the

events information (such as address and sign) and their Tprereqst time only.

4. The simulator updates all channels with the new events, and stores the new state

for the processing instance.

5. If there are more unprocessed events in the channels, the simulator goes back to

step 1, otherwise the simulation finishes.

25


3.2.5 Algorithm Optimizations for Efficient Computation Speed

One important aim pursued during the development of the tool was always to achieve

a minimum simulation time with an optimum memory management. This aim leaded

us to construct our algorithms in efficient ways, such us optimize the selection of the

next event to be processed, the way in which unprocessed events are allocated, the way

in which the already processed events are saved, etc.

In the first designs, the search for the first event to be processed in the system at

each step of the simulation was implemented by looking for that event with the lowest

Tprereqst among all the events belonging to every channel. The idea of searching among

all the events and all the channels was not feasible as the number of events in every

channel grew fast at each step of the simulation. Therefore, this exhaustive search

resulted in very slow simulations. Furthermore, the growth of the number of events in

some channels could be exponential. Another limiting factor is that, for a particular

simulation, there was a growing number of processed events stored in memory. This

implies that the channel matrices grew and grew making the simulation slower every

time and reaching unsustainable simulation times.

The first optimization implemented was the use of a temporal matrix composed of

cells to save all those events already processed. This way, after a certain number of

events processed, all the channels were analyzed and their processed events were taken

out and stored in the temporal cell matrix. Operating like this one avoids the search

of the earliest event among all the processed and unprocessed events, considering only

those still unprocessed. The adoption of this solution does not solve the growth of the

temporal matrix, which was still a problem as it can grow enormously and access to it

to save new processed events can be very expensive in terms of computation time.

The second optimization implemented is the use of an auxiliary matrix of indexes.

The first row in this matrix stores a pointer to the following event to be processed for

every channel. The second row stores the Tprereqst values for these events. Finally, the

third row stores a pointer to the last event. Therefore, each time an event belonging to

one channel is processed, the pointer corresponding to the channel is incremented by

one and the new Tprereqst value is loaded also into the matrix. This way of operation

imposes the restriction that all the events in a channel have to be in increasing order

according to their Tprereqst values. To tackle this, a sorting function is used to order

26


new events in case they are not correlative in time with the previous existing ones.

The main advantage of working with indexes was that it avoids searching for the next

earliest unprocessed event through all events in a channel. With the matrix of indexes

the algorithm only has to compare the Tprereqst times stored in the matrix to select the

channel and event to process. With this optimization we can simulate large systems

composed by a considerable number of modules processing events at speeds of around

500eps (events per second).

Despite the use of these two optimizations, the problem of the growth of the tem-

poral matrix storing the processed events was still unsolved. The simulation speed still

decreases exponentially with the growth of the temporal matrix, which makes it almost

impossible to simulate systems with large number of events. To solve this, the pro-

cessed events were decided not to be kept inside a matrix in main memory, but outside

main memory in a text file that would stay open until the end of the simulation. Thus,

after a certain time or number of events, all the already processed events were saved in

this file and cleared from main memory, avoiding the use of memory resources which

made the simulations too slow.

Thanks to all these optimizations, the tool in its actual form is able to simulate

large systems at constant speeds independently of the system size (number of channels

and instances) and number of events. The average per event speed will only depend on

the complexity of the instances and their parameters (such as size). After the end of

a simulation a text file is available containing all the events that have been transferred

through all channels and with all the corresponding timing information. At this point

the user can use provided functions by the simulator to read the text file and recover

all the events in the simulated system to analyze the behaviour of the system or to use

some of the channels as sources in different systems.

To compare and better appreciate the optimizations implemented in the system,

Fig 3.3 shows the time figures with and without these optimizations. Particularly, we

have simulated the system shown in Fig 3.1 which simulates a total of 1.9Mevents.

Note that the peaks in the simulation times in the second and third versions are due

to the processing time required by the simulator to store the already processed events.

Note how these times were longer when the tool used matrices of cells to store the

events.

27


Figure 3.3: Time Optimizations in the Simulation tool

Fig 3.3 plots the simulation times required under three different situations. x axis

represents the number of blocks, where each block contains 100 events. The upper

curve (plotted in red) corresponds to the simulation time when neither pointers nor

storing elements were used to save the processed events. In the central curve (plotted

in blue) the tool made use of pointers (matrix of indexes) to find the next events to be

processed and an auxiliary matrix of cells to store the already processed events. Finally,

the bottom curve (plotted in green) makes use of pointers and also of an external file to

store the processed events. As it can be appreciated, in the first and second cases the

simulation time depends on the size of the sources and follows a linear law proportional

to the number of events already processed. In particular, for the system of Fig 3.1

the time in the two first implementations follows a law with a temporal constant of

approximately 4 ∗ 10−6, so that the simulation time can be written as

tfinal = 4 ∗ 10−6 ∗ nevents + time100events (3.6)

where time100events is the time to process the first 100 events and nevents is the number

of events. In the first case time100events was 3.9s approximately. In the second case

the time required to store the processed events in the auxiliary array (each 10kevents)

28


was proportional to the number of events already processed and time100events was 0.8s.

Finally, in the third case note that the simulation time required for each 100 event-

block is constant (and less than 0.10s) along all the simulation for the particular system

simulated. This constant value implies a simulation speed higher than 1keps (1000

events per second). It must be pointed out that for systems in general, this time does

not only depend on the number of events but also on the complexity of the modules,

specially the size and number of variables used inside them.

3.2.6 C++ implementation

In spite of achieving a high processing speed, the Matlab implementation soon turned

out to be not fast enough when trying to simulate large and complex systems with a

huge number of total events (higher than 1Mevents). This fact motivated us to imple-

ment the tool in C++. Matlab includes utilities to translate matlab code into C++.

However, when we tested the resulting C++ code, the speed improvement was not re-

ally significant (times were comparable). Thus, the tool was rewritten entirely in C++

trying to keep the format of all the events and modules in the Matlab implementation.

However, some changes were implemented to optimize the execution time and the use

of memory by the C++ application. The main differences (and optimizations) with

respect to the Matlab implementation were:

1. Dynamic memory management. The memory is managed dinamically so that

channels grow when new events are added and shrink when events are concluded.

State variables and parameters are also created dinamically when the application

is started.

2. Use of lists. Events are not stored in matrices, as in the Matlab implementation,

but in lists, where each list corresponds to one channel. This way, each element

in the list stores the information corresponding to one event and also a pointer to

the following event to be processed in the channel list. The format of each event

is the same than in the Matlab implementation.

The main advantage of using lists and pointers is that memory is used more ef-

ficiently, as events are stored in different and small memory positions instead of

in matrices occupying sequential blocks of memory.

29


Figure 3.4: CAVIAR AER vision system

These two features allowed faster simulations and a more efficient use of memory. With

the C++ implementation the tool was able to process events at speeds higher than

30keps, more than 30 times the speed achieved in the Matlab implementation.

3.3 AER Modules

When describing modules, we have to distinguish between interconnection modules

and processing modules. In the complex systems developed by neuromorphic engineers

different interfaces are required to implement interconnections between them, and to

connect them to PCs for development, debugging, or other purposes. There are some

AER tools developed under the European CAVIAR project to facilitate these intercon-

nections [36]. To date, the biggest AER chain has been built in the CAVIAR project

(see Fig. 3.4). In this system, the front of the signal chain is composed of a 128x128

retina that spikes with temporal contrast changes, four convolution chips that can be

programmed with arbitrary kernels of up to 32x32 pixels, a Winner-takes-all chip and

a two-chip spike-learning stage comprised of a delay line and a learning chip.

In the next subsections, different hardware AER modules are briefly described to-

gether with their correspondent implementation in the AER tool. Extra modules are

then proposed to allow complex processing in larger and more sophisticated systems.

3.3.1 AER switch Module

To connect many to one and one to many AER chips inside a system, CAVIAR pro-

vided an interconnection module called AER-Switch [55] that can perform two different

30

3.3 AER Modules

Figure 3.5: AER-Switch hardware interface

operations:

3.3.1.1 AER Splitter

In this configuration, one AER input is replicated to up to four AER outputs.

3.3.1.2 AER Merger

Up to four inputs are joined to one output. It can add bits to identify the input channel

if necessary.

The AER switch is based on a Xilinx 9500 complex programmable logic device

(CPLD). It has five AER ports: one input, one output, and three bidirectional ports

(Fig. 3.5). It provides delays in the order of tens of nanoseconds.

A Figure showing the representation of the AER-switch module implemented in the

tool with the internal parameters can be seen in Fig. 3.6.

The workflow of this module is as follows:

1. Use current time, timedelay (delay configured for the asynchronous communi-

cation) parameters to update the incoming event (event in) timing information

(Treqst,and Tack):

new eventin(Treqst)← current time

new eventin(Tack)← current time+ timedelay

2. Update New current time using timetoprocess (time considered to process the

input event and generate the output ones):

New current time← current time+ timedelay + timetoprocess

31


Figure 3.6: AER-switch acting as Splitter or Merger

3. Create out (set of output events) with as many output events as output ports

(numb ports) exist setting their Tprereqst values to current time.

3.3.2 Subsampling Module

A subsampling module is important because it reduces the resolution of the input visual

flow. This module reduces the input event address space by a factor coeff so that the

address of each input event (xin, yin) is modified and turns to (xout, yout), which is

computed as:

xout = bxin/coeffc (3.7)

yout = byin/coeffc (3.8)

Note that the floor operation makes the output coordinates to be integer numbers. This

module can be easily created using a merger module to which the parameter coeff is

incorporated to reduce the input event address coordinates.

3.3.3 Mapper Module

An AER mapper implements spatial transformations of the address space. It com-

municates events between two AER chips by applying a transformation on the event

32

3.3 AER Modules

Figure 3.7: Scanner and Rotator AER Modules

data during the transmission. Each event from the sender is used to address an LUT

(look up table). The event to the receiver is the one stored in the LUT. Through the

mapper, one can transform the address space through a translation, rotation, shifting,

compression, etc. or by filtering the events.

Most of the current AER mappers have the following functionalities:

1. Map each address event (AE) from an emitter module into a different address for

the receiver module, (1 to 1 mapper).

2. Map each event from an emitter to several address events for the receiver (1 to n

mapper).

3. Send a mapped event following a probabilistic model (stochastic mapper).

4. Repeat a mapped event several times in order to make the effect stronger in the

receiver module (repetition mapper).

5. Manipulate the time information of the events so that multiple copies of an event

can be transmitted with different delays (delay mapper).

An example of an AER mapper module that is able to apply not only a spatial

address transformation, but also a time transformation in the AER bus traffic can be

found in [57]. In order to implement the example systems presented in this thesis, two

types of mapper modules were developed: an AER scanner and an AER rotator.

Fig. 3.7 shows the representation of the two modules with their respective parameters.

33


3.3.3.1 3.3.3.1 AER Scanner

This module has one AER input port and one AER output port. The module has

an internal look up table which stores consecutive pixel addresses of an (x,y) array,

scanned row by row. For each incoming event, regardless of its address, the module

sends out events with consecutive addresses. This way, each incoming event simply

increments the pointer to the following position in the LUT in a circular way. The

implemented module uses the following parameters:

-size1, X dimension of the array address space,

-size2, Y dimension of the array address space,

-timedelay, delay of the asynchronous communication,

-timetoprocess, time required by the module to process each incoming event.

The module has one state vector parameter called prev, storing the x and y coor-

dinates of the last the event sent out by the module, and it is initialized with the value

[0, 0].

The workflow of this module is as follows:

1. Use current time, timedelay and timetoprocess parameters to update the in-

coming event (event in) timing information (Treqst,and Tack):



2. Update New current time:


3. Ignore event coordinates (x, y) and use parameters size1, size2 and state vector

parameter prev to compute the coordinates(xo, yo) for the output event:

xo ← prev(1); yo ← prev(2)

yo ← yo + 1

if yo >= size2

yo ← 0;xo ← xo + 1

if xo >= size1

xo ← 0;

34

3.3 AER Modules

4. Create out with one event and setting its Tprereqst value to New current time

and (x, y) as its new coordinates:

out = [New current time 0 − 1 xo yo 1]

3.3.3.2 3.3.3.2 AER Rotator

This module has one AER input port and one AER output port. For each incoming

event it generates an output event with a 0, 90, 180 or 270 rotated address depending

on the value specified in the direction parameter. The parameters in this module are:

-size1, X dimension of the input address space,

-size2, Y dimension of the input address space,

-direction, parameter (integer value from 0 to 3) to set the value of the rotation that

is going to be applied to the incoming event: 0, 90, 180, 270,


-timetoprocess, time required by the module to process each incoming event.

This module does not have internal state. The operation of this module is as follows:







3. Use event in information to get the event coordinates (x, y)

x← event in(3); y ← event in(4)

4. Use parameters size1, size2 and direction to rotate the (x, y) coordinates:

switch direction

case 0: xnew ← x; ynew ← y;

case 90:ynew ← x;xnew ← size2− y;

case 180:xnew ← size1− x; ynew ← size2− y;

case 270:xnew ← size1− x; ynew ← size2− y;

35


5. Create an output event out setting its Tprereqst value to New current time and

(xnew, ynew) as its new coordinates:

out = [New current time 0 − 1 xnew ynew 1]

3.3.4 AER Convolution Chip

An important processing module in complex AER-based systems is the 2D convolution

module. This module is meant for frame-free visual information processing. To illus-

trate how event-driven convolution is performed consider the example in Fig. 3.8. Fig.

3.8(a) corresponds to a conventional frame-based convolution, where a 5x5 input static

image f(i, j) is convolved with a 3x3 kernel h(m,n), producing a 5x5 output image

g(i, j). Mathematically, this corresponds to the convolution operation:

g(i, j) =∑m

∑n

f(m,n)h(i−m, j − n) (3.9)

In an AER system, a convolution module is composed of an internal pixel array

where a fixed threshold level is defined for all pixels and a convolution mask (kernel).

Each time an event is received, the kernel is added to the array of pixels (which oper-

ate as adders and accumulators) around the pixel having the same event coordinate.

Whenever a pixel exceeds the fixed threshold level, it will generate an output event, and

the pixel will be reset. This way, pixels also act as AER sender pixels, so that an AER

convolution module operates as an AER transceiver (receiver and emitter) module. To

explain this operation in detail,

In an AER system, shown in Fig. 3.8(b), a luminance retina sensing the same visual

stimulus would produce events for some pixels only (those sensing a non-zero light

intensity). In the figure, the pixel at coordinate (3,3) senses twice as much intensity as

pixels (2,3) and (3,2). Thus the source will produce output event with address (3,3)

with a frequency twice the one for pixels (3,2) and (2,3). These events are sent to a

convolution module.

The convolution module has an internal pixel array where a fixed threshold level is

defined for all pixels and a convolution mask (kernel). Every time an event is received

by the convolution chip, the kernel is added to the array of pixels (which operate as

adders and accumulators) around the pixel having the same event coordinate. This

36

3.3 AER Modules

Figure 3.8: Comparison between (a) classical frame-based and (b) AER event-basedconvolution processing.

is actually a projection-field operation. Whenever a pixel exceeds the fixed threshold

level, it will generate an output event, and the pixel will be reset. As a consequence,

pixels also act as AER sender pixels, so that an AER convolution module operates

as an AER transceiver (receiver and emitter) module. Note that, in the example in

Fig. 3.8(b), after the four retina events have been received and processed, the result

accumulated in the array of pixels in Fig. 3.8(b) is equal to that in Fig. 3.8(a).

The convolution module implemented in this work emulates a fully digital convo-

lution chip with programmable arbitrary-shape kernels as that published in [58]. The

convolution chip is shown in Fig. 3.9. It receives input AER events, which represent

visual information from a previous sensing or processing stage, and generates output

AER events, which represent the result of the convolution operation. The chip includes

a periodic forgetting mechanism which needs to be kept active during absence of input

events.

37


3.3.4.1 3.3.4.1 System Level Architecture of the Convolution Chip

The system level architecture of the chip [58] is illustrated in Fig. 3.9, where the

following blocks are shown:

1. Array of 32 x 32 digital pixels.

2. Static RAM that stores the kernel in two’s complement representation.

3. Synchronous controller, which performs the sequencing of all operations for each

input event and the global forgetting mechanism.

4. High-speed clock generator, used by the synchronous controller.

5. Configuration registers that store configuration parameters loaded serially at

startup.

6. A two’s complement block that changes the sign of the kernel data before being

added to the pixels, if the input event is negative.

7. Left/right column shifter, to properly align the stored kernel with the incoming

event coordinates.

8. AER-out, asynchronous circuitry for arbitrating and sending out the output

events.

The operation of the chip is as follows: when the synchronous controller detects a

falling edge in the input Rqst in line, the event address (x, y) and sign at Address in is

latched and the asynchronous handshaking completed. Then the controller, using the

available kernel size information, computes the limits of the projection field with three

different possible results: 1) the projection field fits fully inside the array of pixels, 2)

it can be partially inside the array, or 3) it can be completely outside the array. If the

projection field is outside the array, the controller discards the event and waits for the

next one. However, in any of the other possible situations, the controller calculates the

left/right shift between the RAM columns holding the kernel and the projection field

columns in the pixel array. After this, it enables the addition, row after row, of the

kernel values onto the pixels. Hence, after receiving an input event, the pixels inside the

projection field change their state. If any of them reaches the programmed threshold

38

3.3 AER Modules

Figure 3.9: Convolution Chip 32x32

it resets itself and generates an output event that will be handled by the asynchronous

AER-out block and sent off chip with its corresponding handshaking signals. Parallel

to this per-event processing, there is a global forgetting mechanism common for all the

pixels.

In the asynchronous AER-out block events are arbitrated by rows (for the same row

all request signals are wired-or). Once the row arbiter answers, all the events generated

in this row are latched on the top periphery, freeing the row arbiter. This way, the row-

arbiter can acknowledge the request of another row, while the events of the previous

row are sent out in a burst. The size of the array is 32x32 pixels, but the input address

space it can “see” is larger (128x128). This allows to build arrays of convolution chips to

process larger pixel arrays, programming each one of them to see a part of the address

39


space by setting some configuration registers. The size of the RAM is 32x32 words of

6 bits in two’s complement representation. In general, since convolution kernels can

have positive or negative values, output events generated by a convolution chip can also

be either positive or negative. In a multilayer system convolution operations can be

cascaded, which implies that a generic convolution chip must be able to handle signed

input events, and produce signed output events. For this reason, the chip includes a

sign bit both for the input and output address events, and also for the values stored in

the kernel RAM (in two’s complement representation). The pixels are able to compute

signed addition and produce positive and negative events. When processing a negative

input event, the controller enables the two’s complement block to invert the kernel

values before being added to the pixels. In the chip the forgetting mechanism is also

handled by the synchronous controller. The aim of this mechanism is that the absolute

state values stored in the pixels are decremented at a programmable rate, so that

they can “forget” their previous state after some controlled time. This functionality is

implemented by a 20-bit counter in the controller which generates a periodic forgetting

pulse for all the pixels every time it reaches the programmed limit. Each forgetting

pulse will decrement by ‘1’ the state of all the pixels with positive state, and increment

by ‘1’ those with negative state. Consequently, the chip implements a (programmable)

constant-rate (or linear) forgetting mechanism.

3.3.4.2 3.3.4.2 AERST Convolution Module

The convolution module implemented in AERST emulates the convolution chip de-

scribed above [58] and uses the following input parameters:

-size1, X dimension of the input address space,

-size2, Y dimension of the input address space,

-s, matrix containing the kernel values,

-cteloss, forgetting factor. Its value specifies the number of charge units discharged per

second in the pixels belonging to the pixel array,

-threshold, threshold value for the pixels,


-zs, vector storing the origin of coordinates of the convolution kernel,

-timetoprocess, time required to process each event,

40

3.3 AER Modules

-offset, reset value for the firing pixels,

-trefract, time that the pixels that have fired have to wait to fire again,

-option, parameter that chooses the way the output events are created.

Besides the parameters described, the module has also the following state variables:

-J , matrix storing the state values of the pixels,

-time, matrix storing the previous time of modification for every pixel,

-time2, matrix storing the previous time in which the pixel fired an event,

-flags, matrix indicating which pixels have to wait for a trefract period.

A Figure showing the representation of the convolution module implemented in the

tool with the internal parameters and state variables can be seen in Fig. 3.10

The operation of this module is as follows:



eventin(Treqst)← current time

eventin(Tack)← current time+ timedelay



3. Use event in information to get the event coordinates (x, y) and sign information

(sign

x← event in(3); y ← event in(4); sign← event in(5)

4. Use parameter vector zs (origin of coordinates of the projection field), and pa-

rameters size1 and size2 (dimensions X and Y of J) to compute the limits of the

projection field s that fits inside the array of pixels J , which is called kern eff :

kern eff ←part of projection field s overlapping with J

5. Use parameter vector zs, parameters size1 and size2 and (x, y) coordinates of the

incoming event to compute the neurons in J that will be affected by kern eff .

These neurons will be called neur aff

neur aff ←neurons in J affected by kern eff.

41


Figure 3.10: AER convolution module implemented in the tool

6. Compute the time transcurred for the neurons to be affected since the last change

using current time and state array time to apply the forgetting loss cteloss. Pos-

itive and negative states will discharge towards ‘0’ :

abs(J(neur aff))← abs(J(neur aff))− cteloss ∗ [currenttime− time(neur aff)]

time(neur aff)← current time

7. For those neurons in J which fired a time higher than trefract ago, set their

corresponding value in matrix flags to ‘0’. For this, use matrix time2 and

current time:

positions← find((current time− time2) > 0)

flags(positions)← 0

42

3.3 AER Modules

8. Apply projection field kern eff to neurons neur aff considering the sign (sign)

of the incoming event:

J(neur aff) = J(neur aff) + kern eff ∗ sign

9. Due to the change in flags and J some neurons may have reach the parameter

threshold value and will fire events. These neurons will be called neur firing.

Locate these neurons, set their corresponding values in flags to ‘1’ and update

their firing time in time2. Finally, reset the firing positions in J with parameter

offset:

neur firing = find((abs(J) >= threshold)&(flags == 0))

flags(neur firing) = 1

time2(neur firing) = current time

J(neur firing) = offset

10. Use parameter option to select the sign of the new created events. If parameter

option is ‘0’, their sign will be that of the threshold achieved in each neuron

(positive or negative). if option is ‘1’, their sign will always be 1 (full-wave

rectification).

11. Create the output matrix of events with as many events as neurons firing (neur firing).

Compute their Tprereqst values using current time and timetoprocess to consider

delays between them. Update current time accordingly.

3.3.5 Projection Module

This module creates a set of output events with the shape of a projection field received

as parameter each time an event is received. The module can use different input ports

to provide different shapes depending on the input port of the incoming event. This

module can be implemented using a multikernel convolution module similar to the one

described in Section 3.3.4 with kernel s variable and with the desired shape. Threshold

is set to ‘0’ value, so that each incoming event will always produce the generation of

output events coding the shape.

43


Figure 3.11: AER Integrate and Fire module implemented in the tool

3.3.6 Integrate and Fire Module

This module has several AER input ports, one AER output port and one stored array.

The element adds (or substracts) a fixed quantity (specified by a parameter called

value) in the array position coded by the incoming event. The addition or substraction

depends on the input port from which the event is received. The sign of the operation

is coded by a parameter vector (called oper) with values -1, +1 for every input port.

Each time a certain value (specified by the parameter threshold) is achieved by one

pixel of the array, a new output event is produced and the pixel is reset.

A figure showing the representation of the integrate and fire module implemented

in the tool with the internal parameters and state variables can be seen in Fig. 3.11.

The operation of this module is as follows:



eventin(Treqst)← current time

44

3.3 AER Modules

eventin(Tack)← current time+ timedelay



3. Use event in information to get the event coordinates (x, y) and sign information

(sign)

x← event in(3); y ← event in(4); sign← event in(5)

4. Use parameters port in, oper, value and state array J to update the neuron

addressed by the incoming event:

J(x, y) = J(x, y) + value ∗ oper(port in)

5. If the neuron state is higher than threshold, it will fire an event and will be reset

to ‘0’.

if (abs(J(x, y)) > threshold)

xnew ← x; ynew ← y;

J(x, y) = 0

6. Create the new output event with coordinates (xnew, ynew). Use the sign of the

threshold achieved (positive or negative). Use current time and timetoprocess

to compute its Tprereqst value. Update current time accordingly.

3.3.7 Rate-Reducer Module

This module has only one input port and one output port. If we configure the module

i&f to have only one input port with value ‘+1’ and we fix the parameters value and

threshold (with threshold > value), the input rate will be decreased at the output by

a factor value/threshold. This means that several events coding one address will be

needed to fire only one output event with the same address. As in the module i&f we

will need to use an internal stored array.

3.3.8 Self-Exciting Modules

The present simulator is event-driven and consequently updates states of modules only

when events are processed. In between events, it does not update any state, nor checks

for new output events. However, one can think of AER modules which include some

45


Figure 3.12: AER Self-Exciting Module

type of internal self exciting capability, such that they could eventually generate output

spikes while no input events have been received for some time. In a clock-driven simula-

tor, one just would describe such modules through differential equations updated after

a given time step. This time step can be either fixed, or made it to change dynamically

according to the actual effective time constant. The modules considered so far are

not described, in general, by a set of clock-driven differential equations, because the

module states are only updated when they receive input events. However, it is possible

to describe a module by a set of differential equations. With the AERST event-driven

simulator, the idea is to add a “dummy channel” as shown in Fig. 3.12, whose in-

put and output connect only to the AER module described internally with differential

equations. Then, the module should be described in such a way that it will put a

future event in the dummy channel at the time at which the differential equations need

to be updated. This will depend on the time step used by the differential equations

algorithm.

3.4 Validation of the AER Tool

To validate the simulator, we have implemented two simulations of two AER systems

that had been previously built in hardware [50][17]. All the parameters describing

the modules such as thresholds, forgetting ratio, kernel values, delays, array sizes, etc.

46


Figure 3.13: Block diagram of the AER system developed to simulate the hardwareimplementation

have been chosen according to the specifications of the corresponding AER hardware

devices. These two example systems are described next.

3.4.1 Detection and tracking of moving circles of given radius

The first example simulates part of the demonstration system in the CAVIAR project

[50]. This system could track a circular object of a given size. A block diagram of the

complete system is shown in Fig. 3.4.

The complete chain consisted of 17 AER modules. The AER block diagram that we

have used to emulate the CAVIAR system is shown in Fig. 3.13. The system receives as

input events recorded previously by the electronic retina when watching the movement

of a rotating disc with two solid circles of different radii. These events are sent to a

convolution module with a kernel tuned to detect a circumference of a certain radius

(Fig. 3.14 (a)).

The positive output events of the convolution module follow the center of the tar-

get circumference. These events are sent to a winner-takes-all module (WTA). We

implement the WTA module by using a two-inputs merger module together with a

convolution module. The convolution module is programmed with a kernel which is

positive in the center and negative in the rest of positions (Fig. 3.14 (b)). We use

the output activity from the convolution chip as feedback to the merger second input

port. Due to the feedback in the winner-takes-all and the convolution kernel positive

only in the central point, the output activity of the WTA module responds only to the

47


Figure 3.14: a) Kernel to detect a circumference of a certain radius. b) Kernel used inthe WTA module.

Figure 3.15: Winner-Takes-All module

incoming addresses having the highest activity.

A representation of the winner-takes-all created this way can be seen in Fig. 3.15.

The disc with the two circles rotates at a speed of 0.28rev/sec approximately. In Fig.

3.16 the four images in the left represent the images reconstructed with the hardware

implementation (images were obtained with the jAER tool [59]). Each 2D image is

obtained by collecting events during 33ms. The gray values correspond to non-activity.

Black values correspond to changes in intensity due to the motion sensing retina at

the input and white levels at the bottom figures correspond to the pixels detecting the

center of the moving ball at the output. The four images on the right correspond to

the images obtained using the C++ version of the simulator AERST with the same

input stimulus.

48


Figure 3.16: On the left, input and output obtained with the hardware implementation.On the right, input and output obtained with the simulated implementation

3.4.2 Recognition of high speed Rotating Propellers

The second experiment demonstrates the high-speed processing capabilities of AER

based systems. It is the recognition and tracking of a high speed S-shaped rotating

propeller at 5000rev/sec [17] and moving across the screen. At this speed, a human

observer would not be able to discriminate the propeller shape and would only see a

moving circle across the screen. The propeller has a diameter of 16 pixels. The AER

simulated system is again the one shown in Fig. 3.13. This time, the convolution chip

was programmed with a kernel to detect the center of the S-shaped propeller when it

is in the horizontal position. Fig. 3.17 (a) shows the kernel. Fig. 3.17 (b) and (c)

show the 2-D input (propeller) and output reconstructed images by collecting events

during a 50µs interval (1/4 of a rotating movement). Fig. 3.17 (d) and (e) show

the 2-D input and output images reconstructed by collecting events during a 200ms

interval (corresponding to one complete back-and-forth screen crossing). As can be

seen, only those pixels detecting the center of the propeller produce output activity.

The propeller is properly detected and tracked at any instant in real time. Note that

using conventional frame-based image processing methods to discriminate the propeller

is a complicated task, which requires a high computational load. First, images must

be acquired with an exposure time of no more than 20µs (one 10th of a rotation) and

secondly , recognition must be achieved also in less than one rotation, 200µs.

49


Figure 3.17: a) Kernel used to detect the propeller, b) and c) input and output whenwe collect events during 50µs, d) and e) input and output when we collect events during200ms.

50

Chapter 4

MULTI-CHIP MULTI-LAYER

CONVOLUTION PROCESSING

FOR CHARACTER

RECOGNITION

The system reported in this chapter is a simplification of Fukushima’s Neocognitron [2].

First we will briefly describe the network originally proposed by Fukushima and then

we will explain how a simplification of this system was implemented using AER-based

modules.

4.1 Fukushima’s Neocognitron

Fukushima’s Neocognitron is a hierarchical network consisting of several layers of

neuron-like cells. There are forward connections between cells in adjoining layers.

Some of these connections are variable, and can be modified by learning. The neocog-

nitron can acquire the ability to recognize patterns by learning, and can be trained to

recognize any set of patterns. Since it has a large power of generalization, presentation

of only a few typical examples of deformed patterns (or features) is enough for the

learning. It is not necessary to present all the deformed versions of the patterns which

might appear in the future. After learning, it can recognize input patterns robustly,

with little effect from deformation, changes in size, or shifts in position. In contrast to

51

4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION

Figure 4.1: A typical architecture of the neocognitron network.

most conventional pattern recognition systems, it does not require any preprocessing

such as normalizing the position, size, or deformation of input patterns.

Fig 4.1 shows a typical architecture of the neocognitron network. The lowest stage

is the input layer consisting of a two-dimensional array of cells, which corresponds to

photoreceptors of the retina. There are retinotopically ordered connections between

cells of adjoining layers. Each cell receives input connections that lead from cells

situated in a limited area on the preceeding layer. Layers of “S-cells” and “C-cells” are

arranged alternately in the hierarchical network. In the network shown in Fig 4.1, a

contrast-extracting layer is inserted between the input layer and the S-cell layer of the

first stage. S-cells work as feature-extracting cells. They resemble simple cells of the

primary visual cortex in their response. Their input connections are variable and can

be modified through learning. Each S-cell responds selectively to a particular feature

presented in its receptive field (the receptive field is a portion of sensory space that

can elicit neuronal responses when stimulated). The features extracted by S-cells are

determined during the learning process. Generally speaking, local features, such as

edges or lines in particular orientations, are extracted in lower stages. More global

features, such as parts of learning patterns, are extracted in higher stages.

C-cells, which resemble complex cells in the visual cortex, are inserted in the network

to allow for positional errors in the features of the stimulus. The input connections

52

4.1 Fukushima’s Neocognitron

of C-cells, which come from S-cells of the preceding layer, are fixed and invariable.

Each C-cell receives excitatory input connections from a group of S-cells that extract

the same feature, but from slightly different positions. The C-cell responds if at least

one of these S-cells yields an output. Even if the stimulus feature shifts in position

and another S-cell comes to respond instead of the first one, the same C-cell keeps

responding. Thus, the C-cell’s response is less sensitive to shift in position of the

input pattern. We can also express that C-cells make a blurring operation, because the

response of a layer of S-cells is spatially blurred in the response of the succeeding layer

of C-cells.

Each layer of S-cells or C-cells is divided into sub-layers, called “cell-planes” (feature

maps), according to the features to which the cells respond. The cells in each cell-plane

are arranged in a two-dimensional array. A cell-plane is a group of cells that share the

same set of connections. As a result, all the cells in a cell-plane have receptive fields

of an identical characteristic, but the locations of the receptive fields differ from cell

to cell. The modification of variable connections during the learning progresses also

under the restriction of shared connections. In the whole network, with its alternate

layers of S-cells and C-cells, the process of feature-extraction by S-cells and toleration

of positional shift by C-cells is repeated. During this process, local features extracted in

lower stages are gradually integrated into more global features, as illustrated in Fig 4.2.

Since small amounts of positional errors of local features are absorbed by the blurring

operation by C-cells, an S-cell in a higher stage comes to respond robustly to a specific

feature even if the feature is slightly deformed or shifted.

Thus, tolerating light positional errors at a time at each stage, rather than a con-

siderable positional error in one step, plays an important role in endowing the network

with the ability to recognize even distorted patterns.

C-cells in the highest stage work as recognition cells, which indicate the result of the

pattern recognition. Each C-cell of the recognition layer at the highest stage integrates

all the information of the input pattern, and responds only to one specific pattern.

Since errors in the relative position of local features are tolerated in the process of

extracting and integrating features, the same C-cell responds in the recognition layer

at the highest stage, even if the input pattern is deformed, changed in size, or shifted in

position. In other words, after having finished learning, the neocognitron can recognize

53


Figure 4.2: The process of pattern recognition in the neocognitron. The lower half of thefigure is an enlarged illustration of a part of the network.

input patterns robustly, with little effect from deformation, change in size, or shift in

position.

4.2 AER-based system for Character Recognition

We have adapted the original structure of the Neocognitron so that it can distinguish

between characters ‘A’, ‘B’, ‘C’, ‘H’, ‘L’, ‘M’ and ‘T’. It is based on AER and makes use

of the programmable kernel AER convolution chip proposed by Serrano-Gotarredona

et al. [17][50][39]. As shown in Fig 4.3, it receives an input visual stimulus (of 16

54


Figure 4.3: Character recognition system based on AER

x 16 pixels), which can be one of the previous characters, and it can tolerate slight

deformations. Each active pixel of the 16 x 16 input stimulus will fire ten events, and

the rest of pixels will not fire. Input events will be separated 50ns. In this way, the

complete input stimulus, which has around 30 active pixels, will be transmitted in

about 15µs.

The first processing layer can be considered as a layer of “S-cells” and performs

17 convolutions in parallel for feature extraction with convolution masks (also called

kernels) ki (i = 1, ..., 17). Kernels have positive and negative values. Therefore, convo-

lution outputs would include both positive and negative events. The kernels are shown

in Fig 4.4 normalised from ‘-1’ to ‘1’. Black pixels (value ‘-1’) correspond to the most

negative value in each kernel and white pixels (value ‘1’) correspond to the highest

positive value. The cross in the kernel of Fig 4.4 shows the coordinates. As explained

in the previous chapter, the convolution modules work adding a convolution mask in a

matrix of pixels around the address coordinate specified by the incoming events. When

a pixel in the matrix of neurons belonging to one of the convolution modules reaches

a configurable threshold value, it will reset itself and generate an output event, which

will be sent out of the convolution module. In the system of Fig 4.3, each convolu-

tion module is configured to not send out any negative event1. Only positive events1Events in AER-based systems can have positive or negative sign. However, our convolution mod-

55


Figure 4.4: Kernels used in the first layer for feature detection. The red cross indicatesthe origin of coordinates of the kernel when it is proyected in the pixel array.

will be transmitted. Consequently, each convolution module will compute a half wave

rectification after the convolution operation.

Each Kernel in layer ‘1’ is intended to detect discriminatory features that help to

identify the characters. Kernel k1 is intended to detect the presence and position of

the upper peak in letter ‘A’. Kernel k2 detects a horizontal segment ending on the left

and touching a vertical segment. Kernel k3 detects a horizontal segment ending on

the right and touching a vertical segment. Kernel k4 detects a vertical segment ending

on the top and touching a horizontal segment. Kernel k5 detects the bottom end of

a vertical segment. Kernel k6 detects the top end of a vertical segment. Kernel k7

detects the left end of a horizontal segment, kernel k8 the same but for the right end.

Kernel k9 is intended to detect the upper curvature of letter ‘C’ and kernel k10 the

same but for the lower one. Kernel k11 detects a horizontal segment and kernel k12 the

same but for a vertical one. Kernel k13 is intended to detect the central crossing point

between the two inclined segments of letter ‘M’, kernel k14 detects the crossing point

between the two right curves in letter ‘B’. Kernel k15 the upper left peak in letter ‘M’,

kernel k16 the same, but on the right and finally, kernel k17 detects the crossing point

ules are configured so that if a pixel produces a negative output event, the pixel is reset, but it does

not transmit the event out of the chip.

56


between the horizontal and vertical segments in letter ‘L’. Consequently, the first layer

of convolutions is intended to detect a set of 17 geometrical features which can be used

to detect and discriminate between the different letters. As shown in Fig. 4.3, each

kernel ki (i = 1, ..., 17) produce activity on channel ci (i = 1, ..., 17). Consequently,

letter ‘A’ should produce activity at outputs {c1, c2, c3, c5, c11}, letter ‘B’ at {c2,

c11, c12, c14, c17}, letter ‘C’ at {c8, c9, c10, c11, c12}, letter ‘H’ at {c2, c3, c5, c6,

c11, c12}, letter ‘L’ at {c6, c8, c11, c12, c17}, letter ‘M’ at {c5, c12, c13, c15, c16}and letter ‘T’ at {c4, c5, c7, c8, c11, c12}.

The second layer of convolution processing can be considered as a layer of “C-cells”

and perfoms 17 convolutions in parallel. Each of these convolution chips uses one of

six different convolution masks shown on the right of Fig 4.5 adjusted to the range ‘-1’

to ‘1’. The kernel that each filter pi uses and its origin of coordinates in the pixel array

are shown in Table 4.1. As shown in Fig. 4.3, each kernel pi (i = 1, ..., 17) produce

activity on channel di (i = 1, ..., 17). This layer is intended to evaluate whether the

spatial distribution of features detected in the first layer is meaningful for the character

to be detected. For example, for letter ‘A’, the top peak (detected by k1 in the first

layer) should be in the upper part above all other features. Consequently, filter p1 will

produce a positive contribution in the region below the peak, because this would be the

place in output d1 where the center of letter ‘A’ would be if all its features are detected

simultaneously. In a similar manner, if there is output activity at c2, the center of ‘A’

should be to the right. Therefore, filter p2 will add contribution to the pixels in d2

which are to the right of those who fired in c2. The output at c3 has to be treated

symmetrically than the one for c2. Filter k5 places events at c5 if a bottom end of

vertical segment is detected. This means that the center of letter ‘A’ is above, either

to the right or to the left. This spatial weighting is performed by filter p5. Finally, if

there is output at c11, the center of ‘A’ should be in the same position, as kernel k11

is intended to detect horizontal segments. Therefore, filter p11 will add contribution to

the pixels in d11 which are at the same position of those who fired in d11. In this way,

when the input is letter ‘A’ the activity at {c1, c2, c3, c5, c11} will be on different

pixels. However, the activity at {d1, d2, d3, d5, d11} would be around the center of

letter ‘A’. In this way, layer ‘2’ joins the meaningful features for each character in the

central region of the corresponding character. Something similar will occur with the

rest of letters, which center will be identified also with the respective outputs obtained

57


CENTRAL COORDINATE

FILTER KERNEL X-COORD Y-COORD

p1 f1 10 9

p2 f4 8 7

p3 f5 8 2

p4 f1 9 9

p5 f3 -3 9

p6 f3 10 8

p7 f6 10 7

p8 f6 10 0

p9 f1 7 11

p10 f1 0 11

p11 f1 -2 8

p12 f1 4 12

p13 f1 5 9

p14 f1 3 7

p15 f2 10 13

p16 f2 10 5

p17 f1 -2 12

Table 4.1: Origin of Coordinates for Kernels in Layer 2

from filters in layer ‘2’, which will perform the spatial weighting needed for each of the

different features extracted in the first layer.

This layer is intended to implement in some way the blurring operation of C-cells,

as the response of the previous layer of S-cells (first layer) is spatially blurred and the

features corresponding (and detected) to one character are copied near the center of

the character at this stage.

The purpose of the third layer is to combine with positive or negative weigths the

outputs of the second layer. For example, for letter ‘A’ outputs {d1, d2, d3, d5, d11}should contribute positively, while outputs {d4, d6, d7, d8, d9, d10, d12, d13, d14,

d15, d16, d17} should inhibit. The same will occur with the rest of letters. In this

layer, all outputs d1 − d17 from the second layer are splitted (blocks Sp in Fig 4.3)

into seven separate pathways with seven independent 17-input merger blocks (block

M in Fig 4.3), each one to detect one of the characters. Only positive events come

out at outputs d1 − d17. However, the sign bits are hardwired at the inputs of the

merger blocks, with positive sign if the events contribute positively or negative sign if

the events contribute negatively.

To implement the operation of addition or substraction between the input channels,

the merger blocks sequence the events coming from their seventeen input channels, and

58


Figure 4.5: Kernels used in the second layer for spatial weighting. The kernel C at thebottom end is a single convolution chip for detecting whether the events coming from theprevious layer are more or less clustered together.

feed them to a convolution chip programmed with a 1x1 kernel with weight ‘1′ (block

U in Fig 4.3). The convolution chips parameters are set so that 3 input events will

be necessary to produce an output event for one pixel. 3 events is considered a value

high enough to implement the operation of addition between the channels efficiently.

Besides, the value is low enough to speed up the response of the system. Finally, the

fourth layer consists of one single convolution chip for each character path (block C in

Fig 4.3), which will detect whether the events coming from the previous layer are more

or less clustered together, rather than spread over the pixel array. If they are clustered

(in the center of the character), it means the character has been detected. The kernel,

normalised to ‘1’, is shown on the bottom of Fig 4.5. The ‘x’ in the kernel C of Fig 4.5

shows the origin of coordinates of the kernel.

Note that if this system is built with AER hardware modules, all this processing

is done in parallel and in real time, being the events sent from layer to layer with ns

delays. On the other hand, in a frame-based system, we would have to 1) receive the

59


Figure 4.6: Letters used for testing the system based on AER for character recognition.

complete number of pixel values belonging to the letter, 2) process the letter image

sequentially with all the convolution masks of the first layer, 3) process the resulting

filtered images sequentially with all the convolution masks of the second layer, 4) add

or substrat the corresponding filtered images, and 5) process the seven resulting images

from the third layer (each one corresponding to one letter) with the convolution mask

described in the fourth layer.

4.3 Experimental Results

The multi-chip (68 convolution modules) multi-layer (4 layers) system described above

has been tested using three slightly modified versions for each one of the seven charac-

ters proposed. The twenty-one characters are shown together in Fig 4.6. The results

obtained after the simulations of the system described in the previous Section are shown

in Table 4.2.

In Table 4.2, we show the duration of the input stimulus (stimulus time, T1), the

time when the first event corresponding to each character is obtained at the system

output (time first output, T2), the difference between these times (T2− T1), the time

when the last event corresponding to each character is obtained at the system output

(time last output) and the retrieval accuracy achieved for each of the letters (retrieval

accuracy). Finally, we also show the number of events obtained at each of the seven

output channels for each of the characters.

The output events generated at the different convolution outputs {c1, c2, c3, c5,

c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA, fB, fC, fH, fL, fM, fT}are shown in Fig. 4.7, for the case of input stimulus ‘A’. The specific timing of the

60


Table 4.2: Timing and accuracy obtained for each of the letters

events can be seen in Fig 4.8. The vertical axes indicate pixel numbers (from 0 to 255)

in the 16x16 pixel arrays, while the horizontal axes represent time in µs (from 0 to

40µs). The specific timing of the input and output event bursts for the first version of

each character can be seen in Fig 4.9. As Table 4.2 indicates, in all cases, the system

is capable of detecting which letter is present in less than 9.3µs since the first input

stimulus event is received by the system. This delay is even smaller than the average

duration of the input stimulus spike burst (12.4µs). Consequently, on average the

system is able to recognize the letter before processing all the input spikes. In any case,

the recognition performance rate for such a system (it should be noted that it tolerates

a certain degree of letter deformation and scaling) is unprecedented (as it can be seen in

Table 4.2, the average time for detecting each letter is only 9,31µs, that is equivalent to

process over 100000 images per second). In a frame by frame based system, we would

have to wait for the frame-time to recover all the pixel values of the character under

recognition, and after that, we would have to process the entire image sequentially with

the 68 convolution modules described. If we suppose that a scheme using 25 frames per

second is used, we would always have the limitation of 40ms for processing each letter

(note that this value is computed without considering the post-processing time due to

the convolution modules). We believe that this technique for bio-inspired AER-based

vision processing is very promising.

61


Figure 4.7: The output events generated at the different convolution outputs {c1, c2, c3,c5, c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA, fB, fC, fH, fL, fM, fT} forthe case of input stimulus ‘A’.

One limitation that can be argued for this vision processing technique is that for real

size images much more events need to be processed. This is certainly true. However,

our experience is that if one uses appropiate input sensors, like retinae that directly

sense motion [15] or contrast [41][60][16] instead of image intensity [32], then the flow

of events is kept at reasonable event rate (below 1 Meps for arrays of 128x128 pixels

[15]).

4.4 Discussion

In this Chapter, we have implemented a system with four layers for character recog-

nition. It can distinguish between characters ‘A’, ‘B’, ‘C’, ‘H’, ‘L’, ‘M’ and ‘T’. The

system is based on AER and makes use of the programmable kernel AER convolution

chip proposed by Serrano-Gotarredona et al. [17][50][39].

62

4.4 Discussion

Figure 4.8: Events obtained in the system at outputs{c1,c2,c3,c5,c11,d1,d2,d3,d5,d11,dA,fA} when input is letter ‘A’ Time is expressedin µs.

The characters catalog used in the application could be expanded by simply sending

the outputs of layer ‘3’ to new merger modules. In a hardware implementation, besides

the convolution modules, one also requires splitters and/or merger blocks, and eventu-

ally some extra mappers. In the future, as AER processors become more sophisticated,

we expect to be able to fit in a single chip several convolution arrays together with

splitters/mergers and mappers. In the example of Fig 4.3, all convolution chips are of

size 16x16.

The system could be conceived to include learning, so that an external supervisor

trains it and updates the weights dynamically, optimizing final performance [61].

The results shown in Table 4.2 indicate some of the clear advantages of using AER

63


Figure 4.9: Events obtained in the system at input and output channels for the firstversion of each of the letters.

rather than using the classical image processing methods based on frames. Some of

these advantages regarding this application can be summarized as follows:

1. We do not need to wait for the complete image frame to start processing in none

of the layers. On the contrary, since the first event is received in the first layer

of the system, new output events are produced and can be processed by the

following layers. However, if we use the classical frame based methods, we would

have to process sequentially the input image frame completely with each one of

the convolution modules. Moreover, we could not start processing in one module

until the image has not been completely processed by a previous one. Note that

in the AER implementation we require less than 9.31µs to detect the character

under recognition.

2. We do not need to collect all the output events in layer ‘3’ to identify the character

under analysis in the system. With only the first output events from layer ‘3’, we

are able to finish the recognition task.

64

4.4 Discussion

3. It is possible to add new modules in parallel in each of the layers. This fact will

allow us to analyze more features in the characters under recognition, and always

improving the results without increasing the computational cost.

65


66

Chapter 5

IMPLEMENTATION OF

TEXTURE RETRIEVAL USING

AER-BASED SYSTEMS

5.1 Introduction

To illustrate the processing power of AER-based systems, we have developed an AER

architecture example of a sophisticated image processing application of content-based

image retrieval. Content-based image retrieval is emerging as an important research

area with applications in digital libraries and multimedia databases. An image can be

considered as a mosaic of different texture regions, and the image features associated

with these regions can be exploited for search and retrieval.

5.2 State of the art in texture recognition

Texture analysis has a long history and a very large amount of algorithms for texture

characterization has been developed in the last decades. The commonly used methods

for texture characterization can be divided into three categories: statistical, model-

based, and filtering approaches [62]. Statistical methods, such as cooccurrence features

[63][64], analyze the spatial distribution of gray values, by computing local features

at each point in the image, and deriving a set of statistics from the distributions of

the local features. Model-based methods such as Markov random field (MRF) [65]

and simultaneous autoregressive (SAR) models [66] provide a description of texture in

67

5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS

terms of spatial interaction. Most of the statistical and model-based approaches for

texture classification consider spatial interactions over relatively small neighborhoods.

Therefore, these approaches are more apt only for microtextures [67], [68]. Filter-

ing approaches including wavelet [69], [70], Gabor filters [67], [71], steerable pyramid

[72], and directional filter bank (DFB) [73], [74] characterize textures in the frequency

domain. Among the three categories, MPEG-7 has adopted Gabor-like filtering for tex-

ture description [75]. The rationale behind is that visual cortex is sensitive to localized

frequency components [76]. It has been shown that the direction together with scale in-

formation is important for texture perception. In the last decade, researchers have been

combining different methods in order to provide a better classification and retrieval of

images. Fusion of different types of texture features can be found in the literature

[77]-[80]. A comprehensive performance evaluation on filtering (i.e., spectral-based)

methods for texture classification is presented in [62], which suggests that no single

set of features derived from filtering approaches has consistent superior performances

on all textures. Other comparative studies about all these methods can be found in

[81]-[83]. In [84], two fast algorithms for multiscale directional filter banks (MDFB) are

proposed. These two algorithms are compared with the previous algorithm for MDFB

proposed in [85] and with the contourlet transform [86], [87] in terms of time of feature

extraction (FE) and total computational time. In [88], a texture representation suit-

able for recognizing images of textured surfaces under a wide range of transformations,

including viewpoint changes and nonrigid deformations is presented. At the feature

extraction stage, a sparse set of affine Harris and Laplacian regions is found in the im-

age. Each of these regions can be thought as a texture element having an elliptic-shape

characteristic and a distinctive appearance pattern. Using the Brodatz database [120],

the approach achieves a maximum average retrieval rate (ARR, see definition in the

Experimental Results Section) of 76, 26% when combined Harris and Laplacian descrip-

tor channels are used. In [89], a linear family of filters is introduced, which provides

certain scale invariance, resulting in a texture description invariant to local changes in

orientation, contrast and scale, and robust to local skew. Then, a texture discrimina-

tion method based on the χ2 similarity measure is applied to the histograms derived

from the filter responses. This approach achieves a maximum average retrieval rate of

78, 5% when it is tested using the Brodatz database [120]. In [90], the authors propose

an approach for rotation-invariant texture image retrieval by using a set of dual-tree

68

5.3 AER implementation for texture retrieval

rotated complex wavelet filter (DT-RCWF) and DT complex wavelet transform (DT-

CWT) jointly. They make a comparison of average retrieval accuracy using standard

real DWT (discrete wavelet transform), DT-CWT and a combination of DT-CWT and

DT-RCWF. In [74], rotation-invariant and scale-invariant Gabor representations are

proposed, where each representation only requires few summations on the conventional

Gabor filter impulse responses. The results show that the new implementations behave

better than the conventional Gabor-based scheme when rotated or scaled images are

considered. However, a conventional Gabor-based scheme provides better results when

no rotation or scaling is considered.

In [85], an MDFB is first proposed and it is compared with the Gabor filters in polar

form [71] and steerable pyramid [91] in terms of retrieval accuracy. In [92], fractal-code

signatures are proposed for texture-based retrieval of images. Fractal image coding

is a block-based scheme that exploits the self-similarity hiding within an image. By

combining fractal parameters and collage error, a set of statistical fractal signatures

is proposed. In [93], image signatures constructed from the bit planes of wavelet sub-

bands are presented [bit plane signature (BP) and three-pass layer probability (TPLP)

signature]. As can be observed, the method that provides the highest ARR is filter

based and is the combination of DT-CWT and DT-RCWF implemented by Kokare et

al. [90].


To illustrate the potential of the AER technique, we have implemented a multireso-

lution representation based on Gabor filters using the filter-based method proposed

by Manjunath [94]. This example adapts a known frame-based image processing al-

gorithm to the AER frame-less vision processing philosophy. The use of Gabor filters

in extracting textured image features is motivated by various factors. The Gabor

representation has been shown to be optimal in the sense of minimizing the joint two-

dimentional uncertainty in space and frequency [95]. These filters can be considered as

orientation and scale tunable edge and line (bar) detectors, and the statistics of these

microfeatures in a given region are often used to characterize the underlying texture

information. Gabor features have been used in several image analysis applications in-

cluding texture classification and segmentation [96]-[97], image recognition [98]-[100],

69


object recognition [1], image registration, medical applications [101][102] and motion

tracking [1], and it has been demonstrated that using the Brodatz texture database,

the Gabor features provide a very good pattern retrieval accuracy. Furthermore, since

Hubel and Wiesel’s [103] discovery of the crystalline organization of the primary visual

cortex in mammalian brains some thirty years ago, an enormous amount of experimen-

tal and theoretical research has greatly advanced our understanding of this area and

the response properties of cortex cells. On the theoretical side, an important insight

has been advanced by Marcelja [104] and Daugman [105][106], who suggest that simple

cells in the visual cortex can be modeled by Gabor functions. The 2D Gabor functions

proposed by Daugman are local spatial bandpass filters that achieve the theoretical

limit for conjoint resolution of information in the 2D spatial and 2D Fourier domains.

For all these reasons, we will exploit Gabor wavelets for the texture based retrieval of

image data. The focus of this AER convolution processing application is on the image

processing aspects of the texture based retrieval processes. We have developed an AER

architecture to obtain Manjunath’s Gabor wavelet features for texture analysis [94] and

provide a comprehensive experimental evaluation. These features are still today being

widely used in many applications [107]-[116]. By performing texture analysis using Ga-

bor filters at different scales and orientations, these patterns can be efficiently described

in the frequency domain and localized in the spatial domain. Next we summarize the

sequence of computations performed in Manjunath’s method, and indicate how we have

adapted them for an AER hardware system.

5.3.1 Frame-based implementation for texture retrieval

In the method proposed by Manjunath [94], texture is analysed by applying a bank

of scale and orientation Gabor filters to an image. A two-dimensional (2-D) Gabor

function g(x,y) can be constructed as:

g(x, y) = (1

2πσxσy)exp

[− 12( x

2

σ2x

+ y2

σ2y

)+2πjWx](5.1)

where σx , σy, and W are its characteristic geometrical parameters. A class of self-

similar functions referred to as Gabor wavelets is now considered. Let g(x,y) be the

70


mother wavelet. A Gabor filter bank can be obtained by appropiate dilations and

translations of g(x,y) through the generating function

gs,k(x′, y′) = a−sg(x, y) (5.2)

x′ = a−s(x cos θ + y sin θ) (5.3)

y′ = a−s(−x sin θ + y cos θ) (5.4)

where θ = kπ/K is the orientation of the filter with respect to the vertical, kε[0, ...,K − 1]

is the orientation index, and sε[0, ..., S − 1] is the scale index. K is the total number

of orientations, and S is the total number of scales in the filter bank. The filter bank

parameters σx, σy, a, θ,W are computed by Manjunath’s method [94], given the input

specifications S, K, and the upper and lower center frequencies of the filters Uh and

Ul. Given an image I (x,y), its Gabor wavelet transform is then defined as

Wmn(x, y) =∫I(x1, y1)g∗mn(x− x1, y − y1)dx1dy1 (5.5)

where * indicates the complex conjugate. It is assumed that the local texture regions

are spatially homogeneous, and the mean µmn and the standard deviation σmn of the

magnitude of the transform coefficients are used to represent the region for classification

and retrieval purposes:

µmn =∫∫|Wmn(x, y)|dxdy (5.6)

σmn =

√∫∫(|Wmn(x, y)| − µmn)2dxdy (5.7)

As we will see below, in our AER implementation we will not compute σmn as given in

eq. 5.7, but

Smn =

√∫∫||Wmn(x, y)| − µmn|dxdy (5.8)

71


without any degradation in performance. A feature vector is now constructed using

µmn and σmn as feature components. In the experiments, we use four scales S = 4 and

six orientations K = 6, resulting in a feature vector:

FE = [µ11σ11µ12σ12, ..., µ46σ46] = [µmnσmn]m=1,...,4;n=1,...,6 (5.9)

Consider two image patterns i and j, and let FEi and FEj represent the corresponding

feature vectors. The distance between the two patterns in the feature space is then

defined as

d(i, j) =∑m

∑n

dmn(i, j) (5.10)

where

dmn(i, j) = |µ(i)mn − µ(j)

mn

α(µmn)|+ |σ

(i)mn − σ(j)

mn

α(σmn)| (5.11)

with α(µmn) and α(σmn) being the standard deviations of the respective features over

the entire database. They are used to normalize the individual feature components.

For database texture retrieval, the feature vector FEi of a new input image is compared

with a precomputed database of feature vectors FEj . Computation of d(i, j) is fast

and can be done using simple algorithmic computations on conventional FPGA or DSP

like circuits. However, computing the feature vector is a slow process in conventional

computers. Other authors consider other distance measures, such us Mahalanobis,

Bhattacharyya or Euclidean distances [117][118]. We have tested all of them but the

distance measure providing the best result was that described in eq. (5.11).

In the next subsection we will now show how FEi can be computed very quickly

using AER hardware convolution modules.

5.3.2 AER-based implementation for texture retrieval

Our resulting AER system implements a slightly modified version of the system origi-

nally proposed by Manjunath for texture retrieval. The AER system is shown in Fig.

5.1. It has three layers. The first one is composed by a splitter module and 24 AER

convolution chips in parallel. It implements a Gabor filter bank with 4 scales and 6

orientations. In [119] this configuration of filters was demonstrated to provide the best

72


Figure 5.1: Scheme of the AER-based system implemented for texture-based retrieval ofimages

results. In Fig. 5.1, a texture image is coded by events at intervals of 50ns. These

events are fed to a splitter module that replicates them on the 24 output channels.

Each output channel is connected to a convolution module gmn that uses as kernel the

real part of a gabor wavelet with scale m and orientation n.

The sign of output events from the convolution modules are changed to positive

(this is a full-wave rectification). This way, the output at each convolution module is

|Wmn(x, y)| (represented as cmn in Fig. 5.1).

Note that adding more chips to layer ‘1’ increases the number of scales and orienta-

tions in the bank of Gabor filters. This will help to improve classification performance.

However, note that adding more chips to layer ‘1’ will not increase the processing delay

of the hardware.

Layer ‘2’ consists of 24 Feature Extraction Modules (FEM in Fig. 5.1). A FEM

module is shown in Fig. 5.2. The first block in the FEM is a splitter with three

output channels. The top channel (labelled ‘2’ in Fig. 5.2) goes directly to layer 3,

thus providing an AER representation for |Wmn(x, y)|. The bottom channel (labelled

‘5’) goes to an internal merger module with a hardwired positive sign. The central

channel (labelled ‘3’) goes to an internal mapper. This mapper ignores the address of

73


Figure 5.2: Scheme of a generic FEM used in Fig. 5.1.

the incoming event, and generates a new address by sequentially sweeping all addresses.

Consequently, at the mapper output a uniform AER image is represented with the same

number of events as |Wmn(x, y)|. Thus, this output is an estimation of the mean µmn

in eq. (5.6). This mean is fed to the internal merger with a hardwired negative sign.

Consequently, at the merger output we have all |Wmn(x, y)| events with a positive sign

and all µmn events with a negative sign. After convolving them with a unitary kernel

C and changing the negative output event signs to positive, the output will represent:

Smn = ||Wmn(xy)| − µmn| (5.12)

Finally, for each of its inputs |Wmn(x, y)| and Smn(x, y), layer ‘3’ will count the

total number of events (regardless of their addresses) per unit time. We will use these

numbers to create our feature vector described as:

FE = [W11S11W12S12, ...,W46S46] (5.13)

where

Wmn =∑x

∑y

Wmn(x, y); Smn =∑x

∑y

Smn(x, y); (5.14)

Wmn and Smn will be an estimation of µmn and an approximation to |σmn| in Eqs.

(5.6), (5.8), and consequently they can be used as features for the input texture. Al-

though this feature vector is slightly different to Manjunath’s vector described in eq.

(5.9), this will not introduce any significant deviations as will become apparent in the

74


Section on experimental results. This feature vector obtained without a frame-based

scheme will be then compared with Manjunath’s frame-based and other state-of-the-art

frame-based methods with the Brodatz database [120] using Eqs. (5.10) and (5.11).

We have computed distances dmn in Eq. (5.10) between two feature vectors FEi

and FEj as:

dmn(i, j) = |W(i)mn −W (j)

mn

α(Wmn)|+ |S

(i)mn − S(j)

mn

α(Smn)| (5.15)

There is an abundance of literature in the field of texture based image retrieval.

Some use different filters [121][62], different features [111], [62], [122]-[130], different

distance measures [117][131] or different numbers of scales or orientations [132]-[134].

But all of them can be mapped to an AER architecture similar to the one we have de-

scribed. The AER hardware implementation technique is therefore not restricted to the

particular example we have picked for illustration purposes. In the next Section we will

provide realistic performance characteristics for an eventual AER system implementing

the texture classification system described above.


In this Section we provide a realistic performance evaluation of an eventual hardware

implementation. With this aim we have used our AER C++ behavioral simulator tool.

As in the rest of the applications presented in this thesis, the performance character-

istics of the AER modules employed here (convolution chips, mergers, splitters, and

mappers) are obtained from already manufactured, tested and reported AER modules

[50]-[55]. Using the module performance characteristics together with the AER behav-

ioral simulator, we can obtain a very good estimate of the overall system performance.

Below, we illustrate the performance obtained for the AER-based analysis system with

the 48 convolution modules described previously.

When researchers in texture retrieval and classification try to test their proposed

approaches, they use mainly two different databases, the Brodatz database [120] and

the Vistex database [135]. Compared to Brodatz textures, VisTex images are less

suitable for texture classification. Unlike Brodatz textures, the images in VisTex do

75


not conform to rigid frontal plane perspectives and studio lighting conditions1, which

brings in a large variation of scale, rotation, contrast and perspective. For these reasons,

we have chosen the Brodatz database as benchmark for texture recognition. Some

researchers use a part of the Brodatz database to show their results (normally forty

classes), but sometimes this it is not a proof of the validity of their algorithms, because

a method may be good for some classes of textures, but it may provide bad results for

the rest. Due to this, we have preferred to compare our approach with the state-of-the-

art methods that make use of the entire Brodatz database [84][88][89][90][74][85][94]

and two methods that use a part of the Brodatz database [92][93]. We have made our

comparison in terms of average retrieval rate (ARR), feature extraction time (FET)

and total computational time.

For this purpose we have used the entire Brodatz database [120], which consists

of 112 images and each image has been divided into sixteen 90x90 nonoverlapping

subimages, thus creating a database of 1792 texture images. These images have been

rate-coded into events separated by 50ns, creating stimulus bursts of 30ms on average2.

We used our C++ behavioral simulation tool to estimate the performance of an eventual

hardware implementation. The 48 channel outputs of Layer 2 (see Fig. 5.1) obtained

for each of the images in the database were collected during 30ms (duration of the input

burst) to create the feature vector database. In what follows, a query pattern is any

one of the 1792 patterns in the database. This pattern is then processed to compute

the feature vector as in Eq. 5.13. The distances d(i, j), where i is the query pattern

index and j is the index of a pattern from the database (with i 6= j), are computed

and sorted in increasing order. Only the closest set of patterns are retrieved. Ideally,

all top 15 retrievals are from the same large image.

The performance is measured in terms of the average retrieval rate (ARR), which

is defined as the average percent number of patterns belonging to the same image as

the query pattern in the top 15 matches.

Table 5.1 summarizes the results. It shows the retrieval accuracy for each of the

112 texture classes in the database when we compare our AER-based method with the

original Manjunath frame-based results. As can be seen, the retrieval accuracies are

approximately equal.

1Quoted from VisTex Web Site.2This burst time is conceptually comparable to the frame time in a frame-based system.

76


IMAGE FRAME-

BASED

AER-

BASED

IMAGE FRAME-

BASED

AER-

BASED

IMAGE FRAME-

BASED

AER-

BASED

D1 100 100 D39 47,67 40,49 D77 100 100

D2 67,39 64,58 D40 25,75 38,44 D78 88,21 87,64

D3 100 100 D41 79,45 41 D79 100 100

D4 100 100 D42 19,72 22,04 D80 100 100

D5 39,45 65,09 D43 37,81 42,54 D81 100 100

D6 100 100 D44 40,55 37,41 D82 100 100

D7 19,18 22,55 D45 10,41 13,32 D83 100 100

D8 88,21 100 D46 86,02 66,62 D84 100 100

D9 93,69 96,35 D47 100 100 D85 100 100

D10 72,87 69,7 D48 75,06 57,91 D86 46,03 64,06

D11 100 100 D49 100 100 D87 100 100

D12 100 100 D50 82,74 89,69 D88 24,11 26,65

D13 19,18 23,06 D51 100 99,43 D89 19,18 33,31

D14 27,4 29,21 D52 89,31 58,94 D90 52,6 37,41

D15 78,9 99,43 D53 100 100 D91 15,89 16,4

D16 100 100 D54 80 87,64 D92 100 99,42

D17 100 100 D55 100 100 D93 100 98,91

D18 52,6 66,62 D56 100 100 D94 100 100

D19 100 90,71 D57 100 100 D95 100 97,37

D20 100 100 D58 14,25 15,37 D96 66,85 72,26

D21 100 100 D59 37,81 45,1 D97 36,16 59,96

D22 100 100 D60 39,45 51,25 D98 33,97 43,56

D23 21,92 33,31 D61 33,42 44,07 D99 19,72 24,09

D24 100 100 D62 32,33 33,83 D100 45,48 49,2

D25 56,98 55,86 D63 38,35 43,56 D101 100 99,43

D26 96,43 71,24 D64 100 100 D102 100 100

D27 31,23 38,44 D65 100 100 D103 98,08 100

D28 64,65 75,85 D66 76,71 87,12 D104 77,26 87,64

D29 100 100 D67 49,86 48,18 D105 100 94,3

D30 24,66 36,9 D68 100 100 D106 100 100

D31 21,37 15,89 D69 80 83,54 D107 32,87 11,79

D32 100 100 D70 95,89 97,89 D108 13,15 13,84

D33 94,79 100 D71 98,08 100 D109 72,32 66,11

D34 100 100 D72 35,07 33,82 D110 100 86,1

D35 100 100 D73 20,27 27,68 D111 71,78 78,41

D36 95,89 84,05 D74 56,44 37,93 D112 52,6 57,4

D37 100 100 D75 84,38 92,76

D38 100 93,79 D76 100 100 AVERAGE 73,21 73,89

Table 5.1: Retrieval Performance for Each of the 112 Brodatz Images. ComparisonBetween Manjunath’s Frame-Based Method and the AER-Based Proposed Approach.

To estimate the minimum time for correct texture retrieval we proceeded as follows.

Input stimuli lasted for about 30ms. Layer 3 counts events coming from the 48 Layer 2

output channels during a time Tcount. This time was increased in steps of 15µs from 0

to 30ms. We found that for Tcount approximately equal to 10ms the results were similar

to those shown in Table 5.1. Consequently, an AER hardware implementation would

be able to achieve correct texture retrieval in about Trcg=10ms. As an illustration, Fig.

5.3 shows the retrieval accuracy as a function of Tcount for six of the texture images in

[120]. As can be seen, after 10ms the retrieval accuracy has stabilized; this is 20ms

before the input stimulus is finished.

77


Figure 5.3: Texture retrieval accuracy obtained for images D1-D2-D3-D8-D9-D10 asfunction of Tcount (in milliseconds)

5.4.1 Comparison with the State-of-the-Art

In Table 5.2 we compare our AER event-based method with those reported in [74] and

[84][85], [88]-[90][92][93] and with Manjunath approach [94] in terms of average retrieval

rate (ARR) using the entire Brodatz database. In Table 5.3, we compare our method

with those published in [84], [90], [92], and [93] and also with Manjunath’s method [94],

in terms of computation times. We distinguish between a FE time (time required to

obtain a feature vector) and a searching and sorting time (additional time to classify

the texture: computation of terms dij, sorting them, and selecting the best match).

The sum of both is the total computation time. Note that, because of the conceptual

difference between a frame- and an event-based approach, total computation time for

a frame-based system is TFC (as defined in Fig. 5.4), while for an event-based system

it is Trcg (as defined in Fig. 5.4). Consequently, comparing the computational delay of

the two approaches by simply comparing times TFC and Trcg is not a fair comparison.

78

5.5 Discussion

Figure 5.4: Comparison between frame-based and AER-based systems

It is more realistic to either compare Tframe + TFC against Trcg, or the time between

a frame is fully available (T1 + ∆ in Fig. 5.4) and the computing system provides a

recognition result: TFC for a frame-based system against T ′FC = Trcg + td−Tframe (see

Fig. 5.4) for an event-based system. Note that the latter ends up being negative.

5.5 Discussion

AER is an emerging hardware technology with great potential for providing complex

cortical-like sensory-processing systems. Of special interest is its potential for provid-

ing very fast spike-processing convolutional neural networks with complex hierarchical

structures, similar to those found in biological cortex. Recent work on individual AER

convolutional chips reveals the outstanding capabilities of such components as “bricks”

for larger highly sophisticated and hierarchically structured cortical-like sensory pro-

cessing systems. To date, the largest AER multimodule system reported uses only

four processing stages, one of which is a convolution [36]. We believe that we are not

far from seeing systems made out of several hundreds (or thousands) of AER convo-

79


METHOD NUMBER OF CLASSES

CONSIDERED (%)

ARR(%)

MDFB [84] 100% 73,00%

FAST MDFB1 [84] 100% 73,00%

FAST MDFB2 [84] 100% 73,00%

CONTOURLET TRANS-

FORM [84]

100% 71,00%

LOCAL AFFINE REGIONS

[88]

100% 76,26%

LOCALLY INVARIANT DE-

SCRIPTORS [89]

100% 78,50%

STANDARD REAL DWT

[90]

100% 64,17%

DT-CWT [90] 100% 76,83%

COMBINATION OF DT-

CWT AND DT-RCWF [90]

100% 78,93%

ROTATION INVARIANT

GABOR FEATURE [74]

100% 59,00%

SCALE INVARIANT GA-

BOR FEATURE [74]

100% 57,00%

MDFB [85] 100% 72,10%

STEERABLE PYRAMID [85] 100% 69,60%

FRACTAL-CODE SIGNA-

TURES [92]

36% 53,2% - 85,3%

TPLP signature [93] 36% 82,10%

MANJUNATH APPROACH

[94]

100% 73.2%

AER-BASED APPROACH 100% 73,89%

Table 5.2: Comparison of Average Retrieval Rate Between Different Methods Using theBrodatz Database)

lutional modules in the near future. NoC (Network-On-Chip1) technology [136] could

host around 100 individual convolutional modules on a single chip, and about 100 such

chips could be put on one single PCB (Printed circuit board). Consequently, a small

physical volume like a desktop computer could easily hold 20-40 such PCBs, provid-

ing a total of almost half million convolution modules. However, currently, it is not

obvious what architectural structures should be used to assemble these emulated AER

convolutional “bricks” and how to set their parameters for a desired (recognition) appli-

cation. In this chapter, we have concentrated on one such possible application, texture

recognition, emulated it with a behavioral AER simulator, and used it as an exercise

to see how to set up such a system, its parameters, and estimate the performance of

multilayer AER convolutional systems. Some software computational works are start-

ing to appear in the literature that use massive convolutions for vision processing.

For example, in texture recognition, experiments in the last years have demonstrated

that filter-based schemes provide excellent results [62], [81]-[83]. However, massive con-

1In a NoC system, modules such as processor cores, memories and specialized IP blocks exchange

data using a network as a “public transportation” sub-system for the information traffic.

80

5.5 Discussion

METHODFeature Extraction(FE) time (s)

Searching andSorting time (s)

Total time (s)FRAME-based TFC

Total time (s) AER-based Trcg SOFTWARE HARDWARE

MDFB [84] - - 2,59 Matlab 6.5CPU of Intel Pentium 42.4 GHz

FMDFB1 [84] - - 1,69 Matlab 6.5CPU of Intel Pentium 42.4 GHz

FMDFB2 [84] - - 1,62 Matlab 6.5CPU of Intel Pentium 42.4 GHz

CONTOURLETTRANSFORM [84] - - 1,38 Matlab 6.5

CPU of Intel Pentium 42.4 GHz

STANDARD REALDWT [90]

0,47 0,060,53 Matlab 5.3

CPU of Intel PentiumIII 866 MHz

DT-CWT [90] 0,56 0,06 0,62 Matlab 5.3CPU of Intel PentiumIII 866 MHz

DT-CWT AND DT-RCWF [90] 1,05 0,09 1,14 Matlab 5.3

CPU of Intel PentiumIII 866 MHz

FRACTAL-CODESIGNATURES [92] - - 0,42 - 18 -

CPU of Intel Pentium 4(2 GHz)

TPLP [93] 3,3 4,78 - Visual C++ 6.0CPU of Intel Pentium 4(2 GHz)

MANJUNATHAPPROACH [94] 9,3 1,02 10,32 Matlab 5.0

CPU of SUN Sparc20

AER-BASEDAPPROACH 0,01 0,01 - 0,02 -

AER-based DEVICES

Table 5.3: Comparison of Computational Times Between Different Methods Using theBrodatz Database)

volutions on conventional computers result in excessive computational times, making

such approaches nonpractical for real-world applications. In general, vision processing

researchers tend to avoid the use of convolutional processing because of its excessive

computational load. For example, quoting Serre et al. [1] who use a first stage with

64 Gabor filters (for an input image of 128x128 pixels), the main limitation of their

powerful recognition system is the delay of this first stage, which requires several tens

of seconds. An AER-based spiking hardware could perform this processing with delays

of a few milliseconds, or fractions of milliseconds, while the visual input is being sensed.

In all reported approaches for texture recognition, there is a relationship between the

length of the feature vector and the computational time. The longer the feature vector,

the longer the feature extraction time. In AER convolutional hardware, this is not the

case, because all the elements of the feature vector are computed in parallel. Conse-

quently, it is possible to increase the feature vector length or elements [74] to improve

retrieval rate, without increasing feature extraction time, although at the cost of using

more hardware “bricks”. Actually, novel approaches for texture retrieval are based on

the use of filters that take into account more frequencies or scales [84], [90] and produce

less redundant features as compared to other wavelets (Gabor wavelet in our case). In

AER convolutional hardware, increasing the number of convolutional filters does not

degrade speed response of the overall system. This is because the filters receive the

81


same input events simultaneously and process them in parallel. There will be some

delay the hardware will add to distribute the events to a larger number of receivers,

but this extra delay will be in the order of nanoseconds, and consequently not perceived

by the overall system. For present day reported AER links, a typical bandwidth is in

the order of 10-30Meps (mega events per second). Retina sensors output event rate

is usually below 1Meps. However, when merging several AER module outputs into

one single AER channel, especially if we are thinking of several hundreds for the near

future, it is realistic to expect that the limited AER link bandwidth could easily end

up being the main delay bottleneck for such systems. Solutions for this problem could

be to do a hierarchical merging of outputs combined with replicating the number of

AER links to increase bandwidth.

Also, we have observed that event traffic is higher for the first stages and is gradually

reduced as convolutional processing compresses and extracts relevant information.

Perhaps the most interesting observation is that in AER sensory processing hardware,

processing is performed as events flow between modules. As a retina is sending out its

events they are sent directly to the processing structure and are processed as they flow

in. In the same way, each “brick” processes its input events as they flow in and gener-

ates new ones. This way the whole system operates as if a wave of (visual) information

(in the form of flow of events) travels through the convolutional structure while it is

processed. Since processing is on a per event basis, stages do not wait for transmitting

full “images” before processing them, thus reducing drastically the latency between

input and output information flow.

What we have found with the specific example in this chapter is that when mapping

a known convolutional processing (frame-based) algorithm to AER hardware: 1) the

recognition performance remains similar, and also comparable to state-of-the-art com-

putational methods not based on convolutions (or filters), and 2) if some day we are

able to build physically this hardware, it will be capable of providing output recognition

while the input stimulus is being produced by the sensors.

5.6 Conclusions and Future Work

The application presented in this chapter shows performance results for a relatively

large multi module multilayer convolutional neural network frameless AER processing

82

5.6 Conclusions and Future Work

system, estimated through behavioral simulations but using performance figures of real

individual AER hardware modules already available. A texture classification system

based on Manjunath’s method has been analyzed. This scheme uses 48 AER convolu-

tional modules plus a similar number of interfacing modules, such as splitters mergers

and mappers. We have shown that the recognition performance of the AER system is

equivalent to its original frame-based reference. However, if built with realistic AER

hardware, recognition is achieved while the sensory stimulus is being generated. This

would be equivalent to stating that an AER system has a negative processing delay

when compared to a frame-based system, where each frame has to be fully available be-

fore starting any recognition computation. Thus, AER systems reveal some interesting

properties. First, they are not constrained to frames and the output is often available

even before the input stimulus has finished. Processing delay is given mainly by the

number of layers and the number of events needed to represent the input stimulus.

The processing capability of such systems is increased by adding more modules per

layer, but without increasing the number of layers. Consequently, processing capability

can be increased without penalizing delays, although at the cost of adding hardware.

Currently, the available AER hardware modules are quite preliminary, although their

performance figures provide very promising system level performance estimations.

83


84

Chapter 6

EVENT-DRIVEN

CONVOLUTIONAL

NETWORKS FOR FAST

VISION POSTURE

RECOGNITION

In this chapter a bio-inspired six-layer frame-free event-driven convolutional network

for people recognition is proposed. The system consists of six feed-forward layers and

22 AER convolution modules. Its corresponding frame-based version was trained using

32x32 images reconstructed collecting output spikes from a 128x128 AER motion (tem-

poral contrast) electronic retina. The computed weights obtained during the training

stage were then used in the frame-free version of the system. This frame-free implemen-

tation was tested with output spikes obtained from the same retina chip. We provide

simulation results of the system trained for people recognition, showing recognition

delays of a few miliseconds from stimulus onset.

6.1 Motivation

Nowadays power and speed requirements for sophisticated tasks such us people or

objects tracking and recognition, fabrication and quality of components, vision pro-

85

6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION

cessing, etc, impose strong real-time restrictions. Frame-based systems have difficulties

to deal with such restrictions as they are not able to work in real time, mainly when

several parallel processing layers are involved. In the last decades, a series of multi-

layer frame-based systems with short time responses have been proposed to solve and

accelerate complicated tasks emulating brain behavior [137]-[150], but the most suc-

cessful systems have been those based on convolutional networks trained using some

kind of continuous-time gradient-based learning algorithms [151]-[162]. Note that the

high number of connections present between neurons in perceptron neural networks is

reduced considerably in convolutional networks (ConvNets) since the connections are

shared (weight-sharing) and the connection weights are those stored in the convolu-

tion masks. Application examples of Convolutional Networks are object recognition

and scene analysis [1], image segmentation and biological image analysis (brain circuit

reconstruction) [151], natural language processing and understanding [152], biological

image analysis, object recognition and visual navigation for robots [153] and many

others [154]-[162]. In Convolutional Networks, early stages extract elementary visual

features such as oriented edges, endpoints, corners which are then combined by sub-

sequent layers in order to detect higher order features. Early stages usually operate

with small but dense convolution masks, while later stages use longer range but sparser

masks [1]. Example ConvNet systems for face and character recognition applications

may have several tens to hundreds filters per layer. There are many reasons that mo-

tivate the election of ConvNets to implement bio-inspired tasks in object recognition.

One of them is that, compared to other neural networks, ConvNets have a graceful

scaling capability. To increase knowledge one simply has to increase the number of

filters in a layer. Thus, number of neurons (pixels) scales linearly with the number of

filters, and as there is a fixed number of synapses per filter (the convolutional kernel

weights), the number of synapses also scales linearly with the number of filters. On

the other hand, the latency of the computing structure (if implemented as parallel

hardware) is determined mainly by the number of sequential layers, which is a reduced

number and does not change for a given application. Therefore, speed does not degrade

by adding more filters per layer (more knowledge). Consequently, ConvNets seem very

appealing for configurable, modular and scalable spiking hardware implementations.

Other important reasons motivating the use of ConvNets are that they combine three

architectural ideas to ensure some degree of shift, scale, and distortion invariance: 1)

86

6.1 Motivation

Local receptive fields. If the input image is shifted, the filter output will be shifted

by the same amount. This property is at the basis of the robustness of convolutional

networks to shifts and distortions of the input. Once a feature has been detected, its

exact location becomes less important. Only its approximate position relative to other

features is relevant; 2) Shared weights (or weight replication). This leads to a great

reduction of the number of trainable weights and provides shifting independence. 3)

Spatial or temporal subsampling. Subsampling at upper layers also provides shift, scale

and distortion invariance.

The big disadvantage of the so far developed convolutional networks is that they are

mainly frame-based systems, and consequently they are not truly bioinspired as they

lack the idea of continuous real-time spike processing implemented in the brain. They

have to wait to collect frames (or sections of them) in every layer to start processing.

A second drawback is that frame-based implementations are not good at handling

the massive interconnections usually present in ConvNets. In spite of the weight sharing

technique employed in ConvNets, each neuron in one layer is connected to a set of

number of neurons in the following layer and sometimes with neurons also in the same

layer. An example of a state-of-the-art frame-based successful ConvNet can be found

in [163].

All these drawbacks motivated our interest to design an AER system as an efficient

alternative which provides good results in speed and recognition at the same time.

Moreover, the massive interconnections present in ConvNets can be perfectly handled

by the AER protocol [30][43][31]. In this Chapter we present a six-layers frame-free

ConvNet similar to Convolutional Network LeNet-5 implemented by Y. LeCun [4] for

online handwriting recognition, but which is fully intended for an AER implementation

in a frame-free scheme. The frame-free ConvNet implemented detects people in vertical,

up-side-down and horizontal positions captured with a real temporal contrast (motion)

128x128 AER retina [15].

In next section the implemented frame-based system is described. Then we will

explain why this particular structure has been chosen compared to others with different

number of filters and parameters. Finally, the full AER-based version will be described.

87


Figure 6.1: Frame-based ConvNet to detect people in up, up-side-down or horizontalpositions

Figure 6.2: Real scenarios where AER recordings with the motion retina were obtained

6.2 Frame-Based Convolutional Network

The frame-based version of our AER six-layer ConvNet is shown in Fig. 6.1. This

frame-based version of the system has been implemented to obtain the trained weights

to be used in the corresponding frame-free (AER-based) implementation. It has six

layers and it receives as inputs 32x32 pixel images obtained after collecting spikes from

an electronic 128x128 AER retina during 30ms. The AER silicon retina chip used in

our implementation [157] generates events corresponding to relative changes in image

intensity. The retina recordings were obtained in scenarios like those shown in Fig. 6.2,

where static distractor objects are mixed with walking people. Some images obtained

after collecting spikes during 30ms are shown in Fig. 6.3. Note that due to the retina

dynamic nature, only motion information is captured. This way, all the static objects

present in the scene are removed implying a first stage of processing implemented

directly at the sensor.

The retina 128x128 address space was downsampled to 32x32 pixels. The first

88


Figure 6.3: Images obtained collecting input spikes from the retina each 30ms. Thesecond and third rows were obtained rotating previosly the input events 90 and 180 degreesrespectively.

column in Table 6.1 shows the different parameters that are used in the frame-based

implementations of the system.

The output of each of the six layers in the system consists of a set of output images

or planes called “feature maps”, which are composed of arrays of neurons. Neurons

belonging to a feature map in one layer are only connected to neurons in feature maps

in the following layer through projection fields (convolution masks). There are no

connections between neurons inside one layer. A unit (pixel or neuron) located at

position (i, j) inside a feature map q (of size KxL, q = 1, ..., Q) belonging to layer l will

have a value yql (i, j) computed as:

xql (i, j) =∑pεP

∑mεM

∑nεN

(ypl−1(m,n) ·W p,ql (i−m, j − n)) + bql (6.1)

yql (i, j) = A tanh (S · xql (i, j)) (6.2)

where P is the total number of feature maps in the preceeding l − 1 layer, Q is the

89


Table 6.1: Parameters used in the frame-based and AER-based implementations

total number of feature maps in the present layer l, xql (m,n) is the state of pixel (i, j)

at feature map q, W p,ql is the convolution mask connecting input feature map ypl−1 to

output feature map yql . bql is a bias and A and S are constants. For our simulations,

we use A = 1.7159 and S = 2/3, as these particular values improve the convergence

towards the end of the learning session [4]. For simplification purposes, in the AER-

based hardware implementation we have not considered trainable biases.

The first layer C1 of the system in Fig. 6.1 is a Gabor filter bank [94] with six 10x10

filters (convolution masks) and six 28x28 feature maps. As done by LeCun [4], the size

of each feature map is 28x28 because we have only considered a square of 28x28 pixels

from each filter output. Each pixel in a feature map in C1 is connected to a square of

100 (10x10) pixels of the input, called the receptive field of the pixel. All the pixels

in a particular feature map in layer C1 share the same set of 100 weights, which are

the values of the corresponding filter (convolution mask). As each unit (neuron) has

100 inputs and the weights are shared for each feature map in C1, we would need

100 coefficients for each convolution mask (there are six feature maps each with its

corresponding convolution mask). In generic ConvNets, filtering layers are trainable.

90


However, opposite to conventional ConvNets, we have chosen a fixed (non-trainable)

bank of six 10x10 Gabor filters with two scales and three orientations in this layer. Due

to the fixed weights, this layer has 0 trainable coefficients and 470400 connections.

The second hidden layer S2 is a subsampling layer with six feature maps of size

14x14 pixels (each feature map is connected to each of the six feature maps in layer C1).

A subsampling layer in generic ConvNets performs a smoothing of the input followed by

a subsampling operation by two (in rows and columns), thereby reducing the resolution

of the feature map and the sensitivity of the outputs to shifts and distortions. The

receptive field of each pixel at this layer is a 2x2 area at the previous layer corresponding

feature maps. In generic subsampling layers, each pixel computes the average of its

inputs multiplied by a trainable coefficient and adds a trainable bias to the sum. Then

the result is passed to a sigmoid function. As contiguous pixels have nonoverlapping

contiguous receptive fields, these subsampling layer feature maps have half the number

of rows and columns as the feature maps in layer C1.

For simplification purposes (especially for the eventual hardware implementation)

we have implemented subsampling layers as addition layers with a multiplying factor

of value 1. In other words, each pixel computes the sum of its four corresponding input

pixels in the previous layer. This simplification did not affect to the recognition results

in our experiments but led to different weights obtained after training in the filtering

layers. With this simplification, subsampling layers have neither trainable coefficients

nor sigmoid functions. This implies that this layer has 0 trainable coefficients and 4704

connections.

As done by Y. LeCun [4], the third layer C3 is a convolutional layer with four

10x10 pixel feature maps. Each pixel in the third layer has input connections from all

the six feature maps in the previous layer. This way each feature map p in layer S2

is connected to a feature map q in layer C3 through an independent projection field

(filter or convolution mask) W p,q3 . Thus, there are 24 5x5 different trainable filters (6

input feature maps and 4 output feature maps). At this stage each pixel in layer C3

is connected to a square of 25x25 input pixels and the result for each output pixel in

C3 is passed through a sigmoid function. This layer has 600 trainable coefficients and

60000 connections.

Layer S4 is a subsampling block again with four 5x5 output feature maps. The

number of trainable parameters is zero again and the number of connections is 400.

91


The fifth layer C5 is a convolutional layer with eight 1x1 feature maps. Each unit

(pixel) is connected to all the 5x5 pixels on all the feature maps in S4 through different

projection fields. This implies that there is a full connection between S4 and C5. In

this layer there are 32 5x5 trainable filters (projection fields) connecting each of the 4

5x5 feature maps in S4 with the 8 1x1 feature maps in C5. Thus, layer C5 has 800

trainable connections (25 weights in each one of the 32 convolution masks).

Finally, layer F6 contains 4 units and is fully connected to C5. It has 32 trainable

connections.

The mean-squared error for the output units can be computed as:

E =∑i

(yi − di)2 (6.3)

where yi and di are the computed output and desired output at unit i respectively.

This frame-based version of the categorizing system has 536336 connections and only

1432 trainable parameters. All the trainable parameters have been computed using the

backpropagation algorithm [4] (see Appendix Section 6.6).

6.3 Justification of the Architecture Used

Since the ultimate long term goal of our work is the eventual implementation in AER

hardware of the systems this thesis analyzes, we seek to find the simplest architectures

for a given task. Our aim is to obtain a recognition rate over 98% on the training

set using the minimum number of filters and trainable parameters. As done by Y.

LeCun [4], we have preferred a six-layer system (repeating alternatively filtering and

susbsampling layers) because in this way the number of trainable parameters is low

and we obtain a high recognition rate [155].

As opposed to LeCun [4], we have substituted the trainable first layer by a bank

of fixed weights 10x10 Gabor filters with two scales and three orientations, because a

bank of Gabor filtering is often the first stage of visual processing in many systems and

in the human brain [8][1]. In addition, Gabor filters are selective to different scales and

orientations and they remove noise due to sparse spikes produced by the retina. The

reason for choosing a bank of gabor filters with size 10x10 was because smaller sizes

92

6.3 Justification of the Architecture Used

Figure 6.4: Comparison of recognition rates when we use a trainable set of filters in firstlayer or a fixed Gabor filter bank.

did not provide adequate filters in the upper scales and higher sizes did not improve

the recognition results.

Anyway, for comparison purposes we evaluated the recognition rate using fixed-

weight Gabor filters and using trainable filters at this stage with 250 training images

and 100 epochs (repetitions of the training set) and we obtained approximately the

same recognition rate of 98%. In Fig. 6.4 we show the recognition rate in both imple-

mentations.

To determine the simplest architecture that provides the best recognition perfor-

mance, several tests have been implemented varying the number of filters and feature

maps at each layer. First, the performance of the system was compared when different

number of Gabor filters were used in the first layer C1. Thus, different Gabor filter

banks at different scales and orientations in the first layer were tested, for a fixed num-

ber of feature maps in the rest of layers. In Fig. 6.5 we show the recognition rate

obtained for these combinations. It is evident that the choice that provides best results

is 2 scales and 3 orientations.

Once the optimum number of Gabor filters was adopted (2 scales and 3 orientations),

the optimum number of feature maps in the rest of layers should be computed. Note

that by fixing the number of filters in layer C1 we only have to choose the proper

number of feature maps in layers C3 and C5. This is because the number of feature

93


Figure 6.5: Comparison of recognition rates when we use different Gabor filter banks atdifferent number of scales and orientations.

maps in layers S2, S4 is fixed (since they implement subsampling of the feature maps

in their previous layers) and layer F6 is the final 4-outputs layer fully-connected to

layer C5.

In Fig. 6.6 we show the recognition rates obtained when varying the number of

feature maps in the third layer for different number of feature maps in the fifth layer.

The recognition rate for a fixed number of feature maps in the fifth layer remained

almost constant in spite of the number of filters in the third layer. This demonstrates

that the number of feature maps in the fifth layer is more critical than the number of

feature maps in the third layer. In Fig. 6.7 we show the different recognition rates when

varying the number of feature maps in the fifth layer for different number of feature

maps in the third layer. It is evident that beyond 8 feature maps in the fifth layer there

are stable values with high accuracy ( 98%) almost independently of the number of

feature maps in the third layer. These results motivated us to select four feature maps

for the third layer and eight for the fifth layer. A higher number of feature maps did

not provide significant improvement in the recognition rate and it would increase the

number of trainable weights. Note that in both experiments (Fig. 6.6 and Fig. 6.7)

the number of filters in layers C3 and C5 varies correspondingly to the variations in

the number of feature maps.

For hardware design purposes it is also interesting to analyze the possible ranges

94

6.4 Frame-Free Convolutional Network

Figure 6.6: Different accuracies obtained when varying the number of feature maps inthe third layer fixing the number of feature maps in the fifth layer.

of weight values at each layer. To do this, the values of the weights were computed

for different combinations of the number of feature maps in layers 3 and 5. In Fig.

6.8(a) and (b) the maximum absolute values that any of the weights achieved during

the training stage and at the end of the training stage are shown.

The structure we selected (as shown in Fig. 6.1) provides the lowest number of

filters and trainable parameters while maintaining a high recognition rate.


In Fig. 6.9 the frame-free AER structure corresponding to the frame-based scheme in

Fig. 6.1 is shown. The second column in Table 6.1 shows the different parameters that

are used in this frame-free implementations of the system.

In the frame-free architecture, we use as input a flow of events captured with an

AER motion sensing retina coding a 128x128 address space, downsampled to an address

space of 32x32. This is feasible using an AER subsampling module that transforms each

input event address coordinate in the range [0-127] to a new event address coordinate

in the range [0-31]. The subsampling module simply assigns to each input event with

address coordinates (xin, yin) the following new coordinate values:

xnew = bxin/4c; ynew = byin/4c; (6.4)

95


Figure 6.7: Different accuracies obtained when varying the number of feature maps inthe fifth layer fixing the number of feature maps in the third layer.

where operand bc indicates down rounding to the nearest integer. Note that modifying

the event address coordinates in this way is equivalent to an averaging and downsam-

pling operation. This new event flow with modified coordinates is used as input to the

system in Fig. 6.9. As is shown in the figure, each input event is replicated using a

1− to− 6 splitter module [36][50]. These six replicas are connected to layer C1, com-

posed of six AER convolution modules [165], each programmed with a 10x10 Gabor

filter (belonging to the Gabor filter bank with two scales and three orientations). Each

convolution module has internally an array (feature map) of 28x28 pixels.

Each neuron at position (i, j) in feature map q (qε(1, ...6)) belonging to layer C1

(l = 1) will have a state sq1(i, j) represented by the mathematical operation:

sq1(i, j) =∑mεM

∑nεN

(ein(m,n) ·W q1 (i−m, j − n)) (6.5)

where ein(m,n) is the number of input events per second coding activity at address

(m,n) (representing the input visual stimulus after being replicated with the splitter)

and W q1 is the Gabor filter connecting the input stimulus with the 28x28 output feature

map q (output of filtering the input stimulus with Gabor filter q).

The events obtained at the output of the six AER 28x28 internal arrays in the

convolution modules, thus coding a 28x28 address space, are sent to the six subsampling

96


Figure 6.8: Maximum absolute value of the weights during the training stage and at theend of the training stage.

modules in layer S2. With the simplifications considered for subsampling layers (no

trainable weights and no non-linearities), these modules can be easily implemented

using AER subsampling modules again: the address of each input event (xin,2, yin,2) is

modified so that address (i, j), for i, j = 1, ..., 28 turns to (k, l), for k, l = 1, ..., 14:

xnew,2 = bxin,2/2c; ynew,2 = byin,2/2c; (6.6)

The output of each of the six subsampling modules is sent to a splitter again to

replicate the output onto four channels. Each of the four channels is connected to one

input of the four convolution structures with six input ports available in the third layer.

97


Figure 6.9: AER-based implementation of the ConvNet system.

A detailed description of one of these convolution structures is shown in Fig. 6.10. In

this structure, each time an event is received, the convolution map (projection field)

corresponding to the event input port is added around the event address in the pixel

array (feature map). These kinds of structures can be implemented with the recently

developed multikernel AER-convolution chips [56], which have multikernel capability

(up to 32). This way, a neuron at position (i, j) in feature map q belonging to the third

layer (l = 3) will have a state represented by the mathematical operation:

sql (i, j) =∑pεP

∑mεM

∑nεN

(einpl (m,n) ·W p,ql (i−m, j − n)) (6.7)

where einpl (m,n) is the number of events per second coming into input port p (feature

map input of size 14x14) of layer l (l = 3) coding the address (m,n). Matrix W p,ql

is the convolution mask connecting input feature map p with feature map q (of size

10x10). Fig. 6.11 illustrates with an example what actually happens inside one of the

98


Figure 6.10: Convolution Structure at layers C3, C5. Each incoming spike makes aconvolution map to be added on a pixel array.

pixels in the pixel array of the feature map. In the figure, three events coming from

different input ports are received by the convolution module. The first event splashes

the convolution mask corresponding to that input port around the address coded by

the event in the pixel array and adds it to its state. If the neuron under study is inside

this area, only one of the weights (w1 in Fig. 6.11) corresponding to the first filter

will affect. As shown in this figure example, this weight w1 increments the state inside

the neuron (the weight will be added or decreased according to the event and weight

signs). The second event arriving later from a different (or the same) input port will

produce a different weight (or the same) to be added to the state value in the neuron.

The third event produces the same result but this time a threshold is reached and a

new output event is produced coding the address of the firing neuron.

Each time a neuron (unit or pixel) in a feature map reaches a threshold and the

time since the last output event (Toutput) is higher than an established refractory time

(Trefractory), a new output event is sent to the following layers coding the neuron

address, and the neuron is reset to the bias value of the feature map. Using this

refractory time to limit the neurons maximum firing rate [15], we emulate a rectifying

non-linearity. This is one of the most important factors in improving the performance of

a recognition system [156]. If we do not consider refractory times, the number of events

99


Figure 6.11: Neuron in the pixel array. Each time a spike is received a certain weight isadded to the neuron state.

that a neuron would fire in layer C3 at position (i, j) in feature map q (outq3,lin(i, j)) is

eoutq3,lin(i, j) =

∑pεP

∑mεM

∑nεN (einp3(m,n) ·W p,q

3 (i−m, j − n))Threshold3

· (1− F q3 ) + nq3

(6.8)

where einp3(m,n) is the number of events per second coming to input port p of layer

C3 coding address (m,n), W p,q3 is the convolution map connecting feature map p with

output feature map q and Threshold3 is the threshold selected for layer C3. In this

expression we have incorporated two new variables to model leakage due to a forgetting

factor F q3 (events per second lost by one neuron belonging to feature map q) and a

quantization noise factor nq3 (quantization caused by thresholding). The forgetting

mechanism is important because it “empties” the state stored in neurons (forgets) so

that no old information is relevant for the computations. In a certain way, the forgetting

mechanism has a similar effect than the refractory time, as it limits the neurons firing

100

6.5 Results

output activity.

When using refractory times (to emulate the sigmoid functions in frame-based sys-

tems), the number of events fired by a neuron in this layer is limited as:

eoutq3(i, j) = min(eoutq3,sat, eoutq3,lin(i, j)) (6.9)

where eout3,lin is computed as described in Eq. 6.8 and eout3,sat is the maximum

number of output events allowed by the imposed refractory time Tref3.

Layer S4 implements subsampling in the same way as in layer S2. Output events

from layer S4 are connected to neurons in layer C5 in the same way as in layer C3 (see

eq. 6.8). Each event produced in layer C5 is replicated in four different outputs using

splitter modules. These output neurons are fully connected to the four output neurons

in layer F6. Neurons at layer F6 will fire positive or negative events indicating that

the input has been or not categorized as the class coded by the firing neuron. In the

system, there are two stages where refractory periods have been incorporated. In the

frame-based system we considered sigmoid functions at layers C3, C5 and F6. However

in the AER implementation, only layers C3 and C5 needed refractory times, because

the activity at layer F6 does not saturate but will be high and positive only for the

desired output neuron. Besides, using too many or too long refractory times has the

negative effect of saturating the firing frequency, increasing this way the time-to-first

event and the separation between them.

6.5 Results

Due to the non availability so far of the large number of filters that would be needed

in our implementation, the system was simulated again with our AERST simulator

tool but using real input stimuli from the AER electronic retina and the performance

figures from the physically available AER hardware [39][36][165]. The combination of

real sensory event-format data with performance figures of available AER devices allows

us to estimate reasonably well the performance of our AER-based ConvNet proposed

in this thesis work. The AER-based system was first trained and tested using a frame-

based version of our AER (frame-free) ConvNet, as indicated in the algorithm depicted

in Fig. 6.12.

101


Figure 6.12: Algorithm used to configure the system. First, the system was trained withthe frame-based version. Then all the obtained weights were used in the frame-free system.

Several experiments were implemented in the system. The first set of experiments

was implemented downsampling the 128x128 input space obtained from the retina to a

modified input space 32x32. The second set of experiments was implemented selecting

a central square of 64x64 in the centre of the 128x128 input space and downsampling it

to 32x32. The first set of experiments has been called AER ConvNet with 32x32 pixel

inputs and the second set is called AER ConvNet with 64x64 pixel inputs.

6.5.1 AER ConvNet with 32x32 pixel inputs

In this first experiment, we used the 128x128 pixel retina to collect events during

intervals of 30ms. These events were histogrammed into images and then downsampled

to 32x32 pixel images. This way we created a total of 262 images of people walking.

We rotated these images 90 and 180 degrees to create the corresponding images in

horizontal and up-side-down positions. Some of these images are shown in the top row

of Fig. 6.13.

102

6.5 Results

Figure 6.13: a) Images corresponding to downsampling the 128x128 input stimulus to32x32. b) Images obtained cropping the input stimulus in a central square of size 64x64and downsampling the cropped stimulus to 32x32.

Finally, to add some distractors, we used another set of 262 images of moving objects

recorded with the retina. Thus, we generated a database composed of 1048 images

representing a total of four different categories. 250 of these images were used to train

the frame-based version of the system and the rest were used for testing purposes. With

these images we obtained a 98% recognition rate with the training set and 93.2% with

the testing set on the frame-based version.

To map the trained frame-based version to an event-based version we proceed as

follows. In the AER-based system there are mainly three sets of parameters that have

to be set before using the system. These are:

a) The weights of the convolution masks of each AER convolution module. In the

frame-free implementation, all the weights that had been obtained during the training

of the frame-based version were used as weights in the frame-free implementation.

b) Thresholds. The threshold values to be used inside the convolution modules in the

AER system were chosen considering two limiting factors:

1. Threshold values should be low to provide a high output rate, thus speeding up

the system.

103


Table 6.2: Maximum kernel weights, threshold values, refractory times, layer times andevents per second in the system

2. Threshold values inside the convolution modules must be higher than the con-

volution mask maximum weights in order to avoid high quantization noise. On

the contrary, each neuron inside the convolution module affected by an incoming

event would always achieve the threshold value, thus producing the generation of

undesirable output events.

These two considerations lead us to choose threshold values between 1.5 and 2 times

the maximum weight existing inside each convolution module.

c)Refractory Times. To compute the refractory times in layers C3 and C5, we estab-

lished relations (see Appendix Section 6.6) between the sigmoid saturation values in

the frame-based version and the event rate in the AER-based version. Operating this

way we obtained the values 2.3ms and 17.5ms for the refractory times in layers C3 and

C5 respectively.

Table 6.2 shows the maximum kernel weights, the threshold values, and the refrac-

tory times used in each layer.

Once the main parameters have been set, the AER version can be tested using

different experiments. In the tests we are interested in getting a positive activity with

a high rate, thus minimizing the system response and the time-to-first event. In our

system, the time-to-first event is considered to be the delay between the first input

event categorizing one person position and the first positive output event in the target

output corresponding to the input person position.

The AER-based system shown in Fig. 6.9 was tested with three different flows of

spikes of visual information. The first flow, corresponding to up position, was obtained

composing several data files of people walking, recorded with the AER motion sensing

104

6.5 Results

retina. This flow had a duration of 9s. Note that if we assume a rate of 33 frames

per second in a frame-based version, 9s in the frame-free version would correspond to

approximately 297 frames. In the recording, several people appear at different times.

As the flow had 102572 spikes, this results in a total equivalent input firing rate of

11.4keps (kilo-events per second). Then, this flow of spikes (corresponding to the up

position) was rotated 90 and 180 degrees to create the other two flows for horizontal

and up-side-down positions. Note that rotating a flow of spikes in AER is very simple

as we only need to change the pixel address of each spike through simple operations

(as implemented in the rotator module). The three flows of activity created this way

were used as inputs in the frame-free system configured with the theoretical parameters

explained above.

Fig. 6.14 shows the spike sequence of the three input flows. Fig. 6.15 shows the

activities of the four output channels on the system when each of the three input are

used. Positive events in a particular output channel indicate that the system recognizes

input events as belonging to the category represented by that output channel. Negative

events indicate the opposite. As it can be observed, the system was able to recognize

the people position in the three cases (up, horizontal and up-side-down).

Despite of having theoretical values for the refractory times, we verified the per-

formance of the system when sweeping them systematically. We varied the refractory

times for layers C3 and C5 from 0 to 50ms. Note that 50ms is a time considered to be

excessively long. Defining the recognition rate as the ratio between the number of pos-

itive events in the target output channel, and the total positive output events collected

in the four output channels, the combination of refractory values in layers C3 and C5

that provided the highest recognition rate was 1.3ms and 16ms (note that these exper-

imental values are close to the theoretical values shown in Table 6.2). Higher refractory

periods provided better recognition but also a reduced number of output events and

consequently a slower response. Lower values for the refractory period in layer C5 (but

close to 16ms) also provided good recognition rates (and close to 98%). However, we

discarded these lower values for the refractory time as small variations provided highly

fluctuating recognition rates, thus increasing delay times and the time-to-first output

event. The selected combination of refractory times provided a recognition rate close

to 98% (97.92%) and a high number of output events per second (close to 100eps)

105


Figure 6.14: Input events used to test the system. x axis represents time in seconds.y axis represents the input event coordinates in a 32x32 pixel array, numbered from 0 to1023.

(speeding up the system). In Fig. 6.16 we show the number of output events and the

recognition rate obtained when we vary refractory periods in layers C3 and C5.

A different measure that can be used for computing the recognition rate, can be the

percent of time in which activity is positive on the correct output channel. However,

this measure is not realistic as there are periods with very low activity at the input

resulting also in a low activity at the output. Anyway, looking at the output flows in Fig.

6.15, it is clear that the system performed correctly almost all the time. The system

misclassified mainly up positions (classifying them as the up-side-down). In most of the

cases, this occurred when there were people moving in or out of the field of view. This is

reasonable as vertical and up-side-down positions are very similar, specially when only

few events are received during transitions. The minimum time-to-first event was of

approximately 16ms. Considering that the input flow was of 11.5Keps approximately,

16ms corresponds to over 184 input events. This value corresponds to a percent of

effective average firing pixels of 5.66%. This percent value was computed using images

created collecting over 184 events from the input flow and computing the number of

106

6.5 Results

Figure 6.15: Output events corresponding to each one of the input flows, a) outputswhen input is up position, b) output when input is horizontal position, c) output wheninput is up-side-down position.

active pixels in each image. The 184 events and the 5.66% value of effective firing pixels

leads to an appoximate average of 3.17 events per pixel when considering a 32x32 input

array. Note that the retina output has an address space of 128x128, so that the 3.17

events per pixel computed here corresponds on average to 0.4 events per pixel in the

128x128 array (less than one event per pixel). The maximum firing rate in each output

channel lead to minimum delays between events in the order of 15ms. These delays can

be considered low, but note that even so they are not determined by the computation

process within the ConvNet (in the order of microseconds [165]), but by the reduced

input firing frequency provided by the retina. With the development of new faster

107


Figure 6.16: Recognition rate and number of output events per second obtained byvarying the refractory times in layers C3 and C5.

electronic retinae it will be possible to achieve shorter delays, as recent convolution

chips have delays between input and output flows in the order of microseconds (or

fraction) [165].

A second experiment was to analyze the system response when the input flow was

changed alternatively between the three different people positions. The input (up)

flow was rotated after certain random time (on average 0.5s) to create the other two

positions (horizontal and up-side-down). The new flow was of 7.3s. To show how

critical refractory times are, Fig. 6.17(a) shows the input and outputs flows obtained

when they are not used in the simulations. Note that the output channels respond

producing double sign activity (positive and negative), which makes the system unable

to recognize the input pattern. For clarifying purposes, we have represented input

events corresponding to up position as ‘5’. Events corresponding to horizontal position

are represented as ‘6’ and up-side-down position is represented as ‘7’. The different

108

6.5 Results

outputs are shown in the same figure. Positive and negative output events for the

output channel identifying up position are represented with ‘1’ and ‘-1’ respectively.

Output events identifying horizontal position are represented with ‘2’ and ‘-2’. Output

events corresponding to up-side-down position are represented with ‘3’ and ‘-3’. Finally,

the events corresponding to the noise output are represented as ‘4’ and ‘-4’.

In Fig. 6.17(b) the system has been tested using a fixed refractory time (Tref3)

of 1.3ms for layer C3 and 9ms for layer C5 (Tref5). Here we are representing only

the positive activity of the output channels. Categories of input events are shown

by the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-down

positions, respectively. Output events corresponding to the up category are represented

with blue circles, horizontal category output events are represented by red crosses, up-

side-down category events are represented by green stars, and noise category events by

black points. Note that this time the system is able to track the input stimulus at any

time. Only a few wrong events are produced. Besides, the system never identifies the

people as noise, which is correct.

In Fig. 6.17(c) the system has been tested using a fixed refractory period (Tref3)

of 1.3ms for layer C3 and 18ms for layer C5 (Tref5). Note that this time the system

improves in accuracy producing a lower number of wrong events. However the total

number of events produced is lower, which means a degradation of the system speed.

So far, all the experiments have been implemented without considering forgetting

inside neurons. To add the forgetting mechanism to our system we have included

forgetting rates in layers C1, C3, C5 and C6. A forgetting rate Fl means that a state

quantity of value F will be discharged each second in all neurons belonging to layer l.

The state stored (positive or negative) will always leak towards ‘0’. In our system, the

forgetting rates are denoted as F1, F3, F5 and F6. To evaluate how important these

forgetting rates are, we have varied the four variables between 0 and 35 nups (nano

units per second). This last value has been chosen empirically. It corresponds to null

output activity without refractory times.

In Fig. 6.18 we show the recognition rates and Number of Output Events obtained

when varying forgetting rates F1, F3, F5, and F6 (in pairs). As can be seen, compared

to the other forgetting rates, the sensitivity to parameter F5 is very high as it makes the

recognition rate and number of events to change abruptly when F5 is close to 26 nups.

The system is also very sensitive to F3 variations. The system performance degrades

109


Figure 6.17: a) Input and output activity when input is alternated between up, layand up-side-down positions. No refractory periods have been considered. Values ‘5’, ‘6’and ‘7’ correspond to up, horizontal and up-side-down respectively. Absolute values ‘1’,‘2’, ‘3’ and ‘4’ correspond to the output channels identifying up-side-down, horizontal, uppositions and noise respectively. b) Input and output activity when input is alternated anda refractory period of 9ms is used in layer C5. Input event correct orientation is shownby the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-downpositions, respectively. Output events corresponding to the up category are representedwith blue circles, with red crosses for the horizontal category, with green stars for theup-side-down category and black dots for the noise category. c) Input and output activitywhen a refractory time of 18ms is used in layer C5. d) Input and output activity whenthe simulated annealing algorithm is employed to obtain optimum parameters

when F3 is close to 7nups. Finally F1 and F6 do not degrade the recognition in the

system very much, but their values affect considerably the number of output events,

which implies a slower system response. Thus we chose a 5nups value for F1 and F6.

For these experiments we fixed the refractory times and thresholds to the values shown

110

6.5 Results

Figure 6.18: Recognition Rate and Number of Output Events obtained when varyingforgetting rates F1, F3, F5, and F6. a) Results when varying F1 and F3. b)Results whenvarying F3 and F6. c)Results when varying F3 and F5. d)Results when varying F1 andF5.

in Table 6.2.

As thresholds, forgetting rates and refractory times parameters influence the sys-

tem performance, both in terms of recognition rate and speed, variations of them

were analyzed jointly to obtain the optimum practical parameters maximizing accu-

racy and speed. To find optimum parameters we used the simulated annealing algorithm

[170]. Note that there are 10 free parameters: Threshold1, Threshold3, Threshold5,

Threshold6, F1, F3, F5, F6, Tref3, Tref5. The simulated annealing algorithm min-

imizes a cost function, while providing lower and upper bounds for the parameters.

We used a cost function which penalizes a reduced number of positive output events

(under 90) and a reduced recognition rate (under 95%). For the refractory times we

imposed an upper bound of 5ms. Larger values are not necessary as the pixels include

the forgetting mechanism. The algorithm found several parameter vectors providing

good results. The best vector after 2000 iterations is shown in Table 6.3. Note that

in this set of parameters, the refractory times were driven to low values (almost ‘0’)

111


PARAMETER VALUE

Threshold1 0.62

Threshold3 3.12

Threshold5 6.4

Threshold6 1.5

F1 (nups) 17.46

F3 (nups) 0.21

F5 (nups) 21.17

F6 (nups) 32.6

Tref3 (ms) 2.23

Tref5 (ms) 0.95

Table 6.3: Parameter Vector obtained with the Simulated Annealing Algorithm

increasing the forgetting parameters, which it is a very desirable effect.

The resulting activity obtained with these parameters can be seen in Fig. 6.17(d).

Note that again only a few wrong recognition events are produced. The main advantage

with this last version is that we have included the forgetting mechanisms, making the

system more biological-like. To a certain extent the forgetting mechanism helps the

refractory time effect. For illustration purposes negative events are also shown in this

figure. Note how the different output channels respond strongly with negative activity

when they are not the target outputs.

Table 6.4 shows the time-to-first event at the target output when the input flow is

changed between positions. First column corresponds to the times at which the input

events represent a new position. Second column shows the delay to obtain a positive

output event at the target output when the first input event corresponding to a new

position has been fed to the system. This delay was computed considering only the

positive output events in the channel identifying the correct input.

As Table 6.4 shows, the minimum Time-to-First Event identifying properly the

position after changing the people position at the input flow was of 15.5ms (see Fig.

6.19,where a zoomed version of the simulation results between 5760ms y 5830ms is

shown).

112

6.5 Results

INPUT TIME-TO-FIRSTTRANSITION OUTPUT EVENT

TIME (ms) (Delay) (ms)

0 82.3

512 39.3

957 141.8

1310 28

1633 32

1909 13.8

2152 59.2

2453 87.7

2819 21

3029 31.5

3297 69.6

3623 48

3934 51.4

4349 26.8

4820 37.4

5574 46.3

5794 15.5

6019 112.3

6339 26

6673 37.2

Table 6.4: Time-to-first output event after transitions of the input between up, horizontaland up-side-down positions.

6.5.2 AER ConvNet with 64x64 pixel inputs

In this set of experiments, the 128x128 input flow of events obtained from the retina

was not downsampled directly to 32x32. This time, only the events in a 64x64 window

containing the region of interest of the 128x128 stimulus address space (coding addresses

inside a 64x64 square) were collected to create the set of training images and the flow

of events for testing. Then, the coordinates of these new events coding a 64x64 address

space were modified to a downsampled space of 32x32. Note that opposite to the

113


Figure 6.19: Zoomed version of the simulation results of Fig. 6.17 between 5760ms y5830ms.

previous set of experiments, images obtained collecting these new events have more

resolution (see Fig. 6.13(b)).

The first experiment was implemented again alternating the input between up,

horizontal and up-side-down positions. This time the input flow had a duration of 6.5s.

This first experiment was carried out without considering forgetting mechanisms. The

refractory period for layer C3 was 1.3ms and 18ms for layer C5. The results can be

seen in 6.20(a). As in the previous implementation the system is able to track the input

people position with delays lower than 15ms. Note that only a few wrong events are

produced.

The second experiment was carried out using forgetting parameters. The simulated

annealing algorithm was used again. The best set of parameters after 2000 iterations

is shown in Table 6.5.

The input and output events obtained using this set of parameters are shown in Fig.

6.20(b). Note that the accuracy of the system has been improved with this set of

parameters and that the system is able to provide correct output even when there

are fast transitions at the input (see the abrupt change between 4.82s and 4.88s).

Again, note that the refractory periods have been reduced and part of their operation

is implemented by the forgetting mechanisms.

114

6.6 Appendix

PARAMETER VALUE

Threshold1 0.6

Threshold3 1.11

Threshold5 6.17

Threshold6 1.75

F1 (nups) 18.68

F3 (nups) 4.08

F5 (nups) 0.015

F6 (nups) 15.58

Tref3 (ms) 0.72

Tref5 (ms) 5.96

Table 6.5: Parameter Vector obtained with the Simulated Annealing Algorithm

6.6 Appendix

6.6.1 Learning in Convolutional Networks

There are several approaches to automatic machine learning, but one of the most suc-

cessful approaches is called gradient-based learning. The learning machine computes

a loss function that measures the discrepancy between the “correct” or desired output

for pattern and the output produced by the system. The simplest output loss function

that can be used in convolutional networks is the minimum mean squared error (MSE)

[4]. The loss function can be computed then as:

E =1P

P∑p=1

12

∑q

(yq − dq)2 (6.10)

where yq is the output obtained at output port q and dq is the desired output for port

q. P is the number of the training samples considered.

The loss function is minimized by computing the gradient with respect to all the

parameters in all the layers, and a simple and efficient procedure to compute it in a

nonlinear system composed of several layers of processing is to use the back-propagation

algorithm [4]. The standard algorithm must be slightly modified to take the weight

sharing into account. The weight sharing technique has the interesting side effect of

reducing the number of free parameters. An easy way to implement it is to first compute

115


Figure 6.20: a) Input and output activity when input is alternated and a refractoryperiod of 18ms is used in layer C5. Input events are shown by grey circles. Values ‘1’,‘2’ and ‘3’ correspond to up, horizontal and up-side-down positions respectively. Outputevents corresponding to the up category are represented with blue circles, output eventscorresponding to the horizontal position are represented by red crosses, output eventscorresponding to the up-side-down positions are represented by green stars and the noisecategory by black points. b) Input and output activity when the simulated annealingalgorithm is employed to obtain optimum parameters

the partial derivatives of the loss function with respect to each connection. Then the

partial derivatives of all the connections that share a same parameter are added to

form the derivative with respect to that parameter. Before training, the weights are

initialized with random values using a uniform distribution between−2.4/Fi and 2.4/Fi,

where Fi is the number of inputs (fan-in) of the unit which the connection belongs to

[4]. To train the system, the patterns are presented in a constant random order, and the

training set is typically repeated a certain number of times (epochs). At each learning

116

6.6 Appendix

iteration, a particular parameter is updated according to the following update rule:

wk = wk − εk∂Ep

∂wk(6.11)

where Ep is the error obtained for class p. The step sizes are not constant and are

computed as

εk =η

µ+ hkk(6.12)

where η and µ are hand-picked constants and hkk is an estimate of the second derivative

of the loss function E with respect to wk. The larger hkk is, the smaller the weight

update. Once the system has been trained, the performance is estimated by measuring

the accuracy on a set of samples disjoint from the training set, which is called the test

set.

6.6.2 Computations in the Frame-Based System

6.6.2.1 Filtering layers

In filtering layers C1, C3 and C5, the state of a neuron yql (i, j) located at position (i, j)

in feature map q (q = 1, ..., Q) belonging to layer l (l = 1, 3, 5) is computed as

xql (i, j) =∑pεP

∑mεM

∑nεN

(ypl−1(m,n) ·W p,ql (i−m, j − n)) + bql (6.13)

yql (i, j) = A tanh (S · xql (i, j)) (6.14)

where P is the number of input feature maps, Q is the number of output feature maps,

ypl−1(i, j) is the analog value of the input feature map p at position (i, j), W p,ql is the

convolution map connecting input feature map p with output feature map q, bql is the

offset for the output feature map q and A and S are constants. For simplification

purposes in the hardware implementation, we have used zero values for all the biases

bql . For the first layer C1, P = 1 (the input image), Q = 6 and W p,q1 is a set of 6

Gabor filters (coding 2 scales and 3 orientations) and ypl−1 is the input image. We have

not used the hyperbolic tangent sigmoid function in neurons in this layer. In layer C3,

P = 6, Q = 4 and W p,q3 is a set of 24 5x5 trainable filters. In layer C5, P = 4, Q = 8

and W p,q5 is a set of 32 5x5 trainable filters.

117


6.6.2.2 Subsampling Layers

For the second and fourth layers (S2, S4), the analog output values are computed as:

yql (i/2, j/2) = W ql ·

1∑m=0

1∑n=0

yql−1(i+m, j + n) i, j = 2 · n, nεN (6.15)

In subsampling layers W ql are 1x1 fixed averaging factors (of value 1 in our imple-

mentation) instead of convolution matrixes.

6.6.2.3 Full-Connection layer F6

In layer F6, the state of a neuron yq6 (q = 1, ..., 4) is computed as

yq6 = A tanh (S · yp5 ·Wp,q6 + bq6) (6.16)

This time, W p,q6 are trainable 1x1 weights connecting each one of the 8 output

neurons in layer C5 with each one of the 4 neurons in layer F6. Again, we have used

zero values for the biases bq6. The error function can be computed then as:

E =12

∑q

(yq6 − dq6)2 (6.17)

Where yq6 is the output obtained at output port q and dq is the desired output for

port q.

6.6.3 Computations in the Frame-free system

In order to develop some simplifications in the hardware implementation of the AER-

based system, we decided to consider non-linearities only in layers C3 and C5. In the

AER-based system, the output events per second in feature map q from neuron (i, j)

at layer l − 1 (denoted by eoutql−1(i, j)) are used as input events in the following layer

(and are denoted by einql (i, j)). This way we have the following equivalence:

einql (i, j) = eoutql−1(i, j) (6.18)

118

6.6 Appendix

6.6.3.1 Filtering Layers

For filtering layers, the number of events per second that a neuron at the output

feature map q in layer l (l = 1, 3, 5) and position (i, j) fires if the refractory period is

not considered is denoted as eoutql,lin(i, j) and is computed as:

eoutql,lin(i, j) =

∑pεP

∑mεM

∑nεN (einpl (m,n) ·W p,q

l (i−m, j − n))Thresholdl

· (1− F ql ) + nql

(6.19)

where einpl (m,n) is the number of events per second coming to input port p (feature map

p of previous layer) in layer l coding the address (m,n), W p,ql is the convolution map

connecting feature map p with output feature map q and Thresholdl is the threshold

selected for layer l. In the expression we have incorporated two new variables to model

the loss of events due to a forgetting factor F ql (events per second forgotten by one

neuron belonging to feature map q) and a quantization noise factor nql (caused by the

use of a certain threshold).

6.6.3.2 Subsampling Layers

In Subsampling layers S2 and S4, each input event with address (xIN , yIN ) is replicated

to the corresponding output port but coding a new address (xNEW , yNEW ) computed

by

xNEW = bxIN/2c; yNEW = byIN/2c; (6.20)

6.6.3.3 Sixth Layer F6

In this layer, each connection between output unit q and input unit p is done through

a trained weight W p,q6 :

eoutq6,lin =

∑pεP (einp6 ·W

p,q6 )

Threshold6· (1− F q6 ) + nq6 (6.21)

119


6.6.4 Implementation of non-linearities and equivalences between the

frame-based and the AER-based implementation

After implementing non-linearities in the frame-free system, the real number of output

events per second in layer l and feature map q with position (i, j) is:

eoutql (i, j) = min(eoutl,sat, eoutql,lin(i, j)) (6.22)

where eoutql,lin(i, j) is computed as described in Eqs. (6.19) and (6.21), and eoutl,sat is

the maximum number of events per second allowed by the refractory period in layer l.

We can consider that each layer l is characterized by a layer time constant τl that

corresponds to the minimum time between events in one of the channels belonging to

the layer. This layer time constant τl is specified by the refractory period limiting the

firing activity at the outputs of layers C3 and C5, or by the maximum number of events

per second travelling in one channel belonging to one layer without refractory periods

(C1, S2, S4 and F6). Taking these ideas into consideration, in layers C3 and C5, the

layer time constants τ3 and τ5 are equal to the refractory periods (limiting the firing

activities) computed for these layers. In layers C1, S2, S4 and F6, as we do not use

refractory periods, time constants τ1, τ2, τ4 and τ6 are determined by the maximum

number of events per second travelling in these layers.

To compute the layer time constants τl for each layer in the AER-based system, we

have to compute the corresponding refractory periods (so that non-linearities equivalent

to sigmoid functions can be implemented). To do this, we have stablished mathematical

relations with the frame-based computations.

In the first layer, as we do not use non-linearities, we have no saturation in the

output units. Thus, we have computed τ1 experimentally. We have analyzed the output

obtained in all the units of layer C1 in the frame-free system and have computed the

minimum time separation between events corresponding to each output unit. Then

these values have been averaged between all the output units and the mean value

obtained has been chosen as τ1. With these computations we obtained a layer time

constant τ1 of 1.8ms which corresponds to a number of events approximately of 560eps.

In layers S2 (and therefore in S4), events coming from previous layers are simply

replicated but modifying their coded address, as specified in eq. (6.20), in order to

reduce the input space address by four. For these layers, we do not use refractory

120

6.6 Appendix

periods. However, the layer time constants τ2 and τ4 are determined by τ1 and τ3 and

are computed as:

τ2 = τ1/4; τ4 = τ3/4; (6.23)

Using eq. (6.23) we obtain a value for τ2 of 0.45ms.

In layers C3 and C5, we can use Eqs. (6.13) and (6.19) to relate the analog output

for one neuron belonging to a feature map q in position (i, j) in the frame-based version

with the number of output events fired by a neuron located in the same position, layer

and feature map in the AER-based version with the following expressions:

eoutq3,lin(i, j) =xq3(i, j)

4 · Threshold3 · τ2· (1− F q3 ) + nq3, τ2 =

1ein2,max

(6.24)

eoutq5,lin(i, j) =xq5(i, j)

4 ·A · S · Threshold5 · τ4· (1− F q5 ) + nq5, τ4 =

1ein4,max

(6.25)

where τ2 and τ4 indicates the layer time constants of the previous layers, determined

by the maximum firing rates ein2,max and ein4,max.

Note that the factors 4 are due to the 4-neighbourghs merging operation imple-

mented in the previous subsampling layers. The factor A · S is to consider that the

output activity in units belonging to layer C3 are modulated by this constant (see eq.

(6.14)) in the frame-based implementation.

To implement the non-linearities in the frame-free system we need first to compute

the saturation point in the AER-based version. To do this we can relate the analog

state for one neuron belonging to a feature map q in position (i, j) in the frame-based

version with the number of output events fired by a neuron in the AER-based version.

For this we can use eqs. (6.24) and (6.25) making xql (i, j) equal to the saturation point

xsat in the frame-based version. For simplification purposes, the simple case without

forgetting ratio and quantization noise will be considered.

Taking these ideas into consideration, in layer C3 the maximum number of events

per second in the AER version can be computed as:

eout3,sat =x3sat

4 · Threshold3 · τ2(6.26)

121


Figure 6.21: Computation of the saturation point in the hyperbolic tangent function.The function saturates when the absolute value of the argument is higher than 1.5283.

The tangent hyperbolic function computed as described in eq. (6.14) in the frame-

based version saturates when the argument is greater than 1.528 (saturation point

xsat) as shown in Fig. 6.21. Thus, considering that the τ2 value is 0.45ms and that

xsat is approximately 1.528, using the threshold value 2 for third layer, we obtain

an approximated number of events corresponding to the saturation situation point

in layer C3 of 427.8eps. This value leads to a refractory period (third layer time

τ3 = 1/eout3,sat) of 2.3ms, approximately.

Using eq. (6.23) we get a value for τ4 of 0.58ms. As done in layer C3, τ5 in the

AER-based implementation can be computed making xq5(i, j) equal to the saturation

point xsat:

τ5 =Threshold5 · τ4xsat/(4 · S ·A)

(6.27)

Again, using the saturation point xsat equal to 1.528, and a threshold value of 10 for

the fifth layer, we obtain an approximate refractory period in this layer τ5 of 17.5ms,

corresponding to a number of events per second in the saturation situation of 57.15eps.

Note that the selected experimental refractory period was 16ms, which is very close to

the theoretical computed value.

In layer F6, we do not use refractory periods. However, considering the results

obtained, we can compute the minimum achievable time between events at the output

using the weights computed for this layer. The minimum time between events will

occur when events from the fifth layer make all the weights to interfere constructively

122

6.6 Appendix

Table 6.6: Refractory periods, layer times and maximum number of events per secondcomputed for each layer in the system

in the output neurons. Then, we can compute the maximum number of events per

second as:

eoutq6,max =

∑pεP ‖W

p,q6 ‖

Threshold6 · τ5(6.28)

With the weights computed for this layer, using a threshold of value 1.5 and τ5

equal to 16ms, we obtain a maximum firing rate of 111.1eps, which corresponds to a

minimum time between events of 9ms.

In Table 6.6 we show the refractory periods, layer times and events per second

obtained at each layer. Using the expressions described in eqs. (6.26), (6.27) and

(6.28) we can configure our system in order to obtain certain desired system times

(and corresponding events per second at each layer) according to the input firing rate

provided by the AER retina, the thresholds selected and the weight values computed

at each layer.

123


124

Chapter 7

CONCLUSIONS

Throghout this tesis work we have presented the event-driven software simulator tool

AERST implemented in Matlab and C++ to simulate systems based on AER.

The AERST tool allows the user to simulate easily any system described by a netlist

where the modules together with their parameters are listed. AERST is able to process

around 20Keps. It allows us to simulate complex and hierachically-structured systems

before the available hardware technology becomes mature. AERST can be used to test

new AER processing modules within large systems, and thus orient hardware devel-

opers on what kind of AER hardware modules may be useful and what performance

characteristics they should possess.

Throughout this thesis work we have presented three large multi-module multi-

layer convolutional neural networks. They have been simulated considering available

AER hardware modules modeled with their performance figures. The first application

is intended to recognize characters even when they present slight deformations. The

second system implements texture classification for image retrieval and the third one

is a large neuronal network trained using backpropagation to implement people posture

recognition. In the three systems results show clearly the high speed and possibility

of implementing complex processing systems that AER provides. Besides, in all the

systems recognition was achieved while the sensory stimulus was still being generated.

In general, processing delay in feed-forward event-based systems is given mainly by

the number of layers and the number of events needed to represent the input stimulus.

In some of the systems implemented, the processing capability could be increased by

adding more modules per layer working in parallel and without increasing the number

125

7. CONCLUSIONS

of layers. This means that the processing capability is increased without penalizing

delays, although at the cost of adding hardware.

Currently, the available AER hardware modules are quite preliminary, although

their performance figures provide very promising system level performance estimations.

In the future, AER hardware design is aimed at miniaturizing present AER modules

so that a large number of them (several hundred) could fit on a single PCB or in

a large NoC dye. Also, such multi-module elements should allow a large degree of

reconfigurability and reprogrammability, so that many different applications can easily

be set up.

126

Appendices

127

Appendix A

AERST Tool User Guide

A.1 Introduction

The purpose of AERST is to simulate systems composed of several interconnected AER

modules. It is a simple and open MATLAB and C++ based simulator. A basic library

of AER modules has been developed. The user can easily add new AER modules to

the library or elaborate the available modules with different levels of complexity. In

this Appendix a guide of how to use the tool together with a step-by-step example is

provided.

A.2 Description of an AER System

An AER system is described in AERST by a netlist given in a configuration file. In

this netlist the AER system is composed of sources, channels and modules.

In the netlist channels are identified by integer numbers {1,2,3,...,N}. All input

sources to the system are defined in a single line beginning with the word sources.

Then, between brackets the user indicates the channel numbers the input sources are

connected to. Finally, the user has to indicate between brackets the name of the

MATLAB files (mat files) containing the list of source events.

The sources command has the following structure:

sources {1, 2, ..., N}{datasource1, datasource2, ..., datasourceN} (A.1)

129

A. AERST TOOL USER GUIDE

In the Matlab implementation, the list of events belonging to a source or channel

has the structure of a two-dimensional matrix. Each row of the matrix corresponds

to one event. In our implementation each event contains six fields. The first three

correspond to timing information of the event, while the last three correspond to data

transmitted by the event:

[Tprereqst Treqst Tack x y sign] (A.2)

The data fields are irrelevant for the simulator, and only need to be interpreted

properly by the modules (instances) receiving and generating them. For the particular

cases we describe in this thesis we have always used the same three fields: ‘x′ and ‘y′

represent the coordinates or addresses of the pixel originating the event and ‘sign′ its

sign. The three timing fields are as follows: ‘Tprereqst’ represents the time at which

the event is created at the emitter instance, ‘Treqst’ represents the time at which the

event is processed by the receiver instance, and ‘Tack’ represents the time at which

the event is finally acknowledged by the receiver instance. We distinguish between a

pre−Request time and an effective Request time. The first one is only dependent on

the emitter instance, while the second one requires that the receiver instance is ready

to read and process an event request. This way, we can provide as source a full list of

events which are described only by their data fields and pre-Request times.

Times in AERST are double numbers without units. This means that the unit

used in all the times appearing in parameters and events information is a choice for

the user. Is the user who determines the meaning of such numbers. The tool only uses

these numbers and processes them according to the parameters and the events timing

information provided. The user has to check that parameters and the events timing

information are correspondent.

In the C++ implementation, the sources are provided as matrixes (as in the MAT-

LAB implementation) to the tool. A matlab function called write2file is provided to

convert the sources to text files. However, the internal representation of events in the

C++ implementation is a bit different. The events are stored in lists where each el-

ement has two fields. The first field contains the information of the event, which is

stored in a vector with the six components previously described. The second field is a

pointer to the following event to be processed in the list. The use of lists and pointers

130

A.2 Description of an AER System

is totally transparent to the user. The number of fields inside an event can be increased

using more fields in the row vector. This is an open issue left to the user.

After declaring the sources, the instances are listed. The syntax to declare a system

module is:

module name {input channels}{output channels}{parameters file}{state file}(A.3)

In the Matlab version, file extensions are omitted as all source, parameter and state

files have the .mat extension (Matlab format). However, in the C++ implementation

files can have different extensions. Therefore, it is neccesary to include their extensions

in the file names.

An instance is described by an independent function whose name is identical to the

instance name (module name) that will appear in the netlist file describing the system.

The declaration format of a MATLAB instance (first line of a Matlab function) is

the following:

[new event in, events out, new state, new time, port out] = (A.4)

module f(event in, pars, old state, old time, port in)

event in corresponds to the present event information (as in eq. A.2) sent through

the channel. The event in information passed to the function as input parameter

contains the x and y coordinates of the event being processed and its Tprereqst time.

The updated new event in returned by the function contains also the established Treqstand Tack times. old state and new state represent the instance state before and after

processing the event. old time and new time are the global system times before and

after processing the event. events out is a list of output events produced by the instance

at its different output channels. port in is the port number from where the event has

entered the module and port out is a list of numbers identifying the output ports where

each of the output events created will be written. These output events (which are still

unprocessed events) are included by the simulator in their respective channel matrices

with Tprereqst as the present actual times, which at a later time should be processed by

their respective destination instances.

In the C++ implementation, the syntax is similar but with some differences:

131


int instance name(double *event in, double ***events out, params2 params, params2

*state, double *timeact, int port input, int **port output, int tam vect)

event in is a pointer to the input event vector. It is received as a pointer so it can be

modified and updated inside the module. The input event is given back by the function

but the event fields Trqst and Tack have been updated.

params are the module parameters. They are stored in a struct where each field corre-

sponds to one parameter. The structure of params has been defined in an initialization

routine.

state is the internal state of the AER module before processing the incoming event.

The variables are stored in a struct where each field corresponds to one state variable.

The structures of the state variable have been defined in an initialization routine.

timeact is a pointer to the actual simulation time before and after processing the event.

port input is the port through which event is received.

events out is a pointer to an empty two-dimensional array. It returns the events gen-

erated (if any) in the module output ports during the execution of the function. The

module has to reserve dinamically the memory space needed for this array. When the

events are created and stored in the array they have their Trqst set to ‘0’, their Tack set

to ‘-1’ and their Tprerqst set to the simulation time. The generated events are added to

the event lists in the corresponding channels.

port output is a pointer to a vector of port numbers indicating where each event in

events out belongs to.

tam vec is the number of fields in an event. In the format we have used this field was

set to ‘6’. However, the user can increase this number.

The C++ function returns an integer value corresponding to the number of output

events that have been generated by the processing module.

A.3 MATLAB Initialization of Parameters and States

As it will be further explained in next sections, the files containing the parameters of

each AER instance (parameters file) and their initial internal states (state file) are

created each time the simulator AERST is invoked. The user has to create and/or

132

A.3 MATLAB Initialization of Parameters and States

update previously the initialization routines. These initialization routines are invoked

at the initial step of the simulation and have to be customized for the particular AER

modules listed in the configuration file. To illustrate how to create the parameter and

state files for a certain application, we show below some example initialization routines.

A.3.1 Initialization of Parameters

The example code to create a parameter file for an AER module that rotates an input

visual flow in a counterclockwise direction is:

function []=pars file rotate()

%SAMPLE PARAMETERS INITIALIZATION ROUTINE

direction=1; %SET THE OPTION FOR A COUNTERCLOCKWISE ROTATION DIRECTION

size1=128; % X DIMENSION IMAGE SIZE

size2=128; % Y DIMENSION IMAGE SIZE

save pars rotate direction size1 size2

A.3.2 Initialization of States

Bellow, we show a sample initialization routine to create an initial internal state file for

an AER module. The example initialization routine creates a file (state1 in this case)

containing two all zeros matrixes (one 128x128 called J and one 32x32 called times)

and a scalar variable (potential value) state. The state file created can be shared by

several AER modules with matched sizes.

function []=initstate1()

%INITIALIZATION OF THE INTERNAL STATE PARAMETERS OF AN AER MODULE

size1 1=128;

size1 2=128;

size2 1=32;

size2 2=32;

potential value=0;%FIRST SCALAR STATE VARIABLE

J=zeros(size1 1,size1 2);%FIRST MATRIX STATE VARIABLE

times=zeros(size2 1,size2 2);%SECOND MATRIX STATE VARIABLE

save state1 J times potential value%CREATE STATE FILE

133


A.4 RUNNING AERST in MATLAB

AERST in MATLAB is called by the user in the command prompt to execute a simu-

lation as:

[VARS, CHANNELS] = AERST()

Before running a simulation the user has to customize the AERST initialization

contents. The user has to:

1. Store the name of the configuration file in the variable name CONFIG FILE. Theconfiguration file contains the netlist of the AER system to be simulated.

2. Store the name of the output file in the variable name OUTPUT FILE. The output fileis a text file that will be created during the simulation. The simulator stores the data ofthe events generated in the system channels.

3. List the initialization routines. These routines create the parameter and state files

(parameters files and state files) that contain the parameters and internal states used

inside the AER modules used in the system. The proper initialization routines for the

current system netlist have to be created and/or updated by the user before each simu-

lation.

At the end of the AERST file the call to the main function is invoked (main aerst).

Below is an example showing a system described by a configuration file conf file.txt

and an output file outfile.txt. The system receives a source called myevents.mat and

it has 9 channels. The initialization routines are pars file1, pars file2, pars file3,

pars file4, initstate1, initstate2 and initstate3:

function [VARS,CHANNELS] = AERST()%USER CONFIGURATION FILE:CONFIG FILE=‘conf file.txt’;%USER OUTPUT FILE:OUTPUT FILE=(‘outfile.txt’);

134

A.4 RUNNING AERST in MATLAB

MAT SOURCE=‘myevents.mat’;NUMB CHANNELS=9;

%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ./CREATE PARAMETERS;%USER PARAMETER AND STATE FILES:pars file1;%USER PARAMETER FILE 1pars file2;%USER PARAMETER FILE 2pars file3;%USER PARAMETER FILE 3pars file4;%USER PARAMETER FILE 4initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2initstate3;%USER STATE FILE 3copyfile(‘*.mat’,‘../AERST MAIN’);

cd ../CONFIG FILES

copyfile(CONFIG FILE,‘../AERST MAIN’);

cd ../SOURCES

copyfile(MAT SOURCE,‘../AERST MAIN’);

cd ../AERST MAIN

%THE MAIN SIMULATOR FUNCTION IS INVOKED

VARS = main aerst(CONFIG FILE,OUTPUT FILE);

delete(‘*.mat’);

copyfile(OUTPUT FILE,‘../ALG REC’);

delete(‘*.txt’);

cd ../ALG REC

%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED

[CHANNELS]=disktocell2(NUMB CHANNELS,0,0,OUTPUT FILE);


cd ..

The aer tool function returns two variables: CHANNELS and VARS. CHANNELS

is a matrix of cells where each element stores all the events that have travelled in one

channel.

VARS is a matrix of cells where each row contains the information regarding to one

particular AER module in the system. Each row in VARS has five columns:

135


1. First column contains the name of the corresponding AER module.

2. Second column contains the number of the input channels connected to that

module.

3. Third column is a structure containing the parameters of the AER module.

4. Fourth column is a structure containing the final internal states of the AER

module.

5. Fifth column contains the number of the output channels connected to that mod-

ule.

A.4.1 Building Modules

To build a MATLAB user defined module the user has to respect the format for the

declaration of functions in the simulator. The user also has to provide the outputs to

the module. An example for one simple function is the following:

function [new event in,events out,new state,new time,port out]=myfunction(event in,pars,old state,old time, port in)

new event in=event in;

new state=old state;

new time=old time;

% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND sign

x = event in(1,4);

y=event in(1,5);

sign = event in(1,6);

size1 = pars.size1;% EXAMPLE OF HOW TO USE PARAMETER 1

size2 = pars.size2;% EXAMPLE OF HOW TO USE PARAMETER 2

new state.J(x,y) = old state.J(x,y)+1*sign; % WE UPDATE THE INTERNAL STATE

new event in(1,2) = new time; % WE UPDATE THE REQUEST TIME OF THE INCOMING

EVENT

new time = new time+10e-9; % A TEN NANOSECONDS DELAY IS INTRODUCED

136

A.5 C++ Initialization of Parameters and States

new event in(1,3) = new time; % THE FINAL TIME IS USED TO SET THE ACKNOWL-

EDGE TIME

events out = [new time 0 -1 x y sign; new time 0 -1 x y sign]; % AN OUTPUT EVENT PER

OUTPUT CHANNEL WITH THE SAME X,Y,SIGN IS GENERATED

port out=[1 2]; %THE TWO EVENTS ARE WRITTEN IN PORTS ONE AND TWO.

In the above function, the incoming event is updated with a Trqst value equal to

the initial time and a Tack equal to the final time. The updating of the Trqst and Tack

inside the AER module is not mandatory. By default, the simulator updates the Trqstand Tack to simulator time time before the execution of the module function. If one or

more events are created, it is neccessary to indicate the output ports for each event in

port output.

A.5 C++ Initialization of Parameters and States

In the C++ implementation, similar functions can be used in Matlab to create the

parameter and state text files. The steps needed to build an initialization file in C++

are the following:

1. Create parameters as described in the Matlab initialization function.

2. If there are matrices, indicate in rows the number of rows and columns of each matrix.Use the first row in rows to indicate the number of rows of each matrix. Use the secondrow in rows to indicate the number of columns.

3. Create and open the parameter (or state) file.

4. Write key expression #doubles if there are scalar parameters.

5. Write the number of scalar parameters (num doubles) in the file.

6. Write sequentially each of the scalar parameter in the file.

7. Write key expression #matrices if there are matrices used as parameters.

8. Write the number of matrices (num matrices) in the file.

9. Write rows in the file to indicate rows and columns of each matrix.

10. Write sequentially each of the matrices in the file.

11. Close the parameters (or state) file.

137


Note that creating the initialization files this way we manage the memory space

used in the application in an efficient way, as the tool reserves only the number and

size of the parameters indicated.

A.5.1 Initialization of Parameters

The initializing parameter file corresponding to the MATLAB described above is the

following:

function []=pars file rotate()

%SAMPLE PARAMETERS INITIALIZATION ROUTINE FOR A ROTATION AER MODULE

direction=1; %SET A COUNTERCLOCKWISE ROTATION DIRECTION

size1=128; % X DIMENSION IMAGE SIZE

size2=128; % Y DIMENSION IMAGE SIZE

%START THE C++ INITIALIZATION:

s2=‘pars rotate.txt’; %name of the file with the parameters

num doubles=3; %number of double parameters

%WRITE TO THE C++ txt PARAMETERS FILE:

fid=fopen(s2,‘w’); %OPEN PARAMETERS OUTPUT FILE

fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles

fprintf(fid,‘%d\n’,num doubles);

fprintf(fid,‘%f ’,direction);

fprintf(fid,‘%f ’,size1);

fprintf(fid,‘%f\n’,size2);

fclose(fid);

Note that for the parameters file there are only 3 scalar parameters and there are

no matrices. This implies that we do not have to indicate the number of them nor the

number of rows and columns (stored in rows matrix).

A.5.2 Initialization of States

For the state file, the C++ initialization function is:

138

A.6 RUNNING AERST in C++

function []=initstate1()

%INITIALIZATION OF THE INTERNAL STATE PARAMETERS OF AN AER MODULE

size1 1=128;

size1 2=128;

size2 1=32;

size2 2=32;

potential value=0;

J=zeros(size1 1,size1 2);

times=zeros(size2 1,size2 2);

%START THE C++ INITIALIZATION

num doubles=1;

num matrices=2;

%NOW, WE STORE THE NUMBER OF ROWS OF EACH MATRIX IN ROWS:

rows=[size1 1 size2 1;size1 2 size2 2];

%WRITE TO THE C++ STATE FILE:

s2=‘state1.txt’; fid=fopen(s2,‘w’); %OPEN STATE OUTPUT FILE



fprintf(fid,‘%f\n’,potential value);

fprintf(fid,‘#matrices\n’);WRITE KEYWORD #matrices

fprintf(fid,‘%d\n’,num matrices);

fprintf(fid,‘%d\n’,rows);

fprintf(fid,‘%f\n’,J);

fprintf(fid,‘%f\n’,times);

fclose(fid);

Note that this time we have a scalar parameter (potential value and two matrices (J

and times)). For each matrix rows stores the number of rows and colums.


In the C++ implementation we should provide the source txt files and the initialization

files for the parameters and state variables. A function for creating the sources in txt

139


format from the Matlab sources stored as matrixes is provided, called write2file. The

calling format is the following:

write2file(input matrix,text file);

input matrix is a two-dimensional matrix with the source events and text file is the

name of the destination txt file.

AERST in C++ can be called from MATLAB by the user in the command prompt

to execute a simulation as:

[CHANNELS] = AERST()

Before running a simulation the user has to customize the AERST initialization

contents. The user has to:

1. Provide the sources to the system (stored in txt files).

2. Store the name of the configuration file in the variable name CONFIG FILE. The con-figuration file contains the netlist of the AER system to be simulated.

3. Store the name of the output file in the variable name OUTPUT FILE. The output fileis a text file that will be created during the simulation. The simulator stores the data ofthe events generated in the system channels.

4. List the initialization routines. These routines create the parameter and state files (pa-

rameters files and state files) that contain the parameters and internal states used inside

the AER modules used in the system. The proper initialization routines for the current

system netlist have to be created and/or updated by the user before each simulation.

Below is an example showing a system with 9 channels described by configuration

file conf file.txt and an output file outfile.txt. There is one source called myevents in

MATLAB format, which is converted to txt format using write2file. The initialization

routines are called pars file1, pars file2, pars file3, pars file4, initstate1, initstate2 and

initstate3 :

function [CHANNELS]=AERST()%MATLAB ARRAY OF SOURCE EVENTS:

140


MAT SOURCE=‘myevents.mat’;%CPP TXT SOURCE FILE:TXT SOURCE=‘myevents.txt’;%USER CONFIGURATION FILE:CONFIG FILE=‘conf file.txt’;%USER OUTPUT FILE:OUTPUT FILE=‘outfile.txt’;NUMB CHANNELS=9;

%CODE TRANSPARENT TO THE USER:%LOAD SOURCE AS A MATLAB EVENT ARRAY AND CONVERT IT TO TXT FOR-MAT:cd ./SOURCESload(MAT SOURCE)write2file(MAT SOURCE, TXT SOURCE); %convierte de mat a txtcopyfile(TXT SOURCE,‘../AERST MAIN/’);cd ..delete(TXT SOURCE);

%GO TO THE CONFIGURATION FILES FOLDER AND COPY THE CONFIG FILE TOTHE MAIN FOLDERcd ./CONFIG FILEScopyfile(CONFIG FILE, ‘../AERST MAIN/config file.txt’);

%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ../CREATE PARAMETERS %USER PARAMETER AND STATE FILES:pars file1;%USER PARAMETER FILE 1pars file2;%USER PARAMETER FILE 2pars file3;%USER PARAMETER FILE 3pars file4;%USER PARAMETER FILE 4initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2initstate3;%USER STATE FILE 3copyfile(‘*.txt’,‘../AERST MAIN’);%IN THE C++ SIMULATOR PARAMETERS AND STATESARE STORED IN TXT FILESdelete(‘*.txt’);

%CALL THE C++ MAIN PROGRAM:cd ../AERST MAINsystem(‘./AERST.exe config file.txt out1.txt’);

141


copyfile(‘out1.txt’,‘../ALG REC’);delete(‘*.txt’);cd ../ALG RECcopyfile(‘out1.txt’,OUTPUT FILE);

%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED:[CHANNELS]=disktocell3(NUMB CHANNELS,0,0,OUTPUT FILE);delete(‘out1.txt’);delete(OUTPUT FILE)cd ..

Note that now the parameter and state files are stored in text files. Note also how

at the end of the AERST file the call to the main function is invoked (AERST.exe).

The aer tool function returns CHANNELS, which is a matrix of cells where each element

stores all the events that have travelled in one channel.

A.6.1 Building C++ Modules

To build a user defined module the user has to respect the format for the declaration

of functions in the simulator. The user has to create dinamically the memory needed

for the ouput events and for the output port vector in case there exist output events.

An example for one simple function that replicates each input event in two different

output ports and accumulates the event on a state array J is the following:

int example function (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)

{int c, numb ports,numb events;double timedelay, timetoprocess, **tt,double **J;;// Get parameterstimedelay = params.par doub[0]; // First parameter: delay of the asynchronous communicationtimetoprocess = params.par doub[1]; //Second parameter: time to process one eventnumb ports=(int)params.par doub[2];//Number of output ports//Update incoming event timing informationevent in[1]=*timeact;// We set the request time for the input event*timeact=*timeact+timedelay; // We compute the present time for the moduleevent in[2]=*timeact;// We acknowledge the input event*timeact=*timeact+timetoprocess; //We compute again the new present time in the module

142

A.7 Matlab Auxiliary Functions

//Update neuron state addressed by the incoming eventJ=state->p[0]; //We load the state array J.J[event in[3]][event in[4]]=J[event in[3]][event in[4]]+1*event in[5]; //We accumulate the event in the array// Start dinamical management of memory to create output eventstt = new double *[numb ports]; //We create dinamically the set of events.for(i=0;i¡numb ports;i++){

tt[i]=new double[tam vec]; //We create dinamically each event of size specified by tam vec}*events out=tt; //events out is the pointer to the array provided as output*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates in// which output port each event must be written.for (i=0;i¡numb ports;i++){//We initialize each event

(*events out )[i][0]=*timeact;(*events out )[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=event[3];(*events out)[i][4]=event[4];(*events out)[i][5]=event[5];(*port output)[i]=i+1;

}numb events=numb ports;return numb events;//We return the number of events}

A.7 Matlab Auxiliary Functions

Auxiliary functions have been developed to help the user to create AER matrix sources

to be provided to the system from a specified standard image and to recover a standard

image from the AER events obtained after a simulation.

A.7.1 Generation of AER events from a standard image

To create an AER matrix from a specified standard image there are three methods

provided. These methods implement the random, scan and uniform methods proposed

by A. Linares-Barranco in [169]. The auxiliary functions implementing the algorithmic

approximations to generate the events are located in the directory ‘./ALG GEN’. and

143


are the following:

[CIN] = randimage(I, prectemp, max events)

[CIN] = scanimage(I, prectemp, max events)

[CIN] = unifimage(I, prectemp, max events)

Their input parameters are I, which is the image coded in gray scale, prectemp,

minimum time interval between consecutive events and max events (maximum number

of events per frame). The output CIN is a matrix containing all sequence of events

representing image I, in the format required by AERST.

A.7.2 Reconstruction of images from channels

To reconstruct an image from channel events, the auxiliary function is reconstaer back

located in the directory ‘./ALG REC’.

The function is invoked with the following format:

[J]=reconstaer back(CIN, size1, size2),

where CIN is the AER events matrix with the appropiate format, size1 is the

image X dimension and size2 is the image Y dimension. J is the reconstructed image

coded in gray scale.

A.7.3 Reconstruction of channels from the text output file

To reconstruct channel events from the text output file, the auxiliary function is

disktocell2, which is located in directory ‘./ALG REC’. This function recovers the

events at one channel or at all the channels stored in the text output file.

The function is invoked as:

[AC]=disktocell2 (N, numb, flag, output file),

where N is the number of channels in the AER system, numb is the number of

the channel to be recovered, flag is a logic variable and output file is the text output

file. If flag value is ‘0’, the information in all the system channels will be reconstructed

144

A.8 MATLAB Step-by-Step Example

and numb is ignored. If flag it is ‘1’, only the information of the channel specified in

numb is recovered. AC is a matrix of cells containing the recovered events. It has

as many cells as channels in the AER system. The cells for the non-reconstructed

channels are left empty. As an example, if we want to recover the information from

channel 3 in a system with a total of 5 channels we should invoke the function as follows:

[AC] = disktocell2 (5, 3, 1, ‘output file.txt’);

channel3 = AC{3};

Afterwards, to visualize the image reconstructed from the channel 3 we can use the

previously described function reconstaer back.


In AERST, all the auxiliary functions, modules, sources, and rest of files are organized

in folders. The tool looks for all the needed files in these locations. The folders are:

1. AERST MAIN. This folder contains the main function main aerst.m called by

the AERST tool (AERST.m) located in the root directory.

2. ALG GEN. This folder contains the functions to convert images to events unifim-

age.m, scanimage.m and randimage.m. The functions mat2dat (to generate dat

files from events) and dat2mat (to generate matrices of events from dat files) are

also stored in this location.

3. ALG REC. In this folder the function to recover the events from the text out-

put file disktocell2.m and the function to reconstruct images from events recon-

staer back are stored.

4. CONFIG FILES. This folder contains the user configuration files.

5. CREATE PARAMETERS. This folder stores the initialization files for parame-

ters and state variables.

6. FUNCS. This folder contains the auxiliary functions needed by the tool internal

processing.

145


7. SOURCES. This folder contains the sources used as input to the system.

8. MODULES. This folder contains the library of modules and the user-defined

modules.

A.8.1 Preparing the Stimulus Events

The example proposed is for visual processing. We will use two different input stimuli.

The first one is generated by converting a static 64x64 pixel image into a sequence

of events. The second stimulus is directly obtained from a physical motion sensitive

64x64 pixel AER retina recording [15]. The static image is stored in a matlab file called

myimage.mat. The events from the retina are stored in a file called myevents.dat. The

.dat file format is provided by the jAER software [59] when recording real life scenes.

The first step is to convert the selected source to the proper format required by AERST.

If we choose the static image, we have to convert it to a sequence of events. In this

example, we have chosen the uniform method [169] to code the image into events:

[myevents] = unifimage(myimage, 100, 400000);

With this function, we obtain a list of events (myevents) as output. The input

parameters are:

1. I is the input image coded in gray scale.

2. prectemp is the minimum time interval between consecutive events.

3. max events is the maximum number of events per frame.

In the example prectemp will be higher than 100 and the maximum number of

events max events will be 400000. The output events are stored in matrix myevents

with as many rows as events and six columns containing the parameters of each event.

The uniform method distributes all the events uniformly in 400k slots.

If we use the .dat retina recording as source, we can do the format conversion using

the dat2mat function:

[myevents]=dat2mat(myevents.dat);

146


Once the events have been created, they have to be saved in the Matlab working

directory inside the SOURCES directory:

cd ./SOURCES

save myevents myevents

cd ..

If we want to use more sources, we should repeat the above steps for each source.

A.8.2 Setting Up the Configuration File

The next step is writing the text configuration file which describes the netlist of the

system to be simulated.

In this example we consider the system shown in Fig. A.1. It is composed of two par-

allel processing modules (chip1 replicated twice) which receive the same input visual

stimulus (from a two output splitter) and merge the output events into one port using

a merger module. The configuration file describing this system has the following lines:

sources {1} {myevents}splitter {1} {2,3} {file1}chip1 {2} {4} {file2} {state2}chip2 {3} {5} {file3} {state3}merger {4,5} {6} {} {}

The file is saved as config example.txt in the subdirectory CONFIG FILES.

If we want to use more sources (for example two), the first line in the configuration

file would be:

sources {1,2} {myevents1, myevents2}

A.8.3 Initializing Parameters

The following step is to create the parameters and state variables that each block will

use. In our example there are four blocks, each one with its parameters and state

variables. Consequently, we must create the initialization files that will initialize and

save the parameters and state variables. In the example we need five initialization files,

147


Figure A.1: System Simulated in the Step-by-Step Example

two for each processing chip (one for parameters and one for state variables), and one

for the splitter module. The merger block does not have parameters nor state variables.

The initialization files are stored in subdirectory CREATE PARAMETERS. They are

as follows:

A.8.3.1 Splitter

The module splitter in the system has only one initialization file called initparams1.m

(it has no state variables) with the following lines:

timedelay=0.5; %parameter delay time.

Numb ports=2;

save file1 timedelay numb ports

A.8.3.2 Chip1

Module Chip1 is used twice in the system but with different parameters and state vari-

ables. Note that modules can be used more than once in the same system and that

the user can specify different parameters and state variables for them. For the upper

Chip1 in Fig. A.1 there are two initialization files: iniparams2 and initstates2

a)initparams2 :


size1=64; %size 1 of the image.

148



s=[0 0 0;1 2 1;0 0 0]; %convolution kernel

sizekernel1= floor(size(s,1)/2);

threshold=60;

save file2 timedelay size1 size2 s sizekernel1 threshold

b)initstate2 :

imagestate2=zeros(64); %temporal bidimensional array used in block chip1.

timestate=zeros(64); %auxiliar state variable used in chip1.m.

save state2 imagestate2 timestate

For Chip1 at the bottom of Fig. A.1, there are two initialization files: iniparams3

and initstates2:

Note that both modules (upper Chip1 and bottom Chip1) use the same initialization file

for state variables (initstates2) as they use state variables with the same characteristics.

a)initparams3 :




s=[0 1 0; 0 2 0; 0 1 0]; %convolution mask

sizekernel1= floor(size(s,1)/2);

threshold=60;

save file3 timedelay size1 size2 s sizekernel1 threshold

A.8.4 Editing the Modules

Once the initialization files have been created, the next step is to write the code of

the different modules. The declaration of each module will be according to the syntax

below. In the example, the modules have the following code:

149


A.8.4.1 Splitter Module

function [new event in,events out,new state,time,port out]=

splitter(event in,pars,old state,old time, port in)

new event in=event in;

new state=old state;

% Get the parameter variables timedelay, numb ports

tdel=pars.timedelay;

numb ports=pars.numb ports;

% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND sign

x = event in(4);

y=event in(5);

sign = event in(6);

new event in(2)=old time;% Update incoming event Treqst

new time=old time+tdel;% Update current time

new event in(3)=new time;% Update incoming event Tack

events out=[new time 0 -1 x y sgn];

events out=repmat(events out,numb ports,1); %ONE EVENT FOR EACH PORT

port out=1:numb ports;

According to the initialization file initfile1.m, this module creates two events, one

in channel 2 and one in channel 3 with a delay. It also updates variables time, Treqstand Tack for the incoming event. If we want more than one replicated events in each

channel, for example two, we only have to modify the instructions to create the output

events. A possible way could be as follows:

events out=[new time 0 -1 x y sgn];

events out=repmat(events out,2*numb ports,1);

port out=1:numb ports;

port out=repmat(port out,1,2);

A.8.4.2 Chip1 Module

function [new event in,events out,new state,time,port out]=

chip1(event in,pars,old state,old time, port in)

150


% Get the parameter variables timedelay, size1 and size2, threshold and convolution mask snew event in=event in; new state=old state; new time=old time;tdel=pars.timedelay;size1=pars.size1;size2=pars.size2;threshold=pars.threshold;s2=pars.s;

% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND signx = event in(4);y=event in(5);sign = event in(6);

event in(2)=new time;% Update incoming event Treqst

new time=new time+tdel;% Update current timeevent in(3)=new time;% Update incoming event Tack

J=old state.imagestate2;% Get state bidimensional state array J

%APPLY CONVOLUTION MASK s TO STATE ARRAY J AT (x, y) POSITIONSc=params.sizekernel1;J(max(1,(x-c)):min(size1,(x+c)),max(1,(y-c)):min(size2,(y+c)))= ...J(max(1,(x-c)):min(size1,(x+c)),max(1,(y-c)):min(size2,(y+c)))+...sgn*s2(max(1,c-x+2):min(size(s2,1),size1-x+c+1),max(1,c-y+2):min(size(s2,2),size1-y+c+1));

%FIND NEURONS REACHING threshold

[a, b]=find(abs(J)>threshold);e=find(abs(J)>threshold);new state.timestate(e)=new time; %We update time of those pixels achieving thresholdx2=a’;y2=b’;signo=sign(J(e))’;

time re=zeros(1,length(a));Ack=(-1)*ones(1,length(a));treqini=new time*ones(1,length(a));

%RESET NEURONS ACHIEVING threshold

if length(e)>0

J(e)=0;

end

151


new state.imagestate2=J;%save new state B

%CREATE NEW OUTPUT EVENTS

events out=[treqini’ time re’ Ack’ x2’ y2’ signo’];

port out=ones(1,length(a));%set the number of output port for each created event

In this block, a square convolution kernel s is applied to the present 2D state at

positions x, y. Positive and negative pixels reaching the threshold threshold (in this

case ‘60′), will produce new events in the output channel.

A.8.4.3 Merger Module

function [new event in,events out,new state,new time,port out]=

merger(event in,pars,old state,old time, port input)

% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND signnew event in=event in; new state=old state; new time=old time;

x = event in(4);

y=event in(5);

sign = event in(6);

new event in(2)=new time;% Update incoming event Treqst

new event in(3)=new time;% Update incoming event Tack

events out=[new time 0 -1 x y sgn]; %ONE EVENT COPIED TO THE OUTPUT PORT

port out=[1];

In this case, the input events coming from channels four and five are transferred to

channel ‘6’.

A.8.5 Editing the AERST.m file

Once the modules and initialization files are available, we must include the names of

the configuration file and the initialization files in the AERST.m file before starting the

simulation. AERST.m is located in the root directory. In this step-by-step example,

AERST.m has the following lines:

152


function [VARS,CHANNELS] = AERST()%USER CONFIGURATION FILE:CONFIG FILE=‘config example.txt’;%USER OUTPUT FILE:OUTPUT FILE=(‘outfile.txt’);NUMB CHANNELS=6;

%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATEFILEScd ./CREATE PARAMETERS;%USER PARAMETER AND STATE FILES:initfile1;%USER PARAMETER FILE 1initfile2;%USER PARAMETER FILE 2initfile3;%USER PARAMETER FILE 3initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2

copyfile(‘*.mat’,‘../AERST MAIN’);

cd ../CONFIG FILEScopyfile(CONFIG FILE,‘../AERST MAIN’);

cd ../AERST MAIN%THE MAIN SIMULATOR FUNCTION IS INVOKEDVARS = main aerst(CONFIG FILE,OUTPUT FILE);delete(‘*.mat’);copyfile(OUTPUT FILE,‘../ALG REC’);delete(‘*.txt’);

cd ../ALG REC

%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED

[CHANNELS]=disktocell2(NUMB CHANNELS,0,0,OUTPUT FILE);


cd ..

A.8.6 Simulating the System

Finally, we must invoke the AERST simulator by typing in the matlab prompt the

following command:

153


[VARS,CHANNELS] = AERST()

If all steps have been done correctly, no errors will appear on the screen. We

strongly recommend to check that all required files (sources, configuration file, initfiles,

initstates, new modules and aer tool.m ) have been created or modified before starting

the simulation.

A.8.7 Viewing Results

To analyze the resulting events of all channels, we can use the disktocell2 function as

mentioned previously:

[A4]=disktocell2(6, 0, 0, ‘outfile.txt’);

Here we use the value ‘6’ for the first argument because there are six channels in

our system. If we want to analyze only one channel, for example channel ‘2’, we must

type the following:

[A4]=disktocell2(6, 2, 1, ‘outfile.txt’);

In the two cases, we can access to the information of one channel (in this case chan-

nel ‘2’) with the following instruction:

channel=A4{2};

Now, in channel we have the full list of events at channel ‘2’, each with its 6 field

format. If we want to view the image resulting of accumulating all the events in the

recovered channel channel, we can type:

[J2]=reconstaer back(channel, 64, 64);

There is a function to convert events in matlab files to data files (.dat) in case that

the user wants to use the jAER software tools [59] to visualize the events as video

154

A.9 C++ Step-by-Step Example

streams. This function is called matwdat and the calling format is the following:

mat2dat(CIN,s,size)

CIN is a matrix of events with the format required by the AERST tool. s is the name

of the dat file where the events are going to be stored. size is the size of the address

space coded by the events.

mat2dat requires times in the events in CIN to be in nanoseconds. For instance if the

user wants to convert a matrix of events called myevents coding a 128x128 address

space to a data file called myevents.dat the calling format is:

mat2dat(myevents, myevents.dat,128);


To maintain a correspondency with the Matlab implementation, the C++ implemen-

tation of AERST has been organized with the same folder structure. Again the folders

are:

1. AERST MAIN. This folder contains the main executable file AERST.exe called

by the AERST tool (AERST C.m located in the root directory.

2. ALG GEN. This folder contains the functions to convert images to events unifim-

age.m, scanimage.m and randimage.m. The functions mat2dat (to generate dat

files from events) and dat2mat (to generate matrices of events from dat files) are

also stored in this location.

3. ALG REC. In this folder the function to recover the events from the text out-

put file disktocell3.m and the function to reconstruct images from events recon-

staer back are stored.

4. CONFIG FILES. This folder contains the user configuration files.

5. CREATE PARAMETERS. This folder stores the initialization files for parame-

ters and state variables.

155


6. FUNCS. This folder contains the auxiliary functions needed by the tool internal

processing.

7. SOURCES. This folder contains the sources used as input to the system.

A.9.1 Converting a Matrix of Events to a source text file

Given an event matrix stored as a mat file (in this example MATRIX.mat) with

the format required by the Matlat version of AERST (six fields of information per

event), it can be converted into the format required by the C++ version using function

write2file:

write2file(‘MATRIX.mat’, ‘myevents.txt’);

myevents.txt stores the events with the format required by the C++ version of AERST.

A.9.2 Setting Up the Configuration File

The next step is to write the text configuration file describing the netlist of the system

to be simulated. For the example system shown in Fig. A.1, the configuration file is:

sources {1} {myevents.txt}splitter {1} {2,3} {file1.txt}chip1 {2} {4} {file2.txt} {state2.txt}chip2 {3} {5} {file3.txt} {state3.txt}merger {4,5} {6} {} {}

Note that the configuration file is the same as that one used in the Matlab version,

except that the file extensions of the source and parameter files has to be given.

A.9.3 Initializing Parameters

The next step is to create the parameters and state variables that each module will

use. In our example there are four modules, each one with its parameters and state

variables associated. Consequently, we must create the files that will initialize and

save the parameters and state variables. In the example we need five initialization

156


files, two for each processing chip (one for parameters and one for state variables), and

one for the splitter module. The merger module does not have parameters nor state

variables. The initialization files are stored in ‘./CREATE PARAMETERS’. In this

example, we can use the matlab initialization files described in the matlab step-by-step

section just including some lines to save the variables in a txt file. The initialization

files are described next:

A.9.3.1 Splitter

The module splitter has only one initializing file called initparams1.m (it has no state

variables) with the following lines:


numb ports=2;

%THE FOLLOWING CODE CREATES THE file1.txt INITIALIZATION FILE

num doubles=2;

s2=‘file1.txt’;

fid=fopen(s2,‘w’);

fprintf(fid,‘#doubles\n’);


fprintf(fid,‘%f ’,timedelay);

fprintf(fid,‘%f\n’,numb ports);

A.9.3.2 Chip1

For the upper module Chip1 in Fig. A.1 there are two initialization files: iniparams2

and initstates2

a)initparams2 :


size1=64; %size 1 of the input visual flow.


157


s=[0 0 0;1 2 1;0 0 0]; %CONVOLUTION MASK

shift= floor(size(s,1)/2);

threshold=60;

%THE FOLLOWING CODE CREATES THE TXT INITIALIZATION FILE

num doubles=3;

num matrices=1;

rows=[3;3];%THERE IS ONLY ONE MATRIX, THE 3x3 CONVOLUTION MASK s

s2=‘file2.txt’;





fprintf(fid,‘%f ’,shift);

fprintf(fid,‘%f\n’,threshold);




fprintf(fid,‘%f\n’,s’);

fclose(fid);

b)initstate2 :

imagestate2=zeros(64); %temporal bidimensional array used in block chip1.

timestate=zeros(64); %auxiliar state variable used in chip1.m.


num doubles=2;

num matrices=2;

rows=[64 64;64 64];%THIS TIME THERE ARE TWO MATRIXES, the 64x64imagestate2 and

the 64x64 timestate.

s2=‘state2.txt’;

158





fprintf(fid,‘%f ’,size1);

fprintf(fid,‘%f\n’,size2);


fprintf(fid,‘%f\n’,imagestate2’);

fprintf(fid,‘%f\n’,timestate’);

fclose(fid);

For the module Chip1 at the bottom of Fig. A.1, there are two initialization files:

iniparams3 and initstates2. Note that both modules (upper Chip1 and bottom Chip1)

use the same initialization file for state variables (initstates2) as they use state vari-

ables with the same characteristics.

a)initparams3 :




s=[0 1 0; 0 2 0; 0 1 0]; %CONVOLUTION MASK

shift= floor(size(s,1)/2);

threshold=60;


num doubles=3;

num matrices=1;

rows=[3;3];THERE IS ONLY ONE MATRIX, THE CONVOLUTION MASK s

s2=‘file3.txt’;



159




fprintf(fid,‘%f ’,shift);

fprintf(fid,‘%f\n’,threshold);




fprintf(fid,‘%5.20f\n’,s’);

fclose(fid);

A.9.4 Editing the C++ Modules

Once the initialization files have been created, the next step is to write the different

modules (in case new ones have been created and are not available in the modules

library).

In the example system, the modules will have the code given below:

A.9.4.1 Splitter Module

int splitter (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)

{int i,c, numb ports;double timedelay, **tt;// Get parameterstimedelay = params.par doub[0]; // First parameter: delay of the asynchronous communicationnumb ports=params.par doub[1];//Number of output ports//Update incoming event timing information:event in[1]=*timeact;// Request time for input event*timeact=*timeact+timedelay; // We compute the present time for the moduleevent in[2]=*timeact;// We acknowledge the input eventtt = new double *[numb ports]; //We create dinamically the set of events. In this case only two events (two ports)if (tt==NULL){

160


printf(“ERROR IN MEMORY ALLOCATION\n”);}else{

for(c=0;i<numb ports;c++){

tt[c]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[c]==NULL){

printf(“ERROR IN MEMORY ALLOCATION\n”);}

}*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){

printf(“ERROR IN MEMORY ALLOCATION\n”);}else{// which output port each event must be written.

for (i=0;i<numb ports;i++){//We initialize each event

(*events out)[i][0]=*timeact;(*events out)[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=event in[3];(*events out)[i][4]=event in[4];(*events out)[i][5]=event in[5];(*port output)[i]=i+1;

}}numb events=numb ports;return numb events;//We return the number of events}

In this block, a new event is created in channels ‘2’ and ‘3’ with a delay, and timeact

and state are refreshed. Effective request time and acknowledge time are modified in

the input event adding them a delay specified in params.

161


A.9.4.2 Chip1 C++ Module

int chip1(double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)

{int i,j,numb events=0,cont=0, cont2=0,c=0;int ind11,ind12,ind21,ind22,sg11,sg21,sg12,sg22,shift;int *a, *b;double x, y, sgn event;double **J, **tb,**s2,**tt;double *signo;double timedelay, treqinic,offset,size1,size2,threshold;%GET PARAMETERS AND STATE VARIABLEStimedelay=params.par doub[0];size1=(int)state->par doub[0];size2=(int)state->par doub[1];s2=params.p[0];shift=params.par doub[1];threshold=params.par doub[2];%GET EVENT INFORMATION AND UPDATE ITx=event in[3];y=event in[4];sgn event=event in[5];treqinic=*timeact;event in[1]=*timeact;*timeact=*timeact+timedelay;event in[2]=*timeact;c=shift;J=state->p[0];%COMPUTE WHICH PART OF CONVOLUTION MASK s FITS IN THE STATE ARRAYind11=maxim2(0,-x+c);ind21=minim2((params.rows2 m[0][0]-1),(state-¿rows2 m[0][0]+ ...

params.rows2 m[0][0]-(x+c+2)));ind12=maxim2(0,-y+c);ind22=minim2((params.rows2 m[1][0]-1),(state-¿rows2 m[1][0]+ ...

params.rows2 m[1][0]-(y+c+2)));// if length(s2)>0] => s2 tiene elementosif(((ind21-ind11)>=0)&&((ind22-ind12)¿=0)){

a=new int[(ind21-ind11+1)*(ind22-ind12+1)];b=new int[(ind21-ind11+1)*(ind22-ind12+1)];

162


signo=new double[(ind21-ind11+1)*(ind22-ind12+1)];if((a!=NULL)&(b!=NULL)&(signo!=NULL)){

cont=1;sg11=maxim2(0,x-c);sg21=minim2(size1-1, x+c);sg12=maxim2(0,y-c);sg22=minim2(size2-1,y+c);tb=state->p[1];s2=params.p[0];for(j=sg12;j<=sg22;j++){

for (i=sg11;i<=sg21;i++){

%APPLY CONVOLUTION MASK TO STATE ARRAY BJ[i][j]=J[i][j]+sgn event*s2[ind11+(i-sg11)][ind12+(j-sg12)];

%LOCATE THOSE NEURONS WITH A STATE HIGHER THAN thresholdif (fabs(J[i][j])>=threshold){

a[cont2]=i;b[cont2]=j;tb[i][j]=*timeact;if (J[i][j]>=0){

signo[cont2]=1;%RESET THOSE NEURONS WITH A STATE HIGHER THAN threshold

J[i][j]=0;}else{

signo[cont2]=-1;J[i][j]=0;

}cont2++;

}}

}}

}if(cont>0){

163


%CREATE OUTPUT EVENTSnumb events=cont2;tt = new double *[numb events]; //We create dinamically the set of events. In this case only two events (two ports)if (tt==NULL){


for(c=0;i<numb events;c++){

tt[c]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[c]==NULL){


}*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb events]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){

printf(“ERROR IN MEMORY ALLOCATION\n”);}else{// which output port each event must be written.

for (i=0;i<numb events;i++){//We initialize each event

(*events out)[i][0]=treqinic;(*events out)[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=a[i];(*events out)[i][4]=b[i];(*events out)[i][5]=signo[i];(*port output)[i]=1;*timeact=(*events out)[i][0];

}}

delete a;delete b;

164


delete signo;}

return numb events;}

In this block, convolution kernel s is applied to the present 2D state at positions x, y.

Positive and negative pixels that reach threshold umbral (in this case 0), will produce

new events in the output channels.

A.9.4.3 MERGER C++ Module

int merger (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)

{int numb ports=1;double **tt;// Get parameters//Update incoming event timing information:event in[1]=*timeact;// Request time for input eventevent in[2]=*timeact;// We acknowledge the input eventtt = new double *[numb ports]; //We create dinamically the set of events. In this case only one eventif (tt==NULL){


tt[0]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[0]==NULL){


*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){


165


//We initialize the event(*events out)[0][0]=*timeact;(*events out)[0][1]=0;(*events out)[0][2]=-1;(*events out)[0][3]=event in[3];(*events out)[0][4]=event in[4];(*events out)[0][5]=event in[5];(*port output)[0]=1;

}numb events=numb ports;return numb events;//We return the number of events}

Once the modules have been written, the user has to:

1. Include the modules in the hpp file called lib modules.hpp.

2. Include the module names in the file AERST.cpp. The module names are included

as independent lines in the C++ structure called intern func type. This structure

stores the library modules and the user-defined modules. It has the following lines:

struct intern func type {

char *f name;

int (*p)(double *,double ***,params2 , params2 *,double *,int , int **, int );

} intern func[]={

“aerswitch”, aerswitch, //INCLUDE THE NAME OF NEW MODULES AS DONE

NEXT:

“aerscanner”,aerscanner,

“convolution”, convolution,

“projection”, projection,

“integandfire”,integandfire,

“ratereducer”, ratereducer,

“rotator”, rotator,

“subsampling”, subsampling,

“splitter”,splitter,

“merger”, merger,

166


“chip1”,chip1,

“chip2”,chip2,

“”,0

};

A.9.5 Simulating the System in C++

Once the modules and initfiles are available, the simulator can be run. A MAT-

LAB function (AERST.m) to call the initialization functions and the main program

(AERST.exe) is provided. It has the following lines1:

function [CHANNELS]=AERST()%MATLAB ARRAY OF SOURCE EVENTS:MAT SOURCE=‘myevents.mat’;%CPP TXT SOURCE FILE:TXT SOURCE=‘myevents.txt’;%USER CONFIGURATION FILE:CONFIG FILE=‘config example c.txt’;%USER OUTPUT FILE:OUTPUT FILE=‘output.txt’;NUMB CHANNELS=6;

%CODE TRANSPARENT TO THE USER:%LOAD SOURCE AS A MATLAB EVENT ARRAY AND CONVERT IT TO TXT FOR-MAT:cd ./SOURCESload(MAT SOURCE)write2file(MAT SOURCE, TXT SOURCE); %convierte de mat a txtcopyfile(TXT SOURCE,‘../AERST MAIN/’);cd ..delete(TXT SOURCE);

%GO TO THE CONFIGURATION FILES FOLDER AND COPY THE CONFIG FILE INTHE MAIN FOLDERcd ./CONFIG FILES

1The user can run the application outside MATLAB creating the txt files as described in the tool

and calling the AERST with the name of the configuration file

167


copyfile(CONFIG FILE, ‘../AERST MAIN/config file.txt’);

%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ../CREATE PARAMETERS %USER PARAMETER AND STATE FILES:

initfile1 C;%USER PARAMETER FILE 1initfile2 C;%USER PARAMETER FILE 2initfile3 C;%USER PARAMETER FILE 3initstate2 C;%USER STATE FILE 2

copyfile(‘*.txt’,‘../AERST MAIN’);%IN THE C++ SIMULATOR PARAMETERS AND STATESARE STORED IN TXT FILESdelete(‘*.txt’);

%CALL THE C++ MAIN PROGRAM:cd ../AERST MAINsystem(‘./AERST.exe config file.txt out1.txt’);

copyfile(‘out1.txt’,‘../ALG REC’);delete(‘*.txt’);cd ../ALG RECcopyfile(‘out1.txt’,OUTPUT FILE);

%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED:[CHANNELS]=disktocell3(NUMB CHANNELS,0,0,OUTPUT FILE);delete(‘out1.txt’);delete(OUTPUT FILE)cd ..

Now in CHANNELS all the events that have travelled through the system are

stored. The user can access to the information of one channel (consider case of channel

‘2’) invoking the MATLAB command:

channel=CHANNELS{2};

Now, in channel we have the main information of channel two with the format ex-

plained previously. If the user want to view the image resulting of reconstructing the

events in channel, an example way to do it would be:

168


[J2]=reconstaer back(channel, 64, 64);

imshow(J2,[])

169


170

Appendix B

RESUMEN

B.1 INTRODUCCION

La tecnologıa actual permite la implementacion de aplicaciones complejas a alta ve-

locidad y con resultados bastante eficientes. Sin embargo, cuando se trata de im-

plementar aplicaciones que resultan inmediatas para el cerebro, como actividades de

reconocimiento, de seguimiento o movimiento de objetos, etc., los sistemas electronicos

actuales resultan ineficientes. En el caso de aplicaciones de vision, la mayorıa de los

sistemas actuales basan su funcionamiento en el procesamiento de fotogramas. Los

sistemas de vision trabajan habitualmente capturando y procesando secuencias de fo-

togramas (frames), que son procesados, pıxel a pıxel hasta que alguna tarea determi-

nada es conseguida. Este procesamiento basado en fotogramas es lento, especialmente

si se necesitan varias convoluciones en secuencia para cada imagen de entrada. El cere-

bro no funciona con un esquema basado en fotogramas. En la retina, cada pıxel envıa

pulsos (tambien llamados eventos) a la corteza cerebral cuando su nivel de actividad

alcanza cierto umbral. Aquellos pıxeles muy activos enviaran mas pulsos que los menos

activos. Todos estos pulsos son transmitidos a medida que estan siendo producidos, y

no esperan el tiempo artificial “tiempo de frame” antes de enviarlos a la siguiente etapa

de procesamiento [8]. Las caracterısticas extraıdas son propagadas y procesadas etapa

por etapa tan pronto como han sido producidas, sin esperar a finalizar la recoleccion

y procesamiento de los datos de fotogramas completos. Un problema importante que

encuentran los ingenieros cuando tratan de implementar sistemas de procesamiento de

vision bio-inspirados es conseguir la masiva cantidad de interconexiones hacia delante

171

B. RESUMEN

Figura B.1: Concepto de comunicacion punto a punto basada en AER.

y de realimentacion que aparece entre las etapas neuronales existentes en el sistema de

procesamiento de vision humano. La representacion de datos basada en direcciones de

eventos (Address Event Representation, AER [30][31]) es una posible solucion. La Fig.

B.1 ilustra la comunicacion en un enlace punto a punto AER tradicional.

El estado continuo en el tiempo de las neuronas emisoras en un chip es transfor-

mado a una secuencia de pulsos digitales muy rapidos (eventos) de anchura mınima (del

orden de ns) pero con intervalos entre pulsos del orden de ms (similar a las neuronas

cerebrales). Este alto intervalo entre pulsos permite una potente multiplexacion, y los

pulsos generados por las neuronas emisoras pueden ser multiplexados en tiempo en un

bus de salida comun de alta velocidad. Cada vez que una neurona emite un pulso

o evento, la direccion de esa neurona aparece en el bus digital, junto con sus senales

request y acknowledgede. Esto se conoce como evento de direccion. El chip receptor lee

y decodifica las direcciones de los eventos entrantes y envıa pulsos a las neuronas recep-

toras correspondientes, que integran esos pulsos y son capaces de reproducir el estado

de las neuronas emisoras. Esta es la comunicacion entre chips basada en AER mas sim-

ple. Sin embargo, esta comunicacion punto a punto puede ser extendida a un esquema

multiemisor o multireceptor [50], donde rotaciones, traslaciones o procesamientos mas

complicados como convoluciones pueden ser implementados por chips de procesamiento

que reciben estos eventos [37]. Ademas, la informacion puede ser trasladada o rotada

facilmente simplemente cambiando las direcciones de los eventos al tiempo que viajan

de un chip al siguiente. Existe una creciente comunidad de usuarios del protocolo AER

para el diseno de aplicaciones de vision y audicion bio-inspiradas, robotica, seguimiento

y reconocimiento de objetos, etc., como ha sido demostrado por el exito en los ultimos

anos de los participantes en las ‘Neuromorphic Engineering Workshop series’ [48]. El

exito de esta comunidad es disenar sistemas grandes jerarquicamente estructurados

172

B.2 Descripcion del Simulador AERST

multichip multietapa capaces de implementar procesamientos complejos de matrices en

tiempo real. El exito de tales sistemas dependera grandemente de la disponibilidad de

herramientas robustas y eficientes de diseno y depuracion de sistemas AER [55][50]. Se

hace imprescindible por tanto la disponibilidad de un simulador de sistemas de proce-

samiento complejos basados en AER, que permita el analisis de tales sistemas y la

propuesta de otros nuevos, antes de la implementacion fısica de los mismos. En este

trabajo se presenta un simulador implementado en Matlab, que ha sido implementado

tambien en C++.


En este simulador, un sistema generico AER es descrito mediante un netlist o fichero de

conexiones que usa solamente dos tipos de elementos: modulos y canales. Un modulo

es un bloque que genera y/o produce trenes de eventos (streams) AER. Por ejemplo,

un chip retina serıa una fuente que proporciona eventos AER a un sistema AER. Un

chip de convolucion [39] serıa un modulo de procesamiento AER que recibirıa como

entrada un stream de eventos AER y que producirıa un nuevo stream de eventos AER

a la salida. Los streams AER constituyen los nodos del netlist en un sistema AER y

son llamados canales. De este modo, los canales representan conexiones punto a punto.

Para replicar o multiplexar canales, se deben incluir en el netlist modulos splitter o

merger. En la Fig. B.2 se muestra un netlist de ejemplo y su descripcion mediante un

archivo ASCII.

El netlist de la figura contiene 7 modulos y 8 canales. Como se puede observar,

en el netlist hay que indicar aquellos canales fuente del sistema (junto con el nombre

del fichero de texto donde estan los eventos de cada fuente), ademas de los modulos de

procesamiento junto con sus estructuras de parametros y estados. Para cada modulo se

indica aquellos canales que son de entrada y de salida. La descripcion del netlist es pro-

porcionada a la herramienta de simulacion mediante un fichero de texto, que se muestra

en la parte inferior de la Fig. B.2. Cada modulo de procesamiento es descrito mediante

una funcion cuyo nombre es el propio nombre del modulo. El simulador no impone

ninguna restriccion en el formato de las estructuras de parametros y estados. Estos

formatos estan abiertos al usuario que escribe el codigo de la funcion de cada modulo.

173

B. RESUMEN

Figura B.2: Ejemplo de Sistema AER y su descripcion mediate un fichero ASCII

El simulador solo necesita saber el nombre de los archivos donde estan almacenadas

estas estructuras. La informacion relativa a un evento tiene seis componentes:

[Tprereqst Treqst Tack x y sign] (B.1)

x e y representan las coordenadas o direcciones del evento, y sign es el signo. Tprereqst

representa el tiempo en el cual el evento fue creado por el modulo emisor, Treqst repre-

senta el tiempo en el cual el evento es procesado por el modulo receptor y Tack representa

el tiempo en el cual el evento es finalmente asentido por el modulo receptor. En nuestra

aplicacion distinguimos entre tiempo pre-Request Tprereqst y tiempo de Request efectivo

Treqst. El primero solo depende del modulo emisor, mientras que el segundo requiere

que el modulo receptor este preparado para procesar la senal request de un evento. De

este modo, podemos proporcionar como fuente al sistema una lista entera de eventos

que estan descritos solo mediante sus direcciones, signo y tiempos Tprereqst. Una vez

que los eventos van siendo procesados por el simulador, sus tiempos Treqst y Tack van

siendo establecidos.

Los eventos de los canales fuente pueden proceder directamente de dispositivos AER

174


como retinas electronicas, o del resultado de convertir imagenes a un listado de eventos.

La ejecucion del simulador es como sigue.

1. Inicialmente se lee el archivo netlist conjuntamente con todos los ficheros de

parametros y variables de estado pertenecientes a los diferentes modulos.

2. Todos los canales son examinados. El simulador selecciona aquel con el evento

no procesado con menor tiempo Tprereqst.

3. La informacion del evento es suministrada como entrada al modulo con el cual el

canal esta conectado. En ese momento, el modulo actualiza su estado interno a

partir de la informacion recibida del evento y del resto de parametros de config-

uracion del modulo. Este hecho puede ocasionar la generacion de nuevos eventos

no procesados en el modulo que seran proporcionados a los puertos de salida

correspondientes del modulo. Estos eventos no tendran asignados (por haber

sido creados, no procesados) valores en los tiempos Treqst y Tack. A partir de

este momento, el simulador actualiza todos los canales, almacena el nuevo estado

para el modulo que se acaba de procesar y regresa al punto 2. Aquellos eventos

procesados se almacenan en un fichero de texto que el usuario de la aplicacion

podra consultar y visualizar al final de la simulacion para analizar la evolucion

del sistema.

La herramienta dispone de una librerıa basica de modulos que puede ser ampliada

facilmente por el usuario. Los principales modulos son:

1. El modulo aerswitch, que replica el evento en su puerto (o puertos) de entrada

en el puerto (o puertos) de salida.

2. El modulo mapper, que transforma las direcciones codificadas por los eventos de

entrada de acuerdo a una LUT (look-up-table) o algun tipo de procesamiento.

De este modulo se han implementado dos variantes: rotator (que cambia las

direcciones de los eventos de entrada para aplicar una rotacion al estımulo visual

codificado por tales eventos), y aerscanner (que ignora las direcciones codificadas

por los eventos de entrada y produce por cada evento entrante, un evento de salida

que codifica una direccion consecutiva con la codificada por el evento enviado

anterior.

175

B. RESUMEN

3. El modulo subsampling, que reduce el espacio de direcciones codificado por los

eventos.

4. El modulo convolution, que implementa la convolucion del estımulo visual codifi-

cado por los eventos de entrada y una mascara de convolucion almacenada en el

interior del modulo.

B.3 IMPLEMENTACIONES

Con la herramienta descrita se han disenado varios sistemas AER multietapa multi-

procesamiento. Entre ellas, las principales son un Sistema AER de reconocimiento de

caracteres, un sistema de clasificacion de imagenes basada en informacion de textura y

una red neuronal convolucional de varias etapas para el reconocimiento de personas en

varias posiciones.

B.4 Sistema de Reconocimiento de Caracteres basado en

AER

Una de las implementaciones AER multietapa multiprocesamiento implementadas cor-

responde a un sistema de reconocimiento de caracteres escritos a mano. En particular,

se ha implementado una version simplificada de la arquitectura de reconocimiento de

caracteres Neocognitron [2]. La estructura implementada permite el reconocimiento de

varios caracteres (‘A’,‘B’,‘C’,‘H’,‘L’,‘M’,‘T’) que ademas pueden presentar ligeras defor-

maciones. El sistema se representa en la Fig B.3. La entrada al sistema es un estımulo

visual de 16x16 pıxeles codificado en eventos y que puede representar uno de los siete

caracteres (‘A’,‘B’,‘C’,‘H’,‘L’,‘M’,‘T’). Cada pixel produce 10 eventos y la separacion

entre eventos es de 50ns. Como el numero aproximado de pıxeles activos es de 30, el

estımulo completo tiene una duracion de 15mus aproximadamente.

La primera etapa de procesamiento implementa 17 convoluciones (con mascaras

de convolucion ki (i = 1, ..., 17)) en paralelo para la extraccion de caracterısticas en

el estimulo visual. Cada mascara de convolucion (kernel) en la etapa ‘1’ tiene como

objetivo detectar caracteristicas discriminatorias que ayuden a identificar los caracteres.

Por ejemplo, la mascara de convolucion k1 detecta la presencia y posicion del pico

superior en la letra ‘A’. La mascara k2 detecta un segmento horizontal que termina en

176

B.4 Sistema de Reconocimiento de Caracteres basado en AER

Figura B.3: Sistema de Reconocimiento de Caracteres basado en AER

la izquierda y toca con un segmento vertical. La mascara k3 igual, pero terminando el

segmento a la derecha, etc. Por tanto, la primera etapa tiene como objeticvo detectar

un conjunto de 17 caracterısticas geometricas que puedan ser usadas para detectar y

discriminar entre los diferentes caracteres.

La segunda etapa implementa 17 convoluciones en paralelo (con mascaras pi(i =

1, ..., 17)). Esta etapa tiene como objetivo evaluar si la distribucion espacial de las

caracterısticas detectadas en la primera etapa son significativas para el caracter que

esta siendo analizado. Por ejemplo, para el caracter ‘A’, el pico superior (detectado por

la mascara k1 en la primera etapa) deberıa estar por encima del resto de caracterısticas.

Por tanto, la mascara p1 producira una contribucion positiva en la region justo debajo

del pico detectado por k1, porque este sera el lugar donde deberıa estar el centro del

caracter ’A’. Lo mismo ocurre con el resto de mascaras de convolucion.

El proposito de la tercera etapa es combinar con pesos positivos o negativos las

salidas de la segunda etapa. Para ello, cada una de las 17 salidas obtenidas en la etapa

2 son replicadas en 7 canales independientes. Cada uno de estos canales entra a un

modulo merger de 17 entradas (modulo M en la Fig. B.3). Como los eventos que

proceden de la segunda etapa tienen todos signo positivo, los modulos merger tienen

cableados los signos en las entradas, de modo que se le asigna signo positivo a los eventos

que contribuyen positivamente para el reconocimiento del caracter y negativo a los que

177

B. RESUMEN

Figura B.4: Caracteres utilizados para evaluar el Sistema AER.

contribuyen negativamente. Los eventos con nuevo signo obtenidos de cada modulo

merger son enviados a un modulo de convolucion con una mascara de convolucion 1x1

y con valor 1 (modulo U en la Fig. B.3). Los parametros de los modulos de convolucion

son ajustados de modo que hacen falta 3 eventos a la entrada codificando la misma

direccion para producir la generacion de un evento. De este modo, en la tercera etapa

se implementa una suma de las actividades recibidas a la entrada.

Finalmente, la cuarta etapa consiste de un modulo de convolucion para cada canal

de salida de la etapa 3 (modulos C en Fig B.3). Hay un modulo C para cada caracter

(siete en total) y el objetivo de estos modulos es analizar si los eventos que vienen de las

etapas previas estan mas o menos agrupados, en lugar de dispersos. Si estan agrupados

(en el centro del caracter) significa que el caracter ha sido detectado.

B.5 Resultados

Al estar implementado todo el sistema con modulos AER, todo el procesamiento es en

paralelo y en tiempo real, siendo los eventos enviados de etapa a etapa en cuestion de

ns. El sistema multi-chip (68 modulos de convolucion) multi-etapa (4 etapas) ha sido

testeado usando tres versiones ligeramente modificadas de cada uno de los 7 caracteres.

Los 21 caracteres se muestran en la Fig B.4.

En todos los casos, el sistema es capaz de detectar que letra esta presente en menos

de 9,3µs (equivalente a procesar 100000 imagenes por segundo aproximadamente) desde

que el primer evento de entrada es recibido por el sistema. Este retraso es incluso

menor que la duracion del estımulo de entrada visual (12,4µs). Por tanto, el sistema

es capaz de reconocer el caracter a la entrada del sistema antes de haber recibido y

procesado todos los eventos. En una version implementada en fotogramas, habrıa que

esperar el tiempo correspondiente a un fotograma para recoger todos los valores de los

178

B.6 Clasificacion de Imagenes basada en informacion de textura

Figura B.5: Esquema del sistema basado en AER para clasificacion de imagenes basadaen textura

pıxeles correspondientes a un caracter, y despues de eso, deberiamos procesar toda los

pıxeles de la imagen secuencialmente con los 68 modulos de convolucion en el sistema.

Si suponemos un esquema de codificacion de 25 fotogramas por segundo, tendrıamos

siempre una limitacion de 40 ms para procesar cada caracter (sin considerar los tiempos

de procesamiento de los modulos de convolucion).

B.6 Clasificacion de Imagenes basada en informacion de

textura

En esta implementacion se ha realizado un sistema para clasificacion de imagenes

basada en textura usando filtros de Gabor. El sistema AER propuesto es una version

ligeramente modificada de una implementacion anterior propuesta por Manjunath [94]

basada en fotogramas.

Nuestro sistema AER propuesto esta descrito en la Fig. B.5.

En el sistema, una imagen de textura codificada mediante eventos llega a la primera

etapa (layer ‘1’), que esta compuesta por un modulo splitter que replica cada evento

en cada uno de los 24 puertos de salida, y 24 modulos de convolucion basados en AER

179

B. RESUMEN

trabajando en paralelo [39]. Esta primera etapa implementa por tanto un banco de 24

filtros de Gabor (4 escalas y 6 orientaciones). Cada modulo de convolucion Gmn usa

como mapa de convolucion la parte real de una wavelet de gabor a una determinada

escala y orientacion. Cuando un pıxel en el array de pıxeles dentro de un modulo de

convoluciones alcanza su umbral de disparo, se resetea a sı mismo y genera un evento de

salida, que sera enviado al exterior del modulo de convolucion. Cada evento obtenido a

la salida de un modulo perteneciente a la etapa 1 llega a un modulo de procesamiento

en la etapa 2. La etapa 2 consiste en 24 modulos de extraccion de caracterısticas (FEM

en la Fig. B.5) trabajando en paralelo. Cada modulo FEM computa una estimacion

de la media µm,n (llamada Wmn en la Fig. B.5) y de la varianza σmn (etiquetada como

Smn en la Fig. B.5) del resultado de la convolucion obtenido en la etapa anterior. Estos

parametros, codificados por los eventos que viajan de la etapa 2 a la 3, son recibidos

en la etapa ‘3’ por una FPGA. La FPGA escanea durante un tiempo especificado y

programado los eventos de los 48 canales de entrada, y crea de este modo un vector de

caracterısticas con el siguiente aspecto:

f = [W11S11W12S12, ...,W46S46] (B.2)

Una vez construido el vector de caracterısticas, esta etapa calcula la distancia entre

el nuevo vector de caracterısticas creado y los vectores de caracterısticas correspondi-

entes al resto de texturas en la base de datos. La textura bajo analisis sera considerada

textura k, si la distancia al vector de caracterısticas correspondiente a la textura k en

la base de datos es la mınima.

El esquema AER implementado demuestra que en un tiempo inferior (aproximada-

mente 10ms) al correspondiente a un fotograma (por ejemplo 33ms) el sistema es capaz

de reconocer la textura codificada por los eventos del estımulo visual entrante al sistema

en tiempo real y antes de recibir el estımulo por completo.

B.7 Red neuronal Convolucional para el reconocimiento

de personas

El objetivo de este apartado es implementar una red neuronal multietapa basada en

convoluciones similar a la red Lenet-5 implementada por Y. LeCun en [4]. El sis-

tema propuesto ha sido implementado completamente en AER y ha sido aplicado al

180

B.7 Red neuronal Convolucional para el reconocimiento de personas

Figura B.6: Sistema neuronal convolucional basado en fotogramas para detectar personasde pie, en posicion horizontal o boca a bajo.

reconocimiento e identificacion de posturas en personas. El esquema implementado

consta de 6 etapas y se muestra en la Fig. B.6.

Las etapas 1, 3, 5 y 6 son etapas de convolucion, donde cada peso de cada mascara

de convolucion se obtiene mediante el algoritmo de entrenamiento backpropagation

[4]. Las etapas 2 y 4 son etapas de subsampling o submuestreo. En el sistema im-

plementado, el sistema dispone de 4 salidas indicando una de las siguientes opciones:

persona de pie, horizontal, boca a bajo o ruido. El sistema ha sido implementado

en 2 versiones: La primera implementacion esta basada en fotogramas y su objetivo

principal es obtener el valor de los parametros entrenables (los pesos de las mascaras

de convolucion) mediante el algoritmo backpropagation. La segunda implementacion

ha sido realizada completamente en AER haciendo uso de la herramienta AERST de

simulacion.


EN FOTOGRAMAS

El sistema basado en fotogramas implementado se muestra en la Fig. B.6. En este

sistema los datos de entrada procedıan directamente de eventos AER reales obtenidos

mediante una retina electronica AER sensible al movimiento y que por tanto solo

detecta cambios en la escena [15]. De este modo, el background estatico tıpico en

una escena es totalmente eliminado. Al proporcionar la retina solamente informacion

de objetos/personas en movimiento, el sistema fue entrenado para identificar personas

(en 3 posibles posiciones) u objetos (estos ultimos categorizados como ruido por la

aplicacion). La informacion proporcionada por la retina (eventos en formato AER)

fueron recolectados para formar imagenes de tamano 128x128. Estas imagenes fueron

181

B. RESUMEN

reducidas a tamano 32x32 y un porcentaje de ellas fueron usadas para entrenar la red.

El sistema consta de las siguientes etapas:

1. Etapa C1: Primera etapa. Consiste en un banco de 6 filtros de Gabor 10x10 con

2 escalas y 3 orientaciones.

2. Etapas S2 y S4: Segunda y cuarta etapas. Son etapas de submuestreo o subsam-

pling.

3. Etapa C3: Tercera etapa. Consta de 24 filtros 5x5 entrenables y 4 canales de

salida.

4. Etapa C5: Quinta etapa. Consiste en 32 filtros 5x5 entrenables y 8 canales de

salida.

5. Etapa F6: Sexta etapa: Etapa de conectividad total entrenable con 32 parametros

de entrenamiento.

Como resultado del entrenamiento de la red mediante el sistema basado en fotogra-

mas, todos los pesos (parametros de entrenamiento) obtenidos fueron almacenados

para ser utilizados en los modulos de convolucion AER en la estructura no basada en

fotogramas.


EN AER

En la arquitectura implementada AER, la entrada al sistema es directamente el flujo

de eventos capturado con la retina AER (sin recolectar los eventos para producir

imagenes). La entrada al sistema codifica un espacio de direcciones 128x128 que es

primero submuestreado para proporcionar un flujo de eventos con un espacio de direc-

ciones 32x32. Esto es posible en AER utilizando modulos mapper que modifican con

operaciones muy sencillas la direccion de cada evento de entrada de modo que los nuevos

eventos de salida codifiquen el espacio deseado 32x32. Este modulo mapper asigna a

cada evento de entrada con direccion (xin, yin), la direccion de salida (xnew, ynew) del

siguiente modo:

xnew = bxin/4c; ynew = byin/4c; (B.3)

182


Figura B.7: Implementacion AER de la red neuronal convolucional para el reconocimientode personas.

Este nuevo flujo de eventos con nuevas coordenadas se usa como entrada efectiva

para el sistema que se muestra en la Fig. B.7.

Para facilitar el diseno hardware de la red y adaptarla al problema concreto de

nuestra aplicacion se adoptaron las dos siguientes simplificaciones:

1. Los filtros habitualmente entrenables de la etapa C1 en redes neuronales con-

volucionales, se cambiaron en esta implementacion por un banco de filtros de

Gabor con 2 escalas y 3 orientaciones (las 3 orientaciones bajo analisis de la per-

sona). En la implementacion, cada uno de estos filtros de Gabor es un modulo

de convolucion AER [165]. Cada uno de los modulos de convolucion tiene un ar-

ray interno de tamano 28x28 neuronas y una mascara de convolucion de tamano

10x10 implementando uno de los 6 filtros de Gabor.

2. Los modulos de subsampling o submuestreo fueron cambiados por modulos map-

pers que simplemente replican cada 4 vecinos de entrada a una unica salida

cambiandoles la direccion para codificar el espacio de direcciones reducido. De

este modo, la etapa S2 trasnsforma el espacio de direcciones entrante 28x28 a

uno de 14x14, y la etapa S4 transforma el espacio entrante 10x10 a uno de salida

5x5. Con el uso de estos modulos se simplifica la implementacion hardware y

183

B. RESUMEN

se eliminan los parametros entrenables habituales en etapas de submuestreo en

sistemas neuronales multietapa.

Los eventos de salida de la etapa S2 son enviados a cuatro estructuras de convolucion

con seis puertos de entrada. En estas estructuras se utiliza un array interno de neuronas

de tamano 10x10 compartido y una mascara de convolucion (de tamano 5x5) diferente

para cada canal de entrada. Estas estructuras emulan el comportamiento de los chips

de convolucion AER multikernel [56]. Cuando tras la llegada de varios eventos, una

neurona del array 10x10 supera un umbral de disparo codificado como parametro, esta

neurona se resetea y dispara un evento a la salida. Para implementar la saturacion

habitual en el estado de las neuronas mediante funciones sigmoides en los sistemas

neuronales basados en fotogramas, se adopto la siguiente solucion en el sistema AER:

Para que una neurona pueda volver a disparar un evento a la salida tiene que esperar

un tiempo refractario Tref , que limita (satura) la actividad de disparo de la neurona.

Estos tiempos refractarios fueron utilizados en las etapas C3 y C5 y fueron obtenidos

analıticamente haciendo equivalencias entre las estructuras basadas en fotogramas (con

funciones sigmoides) y las basadas en AER.

Los eventos obtenidos en la etapa 3 son enviados de nuevo a modulos mapper en la

etapa S4 que reducen el espacio de direcciones codificado por 2.

La etapa C5 es totalmente equivalente a la etapa C3 y es implementada del mismo

modo. Esta vez cada neurona de salida (8 en total) produce eventos de salida codifi-

cando un espacio de direcciones 1x1. Cada evento procedente de uno de estos 8 canales

de salida es replicado mediante modulos splitter para ser utilizados como entrada en

la etapa F6, que implementa conectividad total entre las 8 entradas y las 4 salidas

mediante el uso de pesos de conexion entrenables (32 en total). Para la etapa F6 no se

ha hecho uso de tiempos refractarios ya que para esta etapa la saturacion de los canales

no es un aspecto relevante, al ser la actividad (sin saturar) positiva solo para el canal

de salida de interes y negativa para el resto.

B.7.3 Resultados

El sistema propuesto fue primero testeado haciendo uso de la implementacion basada

en fotogramas. Para ello se utilizaron imagenes obtenidas tras recolectar eventos de

la retina AER electronica. Las imagenes de personas en posicion horizontal y boca a

184


bajo fueron obtenidas rotando las imagenes de las personas (habitualmente de pie y

andando en escenarios reales) 90 y 180 grados respectivamente. La implementacion

basada en fotogramas produjo muy buenos resultados con una tasa de reconocimiento

por encima del 93% (usando 250 imagenes para entrenar de un total de 1048).

La implementacion basada en AER tambien produjo excelentes resultados con una

tasa de acierto superior al 96%. El resultado obtenido dependıa de los valores de tiem-

pos refractarios usados en las etapas tercera (C3) y quinta(C5)).

Una figura bastante representativa de la tasa de acierto conseguida es la Fig. B.8.

Para obtenerla se creo un flujo de eventos en el que se alternaban personas en las 3

distintas posiciones y se analizaba la respuesta del sistema. Para este experimento se

consideraron tiempos refractarios Tref 0.5ms y 18ms para las etapas C3 y C5 respecti-

vamente. En esta figura se muestran los eventos de entrada y los de salida. Los eventos

correspondientes a la posicion de pie estan representados con valor 7, los correspon-

dientes a la posicion horizontal con valor 5 y los correspondientes a boca a bajo con

valor 6. El canal ‘UP’ de salida codificando la postura ‘de pie’ esta representado con

los valores 3 y -3 (el valor positivo corresponde a que el sistema interpreta la entrada

como ‘DE PIE’ y el valor negativo a que el sistema no la reconoce como tal), los even-

tos del canal ‘HORIZONTAL’ (persona en posicion horizontal) estan representados con

los valores 1 y -1, el canal ‘BOCA-ABAJO’ (persona boca a bajo) con los valores 2

y -2. Finalmente el canal ‘RUIDO’ (otro tipo de objetos o ruido) esta codificado con

los valores 4 y -4. Como se puede apreciar el sistema reconoce en todo momento y en

tiempo real que hay una persona en la escena y la posicion en la que esta con retrasos

inferiores a 15ms desde que el estimulo visual de entrada es recibido por el sistema.

185

B. RESUMEN

Figura B.8: a) Entrada y salida del sistema cuando la entrada es alternada entre lasposiciones ‘de pie’, ‘horizontal’ y ‘boca abajo’. Los valores ‘5’, ‘6’ y ‘7’ corresponden a lasposiciones ‘horizontal’, ‘boca abajo’ y ‘de pie’ respectivamente. Los valores absolutos ‘1’,‘2’, ‘3’ y ‘4’ corresponden a la actividad en los canales de salida identificando las posiciones‘horizontal’, ‘boca abajo’, ‘de pie’ y ‘ruido’.

186

Bibliography

[1] T. Serre, L.Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object

recognition with cortex-like mechanisms,” IEEE Trans. Pattern Anal. Mach.

Intell., vol. 29(3), pp. 411-426, Mar. 2007. 2, 5, 6, 70, 81, 86, 92

[2] K. Fukushima and N. Wake, “Handwritten alphanumeric character recognition

by the neocognitron,” IEEE Trans. Neural Netw., vol. 2(3), pp. 355-365, May

1991. 2, 51, 176

[3] T. Masquelier, R. Guyonneau, and S. J. Thorpe, “Competitive STDP-based

spike pattern learning,” Neural Comp., 21, 1-18, 2008. 2

[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning ap-

plied to document recognition,” Proc. IEEE, vol. 86(11), pp. 2278-2324, Nov.

1998. 2, 87, 90, 91, 92, 115, 116, 180, 181

[5] S. Furber , “The Future of Computer Technology and its Implications for the

Computer Industry”, J. Comput. vol. 51(6), pp. 735-740, 2008. 2

[6] E. M. Izhikevich, “Simple model of spiking neurons”, IEEE Transactions on

Neural Networks, vol. 14, pp. 1569-1572, 2003. 2, 17

[7] A. L. Hodgkin, A. F. Huxley, “A quantitative description of membrane current

and its application to conduction and excitation in nerve”, Journal of Physiology,

vol. 117(4), pp. 500-544, 1952. 2, 17

[8] G. M. Shepherd, The Synaptic Organization of the Brain, 3rd ed. New York:

Oxford Univ. Press, 1990. 5, 92, 171

187

BIBLIOGRAPHY

[9] E. T. Rolls and G. Deco, Computational Neuroscience of Vision. New York:

Oxford Univ. Press, 2002. 5

[10] R. DeValois, D. Albrecht, and L. Thorell, “Spatial frequency selectivity of cells

in macaque visual cortex,” Vis. Res., vol. 22, pp. 545-559, 1982. 6

[11] R. DeValois, E. Yund, and N. Hepler, “The orientation and direction selectivity

of cells in macaque visual cortex,” Vis. Res., vol. 22, pp. 531-544, 1982. 6

[12] P. H. Schiller, B. L. Finlay, and S. F. Volman, “Quantitative studies of single-cell

properties in monkey striate cortex. Spatial frequency,” J. Neurophysiol., vol.

39(6), pp. 1334-1351, 1976. 6

[13] S. Grossberg, E. Mingolla, and J.Williamson, “Synthetic aperture radar process-

ing by a multiple scale neural system for boundary and surface representation,”

Neural Netw., vol. 8(7/8), pp. 1005-1028, 1995. 6

[14] S. Lawrence, C. L. Giles, A. Tsoi, and A. Back, “Face recognition: A convo-

lutional neural network approach,” IEEE Trans. Neural Netw., vol. 8(1), pp.

98-113, Jan. 1997. 6

[15] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128x128 120 dB 30 mW asyn-

chronous vision sensor that responds to relative intensity change,” IEEE J. Solid-

State Circuits, vol. 43(2), pp. 566-576, Feb. 2008. 6, 12, 14, 62, 87, 99, 146, 181

[16] J. Costas-Santos, T. Serrano-Gotarredona, R. Serrano-Gotarredona, and B.

Linares-Barranco, “A contrast retina with on-chip calibration for neuromorphic

spike-based AER vision systems,” IEEE Trans. Circuits Syst. I, Reg. Papers,

vol. 54(7), pp. 1444-1458, Jul. 2007. 6, 14, 62

[17] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, and B.

Linares-Barranco, “A neuromorphic cortical layer microchip for spike based

event processing vision systems,” IEEE Trans. Circuits Syst. I, Reg. Papers,

vol. 53(12), pp. 2548-2566, Dec. 2006. 7, 13, 14, 46, 49, 54, 62

[18] E. Kandel, J. Schwartz, and T. M. Jessel,. Principles of Neural Science. Elsevier,

New York, 1991. 8

188

BIBLIOGRAPHY

[19] E. D. Adrian, Y. Zotterman, “The impulses produced by sensory nerve endings:

Part II: The response of a single end organ”, Journal of Physiology, no 61, pp.

151-71, 1926. 8

[20] R. Stein, E. Gossen, K. Jones, “Neuronal variability: noise or part of the sig-

nal?”, Nature Reviews Neuroscience vol. 6, pp. 389-397, 2005. 8

[21] A. L. Jacobs, et al., “Ruling out and ruling in neural codes,” Proc Natl Acad

Sci U S A, vol. 106, pp. 5936-41, 2009. 9

[22] S. J. Thorpe, “Spike arrival times: A highly efficient coding scheme for neural

networks,” Parallel processing in neural systems and computers, R. Eckmiller,

G. Hartmann, and G. Hauske, Editors. 1990, Elsevier: North-Holland. p. 91-94.

9

[23] R. VanRullen, R. Guyonneau, and S. J. Thorpe, “Spike times make sense,”

Trends Neurosci, vol 28, pp. 1-4, 2005. 9

[24] R. VanRullen and S. J. Thorpe, “Rate coding versus temporal order coding:

what the retinal ganglion cells tell the visual cortex,” Neural Comput, vol 13,

pp. 1255-83, 2001. 9

[25] T. Gollisch and M. Meister, “Rapid neural coding in the retina with relative

spike latencies,” Science, vol 319, pp. 1108-11, 2008. 9

[26] S Wu, S Amari, and H Nakahara, “Population Coding and Decoding in a Neural

Field: A Computational Study,” Neural Computation vol. 14, pp. 999-1026,

2002. 11

[27] J. H. R. Maunsell, D. C. Van Essen, “Functional properties of neurons in middle

temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction,

speed, and orientation,” Journal of Neurophysiology, vol. 49, pp. 1127-1147,

1983. 11

[28] D. H. Hubel, T. N. Wiesel, “Receptive fields of single neurons in the cat’s striate

cortex,” Journal of Physiology, vol. 148, pp. 574-591,1959. 12

189

BIBLIOGRAPHY

[29] M. A. Montemurro, M. J. Rasch, Y. Murayama, N. K. Logothetis, S. Panzeri,

“Phase-of-Firing Coding of Natural Visual Stimuli in Primary Visual Cortex,”

Current Biology, vol. 18(5), pp. 375-380, 2008 12

[30] M. Sivilotti, “Wiring considerations in analog VLSI systems with application to

field-programmable networks,” Ph.D. dissertation, Comput. Sci. Div., California

Inst. Technol., Pasadena, CA, 1991. 2, 13, 87, 172

[31] K. Boahen, “Point-to-point connectivity between neuromorphic chips using ad-

dress events,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol.

47(5), pp. 416-434, May 2000. 13, 87, 172

[32] E. Culurciello, R. Etienne-Cummings, and K. A. Boahen, “A biomorphic digital

image sensor,” IEEE J. Solid-State Circuits, vol. 38(2), pp. 281-294, Feb. 2003.

13, 62

[33] P. F. Ruedi, P. Heim, F. Kaess, E. Grenet, F. Heitger, P.-Y. Burgi, S. Gyger,

and P. Nussbaum, “A 128x128, pixel 120-dB dynamic-range vision-sensor chip

for image contrast and orientation extraction,” IEEE J. Solid-State Circuits,

vol. 38(12), pp. 2325-2333, Dec. 2003. 13

[34] C. Shoushun and A. Bermak, “A low power CMOS imager based on time-to-

first-spike encoding and fair AER,” in Proc. IEEE Int. Symp. Circuits Syst.,

2005, pp. 5306-5309. 13

[35] A. Delorme, L. Perrinet, and S. J. Thorpe, “Networks of integrate and-fire neu-

rons using rank order coding B: Spike timing dependent plasticity and emergence

of orientation selectivity,” Neurocomputing, vol. 38-40, pp. 539-45, 2001.

[36] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R.

Paz-Vicente, F. Gomez-Rodrıguez, L. Camunas-Mesa, R. Berner, M. Rivas,

T. Delbruck, S. C. Liu, R. Douglas, P. Hafliger, G. Jimenez-Moreno, A.

Civit, T. Serrano-Gotarredona, A. Acosta-Jimenez, and B. Linares-Barranco,

“CAVIAR: A 45 k-Neuron, 5 M-Synapse, 12 G-connects/sec AER hardware

sensory-processing-learning-actuating system for high speed visual object recog-

nition and tracking,” IEEE Trans. Neural Netw., vol. 20(9), pp. 1417-1438, Sep.

2009. 13, 14, 16, 19, 30, 79, 96, 101

190

BIBLIOGRAPHY

[37] T. Serrano-Gotarredona, A. G. Andreou, and B. Linares-Barranco, “AER image

filtering architecture for vision-processing systems,” IEEE Trans. Circuits Syst.

I, Fundam. Theory Appl., vol. 46(9), pp. 1064-1071, Sep. 1999. 13, 172

[38] D. H. Goldberg, G. Cauwenberghs, and A. G. Andreou, “Probabilistic synap-

tic weighting in a reconfigurable network of VLSI integrate-and-fire neurons,”

Neural Netw., vol. 14, pp. 781-793, 2001. 13

[39] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, C.

Serrano-Gotarredona, J. A. Perez-Carrasco, A. Linares-Barranco, G. Jimenez-

Moreno, A. Civit-Ballcels, and B. Linares-Barranco, “On real-time AER 2D

convolutions hardware for neuromorphic spike based cortical processing,” IEEE

Trans. Neural Netw., vol. 19(7), pp. 1196-1219, Jul. 2008. 13, 14, 54, 62, 101,

173, 180

[40] M. Azadmehr, J. Abrahamsen, and P. Hfliger, “A foveated AER imager chip,”

in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, 2005, pp. 2751-2754. 13

[41] K. A. Zaghloul and K. boahen, “Optic nerve signals in a neuromorphic chip:

Part I and II,” IEEE Trans. Biomed. Eng., vol. 51(4), pp. 657-675, Apr. 2004.

14, 62

[42] K. Boahen, “Retinomorphic chips that see quadruple images,” in Proc. Int.

Conf. Microelectron. Neural Fuzzy Bio-Inspired Syst., Granada, Spain, 1999,

pp. 12-20. 14

[43] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Silvilotti, and D. Gillespie, “Silicon

auditory processors as computer peripherals,” IEEE Trans. Neural Netw., vol.

4(3), pp. 523-528, May 1993. 14, 87

[44] G. Cauwenberghs, N. Kumar, W. Himmelbauer, and A. G. Andreou, “An analog

VLSI chip with asynchronous interface for auditory feature extraction,” IEEE

Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 45(5), pp. 600-606,

May 1998. 14

[45] E. Chicca, A. M. Whatley, P. Lichtsteiner, V. Dante, T. Delbruck, P. Del Giu-

dice, R. J. Douglas, and G. Indiveri, “A multichip pulse-based neuromorphic

191

BIBLIOGRAPHY

infrastructure and its application to a model of orientation selectivity,” IEEE

Trans. Circuits Syst. I, Reg. Papers, vol. 54(5), pp. 981-993, May 2007.

[46] M. Oster, Y. Wang, R. Douglas, and S.-C. Liu, “Quantification of a spike-based

winner-take-all VLSI network,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.

55(10), pp. 3160-3169, Nov. 2008. 14, 16

14

[47] T. Teixeira, A. G. Andreou, and E. Culurciello, “Event-based imaging with

active illumination in sensor networks,” in Proc. IEEE Int. Symp. Circuits Syst.,

Kobe, Japan, 2005, pp. 644-647. 14

[48] A. Cohen, R. Etienne-Cummings, T. Horiuchi, G. Indiveri, S. Shamma, R. Dou-

glas, C. Koch, and T. Sejnowski, “Rep. 2004 Workshop on Neuromorphic Eng.,”,

Telluride, CO, Jun. 27 to Jul. 17 2004. 172

[49] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P.

Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar,

“An 80-Tile 1.28 TFLOPS network-on-chip in 65 nm CMOS,” in Proc. IEEE

Int. Solid-State Circ. Conf., Feb. 2007, pp. 98-99. 14

[50] R. Serrano-Gotarredona, et al., AER Building Blocks for Multi-Layers Multi-

Chips Neu-romorphic Vision Systems, in Advances in Neural Information Pro-

cessing Systems, Vol. 18, Y. Weiss and B. S. and J. Platt (Eds.), (NIPS’06),

MIT Press, Cambridge, MA, 1217-1224, (2006). 16, 46, 47, 54, 62, 75, 96, 172,

173

[51] T. Texeira, E. Culurciello and A.G. Andreou, “An address-event image sensor

network,” Proceedings of the 2006 IEEE International Symposium on Circuits

and Systems, (ISCAS 2006), Kos, Greece, pp. 4467-4470, May 2006. 16

[52] R. Brette, M. Rudolph, T. Carnevale, et al. “Simulation of networks of spiking

neurons: A review of tools and strategies”, Journal of Computional Neuro-

science, Vol. 23(3), pp.349-398, Dec 2007. 17

[53] R. Brette, W. Gerstner, “Adaptive exponential integrate-and-fire model as an

effective description of neuronal activity”, Journal of Neurophysiolgy, vol. 94,pp.

3637-3642, 2005. 17

192

BIBLIOGRAPHY

[54] T. Clayton, “How much can we trust neural simulation strategies?”, Neurocom-

puting, vol. 70(10-12), pp. 1966-1969, June 2007. 17

[55] F. Gomez-Rodrıguez, R. Paz, A. Linares-Barranco, M. Rivas, L. Miro, G.

Jimenez, A. Civit. “AER tools for Communications and Debugging”. Proc. IEEE

ISCAS06. Kos, Greece, May 2006. 19, 30, 75, 173

[56] L. A. Camunas-Mesa, A. Linares-Barranco, A. J. Acosta-Jimenez, T. Serrano-

Gotarredona, B. Linares Barranco, “Improved Aer Convolution Chip for Vision

Processing With Higher Resolution and New Functionalities,” Conference on

Design of Circuits and Integrated Systems 2009. Num. 21. Barcelona. DCIS.

2009. Pag. 1-6. 21, 98, 184

[57] A. Linares-Barranco, F. Gomez-Rodrıguez, G. Jimenez-Moreno, T. Delbruck, R.

Berner, et. al., “Implementation of a Time-Warping Aer Mapper,” Proc. IEEE

International Symposium on Circuits and Systems. IEEE Circuits and Systems

Society. pp. 2886-2889, Taiwan, 2009. 33

[58] L. Camunas-Mesa, A. Acosta-Jimenez, C. Zamarreno-Ramos, T. Serrano-

Gotarredona, and B. Linares-Barranco, “A 32x32 Pixel Convolution Processor

Chip for Address Event Vision Sensors with 155ns Event Latency and 20Meps

Throughput,” accepted for publication in IEEE Transactions On Circuits And

Systems, 2010. 37, 38, 40

[59] http://sourceforge.net/apps/trac/jaer/wiki. 48, 146, 154

[60] K. A. Zaghloul and K. Boahen, “Optic nerve signals in a neuro-morphic chip:

Part 2,” IEEE Trans.Biomed Eng., vol. 51, pp. 667-675, 2004. 62

[61] K. Fukushima: “Visual feature extraction by a multilayered network of analog

threshold elements”, IEEE Transactions on Systems Science and Cybernetics,

SSC-5 (4), pp. 322-333, Oct. 1969. 63

[62] T. Randen and J. H. Husoy, “Filtering for texture classification: A comparative

study,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 4, pp. 291-310,

Apr. 1999. 67, 68, 75, 80

193

BIBLIOGRAPHY

[63] R. M. Haralick, “Statistical and structural approaches to texture,” Proc. IEEE,

vol. 67, no. 5, pp. 786-804, May 1979. 67

[64] G. V. D. Wouwer, P. Scheunders, and D. V. Dyck, “Statistical texture charac-

terization from discrete wavelet representations,” IEEE Trans. Image Process.,

vol. 8, no. 4, pp. 592-598, Apr. 1999. 67

[65] G. R. Cross and A. K. Jain, “Markov random field texture models,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 5, no.1, pp. 25-39, Jan. 1983. 67

[66] R. L. Kashyap and R. Chellappa, “Estimation and choice of neighbors in spatial-

interaction models of images,” IEEE Trans. Inf. Theory, vol. 29, no. 1, pp. 60-72,

Jan. 1983. 67

[67] R. M. Haralick, K. Shanmugan, and I. Dinstein, “Texture features for image

classification,” IEEE Trans. Syst. Man Cybern., vol. 3, no. 6, pp. 610-621, Nov.

1973. 68

[68] A. Speis and G. Healey, “Feature extraction for texture discrimination via ran-

dom field models with random spatial interaction,” IEEE Trans. Image Process.,

vol. 5, no. 4, pp. 635-645, Apr. 1996. 68

[69] T. Chang and C.C. J. Kuo, “Texture analysis and classification with trees-

tructured wavelet transform,” IEEE Trans. Image Process., vol. 2, no. 4, pp.

429-441, Apr. 1993. 68

[70] M. Unser, “Texture classification and segmentation using wavelet frames,” IEEE

Trans. Image Process., vol. 4, no. 11, pp. 1549-1560, Nov. 1995. 68

[71] G. M. Haley and B. S. Manjunath, “Rotation-invariant texture classification

using a complete space-frequency model,” IEEE Trans. Image Process., vol. 8,

no. 2, pp. 255-269, Feb. 1999. 68, 69

[72] W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 9, pp. 891-906, Sep. 1991.

68

194

BIBLIOGRAPHY

[73] J. G. Rosiles and M. J. T. Smith, “Texture classification with a biorthogonal

directional filter bank,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,

May 2001, pp. 1549-1552. 68

[74] J. Han and K. K. Ma, “Rotation-invariant and scale-invariant Gabor features

for texture image retrieval,” Image Vis. Comput., vol. 25, no. 9, pp. 1474-1481,

Sep. 2007. 68, 69, 76, 78, 80, 81

[75] T. Sikora, “The MPEG-7 visual standard for content description. An overview,”

IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp. 696-702, Jun. 2001.

68

[76] J. J. Kulikowski and P. O. Bishop, “Fourier analysis and spatial representation

in the visual cortex,” Experientia, vol. 37, pp. 160-163, 1981. 68

[77] D. A. Clausi and H. Deng, “Design-based texture feature fusion using Gabor

filters and co-occurrence probabilities,” IEEE Trans. Image Process., vol. 14,

no. 7, pp. 925-936, Jul. 2005. 68

[78] D. A. Clausi, “Comparison and fusion of co-occurrence, Gabor and MRF texture

features for classification of SAR sea-ice imagery,” Atmos. Oceans, vol. 39, no.

4, pp. 183-194, 2001.

[79] S. Li and J. Shawe-Taylor, “Comparison and fusion of multiresolution features

for texture classification,” Pattern Recognit. Lett., vol. 26, pp. 633-638, 2005.

[80] N. Qaiser,M. Hussain, A. Hussain, “Texture recognition by fusion of optimized

moment based and Gabor energy features,” Int. J. Comput. Sci. Network Secu-

rity, vol. 8, no. 2, pp. 264-270, Feb. 2008. 68

[81] C.C. Chen and C.C. Chen, “Filtering methods for texture discrimination,” Pat-

tern Recognit. Lett., vol. 20, pp. 783-790, 1999. 68, 80

[82] R. Picard, T. Kabir, and F. Liu, “Real-time recognition with the entire Brodatz

texture database,” in Proc. Comput. Vis. Pattern Recognit., 1993, pp. 638-639.

[83] P. P. Ohanian and R. C. Dubes, “Performance evaluation for four classes of

textural features,” Pattern Recognit., vol. 25, no. 8, pp. 819-833, 1992. 68, 80

195

BIBLIOGRAPHY

[84] K. O. Cheng, N. F. Law, and W. C. Siu, “A novel fast and reduced redundancy

structure for multiscale directional filter banks,” IEEE Trans. Image Process.,

vol. 16, no. 8, pp. 2058-68, Aug. 2007. 68, 76, 78, 80, 81

[85] K. O. Cheng, N. F. Law, and W. C. Siu, “Multiscale directional filter bank with

applications to structured and random texture retrieval,” Pattern Recognit., vol.

40, no. 4, pp. 1182-1194, 2007. 68, 69, 76, 78, 80

[86] M. N. Do and M. Vetterli, “Pyramidal directional filter banks and curvelets,”

in Proc. IEEE Int. Conf. Image Process., Oct. 2001, vol. 3, pp. 158-161. 68

[87] M. N. Do and M. Vetterli, “The contourlet transform: An efficient directional

multiresolution image representation,” IEEE Trans. Image Process., vol. 14, no.

12, pp. 2091-2106, Dec. 2005. 68

[88] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texture representation using

local affine regions,” Proc. Comput. Vis. Pattern Recognit., vol. 2, pp. 319-324,

2003. 68, 76, 78, 80

[89] M. Mellor, B. W. Hong, and M. Brady, “Locally rotation, contrast, and scale

invariant descriptors for texture analysis,” IEEE Trans. Pattern Anal. Mach.

Intell., vol. 30, no. 1, pp. 52-61, Jan. 2008. 68, 76, 80

[90] M. Kokare, P. K. Biswas, and B. N. Chatterji, “Rotation-invariant texture image

retrieval using rotated complex wavelet filters,” IEEE Trans. Syst. Man Cybern.

B, Cybern., vol. 36, no. 6, pp. 1273-1282, Dec. 2006. 68, 69, 76, 78, 80, 81

[91] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid: A flexible ar-

chitecture for multi-scale derivative computation,” in Proc. Int. Conf. Image

Process., Oct. 1995, pp. 444-447. 69

[92] M. Pi and H. Li, “Fractal indexing with the joint statistical properties and its

application in texture image retrieval,” IET Image Process., vol. 2, no. 4, pp.

218-230, 2008. 69, 76, 78, 80, 81

[93] M. H. Pi, C. S. Tong, S. K. Choy, and H. Zhang, “A fast and effective model

for wavelet subband histograms and its application in texture image retrieval,”

196

BIBLIOGRAPHY

IEEE Trans. Image Process., vol. 15, no. 10, pp. 3078-3088, Oct. 2006. 69, 76,

78, 80, 81

[94] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval

of image data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 8, pp.

837-842, Aug. 1996. 69, 70, 71, 76, 78, 80, 81, 90, 179

[95] J.G. Daugman, “Complete Discrete 2D Gabor Transforms by Neural Net-works

for Image Analysis and Compression,” IEEE Trans. ASSP, vol. 36, pp. 1169-

1179, July 1988. 69

[96] A.C. Bovic, M. Clark, and W.S. Geisler, “Multichannel Texture Analysis Using

Localized Spatial Filters,” IEEE Trans. Pattern Analysis and Machine Intelli-

gence, vol. 12, no. 1, pp. 55-73, Jan. 1990.

[97] B.S. Manjunath and R. Chellappa, “A Unified Approach to Boundary Detec-

tion,” IEEE Trans. Neural Networks, vol. 4, no. 1, pp. 96-108, Jan. 1993. 69

69

[98] J.G. Daugman, “High Confidence Visual Recognition of Persons by a Test of

Statistical Independence,” IEEE Trans. Pattern Analysis and Machine Intelli-

gence, vol. 15, no. 11, pp. 1148-1161, Nov. 1993. 69

[99] M. Lades et al., “Distortion Invariant Object Recognition in the Dynamic Link

Architecture,” IEEE Trans. Computers, vol. 42, no. 3, pp. 300-311, Mar. 1993.

[100] B.S. Manjunath and R. Chellappa, “A Feature Based Approach to Face Recog-

nition,” Proc. IEEE Conf. CVPR ’92, pp. 373-378, Champaign, Ill., June 1992.

69

[101] R.J. Ferrari, R.M. Rangayyan, J.E.L. Desautels, R.A. Borges, A.F. Frere,

“Analysis of asymmetry in mammograms via directional filtering with Gabor

wavelets,” IEEE Trans. Med. Imag. 20(9), pp. 953-964, 2001. 70

[102] R.J. Ferrari, R.M. Rangayyan, J.E.L. Desautels, R.A. Borges, and A.F. Frere,

“Automatic identification of the pectoral muscle in mammograms”, IEEE Trans.

on Medical Imaging, 23(2), pp. 232-245, 2004. 70

197

BIBLIOGRAPHY

[103] D.H. Hubel and T.N. Wiesel, “Functional Architecture of Macaque Mon-key

Visual Cortex,” Proc. Royal Soc. B (London), vol. 198, pp. 1-59,1978. 70

[104] S. Marcelja, “Mathematical Description of the Responses of Simple Cortical

Cells,” J. Optical Soc. Am., vol. 70, pp. 1297-1300, 1980. 70

[105] J.G. Daugman, “Two-Dimensional Spectral Analysis of Cortical Receptive Field

Profile,” Vision Research, vol. 20, pp. 847-856,1980. 70

[106] J.G. Daugman, “Uncertainty Relation for Resolution in Space, Spatial Fre-

quency, and Orientation Optimized by Two- Dimensional Visual Cortical Fil-

ters,” J. Optical Soc. Amer., vol. 2, no. 7, pp. 1160-1169,1985. 70

[107] B. S. Manjunath, Jens-Rainer Ohm, Vinod V. Vasudevan, and Akio Ya-mada,

“Color and Texture Descriptors,” IEEE Trans. on Circuits and Systems for

Video Technology, vol. 11, no. 6, June 2001. 70

[108] M. Jian, J. Dong, R. Tang, “Combining Color, Texture and Region with Objects

of User’s Interest for Content-based Image Retrieval”, in Proceedings of Eighth

ACIS International Conference on Software Engineering, Artificial Intelligence,

Networking, and Parallel/Distributed Computing, pp. 764-769, IEEE Computer

Society, 2007.

[109] S. Bhagavathy and B. S. Manjunath, “Modeling and Detection of Geospatial

Objects Using Texture Motifs,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 44, no. 12, December 2006.

[110] S. Newsam and B. S. Manjunath, “Normalized texture motifs and their appli-

cation to statistical object modeling,” IEEE International Conference on Com-

puter Vision and Pattern Recognition: Workshop on Perceptual Organization

in Computer Vision, Washington, D. C., June 2004.

[111] J. Dong, M. Jian, D. Gao, S. Wang. “Reducing the Dimensionality of Feature

Vectors for Texture Image Retrieval Based on Wavelet Decomposition”, in Proc.

of Eighth ACIS International Conference on Software Engineering, Artificial

Intelligence, Networking, and Parallel/Distributed Computing, IEEE Computer

Society, 2007. 75

198

BIBLIOGRAPHY

[112] B. S. Manjunath, B. Sumengen, Z. Bi, J. Byun, M. El-Saban, D. Fedorov, and

N. Vu, “Towards automated bioimage analysis: From features to semantics,” in

IEEE Int. Symposium on Biomedical Imaging (ISBI), 2006.

[113] Text of ISO/IEC 15 938-3 Multimedia Content Description Interface-Part 3:

Visual. Final Committee Draft, ISO/IEC/JTC1/SC29/ WG11, Doc. N4062,

Mar. 2001.

[114] MPEG-7 Visual Experimentation Model (XM), Version 10.0,

ISO/IEC/JTC1/SC29/WG11, Doc. N4063, Mar. 2001.

[115] B.S.Manjunath, P. Salembier, and T. Sikora, Eds., Introduction to MPEG7:

Multimedia Content Descripfion Inreflure, John Wiley and Sons, first edition,

2002.

[116] Z. Sun, G. Bebis, and R. Miller, “On-Road Vehicle Detection Using Evolu-

tionary Gabor Filter Optimization”, IEEE Trans. On Intelligent Transportation

Systems, 6(2), 2005, 125-137. 70

[117] K. Xu, B. Georgescu, D. Comaniciu, P. Meer, “Performance Analysis in Content-

based Retrieval with Textures”. ICPR 2000, pp. 4275-4278. 72, 75

[118] M. Kokare, B.N.Chatterji and P.K.Biswas, “Comparison of Similarity Metrics

for Texture Image Retrieval”, IEEE Digital index No. 0-7803-7651-X, 2003. 72

[119] L. Chen, G. Lu, D. Zhang, “Effects of Different Gabor Filter Parameters on Im-

age Retrieval by Texture,” pp.273, Proceedings of the 10th International Multi-

media Modelling Conference, 2004 (MMM’04). 72

[120] P. Brodatz, Textures: A Photographic Album for Artists and Designers. New

York: Dover, 1966. 68, 75, 76, 77

[121] A.P.N. Vo, T. T. Nguyen, S. Oraintara, “Texture Image Retrieval Using Com-

plex Directional Filter Bank,” 2006 IEEE International Symposium on Circuits

and Systems, Island of Kos, Greece, May 2006. 75

[122] B.S. Manjunath, C. Shekhar, and R. Chellappa, “A New Approach to Image

Feature Detection with Applications,” Pattern Recognition, vol. 29(4), pp. 627-

640, April 1996. 75

199

BIBLIOGRAPHY

[123] M. Jian, J. Dong, D. Gao, Z. Liang, “New Texture Features Based on Wavelet

Transform Coinciding with Human Visual Perception,” in Proc. of Eighth ACIS

International Conference on Software Engineering, Artificial Intelligence, Net-

working, and Parallel/Distributed Computing, IEEE Computer Society, 2007.

[124] M. Kokare, P. K. Biswas, B. N. Chatterji, “Rotation invariant texture features

using rotated complex wavelet for content based image retrieval,” ICIP 2004,

pp. 393-396.

[125] Z. Liu and S. Wada, “Robust Feature Extraction Technique for Texture Image

Retrieval,” ICIP 2005, pp. 525-528.

[126] G. Guo; H. J. Zhang; S.Z. Li, “Distance-From-Boundary As A Metric for Tex-

ture Image Retrieval,” IEEE International Conference on Acoustics, Speech and

Signal Processing, vol. 3, pp. 1629-1632, 2001.

[127] Y. Liu, and X. Zhou, “A Simple Texture Descriptor for Texture Retrieval”,

Proceedings of ICTT, pp. 1662-1665, 2003.

[128] X. Fu, Y. Li, R. Harrison, S. Belkasim, “Content-based Image Retrieval Using

Gabor-Zernike Features,” ICPR 2006, pp. 417-420.

[129] B.S. Manjunath, P. Wu, S. Newsam, H.D. Shin1, “A texture descriptor for

browsing and similarity retrieval,” SP:IC(16), No. 1-2, September 2000, pp. 33-

43.

[130] A. Ahmadiad , E. Faramarz, Sayadian, “Image Indexing and Retrieval Using

Gabor Wavelet and Legendre Moments,” Proceedings of the 25th Annual Inter-

national Conference of the IEEE EMBS, Cancun, Mexico, pp.17-21, September

2003. 75

[131] M.N. Do, and M. Vetterli, “Wavelet-based texture retrieval using generalized

Gaussian density and Kullback-Leibler distance,” IEEE Trans. Image Process.

v11(2), pp. 146-158. February 2002. 75

[132] A. Ahmadian, A. Mostafa, “An efficient texture classification algorithm using

Gabor Wavelet”, Proceedings of the 25th Annual International Conference of

the IEEE EMBS, Cancun, Mexico, September 17-21, 2003, pp. 930-933. 75

200

BIBLIOGRAPHY

[133] Y. Liu, D. S. Zhang, G. Lu and W-Y. Ma, “Study on Texture Feature Extraction

in Region-based Image Retrieval System”, In Proc. of IEEE International Conf.

on Multimedia Modeling (MMM06), pp.264-271, Beijing, Jan. 2006.

[134] P. Howarth., S. Ruger., “Robust texture features for still-image retrieval,”

VISP(152), No. 6, pp. 868-874, December 2005. 75

[135] R. Picard, C. Graczyk, S. Mann, and et al., “Vision

texture 1.0,” tech. rep., Media Laboratory, MIT, 1995.

http://www.white.media.mit.edu/vismod/imagery/VisionTexture/vistex.html.

75

[136] R. Marculescu, P. Bogdan, “The chip is the network: Toward a science of

network-on-chip design”, Foundations and Trends in Electronic Design Automa-

tion, pp. 371461, March 2009. 80

[137] T. Serre, “Learning a dictionary of shape-components in visual cortex: Com-

parison with neurons, humans and machines,” MIT. Comput. Sci. & AI Lab,

Cambridge, MA, Tech. Rep. MIT-CSAIL-TR-2006-028 CBCL-260, 2006. 86

[138] H. Fujii, H. Ito, K. Aihara, N. Ichinose, and M. Tsukada, “Dynamical cell assem-

bly hypothesis - Theoretical possibility of spatio-temporal coding in the cortex,”

Neural Netw., vol. 9, pp. 1303-1350, 1996.

[139] Y. Le Cun and Y. Bengio, “Convolutional networks for images, speech, and time

series,” in Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed.

Cambridge, MA: MIT Press, 1995, pp. 255-258.

[140] M. Matsugu, K. Mori, M. Ishi, and Y. Mitarai, “Convolutional spiking neural

network model for robust face detection,” in Proc. 9th Int. Conf. Neural Inf.

Process., 2002, vol. 2, pp. 660-664.

[141] B. Fasel, “Robust face analysis using convolutional neural networks,” in Proc.

Int. Conf. Pattern Recognit., pp. 40-43, 2002.

[142] M. Browne and S. S. Ghidary, “Convolutional neural networks for image pro-

cessing: An application in robot vision,” in Advances in Artificial Intelligence.

Cambridge, MA: MIT Press, 2003, pp. 641-652.

201

BIBLIOGRAPHY

[143] C. Neubauer, “Evaluation of convolution neural networks for visual recognition,”

IEEE Trans. Neural Netw., vol. 9, no. 4, pp. 685-696, Jul. 1998.

[144] S. Thorpe, D. Fize, and C. Marlot, “Speed of processing in the human visual

system,” Nature, vol. 381, pp. 520-522, Jun. 1996.

[145] K. Fukushima, “Analysis of the process of visual pattern recognition by the

neocognitron,” Neural Netw., vol. 2, pp. 413-420, 1989.

[146] L. Itti, “Quantitative modeling of perceptual salience at human eye position,”

Vis. Cogn., vol. 14, no. 4, pp. 959-984, 2006.

[147] L. Itti, “Quantifying the contribution of low-level saliency to human eye move-

ments in dynamic scenes,” Vis. Cogn., vol. 12, no. 6, pp. 1093-1123, Aug. 2005.

[148] T. Crimmins, “Geometric filter for speckle reduction,” Appl. Opt., vol. 24, pp.

1438-1443, 1985.

[149] S. Grossberg, E. Mingolla, and W. D. Ross, “Visual brain and visual perception:

How does the cortex do perceptual grouping?,” Trends Neurosci., vol. 20, pp.

106-111, 1997.

[150] E. Mingolla, W. Ross, and S. Grossberg, “A neural network for enhancing bound-

aries and surfaces in synthetic aperture radar images,” Neural Netw., vol. 12,

no. 3, pp. 499-511, 1999. 86

[151] Jain V., et al., “Supervised learning of image restoration with convolutional

networks,” IEEE Int. Conf. Comp.t Vis. (ICCV), 2007. 86

[152] R. Collobert and J. Weston, “A unified architecture for natural language pro-

cessing. Deep neural networks with multitask learning,” Proc. Int. Conf. on

Machine Learning (ICML 08), pp. 160-167, 2008. 86

[153] R. Hadsell, P. Sermanet, M. Scoffier, A. Erkan, K. Kavackuoglu, U. Muller and

Y. LeCun, “Learning Long-Range Vision for Autonomous Off-Road Driving,”

J. Field Robotics, vol. 26(2), pp. 120-144, February 2009. 86

202

BIBLIOGRAPHY

[154] Y. Bengio and Y. LeCun, “Scaling learning algorithms towards AI,” In L. Bot-

tou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma-

chines. MIT Press, 2007. 86

[155] Y. Le Cun et al.: “Backpropagation applied to handwritten zip code recogni-

tion”, Neural Computation 1, 541-551 (1989). 92

[156] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the Best

Multi-Stage Architecture for Object Recognition?,” in Proc. International Con-

ference on Computer Vision (ICCV’09), 2009. 99

[157] P. Lichtsteiner, J. Kramer, T. Delbruck, “Improved ON/OFF temporally differ-

entiating address-event imager,” 11th IEEE International Conference on Elec-

tronics, Circuits and Systems (ICECS 2004) Tel Aviv, Israel, pp. 211-214. 88

[158] A. Gentile and D. S. Wills, “Portable video supercomputing,” IEEE Trans.

Comput., vol. 53(8), pp. 960-972, Aug. 2004.

[159] V. Wall, M. Torkelson, and P. Egelberg, “A custom image convolution DSP

with a sustained calculation capacity of ¿1 GMAC/s and low I/O bandwidth,”

J. VLSI Signal Process., vol. 23, pp. 355-349, 1999.

[160] H. Kwon, “A low-power image convolution algorithm for variable voltage pro-

cessors,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2003, vol. 2,

pp. 677-680.

[161] H. H. Cut, A. Gentile, J. C. Eble, M. Lee, O. Vendier, Y. J. Joo, D. S. Wills,

M. Brooke, N. M. Jokerst, and A. S. Brown, “SIMPiL: An OE integrated SIMD

architecture for focal plane processing applications,” in Proc. 3rd Int. Conf.

Massively Parallel Process. Using Opt. interconnects, 1996, pp. 44-52.

[162] F. Paillet, D. Mercier, and T. M. Bernard, “Making the most of 15k silicon

area for a digital retina PE,” in Proc. SPIE Adv. Focal Plane Arrays Electron.

Cameras II, Zurich, Switzerland, May 1998, vol. 3410, pp. 158-167. 86

[163] C. Farabet, C. Poulet and Y. LeCun, “An FPGA-Based Stream Processor for

Embedded Real-Time Vision with Convolutional Networks”, in Proc. of the Fifth

203

BIBLIOGRAPHY

IEEE Workshop on Embedded Computer Vision (ECV’09 ICCV’09), IEEE,

Kyoto, 2009. 87

[164] A. Torralba, “How many pixels make an image?,” Visual Neuroscience, vol.

26(1), pp. 123-131, 2009.

[165] L. Camunas-Mesa, et al., “Fully digital AER convolution chip for vision process-

ing,” Proc. 2008 IEEE Int. symp. Circ. & Syst. (ISCAS08), pp. 652-655, May

2008. 96, 101, 107, 108, 183

[166] R. Etienne-Cummings, Z. K. Kalayjian, and D. Cai, “A programmable focal-

plane MIMD image processor chip,” IEEE J. Solid-State Circuits, vol. 36(1),

pp. 64-73, Jan. 2001.

[167] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a configurable architecture

for hybrid digital circuits with two-terminal nanodevices,” Nanotechnology 16,

pp. 888-900, 2005.

[168] B. Linares-Barranco and T. Serrano-Gotarredona, “Memristance

can explain Spike-Time-Dependent-Plasticity in Neural Synapses”,

http://hdl.handle.net/10101/npre.2009.3010.1, 2009.

[169] A. Linares-Barranco, G. Jimenez-Moreno, B. Linares-Barranco and A. Civit-

Ballcels, “On Algorithmic Rate-Coded AER Generation,” IEEE Trans. on Neu-

ral Networks, vol. 17, No. 3, pp. 771-788, May 2006.

[170] S. Kirkpatrick, C. D. Gellat, Jr., and M. P. Vecchi, “Optimization by simulated

annealing,” Science, vol. 220, pp. 671-689, 1983. 143, 146

111

204

Publications

Journal Papers

1. J. A. Perez-Carrasco, B. Acha, C. Serrano, L. Camunas-Mesa, T. Serrano-Gotarredona,

and B. Linares-Barranco, “Fast Vision through Frame-less Event-based Sensing

and convolutional Processing. Application to Texture Recognition,” IEEE Trans.

Neural Networks, vol. 21, No. 4, pp. 609-620, April 2010.

2. R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, C. Serrano-

Gotarredona, J. A. Perez-Carrasco, A. Linares-Barranco, G. Jimenez-Moreno, A.

Civit-Ballcels, and B. Linares-Barranco, “On Real-Time AER 2D Convolutions

Hardware for Neuromorphic Spike Based Cortical Processing,” IEEE Trans. Neu-

ral Networks, vol. 19, No. 7, pp. 1196-1219, July 2008.

Under Review or in Preparation

1. S. Chen, P. Akselrod, E. Culurciello, J. A. Perez Carrasco, B. Linares-Barranco,

“Efficient feedforward categorization of objects and human postures with address-

event image sensors”, submitted to IEEE Pattern Analysis and Machine Intelli-

gence (PAMI).

2. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gortarredona, and B.

Linares-Barranco, “Event-Driven convolutional networks for fast vision posture

recognition”, in preparation for IEEE Pattern Analysis and Machine Intelligence

(PAMI).

205

BIBLIOGRAPHY

Conference Proceedings

1. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-

Barranco, “Spike-Based Convolutional Network for real-time processing”, 20th

International Conference on Pattern Recognition (ICPR 2010), pp.3085-3088, Is-

tanbul, Turkey, 2010.

2. J. A. Perez-Carrasco, C. Zamarreno-Ramos, L. Camunas-Mesa, T. Serrano-Gotarredona,

and B. Linares-Barranco, “On Neuromorphic Spiking Architectures for Asyn-

chronous STDP Menristive Systems”,. IEEE International Symposium on Cir-

cuits and Systems (ISCAS 2010), Paris, France, 2010.

3. L. Camuas-Mesa, J. A. Perez-Carrasco, C. Zamarreno-Ramos, T. Serrano-Gotarredona,

and B. Linares-Barranco, “On Scalable Spiking ConvNet Hardware for Cortex-

Like Visual Sensory Processing Systems”, IEEE International Symposium on Cir-

cuits and Systems (ISCAS 2010), Paris, France, 2010.

4. S. Thorpe, A. Brilhault, J. A. Perez-Carrasco, “Suggestions for a Biologically

Inspired Spiking Retina Using Order-Based Coding”, IEEE International Sym-

posium on Circuits and Systems (ISCAS 2010), Pars, Francia, 2010.

5. J. A. Perez-Carrasco, C. Zamarreno-Ramos, T. Serrano-Gotarredona, and B.

Linares-Barranco, “Neocortical Frame-free Vision Sensing and Processing through

Scalable Spiking ConvNet Hardware”, Luis Camuas-Mesa, 2010 IEEE World

Congress on Computational Intelligence, Barcelona, Spain, 2010.

6. J. A. Perez Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares

Barranco, “Advanced Vision Processing Systems: Spike-Based Simulation and

Processing”, Advanced Concepts for Intelligent Vision Systems (ACIVS 2009),

pp. 640-651, Bordeaux, France, 2009


Barranco, “Simulacion de Sistemas Basados en Eventos”, XXIV Simposium Na-

cional de la Union Cientıfica Internacional de Radio (URSI 2009), Santander,

Septiembre 2009.

206

BIBLIOGRAPHY


Barranco, “Procesamiento rapido de vision basado en AER”, XXIV Simposium

Nacional de la Union Cientıfica Internacional de Radio (URSI 2009), Santander,

Septiembre 2009.

9. J. A. Perez-Carrasco, B. Acha, C. Serrano, “Calibracion colorimetrica para el

diagnostico automatico de quemaduras”, XXIV Simposium Nacional de la Union

Cientıfica Internacional de Radio (URSI 2009), Santander, Septiembre 2009.

10. J. A. Perez-Carrasco, C. Serrano, B. Acha, ”Clasificacion de Lesiones de Piel

Basada en Filtros de Gabor y Color”, Congreso Anual de la Sociedad Espaola de

Ingenierıa Biomedica (CASEIB 2009), Num. 27, pp.125-128, Cadiz, Spain, 2009.

11. J.A. Perez-Carrasco, T. Serrano-Gotarredona, C. Serrano-Gotarredona, B. Acha,

B. Linares-Barranco. “High-Speed Character Recognition System Based on a

Complex Hierarchical Aer Architecture”, High-Speed Character Recognition Sys-

tem Based on a Complex Hierarchical Aer Architecture. IEEE International Sym-

posium on Circuits and Systems. Seattle, EE.UU. IEEE. Pag. 2150-2153 (ISCAS

2008). Seattle. Washington, USA, 18-21 May 2008.

12. J.A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-

Barranco, “Event Based Vision Sensing and Processing”, Proceedings of the 15th

IEEE International Conference on Image Processing (ICIP 2008). pp: 1392-1395.

San Diego, California, USA, 12-15 October 2008.


Barranco, “Simulador de Sistemas AER Basado en Eventos”, XXIII Simposium

Nacional de la Union Cientıfica Internacional de Radio (URSI 2008), Madrid,

Spain, Septiembre 2008.

14. J. A. Perez Carrasco, Teresa Serrano Gotarredona, C. Serrano, B. Acha, B.

Linares Barranco, “On the Computational Power of Address-Event Represen-

tation (Aer) Vision Processing Hardware”, DCIS 2007, pp. 21-23, Sevilla, Spain,

2007.

207

Documents

UNIVERSIDAD DE SEVILLAbernabe/theses/Thesis_JAPCarrasco.pdf · 2012. 12. 30. · UNIVERSIDAD DE SEVILLA . A mi abuelo, mi abuela, y mi madre. Agradecimientos Ha sido un largo camino