Proceedings Rsp 2011

2011 22nd IEEE International Symposium on

Rapid SystemPrototyping

24-27 may 2011Karlsruhe, Germany

Proceedings of the

Sponsored byIEEE Reliability SocietyKarlsruhe Institute of Technology

Shortening the Path

from Specification

to Prototype

Copyright and Reprint Permission:

Abstracting is permitted with credit to the source. Libraries are permitted to photo-copy beyond the limit of U.S. copyright law for private use of patrons those articlesin this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222Rosewood Drive, Danvers, MA 01923. For other copying, reprint or republicationpermission, write to IEEE Copyrights Manager, IEEE Operations Center, 445 HoesLane, Piscataway, NJ 08854.

All rights reserved. Copyright c© 2011 by IEEE.

2011 22nd IEEE International Symposium on Rapid System Prototyping

Table of Contents

Message from the General Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Message from the Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Message from the Organizing Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Conference Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Tutorial and Keynotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Session 1: Automotive & FPGA

An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar . . . . . . . . . . . . . . . . 2Sazzadur Chowdhury, Roberto Muscedere and Sundeep Lal

FPGA based Real-Time Object Detection Approach with Validation of Precision and Performance . . . . . . . . . . . 9Alexander Bochem, Kenneth Kent and Rainer Herpers

Rapid Prototyping of OpenCV Image Processing Applications using ASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Felix Mühlbauer, Michael Großhans and Christophe Bobda

Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations . . . . 23Ming Zhang and Zonghua Gu

FPGA Design for Monitoring CANbus Traffic in a Prosthetic Limb Sensor Network. . . . . . . . . . . . . . . . . . . . . . . . . . 30Alexander Bochem, Kenneth Kent, Yves Losier, Jeremy Williams and Justin Deschenes

Session 2: Prototyping Architectures

Rapid Single-Chip Secure Processor Prototyping on OpenSPARC FPGA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Jakub Szefer, Wei Zhang, Yu-Yuan Chen, David Champagne, King Chan, Will Li, Ray Cheung and Ruby Lee

A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup ofSystem-on-a-Chip Based Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Owen Callanan, Antonino Castelfranco, Catherine Crawford, Eoin Creedon, Scott Lekuch, Kay Muller,Mark Nutter, Hartmut Penner, Brian Purcell, Mark Purcell and Jimi Xenidis

Rapid automotive bus system synthesis based on communication requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Matthias Heinz, Martin Hillenbrand, Kai Klindworth and Klaus D. Müller-Glaser

An event-driven FIR filter: design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Taha Beyrouthy and Laurent Fesquet

Session 3: Prototyping Radio Devices

Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment. . . . . . . . . . . 67William Plishker, George Zaki, Shuvra Bhattacharyya, Charles Clancy and John Kuykendall

Validation of Channel Decoding ASIPs A Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Christian Brehm and Norbert Wehn

Area and Throughput Optimized ASIP for Multi-Standard Turbo decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Rachid Alkhayat, Purushotham Murugappa, Amer Baghdadi and Michel Jezequel

Design of an Autonomous Platform for Distributed Sensing-Actuating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85François Philipp, Faizal A. Samman and Manfred Glesner

Session 4: Virtual Prototyping for MPSoC

A Novel Low-Overhead Flexible Instrumentation Framework for Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 92Tennessee Carmel-Veilleux, Jean-François Boland and Guy Bois

i

Using Multiple Abstraction Levels to Speedup an MPSoC Virtual Platform Simulator . . . . . . . . . . . . . . . . . . . . . . . 99João Moreira, Felipe Klein, Alexandro Baldassin, Paulo Centoducatte, Rodolfo Azevedo and Sandro Rigo

A non intrusive simulation-based trace system to analyse Multiprocessor Systems-on-Chip software . . . . . . . . 106Damien Hedde and Frédéric Pétrot

Embedded Virtualization for the Next Generation of Cluster-based MPSoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Alexandra Aguiar, Felipe Gohring De Magalhaes and Fabiano Hessel

Session 5: Model Based System Design

Rapid Property Specification and Checking for Model-Based Formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Daniel Balasubramanian, Gabor Pap, Harmon Nine, Gabor Karsai, Michael Lowry, Corina Pasareanu andTom Pressburger

Automatic Generation of System-Level Virtual Prototypes from Streaming Application Models . . . . . . . . . . . . . 128Philipp Kutzer, Jens Gladigau, Christian Haubelt and Jürgen Teich

An Automated Approach to SystemC/Simulink Co-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Francisco Mendoza, Christian Koellner, Juergen Becker and Klaus D. Müller-Glaser

Extension of Component-Based Models for Control and Monitoring of Embedded Systems at Runtime. . . . . .142Tobias Schwalb and Klaus D. Müller-Glaser

A model-driven based framework for rapid parallel SoC FPGA prototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149Mouna Baklouti, Manel Ammar, Philippe Marquet, Mohamed Abid and Jean-Luc Dekeyser

A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System Architectures . . . . .156Sebastien Le Nours, Anthony Barreteau and Olivier Pasquier

Session 6: Software for Embedded Devices

Task Mapping on NoC-Based MPSoCs with Faulty Tiles: Evaluating the Energy Consumption and theApplication Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Alexandre Amory, César Marcon, Fernando Moraes and Marcelo Lubaszewski

Me3D: A Model-driven Methodology Expediting Embedded Device Driver Development . . . . . . . . . . . . . . . . . . . 171Hui Chen, Guillaume Godet-Bar, Frédéric Rousseau and Frédéric Pétrot

Session 7: Tools and Designs for Configurable Architectures

Schedulers-Driven Approach for Dynamic Placement/Scheduling of multiple DAGs onto SoPCs . . . . . . . . . . . . 179Ikbel Belaid, Fabrice Muller and Maher Benjemaa

Generation of emulation platforms for NoC exploration on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186Junyan Tan, Virginie Fresse and Fédéric Rousseau

Arbitration and Routing Impact on NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193Edson Moreno, Cesar Marcon, Ney Calazans and Fernando Moraes

On-Chip Efficient Round-Robin Scheduler for High-Speed Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Surapong Pongyupinpanich and Manfred Glesner

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

ii

Message from the General Chair

Welcome to Germany and welcome to Karlsruhe Institute of Technology KIT forthe 22nd IEEE International Symposium on Rapid System Prototyping (RSP). RSPexplores trends in Rapid Prototyping of Computer Based Systems. Its scope rangesfrom embedded system design, formal methods for the verification of systems,engineering methods, process and tool chains to case studies of actual softwareand hardware systems. It aims to bring together researchers from the hardwareand software communities to share their experiences and to foster collaboration ofnew and innovative Science and Technology. The 22nd annual Symposium focuswill encompass theoretical and practical methodologies, resolving technologies ofspecification, completeness, dynamics of change, technology insertion, complexity,integration, and time to market. RSP 2011 is a 4 day event starting with an industrydriven tutorial program on safety critical systems and the according standard IEC61508 and to go along with a visit to a car manufacturing plant. We will havethree days of technical program starting each day with outstanding keynote speak-ers from industry and academia talking about RSP in Automotive, Aerospace andRobotics Applications. Time for technical discussions will be supplemented by socialgathering with old and new friends. The success of the symposium is based on theeffort of many volunteers. I wish to thank and express my appreciation for theirhard and dedicated efforts of Program Chairs Fabiano Hessel and Frédéric Pétrotwho set up an excellent symposium program, the Tutorial Chair Michael Huebnerfor organizing the highly topical tutorial day, the Publicity Chair Jérôme Huguesand the IEEE Liaison Chair Alfred Stevens from IEEE Reliability Society who issponsoring RSP2011, the Organizing Chair Matthias Heinz and last but not leastthe Finance Chair Martin Hillenbrand. Special thanks to all of you who contributeda paper and will contribute to formal presentations and informal discussions. Ihope that you will find this year’s symposium interesting and rewarding, and thatyou will enjoy your time in Karlsruhe.

Klaus D. Mueller-GlaserKarlsruhe Institute of Technology, Germany

iii

Message from the Program Chairs

We welcome you to the 22nd IEEE International Symposium on Rapid SystemPrototyping (RSP-2011) held in Karlsruhe, Germany. RSP is the first conferenceto bring together people from the software and hardware communities comingfrom academia and industry for exchanging on RSP-related research topics fromscientific and technical standpoints. For more than 20 years, the symposium hasbeen attracting an outstanding mix of practitioners and researchers, and since itsearly days spans numerous disciplines, making it the one of its kind.Your participation in RSP will have significant impact to understand the challengesbehind rapid system prototyping and how the methods and tools you develop canlead to better systems for our everyday lives. It is the intent of this symposium to fos-ter exchanges between professionals from industry and academia, from hardwaredesign to software engineering, and to promote a dialog on the latest innovationsin rapid system prototyping.The quality of the Technical Program is a key point to RSP. The strength of theTechnical Program is due to the work of the Technical Program Committee, forsoliciting colleagues for quality submissions and for the review work. We believethat the program of this year covers many exiting topics, and we hope that youwill enjoy it. We extend our sincere appreciation to all those who contributed tomake RSP 2011 a bubbling experience: the Authors, Speakers, Reviewers, SessionChairs, and volunteers. We extend a particular thank you to the technical ProgramCommittee for their dedication to RSP and their excellent work in reviewing thesubmissions. We wish you a very productive and exciting conference.

Frédéric Pétrot - TIMA, FranceFabiano Hessel - PUCRS, Brazil

iv

Message from the Organizing Chairs

We are happy to welcome you at the 22nd IEEE International Symposium on RapidSystem Prototyping in Karlsruhe. We hope you will enjoy the well-chosen programof remarkable keynotes, talks and topics as well as the technology region Karlsruhe.The city of Karlsruhe was founded in 1715 by Margrave Charles III William ofBaden-Durlach. From the Karlsruhe Palace, which forms the center of the city, 32streets are radiating out like the ribs of a fan. This is why Karlsruhe is also called the"fan city". On the right-hand side directly next to the Karlsruhe Palace, the campusof the University of Karlsruhe is located. In 2009, the University of Karlsruheand the Research Center of Karlsruhe joined forces as the Karlsruhe Institute ofTechnology (KIT). This year, the RSP Symposium is hosted by the KIT. We thankour partners for their contributions and efforts to accomplish the symposium: thelocal partners and institutions at KIT for their support, the general chair and theprogram chairs for developing the scientific program of this symposium and themembers of the program committee for their conscientious reviews. We are alsograteful for the sponsorship of Institution of Electrical Engineers (IEEE) and theKIT in financially and technically supporting the symposium. We hope you enjoythe conference program and additionally take some time to explore the city ofKarlsruhe.

Matthias Heinz (KIT), Martin Hillenbrand (KIT), Germany

v

Conference CommitteesGeneral Chair

K. Mueller-Glaser - KIT, Germany

Program Chairs Tutorial Chair Publicity ChairF. Pétrot - TIMA, FranceF. Hessel - PUCRS, Brazil M. Hübner - KIT, Germany J. Hugues - ISAE, France

IEEE Liaison Chair Local Organization Chair Finance ChairA. Stevens - IEEE Reliability Society, USA M. Heinz - KIT, Germany M. Hillenbrand - KIT, Germany

Technical Program Commitee Members

M. Aboulhamid Universite de Montréal R. Kress InfineonG. Alexiou CTI I. Krueger UCSDT. Antonakopoulos University of Patras R. Lauwereins IMECP. Athanas Virginia Tech M. Lemoine CERTM. Auguston NPS P. Leong TheChineseUniversityofHongKongA. Baghdadi TELECOM Bretagne T. Le National University of SingapourJ. Becker Karlsruhe Institute of Technology R. Ludewig IBM GermanyC. Bobda University of Arkansas G. Martin TensilicaG. Bois Ecole Polytechnique de Montreal B. Michael Naval Postgraduate SchoolD. Buchs CUI, University of Geneva K. Mueller-Glaser Karlsruhe Institute of TechnologyR. Cheung UCLA N. Navet INRIA / RTaWR. Drechsler University of Bremen, Germany G. Nicolescu Ecole Polytechnique de MontrealJ. Drummond RSP PC V. Olive CEA-LETIM. Engels FMTC Y. Papaefstathiou Technical University of CreteA. Fröhlich FederalUniversityofSantaCatarina C. Park SAMSUNG ElectronicsM. Glesner TU Darmstadt L. Pautet TELECOM ParisTechM. Glockner BMW AG C. Pereira Univ. FederaldoRioGrandedoSulD. Hamilton Auburn University R. Pettit The Aerospace CorporationW. Hardt Chemnitz University of Technology D. Pnevmatikatos Tech. Univ. of Crete & FORTH-ICSM. Heinz Karlsruhe Institute of Technology F. Pétrot TIMA Lab, Grenoble-INPJ. Henkel Karlsruhe Institute of Technology J. Rice University of LethbridgeF. Hessel PUCRS F. Rousseau TIMA - UJFJ. Hugues ISAE M. Shing Naval Postgraduate SchoolA. Jerraya CEA O. Sokolsky University of PennsylvaniaG. Karsai Vanderbilt University T. Taha Clemson UniversityK. Kent University of New Brunswick E. Todt Universidade Federal do ParanaF. Kordon Univ. P. & M. Curie B. Zalila ReDCAD Laboratory, Univ. of Sfax

Additional ReviewersDoucet, Fred Eggersglüß, Stephan Friederich, Stephanie Gheorghe, LuizaGligor, Marius Grosse, Daniel Heisswolf, Jan Hostettler, SteveKuehne, Ulrich Le, Hoang Lesecq, Suzanne Li, HuaLi, Min Linard, Alban Love, Andrew Marinissen, Erik JanMenarini, Massimiliano Muller, Ivan Muller, Olivier Naessens, FrederikNoack, Joerg Philipp, Francois Samman, Faizal Arya Spies, Christopher

Thoma, Florian Zha, Wenwei

vi

Tutorial and Keynotes

Tutorial Program

Tuesday May 24th:Safety Integrity Levels in FPGA based designs

Gernot Klaes, Technical-Inspection Authority, Germany"Functional Safety according to IEC 61508 — a short introduction —"

Giulio Corradi, Xilinx"FPGA and the IEC61508 an inside perspective"

Romual Girardey, Endress+Hauser"Safety Aware Place and Route for On-Chip Redundancy in FPGA"

Keynote Speeches

Wednesday May 25th:

Prof. Dr.-Ing. Juergen Bortolazzi, Porsche AG,"RSP in Automotive"

Thursday May 26th:

Dr. Costa Pinto, Efacec,"FPGA in Aerospace Applications"

Friday May 27th:

Prof. Dr. Rüdiger Dillmann, FZI,"Prototyping in Robotics"

vii

Session 1Automotive & FPGA

1

An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar

Sundeep Lal, Roberto Muscedere, Sazzadur Chowdhury Department of Electrical and Computer Engineering

University of Windsor Windsor, Ontario, N9B 3P4

Canada

Abstract— An FPGA implemented signal processing algorithm to determine the range and velocity of targets using a MEMS based tri-mode 77 GHz FMCW automotive radar has been presented. In the developed system, A Xilinx Virtex 5 FPGA based signal processing and control algorithm dynamically reconfigures a MEMS based FMCW radar to provide a short-range, mid-range, and long range coverage using the same hardware. The MEMS radar incorporates two MEMS SP3T RF switches, two microfabricated Rotman lenses, and two reconfigurable microstrip antenna arrays embedded with MEMS SPST switches in addition to other microelectronic components. By sequencing the FMCW signal through the 3 beamport of the Rotman lens, the radar beam can be steered ±4 degrees in a combined cycle time of 62 ms for all the 3 modes. A worst case range accuracy of ±0.21m and velocity accuracy of ±0.83 m/s has been achieved which is better than the state-of-the art Bosch LRR3 radar sensor.

Keywords-component; formatting; style; styling; insert (key words)

I. INTRODUCTION The global auto industries are extensively pursuing radar

based proximity detection systems for applications including adaptive cruise control, collision avoidance, and pre-crash warning to avoid or mitigate collision damage. In [1], it has been identified by analyzing actual crash records from the 2004-08 files of the National Automotive Sampling System General Estimates System (NASS GES) and the Fatality Analysis Reporting System (FARS) that a forward collision warning/mitigation system comprised of radar sensors has the greatest potential to prevent or mitigate up to 1.2 million crashes, up to 66,000 nonfatal serious and moderate injury crashes, and 879 fatal crashes per year. Ironically, the IIHS study found that the forward collision warning crash avoidance features that could prevent or mitigate this much fatal and nonfatal injury related crashes were available only to just a handful of luxury vehicle models due to the high cost of the currently available forward collision warning technology. Thus a low cost radar technology for forward collision warning system that can be made available to all the on-road vehicles would be able to prevent/mitigate up to 1.2 million crashes per year. Market research firm Strategy Analytics predicts that over the period 2006 to 2011, the use of long-range distance warning systems in cars could increase by more than 65

percent annually, with demands reaching 3 million units in 2011 with 2.3 million of them using radar sensors [2]. The strategic Automotive Radar frequency Allocation (SARA) consortium specified that a combined SRR and LRR platform in the 77-79 GHz range will enable to reduce size and improve performance of automotive radars [3-4]. In [3] it has been identified that in the long term, 77 GHz will become the only reasonable technology platform to serve both short and long range radars.

In [3, 5] it has been determined that frequency modulated continuous wave (FMCW) radar with an analog or digital beamforming capability with a low cost SiGe based radar front end is the technology of choice for forward collision warning applications. Though GaAs or SiGe based MMICs are being pursued vigorously to minimize the cost and size while improving the performance of automotive radars [3, 6], the auto industry is eyeing on to exploit the small cost, batch fabrication capability of the MEMS technology to realize more sophisticated radar systems [2]. The project goal of a European consortium SARFA has been set to utilize RF MEMS as an enabling technology for performance improvement and cost reduction of automotive radar front ends operating at 76-81 GHz. [7]. In [5], a MEMS based long range radar comprised of a microfabricated Rotman lens and MEMS SP3T switches and a microstrip antenna array has been presented [8]. The DistronicPlus™ system uses one long range and two short range radars in the front, two other short range radar in the front for park assist, and two other short range radars in the rear to provide an effective collision avoidance system. Due to the individual short and long range units, the price tag of the DistronicPlus™ system is pretty high.

Almost all the commercially available automotive radars use microelectronic based ASICs. However, FPGAs are becoming increasingly popular in the development phase for rapid prototyping as opposed to DSP based solutions. Relative advantages of an FPGA based system over the DSP based ones for automotive radars are discussed in [9-11] where it has been determined that FPGAs can offer superior performance in terms of footprint area, high throughput, more time-and-resource efficient implementations, high speed parallel processing, digital data interfacing, ADC and DAC handling, and clock management in a relatively low-cost platform. For example, An FPGA can generate a precision Vtune signal to

This research has been supported by NSERC Canada, Ontario Centres ofExcellence (OCE), and Auto21 Canada.

978-1-4577-0660-8/11/$26.00 ©2011 IEEE 2

control the VCO linearity which is a critical issue to avoid false targets.

Investigation shows that instead of a passive antenna system, a MEMS based reconfigurable microstrip antenna in conjunction with MEMS SP3T switches and a microfabricated Rotman lens can be used to realize a compact tri-mode radar that can provide all the short, mid and long range functionality in a small form-factor single unit. An FPGA based control unit will control the operation of an array of MEMS SPST RF switches embedded in the reconfigurable antenna array to dynamically alter the antenna beamwidth to switch the radar from short to mid to long range using a predetermined time constant. This will reduce the price tag significantly by multiplexing the functionality of three different range radars in the same hardware. Additionally, the passive microfabricated Rotman lens will eliminate microelectronics based analog or digital beamforming components as used in commercially available automotive radars. Consequently, the overall system will become less complex, faster, lower cost, and more reliable. Due to the faster signal processing and digital data interfacing capability, an FPGA like Xilinx Virtex 5 can offer a very robust control of VCO linearity and a faster refresh rate for range and velocity data. In addition to the conventional signal processing tasks, control algorithms for the MEMS SP3T and SPST RF switches can also be embedded in the same FPGA.

In this context, this paper presents the development, implementation, and validation of Xilinx Virtex 5 FPGA based control and signal processing algorithms for the developed MEMS tri-mode radar sensor. The algorithm is able to determine the target range and velocity with a very high degree of precision in a cycle time that is shorter than the state-of-the art 3rd generation long range radar (LRR3) from Bosch [12].

II. MEMS TRI-MODE RADAR OPERATING PRINCIPLE A simplified architecture of the MEMS tri-mode radar is

shown in Fig. 1. The radar operating principle is as follows: (i) An FPGA implemented control circuit generates a triangular signal (Vtune) to modulate a voltage controlled oscillator (VCO) to generate a linear frequency modulated continuous wave (FMCW) signal centered at 77 GHz. (ii) The FMCW signal is fed to a MEMS SP3T switch. (iii) An FPGA implemented control algorithm controls the SP3T switch to sequentially switch the FMCW signal among the three beam ports of a microfabricated Rotman lens. (iv) As the FMCW signal arrives at the array ports of the Rotman lens after traveling through the Rotman lens cavity, the time-delayed in-phase signals are fed to a reconfigurable microstrip antenna array. (v) The reconfigurable antenna has MEMS SPST switches embedded in each of the linear sections as shown in Fig. 2. The scan area of a conventional microstrip antenna array depends on the antenna beamwidth that in turn depends on the number of microstrip patches. The higher the number of patches, the narrower will be the beam. It has been determined that for a short range radar, a beamwidth of 80 degrees is necessary to scan an area up to 30 meter in front of the vehicle as shown in Fig.3. For mid range, a beam width of 20 degrees is necessary to cover an area between 30-80 meters ahead of the vehicle and in the LRR mode, a beam width of 9 degrees is necessary to

Figure 1. MEMS tri-mode radar block diagram.

Figure 2. Reconfigurable microstrip antenna array.

Figure 3. SRR, MRR, and LRR coverage with beam width.

scan an area 80-200 meters ahead of the vehicle. Following Fig. 2, when both the SPST switches SW1 and SW2 are in OFF position, 4 microstrip patches per linear array provide short range coverage. When the switch SW 1 is turned ON and SW 2 is OFF, 8 microstrip patches per linear array provide mid-range range coverage. Finally, when both the switches SW 1 and SW 2 are ON, 12 microstrip patches per linear array provide long range coverage. An FPGA implemented control module controls the operation of the switches SW1 and SW2. (vi) The sequential switching of the input signal among the beamports of the Rotman lens enables the beam to be steered across the target area in steps by a pre-specific angle as shown in Fig. 4. (vii) On the receiving side, a receiver antenna array receives the signal reflected off a vehicle or an obstacle and feeds the signal to another SP3T switch through another Rotman lens. (viii) An FPGA based control circuit controls the operation of the receiver SP3T switch in tandem with the transmit SP3T switch so that the signal output at a specific beamport of the receiver Rotman lens can be mixed with the corresponding

3

Figure 4. Beam steering by the Rotman lens.

transmit signal. (ix) The output of the receiver SP3T switch is passed through a mixer to generate an IF signal in the range of 0-200 KHz. (10) An analog-to-digital converter (ADC) samples the received IF signal and converts it to a digital signal. (x). Finally, an FPGA implemented algorithm processes the digital signal from the ADC to determine the range and velocity of the detected target. In this way a wider near-field area and a narrow far field area can be progressively scanned with a minimum hardware.

III. RADAR SIGNAL PROCESSING AND SWITCH CONTROL

A. Choice of Development Platform – FPGA vs. DSP Older radar systems relied on various analog components.

However, Digital Signal Processors (DSP) and Field-Programmable Gate Arrays (FPGA) are increasingly becoming popular as both offer attractive features. Off-the-shelf signal processing blocks are available from both DSP and FPGA manufacturers as IP (Intellectual Property) cores for rapid prototyping. Key considerations to choose one over the other depend on internal architecture, speed, development time, complexity, and system requirements. DSPs follow a pipelined architecture, and even a dual core DSP requires resource sharing. This limits the overall data throughput in DSPs, and data capture channels depend on the number of memory modules and interrupts. On the other hand, FPGAs are fully customizable, and modules do not have to share resources like I/O ports and memory. DSPs spend most processing time and power in moving instructions and variables in and out of shared memories, which FPGAs inherently avoid. This makes FPGAs more suitable for parallel processing and optimized resource handling, especially in low-memory applications. The capability of FPGAs are continuously increasing due to the development of fast multipliers, accumulators, and RAM units, etc. With parallel processing capabilities, FPGAs have a higher throughput than DSPs while using a slower clock. Table I shows benchmark results comparing a Xilinx FPGA with various processors for a 2048-point FFT and Table II compares DSPs to FPGAs in terms of various development criteria that affect their development time, reliability and applicability. From the Table I, it is clear that Virtex-5 SX50T is well-adapted for signal processing applications and computes the same FFT in a lesser number of clock cycles compared to typical DSPs by using parallel resources. Based on the Tables I and II, the Xilinx Virtex 5 SX50T FPGA has been selected for rapid system prototyping for the target MEMS tri-mode radar.

TABLE I. SPEED COMPARISON - FPGA VS. A DUAL CORE μP AND DSP

Part Name

Clock Freq.

(MHz)

2048-point FFT Latency

(µs)

No. of Clock Cycles

Intel 32-bit Core 2 Duo 3000 37.55 112650 Analog Devices ADSP-

BF53x 600 32.40 19440

Texas Instruments TMS320C67xx 600 34.20 20520

Xilinx Virtex-5 FFT Core 200 39.60 7920

B. LFMCW Sweep Generation and Switch Control To keep the hardware requirements to a minimum for the

Virtex 5 FPGA, FMCW sweep durations of 1.0 ms, 3 ms, and 6 ms have been selected for the LRR, MRR, and SRR modes, respectively. A 10-bit counter in the FPGA is used to generate a digital sweep signal which is fed to a 10 bit DAC to generate the VTune signal for the successive modes as shown in Fig. 5. The sweep generation timing diagram as shown in Fig. 5 is for a clock frequency of 100 MHz on a Virtex-5 FPGA. A sweep bandwidth of 800 MHz for LRR, 1.4 GHz for MRR, and 2 GHz for SRR has been selected to provide sufficient samples. Fig. 5 also shows the required DAC output and tuning voltage for a TLC™ Precision 77 GHz GaAs VCO (MINT77TR™). The MEMS SP3T switches (SP3T-T and SP3T-R in Fig. 1) are operated by 18 V charge pumps which are activated at the end of the down sweep of the SRR mode.

TABLE II. DEVELOPMENT CRITERIA COMPARISON

Criteria DSP FPGA

Sampling rates Low (interrupts and I/O are shared)

High (parallel data capture and processing)

Data rate Better below 30MB/s Can handle faster data rates

Memory management

Unpredictable optimized arrangement – can lead to failures by mishandling

pointers

Fully customizable memory arrangement and

read/write ports

Data capture Uses interrupts with varying orders of precedence – can

lead to conflicts

Independent data capture with dedicated input/output ports and memory blocks

Group development

Developers do not have a clear idea of resource

availability and usage by others

Developers can work independently without resource availability

concerns

Figure 5. VTune timing diagram.

4

C. Signal Processing Algorithm Following the theory of FMCW radars, the range R and

relative velocity RV of a detected target can be calculated from:

kcff

R22

)(×

+= downup (1)

04)(

fcff

VR ×−

= downup (2)

where upf , represents the up sweep beat frequency, downf represents the down-sweep beat frequency, c is the speed of the electromagnetic wave in the medium, TBk /= is the bandwidth / sweep duration and 0f represents the center frequency as shown in Fig. 6. Fig. 7 presents the developed radar signal processing algorithm. In the system, the transmitted and received signals are mixed to generate a beat signal which is then passed through a low pass filter (LPF) to remove the noise. The filtered signal is then converted to digital format using an ADC and a hamming window is applied to limit the frequency content. Afterwards, a Fast Fourier Transform (FFT) is applied to the time-domain samples. The normalized peak intensity for all the FFT samples is computed and processed by a Cell Averaging Constant False Alarm Rate (CA-CFAR) processor architecture as in [11]. Upon processing of both up and down sweep for a beam port, peak pairing is done to compute preliminary values for target range and velocity. Peak pairing is responsible for matching the detected peak in the up sweep spectrum to a detected peak in the down sweep spectrum as belonging to the same target.

D. Implementation Methodology

The critical design considerations to implement the developed algorithm include: (a) Interconnection between the processing modules, (b) Data interdependency between the units, (c) Control interdependency between the units, (d) Coherency in data format, and (e) Synchronization of up and down sweeps and timely switching of the SP3T switches.

Figure 6. FMCW chirp signal and beat frequency.

Figure 7. Signal processing algorithm.

TABLE III. SIGNAL PROCESSING UNITS

Processing Unit Details DAC 10-bit, 1.2 MHz ADC 11-bit, 2.2 MHz

Window Type/Length Hamming/2048 FFT Type Mixed Radix-2/4 DIT

FFT Length 2048 CFAR Type Cell Averaging

CFAR Parameters M = 8, GB = 2, Pfa = 10-6 * Peak Pairing Criteria - Power Comparison, - Spectral Proximity

* M = depth of cell averaging, GB = no. of guard bands, Pfa = probability of false alarm.

The following stepwise methodology has been adopted to address the mentioned considerations to achieve a high degree of accuracy in the FPGA based signal processing scheme: (a) Development and simulation of the developed signal processing algorithm in Matlab environment for a single target, (b) Verification of the algorithm implemented in Matlab for a single target and then extending the Matlab codes for a single Rotman lens beam with 7 random targets, (c) HDL coding and testing of individual modules of the verified algorithm, (d) Identification of hardware resource sharing, timing, and area optimization, (e) Assembly of modules to form overall radar signal processing system in HDL, (f) Validation using a 7-target scenario by comparing the results obtained from Matlab codes.

IV. HARDWARE IMPLEMENTATION

A. HDL Modules Fig. 8 shows the top-level black box view of the FPGA

implemented signal processing system and Fig. 9 presents the developed HDL building blocks for FPGA implementation. The system has been developed using Verilog HDL 2005 (IEEE 1364-2005).

B. Fixed-Point Considerations Fixed-point implications arise at 4 stages of the developed

HDL system: (i) ADC – the quantization noise added by

5

sampling is unavoidable. To minimize the quantization error, an 11-bit ADC has been employed and a maximum error of 0.125% is induced due to quantization. (ii) Windowing – A 2048-point Hamming window is stored inside an on-chip ROM. The window coefficients are stored in 10-bit resolution. This results in 0.084% error in computation. The use of such digitized window functions are validated in [13]. (iii) FFT – The accuracy of the FFT that depends on input resolution and phase coefficient/twiddle factor affects the precision of the entire signal processing algorithm and deduced target information. With a fixed resolution of 12 bits, different phase coefficient resolutions were tested. Highly accurate results were obtained with 16-bit resolution. (iv) Peak Pairing –As the implementation of the ADC and the window function as mentioned above in an FPGA environment involves several multiplications/divisions, to avoid computational delay and resource overhead, a simplified approach is used. It is known that actual frequency is the product of frequency resolution of FFT and bin number of detected target peak by the CFAR processor. For the developed system, the FFT frequency resolution defined as SizeFFTfrequencySampling / equals 976.5625 Hz/bin. Also, factor k in (1) can be calculated from the bandwidth B (800MHz) and sweep duration T (1 ms) as 8 x 1011 for the LRR mode. Using these information, (1) can be simplified to:

( ) 09290625.0×+= down_binup_bin ffR (3)

where, 5625.976×= inup(down)_bup(down) ff and inup(down)_bf is the FFT bin number for up and down peaks of a valid target detected by CFAR. Similarly, (2) can be simplified as

( ) km/h down_binup_bin 398.3×−= ffVR (4)

The constants 0.09290625 in (3) and 3.398 in (4) have been approximated as 0.0927734375 (11-bit binary number) and 3.40625 (7-bit binary number). This reduces multiplier resolution and is suitable for fast computation using embedded FPGA multipliers e.g. in Xilinx DSP48E slices. This creates a maximum error of 0.193% in range and velocity computation. Similar calculations are done for the MRR and SRR modes.

C. HDL Optimization Critical optimization considerations include: (i) Interfacing

modules, (ii) Synchronization of modules in the overall system, (iii) Fixed-point truncation/rounding errors, (iv) Potential overflow identification, (v) Identification of data flow or

Figure 8. Top-level module.

Figure 9. HDL building block modules.

processing bottlenecks in the algorithm and use of multiple parallel units to resolve the same, (vi) Potential race conditions and data consistency issues in shared resources. (vii) Memory sharing, (viii) Clocked/combinational logic synchronization, (ix) Data word length and fixed-point format (position of decimal point) considerations, and (x) Handling of signed and unsigned data.

A bottleneck was identified in the Power spectral density calculation module (PSD unit) where a square root operation caused significant delay. This was resolved by using 4 PSD units in parallel as shown in Fig. 10. Since the time-domain data RAM and frequency-domain data RAM are implemented separately, this allows for simultaneous sampling of the next frequency sweep while processing samples from the previous sweep. This also optimizes the HDL design for time. Upon testing the Xilinx FFT it has been observed that the first half of the output has more noise and higher DC components at lower frequencies as compared to the latter half. Accordingly, only the latter half of the FFT output is utilized. This inherently saves memory and improves timing by a factor of 2. The CFAR unit is designed to process 32 values at a time and is fully synchronized with the FDR and PSD units (Fig. 8) thus avoiding storage of the entire 1024 frequency samples.

D. Resource Usage The design was synthesized for Xilinx Spartan-3A DSP

Edition and Virtex-5 SX50T. The resource usage for the HDL implementation of the developed signal processing system is listed in Table IV.

Figure 10. Parallel PSD units.

6

TABLE IV. RESOURCE USAGE ON DIFFERENT FPGAS

Resource Spartan-3A Virtex-5 Slice registers 46% 4% Slice LUTs 96% 23% DSP48 Slices 30% 6% LUT-FF pairs 11% 9% FPGA fabric area 47% 21%

E. Processing Latency Table V shows the respective number of clock cycles

required by each module in the HDL implementation in Fig. 9 for the LRR mode and the total number of clock cycles consumed for processing both up and down frequency sweeps and producing the final target information. Implementation on Xilinx Spartan-3A has a safe maximum operating frequency of 50 MHz and 160 MHz on Xilinx Virtex-5.

The system has been simulated for 50 MHz on the Spartan-3A and 100 MHz on the Virtex-5, as presented in Table V. One beam corresponds to both up and down frequency sweeps on the same beam port of the Rotman lens.

TABLE V. PROCESSING LATENCY ON DIFFERENT FPGAS FOR LRR

Operation

Clock Cycles / Beam

Latency at 50 MHz

Spartan-3A (ms)

Latency at 100 MHz

Virtex-5 (ms) Sweep Sampling 204756 2.04756 2.04756 Window and feed to FFT

2072 0.04144 0.02072

FFT 1 3960 0.07920 0.03960 PSD (4 parallel units )

10743 0.21486 0.10743

CFAR 4388 0.08776 0.04388 Total Processing 21163 0.42326 0.21163 Overall 225928 2.47082 2.25928 1 This FFT delay is for real-valued input data.

V. SIMULATION AND VALIDATION A randomly generated 3-lane highway scenario as shown in

Fig. 11 has been used to test the validity of the HDL codes. The scenario includes 7 arbitrary targets covered by a composite 17° wide scanning beam formed by the 3 beamports of the Rotman lens as shown in Fig. 1 [5]. Table VI compares the range accuracy of Matlab and HDL implemented versions with actual values for the LRR mode only. The simulation has been carried out over 6 iterations of time-domain MATLAB-generated samples with the algorithm running on a Xilinx Virtex-5 ML506 development board. During simulation, following physical conditions are assumed:

1. Light-medium rain producing RF attenuation of 0.8 dB/km [14].

2. Negligible attenuation and reflection from radome with less than 0.05 mm thick water deposition [15].

3. Simulated targets can be entirely described by Swerling I, III and V (or 0) models [16] and clutter sources are spectrally stationary.

4. Best case SNR of 4.73 dB. Table VII shows a similar comparison for the velocity

accuracy for the same targets for the LRR mode. From the Tables VI and VII, it appears that the HDL version of the developed algorithm

Figure 11. Highway test scenario.

can determine the range with a maximum error of 0.28 m when compared with actual values and the maximum error in relative velocity is 3 km/h (0.83 m/s) only. In both cases, the HDL generates more accurate results as compared to Matlab determined values.

Comparative accuracy and FGPA parameters for all the three modes are listed in Table VIII. From the Table VIII it is clear that the SRR offers highest range and velocity accuracy. Table IX compares the range and velocity accuracy of all the 3 modes from the implemented HDL version with the state-of-the-art Bosch LRR3. From Table IX, it is clear that the new radar can determine the range and velocity with almost same accuracy as Bosch LRR3 while covering 3 ranges and the complete cycle time of all the three modes is 62 ms as compared to 80 ms for the Bosch LRR3. The marginal

TABLE VI. LRR RANGE ACCURACY COMPARISON: MATLAB-HDL

Target ID

Actual Distance

from Host(m)

Matlab value (m)

HDL value (m)

Δ Matlab-Actual

(m)

Δ HDL-Actual

(m)

Δ Matlab-

HDL (m)

1 9.00 9.38 9.00 0.38 0.00 0.38

2 24.00 24.34 24.00 0.34 0.00 0.34

3 29.00 29.27 29.00 0.27 0.00 0.27

4 55.00 55.37 55.00 0.37 0.00 0.37

5 78.00 78.32 78.00 0.32 0.00 0.32

6 106.00 106.28 106.00 0.28 0.00 0.28

7 148.00 148.37 147.75 0.37 0.28 0.62

TABLE VII. LRR VELOCITY ACCURACY COMPARISON: MATLAB-HDL

Target ID

Actual Velocity relative to Host (km/h)

Matlab value

(km/h)

HDL value

(km/h)

Δ Matlab

-Actual (km/h)

Δ HDL-Actual (km/h)

Δ Matlab-HDL (km/h)

1 123 123.85 123.5 0.85 0.5 0.35

2 55 52.31 53.5 2.69 1.5 1.19

3 89 89.78 87.5 0.78 1.5 2.28

4 100 100.00 100.0 0.00 0.0 0.00

5 70 69.34 70.5 0.66 0.5 1.16

6 80 79.56 83.0 0.44 3.0 3.44

7 22 21.64 22.0 0.36 0.0 0.36

7

TABLE VIII. TRI-MODE RADAR DESIGN SPECIFICATIONS

Criteria SRR MRR LRR Range coverage 0-30 m 30-100 m 100-200 m Relative velocity

coverage ±300 km/h ±300 km/h ±300 km/h

Up or Down sweep duration 6 ms 3 ms 1 ms

Sweep bandwidth 2000 MHz 1400 MHz 800 MHz Required sampling rate 200 KSPS 700 KSPS 2000 KSPS

FFT size 1024 points 2048 points 2048 points FFT frequency

resolution 196 Hz/bin 342 Hz/bin 977 Hz/bin

Range accuracy with worst case VCO linearity of 25%

±0.28 m ±0.29 m ±0.34

Range accuracy with practical VCO linearity

of 1% ±0.10 m ±0.14 m ±0.28

Velocity accuracy ±0.14 m/s ±0.42 m/s ±0.83 m/s Processing delay for

beam port 106 µs 212 µs 212 µs

deviation from Bosch LRR3 for MRR and LRR modes is offset by the fact that the combined cycle time for the 3 modes (62 ms) is smaller than the cycle time of Bosch LRR3 (80 ms).

TABLE IX. TRI-MODE RADAR ACCURACY COMP. WITH BOSCH LRR3

Parameter Bosch LRR3

MEMS Tri-Mode (Xilinx Virtex 5 HDL Simulation)

SRR MRR LRR

Range (m) 0.5-250 0.4-30 30-100 100-200

Velocity (km/h) -100 to +200 ±300 ±300 ±300

Range accuracy (m) ±0.10 ±0.10 ±0.14 ±0.28 Velocity accuracy

(m/s) ±0.12 ±0.14 ±0.42 ±0.83

Processing latency N/A 106µs 212µs 212µs

Cycle time 80 ms 62 ms for three modes combined

VI. CONCLUSIONS A signal processing algorithm has been developed and

tested in a Xilinx Virtex 5 SX50T FPGA for a MEMS based tri-mode (short, mid, and long range) 77 GHz FMCW automotive radar to determine the range and velocity of multiple targets. The MEMS based radar uses a microfabricated Rotman lens and MEMS SP3T RF switches in conjunction with a reconfigurable MEMS SPST switch embedded microstrip antenna array. The proposed signal processing hardware implementation for the MEMS radar makes use of independent modules thus eliminating the latency posed by the MicroBlaze softcore used in [11]. The algorithm can determine the target range with a maximum error of ±0.28 m for LRR and ±0.10 for SRR. The maximum velocity error has been determined as ±0.83 m/s for LRR and ±0.14 m/s for SRR. Further investigation shows that by increasing the sweep duration to 6 ms, the maximum velocity error for LRR can be reduced to 0.14 m/s following (1). These accuracies of the FPGA determined range and velocity results meet the auto industry set specifications and are in par with the Bosch/Infineon LRR3 radar. The developed system allows for

a highly reliable low cost small form factor radar sensor that can enable even the lower-end vehicles to be equipped with a collision avoidance system. All the MEMS components have been fabricated and the assembly and packaging of a prototype device is in progress.

ACKNOWLEDGMENT The authors would like to greatly acknowledge the

additional support provided by the Canadian Microelectronics Corporation (CMC Microsystems), and Evigia systems Inc., Ann Arbor, MI.

REFERENCES [1] J. S. Jermakian , “Crash Avoidance Potential of Four Passenger Vehicle

Technologies”, Insurance Institute of Highway safety, April, 2010, http://www.iihs.org/research/topics/pdf/r1130.pdf

[2] H. Arnold, “Infineon: Automotive radar is aimed at mid-range cars”, available. [Online]., http://www.electronics-eetimes.com/en/infineon-auto motive-radar-is-aimed-at-mid-range-cars? cmp_id= 7& news_id=202803354

[3] R. Lachner, “Development Status of Next generation Automotive Radar in EU”, ITS Forum 2009, Tokyo, 2009, [Online]. Available. http://www.itsforum.gr.jp/Public/J3Schedule/ P22/ lachner090226.pdf

[4] G. Rollmann, “Frequency Regulations for Automotive Radar”, SARA, presented at Industrial Wireless Consortium (IWPC), Düsseldorf, Germany, 2009.

[5] R. Schneider, H. Blöcher, K. Strohm, “KOKON – Automotive High Frequency Technology at 77/79 GHz”, Proceedings of the 4th European Radar Conference, 2007, Munich, Germany, pp. 247-250.

[6] R. Stevenson, “SiGe threatens to weaken GaAs’ grip on automotive radar”, 2009, Compound Semiconductor, [Online]. Available. http://www.compoundsemiconductor.net

[7] J. Oberhammer, “RF MEMS Steerable Antennas for Automotive Radar and Future Wireless Applications (acronym SARFA)”, NORDITE, the Scandinavian ICT Research Programme, [Online]. Available. http://www.sarfa.ee.kth.se

[8] A. Sinjari, S. Chowdhury, “MEMS Automotive Collision Avoidance Radar Beamformer,” in Proc. IEEE ISCAS2008, Seattle, WA, 2008, pp. 2086-2089.

[9] D. Kok, J. S. Fu, “Signal Processing for Automotive Radar,” in IEEE Radar Conf.(EURAD2005), Arlington, VA, June 2005, pp. 842-846.

[10] J. Saad, A. Baghdadi, “FPGA-based Radar Signal Processing for Automotive Driver Assistance System,” in IEEE/IFIP Intl. Symp. Rapid System Prototyping, Fairfax, VA, 2009, pp. 196-199.

[11] T. R. Saed, J. K. Ali, Z. T. Yassen, “An FPGA Based Implementation of CA-CFAR Processor,” Asian Journal of Information Technology, Vol. 6, No. 4, pp. 511-514, 2007.

[12] Robert Bosch, “LRR3: 3rd generation Long-Range Radar Sensor”, [Online]. Available. http://www.bosch-automo tivetechnology.com /media/en/pdf/fahrsich erheits systeme_2/lrr3_datenblatt_de_2009.pdf

[13] G. Hampson, “Implementation Results of a Windowed FFT”, Sys. Eng. Div., Ohio State Univ., Columbus, OH, July 12, 2002. [Online]. Available: http://esl.eng.ohio-state.edu/~rstheory/iip/window.pdf

[14] P. W. Gorham, “RF Atmosphere Absorption/Ducting,” Antarctic Impulsive Transient Antenna Project (ANITA), Univ. Hawaii (Manoa), April 21, 2003.

[15] A. Arage, G. Kuehnle, R. Jakoby, “Measurement of Wet Antenna Effects on Millimeter Wave Propagation”, in Proc. IEEE RADAR Conf., New York City, NY, 2006, pp. 190-194.

[16] P. Swerling, “Probability of Detection for Fluctuating Targets,” The RAND Corp., Santa Monica, CA, Mar. 17, 1954.

8

FPGA based Real-Time Object DetectionApproach with Validation of Precision and

PerformanceAlexander Bochem, Kenneth B. Kent

Faculty of Computer ScienceUniversity of New Brunswsick

Fredericton, [email protected], [email protected]

Rainer HerpersDepartment of Computer Science

University of Applied Sciences Bonn-Rhein-SiegSankt Augustin, [email protected]

Abstract—This paper presents the implementationand evaluation of a computer vision problem on aField Programmable Gate Array (FPGA). This workis based upon previous work where the feasibility ofapplication specific image processing algorithms on aFPGA platform have been evaluated by experimentalapproaches. This work coveres the development of aBLOB detection system on an Altera Developmentand Education II (DE2) board with a Cyclone IIFPGA in Verilog. It detects binary spatially extendedobjects in image material and computes their centerpoints. Bounding Box and Center-of-Mass have beenapplied for estimating center points of the BLOBs.The results are transmitted via a serial interfaceto the PC for validation of their ground truth andfurther processing. The evaluation compares precisionand performance gains dependent on the appliedcomputation methods.

keywords: FPGA; BLOB Detection; Image Pro-cessing; Bounding Box; Center-of-Mass; Verilog

I. INTRODUCTION

The usability of interfaces in software is a ma-jor criteria for acceptance by the user. A majorissue in computer science is the improvement ofinformation representation in interfaces and findingalternatives for user interaction. Simplifying theway a user operates with a computer helps inoptimizing the usability and increasing the benefitof digitization.

A common virtual reality (VR) system usesstandard-input devices such as keyboards and mice,or multimodal devices such as omni directionaltreadmills and wired gloves. The issue with these”active input devices” is the lack of informationabout the user and his point of interest. A computermouse, for example, only gives information about

its relative position on the desk in relation to itsposition on the computer monitor. A wired gloverepresents the user in the coordinate system of thevirtual world, but it gives no information about theabsolute position in the real world.

The common way to display a virtual environ-ment are computer monitors, customized stereo-scopic displays, also known as Head MountedDisplays (HMD), and the Cave Automatic VirtualEnvironment (CAVE). In general computer screensare relatively small compared to the virtual environ-ment they display. The frame of a computer monitorcauses a significant disturbance in the perceptionof the VR by the user. HMDs on the other handallow a complete immersion of the virtual realityfor the user. The downside is that HMDs are heavyand therefore not recommendable to wear for alonger time. The CAVE concept applies projectiontechnology on multiple large screens, which arearranged in the form of a big cubicle [1]. Thesize of the cubicle usually fits several people.Based on the CAVE concept the Bonn-Rhein-SiegUniversity of Applied Sciences has invented a lowcost mobile platform for immersive visualisations,called ”Immersion Square” [2]. It consists of threeback projection walls using standard PC hardwarefor the image processing task.

For immersive environments the position andorientation of the user in relation to the projec-tion surface would allow alternative ways of userinteraction. Based on this information, it would bepossible to manipulate the ambiance without havingthe user actively use an input device. At the Bonn-Rhein-Sieg University for Applied Sciences this

”978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

9

problem has been addressed in a project named 6Degree-of-Freedom Multi-User Interaction-Devicefor 3D-Projection Environments (MI6) by the Com-puter Vision research group. The project’s intentis to estimate the position and orientation of theuser to improve the interaction of the user with theimmersive environment. Therefore a light-emittingdevice is developed, that creates a unique patternof light dots. This device will be carried by theuser and pointed towards the area on the projectionscreens of the immersive environment where theuser is looking. By detecting the light dots on theprojection screens, it is possible to estimate the po-sition and orientation of the user in the cubicle, asshown in [3], [4]. One problem is that the detectionprocess of the BLOBs and the estimation of theircenter points requires fast image processing. It isintended to realize an immediate feedback for theuser for any change of direction and position inthe immersive environment. Another problem is therequired precision for the estimation of the BLOBs’center points. A mismatch of the BLOB’s centerpoint by a few pixels will cause an error for theestimated position of the user. With those problemsin mind the intent is to develop a BLOB detectionsystem which serves both high performance andhigh accuracy.

An FPGA is a grid of freely programmable logicblocks which can be combined through intercon-nection wires. This allows the hardware designerto have a flexible and also application specifichardware design. While FPGAs have been used forprototyping only in the beginning, the revolutionof personal computers helped this field becomeattractive for consumer and industry products aswell. The decreasing sizes for chip circuit elementsallowed manufacturers to create FPGAs, whichare large enough to fit even complete processordesigns. Those advantages have been perceived inthe computer vision area [5] and used in variousresearch projects [6], [7], [8]. The FPGA archi-tecture allows one to create a hardware designthat can process data in parallel like a multi-corearchitecture on a GPU. This again is well suited forcomputer vision problems, as have been addressedin [9]. It has to be kept in mind that, althoughASICs and FPGAs have the design of hardware incommon, the circuit space and power consumptionof an FPGA is still several times higher than thatof an ASIC.

The remainder of this paper is organized asfollows. Section 2 gives a brief overview of the

applied methods and system design. Section 3shows how the implementation of the system iscompleted. In Section 4 the results of the system areverified and validated. Section 6 finally concludesthe paper.

II. SYSTEM DESIGN

Fig. 1: Schematic design of the system architecture.

The target platform of the realized system de-sign is an DE2 board from Altera [10]. Figure 1shows the schematic design of the BLOB detectionsystem. The image material can be acquired ondifferent input devices. The analog video input isused for precision evaluation with the recordedimage material. The CCD camera has been appliedfor performance evaluation with real-time data.Both input devices have individual pre-processingmodules and provide the image material in RGBformat with a resolution of 640x480 pixels. TheAD-converter of the analog-video input performswith 30 frames per second (fps). The CCD cameracan provide up to 70 fps for the given imageresolution if the full five Megapixels of the camerasensor is used [11].

A. CCD Camera

The CCD camera D5M from Terasic has beenused [11] for the performance evaluation of theBLOB detection approach. This camera sensor isarranged in a Bayer pattern format. The read-out

10

Fig. 2: Four and eight pixel neighbourhood.

speed depends on several features such as resolu-tion, exposure time and the activation of skippingor binning. Skipping and binning is used to com-bine lines and columns of the CCD sensors. Thisfunctionality allows the design to read out a largerarea of the sensor in the form of a smaller imageresolution. The consecutive lines in the sensor aremerged and handled as a single image line. Thecontroller on the D5M camera can read out onepixel per clock pulse and processes each sensor linein sequential order.

B. BLOB Detection

In computer vision a binary large object is con-sidered as a set of pixels, which describe a commonattribute and are adjacent. This common attributein most cases is defined by color or brightness ofthe pixel. The representation of the pixel dependson the applied color model in the image material.For the detection of the BLOBs the first problemto be solved is the identification of relevant pixels.In this work a pixel is considered to be relevant ifits brightness value exceeds a specified thresholdvalue. This threshold value can be a static param-eter or a computed average value, based upon thepixel values of previous frames.

The adjacency condition is the second importantstep in BLOB detection. The two most commondefinitions for adjacency are known as four pixelneighbourhood and an eight pixel neighbourhood.Figure 2 is showing the two ways of labellingpixels to describe adjacency. On the left imagethe four pixel neighbourhood is applied and fourBLOBs could be detected. The detection applies theadjacency check only on the horizontal and verticalaxis. On the right image it can be seen that thesame pixels are labelled as only two BLOBs. Forthe eight pixel neighbourhood the diagonal axis is

taken into account as well [12].The check of the adjacency condition has to be

performed on all pixels that match the criteria ofthe common attribute. Based on those two actionsall BLOBs in a frame can be found.

The aim of the BLOB detection for our re-quirements is to determine the center points ofthe BLOBs in the current frame. With respect tothe application area this project describes BLOBsas a set of white and light gray pixels while thebackground pixels are black. This circumstancefollows from the setup for the image acquisitionwhere infrared cameras will be applied to trackthe projection surface of the Immersion Square.The expected image material will be similar to thesamples in Figure 3.

Fig. 3: Example of BLOBs perfect shape (left) and withblur (right).

The characteristics of the blurring effect willdepend on the acceleration of the light source bythe user. This effect causes some issues for theestimation of the BLOBs center points, which havebeen addressed in this project.

The Bounding Box based computation estimatesthe center-point of the BLOB by searching forthe minimal and maximal XY-coordinates of theBLOB. The computation can be implemented veryefficiently and will not cause large performanceissues.

11

BLOB’s X center position = maxX position + minX position2

BLOB’s Y center position = maxY position+minY position2

The result for the center coordinates are stronglyaffected by the pixels at the BLOB’s border.This effect becomes even stronger for BLOBs inmotion. With reference to the light emitting devicefor the Immersion Square environment, the anglebetween the light beams and the projection surfacechanges the shape of the BLOBs and increasesthe range of pixels with less intensity. In additionthe movement of the device by the user willcause motion blur. These effects will increase theflickering of pixels at the BLOB’s border and willcause a flickering in the computed center-point.

With Center-of-Mass the coordinates of allpixels of the detected BLOB are taken intoaccount for the computation of the center point.The algorithm combines the coordinate valuesof the detected pixels as a weighted sum andcalculates an averaged center coordinate.[12]

BLOB’s X center position =∑

X position of all BLOB pixelsnumber of pixels in BLOB

BLOB’s Y center position =∑

Y position of all BLOB pixelsnumber of pixels in BLOB

To achieve even better results the brightnessvalues of the pixels have been applied as weightsas well. This increases the precision of theestimated center-point with respect to the BLOB’sintensity.

As described in [12] the sequential processing ofa frame requires an additional check of adjacencyfor the BLOBs itself. Dependent on the BLOB’sshape and orientation the detection might separatepixels of one BLOB into two different BLOBs. Theadjacency check for BLOBs is based on the sameconcept as has been explained for pixels earlier.

C. Serial Interface

To observe the system process and evaluate itsresults, a module for communication on the serialinterface of the target board has been developed.Other hardware interfaces have been available, suchas USB, Ethernet, and IRDA. But the serial in-terface allows the smallest design with respect toprotocol overhead and resource allocation on theFPGA. The serial module reads the informationabout the BLOB result from a FIFO buffer andtransmits it to the RS232 controller. The serialinterface module operates independent from the

other system modules and sends results as longas the FIFO containing the BLOB results is notempty.

III. IMPLEMENTATION

The overall processing of BLOB detection isseparated into several sub-processes. For a higherflexibility of the system design, the functionalityhas been implemented in separate modules. Thosemodules use FIFOs to store their results. Thisallows all modules to run separately with individualclock rates. The identification of pixels, whichmight belong to a BLOB, and the collection ofBLOB attributes is very similar for both BLOBdetection solutions. Only the computation of thecenter point for the detected BLOBs is different,based on the algorithm explained in the previoussection.

In the first processing step the check of the com-mon attribute on the RGB pixel data will identifyrelevant data. All pixels which do not match thecriteria are dropped from further processing. Thechosen identification criteria is the brightness valueof the pixel. Every pixel which has a higher valuethan the specified threshold value, will be savedfor the adjacency check. Those pixels are saved ina dual clock FIFO from where the adjacency checkacquires its input data.

The BLOB detection module implements theadjacency check for the pixel data and sorts theminto data structures, referred to as a container.During the detection process the containers arecontinuously updated until a new frame starts. Atthat point the BLOB attributes are written into aFIFO and the container contents are cleared. Theamount of attributes stored for each BLOB dependson the selected method for computing the centerpoint.

The last task to be processed in the BLOBdetection is the computation of the center point. Forcontrolling the processing of the BLOB attributes,the module is designed as a state machine (Figure4). Using state machines is a common method forprocess control in hardware design. The results ofthe center point computation are written into theresult FIFO from where the serial communicationmodule obtains its input data.

IV. VERIFICATION AND VALIDATION

For the verification and validation of the BLOBdetection results, two different input sources havebeen applied. The S-Video input is used for the

12

Fig. 4: State Machine for reading BLOB attributes and computing center points.

verification of detection accuracy and precisionof the different center point computation meth-ods. Secondly, the CCD camera input is used forperformance benchmarking of the system design.For the performance measuring, the visualizationof the captured image data from the CCD cameraon the VGA display had to be disabled. This wasnecessary because the VGA controller module didnot support higher frame rates. The applied inputmaterial on the S-Video input had a resolution of640x480 pixels. The sampling of the AD-converterwas set to the same resolution to avoid the artificialoffset in the digitized image material, that wouldhave shown up for different image resolutions.The ground truth values for the precision of theexpected BLOB center points had been verified byhand.

Based on the specified application area, the shapeof the BLOBs to be detected has been estimatedas perfect circular and circular shapes with blureffect, as shown in Figure 3. For applying theBLOB detection system it needs to be configuredbefore using the BLOB detection results for furtherprocessing. This includes the estimation of the bestthreshold value for the given application environ-ment.

A. Precision

For the computation of the BLOBs’ center pointthe Bounding Box and the Center-of-Mass basedmethods showed the exact same results for the clearBLOBs with a perfect circular shape. This has beentested for several representative threshold values.If a BLOB is not showing any blur effect, theapplied method for computing the center point hasno influence on the precision.

The Bounding Box and Center-of-Masscomputation showed different results for the imagematerial containing BLOBs with blur effect. TheCenter-of-Mass results turned out to be closer

to the BLOB’s center point. The results for oneparticular example are shown in Table I.

Ground Truth Bounding Box Center-of-MassThreshold X Y X Y X Y

0x190 230 306 226 308 229 3070x20B 230 306 227 308 229 3070x286 232 305 228 307 230 3060x301 234 303 231 305 233 3040x37C 237 300 236 300 237 3000x3E0 242 298 241 299 241 298

TABLE I: Results for center point computation with blurshaped BLOBs.

Center point computation with Center-of-Massshows higher precision for BLOBs with blur effect,compared to Bounding Box. The Center-of-Massmethod is showing an error of 0.02 %. The Bound-ing Box based computation is showing an error of0.04 %. The threshold values which have been usedfor evaluation of precision are in between the valuerange of the BLOBs in the applied image material.This value range is dependent on the applied imagematerial and can not be simply reused for anyimage source or material. The estimation of thevalue range is a configuration requirement beforeusing the BLOB detection system. For thresholdvalues above or below the given value range thesystem was not able to detect all BLOBs in theimage material accurately.

B. Performance

The system performance has been evaluated on afixed environment setup. The CCD camera has beenapplied to perform the BLOB detection approachwith real-time image data. The performance hasbeen measured by counting the number of framesper second that were processed on the DE2 board.For verifying that the BLOB detection was workingcorrect the results have been observed manuallyon the connected Host PC. The visualization of theBLOB detection results on the connected host-PC

13

allowed a reasonable validation by hand for aninput speed of 45 frames per second. This wasthe limitation of the CCD camera for the appliedresolution of 640x480 pixels. Reported valuesabout performance and resource allocation aregiven in Tables II and III.

with Monitor OutputBounding Box Center-of-Mass

Speed (fps.) 12 12Camera Speed (MHz) 25 25System Speed (MHz) 40 50

Max. System Speed (MHz) 72 65Allocated Resources on the FPGA

Logic Elements 7,850 14,430Memory Bits 147,664 237,616

Registers 2,260 2,871Verified * *

TABLE II: Resource allocation and benchmark resultswith monitor output.

without Monitor OutputBounding Box Center-of-Mass

Speed (fps.) 46 50Camera Speed (MHz) 96 96System Speed (MHz) 125 125

Max. System Speed (MHz) 140 189Allocated Resources on the FPGA

Logic Elements 5,884 13,311Memory Bits 113,364 239,316

Registers 1,510 2,078Verified * *

TABLE III: Resource allocation and benchmark resultswithout monitor output.

The performance result ”fps” refers to the ob-tained frame rate during the benchmarking. The”camera speed” is the particular clock rate that isused to read out the CCD sensor. ”System Speed”is the clock rate for the BLOB detection moduleduring the performance test and ”Max. SystemSpeed” describes the maximum possible clock ratefor the implemented design. The CCD camera itselfhas an average speed of 45 frames per secondfor the applied configuration parameter. While bothBLOB detection approaches would have been ableto perform on faster frame rates, the estimation ofthe maximum performance was again restricted bythe input source. The resource allocation shows thatCenter-of-Mass requires about twice as many logicelements and memory bits than Bounding Box.

Figure 5 is showing the relation between theamount of BLOBs that can be processed by the sys-tem and the maximum performance in MHz. Themaximum amount of BLOBs that can be detectedwith the applied target board is approximatly sevenfor Center-of-Mass and sixteen for Bounding Box.

V. FUTURE WORK

It has been shown in Section IV-B that the BLOBdetection could not be tested for its maximumperformance. Both available input sources turnedout to be a bottleneck for the acquisition of im-age material. It is planned to integrate the BLOBdetection approach into a camera with an onboardFPGA. With direct access to the sensor of a highperformance camera, the BLOB detection can beevaluated for its maximum processing speed.

VI. CONCLUSION

With Bounding Box and Center-of-Mass, twocommon methods for the estimation of a BLOBcenter point have been implemented. The eval-uation has shown reliable precision results withrespect to the given application area. As it hasbeen estimated in [12] the Center-of-Mass basedapproach shows higher precision and is the recom-mended solution for the given application task.

The BLOB detection approach is a workingthreshold based solution specialized for BLOBswhich consist of white and light grey pixels ona black image background. Both available inputsources have been identified as a bottleneck forthe system’s processing speed. This has proven theFPGA design to be well qualified for the proposedcomputer vision problem.

For the evaluation and further processing a mod-ule for transmitting results on the serial interfacehas been designed, tested and applied. The max-imum performance of the serial interface is highenough for even faster frame processing rates, asdescribed in Section II-C. The output format ofthe computation results can be easily changed inthe system design. For instance, further informationabout the BLOBs can be computed on the FPGAsystem and transmitted as well, such as directionof movement or speed of the BLOBs.

This paper has shown that a BLOB detectionsystem can be successfully implemented and eval-uated on a FPGA platform. Validation and verifica-tion showed reliable results and the advantages ofimage processing tasks designed in hardware. Thework required a higher time effort compared to theimplementation of a similar system in a high-levellanguage, such as C++ or Java.

Acknowledgments

This work is supported in part by Matrix Vision,CMC Microsystems and the Natural Sciences andEngineering Research Council of Canada.

14

Fig. 5: Performance for BLOB Detection.

REFERENCES

[1] C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti,“Surround-screen projection-based virtual reality: thedesign and implementation of the cave,” in SIGGRAPH’93: Proceedings of the 20th annual conferenceon Computer graphics and interactive techniques.New York, NY, USA: ACM, 1993, pp. 135–142.[Online]. Available: http://portal.acm.org/ft gateway.cfm?id=166134&type=pdf&coll=GUIDE&dl=GUIDE&CFID=82404499&CFTOKEN=73325867

[2] R. Herpers, F. Hetmann, A. Hau, and W. Heiden,“Immersion square - a mobile platform for immersivevisualisations,” 2005, university of Applied SciencesBonn-Rhein-Sieg. [Online]. Available: http://www.cv-lab.inf.fh-brs.de/paper/remagen2005-I-Square1b.pdf

[3] C. Wienss, I. Nikitin, G. Goebbels, K. Troche,M. Gobel, L. Nikitina, and S. Muller, “Sceptre - aninfrared laser tracking system for virtual environments,”vol. isbn 1-59593-321-2. Proceedings of the ACMsymposium on Virtual Reality software and technologyVRST 2006, 2006, pp. pp. 45–50. [Online]. Available:http://www.digibib.net/openurl?sid=hbz:dipp&genre=proceeding&aulast=Wienss&aufirst=Christian&title=+Proceedings+of+the+ACM+symposium+on+Virtual+Reality+software+and+technology+VRST+2006&isbn=1-59593-321-2&date=2006&pages=45-50

[4] M. E. Latoschik and E. Bomberg, “Augmentinga laser pointer with a diffraction grating formonoscopic 6dof detection,” Journal of Virtual Realityand Broadcasting, vol. 4, no. 14, pp. –, jan2007, urn:nbn:de:0009-6-12754,, ISSN 1860-2037. [Online]. Available: http://www.jvrb.org/4.2007/1275/4200714.pdf

[5] J. Hammes, A. P. W. Bohm, C. Ross, M. Chawathe,B. Draper, and W. Najjar, “High performance imageprocessing on fpgas,” in In Proceedings of the LosAlamos Computer Science Institute Symposium. Santa Fe,NM, 2000. [Online]. Available: http://www.cs.colostate.edu/∼draper/publications/hammes lacsi01.pdf

[6] A. Benedetti, A. Prati, and N. Scarabottolo,“Image convolution on fpgas: the implementationof a multi-fpga fifo structure,” in FIFO Structure,24 th. EUROMICRO Conference Volume 1 (EU-ROMICRO’98), August 25 - 27, 1998. [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.1811&rep=rep1&type=pdf

[7] D. G. Bariamis, D. K. Iakovidis, and D. E. Maroulis,“An fpga-based architecture for real time image featureextraction,” in in: Proceedings of the ICPR InternationalConference on Pattern Recognition, 2004, pp. 801–804.[Online]. Available: http://rtsimage.di.uoa.gr/publications/01334338.pdf

[8] J. Trein, A. T. Schwarzbacher, and B. Hoppe, “Fpgaimplementation of a single pass real-time blob analysisusing run length encoding,” MPC - Workshop, February2008. [Online]. Available: http://www.electronics.dit.ie/postgrads/jtrein/mpc08-1Blob.pdf

[9] J. W. MacLean, “An evaluation of the suitability of fpgasfor embedded vision systems,” The First IEEE Workshopon Embedded Computer Vision Systems (San Diego),June 2005. [Online]. Available: http://wjamesmaclean.net/Papers/cvpr05-ecv-MacLean-eval.pdf

[10] Terasic, DE2 Development and Education Board -User Manual, 1st ed., Altera, 101 Innovation Drive,San Jose, CA 95134, 2007. [Online]. Available: http://www.terasic.com/downloads/cd-rom/de2/

[11] ——, Terasic TRDB-D5M Hardware Specification,Terasic, June 2008. [Online]. Available:http://www.terasic.com.tw/cgi-bin/page/archivedownload.pl?Language=English&No=281&FID=cbf3f0dcdbe2222a36f93826bcc25667

[12] A. Bochem, R. Herpers, and K. Kent, “Hardware acceler-ation of blob detection for image processing,” in Advancesin Circuits, Electronics and Micro-Electronics (CENICS),2010 Third International Conference on. CENICS, July2010, pp. 28 –33.

15

Rapid Prototyping of OpenCV Image ProcessingApplications using ASP

Felix Muhlbauer∗, Michael Großhans∗, Christophe Bobda†∗Chair of Computer EngineeringUniversity of Potsdam, Germany

{muehlbauer,grosshan}@cs.uni-potsdam.de†CSCE

University of Arkansas, [email protected]

Abstract—Image processing is becoming more and morepresent in our everyday life. With the requirements of minia-turization, low-power, performance in order to provide someintelligent processing directly into the camera, embedded camerawill dominate the image processing landscape in the future. Whilethe common approach of developing such embedded systemsis to use sequentially operating processors, image processingalgorithms are inherently parallel, thus hardware devices like FP-GAs provide a perfect match to develop highly efficient systems.Unfortunately hardware development is more difficult and thereare less experts available compared to software. Automatizingthe design process will leverage the existing infrastructure, thusproviding faster time to market and quick investigation of newalgorithms. We exploit ASP (answer set programming) for systemsynthesis with the goal of genarating an optimal hardwaresoftware partitioning, a viable communication structure and thecorresponding scheduling, from an image processing application.

I. INTRODUCTION

Image processing is becoming more and more present inour everyday life. Mobiles devices are able to automaticallytake a photo when detecting a smiling face, intelligent camerasare used to monitor suspect peoples and operations at airports.In production chain, smart cameras are being used for qualitycontrol. Besides those fields of application, many other arebeing considered and will be widened in the future. Thechallenge for developing such embedded image processingsystems is that image processing often results in very highresource utilization while embedded systems are usually onlyequipped with limited resources. The common approach isbased on general purpose processor systems, which processesdata mainly sequentially. In contrast, image processing algo-rithms are inherently parallel and thus hardware devices likeFPGAs and ASICs provide the perfect match to develop highlyefficient solutions. Unfortunately, compared to software devel-opment, only few hardware experts are available. Additionally,hardware development is error prone, difficult to debug andtime consuming, leading to huge time-to-market. From aneconomic point of view two criteria for the development areimportant: time to market and the performance of the product.Automatic synthesis with the aim of generating an optimal

architecture according the application, will help provide therequired performance in reasonable time.

Our motivation is to design a development environment inwhich high-level software developers could leverage the speedof hardware accelerators without knowledge in the low-levelhardware design and integration. Our approach consists ofrunning a ready-to-use and well known software library forcomputer vision on a processor featuring an operating system.We rely on very popular open source software like OpenCV[1] and the operating system Linux. This approach allows thesoftware application developers to focus on the developpementof high-quality algorithms, which will be implemented withthe best performance. Given an application, the decision tomap a task to a processing element and defining the underlyingcommunication infrastructure and protocol is a challengingtask.

In this paper we focus on the system synthesis problem ofOpenCV applications using heterogeneous FPGA based on-chip architectures as target architecture. We use ASP (answerset programming) to prune the solution space. The goal isto find optimal solutions for the task mapping and communi-cation simultaneously using constraints like timings and chipresources.

This paper is structured as follows: After addressing relatedworks, we explain our model for image processing architec-tures and the resulting design space. A brief introduction intoASP is followed by a description of the strategy how to expressand solve the problem in an ASP-like manner. The paperconcludes with results and future work.

II. RELATED WORK

Search algorithms like e. g. evolutionary algorithms arecapable of solving very complex optimization problems. Ageneric approach is implemented by the PISA tool [2]. Herea search algorithm is a method which tries to find solu-tions for a given problem by iterating three steps: First, theevaluation of candidate solutions, second, the selection ofpromising candidates based on this evaluation and third, thegeneration of new candidates by variation of these selected

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

16

candidates. As examples most evolutionary algorithms andsimulated annealing fall into this category. PISA is mainlydedicated to multi-objective search, where the optimizationproblem is characterized by a set of conflicting goals. Themodule SystemCoDesigner (SCD) was developed to explorethe design space of embedded systems. The found solution isnot necessarily the best possible, in contrast we use ASP tofind an optimal solution.

In Ishebabi et al. [3] several approaches for architecturesynthesis for adaptive multi-processor systems on chip wereinvestigated. Beside heuristics methods also ILP (linear integerprogramming), ASP and evolutionary algorithms were used. Inour work the focus is less on processor allocation and moreon communication, especially handling streaming data. Theresulting complexity is different.

III. ARCHITECTURE

The flexibility of FPGAs allows for the generation ofapplication specific architecture, without modification on thehardware infrastructure. We are especially interested in cen-tralized systems in which a coordinator processing elementis supported by other slave processing elements. Generallyspeaking, these processing elements could be GPPs (generalpurpose processors) like those available in SMP systems, co-processors like FPUs (floating point unit), or dedicated hard-ware accelerators (AES encryption, Discrete Fourier Transfor-mation, . . . ). Each processing element has different capabilitiesand interfaces which define the way sata exchange withethe environment takes place. Also communication is a keycomponent of a hardware/software system. While an efficientcommunication infrastructure will boost the performance apoorly designed communication system will badly affect thesystem.

In the following our architecture model and the differentpaths of communication are described in detail. In general,the most important components for image processing areprocessing elements, memories and communication channels.The specification will therefore focus on more details on thosecomponents.

In our architecture model we distinguish two kinds of pro-cessing elements: software executing processors and dedicatedhardware accelerators that we call PUs (processing unit). Foreach image processing function one or more implementationsmay exist in software or hardware each of which is havinga different processing speed and resource utilization (BRAM,slices, memory, . . . ).

Considering the communication, image data usually doesnot fit1 into the sticky memory inside an FPGA and must bestored in an external memory. Because of the sequential nature(picture after picture) of the image capture, the computation onvideo data is better organized and processed in a stream. Thehardware representation of this idea is known as pipelining:Several PUs, are concatenated and build up a processing

1A video with VGA resolution, 8 bit per color and 25 frames per secondhas an amount of data of (640·480)·(3·8)·25 ≈ 175 megabit per second.

processor IM PU

PUIM

PU PU

memorycontroller

IMPU

to externalmemory

fromcamera

system bus

= processing unit= interconnection module= SDI connection

FPGA

PUIM

Fig. 1. Example architecture according to our model

chain. The data is processed while flowing through the PUchain. In order to allow a seamless integration of streamingoriented computation in a software environment, we imple-mented an interface called SDI (streaming data interface) here.The interface is simple and able to control the data flow toprevent data loss because of different speeds of the modules. Itconsists of the signals and protocolls to allow an interblockingdata transport accross a chain of connected components. TheSDI-interface allows for reusing PUs in different calculationcontexts.

A variety of processors available for FPGAs, like PowerPC,MicroBlaze, NIOS, LEON or OpenRISC, provide a dedicatedinterface to connect co-processors. These interfaces could beused for instruction set extensions but also as a dedicatedcommunication channel to other modules.

Hence, to build an architecture, a PU or chain of PUscould be connected to a memory or a processor. Furthermore,memories are accessible directly (e. g. internal BRAM) or via ashared bus (e. g. external memory). For these interconnections,so called IMs (interconnection module) are introduced in ourmodel to link an SDI interface to another interface. Figure 1shows an example architecture.

IV. DESIGN SPACE AND SCHEDULING

Compared to a software only solution, the best architecture,based on a hardware/software partitioning, should be found.The search is based on a given set of image processingalgorithms and a given set of objective functions and opti-mization constraints. These general requirements is usuallythe system performance, the chip’s resource utilization or thepower consumption.

We use the task graph modell to capture a computationapplication. It defines the dependencies between the differentprocessing steps, from the capture of the raw image data tothe production of result.

Beside a pool of software and hardware implementationsa database was filled with meta information about theseimplementations like costs and interoperability. Assuming thatfor each task a software implementation exists, the costs ofselecting a component are its processing time and its mem-ory utilization. For computational intensive tasks a hardwareimplementation is available. The important costs here areprocessing time, initial delay, throughput and chip utilization

17

(slices, BRAM, DSP, . . . ). These information are gatheredp. ex. by profiling of function calls or data flow analysis.

The problem to be solved is to distribute the tasks to aselected set of processing elements while being aware oftimings and scheduling.

As mentioned earlier, communication is a key part and oftena trade-off between communication and processing has to befound. For example for adjacent tasks it could be faster in totalto process these all locally instead of transferring the data fromone task in the middle to and from another high-speed module.

While mapping the tasks to processors, two kinds of paral-lelism have to be considered: First, the parallel operation ofindependent processors like in SMP systems and second, PUsin series processing data while streaming like pipelining. Bothparallel operations have different impact on the scheduling.

Figure 2 shows a petri net modeling the behavior of theimplementation of a image processing function. One byte ofdata is represented by one token, which traverse the chainfrom pin to pout according to the implementation type T.I is the number of pixels in one image or frame. The twopaths on the left describe filter like operations which consumeone image with α bytes per pixel and output one imagewith maybe another data rate α′. The two paths on the rightdescribe operations with a fixed result size r like the imagebrightness. While stream-based implementations (T=0;2) workon a pixel-by-pixel base and cause an initialization delay(modeled as transaction t1 / t4) and a processing delay (t2 / t5),other implementations (T=1;3) need full access to the wholeimage and take a delay of t3 / t6.

PUs introduce two additional parameters, which are im-portant to calculate the scheduling. The initial delay and thethroughput of data, which must be considered for chainedPUs: It takes some time after a PU has read the first databefore the first result is available. This delay determines thestarting-time of the next PU as specified in the the task graph.Additionally, it is not always possible for a PU to operate atits maximum speed because the speed is also determined bythe speed of incoming data from the previous module. Thus tocalculate the actual operation speed of a PU, these two factshave to taken in account too. The different operation speedsfor different contexts make a decision for the best architecturemore difficult. Furthermore, the interplay of components onthe chip may incur communication bottlenecks. For examplethe parallel access to main memory by different modules willincur delays in the computation.

V. ANSWER SET PROGRAMMING

Answer Set Programming (ASP) is a declarative program-ming paradigm, which uses facts, rules and other languageelements to specify a problem. Based on this formal descrip-tion an ASP solver can determine possible sets of facts, whichfit to the given model.

pin

pout

T=0t1

α

1

t21

α′

T=1t3

αI

α′I

T=2t4

α

1

t5I

r

T=3t6

αI

r

Fig. 2. Petri net modeling the data flow within processing elements withdifferent implementations T

A. Background

The basic concept of modeling ASP programs are rules inthe form:

p0 ← p1, . . . , pm, not pm+1, . . . , not pn (1)

For a rule r the sets body+(r) = {p1, . . . , pm}, body−(r) ={pm+1, . . . , pn} and head(r) = p0 are defined.

To understand the concept of answer sets the rule couldbe interpreted as following: If an answer set A containsp1, . . . , pm, but not pm+1, . . . , pn then p0 has to be insertedinto this answer set.

Additionally, to avoid unfounded solutions for an answerset A, which contains p0, there must exist one rule r, thathead(r) = p0, body+(r) ⊆ A and body−(r) ∩A = ∅.

Of course, this is only an intuitively way to describe thewide field of answer sets. More specific definitions could befound in several works about answer set programming [4].

With regard to the answer set semantic, a solving strategyand some solving tools are needed to handle the proposed wayof understanding logic programs.

The first step in computing answer sets is to build agrounded version of the logical program: All variables areeliminated by duplicating rules for each possible constant andsubstitute the variable. For example, the program:

p(1). p(2). q(X)← p(X). (2)

is ground to:

p(1). p(2). q(1)← p(1). q(2)← p(2). (3)

Of course the grounded version of the program could bemuch bigger than the original one. Another problem is thatthe grounder needs a complete domain for each variable. For

18

this reason sometimes it is necessary to model such a domainmanually, e. g. the grounder could need a time domain, whereall possible time values are explicitly given. After generatingthe grounded version a SAT-like solver is used to compute allanswer sets.

The most common way to model logic programs for anASP solver is to use the generate-and-test paradigm. Thus,there are some rules, which are responsible to generate a setof facts. Additionally, there exists some constraints which haveto be met, such that the generated set is an answer set, andconsequently a solution for the given problem. For that reasonASP solvers support some special language extensions.

Generating rules could be modeled using aggregates [5]:

l [ v0=a0, v1=a1, . . . , vn=an ] u. (4)

The brackets define a weighted sum of atoms v0, . . . , vn andtheir weights a0, . . . , an. The rule describes the fact, that asubset A ⊆ {v0, . . . , vn} of true atoms exist, such that thesum of weights is within the boundaries [l;u]. Omitted weightsdefault to 1. These rules could be used to generate differentsets of atoms.

Integrity constraints test if a generated set of atoms is ananswer set. These constraints describe, which conditions mustnot be true in any answer set and are written:

← p, q (5)

In this example there exists no answer set containing both pand q.

So far we have described all language features which arenecessary to model our optimization problem.

B. Model

Our ASP model is structured in three parts: First, theproblem description including a task graph and constraintsfor the demanded architecture, second a summary of metainformation for all implementations like costs, mapping of taskto HW or SW-component and the mapping interconnect. Third,the solver itself with all rules needed to find a solution. Thisseparation (in files) also offers a highly flexible and reusablemodel.

The basic idea of the model is to select an implementationfor each task, connect the associated modules and finallyconsider timings and dependencies to build the scheduling.Details are described in the following sections.

C. Allocation of processors

First, each task needs to be mapped to exactly one com-ponent. For the introduced scenario this could be the mainprocessor or a PU.

The amount of permutations in the model to map compo-nents is M !, where M is the maximum number of possiblecomponents allowed to be instantiated. To reduce symmetries,a component is defined to have two indices: cij , j ∈ [1; Ji].With Ji defining the maximum possible number of instantia-tions of a component i, the amount of permutations is reducedto J1! · . . . ·Jn!. The values Ji are derived from the task graph.

Normally it is not necessary to instantiate a certain componentmore than once, thus Ji often is equal to 1.

For each instantiated component cij and each task tn anatom Mtncij is defined, where i specifies the implementationtype of a component and j an instance counter. Thus, for eachtask ti the sum of mapped components must equal 1:

1 [Mtic11 , . . . , Mtickl, ] 1. (6)

D. Data flow

After all processing units are instantiated, they need to beconnected. Connections are derived from edges in the taskgraph and could be simple point-to-point connections, butalso could involve more components. For example in thecase data should be transfered from a processing unit to amemory, the connection requires an IM in between to link thedifferent interfaces. For each transfer n an atom Cncijckl

isdefined, which indicates that data for that transfer is sent fromcomponent cij to ckl. This atom only exists if the transferactually happens:

0 [ Cncijckl] 1. (7)

Again, after generating atoms they have to be limited touseful ones. Modeling these constraints is similar to path-finding algorithms. Assuming a transfer n describing thedependency between two tasks tx (source) and ty (sink) thefollowing constraints need to be met:

←Mtxcij , [ Cncijckl] 0. (8)

←Mtycij , [ Cncklcij ] 0. (9)

If a source task tx is mapped to a component cij there mustexist a component ckl to which cij sends data in transfer n (seerule 8). Similary, if a sink task ty is mapped to a componentcij there must exist a component ckl which sends data to cijin transfer n (see rule 9). Else this solution is invalid.

Additionally, the model must ensure that there exists a pathfor each transfer between the source and sink components aswell as it has to avoid senseless connections, p. ex. in the caseof incompatible interfaces between to components.

E. Time

To evaluate the performance of a hardware architecture itis necessary to schedule all tasks in a temporal order thawill insure the minimal run-time of the algorithm. Modelinga temporal behavior could be done with the help of a timedomain, which defines a discrete and finite set of possibletime slots. Each task is assigned to a time slot to indicate itsstarting time. Additionally, the task graph in extended by twospecial tasks to mark the start and end of computation. Whilethe start task is assigned to time slot 0, the time slot of theend task indicates the total runtime and is used as the valuebe optimized by the solver.

Choosing a practical duration for a time slot is difficult.Selecting shorter clock for time intervals results in a veryaccurate scheduling. However a small time slot leads to anexplosion of the number of possibilities that the solver has to

19

deal with. As trade-off we choose a normalized time intervalfor a time slot related to the fastest component: One time slotis the amount of time the fastest component takes to processa certain amount of data. The occupation of two time slotsindicates that a component operates at half speed compared tothe fastest one.

In our ASP model each task ti is assigned exactly to onetime slot k indicated by an atom Ttik

1 [ Tti1, . . . , Ttim ] 1. (10)

where m is the total number of time slots available in thetime domain and given as a constant in our model. To meetthe dependencies given by the task graph, a task tx may notstart before its predecessor ty:

← Ttxkx , Ttyky , ky < kx. (11)

F. Synchronization

In section IV we explained why the maximum operationspeed of a component may not be exhausted and the actualspeed depends on the processing context. Therefore an atomStik is defined as

1 [ Sti1, . . . , Stip ] 1. (12)

where k is the speed of the task ti. Similar to the time modelalso for modeling the speed a relative criteria was chosenfor the same reasons. A value of 1 implies that the fastestcomponent needs one time slot to process a certain amount ofdata. The constant p is the maximum speed value, and thusthe speed of the slowest component.

The possible speed values for a component are dependentof the speed of the predecessor. If two tasks tx and ty aredependent and mapped on adjacent components, the assignedspeed values have to be equal:

← Stxkx , Styky , kx 6= ky. (13)

To find a scheduling some more helper values are needed.If the starting time and the speed of a task is known, then theend time could be determined. Similar to the definition of thestarting time T in section V-E, the end time of a task ti isdescribed by an atom Etik:

Etike← Stid, Ttiks

, ke = ks + d. (14)

To introduce a local scheduling on each component theremay not exist any tasks which have intersecting computationtimes. Thus, if two tasks tx and ty are mapped to the samecomponent and ty starts after tx, then tx must have finishedbefore ty starts:

← Ttxks, Etxke

, Ttyky, ks ≤ ky, ky < ke. (15)

G. Resource utilization

With the introduced rules so far it is possible to buildup valid architectures. In the following further rules arepresented, which have global influence on the quality of thegenerated architectures and second, comply with the general

conditions. In detail, this concerns memory bandwidth, chiparea utilization and total runtime.

One major issue is the utilization of the system bus resp.memory bus, because most data is stored in the main memoryand easily the memory interface becomes a bottleneck. Foreach point in time it must be assured that the bus is notoverloaded and the speed of the attached components isthrottled if necessary.

In our model the speed of the system bus is given asa constant sb. For each time slot the traffic of all activebus transfers is summed up and compared to the systembus capacity. Furthermore the traffic caused by a componentis inversely proportional to its speed, p. ex. if a componentoperates four times slower than the bus, the bus utilization isone quarter. This is expressed by the following inequation:

1

stc1+ . . .+

1

stcn≥ 1

sb(16)

For the time slot t the components c1, . . . , cn are loading thebus according to their individual speeds stck . For the ease ofreading cij is shortened to ck.

In ASP fraction numbers should be avoided and integernumbers used instead. Therefore the bus capacity is modeledas discrete work slots which could be allocated by activecomponents. The constant p (introduced in rule 12) is derivedfrom the slowest component and hence the minimal bus loadis 1/p, if the bus speed sb equals 1. This also results in p asthe number of needed slots respectively p/sb for sb 6= 1.

To understand that, consider that it is possible to normalizeall speed values, including the maximum value p by sb andget a new maximum value p′ = p/sb and a new bus speeds′b = 1.

For a component ck sending data and operating at speedstck , the normalization coefficient is stck/sb. Thus ck usessb/stck of the bus capacity and consequently allocates

sbstck· p′ = sb

stck· psb

=p

stck(17)

work slots. With this normalization the equation 16 becomesan integrity constraint using only integer numbers:

← d p

stc1e + . . . + d p

stcne ≥ bp′c. (18)

Another issue concerning the general constraints of a so-lution is the chip area. As described before, the resourceutilization r of each component is given as part of the meta-information. For each instantiated component ck the value rkis represented by the atom Rckrk . With Ru defining the overallresource constraint the integrity constraint

← Rc1r1 , . . . , Rcnrn , Ru, r =∑

n

rn, u ≤ r. (19)

rejects architectures, which consume too much resources. Thisrule is replicated to handle different resources like slices orBRAMs.

Finally, to get the optimized model the total run-time shouldbe minimized. As indicator for the runtime the end task te was

20

throughput resourcestask PPC PU slices bramgauss 16 1 2 2sobel 16 1 2 2gradient 8 2 1 0trace 16 - - -system bus 10IM (bus) 3

TABLE IBRIEF META-INFORMATION FOR DIFFERENT IMPLEMENTATIONS

defined earlier. In the ASP model an aggregate is used to findthe time slot of te:

minimize [ Tte1 = 1, . . . , Ttem = m ]. (20)

Each atom Ttek is weighted by its time slot number k. Becauseonly one atom is true, the sum results in the time slot numberof the end task and hence the total runtime.

VI. RESULTS

At the University of Potsdam, Germany a collection of toolscalled POTASSCO [6] was developed to support the computa-tion of answer sets. Some of those tools are trend-setting andaward-winning in the wide field of logic programming [7].

We use the tools gringo and clasp to solve our problem.These applications are capable of handling optimizing state-ments, which themselves are very similar to aggregates. A sumof specific weighted literals is build and tried to optimize thissum during the solving process.

As example application for this paper we used the cannyedge detector, a common preprocessing stage for object recog-nition. The processing steps are: camera→ Gauss filter (noisereduction) → sobel filter (find edges) → calculate gradientof edges → trace edges to find contours. While for the firststeps a hardware implementation is very fast, the tracingof edges has no consecutive memory access and thus onlya software implementation is assumed. Table I summarizesthe resource utilization and the assumed throughput for eachimplementation.

On our test system2 the ASP solver needs about 3 secondsto find a solution. Figure 3 shows two different generatedarchitectures while decreasing the constraint for the availablechip area. The mapping of software tasks is illustrated withparallelograms and dashed arrows. In the bottom right cornerof each drawing the consumption of chip area and the esti-mated run-time is given. Finally, the software only solution(not shown) takes 65 time slots compared to 21 for the mosthardware intensive architecture.

Our current ASP model is a first approach and thus notoptimized. Nevertheless we want to give an idea of theperformance of the solving process. For the measurements foreach number of tasks from 5 up to 10, 1000 problems weregenerated randomly especially the task graph and the meta-information for different implementations. Figure 4 shows a

2Desktop with Intel Core2 Duo Processor 3.16 GHz, 3.2 GB RAM

processor

memorycontroller

IM

gauss

to externalmemory

fromcamera

system bus

FPGA

gradientsobeltrace

chip area: 18time slots: 21

processor

memorycontroller

IMgauss

to externalmemory

fromcamera

system bus

FPGA

sobelgradient,

trace

chip area: 17time slots: 37

Fig. 3. Resulting architectures for different chip area constraints. The puresoftware solution takes 65 time slots.

Fig. 4. ASP solver runtime measurements according to the number of tasks

exponential growth of the solving time relative to the numberof tasks in the problem which is not worse than expected.

VII. CONCLUSION AND FUTURE WORK

We have shown that answer set programming is a viableapproach to solve complex problems like the architecture gen-eration for data stream based hardware/software co-design sys-tems. The advantage over evolutionary algorithms or heuristicsmethods is the guaranty findind of an optimal solution.

Our development platform is an intelligent camera system,based on a Virtex4 FX FPGA with an embedded PowerPChardcore processor. Because image data is normally huge andstored in the external DDR memory the first IM which wasdeveloped connects the PLB (system bus) to a SDI componentand vice versa. This module operates similar to a DMAcontroller but instead of just copying, the data is streamedout and in of the module between the load and store operationto go through PUs.

Our next step is to improve the ASP model to be solvedfaster and to be more accurate especially concerning the res-olution of timings and to examine more complex task graphs.

21

An extension of the POTASSCO tools which is currently inheavy development and capable of handling real numbers, willbe used for this purpose.

Our ASP model is already capable to generate architec-tures which include a partial reconfiguration of modules. Thescheduling is also valid except for the case that, a moduleis used, reconfigured and directly used again. Here the delayfor the reconfiguration must be considered in the schedulingbecause it stalls the processing. It’s possible to include thiscase in our model with little modifications. In the future weare going to combine the work of this paper with our work inthe domain of partial reconfiguration.

REFERENCES

[1] Intel Inc., “Open Computer Vision Library,” http://www.intel.com/research/mrl/research/opencv/, 2007.

[2] ETH Zurich, “PISA - A Platform and Programming Language Inde-pendent Interface for Search Algorithms,” http://www.tik.ee.ethz.ch/pisa/,2010.

[3] H. Ishebabi and C. Bobda, “Automated architecture synthesis for parallelprograms on fpga multiprocessor systems,” Microprocess. Microsyst.,vol. 33, no. 1, pp. 63–71, 2009.

[4] C. Anger, K. Konczak, T. Linke, and T. Schaub, “A glimpse of answerset programming,” Kunstliche Intelligenz, no. 1/05, pp. 12–17, 2005.

[5] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, S. Thiele, andT. Schaub, A Users Guide to gringo, clasp, clingo, and iclingo, Nov.2008.

[6] University of Potsdam, “Potassco - Tools for Answer Set Programming,”http://potassco.sourceforge.net/, 2010.

[7] M. Denecker, J. Vennekens, S. Bond, M. Gebser, and M. Truszczynski,“The second answer set programming competition,” in Proceedings ofthe Tenth International Conference on Logic Programming and Nonmono-tonic Reasoning (LPNMR’09), ser. Lecture Notes in Artificial Intelligence,E. Erdem, F. Lin, and T. Schaub, Eds., vol. 5753. Springer-Verlag, 2009,pp. 637–654.

22

Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations

Ming Zhang, Zonghua Gu Colledge of Computer Science, Zhejiang University

Hangzhou, China 310027 {editing, zgu}@zju.edu.cn

Abstract—AUTOSAR is a component-based modeling language and development framework for automotive embedded systems. Component-to-ECU mapping is conventionally done manually and empirically. As the number of components and ECUs in vehicles systems grows rapidly, it becomes infeasible to find optimal solutions by hand. We address some design issues involved in mapping an AUTOSAR model to a distributed hardware platform with multiple ECUs connected by a bus, each ECU running a real-time operating system. We present algorithms for extracting connectivity between ports of atomic software components from an AUTOSAR model and for calculating blocking times of all tasks of a taskset scheduled by PCP. We then address optimization issues in mapping AUTOSAR components (SWCs) to distributed multithreaded implementations. We formulate and solve two optimization problems: map SWCs to ECUs with the objective of minimizing the bus load; for a given SWC-to-ECU mapping, map runnable entities on each ECU to OS tasks and assign data consistency mechanism to each shared data item to minimize memory size requirement on each ECU while guaranteeing schedulability of tasksets on all ECUs.

Keywords-software component; ECU; schedulability; data consistency

I. INTRODUCTION Today’s automotive electrical and electronic systems are

becoming more and more complex. In order to ease the development of automotive electronic systems, leading automobile companies and first-tier suppliers formed a partnership in 2003, and established AUTOSAR (AUTomotive Open System Architecture), a standard for automotive software development. According to AUTOSAR, application software components (SWCs) are platform-independent and need to be mapped to ECUs [1]. This mapping is an important step of system configuration. For a system consisting of ECUs, with

application SWCs to be mapped to the ECUs, there could be different mapping schemes, including valid and invalid

schemes with respect to constraints (i.e., timing constraints of tasks). As the number of ECUs and application SWCs increases, it is inefficient and error-prone to perform the mapping manually in a trial-and-error manner.

In this work, we propose an approach that works closely with the AUTOSAR model to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among application tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency.

W. Peng et al [2] addressed the deployment optimization problem for AUTOSAR system configuration. In their work, an algorithm was presented to find an SWC-to-ECU mapping scheme to guarantee task schedulability while minimizing inter-ECU communication bandwidth. However, their work didn’t consider shared data among tasks and their protection mechanisms. Ferrari et al [3] discussed several strategies for protecting shared data items and raised the issue of optimization for time and memory tradeoffs, but did not propose any concrete algorithms. In this paper, we attempt to formulate and solve the optimization problems involved in mapping AUTOSAR model to distributed multithreaded implementations.

This paper is organized as follows: Section II introduces basic concepts of AUTOSAR. Section III describes our approach in detail while Section IV presents two algorithms for extracting connectivity of ports and calculating blocking times of tasks. In Section V, two simple experiments on an application example demonstrates the correctness and effectiveness of our approach. Finally, this work is concluded in Section IV.

II. BASIC CONCEPTS OF AUTOSAR According to AUTOSAR, application software is

conceptually located above the AUTOSAR RTE and consists of platform-independent software components (SWCs). An SWC may have multiple ports. A port is either a P-Port or an R-Port. A P-Port provides output data while an R-Port requires input data. Each port is associated with a port interface. Two types of port-interfaces, client-server and sender-receiver, are supported by AUTOSAR. A client-server interface defines a set of operations that can be invoked by a client and implemented by a server. A sender-receiver interface defines a set of data-elements sent/received over the VFB. Runnable Entities (runnables for short) are the smallest executable elements. One component consists of at least one runnable. All runnables are activated by RTEEvents. If no RTEEvent is specified as StartOnEvent for a runnable, then the runnable is never activated by the RTE. Two categories of runnables are defined. Category 1 runnables do not have WaitPoints and have to terminate in finite time. Category 2 runnables always have at least one WaitPoint; or they invoke a server and wait for the response. At runtime, runnables within software components are grouped into tasks scheduled by the OS scheduler. The RTE generator is responsible for constructing the OS tasks bodies. Before the RTE generator takes the ECU configuration description as the input information to generate the code, the

978-1-4577-0660-8/11/$26.00 2011 IEEE

23

RTE configurator configures parts of the ECU-Configuration, e.g. mapping of runnables to tasks.

III. PROBLEM FORMULATION

A. Outline Our approach consists of two phases. The first phase tries

to find one or more optimal SWC-to-ECU mapping schemes, with respect to the data rate over the inter-ECU bus, while trying to guarantee schedulability of the task set on every ECU. This phase takes as input a set of inter-connected atomic application SWCs [1], and a set of ECUs, and outputs a mapping scheme from the atomic application SWCs to the ECUs.

In the second phase, a per-ECU optimization is performed. For each ECU, our approach tries to select a method for each data item to guarantee its data consistency as well as schedulability of the task set on the ECU, while minimizing the memory overhead used to protect data consistency on the ECU. This phase takes as input a mapping scheme produced by the first phase, and outputs the selected method to guarantee data consistency, for each data item.

Both of the two phases involve runnable-to-task mapping. We assume the worst-case execution time and the worst-case execution time of the longest critical section of every runnable are given.

Before the two phases of optimization, the top-level component of a system [1] is decomposed into inter-connected atomic SWCs. During this process, the connectivity between the ports of the atomic SWCs is maintained, as described in detail in Section IV.A.

Given a set of atomic application SWCs and a set of ECUs , a mapping scheme is a function : , where

. (1)

gives the ECU which is mapped to. For the mapping to meet timing constraints, every task in the taskset on each ECU finishes before its deadline , that is:

. (2)

In this work, we assume that Priority Ceiling Protocol [5] is applied as the scheduling algorithm, which defines as:

∑ . (3)

, | _ , . (4)

During the first phase, we do not consider data sharing among tasks, hence, for this phase,

0. (5)

In addition, the total data rate over the bus needs to be minimized. We define as:

∑ ∑ . (6)

∑ . (7)

Where is a data item transmitted between ECUs; is the set of runnables of ; is the set of data items transmitted by runnable over the bus; is the size of data item ; is the period of transmission of

.

The first phase is formulated as the following optimization problem:

Find for all ,

Minimize ,

Subject to with 0 for every .

After the application SWCs are mapped to the ECUs, consistency of data shared among tasks on each ECU needs to be guaranteed. We consider two methods mentioned by [3]: semaphore lock and rate transition blocks. Semaphore lock incurs negligible memory overhead while introducing significant delays; Rate transition blocks incur negligible time overhead but require additional memory space to store multiple copies of data.

For each data item shared among tasks on a given ECU , one of the two methods is selected. Hence, a function is to be found, where

. (8)

gives the method to protect the consistency of data item (SL: semaphore lock or RTB: rate transition blocks), where

is a data item shared by tasks on the given ECU. As in the first phase, schedulability of tasks on must be guaranteed. In this phase, data sharing is taken into account. As is required by PCP [5], is the longest critical section of low-priority tasks of , which is given by (4). In addition, memory overhead on the ECU introduced by rate transition blocks is to be minimized:

∑ 1 . (9)

Where is the number of copies of . Since the original copy exists even if no mechanism is applied to guarantee data consistency, the original copy is not counted as overhead.

To determine , we need to consider three cases:

• If has only one writer and no reader, i.e., is written by one task and no task reads it, no extra copy is needed.

• If has only readers and no writer, i.e., is read by some tasks but written by no task, the original copy suffices.

24

• If has more than one writers, or both a writer and a reader, i.e., is written by more than one tasks, or written by some tasks and read by other tasks, then a copy is needed for each of the writers and another copy is required for all the readers, including the original copy. In other words, the number of extra copies is equal to the number of writers.

Combining all of the three cases above, we define as:

. (10)

Where and are the number of writer tasks and reader tasks of , respectively, and

evaluates to 1 when has a reader and 0 when has no reader.

The second phase of our approach is formulated as the following optimization problem:

For a given ECU ,

Find for all ,

Minimize

Subject to with given by (4) for every on .

The algorithm for calculating given by (4) is described in detail in Section IV.B.

In the outline described above, there are a few pending problems:

• Mapping runnables to tasks;

• Identification of data items transmitted between ECUs;

• Identification of data items shared by more than one tasks on the same ECU.

We will address these problems next.

B. Mapping Runnables to Tasks In this work, we take into account only periodic runnables.

We map runnables to tasks in a simple way that is a common practice in the industry: all runnables on the same ECU with the same period are mapped to the same task . Further, we define the period of the task as the period of the runnables mapped to it. Hence,

. (11)

The worst-case execution time of the task is defined as the sum of the worst-case execution times of the runnables mapped to it. Hence,

∑ . (12)

Therefore, the period of a task is unique among all the tasks on the same ECU. By rate-monotonic priority assignment [4], the priority of each task is also unique.

C. Identification of Data Items Transmitted between ECUs According to AUTOSAR [1], an atomic application SWC

must be mapped to an ECU as a whole. From the viewpoint of the sender SWC, this implies that it could send data to a remote ECU only via its ports. The structure of a data item is defined by a data element in a port interface. Therefore, we identify a data item transmitted over the bus by a port-data element pair , . From the data element, the size of the data item , could be obtained.

From the viewpoint of a runnable, a data item it transmits is referenced by its data send point or data write access. Hence we can obtain the period of the transmission of a data item , by a runnable that has a data send point or data write access referencing it:

, , . (13)

Where , denotes the runnable that transmits the data item identified by , .

In order to identify data items transmitted over the bus instead of via memory on the local ECU, it is necessary to find out whether there is a receiver SWC on a remote ECU. We tackle this problem in two steps. First, we define a function

where

. (14)

gives the set of all R-Ports that are connected with the given P-Port . Then, for each R-Port in , the ECU to which the owner SWC of is mapped is compared with that of . If there exist a in , whose owner SWC is mapped to a different ECU from the ECU of the owner SWC of , or formally,

. (15)

Where

• is the owner SWC of ;

• is the ECU to which is mapped;

then every data item transmitted via must be transmitted over the bus.

With (12), (13), (14), and (15), we could rewrite (7) as:

∑ , , . (16)

Where

1, ,0, . (17)

D. Identification of Data Items Shared by Tasks on the Same ECU In an AUTOSAR model, data shared by runnables of

application SWCs could be classified into two categories: those

25

shared by runnables on the same ECU, and those shared by runnables on different ECUs. For the 1st category, race conditions may occur since the data is shared via memory. For the 2nd category, data consistency is not an issue, since the communication is via message passing.

From the viewpoint of application SWCs, data shared by runnables on the same ECU come in two forms:

• Data shared by runnables of the same atomic SWC, or inter-runnable variables;

• Data shared by runnables of different atomic SWCs, transmitted and received by different SWCs via their ports.

According to the AUTOSAR specification [1], an inter-runnable variable is referenced by runnables that read or write them. By counting tasks that reference as a writer or a reader, we obtain and respectively. For shared data items in the form of an inter-runnable variable, (9) can be rewritten as:

∑ ∑ 1 .(18)

For the latter, it is easy to prove the following lemma.

Lemma 1. If data transmitted by an atomic SWC via a P-Port , and received by another atomic SWC via an R-Port ’, the data passes exact one assembly connector.

The concept of assembly connector is defined in [1]. Note that it’s possible for different data items transmitted via a P-Port to pass different assembly connectors since AUTOSAR allows a port to be connected to more than one ports. The same holds for a data item received via an R-Port.

We identify a data item shared by runnables of different SWCs on the same ECU with a pair , , where

• is the assembly connector the data item passes; • is the data element in the port interface associated with the port via which the data item is transmitted, i.e., the P-Port connected by the assembly connector . To make sure that the data item is shared by runnables on

the same ECU, the P-Port via which the data item is transmitted and the R-Port via which the data item is received now belong to SWCs mapped to the same ECU, or formally:

1, ,0, . (19)

By counting tasks (possibly with runnables of different SWCs mapped to it) on the given ECU, which reference the data item identified by , as a writer or a reader, we obtain , and , respectively. To determine whether a data item , is read or written by a runnable , it needs to be determined whether the data item , transmitted by passes . We define a function , where

. (20)

gives the set of assembly connectors data transmitted via could pass.

For data items shared by runnables of different SWCs on the same ECU, (9) could be rewritten as:

∑ , 1 , , . (20)

Note that (19) and (20) are used by the counting process that determines , and , , which in turn are used to find , .

Considering both (18) and (20), we can rewrite (9) as:

. (21)

E. Genetic Algorithm We used the NSGA-II [6] variant of Genetic Algorithm to

solve the optimization problems described in the first sub-section of this section.

For the optimization of the SWC-to-ECU mapping, we encode each individual as a vector , with the -th element

representing the ECU to which is mapped. To recombine two individuals and , the cross-over operator randomly picks a subset of each time, hence , , … , , 1 | | , and exchanges the values of and 1 . The mutation operator also picks a subset of , and maps each to a random ECU by assigning to the corresponding element in the target individual .

To optimize the selection of method to guarantee data consistency of shared data for a given ECU, the individual encoding, cross-over operator and mutation operator are similar to those for optimization of the SWC-to-ECU mapping scheme. Each individual is encoded as a vector , with the -th element representing the method (SL: semaphore lock or RTB: rate transition block) selected to guarantee data consistency of data item . To recombine two individuals

and , the cross-over operator randomly picks a subset of the set of all shared data items , hence , , … , , 1 | | , each time, and exchanges the values of and 1 . The mutation operator also picks a subset of , and assigns a random method from the available data consistency methods to each by assigning to the corresponding element

in the target individual .

IV. ALGORITHMS

A. Deriving Port Clusters and Port Connectivity from AUTOSAR Models In the last section, we defined function , which

finds the set of all R-Ports that are connected with the given P-Port and function , which finds the set of assembly connectors data transmitted via could pass. We define a port cluster to be a pair ， , where

26

• is a set of P-Ports of atomic SWCs;

• is a set of R-Ports of atomic SWCs;

• Data transmitted via any P-Port in can be received by every R-Port in .

By Lemma 1, it is obvious that a port cluster corresponds to exactly one assembly connector, which data transmitted by a P-Port in the passes before being received via one or more R-Ports in the .

In this sub-section, we propose an algorithm to find port clusters which a given port belongs to for every port, whose owner is an atomic SWC.

Our algorithm starts with the top-level composition [1] of a system and performs a breadth-first traversal through the hierarchical structure of the components. In this process, for every composition,

• a port cluster is created for each assembly connector , with the P-Port connects added to the

of and the R-Port to the ;

• every outer port of the composition, which has been added to one or more port clusters, is conceptually replaced with inner ports that are connected to it by delegation connectors [1].

This process continues until all compositions are processed, when every port in the port cluster belong to an atomic SWC. The pseudo-code of the algorithm is as follows:

Algorithm 4.1 (find port clusters) Input: top-level component CP0 add CP0 to Queue Q while Q is not empty

remove CP1 from Q if CP1 is a composition

clear DelegationMap for each connector Cn0 in CP1

if Cn0 is an assembly connector create a port cluster PC(Cn0)=PC(PSet,QSet) add Cn0.PPort to PSet; add Cn0.RPort to RSet add PC(Cn0) to PortToClusterMap

else /*if CnP0 is a delegation connector*/ add (Cn0.outerPort, Cn0.innerPort) to DelegationMap

end if end for each update PortToClusterMap with DelegationMap add all ComponentPrototypes in CP1 to Q

end of while for each (p, PCSet(p)) /*p is a port while PCSet(p) is the set of all port clusters containing p*/

for each PC(PSet, RSet) in PCSet(p) if p is a P-Port

add p to PSet else /*p is an R-Port*/

add p to RSet

end if end for each

end for each return PortToClusterMap

In the pseudo-code above, we use DelegationMap to track inner ports that are directly connected with each outer port. The pseudo-code of the “update PortToClusterMap with DelegationMap” step is as follows:

for each (outerPort, innerPortSet) in DelegationMap /*innerPortSet is the set of all inner ports that are connected directly with outerPort*/

Find PCSet(outerPort) in PortToClusterMap /*PCSet(outerPort) is the set of all port clusters that contain outerPort*/ if PCSet(outerPort) exists in PortToClusterMap /*outerPort is connected from the outside of its owner ComponentPrototype*/

for each innerPort in innerPortSet find PCSet(innerPort) in PortToClusterMap /*PCSet(innerPort) is the set of all port clusters that contain innerPort*/ add (innerPort, PCSet(innerPort)∪PCSet(outerPort)) to PortToClusterMap

end for each end if remove (outerPort, PCSet(outerPort)) from PortToClusterMap

end for each

Note that an inner port may belong to multiple port clusters. When updating a (innerPort, PCSet(innerPort)), care must be taken not to lose the port clusters associated with innerPort previously.

B. Calculating blocking times of tasks In this sub-section, we propose an algorithm to calculate

in (4) for all tasks, which have been sorted by priority (from the highest to the lowest). In the context of this paper, we do not distinguish between a semaphore and a shared data item.

Our algorithm performs two scans through the sorted list of tasks. First, a scan is performed from the highest priority to the lowest priority, and determines the ceiling of each semaphore protecting a shared data item, for every semaphore . Then, the second scan is performed from the lowest to the highest priority. During this scan, our algorithm maintains a map from each semaphore to the longest critical section that uses . For every task , this scan performs the following steps one-by-one:

• remove semaphores that cannot contribute to the blocking time to , hence, all :

;

• find the longest critical section currently in , and save its length as ;

• add all semaphores it encounters for the first time during this scan, hence, : ,

27

to along with the critical section , of ; we use to denote the priority of the lowest-priority task that uses .

• for each semaphore :, update the critical section currently

associated with in , if it is shorter than , .

V. APPLICATION EXAMPLE In this section, we describe two experiments on an

application example, which consists of 6 atomic SWCs to be mapped to 2 ECUs. The hierarchy of SWCs, along with the connectors, is shown in Figure 1. . For simplicity, each atomic SWC contains one and only one runnable, each port interface contains only one data element, and the size of the data element is 36 bytes. There is no inter-runnable variable. The worst-case execution times of runnables are shown in TABLE I. . The maximum lengths of critical sections of runnables, along with the accessed shared data items, are shown in TABLE II. , where each shared data item accessed by a runnable is represented by the port and the data element.

CP0 CP1

CP2

CP00

CP01

CP10

CP11

CP11

P0

P1 P0

R0

P0

P1

R0

R1

R0 P0

P0

R0

R0

R1

P0

P0

R0

A0

D1

A0

A1

D0

D1

D2

D3

A0

A1

A2

Figure 1. Hierachy of SWCs and Connectors of the Application Example

TABLE I. WORST-CASE EXECUTION TIMES OF RUNNABLES IN MS (RE DENOTES RUNNABLE ENTITY)

SC0/CP2,RE40=100 SC0/CP1/CP12,RE30=30 SC0/CP0/CP01,RE10=25 SC0/CP1/CP11,RE20=10 SC0/CP0/CP00,RE00=5 SC0/CP1/CP10,RE10=25

TABLE II. CRITICAL SECTIONS OF RUNNABLES IN MS (DE DENOTES DATA ELEMENT SHARED BETWEEN DIFFERENT RUNNABLES)

SC0/CP2,RE40,R0,DE00=10 SC0/CP1/CP12,RE30,R1,DE00=2 SC0/CP1/CP12,RE30,R0,DE00=8 SC0/CP1/CP12,RE30,P0,DE00=3 SC0/CP0/CP01,RE10,R0,DE00=5 SC0/CP0/CP01,RE10,P0,DE00=6 SC0/CP1/CP11,RE20,R0,DE00=1 SC0/CP1/CP11,RE20,P0,DE00=2 SC0/CP0/CP00,RE00,P1,DE00=1 SC0/CP0/CP00,RE00,P0,DE00=1 SC0/CP1/CP10,RE10,R0,DE00=5

SC0/CP1/CP10,RE10,P0,DE00=6

A. Experiment 1 In this experiment, we map the atomic SWCs to ECUs.

Three mapping schemes are found, all with the same data rate over the bus of 450B/s. The first mapping scheme maps CP00, CP01 and CP2 to EcuInstance0; CP10, CP11 and CP12 to EcuInstance1. From Figure 1. , we can see that all data on bus come from P0 of CP01. This port is written by the runnable RE10 of CP01 with period of 80ms. Hence the data rate over the bus is 36B/80ms=450B/s. (The numerical values are for illustration purposes only.) The second mapping scheme maps CP00, CP01 to EcuInstance0; CP2, CP10, CP11 and CP12 to EcuInstance1. The data rate over the bus is the same as the first scheme. The third mapping scheme is actually the same as the first mapping scheme except that the ECUs are interchanged, thus resulting in the same data rate over the bus, since we assume a homogeneous hardware platform consisting of identical ECUs. The tasksets on EcuInstance0 and EcuInstance1 using the first scheme are shown in TABLE III. and TABLE IV. respectively. The tasksets using the second scheme are shown in TABLE V. and TABLE VI. respectively. We can see that all tasks meet there deadlines.

TABLE III. TASKSET ON ECUINSTANCE0 UNDER 1ST MAPPING SCHEME

Task ID T (ms) = D C (ms) B (ms) WCRT (ms)

Task 0 30 5 0 5

Task 1 80 25 0 30

Task 2 620 100 0 210

TABLE IV. TASKSET ON ECUINSTANCE1 UNDER 1ST MAPPING SCHEME


Task 0 40 10 0 10

Task 1 80 25 0 35

Task 2 132 30 0 75

TABLE V. TASKSET ON ECUINSTANCE0 UNDER 2ST MAPPING SCHEME


Task 0 30 5 0 5

Task 1 80 25 0 30

TABLE VI. TASKSET ON ECUINSTANCE1 UNDER 2ST MAPPING SCHEME


Task 0 40 10 0 10

Task 1 80 25 0 35

Task 2 132 30 0 75

Task 3 620 100 0 600

Next, we assign data consistency method to each shared data item based on the 1st mapping scheme. All the local shared data items are protected with semaphore locks, hence the memory overhead is minimal (0). The tasksets on EcuInstance0 and EcuInstance1 after data consistency method

28

assignment are shown in Tables XI and XII, respectively. Again, we can see that all tasks meet their deadlines.

TABLE VII. TASKSET ON ECUINSTANCE0 AFTER DATA CONSISTENCY METHOD ASSIGNMENT


Task 0 30 5 5 10

Task 1 80 25 0 30

Task 2 620 100 0 210

TABLE VIII. TASKSET ON ECUINSTANCE1 AFTER DATA CONSISTENCY METHOD ASSIGNMENT


Task 0 40 10 5 15

Task 1 80 25 8 53

Task 2 132 30 0 75

B. Experiment 2 In this experiment, we consider the SWC-to-ECU mapping

under 1st mapping scheme (TABLE III. and TABLE IV. ), and increase length of the critical section where the runnable RE30 of SC0/CP1/CP12 accesses the shared data item identified by (R0, DE00) (3rd line in TABLE II. ) from 8ms to 60ms. If all shared data items are protected with semaphore locks, then the taskset is not schedulable due to excessive blocking time, as shown in TABLE IX. . We then run our algorithm for data consistency method assignment on EcuInstance1. This time, our approach assigns rate transition block to the shared data item identified by (A0, DE00), as shown in TABLE X. . Now the taskset is schedulable, as shown in TABLE XI. .

TABLE IX. NON-SCHEDULABLE TASKSET ON ECUINSTANCE1


Task 0 40 10 5 15

Task 1 80 25 60 115

Task 2 132 30 0 75

TABLE X. DATA CONSISTENCY METHOD ASSIGNMENT ON ECUINSTANCE1 AFTER MODIFICATION

SC0,A2,DE00: Lock SC0/CP1,A1,DE00: Lock SC0/CP1,A0,DE00: Rate Transition Block Memory overhead: 36.0

TABLE XI. SCHEDULABLE TASKSET ON ECUINSTANCE1 FOR THE DATA CONSISTENCY METHOD IN TABLE X.


Task 0 40 10 5 15

Task 1 80 25 2 37

Task 2 132 30 0 75

VI. CONCLUSIONS As vehicle electronic systems become increasingly

complex, the number of software components and the number of ECUs have also increased, making it difficult or infeasible to use manual efforts to find optimal SWC-to-ECU mapping schemes. In this work, we present an approach to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency. Along with our approach, we present an algorithm for extracting connectivity between ports of atomic software components from an AUTOSAR model and an algorithm for calculating blocking times of tasks under PCP. Finally, we use an application example to show the correctness and effectiveness of the proposed techniques.

VII. ACKNOWLEDGEMENTS This work was supported by NSFC Project Grant

#61070002 and #60736017; National Important; Science & Technology Specific Projects under Grant No.2009ZX01038-001 and 2009ZX01038-002.

REFERENCES [1] AUTOSAR GbR, AUTOSAR Specifications, AUTOSAR Development

Partnership, 2008, Release 3.0. [2] Wei Peng, Hong Li, Min Yao, Zheng Sun, “Deployment Optimization

for AUTOSAR System Configuration,” International Conference on Computer Engineering and Technology, vol. 4: 4189-4193, 2010.

[3] Alberto Ferrari, Marco Di Natale, Giacomo Gentile, Giovanni Reggiani, Paolo Gai,“Time and memory tradeoffs in the implementation of AUTOSAR components,” Design, Automation & Test in Europe Conference & Exhibition, pp. 864-869, 2009.

[4] C. L. Liu, J. W. Layland, “Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment,” Journal of ACM, vol. 20, No. 1, pp. 46-61, 1973.

[5] L. Sha, R. Rajkumar, J. Lehoczky, “Priority Inheritance Protocols: An Approach to Real-Time Synchronization,” IEEE Transactions on Computers, vol. 39, No. 9, pp. 1175-1185, 1990.

[6] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, T. Meyarivan, “A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II,” Parallel Problem Solving from Nature VI, pp. 849-858, 2000.

29

FPGA Design for Monitoring CANbusTraffic in a Prosthetic Limb Sensor

NetworkA. Bochem∗, J. Deschenes∗, J. Williams∗, K.B. Kent∗, Y. Losier+

Faculty of Computer Science∗ Institute of Biomedical Engineering+

University of New BrunswsickFredericton, Canada

{alexander.bochem, justin.deschenes, jeremy.williams, ken, ylosier}@unb.ca

Abstract—This paper presents a successful im-plementation of a Field Programmable Gate Array(FPGA) CANbus Monitor for embedded use in aprosthesis device, the University of New Brunswick(UNB) hand. The monitor collects serial communi-cations from two separate Controller Area Networks(CAN) within the prosthetic limb’s embedded system.The information collected can be used by researchersto optimize performance and monitor patient use. Thedata monitor is designed with an understanding of theconstraints inherent in both the prosthesis industryand embedded systems technologies. The design uses anumber of verilog logic cores which compartmentalizeindividual logic areas allowing for more successfulvalidation and verification through both simulationsand practical experiments.

keywords: FPGA; CANbus; Data Monitor; Pros-thesis; Verilog

I. INTRODUCTION

In the prosthetics field, various research insti-tutions and commercial vendors are currently de-veloping new microprocessor-based prosthetic limbcomponents which use a serial communication bus.Although some groups’ efforts have been of aproprietary nature, many have expressed interest inthe development of an open bus communicationstandard. The goal is to simplify the intercon-nection of these components within a prostheticlimb system and to allow the interchangeability ofdevices from different manufacturers. This initia-tive is still in development and will undoubtedlyface some obstacles during its development andimplementation as there are currently no embeddeddevices available to reliably monitor the bus activityfor the newly developed protocol.

The open bus communication standard uses the

CAN bus protocol as its underlying hardware com-munication platform. Higher levels of the protocoldefine the initialization, inter-module communica-tion, and data streaming capabilities. Commerciallyavailable off-the-shelf CANbus logic analyzers,although capable of decoding the primary CANfields, are unable to interpret the protocol messagesin order to provide detailed information of thesystem behavior. The design of an FPGA-basedprosthetic limb data monitor will allow embeddedsystem engineers to monitor the new protocol’scommunication activity occurring in the system.This provides an effective developmental tool to notonly help develop new prosthetic limb componentsbut also advance the open bus standard initiative.This monitor will help develop new prostheticcomponents as well as provide a means to as-sess their rehabilitation effectiveness. The designof the system will be flexible enough to meetfuture needs and follow current standards, suchas simplifying the work required for end users toutilize the system. Furthermore, the monitor’s datalogging capabilities will allow the prosthetic fittingrehabilitation team to analyze the amputee’s dailyuse of the system in order to assess its rehabilitationeffectiveness. The evaluation of the data monitor’scapabilities will be performed in conjunction withUNB researchers who are leading members of theStandardized Communication Interface for Pros-thetics forum [1].

Section 2 of the report outlines the field ofbiomedical engineering and presenting an overviewof related work. Section 3 gives an introductioninto the CANbus standard. Section 4 covers thesystem design and its implementation. Section 5

”978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

30

shows how the functionality of the system has beentested and evaluated. Finally, Section 6 concludesthe paper.

II. EMBEDDED SYSTEMS

Within this section we will look at tradi-tional embedded system and biomedical engineer-ing projects. It will highlight the characteristics ofsome approaches with the applied technology andprojects that implement CANbus communication inparticular. Initially, robots were controlled by largeand expensive computers requiring a physical con-nection to link the control unit to the robot. Todaythe shrinking size and cost of embedded systemsand the advances in communication, specificallywireless methods, have allowed smaller, cheapermobile robots. Robots operate and interact withthe physical world and thus require solutions tohard real-time problems. These solutions must berobust and take into account imperfections in theworld. These autonomous systems usually consistsof sensors and actuators; the sensors collect infor-mation coming into the system and the actuatorsare the outputs and can be used to interact withthe outside environment. The signal and controldata is sent to a central processing unit which runsthe main operating system. For reducing powerconsumption and complexity, these sensor networksuse communication buses to exchange data.

In Zhang et al. [2] the researchers started with atypical robotic arm setup which included a numberof controllers and command systems communicat-ing through a communication bus to one another.The researchers, who consider the communicationsystem the most important aspect of the space armdesign set out to improve it. Their system utilizedthe Controller Area Network (CAN) communica-tion bus and enhanced reliability through the imple-mentation of a redundant strategy. The researcherscited features of the CANbus that are desirable,which include: error detection mechanisms, errorhandling by priority, adaptability and a high cost-performance ratio. In order to increase reliability,they implemented a “Hot Double Redundancy”technology. This technology was implemented asthe communication system which consisted of anARM microprocessor, the CANbus controller cir-cuit, data storage, system memory and a complexprogrammable logic device (CPLD) used to imple-ment a redundant strategy. The CPLD interfacedwith two redundant CAN controller circuits, whichwere each connected to their own set of system

devices, while taking commands from the mainmicroprocessor. The logic to handle the “Hot”aspect of the technology, the ability to switch fromone system to the other without any down time, wasimplemented in the hardware definition languageVHDL and put onto the CPLD. This increasesthe redundancy, but also significantly improves thereliability by handling major system faults withoutdown time and increases the flexibility of the designby having hardware components which can beupdated and changed without physical contact. Theprocess was tested through the Quartus II softwaretool from Altera, which simulated normal and ex-treme activity for a period of sixteen to thirty-twohours all the while alternating between the tworedundant systems. The researchers reported thetests to be successful, with a 100% transmissionrate and no error frames or losses due to theredundancy switch time.

Biomedical engineering is an industry whichrequires a multi-disciplinary skillset in which en-gineering principles are applied to medicine andthe life sciences. In recent years renewed interestfrom the American military has promoted the ad-vancement of artificial limb technology. In 2005,the Defense Advanced Research Projects Agency(DARPA) launched two prosthetic revolution pro-gram, revolutionizing prosthetics 2007 (RP2007)and revolutionizing prosthetics 2009 (RP2009) [3].The goal of RP2007 was to deliver a fully func-tional upper extremity prosthesis to an amputee,utilizing the best possible technologies. The arm isto have the same capacity as a native arm, includingfully dexterous fingers, wrist, elbow and shoulderwith the ability to lift weight, reach above one’shead and around one’s back. The prosthetic willinclude a practical power source, a natural lookand a weight equal to that of the average arm.Another, more difficult goal includes control ofthe prosthesis through use of the patient’s centralnervous system. In RP2009 the prosthesis tech-nology will be extended to include sensors forproprioception feedback, a 24 hour power source,the ability to tolerate various environmental issuesand increased durability. Although the prosthesesdeveloped during the course of both projects provedto be technological achievements [4], it is stillunclear whether the cost of these systems, aimedat being around $100,000[5], will prohibit their usewhen they become commercially available.

The University of New Brunswick’s (UNB) handproject [6] seeks to design a low-cost three axis, six

31

basic grip anthropomorphic hand, with control ofthe hand using subconscious grasping to determinemovement.The UNB hand team built intelligentElectromyography (EMG) sensors [7] that couldamplify and process signal information, passingrequired information to the main microprocessorthrough the use of serial communication bus. Thisallows for a reduction in wiring, which reduces theweight and simplifies the component architecture.The serial bus chosen, the Controller Area Networkbus (CANbus), allowed a power strategy to beimplemented, reducing overall power consumption.The CANbus is also noted as having a good com-promise between speed and data protection, a ne-cessity for prosthetics [8]. The hand project creatorshave begun creating a communication standard [9]to improve interchangeability and interconnectionbetween limb components. If adopted, major in-creases in flexibility would be gained which wouldbe beneficial to all people involved in the prosthesisindustry.

The paper by Banzi Mainardi and Davalli [8]extends the idea of a distributed control systemand Controller Area Network serial bus to an entirearm prosthesis. In this project, along with theparallel distribution cited in the UNB hand project,the device had the additional task of handlingexternal communication through either a Bluetoothor RS232 serial connection. The paper also citessimilar reasons in using the CANbus to the UNBhand project, adding evidence of successful deviceintegration in the CANbus’ traditional area, auto-motives and how this could parallel the prostheticindustry. The paper also outlines reasons behind notchoosing other communication protocols or tech-nologies. The two initial choices, Inter-IntegratedCircuit (I2C)[10] and Serial Peripheral Interface(SPI) Bus[11], were rejected because they failedto have adequate data control and data protectionsystems and were unable to handle faults accept-ably. The SPI system required additional hardwareoverhead which would allow for device addressing.The personal computing standards and industrystandards were unable to adapt to the space andweight constraints required by the profile of theprosthesis. The CANbus allowed for a reasonablenumber of sensors, flexibility in expansion andinterfacing, microcontrollers with integrated buscontrollers and efficient, robust, deterministic datatransmission with a reduction of cables requiredand near optimal voltage levels.

III. CANBUS

The CANbus is a serial communication standardused to handle secure and realtime communicationbetween electrical and microcontroller devices andprimarily defines the physical and data link layersof the open systems interconnection model (OSImodel). The CANbus was originally created tosupport the automotive industry and its increasedreliance on electronics; however because of itsreliability, high speed and low-cost wiring, it hasbeen used in many additional areas. The CANbussupports bit rates of up to 1Mbps and has beenengineered to handle the constraints of a secu-rity conscious real-time system[12]. The physicalmedium is shielded copper wiring, which utilizes anon-return-to-zero line coding. The CAN messageframe is either an 11 bit identifier base frame (CAN2.0A) or the 29 bit identifier extended frame (CAN2.0B). There are a number of custom bit identifiersof which most are used to synchronize messages,perform error handling or signal various values.The CAN standard operates on a network bustherefore all devices have access to each message,addressing is handled by identifiers in each messageframe. The CANbus standard outlines the arbitra-tion method as CSMA/BA where the BA stands forbit arbitration. The bit arbitration method allowsany device to transmit its message onto the bus. Ifthere is a collision the transmitter with the greatestpriority, identified by the most successive recessivebits wins bus arbitration. The lesser priority devicesthen standoff for a predefined period of time andre-transmit their message. This allows the highestpriority messages to get handled the fastest. Otherimportant properties defined in this standard are:

• Message prioritization - Critical devices ormessages have priority on the network. This isdone through the media arbitration protocol.

• Guaranteed latency - Realtime messaging la-tency utilizes a scheduling algorithm whichhas a proven worse case and therefore can bereliable in all situations[13].

• Configuration flexibility - The standard is ro-bust in its handling of additional nodes, nodescan be added and removed without requiringa change in the hardware or software of anydevice on the system.

• Concurrent multicasting - Through the ad-dressing and message filtering protocols inplace, the CAN-bus can have multicasting inwhich it is guaranteed that all nodes will

32

accept the message at the same time. Allowingevery node to act upon the same message.

• Global data consistency - Messages contain aconsistency flag which every node must checkto determine if the message is consistent.

• Emphasis on error detection - Transmissionsare checked for errors at many points through-out messaging. This includes monitoring atglobal and local transmission points, cyclicredundancy checks, bit stuffing and messageframe checking[12].

• Automatic Retransmission - Corrupted mes-sages are retransmitted when the bus becomesidle again, according to prioritization.

• Error distinction - Through a combination ofline coding, transmission detection, hardwareand software logic the CAN-bus is able todifferentiate between temporary disturbances,total failures and the switching off of nodes.

• Reduced power consumption - Nodes can beset to sleep mode during periods of inactivity.Activity on the bus or an internal conditionwill awaken the nodes.

IV. SYSTEM DESIGN

The open bus standard-based data monitor sys-tem for Prosthetic Limbs captures and collects theserial information from two separate ControllerArea Network (CAN) buses, the sensor bus andthe actuator bus. With the collected informationit then would transform the data into the correctCAN message format. A timestamp would be addedand the messages would be passed through a usercontrolled filter to dictate which messages shouldbe logged. After the filter the two buses’ messagesare merged and sorted according to their timestamp.Once sorted, the information is then sent to anoutput device for further processing. The currentdesign allows to choose between three differentoutput interfaces. First, the RS232 serial interfacethat sends the CAN message data encoded in ASCIIformat to the serial port using a RS232 chip onthe DE2 board. Second, the direct connection ofthe RS232 serial interface module with the pininterface of the DE2 board. This allows the trans-mission with higher bandwidth, using an externalRS232 chip with better performance. And the thirdcommunication module uses the USB interfaceof the DE2 board to transmit the CAN messagedata [14]. This project will allow the engineers toobserve the communication on the buses and searchfor sources of errors. An overview of the system

design is given in Figure 1.The implementation consists of various ”work-

ing” cores, that can be individually tested eachcontaining one piece of the functionality whichis required for the overall system. These coresinterface with one another through a standard FIFO,alleviating timing issues. The FIFO modules aredual-clocked memory buffers that work on the first-in/first-out concept. With the dual-clock ability,those cores can be used to exchange data betweentwo different modules in a hardware design, evenif those modules run with different clock speeds.The FIFO cores belong to the Intellectual Property(IP) cores library, that is available within the devel-opment environment Quartus II from Altera. Theusage of those cores is usually free for academicuse, but requires a license fee for industrial or end-consumer development purposes.

The module “CAN Reader” is handling the ob-servation of the connected CANbus. It receives allmessages that are transmitted without causing acollision on the bus and forwards it to a FIFObuffer, from where the “Filter” module gets itsinput data. One pair of the “CAN Reader ” andthe “Filter” are connected to the control bus whilethe other pair is listening to the sensor bus. Themodular design would allow to connect more orfewer CANbus’ to the monitoring system. In thecurrent implementation of the “Message Writer”module, only the data transmission on the RS232interface is implemented. The data compressionof the messages and the application of differentinterface technologies has been evaluated duringthe project. Their implementations have been post-poned as future work. RS232 was implementedas it required the least overhead to establish acommunication channel [15].

The design of the “CAN Reader” module isbased upon the “CAN Protocol Controller” projectby Igor Mohor from opencores.org. This design im-plements the specification for the “SJA100 Stand-alone CAN controller” from Philips Semiconduc-tors [16]. To allow the integration of the coredesign, an I/O module that handles the Wishboneinterface had to be implemented. The configurationof the CAN controller is working register based,as defined in the specification of the SJA1000.The created I/O module is used by the “CANReader” that is configuring the “CAN controller”.The “CAN controller” receives the data from theCANbus, while the “CAN Reader” collects themessages from the “CAN controller” and forwards

33

Fig. 1: System design of CANbus monitor.

them to the timestamp and filter procedures. For amanageable processing flow of the modules theirinternal control has been designed as state ma-chines. This design concept allowed the localizationof problems caused by signal runtimes and raceconditions. Figure 2 gives an idea of the statemachine that describes the functionality of the“CAN Reader module”.

To give an idea of the design complexity, thestate “Read Receive Buffer” is using the I/O mod-ule to communicate with the “CAN controller”.This leads to a design with state machines en-capsulated in state machines. Such designs shouldbe avoided in hardware if possible, since theytend to obscure runtime race conditions. In thecurrent design the occurence of race conditions isprobhibited by event driven mutexes.

V. VALIDATION & VERIFICATION

The validation of the system design has beendone by simulation and runtime experiments on thetarget platform. For verification of the functionalitythe single modules have been simulated with Al-tera’s ModelSim tool. This allowed a step-wise exe-cution of the Verilog code to identify logic errors inthe implementation. Afterwards the modules havebeen tested on the target platform, with testbenchmodules providing the required input data. Theoutput of the tested modules has been transmittedon the RS232 interface to a connected computerand verified by hand. The timing verification of thehardware design has been done by an experimentaltest setup, with two sensor nodes on one connectedCANbus (Figure 3).

The two sensor nodes were configured to sendmessages on the bus continuously at 250Hz fre-quency. To allow the verification that all messages

can be received and no message is lost, each mes-sage contained a set of four consecutive numbers.The first node was configured to send numbersin the range from 0 to 999. The second nodewas sending numbers in the range from 1000 to1999. Due to the CANbus standard, each nodewould resend its current message until it receivesan acknowledge response on the bus from the othernode. This would cause the node to increment thenumbers for the next message. If a collision on thebus has occurred, the CANbus protocol standardensures that the transmitting node is informed bya global collision signal. This would be sent bythe first node on the CANbus, which detects thecollision.

The analysis of the system lead to the conclusionthat the hardware design is working correctly. Allmessages that have been sent from both nodeson the CANbus could be received successful andwith correct data by the “CAN Reader” module inthe FPGA design. This could be verified with thelogged data from the output of the serial module inthe hardware design. Since the test setup only hadone CANbus the input pin has been used twice toevaluate the system with two CANbuses connected.This allowed to verify the proper ordering of themessages by the “Message Merge Unit”. It turnedout that the only flaw in the current design is theRS232 interface. The available chip on the DE2board has a maximum bandwidth of 120kbit/s. Thatbecomes the bottleneck if the connected CANbusruns at maximum bandwidth of 1Mbit/s and evenmore for two connected CANbuses. The design foran USB module as communication interface hasbeen created but could not be sufficiently evaluatedfor further usage.

The analysis of the test results showed some

34

Fig. 2: State machine design for ”CAN Reader” module.

unexpected behavior. While the messages of onenode had been received completely, some messagesof the other node were missing. By resetting theCANbus system, this effect could switch to theother node or stay the same. For either one ofthe two nodes some messages were missing, butnever for both nodes at the same test execution.The extensive analysis of the system design leadsto the conclusion that this error might be causedby the configuration of the sensor nodes. Furtherevaluation of the system design will be continued,after a new test setup can be provided by the projectpartner.

VI. FUTURE WORK

It might be useful to have the ability runningthe CANbus monitoring design without a directconnection to a computer. The logged data couldbe stored on an integrated flash memory module.For this feature it would be helfpul to have a com-pression module to increase the systems mobility.The compression would need to be fast enough sothat it does not become a bottleneck for the system.Ideally the core would take in a stream of dataand output a stream of compressed data. This couldplug directly into the current system design.

It would be useful to have wireless functionalityso that the prosthesis does not need to be tethered

to a computer to retrieve the logged data. Thebiggest hurdle to overcome is to understand howto communicate with the wireless controller on thehardware level. Existing solutions in open sourceprojects could build a starting point here.

At the moment all the values for the CANcontroller are hard coded and thus can not bechanged after the design has been synthesized.This could be improved in several ways rangingfrom full configurability at runtime to a bit ofcode reorganization so that the values can easilybe changed for re-synthesis. A compromise couldbe the ability to configure a small number of valuesat runtime. The most time effective solution wouldbe to reorganize the code so that the hard codedvalues are excluded into a configuration file.

The filtering is currently fairly rudimentary andinflexible. It would be advantageous to have moreflexibility with how to filter messages. Reconfigur-ing the filters, and enable or disable filters duringruntime would be helpful. This becomes even morecrucial as the design moves from lab testing toreal world testing where re-synthesizing the designbecomes less feasible. The system could potentiallybe a memory mapped device and masks could bestored in registers which could be written to bya microcontroller. A simpler solution would be tohave several kilobytes of persistent memory, to

35

Fig. 3: Experimental test setup for final system verification.

which masks could be written, to be used duringthe filtering of the messages.

VII. SUMMARY

This paper has shown the successful implemen-tation of a monitoring system for a CANbus sensornetwork in a hardware design on an FPGA devel-opment board. Details of successful projects andrelated work has been introduced. The informationthat has been displayed in this paper should offera good starting point, providing a general under-standing of embedded projects, with emphasis onactual biomedical engineering solutions and basicsof the CANbus standard. The modular design of theapproach allows the application in different projectsthat have a need for a CANbus monitoring system.All project source code will be made availablethrough opencores.org.

Acknowledgments

This work is supported in part by CMC Mi-crosystems, the Natural Sciences and EngineeringResearch Council of Canada and Altera Corpora-tion.

REFERENCES

[1] Standardised Communication Interface for Prosthet-ics. [Online]. Available: http://groups.google.ca/group/scip-forum/

[2] J. Yang, T. Zhang, J. Song, H. Sun, G. Shi, and Y. Chen,“Redundant design of a can bus testing and communicationsystem for space robot arm,” in Control, Automation,Robotics and Vision, 2008. ICARCV 2008. 10th Interna-tional Conference on, Dec. 2008, pp. 1894–1898.

[3] DARPA, DARPA: 50 Years of Bridging the Gap, 1, Ed.Defense Advanced Research Project Agency, 2008.

[4] L. Ward. (2007, October) Breakthrough awards 2007 -darpa-funded proto 2 brings mind control to prosthetics.electronic / periodical. Popular Mechanics.

[5] S. Adee. (2008, February) Ieee spectrum: Dean kamen’s”luke arm” prosthesis readies for clinical trials. electronic.IEEE Spectrum.

[6] A. Wilson, Y. Losier, P. Kyberd, P. Parker, and D. Lovely,“Emg sensor and controller design for a multifunctionhand prosthesis system - the unb hand,” draft document,2009.

[7] Y. G. P. P. A. L. D. F. Wilson, Adam W. Losier, “A bus-based smart myoelectric electrode/amplifier,” in MedicalMeasurements and Applications Proceedings (MeMeA),2010 IEEE International Workshop on, 2010.

[8] S. Banzi, E. Mainardi, and A. Davalli, “A CAN-baseddistributed control system for upper limb myoelectricprosthesis,” in Computational Intelligence Methods andApplications, 2005 ICSC Congress on, Istanbul,.

[9] Y. Losier and A. Wilson, “Moving towards an open stan-dard: The unb prosthetic device communication protocol,”2009.

[10] The I2C-BUS Specification, Philips Semiconductors Std.9398 393 40 011, Rev. 2.1, January 2000.

[11] Motorola, M68HC11 Microcontrollers Reference Manual,rev. 6.1 ed., Freescale Semiconductor Inc., Mai 2007.

[12] CAN Specification Version, Robert BOSCH GmbH Std.,Rev. 2.0, September 1991.

[13] J. Krakora and Z. Hanzalek, “Verifying real-time proper-ties of CAN bus by timed automata,” in FISITA, WorldAutomotive Congress, Barcelona, May 2004.

[14] DE2 Development and Education Board - User Manual,1st ed., Altera, 101 Innovation Drive, San Jose, CA 95134,2007.

[15] MAX232, MAX232I DUAL EIA-232 DRIVERS/RE-CEIVERS, Texas Instruments, Post Office Box 655303,Dallas, Texas 75265, March 2004.

[16] SJA1000 Stand-alone CAN controller, Philips Semicon-ductors, 5600 MD EINDHOVEN, The Netherlands, Jan-uary 2000.

36

Session 2Prototyping Architectures

37

Rapid Single-Chip Secure Processor Prototyping onthe OpenSPARC FPGA Platform

Jakub M. Szefer ∗3, Wei Zhang #1, Yu-Yuan Chen ∗3, David Champagne ∗3,King Chan #1, Will X.Y. Li #1, Ray C.C. Cheung #2, Ruby B. Lee ∗3

# Department of Electronic Engineering, City University of Hong Kong1 {wezhang6, kingychan8, xiangyuli4}@student.cityu.edu.hk, 2 [email protected]

∗ Electrical Engineering Department, Princeton University, USA3 {szefer, yctwo, dav, rblee}@princeton.edu

Abstract—Secure processors have become increasingly impor-tant for trustworthy computing as security breaches escalate.By providing hardware-level protection, a secure processor en-sures a safe computing environment where confidential dataand applications can be protected against both hardware andsoftware attacks. In this paper, we present a single-chip secureprocessor model and demonstrate rapid prototyping of the secureprocessor on the OpenSPARC FPGA platform. OpenSPARC T1is an industry-grade, open-source, FPGA-synthesizable general-purpose microprocessor originally developed by Sun Microsys-tems, now acquired by Oracle. It is a multi-core, multi-threaded64-bit processor with open-source hardware, including the mi-croprocessor core, as well as system software that can befreely modified by researchers. We modify the OpenSPARCT1 processor by adding security modules: an AES engine, aTRNG and a memory integrity tree. These enhancements enablesecurity features like memory encryption and memory integrityverification. By prototyping this single-chip secure processor onthe FPGA platform, we find that the OpenSPARC T1 FPGAplatform has many advantages for secure processor research.Our prototyping demonstrates that additional modules can beadded quickly and easily and they add little resource overheadto the base OpenSPARC processor.

I. INTRODUCTION

As computing devices become ubiquitous and securitybreaches escalate, protection of information security hasbecome increasingly important. Many software schemes,e.g., [1]–[3], have been proposed to enhance the security ofcomputing systems and are effective in defending against soft-ware attacks. However, they are generally ineffective againstphysical or hardware attacks. Attackers who have full con-trol of the physical device can easily bypass software-onlyprotection, and the whole system is left unsafe and subjectto hardware attacks. This results in an increasing need forhardware-enhanced security features in the microprocessor.

Considerable efforts have been made to build secure com-puting platforms that can address security threats. In this paper,we present our extensible secure computing model prototypedon the OpenSPARC FPGA platform. The platform consists ofthe OpenSPARC T1 processor and system software, includingthe hypervisor and the operating system. The OpenSPARCT1 processor is the open-source form of the UltraSPARC T1processor (from Sun Microsystems, now Oracle) that gives

designers the freedom to modify the processor according totheir own needs [4]. This OpenSPARC T1 processor is alsoeasily synthesizable for FPGA targets, which makes the im-plementation of the processor quite easy. Field-programmableGate Array (FPGA) is an integrated circuit designed to beconfigured by the customer or the designer after manufacture.Because of its reconfigurability, FPGA can be used to imple-ment any logic function that an Application-Specific IntegratedCircuit (ASIC) chip could perform, which makes it a goodplatform for rapid system prototyping.

Through the prototyping process, we have found that theOpenSPARC FPGA platform has many advantages for secureprocessor research. For example, the memory subsystem isemulated in a MicroBlaze softcore, which allows new featuresto be added without re-synthesizing the whole platform. Fur-thermore, new hardware components can be easily added asFast Simplex Link (FSL) peripherals without worrying aboutstrict timing and editing fragile HDL code of the processorcore. However, to the best of our knowledge, currently thereare only a few papers [5]–[7] about processor research on theOpenSPARC FPGA platform and none focusing on security.In this paper, we propose a single-chip secure processorarchitecture based on this platform. Furthermore, we find thatthe OpenSPARC FPGA platform is relatively friendly forsecure processor prototyping. Although there are other open-source processors like OpenRISC and LEON available [8],they lack the infrastructure and components (e.g. hypervisoror emulated caches) of the OpenSPARC platform.

The main contributions of this paper are:∙ A reconfigurable single-chip secure processor model.∙ A prototype of the single-chip secure processor on the

OpenSPARC FPGA platform with our new additions:AES engine, true random number generator (TRNG), andmemory integrity tree (MIT).

∙ Evaluation showing the OpenSPARC FPGA platform’sadvantages for secure processor research.

The rest of the paper is organized as follows. Section II de-scribes some existing secure processor models as backgroundinformation. Section III proposes our single-chip secure pro-cessor architecture on the OpenSPARC FPGA platform. Sec-tion IV describes the single-chip secure processor features.978-1-4577-0660-8/11/$26.00 c⃝2011 IEEE

38

Section V gives an evaluation of the secure OpenSPARCplatform. Finally, in Section VI we conclude the paper.

II. RELATED WORK

With the emergence of hardware attacks, hardware-enhanced security has been given considerable attention byresearchers and engineers. Different secure processor archi-tectures have been proposed to provide a secure computingenvironment for protecting sensitive information against bothsoftware and hardware attacks.

The single-chip secure processors consider the processor tobe trusted, but anything outside the processor, e.g. memory,is untrusted. One such example is the Aegis architecture, asdescribed in [9], [10]. In this approach, the secure processorcontains two key primitives: a physical unclonable function(PUF) and off-chip memory protection. Both primitives arerealized within one single-chip processor so that the internalstate of the processor could not be tampered with or observeddirectly by physical means.

The Secret-Protecting (SP) architecture is proposed in [11].In SP, trusted software modules have their data and codeencrypted and hashed when off-chip and a concealed exe-cution mode is provided where these software modules areprotected from other software snooping on them, e.g. registersare encrypted on an interrupt so a potentially compromisedcommodity Operating System cannot read them. Also, ahierarchical key chain structure is used to store all keys inencrypted and hashed form and only the root key needs to bestored in hardware.

Another secure processor architecture is Bastion [12] thatcan protect a trusted hypervisor, which then protects trustedsoftware modules in the application or in the operating sys-tem. Bastion scales to provide support for multiple mutually-distrustful security domains. Bastion also provides a mem-ory integrity tree for runtime memory authentication andprotection from memory replay attacks. It also protects thehypervisor from physical attacks and offline attacks, not justsoftware attacks.

Based on these previous work, we propose our securecomputing model in Section III. Despite the many advantagesof the OpenSPARC FPGA platform, it is not very widelyused as a research platform. We propose a single-chip secureprocessor on this platform and hope that our work can providesome reference for researchers interested in OpenSPARC.

III. SINGLE-CHIP SECURE OPENSPARC PROCESSOR

This section presents our secure computing model andsingle-chip secure OpenSPARC T1 processor.

A. Secure Computing Model

Figure 1 illustrates our secure computing model. We dividethis computing system into two parts. The first part includesthose components in the processor chip, shown inside thedashed box, and the second part consists of all componentsoff the processor chip, shown outside the dashed box. All on-chip modules, including CPU core, cache, registers, encryp-tion/decryption engine and integrity verification module, are

Integrity verification

Encryption / Decryption

External memory

Untrusted operating system

CacheCPU Core

Malicious software

External Peripherals

Security registers

Processor chip

Secure regionPhysical attacks

Software attacks

Physical attacks

Fig. 1: Secure computing model. The processor chip is re-garded as a physically secure region and gray blocks representnew security enhancements.

assumed to be trusted and protected from physical attacks inour design, because the internal state of the processor chipcould not easily be tampered with or observed directly byphysical means. We do not consider such physical meansas side-channel attacks that employ differential power anal-ysis or electromagnetic analysis in this paper. On the otherhand, all off-chip modules, including external memory andperipherals, are considered insecure because those modulescan be easily tampered with by an adversary using physicalattacks. In addition to hardware attacks, the system may sufferfrom software attacks from an untrusted operating system ormalicious software.

To ensure a secure computing environment, the computingsystem must have some secure functions that enable it todefend against either software or hardware attacks. The grayblocks in Figure 1 represent our initial set of new securityenhancements that we add to the computing system. Theencryption/decryption module encrypts all data evicted off theprocessor chip so that they are meaningless to an adversary.The integrity verification module verifies that all data comingfrom the off-chip memory has not been tampered with.

B. OpenSPARC FPGA Platform

Our single-chip secure processor design targets the FPGA-sythesizable version of the OpenSPARC T1 general-purposemicroprocessor. The OpenSPARC T1 microprocessor is anindustry-grade, 64-bit, multi-threaded processor and is freelyavailable from the OpenSPARC website [4], [13]. In additionto the processor core source code (HDL), simulation tools,design verification suites, and hypervisor source code (C andassembly) are available for download [4]. The OpenSPARCFPGA platform consists of the following major components:OpenSPARC T1 microprocessor core, memory subsystem (L2cache), DRAM controller and DRAM, hypervisor and a choiceof operating systems (Linux or OpenSolaris).

Due to the size constraint of the FPGA chip, theOpenSPARC FPGA platform that we use includes only one

39

OpenSPARC T1 Core

cc

x2

mb

MicroBlaze Core( emulated cache )

DRAMUART Ethernet

OPB Bus

CPX

PCX

Processor chip

FSL

Fig. 2: Block diagram of stock OpenSPARC FPGA platform.

single-thread T1 CPU core to minimize its size. In addition,the L2 cache and the L2 cache controller are emulated in aMicroBlaze softcore, i.e. there is no physical L2 cache. Ahigh-level block diagram of the OpenSPARC T1 processor isshown in Figure 2. The microprocessor core is connected tothe L2 cache, emulated by a MicroBlaze softcore, through theCPU-Cache Crossbar (CCX). The DRAM controller is an IP(Intellectual Property) block synthesized and implemented inthe FPGA fabric and connected to the MicroBlaze softcore. Aphysical DRAM is connected to the FPGA board to serve asthe actual memory.

Due to the different complexities of these components andto place and route operations that determine critical pathswhen implementing on FPGA, the FPGA version of theOpenSPARC processor chip has multiple clock domains. TheOpenSPARC T1 core is in one clock domain (50MHz), theMicroBlaze softcore is in another clock domain (125MHz).Peripherals synthesized in the FPGA fabric and connectedto the MicroBlaze softcore are in another clock domain (e.g.10MHz or 50MHz for the peripherals we create). Finally, theDRAM chip has its own clock domain (400MHz).

The OpenSPARC FPGA platform has a lot of advantagesfor security research [14]. It allows users to freely modifyreal hardware. Especially, the memory subsystem is emulated,rather than being fully implemented in HDL, so new featurescan be easily added (in C code and run on the MicroBlaze soft-core) and re-synthesizing the whole platform can be avoided.In addition, new hardware components can also be easilyadded as firmware code or synthesized in FPGA fabric andconnected to the emulated cache by buses, e.g. by FSL bus.The peripherals on the ML505 board are also very useful, e.g.the network port. Also, secure processors can be prototypedwithout fabricating a real processor chip.

C. Secure Processor Architecture

Based on the secure computing model described in Figure 1,we propose our single-chip secure processor architecture. Ourapproach is to add security modules to the stock OpenSPARCFPGA platform. We synthesize and implement the design inFPGA [15].

Figure 3 shows the block diagram of our single-chip secureprocessor architecture on the OpenSPARC FPGA platform.

OpenSPARC T1 Core

cc

x2

mb

MicroBlaze Core( emulated cache )

DRAMUART Ethernet

OPB Bus

FSL

AESTRNG

MIT

firm

wa

re

CPX

PCX

OtherProcessor chip

FSL

Fig. 3: Block diagram of secure OpenSPARC FPGA platform.Gray blocks show our new additions.

The gray blocks in the diagram represent security moduleswe add to the original platform, including a TRNG (truerandom number generator) (HDL code), an AES (AdvancedEncryption Standard algorithm) engine (HDL code), and amemory integrity tree (MIT) (MicroBlaze firmware code). TheTRNG and the AES engine are implemented in the FPGAfabric and the MIT is executed in the MicroBlaze softcoreas firmware. The TRNG and the AES engine are connectedthrough the FSL bus to the MicroBlaze. Through MicroBlaze,the OpenSPARC microprocessor communicates with thesesecurity modules. The MIT firmware calls the AES enginefor memory integrity verification. Each module works in itsown clock domain and a Digital Clock Manager (DCM) on theFPGA board serves to generate these clock frequencies. Table Ishows the different clock domains in our secure OpenSPARCsystem.

In addition to the FPGA chip, the FPGA board also hasmany on-board resources that can be utilized by the secureprocessor. In our experimental setup, we use the Xilinx Virtex-5 ML505 FPGA board. This board contains a 256MB DRAMmodule, in which an 80MB ramdisk is used to boot the Linuxor Solaris operating system. In addition, the board has anEthernet port that can connect the secure processor to theInternet if enabled. The Ethernet port can also serve as acommunication port that provides high data exchange ratebetween the host computer and the secure processor.

Our secure processor works in one of the four modes:

∙ STD - standard mode, which has no additional securitymeasures;

∙ CONF - confidential mode, which performs memoryencryption to ensure data confidentiality;

∙ ITR - integrity tamper-resistant mode, which performsmemory integrity verification to ensure data integrity;

∙ FTR - full tamper-resistant mode, which performs bothmemory encryption and memory integrity verification toensure data confidentiality and integrity.

The secure processor can work in any of the four modesdepending on the user’s need.

40

TABLE I: Clock domains in OpenSPARC system

Module Open-SPARCT1

Micro-Blaze

AESengine

TRNG DRAM

Frequency 50MHz 125MHz 50MHz 10MHz 400MHz

IV. SINGLE-CHIP SECURE PROCESSOR FEATURES

The proposed single-chip secure processor provides thefollowing security features: memory encryption/decryption,secret key generation, and memory integrity verification.

A. AES Engine

Advanced Encryption Standard (AES) is a symmetric en-cryption algorithm approved by the National Institute of Stan-dards and Technology (NIST). It is one of the most widelyused symmetric encryption algorithms and is advanced interms of both security and performance. While any symmetrickey encryption algorithm suits our purpose, we adopt AES asthe memory encryption/decryption algorithm and AES CBCMAC as the cryptographic hash primitive.

AES can process data blocks of 128 bits using cryptographickeys with size of 128, 192 or 256 bits. The encryptionor decryption takes 10 to 14 rounds of array operations,depending on the key size. In our AES unit design, we employthe idea of parallel table lookup (PTLU) as in [16], [17]. TheAES unit is based on AES-128 and takes a block of 128-bitinput and 128-bit key to produce 128-bit output. The blockdiagram of the AES unit is shown in Figure 4.

The AES unit consists of one finite state machine (FSM) thatcontrols the operation of aes_round and aes_key_expander.The load signal triggers the FSM to load the registers withinput data and key. The start signal sets an internal countervalue and starts the AES encryption/decryption cycle fromround 0. The mode signal changes the AES operation betweenencryption and decryption. The latency of one block of AES-128 encryption and decryption is 14 cycles and 25 cycles,respectively. For encryption, the AES unit takes 11 cycles toproduce the output data (1 cycle per round and 1 cycle for theinitial AddRoundKey operation), and 3 cycles to assert load,start, and done signals. However, the decryption processincurs an extra 11 cycles in order to generate the first roundkey for decryption, since the first round key of decryption isthe last round key of encryption. After the round operationsare done, a done signal is asserted to signify that the outputdata is ready on the data_out bus.

The AES engine works in cipher-block chaining (CBC)mode, i.e., AES-CBC. The input and output data width is512 bits (64 bytes) in our design in order to correspond tothe common size of modern cache lines. The AES engine isconnected to the MicroBlaze softcore through the FSL bus,which has a data bus width of 32 bits. All input data has first tobe loaded to the AES engine to start the AES-CBC encryption,which needs (32+128+128+512)/32=25 cycles to load themode, initial_vector, key, and data_in, and to completelyoutput the data back to MicroBlaze takes 512/32=16 cycles. If

FSMAES _ roundAES _ key _expander

registerregister

register

mux mux

128

128

128

128 128

AES unit

key data _ inload, start, mode

donedata _ out

Fig. 4: Block diagram of AES-128 unit.

we include the 14 cycles for encrypting one block of 128-bitinput (25 cycles for decryption), the total latency of encrypting64 bytes of input data is 25+4×14+16=97 cycles. Similarly weget 25+4×25+16=141 cycles for decryption. We note the highoverhead of the FSL bus for data transfer (41 of 97 cycles forencryption). Also, performance can be improved by storing thedecryption round keys so they do not need to be regenerated,at the cost of using more hardware resources.

B. True Random Number Generator (TRNG)

The secure processor needs a secret key for memory en-cryption and decryption. In addition, the secret key should beunpredictable for attackers. A true random number generator(TRNG) is utilized to generate a secret key for the AES engine.

Figure 5 shows the internal structure of the TRNG, whichconsists of many identically laid-out delay loops, or ring oscil-lators (ROs). We call this a ring oscillator TRNG, introducedby Sunar et al. in [18]. Based on the design in [19], which uses110 rings with 13 invertors, our TRNG consists of 114 ROs,each of which is a simple circuit containing 15 concatenatedNOR gates that oscillate at a certain frequency. One of thetwo inputs of the NOR gate is used to reset the TRNG.Because of manufacturing variations, each RO oscillates ata slightly different frequency. The outputs of all ROs areexclusive-ORed in order to correct bias and correlation andto generate a random signal. We sample the random signaloutput from the XOR gate at a frequency of 10MHz. Similarto the AES engine, the TRNG is also connected to MicroBlaze(the emulated cache) through the FSL bus and interacts withthe OpenSPARC T1 core through MircoBlaze.

The TRNG module is separate from other modules, whichrequires overhead of data transfer over the FSL bus if therandom bits are used by firmware or another module. Forexample, a transfer of 128 random bits is needed for theAES engine: TRNG → MicroBlaze → AES. The advantageis that the TRNG is not tied to AES and can be used forother purposes. Only TRNG → MicroBlaze transfer is neededif the random bits are used inside the firmware. Furthermore,the TRNG could be integrated with the AES unit to reducethe transfer overhead if it is dedicated to AES key generation.To the best of our knowledge, combining TRNG with an AES

41

Q

QSET

CLR

D

clock

random114 ring oscillators

reset

15 NOR gates

Fig. 5: Internal structure of true random number generator(TRNG).

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Clock 0

RandomK 127 K 126 K 125 K 0K 1

Key

128 - bit register

128

Counter

flagClock 1

Fig. 6: 128-bit key generation circuit.

engine on a secure processor platform for key generation hasnot been explicitly discussed in previous works.

In testing TRNG on the FPGA platform, we devise anew method to collect enough random bits for testing. Weconnect to MicroBlaze an Ethernet core, which works at10/100/1000Mb/s. The TRNG outputs random bits at 10Mb/s,so these random bits are able to be sent out to the hostcomputer through the Ethernet port on the FPGA board. TheMicroBlaze reads random bits from the TRNG and sends themto the Ethernet port which then sends out these random bitsto the host computer for testing.

C. Cryptographic Key Generation

The output of the TRNG is single-bit. However, the AESencryption and decryption require a 128-bit key. To generatethe 128-bit key using single-bit TRNG, a key generator isemployed, shown in Figure 6.

A single-bit-in, 128-bit-out shift register is used in the keygenerator. The shift register contains 128 concatenated flipflops, each of which stores one bit for the 128-bit symmetrickey. The TRNG generates only one single random bit per clockcycle, and its output is connected to the input of the shiftregister. It takes the shift register 128 clock cycles to store128 random bits. These 128 random bits are then connectedto a 128-bit register, which outputs them as a key. In additionto the registers, there is a counter in the key generator. Thecounter counts from 0 to 127. When the count reaches 127, itsets a flag signal to 1, indicating that the key is ready. Afterthe key is read, the flag signal is reset to 0 until 128 newbits are available. This allows new randomness to be returnedimmediately if some time has passed since last read.

In this way, generating one key needs 128 clock cycles.The generated key can also be used as a seed to generatemore 128-bit keys using the AES engine. The 10MHz workingfrequency of the TRNG might be a little slow. To get fasterkey generation speed, more than one TRNG can be used. For

Root Key

Key 4

Key 1

Key 5

Key 2 Key 3

Key 6 Key 7

Fig. 7: Hierarchical key chain.

example, if using two TRNGs, the generated random bits canbe stored in two 64-bit shift registers respectively, in whichcase a key can be generated in only 64 clock cycles and thespeed can be doubled. However, this is accomplished at theprice of more hardware resource utilization.

D. Key Management

Cryptographic keys in the secure OpenSPARC system arecritical secrets. In the future, as more secure modules andapplications are added to the system, the number of keys mayincrease and key management and protection will become aproblem. We address this problem by utilizing the concept ofa key chain, which is described in [11].

The key chain is a hierarchical structure which stores allkeys of the secure OpenSPARC system in encrypted form, asshown in Figure 7. Each key in the chain is encrypted by itsparent key. At the root of the chain is the Root Key. Only a leafkey can be used to encrypt a user’s data. This tree structureallows an unlimited number of keys to be stored on the keychain. The Root Key is most critical and it is stored in theRoot Key register on the processor chip, which is assumedsecure from physical probing. Because all the other keys arein encrypted form, they can be stored in off-chip repositories(external memory) for retrieval and need no special protection.This has greatly enhanced key security and also reduced theon-chip storage for keys.

E. Memory Integrity Verification

Even though off-chip data stored in external memory are en-crypted, an adversary can still tamper with them by modifyingthem. In ITR or FTR mode, the secure processor performsmemory integrity verification on all off-chip data fed intothe processor to ensure that they are not tampered with. Thememory integrity verification is realized using the hash tree,which is shown in Figure 8.

The external memory is divided into different data blocks,and each data block is computed to get one hash value. Thesehash values are further computed to get their parent hashvalues (in the nodes above them). Thus, a hash tree (calleda Memory Integrity Tree, MIT) is formed and at the top ofthe tree is the root hash. The root hash value is stored in theroot hash register on-chip, which is assumed secure. Whendata are evicted to off-chip memory, the processor performs

42

Ha

sh(h

1)

Has

h(h

4)

Has

h(h

3)

Ha

sh(h

2)

Data block Data blockData blockData block

Has

h(h

5)

Ha

sh(h

6)

Root Hash

External Memory

In Processor

H ( h 5 , h 6 )

Fig. 8: Memory Integrity Tree

cache line hashing and updating of the non-leaf nodes of theMIT, in the path from the leaf node to the root hash. Whenexternal data are read in, the processor performs cache lineverification in the path from the leaf node to the root hash.Whenever there is a mismatch between the new root hash andthe old root hash, it can be asserted that the contents of theexternal memory has been tampered with, and an exceptionwill be generated.

The memory integrity tree (MIT) is realized as OpenSPARCfirmware in our design, rather than as real FPGA fabric. TheMIT firmware calls the AES engine to perform the hashalgorithm using AES CBC MAC. Using the firmware toemulate new hardware has some benefits, one of which is thatthe firmware can be updated without having to re-synthesizeany of the existing components.

V. SYSTEM EVALUATION

This section evaluates our single-chip secure OpenSPARCT1 processor, including its performance and hardware costs.The secure processor is prototyped on Xilinx Virtex-5 ML505FPGA board with a XC5VLX110T FPGA chip. We findthat the OpenSPARC FPGA platform is relatively easy forsecure processor prototyping. The synthesizing time in XilinxISE environment for the whole secure platform is about 4hours. However, if firmware is altered, there is no need tore-synthesize the whole system and recompiling the firmwaretakes only several minutes, which saves a lot of time.

A. Encryption/Decryption Performance

In CONF and FTR mode, the secure processor has toperform encryption and decryption operations. As mentionedin Table I, the secure OpenSPARC system has multiple clockdomains. It takes the AES engine 97 clock cycles to encrypt64 bytes of input plaintext, and 141 clock cycles to decrypt64 bytes of input cyphertext, where many of these cycles aredue to data transfer via the 32-bit FSL bus. The TRNG worksat the frequency of 10MHz, and to generate one 128-bit keyneeds 128 TRNG cycles. However, a new key is infrequently

TABLE II: Clock cycles needed for various operations.

Operation Cycles Cycle frequencyAES encryption of 64-byte block

97 50MHz (AES cycles)

AES decryption of 64-byte block

141 50MHz (AES cycles)

128-bit key generationfrom TRNG

128 10MHz (TRNG cycles)

128-bit key transferTRNG → MicroBlaze

4 125MHz (FSL cycles)

128-bit key transferMicroBlaze → AES

4 125MHz (FSL cycles)

needed. Data from AES and TRNG have to be first processedby MicroBlaze, and they are fetched by MicroBlaze at thefrequency of 125MHz. The clock cycles needed for all theseoperations are shown in Table II.

B. Overall Performance

The STD mode causes no extra performance overhead forthe processor. In CONF and FTR mode, the processor hasto encrypt/decrypt off-chip data, which incurs performanceoverhead, but this only happens for cache misses which arerelatively infrequent. For each 64-byte data evicted off thesecure processor chip, a 97-cycle delay will be caused due tothe encryption operation. Similarly, for each 64-byte off-chipdata fed into the processor, a 141-cycle delay will be causeddue to the decryption operation. This overhead can be reducedif the FSL bus could be widened or multiple FSL buses can beused. Counter-mode AES can also be used to reduce effectiveencryption/decryption latency. In ITR and FTR mode, theprocessor performs cache line hashing on data evicted to off-chip memory, and cache line verification on data read into theprocessor, which also causes additional performance overhead.

In the FPGA version of OpenSPARC, the frequency ofthe OpenSPARC T1 core is 50MHz, which is a bit slowfor performance research running large software applications.Regardless, the OpenSPARC platform is still suitable forsecure processor research because of its many advantages. Inthis paper, we mainly focus on how to modify the platform toadd security features rather than on performance.

C. Hardware Costs

Our single-chip secure processor is implemented on XilinxVirtex-5 XC5VLX110T FPGA. Table III shows the total re-sources of the FPGA chip and hardware costs after the securitymodules are added. Table III shows that the OpenSPARCT1 core has taken up most of the slices of the FPGA chip,up to 78%, while the new security modules consume muchfewer slices. After both AES and TRNG are added, theslice utilization has increased by only 10%, which is far lessthan the 78% consumed by the T1 core. In our design, theOpenSPARC T1 microprocessor has been tailored to includeonly one CPU core. The resource utilization ratio of AES andTRNG will be even lower if two or more cores are used.

43

TABLE III: Logic utilization of single-chip secureOpenSPARC processor on Xilinx Virtex-5 FPGA

Module Slice LUT Register BRAMVirtex-5 FPGAXC5VLX110T

17280 69120 69120 148

OpenSPARC T1 13561(78%)

40270(58%)

30087(43%)

119(80%)

OpenSPARC T1with AES added

14030(81%)

43174(62%)

30945(44%)

143(96%)

OpenSPARC T1with TRNG added

15166(87%)

42162(60%)

30146(43%)

119(80%)

OpenSPARC T1with AES andTRNG added

15181(88%)

45030(65%)

31004(44%)

143(96%)

Table III also shows that the secure OpenSPARC T1 pro-cessor has almost taken up all the resources of the Virtex-5XC5VLX110T FPGA chip. This will restrain the system fromfurther development. For example, if more security modulesare to be added, or two OpenSPARC T1 cores are desiredin the system, the remaining resources may not be enough.One solution to this problem is to move the system to a largerFPGA chip with more logic resources, for example, the XilinxVirtex-6 or Altera Stratix V FPGAs.

VI. CONCLUSIONS

OpenSPARC T1 is an open-source, FPGA-synthesizeablegeneral-purpose microprocessor. In this paper, we have de-scribed a secure computing model which assumes that onlythe on-chip environment is secure from physical attacks, andproposed a single-chip secure processor architecture. Further,we have prototyped the secure OpenSPARC T1 processor onFPGA and evaluated the resulting system. The new securitymodules added to the OpenSPARC system incur little extrahardware costs and performance overhead.

By prototyping the secure OpenSPARC T1 processor, wefind that the OpenSPARC FPGA platform has many advan-tages for secure processor research and prototyping: ability tomodify real hardware, ease of modification due to the emulatedcache, ability to run commodity OS and benchmarks, theavailability of an open source hypervisor, etc. The low workingMHz rate of the stock OpenSPARC T1 processor and the highoverhead of data transfer cycles using the 32-bit FSL bus affectthe performance of the prototype. Hence, performance moni-toring, performance estimation and performance improvementare fruitful areas for further research.

In this work, we have added the AES engine, TRNG andMIT to the OpenSPARC platform. More security modulescan be added to further enhance its security features. Onthe other hand, high-level applications can be developed tomake use of these security modules. In summary, we findthat the OpenSPARC FPGA platform is relatively easy forsecure processor prototyping with many advantages includingits advanced software and hardware platform components.

ACKNOWLEDGMENT

This work is supported in part by NSF CCF-0917134 andNSF EEC-0540832, and by the City University of Hong KongStart-up Grant 7200179.

REFERENCES

[1] W. Ford and B. S. Kaliski, Jr., “Server-assisted generation of a strongsecret from a password,” in Proceedings of the 9th IEEE InternationalWorkshops on Enabling Technologies: Infrastructure for CollaborativeEnterprises. IEEE Computer Society, 2000, pp. 176–180.

[2] J. Garay, R. Gennaro, C. Jutla, and T. Rabin, “Secure distributed storageand retrieval,” in Distributed Algorithms, M. Mavronicolas and P. Tsigas,Eds. Springer Berlin / Heidelberg, 1997, vol. 1320, pp. 275–289.

[3] P. MacKenzie and M. K. Reiter, “Networked cryptographic devicesresilient to capture,” International Journal of Information Security,vol. 2, pp. 1–20, 2003.

[4] OpenSPARC, World’s First Free 64-bit CMT Microprocessors. [Online].Available: http://www.opensparc.net/

[5] I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve, and J. Torrellas,“OpenSPARC: An open platform for hardware reliability experimenta-tion,” the Fourth Workshop on Silicon Errors in Logic-System Effects(SELSE), April 2008.

[6] K. Chandrasekar, R. Ananthachari, S. Seshadri, and R. Parthasarathi,“Fault tolerance in OpenSPARC multicore architecture using core virtu-alization,” International Conference on High Performance Computing,December 2010.

[7] D. Lee, “OpenSPARC - a scalable chip multi-threading design,” in 21stIntl. Conference on VLSI Design, VLSID, 2008, p. 16.

[8] P. Pelgrims, T. Tierens, and D. Driessens, “Overview of embeddedprocessors: Excalibur, LEON, MicroBlaze, NIOS, OpenRISC, Virtex IIPro,” De Nayer Instituut, Tech. Rep., 2003.

[9] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas,“Aegis: architecture for tamper-evident and tamper-resistant processing,”in Proceedings of the 17th annual international conference on Super-computing. ACM, 2003, pp. 160–171.

[10] G. E. Suh, C. W. O’Donnell, and S. Devadas, “Aegis: A single-chipsecure processor,” IEEE Design and Test of Computers, vol. 24, pp.570–580, 2007.

[11] R. B. Lee, P. C. S. Kwan, J. P. McGregor, J. Dwoskin, and Z. Wang,“Architecture for protecting critical secrets in microprocessors,” inProceedings of the 32nd annual international symposium on ComputerArchitecture. IEEE Computer Society, 2005, pp. 2–13.

[12] D. Champagne and R. Lee, “Scalable architectural support for trustedsoftware,” in 16th IEEE International Symposium on High PerformanceComputer Architecture (HPCA), 2010, pp. 1–12.

[13] D. Weaver, OpenSPARC Internals. Sun Microsystems, Inc., 2008.[14] J. Szefer, Y.-Y. Chen, R. Cheung, and R. B. Lee., “Evaluation of

OpenSPARC FPGA platform as a security and performance researchplatform,” Princeton University Department of Electrical EngineeringTechnical Report CE-L2010-002, September 2010.

[15] PALMS OpenSPARC Security and Performance Research Platform.[Online]. Available: http://palms.ee.princeton.edu/opensparc

[16] R. Lee and Y.-Y. Chen, “Processor accelerator for AES,” in Proceedingsof IEEE Symposium on Application Specific Processors (SASP), June2010, pp. 16–21.

[17] J. Szefer, Y.-Y. Chen, and R. B. Lee, “General-purpose FPGA platformfor efficient encryption and hashing,” in Application-specific Systems,Architectures and Processors conference, July 2010, pp. 309–312.

[18] B. Sunar, W. Martin, and D. Stinson, “A provably secure true randomnumber generator with built-in tolerance to active attacks,” IEEE Trans-actions on Computers, vol. 56, no. 1, pp. 109 –119, 2007.

[19] D. Schellekens, B. Preneel, and I. Verbauwhede, “FPGA vendor agnostictrue random number generator,” in International Conference on FieldProgrammable Logic and Applications, FPL, 2006, pp. 1–6.

44

Abstract— Traditional use of software and hardware

simulators and emulators has been in efforts for chip level

analysis and verification. However, prototyping and bringup

requirements often demands system or platform level integration

and analysis requiring new uses of these traditional pre-silicon

methods along with novel interpretations of existing hardware to

prototype some functions matching behaviors of future systems.

In order to demonstrate the versatility and breadth of the pre-

silicon environments in our systems lab, ranging from functional

instruction set software simulators to Field Programmable Gate

Array (FPGA) chip logic implementations to integrated systems

of existing hardware built to mimic key functional aspects of the

future platforms, we present our experiences with platform level

verification, analysis and early software development/enablement

for an I/O attached network appliance system. More specifically,

we show how simulation tools along with these early prototype

systems were used to do chip level verification, early software

development and even system level software testing for a System

on a Chip processor attached as an I/O accelerator via Peripheral

Component Interconnect Express (PCI Express) to a host system.

Our experiences demonstrate that leveraging the full range of

pre-silicon environment capabilities results in full system level

integrated software test for a I/O attached platform prior to the

availability of fully functional ASICs.

Index Terms—Software debugging, software prototyping,

accelerator architectures, product engineering, system analysis

and design.

I. INTRODUCTION

ORKLOAD optimized computer systems represent an

integrated approach to hardware and software

development to achieve maximum performance from available

footprint, power or cost metric. For many of these systems,

Copyright 978-1-4577-0660-8/11/$26.00 ©2011 IEEE

O. Callanan, A. Castelfranco, E. Creedon, K. Muller, B. Purcell, M. Purcell are with the is with IBM Ireland Product Dist, Ltd, Muddart, Ireland (email {owen.callanan, antonino_castelfranco, eoin.creedon, kay.muller,brianpurcell,mark_purcell}@ie.ibm.com)

C.H. Crawford, S. Lekuch, M. Nutter, and H. Penner are with the IBM TJ Watson Research Center in Yorktown Heights, NY (email {catcraw, scottl, hpenner}@us.ibm.com

J. Xenidis is with the IBM Austin Research Lab in Austin, TX (email [email protected])

general purpose processors are integrated with purpose built

processors to balance ease of programming and

implementation with integrated acceleration. When this design

occurs in a single ASIC we refer to this as System On a Chip

(SoC) architecture.

One example of the SoC architecture is the IBM Power

Edge of NetworkTM (PowerENTM) processor. This processor

was designed to handle wirespeed, network facing

applications. It consists of 64 PowerPCTM cores and a set of

accelerators which exist as first class units in the memory

subsystems on the chip. PowerENTM includes

compression/decompression, encryption/decryption, regular

expression (RegX pattern matching), and an Extensible

Markup Language (XML) accelerators all connected via a high

speed bus with an integrated Host Ethernet Adapter (HEA).

(For a complete review of the PowerENTM architecture see

[1]). The combination of integrated I/O with the accelerators

and a massively multithreaded (MMT) capability targeting

many sessions of parallel processing is ideally suited for many

edge of network applications [2]. However, for some

solutions and workloads, because of the availability of

required libraries or components which require significant

single threaded performance, implementing an entire end to

end application on PowerENTM is not possible, and a hybrid

solution is warranted.

The integration of general purpose processors with special

purpose accelerators has become a mainstay in performance

sensitive computing solutions. In the high performance

computing segment, hybrid architectures have been used to

create systems with over a petaflop of performance using Cell

Broadband Architecture [3] and GPUs [4] connected to x86

ISA based “hosts”. In other technical computing systems

FPGAs have been employed to provide application specific

acceleration [5]. Accelerated systems exist outside of

technical and high performance computing as well. Today,

there are a variety of vendors offering both ASIC and FPGA

based I/O attached accelerators for TCP/IP offload [6],

security [7], and financial data processing for real time trading

[8], exactly some of the workloads for which PowerENTM was

targeted,. In all of these systems, the accelerators were

connected to the hosts via PCI Express, allowing for a variety

of choices for both the host platform and operating system, and

A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup of System-on-a-Chip Based Platforms

O. Callanan, A. Castelfranco, C.H. Crawford, E. Creedon, S. Lekuch, K. Muller, M. Nutter, H. Penner, B. Purcell, M. Purcell and J. Xenidis

W

45

so we also choose PCI Express for PowerENTM accelerated

systems.

There are also a variety of programming models to support

these hybrid computing systems. Flexible runtimes such as

OpenCL [9] and CUDA [10] provide versatile language

bindings for a variety of applications. Runtimes derived from

accelerator hierarchies, clusters of hybrid systems or a hybrid

system built from x86_64 hosts and Cell Broadband Engine

accelerators, have also been developed and show another level

of flexibility of heterogeneous programming [11]. Other

vendors build libraries for very specific function with the

accelerator platform, and these libraries are then linked in with

customer applications. In any of these approaches, fast and

reliable communication is required between the host and the

device both for latency (e.g. synchronization) and throughput

(e.g. bulk data transfer) driven communication (see for

instance [11]). For PowerENTM accelerated systems, we are

especially concerned with the performance of our PCI Express

data transport layer given that our set of targeted workloads

have inherent wirespeed computing requirements.

Designing and developing the appropriate PCI Express

software stack for these performance sensitive applications

requires significant access to systems for implementation and

testing. In fact, verifying the PCI Express hardware features in

PowerENTM in addition to a highly integrated and optimized

user space networking stack from the HEA to the PCI Express

interface requires testing from hardware functional tests to

software unit tests all the way to full system level stress tests.

In order to verify hardware designs and logic implementations,

as well as develop and test software and stress test the system

without jeopardizing time to market for our solution, we

leveraged a full range of pre-silicon environments. Our

challenge was to find the right mix of tools and prototype

environments to efficiently develop, debug and test for what

would eventually be a complex, heterogeneous system.

This paper is organized as follows. In section 2, we

describe the PCI Express hardware implementation on

PowerENTM and the software stack which supported PCI

Express connected PowerENTM to an x86 based Linux host.

We review our instruction set simulation tools along with some

prototype environments we used to develop and test the

software in section 3. The hardware verification via HDL

simulation and FPGA implementation is described in section

4. We demonstrate the flexibility of all these environments

with a review of our comprehensive “pre-silion” test

development, including performance and system stress tests in

section 5. We conclude the paper with a section to summarize

the value that the simulation tools and prototype environments

provide for complex heterogeneous systems along with some

of our future work plans in terms of application design and

development.

II. THE POWERENTM PCI EXPRESS FUNCTION

A. Hardware

The PowerENTM architecture incorporated two PCIe ports.

The two ports share up to 17 PCI Express Generation 2 lanes,

with each lane providing a data traffic bandwidth of 500 MB/s.

Additionally PCI Express Port 0 can be freely configured as

PCI Express Host Bridge or as a PCI Express endpoint. In the

context of the hybrid appliance architecture, the endpoint

configuration is of interest, since it enables a PowerENTM chip

to be attached to a host as a device. However, as we will show

later in this paper, the ability to have the loopback

configuration (Port 0 endpoint, Port 1 root complex) allowed

PowerENTM chip logic to operate both functions at the same

time, i.e. be a PCI Express host server and be a PCI Express

device, and provide an additional hardware verification and

software test configuration for bring-up.

The PowerENTM PCI Express endpoint configuration

provides a set of PCI Express Gen 2 features which gives a

host a wide variety of useful features. Most prominent it

supports SR-IOV [12], with the support of 2 physical (PF) and

16 virtual (VF) functions. This feature enables the

development of device drivers on the host which exploit the

PowerENTM in a virtual environment -- multiple device drivers

can operate independently with hardware isolation. As an

example that will be described in more detail in the next

section, PF0 operates a virtual Ethernet driver whereas PF1 is

used as platform of an userspace enablement driver. In future

exploitations, the 16 virtual functions can be used for many

instances of PCIe device drivers and corresponding device

features (embedded analytics for instance), providing one

instance per logical partition or user space process of the host.

The PCI Express endpoint DMA engine in the PowerENTM

architecture is connected to the main bus (PowerBus or PBus)

on the chip as a coprocessor. As such, just as all other

coprocessors on the PowerENTM chip (e.g. crypto), it has a

PBus Interface Controller (PBIC) TLB which give provides

mapping between bus addresses and virtual user space

addresses and a DMA engine. Because it is a coprocessor, the

DMA engine can be operated in userspace with coprocessor

initiate (icswx) commands along with the hardware protection

features of the PBIC TLB. Furthermore, by using an IOMMU

on the host side, the PCI Express addresses can be translated

into user space addresses, providing a user space to user space

communication path from PowerENTM to the host.

B. Software

Given the rich PowerENTM hardware features to support

user space PCI Express communication, the software stack

needed to provide corresponding function for high-speed

communications between userspace applications on the host

and device. In other words, our PCI Express software

infrastructure does not to require kernel involvement in packet

transfers, as with standard socket style systems. In order to

enable such high speed communication, a kernel device driver

is required to provide low-level access to the PCIe device.

46

That is to say, after initial probing and provision of features

(i.e. DMA engine, shared memory areas etc.) to userspace, the

kernel device driver involvement is minimal.

Since the PCI Express DMA engine is a PowerENTM

coprocessor, we created a thin abstraction interface to initiate

coprocessor requests. We called this interface libarb. This

abstraction interface also provided requisite memory

registration functions for DMA and MMIO between user space

processes. We implemented libarb on both x86 and

PowerENTM to allow for architectural neutrality as well for any

software developed on top of this layer, thus allowing for us to

only port this layer across the various simulation and prototype

platforms. Since the DMA engine is only on the PowerENTM

device, host initiated DMA actually first required the DMA

command to be transferred via MMIO to the device and re-

mapped to a device initiated call. Any thread management,

locking, state management or endian conversion is done by the

callers of libarb. By providing implementations of this

abstraction, we have the ability to swap device drivers in/out

depending on the particular bring up platform we’re using at

the time whilst preserving the integrity of the upper layers.

Layered on top of this is our communications protocol

library, the Hardware Abstraction Layer for devices (HAL-d).

HAL-d provides the necessary queue infrastructure required

for two sided packet flow as well as the Remote DMA

(RDMA) semantics and support for one sided throughput

oriented bulk transfer. HAL-d is also not thread safe, but it is

thread-friendly. That is to say that we provide for multiple

DMA groups so that individual threads can initiate DMAs

using separate command groups.

Now that we have a full userspace stack in place, we require

function to start host/device userspace applications. Starting

host applications is trivial, given that the host is standard

Linux server. However, starting applications on the remote

device is not so straightforward. An intuitive method is to ssh

into the remote system and then start the device’s userspace

application. In order to do this, we have an additional kernel

device driver that essentially provides a virtual ethernet

channel over PCI Express. On probing this device, a new

network interface will be instantiated e.g. eth1, which can be

configured using ifconfig. In order to avoid writing yet

another network device driver, we researched the feasibility of

using the virtio infrastructure [13] (particularly virtio_net) to

provide most of the interaction with the kernel’s networking

stack. Although virtio was originally intended for

virtualization, it proved suitable for our purpose. Internally,

this driver also uses the same communications protocol library

for packet flow as the userspace stack, thus providing the

possibility of having a device userspace to host kernel packet

flow.

Although many concepts about MMIO, receive and send

queue management, and RDMA have been well understood in

protocols and network research for decades, actual design and

implementation on new hardware can be a challenge. This is

especially true when working across endian boundaries. To

help facilitate our work on the PowerENTM platforms we used

a variety of pre-silicon environments, many of which are

described in [14]. In the next sections we describe how we

leveraged these tools and environments for PCI Express and

PowerENTM in order to meet our feature and performance

requirements within strict time to market constraints.

III. SOFTWARE DESIGN AND VERIFICATION

A. Shared memory platforms

Using memory areas shared between two processes is an effective way to mimic the behavior of PCI Express communications hardware without needing access to any specific hardware. By mapping shared memory areas into two separate processes running on a single system, one process runs the device-side code whilst the other process runs the host-side code, both MMIO and DMA data transfer methods can be simulated on a single system.

Figure 2: A diagram showing the logical description of the HAL-d IOREMAP as described in the text for MMIO based data movement.

MMIO is the simplest to implement; two shared memory areas are created within a kernel module, a device-side MMIO receive area and a host-side MMIO receive area. The device-side area is mapped into the device process as the receive MMIO space, and into the host process as the send MMIO space. Similarly the host-side receive area is mapped to the host process as the receive MMIO space, and to the device process as the send MMIO space. This “IOREMAP” is shown in Figure 2. MMIO data communication is performed by, for example, the host process writing data to its send MMIO space, which will then be visible to the device process in its receive space.

DMA is more complex to simulate due to its asynchronous nature, and since with many systems arbitrary areas of memory can be used as DMA buffers, once they have been suitably prepared. The libarb interface is the key to effectively simulating DMA with shared memory. Libarb places two key restrictions on DMA users (e.g. HAL-d): 1. All memory for DMA must be allocated through arb_allocate_buffer

2. DMA transfers may only be initiated by calling arb_send. Restriction 1 ensures that only shared memory

areas are allocated to the user application as DMA buffers.

47

This restriction is also useful for the PCIe libarb since it hides the complexities of creatig DMA'able memory from the libarb user. To enable user-space memory copies libarb also internally maps the "remote" DMA buffers, so when arb_send() is called it simply copies data between the

"local" DMA buffer (mapped to the user-space process) and the "remote" DMA buffer, sets the completion struct and then returns. Due to restriction 2 this is invisible to the user-space process, so the behaviour of shared-memory libarb is identical to that of the libarb running on physical PCI Express PowerENTM hardware.

With libarb, a DMA command is issued with arb_send(), and command completion is checked using

arb_check_completion(). When run on a standard

processor architecture however, such as Power or x86 instruction set, the simulated DMA transfers are inherently synchronous; the transfer is performed in software within the calling process' thread of execution. As a result shared memory on x86 does not properly test the DMA completion monitoring systems of any applications using libarb, since arb_check_completion will also return true when called

after arb_send(). On PowerENTM PCI Express hardware

the transfer is performed asynchronously by the chip's PCI Express DMA engine, and the engine notifies DMA completion by updating a struct held in the calling process' memory. DMA's may be completed out-of-order and at any time.

To provide a more complete test of the DMA path, and to verify the HAL-d stack on PowerENTM's A2 processor architecture, the x86 shared memory libarb was ported to an early version of PowerENTM hardware. This hardware did not have PCI Express end-point functionality, so to test the DMA completion code arb_send was altered to use the

asynchronous data mover (ADM). The ADM is a co-processor on the PowerENTM chip that asynchronously moves data between locations in PowerENTM memory. It's mode of operation is very similar to that of the PCI Express DMA engine, except that both the source and destination addresses must be in PowerENTM memory. The shared memory drivers were ported to single-chip PowerENTM, and arb_send was

enhanced to use the ADM. In this way libarb users can test their completion monitoring more completely, without needing full PowerENTM PCI Express hardware.

B. Hybrid Architecture Simulation

The shared memory device drivers and libarb layer are very useful for early development and test of the user-space software stack, however they are little use for pre-silicon development of the kernel space device drivers. For this a much more sophisticated simulation environment that simulates at a hardware register level is required. A combination of the PowerENTM version of the IBM Full System Simulator (IBM FSS or Mambo) [15] and an x86 simulator called Simics [16], from WindRiver is used to provide this. Similar in concept to the IBM FSS, Windriver Simics speeds up system design, software development, deployment and test automation of hardware architectures such as embedded systems, single and multicore CPUs, complex

hybrid architectures and network connected systems like clusters, racks and distributed systems. It supports several processor families, IO devices and standard communication protocols. It runs indifferently the same binary working on real systems on modeled virtual hardware enabling the developers to program, debug and deploy firmware, device drivers, operating systems, middleware, and the application software.

Debugging and testing are simplified by the use of a user friendly interface to run, break and stop the execution of the simulator, to inspect the hardware faults, save the hardware state for later inspection, get the output for using the system in batch from convenient automation tools.

By appropriately connecting the Simics x86 simulator to the IBM FSS PowerENTM simulator, the PCIe link between an end-point mode PowerENTM processor and a root-complex x86 processor was simulated with sufficient accuracy to allow pre-silicon design and implementation of the x86 and PowerENTM device drivers. We used this software to develop two PCI Express devices drivers and a middleware on top of them. We have built two very small images containing busybox linux OS and a customized linux kernel, to have an agile environment for kernel development that allowed us to reboot, debug and modify quickly the system in order to advance investigation useful to possibly fixing hardware and software issues. We developed a set of bash scripts to create automatically these images and we created a set of Simics scripts to test our device drivers indifferently using Linux or Bare Metal Applications (BMAs) [14] on the device.

Furthermore since the Simics/FSS environment runs Linux

on both sides, it was also suitable for test and verification of

the user-space software stack. Using the Simics File System

we were able to mount the partition on the real machine to run

user-space applications. Testing the user-space stack on

Simics/FSS uncovered a number of bugs and problems which

did not show up on the shared memory platform, in particular,

those software bugs related to endian conversion., We were

actually able to discover software bugs in the fully integrated

software stack since the Simics scripts also provides a way to

call the simulation in batch mode from an external automation

tool, our test harness, for continuous integration development

automating build, deployments, unit tests and functional tests.

In particular, Simics provides a way to connect X terminals

using a virtual serial connections that allowed us to stream the

output of the host side and the device side terminal in log files

saved on the host machine and to monitor execution of the

tests to exit gracefully in case of errors during the executions.

The tests are called automatically from an external automation

tool and produce reports about the sanity of the software. The

FSS/Simics environment was our first system level pre-silicon

platform on which we developed and executed a

comprehensive test suite for the PCI Express software and

corresponding applications.

C. Prototype PowerPCTM

- x86 Testbed

A third bring up platform was based on, existing hardware,

namely the PowerXCellTM 8i based PCI Express accelerator

board, or PX-CAB [17]. The PowerXCellTM 8i is a Cell BE

48

based processor, and thus has a PowerPCTM based main

processor, just like PowerENTM. The PCI Express DMA

engine logic implemented as a separate ASIC on the PX-CAB

also resembles the PCI Express logic on PowerENTM. Given

these two features of PX-CAB, it is the most closely related

physical platform to our target platform.

There are several benefits to using real hardware for bring

up. Due to the fact that PX-CAB is a PowerPCTM system, its

use immediately highlights any endian correction issues when

attached to an Intel host. It is also an established and marketed

device, thus being a stable platform for test case generation

and execution, allowing us to use it to drive more complete

end-end test case scenarios, for example, using the PX-CAB as

a NIC and performing some packet processing prior to

forwarding packets to the host. Additionally, we were able to

re-use some portions of the existing PX-CAB device driver to

accelerate bring up.

Our PX-CAB stack is identical to our target software stack

at the upper stack layers, the main difference being the actual

device driver and an implementation of our libarb abstraction

layer to interface with it. Also, some development effort was

required on the PX-CAB device driver to bring it up to the

kernel version of our target software stack, from 2.6.26 to

2.6.32 which helped development teams gain learning on

pushing our driver from one kernel version to another.

One particularly useful aspect of using PX-CAB was to be

able to stress test and performance tune the upper stack layers

relatively early. Especially in relation to our packet flow

communications library, we were able to stress the packet

queues and ensure that there was no unnecessary polling

across the PCI Express bus accessing either the send or receive

queues. Measuring the latency for direct data transfers it was

possible to ascertain any large discrepancy between this and

the transfers using our communications library.

IV. HARDWARE VERIFICATION

A. HDL Simulator

The HDL which described the actual implementation of the

PowerENTM chip was regularly run on the hardware simulator

[14] to ensure the functionality and performance of the

resulting chip. As stated previously PowerENTM has two PCI

Express ports which could be operated in different modes.

This feature enabled us to create an easy test configuration

with only minor changes of the HDL. This was done by

connecting PCI Express Port 1 Root Complex (RC) to the PCI

Express Port 0 Endpoint Complex (EP). On the HDL it is

basically connecting the TX lanes from one port to the RX

port of the other. This configuration enabled a wide range of

test, starting as being able to run the full initialization code

including the PCI Express scan to extensive testing of PCI

Express EP functionally like DMA. And we were able to find

issues we could resolve before the ‘real’ chip, significantly

improving the quality of the first chip samples.

The major development and testing on this environment was

the firmware code responsible for the PCI Express setup of

PCI Express EP and PCI Express RC. For the former a static

setting has to be applied whereas for the later the initialization

code is more complex by the nature of the logic – PCI Express

scan has to do a PCI Express bus walk and configure the

device find on this way. The wrap we had with our model gave

this code an opportunity to find a device, which is a SR-IOV

capable PCI Express EP, providing the most complex setup of

existing PCI Express devices. With this we could not only find

bugs in the newly developed SR-IOV capable scan code but

also give feedback to the chip team where the functionality

was not correct or the documentation lacked preciseness.

Another challenge we addressed in the loopback test setup was

the complex process of PCI Express link training. Since the

PHY was part of the HDL simulated and the PCI Express lanes

connected where after the PHY we were able to run the link

training sequence and spend quite some time to get the link up

and running. This not only exposed to the firmware team to

complexity of the link training but also provided the

opportunity ahead of hardware availability to have

initialization code and debug code ready for the debug on the

real hardware.

The second task on this environment was the development

of testcases for the PCI Express EP functionality, e.g. DMA,

MMIO access, interrupt handling and SR-IOV. As part of the

firmware a special wrap test was created, which ensured the

functionality of the entire feature set listed above in the wrap

configuration. This test again not only discovered bugs in the

early phase but also gave us first performance impression,

since the HDL model is cycle accurate. Only by this effort we

were able to ensure the full functionality of PCI Express of the

PowerENTM chip. And as a side effect we created a combined

hardware and software team which fully understands the

system architecture and was able to do the ‘real’ hardware

bringup in a much shorter test cycle than our lab has

previously experience for full platform enablement.

B. FPGA Emulation of the PCI Express Platform

The PowerENTM FPGA Based Emulator (PFBE) architecture

[14] could be used to exercise unique unit functions by

reassigning the logic configurations assigned to the FPGA

units. For example, logic cards connect to the central bus

could be assigned to hold either additional processing nodes,

or reconfigured to hold multiple instances of either I/O or

accelerator units. In one example, multiple instances of a

single accelerator were emulated in the system by re-assigning

multiple instances of the accelerator to FPGA's originally

provisioned to be used for other accelerator units. Having

multiple instances of the accelerator in the FPGA logic created

a platform that enabled the software team to exercise advanced

unit to unit SMP communication functions of the accelerator

engines well in advance of the availability of the final ASIC

mounted in an SMP configuration. In a second example, a

unique PCI Express to PCI Express wrap logic block was

created to bridge two controllers in the PCI Express logic

chiplet. This wrap logic block removed a dependency on any

external physical I/O by directly connecting the PIPE interface

49

of one PCI Express unit configured as a host controller, to the

PIPE interface of a PCI Express unit configured as a endpoint.

By directly connecting the FPGA logic units, without any

physical external connection, all of the logic could be run at

the same relative system speed, eliminating external real time

dependencies. This combination of root complex with

endpoint complex wrapped together in the FPGA system

enabled the software development teams to develop the drivers

for early root and endpoint logic function. As a result of this

early pre-silicon work, the software team was able to

demonstrate PCI Express driver function on the chip within

one week of receiving the physical chip.

V. TEST DEVELOPMENT AND MULTI-PLATFORM EXECUTION

Given the variety of function capabilities in our pre-silicon

environment, we were able to develop a comprehensive

platform test suite to validate function, measure performance,

and stress both the hardware and software for the x86 host

PowerENTM device environment. To achieve a broad range of

test execution on multiple pre-silicon platforms a common test

framework was developed which could be utilized across all

target test platforms. Our test framework began with the

various unit and function level testing of the A2 cores and

corresponding memory subsystem. For this, we leveraged the

Linux Test Program (LTP) suite [18], augmented with

lmbench [19], along with some microbenchmarks that we

developed internally which exercised some of the key features

of a highly multithreaded indexing code, e.g. a pointer chase

test. We then added component level testing for each of the

coprocessors to stress thread concurrency. Finally, at the chip

level we added user level networking. Where possible, similar

tests were run in the BMA environment to measure software

overheads and ensure any errors found were introduced by our

own PBIC address management code.

As part of system level testing, test suites have been

developed to test the MMIO, DMA and virtual ethernet

channels of the PCI Express software stack for PowerENTM.

As it is difficult to predict how a user will utilize HAL-d, e.g

what combinations of DMA vs. MMIO, data sizes, address

offsets, etc., we had to develop a test harness with which one

could easily change the data patterns and simulate multi-

threaded and multi-process use cases, as well as the standard

single-thread single-process use case. The test driver was

designed primarily to enable the easy, robust and extendable

testing of HAL-d functionality. It also provides monitoring and

reporting functionality. Monitoring is especially important in

utilities such as HAL-d where blocking communication plays a

role. Should a test fail during the communication or hang, the

monitoring function has the ability to terminate the test after a

user defined timeout. Defect identification is assisted by the

reporting functionality of the framework. This functionality

can provided detailed descriptions of the type of failure

encountered and also generate its own backtrace allowing for

quick identification of any issues should they arise.

This test framework provides uniform test execution across

all target test platforms, making it easier for results

comparisons, issue identifications and status reporting. For

instance, we found several scenarios where the software tests

would pass in the functional simulator environment, but the

same software tests would fail on the PFBE “loopback”

system. Upon further inspection and discussions with the

hardware teams, we discovered that the problem was errors in

the hardware documentation when compared with the

implementation. Because we could isolate this to a functional

level on the PFBE system well ahead of silicon delivery, this

helped us considerably when actual hardware arrivied. In

comparison to testing on the simulated hardware, in which test

execution can take much longer than on real hardware, shared

memory testing provides real world execution times of the PCI

Express software stack on the target architecture. This allows

for more efficient prototyping of the test framework as both

host and device processes are resident on a single machine and

the tests themselves are running at current microprocessor

speeds. Along with PX-CAB, which provided fast execution

time for developing tests around endian issues, these systems

provided the test team with a complete test execution

environment prior to the release of PowerENTM. Using the

different pre-silicon platforms, satisfactory test coverage of the

target architecture can be achieved.

We were also able to leverage the wide variety of our pre-

silicon platforms to create performance models of the PCI

Express hardware and software. As mentioned previously, the

hardware performance model was developed using the

awanstar cycle accurate simulator. On top of that we took

measurements on both shared memory x86 as well as the PX-

CAB environment to develop estimates of the HAL-d software

overhead. For PowerENTM, we took measurements for both

pure shared memory and the asynchronous data mover (ADM)

for DMA. Interestingly enough, these measurements showed

performance variations depending upon the size of individual

transfers as well as number of transfers before synchronizing

when using the ADM. Upon further investigation we found

that this was an artifact of the ADM hardware architecture and

not something we would necessarily find in the PCI Express

DMA engine. We adjusted the performance models

accordingly. In the end, we created a performance model and

results of this suite/model have been used to identify

performance bottlenecks and tune the PCI Express software

stack accordingly to remove these. Finally, to gauge the HAL-

d and PCI Express software stack performance both user space

and PCI Express link level applications were used.

Differences between the applications reflect the overhead of

HAL-d and the PCI Express software stack, allowing the

performance model to be created independent of the execution

platform.

VI. DISCUSSION AND FUTURE WORK

A PCI Express attached network processor to an x86 based

host allows for a variety of applications, ranging from

50

intrusion detection systems to financial market data feed

handlers to sensor network data aggregation and filtering.

Many of these applications target the hybrid architectures to

gain substantive performance benefit while maintaining the

general programmability and library availability of the host.

The hybrid approach also allows a development path in which

portions of the code can be ported while others remain on x86

host – allowing for greater experimentation as well as a

progressive and limited risk process. For all these reasons

noted above, the hybrid computing approach, especially one in

which an integrated highly programmable and powerful

network processor, such as PowerENTM, is used.

In order to enable PowerENTM’s PCI Express hardware

capabilities and the corresponding high performance

applications, we implemented a low latency, high throughput

PCI Express software stack, containing both MMIO and

RDMA based programming interfaces. This software stack

was designed with the requirements of application data planes

in mind. Since many applications also require control plane

operations, such as logging, heartbeating, etc., we also

implemented a virtual Ethernet over PCI Express stack which

allowed for standard socket based programming between the

x86 host process and the PowerENTM based device process.

Therefore application developers have a choice of interfaces to

leverage when optimizing with the PowerENTM PCI Express

system.

To gain the greatest benefit from workload optimized

application development on combined general purpose and

purpose built systems, the software application and hardware

platform, including the processor, needs to be designed and

developed in an integrated process. This integrated

development process implies that iterative hardware designs

should be considered as applications are ported and tuned.

Therefore, the availability of system level pre-silicon

environments for hardware verification and test along with

software development are crucial not just from a time to

market perspective, but also from a performance optimization

perspective as well. In this paper we have presented a variety

of pre-silicon environments which were used to validate

hardware and software designs for the new PowerENTM

processor in a hybrid system.

Traditionally, pre-silicon environments have been used at

the microprocessor level. System level design and

development work, including system test, required the

existence of silicon and early hardware. Our goals were to

design and prototype at the system level from processor unit

verification all the way to system software design and

enablement to fully integrated stress tests. In order to

accomplish this we had to include standard software

instruction set simulators, HDL simulators, a novel approach

to using FPGAs to emulate actual chip logic VHDL at

microprocessor speeds, and even some early hardware

“prototype” environments. The various stages and

requirements of integrated development were carefully

reviewed and the various pre-silicon environments were

chosen for appropriate and efficient verification, development

and debugging. For instance, much of our PCI Express

software stack early design and development was done in

shared memory and then prototypes mimicking tightly coupled

x86 and PowerPCTM systems. Once the PCI Express software

stack was available on the various pre-silicon environments,

build acceptance, functional verification, and even full system,

stress and performance tests were developed. This allowed for

both the full software stack and the various tests to be

available as soon as the real hardware arrived.

Our test suites have been developed not just to exercise the

PCI Express function, but also the entirety of the system used

by applications of interest, e.g. the HEA for packet processing,

the A2 PowerPCTM binaries, the PowerENTM coprocessor

along with moving data up to and back from the x86 host. We

are currently evaluating a variety of wire speed applications in

the areas mentioned previously in terms of their performance

characteristics and ability to stress the overall system to

integrate into our system test suite. As actual hardware

arrives, having already developed and debugged the core of

the system infrastructure will allow us to focus on the

performance optimizations required to reach new levels of

processing capability in network computing.

ACKNOWLEDGMENT

We are grateful to Nancy Greco and Heather Achilles for

their continued support and guidance throughout all phases of

these projects as well as the members of the Hybrid Systems

Lab, specifically, Heinz Baier, Thomas Hovarth, Ken Inoue,

and Steve Millman.

REFERENCES

[1] H. Franke, J. Xenidis, C. Basso, B.M. Bass, S.S. Woodward, J.D. Brown, C.L. Johnson, “Introduction to the wire-speed processor and architecture”, IBM Journal of Research and Development, Vol 54. No 1. paper 3 January/February 2010.

[2] D.P. Lapotin, S. Daijavad, C.L.Johnson, S.W. Hunter, K.Ishizaki, H. Franke, H.D. Achilles, D.P. Dumarot, N.A. Greco, B. Davari, “Workload and ntwork-optimized computing systems”, IBM Journal of Research and Development Vol. 54 No1 paper 1: 1-12, January/February 2010.

[3] M.Kistler,J.Gunnels,D.Brokenshire,B.Benton, “Petascale computing with accelerators”, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP 2009, Raleigh, NC, 2009, pp 241-250.

[4] R.Stone and H.Xin “Supercomputer leaves competition – and users – in the dust”, Science, Vol 330. No 6005, pp 746-747, 5 November 2010.

[5] R. Sass, W.V. Kritikos, A.G. Schmidt, S. Beeravolu, P. Beeraka, “Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing” Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007, pp 127-140, IEEE Computer Society, Washington, DC, USA.

[6] “The Unified Wire”, Chelsio communications whitepaper. http://www.chelsio.com/unifiedwire_eng.html.

[7] “Accelerators in Cray’s Adaptive Supercomputing” from http://rssi.ncsa.illinois.edu/2007/docs/industry/Cray_presentation.pdf

[8] G. Valente, “Implementing Hardware Accelerated Accelerated Applications For Market Data and Financial Computations”, from HPC on Wall Street, September 17, 2007, New York, NY,

51

http://www.lighthouse-partners.com/highperformance/presentations07/Session-7.pdf

[9] http://www.khronos.org/opencl/ [10] CUDA Xone, NVIDIA Corporation; see

http://www.nvidia.com/object/cuda_home.html#. [11] IBM Corporation, Data Communication and Synchronization Library

Programmer’s Guide and API Reference, see http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/eicck/dacs/DaCS_Prog_Guide_API_v3.1.pdf

[12] PCI Special Interest Group, “Single Root I/O Vritualization”, see http://www.pcisig.com/specifications/iov/single_root/

[13] R.Russell. “virtio: Towards a De-Facto Standard For Virtual I/O Devices”. See: http://portal.acm.org/citation.cfm?id=1400108

[14] J. Aylward, C. Cox, C.H.Crawford, K.Inoue, S.Lekuch, K.Muller, M.Nutter, H.Penner, K.Schleupen, J.Xenidis, “A Review of Software and System Tools for Hardware Design, Verification and Software Enablement for System-on-a-Chip Architectures”, IBM Research report, submitted.

[15] P. Bohrer, M. Elnozahy, A. Gheith, C.Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang, “Mambo – A Full System Simulator for the PowerPC Architecture” ACM SIGMETRICS Performance Evaluation Review, Volume 31, Number 4, March 2004, pp. 8-12.

[16] P. Mgnusson, M. Christensson, J. Eskilson,D. Forsgren,G.Hllberg,J. Hgberg,F.Larsson,A.Moestedt,B.Werner, “Simics: A Full System Simulation Platform”, IEEE Computer, Feb 2002.

[17] H. Penner, U. Bacher, J. Kunigk, C. Rund, H.J. Schick, „ directCell: Hybrid systems with tightly coupled accelerators“, IBM Journal of Research and Development, Vol 53. No 5. 2009, paper 2.

[18] LTP: Linux Test Project; see: http://ltp.sourceforge.net/ [19] LMBench; see: http://www.bitmover.com/lmbench/

52

Rapid automotive bus system synthesis based oncommunication requirements

Matthias Heinz, Martin Hillenbrand, Kai Klindworth, K.-D. Mueller-GlaserKarlsruhe Institute of Technology (KIT), Germany

Institute of Information Processing TechnologyEmail: {heinz, hillenbrand, kai.klindworth, klaus.mueller-glaser}@kit.edu

Abstract—The complexity of modern cars and along theirelectric/electronic architecture (EEA), rapidly increased duringthe last years. New applications like driver assistance systemsare highly distributed over the network of hardware components.More and more systems share common sensors placed in sensorclusters. This leads to a greater number of mutually connectedelectronic control units (ECUs) and bus systems. The traditionaldomain specific approach of grouping connatural ECUs into onebus system, does not necessarily lead to an overall optimal EEAdesign. We developed a method to automatically determine anetwork structure based on the communication requirementsof ECUs. Based on the EEA model, which is developed duringthe vehicle development life-cycle, we have all the informationwe need, like cycle times and data width, to build a networkof automotive bus systems. We integrated our method into theEEA tool PREEvision to allow rapid investigation of realizationalternatives. The relocation of functions from one ECU toanother can ideally be supported by our method, since we cangenerate a new network structure within minutes, fitting the newcommunication demands.

I. INTRODUCTION

During the design of a vehicle, several thousand signalsbetween up to 70 ECUs have to be considered [1]. Basedon the given requirements, the network designer has to setup a compound of automotive bus systems. Arising fromthe communication requirements, he has to select the rightbus systems for the given bus load. Although not all ECUsin a new vehicle model are newly introduced and formerarchitectures can be taken as reference, the high innovationrate in vehicle electronics leads to a bunch of new func-tions in every new car. Also new concepts like the use ofsensor clusters sharing the sensors for several ECUs, lead tochanged design requirements. Moving of functions from oneECU to another may require a restructuring of bus systems.New technologies, like radar or video based driver assistancesystems, feature high requirements concerning the requireddata rates. This additionally leads to new connections to thealready established ECUs.

Currently the complete bus system architecture is designedby hand. Using a domain approach, where ECUs of a certainapplication area are bind together, designers try to handlethe complexity. In the past, powertrain, chassis and body bussystems where installed fulfilling the communication needs.But new highly distributed systems like for example Adaptive

Cruise Control (ACC) or lane keeping are distributed to anetwork of many ECUs and do not clearly fit into this fixedstructure. So the grown domain specific approach, does notnecessarily lead to the overall optimal design.

To avoid these problems and to speed up the prototypingand development process, we present a method, that allowsto automatically generate automotive bus system connectionsbetween ECUs. The connections are based on the com-munication requirements of the provided ECUs. To get allnecessary information, we employ the data provided in theelectric/electronic architecture (EEA) model. Since the EEAmodel is a crucial part of the overall vehicle design process,the data provided by this model is always up to date and canideally be taken as reference. This allows to directly influencethe network structure based on the current data requirements.Since our methodology has been implemented as plugin, forthe eclipse based EEA modeling tool PREEvision, we canautomatically create all generated bus systems one click.

The partitioning of ECUs, we used for building our net-works, is the basis for hardware/software co-design approachessince a while. The place and route algorithms, used in tools forreconfigurable hardware devices like Field Programmable GateArrays (FPGA) and also Very-large-scale integration (VLSI)processes, extensively uses partitioning technologies [2]. Lit-erature provides many different algorithms for solving suchproblems. Basic algorithms for Electronic Design Automation(EDA) approaches for electronic devices can be e.g. found in[3]. We adapted several techniques for the partitioning problemdescribed in this paper.

The subsequent paper is organized in five sections. Ashort introduction to automotive bus systems and architecturemodeling is given in sections II and III. Section IV presentsour approach of ECU partitioning describing the clustering,the implemented nearness functions, partition optimizing andpartition merging. Section V describes the verification andtest of our approach, followed by conclusions and outlook insection VI.

II. AUTOMOTIVE BUS SYSTEMS

Current vehicles feature a number of different bus sys-tems fullfilling the diverse communication requirements ofthe distributed network of interconnected sensors, ECUs and

978-1-4577-0660-8/11/$26.00 c©2011 IEEE

53

actuators. Based on their communication bandwidth and ap-plication, they are separated in different classes. Currentlydeployed bus systems are listed in Table I. Since infotainmentbus systems have not been designed for open or closed loopcontrol, they feature their own class.

Class Data rate Deployed busesA < 25 Kbit/s Local Interconnect Network (LIN) [4]B 25-125 Kbit/s Controller Area Network (CAN) Class-

B [5]C 125-1000Kbit/s Controller Area Network (CAN) Class-

C [5]D > 1 Mbit/s FlexRay [6]

Infotainment > 10 Mbit/s Media Oriented Systems Transport(MOST) [7]

TABLE IOVERVIEW AUTOMOTIVE BUS SYSTEMS [8]

III. ELECTRIC/ELECTRONIC ARCHITECTURE MODELING

During the concept phase of a vehicle the electric/electronicarchitecture (EEA) is designed. The modeling of such archi-tectures allows to balance the possible realization alternativesand to find an overall design. The tool PREEvision, which isused by leading car manufacturers [9], allows to design suchcomplex architectures containing up to 800.000 elements fora premium car. To handle this complexity, different perspec-tives to the model are provided (Fig. 1). The EEA elementsrequired for our method are located in the function networkand the component network. Functions feature ports thatimplement interfaces. An interface describes the exchangeddata elements between the participators. A communicationrequirement which can be allocated to a port prototype allowsto describe the cycle time of the exchanged data elements. Inthe component network ECUs, bus connectors and bus systemsare modeled.

IV. ECU PARTITIONING

The following information from the EEA model is utilized,for starting the ECU partitioning. Elements can directly beaccessed in the model using java code.

• Cycle time given by communication requirement (Port-CommunicationRequirement)

• Type and number of data elements out of interfaces(DataElement)

• Sending ECU (sender)• Receiving ECU (receiver)

A. Representation as graph

The representation of ECUs and their communication re-quirements as a graph allows to solve the partitioning problemwith the help of algorithms. In our case, edges represent theexchanged data while nodes represent ECUs. The algorithmtries to partition the nodes to single networks while it triesto reduce the cutting costs between the partitions. Partitioningis a classical problem in computer science and is consideredas non-deterministic polynomial-time hard (NP-hard). There

ECU 1

ECU 2

ECU 2

ECU 1

RAM

SoftwareCPU

PCB

ROM

ECU 1

Detailed

Detailed

Detailed

Compo

nent

descrip

tion

Wiring

harne

ss

Sche

matics

Sensor 1

Sensor 2

ECU 1

ECU 2Fuse BoxGround

point

Groun

ding

Co

ncep

t

CompositionSensorFunction_1

SensorFunction_2

FunctionalGroup

ActorFunction_1Actor

Function_2

Branch‐off

Inline connector

Installation space Installation space

Placed in

Segment

Topo

logy

Functio

n ne

twork

Require‐ments

Power distributionNetworking&

Communication

Actor 1

Actor 2

InstallationLocation





Placed in

Requirement 1.2 DescriptionRequirement 1.2.1 Description

Requirement 1.2.2 Description

1.21.2.1

1.2.2

Assembly Connector

Fig. 1. Layered EEA

are some well established algorithms to solve such problems[10]. Namely there are Kernighan-Lin, Fiduccia-Mattheyses,Simulated Annealing, Hierarchical Clustering, EvolutionaryAlgorithms, Integer Linear Program and Tabu Search.

B. Hierarchical clustering

To build a set of partitions in the first place we used theHierarchical Clustering (HC) algorithm. While the nodes ofthe graph are ECUs, weighted edges indicate the nearness ofthe ECUs represented by their communication requirements.The edges are undirected cause the direction of informationexchange does not influence the result. The information flow inboth directions between the ECUs is summed up in the weightof the edge. Hierarchical Clustering starts with one ECU ineach cluster and merges the two partitions with the greatestnearness together. This is proceeded till only one partitionis left. Each group of partitions featuring all ECUs formsa possible solution, independent on which solution step theyappear. This feature will be utilized in the succeeding steps.

To find the best overall solution, the costs for each de-termined solution have to be calculated. The costs for onepartitions are composed out of the following parts:

• Bus system costs: holds the costs for establishing a bussystem

• Bus participant: is calculated for each bus member• Gateway costs: will be calculated if there is data transfer

to other partitions• Byte/s for external data transfer: covers the costs for the

data that must be transferred through the gateway.

54

The overall costs of a partition, is the cheapest solution fromthe sum of single costs, calculated for all possible bus systems(LIN, CAN, FlexRay). If a bus system can not fulfill thecommunication requirements, the algorithm returns an errormessage.

1

2 3

4 5 6 7

Traffic: 660 kbit/sFlexRay

Traffic: 410 kbit/sFlexRay

Traffic: 250 kbit/sHigh‐speed CAN

Traffic: 180kbit/sHigh‐speed CAN

Traffic: 230kbit/sHigh‐speed CAN



Fig. 2. Partition tree after Hierarchical clustering

To find the overall best solution for the whole HC-tree, thealgorithm (Fig. 3) processes the tree, beginning at the top.In the first step, the costs for the currently selected partitionare calculated. Afterwards the costs for the child partitionsare calculated and the sum is compared to the own costs.The cheaper solution will then be taken. So the algorithmsteps recursively through the tree and determines the cheapestsolution for all possible partitions.

Using the graph in Fig. 2, the algorithm returns the solutionin Table II. To get a better overview and to keep it simple,we set the gateway and external data costs to zero for thisexample. We set a High-speed CAN bus to 200, a HS-CANdevice to 10, a FlexRay bus to 300, and a FR device to 25 asfictitious costs for this example. As cheapest solution, partition3, 4 and 5 will emerge.

Since the HC algorithm does not take the available data rateof a bus system into concern during partitioning, inappropriatepartition sizes can appear (Fig. 4). To overcome this issue,an additional bin packing algorithm has been implemented tomerge the under utilized partitions together.

Data: Partition partResult: costs c, list of partitions listmyCost := costOfPartition(part)mySolution := part;childrenCost := 0childrenSolution := ∅while child := part.nextChild do

childrenCost += cheapestSolution(child).cchildrenSolution := childrenSolution ∪cheapestSolution(child).list

if childrencost < mycost||childrenSolution 6= ∅ thenreturn (childrenCost, childrenSolution)

elsereturn (myCost, mySolution)

Fig. 3. Cheapest solution algorithm

Partition Bus system Own costs/Costs of children1 FlexRay 950/8602 FlexRay 525/4903 High-speed CAN 370/5704 High-speed CAN (5 devices) 250/-5 High-speed CAN (4 devices) 240/-6 High-speed CAN (8 devices) 280/-7 High-speed CAN (9 devices) 290/-

TABLE IICALCULATED COSTS FOR THE GRAPH IN FIGURE 2

High‐speed CAN

High‐speed CAN High‐speed CAN

Low‐speed CAN90% bus load*




* bus load is approx. 50% of 125 kbit/s data rate for Low‐speed CAN

Fig. 4. Inappropriate partitioning

C. Nearness function

To execute the HC-algorithm, a nearness function has to beimplemented. The obvious idea to take the absolute data ratebetween the partitions turned out to be inapplicable. A usualnetwork consists of unequally fast participants. An algorithmonly taking into account the data rate, would start to mergethe fastest participants together. A big partition dominatingthe others would emerge. Since it is very likley that leftover partitions, featuring only one ECU, also feature a highnearness to this big partition one after another will be addedto the big partition. This leads to an abnormal HC tree. Thisbehavior is not desired, since the slower participants haveno chance to build their own network, featuring minor busrequirements.

During the development of nearness functions, it turned outthat several nearness functions lead to different results.

The first nearness function, we call it ”relative nearness”,the data transfer of the current partition to another partitionsis divided through the overall transfer rate of the currentpartition. This shows the percentaged quota of the commu-nication to other partitions (Fig. 5). This enables partitionscommunicating strongly which each other to have a highnearness, even if they have low data rates.

To avoid the merging of slow nodes with faster ones, alwaysthe smaller value for one connection is used. In Fig. 5 anearness of 0.05 would be taken for the shown connection.We call it ”bothsided relative nearness”.

Because the right node massively lowers the nearness of theconnection and so the sum of all connections for the left node,again the quotient of one connection to all other connectionsis calculated for the ”weigthed nearness” (Fig. 6). The sum of

55

8

8

700

900

90.45 0.05

Fig. 5. Weighting of connections

the connections in this example is 0.20+ 0.14+ 0.02 = 0.36.If we divide the single values by 0.36 we get 0.56, 0.39, 0.06as new values.

600

500

7

5

230.02/0.06

0.14/0.39

0.20/0.56

Fig. 6. Balanced weighting of connections

It turned out that penalizing partitions with many connec-tions leads to a better solution, so we divided the ”weightednearness” through the number of connections. We call this newfunction ”shared weighted nearness”.

The software we developed, contains all above mentionednearness functions, since the structure of future networkscan not be foreseen. Therefore all different functions will becalculated and the best solution will be taken. Since the bestsolution for a certain communication network is unknown,we used randomly generated networks to prove the conceptsexperimentally.

D. Partition optimization

Since the HC-algorithm follows a greedy strategy and takesa locally optimal choice at each stage, it doesn’t necessarilylead to the overall best solution. To improve the solution foundby HC, we implemented a succeeding Fiduccia-Mattheyses(FM) algorithm. This iterative algorithm featuring a com-plexity of O(n), helps to lower the cutting costs betweenpartitions. The advantage of this algorithm is, that it doesn’trequire partitions of equal size like e.g. Kernighan-Lin. Abalance criterion can be used to check, if the balance betweenpartitions is not too unequal and that the busload of the bussystem is not exceeded.

The implemented FM-algorithm comprises the followingsteps:

1) The gain of all nodes for shifting between the partitionsis calculated.

2) A node not violating the balance criterion and holdingthe highest gain will be selected. If several nodes featurethe same gain, the one best fitting the balance criterionwill be selected.

3) This node will be taken to a list and cannot be movedin further steps during this optimization step.

4) Till not all nodes are in this list, the gain for allconnected nodes will be newly calculated and gone onto step 2.

5) The shifting sequence will now be executed till thehighest aggregate gain is found. If the gain is negativefor all steps we stop and do not shift any nodes. If not,the list of fixed elements will be cleared and we startover with step 1.

The first pass of the given algorithm is depicted in Fig. 7. Theshown nodes are ECUs, the while the dashed lines indicate thelimits of the partitions. The numbers below the nodes show theachievable gain when moving the node to the other partition.

Start

2 1

1 2

Step 1Gain total: 2 This step: 2

‐2fixed

‐1

1Criterionviolated

0


0fixed

‐1

‐1 0fixed


0fixed

1fixed

‐1 ‐2fixed

Step 4Gain total: 1 This step: ‐1

0fixed

1fixed

‐1fixed

‐2fixed

Balance criterion:1<= Partition size <=3

Max. gain:Step 1,2,3Best criterion: Step 2

Fig. 7. example FM algorithm first pass

E. Partition merging

After executing the HC- and FM-algorithm sometimes sev-eral partitions are not filled to 100%. To lower the costs ofthe overall architecture it is reasonable to merge partitions to asingle bus. Only partitions featuring the same bus type will bemerged, since the costs would rise if nodes would be shiftedto a faster and so more expensive bus.

The merging of partitions can be considered as bin-packingproblem [11]. Our goal is to maximize the filling level of thepartitions. There are several algorithms available in literature,to solve the bin packing-problem exactly [12]. We solved theproblem using the dynamic programming approach [3].

A challenge for this specific problem is, that the partitionschange their weight when they are packed together. A simpleexample of this relationship is depicted in Fig. 8. The datatransfer between to partitions transferred trough a gateway willbe counted for both partitions. If these partitions are mergedthe data transfer is only counted once, since the ECUs cancommunicate directly. So the overall bus utilization is lowerthan in both parts.

The bin packing algorithm has being implemented for eachdifferent bus system. Since the dynamic programming reuses

56

5 kbit/sECU A ECU B

Gatew

ay5 kbit/s

Bus A Bus B

5 kbit/sECU A ECU B

New bus

Fig. 8. Weight change arising from partition merging

the former calculated blocks to speed up the calculation it hasto suffer the above described the partition size problem. Tominimize this drawback we implemented the following steps:

1) Generate list of all partitions featuring the same bussystem

2) Create empty bin featuring the capacity of the bussystem

3) Filling the bin with partitions. Deleting used partitionsout of the list and adding the new created bin to the listif more than one partition has been packed into it.

4) If unpacked partitions are in the left over, restart withstep 1.

Step 3 enables to recalculate the size of the bins and allowsto pack another partition into it if there is enough space. Thisaddresses the above described problem concerning the sizechanges after merging partitions.

For a small number of bus systems it is also possible touse the exact algorithm which uses more computing timeand memory than the dynamic programming approach. Theexact algorithm steps through each possible combination ofpartitions. It recursively calls itself adding a partition to the binand a second time without adding it. So it checks all availablesolutions. The pseudocode is given in Fig. 9.

Data: List of old partitions oldlist, partition list allparts, listposition it, used bus bus

Result: used partitions listpart := oldpart ∪ {allparts(it)}if it = allparts.last then

if bus = checkBus(part) then /* Check ifthe busload of the current bus is notexceeded */

return part

elsereturn oldpart

elseif bus = checkBus(part) then

return rucksack(oldpart, it+ 1, bus)else

return maxInternalTraffic(rucksack(part, it+1, bus), rucksack(oldpart,it+ 1, bus))

Fig. 9. Exact bin packing algorithm

After finishing the bin packing, it makes sense to runthe FM-algorithm again. This is because merged partitionsmay have a higher nearness to nodes in other partitions.This relationship is depicted in Fig. 10. Before merging the

partitions the shifting of the gray node would cause a loss of1. After merging the partitions on the right, we can achieve again of 3.

5

Gain: ‐1

4

4

54

4

Gain: 3

Fig. 10. Additional execution of the FM-algorithm

The whole process now looks as follows:1) Hierarchical clustering2) Selecting the best solution in the HC tree3) Optimizing of cutting costs using the FM algorithm4) Merging of non busy partitions using the bin packing

algorithm5) Again optimizing the cutting costs using the FM algo-

rithmTo enable a rapid exploration of different architecture pro-

totypes we implemented these steps into a customized metricblock in PREEvision (Fig. 11). PREEvision is based on eclipseplatform and can so easily can be extended by using custommetric blocks. Our customized metric block is based on a javacode which can be directly executed in the framework [13].

To start the calculation, we provide a list of ECUs that shallbe partitioned and a folder containing the allowed bus systems.This allows to exclude ECUs from bus generation, e.g. to meetnon-technical requirements. The same is true for the list ofallowed bus systems which allowes to include/exclude certainnetwork types.

Additional data, e.g. communication requirements, is di-rectly read from the EEA model. This provides all necessaryinput data for the above described steps. The metric blockautomatically generates the determined bus connectors and bussystems in the architecture model.

3Bus synthesis

2

DoorLockDD

DoorLockFP

MasterModule

BodyModule

SlidingDoorController

ECUs

1

Networks

Bus systems

ecuInput

busInput

Context

Context

Fig. 11. PREEvision block plugin implementation

V. RESULTS

Since the real communication structure between the func-tions and ECUs is strictly confidential knowledge of carmanufacturers, there was no data from a real car available to

57

test our approach. As workaround, we designed a customiz-able random network generator. This generator features thefollowing settings:

• min/max number of ECUs• min/max number of connections• min/max distance between the min/max number of con-

nections• min/max of the minimum data rate of connections• min/max of the maximal data rate of connectionsFurthermore we implemented a likeliness that a ECU con-

nects to ECUs of the same tenner block. This means that ECU16 has a higher likelihood to connect to the ECUs 10-19 thanto all others. In addition the user can to set the data rate foreach of these blocks different. This helps to see if the algorithmcorrectly detects the ECUs belonging together. The designednetwork generator also allows to set the group size of ECUsbelonging together, but this prevents from identifing the ECUsbeloging together. Another setting allows to set the group ofthe adjacent ECUs by hand.

We generated 100 different networks using our networkgenerator. The results are depicted in Fig. 12. While it lookslike the ”shared weighted” nearness is the overall winnerof the benchmark, this is only the case for most of thenetworks. Looking at the standard deviation shows, that alsoother nearness functions can lead to a better solution for aspecific network. Because of this, the result of all differentnearness functions is calculated and the best one is selected.The result graph in Figure 12 is based on the best solutionfound for each network. The deviation of the current solutioncompared to the best solution found is depicted in percent.

0

10

20

30

40

50

60

70

80

90

equaldistibution

normal relative bothsidedrelative

weighted sharedweighted

without optimizationbin packingFM+bin packing

Fig. 12. Comparison of implemented nearness functions and algorithms

VI. CONCLUSION AND FUTURE WORK

Our method to automatically partition communicating ECUsto automotive networks allows to rapidly evaluate differentdesign alternatives. During the design phase of a vehicle de-velopment, different architectures are investigated. AutomaticECU partitioning can help the designer to quickly generatea new network prototype when moving function blocks fromone ECU to another. With the help of our approach, all bussystem requirements are met. Since only a subset of ECUs

can be selected, the ECU partitioning can also be executedfor only a specific set of ECUs.

Modifications to the automatically generated network mayof course be necessary, since politically decisions always haveto be considered during the design. Nevertheless our tool canprovide a very good starting solution, meeting all requirementsconcerning data rates. The cost function which is the basis forthe decision for a specific network, can individually be set bythe designer and so meet the specific computation of differentcar manufacturers. The current approach can also easily beexpanded be new bus systems, since it is not dependent on acertain kind of bus.

In the next steps, we will try to improve the selectionof bus systems. Currently, a certain bus is selected by afixed bandwidth value. This could be extended by an in-depthconfiguration and scheduling for the selected bus. So possiblya better bandwidth utilization could arise.

REFERENCES

[1] J. Broy and K. D. Mueller-Glaser, “The impact of time-triggered com-munication in automotive embedded systems,” in Industrial EmbeddedSystems, 2007. SIES ’07. International Symposium on, Jul. 2007, pp.353–356.

[2] J. Teich and C. Haubelt, Digitale Hardware/Software-Systeme: Syntheseund Optimierung, 2nd ed. Berlin: Springer, 2007.

[3] J. Lienig, Layoutsynthese elektronischer Schaltungen - GrundlegendeAlgorithmen fur die Entwurfsautomatisierung. Berlin: Springer, 2006.

[4] LIN-Consortium, LIN Specification Package, revision 2.1 ed., Nov. 2006.[5] Robert Bosch GmbH, CAN Specification, 2nd ed., Stuttgart, September

1991. [Online]. Available: http://www.semiconductors.bosch.de/pdf/can2spec.pdf

[6] FlexRay Consortium, FlexRay Communications System - Protocol Spec-ification Version 2.1, Dec. 2005, version 2.1 Revision A.

[7] MOST Cooperation, MOST Specification, MOST Cooperation, 07 2010,rev. 3.0 E2.

[8] W. Zimmermann and R. Schmidgall, Bussysteme in der Fahrzeugtechnik.Protokolle und Standards. Vieweg+Teubner, September 2008, vol. 3.Auflage.

[9] aquintos GmbH , E/E-Architekturwerkzeug PREEvision, 2009.[10] R. Xu and D. Wunsch, Clustering (IEEE Press Series on Computational

Intelligence). New York: IEEE Press, 2009.[11] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack problems :

with ... 33 tables. Berlin: Springer, 2004. [Online]. Available:http://books.google.de/books?isbn=3540402861

[12] S. Martello and P. Toth, Knapsack problems: algorithms and computerimplementations. New York, NY, USA: John Wiley & Sons, Inc., 1990.

[13] B. Daum, Java-Entwicklung mit Eclipse 3.3 : Anwendungen, Pluginsund Rich Clients, 5th ed. Heidelberg: Dpunkt-Verl., 2008. [Online].Available: http://www.ulb.tu-darmstadt.de/tocs/194895912.pdf

58

978-1-4577-0660-8/11/$26.00 ©2011 IEEE

Abstract — Non-uniform sampling has proven through

different works, to be a better scheme than the uniform sampling

to sample low activity signals. With such signals, it generates

fewer samples, which means less data to process and lower power

consumption. In addition, it is well-known that asynchronous

logic is a low power technology. This paper deals with the

coupling between a non-uniform sampling scheme and an

asynchronous design in order to implement a digital Filter. This

paper presents the first design of a micropipeline asynchronous

FIR filter architecture coupled to a non-uniform sampling

scheme. The implementation has been done on an Altera FPGA

board.

Index Terms — Asynchronous logic, non-uniform sampling,

FIR filter, FPGA.

I. INTRODUCTION

ith the increasing system on chip complexity, several

problems become more and more critical, and affect

severely the performance of the system. These issues

can have different form such as: power consumption, clock

distribution, electromagnetic emission, etc. Synchronous logic

seems to reach its technological limits when dealing with these

problems. However, asynchronous logic has proven that it

could be a better alternative in many cases. It is well known

that it has many interesting properties such as immunity to

metastable states ‎[2], low electromagnetic noise emission ‎[15],

low power consumption ‎[11]‎[12], high operating

speed ‎[13] ‎[14] or Robustness towards variations in supply

voltage, temperature, and fabrication process parameters ‎[16].

Moreover, non-uniform sampling and especially the level-

crossing sampling become more interesting and beneficial

when they deal with specific signals like temperature,

pressure, electro-cardiograms or speech which evolve

smoothly or sporadically. Indeed, these signals are able to

remain constant on a long period and to vary significantly

during a short period of time.

Therefore using the Shannon theory for sampling such signals

leads to useless samples which increases artificially the

computational load. Classical and uniform sampling takes

samples even if no change occurs in the input signal. The

authors in [5] and [6] show how using the non-uniform

sampling technique in ADCs leads to drastic power savings

compared to Nyquist ADCs.

A new class of ADCs, called asynchronous ADCs (A-

ADCs) has been developed by the TIMA Laboratory [7]. This

A-ADC is based on the combination of a level-crossing

sampling scheme and a dedicated asynchronous logic [8]. The

asynchronous logic only samples digital signals when an event

occurs, i.e. a sample is produced by the A-ADC which

delivers non-uniform data in time. This event-driven

architecture combined with the level-crossing sampling

scheme is able to significantly reduce the dynamic activity of

the signal processing chain.

Many publications on non-uniform sampling are available

in the literature, but to the best of our knowledge none relates

to the coupling of an event-driven (asynchronous) logic and

FIR filters techniques applied to a non-uniform sampling

scheme

This paper presents an asynchronous FIR filter architecture,

based on a micro-pipeline asynchronous design style. It also

shows a successful implementation of this architecture on a

commercial FPGA board from Altera. The second part of the

paper is dedicated to the asynchronous logic and more

precisely to one kind on asynchronous circuit: micro-pipeline.

Some details about the A-ADC as well as the non-uniform

sampling scheme are showed in the third paragraph. The

fourth paragraph handles the asynchronous FIR Filter

algorithm and architecture. Finally the fifth paragraph presents

the implementation result of the proposed architecture on a

DE1 Altera FPGA board.

II. PRINCIPLES OVERVIEW

A. Asynchronous logic

Asynchronous logic is well known for its interesting

properties that synchronous‎ logic‎ doesn‟t‎ have:‎ such‎ as‎ low‎

electromagnetic emission, low power consumption,

robustness, etc. ‎[1]. It has been proven that this logic improves

the Nyquist ADCs performances in terms of immunity to

metastable states ‎[2], low electromagnetic emission ‎[3] or low

power consumption ‎[4]. This paragraph briefly presents the

main asynchronous logic principles. It also shows how

asynchronous micro-pipelined circuits are build, from two

distinguish parts: the data path, and the asynchronous control

path.

An‎event-driven‎FIR‎filter:‎design‎and‎

implementation

Taha Beyrouthy, Laurent Fesquet TIMA Laboratory - Concurrent Integrated Systems Group, Grenoble, France

[email protected] – [email protected]

W

59

Asynchronous principles

Unlike synchronous logic, where the synchronization is

based on a global clock signal,‎ asynchronous‎ logic‎ doesn‟t‎

need a clock to maintain the synchronization between its sub-

blocks. It is considered as a data driven logic where

computation occurred only when new data arrived. Each part

of an asynchronous circuit, establishes a communication

protocol, with its neighbors in order to exchange data with

them.‎This‎kind‎of‎communication‎protocol‎is‎known‎as‎“hand‎

shake”‎ protocol.‎ It‎ is‎ a‎ bidirectional‎ protocol,‎ between‎ two‎

blocks called Sender and Receiver as showed in Figure 1

The sender start the communication cycle by sending a

request‎ signal‎ “req”‎ to‎ the‎ receiver.‎ This‎ signal‎ means‎ that‎

data are ready to be sent. The receiver start the new

computation‎ after‎ the‎detection‎of‎ the‎ “req”‎ signal,‎ and‎ send‎

back an acknowledge signal‎ “ack”to the sender marking the

end of the communication cycle, so a new one could start.

The‎main‎gate‎used‎in‎this‎kind‎of‎protocol,‎ is‎ the‎“Muller‎

“gate‎ or‎ also‎ known‎ as‎ C-Element. It helps - thanks to its

properties- to detect a rendezvous between different signals.

The C-element is in fact a state-holding gate; Table 1 shows

its output behavior.

Consequently,‎when‎the‎output‎changes‎from‎„0‟‎to‎„1‟,‎we‎

may‎conclude‎that‎both‎input‎are‎„1‟.‎And similarly, when the

output‎ changes‎ from‎ „1‟‎ to‎ ‟0‟,‎ we‎ may‎ conclude‎ that‎ both‎

inputs‎are‎now‎set‎to‎„0‟.‎ ‎This‎behavior‎could‎be‎interpreted‎

as‎an‎acknowledgement‎that‎indicates‎when‎both‎input‎are‎„1‟‎

or‎„0‟.

This is why the C-element is extensively used in asynchronous

logic, and is considered as the fundamental component on

which is based the communication protocols

Asynchronous Micro-pipeline circuits

Many asynchronous logic styles exist in the literature. It is

worth mentioning that the choice of the asynchronous style

affects the circuit implementation (area, speed, power,

robustness, etc.). One of the most known styles is the micro-

pipeline style.

Among all the asynchronous circuit styles the micro-pipeline

has the most closely resemblance with the design of

synchronous circuits due to the extensive use of timing

assumptions ‎[5]. Same as a synchronous pipeline circuit, the storage elements is

controlled by control signals. Nevertheless, there is no global

clock. These signals are generated by the Muller gates in the

pipeline controlling the storage elements as shown in Figure 5 A simple asynchronous micro-pipeline circuit, could be

built by using transparent latches as storage elements as shown

in Figure 4.

The Muller gate pipeline is used to generate the local

clocks. The clock pulse generated in a stage overlaps the

pulses generated in the neighboring stages in a specific

controlled interlocked manner (depending on the handshake

protocol).

This circuit can be seen as an asynchronous data-flow

structure‎composed‎of‎two‎main‎blocks:‎“the‎data‎path”‎which‎

is clocked by a distributed gated-clock driver, “The‎ control‎

path“.

Figure 1: Handshake protocol is established between two sub-

blocks of an asynchronous circuit that need to exchange data

between each other

Figure 2: C-Element or Muller gate

Input1 Input2 Output

0 0 0

0 1 Output-1

1 0 Output-1

1 1 0

Table 1: The Truth table of the C-Element The output copies the

inputs value when both have the same, and maintain its previous

value when inputs are different.

Figure 3: Muller pipeline, controlling Latch chain

Figure 4: Micro-pipeline asynchronous circuit

60

III. ASYNCHRONOUS ANALOG TO DIGITAL CONVERTER –

‘AADC’

Most of real life signals are time varying in nature. The

spectral contents of these signals vary with time, which is a

direct consequence of the signal generation process ‎[6]. The

synchronous ADCs are based on the Nyquist architectures.

They do not exploit the input signal variations. Indeed, they

sample the signal at a fixed rate, without taking into account

the intrinsic signal nature. Moreover, they are highly

constrained due to the Shannon theory especially in the case of

low activity sporadic signals like electrocardiogram, seismic

signals, etc. It leads to capture and process a large number of

samples without any relevant information, a useless increase

of the system activity and power consumption.

The Asynchronous Analog-to-Digital Converter (AADC)

presented in ‎[7] and ‎[8] is based on a non-uniform sampling

scheme called level-crossing sampling ‎[9]. This system is only

driven by the information present in the input signal. Indeed, it

only reacts to the analog input signal variations.

A. Non-uniform - level crossing sampling

The sampling process strongly affects performances of the

post Digital Signal Processing (DSP) chain. Best

performances can be achieved if the signal is efficiently

sampled. Several ways exist to sample an analog signal.

The classical uniform sampling is well developed and well

adapted to the existing signal processing devices. Although it

covers the whole existing DSP areas, it is not the best one for

all of them. In many cases, the non-uniform sampling could be

a better candidate and provides advantages such as system

complexity reduction, compression, smarter data transmission,

and acquisition, etc., which are not attainable with the uniform

sampling process.

With our non-uniform sampling scheme, a sample is only

captured when the Continuous Time (CT) input signal x(t)

crosses one of the defined levels (Figure 5).

For an M-bit resolution, 2M-1 quantization levels are

regularly disposed along the amplitude range of the signal. Unlike the classical Nyquist sampling,, the samples are not

uniformly spaced out in time, because they depend on the

signal variations. Thus, together with the value of the sample

axn, the time dtxn = txn – txn-1 is defined. It corresponds to the

time elapsed since the previous sample axn-1. A local timer of

period TC is dedicated to record dtxn and deliver it when

necessary along with axn.

Contrarily to the usual sampling technique, the amplitude of

the sample is known and the time elapsed between two

samples is quantized with a timer. The Signal to Noise Ratio

(SNR) depends on the timer period TC, and not on the number

of quantization levels [8]. Thus, for a given implementation of

the non-uniform sampling A/D converter (a fixed number of

quantization levels: L = 2M

-1), the SNR can be externally

tuned by changing the period TC of the timer. In theory, for

level-crossing sampling, the SNR can be improved as far as it

is needed, by reducing TC ‎[10].

B. A-ADC architecture

Let‎δ‎be‎the‎A-ADC processing delay for one sample. Thus

the proper signal capturing the x(t) must satisfy the tracking

condition given by the equation (1):

(1)

Where q is the A-ADC quantum, defined by the equation (2):

(2)

where E represents the amplitude dynamics of the A-ADC,

and M its resolution.

The output digital value Vnum is converted to Vref by the DAC,

and compared to the CT input signal x(t) (Figure 6).

If the comparison result is greater than

, the counter is

incremented. If it is lower than

, the counter is

decremented. In other cases, nothing is done, and thus, the

output Vnum remains constant.

The output signal is composed of couples (axn, dtxn) where

axn is the digital value of the sample, and dtxn the time elapsed

Figure 5: Level crossing sampling. ‘q’ is considered as the A-

ADC quantum.

Figure 6: Block diagram of the A-ADC

61

since the previous converted sample axn-1, given by the timer,

as said before. Since, the architecture of the A-ADC is

asynchronous, the latter uses asynchronous communication

protocol (based on „req‟ and „ack‟ signals) to exchange data

with its environment.

IV. ASYNCHRONOUS FIR-FILTER

A. Principles & Algorithm

A synchronous Nth

order FIR filter based on a uniform

sampling scheme, computes a digital convolution product

(equation (4)).

∑

(3)

where is the sampling period.

In the non-uniform sampling scheme, the sampling time of the

kth

sample of the impulse response h does not necessarily

correspond to the sampling time of the (n-k)th

sample of the

input signal ax (equation (5))

(4)

The product of two samples is thus meaningless. To bypass

this issue, the impulse response h of the filter is resampled and

interpolated as well as the input signal ax. Thus the new

convolution product is processed between these new samples

and (Figure 7).

The new convolution product is an area computation. The

easier way to compute this area is the rectangle method i.e. a

zero-order interpolation. This method allows splitting the area

corresponding to the convolution product in a sum of rectangle

areas with different widths (Figure 7).

In order to compute each rectangle area, an iterative loop can

do the job ‎[10]:

{ ∑ ( )

If min = then

k = k+1

If min = then

j = j+1

if min = Then j = j+1

k = k+1

where min =min( ,

An example illustrating these iterations is shown in Figure 8

B. FIR filter asynchronous micro-pipeline architecture

General architecture

The previously proposed algorithm is implemented with the

feedback structure presented in the Figure 9. This structure

Figure 7: Principle of the resampling scheme used in the

irregular FIR computation. The continuous lines represent the

original samples, whereas the dashed lines correspond to the

new resampled interpolated samples.

Step1

Step 2

Figure 8 : an example of an asynchronous convolution

product.

62

describes the architecture of the FIR filter.

The‎ “Delay‎Line”‎gets‎ the‎ sampled‎ signal‎ data‎ (axn, dtxn),

from the A-ADC. The communication between these two

blocks is based on the handshake protocol.

The‎“Delay‎Line”‎is‎the‎memory‎block‎of‎the‎filter.‎It‎is‎a‎shift‎

register that stores the input samples (magnitudes and time

intervals). The output of this register is connected to a

multiplexer (not shown) that allows selecting samples

depending‎on‎the‎value‎of‎the‎selection‎input‎“k”.‎A‎ROM‎and‎

another multiplexer (not shown) are used to store the impulse

response coefficients. The coefficients are selected the signal

“j”.

The‎“MIN”‎block‎has‎multiple‎functionalities:

1- It allows determining the minimum time interval dtmin

of (dtxn-k, dthj).

2- It‎ generates‎ the‎ selection‎ signals‎ „j‟‎ and‎ „k‟‎ that‎

control the selection process in the Delay Line.

3- It detects the end of a convolution product round,

and allows starting a new one. This functionality is

based‎ on‎ detecting‎ the‎ signal‎ „k‟.‎ If‎ „k‟‎ reaches‎ its‎

maximum, that means that all filter coefficients are

used, and that the convolution product is done. The

filter is ready to perform a new one. The MIN block

generates at this point a reset signal to the other

blocks, commanding them to start a new convolution.

4- Finally, MIN generates at the end of each

convolution product cycle, a „rest‟ signal and an

„enable‟ signal to reset the output of the Accumulator

and enable the output of the Buffer.

Then‎ the‎ “Multiplier”‎ computes‎ all‎ sub_areas‎ value‎‎‎

(dtmin*axn-k*ahj)‎that‎are‎accumulated‎in‎the‎“Accumulator”‎in‎

order to compute the convolution product.

Until now, nothing is so special. The structure is a simple

and logical translation of the iterative function presented in the

previous paragraph. In the micropipeline architecture, this part

of‎the‎circuit‎will‎be‎considered‎as‎the‎“data‎path”.

The challenge begins with defining the control path of the

micropipeline of the filter. The control path will be in charge

of synchronizing the communication between different parts in

the data-path, so that they will work in a complete harmony

while exchanging their data.

Once the data and control path are described, the filter

implementation starts on the FPGA. In order to implement the

control path, a specific asynchronous library has to be

specified. This library contains Muller gates, asynchronous

controllers and some other asynchronous functions ‎[17].

As shown in the paragraph ‎II, each functional block has its

own controller. For the clarity of the paper, not all the

controllers are presented. As an illustration, the controller of

the MIN block is studied below.

Control block of MIN

The simplest block specified for our asynchronous

controllers‎ is‎ the‎ „Linear_control‟‎ Figure 10. It ensures the

rendezvous‎between‎to‎incoming‎signals‎„req‟‎and‎„ack‟.‎It has

two inputs, and two delayed outputs.

The first input represents an input request signal coming from

a previous ‘P’ block connected to the inputs of the MIN block.

The ‘P’ block sends along with its data a request signal to the

MIN block.

The second input is used for the input acknowledge signal

that comes from the following block ‘F’. The ‘F’ block

receives data to compute from the MIN block, and sends back

an acknowledge signal. Then the MIN block is ready for

receiving new data.

The Linear Controller has also two outputs. The first one is

an output request signal. This output will indicate whether

MIN has finished or not his computation, and thus a new data

is ready to be sent. This signal will be sent to the controller of

Figure 9 : Iterative structure for the Asynchronous FIR Filter

samples.

63

the‎„F’ block.

The second output is the output acknowledge signal, it is send

to the ‘P’ block to indicate if MIN is ready or not to receive

new data.

In the case of the MIN block: it receives inputs from only

one ‘P’ block:‎the‎delay‎line.‎This‎means‎only‎one‎„req‟‎input‎

is‎ sent‎ to‎ its‎ controller.‎ However‎ the‎ MIN‟s‎ output‎ is‎

connected to more than one block: it is connected to the

‘Multiplier’ that receives dtmin.. It is also connected to the

„Accumulator’. The ‘Accumulator’ receives‎ a‎ „reset’ signal

from MIN, in order to restart a new accumulation (a new

accumulation corresponds to a new convolution product – as

mentioned in the description of the MIN functionality).

Finally, one of the MIN outputs is connected to the Buffer

input. At the end of each convolution product, the buffer

receives an ‘enable signal’ from MIN in order to transfer the

new convolution product value to the output.

As a conclusion, the MIN outputs are connected to 4 ‘F’

blocks. This means that the MIN controller receives 4

incoming acknowledge signals, one from each ‘F’ block. This

also means that MIN has to wait for these 4 acknowledgement

signals in order to generate a new output data. Thus, a

rendezvous between these 4 signals should be processed. This

is done by a block called‎“join_4”‎which‎is‎implemented‎with‎

3 2-input Muller gates connected to each other. The MIN

block and its controller are shown in Figure 11.

Practically, the MIN block is more complex. In fact, MIN is

divided into 3 sub-blocks, each processing one of the

functions previously described. Each sub-block has its own

controller. The problem that could appear with multiple blocks

connected to each other in a non-linear pipeline is a dead-lock.

These problems are manually managed because there are no

available asynchronization tools. The controllers of the other

blocks are designed following the same steps

C. Asynchronous FIR-filter implementation results

The micro-pipeline asynchronous FIR filter architecture, as

well as a part of the A-ADC, are implemented on a

synchronous FPGA board: the DE1 from Altera. As

mentioned before, a dedicated library has been specified

because synchronous commercial FPGA are not able to

support asynchronous circuits.

Figure 12 shows the simulation of our asynchronous FIR

Filter after Place and route on the Altera FPGA. It is a low-

pass FIR-Filter of 15th

order. An input signal varying from

1 kHz to 18 kHz has been injected to its input

V. CONCLUSION

An asynchronous FIR Filter architecture was presented in

this paper, along with an asynchronous Analog to digital

converter (A-ADC). The FIR Filter architecture is designed

using the micro-pipeline asynchronous style. This architecture

has been successfully implemented for the first time on a

commercial FPGA board (Altera-DE1). A specific library has

also been designed for this purpose. It allows the synthesis of

asynchronous primitive bocks (the control path in our case) on

the synchronous FPGA. Simulation results of the FIR Filter

after place and route validate the implementation. This work is

still going on, in order to optimize the implementation. We

expect to have a very low-power FIR Filter, with a reduction

of the total power consumption by one order of magnitude.

REFERENCES

[1] M.‎ Renaudin,‎ “Asynchronous‎ Circuits‎ and‎ Systems:‎ a‎

Promising‎ Design‎ Alternative”,‎ Journal of

Microelectronic Engineering, Vol. 54, pp. 133-149, 2000.

[2] D. Kinniment et al.,‎“Synchronous‎and‎Asynchronous‎A-

D‎Conversion”,‎IEEE Trans. on VLSI Syst., Vol. 8, n° 2,

pp. 217-220, April 2000.

[3] D.J. Kinniment et al.,‎ “Low‎ Power,‎ Low‎ Noise‎

Micropipelined Flash A-D‎ Converter”,‎ IEE Proc. On

Circ. Dev. Syst., Vol. 146, n° 5, pp. 263-267, Oct. 1999.

Figure 10: primitive Linear Controller. The delay value depends

on the propagation delay of the functional block.

Figure 11: MIN block and its controller

Figure 12: Asynchronous FIR Filter after P&R simulation

64

[4] L. Alacoque et al.,‎ “An‎ Irregular‎ Sampling‎ and‎ Local‎

Quantification Scheme A-D‎Converter”,‎ IEE Electronics

Letters, Vol. 39, n° 3, pp. 263-264, Feb. 2003.

[5] “Principles‎of‎Asynchronous‎Circuit‎Design – A Systems

Perspective”, Edited by JENS SPARSØ Technical

University of Denmark & STEVE FURBER The

University of Manchester UK

[6] L.‎ Wiliams‎ et‎ al.”A‎ Stereo‎ 16-bit Delta-Sigma A/D

Converter‎for‎Digital‎Audio”,‎Ph.D.‎dissertation Stanford

University 1993.

[7] E.‎ Allier,‎ L.‎ Fesquet,‎ G.‎ Sicard,‎ M.‎ Renaudin,‎ “Low‎

Power Asynchronous‎ A/D‎ Conversion”,‎ Proceedings‎ of‎

the 12th International Workshop on Power and Timing,

Modeling,Optimization‎ and‎ Simulation‎ (PATMOS‟02),‎

September 11-13 2002, Sevilla, Spain.

[8] E. Allier, G. Sicard,‎ L.‎ Fesquet,‎M.‎ Renaudin,‎ “A‎ New

Class of Asynchronous A/D Converters Based on Time

Quantization”,‎ ASYNC‎ Proceedings,‎ pp.‎ 197-205, May

12-16 2003, Vancouver, Canada.

[9] J.W.‎Mark‎et‎al.,‎“A‎Nonuniform‎Sampling‎Approach‎ to

Data‎ Compression”,‎ IEEE‎ Trans.‎ on‎ Communication.

Vol. COM-29, n° 4, pp. 24-32, Jan. 1981. W.-K. Chen,

Linear Networks and Systems (Book style). Belmont,

CA: Wadsworth, 1993, pp. 123–135.

[10] F. Aeschlimann, E. Allier, L. Fesquet, M. Renaudin,

"Asynchronous FIR Filters: Towards a New Digital

Processing Chain," Asynchronous Circuits and Systems,

International Symposium on, pp. 198-206, 10th IEEE

International Symposium on Asynchronous Circuits and

Systems (ASYNC'04), 2004

[11] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and

N.C. Paver.AMULET2e: An asynchronous embedded

controller. In Proc. International Symposium on advanced

Research in Asynchronous Circuits and Systems, pages

290–299. IEEE Computer Society Press, 1997.

[12] L.S. Nielsen. Low-power Asynchronous VLSI Design.

PhD thesis, Department of Information Technology,

Technical University of Denmark, 1997. IT-TR:1997-12.

[13] SPeedster, a very high speed FPGA by Achronix:

http://www.achronix.com/

[14] A.J. Martin, A. Lines, R. Manohar, M. Nyström, P.

Penzes, R. Southworth, U.V. Cummings, and T.-K. Lee.

The design of an asynchronous MIPS R3000. In

Proceedings of the 17th Conference on Advanced

Research in VLSI, pages 164–181. MIT Press, September

1997.

[15] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A.

Lien, and J. Liu. A low-power, low-noise configurable

self-timed DSP. In Proc. International Symposium on

Advanced Research in Asynchronous Circuits and

Systems, pages 32–42, 1998.

[16] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H. van Berkel.

Low-power operation using self-timed circuits and

adaptive scaling of the supply voltage. IEEE Transactions

on VLSI Systems, 2(4):391–397, 1994.

[17] Quoc Thai Ho, J.-B. Rigaud, L. Fesquet, M. Renaudin, R.

Rolland, "Implementing asynchronous circuits on LUT

based FPGAs", The 12th International Conference on

Field Programmable Logic and Applications (FPL),

September 2-4, 2002, Montpellier (La Grande-Motte),

France.

65

Session 3Prototyping Radio Devices

66

Applying Graphics Processor Acceleration in aSoftware Defined Radio Prototyping Environment

William Plishker, George F. Zaki, Shuvra S. Bhattacharyya

Dept. of Electrical and Computer Engineering

and Institute for Advanced Computer Studies

University of Maryland

College Park, Maryland

{plishker,gzaki,ssb}@umd.edu

Charles Clancy, John Kuykendall

Laboratory for Telecommunications Sciences

College Park, Maryland, USA

{clancy, jbk}@ltsnet.net

Abstract—With higher bandwidth requirements and morecomplex protocols, software defined radio (SDR) has ever growingcomputational demands. SDR applications have different levels ofparallelism that can be exploited on multicore platforms, but de-sign and programming difficulties have inhibited the adoption ofspecialized multicore platforms like graphics processors (GPUs).In this work we propose a new design flow that augments apopular existing SDR development environment (GNU Radio),with a dataflow foundation and a stand-alone GPU acceleratedlibrary. The approach gives an SDR developer the ability toprototype a GPU accelerated application and explore its designspace fast and effectively. We demonstrate this design flow on astandard SDR benchmark and show that deciding how to utilizea GPU can be non-trivial for even relatively simple applications.

I. INTRODUCTION

GNU Radio [1] is a software development framework that

provides software defined radio (SDR) developers a rich

library and a customized runtime engine to design and test

radio applications. GNU Radio is extensive enough to describe

audio radio transceivers, distributed sensor networks, and radar

systems, and fast enough to run such systems on off-the-self

radio hardware and general purpose processors (GPPs). Such

features have made GNU Radio an excellent rapid prototyping

system, allowing designers to come to an initial functional

implementation quickly and reliably.

GNU Radio was developed with general purpose pro-

grammable systems in mind. Often initial SDR prototypes

were fast enough to be deployed on general purpose processors

or needed few custom accelerators. As new generations of

processors were backwards compatible with software, GNU

Radio implementations could track with Moore’s Law. As a

result, programmable solutions have been competitive with

custom hardware solutions that required longer design time

and greater expense to port to the latest process generation.

But with the decline in frequency improvements of GPPs, SDR

solutions are increasingly in need of multicore acceleration,

such as that provided by graphics processors (GPUs). SDR

is well positioned to make use of them since many SDR

applications have abundant parallelism.

GPUs are starting to be employed in SDR solutions, but

their adoption has been inhibited by a number of difficul-

ties, including architectural complexity, new programming

languages, and stylized parallelism. While other research is

addressing these topics [5] [6], one of the primary barriers

in many domains is the ability to quickly prototype the

performance advantages of a GPU for a particular application.

The inability to assess the performance impact of a GPU with

an initial prototype leaves developers to doubt if the time and

expense of targeting a GPU is worth the potential benefit.

Many design decisions are needed before arriving at initial

multicore prototype including mapping tasks to processors and

data to distributed memories. Mapping SDR applications is

further complicated by application requirements. The amount

of parallelism present may be dictated by the application itself

based on its latency tolerances and available vectorization

of the kernels. More vectorization tends to lead to higher

utilization of the platform (and therefore higher throughput),

but often at the expense of increased latency and buffer

memory requirements. Also an accelerator typically requires

significant latency to move data to or from the host processor,

so sufficient data must be burst to the accelerator to amortize

such overheads.

Ideally, application designers would be simply presented

with a Pareto curve of latency versus vectorization trade-offs

so that an appropriate design point can be selected. However,

vectorization generally influences the efficiency of a given

mapping. Thus, to fully unlock the potential of heterogeneous

multiprocessor platforms for SDR, designers must be able to

arrive at a variety of solutions quickly, so that the design space

may be explored along such critical dimensions.

To enable developers to arrive at an initial prototype that

utilizes GPUs, we introduce a new SDR design flow, as shown

in Figure 1. We begin with a formal description of an SDR

application, which we extract from a GNU Radio specification.

Formalisms provide the design flow with a structured, portable

application description which can be used for vectorization,

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

67

!""#$%&'()*+,-%.$"'()*$)*

/01*".(2.&33$)2*

,)4$.()3,)5*

675.&%5*

8(.3&#*

+,-%.$"'()*

0&5&9(:*&""#$%&'()*

+,-%.$"'()*

!;23,)5,+*&""#$%&'()*

+,-%.$"'()*$)*(.$2$)&#*/01*

,)4$.()3,)5*

/%<,+;#$)2*

=&.''()$)2*

>&""$)2*

/01*

,7,%;'()*

,)2$),*

!.%<$5,%5;.,*

3(+,#*

?=@*-",%$A%*

/01*#$B.&.C*

Fig. 1. Dataflow founded SDR Design Flow.

latency, and other design decisions. These design decisions can

ultimately be incorporated into an SDR application through

a GPU specific library of SDR actors. For this work, we

have constructed GRGPU, which is a such a library written

for GNU Radio. With this design process, we demonstrate

the value of this approach with GNU Radio benchmark on a

platform with a GPU.

II. BACKGROUND

Dataflow graphs are widely used in the modeling of signal

processing applications. A dataflow graph G consists of set

of vertexes V and a set of edges E. The vertices or actors

represent computational functions, and edges represent FIFO

buffers that can hold data values, which are encapsulated as

tokens. Depending on the application and the required level

of model-based decomposition, actors may represent simple

arithmetic operations, such as multipliers or more complex

operations as turbo decoders.

A directed edge e(v1, v2) in a dataflow graph is an ordered

pair of a source actor v1 = src(e) and sink actor v2 = snk(e),where v1 ∈ V and v2 ∈ V . When a vertex v executes

or fires, it consumes zero or more tokens from each input

edge and produces zero or more tokens on each output edge.

Synchronous Data Flow (SDF) [8] is a specialized form of

dataflow where for every edge e ∈ E, a fixed number of

tokens is produced onto e every time src(e) is invoked,

and similarly, a fixed number of tokens is consumed from

e every time snk(e) is invoked. These fixed numbers are

represented, respectively, by prd(e) and cns(e). Homogeneous

Synchronous Data Flow (HSDF) is a restricted form of SDF

where prd(e) = cns(e) = 1 for every edge e.Given an SDF graph G, a schedule for the graph is a

sequence of actor invocations. A valid schedule guarantees

that every actor is fired at least once, there is no deadlock

due to token underflow on any edge, and there is no net

change in the number of tokens on any edge in the graph

(i.e., the total number of tokens produced on each edge during

the schedule is equal to the total number consumed from the

edge). If a valid schedule exists for G, then we say that G is

consistent. For each actor v in a consistent SDF graph, there

is a unique repetition count q(v), which gives the number of

times that v must be executed in a minimal valid schedule (i.e.,

a valid schedule that involves a minimum number of actor

firings). In general, a consistent SDF graph can have many

different valid schedules, and these schedules can differ widely

in the associated trade-offs in terms of metrics such as latency,

throughput, code size, and buffer memory requirements [4].

III. RELATED WORK

Many models of computation have been suggested to de-

scribe software radio systems. In [2], the advantages and

drawbacks of various models are investigated. Also different

dataflow models that can be applied to various actors of an

LTE receiver are demonstrated.

Actor implementation on GPUs is discussed in [13]. A GPU

compiler is described in order to take a naive actor implemen-

tation written in CUDA [11], and generate an efficient kernel

configuration that enhances the load balance on the available

GPU cores, hides memory latency, and coalesces data move-

ment. This work can be used in our proposed framework to

enhance the implementation of individual software radio actors

on a GPU. Raising the abstraction of CUDA programming

through program analysis is the focus of Copperhead [6].

In [12], the authors present a multicore scheduler that maps

SDF graphs to a tile based architecture. The mapping process

is streamlined to avoid the derivation of equivalent HSDF

graphs, which can involve significant time and space over-

head. In more general work, MpAssign [5] employs several

heuristics, allows different cost functions and architectural

constraints to arrive at a solution.

In [15], a dynamic multiprocessor scheduler for SDR

applications is described. The basic platform consists of a

Universal Software Radio Peripheral (USRP), and cluster of

GPPs. A flexible framework for dynamic mapping of SDR

components onto heterogeneous multiprocessor platforms is

described in [9].

Various heuristics and mixed linear programming models

have been suggested for scheduling task graphs on homoge-

neous and heterogeneous processors (e.g., see [10]). In these

works, the problem formulations are developed to address

different objective functions and target platforms for imple-

menting the input application graphs.

The focus of this work is to construct a backend capable

of integrating specialized multicore solutions into a domain

68

specific prototyping environment. This should facilitate the

previously described dataflow based design flow, but should

also enable these other works to be applied in the field of SDR.

Any solution targeting a complex multicore system is unlikely

to produce the optimal solution with its first implementation.

The ability to quickly generate and evaluate many solutions on

a multicore platform should improve the efficacy the approach

and ultimately the quality of the final solution.

IV. SDR DESIGN FLOW FOR GPUS

We implemented the design flow proposed in Figure 1

by using GNU Radio as the SDR description and runtime

environment and the Dataflow Interchange Format (DIF) [7]

for the dataflow representation and associated tools. Our GPU

target was CUDA enabled NVIDIA GPUs. With these tools in

place the design flow proceeds as described in the following

steps:

1) Designers write their SDR application in GNU Radio

with no consideration for the underlying platform. As

GNU Radio has an execution engine and a library of

SDR components, designers can verify correct function-

ality of their application. For existing GNU Radio appli-

cations, nothing must be changed with the description

to continue with the design flow.

2) If actors of interest are not in the GPU accelerated li-

brary, a designer writes accelerated versions of the actors

in CUDA. The design focuses on exposing the paral-

lelism to match the GPU architecture in as parametrized

way as possible.

3) Either through automated or manual processes, instanti-

ated actors are either assigned to a GPU or designated

to remain on a GPP. With complex trade offs between

GPU and GPP assignments possible, this step may be

revisited often as part of a system level design space

exploration. Dataflow provides a platform independent

foundation for analytically determining good mappings,

but designer insight is also a valuable resource to be

utilized at this step.

4) The mapping result is utilized by augmenting the origi-

nal SDR application description environment. By lever-

aging a stand-alone library of CUDA accelerated actors

for GNU Radio, the designer can describe and run the

accelerated application description with existing design

flow properties.

The following sections cover these steps in detail, specif-

ically as they relate to our instance of the design flow that

utilizes CUDA, GNU Radio, and DIF.

A. Writing GPU Accelerated Actors

After the application graph is described in GNU Radio, ac-

tors are individually accelerated using GPU specific tools. If an

actor of interest is not present in the GPU accelerated library,

the developer switches to the GPU customized programming

environment, which in our case is CUDA. The designer is still

saddled with difficult design decisions, but these decisions are

localized to a single actor. System level design decisions are

orthogonal to this step of the design process. While we do

not aim to replace the programming approach of the actors

functionality, the following design strategy lends itself to later

design space exploration by the developer.

As with other GPU programming environments, in CUDA

a designer must divide their application into levels of par-

allelism: threads and blocks, where threads represent the

smallest unit of a sequential task to be run in parallel and

blocks are groups of threads. In our experience, SDR actors

vary in how to use thread level parallelism, but tend to

realize block level parallelism with parallelism at the sample

level. The ability to tightly couple execution between threads

within a block creates a host of possibilities for the basic

unit of work within a block, be it processing a code word,

multiplying and accumulating for a tap, or performing an

operation on a matrix. Because blocks are decoupled, only

fully independent tasks can be parallelized. For SDR those

situations tend to arise between channels or between samples

on a single channel. Some samples may overlap between

blocks to support the processing of a neighboring sample, but

this redundancy is often more than offset by the performance

benefits of parallelization.

The performance of this parallelization strategy strongly

influenced by the number of channels or the size of a chunk

of samples that can be processed at one time. When the

application requests processing on a small chunk of sample,

there are few blocks to spread across a GPU leaving it

under utilized, while large chunks enable high-utilization. The

performance difference between small and large chunks is

non-linear due to the high fixed latency penalty that both

scenarios experience when transferring data to and from the

GPU and launching kernels. When chunks are small, GPU

time is dominated by transfer time, but when chunks are larger,

computation time of the kernel dominates, which amortizes the

fixed penalty delay. As the application dictates these values,

actors must be written in a parametrized way to accommodate

different size inputs.

B. Partitioning, Scheduling, and Mapping

Once actors are written, system level design decisions must

be made, such as assigning which actors are to invoke GPU

acceleration. With some applications, the best solution may

be to offload every actor that is faster on the GPU than it

is on the GPP. But in some cases, this greedy strategy fails

to recognize the work that could occur simultaneously on

the GPP, while the host thread with the kernel call waits for

the GPU kernel to finish. A general solution to the problem

would consider application features such as rates of firings,

dependencies, and execution times on each platform of each

actor, as well as architectural features such as the number

and types of processing elements, memories, and topology.

To simplify the problem, designers can cluster certain actors

together so that they are assigned to the same. To promote this

clustering, designers may partition the application graph.

Multirate applications also need to be scheduled properly to

ensure firing rates and dependencies are proper accounted for.

69

!"!#$%

!&$%"'()*%

!#$%+,-.,/%

).%0$12%

34567%

0$12%

89.:;,<)=,-%

3.>557%

0$12%?)@-'-),<%

0AA%B/*5C%

D):;%5'//%:*%

!#$%+,-.,/%

34557%

!#$%+,-.,/%

).%0AA%

34557%

/)@:**/%

8:'.(%'/*.,%E9:;*.%

E'5C'F,%

!"#$%&'()*

!"#$%+!*

&,-"$,./0(123*

Fig. 2. GRGPU: A GNU Radio integration of GPU accelerated actors.

When the application can be extracted into a formal dataflow

model, schedulers will not only respect these constraints but

are able to optimize for buffer assignments [3]. The applica-

bility of such techniques for specialized multicore platforms

are still open research, and this design flow enables greater

experimentation with them for SDR applications. Manual

scheduling and mapping is likely to continue to dominate

smaller, more homogeneous mappings, but a grounding in

dataflow opens the door for new automation techniques. In

this work we focus on the design flow, conventions for writing

SDR actors, and integrating GPU accelerated actors with GNU

Radio.

C. GRGPU: GPU Acceleration in GNU Radio

We developed GPU accelerated GNU Radio actors in a

separate, stand-alone library called GRGPU. GRGPU extends

GNU Radio’s build and install framework to link against

libraries in CUDA as shown in Figure 2. After building against

CUDA libraries, the resulting actors may be instantiated along-

side traditional GNU Radio actors, meaning that designers

may swap out existing actors for GRGPU actors to bring

GPU acceleration to existing SDR applications. The traditional

GNU Radio actors run unaffected on the host GPP, while

GRGPU actors utilize the GPU.

When writing a new GRGPU actor, application developers

start by writing a normal GNU Radio actor including a C++

wrapper that describes the interface to the actor. The GPU

kernels are written in CUDA in a separate file and tied back

to the C++ wrapper via C functions such as device work().

Additional configuration information may be sent in through

the same mechanism. For example, the taps of a FIR filter

typically need to be updated only once or rarely during the

execution, so instead of passing the tap coefficients during

each firing of the actor (taps sent from work() to device work()

to the kernel call), they could be loaded into device mem-

ory when the taps are updated in GNU Radio. The CUDA

compiler, NVCC, is invoked to synthesize C++ code which

contains binaries of the code destined for the GPU, but glue

code formatted for C++. By generating the C++ instead of an

object file directly, we are able to make use of the standard

GNU build process using libtool. Even though the original

application description was in a different language, the code

is wrapped and built in the GNU standard way giving it

compatibility with previous and future versions of GNU and

GNU Radio.

When a GNU Radio actor is instantiated, a new C++

object is created which stores and manages the state of the

actor. However, state in the CUDA file is not automatically

replicated, creating a conflict when more than one GRGPU

actor of the same type is instantiated. To work around this

issue, we save CUDA (both host and GPU) state inside the

C++ actor, which includes GPU memory pointers of data

already loaded to the GPU. The state from the GPU itself is

not saved inside the C++ object, but rather the pointers to the

device memory are. Data residing in the GPUs memory space

is explicitly managed on the host, so saving GPU pointers is

sufficient for keeping the state of the CUDA portion of an

actor.

To minimize the number of host-to-GPU and GPU-to-

host transfers, we introduce two actors, H2D and D2H, to

explicitly move data to and from the device in the flow graph.

This allows other GRGPU actors to contain only kernels that

produce and consume data in the GPU memory. If multiple

GPU operations are chained together, data is processed locally,

reducing redundant I/O between GPU and host as shown in

Figure 3. In GNU Radio, the host side buffers still exist which

connect links between the C++ objects that wrap the CUDA

kernels. Instead of carrying data, these buffers now carry

pointers to data in GPU memory. From a host perspective,

H2D and D2H transform host data to and from GPU pointers,

respectively.

While having both a host buffer and a GPU buffer introduces

some redundancy, it has a number of benefits which make this

an attractive solution. First, there is no change to the GNU

Radio engine. The GNU Radio engine still manages data being

produced and consumed by each actor, so decisions on chunk

size or invocation order do not need to be changed with the

use of GRGPU actors. Second, GPU buffers may be safely

managed by the GRGPU actors. With GPU pointers being

sent through host buffers, actors need only concern themselves

with maintaining their own input and output buffers. This

provides dynamic flexibility (actors can choose to create and

free memory for data as needed) or static performance tuning

(actors can maintain circular buffers which they read and

write a fixed amount of data to and from). Such schemes

require coordination between GRGPU actors and potentially

information regarding buffer sizing, but the designer does have

the power to manage these performance critical actions without

redesigning or changing GRGPU. Future versions of GRGPU

could provide a designers with a few options regarding these

schemes and even make use of the dataflow schedule or

other analysis to make quality design decisions. Finally, no

extraneous transfers between GPU and host occur. While the

host and GPU buffers mirror each other, no transfers occur

between them, which avoids I/O latencies that can be the cause

of application bottlenecks.

70

!"#$%

#"&'()%

!"#$%$"%

*)+,()%

!"#$%

#,-.%

*)+,()%

$"%!"#$%

/01%

203%

/01%

204%

Fig. 3. GRGPU actors within H2D and D2H communicate data using theGPUs memory, avoiding unnecessary host/GPU transfers

!"#$

%&"$

%&"$

%&"$

%&"$

%&"$

%&"$

%&"$

%&"$

%&"$

%&"$ %&"$ %&"$

!'($

!"#$"%&'()*"

!"#$"+,-).,/)*"

Fig. 4. SDF graph of the mp-sched Benchmark.

V. EVALUATION

We have experimented with the proposed design flow us-

ing the mp-sched benchmark. Figure 4 shows the mp-sched

benchmark pictorially. Each of the actors after the distributor

performs FIR filtering. To provide flexibility for evaluating

different multicore platforms, it is configurable with number

of chains of FIR filters (pipelines) and the depth of the chains

(stages). This benchmark describes a flow graph that consists

of a rectangular grid of FIR filters. The dimensions of this

grid are parametrized by the number of stages (STAGES)and number of pipelines (PIPES). The total number of FIR

filters is thus equal to PIPES×STAGES. This benchmark

represents a non-trivial problem for the multiprocessor sched-

uler as all actors in different pipelines can be executed in

parallel. More information about the mp-sched benchmark can

be found in [1].

A. FIR Filter Design

In this implementation [14], we take advantage of data

parallelism between the filter output samples as well as

functional parallelism to calculate every sample. For relatively

large chunks of samples, the CUDA kernel is configured such

that the number of blocks is equal to double the number of

available streaming multiprocessors. By using this configu-

ration, the first level of data parallelism can be achieved if

every CUDA block is responsible to calculate a different set

of output samples. In other words, the required number of

output samples are evenly distributed on the number of CUDA

blocks. To overcome the inherited stateful property of the FIR

filter(i.e, consecutive output samples depend on some shared

input samples), the input of every block must contain an extra

set of delayed input samples equal to the number of taps.

To reduce the number of device memory access, initially

all of the threads will perform a load of a coalesced chunk of

input elements to the shared memory of a multiprocessor. Then

every thread will be responsible of calculating the product

of a filter tap coefficient with an input sample, and adding

this product to the partial sum of the previous stage. After

processing a set of inputs, the threads perform a block store

of the calculated results to the GPU device memory.

B. Empirical Results

We found a variety of design points of mp-sched to evaluate

the utility of rapid prototyping with GRGPU. The target

platform was two GPPs (Intel Xeon CPUs 3GHz) and a GPU

(an NVidia GTX 260). The actors performed a 60 tap FIR

filtering with either CUDA acceleration in the case of GPU

accelerated actors or SSE acceleration in the case of the GPP.

To minimize the latencies incurred by using H2D and D2H,

the GPU accelerated actors were clustered together leaving

remaining GPP actors similarly clustered.

In the case of our exploration of the mp-sched implemen-

tation design space, each pipeline was located in a separate

thread and the number of actors with GPU acceleration was

configurable. Mp-sched pipelines could run in parallel and

share the GPU as an acceleration resource during runtime.

Multiple pipelines with GPU accelerated actors were forced to

serialize their GPU accesses according to CUDA conventions.

For example, one possible solution to a 2x20 instance of

mp-sched is shown in Figure 5. The Gantt chart is not to

scale, but shows how the two different pipelines (one in red

and one in blue), are able to run in parallel on the two

GPPs, but must have exclusive access to the GPU when

running accelerated actors. While the cross thread sequencing

was not specified at runtime, GRGPU’s ability to specify

acceleration and clustering enables the creation of multicore,

GPU accelerated complete solutions.

The problem for a designer is then to leverage GPPs

and GPU, weigh SSE acceleration and CUDA acceleration,

account for communication latencies between GPU and GPP

and thread to thread, and consider how all of this will occur

in parallel. Models and automated techniques should continue

to assist in providing good starting points, but a necessary

condition to arriving at a quality solution is still the ability to

try many points quickly.

To this end, we constructed an illustrative example that

produces an interesting set of design points: mp-sched with 20

stages and varied the number of pipelines. Figure 6 shows a

sub-sampling of the design space. “All GPP” means all stages

of all pipelines are assigned to the GPPs, while “All GPU”

means all stages of all pipelines are assigned to the GPU.

“3/4 GPP”, “Half GPP”, and “3/4 GPU” indicates that three

quarters, one half, or one quarter of the stages of all pipelines

are assigned to the GPP, respectively, while the remaining

actors use the GPU. For example, Figure 5 shows the 2x20

Half GPP solution. We also evaluated solutions in which

one of the pipelines was all GPP and the rest GPU (“One

GPP”) and the reverse (“One GPU”). In the case of only one

pipeline, these solutions were equivalent to an all GPP or all

GPU solution. We ran each solution for 200,000 samples and

recorded the execution time, including GNU Radio overheads,

communication overheads, etc.

71

!""#$#

!""#%#

!"&#

Fig. 5. Gantt chart for 2x20 mp-sched graph on 2 GPP and 1 GPU. Theblue and the red set of blocks and arrows each represent one branch of themp-sched instance.

For the 60 tap FIR filter, SSE acceleration performs well,

but still somewhat slower than the GPU implementation, so

once a sufficient amount of computation is located on the

GPU, GPU weighted implementations tend to perform better.

But this graph does reveal that the GPU should be employed

in different ways depending on the number of pipelines. For

example, a single pipeline implies that there is not quite

enough computation present to merit GPU acceleration. How-

ever when 2 or more pipelines are used, the GPPs become

saturated to the point that GPU acceleration can improve upon

the result. When 4 pipelines are needed, one GPP only pipeline

proves higher performing than an all GPU solution, indicating

that the GPU itself has become saturated with computation and

that employing more of the GPP is appropriate. In each of the

cases, retrospective reasoning gives us insight into improving

performance, but a change in GPU, communication latencies,

etc. would likely change this space again, leaving a designer

to re-explore the design space.

It should be possible to arrive at these solutions more

analytically to accelerate the design space exploration, but

inevitably a set of points will need to be evaluated to judge

the efficacy of any analytical assistance. GRGPU will continue

to provide value in such a scenario feeding-back empirical

solutions to the design space exploration engine.


As SDR attempts to leverage more special purpose multi-

core platforms in complex applications, application developers

must be able to quickly arrive at an initial prototype to

understand the potential performance benefits. In this paper,

we have presented a design flow that extends a popular SDR

environment, lays the foundation for rigorous analysis from

formal models, and creates a stand-alone library of GPU

accelerated actors which can be placed inside of existing

applications. GPU integration into an SDR specific program-

ming environment allows application designers to quickly

evaluate GPU accelerated implementations and explore the

design space of possible solutions at a system level.

Useful directions for future work include new methods

for dealing with scheduling, partitioning, and mapping for

multicore systems along with evaluating existing automation

!"

#!!"

$!!"

%!!"

&!!"

'!!!"

'#!!"

()))"*++" ,-$"*++" .()/"*++" ,-$"*+0" ())"*+0" 123"*+0" 123"*++"

!"#$%&'()&*#)+*,-)

./'$0)1,,23(*#(4)5$6#*#)

'4#!"

#4#!"

,4#!"

$4#!"

Fig. 6. A sampling of the design space of 1x20, 2x20, 3x20, 4x20 mp-schedgraph on 2 GPPs and 1 GPU for different assignments.

solutions that have been developed. Also, GRGPU should

be able to extend to multi-GPU platforms by customizing

GRGPU actors to communicate and launch on a specific GPU.

Acknowledgments

This research was sponsored in part by the Laboratory for

Telecommunication Sciences, and Texas Instruments.

REFERENCES

[1] http://gnuradio.org/redmine/wiki/gnuradio. Nov 2010.[2] H. Berg, C. Brunelli, and U. Lucking. Analyzing models of computation

for software defined radio applications. In Proc. IEEE International

Symposium on System-on-Chip, pages 1–4, Nov. 2008.[3] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis

from Dataflow Graphs. Kluwer Academic Publishers, 1996.[4] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Synthesis of embedded

software from synchronous dataflow specifications. Journal of VLSI

Signal Processing Systems for Signal, Image, and Video Technology,21(2):151–166, June 1999.

[5] Y. Bouchebaba, P. Paulin, A. E. Ozcan, B. Lavigueur, M. Langevin,O. Benny, and G. Nicolescu. Mpassign: A framework for solving themany-core platform mapping problem. In Rapid System Prototyping

(RSP), June 2010.[6] B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an

embedded data parallel language. Technical Report UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sep 2010.

[7] C. Hsu, I. Corretjer, M. Ko., W. Plishker, and S. S. Bhattacharyya.Dataflow interchange format: Language reference for DIF languageversion 1.0, user’s guide for DIF package version 1.0. TechnicalReport UMIACS-TR-2007-32, Institute for Advanced Computer Studies,University of Maryland at College Park, June 2007. Also ComputerScience Technical Report CS-TR-4871.

[8] E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedingsof the IEEE, 75(9):1235–1245, September 1987.

[9] V. Marojevic, X. R. Balleste, and A. Gelonch. A computing resourcemanagement framework for software-defined radios. IEEE Transactions

on Computers, 57:1399–1412, 2008.[10] R. Niemann and P. Marwedel. Hardware/software partitioning using

integer programming. In Proc. of the European Design and Test

Conference, pages 473 –479, Mar. 1996.[11] NVIDIA. CUDA C programming guide version 3.1.1. July 2010.[12] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor

resource allocation for throughput-constrained synchronous dataflowgraphs. In Proc. of the 44th annual Design Automation Conference,

DAC ’07, pages 777–782, June 2007.

72

[13] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler formemory optimization and parallelism management. In Proc. of the

2010 ACM SIGPLAN conference on Programming language desing and

implementation, June 2010.[14] G. Zaki, W. Plishker, T. OShea, N. McCarthy, C. Clancy, E. Blossom,

and S. S. Bhattacharyya. Integration of dataflow optimization techniquesinto a software radio design framework. In Proceedings of the IEEE

Asilomar Conference on Signals, Systems, and Computers, pages 243–247, Pacific Grove, California, November 2009. Invited paper.

[15] K. Zheng, G. Li, and L. Huang. A weighted-selective scheduling schemein an open software radio environment. In IEEE Pacific Rim Conference

on Communications, Computers and Signal Processing, pages 561 –564,Aug 2007.

73

Validation of Channel Decoding ASIPsA Case Study

Christian Brehm, Norbert WehnMicroelectronic Systems Design Research Group

University of Kaiserslautern, Germany{brehm, wehn}@eit.uni-kl.de

Sacha Loitz, Wolfgang KunzElectronic Design Automation Research Group

University of Kaiserslautern, Germany{loitz, kunz}@eit.uni-kl.de

Abstract— It is well known that validation and verificationis the most time consuming step in complex System-on-Chipdesign. Thus, different validation and verification approaches andmethodologies for various implementation styles have been de-vised and adopted by the industry. Application specific instructionset-processors (ASIPs) are an emerging implementation technol-ogy to solve the energy efficiency/flexibility trade-off in basebandprocessing for wireless communication where multiple standardshave to be supported at a very low power budget and a smallsilicon footprint. In order to balance these contrary aims ASIPsfor these application domains have a restricted functionalitytailored to a specific class of algorithms compared to traditionalASIPs. Downside of the outstanding efficiency/flexibility ratio isthe coincidence of bad attributes for validation. Compared tostandard processors, these ASIPs often have a very complexinstruction set architecture (ISA) due to the tight couplingbetween the instructions and the optimized micro-architecturerequiring new validation concepts.

This paper will sensitize for the distinctiveness and complexityof the validation of ASIPs tailored to channel decoding. In a casestudy a composite approach comprising formal methods as wellas simulations and rapid-prototyping for validating an existingchannel decoding ASIP is applied and transferred it into anindustry product.

I. INTRODUCTION

Today’s and future wireless communication networks re-quire flexible modem architectures to support seamless ser-vices between different network standards. Next generationhandsets will have to suppport multiple standards, such asUMTS, LTE, DVB-SH or WiMax. This creates the demand forthe design of flexible, yet power- and area-efficient solutionsfor baseband signal processing, which is one of the mostcomputation intensive tasks in mobile wireless devices [1].

Application specific instruction set processors are a verypromising candidate for this task, as they promise a muchhigher flexibility than dedicated architectures and a betterenergy efficiency than general purpose processers (GPPs) [2].For many applications efficient ASIP designs are best derivedfrom standard processor pipelines in a top-down manner. Thisis done by adding functionality and instructions for the mostcommon kernel operations of the targeted algorithms, such ase.g., an FFT.

Also in the field of channel decoding ASIPs are verypopular, as they are seen as an elegant way to cope with the

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

vast amount of different coding schemes and their parameters,e.g. [3]–[7]. However, there ASIP designs often have not muchin common with an enhanced standard processor pipeline.Energy and area efficiency demand for distributed memoryembedded into the pipeline which are typical for many state-of-the-art decoding schemes and the demand for unifying thecommonalities of several dedicated architectures. Fully cus-tomized deep pipelines with non-standard memory interfacesand instructions tailored to the targeted algorithms are theconsequence. A minimal support for flow control operationsis added, resulting in a weakly programmable architecture thatoffers no more than the desired flexibility. We denote this typeof ASIPs as Weakly Programmable IP Cores (WPIPs).

While WPIPs combine many advantages of standard IPblock design and programmable architectures, they inheritthe drawbacks of the respective implementation styles w.r.t.validation. This has yet been barely addressed by the researchcommunity. Muller [8] and Alles [9] have presented rapidprototyping platforms for ASIPs, which can be used for testingpurposes. But both approaches are far too inflexible, as theycan only show the presence of errors but never their source ortheir absence.

The rest of the paper is structured as follows: we will• illuminate the differences in the design flows for the vari-

ous implementation styles (Section II) and the challengesin WPIP validation (Section III)

• quantify the effort required for different verification andvalidation tasks in order to sensitize for the importance

• introduce our approach for validation in a case studywhich was successfully applied to our FlexiTreP ASIPin order to bring it to product level (Section IV).

II. IMPLEMENTATION STYLES

Design methodologies for the implementation of digitalsignal processing systems consist of two phases. The goalof the first phase is to make all functional design decisionsfrom algorithm selection down to quantization. Purely func-tional, software-based system models are used at this stage toguarantee the desired functional behavior. For state-of-the-artcommunication systems this step is particularly challenging,as the communications performance of todays channel codescan not be evaluated analytically. Instead, extensive MonteCarlo simulations have to be performed to determine the

74

Fig. 1. Implementation Styles [10]

bit-error performance or frame-error performance of everysingle design candidate. At the end of this iterative refinementand evaluation process stands the so called Golden ReferenceModel, a functional software implementation of the system.

The second phase deals mostly with non-functional aspectsof the actual system implementation. Various implementationstyles exist and the right choice depends on flexibility andenergy or area efficiency requirements (cf. Figure 1). By farthe most challenging part at this stage is the validation andverification of the implementation against the Golden Refer-ence Model. Their properties are highlighted in the following.

• General Purpose Processors (GPP) offer the greatestflexibility of all implementation styles. It is also easilypossible to upgrade such systems to support new featuresor new standards by a simple update of the systemsoftware. Another advantage of this implementation styleis the comparatively low effort for system validation andverification. Given the correctness of the processor anda functional model of the instruction set architecture (socalled ISA model), the application software can be vali-dated independently from the underlying processor. Thecorrectness of the hardware is often proven with formalverification methods and guaranteed by the manufacturer.The big drawback of such platforms is their very low areaand energy efficiency.

• Dedicated, hardwired architectures in contrast offer thehighest implementation efficiency. For such architectures,traditional synthesis based design flows are widely used.As the RTL (register transfer level) hardware descrip-tion is typically derived at least in parts by iterativerefinement of the well elaborated golden reference model,the correctness of the implementation can be shown bysimulation or formal methods. The high implementationefficiency of dedicated architectures of course comes atthe cost of very limited flexibility.

• Application Specific Instruction Set Processors (ASIP)[11]–[13] try to close the gap between dedicated hard-wired and programmable off-the-shelf solutions. Typi-cally, the instruction set of a GPP is enhanced with

TABLE ICHARACTERISTICS IN ASIP IMPLEMENTATION STYLES

Top-Down Approach Bottom-Up Approach(classical ASIP) (WPIP)

Standard Pipeline Application Specific PipelineStandard Memory Application Specific Memory

Access Scheme Access and OrganizationStandard Instruction Set Only Application Specific

Extendend by Application Instructions definingSpecific Instructions interplay of functional blocks

Single-context instructions Multi-context instructions

special non-standard instructions to allow a more efficientprocessing of the algorithms under consideration. Theseadditional instructions, extracted in detailed analysis andprofiling of the algorithms, are supported by additionaldedicated stages which are inserted into the processorpipeline. Thus, the original instructions from the GPP re-main unchanged and the ISA is only enhanced. Conceptsfrom standard processor validation are still applicable.

• Weakly programmable IPs (WPIP), too, are ASIPs, butthey are created in a bottom-up approach starting fromdedicated architectures rather than from a GPP. Thecommonalities of dedicated architectures with similarkernel operations and memory requirements are extractedand unified in a fully customized pipeline with a cus-tom, scattered memory architecture offering exactly therequired bandwidth and flexibility. The characteristicdifferences compared to traditional ASIPs are faced inTable I. The gain of this approach is a performance andenergy efficiency very close to that offered by dedicatedarchitectures and at the same time offering at least theminimum required flexibility from programmable archi-tectures (see Figure 1). Thus, they are the preferable im-plementation style for upcoming multi-standard channeldecoder implementations. The biggest challenge in WPIPdesign, however, is the validation.

III. WPIP VALIDATION

While WPIPs inherit many desirable properties from dedi-cated as well as programmable architectures, this is not truefor the ease of validation. The ISA of a WPIP is not designedspecifically, but is merely an emerging phenomenon from thecombination of architectures. Thus, there is no standard ISAmodel that can be used in tools for formal verification. Thetight coupling of hardware and software hinders the separatedvalidation of the WPIP architecture without the applicationsrunning on it. Furthermore, the pipeline of a typical WPIPis very deep (e.g. , 15 stages in case of [14]) and contains acomplex system of irregularly sized and distributed memories,which even may be accessed in an out-of-order fashion inseveral pipeline stages. This fact creates inter-instruction de-pendencies over many clock cycles exceeding the capabilitiesof formal verification tools commonly available today. Takingthese properties into account the following approaches turnedout as being potentially appropriate for WPIP validation.

75

TABLE IIRUNTIMES FOR SIMULATIONS AND VERIFICATION

Simulation Runtime Throughput10 k blocks

Property Checking [16] 18 h 83 properties

Monte Carlo Simulation (Software)· Viterbi, 1k info bits, w/ ASIP 0.7 h 3.8 kbps· bTC UMTS, w/ ASIP 10 h 1.4 kbps· bTC UMTS, w/ SW reference decoder 15 min 58 kbps

RTL Sim. bTC, UMTS, only ASIP 47 h 0.3 kbps

A. Formal Verification

Formal methods prove the absence of errors, while simula-tion can always only show the presence of errors.

Loitz et. al. [15] have recently established a way to reducethe complexity of interval property checking for WPIPs bycomposing instructions of micro operations making the val-idation of complex WPIP instructions feasible. Their com-pleteness approach enables a formal proof that each of thecomplex instructions behaves as intended. Although this doesnot necessarily prove our design to be completely functional asexpected, verification of each instruction can detect errors suchas saturation or rounding issues. However, formal verificationof the system behavior is by now impossible and still underresearch.

B. Simulations

For programmable architectures based on a standardinstruction-set, verification of the instruction-set guaranteesthat any arbitrary algorithm can be implemented. In contrastto that for WPIPs the correctness of all instructions is notsufficient since this does not guarantee that the intendedsystem behavior can be implemented. Hence, despite theconfidence that formal methods provide, there is still thedemand for simulations. Particularly, they are invaluable foranalysis during the program development phase which turnedinto a challenging and error prone task due to the optimizedinstruction-set.

As WPIPs are designed to implement a large numberof possible standards, a purely simulation-based validationapproach needs to simulate every channel decoding applicationand compare a statistically significant number of computedvalues to the respective golden reference models or the re-spective frame or bit error rate (FER, BER) point specified incommunication standards. Further, WPIPs are not created byrefinement from the golden reference model, as are dedicatedarchitectures. Hence, there is no structural similarity betweenthe two, which could be exploited for validation. An approachto cope with this will be presented in Section IV.

Simulation times of WPIPs exceeding those of the goldenreference or even an RTL model of a dedicated architectureby orders of magnitudes (see Table II). It quickly becomesinfeasible to perform statistical Monte Carlo simulations asthe only means of implementation validation.

TABLE IIISIMULATIONS REQUIRED FOR MULTI-STANDARD SYSTEM VALIDATION

Standard CodeCode Types Different

(Enc. Type, Tailing, BlocksizesRates, Polynomials) per Type

GSM/EDGEViterbi 256 states 3 375Viterbi 64 states 8 1000Viterbi 16 states 4 724

LTE bin. Turbo 1 188UMTS/HSPA bin. Turbo 1 5075CDMA2k bin. Turbo 1 18WiMax duobin. Turbo 1 17

overall different code blocks: 17,319

Finally, one of the biggest advantages designers hope forwhen choosing programmable architectures is the flexibilityto easily extend the system to new standards. While this isfeasible by software adjustment for GPPs or ASIPs basedon standard ISAs, every change in the pipeline of a WPIPeffectively poses a potential change to the functional behaviorof every single application running on it. As there is no clearseparation between the WPIP and the applications by themeans of a well defined ISA, and hardware is shared overthe supported algorithms, even small changes can require acomplete re-validation of all implemented applications.

C. Rapid Prototyping

Simulations are a powerful validation method. Drawbackis that simulations with a sufficient amount of test vectorslast very long, up to several days, depending on complexityand computation intensity. This problem can be attenuatedby rapid prototyping: The simulation is transferred to anacceleration platform, usually an FPGA board and run there.A sophisticated variant is often denoted as “hardware in theloop” where the testbench or simulation environment remainsin software and the device under test is integrated as a realhardware component.

D. Combined Approach

The specific properties of WPIP and the andvantages anddisadvantages of the above described validation methods yieldthat neither of these common approaches for traditional im-plementation styles solely will be applicable. A mix of formalmethods and simulation or emulation will be appropriate.

In the next section we will introduce this approach for WPIPvalidation using the example of a WPIP designed for industrialuse and quantize the validation and verification effort.

IV. CASE STUDY: FLEXITREP VALIDATION

The WPIP for the case study is FlexiTreP [14], a FlexibleTrellis Processing engine. With its capability of decoding bi-nary and duobinary Turbo and convolutional codes it supportsmost important wireless communication standards, among oth-ers UMTS, LTE, DVB-SH, or WiMax. It comprises 15 pipelinestages and seven memories that are accessed in differentpipeline stages. The pipeline is dynamically reconfigurable inorder to react to code changes. The pipeline is implemented in

76

Fig. 2. ASIP Design and Validation Flow

a high-level processor description language, LISA [17]. Fromthis description a cycle accurate C++ simulation model as wellas a synthesizable RTL model are generated using SynopsysProcessor Designer [11].

For validation we applied a combined approach accord-ing to Figure 2: with the properties we gained during theimplementation phase from our system specification and theexisting implementation knowledge we can verify formallythat the instructions described in our high-level language workcorrectly by applying property checking to the RTL model (cf.left part of Figure 2). This can be done independently fromapplication program development and excludes a wide rangeof errors such as memory access conflicts (e.g. , from stallunits), address range faults or rounding and saturation faults.

Despite the instruction verification there is still a hugeamount of simulations to be done which is shown in theright part of Figure 2. Simulations are mandatory for twopurposes. For channel decoding architectures like FlexiTrePthe algorithmic performance needs to be proven. This is onlypossible by Monte Carlo simulations comparing FER or BERagainst the specification in the respective standards with alltheir parameters. Therefore the approach from [8] is notapplicable since only single blocks can be decoded with thisplatform. Rapid Prototyping as introduced in [9] is an optionfor gathering BER/FER performance but lacks in flexibility foranalysis, debugging and application development: Usually thedesigner wants to check a modification as early as possible.According to traditional IP block design approaches smallerfunctional parts of the pipeline are compared against thereference model. This shortens hardware as well as softwaredevelopment which consumes a great amount of time due tothe sophisticated ISA. For a first test of minor modificationssimulations on single blocks are perfectly suitable and savesimulation time.

Table III shows that for the validation of FlexiTreP forthe required standards more than 17,000 different codeblocks

Fig. 3. Generic System Simulation Chain

exist. Each of them is combinable with various typical pa-rameters depending on the code (e.g. , windowsize, blockor acquisition length, . . . ) so that the theoretical number ofpossible combinations multiplies. However, many paramaterscan be considered constant for a given code, as they are knownto provide the best properties. Nevertheless, for each codeblock hundreds of thousands of bits have to be simulated inorder to reach statistical significance. As an example let usassume that 10,000 blocks (corresponds a FER of 10−2) persimulation were sufficient. Table II lists the simulation timesfor a single block for each case. For a complete validation ofFlexiTreP for the given standards calculative simulation timesof more than five years were required.

In order to reduce this simulation effort, we have set up asimulation environment modeling the channel encoding anddecoding chain as depicted in Figure 3. It is arbitrarily con-figurable. Modules can be exchanged and added or removedaccording to the needs. For comparison of a design againsta reference it is possible to instantiate the design under test(DUT) and an arbitrary reference model in parallel. Thedeployed reference models are IO equivalent to the implemen-tation, they are well elaborated and also proven by existinghardware implementations.

This enables debugging and program development with anenvironment supporting the flexibility of the design. Functionalsimulations can be run without the fairly slow cycle accuratemodel only on basis of the well elaborated reference modelswhich reduces the simulation times by up to an order ofmagnitude, depending on the code. In conjunction with formalmethods we still get a high quality while simulation times re-duce to a few days. The additional time for property checkingis negligible once the properties are set up. Verification canbe done independently in parallel to the simulations.

Additionally we have added an interface from the simulationchain to an FPGA board over Ethernet. With this the wholesimulation chain runs on a standard PC offering the full

77

flexibility of simulation. Only the design under test is ex-changed by the hardware implementation. This setup enablesfrom the same environment debugging and analysis of singlecode blocks, performance simulations in software or emulationwith the RTL design, or comparisons to an already verifiedreference. The emulation offers an acceleration of anotherorder of magnitude.

Our separation approach offers an additional big advantage:whenever the hardware is modified, it can be shown formallythat unmodified instructions are not influenced by the mod-ifications. Hence, by applying the existing (or only slightlymodified) properties again, it is assured that programs thatdo not use any new or modified instructions are still workingas before and need not to be simulated again. This reducesthe additional validation time for the enhancement to a thirdcompared to a full rerun of all simulations.

V. CONCLUSIONS AND FUTURE WORK

Application specific programmable architectures are spread-ing quickly in the field of channel decoding. In this paperwe outlined the differences between various implementationstyles, showed the advantages of ASIPs and in particularWPIPs for channel decoding and highlighted their disad-vantages w.r.t. validation and fleshed this out by concretenumbers. We validated our existing FlexiTreP ASIP withour flexible channel decoding simulation environment andformal verification methods. We showed that with deliberatesimulations in combination with formal methods the effort forvalidation of a multi-standard architecture can be reduced frominfeasible times calculatingly in the range of several months toa few days, depending on the supported codes. Our validatedASIP was successfully produced in a 65 nm technology andintegrated into a commercial product.

For the future enhancement in the architecture and en-hancements for a multi-core system is planned. The validationenvironment emerged to be perfectly suitable for debuggingthese thanks to its modular character and configurability.

REFERENCES

[1] Y. Lin, H. Lee, M. Who, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti,and K. Flautner, “SODA: A Low-power Architecture For Software Ra-dio,” in Proc. 33rd International Symposium on Computer ArchitectureISCA ’06, 2006, pp. 89–101.

[2] C. Rowen, “Silicon-efficient dsps and digital architecture for lte base-band,” in 10th International Forum on Embedded MPSoC and Multicore,Gifu, Japan, Aug. 2010.

[3] B. Bougard, R. Priewasser, L. V. der Perre, and M. Huemer, “Algorithm-Architecture Co-Design of a Multi-Standard FEC Decoder ASIP,” inICT-MobileSummit 2008 Conference Proceedings, Stockholm, Sweden,Jun. 2008.

[4] S. Kunze, E. Matus, and G. P. Fettweis, “ASIP decoder architecture forconvolutional and LDPC codes,” in Proc. IEEE International Symposiumon Circuits and Systems ISCAS 2009, May 2009, pp. 2457–2460.

[5] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan,L. Van der Perre, and F. Catthoor, “A unified instruction set pro-grammable architecture for multi-standard advanced forward error cor-rection,” in Proc. IEEE Workshop on Signal Processing Systems SiPS2008, Oct. 2008, pp. 31–36.

[6] O. Muller, A. Baghdadi, and M. Jezequel, “From Parallelism Levels toa Multi-ASIP Architecture for Turbo Decoding,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92–102,Jan. 2009.

[7] M. Alles, T. Vogt, and N. Wehn, “FlexiChaP: A Reconfigurable ASIPfor Convolutional, Turbo, and LDPC Code Decoding,” in Proc. 5thInternational Symposium on Turbo Codes and Related Topics, Lausanne,Switzerland, Sep. 2008, pp. 84–89.

[8] O. Muller, A. Baghdadi, and M. Jezequel, “From Application to ASIP-based FPGA Prototype: a Case Study on Turbo Decoding,” IEEEInternational Workshop on Rapid System Prototyping, pp. 128–134, Jun.2008.

[9] M. Alles, T. Lehnigk-Emden, C. Brehm, and N. Wehn, “A RapidPrototyping Environment for ASIP Validation in Wireless Systems,” inProc. edaWorkshop 09, Dresden, Germany, May 2009, pp. 43–48.

[10] T. Noll, T. Sydow, B. Neumann, J. Schleifer, T. Coenen, and G. Kappen,“Reconfigurable Components for Application-Specific Processor Archi-tectures,” in Dynamically Reconfigurable Systems, M. Platzner, J. Teich,and N. Wehn, Eds. Springer Netherlands, 2010, pp. 25–49.

[11] “Synopsys Processor Designer,” June 2010. [Online]. Available:http://www.synopsys.com/Tools/SLD/ProcessorDev/

[12] “Target Compiler Technologies,” http://www.retarget.com.[13] “Tensilica Inc.” http://www.tensilica.com.[14] T. Vogt and N. Wehn, “A Reconfigurable ASIP for Convolutional

and Turbo Decoding in a SDR Environment,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 16, pp. 1309–1320,Oct. 2008. [Online]. Available: http://dx.doi.org/10.1109/TVLSI.2008.2002428

[15] S. Loitz, M. Wedler, C. Brehm, T. Vogt, N. Wehn, and W. Kunz,“Proving Functional Correctness of Weakly Programmable IPs - ACase Study with Formal Property Checking,” in Proc. Symposium onApplication Specific Processors SASP 2008, Anaheim, CA, USA, Jun.2008, pp. 48–54.

[16] S. Loitz, M. Wedler, D. Stoffel, C. Brehm, N. Wehn, and W. Kunz,“Complete Verification of Weakly Programmable IPs against TheirOperational ISA Model,” in FDL, A. Morawiec and J. Hinderscheit,Eds. ECSI, Electronic Chips & Systems design Initiative, 2010, pp.29–36.

[17] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, andH. Meyr, “A methodology for the design of application specific in-struction set processors (ASIP) using the machine description languageLISA,” in Computer Aided Design, 2001. ICCAD 2001. IEEE/ACMInternational Conference on, Nov. 2001, pp. 625–630.

78

Area and Throughput Optimized ASIP forMulti-Standard Turbo decoding

Rachid Al-Khayat, Purushotham Murugappa, Amer Baghdadi, Michel JezequelInstitut Telecom; Telecom Bretagne; UMR CNRS 3192 Lab-STICC

Electronics Department, Telecom Bretagne, Technopole Brest Iroise CS 83818, 29238 BrestUniversite Europeenne de Bretagne, France

E-mail: {firstname.surname}@telecom-bretagne.eu

Abstract—In order to address the large variety of channelcoding options specified in existing and future digital communi-cation standards, there is an increasing need for flexible solutions.Recently proposed flexible solutions in this context generallypresents a significant area overhead and/or throughput reductioncompared to dedicated implementations. This is particularlytrue while adopting an instruction-set programmable processors,including the recent trend toward the use of Application SpecificInstruction-set Processors (ASIP). In this paper we illustrate howthe application of adequate algorithmic and architecture leveloptimization techniques on an ASIP for turbo decoding can makeit even an attractive and efficient solution in terms of area andthroughput. The proposed architecture integrates two ASIP com-ponents supporting binary/duo-binary turbo codes and combinesseveral optimization techniques regarding pipeline structure,trellis compression (Radix4), and memory organization. The logicsynthesis results yield an overall area of 1.5mm2 using 90nmCMOS technology. Payload throughputs of up to 115.5Mbpsin both double binary Turbo codes (DBTC) and single binary(SBTC) are achievable at 520MHz. The demonstrated resultsconstitute a promising trade-off solution between throughput andoccupied area comparing with existing implementations.

Index Terms—SoC design, Embedded System Architecture,ASIP, Pipeline Processor, Turbo codes, WiMAX, 3GPP, LTE,DVB-RCS.

I. INTRODUCTION

Systems on chips (SoCs) in the field of digital communi-cation are becoming more and more diversified and complex.In this field, performance requirements, like throughput anderror rates, are becoming increasingly severe. To reduce theerror rate (closer to the Shannon limit) with a lower signal-to-noise ratio (SNR), turbo (iterative) processing algorithms havebeen recently proposed [1] and adopted in emerging digitalcommunications standards. These standards target differentsectors such as: LTE and WiMAX covering metropolitan areafor voice and data applications with limited video service,while the DVB series targeting video broadcasting. A selectedlist of current standards and their throughput requirements aregiven in table IThe user demands on the other hand, require these applica-

tions to be supported on a single portable device which callsfor future wireless devices to be multi-standards such as PDAs,smart phones and other devices. The efficient implementation

Standard Codes Rates States Blocksize

ChannelThroughput

IEEE802.16 DBTC 1/2 - 3/4 8 .. 4800 .. 75 Mbps(WiMax)

DVB-RCS DBTC 1/3 - 6/7 8 .. 1728 .. 2 Mbps3GPP-LTE SBTC 1/3 8 .. 6144 .. 150 Mbps

TABLE I: Selection of standards supporting turbo codes.DBTC: Double Binary Turbo Code, SBTC: Single BinaryTurbo Code

of the advanced channel decoders, which is the most areaconsuming and computationally intensive block in basebandmodem, becomes more important.Numerous research groups have come up with differentarchitectures providing specific reconfigurability to supportmultiple standards on a single device. A majority of theseworks target channel decoding and particularly turbo de-coding. The supported types of channel coding for turbocodes are usually Single Binary and/ Double Binary TurboCodes (SBTC and DBTC). In this context, the work in [2]presents an ASIP-based implementation (Application-SpecificInstruction-set Processor) with a flexible pipeline architecturethat support turbo decoding of SBTC and DBTC. The pre-sented ASIP occupies a small area of 0.42mm2 in 65nmtechnology ( 0.84mm2 in 90nm), however it achieves alimited throughput of 37.2Mbps in DBTC and 18.6Mbps inSBTC modes at 400Mhz. Besides ASIP-based solutions, otherflexible implementations are proposed using a parametrizeddedicated architecture (not based on instruction-set), like thework presented in [3]. The proposed architecture supportsDBTC and SBTC modes and achieves a high throughput of187Mbps. However, the area occupied is large: 10.7mm2 in130nm technology (5.35mm2 in 90nm).

On the other hand, several one-standard dedicatedarchitectures exist. In this category we can cite the dedicatedarchitectures presented in [4] and [5] which support onlySBTC mode (3GPP-LTE). In [4] a maximum throughput of150Mbps is achieved at the cost of a large area of 2.1mm2

in 65nm technology ( 4.2mm2 in 90nm), while in [5] amaximum throughput of 130Mbps is achieved at the costof 2.1mm2 in 90nm technology. Another example is thededicated architecture proposed in [6] which support only978-1-4577-0660-8/11/$26.00 c©2011 IEEE

79

DBTC mode (WiMAX). A limited throughput of 45Mbpsis achieved (not covering all WiMax requirements) with anoccupied area of 3.8mm2 in 180nm technology ( 0.95mm2

in 90nm).

While analyzing the overall state of the art in this context,one can note that proposed flexible solutions generally presenta significant area overhead and/or throughput reduction com-pared to dedicated implementations. This is particularly truewhile adopting an instruction-set programmable processors,including the recent trend toward the use of ASIP. In this paperwe illustrate how the application of adequate algorithmic andarchitecture level optimization techniques on an ASIP for turbodecoding can make it even an attractive and efficient solutionin terms of area and throughput. The considered initial ASIPfor turbo decoding is the one proposed in [7] and the proposedoptimizations made its area size decreases from 0.2mm2 to0.15mm2 in @90nm technology and its throughput increasesfrom 50Mbps in DBTC to 115.5Mbps and from 25Mbpsin SBTC to 115.5Mbps. These significant improvementsare obtained due to applying three levels of optimization:(1) Architecture optimization by re-arranging the pipeline anddecreasing the instruction set used in iterative-loop to generateextrinsic information, (2) Algorithmic optimization by apply-ing trellis compression (Radix4) to double the throughput inSBTC mode, and (3) Memory re-organization to optimize thearea.The proposed architecture allows to have simple light weight1 × 1 system decoder which achieve a best ratio betweenthroughput and area to support SBTC/DBTC turbo decodingfor an array of standards (WiMAX, LTE, DVB-RCS). The restof the article is organized as follows: section II presents thedecoding algorithms for turbo decoding used in the proposedarchitecture. section III explains in detail the proposed archi-tecture of the decoder system and the proposed optimizationtechniques. The synthesis results and comparisons w.r.t. thestate of the art are given in section IV and finally the paperconcludes with section V giving some future perspectives.

II. DECODING ALGORITHMS

A. Turbo decodingThe typical system diagram for the turbo decoding is shown

in Fig.1. It consists of two component decoders exchangingextrinsic information via an interleave (Π) and deinterleave(Π−1) processes. The component decoder0 receives Log-likelihood ratio Λk (1) for each bit k of a frame of length Nin the natural order while component decoder1 is initializedin interleaved order.

Λk = logPr{dk=0|y0..N−1}Pr{dk=1|y0..N−1} (1)

For efficient hardware implementation Max-Log MAP al-gorithm is used, as described in [7]. For DBTC, the threenormalized extrinsic information are defined by (2) wherei ∈ (01, 10, 11) of the kth symbol while s′ and s are theprevious and current corresponding trellis state respectively.

Zn.extk (d(s′, s) = i) = Zext

k (d(s′, s) = i)− Zextk (d(s′, s) = 00)

(2)

Componentdecoder0

Componentdecoder1

Hard. dec

ChannelLLR

Π−1 Π

Π

Zn.extk

Λk

Fig. 1: Turbo decoding system

The extrinsic information defined by (3) is calculated from theaposteriori probability given by (4), wherein αk(s) and βk(s)are the state metrics in forward (5) and backward recursion(6) respectively and γk(s′, s) are the branch metrics (7). Theγsysk (s′, s) and γpar

k (s′, s) are the systematic and parity symbolLLRs. Finally, when the required number of iterations Niterare completed the hard decision is calculated as given by (9).

Zextk (d(s′, s) = i) = Γ× (Zapos

k (d(s′, s) = i)− γintk (s′, s)) (3)

Zaposk (d(s′, s) = i) = max

(s′,s)/d(s′,s)=i(αk−1(s)+

γn.extk (s

′, s) + βk(s)), i ∈ {00, 01, 10, 11}

(4)

αk(s) = maxs′ ,s(αk−1(s) + γk(s′, s)) (5)

βk(s) = maxs′ ,s(βk+1(s) + γk+1(s′, s)) (6)

γk(s′, s) = γintk (s′, s) + γn.ext

k (s′, s) (7)

γintk (s′, s) = γsys

k (s′, s) + γpark (s′, s) (8)

ZHard.deck = sign(Zapos

k ) (9)

B. Radix4 decoding algorithm

For SBTC, the trellis length is reduced by half throughapplying the one-level look-ahead recursion [8]. The modifiedα and β state metrics for this Radix4 optimization are givenby (10) and (11) where γk(s

′′, s)is the new branch metric for

the combined two-bit symbol (uk−1, uk) connecting state s′′

and s.

S′′2

S′′3

S0

S1

S2

S3

S′′0

S′′1

S′′0

S′′1

S′′2

S′′3

S′0 S0

S′1 S1

S′2 S2

S′3 S3

uk−1 uk uk+1uk−1 uk+1

Fig. 2: Trellis compression (Radix4)

αk(s) = maxs′′ ,s{αk−2(s′′

) + γk(s′′, s)} (10)

βk(s) = maxs′′ ,s{βk+2(s′′

) + γk(s′′, s)} (11)

γk(s′′, s) = γk−1(s

′′, s

′) + γk(s

′, s) (12)

The extrinsic information for uk−1 and uk are computed as:

Zn.extk−1 = Γ× (max(Zext

10 , Zext11 )−max(Zext

00 , Zext01 )) (13)

Zn.extk = Γ× (max(Zext

01 , Zext11 )−max(Zext

00 , Zext10 )) (14)

80

C. QPP InterleavingInterleaving/deinterleaving of extrinsic information is a

key issue that needs to be aware of to enable addressingmultiple extrinsic information in the same cycle becausememory access contention may occur when MAP decoderfetch/write extrinsic information from/to memory. Theinterleaving/deinterleaving addresses required w.r.t. LTEstandard QPP interleaving which is a contention-freeinterleaver that is expressed via a simple mathematicalformula,let N be the number of data couples in each blockat the encoder input.

For j = 0 .. N − 1, I(j) = (F2 ∗ j2 + F1 ∗ j)modN (15)

Where F1 and F2 are constants defined in the standard with jbeing the index of the natural order. By definition, parameterF1 is always an odd number whereas F2 is always an evennumber. QPP interleaver has many algebraic properties, aninteresting one is I(x) has same even/odd parity as x:

I(2k)mod 2 = 0

I(2k + 1)mod 2 = 1(16)

This algebraic property is used later in section III-C indesigning memory system for addressing multiple extrinsicinformation without memory access contention.

III. DECODER SYSTEM ARCHITECTURE

The proposed decoder system architecture consists of 2ASIPs interconnected directly as shown in Fig.3. Shuffleddecoding ASIPs [7] are configured to operate in a 1×1 mode,with one ASIP (ASIP0) process the data in natural orderwhile the other one (ASIP1) process it in interleaved order.The generated extrinsic information are exchanged betweenthe two ASIP decoder components via connection of buffers& multiplexers.

Ext memTop Ext memTop

Ext memBot Ext memBot

Even

Odd Odd EvenEven

Odd

Mux Mux++

EvenOdd

BufferBufferASIP0 ASIP1

Top

InputInput

InputTop

BottomBottom

Input

COMPONENT DECODER1COMPONENT DECODER0

Fig. 3: 1× 1 Decoder system architecture

A. ASIP architecture and optimization levels

Fig.7a illustrates the overall architecture of the optimizedASIP with the proposed memory structure and pipeline organ-isation in 9 stages. The numbers in brackets indicate the equa-tions (referred in section II-A) mapping on the correspondingpipeline stage. The extrinsic information format, at the outputof the ASIP, is also depicted in the same figure for the twomodes SBTC and DBTC. The rest of this sub-section detailsthe proposed architecture optimizations to achieve efficientsolution in terms of area and throughput, classified in threelevels.

1) Architecture level optimization: The initial decodingprocess of the ASIP proposed in [7] was done through 8pipeline stages and was implementing the butterfly schemefor metrics computation. During this process, ASIP calculatesthe α metrics (forward recursion) and β (backward recursion)simultaneously, till it reaches half of the processed sub block(which called left-butterfly). In left butterfly the recursionunits do the state metric calculations in the first clock cycleand the max operators find the state metric maximum values inthe second clock cycle eq.(5) & eq.(6). During processing theother half of the processed sub block (right-butterfly) besidesfinding the state metrics values, another three clock cyclesare required to do addition for extrinsic information and thenfinding maximum A posteriori information eq.(4). All theseoperations are taking place in (EXstage). So 7 clock cyclesin total are required to generate extrinsic information for twosymbols.Major proposed optimization in the architecture level is inre-arranging the pipeline by modifying the (EXstage) toplace the recursion units and max operators in series so onlyone instruction is enough to calculate the state metric andfind the maximum values eq.(5) & eq.(6). In similar way,one instruction will be needed to find maximum A posterioriinformation eq.(4). In fact, finding maximum A posterioriinformation is done in three cascaded stages of max operators(searching the max between 8 metric values). Thus, placingthem in series with recursion units in one pipeline stage willincrease the critical path (i.e. reduce the maximum clockfrequency). To avoid that, new pipeline stage (MAXstage)is added after (EXstage) to distribute the max operators asshown in Fig.4.During the decoding process in left butterfly, ACS units (Add,Compare, Select) do the state metric calculations and findthe state metrics maximum values in the same clock cycle in(EXstage), while during the right butterfly besides findingthe state metrics values, ACS units do addition and findmaximum A posteriori information eq.(4) in (MAXstage)in one clock cycle. So 3 clock cycles in total are required togenerate extrinsic information for two symbols.Another proposed optimization concerns the implementation

of windowing to process large block-size which is achievedby dividing the frame into N windows, where the windowmaximum size supported is 128 bits. Fig.5 shows the win-dows processing in butterfly scheme, i.e. ASIP calculatesthe α values (forward recursion) and β (backward recur-sion) simultaneously, and when it reaches half of the pro-cessed window (left-butterfly) and start the other half (right-butterfly), ASIP can calculate the extrinsic information onthe fly along with α and β calculations. State initializations(α int(wi

(n−1)), β int(wi(n−1))) of a specific α−β recursions

across windows are done by message passing via a specificarray of registers. Since the maximum window size is 128bits, 48 windows are needed to cover all LTE block-sizes, so48×96 array of registers is added.

2) Algorithmic level optimization: In the ASIP proposedin [7], SBTC throughput equals half of DBTC throughput

81

FinderMax

FinderMax

FinderMax

FinderMax

FinderMax

FinderMax

EX MAX

×64

×32

×16

×8

α

γ

β

α′

Fig. 4: Modified pipeline stages (EX, MAX)

Time

Extrinsic

Extrinsic

Extrinsic

right B−Fleft B−F

w(N−

1)

fram

esi

zeN

w0

w1

β int(w1)

β int(w0)

β int(w2)

α int(w0)

β

α

α int(w(N − 2))

Fig. 5: Windowing in butterfly computation scheme

because the decoded symbol is composed of 1bit in SBTCwhile it is 2 bits in the DBTC mode. Trellis compression isapplied to overcome this bottleneck as explained in sectionII-B. This makes the decoding calculation for SBTC similarto DBTC as presented in eq.(10), eq.(11) and eq.(12) so noadditional ACS units are added. The only extra calculationis to separate the extrinsic information to the correspondingsymbol as presented in eq.(13) and eq.(14) and the cost for itshardware implementation is very small. Fig.6 depicts butterflyscheme with Radix4, where the numbers indicate the equations(referred in section II-B). In this case four bits (single binarysymbols) are decoded each time.

LLR1

LLR4LLR3

LLR2

Time

Extrinsic

fram

esi

zeN (14)

(13)

(14)

(13)

α

β

Fig. 6: Butterfly scheme with Radix4

3) Memory level optimization: Three major memory struc-ture optimizations are implemented. The first one concernsthe normalization of the extrinsic information as presentedin eq.(2). This optimization reduces the extrinsic memoryby 25% because (γn.ext00 ) is not stored anymore. The secondoptimization is to restrict the support of trellis definition toa limited number of standards (WiMAX, 3GPP) rather thanall possible ones. Besides reducing the complex multiplexinglogic, this optimization allows for 1bit mode selection which ispassed through the instruction set, and thus, the configurationmemories which store the trellis definition are eliminated.

The third optimization is to re-organize input and extrinsicmemories, where input memories contain the channel LLRsΛn which are quantized to 6 bits each. In the proposedorganization for DBTC mode, LLRs values of systematics(Sn

0 , Sn1) and parties (Pn0, Pn1) for same double binarysymbol are stored in the same input memory word. Howeverin SBTC mode, each input memory word stores the LLRsvalues (Sn

0 , Sn+10 , Pn

0 , Pn+10 ) for two consecutive bits (single

binary symbols). The same approach is proposed for extrinsicmemories. As normalized extrinsic values are quantized to 8bits, in DBTC mode, values γn.ext01 , γn.ext10 , γn.ext11 related tosame symbol are stored in the same memory word. While inSBTC mode, each memory word stores the extrinsic valuesγn.ext1 , γn+1.ext

1 for two consecutive bits. In this way thememory resources in two turbo code modes (STBC/DBTC)are efficiently re-utilized.Memory sizes are dimensioned to support the maximum blocksize of the target standards (Table I). This corresponds to theframe size of 6144 bits of 3GPP-LTE standard which results ina memory depth of 6144+3 tail bits+1 unused

(Nsym=2)×(Nmb=2) = 1537 words (forboth input and extrinsic memories). Where Nsym is numberof symbols per memory word and Nmb is number of memoryblocks (Nmb = 2 as butterfly scheme is adopted).Table II presents the used memories in the proposed ASIP.It has 2 single port input memories to store channel LLRvalues of size 24×1537 and 2 simple dual port (oneportreadand oneportwrite) extrinsic memories to save a priori infor-mation. Each extrinsic memory is split into two banks odd8×1537 and even 16×1537. Each ASIP is further equippedwith 128×16 cross-metric memory which implement buffersto store β and α in left butterfly calculation phase and re-utilized in right butterfly phase.

Memory name # depth WidthProgram memory 1 64 16Input memory 2 1537 24Extrinsic memory odd 2 1537 8Extrinsic memory even 2 1537 16Cross-metric memory 1 16 128

TABLE II: Typical RAM configuration used for one ASIPdecoder component

82

B. Assembly Code Example

An assembly code example of the proposed optimized ASIPin turbo mode is as shown in Fig.7b. First we initialize theASIP mode (SBTC, DBTC), initialize the scaling factor Γidentified in eq.(3) and eq(13), from software module foundfor better performance BER Γ = 0.75 for DBTC and Γ = 0.5for SBTC, current iteration number (iter = 0), number ofwindows (N ) per ASIP, length of windows (L) and the lengthof last window (Llast). The REPEAT instruction controlsthe number of iterations (ITER MAX = 6). For the firstiteration (i=0) the ASIP start with zero as the initial state met-ric (α int(wi=0

n ) = β int(wi=0n ) = 0). The ZOLB instruction

controls the instructions @10 and @12-13 to execute L (orLlast in case of last window) number of times. The DATALEFT instruction @10 executes the left-butterfly recursioncalculating the α\β metrics and store them in the cross-metricmemory. The DATA RIGHT instruction executes the right-butterfly recursion calculating α\β metrics to be used onthe fly along with the corresponding stored metrics from thecross-metric memory in the next instruction EXTCALC tocalculates the extrinsic information (3) @12-13 and sendsthem to the other decoder component through (Buffer -MUX), so extrinsic calculation require two clock cycles tobe calculated. To avoid conflict in cross-metric memory whenASIP finish processing left-butterfly and starting right-butterflyNOP @11 is placed and executed one time for 1 clockcycle delay. In case of SBTC mode, four extrinsic informationis generated one for each input LLR, while in DBTC, sixextrinsic information is generated three for each input LLR((13),(14)). The EXCH WIN forwards the last αi

(n) valuesas α int(wi

(n)), initializes state metric of the next windowwith βwi

(n)of window n and increments the current window

counter (n = n+ 1).

C. Addressing implementation

In DBTC turbo decoding, due to the use of butterflyscheme, two symbols are decoded at the same time so twoextrinsic information are generated simultaneously and shouldbe addressed to the other component decoder. As explained insection III-B during right butterfly there are two clock cyclesto generate extrinsic information, so one value is addressedin first clock cycle and the other is buffered to be addressednext clock cycle. In SBTC, Radix4 decoding is adopted. Usingthis decoding with butterfly scheme will generate four extrinsicinformation simultaneously each time and should be addressedand sent to the other decoder component in two clock cycles.To avoid collision QPP interleaving is applied as explainedin section II-C. According to eq.(16) odd addresses in naturaldomain are also odd in interleaved domain and the same foreven addresses. Extrinsic memories has been split to two banks(odd/even) to avoid memory conflicts. In fact, in the first clockcycle two extrinsic information (out of the generated four inSBTC mode) with odd and even addresses are sent, followedby the other two extrinsic information in the next clock cycle.

IV. SYNTHESIS RESULTS

The ASIP was modeled in LISA language using CoWare’sprocessor designer tool. Generated VHDL code was validatedand synthesized using Synopsys tools and 90nm CMOStechnology. Obtained results demonstrate an area of 0.15mm2

per ASIP with maximum clock frequency of Fclk = 520MHz.Thus, the proposed turbo decoder architecture with 2 ASIPsoccupies a logic area of 0.3 mm2 with total memory area of1.2 mm2. With these results, the turbo decoder throughputcan be computed through the equation (17). An averageNinstr = 3 instructions per iteration are needed to generate theextrinsic information for Nsym = 2 symbols in DBTC mode,where a symbol is composed of Bitssym = 2 bits. In SBTCmode, same number of instructions is required for Nsym = 4symbols, where symbol is composed of Bitssym = 1 bit.Considering Niter = 6 iterations, the maximum throughputachieved is 115.5Mbps in both modes.

Throughput =Nsym ∗Bitssym ∗ Fclk

Ninstr ∗Niter(17)

Standardcompliant

Tech(nm)

Corearea(mm2)

NormalizedCore area@90nm(mm2)

Throughput(Mbps)

Fclk

(MHz)

ThisWork

WiMAX,DVB-RCS,LTE

90 1.5 1.5 115.5@6iter

520

[3] WiMAX,LTE

130 10.7 5.35 187@8iter

250

[2] DBTC,SBTC

65 0.42 0.84 18.6-37.2@5iter

400

[6] WiMax 180 3.8 0.95 45 99[4] LTE 65 2.1 4.2 150

@6.5iter300

[5] LTE 90 2.1 2.1 130@8iter

275

TABLE III: Comparison with state of the art implementations

Table III compares the obtained results of proposed workarchitecture with other related works. The presented ASIP in[2] supports both turbo modes (DBTC, SBTC). Although itoccupies almost half the area of our proposed ASIP, it presentsa limited throughput of 6 times less for SBTC mode and3 times less for DBTC mode. The parametrized dedicatedarchitecture in [3] supports both turbo modes (DBTC, SBTC)and achieves higher throughput ( 1.6 times) at the cost of morethan 3.5 times in area comparing to this work. The SBTC-dedicated architecture proposed in [4] achieves a throughputof 30% more than the proposed work but at a cost of almost3 times the occupied area. Similarly, the SBTC-dedicatedarchitecture proposed in [5] achieves a throughput of 13%more than the proposed work but at a cost of almost 1.4 timesthe occupied area. The DBTC-dedicated architecture proposedin [6] occupies an area of around 30% less comparing withthis work but the achieved throughput is around 40% less.This analysis demonstrates how the proposed optimized ar-chitecture constitutes a promising trade-off solution betweenthroughput and occupied area comparing with existing imple-mentations.

83

Programmemory

Decode

Fetch

Prefetch

s0p0s1p1

Operand fetch

BranchMetric1

BranchMetric2

Extrinsic Exchange

84

Extrinsic memory

666 6

8 8 8

1313

SBTC

13 8 8 813 80

888138881310

DBTC

Even Odd8x153716x1537

Inputmemory24x1537

memoryCross metric

1537

1537

16x64

128x16

Max

EX

(2) (9)

(4)

(8)

(7)

×2

×2

Zext01Zext

10Zext11

w − j j

ZextjZext

j+1addrj+1

j + 1 j

addrw−j

w − j − 1

addrw−j−1

w − j

Zextw−j−1 Zext

01Zext10addrjZext

10Zext11addrw−j Zext

11Zext01

(3)(5)(6)

addrjZextw−j

(a) ASIP Pipeline Architecture

k instruction

1 SET CONF double2 SET SF 63 SET WINDOW ID 1

;setnum windows4 SET WINDOW N 3

;1st n last window length5 SET SIZE 32,8

;repeat @11=41 if last window executed else;repeat @28-41, for 6*WINDOW N times

6 REPEAT until LOOP 6times

7 NOP;repeat 30-31, and 35-36 for CurrWindowLen times

8 ZOLB RW1, CW1, LW19 NOP10 RW1: DATA LEFT add m column2

;save last beta load alpha init11 CW1: NOP12 DATA RIGHT add m col-

umn213 LW1: EXTCALC add i line2 EXT

;save last alpha load beta init if lastwindow else;exch calculated alpha and beta

14 EXCH WIN15 NOP16 LOOP: NOP

(b) Example Assembly Code

(b) Turbo assembly code

Fig. 7: ASIP pipeline and execution schedule

V. CONCLUSION

In this paper, we have presented an area efficient high-throughput 1×1 decoder system based on ASIP that supportturbo codes for both modes DBTC (WiMAX, DVB-RCS)and SBTC (LTE). Three levels of optimization (Architecture,Algorithmic, Memory) have been proposed and significantperformance improvements have been demonstrated. The pro-posed contribution illustrates how the application of adequateoptimization techniques on a flexible ASIP for turbo decodingcan make it even an attractive and efficient solution in termsof area and throughput. Future work targets to integrate lowpower decoding techniques.

VI. ACKNOWLEDGE

This work was supported in part by UDEC and TEROPPprojects of the French National Research Agency (ANR).

REFERENCES

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limiterror-correcting coding and decoding: Turbo-codes. 1,” In Proc. IEEEInternational Conference on Communications, ICC’93., vol. 2, pp. 1064–1070, 1993.

[2] T. Vogt and N. Wehn, “A Reconfigurable Application Specific InstructionSet Processor for Viterbi and Log-MAP Decoding,” In Proc. IEEEWorkshop on Signal Processing Systems Design and Implementation,SIPS’06, pp. 142–147, 2006.

[3] J.-H. Kim and I.-C. Park, “A Unified Parallel Radix-4 Turbo Decoderfor Mobile WiMAX and 3GPP-LTE,” In Proc. IEEE Custom IntegratedCircuits Conference, CICC’09., pp. 487–490, 2009.

[4] M. May, T. Ilnseher, N. Wehn, and W. Raab, “A 150 Mbit/s 3GPP LTETurbo Code Decoder,” In Proc. Design, Automation and Test in EuropeConference & Exhibition, DATE’10, pp. 1420–1425, 2010.

[5] C. Cheng-Chi, Wong. Hsie-Chia, “Reconfigurable Turbo Decoder WithParallel Architecture for 3GPP LTE System,” IEEE Transactions onCircuits and Systems II: Express Briefs, vol. 57, no. 7, pp. 566–570,2010.

[6] H. Arai, N. Miyamoto, K. Kotani, H. Fujisawa, and T. Ito, “A WiMAXturbo decoder with tailbiting BIP architecture,” In Proc. IEEE Asian Solid-State Circuits Conference, SSCC’09., pp. 377–380, 2009.

[7] O. Muller, A. Baghdadi, and M. Jezequel, “From Parallelism Levels toa Multi-ASIP Architecture for Turbo Decoding,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92–102,2009.

[8] Y. Zhang and K. Parhi, “High-Throughput Radix-4 logMAP TurboDecoder Architecture,” In Proc. Asilomar Conference on Signals, Systems,and Computers, ACSSC’06., pp. 1711–1715, 2006.

84

Design of an Autonomous Platform forDistributed Sensing-Actuating Systems

Francois Philipp, Faizal A. Samman and Manfred Glesner

Microelectronic Systems Research Group

Technische Universitat Darmstadt

Merckstraße 25, 64283 Darmstadt, Germany

{francoisp,faizalas,glesner}(at)mes.tu-darmstadt.de

Abstract—A platform for the prototyping of distributed sens-ing and actuating applications is presented in this paper. By com-bining a low power FPGA and a System-on-Chip specialized inlow power wireless communication, we enabled the developmentof a large range of smart wireless networks and control systems.Thanks to multiple customization possibilities, the platform canbe adapted to specific applications while providing high perfor-mance and consuming little energy. We present our approach todesign the platform and two application examples showing howit was used in practice in the frame of a research project foradaptronics.

Keywords-Smart structures, Control Systems, Wireless SensorNetworks, Reconfigurable Hardware

I. INTRODUCTION

Distributed sensing systems are nowadays a key element

for the development of intelligent environments and adaptive

structures. Information gathered by spatially distributed sen-

sors can either be used for passive monitoring of a system con-

dition or active real-time feedback control. In both cases, tiny

platforms that can be easily integrated on existing structures

are required. Using wireless sensor nodes, placement of power

and sensor cables along a construction is no longer necessary,

but new issues regarding synchronization, speed and autonomy

are appearing.

We introduce a platform combining low power consumption

for autonomous wireless sensor networks applications and

high performance for real-time distributed control systems.

The Hardware accelerated LOw Energy Wireless Embedded

Sensor-Actuator node (HaLOEWEn) relies on fine-grained

reconfigurable hardware to implement complex data processing

tasks. If FPGAs tend to replace microcontrollers and DSPs

for prototyping control systems, their introduction in the

design of very low power autonomous embedded systems is

recent. The new generation of FPGAs based on non volatile

memory is highly suitable for this range of application where

a switch between active and sleep periods is frequent. The

power consumption of these devices is also sufficiently low

to be integrated in systems intended to run on long-term

deployments with batteries.

In addition, monitoring and control of structures wireless

sensor networks depends on high-bandwidth sensing. Large

amount of data are generated by vibration or acceleration

sensors. Wireless transmission of raw data would have a non-

negligible impact on the energy consumption of the node.

Alternatively, local data preprocessing and in-network ag-

gregation algorithms can be implemented with high energy-

efficiency on FPGAs, improving significantly the lifetime of

the network.

The paper is organized as follows. After a short review of

related work in section II, the architecture of the developed

platform and our design concept are introduced in section III.

We then present in section IV-A and IV-B two Wireless Sensor

- Actuator Networks (WSANs) applications using HaLOEWEn

as a prototype.

II. RELATED WORK

Prototyping of wireless sensor node with reconfigurable

hardware was addressed by Hinkelmann & al. in [1]. A Xilinx

Spartan3E with 2000k gates was used as a prototyping chip

for emulating wireless sensor networks microcontrollers. A

sophisticated hardware / software debugging interface has been

additionally developed for precise internal debugging of the

design implemented on the FPGA. Altough the platform pro-

vides enough flexibility to implement and test a large range of

wireless sensor networks applications, its power consumption

is too high for long term deployments. The node still need

a reliable power supply close-at-hand or cables making the

wireless communication feature less adapted.

Reconfigurability of FPGAs was used by Portilla & al. [2]

to implement custom sensor interfaces. Based on a Spartan III

with 200k gates, the COOKIE platform can interface a large

range of analog and digital sensors thanks to a HDL interface

library. However, even if it reduces the power consumption, the

limited size of the FPGA does not allow the implementation

of complex data processing circuits required by our target

applications.

Following the similar approach to reduce energy consump-

tion by locally processing the data, the Imote 2 node [3]

developed by Intel includes a high speed multimedia DSP

coprocessor to handle high bandwidth sensing. Significant im-

provements in performance and energy-efficiency were shown

for various applications in comparison to standard nodes

including only a simple microcontroller. Now commercially

available with multiple sensor and power supply extensions,

the Imote 2 is an interesting alternative for rapid-prototyping of

wireless networks applications implicating complex data pro-

cessing. However, custom hardware implementations enabled

978-1-4577-0660-8/11/$26.00 c©2011 IEEE 85

Low Power Mode Power Consumption

FPGA Flash & Freeze - RF SoC Idle 30.7mWFPGA deep sleep - RF SoC Idle 29.8mW

FPGA deep sleep - RF SoC LPM1 2.1mWFPGA deep sleep - RF SoC deep sleep 50µW

TABLE IPOWER CONSUMPTION OF THE PLATFORM DURING DIFFERENT SLEEP

MODES

by FPGAs result in many cases in higher performance for a

larger range of applications.

III. PLATFORM ARCHITECTURE

For our design, we considered an Actel IGLOO FPGA

AGL1000V5 [4] with 1000k equivalent system gates as the

central unit of the system. It is by default extended by a Texas

Instruments CC2531 System-on-Chip (SoC), integrating an

IEEE 802.15.4 compliant 2.4 Ghz transceiver and a 8051 CPU

core [5]. The SoC includes 256k programmable Flash memory

and 8K RAM. The system runs with a 32 MHz oscillator.

Any node can be connected to a PC via the integrated USB

port. Through such nodes acting as base stations, data can

be accumulated and visualized immediately and user requests

can be disseminated to the whole network. Debuggers can be

plugged to the board allowing simultaneous monitoring of the

FPGA and the RF SoC operation.

The FPGA and the RF SoC communicate via a dedicated

SPI bus. When running at the maximum frequency with a

DMA controller, datarate can reach 2Mbps. Both components

have different deep low power modes useful for applications

involving long sleeping periods. The FPGA has a so-called

Flash & Freeze Low-Power mode with internal SRAM re-

tention activated by a dedicated pin driven by the software

running on the RF SoC. The RF SoC is also able to switch off

the power supply of the FPGA resulting in a deep low power

mode with configuration retention since IGLOO FPGAs are

based on flash memory. Thus, FPGA functionality is quickly

and energy-efficiently recovered after sleeping periods. Power

consumption measured on the platform are summarized in

table I. The power consumption of the FPGA in active mode

depends on the implemented design and can range from 4mwto 120mW . When the radio is activated, 47mW have to be

added in listening mode and 72mW when transmitting at

maximum power output (10dBm).

The platform can be powered by an external power supply

available next to the node or by a battery. If the external

conditions are adequate, a hybrid energy harvesting circuit

combining power extracted from different sources has been

developed for this platform [6]. The latest solutions are well

suited for monitoring-only systems but they are inappropriate

when the application includes control of actuators. Power

generated by batteries or energy harvesting is not sufficient

in this cases and the platform is likely to be supplied by an

external source.

As they do not require complex data processing, low rate

analog sensors are connected to the integrated ADC of the

Fig. 1. Schematic Side View of the Platform with Extension Boards

SoC. A temperature and a light sensor were placed by default

on the board. Further analog sensors may be attached through

an external connector.

Other parts of the platforms are customizable modular

circuits which are connected to one of the three left FPGA I/O

banks. An FPGA with a relatively high number of I/Os (256)

has been chosen to maximize the connectivity of the platform.

Available extension boards include for example additional

volatile and non-volatile memory, Analog-to-Digital (ADC)

and Digital-to-Analog (DAC) converters, interfaces to other

boards, digital sensors, etc. Each available I/O bank has a

dedicated 50 pins header connector to plug the extensions.

The board is small enough (60 mm x 96 mm) to be easily

integrated in various environments.

A. Development Environment

We distinguished two main parts for typical distributed

sensing-actuating applications: communication and data pro-

cessing. The wireless communication with other nodes of the

network is handled by the microcontroller in software while

sensors and actuators are directly interfaced by the FPGA

(Fig. 3). Preprocessing of the sensor data is thus handled

by dedicated hardware circuits for enhanced energy-efficiency.

Similarly, actuator control is implemented on the FPGA for

fast and accurate operation. Communication between the mi-

crocontroller and the FPGA is limited to update of parameters

for control (feedback information from other nodes) and data

extracted from the sensors.

The wireless communication should guarantee a very accu-

rate synchronization between nodes in the case of distributed

control systems in order to minimize delays in the feedback

loop. As it is independent from the sensor and actuator data

processing, both operations can be run in parallel. Thus, a very

accurate synchronization may be achieved by frequent resyn-

chronizations phases without interferences with the accelerator

operation.

A direct implementation of the communication protocol on

the FPGA, as it was done in [7], is limited by the area available

on the FPGA. Complex synchronization or routing protocols

require control units and memory that can not fit together with

data processing blocks. Even if improvements in the energy

86

Fig. 2. Top and Bottom View of the HaLOEWEn platform

consumption are possible, it is then preferable to use software

implementations of the networking protocol for prototyping

purposes.

Power management is also handled in software: FPGA and

RF SoC operation can be shut down to reduce the power con-

sumption of the platform during idle periods. If the platform

uses energy harvesting, part of the power management control,

like the maximum-power-point-tracking algorithm can also be

mapped on the processor [6].

The communication protocol is programmed in the C lan-

guage and can be supported by well-known wireless sensor

network operating systems like Contiki [8]. The design imple-

mented on the FPGA is described with VHDL or Verilog and

Actel IP cores within the Libero IDE.

Fig. 3. Typical Application Mapping

Fig. 4. Wireless Sensor Network for Acoustic Source Localization

IV. PROTOTYPING APPLICATIONS

In this section, we present the details of two applications

illustrating how the platform is used to test and develop

distributed sensing-actuating systems in the frame of a research

project.

A. Acoustic localization

We first detail a setup for the prototyping of a wireless

sensor network used for acoustic localization [9] as illustrated

in Fig. 4. The purpose of this application is to identify the

source of a sound disturbance in a closed environment.

Each platform is extended with two sensors: an ultrasonic

transceiver and a low cost MEMS microphone. The ultrasonic

transceiver is used to perform an accurate self-localization

of the nodes between each others. Ultrasonic signals are

exchanged at regular time intervals in order to determine

distances and then positions of the nodes with a multilateration

algorithm. Nodes may thus have arbitrary positions and can be

87

Fig. 5. Time on Arrival Estimator

moved during operation without interfering with the correct

operation of the application.

The sound disturbance localization process involves two

steps where intensive computation is required. In the first step,

time difference of arrival of the sound to the different nodes

must be estimated. The most common way to realize this

is to use cross-correlation with a reference signal or among

sounds recorded by the node members of the network. In

order to minimize delays due to communication overhead, the

first solution is preferred although it limits the localization to

predefined sounds (a ringtone for example).

tarrival = t,max {(record ⋆ reference)(t)} > threshold(1)

Thanks to the FPGA implementation of the cross-

correlation based detection, a timestamp can be very quickly

extracted. The nodes can then synchronize each other by

exchanging their local time reference (Post-Facto Synchroniza-

tion) and compute time differences on arrival (TDOAs).

In a second step, TDOAs and spatial coordinates of each

nodes are combined by using an algorithm based on a least-

squares estimation called spherical intersection [10]. The

complexity of this algorithm is O(n3) for three dimensional

localization where n denotes the number of available mea-

surements. When the size of the network is important, it

is then likely to use hardware acceleration to speed up the

computation. However, an implementation of this algorithm

is only necessary on one node since the result is unique for

the whole network. As it does not fit together with the cross-

correlator on a single FPGA, the localization accelerator is

only implemented on a reference node that keeps trace of other

nodes positions and measurements. The hardware architecture

of the accelerator is depicted in Fig. 6.

Using FPGAs for this application has several advantages : it

first allows a fast estimation of the cross-correlation necessary

for precise synchronization of nodes. When activated, sound

Fig. 6. Architecture of the Localization Accelerator

processing has to be continuously performed in order to detect

the sound on every nodes. This data streaming implies a real-

time processing of the incoming data which is only supported

using dedicated computation blocks. Additionally, the network

traffic generated by the application is highly reduced. Only

time of arrival and synchronization information need to be

exchanged

Secondly, it allows a fast estimation of the source po-

sition within the network. The system is then able to run

autonomously without support from an external computation

unit allowing deployments in harsh environments. Examples

of applications can be found in the military domain (Counter-

sniper project) [11], but also for structural health monitoring

based on acoustic emissions [12].

B. Wireless Distributed Control Systems

In our project, the platform will also be used for active

vibration and noise control systems, in which the platform

will be implemented as wireless distributed controller. A

decentralized control strategy will be used where a master

platform will coordinate data synchronization and communica-

tions. Fig. 7 presents an example of distributed control systems

for vibration control of a large plate. We assume that the plate

vibration is controlled by using N number of local adaptive

controllers. A single small area i = {1, 2, · · · , N} on the plate

will be controlled by an adaptive controller. As shown in the

figure, only two adaptive controllers are presented for the sake

of simplicity. The objective of the vibration control system

is as follows. The vibration sensed on N number of points

on the plate will be minimized subject to a force disturbance

d on an arbitrary location on the plate. The error signals

ei, i = {1, 2, · · · , N} measured from error sensors should be

minimized.

A pair of sensor-actuator is placed on each node i. Piezo-electric patches can be used as actuators and sensors. For some

reasons, an tuned-mass-dampers can also be used to absorb

vibrations. The signal ui is an actuating signal sent by the

controller to the actuator. While ei and xi are perturbed error

signal and reference signal, respectively. Both signals are used

to make controller parameters adaptation mechanism.

88

u1e1

uN

sensor

sensor

d

actuator

actuator

eN

x

T

TT

T

ue1

de1

ueN

deN

Tux1uxN

Tdx

T

ControllerAdaptive

ControllerAdaptive

Fig. 7. Distributed parameter control for vibration control of a large plate.

Σ

Σ

.....

Tde1

Tue1C(z)

parametersadaptationmechanism

−+

y

u1 z1

1

d

dxTTux1

C(z)

parametersadaptation

mechanism

Σ−

+

TdeN

TueN

y

uNz N

N

TuxN

e1 eN

x x

Fig. 8. Block diagram of the control systems.

Some blocks of transfer functions are shown in Fig. 7.

Tdx is the transfer function from the disturbance signal d to

reference sensor signal x. Tde is the transfer function from

the disturbance signal d to error sensor signal e. Tux is the

transfer function from the control signal u to reference sensor

signal x. Tue is the transfer function from the control signal

u to error sensor signal e. Fig. 8 shows the block diagram of

the distributed adaptive control system. Because the location

of disturbance signal d is not fixed, then the parameter values

(and perhaps also the structure) of the Tdx and Tde will change

accordingly. Due to such situation, the adaptive control system

is used to handle such problem. The parameters of the adaptive

controller can be adaptively tuned to compensate the changes

of the transfer function parameters of Tdx and Tde.

The structure of the adaptive transversal filter is shown

in Fig. 9. The transversal filter and the parameter adaptation

algorithm will be implemented on the Actel FPGA IGLOO

mounted on the platform. The filter consists of three main

units, i.e. a multiplier, an adder and a delay unit (z−1). The

filter output is described in Equ. (2), where u(k) is the controlsignal, aj(k) is the tuneable controller parameter, and x(k) isthe reference signal.

x x

Σ

x

Σ

x

Σ

Parameter Adaptation Algorithm

x

Σ

x

Σ

u(k)

x(k)

Adaptive Transversal Filter

ControllerParameters

a1 a2 a3 a4 aPa0

z−1 z−1z−1 z−1 z−1x(k−1) x(k−2) x(k−3) x(k−4) x(k−P)

e(k)

Fig. 9. Adaptive Controller Architecture.

u(k) =

P∑

j=0

aj(k)× x(k − j) (2)

The controller parameters ap, p ∈ {0, 1, · · · , P} are adap-

tively tuned by using a commonly used Least-Mean-Square

(LMS) algorithms [13] shown in Equ. (3). The parameters

are adaptively updated by using the measurement of the mean

square error of an error sensor signal on a local point in a

system. The constant γ is the adaptation gain that can be set

to accelerate the parameter adaptation mechanism. But a higher

constant tends to instabilize the system. Therefore, a correct

value must be set.

aj(k + 1) = aj(k) + γe(k)x(k − j) (3)

For a system with single adaptive controller, the error

signal is e(k) = y(k) − z(k) = Tde(z)d(k) − Tue(z)u(k).By using Equ. (2), then we have e(k) = Tde(z)d(k) −Tue(z){

∑Pj=0 aj(k)x(k − j)}, or based on Fig. 8, e(k) =

Tde(z)d(k) − Tue(z)C(z)x(k). From the figure, we see

that x(k) = Tdx(z)d(k) − Tux(z)C(z)x(k), and x(k) =Tdx(z)

1+Tux(z)C(z)d(k). Therefore, in a system with single con-

troller, the adaptive control will move to a steady-state, i.e.

e(k) → 0, when Equ. (4) is fulfilled.

Tde(z) =Tdx(z)Tue(z)C(z)

1 + Tux(z)C(z)(4)

And, for a system with N number of adaptive controllers,

then the steady-state condition holds, when Equ. (5).

Tde,i(z) =Tdx(z)Tue,i(z)Ci(z)

1 +∑N

i=1 Tux,i(z)Ci(z), (5)

where i ∈ {1, 2, · · · , N}.In order to optimize the logic gates usage in the FPGA

IGLOO, the adaptive transfersal filter structure presented in

Fig. 9 can also be implemented using serial architecture and

(P + 1) number of storage registers for the shifted reference

signals and controller parameters.

Wireless communication has been an interesting issue in

the area of networked control systems [14], [15]. There are

two control strategies that can be used regarding the need for

the wireless communication based on our platform. Firstly, the

89

wireless communications could be used by a master controller

and the other local controller to synchronize data sampling

from the reference and error sensors. In this case a decen-

tralized control strategy is used, where the LMS adaptation

algorithm can be used to tune the controller parameters.

Secondly, the wireless communications could be used to

exchange the actuator and sensor data such that a master

controller can make online parameter identifications. The on-

line parameter identification can also be made in each local

controller. The identified parameters can be further used to

reconfigure controller parameters values, which can meet the

control objective. However, this strategy requires a high-speed

and guaranteed-lossless communication infrastructure. In order

to come up to such issues, a predictive or a state estimation

control strategy can be used to reduce data exchanges in the

wireless media [16], [17].

V. CONCLUSION AND FUTURE WORK

A platform that can be implemented on multiple distributed

sensing and actuating applications is introduced in this paper.

The hardware platform is realized in a printed-circuit-board

(PCB), in which two main computing elements, i.e. the FPGA

IGLOO and the CC2351 SoC are used. By plugging multiple

independent extensions to the FPGA, the node can be easily

adapted to specific applications. The power consumption of

the whole system is low enough for autonomous operation

during long periods while complex control and data processing

algorithms can be implemented with high efficiency.

For our future work, the performance and the reliability of

our platform for the aforementioned applications will be pre-

cisely measured. Based on comparisons with traditional motes,

speedup and energy consumption gain will be estimated.

Additionally, we will use the platform for further appli-

cations on smart structures. In particular, a structural health

monitoring network for a bridge will be deployed.

ACKNOWLEDGEMENTS

This work has been supported by the European FP7 project

Maintenance on Demand (MoDe) Grant FP7-SST-2008-RTD

233890 and by Hessian Ministry of Science and Arts toward

Project AdRIA (Adaptronik-Research, Innovation, Applica-

tion) with Grant Number III L 4–518/14.004 (2008).

REFERENCES

[1] H. Hinkelmann, A. Reinhardt, and M. Glesner, “A Methodology forWireless Sensor Network Prototyping with Sophisticated DebuggingSupport,” in Proceedings of the 19th IEEE/IFIP International Symposium

on Rapid System Prototyping, 2008.[2] J. Portilla, A. de Castro, E. de la Torre, and T. Riesgo, “A Modular Ar-

chitecture for Nodes in Wireless Sensor Networks,” Journal of UniversalComputer Science, vol. 12, pp. 328 – 339, 2006.

[3] L. Nachman, J. Huang, J. Shahabdeen, and R. A. R. Kling, “IMOTE2:Serious Computation at the Edge,” in Proceedings of the International

Conference on Wireless Communications and Mobile Computing, 2008.[4] IGLOO Low-Power Flash FPGAs Datasheet, Actel.[5] CC253x System-on-Chip Solution for 2.4 GHz IEEE 802.15.4 and ZigBee

Applications User’s Guide, Texas Instruments.[6] F. Philipp, P. Zhao, F. A. Samman, and M. Glesner, “Demonstration :

Monitoring and Control of a Dynamically Reconfigurable Wireless Sen-sor Node Powered by Hybrid Energy Harvesting,” in Design, Automation

& Test in Europe (DATE), University Booth, 2011.

[7] L. A. Vera-Salasa, S. V. Moreno-Tapiaa, R. A. Osornio-Riosa, andR. de J. Romero-Troncosob, “Reconfigurable Node Processing UnitFor A Low-Power Wireless Sensor Network,” in Proceedings of theIntenational Conference on Reconfigurable Computing, 2010.

[8] A. Dunkels, B. Gronvall, and T. Voigt, “Contiki - A Lightweight andFlexible Operating System for Tiny Networked Sensors,” in Proceedings

of the 29th IEEE International Conference on Local Computer Networks,2004.

[9] F. Philipp, F. A. Samman, and M. Glesner, “Real-time Characterizationof Noise Sources with Computationally Optimised Wireless SensorNetworks,” in Proceedings of the 37th Annual Convention for Acoustics

(DAGA), 2011.[10] H. C. Schau and A. Z. Robinson, “Passive Source Localization Employ-

ing Intersecting Spherical Surfaces from Time-of-Arrival Differences,”in IEEE Transactions on Acoustics, Speech and Signal Processing, 1987.

[11] G. Simon, M. Marti, . Ldeczi, G. Balogh, B. Kusy, A. Ndas, G. Pap,J. Sallai, and K. Frampton, “Sensor Network-Based Countersniper Sys-tem,” in Proceedings of the 2nd International Conference on Embeddednetworked sensor systems, 2004.

[12] S. D. G. C. U. Grosse and M. Krger, “Initial Development of WirelessAcoustic Emission Sensor Motes for Civil Infrastructure State Monitor-ing,” Smart Structures and Systems, vol. 6, pp. 197 – 209, 2010.

[13] S. Haykin, Adaptive Filter Theory, 3rd ed. Prentice-Hall, 1996.[14] N. J. Ploplys, P. A. Kawka, and A. G. Alleyne, ““Closed-Loop Control

over Wireless Networks”,” IEEE Control Systems Magazine, vol. 24,no. 3, pp. 58–71, June 2004.

[15] H. A. Thompson, ““Wireless and internet communications technologiesfor monitoring and control”,” Elsevier J., Control Engineering Practice,vol. 12, no. 6, pp. 781–791, June 2004.

[16] J. K. Yook, D. M. Tilbury, and N. R. Soparkar, “Trading Computation forBandwidth: Reducing Communication in Distributed Control Systemsusing State Estimator,” IEEE Trans. Control Systems Technology, vol. 10,no. 4, pp. 503–518, July 2002.

[17] R. Wang, G.-P. Liu, W. Wang, D. Rees, and Y. B. Zhao, “GuaranteedCost Control for Networked Control Systems Based on an ImprovedPredictive Control Method,” IEEE Trans. Control Systems Technology,vol. 18, no. 5, pp. 1226–1232, Sep. 2010.

90

Session 4Virtual Prototyping for MPSoC

91

A Novel Low-Overhead Flexible Instrumentation Frameworkfor Virtual Platforms

Tennessee Carmel-Veilleux∗, Jean-François Boland∗ and Guy Bois†

∗ Dept. of Electrical Engineering, École de Technologie Supérieure, Montréal, Québec, Canada† Dept. of Software and Computer Engineering, École Polytechnique de Montréal, Montréal, Québec, Canada

Abstract—Instrumentation methods for code profiling, tracing andsemihosting on virtual platforms (VP) and instruction-set simulators(ISS) rely on function call and system call interception. To reduceinstrumentation overhead that can affect program behavior and timing,we propose a novel low-overhead flexible instrumentation frameworkcalled Virtual Platform Instrumentation (VPI). The VPI frameworkuses a new table-based parameter-passing method that reduces theruntime overhead of instrumentation to only that of the interception.Furthermore, it provides a high-level interface to extend the functionalityof any VP or ISS with debugging support, without changes to their sourcecode. Our framework unifies the implementation of tracing, profilingand semihosting use cases, while at the same time reducing detrimentalruntime overhead on the target as much as 90 % compared to widelydeployed traditional methods, without significant simulation time penalty.

Index Terms—Computer simulation, Software debugging, Softwareprototyping, System-level design

I. INTRODUCTION

With the advent of multiprocessor systems-on-chip (MPSoC) forconsumer and networking applications, complexity has become asignificant issue for system debugging and prototyping. Simulatorsand system-level modeling tools have become necessary tools tomanage this complexity. Virtual platforms (VP) are system-level soft-ware tools combining instruction-set simulators (ISS) and peripheralmodels that are used to start software prototyping before availabilityof the final product. In the case of state-of-the-art MPSoCs, virtualplatform models can even be used as the “golden model” provided todevelopers years before availability of final silicon [1]. The prolifera-tion of SystemC-based design-space exploration tools (e.g. PlatformArchitect [2], ReSP [3], Space Studio [4], etc.) was also made possibleby mature VP technology.

When using VPs for debugging or design-space exploration,software instrumentation methods can be used to obtain profilingdata, execution traces or other introspective behavior. The runtimeoverhead (i.e. intrusiveness) on the target of these instrumentationmethods is critical. It must be minimized to prevent interfering withthe strict timing constraints common in embedded software [5].

In this paper, we present a novel low-overhead flexible codeinstrumentation framework called Virtual Platform Instrumentation(VPI). The VPI framework can be used to extend existing virtualplatforms with additional tracing, profiling and semihosting capa-bilities with minimal target code overhead and timing interference.Semihosting is a mechanism whereby a function’s execution on thetarget is delegated to an external hosted environment, such as a VP.

The authors would like to acknowledge financial support from the Fondsquébécois de la recherche sur la nature et les technologies (FQRNT), theÉcole de technologie supérieure (ÉTS) and the Regroupement Stratégique enMicrosystème du Québec (ReSMiQ) in the realization of this research work.

Semihosting is traditionally used to exploit the host’s I/O, consoleand file system before support becomes available on the target [6].

Through our proposed framework, we make three main contribu-tions.

Firstly, we describe a new mechanism for fully inlinable instrumen-tation insertion with table-driven parameter-passing between a sim-ulated target and its host. Our method completely foregoes functioncall parameter preparation overhead seen in traditional semihosting.In doing so, we reduce detrimental runtime overhead on the targetbetween 2–10 times in comparison to traditional methods, whileshowing nearly identical simulation run times.

Secondly, we show that our framework can realize function semi-hosting, tracing and profiling tasks, thus unifying usually separateuse cases.

Thirdly, we propose a generic high level instrumentation handlinginterface for VPs which allows for new instrumentation behavior tobe added to existing tools, without requiring modifications.

This paper is organized as follows: section II presents backgroundinformation and related works about virtual platform code instrumen-tation, section III describes our proposed instrumentation framework,section IV presents experimental case studies of semihosting andprofiling with conclusions and future work in section V.

II. BACKGROUND AND RELATED WORK

In this section we explore different instrumentation methods usedfor debugging, system prototyping and profiling on virtual platforms.This is followed by an overall comparison of the methods, includingour proposed VPI framework.

For our purposes, we define virtual platforms as software en-vironments that simulate a full target system on a host platform.Virtual platforms integrate instruction-set simulators as well asmodels of memories, system buses and peripherals to realize afull SoC simulator. The conceptual layering of a VP is shown inFigure 1. Through the development of our framework, we evaluatedthe features and mechanisms present in the Simics [7], PlatformArchitect [2], QEMU [8], ReSP [3] and OVPSim [9] virtual platforms.In our experimental case study of section IV, we concentrated onSimics and QEMU.

A. Instrumentation use cases overview

We define instrumentation as tools added to a program to aid intesting, debugging or measurements at run-time. These tools can beimplemented as intrusive instrumentation functions in source codeor as non-intrusive instrumentation functionality within a VP. In ourcontext, intrusive means that target run time is affected in some wayby the instrumentation. An instrumentation site refers to the locationwhere instrumentation is inserted.

978-1-4577-0660-8/11$26.00 © 2011 IEEE

92

Host Hardware (User’s Workstation)

Host Operating System (Windows, Linux)

Virtual Platform Model of Target System (e.g. SystemC model)

ISSCore 1

Instruction Set Simulator (ISS)

Core 0

System Bus Model

Memory Model

PeripheralModel

Virtual Platform Instrumentation Interface (VPII)

Figure 1. Virtual platform modeling layers

Some examples of intrusive instrumentation use cases are:

• compile-time insertion of tracing or profiling calls at everyfunction entry and exit point [10, p. 75];

• compile-time insertion of code coverage or other measurementstatements in existing source code;

• insertion of probe points for fine-grained execution tracing at theOS kernel level (e.g. Kernel Markers [11] in the Linux kernel).

Conversely, examples of non-intrusive instrumentation include:

• insertion of breakpoints and watchpoints at runtime using adebugger to aid in tracing and debugging;

• interception of library function calls through their runtimeaddress to emulate functionality or store profiling data [4], [3];

• runtime insertion of transparent user-defined instruments tied toprogram or data accesses such as the probe, event and watchmechanisms of Avrova [12];

• storage of control-flow data in hardware trace buffers readablethrough specialized interfaces.

The instrumentation methods which implement these use cases differsignificantly in how much they affect target and host run time (i.e.their intrusiveness) and what they enable the user to do (i.e. theirflexibility).

In terms of tracing instrumentation, many varieties exist whichdiffer in semantic level. The VPs we evaluated each allowed forstraightforward dumping of a trace showing all instructions executedand every data access, with no context-related semantic information.Conversely, user-defined or compiler-inserted high-level tracing storesmuch less data, but with much higher semantic contents (e.g. a listof task context switches in an OS [7]). When we refer to tracing inthis paper, we are referring to the high-level tracing case.

In the next section we will discuss semihosting, a method used toimplement several of the aforementioned instrumentation tasks eitherintrusively or non-intrusively.

B. Semihosting function calls

In the general case, semihosting works by intercepting calls tospecific function stubs in the target code. Instead of running thefunction on the target at these sites, the ISS forwards an event tothe VP. The VP’s semihosting implementation then examines theprocessor state in the ISS and emulates the function’s behaviorappropriately, with target time stopped.

With semihosting, mechanisms for call interception differ byimplementation. Run-time and code size intrusiveness are directlylinked to which interception mechanism is used. We distinguish threesuch mechanisms:

• “Syscall” interception: the “system call” instruction is divertedfor use in interception. This is the traditional approach as usedby ARM tools [6] as well as in QEMU for PPC and ARMplatforms. This approach may require an exception to be taken,with associated runtime overhead.

• “Simcall” interception: a specific instruction is diverted for useas an interception point. The instruction can be specific for thatpurpose, like the SIMCALL instruction in the Tensilica Xtensaarchitecture [13, p. 520], or it can be an architectural NO-OPas in the Simics virtual platform, where it is called a “magicinstruction” [7].

• Address interception: the entrypoints of all functions to beemulated are registered as implicit breakpoints in the VP. Whenthe program counter (PC) reaches these breakpoints, interceptionoccurs. This approach is used in tools such as Imperas OVP-Sim [9], ReSP [3] and Space Studio [4] amongst others. It canalso be implemented in any VP with debugging support usingwatchpoints or breakpoints.

With all traditional semihosting methods, context-specific parame-ters are passed using regular function parameter-passing. The prepa-ration of semihosted function call parameters according to normalcalling conventions accounts for most of the runtime overhead ofthese methods.

Since the emulated function is fully executed by VP host code,the entire state of the modeled system can be exploited. Using theexample of a MPSoC model, this would imply that internal registersof all CPU cores could be accessed while processing the emulatedfunction. This opens the possibility of emulating as much as a fullOS system call as done in [3], [14] or providing “perfect” barriersynchronization primitives across the system [15].

Semihosting is increasingly used in the implementation of designspace exploration tools for hardware–software codesign. In thatcase, it enables developers to quickly assess the performance of analgorithm without having to deal with the accessory details of OSporting or adaptation early in the design phase [3], [4].

C. Comparison of instrumentation methods

In this section we compare different instrumentation methodscommonly used under virtual platforms. The comparison matrix isshown in Table I. Although we have not yet detailed our VirtualPlatform Instrumentation (VPI) method, it is included in the Tablefor comparison purposes and fully described in section III.

Firstly, we evaluated intrusiveness in terms of code size, runtime and features “lost” to the method (e.g. system call no longeravailable). Secondly, we established whether the methods workwithout symbol information (i.e. even with a raw binary image) andwhether they allow for inlining of the instrumentation. By symbolinformation, we mean the symbols table that links function namesto their addresses, which is present in all object file formats. Finally,we determined if the methods listed were suitable for different usecases presented earlier. For the qualitative criteria, we evaluatedimplementation source code or manuals of every method listed todetermine the values shown.

Our VPI method appears to compare favorably with existing ap-proaches. We contrast our method with other approaches and provideexperimental results supporting these intrusiveness comparisons insection IV.

93

Table IINSTRUMENTATION METHODS COMPARISON MATRIX

Intrusive ? Works ? Supports ?

Method Code Run-time Featureslost

Withoutsymbols Inline Tracing Profiling Syscall/OS

emulationCompiler-inserted

profiling function calls High High No No No Yes Yes No

Traditional semihosting("Syscall" interception) Medium Medium Yes Yes No Depends Depends Yes

Traditional semihosting("Simcall" interception) Low Low Yes Yes No Depends Depends Yes

Traditional semihosting(Address interception) Low None No No No Yes Yes Yes

Watchpoints / Breakpoints None None No No N/A Yes No No

VPI (Proposed method) Low None–Low No Yes Yes Yes Yes Yes

III. DETAILS OF PROPOSED FRAMEWORK

Our code instrumentation framework (VPI) is composed of twosoftware elements:

1) an inline instrumentation insertion method with table-basedparameter-passing, implemented with inline assembler in Ccode;

2) a high-level virtual platform instrumentation interface (VPII)that handles interception of instrumentation sites by callingappropriate virtual platform instrumentation functions (VPIFs).

Combined, these two components form a low-overhead generic codeinstrumentation framework that can be implemented on any VP orISS with debugger support or extension capabilities.

For the purposes of this paper, the compiler’s inline assemblerextensions are those of the unmodified GCC version 4.5 C com-piler [16]. However, the concepts behind our method are tool-agnosticand applicable to production-level compilers.

In the following subsections, we refer to the numbered markersin Figure 2 to illustrate the flow of instrumentation insertion frominitial source code to the compiler-generated assembler code. Marker1 of Figure 2 will be listed as (Ê), marker 2 as (Ë) and so on. Wewill use the fopen() C library function as a semihosting exampleto illustrate instrumentation insertion.

A. Target-side instrumentation insertion

Instrumentation statements are inserted into target code by the de-veloper using common C macros (Ê). They can refer to any programvariable (Ë). Each instrumentation macro expands to inline assemblerstatements containing a semihosting interception block (Ì,Í) and aparameter-passing payload related to the desired instrumentation (Î).The entire instrumentation call site is inserted inline (i.e. in-situ).

At compile time, the interception block (Ð) and parameter-passingpayload table (Ñ) are constructed from compiler-provided registerand memory address allocations. This is done by accessing inline-assembler-specific placeholders (Ï) and pretending instructions areemitted from them.

When inline assembler is used within a function’s body, place-holders referencing C variables in the assembler code are replaced byvalues from the compiler’s internal registers and memory addressesallocation algorithms.

We save these references out of band from the main code section(“.text”), in the read-only data section (“.rodata”). The choice of the“.rodata” section for the payload data table is deliberate, to preventinstruction cache interference by data that never gets read by usercode. However, it is possible and sometimes required to use the “.text”section for the payload table. For example, if the target OS uses paged

virtual memory, the interception block and payload table may need tobe inlined in the code section. Otherwise, the table’s effective addressrange might not currently be mapped-in by the OS, causing a dataaccess exception at interception time.

Interception block

Although interception is still necessary with our method, we donot mandate the use of a specific mechanism. The interceptionblock from our example of Figure 2 (Ì,Í) is composed of threeparts: 1) “Simcall” interception instruction (“rlwimi 0,0,0,0,9” inthis case); 2) pointer-skipping branch; and 3) payload table pointer.The interception block shown is an arbitrary example. Any otherinterception mechanism described in section II-B could be used, aslong as it is supported by the VP.

Along with the interception block, a pointer to the parameter-passing payload table is used to link an instrumentation site withits parameters. An unconditional branch is added to the interceptionblock to prevent the fetching and execution of the payload tablepointer.

Parameter-passing payload table

The parameter-passing payload table serves as a link between thetarget program’s state and the high-level instrumentation interfacerunning in the VP. For an instrumentation site, it both uniquelyidentifies desired behavior and provides reference descriptors to thefunction parameters that should be passed to–from the handler. Thesereference descriptors allow a high-level instrumentation interface toboth read data from, and write data back to the target program’s state.

The format used for each payload table is as follows:• Signature header (1 word) including a functional identifier (16

bits) and quantity (from 0–15 each) of constants, input variablesand output variables references;

• Constants table (1 word each);• Input variables references (fixed number of strings and/or in-

structions);• Output variables references (fixed number of strings and/or

instructions);The signature header identifies the desired functional behavior (e.g.:tracing, fopen(), printf(), etc.). For every functional identifier,it is possible to use more or less constants, inputs and outputsdepending on the need. For instance, a “printf()” function couldbe implemented as 16 versions, covering the cases where 0 to 15variables need to be formatted.

Constants are emitted from references known before runtime. Forinstance, our implementation of a semihosted “printf()” uses a

94

do { FILE *retval;__asm__(" SIMCALL/NOP b 2f .long 1f2:

.section "rodata"1: .long FOPEN_IDENTIFIER .asciz "%[RETVAL]" .asciz "%[FNAME]" .asciz "%[MODE]" .text" : [RETVAL] "=r" (retval): [FNAME] "rm" (filename), [MODE] "rm" ("w+")); return retval;} while (0)

char filename[] = "testfile";FILE *f;

f = VP_FOPEN(filename, "w+");

Compile

Expand macro

/* Start of inserted inline assembler */ rlwimi 0,0,0,0,9 b 2f .long 1f2:

.section .rodata,"a",@progbits1: .long 0x0120000b .asciz "11" .asciz "30" .asciz "8(10)" .text

/* End of inserted inline assembler */

Interception: 3 words, 2 executed

Parameter-passing payload:1 word and 3 strings (16 bytes)

Parameters list in inline assembler

syntax

Instrumentation parameter-passing

payload with placeholders

Pointer to payload(skipped by interception)

Interception

1

2

3

4

5

6

7

8

Figure 2. Overview of VPI instrumentation insertion in source code

constant slot for the pointer to the format string. The example ofFigure 2 does not use any constants.

Input variable references and output variable references arecompiler-provided data references that can be accessed by the VPIFsthrough the VPII.

Table II breaks-down the payload table of our “fopen()” exam-ple from Figure 2. Again, this example is based on a PowerPC target,but equivalent content would be present for any architecture.

Although our example of Figure 2 uses only strings for references,both strings and instructions can be used, as long as the tableformat is understood by the VPII implementation. In the case whereinstructions are used, the VPII can disassemble them at runtime todecode the references they contain. To illustrate this, we show a store

Table IIDESCRIPTION OF FIGURE 2’S PAYLOAD TABLE

Compiled value Description

0x0120000b Signature header• Function 0x000b• 0 constants• 2 inputs• 1 output

“11” Value of “retval” output variable is in GPR11

“30” Value of “filename” input variable is in GPR30

“8(10)” orstw 0,8(10)

Pointer to “mode” (“w+”) input variable iscontents of GPR10 + 8

(“stw”) instruction that could replace the reference string of the lastreference in the Table.

Remarks about insertion method construction

As far as we know, every other semihosting-based methods aredesigned for source code equivalence—all instrumentation-callingcode must remain identical after instrumentation is removed. Thisrequirement has the advantage of allowing instrumentation to beincluded by simply linking with different versions of the libraries.However, parameter-passing becomes bound to the C calling conven-tions in effect on the target platform. We constructed our proposedinstrumentation insertion method to overcome the artificial require-ment of function call setup when running on virtual platforms.

In our case, where we know the instrumented binary will be run ina VP, it becomes only necessary to somehow tell the VP where to findfunction parameters after a call is intercepted. Function call prepa-ration merely copies program variables into predetermined registersor stack frame locations. Since the VP can access all system state“in the background” without incurring instruction execution penalties,we replaced the function call and associated execution overhead witha static parameter-passing table. Parameters can then be accessedby interpreting the table, rather than reading predetermined registersor stack frame locations. The compiler guarantees the “reloading”from memory of any variable not locally available from registers oroffsetable memory locations. This reloading overhead is, in all cases,a subset of standard function call overhead.

Another side effect of our method’s construction is that instrumen-tation insertion is always inlined. This has the desirable consequenceof “following” other inlining done by the compiler. It then becomestrivial to instrument functions inlined by the compiler’s optimizer,and to identify them uniquely, without any special compiler support.

Finally, while optimizing compilers can reorder statements aroundsequence points in C code, some compiler-specific mechanisms canbe used to guarantee the positioning of the inlined assembler blocks.During our tests with GCC 4.5 on ARM and PPC platforms, the useof volatile asm statements with “memory” clobber preventedany instruction reordering from affecting the test result signatures atevery optimization level.

B. Virtual Platform Instrumentation Interface

Within our framework, we propose that the VP be pre-configured torun a centralized instrumentation handler whenever interception oc-curs at an instrumentation point. A high-level, object-oriented virtualplatform instrumentation interface (VPII) layer is used to interfacebetween the VP and the instrumentation functions by providingabstract interfaces to the VP’s state and parameter-passing tables.

95

Virtual Platform (VP)

Instruction Set Simulator (ISS)

GDB + Python Internal Python Interface←OR→

Virtual Platform Instrumentation Interface (VPII)

Virtual Platform Instrumentation Functions (VPIF)

Figure 3. VPI framework implementation layers

+eval_expr()+read_mem()+write_mem()+read_n_bytes_variable()+write_n_bytes_variable()+read_string_variable()+get_reg()+set_reg()+get_pc()

DebuggerInterface

+is_64_bits()+get_longlong_second_reg()+get_longlong_from_regs()+get_regs_from_longlong()+get_adapted_endian()+to_signed()+get_top_address()+get_double_from_longlong()

ApplicationBinaryInterface

GdbDebuggerInterface PpcApplicationBinaryInterface

PpcGdbDebuggerInterface

+read_n_bytes_variable()+write_n_bytes_variable()+read_string_variable()

VariableAccessor

DirectAccessor

RegisterAccessor

SymbolicAccessor

+register_func()+get_payload()+process()+accessor_factory()

VPIInterface

+register_func()+get_payload()+process()+accessor_factory()

VPITriggerMethod

PpcGdbVPITriggerMethod

creator

created

+get_func_id()+process()

VPIFunction

PpcVPIInterface

PpcGdbVPIInterface

accesses-state-throught

Figure 4. Class hierarchy of a sample VPII implementation

It also executes the appropriate virtual platform instrumentationfunction (VPIF) handler on behalf of the target code. The layersforming this high-level interface are shown in Figure 3.

The VPII abstraction allows the VPIF handlers to access registers,memory and internal VP state using generic accessors that hide low-level platform interfaces. It also handles data conversion tasks relatedto a platform’s application binary interface (ABI).

The VPII is implemented using the high-level language (HLL)extension interfaces built into VPs. For instance, this could be aninternal script interpreter, such as Python in Simics [7] or Tclin Synopsis Platform Architect tools [2]. It could also be a C++library built on top of a SystemC simulator. Alternatively, the GNUDebugger’s (GDB) Python interface can be used to implement ageneric VPII suitable for existing ISS and VP implementations withGDB debugging support.

For our experimental implementation, we developed VPII andVPIF libraries supporting both Simics’ and GDB’s Python extensioninterfaces. The class hierarchy for our implementation of the VPIIinterface for PowerPC targets with GDB-based VP access is shownin Figure 4. In that example, the PpcGdbVPIInterface class isused as the focal point to register instrumentation behavior (VPIFhandlers) and access the VP through GDB.

For testing, we also developed a sample library of VPIF handlerscovering common instrumentation tasks of I/O semihosting, tracingand code timing. New VPIF handlers can be registered and modifieddynamically at run-time with our sample Python implementation.Handlers written generically using only parameter accessors withoutusing any VP-specific functionality can be reused on any supportedarchitecture.

IV. EXPERIMENTAL CASE STUDY

In order to validate that the method we propose is flexible andhas low overhead, we performed a comparative case study. We

compared our experimental framework implementation to commoninstrumentation methods using controlled examples.

Source code for the case study, as well as for our VPI implementa-tion, is available at http://tentech.ca/vp-instrumentation/ under a BSDopen-source license.

A. Experimental setup

The case study was run on a standard PC running Windows 7x64 Professional with an Intel Core 2 Duo P8400 with two 2.26GHzcores. The toolchain and C libraries were from the Sourcery G++2010.09-53 release, based on GNU GCC 4.5.1 and GNU Binutils2.20.51. The target was a PowerPC e600 single-core processor on theWind River Simics 4.0.60 and QEMU PPC 0.11.50 virtual platforms.We used GDB 7.2.50 with Python support as the debugger.

We instrumented the “QURT” quadratic equation root-findingbenchmark program from the SNU WCET suite [17] based ontwo instrumentation scenarios, which were run independently. Thesescenarios showcase the unification of profiling and semihosting usecases since both are implemented using the same VPI frameworkfunctionality and insertion syntax.

Each scenario comprised a base non-instrumented case, and fourinstrumented cases. The instrumented cases represent different com-binations of instrumentation methods and VPI configurations. TheVPI configurations were the following:

• Internal: VPI handler is run internally on Simics’ Python inter-preter with “simcall” interception.

• External: VPI handler is run externally on GDB’s Pythoninterpreter with debugger watchpoint interception under eitherSimics or QEMU.

We compared the three following instrumentation methods:

• “VPI”: uses our inlined VPI instrumentation for each site;• “Stub-call”: calls a C function stub at every site which wraps an

inlined VPI instrumentation site, so that traditional semihostingfunction call overhead can be compared;

• “Full-code”: in the case of the printf() scenario, we run anoptimized printf() implementation entirely on the target, withI/O redirected to a null device so that “manual” non-semihostedinstrumentation overhead can be compared.

For each run, we recorded binary section sizes, simulation timeson the host and cycle counts on the target. Section sizes provideinformation about code size interference. Simulation times and cyclecounts are used to compare runtime overhead. With the “stub-call”cases, the results in Tables III, IV and V are compensated bysubtracting the wrapped VPI site contribution, which would haveartificially inflated the results of those cases.

All results are from release-type builds with no debugging symbolsand “-O2” (“optimize more”) option on GCC. Host OS noise wasquantified by executing 50 runs of each case.

B. Results of “printf()” semihosting scenario

The printf() semihosting scenario compares space and time over-head of a printf() function semihosting use case. In this case,we inserted 3 instrumented sites to display the results of differ-ent loops of the QURT benchmark. Each loop ran 100 times,for a total of 300 calls. For all cases, the printf() implemen-tation was functionally equivalent, with full float support. Theprintf() statement was printf("Roots: x1 = (%.6f%+.6fj)

x2 = (%.6f%+.6fj)\n", x1[0], x1[1], x2[0], x2[1]).

96

Table IIITARGET RUNTIME OVERHEAD IN CYCLES FOR PRINTF() SCENARIO

Instrumentationcase

CPUcycles

Totaloverhead

Per-calloverhead

Overheadincrease

None 5 538 648 0 0 N/A

Internal VPI 5 539 272 624±24 2 ×1 (Base)

External VPI 5 539 872 1224±24 4 ×2Stub-call 5 545 848 6888±24 23 ×11.5Full-code 7 174 476 1 635 828±24 5453 ×2726.5

Table IVBINARY SIZE OVERHEAD FOR PRINTF() SCENARIO

Instrumentation case .textsize

.datasize

.rodatasize

Total

None 34 796 1864 1224 37 884

Internal VPI +132 +0 +200 +332

External VPI +168 +32 +200 +400

Stub-call +320 +0 +80 +400

Full-code +328 +0 +40 +368

We ran this scenario on both Simics and QEMU. QEMU onlysupports the external configuration without source code modifica-tions. Since all binaries are identical between Simics and QEMU,the results of Tables III and IV apply equally to both VPs.

Simulated CPU runtime overhead results are detailed in Table III.Uncertainty on overhead was ±24 cycles because of the timingmethod. We observe that execution overhead per call for VPI casesis only 2–4 cycles, depending on the configuration. The external VPIconfiguration—with watchpoint interception—requires twice as manyinstructions per call as the internal “simcall”-based VPI configuration.Overheads of the VPI cases are a significant 5–11 times reductionover traditional stub-call instrumentation. Function call preparationaccounts for the higher overhead of the stub-call case. In contrast,even when excluding I/O cost, the full-code printf() cases has 3orders of magnitude higher runtime overhead than either VPI cases.

Space overhead results are listed in Table IV, in comparison to theuninstrumented base case. Code section (.text) space overheads of theVPI cases are noticeably lower than the other cases. Through manualassembler code analysis we confirmed that function call preparationaccounted for the difference observed between the stub-call and full-code cases. As expected, the lower code section overheads with VPIcases come at the cost of a larger constants section (.rodata), althoughtotal sizes are comparable.

In terms of simulation time, our VPI framework’s overhead de-pends considerably on whether an internal or external configurationis used. Simulation times for different scenarios under both Simicsand QEMU are shown in Figure 5 (note the logarithmic scale on thesimulation time axis). The “full -code” and “stub-call” cases in thatfigure do not have any interception methods enabled at runtime.

The internal uninstrumented (“Internal None”) case is shown tohave no penalty on simulation time. Conversely, the external uninstru-mented (“External None”) case—which uses watchpoint instead of“simcall” interception—causes some baseline interception overhead.Furthermore, the internal VPI-only case is shown to have no penaltyover a traditional internal stub-call case.

With all instrumented cases, those using internal configurations

Base

Intern

al Non

e

Full-co

de

Stub-ca

ll

Intern

al VPI

Externa

l Non

e

Externa

l VPI

Base

Externa

l Non

e

Full-co

de

Stub-ca

ll

Externa

l VPI

0.1

1

10

100

Sim

ulat

ion

time

(sec

onds

)

0.247 0.249 0.285 0.292 0.293 0.353

9.791

0.304 0.318 0.353

150.0 182.2

SimicsQEMU

Figure 5. Simulation times for printf() scenario

display significantly better simulation performances than those usingexternal configurations. Moreover, the internal VPI instrumented caseis even faster than the external uninstrumented case under Simics.This shows that simulation time is practically unaffected by lowinstrumentation loads when an internal VPI framework configurationis used. The interception and VPII mechanisms appear to be muchslower when going through the GDB interfaces used in all externalcases. We determined that the slowdown was due to the overheadof both the GDB ASCII protocol and the context switches requiredto go back and forth between the GDB and VP processes. Incontrast, the internal configuration has direct access to VP resources,which explains its better performances. In the case of QEMU, GDBcommunication overhead was prohibitive enough to prevent the useof our framework for non-trivial cases under that particular VP.

C. Results of “profiling” scenario

The profiling scenario compares overhead of runtime pro-filing between stub-call (i.e. compiler-inserted) and inlinedtracing/profiling instrumentation. In the stub-call cases, the-finstrument-functions option of GCC was used to automati-cally insert a call to instrumentation stubs at every function entry andexit points. For the VPI cases, we manually inserted the VPI tracingcalls in the C source code at every function entry and exit points. Inboth cases, the instrumentation behavior involved recording executiontracing information to a file, as usually done by profiling tools. Thetracing call was of the form vp_gcc_inst_trace("FUNC_ENTER",

"NAME", __FILE__, __LINE__), where vp_gcc_inst_trace isa VPI instrumentation site insertion macro. There were 11 instrumen-tation sites, totalling 456 802 calls over a run and yielding a trace fileover 23 megabytes long. This is more than a thousandfold increasein instrumentation calls over the printf() scenario. We did not runthis scenario under QEMU in light of the prohibitive simulation timesfor the much simpler printf() scenario.

Results are detailed in Table V. We only present runtime andsimulation time overhead results, since space overhead is negligible inruntime-dominated profiling use cases. As with the printf() semi-hosting scenario, large differences exist in results depending on theconfiguration used. With internal configuration, the instrumentationcalls penalized simulation time on the order of 150µs per call. Incontrast, the negative impact on simulation speed of accessing VPstate through an external interface is clearly demonstrated by over-

97

Table VOVERHEADS PER CALL FOR PROFILING SCENARIO

Instrumentation caseTotal

simulationtime (s)

Runtimeoverhead(cycles)

Simulationoverhead(seconds)

None 0.250 0 0

Internal VPI 62.59 2.13 136.5µ

Internal stub-call 69.88 9.7 152.4µ

External VPI 3286 4.06 7193µ

External stub-call 3299 9.7 7221µ

head results around 7 ms per call, which is close to 50 times worsethan with internal cases. On the opposite end of the performancespectrum, the internal VPI instrumentation case displays significantlylower runtime overhead than the traditional stub-call approach, for acomparable simulation time.

In terms of target runtime overhead, a reduction of 2–5 times overthe stub-call case is seen with the VPI cases. If complex behaviorhad been implemented in the instrumentation functions on the targetinstead of wrapping a VPI call, overhead would have increasedproportionately over the simple stub-call cases shown.

V. CONCLUSION AND FUTURE WORK

Compared to existing semihosting and profiling instrumentationapproaches, our contributed framework is shown to have lower run-time and space overhead on the target. In both case study scenarios,our method showed 2–11 times lower runtime interference comparedto traditional methods. The lower overall target overhead and theconstruction of our VPI instrumentation insertion method enable theuse of our framework to unify the implementation of previouslyseparate semihosting and tracing/profiling use cases.

Because our method allows for inlining, is fully compatible withall optimization levels and has low target space and time overhead,it may remain in release code. With interception disabled in the VP,instrumented sites do not affect the runtime. This opens the possibilityof distributing instrumented binaries which can later be pulled fromthe field for re-execution with instrumentation enabled under a VP.

In terms of simulation time, our VPI implementation has perfor-mances comparable to traditional stub-call semihosting when usingthe internal configuration.

We have also shown that our framework can be used to extendthe instrumentation capabilities of existing VPs without changingtheir source code. This “add-on instrumentation” capability exploitsscripting interfaces currently available in VPs and provides userswith the option of reusing our sample implementation in their ownenvironments.

While our results validate our assertions, we must also acknowl-edge that our prototype implementation suffers from some perfor-mance issues which are unrelated to the core VPI concepts presentedin this paper.

Firstly, simulation time overhead is dominated by choice of VPIconfiguration, with the external configuration executing as much as50 times slower than internal configurations. In the case of ourGDB-based external implementation, performances are limited by thecommunication and context switching overheads between GDB andthe VP. These performance issues are due to the architecture of GDBand shared by any tool employing GDB as a generic interface to avirtual platform.

Secondly, since our prototype implementation uses pure Pythonscripting code, it is at least an order of magnitude slower than whatcould be achieved using a native C/C++ implementation.

Future work includes implementing our VPI framework on awider variety of VPs and architectures. Additional case studies andbenchmarks could be beneficial in identifying more use cases whereour method is an optimization of existing practices, while alsoserving as validation that inlined instrumentation is robust under moreoptimizations than those we validated.

ACKNOWLEDGEMENTS

We would like to thank L. Moss, J. Engblom, G. Beltrame andL. Fossati for providing us with valuable insights about code instru-mentation on virtual platforms, which helped shape the constructionof our framework and its presentation in this paper. We also wish tothank J-P. Oudet and the peer reviewers for helpful comments aboutthe original manuscript.

REFERENCES

[1] Freescale Semiconductors, Inc. (2008, Jun.) Virtutech announcesbreakthrough hybrid simulation capability allowing mixed levelsof model abstraction. Accessed 6/7/2010. [Online]. Available: http://goo.gl/UErXR

[2] Synopsys, inc., CoWare Platform Architect Product Family: SystemCDebug and Analysis User’s Guide, v2010.1.1 ed., Jun. 2010.

[3] G. Beltrame, L. Fossati, and D. Sciuto, “Resp: A nonintrusivetransaction-level reflective mpsoc simulation platform for design spaceexploration,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,vol. 28, no. 12, pp. 1857–1869, Dec. 2009.

[4] L. Moss, M. de Nanclas, L. Filion, S. Fontaine, G. Bois, and M. Aboul-hamid, “Seamless hardware/software performance co-monitoring in acodesign simulation environment with rtos support,” in Proc. Design,Automation Test in Europe Conf. Exhibition (DATE), 2007, pp. 1–6.

[5] S. Fischmeister and P. Lam, “On time-aware instrumentation of pro-grams,” in Proc. 15th IEEE Real-Time and Embedded Technology andApplications Symp. (RTAS), Apr. 2009, pp. 305–314.

[6] ARM Ltd, ARM Compiler toolchain: Developing Software for ARMProcessors, 2010, version 4.1, document number ARM DUI 0471B.[Online]. Available: http://goo.gl/qlKkO

[7] J. Engblom, D. Aarno, and B. Werner, Full-System Simulationfrom Embedded to High-Performance Systems. Springer US, 2010,ch. 3, pp. 25–45. [Online]. Available: http://dx.doi.org/10.1007/978-1-4419-6175-4_3

[8] Qemu open-source processor emulator. [Online]. Available: http://www.qemu.org

[9] Imperas Ltd. (2010) Technology ovpsim. Accessed 12/15/2010.[Online]. Available: http://www.ovpworld.org/technology_ovpsim.php

[10] IBM Corporation, IBM XL C/C++ for Linux, V11.1, Optimization andProgramming Guide, 2010, document number SC23-8608-00. [Online].Available: http://goo.gl/e1Ri9

[11] J. Corbet. (2007, aug) Kernel markers. Accessed 12/10/2010. [Online].Available: http://lwn.net/Articles/245671/

[12] B. L. Titzer and J. Palsberg, “Nonintrusive precision instrumentation ofmicrocontroller software,” ACM SIGPLAN Not., vol. 40, pp. 59–68, June2005. [Online]. Available: http://doi.acm.org/10.1145/1070891.1065919

[13] Tensilica, inc., Xtensa Instruction Set Architecture: Reference Manual,Santa Clara, CA, Nov. 2006, document number PD-06-0801-00.

[14] H. Shen and F. Petrot, “A flexible hybrid simulation platform targetingmultiple configurable processors soc,” in Proc. 15th Asia and SouthPacific Design Automation Conf., Jan. 2010, pp. 155–160.

[15] N. Anastopoulos, K. Nikas, G. Goumas, and N. Koziris, “Early experi-ences on accelerating dijkstra’s algorithm using transactional memory,”in Proc. IEEE Int. Symp. on Parallel Distributed Processing (IPDPS),May 2009, pp. 1–8.

[16] Free Software Foundation. The gnu c compiler. Accessed 11/1/2010.[Online]. Available: http://gcc.gnu.org

[17] S.-S. Lim. (1996) Snu-rt benchmark suite for worst case timinganalysis. Original SNU site now down. [Online]. Available: http://www.cprover.org/goto-cc/examples/snu.html

98

Using Multiple Abstraction Levels to Speedup anMPSoC Virtual Platform Simulator

Joao Moreira∗, Felipe Klein∗, Alexandro Baldassin†, Paulo Centoducatte∗, Rodolfo Azevedo∗ and Sandro Rigo∗∗Institute of Computing – University of Campinas (UNICAMP) – Brazil

[email protected], {klein, ducatte, rodolfo, sandro}@ic.unicamp.br†IGCE/DEMAC – UNESP – Brazil

[email protected]

Abstract—Virtual platforms are of paramount importance fordesign space exploration and their usage in early softwaredevelopment and verification is crucial. In particular, enablingaccurate and fast simulation is specially useful, but such featuresare usually conflicting and tradeoffs have to be made. In thispaper we describe how we integrated TLM communication mech-anisms into a state-of-the-art, cycle-accurate, MPSoC simulationplatform. More specifically, we show how we adapted ArchC fastfunctional instruction set simulators to the MPARM platform inorder to achieve both fast simulation speed and accuracy. Ourimplementation led to a much faster hybrid platform, reachingspeedups of up to 2.9x and 2.1x on average with negligible impacton power estimation accuracy (average 3.26% and 2.25% ofstandard deviation).

I. INTRODUCTION

As new hardware architectures become increasingly com-plex, the need for tools to support their development becomesevident. The use of virtual platforms to enable design spaceexploration has shown to be an important procedure foraccelerating the design of new hardware components, allowingearly architectural exploration and verification.

Low power consumption is a key feature in hardware de-velopment, not only for embedded systems, extending batterylifetime, but also for hardware in general, reducing heat dissi-pation. The development of energy-aware systems is a hardtask that can be assisted by virtual platforms, since they makeit possible to trace the behavior of interconnected hardwarecomponents, allowing performance and power estimation.

Many approaches have been proposed for power estimationon single core applications, but only a few options are availablefor the multi-core domain. Some simulation platforms [1], [2]with power analysis support were published and are in use,but a need for more alternatives and resources, satisfying awider range of testing possibilities, still exists.

Virtual platforms may be implemented using different abs-traction levels. Cycle-accuracy provides precise simulationswith highly trustable results, but this efficiency comes atcost of complexity and, consequently, increased simulationtime. This characteristic imposes hard performance limitations,sometimes making the execution of real-world applicationsunfeasible. The use of higher abstraction levels, such asfunctional simulators, reduces the simulation complexity, im-proving the platform’s time efficiency. However, since many

hardware details are not taken into account, their results mightbe less precise if compared to those generated with a cycle-accurate platform.

This paper is focused on how to improve the speed of acycle-accurate platform by including a functional simulatorwhile maintaining the accuracy. We integrated functional simu-lators generated with ArchC [3] into the MPARM [1] platform,turning it into a faster hybrid platform. The contributions ofthis work are threefold. First, we introduce a new simulationresource into the MPARM platform to improve its speed upto 2.9 times. Second, we present a detailed implementationdescription of a hybrid simulation platform, showing howthe abstraction compatibility problems were fixed. Finally, adescription of how we managed to statistically fix the precisionloss introduced by the functional simulator is showed.

This paper is organized as follows: Section 2 describesrelated works. Section 3 details the implementation of theplatform, describing the interface of the functional simula-tor with the MPARM platform, techniques used to improveprecision, and the verification process. Section 4 describesthe experimental results, showing the obtained speedups anddescribing how power estimations were statistically fixed.Section 5 presents our conclusions.

II. RELATED WORK

MPARM [1] is a complete platform for Multi-processorSystems-on-Chip simulation (MPSoCs). It is written in C++and makes use of SystemC [4] as its simulation engine.The platform includes an implementation of a cycle-accurateARM simulator called SWARM [5], AMBA buses, hierarchicmemories and synchronization mechanisms. Cycle-accuratepower models for many of the simulated devices are includedin MPARM platform, which makes it quite suitable for powerestimation. MPARM is well known, and have been used forpower analysis in MPSoCs [6], for testing Hardware andSoftware Transactional Memory systems [7], [8].

SimWattch [2] is a simulation tool based on Simics [9]and on Wattch [10], a power modeling extension present onSimpleScalar [11]. This tool have been designed to supportmicroprocessor performance and power estimation, but nomodels are provided for other system components, such asexternal memories.

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

99

Fig. 1. MPARM platform with ArchC generated cores

ArchC [3], [12] is an open-source SystemC-based ar-chitecture description language capable of generating fast,functional, Instruction Set Simulators (ISS). It is easy tomodify an ArchC model to use its natively supported TLM[13] interfaces to communicate with external modules, pro-viding seamless integration into virtual platforms. The ArchCARMv5 model is a functional simulator, which means thatthere is no detailed pipeline simulation. This makes thesimulator implementation very simple, turning it into a muchfaster option than the original SWARM simulator distributedwith MPARM.

III. PROBLEM DESCRIPTION

Hardware research frequently requires the execution of largesimulation sets, using multiple applications with varying num-ber of configurations and hardware compositions. Simulationtime is crucial in order to evaluate a specific design and,since the result of the simulation will probably require designmodifications and further simulation, it becomes impracticalto wait hours (or even days) for a single simulation to finish.

Cycle-accuracy, as in MPARM, imposes hard restrictions tosimulation sets, requiring the remotion of heavy-weight soft-ware or complex hardware from the workload. As a realisticexample, consider the simulation of a lock-based version of thegenome application, which is one of the fastest applicationswithin the STAMP [14] benchmark. Running genome onMPARM with one core required 3:16 minutes. When thenumber of cores is increased to 2, 4, 8, and 16, the overallsimulation time rises to 5:20, 8:47, 21:21 and 45:02 minutes,respectively. Simulating larger STAMP applications, such asYada, takes around 31 hours with the 8 cores configuration.

Another significant limitation imposed by MPARM is itslack of flexibility. The platform implementation tightly couplesthe processor with adjacent modules, making the explorationof different architectures a hard task. This lack of flexibilityalso happens in other simulation platforms but we focus onMPARM in this paper.

A. MPARM modifications

In order to improve performance, the cycle-accurateSWARM processor simulator was replaced by a functionalARMv5 simulator core generated with the ArchC toolset. Thisimplementation led to a hybrid simulator, where all modules,except for the processor, are cycle-accurate. As it will be seen

ahead, power measurement was not seriously compromisedand could be statistically estimated. We choose to modify theprocessor simulator because we are interested in platformswith a varying number of cores (from 1 up to 16) and ourprofile indicated that it is possible to get better results with afaster ISS (see Section III-E).

In the MPARM platform, each ISS module consists of aC/C++ implementation encapsulated into a SystemC wrapper.The wrapper is an interface between the processor core andthe platform, handling all communication. This wrapper is alsoresponsible for providing control to the core to execute a newcycle, which turns it into an efficient layer to force modulesynchronization on the platform.

Due to the loss of simulation accuracy implied by thehigher abstraction level, some modifications needed to beapplied to the model.The original ArchC ARM model, thatwas designed to use internal memory, was changed to use twoTLM ports and the memory loader available in the platform.The platform was modified to correctly compile and instantiatethe new processors. Functions to estimate core power andgenerate simulation reports were also created. Finally, somemodifications were applied to the core signal data structureexistent on the platform, making it compatible with SystemC2.1. The implementation required 460 lines of code for theTLM interfaces and the SystemC wrapper. Around 200 codelines were written in the original platform to correctly supportthe ArchC processor. The code can be easily reused to integrateany other ArchC processor model into MPARM, allowingthe exploration of many architectures not yet supported byMPARM. This new feature can turn the platform, that wasoriginally designed to evaluate embedded systems, into a moreembracing one.

B. Model integration

Models written with ArchC language may use internalmemory or TLM interfaces to allow processor communicationwith external modules. When using TLM interfaces, everymemory operation executed by the processor is forwarded tothe TLM ports through a TLM packet. The TLM interface im-plements the SystemC signaling communication with externalmodules. Being a centralized communication channel, creatingmemory hierarchy only required plugging the cache memoriesto the TLM interface code, correctly consulting and updating iton every operation. The cache memories implementation wasthe same originally used on MPARM. To support split data andinstruction caches, two TLM ports were created in the ArchCprocessor model. One port is exclusively used for instructionfetching and the other for any other memory access. EachTLM port is connected to its own cache, which is consequentlyturned into a data or instruction cache. A diagram with detailedTLM interface implementation can be seen on Figure 1.

MPARM does not support TLM by default. For this reason,once the packets reach the ArchC TLM interface, they aretranslated to an MPARM core signal, and a memory operationrequest is made to the bus master. These memory operationswill block the processor until a ready signal is received back

100

from the bus master. The MPARM platform also uses certainmemory addresses to create a communication channel betweenthe application simulated and the simulator itself. This isuseful for allowing the use of the simulation support API,which provides calls for functionalities such as enabling anddisabling power measurement or printing output messages.

The original AMBA bus was not modified, keeping itsoriginal cycle accuracy. When performing memory operations,this characteristic makes the processors wait the same numberof cycles as cycle-accurate ones would. Being a blockingfunction, the TLM interface forces synchronization betweenthe processor and other modules in the platform.

To provide correct power estimation, calls to the measure-ment API were encapsulated into the TLM interface. Sincethe power estimation models were developed focusing a cycle-accurate simulation, another strategy needed to be applied. TheSWARM processor is modeled as a state machine with eightdifferent states. Each of these states describes an operationmode of the processor, and its transitions are defined by theinternal flow of events such as cache and memory operations.On every simulated cycle, the processor state is stored bymeasurement calls. At the end of the simulation, the number ofcycles on each state is used to estimate power. Since the ArchCsimulator is functional, this power estimation mechanismcould not be directly applied. But, instead of using a fixed stateflow or value for each instruction, we placed the measurementcalls on the TLM interface. Within the TLM interface we caneasily trace all memory accesses, cache updates and numberof wait cycles, which allows us to dynamically reproduce theprocessor’s state flow for each instruction. This approach ledto a reduced precision loss in our power estimation model,making it very similar to the original one, as we show inSection IV. Figure 2 illustrates the measurement API calls onthe TLM interface.

Fig. 2. TLM flowgraph

C. ArchC ARMv5 model modifications

For the sake of speed, ArchC simulators use an instructiondecode cache to avoid the need of decoding the same ins-truction twice. Once an instruction is decoded, a data structure

with results is stored using its address as index. If the ins-truction is needed once again, it is retrieved directly from thedecode cache. Despite being essential for high performance,this mechanism imposes difficulties to statistics collectionas for an instruction that was already decoded, no memoryaccess is performed, leading to wrong memory and cachemeasurements. To fix this behavior, we modified the originaldecode cache code to perform a dummy memory access tothe corresponding address, in case of a cache hit. The solutionforced the core to make the same original memory accesses,but avoided the need of decoding the same instruction twice.

In a first comparison between the cycle-accurate and thefunctional simulations, a big difference of memory read oper-ations was noticed. This difference happened due to the lackof a pipeline in the functional simulation. With a pipeline, theexecution of a branch may flush instructions on the first andsecond pipeline stages. In the functional simulation, these twoinstructions would never be fetched from memory.

This inconsistency is imposed by the different abstractionlevels among processors and the rest of the platform. Tocope with this problem, a branch detection mechanism wasimplemented in the ArchC simulator. If a branch is taken, thismechanism generates two dummy memory reads to the nextaddresses of the “not taken” program flow, corresponding tothe instructions that would be in the pipeline but that wouldbe discarded in the cycle-accurate simulation. This mechanismdrastically improved the precision of the functional simulation,allowing a much more reliable power measurement.

D. Platform Verification Process

The new platform was tested using the STAMP benchmark[14], which a set of applications targeting the evaluation ofSoftware Transactional Memory (STM) systems. This bench-mark showed to be very suitable to our purposes, since itsinherent concurrency contributed for correctly evaluating theplatform’s communication through the TLM interfaces. Thewide variety of algorithms in the STAMP applications presentsa good robustness test, evaluating different aspects of theplatform, such as varying bus contentions, critical section’slength and shared memory area sizes.

The platform verification comprised three main stages.In a first moment, different applications were executed andtheir outputs were compared for correctness. The tests startedwith micro-benchmark applications, and were concluded afterexecuting all applications in STAMP. At this point, all testswere executed with a single core instantiated on the platform.

The second stage consisted in comparing memory tracesgenerated by the new platform with traces generated by theoriginal one. This test allowed the verification for correctmemory operations and address translation. After comparingthe traces, the output reports were also checked, showing thatboth platforms had the same number of memory access foreach memory area while executing the same benchmark. Onceagain, only one core was instantiated on the platform.

The third validation stage consisted in executing the firstand second stages with multiple cores (2, 4 and 8). It is worth

101

mentioning that some of the STAMP programs run more thana billion of instructions in each processor. Once the expectedoutput for the application was found, memory traces werecompared. Due to the different simulation abstraction, thememory traces were not exactly the same for both platforms.

A big difference on the number of memory reads to theprocessor’s private memory area was noticed on the tests. Byusing the branch detection mechanism mentioned on III-C,this difference was drastically reduced, showing that it wasan effect of the pipeline absence in the new abstraction, asdescribed. In multi-core simulations, other memory areas alsoshown some differences but none was significant. The natureof such differences will be described on Section III-E.

An illustrative comparison between 4 applications can beseen on Table I. Values denoted with a + signal means thatthe simulation made with ArchC executed a larger number ofmemory operations. Values describing multi-core simulationsrefer to the number of operations executed by the first core.

TABLE IMEMORY OPERATION DIFFERENCE

Operation genome genome Intruder Intruder1 core 8 cores 1 core 8 cores

Private Rd 0.33% 9.16% 0.05% 14.04%Private Wr 0% 0.2% 0% +1.44%Shared Rd 0% 0.08% 0% 2.89%Shared Wr 0% 1.08% 0% + 1.54%

Semaphore Rd 0% 3.12% 0% 2.47%Semaphore Wr 0% 0.49% 0% + 1.06%

E. Platform Profiling

After reaching the expected behavior with the STAMPapplications, the original and the modified platforms were pro-filed in order to enable better understanding of the differencesimposed by changing the abstraction. Using the gprof tool,the execution of both platforms was profiled while runningthe application genome with 1, 2, 4 and 8 cores. The profilingallowed the verification of the time spent running each of theplatform’s module. The results showed that the differences inthe time spent by each of the modules were not restricted to theISS code, indicating that the effects of changing its abstractionwere propagated to other parts of the platform, such as busesand memories.

A deeper observation of the bus behavior revealed differentcontentions on each platform, showing that the processor’sabstraction had influence on bus operations. In fact, the wayeach abstraction emits its instructions and its memory accessesare not equal. Only one cycle is required to run an instructionthat does not require memory accesses on the functionalsimulator. If we consider an empty pipeline, it would berequired at least three cycles to run the same instruction ona cycle-accurate pipeline. Considering that the block beingexecuted is already in the cache, if each operation needs toperform a memory access, the functional simulator wouldemit one memory access per cycle. In a similar situation,the cycle-accurate simulator would emit a number of memory

accesses that is directly dependent on the number of stagesin its pipeline, and it would never be equal to one memoryaccess per cycle.

Since the functional simulator requires less cycles to executean instruction, it also executes a code block in a smallerinterval. This fact led to differences in the number of cyclesspent running critical sections, which also had a major effecton the number of cycles and memory operations performedby other processors on the platform while waiting for the lockacquisition. Consequently, the bus contention of both platformsis not the same. Considering the effects of the new abstractionon critical sections, and the fact that a single difference heremay influence the whole program flow, not only the differentnumber of memory accesses are explained, but also the varyingtimes spent on modules other than the processor on eachplatform.

Table II summarizes the profile data. It presents the timespent by each processor implementation, in seconds, and thewhole execution time percentage spent on the processors.Since the abstraction level influences the whole platform, itis not possible to assume an absolute efficiency comparisonfor both processors. However, the values show that, whileusing the ArchC simulator, the percentage of time spent onthe processor in relation to the overall simulation time wasreduced, at least, 2.5x, reaching a reduction of 4x in a 8core configuration. After these tests were complete, a seriesof experiments was performed to assess the performance andpower estimation capabilities of our implementation. Theseexperiments are discussed in the next section.

TABLE IIPROCESSOR PROFILING SUMMARY

Processor 1 core 2 cores 4 cores 8 coresArchC 6.87% 7.07% 7.61% 5.55%

SWARM 17.25% 20.29% 23.18% 22.59%ArchC 5.15s 8.72s 16.9s 32.07s

SWARM 25.15s 51.83s 97.1s 180.29s

IV. EXPERIMENTAL RESULTS

The STAMP benchmark with lock-based synchronizationwas used to evaluate both performance and power estimation.The test consisted in executing all the 8 applications availablein the benchmark with 1, 2, 4, and 8 cores. Some of themwere executed with more than one configuration, totalizing13 application variants. A sequential version of each variantwas also executed. The whole test consisted in 65 simulations,which were executed on 2.4GHz and 4GB RAM machinesrunning Ubuntu Linux, with Kernel 2.6.9. Due to restrictionsimposed by the source code of the platform, all the code wascompiled using GCC 3.4 using the −O3 optimization level.

A. Performance Assessment

The result of each simulation was compared with a similarsimulation on the original MPARM platform. A performancecomparison can be seen on Figures 3 and 4. The speedup isshowed as gray bars in the figures. The nomenclature for each

102

Fig. 3. Lock-based speedup / Simulated cycles

Fig. 4. Sequential speedup / Simulated cycles

simulation follows the one presented in the original STAMPpaper [14]. As it can be seen, our implementation reachedat least 1.8x speedup for each simulation, with a maximumspeedup of 2.9x. The average speedup was of 2.1x.

The absence of a pipeline reduced the overall number ofsimulated cycles for each execution. The number of cycles,normalized to values obtained with the original platform, canbe identified by a black line in Figures 3 and 4. As shown,70% of the original cycles were simulated with the sequentialimplementation. This value also stands for lock simulationswith 1 core, but it increases as more cores are added to theplatform. This is an effect of bigger bus contention generatedby the addition of new cores to the platform. Since the bus iscycle-accurate, memory operations will keep the cores blockedfor a number of cycles equivalent to the cycle-accurate simu-lation, which makes the number of cycles increase towardsthe one obtained with the original platform. Simulating lesscycles was not the only reason for the speedup. The functionalArchC simulator is simpler, and thus faster, than SWARM.On average, the ArchC model simulated 1.9x more cyclesper second, reaching a maximum of 2.4x. A comparison ofsimulated cycles per second can be seen on Table III. Seriallyrunning these 65 simulations would spend 187 hours and30 minutes on the original MPARM with SWARM. This

same batch of simulations was completed in 98 hours and19 minutes on the new MPARM implementation using ArchCfunctional cores.

B. Energy and Power estimation

In MPARM, the total energy estimation is calculated basedon the stored states of the processor, as described in SectionIII-B. At each cycle, the processor state is stored and, at theend, is applied to the energy model present in the platform. Asexpected, the use of a higher abstraction introduced impreci-sion to energy estimation. A scatter plot showing the error inenergy estimation obtained from the new platform can be seenin Figure 5, where the dots represent the obtained results, andthe line, the value obtained with the original platform, used asreference for correctness.

Fig. 5. Total energy measurement error

In the results presented in Figure 5, the applications Bayes,Labyrinth, Labyrinth+, and Yada shown a more significanterror when executed with 4 cores. Since these applicationshave long critical sections, they are more susceptible to thenew abstraction effects on bus contention. The upper limitof bus contention can be understood as all processors on theplatform waiting for bus operations. For the mentioned appli-cations, the 4 cores simulation reached maximum contention in

103

TABLE IIINUMBER OF K CYCLES SIMULATED PER SECOND

Application Sequential 1 core 2 cores 4 cores 8 coresArchC SWARM ArchC SWARM ArchC SWARM ArchC SWARM ArchC SWARM

kmeans-low 270 150 276 209 406 196 546 345 870 515kmeans-high 275 203 279 216 407 282 587 360 938 359yada 297 214 295 142 409 308 465 375 727 540bayes 318 150 312 227 487 322 657 396 991 584intruder 351 161 341 159 480 216 603 262 889 556intruder+ 345 160 350 160 488 325 590 261 894 562labyrinth 319 236 317 149 461 204 575 243 892 356labyrinth+ 318 233 316 223 465 204 575 357 880 530vacation-low 352 260 341 163 468 223 588 268 906 391vacation-high 340 261 347 260 479 333 613 427 880 415ssca2 326 254 339 244 502 324 710 408 1063 623genome 324 248 322 232 519 337 711 431 1083 605genome+ 334 255 330 157 520 355 709 428 1075 464average 320 214 320 195 468 279 609 350 929 500

the hybrid platform, but not in the original one, resulting in theobserved error. The 8 cores simulations reached the maximumbus contention, in both platforms, during almost all runtimeand, for this reason, the observed error was not significant.

The power estimation mechanism on MPARM uses thenumber of simulated cycles to estimate the consumption oneach core. As the hybrid platform implementation trades offcycle precision for performance, an error margin was intro-duced to power calculations due to the difference in numberof cycles. Raw results obtained with the modified platformhad an average power estimation error of 21.2% with a 2.7%standard deviation (SD).

In order to improve these results, a model based on re-gression analysis was built using the least square methods. Tobuild the model, values measured on the original platform wereused as expected values. The coefficient reached, if applied tothe results obtained with the hybrid platform, minimizes thepercentage of error. Since this first model was built based onresults obtained with the whole simulation set, it was named“general model”. After applying the general model to theresults obtained with the hybrid platform, the average errorwas reduced to 14.45%, with a SD of 4.3%.

TABLE IVLINEAR REGRESSION COEFFICIENTS

cores modelgeneral model 0.9× x− 3

1 0.71× x+ 0.572 0.8× x4 0.75× x+ 7.58 0.8× x+ 24

As shown in Figure 3 and explained in Section IV-A, thenormalized number of simulated cycles is not the same forsimulations with different number of cores. As this value isan important parameter for the power estimation, it turns outthat using the general model described above is not the mostappropriate choice. In order to achieve a higher accuracy level,we calculated new models for each number of cores. Core-

specific coefficients obtained through the linear regression canbe seen on Table IV, where x stands for the raw powerestimation value obtained with the hybrid simulation. By usingthe core-specific models, the error margin was reduced to anaverage of 3.25% with a SD of 2.25%. A scatter plot showingthe error of the final results can be seen in Figure 6, wheredots represent obtained results after applying the estimationmodels and the diagonal line represents results obtained withthe original platform.

Fig. 6. Total power measurement error

C. Regression Model validation

In order to correctly assess our models, a small test set,composed by STAMP applications executed with differentinputs, was used. The general and the core specific coefficientswere applied to the results obtained after a hybrid simulation.The applications that composed the test set were Bayes,Intruder, and Intruder+ with a different random seed; Labyrinthand Labyrinth+ with different mazes.

104

(a) Power measurement error (b) Power measurement error, generalmodel applied

(c) Power measurement error, specificmodel applied

By using the general model, the average error was reducedfrom 22.21% to 18.74%. The specific model reduced theaverage error to 5.85%. Despite of showing a general worseefficiency than the core-specific model, the general modelwas more precise when applied to the 8 cores simulation.This behavior is due to the larger data set employed in theconstruction of the general model and the similarity in thenumber of cycles executed in the simulations. Plots withthe errors originally obtained and reduced after applying themodels are presented on Figures 7(a), 7(b) e 7(c).

V. CONCLUSION

We have introduced a new simulation resource to MPARM,turning it into a hybrid simulation platform regarding themodel abstraction levels. By replacing the cycle-accurate pro-cessor with a functional one we have significantly increasedperformance, as a consequence of simulating more cyclesper second, and of reducing the number of overall simulatedcycles. We also have introduced techniques to reduce theloss of precision in cycle/power estimates imposed by ahigher abstraction level. The lack of precision introduced bythe abstraction modification was discussed, highlighting theeffects of bus contention while running simulations with abigger number of cores. Finally we have suggested the useof regression analysis to improve power estimation results,defining a different correction model to each number of cores,due to the variation in bus contention effects in each case.By reaching an average error of 3.26% we showed that ourhybrid platform, in spite of reaching speedups of up to 2.9times if compared to the original MPARM, is able to generatepower estimation with a very similar level of confidence inthe results.

VI. ACKNOWLEDGEMENT

This work was partially supported by grants from FAPESP(2009/04707-6, 2009/08239-7, 2009/14681-4), CNPq, and

CAPES.

REFERENCES

[1] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri,“MPARM: Exploring the multi-processor SoC design space with sys-temc,” J. VLSI Signal Process. Syst., vol. 41, no. 2, pp. 169–182, 2005.

[2] J. Chen, M. Dubois, and P. Stenstrom, “Integrating complete-systemand user-level performance/power simulators: the simwattch approach,”in ISPASS ’03: Proceedings of the 2003 IEEE International Symposiumon Performance Analysis of Systems and Software, 2003, pp. 1–10.

[3] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, “ArchC: asystemc-based architecture description language,” Computer Architec-ture and High Performance Computing, pp. 66–73, October 2004.

[4] D. C. Black and J. Donovan, SystemC: from the ground up, 2004.[5] M. Dales, “Swarm 0.44 documentation,” February 2003,

www.cl.cam.ac.uk/ mwd24/phd/swarm.html.[6] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power analysis

for multiprocessor systems-on-a-chip,” in GLSVLSI ’04: Proceedings ofthe 14th ACM Great Lakes symposium on VLSI, 2004, pp. 410–406.

[7] C. Ferri, T. Moreshet, R. I. Bahar, L. Benini, and M. Herlihy, “Ahardware/software framework for supporting transactional memory ina mpsoc environment,” SIGARCH Comput. Archit. News, vol. 35, no. 1,pp. 47–54, 2007.

[8] A. Baldassin, F. Klein, G. Araujo, R. Azevedo, and P. Centoducatte,“Characterizing the energy consumption of software transactional me-mory,” Computer Architecture Letters, vol. 8, no. 2, pp. 56–59, Feb.2009.

[9] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A fullsystem simulation platform,” Computer, vol. 35, pp. 50–58, 2002.

[10] D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson, P. N. Kudva,A. Buyuktosunoglu, J.-D. Wellman, V. Zyuban, M. Gupta, and P. W.Cook, “Power-aware microarchitecture: Design and modeling challengesfor next-generation microprocessors,” IEEE Micro, vol. 20, pp. 26–44,2000.

[11] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure forcomputer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, 2002.

[12] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, andE. Barros, “The ArchC architecture description language and tools,” Int.J. Parallel Program., vol. 33, no. 5, pp. 453–484, 2005.

[13] F. Ghenassia, Transaction-Level Modeling with Systemc: Tlm Conceptsand Applications for Embedded Systems, 2006.

[14] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun, “STAMP:Stanford transactional applications for multi-processing,” in IISWC ’08:Proceedings of The IEEE International Symposium on Workload Char-acterization, September 2008.

105

A non intrusive simulation-based trace system toanalyse Multiprocessor Systems-on-Chip software

Damien Hedde, Frédéric PétrotTIMA Laboratory

CNRS/Grenoble INP/UJF, Grenoble, France{Damien.Hedde, Frederic.Petrot}@imag.fr

Abstract—Multiprocessor Systems-on-Chip (MPSoC) are seal-ing in complexity. Most part of the MPSoCs are concernedwith this evolution: number of processors, memory hierarchy,interconnect systems, . . . Due to this increase in complexity andthe debugging and monitoring difficulties it implies, developingsoftware targeting these platforms is very challenging. The needfor methods and tools to assist the development process of theMPSoC software is mandatory. Classical debugging and profilingtools are not suited for use in the MPSoC context, because theylack adaptability and awareness of the parallelism.

As virtual prototyping is today widely used in the developmentof MPSoC software, we advocate the use of simulation platformsfor software analysis. We present a trace system that consistsin tracing hardware events that are produced by models ofmultiprocessor platform components. The component modelsare modified in a non-intrusive way so that their behavior insimulation is not modified. Using this trace results allows to runprecise analysis like data races detection targeting the softwareexecuted on the platform.

I. INTRODUCTION

The ever increasing performance and flexibility demandsfor running applications on embedded platforms has led tothe emergence of Multiprocessor Systems-on-Chip (MPSoC)platforms a decade ago. Nowadays, the systems integratevery complex memory subsystems and interconnects. In its2009 edition [1], the ITRS (International Technology Roadmapfor Semiconductor) expects Systems on Chip in portableconsumer segment with more than 1000 processing elementsin 2020.

Unfortunately, MPSoCs embed more an more elementswhile keeping few debugging or monitoring external capabil-ities. As a consequence, the observability abilities such SoCSare not increasing as their complexity does. By integratingelements that were previously outside the chip, it becomesalmost impossible to observe their behavior and communica-tion. Furthermore, the growing number of internal elementsthat need to be connected together leads to the saturationof classical interconnect systems such as buses. To replacethem, scalable interconnects (ie: Networks on Chip (NoC)),are used. Although providing higher bandwidth, they do nothave the same observability. To settle this general observabilityproblem, Design for Debug (DfD) features ([2], [3], [4], [5])are developed and integrated into SOCs.

This complexity raises problems for designing a MPSoCbut also for the software running on its processors. The

software has to target the processors and make themcommunicate in order to execute the required application.Depending on the MPSoC architecture, the software mayhave to handle multiple communication mechanisms (sharedmemory, mailboxes, DMA) and target different processortypes (general purpose processor, specialized processor).Analysing and debugging programs that use concurrentlyseveral processors is not a new problem. Lots of techniqueswere developed [6] during the 1980s period, particularlyfor targeting distributed systems programs. Encountereddifficulties for debugging these systems is not far from theones for MPSoCs. A very important one is the difficulty toget the state of the global system. There is no real problem toget the state of each node, but getting the state of each nodeat the same time is nearly impossible. With the developmentof GALS (Globally Asynchronous, Locally Synchronous)integrated systems, it becomes clearly unfeasible.

In this paper, we present a method for fine-grain analysis ofsoftware running on MPSoC. This method uses simulation andconsists in doing an instrumentation of component models.During the simulation, these models then generate eventswhich are collected for further analysis. Our approach relyon the fact that when simulating a entire SoC platform, wecan have access to everything that’s going on during thesimulation. The ability of collecting information is indeed notlimited by the classical constraints a real SoC has: limitedbandwidth with external debugging device, limited observ-ability, . . . Contrary to software instrumentation, informationfrom the simulation can be collected without being intrusive:information can be obtained without changing the behaviorof the simulated components. However simulation relies onmodels that are often only approximate in timing. In our caseit is not a problem because we are mainly concerned with theorder of memory events.

The produced trace mainly contains instructions executedby the different processors. The related memory accesses arealso traced up to the memory by every component relayingthem. These memory access are used to recover the inter-processors instructions dependencies, allowing to analyse thedetailed, low level, synchronization mechanisms between thedifferent processors.

The remainder of this article is organized as follows. Relatedwork is reviewed in Section II. Section III describes the978-1-4577-0660-8/11/$26.00 c©2011 IEEE

106

trace mechanisms and analysis method. Section IV presentsexperimentations and results. We then conclude in Section V.

II. RELATED WORK

During the last years, lots of methods were proposed tohelp the debug and analyse of program running onto MPSoCs.Due to the increasing SoC parallelism, an important effort ismade on communication monitoring. Parallel program errorscomes indeed from interaction errors between the processingelements and interactions are mainly done through memory.Several solutions are proposed in order to integrate features inchips ([5], [7]), allowing to debug at run-time.

Following an other track, work is done to improve de-bugging abilities by the use of simulation techniques. Anadvantages of these methods is that they are really non-intrusive. [8] propose a solution using a virtual machine thatexpose the data structures of the Operating System (OS)running into the virtual machine to an external debugger. In[9], a Instruction Set Simulator (ISS) API is proposed forthe simulation of processors in MPSoCs. This ISS has aninterface allowing to add instrumentation tools independent ofthe simulated processor. An implementation of a GDB serverusing the instrumentation ability has been made, allowing tocontrol the set of processors of the MPSoC. Although allowingto control and monitor every processor, it doesn’t address thecommunication mechanism. In [10], a solution is proposed forverifying the shared memory mechanism. The method consistsin recording memory operations and their order, and then tocheck whether there was a bug or not.

Non-intrusive instrumentation in simulators already exists:In eSimu [11], a very low-level trace is generated by a cycleaccurate simulator. This trace is used in energy profiling goal.The trace contains instructions with cache penalties informa-tion, peripheral change of state and data evolution throughfifo in order to link energy consumption to the instructionwhich generates it. For example the sending of data througha wireless device cost lots of energy but it is issued long timeafter the related instruction, it is so needed to keep tracks ofthe data. This solution does not target platforms that embedmultiple processor and so lacks information for recoveringmemory accesses order in multiprocessor platforms. It is alsoonly focused on profiling.

In his thesis, D. Kranzlmueller [12] studies concurrentprograms modelling using events in a debugging goal. Hiswork targets mainly distributed programs at the applicationlevel and not the whole software stack. He records and analysemain events of a concurrent program (mostly communicationevents). The event trace is intrusive as it is generated throughsoftware instrumentation. Event relations and orders are an-alyzed in order to detect erroneous behaviors. Some otherworks, like [13], focus on MPSoC software monitoring byusing a specific programming model with integrated observa-tion abilities. In this last work, a component-based approachis used where the component has observation interfaces.

III. CONCURRENT PROGRAM ANALYSIS

There are several kind of analysis which can be performedon a trace: verification of cache coherency protocol, verifica-tion of memory consistency model, detection of data races, andso on. These analysis require to build order relations betweenthe memory access instructions. Part III-A below explains howto do this instruction trace for a multiprocessor platform usinga sequential consistency memory model [14]. The analysisneeds to build the software threads (which may have migratedbetween several processors in SMP architectures) and identifysynchronizations between the threads to highlight erroneousbehaviours. We explain how to proceed in part III-B focusingon data races detection.

A. Recovering instruction scheduling

In order to analyse a concurrent program running on a multi-processor shared memory platform, we need to sort concurrentprogram instructions that access the same memory. But thedate of a memory access may be very different from thedate of the instruction that generate the access. Indeed due tocommunication interconnects and cache or buffer components,the date of an access might be significantly shifted from theinstruction date.

This section describes the method we use to associate theright date of the access to each instruction accessing memoryusing information provided by the simulation. This is methodis done in two step and has no impact on the program executedby the simulated platform.

1) Tracing hardware events

This first step takes place during the simulation of a multipro-cessor platform executing a concurrent program. It consists intracing all events related to the program execution and memoryaccesses. Several kind of components of the platform areinvolved in this operation: processors, caches and memories.Traced event are stored to allow further use.

Each involved component traces events depending on theoperations it is doing. Traced events are:

• processor instructions,• processor requests,• cache acknowledgements,• cache requests,• memory acknowledgements.Requests and acknowledgements represent memory ac-

cesses. The initiator of a memory access generates a requestevent and the target generates an acknowledgement event. Anacknowledgement match the action of the target to the memoryarray (mainly read to or write from).

Events contain some type related data. A request eventcontains the address, width and type of the access. Examplesof type are: load, store, linked load, exclusive (for write-backpolicy caches), etc. A processor request event contains alsothe data read or written by the access. An instruction eventcontains the instruction address and processor state changes.A memory acknowledgement event contains only the date

107

processor_0

cache_0

Instruction@ 0x5000

time

processor_1

cache_1

memory

RequestRead 0x2004

Instruction@ 0x5004

RequestRead 0x2008

Instruction@ 0x5008

Cache RequestRead 0x2000

Acknowledgement

Acknowledgement

RequestWrite 0x3008

Instruction@ 0x6000

RequestWrite 0x300C

Instruction@ 0x6004

RequestRead 0x0014

Instruction@ 0x6008

Cache RequestWrite 0x3008

Acknowledgement Acknowledgement

An arrow means a dependency between 2 events: the circle side event contains the identifier of the triangle side Fevent

Figure 1. Event dependencies examples for 2 processors platform with write-through policy caches with write buffers

of the target’s action. Because dates are only used to sortacknowledgements, they can be logical dates and not relatedto the simulated time.

Events are generated by several components, and we need tolink some of them together (for example an acknowledgementwith the corresponding request). In order to do that, each eventis tagged with a unique identifier.

The identifier of an event can then be used by another event to indicate a relation between the two events. Thismethod is used in three cases:

• An acknowledgement event (issued by a cache or amemory) contains the identifier of the request eventthat leads to this acknowledgement.

• A cache request contains the identifier of a proces-sor request if the cache relays a request to the memory(for example: in case of a processor load that generatesa load to a whole cache line).

• A processor request event contains the identifier ofthe instruction that generates the memory access.

Figure 1) shows some typical examples of eventdependencies. In this Figure, processor 0 first does aload which lead the cache to do a line load from the memory.The processor 0 then does a second load in the same linewhich is handled by the cache. The processor 1 first doestwo write which are gathered by the cache write buffer andthen does an uncached read which is handled by the memory,skipping the cache.

As long as events are not linked by identifier, it is not aproblem to generate them. Processors, caches and memoriescomponents have to be modified to generate the events but nottheir communication channels.

But as an event may need the identifier of another event,the communication channel between the two components thatgenerate these events must be modified. The channel haveindeed to transport along with the standard communicationthe identifier of an event. Due to the direction of links(an acknowledgement need the identifier of the request

event, but not the opposite) , the response channel does notneed to be modified. Only the request channel is modified.

2) Associating dates to memory related instructions

The second step consists in building for each processor thethread of executed instructions with proper dates associated tomemory access instructions. Theses dates are later used to sortthe instructions from different processors in order to match thememory access order. An Instruction that generates a memoryaccess must be tagged with the date of the acknowledgementat the memory.

It can be noticed that the way of handling write accessesdepends on the cache policy. Write-through caches do notgenerate acknowledgment events on write accesses, but relaythem to the memory as shown in Figure 1. In the contrarywrite-back caches can acknowledge write accesses when thecorresponding line state allows it. When a write-through cacheuses a write-buffer, still no write access acknowledgment isgenerated by the cache, but several identifiers (onefor each intial request) are indicated in the resulting requestevent (see Figure 1).

In order to associate a date to an instruction that generatesan access, the relations between events must be followed fromthe memory acknowledgement down to the instructions. Butdue to the presence of caches, each processor request is notdirectly linked to a memory acknowledgement. In case of aread cache hit, a cache acknowledges a request but the realread access up to the memory has been done possibly longtime ago by the cache. A similar issue is raised for a writeaccess in write-back cache policy (but threal acces is after andnot before).

It could be possible to assign the true date (which could bein the past or in the future) of the memory access to everyprocessor request but it is not what we need. This wouldnot lead to a satisfying result because the dates of memoryaccesses would not be in an increasing order inside a processorinstructions sequence. Without this last point, the analysisof threads would be complicated: for example, it would be

108

Figure 2. Example of date associations to processor requests for a processor behind a write-back policy cache

impossible to find which is the next access to a memorylocation without checking the whole instructions sequence ofeach processor. We have then the following constraints:

1) Dates in a processor instruction sequence must be inincreasing order.

2) Dates of two memory access at the same address fromtwo different processors must be in the proper order. Theaccess that take place first must have a smaller date.

Accesses can be classified in two categories: memory-acknowledged accesses and cache-acknowledged accesses.Figure 2 shows how accesses are dated. There is no difficultiesfor assigning a date for an access that is acknowledged by thememory: we keep the date of the acknowledgement and theconstraints are met.

However for accesses that are acknowledged by a cache, adate must be computed. In order to meet the first constraint,this date must be between the dates of the previous and nextmemory-acknowledged accesses of the processor. The date ofthe corresponding memory acknowledgment cannot be usedsince it will violate this constraint. For all cache-acknowledgedaccesses between two memory-acknowledged accesses, we usethe date of the previous memory-acknowledged access. Thefirst constraint is obviously met although the processor orderis not strict.

The second constraint is also met because the cache-acknowledged access could have been done at the date ofthe previous memory-acknowledged access without changingthe results of accesses at the access address. We have twocases, the read case and the write case. Let us call T1 the truedate of the access and T2 the date of the previous memory-acknowledge access. Sequential consistency ensures there isa total order of all memory access respecting every processoraccess order.

• If the access is a read, then it is a cache hit andT1 <= T2. If a modification had occured before T2 tothe memory cell then, the cache would have received aninvalidation before the previous memory-acknowledgeddue to the sequential consistency. So the line has notbeen writen between T1 and T2. Some may have read tothe line, but this is not a problem to reorder consecutiveread to the same address.

• If the access is a write, then it is delayed due to the write-

back cache policy and T1 > T2. So the cache has theaccessed line in exclusive state and no other processorcan do an access to the line (even a read) without thecache doing first the write back to the memory. So theline has not been accesses between T2 and T1.

The following algorithm tags each processor request witha proper date. Due to the direction of the links betweenevents (acknowledgement to requests), it is not possible tofind easily the acknowledgment event starting with a requestevent. Instead starting from the acknowledgement event to findthe requests that correspond to it do not lead to difficulties.This is why the algorithm is driven by top memory hierarchycomponents (ie: the memories).

• Main: Consume all events of the memories following dateorder. Memories should only have generated acknowledg-ment events.

• Memory: For each consumed event, identify the sourcecomponent (processor or cache) with the identifierthat is contained in the event, and consume events of thiscomponent up to the one referred by the identifier.The acknowledgement date is given to this source com-ponent.

• Cache: For each consumed event, if there is a sourceidentifier, consume event of the source component(lower level cache or processor) up the identified event.The date given to the source component is either theprevious date or the just received date if the event islinked to the request being acknowledged by the memory(or higher level cache).

• Processor: For each consumed request, it is tagged withthe current received date.

This algorithm works as long as dates are consistent be-tween every memory components. Depending on the simulatorused, such a time might not be at our disposal. This algorithmconsumes events of each component in the order they weregenerated. The intermediate storage of the events thread ofeach component can then be avoided and the components candirectly feed the above algorithm. This way, only processorsequences will be generated.

109

B. Software analysis: data races detection

The previously generated processors instructions sequencescontain executed instructions. They are associated with proces-sor state changes and memory access acknowledgment dates.This information allows to run some analysis on the executedconcurrent software.

1) Building software structure: threads

Processor sequences contain information related to thehardware part of the execution: processor state changesand dates of memory accesses. These dates guarantee aproper interleaving of memory instructions. However it lacksinformation on the software part. The following analysismethod needs, as well as the processor instruction sequences,the information called symbols, that allows to match forexample instruction addresses to functions. A prerequisite ofthe analysis is that the symbols are known. The ApplicationBinary Interface (ABI) is also needed.

Processors sequences are then analysed to generate thesoftware structure. This operation is done with respect to thememory access order and may be compared to a kind of replayof the execution. The ABI allows to detect function calls andreturns if the symbols of these functions are known.

Additional data such as DWARF debugging data containsparameters locations for function symbols and allow toidentify parameters of functions calls. Some analysis rely onthe identification of specific function calls or returns.

At start, a thread is associated for each processor and the callgraphs are built. This will lead to correct result as long as thereis no Operating System (OS) doing some thread scheduling. Ifthere is an OS, thread creation and scheduling functions haveto be detected to create new threads and change the threadwhich is associated to a processor.

2) Adding synchronization points

Detection of any function is possible if its symbol informationis known. In order to apply this system to detect data races,synchronization mechanisms of the software threads need tobe first detected. Without taking synchronization constraintsinto account, the whole threads will be considered concurrent.

A synchronization point is a point in software thread thathas some link with at least another point in a second softwarethread. A Link tells that a synchronization point occurs eitherafter or before the other.

At the lower level atomic memory accesses (load linked,store conditional, test and set, compare and swap, etc) may beconsidered as synchronization points. But some synchroniza-tion can be done without using these accesses. Consideringeach memory access as a synchronization is not a goodsolution as most of them do not serve the realization ofsynchronization.

From a higher level point of view, synchronizations are donethrough software functions. Synchronization point are then

created when the execution of such a function is detected.Useless synchronization points generate additional constraintswhich indeed might mask data races. To avoid these uselesssynchronization points must be set only for highest levelsynchronization functions, as they may be constructed fromseveral lower level synchronization ones.

3) Checking data races

Data-race corresponds to a case when multiple threads ac-cesses the same memory location in a concurrent way and ifthe result of the accesses can not be decided without knowingin which order the accesses take place. Two cases exists: write-read and write-write.

These two cases can be reduce to a unique one: a data raceoccurs if a thread do a write access to a memory location andanother thread do an access to the same memory location (reador write).

Figure 3 shows some software threads with their synchro-nization points. Parts of threads between two synchronizationpoints are called segments in the following. To find all dataraces, every pair of threads must be analyzed.

Figure 3. Example of threads with their synchronization point. Numbersrepresent synchronization points and letters represent segments. Dashedarrows represent indirect links.

In order to find data races between two threads, the wholegraph including all threads must be reduced. Although onlytwo threads are concerned, the graph can not be reducedby just removing every other thread and links not involvingthe two analysed threads. Due to the transitivity of the linksbetween synchronization points, some indirect links betweenthe two studied threads may be inferred. For example, inFigure 3, point 1.3 happens before point 0.2 although thereis no direct link between them.

Two segments of the two threads may be concurrent if thereno constraints forcing one to be completely executed beforethe other starts. In Figure 3 concurrent segments of thread 1for the segment 0.A of thread 1 are 1.A, 1.B and 1.C. Finallyaccesses of concurrent segments must be checked in order tofind data races.

IV. IMPLEMENTATION AND RESULTS

This section presents the implementation we have doneand the obtained results. We first describe the implementationarchitecture of the software analysis program. We then detailour experimentations and results.

110

A. Analysis implementation

The two steps described in sections III-A1 and III-A2 arenot implemented in the same program. The first step hasbeen implemented in component models that are used in thesimulation. Events are stored into separated files (one percomponent), the storage is done in parallel of the simulationin a separate thread to limit the overhead. The second step isdone in a separate program and generates a dated instructionssequence per processor.

The software thread analysis program is organized as fol-lows. It takes in input the previously generated processorinstruction sequences and the software binary image executedby the simulated platform. The software image is used toobtain the software symbols.

The main program core consumes instruction following thedate order. When an instruction is consumed it is put intothe current software thread of its processor. And using thesoftware symbols and the ABI, the software thread histories(call graphs) is built. Hook functions can be registered by theuser at entry or end of function symbols to do some tasks (forexample changing the current thread of a processor, or addingsome synchronization points). A hook function is executedby the main core when the given symbol is detected. Theprogram stores into files the software threads histories andthe synchronization graph.

Thread histories can then be used to study the softwarebehavior. A datarace program has been implemented to detectdatarace between a pair of thread using two threads historiesand the synchronization graph.

B. Experimentation detail

We implemented the first step (section III-A1) using theSoCLib framework, which consists in a library of SystemCcomponents for MPSoCs simulation. The SoCLib [15] libraryprovides components such as processors, caches, memories,interconnect and allows to build platforms at system level.

Figure 4. Simulated platform

We modified the CABA (Cycle Accurate, Bit Accurate)(CABA) implementation of the components to build a platformcontaining several MIPS32 ISA processors with write-throughcaches and write-buffers. Memory and caches communicatethrough an abstract network. The memory is kept coherentbetween the caches through a directory based mechanism.

Figure 4 shows an overview of this platform. The platformcontains also others peripherals components: a timer, an inter-rupt controller, a frame-buffer, a serial output peripheral anda storage peripheral. The simulation was launched under theSystemCASS [16] SystemC kernel.

The software that runs on this platform is a parallel MJPEGdecoder on top of a small Operating System called DNA [17].The MJPEG decoder is organized as follows: one first threadis in charge of reading the file and dispatching JPEG blocksto several threads which do the decoding part. Decoded partof images are then given to a last thread which is in chargeof the display.

For the software analysis, several hooks have been registeredon DNA functions.

Hooks have been set for thread context handlers(cpu_context_init, cpu_context_load andcpu_context_save) to handle thread creattion andscheduling.

Hooks have been set for lock functions (lock_acquire,lock_release) and operating system barrier(cpu_mp_proceed and cpu_mp_wait to createsynchronization points between the threads.

The insertion of synchronization points must be done verycarefully, since missing some will lead to false data racedetection. But putting to much points may lead not to detectsome races. For example semaphore should not be consideredas synchronization points, since they are not generally used toprotect shared variables or memory.

Each thread was separated into an application level and akernel level and other hooks have been set for several kernelfunctions to operate the switch between the levels. This allowsto remove the kernel synchronization from the application partand vice-versa.

C. Results

The platform was simulated with 1 to 16 processors.Figure 5shows some numbers for a simulation of 100000000 cycles for1, 2 and 8 processors. Sim. time is the time to simulate theplatform and store the events. Overhead is the simulation timeoverhead compared with the initial simulation without tracingthe events. Ev. num and Ev. size are the number of eventsand their size. Inst. time is the total and user time needed togenerate the dated processor instruction sequences from theevents (step described in section III-A2). Inst. num and Inst.size are the number of instructions executed by all processorsand their size. Soft. time is the total and user time used tobuild the software thread and synchronization graph from thedated instruction sequences.

As shown by these numbers the application on the platformdoes not scale well. This is due to the platform implementinga sequential consistency which is a very strict constraint onthe memory hiearchy. Furthermore we do not use any DMAtransfer because nowadays we do not support them in the tracesystem. In consequence the performance is very low whenusing several processors.

111

Processors 1 2 8Sim. time (seconds) 156s 395s 468s

Overhead 6.3% 4.5% 3.5%Ev. num (millions) 64,5 64,9 70,2

Ev. size 1.3GB 1.3GB 1.4GBInst. time (total/user) 127s/12s 127s/13s 131s/14s

Inst. nb 31,4 31,6 34,6Inst. size 582GB 586GB 638GB

Soft. time (total/user) 43s/7s 44s/8s 48s/8s

Figure 5. Time in the different step for different numbers of processors

The overhead of trace system during the simulation is notvery high. The Time spent in following steps is importantcompared to the simulation time. But this time is for themost part spent in system parts (the user time is low). Thisis due to the large amount of data that is read from files forboth step. In consequence most of that time could be avoidedby pipelining all steps and not using intermediate files. Thiscould be done without difficulties as the two last step readlinearly the results of the previous step.

Using the thread call graphs and synchronization graphgenerated by the last step, data races detection were runnedon pairs of thread that communicate tgether. Due to the com-plexity (each part of the first thread between two consecutivesynchronization points must be checked with each part ofsecond thread ath can be concurrent with the fifrst thread part)of data race detection. No data races were detected in theMJPEG decoder using these. synchronization points. In orderto test our analysis, we removed some locks from the fifodriver used by the threads of the MJPEG decoder to transferdata. This lead to the detection of data races. However a datarace detection must be studied carefully. They may be falsedata race detection. For example, an increasing counter whichis only protected by a lock for updating and not for readingwill cause false data races to be detected on read.

V. CONCLUSION

The generalization of multiprocessor in integrated circuitsintroduces the issue of debugging parallel programs into theembedded devices. Debugging can no longer be done step bystep on a console, and more and more rely on trace analysis.We have shown that using virtual prototypes producing non in-trusive traces, it is possible to perform complex analysis, suchas data races detection. The analysis can be done at differentlevel of abstraction or target different part of the software.For example, operating system kernel and application can beanalyse indepedantly.

However our method only analyses a given execution on aplatform. It may be coupled with stimulation (for example by

adding random delay in communication) of the simulation totry to change the execution at each simulation and increasethe test coverage.

We plan to extend this work to systems that do not followthe sequential consistency model, and to the verification ofcache coherence protocols implementation.

ACKNOWLEDGMENT

This work is funded by the French Authorities in theframework of the Nano 2012 Program.

REFERENCES

[1] ITRS, “ITRS 2009 Edition,” 2009, http://www.itrs.net.[2] ARM, “Embedded trace macrocell architecture specification,” http://

www.arm.com.[3] MIPS Technologies, “EJTAG Trace Control Block Specification,” http:

//www.mips.com.[4] A. B. Hopkins and K. D. McDonald-Maier, “Debug support strategy for

systems-on-chips with multiple processor cores,” IEEE Transactions onComputers, vol. 55, no. 2, pp. 174–184, February 2006.

[5] B. Vermeulen, K. Gooseens, and S. Umrani, “Debugging distributed-shared-memory communication at multiple granularities in networkson chip,” in Proceedings of ACM/IEEE International Symposium onNetworks-on-Chip, April 2008, pp. 3–12.

[6] C. E. Mcdowell and D. P. Helmbold, “Debugging concurrent programs,”ACM Computing Surveys, vol. 21, no. 4, pp. 593–622, December 1989.

[7] C.-N. Wen, S.-H. Chou, T.-F. Chen, and A. P. Su, “Nuda: A non-uniformdebugging architecture and non-intrusive race detection for many-core,”in Proceedings of 46th ACM/IEEE Design Automation Conference, July2008, pp. 148–153.

[8] L. Albertsson, “Simulation-based debugging of soft real-time appli-cations,” in Proceedings of the 7th IEEE Real-Time Technology andApplications Symposium, may 2001, pp. 107–108.

[9] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, “Ageneric instruction set simulator api for timed and untimed simulationand debug of mp2-socs,” in Proceedings of the IEEE/IFIP InternationalSymposium on Rapid System Prototyping, June 2009, pp. 116–122.

[10] S. Taylor, C. Ramey, C. Barner, and D. Asher, “A simulation-basedmethod for the verification of shared memory in multiprocessor sys-tems,” in IEEE/ACM International Conference on Computer AidedDesign, November 2001, pp. 10–17.

[11] N. Fournel, A. Fraboulet, and P. Feautrier, “esimu : a fast and accurateenergy consumption simulator for real embedded system,” in IEEEInternational Symposium on a World of Wireless, Mobile and MultimediaNewtorks, 2007, pp. 1–6.

[12] D. Kranzlmueller, “Event graph analysis for debugging massively par-allel programs,” Ph.D. dissertation, Johannes Kepler University of Linz,Austria, 2000.

[13] C. Prada-Rojas, V. Marangozova-Martin, K. Georgiev, J.-F. Mehaut,and M. Santana, “Towards a component-based observation of mpsoc,”in Proceedings of International Conference on Parallel ProcessingWorkshops, 2009, pp. 542 –549.

[14] L. Lamport, “How to make a multiprocessor computer that correctlyexecutes multiprocess programs,” vol. 28, pp. 690–691, 1979.

[15] “SoCLib,” http://www.soclib.fr.[16] R. Buchmann, F. Pétrot, and A. Greiner, “Fast cycle accurate simulator

to simulate event-driven behavior,” in Proceedings of The 2004 Interna-tional Conference on Electrical, Electronic and Computer Engineering(ICEEC’04), 2004, pp. 35–39.

[17] X. Guerin and F. Pétrot, “A system framework for the design of embed-ded software targeting heterogeneous multi-core socs,” in Proceedings ofthe 20th IEEE International Conference on Application-Specific Systems,Architectures and Processors, 2009, pp. 153–160.

112

Embedded Virtualization for the Next Generation of Cluster-based MPSoCs

Alexandra Aguiar, Felipe G. de Magalhaes, Fabiano HesselFaculty of Informatics – PUCRS – Av. Ipiranga 6681, Porto Alegre, Brazil

[email protected], [email protected], [email protected]

Abstract

Classic MPSoCs tend to be fully implemented using asingle communication approach. However, recent effortshave shown a new promising multiprocessor system-on-chipinfrastructure: cluster-based or clustered MPSoC. This in-frastructure adopts hybrid interconnection schemes whereboth buses and NoCs are used in a concomitant way. Themain idea is to decrease the size and complexity of the NoCby using bus based communication systems at each localport. For example, while in a classic approach a 16 pro-cessor NoC might be formed in a 4 x 4 arrangement, incluster-based MPSoCs a 2 x 2 NoC is employed and eachrouter connected to a local port contains buses that carry 4processors. Nevertheless, although good results have beenreached using this approach, the implementation of wrap-pers to connect the local router port to the bus can be com-plex. Therefore, we propose in this work the use of embed-ded virtualization, another current promising technique, toachieve similar results to cluster based MPSoCs without theneed for wrappers besides providing a decreased area us-age.

1 Introduction

Embedded Systems (ES) have become a solid reality inpeople’s lives. They are present in a broad range of facil-ities, such as entertainment devices (smart phones, videocameras, games toys), medical supply (dialysis machines,infusion pumps, cardiac monitors), automotive business(engine controls, security, ABS) and even in aerospace anddefense fields (flight management, smart weaponry, jet en-gine control) [18].

Usually, these systems need powerful implementationsolutions, which contemplates several processor units, suchas the Multiprocessor System-on-Chips (MPSoCs) [11].One of the most important issues regarding MPSoCs liesin the way communication is implemented. Initially, bus-based systems used to be the most common communication

978-1-4577-0660-8/11/$26.00 2011 IEEE

solution, since they used to be usually simpler in terms ofimplementation.

On the other hand, buses have poor scalability rates,since only a few dozens of processors can be placed inthe same structure without presenting prohibitive contentionrates. Therefore, other communication solutions started be-ing researched and the most prominent one is the Network-on-Chip (NoC) approach [16].

NoCs are a communication solution widely acceptedand based on general purpose network concepts. However,NoCs can present more complex communication protocolsand, consequently, less predictability.

In this context, a recent idea known as Cluster-basedMPSoCs has gained notoriety [7], [12]. In this approachthe best of both worlds are intended to be placed together:NoCs allow higher scalability rates but buses keep the de-sign simpler even with more processors on the system. Tobetter understand the concept, Figure 1 depicts a 2x2 sizedNoC which contains a bus located at each local port. Eachbus carries along four processors which communicate insimpler ways inside and, if needed, can communicate withother clusters through the NoC. Dotted lines represent thewrappers needed to connect the bus to the NoC.

Figure 1. Cluster-based MPSoC concept

Another recent idea for embedded systems is the use ofvirtualization in their composition. Virtualization has sev-eral possible advantages, including the decrease of area,increase of security levels and the ease of software de-sign [10], [2], [3]. Virtualized systems are composed by

113

a hypervisor that holds and controls all virtual machines’operation details.

This paper proposes the unification of both concepts. In-stead of using buses on each router of the NoC, we proposea single processor holding a hypervisor, providing the emu-lation of several virtual processors. Since buses are poorlyscalable, hypervisors do not need to support more proces-sors than a simple bus would. The main contribution ofthis proposal, named as Virtual Cluster-based MPSoCs, isto provide multiprocessed systems with less area occupa-tion.

The remainder of the paper is organized as it follows.Next section show some related work on cluster-based MP-SoC. Section 3 shows basic concepts regarding embeddedvirtualization. Then, in Section 4, details about the VirtualCluster-based MPSoCs are discussed. Section 5 details mo-tivational use cases and some initial experimental results.Finally, Section 6 concludes the paper besides presentingsome future work.

2 Cluster-based MPSoCs

It is widely known that several MPSoCs are bus-basedarchitectures. Systems such as the ARM MPCore [9], theIntel IXP2855 [6] and the Cell processor [13] are examplesof it. Nevertheless, the need for more processing elementsand a growing system complexity has led other approachesto be researched.

Networks-on-Chip (NoCs) have arisen as the maincommunication infrastructure involving complex MPSoCs.However, the design of NoC-based parallel application isfar more complex that the one involving only bus-based sys-tems [7].

Due to the lack of scalability present in bus-based sys-tems and the excessive application design complexity foundin NoCs, cluster-based systems are becoming a possible al-ternative. These systems, intend to achieve the advantagesof both systems.

In [7], the authors propose a cluster-based MPSoC pro-totype design. In this paper, the authors integrate 17NiosII [14] cores, organized in four processing clusters anda central core. In every cluster, the cores are composed bytheir own local memory and their communication is per-formed through a shared memory, accessed from the bus. Inorder to access the inter-cluster communication, cores havea shared network interface.

This system still proposes that a single processing el-ement has the access to external peripherals, such asSDRAM controllers. Also, this central control unit is re-sponsible for managing mapping issues of the parallel appli-cation in the clusters as well as gathering expected results.Figure 2 depicts the architecture proposed by [7]. In thisFigure, LM stands for Local Memory, CSM, for Common

Shared Memory, NI, for Network Interface and SDRAM IFfor SDRAM Interface.

Figure 2. The architecture of Cluster-basedMPSoC proposed by [7]

Results were taken considering two real applications:matrix chain multiplication and JPEG picture decoding,both implemented on an FPGA development board. Theimplementation resulted in speedup ratios of above 15times. The main drawback is that real-time applications arenot referred by the authors.

Figure 3 shows an example of a processing cluster thatcomposes the cluster-based MPSoC, which is composed byfour processor cores itself. Each processor core, a NIOSII,contains its own Local Memory (LM, in the figure) anda bridge to access the local bus. In this bus, it is alsoconnected a Common Shared Memory (CSM, in the fig-ure), used to exchange data among the processors. Still, asemaphore register file, used for synchronization purposesamong the processes during the use of the shared memory,is present. Finally, the cores also share a Network Interface(NI, in the figure) which allows the inter-cluster communi-cation.

Figure 3. The architecture of each processingcluster proposed by [7]

Jin [12] proposes a cluster-based MPSoC using hierar-chical buses on-chip, aiming to attack some of the prob-lems pure NoC implementations can present to the com-ponent connected to the network. One of the main prob-lems pointed by the authors is for real-time applications,

114

where the NoC must provide a high efficiency for data ex-change. In this approach, no NoCs are adopted. Therefore,in cluster-based MPSoCs the performance of the computa-tion cluster is very important for the system as a whole.

The approach presented in [12] can be seen in Figure 4.The system adopts the AMBA-AHB protocol, which is ahigh performance system bus that supports multiple busmasters besides providing high-bandwidth operation. Theauthors also use a hierarchical bus architecture aiming toobtain better performance results, especially when decreas-ing bus collision rates, improving the speed of register con-figuration and avoiding shared memory contention and bot-tlenecks.

Figure 4. The architecture of the cluster basedMPSoC proposed by [12]

The proposed solution is divided into inner buses, whichare present in each SoC itself - forming each cluster - andthe outer bus, which connects them to each other and toexternal peripherals.

Still, the work proposed by [4] also targets pure NoC im-plementation by adding bus-based interface on NoC routers.The main goal is to ease the integration with other bus-basedIP components, which are more commonly found. Thus, theproposed NoC has the ability of integrating standard non-packet based components thus reducing design time.

Other approaches also studied the use of buses in NoCswith different purposes [15], [20]. In our case, we still wantto use the NoC infrastructure but instead of adding anotherlevel of communication we propose to use virtual domains.

Next section introduces some concepts about embeddedvirtualization.

3 Virtualization and Embedded Systems

First of all, even for classic virtualization concepts,which date back more than 30 years [8], the main com-ponent involving virtualization is the hypervisor. It is thehypervisor the responsible for managing the virtual ma-chines (also known as virtual domains) by providing themthe needed scenario for its fine work.

To implement the hypervisor, also known as VirtualMachine Monitor (VMM), commonly two approaches areused. In hypervisor type 1, also known as hardware levelvirtualization, the hypervisor itself can be considered as anoperating system, since it is the only piece of software thatworks in kernel mode, like depicted in Figure 5. Its maintask is to manage multiple copies of the real hardware - thevirtual boards (virtual machines or domains) - just like anOS manages multitasking.

Figure 5. Hypervisor Type 1

Type 2 hypervisors, also known as operating system levelvirtualization, depicted in Figure 6, are implemented suchthat the hypervisor itself can be compared to another userapplication that simply “interprets” the guest machine ISA.

Figure 6. Hypervisor Type 2

One of the most successful techniques to implement vir-tualized systems is known as para-virtualization. It is atechnique that replaces sensitive instructions of the origi-nal kernel code by explicit hypervisor calls (also known ashypercalls). Sensitive instructions belong to a classificationfor the instructions of an ISA (Instruction Set Architecture)into three different groups, proposed by Popek and Gold-

115

berg [19]:

1. privileged instructions: those that trap when used inuser mode and do not trap if used in kernel mode;

2. control sensitive instructions: those that attempt tochange the configuration of resources in the system,and;

3. behavior sensitive instructions: those whose behav-ior or result depends on the configuration of resources(the content of the relocation register or the processor’smode).

The goal of para-virtualization is to reduce the problemsencountered when dealing with different privilege levels.Usually, a scheme referred to as protection rings is usedand it guarantees that the lower level rings (Ring 0, for in-stance) holds the highest privileges. So, most of OSs areexecuted in Ring 0, thus being able to interact directly withthe physical hardware.

When the hypervisor is adopted, it becomes the onlypiece of software to be executed in Ring 0, bringing se-vere consequences for the guest OSs: they are no longerexecuted in Ring 0, instead, run in Ring 1, with fewer priv-ileges.

These concepts are present in the virtualization done forgeneral purpose systems but are very important when deal-ing with embedded systems’ typical challenges. Next, somepeculiarities found in the application of virtualization solu-tions in embedded systems are discussed.

3.1 Virtual-Hellfire Hypervisor

There are several hypervisors with embedded systems’focus [22], [10], [21]. In this work, we adopt the Virtual-Hellfire Hypervisor (VHH) [3], part of the Hellfire Frame-work. The main advantages of VHH are:

• temporal and spatial isolation among domains (eachdomain contains its own OS);

• resource virtualization: clock, timers, interrupts, mem-ory;

• efficient context switch for domains;

• real-time scheduling policy for domain scheduling;

• deterministic hypervisor system calls (hypercalls).

VHH considers a domain as an execution environmentwhere a guest OS can be executed and it offers the vir-tualized services of the real hardware to it. In embed-ded systems where no hardware support is offered, para-virtualization tends to present the best performance results.

Therefore, in VHH, domains need to be modified before be-ing executed on top of it. As a result, they do not managehardware interrupts directly. Instead, the guest OS must bemodified to allow the use of virtualized operations providedby the VHH (hypercalls).

Figure 7 depicts the Virtual-Hellfire Hypervisor struc-ture. In this figure, the hardware continues to provide thebasic services as timer and interrupt but they are managedby the hypervisor, which provides hypercalls for the differ-ent domains, allowing them to perform privileged instruc-tions.

Figure 7. Virtual-Hellfire Hypervisor Domainstructure

Thus, Virtual-Hellfire Hypervisor is implemented basedon the HellfireOS [1] and counts on the following layers:

• Hardware Abstraction Layer - HAL, responsiblefor implementing the set of drivers that manage themandatory hardware, like processor, interrupts, clock,timers etc;

• Kernel API and Standard C Functions, which arenot available to the partitions;

• Virtualization layer, which provides the services re-quired to support virtualization and para-virtualizationservices. The hypercalls are implemented in this layer.

Figure 8 depicts the architecture of the VHH, wheresome of the following modules can be found:

• domain manager, responsible for domain creation,deletion, suspension etc;

• domain scheduler, responsible for scheduling domainsin a single processor;

116

• interrupt manager, which handles hardware interruptsand traps. It is also in charge of triggering virtual in-terrupt and traps to domains, and;

• hypercall manager, responsible for handling callsmade from domains, being analogous to the use of sys-tem calls in conventional operating systems.

Figure 8. VHH System Architecture

4 Virtual Cluster-Based MPSoCs

This section describes the Virtual Cluster-Based MPSoCproposal. Initially, let us take a look into each cluster of theMPSoC.

Since our work is based on the Hellfire Project, wealso use the Plasma [5] processor, a MIPS-like architec-ture. Therefore, the VHH is placed on a Plasma processoras the basis of our cluster. Then, the VHH is responsiblefor managing several virtual domains. In our case, eachVHH is responsible for managing its own processing clus-ter and it allows the internal communication of these pro-cessors through shared memory.

Figure 9 is divided in two parts. In A, the current versionfor memory division, which only predicts a single memorypartition per virtual domain, is shown. This means that thispartition is considered to be the local memory for a givenvirtual domain. In B, it is possible to see that an extra par-tition was added: the shared partition. Here, the idea is toprovide easy and low overhead communication inside thecluster.

The VHH was extended to allow the communication intwo levels. The first level, is named as intracluster commu-nication and occurs through shared memory. Currently, thisis not user transparent and a specific hypercall must be usedfor this communication. In this hypercall, a single CPUidentification (CPU ID) must be used, which means theybelong to the same processing cluster.

These hypercalls are similar to the communication func-tions provided by the HellfireOS and have the follow-

Figure 9. VHH Memory for (A) Non-clusteredsystems (B) Clustered systems

ing parameters: VHH SendMessage (cpu id, task id, mes-sage, message length) used to send a message through theshared memory and VHH ReceiveMessage (source cpu id,source task id, message, message length) used to receive it.

The second communication level is done among clus-ters, through the NoC. In our case, we use the HERMESNoC [17] and a MIPS-like processor in each router. Weadopted a Network Interface (NI) as a wrapper which con-nects the NoC router to the processor located in its localport. This interface, works in a similar way that the non-virtualized approach. This increases the possibility of usingseveral NoC infrastructures as the underlying architecture.Figure 10 depicts this approach.

Figure 10. VHH Communication Infrastructurewith NoC based Systems

The wrapper is connected through specific memory ad-dresses: read and write, to the Plasma. Still, a communica-tion VHH driver had to be written to allow the integrationbetween the wrapper and the virtual cluster. Also, the hy-percalls provided by the VHH allow a virtual processor tosend or receive messages with an extra parameter: the Vir-

117

tual CPU ID, as an identification of the virtual CPU on aspecific cluster.

Thus, the hypercalls to be used to the inter-clustercommunication are: VHH SendMessageNoC (cpu id, vir-tual cpu id, task id, message, message length) used to senda message through the NoC and VHH ReceiveMessage(source cpu id, source virtual cpu id, source task id, mes-sage, message length) used to receive it.

The complete vision of the system is depicted in Fig-ure 11. In the Figure, VHH is the Virtual Hellfire Hypervi-sor. LM stands for Local Memory. NI stands for Networkinterface and PE, for Processing Element. R represents eachrouter of the NoC.

Figure 11. Virtual Cluster-Based MPSoC pro-posal

5 Use Cases and Experimental Results

In this section we highlight some possible use cases forCluster-Based MPSoCs and some preliminary prototypingresults.

The main use for a Cluster-based MPSoC is the possi-bility for field specialization. In this case, each cluster isresponsible for executing a set of tasks with a common pur-pose. For instance, it is possible to execute a JPEG decoderin one cluster, a MPEG decoder in another and so on. In thiscase, the greatest advantage is to simplify the communica-tion of similar tasks, since they share a given memory area,but still allowing a great number of processors, increasingsystem scalability through the NoC usage. Figure 12 de-picts an example of cluster-based MPSoC with applicationspecialization.

Another possible use of the Virtual Cluster-based MP-SoC is when decreasing area with guaranteed system scal-ability is needed. Scalability is assured by NoC usage andthe cluster-based MPSoC itself allows an easier use of real-

Figure 12. Virtual Cluster-Based MPSoC withApplication Specialization

time tasks with no extra communication penalties. Regard-ing area occupation, we prototyped some possible configu-rations to illustrate the benefits of our approach in this issue.We used the Xilinx Virtex-5 XC5VLX330T FPGA.

First, when using the HellfireOS with a Plasma proces-sor, we usually indicate a processor with at least 16KB oflocal memory. HellfireOS is a much optimized kernel anddepending on the application even such a small memorycan fulfill the expected needs. When using the VHH, morememory is required and the total memory size depends es-pecially on the number of virtual domains that are required.Although greater memory sizes infer more block RAMs, itdoes not affect the FPGA area measured in LUTs. In allexperiments performed, the total system memory could beinferred as block RAMs.

We used three different MPSoC configurations, all with16 processors (physical or virtual). First, we have a 16processor MPSoC, distributed in a 4x4 NoC where eachrouter carries its own processor, known as Pure 4x4 NoCapproach.

The second MPSoC configuration regards a 2x2 NoCwith bus-based clustering system, known as Bus Clusteredapproach. Here, each router has a wrapper to connect it tothe clustered-bus, and each bus carries four processors.

Finally, the last approach is the Virtual cluster-based (V-Cluster 2x2 NoC) where a 2x2 NoC was used again andeach router contains a single physical processor. This pro-cessor runs the VHH, where 4 virtual domains are emulatedper cluster, totalizing the 16 processors of the MPSoC.

In the first two solutions, each processor has 16KB oflocal memory. The last, for the virtual cluster approach, 4processors with 128KB of memory each were employed. InTable 1, it is possible to see the prototyping results for threedifferent MPSoCs.

These results show a decrease of the area occupation inup to 70%, depending on the processor local memory con-

118

Table 1. Area results for MPSoCs configura-tion

Configuration Area occupation (LUTs)Pure 4x4 NoC 60934

Bus Clustered 2x2 NoC 56099V-Cluster 2x2 NoC 17179

figuration and the original MPSoC configuration. Also, de-pending on the bus structure used for the Bus-based clus-tered version, the bus communication overhead is similar tothe virtualization overhead.

6 Concluding Remarks and Future Work

This paper presents a new proposal for MPSoC config-uration using virtualization with a cluster-based approach.For validation purposes, we use an extension of the HellfireFramework and, in order to incorporate our virtualizationmethodology, the Virtual-Hellfire Hypervisor (VHH).

We use a HERMES NoC as the underlying architecturewhere each processor runs the VHH, forming the processingclusters. We achieved up to 70% decrease in FPGA areaoccupation in our preliminary tests.

As a future work we intend to get comparison results forperformance and overheads with other approaches. Still,we want to improve the proposal itself, especially regardingmemory and I/O management.

Acknowledgment

The authors acknowledge the support granted by CNPqand FAPESP to the INCT-SEC (National Institute of Sci-ence and Technology Embedded Critical Systems Brazil),processes 573963/2008-8 and 08/57870-9. Also, this workis supported in the scope of the project SRAM by theResearch and Projects Financing (FINEP) under Grant0108031000.

References

[1] A. Aguiar, S. Filho, F. Magalhaes, T. Casagrande, andF. Hessel. Hellfire: A design framework for critical em-bedded systems’ applications. In Quality Electronic Design(ISQED), 2010 11th International Symposium on, pages 730–737, 2010.

[2] A. Aguiar and F. Hessel. Embedded systems’ virtualization:The next challenge? In Rapid System Prototyping (RSP),2010 21st IEEE International Symposium on, pages 1 –7,2010.

[3] A. Aguiar and F. Hessel. Virtual hellfire hypervisor: Extend-ing hellfire framework for embedded virtualization support.

In Quality Electronic Design (ISQED), 2011 12th Interna-tional Symposium on, 2011.

[4] B. Ahmad, A. Ahmadinia, and T. Arslan. Dynamically Re-configurable NoC with Bus Based Interface for Ease of In-tegration and Reduced Design Time. IEEE, June 2008.

[5] O. Cores. Plasma most mips i(tm) opcodes.http://www.opencores.org.uk/projects.cgi/web/mips/,Accessed, September 2009, 2007.

[6] I. Corp. Intel ixp2855 network processor. Web, Available athttp://www.intel.com/., 2005.

[7] L.-F. Geng. Prototype design of cluster-based homogeneousMultiprocessor System-on-Chip. 2009 3rd InternationalConference on Anti-counterfeiting, Security, and Identifica-tion in Communication, pages 311–315, Aug. 2009.

[8] R. P. Goldberg. Survey of virtual machine research. Com-puter, pages 34–35, 1974.

[9] J. Goodacre and A. N. Sloss. Parallelism and the arm in-struction set architecture. Computer, 38:42–50, July 2005.

[10] G. Heiser. Hypervisors for consumer electronics. pages 1–5, jan. 2009.

[11] A. Jerraya, H. Tenhunen, and W. Wolf. Multiprocessorsystems-on-chips. Computer, 38(Issue 7):36– 40, July 2005.

[12] X. Jin, Y. Song, and D. Zhang. FPGA prototype design ofthe computation nodes in a cluster based MPSoC. IEEE,July 2010.

[13] M. Kistler, M. Perrone, and F. Petrini. Cell multiproces-sor communication network: Built for speed. IEEE Micro,26:10–23, May 2006.

[14] A. Ltd. Nios ii processor reference. Web, Available athttp://www.altera.com/., 2009.

[15] R. Manevich, I. Walter, I. Cidon, and A. Kolodny. Best ofboth worlds: A bus enhanced NoC (BENoC). IEEE, Nov.2010.

[16] G. D. Micheli and L. Benini. Networks on Chips: Tech-nology and Tools (Systems on Silicon). Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2006.

[17] F. Moraes, N. Calazans, A. Mello, L. Moller, and L. Ost.Hermes: an infrastructure for low area overhead packet-switching networks on chip. Integr. VLSI J., 38(1):69–93,2004.

[18] T. Noergaard. Embedded Systems Architecture: A Compre-hensive Guide for Engineers and Programmers. Newnes,2005.

[19] G. J. Popek and R. P. Goldberg. Formal requirements forvirtualizable third generation architectures. Commun. ACM,17(7):412–421, 1974.

[20] T. Richardson, C. Nicopoulos, D. Park, V. Narayanan,Y. Xie, C. Das, and V. Degalahal. A hybrid soc interconnectwith dynamic tdma-based transaction-less buses and on-chipnetworks. In VLSI Design, 2006. Held jointly with 5th In-ternational Conference on Embedded Systems and Design.,19th International Conference on, page 8 pp., 2006.

[21] W. River.Wind river. Web, Available at http://www.windriver.com/.Accessed at 2 oct., 2010.

[22] XEN.org. Embedded xen project. Web, Available athttp://www.xen.org/community/projects.html. Accessed at10 ago., 2010.

119

Session 5Model Based System Design

120

1

Rapid Property Specification and Checking forModel-Based Formalisms

Daniel Balasubramanian, Gabor Pap,Harmon Nine, Gabor Karsai

ISIS / Vanderbilt University, Nashville, TN 37212Email: {daniel.a.balasubramanian, gabor.pap,harmon.s.nine, gabor.karsai}@vanderbilt.edu

Michael Lowry, Corina Pasareanu, Tom PressburgerNASA Ames Research Center, Moffett Field, CA 94035

Email: {michael.r.lowry, tom.pressburger,corina.s.pasareanu}@nasa.gov

Abstract—In model-based development, verification techniquescan be used to check whether an abstract model satisfies a setof properties. Ideally, implementation code generated from thesemodels can also be verified against similar properties. However,the distance between the property specification languages andthe implementation makes verifying such generated code diffi-cult. Optimizations and renamings can blur the correspondencebetween the two, further increasing the difficulty of specifyingverification properties on the generated code. This paper de-scribes methods for specifying verification properties on abstractmodels that are then checked on implementation level code. Theseproperties are translated by an extended code generator intoimplementation code and special annotations that are used by asoftware model checker.

I. INTRODUCTION

Model-based development (MBD) is a software and systemdesign paradigm based on abstractions called models. Domain-specific modeling languages (DSMLs) [1] provide the abilityto represent models that are specific to a particular problemdomain. Cast in this light, Matlab/Simulink [2] can be viewedas a DSML for physical and embedded systems, as they allowmodeling the (dynamics of the) physical plant as well asthe behavior of its controller software. Once the model iscreated the closed loop system can be simulated, output tracesobserved, and the model modified as needed.

Simulation alone, however, cannot provide rigorous guaran-tees about a model’s behavior. In order to prove exhaustivelythat a model’s dynamic behavior always satisfies a set ofproperties, some sort of verification [3] must be performed.Typical properties include state reachability, deadlock-freedomand a wide range of temporal properties. In recent years,model-level verification tools have been developed that cancheck models for such properties. While these tools play animportant role in MBD and can provide guarantees about amodel’s behavior, their use is often limited to a small portionof a complex system, i.e. key properties and algorithms.

One of the key goals of MBD is to gradually refine abstract,high-level models until they can be automatically synthesized

into an implementation that runs on a non-ideal computationalplatform. However, one crucial problem is often ignored:how can one verify that the synthesized implementation codesatisfies the same properties as the models from which itwas generated? Without verifying the implementation, theguarantees provided by checking the abstract models arelost. Checking or proving the correctness of the synthesis(transformation) algorithms is an open problem. Further, if noverification is performed on high-level models, then verifyingthe implementation is the only way to prove properties aboutthe system.

The major difficulty in verifying model level properties onimplementation level code lies in the different levels of ab-straction. Abstract models are developed by hand and designedwith readability in mind, while automatically generated codecan be difficult to read. Further, the correspondence betweenmodel elements and their generated code is not obvious.Renamings and optimizations make it difficult to understandhow a particular model element is represented in the generatedcode. As a result, knowing where to place properties that areto be verified becomes a challenge.

Another difficulty lies in the mismatch between the inputlanguages of verification tools used at the different levels ofabstraction. Individual verification tools typically each usetheir own input language for defining properties, so thatproperties checked at the model level must be rewritten ina new syntax to be checked on the implementation level code.This problem is exacerbated by the fact that code generatorstypically rename model elements in the generated code, sothat, for instance, the names of variables in the generatedcode are not known on the model level. Without knowing thenames of the variables, certainly verification properties cannotbe defined.

We present in this paper a method for specifying propertieson high-level models that are then used in the verificationof the generated, implementation level code. Properties arewritten in an intuitive way, directly on the model elements.As the model is translated into various intermediate forms and

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

121

2

SL/SF model + Code

generatorInput Output

Java code +

Software model

checkerGenerates

Verification report

Contracts

Observerautomata

Verification properties

Verification properties

Translated

Property specification methods

Fig. 1. Overview of framework. Verification properties can be specified usingobserver automata or contracts.

ultimately into executable code, the user defined propertiesare preserved and translated into implementation code andannotations that are checked by a software model checker.The translation is performed via a code generator that has beenextended to handle the extra information. The results of theverification are then displayed to the user (in terms of the orig-inal high level model). While we focus on Matlab/Simulink,we believe that our method of defining properties on the modellevel that are checked against a generated implementation canbe generalized and leveraged in other MBD tools as well. Thisapproach makes property-based verification an integral part ofthe development workflow. Note that the framework enablesrun-time verification in addition to model checking.

The remainder of the paper is organized as follows. SectionII gives an overview of our approach and background, includ-ing a description of the tool-suite. Section III provides detailson how the user annotates Simulink models with properties.Section IV presents an end-to-end example. We compare ourapproach with related work in Section V and conclude inSection VI.

II. OVERVIEW AND BACKGROUND

An overview of our approach is depicted in Figure 1 andconsists of the following steps: (1) a Simulink model isdefined, (2) the model is annotated with properties to verify,(3) the code generator is invoked to produce executable code,(4) the software model checker is executed on the code andproperties, (5) results about about the verification process arereported.

The first and third steps are described in [4]; this paperconcentrates on the other steps. The code generator producesrestricted form Java code and is the same code generatordescribed in [4], but extended with features for generating an-notations for verification. The main motivation for this choiceof the target language was that the software model checkerused can work with Java programs. The code generated by ourtoolchain is completely sequential and does not use dynamicmemory (after initialization), hence it is suitable for embeddedapplications. The code is also object-oriented (an increasingtrend in embedded software): subsystems are translated intoJava classes that are instantiated at initialization time. Ourcode generator actually uses a re-targetable back-end, suchthat either Java or C code can be produced from the sameabstract syntax tree.

A. Property annotationsThe second step in Figure 1 is annotating the Simulink

model with properties to verify on the generated code. Since

the development of model checking [5] in the early 1980s,a number of specification languages have been invented toformally define properties. Common ways of specifying theseproperties include regular expressions and temporal logic, suchas LTL and CTL. However, the drawback to using temporallogics for property specification is their steep learning curvefor industrial practitioners. Consequently, designers and de-velopers will be less likely to use verification tools if theymust devote large amounts of time to learning a specificationlanguage.

For this reason, we decided to take two approaches toproperty specification. The first uses the pattern-based systemintroduced in [6]. In that work, the authors studied a large bodyof existing property specifications and found that the majorityof them were instances of a small set of parameterizablepatterns: reusable solutions to recurring problems.

Patterns are entered into our system using a custom interfacethat we integrated directly into Simulink. After the parametershave been entered, our interface generates an observer automa-ton to represent an instance of that pattern. Formulation ofassertions as Statechart observer automata has been describedin Chapters 4 and 5 of [7]. Because we are in a Simulinkcontext, it is natural that we represent observer automataas Stateflow subsystems inserted in the Simulink diagramthat implement the logic of the specification described bythe pattern. They contain input signals corresponding to thevariables and events under observation, and the internal statesthat implement the logic of property. The generated observerautomata are competitive in size to those coded by hand. Fulldetails are given in Section III.

The second approach to property specification is based oncontracts and is similar to the idea of programming by contract[8]. Programming by contract is a methodology for writingprograms that use interface specifications on software compo-nents to define properties about their behavior. Typically, thespecification on a component includes three elements: prop-erties that must hold in order to use the component correctly(preconditions), properties that will hold when the componenthas finished executing (postconditions), and properties thatmust always be satisfied (invariants). We applied this ideaof contracts to specifying properties for Simulink subsystems.On any subsystem, the user is allowed to write preconditions,postconditions and invariants that must be satisfied by thatsubsystem. During the code generation phase the contracts onvarious subsystems are translated into annotations on methodsand classes implementing these subsystems in the generatedcode. A thorough description is given in Section III.

B. Software model checkingOur generated code is verified using Java Pathfinder (JPF)

[9], a software model checker for Java. We chose JPF for tworeasons. First, our toolsuite was already configured to generateJava code. Second, JPF provides libraries supporting a numberof verification features especially useful in our toolsuite: codecontracts, monitoring execution for exceptions and numericalproblems, as well as symbolic execution.

The code contract feature of JPF permits annotations forpreconditions, postconditions and invariants to be written

122

3

on classes and methods. JPF monitors these conditions atruntime and reports any violations. This feature allows thepreconditions, postconditions and invariants that are defined onthe Simulink model elements to be translated to the generatedcode in a straightforward manner by the code generator.

The symbolic execution [10] feature of JPF allows us toperform state reachability analysis and test case generation.The symbolic execution engine runs a program much like asnormal program execution, but it does not assign a concretevalue to program input variables. Instead, input variables areleft as symbolic values. When input variables are used in abranching condition, a constraint solver attempts to find valuesfor the symbolic variables that will allow both branches of thecondition to be taken. This idea is explained further in [10]. Inthis paper, we do not concentrate on the symbolic executionaspect.

III. SPECIFICATION PATTERNS AND CONTRACTS

This section gives details on how properties are specified onthe model level and then translated into generated code. Wefirst describe the specification patterns, which can be attachedto the model using a custom interface or from a suppliedlibrary. If the interface is used, a corresponding observerautomaton is automatically generated from the specifications.The interface can be used to insert basic properties, but todescribe more complex properties, the observer automata canbe compositionally defined using the supplied library. We alsodescribe the details of how contracts are written on the modeland then translated into annotations on the generated code.

A. Specification patterns

Property specification patterns describe commonly observedrequirements in a generalized manner. They capture a par-ticular aspect of a system’s behavior as a sequence of stateconfigurations. Note that the specifications can be state-basedor event-based. In the discussion below we mention the state-based form, but the same approach applies to events as well.

To illustrate, consider the property that throughout a sys-tem’s execution the value of a certain variable should alwaysbe greater than zero. There are two basic parts to this propertythat commonly occur. The first tells when the property shouldhold (in this case, at all times during execution), and thesecond tells what condition should be satisfied during this time(here, the variable should be greater than zero).

A property consists of precisely those two pieces: a scopeand a pattern. The scope defines when a particular propertyshould hold during program execution, and the pattern definesthe conditions that must be satisfied. There are five basic kindsof scopes: global (the entire execution), before (executionup to a given state), after (execution after a state), between(execution from one state to another) and until (execution fromone state even if the second never occurs).

There are three categories of patterns: occurrence, orderand compound. The occurrence group contains the absence(never true), universality (always true), existence (true at leastonce) and bounded existence (true for a finite number oftimes) patterns. The order group contains the response (a state

Pa#ern Error State

error_event

end_event [propertyOK == false]

Global scope

1

2

Pa#ern Error State [Before && propertyOK == false]

Before scope

1

Safe State

[Before && propertyOK == true]

2

Pa#ern Error State

UnBl scope

IniBal State

[Before && propertyOK == true] [AFer]

4 error_event

1

[Before && propertyOK == false] 3

end_event [propertyOK == false] 2

Fig. 2. Scope library.

must be followed by another state) and the precedence (astate must be preceded by another state) patterns, and thecompound group contains the chain precedence, and chainresponse patterns.

Dwyer et al. [6] have shown how these scopes and pat-terns can be expressed in LTL, CTL, and other formalisms.However, the property specification patterns can also be easilyexpressed as parameterized observer automata, which is theapproach we take. Note that many specifications can beadded to a model and each one is translated into a separateautomaton. Additionally, the definition of a simple interfaceallows the composition of the scope and pattern aspects of thespecification, represented as two distinct automata templates.Furthermore, using the Stateflow language allows the observerautomata to be created inside Simulink diagrams. Statecharthierarchy is exploited in some of the examples in Chapters 4and 5 of [7], and we make use of hierarchy in formulatingeach scope as a Stateflow diagram that contains a patternsubmachine.

The Simulink model extended with the observer automatais then translated into the target language. Hence the gener-ated, ’functional’ code will be augmented with the code thatimplements the observer automata. Now the software modelchecker can monitor and verify the execution of the entireimplementation, paying special attention to the error states andproperties specified in observer automata. As specifications aretranslated into executable code, the distance between code-level monitoring and software model checking and model-levelproperty specifications is reduced.

Figure 2 shows the automata for three of the five scopes.We now briefly describe each of these.

The automaton for the global scope is shown at the topof Figure 2. This scope indicates that a property should holdduring the entire system execution. Initially, the state labeled“Pattern” is entered. There are two transitions from this state

123

4

to the state labeled “Error State”. The first is triggered byan event named “error event”. This event is generated by anenclosed property when that property has been violated. Thesecond transition is triggered by an event named “end event”and a guard condition requiring the boolean value “proper-tyOK” to be false. The “end event” is generated upon systemtermination and the “propertyOK” variable is set to false bythe scope’s enclosed property if that property is violated. Thatis, the second transition is taken if the system terminates andthe property enclosed by this scope has been violated.

The automaton for the before scope is shown in the middleof Figure 2. This scope is used to express that a propertyshould hold before some other condition is met. In the Figure,the event named “Before” is used to represent the condition.Initially, the “Pattern” state is entered. If “end event” occurs(the system terminates) and the enclosed property has beenviolated (“propertyOK is false”) then the first transition istaken and the “ErrorState” is entered. If the “Before” eventoccurs and “propertyOK” is false, the second transition istaken and “ErrorState” is entered. The state named “SafeState” is only entered if the “Before” event occurs and theenclosed property has not been violated (“propertyOK” istrue). Note that, in general, a property is considered satisfiedas long as the error state of the property’s scope automaton isnot active.

The until scope captures the requirement that some condi-tion should hold from one state to another even if the secondcondition never occurs, or stated differently, in between onecondition and a second, even if the second condition neveroccurs. The bottom of Figure 2 shows the automaton forthis scope. The two variables named “Before” and “After”are used to represent the two conditions in between which aproperty should hold. Upon entry, “Initial State” is entered.When the variable “After” becomes true, then the transition tothe “Pattern” state is taken. While in this state, the automatonis waiting for the property to happen before the second con-dition is satisfied. When the property is satisfied, the variable“propertyOK” becomes true. If before “propertyOK” becomestrue either the “Before” condition becomes true or systemexecution ends (“end event” occurs), the transition to “ErrorState” occurs and signals an error to the user. Otherwise, if“propertyOK” is true (the property is satisfied) and the secondcondition is also satisfied (“Before” is true), the transition backto “Initial State” is taken, and the cycle repeats.

Figure 3 shows the automata for three of the patterns.At the top of the Figure is the automaton for the existencepattern. This pattern states that a condition (represented inthe automaton by the boolean variable “P1”) should occurduring a specified scope. When the “Initial State” is entered,the “propertyOK” variable is set to false, indicating that theproperty is initially unsatisfied: P1 has not occurred. If “P1”does become true, then the transition to “P1 Encountered” istaken and “propertyOK” is set to true.

A simple pattern, absence, is shown in the middle portionof Figure 3. This pattern states that a condition (representedin the automaton by the boolean variable “P1”) should notoccur during a specified scope. When the “Initial State” isentered, the “propertyOK” variable is set to true, indicating

Error State

Precedence pattern

Initial Stateen: propertyOK = true

[P1]

P1 Encountered Safe State[P2]

[P2]{propertyOK=false; error_event;}

1

2

Initial Stateen: propertyOK = true [P1]{propertyOK = false; error_event;}

Error State

Absence pattern

Initial Stateen: propertyOK = false [P1]

P1 Encountereden: propertyOK = true

Existence pattern

Fig. 3. Pattern library.

Property Error State

error_event

end_event [propertyOK == false]

1

2

[x > 0]

x

Ini?al State en: propertyOK = false

Safe State en: propertyOK = true

Global scope Existence paEern

Fig. 4. Property describing that at some point, x should be greater than 0.Scope states are white and patterns states are shaded.

that the property is initially satisfied: P1 has not occurred. If“P1” does become true, then the transition to “Error State”is taken,“propertyOK” is set to false and the “error event” isemitted.

The automaton for the precedence pattern is at the bottomof Figure 3. This captures the property that some condition(“P2”) must be preceded by another condition (“P1”). Notethat in this automaton, the initial state sets the “propertyOK”variable to true: the property is initially satisfied. If “P2” is truebefore “P1”, that is, the condition denoted by “P2” happensbefore the condition denoted by “P1” is met, then the transitionto “Error State” is taken, “propertyOK” is set to false, andthe “error event” is emitted. Otherwise, the overall precedencepattern is satisfied.

Scopes and patterns are combined to form property specifi-cations. Consider the example in Figure 4, which specifies thefollowing property: at some point during system execution, theinput variable “x” should be greater than 0. Stated differently,throughout the entire system execution (i.e., global scope), xshould be greater than 0 at least once (i.e., existence property).To define this property, the existence pattern shown in Figure 3is inserted into the “Pattern” state of the global scope shown inFigure 2. The difference is that the generic condition shownas “P1” in the basic existence pattern is replaced with thecondition x > 0. Note that the “propertyOK” variable is setby the pattern and its value is used by the scope.

124

5

Subsystem

X

Y Z

Fig. 5. Contract example.

Additionally, we developed a dedicated user interface thatuses dialog forms for inputing property specifications. Thedialogs capture both the kind of scope and pattern, as well asthe parameters needed to instantiate and compose them. Theuser picks the scope and the pattern and enters the appropriateconditions. A composed automaton that composes an instanceof both the scope and pattern is then automatically generated.An example using these dialog forms is described in SectionIV.

B. Contracts

The second method we use for describing verification prop-erties is based on contracts. We extended Simulink with acustom interface that allows the user to annotate any subsystemwith three additional items.

• Preconditions that the input signals to the subsystem mustsatisfy.

• Postconditions that the output signals of the subsystemmust satisfy.

• Invariants that must always be satisfied by the subsystem.Note that a subsystem translates into an executable function

that is called by some scheduler, periodically. Hence, the aboveconditions and invariants can be checked during execution ofthat function block.

Figure 5 shows an example of specifying contracts on asubsystem block. The internal details of the subsystem arenot important, but rather serve to show how our approachallows the complexities of certain elements to be ignored whenwriting specifications. The subsystem in Figure 5 has twoinputs, x and y, and one output, z. Suppose we wish to checkthe following property: either x is equal to 0 and y is between0 and 10, or x is equal to 1 and y is between 10 and 20.Suppose we also wish to check that if x is 0, then the output

z is greater than 0, and if x is 1, then the output z is less than0. These requirements are attached to the subsystem using thedialog box as shown at the top of Figure 5.

The contracts are added to the subsystem model as speciallyformatted descriptions (that are usually just unstructured text),using XML-like syntax. The code generator parses thesedescriptions, and if they are syntactically correct, it constructsthe properly formatted strings (with variable names rewritteninto their ’code’ equivalent) that are suitable for the softwaremodel checker.

A Java implementation of the subsystem in Figure 5 thatis very similar to the code produced by our code generator isshown in Listing 1. Note that in the contract, the inputs andoutputs of the subsystem are referred to by their name in themodel. This is an important part of our approach: the useralways refers to the model elements as they are written in themodel. No knowledge of the code generation process is neededto write specifications. The contract specified in the model isgenerated in the Java code as annotations that automaticallyreference the correct variable names. These annotations areused by the software model checker to monitor the codeexecution.

Listing 1. Java implementation of the subsystem in Figure 5.p u b l i c c l a s s Subsystem15 {

p r i v a t e i n t v a l u e 1 = 0 ;p r i v a t e i n t v a l u e 2 = 0 ;

@Requires ( ‘ ‘ ( x13 == 0 && y25 > 0 && y25 < 10) | |( x13 == 1 && y25 > 10 && y25 < 20) ’ ’ )

@Ensures ( ‘ ‘ ( x13 == 0 && z65 > 0) | |( x13 == 1 && z65 < 0) ’ ’ )

p u b l i c vo id Main23 ( i n t x13 , i n t y25 , i n t [ ] z65 ) {v a l u e 1 = x13 ;v a l u e 2 = y25 ;. . . / / Code i m p l e m e n t i n g s u b s y s t e m l o g i c

}}

IV. EXAMPLE

This section shows how our framework can be applied torealistic models. The example we use is the Apollo LunarModule digital autopilot model, which is included with theMatlab/Simulink distribution as an example. The full modelincludes a dynamic model of the plant: the Apollo Lunar Mod-ule, as well as a model of the Reaction Jet Controller (RJC) –we focused on the embedded controller. A very high-level viewis shown in Figure 6. The RJC receives attitude measurementsand desired attitude values, and generates control signals toactivate yaw, pitch and roll thrusters.

A. Step 1: Define Property

The “Yaw Jets” output of the RJC block is a value from theset -2, 0, 2, which indicates that the yaw thruster should havea negative thrust, no thrust or a positive thrust, respectively.Suppose we wish to verify the property that the “Yaw Jets”output can never go directly from -2 to 2 or directly from2 to -2: at least one output of 0 must always be found in-between. Section III showed how a property like this could bebuilt manually using automata. Using the scope and pattern

125

6

automata as building blocks, one could define this propertydirectly in Stateflow.

As mentioned above, we have also developed a customextension to the Simulink environment that allows propertiesto be entered in an easier way using dialog forms. Thesedialogs decompose the patterns detailed in Section III-A: theuser selects a pattern, enters a scope and a property and theequivalent automaton is generated, including input ports. Ourfirst task is to decide which pattern we need to implement theproperty that the “Yaw Jets” can never go directly from -2 to2 or directly from 2 to -2. Part of the property states that wedo not want the value of “Yaw Jets” to be -2 during a certainscope. The absence pattern fits this requirement, as it checksto see that some condition never occurs.

The dialog form for the absence pattern is shown in Figure7. This dialog guides the user through the process of defininga property. After defining the condition that should neverhold (Command == -2), we define the scope during whichthis condition should hold. In this example, we never wantCommand to go directly from 2 to -2, so the condition thatCommand should never be -2 should hold after Command isequal to 2 and before Command is equal to 0. The propertythat Command should never go directly from -2 to 2 is definedin an analogous way using the absence pattern dialog.

B. Step 2: Connect generated automata

After entering the parameters in the dialog form, the ob-server automaton monitoring the property is generated, asshown in Figure 8. The states representing the scope portion ofthe property are white, and the states representing the patternare shaded. The transition from the initial state is taken whenCommand is 2, at which point we are “in scope” and wantto verify the absence of the condition that Command is -2before it is 0. If the value of Command is -2 before it is 0,the transition to the inner error state is taken, which sets the“propertyOK” variable to false and emits the “error event”.When “error event” is emitted, the outer transition to the errorstate is taken and the automaton remains in this state. Notethat while the automaton is in scope, system termination (the“end event”) will not cause the property to be violated aslong as Command has not been set to -2. The input parameterfor command is automatically generated, so the user mustconnect the “Yaw Jets” signal to the automaton so that itcan be monitored. In Figure 6, the “Command Constraint”and “Command Constraint2” automata have already beenconnected to the “Yaw Jets” signal.

C. Step 3: Verification with JPF

The final step is to invoke the code generator and use JPF toverify our properties. There are two ways JPF can check thecode for property violations. The first uses concrete inputsprovided by the user. If this is done, JPF will perform aconcrete system execution using those inputs and report anyproperty violations in the form of stack traces. The secondway JPF can check for property violations uses the symbolicexecution module. In this case, JPF will try to determine inputsto the system that will cause properties to be violated. With

Fig. 6. High-level view of the Apollo Autopilot. The Command Constraintautomaton was automatically generated using the property defined in Figure7. The second automaton was also automatically generated.

Fig. 7. Property dialog. The property says that after the input variable“Command” becomes 2, it should never be equal to -2 before returning to 0.

either method, property violations can be reported to the userin the form of a stack trace showing the sequence of methodinvocations that led to an error state.

V. RELATED WORK

In more traditional forms of software development, verifica-tion is done in one of two ways. Either an abstract model of thesoftware is created and verified, or the executable code itself isverified. [11] discusses the ongoing trend towards placing theverification efforts directly on the executable code rather thanon models. In MBD, however, one intentionally begins withmodels and gradually refines them until they are synthesizedinto the executable code, and ideally both artifacts can beverified. Our approach eases the burden of both specifyingand checking properties on code generated during the MBDprocess.

A number of tools are available for verifying Simulink/S-tateflow models. Simulink Design Verifier [12] and Reactis[13] are commercial tools for checking model properties.[14] describes an approach that is based on hybrid automata:models are translated from Simulink to a hybrid automata for-

126

7

Error State

Until scope

end_event [propertyOK == false]2

Initial State

[Command == 0 && propertyOK == true][Command == 2]

4

Initial Stateen: propertyOK = true

[Command == -2] {propertyOK = false; error_event;}

Error State

Absence pattern

error_event1

[Command == 0 && propertyOK == false]3

Command

Fig. 8. Generated observer automaton implementing the property specifiedin Figure 7. Scope states are white and pattern states are shaded.

malism and existing techniques for checking hybrid automatacan then be applied. Our approach is complimentary to thesemethods and ensures the properties proved by these tools alsohold for the generated code.

Our approach to specifying properties through patterns isbased on the work of Dwyer et al. in [6]. The patternlibrary described there contains a general description alongwith mappings into multiple formalisms, including LTL, CTLand quantified regular expressions. Our implementation usesa dialog forms to chose and configure simple patterns fromwhich observer automata are generated, and includes a libraryof observer automata for individual scopes and properties fromwhich more complex patterns can be defined.

Runtime monitoring [15] is a related area in which formallyspecified properties are typically translated into executablecode that is used to check program properties during programexecution. Recent work in this area includes optimizing suchmonitors through static analysis techniques [16]. Our approachtranslates properties specified using observer automata intoexecutable code that is checked by a software model checkerand translates contracts on model elements into annotationsthat are used by the model checker.

VI. CONCLUSION

Checking model level properties on implementation codeis a useful approach for practical model-driven development.In this paper, we have shown how relevant properties canbe specified on the model level and then translated intoimplementation code that can be verified with a softwaremodel checker. Our approach is a pragmatic realization of thework described in [6], in the context of the Simulink/Stateflowenvironment. We have shown how the specification patternscan be instantiated from observer automata templates forscopes and properties and how subsystem blocks can beannotated with pre-, post-conditions, and invariants that aremonitored by the software model checker. We have shown theuse of the approach on a realistic example.

Our approach allows two ways for specification: contractsand property specifications based on patterns (that are trans-lated into observer automata). For designers of embedded sys-tems two extensions would be very useful: (1) specifying real-

time properties, and (2) dealing with concurrency. TranslatedSimulink subsystems are typically executed periodically, witha fixed rate. Timing properties can be related to a singleexecution run (i.e. the worst-case execution time of a functionblock), as well as the temporal properties of the system overmultiple execution runs (e.g. the system reacts to a triggeringevent within a bounded number of execution runs). TranslatedSimulink subsystems are also completely sequential; they areusually translated to functions in an implementation language.In order to run them on an execution platform, they haveto be embedded into OS processes, and their communicationand synchronization implemented outside of Simulink. Hence,we need to model these embeddings, and how the threadscontaining the function blocks communicate and synchronize.These topics are the subject of on-going research.

VII. ACKNOWLEDGMENTS

The work described in this paper has been supported byNASA under Cooperative Agreement NNX09AV58A. Anyopinions, findings, and conclusions or recommendations ex-pressed in this material are those of the author(s) and do notnecessarily reflect the views of the National Aeronautics andSpace Administration. The authors would also like to thankMichael Whalen for valuable discussions and feedback.

REFERENCES

[1] A. Ledeczi, A. Bakay, M. Maroti, P. Volgyesi, G. Nordstrom, J. Sprinkle,and G. Karsai, “Composing domain-specific design environments,” IEEEComputer, vol. 34, no. 11, pp. 44–51, 2001.

[2] MATLAB, version 7.10.0 (R2010a). Natick, Massachusetts: TheMathWorks Inc., 2010.

[3] G. J. Holzmann and R. Joshi, “Model-driven software verification,” inSPIN, 2004, pp. 76–91.

[4] J. Porter, P. Volgyesi, N. Kottenstette, H. Nine, G. Karsai, and J. Szti-panovits, “An experimental model-based rapid prototyping environmentfor high-confidence embedded software,” in IEEE International Work-shop on Rapid System Prototyping, 2009, pp. 3–10.

[5] E. M. Clarke, “The birth of model checking,” in 25 Years of ModelChecking, 2008, pp. 1–26.

[6] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett, “Patterns in propertyspecifications for finite-state verification,” in ICSE, 1999, pp. 411–420.

[7] D. Drusinsky, Modeling and Verification Using UML Statecharts - AWorking Guide to Reactive System Design, Runtime Monitoring andExecution-Based Model Checking. Elsevier, 2006.

[8] B. Meyer, Object-Oriented Software Construction, 1st editon. Prentice-Hall, 1988.

[9] W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda, “Modelchecking programs,” Automated Software Engineering (ASE), vol. 10,no. 2, pp. 203–232, 2003.

[10] J. C. King, “Symbolic execution and program testing,” Commun. ACM,vol. 19, no. 7, pp. 385–394, 1976.

[11] G. J. Holzmann, “Trends in software verification,” in FME, 2003, pp.40–50.

[12] “Mathworks Inc. Simulink Design Verifier,”http://www.mathworks.com/products/sldesignverifier/.

[13] “Reactive Systems, Inc.” http://www.reactive-systems.com/.[14] R. Alur, A. Kanade, S. Ramesh, and K. C. Shashidhar, “Symbolic anal-

ysis for improving simulation coverage of simulink/stateflow models,”in Proceedings of the 8th ACM international conference on Embeddedsoftware, ser. EMSOFT ’08. New York, NY, USA: ACM, 2008, pp.89–98.

[15] S. Sankar and M. Mandal, “Concurrent runtime monitoring of formallyspecified programs,” IEEE Computer, vol. 26, no. 3, pp. 32–41, 1993.

[16] E. Bodden, L. J. Hendren, and O. Lhotak, “A staged static programanalysis to improve the performance of runtime monitoring,” in ECOOP,2007, pp. 525–549.

127

Automatic Generation of System-Level VirtualPrototypes from Streaming Application Models

Philipp Kutzer, Jens Gladigau, Christian Haubelt, and Jürgen TeichHardware/Software Co-Design, Department of Computer Science

University of Erlangen-Nuremberg, GermanyEmail: {philipp.kutzer, jens.gladigau, haubelt, teich}@cs.fau.de

Abstract—Virtual prototyping is a more and more acceptedtechnology to enable early software development in the designflow of embedded systems. Since virtual prototypes are typicallyconstructed manually, their value during design space explorationis limited. On the other hand, system synthesis approachesoften start from abstract and executable models, allowing forfast design space exploration, considering only predefined designdecisions. Usually, the output of these approaches is an "ad hoc"implementation, which is hard to reuse in further refinementsteps. In this paper, we propose a methodology for automaticgeneration of heterogeneous MPSoC virtual prototypes startingwith models for streaming applications. The advantage of theproposed approach lies in the fact that it is open to subsequentdesign steps. The applicability of the proposed approach to real-world applications is demonstrated using a Motion JPEG decoderapplication that is automatically refined into several virtualprototypes within seconds, which are correct by construction,instead of using error-prone manual refinement, which typicallyrequires several days.

I. INTRODUCTION

Today, modern Multi-Processor System-on-Chip (MPSoC)architectures consist of a mixture of microprocessors, digitalsignal processors (DSPs), memory subsystems, and hardwareaccelerators, as well as interconnect components. It is notice-able that the adoption of programmable logic in such electronicsystems is more and more increasing. Driven by this rise, theprocess of software development becomes the dominating partduring system design. In the course of software development,software engineers have to cope with operating systems,communication stacks, drivers, and so forth. In order to allowearly software development, virtual prototyping is a more andmore frequently used technology in Electronic System Level(ESL) design. There, the desired target platform is modeledas an abstract, executable, and often completely functionalsoftware model. Hence, the virtual prototype includes all func-tional properties of the target platform, while non-functionalproperties, such as timing behavior, are mostly disregarded.

In contrast to FPGA-based prototyping, virtual prototypesare deployed before architectural models on register-transfer-level are available. Due to this early availability, the overalltime spent on hardware and software design can be reduced,

Supported in part by the German Science Foundation(DFG Project HA 4463/3-1)

Source

Sink

c1

c8

Parser

MComp

c7 c6

c2

c5

Recon

IDCT

c3 c4

CPU HWMemory

Bus

Fig. 1. Application model of a Motion JPEG decoder, clustered and mappedto an architecture template. The architecture template consists of a CPU, ahardware accelerator (HW) and an external memory. All the components areconnected via a bus.

because software can be implemented, refined, tested, de-bugged, and verified on realistic hardware models in parallel tothe hardware design process. Nevertheless, additional time isneeded for implementing such prototypes from the functionaland desired architectural system specification. This drawbackcould be avoided with an automatic virtual prototype genera-tion. This would further speed up the design process and inaddition, errors, often made in manual prototype generation,are avoided.

Describing a complex application abstracted as an actor-oriented model [1] is a more and more accepted approach inESL design. Such models are used to describe the functionalbehavior of the application. Therefore, they consist of con-currently executing actors, which communicate over abstractchannels. In our approach, the communication takes placevia channels with FIFO semantics. An example is shownin Fig. 1 for a small actor-oriented model, a Motion JPEGdecoder, which consists of the actors Source, Parser, Re-construction (Recon), Inverse Discrete Cosine Transformation(IDCT), Motion Compensation (MComp), and Sink, as wellas FIFO channels c1 to c8. In order to generate a virtualprototype starting with an actor-oriented model, additionalinformation about the system architecture candidates and the

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

128

mapping possibilities of the functional components have to bespecified. In the lower part of Fig. 1, a possible mapping toan architecture template is given by the dotted arrows.

In the following, we present a method for automaticgeneration of MPSoC virtual prototypes from actor-orientedmodels. Our proposed approach performs the virtual prototypegeneration in two steps: (i) Based on a given resource map-ping, communication within the application model is refinedto transactions in the virtual prototype, and controllers forintra-resource communication are generated. (ii) The virtualprototype is generated by assembling cycle-accurate processormodels, memory models, and models for hardware accelera-tors using bus models, and synthesizing the software for eachprocessor, according to the given mapping.

The remainder of this paper is structured as follows: Sec-tion II reflects related work. In Section III, a brief overviewof our approach is given. Section IV describes applicationmodeling. In Section V, the automatic generation of architec-tural TLM models is discussed in more detail. Section VI de-scribes the architectural refinement in more detail. Section VIIpresents experimental results from applying the proposedprototype generation approach to a Motion JPEG decoder,a multimedia streaming application mapped onto an MPSoCarchitecture. Finally, conclusions are given in Section VIII.

II. RELATED WORK

As virtual prototypes are nowadays commonly used insystem-level design flows, several commercial as well asfree of charge tools exist to build, simulate and evaluatesuch prototypes. Most prominent are Platform Architect fromCoWare, CoMET from VaST and OVPsim [2] from Imperas.The two first mentioned tools were acquired by Synopsys [3]within the last year. Most existing virtual prototyping toolssupport the integration of transaction level models written inSystemC [4] in the prototypes. However, none of them allowsthe automatic transformation of a formal description, like anactor-oriented model, to a virtual prototype.

In general, mapping formal models on MPSoCs is a currentresearch topic in system synthesis (e.g., see [5], [6]). Thereexist several system-level synthesis tools that automaticallymap formal described applications to a MPSoC target, likeDaedalus [7], Koski [8], and SystemCoDesigner [9]. All theseapproaches want to achieve a common purpose. They targetfinal product generation. This means, they have to cover thecomplete design flow, starting with an high-level applicationspecification down to the running system. Caused by this, theirintegration into existing design flows is hard to establish.

In contrast to system synthesis tools, our proposed approachtargets automatic virtual prototype generation. In this scenario,important design decisions are reflected in the generated proto-type, while support for further manual refinement is retained.Hence, the product quality still could be influenced by adesigner and, even more important, our proposed approach

Application ModelArchitectural

Template

TLM Generation

Architectural

Model (TLM)

Prototype Generation

System-Level

Virtual Prototype

Software Refinement

Automatic 2-step

Prototype Generation

Fig. 2. Design flow from an application model, represented by an abstractexecutable specification, to a virtual prototype. The flow includes automaticmapping of actor-oriented models to TLM architecture models, as well asvirtual prototype generation.

could be easily integrated in established industrial designflows.

III. VIRTUAL PROTOTYPE GENERATION - OVERVIEW

The goal of our system-level design approach is to auto-matically implement abstract system descriptions written inSystemC as virtual MPSoC prototypes. The associated designflow is depicted in Fig. 2.

At the beginning of our ESL design process, an abstractmodel has to be derived for the desired application. In ourapproach, a distinction is drawn between the applicationmodel, which describes the functional behavior of the system,and the architecture template, which represents all architectureinstances of the system.

The system behavior is modeled in form of actor-orientedmodels, which only consist of actors and channels, as depictedin the Motion JPEG example from Fig. 1. Actors are thecommunicating entities, which are executed concurrently. Forcommunication, tokens are produced and consumed by actors,and transmitted via dedicated channels.

The architecture template of the system is representedby a heterogeneous MPSoC platform, which is specified byconnected cores. Single actors or clusters of actors can eitherbe mapped onto processor elements (CPU) or on dedicatedhardware accelerators (HW), as depicted in Fig. 1. Hardwareaccelerators will typically be used for computationally in-tensive or time critical parts of the application. In general,System-on-Chips include both processor elements as wellas hardware accelerators. Depending on the actor mapping,

129

communication channels can either be mapped on internalmemory of data processing units (CPU or HW accelerators),or on shared memory modules. In the Motion JPEG decoderexample, all channels except c1 and c8 are mapped to thehardware accelerator, as communication takes place internally.Channels c1 and c8 represent the communication between theCPU and the dedicated accelerator, and hence have to bemapped to the shared memory.

After modeling the application, the architecture template,and defining a mapping of functional to structural elements,an architectural model will be automatically generated. In thisintermediate model, the actors are clustered according to themapping on architectural resources. Due to the fact that virtualprototypes are usually implemented using transaction levelmodeling (TLM), we use the OSCI TLM-2.0 [10] standardin our design flow.

For virtual prototyping, parts of the architectural model aresubsequently replaced by the corresponding resources froma virtual component library, which consist of cycle-accurateprocessor models, as well as models of communication enti-ties. Beside the architectural refinement, software is generatedand cross-compiled for each CPU, to match its instruction setarchitecture (ISA).

The resulting virtual prototype can now be used for furthersoftware and communication refinement. Moreover, due tothe cycle-accurate processor models, performance estimationbecomes possible. The steps of architectural mapping as wellas prototype generation will be described later in more detail.First, our application modeling approach is described.

IV. APPLICATION MODEL

This section introduces our concept of actor-oriented mod-eling, which is necessary to understand our proposed mappingapproach. In actor-oriented models, actors are potentiallyexecuted concurrently and communicate over dedicated ab-stract channels. Thereby, they produce and consume data (socalled tokens), which are transmitted by those channels. Thesemodels may be represented as bipartite graphs, consisting ofchannels c ∈ C and actors a ∈ A. In the following, we usethe term network graph for this kind of representation.

Definition 1 (network graph): A network graph is a di-rected bipartite graph Gn = (A,C, P,E), containing a set ofactors A, a set of channels C, a channel parameter functionP : C → N∞ × V ∗ that associates with each channelc ∈ C its buffer size n ∈ N∞ = {1, 2, 3, ..,∞}, and also apossibly nonempty sequence v ∈ V ∗ of initial tokens, whereV ∗ denotes the set of all possible finite sequences of tokensv ∈ V . Additionally, the network graph consists of directededges e ∈ E ⊆ (C ×A.I)∪ (A.O×C) between actor outputports o ∈ A.O and channels, as well as channels and actorinput ports i ∈ A.I .An example of a network graph is already given in the upperpart of Fig. 1.

start

i1(1)&gcheck&o1(1)/fpositive

i1(1)&¬gcheck&o2(1)/fnegative

i1

o1

o2

fpositive fnegativegcheck

double in = i1[0];

o1[0] = in;

return i1[0] >= 0; double in = i1[0];

o2[0] = in;

Fig. 3. Visual representation of an actor, which sorts input data accordingto its algebraic sign. The actor consists of one input port i1 and two outputports o1 and o2.

Definition 2 (Channel): A channel is a tuple c =(I,O, n, d) containing channel ports partitioned into a set ofchannel input ports I and a set of channel output ports O,its buffer size n ∈ N∞ = {1, 2, 3, ..,∞}, and also a possiblyempty sequence d ∈ D∗ of initial tokens, where D∗ denotesthe set of all possible finite sequences of tokens d ∈ D.In the following approach, we will use SysteMoC [11], aSystemC [4] based library for modeling and simulating actor-oriented models. In the basic SysteMoC model, each channelis an unidirectional point-to-point connection between an actoroutput port and an actor input port, i.e. |c.I| = |c.O| = 1. Thecommunication between actors is restricted to these abstractchannels, i.e. actors are only permitted to communicate witheach other via channels, to which the actors are connected byports.

In a SysteMoC actor, the communication behavior is sep-arated from its functionality. The communication behavior isdefined as finite state machine (FSM); the functionality is acollection of functions that can access data on channels viaports. These functions are classified in actions and guards, andare driven by the finite state machine (FSM). So SysteMoCfollows the FunState [12] (Functions driven by State machines)approach.

An action of an actor is able to access data on all channels,the actor is connected to, and is allowed to manipulate theinternal state of the actor implemented by internal variables.In contrast, a guard function is only allowed to query, notto alter neither the internal state nor the data on channels.A graphical representation of a SysteMoC actor is given inFig. 3. The actor Sorter, which is used to sort input data tokensaccording to algebraic sign, possesses one input port (i1) andtwo output ports (o1 and o2). Tokens from input port i1 willbe forwarded to output port o1 by the function fpositive, if theactivation pattern i1(1)&gcheck&o1(1) of the state transitionfrom the state start to the state start evaluates to true. Thispattern determines under which conditions the transitions maybe taken. In SysteMoC, the activation pattern can depend on

130

1 class Sorter : smoc_actor {2 public:3 smoc_port_in<double> i1;4 smoc_port_out<double> o1;5 smoc_port_out<double> o2;6 smoc_firing_state start;7 Sorter(sc_module_name name) : smoc_actor(name,

start) {8 start =9 (i1(1) && GUARD(check) && o1(1)) >>

10 CALL(positive) >> start11 |12 (i1(1) && !GUARD(check) && o2(1)) >>13 CALL(negative) >> start;14 }15 private:16 bool check(void) const {17 return i1[0] >= 0;18 }19 void positive(void) {20 double in = i1[0];21 o1[0] = in;22 }23 void positive(void) {24 double in = i1[0];25 o2[0] = in;26 }27 };

Listing 1. SysteMoC code for the actor Sorter. The FSM of the actor isdefined in the constructor of the actor class, whereas the functionality isencoded as private member functions.

some internal state of the actor, on availability and values oftokens on input channels, and on availability of free space onoutput channels. In our example, the state transition will betaken, if at least one token is available on input port (i1(1)),the guard gchecks evaluates to true (data on input channel haspositive algebraic sign), and output port o1 has space for atleast one additional token (o1(1)). Analog to this, the secondtransition is taken if input data is negative. The correspondingSysteMoC code is given in Listing 1.

To summarize, the transition-based execution of SysteMoCactors can be divided into 4 steps: (i) Evaluation of allactivation patterns k of all outgoing state transitions in thecurrent state qc ∈ Q. (ii) Non-deterministically selecting andtaking of one activated transition t ∈ T . (iii) Execution of thecorresponding action f ∈ a.F . (iv) Notification of token con-sumption/production on channels connected to correspondinginput and output actor ports after completion of action as wellas transition to the next state.

During system synthesis from actor-oriented models, actorsa ∈ A and the communication channels c ∈ C are mapped tocomponents of a system architecture. To reflect architecturalstructure in network graphs after mapping, nodes can be clus-tered. For representation of clustering, we define a clusterednetwork graph.

Definition 3 (clustered network graph): A clustered net-work graph Gcn = (Gn, T ) consist of a network graph Gn

and a rooted tree T such that the leaves of T are exactly thevertices of Gn. Each node x of T represents a cluster X(x) ofthe vertices of network graph Gn that are leaves of the subtree

x2x1 x3

x4

Fig. 4. Clustered network graph of the Motion JPEG example. The clusterX(x1) represents the CPU, X(x2) the communication bus and X(x3) thehardware accelerator. Cluster X(x4) represents the whole system.

rooted by x.The representation as tree illustrates the hierarchical structureof the system. This means, the root of T represents the wholesystem, whereas nodes x ∈ T with height(x) = 1 representthe components of the system. As reuse of parts of modelsis common in the design process, hierarchical structures withmore than two levels are possible. The clustered network graphof the example from Fig. 1 is depicted in Fig. 4.

Although we used SysteMoC, our approach is not restrictedto this framework and can be adapted to other frameworksfor actor oriented design, e.g. pure SystemC FIFO channelcommunication. A deeper insight into SysteMoC is givenin [11].

V. GENERATING THE TLM ARCHITECTURE

Transaction level modeling (TLM) with SystemC has be-come apparent as de-facto industry standard for virtual proto-typing and architectural modeling [13], [14]. These models arecharacterized by an encapsulation of low-level communicationdetails. Due to abstraction, very fast simulation speed canbe achieved. To enable fast simulation, details of bus-basedcommunication protocol signaling are replaced with singletransactions. In the course of releasing a TLM standard(OSCI TLM-2.0) to enforce interoperability of models, theOpen SystemC Initiative defined two coding styles [15]: theloosely-timed (LT) and the approximately-timed (AT) codingstyle. The loosely-timed coding style allows only two timingpoints to be associated with each transaction, namely the startand the end of the transaction. This timing granularity ofcommunication is sufficient for software development usinga virtual prototype model of an MPSoC. A transaction inan approximately-timed model is broken down into multiplephases, with timing points marking the transition between twoconsecutive phases. Due to the finer granularity of timing,approximately-timed models are used typically in architecturalexploration and performance analysis. As our approach targetssoftware development, or more precisely the refinement of

131

parts of the application in software, by means of virtualprototyping, the loosely-timed coding style is adequate [15].

As described, actors a ∈ A and the communication channelsc ∈ C are partitioned to clusters X(x) and mapped tocomponents of a system architecture. Due to the mapping,the channel communication can either be internal, in caseboth communicating actors aa and ab mapped onto the sameresource (aa ∈ X(xy) and ab ∈ X(xy)), or external, incase communication crosses cluster boundaries (aa ∈ X(xy)and ab /∈ X(xy)). For intra-resource communication, FIFOscan be put in private memory of the architectural component,whereas FIFOs of inter-resource communication, like c1 andc8 from Figure 1, have to be placed in an external memorymodel. Either way, actor communication semantics throughports are not altered, in order to reuse the existing actorswritten in SystemC based SysteMoC. So, the challenge ofthis step in design flow is to map the FIFO-based communi-cation via dedicated channels to a memory-mapped bus-basedcommunication with global and local shared memory. Sinceour abstract communication semantics (read, write, commit)calls for uniform channel access, access transparency has tobe ensured after mapping to architectural template, resultingin the transaction level architectural model. As communicat-ing actors on different resources are concurrently executed,simultaneous access to FIFO storage has to be avoided. Thismeans that memory coherence as well as cache coherence hasto be guaranteed. To cope with actor clustering and to ensuresynchronized channel access, independent of communicationmapping, we use aggregators and adapters in our approach thatimplement a suitable communication protocol [16]. Adapters,by which the SysteMoC ports (i ∈ A.I and o ∈ A.O)are substituted, serve as links between the actors and thetransaction level. Due to the fact that more than one actorcan be mapped to one resource, and actors can possessmultiple ports, an aggregator is needed for each transactionlevel component (X(xi) : height(xi) = 1) to encapsulatethe desired number of adapters. These aggregators performtransaction level communication and implement the interfaceof the component to the rest of the architectural model. Thereis no need to connect adapters for internal channels with theaggregator, because no communication will take place overcomponent boundaries. In Fig. 1, communication between theactors Parser, Recon, MComp and IDCT is internal and can beimplemented using, e.g., internal memory. In our approach, weuse a transaction level memory model for each communicationchannel. In the following, we will describe the functionalityof adapter and aggregator in more detail.

A. Adapter

An adapter adapts between transactions in the virtual proto-type and the asynchronous FIFO channel communication usedin the application model. Hence, the communication adapterimplements two different interfaces. The interface towards theactor is equivalent to the abstract channel, which has to be

Parser Recon

Out c2 In c2

TLM Memory Model

Virtual Channel c2

Fig. 5. Mapping of parts of cluster X(x3) from the model, depicted in Fig. 1,to a architectural component. The internal communication takes place over avirtual channel, which substitutes the abstract channel. Therefore, adaptersadapt between the abstract model and the transaction level model. The FIFOqueue semantics are implemented using a TLM memory model.

replaced. To sustain abstract communication semantics, theadapter needs to access tokens in a random manner and tocommit completed transitions via this interface. Therefore, aconversion of the token data type (e.g., serialization and dese-rialization) has to be performed in adapters. An adapter alsohas to respect the abstract channel synchronization mechanism.This means, the adapter has to provide an interface throughwhich the adapter can be notified when tokens on channel areproduced or consumedi, respectively. This notification can beused to trigger the corresponding actor waiting for free spaceor tokens on channel.

The transaction level interface consist of three transactionlevel communication sockets (see Fig. 5). One is used for datatransmission. The actor, which is connected to the adapter,can read or write data from a memory through this socket.The other two sockets are needed to sustain the channelsynchronization. For synchronization, the adapters communi-cate among each other over arbitrary TLM communicationresources. Therefor, a dedicated address has to be assigned toeach adapter.

Due to the fact that the SysteMoC channels possess mem-ory, the FIFO storages have to be mapped to resources.As different locations are possible, we allocate the storagein a memory, to which the adapters are connected to. Forinternal communication, the sockets of the adapters can bedirectly coupled with each other, as depicted in Fig. 5. Thesynchronization sockets of the two communicating adaptersare directly coupled, whereas the data sockets are connectedwith the memory.

The memory of external communication is accessible overa bus system, to which the aggregator is connected (seeFig. 6). Allocation of storage in one adapter or splitting anddistributing the storage over both communicating adapters isalso possible. Independent of the chosen implementation andmapping, each adapter needs to know to which address spacehis buffer is mapped to, in order to read or write tokens.

132

X(x1) X(x3)

Out c8In c8 In c1Out c1

Aggregator Aggregator

TLM Bus Model

TLM Memory Model

Fig. 6. Mapping of the cross component communication between HW andCPU from Fig. 1. For the sake of clarity, internal communication structure isomitted.

B. Aggregator

As real computational resources like CPUs or DSPs havea limited number of connection pins, each node x ∈ Tbesides the root node needs a mechanism that aggregatesthe children connected to x. For nodes that represent datatransferring units, like buses (x2), this is done by arbitrationand address translation. Unlike the communication resources(data transferring units), the computational resources (dataprocessing units) need an aggregator for this purpose. Theaggregators contain TLM ports to perform transaction levelcross component communication. Therefore, they implementthe communication protocol for the connected adapters at thetransaction level. For communication, aggregators communi-cate among each other over arbitrary TLM communicationresources. For this purpose, each aggregator is assigned adedicated address-range. Its size depends on the number ofadapters registered to the aggregator. So each adapter isassigned a single address, to which it is accessible for event-based synchronization. Beside his own address range, eachaggregator has to know addresses of peer adapters, whichare associated with registered adapters, and addresses of thecorresponding FIFOs in memory.

VI. VIRTUAL PROTOTYPE GENERATION

In the final step of our automatic design flow, a virtual pro-totype is generated based on the transaction level architecturalmodel.

A. Architectural Refinement

In order to allow for an early software development, parts ofthe architecture have to be substituted by virtual componentmodels. In our approach, all resources except the hardwareaccelerators are replaced. As our approach focuses on software

TABLE IMEASUREMENT TERMS OF THE 5 DIFFERENT VIRTUAL PROTOTYPES.

VP Instructions Simulation VP Performance

Host[s] VP[ms] CPI MIPS

I 4944835683 1997 44285 1.79 111.66II 5319738192 521 30494 1.15 174.45III 5726625319 1791 29222 1.02 195.97IV 5765601708 660 26993 0.94 213.59V 6188808202 1760 7224 0.23 856.66VI 3492102237 550 30870 1.77 113.12

development, the inserted processor models must provide aninstruction set simulator, in order to simulate or furthermoredebug the software running on the models. Therefore, we usea commercial virtual component library [3], which providesthe opportunity to integrate TLM. This feature is necessaryto couple the hardware accelerators with the virtual compo-nents. In order to sustain the abstract channel synchronizationmechanism, an interrupt controller is added for each processorelement. By the use of this controller, the processor elementcan be informed about channel data modification by anotherprocessor or hardware accelerator.

B. Target Software Generation

During the process of target software generation, the actordescription in SystemC is transformed into standard C/C++code. Therefore, the ports for communication of the actor arereplaced by pointers to FIFO interfaces, and the finite state ma-chine is encoded as switch-case statement. The FIFO interfacesrepresent the communication interface equivalent to the TLMcommunication adapters, described in Section V. Moreover,scheduling strategies have to be implemented, in case multipleactors are mapped on the same processor element.

VII. EXPERIMENTAL RESULTS

In order to show the applicability of our approach, wepresent our first results on generating virtual prototypes froman actor-oriented Motion-JPEG model. Therefore, we use amore fine-grained model than given in Fig. 1, which consistsof 19 actors, interconnected by a total of 56 FIFO channels.In Table I, the results of several test cases in terms of differentmappings are presented. Since the architecture template con-tains 19 processors, 19 hardware accelerators, and a sharedmemory, which all are connected by bus, many architectureinstances exist. With our approach, it is possible to generatevirtual prototypes from all of them. To show the applicability,we consider only a few mappings serving as representatives.

Our first prototype (I) consists of a single processor(ARM926), onto which all actors are mapped. For the nexttwo test cases, two processors are allocated and connectedvia a bus. For this architecture instance, two mappings aretested, respectively: (i) The IDCT actors are mapped to oneprocessor, all remaining actors to the other one (II). (ii)The actors are mapped to the processors alternately, i.e. the

133

I II III IV V V I0

10

20

30

40

50

60

Prototype

Tim

e(s

)

generatecompile

Fig. 7. Times measured for generation and compilation of the differentconfigurations.

neighbor of each actor in the decoding pipeline is mappedto the processor different than the processor to which theactor itself is mapped (III). For the FIFO communicationbetween the two processors, a memory is additionally allocatedand connected to the bus. In the fourth prototype (IV), threeprocessors and a memory are allocated. Here, actors Sourceand Sink are clustered to one processor, IDCT is mapped tothe second one, and the remaining actors are mapped to thethird. To take the full advantage of pipelined execution, 19processors are allocated in the fifth prototype (V). In the lasttest case, VI, one processor and one hardware accelerator areallocated. This test case is analog to the second prototype,except the functionality of the IDCT actors is swapped to thehardware accelerator.

Figure 7 shows the time needed for prototype generationand compilation. It can be seen that the time spent forprototype generation is nearly independent of the mapping,whereas the time for compiling depends on the componentsof the prototype. On the one hand, it is obvious, that themore processors are allocated, the more time is needed forcompiling. On the other hand, the code for the transaction levelhardware accelerators is more complex than the code runningon processors, so more time is needed for compiling hardwareaccelerators. However, in summary it can be seen that allvirtual prototypes have been generated within seconds, insteadof hours. In the following, 5 measurement terms will be testedin order to decode 10 images (176x144): total instructionsexecuted; cycles per instruction (CPI); million instructions persecond (MIPS); simulation time (host time); simulated time.In order to make a statement of system performance, not ofsimulator performance, the terms CPI and MIPS relate to thesimulated time. The corresponding values are given in Table I.

It can be seen that the performance of the prototypes behaveas expected. The more processors are allocated, the better thepipeline of the decoder can be exploited. This means lesscycles are needed for one instruction, what causes a higherMIPS and a lower CPI rate. The small difference between IIand III is based on a better workload distribution.

As different developer teams implement different parts of

the application, it is often unneeded to refine all componentsof the TLM architectural model to virtual processor models.Prototype VI shows that there is no appreciable difference insimulated and host time in contrast to the completely refinedmodel (II).

VIII. CONCLUSION

In this paper, we have presented a two-step methodologyfor automatically generating virtual system-level prototypesfrom an abstract system specification. Our main goal was toprovide a methodology to remove the dependency on hardwareavailability, needed for software development, in an earlyphase of the design flow, which starts with an abstract andexecutable application model. For this purpose, design deci-sions are first represented in SystemC TLM, which is typicallysupported by all commercial virtual prototyping tools. Second,the TLM generation is used to assemble the virtual prototypeand generate the embedded software. To show the applicabilityof our approach to real-world applications, we presented firstsimulation results for an actor-oriented Motion JPEG model.

REFERENCES

[1] E. A. Lee, “Overview of the ptolemy project, technical memorandum no.ucb/erl m03/25,” Department of Electrical Engineering and ComputerScience, University of California, Berkely, CA, USA, Tech. Rep., Jul.2004.

[2] OVPworld, http://www.ovpworld.org.[3] Synopsys, http://www.synopsys.com.[4] T. Grötker, S. Liao, G. Martin, and S. Swan, System Design with

SystemC. Norwell, MA, USA: Kluwer Academic Publishers, 2002.[5] O. Moreira, F. Valente, and M. Bekooij, “Scheduling multiple inde-

pendent hard-real-time jobs on a heterogeneous multiprocessor,” inProceedings of EMSOFT, 2007, pp. 57–66.

[6] P. K. F. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, “Run-time spatial mapping of streaming applications to a heterogeneous multi-processor system-on-chip (MPSoC),” in Proceedings of DATE, 2008, pp.212–217.

[7] M. Thompson, T. Stefanov, H. Nikolov, A. D. Pimentel, C. Erbas,S. Polstra, and E. F. Deprettere, “A framework for rapid system-levelexploration, synthesis, and programming of multimedia MP-SoCs,” inProceedings of CODES+ISSS, 2007, pp. 9–14.

[8] T. Kangas et al., “UML-based multi-processor SoC design framework,”ACM TECS, vol. 5, no. 2, pp. 281–320, May 2006.

[9] J. Keinert, M. Streubühr, T. Schlichter, J. Falk, J. Gladigau, C. Haubelt,J. Teich, and M. Meredith, “SYSTEMCODESIGNER - An AutomaticESL Synthesis Approach by Design Space Exploration and BehavioralSynthesis for Streaming Applications,” TODAES, vol. 14, no. 1, pp. 1–23, 2009.

[10] Open SystemC Initiative (OSCI)., “OSCI SystemC TLM 2.0,”http://www.systemc.org/downloads/standards/tlm20/.

[11] J. Falk, C. Haubelt, and J. Teich, “Efficient representation and simulationof model-based designs in systemc,” in Proceedings of FDL, Sep. 2006,pp. 129–134.

[12] L. Thiele, K. Strehl, D. Ziegenbein, R. Ernst, and J. Teich, “Funstate—aninternal design representation for codesign,” in Proceedings of ICCAD.Piscataway, NJ, USA: IEEE Press, 1999, pp. 558–565.

[13] F. Ghenassia, Transaction-Level Modeling with SystemC. Dordrecht:Springer, 2005.

[14] B. Bailey and G. Martin, ESL Models and their Application. Dordrecht:Springer, 2010.

[15] OSCI TLM-2.0 user manual, Open SystemC Initiative, Jun. 2008.[16] J. Gladigau, C. Haubelt, B. Niemann, and J. Teich, “Mapping actor-

oriented models to TLM architectures,” in Proceedings of Forum onspecification and Design Languages, FDL 2007, Barcelona, Spain, Sep.2007, pp. 128–133.

134

An Automated Approach to SystemC/SimulinkCo-Simulation

F. Mendoza and C. KollnerFZI Research Center for Information Technology

Dept. of Embedded Systems and Sensors Engineering (ESS)Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe, Germany

Email: {mendoza|koellner}@fzi.de

J. Becker and K.D. Muller-GlaserInstitute for Information Processing Technology

Karlsruhe Institute of Technology, Karlsruhe, GermanyEmail: {becker|klaus.mueller-glaser}@kit.edu

Abstract—We present a co-simulation framework which en-ables rapid elaboration, architectural exploration and verifica-tion of virtual platforms made up of SystemC and Simulinkcomponents. We exploit the benefits of Simulink’s graphicalenvironment and simulation engine to instantiate, parametrizeand bind SystemC modules which reside inside single or multiplecomponent servers. Any set of SystemC module implementationscan be easily added into a component server controlled bySimulink through a set of well-defined interfaces and simulationsynchronization schemes. The complexity of our approach ishidden by the automated framework that enables a designerto focus on the creation and verification of SystemC models andnot on the intricate low-level aspects of the co-simulation betweendifferent simulation engines.

I. INTRODUCTION

The increasing complexity of embedded systems has con-stantly triggered the creation of tools and methodologies thatcan aid in the different stages of their design and verification.Traditional approaches for the design of embedded systemsare based on common practices, such as the creation ofspecifications and modeling guidelines, simulation of key con-cepts and algorithms, and the implementation into hardwareprototypes. Though widespread, such approaches are not fittedfor complex embedded systems, especially when it comes tothe implementation into hardware prototypes, where up to70% of a project’s design time is invested in costly functionalverification and redesign cycles [1]. There is an evident needfor newer approaches that can improve design efficiency andthe quality of embedded systems.

In the recent years System Level Design (SLD) methodolo-gies have gained popularity in the electronic design automationmarket. SLD was created to cope with the increasing complex-ity of embedded systems and to enhance the productivity ofdesigners. Regardless of the definition given by each author,the goals of SLD are to enable new levels of design and reuseusing higher levels of modeling abstraction and to enable HWand SW Co-Design [2].

The motivation of our work is to incorporate SLD method-ologies into the development flow of embedded systems. Inthe automotive and industry automation fields for example,Simulink is the most accepted simulation and model driven

prototyping tool for continuous and discrete time data flowdesigns. It is here where the functionality of algorithmsare tested and where SLD methodologies can be seamlesslyintegrated. By adding SLD support to Simulink we can enablerapid architectural exploration in early stages of a design.Our approach uses the right tool for the right job: Simulinkfor the creation of functional models and test benches, andSystemC for the creation of system level models of hard-ware implementation solutions. A designer will be able toinvestigate different architectural partitions of a design thatcan be tested along with sensors/actuators, controllers, andembedded software. This will provide a better understandingof the functionality and interactions between the differentcomponents of a system. The acquired knowledge can then beused for the selection of an appropriate hardware prototypeimplementation whose functionality can be later on verifiedwith the available simulation results.

Our work uses S-Functions developed in C++ as a commonprinciple to extend Simulink’s functionality. An S-Functionis basically the source code that describes the behavior ofa user defined Simulink block. S-Functions have access toSimulink’s simulation engine through a set of defined functioncalls. Using S-Function function calls and the expressivepower of C++ we are able to instantiate, connect, parameterizeand simulate SystemC models inside Simulink. An automatedapproach for the co-simulation of SystemC and Simulink willbe further explained in this paper. Additionally, we present itsimplementation in the verification of a DSP algorithm.

The challenges involved in the time synchronization be-tween the simulation models of Simulink and SystemC arediscussed. Simulink uses a time continuous simulation model,while SystemC uses a discrete time event-driven simulationmodel. In a continuous simulation model, time is discretizedinto fixed or variable time steps, also called integration steps,according to the numerically solver used by the simulatorengine. In a discrete time event-driven simulation model, timesteps are inherently variable and are calculated according toevents scheduled in a queue. Delta cycles are used to updateall processes running concurrently in a same time step. Onlywhen the event queue for that time step is empty, the time can

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

135

be updated to the next scheduled event.

II. RELATED WORK

A list of available commercial mixed language simula-tion tools for the creation of virtual platforms is presentedahead. SystemVision [3] from Mentor Graphics enables theinteraction of SPICE, C/C++ , SystemC and Verliog-AMS forcreating simulations of analog and digital components. SystemGenerator [4] from Xilinx provides a library of their DSP IPblocks translated as Simulink components. They enable co-simulation of their IP DSP blocks with standard Simulinkcomponents, with the advantage of being able to synthesizeinto Xilinx’s FPGAs. In the System-on-Chip area, tools forthe simulation of multi-processor systems are common, forexample VaST from Synopsis and Seamless from MentorGraphics.

A common feature found in mixed language simulationtools is the use of simulation wrappers to adapt and com-municate different abstraction levels and simulation models.The use of wrappers is commonly found in the simulation ofmultiprocessor systems, such as [5] and [6], where processormodels are wrapped and connected to a SystemC backplane.Further formal approaches for the generation of simulationwrappers are presented in [7] and [8].

The mixed language simulation approach we focus on isthe co-simulation between Simulink continuous models andSystemC discrete models. A systematic analysis of continuousand discrete simulation models along with their respectivetriggering mechanisms is presented in [9]. We have classifiedthe available co-simulation approaches in two variants, accord-ing to the simulation engine that takes control of the wholesimulation. In the first case, where SystemC takes control ofthe simulation, two synchronization schemes can be identified.A basic, though effective, synchronization scheme is presentedin [1] and [10], where a SystemC application synchronizeswith a Simulink model on fixed time intervals. A moreefficient approach based on SystemC’s event driven schedulingmechanism is presented in [9]. The authors use SystemC’sevent queue to determine the required synchronization pointswith Simulink. They additionally include the possibility of aSimulink model to trigger additional synchronization points.A continuation of their research is given in [8], where betterdefined interfaces between Simulink and SystemC are pre-sented. In the second approach, where Simulink takes controlof the simulation, it is possible to synchronize with one ormore SystemC kernel via user-defined S-Functions triggeredon fixed or variable sampling times. The authors in [11] usea synchronization scheme based on variable sampling timesextracted from SystemC’s event queue, however no technicaldetails are given on its implementation. Our work has a similarapproach, where the time synchronization scheme is controlledby Simulink and allows for a true event-based simulationinside the SystemC sub-system. Each SystemC module in-stance corresponds to a Simulink block with appropriate inputand output connectors. Therefore, a SystemC sub-system can

consist of any number of arbitrarily interconnected moduleinstances.

Our contribution differs from the above in the sense thatwe exploit the benefits of Simulink’s graphical interface toinstantiate, parameterize and bind SystemC modules whichreside inside single or multiple component servers. The benefitof our approach is increased usability. A hardware designer isable to re-arrange the overall system structure of a virtual plat-form in order to explore several design aspects and realizationalternatives. Therefore, we simplify system composition andsimulation control without the need to manually edit SystemCsource code. Furthermore, the proposed approach enables thedesigner to create as many instances of SystemC modules(including multiple instances of the same module) dynamicallyinside a Simulink model and to interconnect them with otherSystemC module instances and native Simulink blocks.

III. THE CO-SIMULATION INTEGRATION FRAMEWORK

A: sc_module

B: sc_module

C: sc_module

Compile

Component Server

ClientS-Function

(single address space)

Component Server

(TCP/IP)

Component Server

(Shared Memory)

Compile Compile

SystemC Infrastructure

Matlab/Simulink

<<uses>>

Matlab/Simulink

ClientS-Function

(TCP/IP)

<<uses>>

ClientS-Function

(Shared Memory)

Matlab/Simulink

<<uses>>

<<uses>>

Figure 1. Overview showing the design flow for creating a component server.

A. Overview

Figure 1 shows an overall view of our co-simulation in-tegration framework. During a design entry step, the devel-oper defines and creates the SystemC modules which willbe available in the repository of a component server. Thesemodules are then compiled along with a SystemC kernel

136

and infrastructure code in order to build a component serverand client application. The component server allows for thedynamic instantiation and interconnection of the SystemCmodules inside its repository. The SystemC kernel built insidethe component server is able to execute a dynamically createdSystemC model. With the help of a set of well-defined inter-faces and synchronization schemes, a client encapsulates thefunctionality to connect to the component server and controlthe data exchange.

Three variants of component servers with their respectiveclients are available, differing only on the middleware used toconnect them. This gives the designer the liberty of definingwhere a component server will be located, either runninginside a Simulink process, locally in another process or inan external server connected via network. In the first vari-ant, the SystemC kernel and the Simulink solver engine areboth executed in the same process, using the same addressspace. This approach has the advantage of high simulationperformance, but has an impact on robustness. If a softwarebug inside a module implementation crashes the componentserver, it will also crash the Simulink environment. Further-more, debugging the design is tedious compared to debugginga standalone application. In the second and third variants,the component server and client are executed as differentprocesses communicating through shared memory or TCP/IPinter-process communication (IPC). Both variants provide abetter isolation between processes and provide a convenientway to co-simulate Simulink with one or more SystemCkernels running concurrently.

B. Usage

Our approach separates the task of SystemC module de-velopment from the complex code infrastructure required tointerface SystemC and Simulink. A SystemC module designeris provided with a small set of preprocessor macros which,when inserted inside a module class declaration, automaticallyregister that class to a component repository. All the designerhas to do is compile his module implementations along withthe SystemC library and our infrastructure library. Buildparameters let him decide which of the three variants (seeFigure 1) will be created.

The component server is displayed in Simulink as an S-Function block. All modules present in the server’s repositorycan be instantiated in arbitrary quantities using such compo-nent server block (which all refer to the same S-Function).Parameters specify which module to create and, if necessary,its constructor arguments. Each block automatically adopts theinterface of the underlying SystemC module instance in termsof input and output ports. For enhanced usability, the user cancreate a Simulink block library which hides the details of theS-Function parameterization.

IV. IMPLEMENTATION

A. Infrastructure/Component Server

Figure 2 shows the class diagram of the infrastructure code.Throughout this subsection, we will focus on the key concepts

Figure 2. Class diagram of the component server infrastructure.

required to achieve the level of integration and usabilitydescribed in section III-B.

1) Structural Analysis: By structural analysis we under-stand the process of determining the interface of a SystemCmodule. This includes the set of its ports along with theirnames and type information. The interface description isrequired by both the diagram editor and the solver engine. Inthe first case, it is needed to present a meaningful graphicalrepresentation of the module and to check the type compatibil-ity of diagram connections. In the second case, it is requiredto prepare according data structures which are needed to runthe simulation.

Approaches that try to reveal the structure of SystemCmodels by source code analysis are presented in [12] and[13]. These approaches require sticking to certain codingstandards. PINAPA [14] is a hybrid approach where theelaboration phase of a SystemC model is virtually executedin order to determine module hierarchy and interface de-scriptions. We chose a simpler approach where the analysisis done at runtime. The SystemC base class sc_objectimplements two methods get_child_objects and kindwhich let the user enumerate dependent objects, suchas ports and processes. We store the processed informa-tion in instances of ModuleInstanceDescriptor and

137

Table ITHE EKIND ENUMERATION

Literal SystemC port type DescriptionDataIn sc_in<DataType> Inbound dataflowDataOut sc_out<DataType> Outbound dataflowImport sc_import<IfType> Required interfaceExport sc_export<IfType> Provided interface

PortInstanceDescriptor. SystemC provides four basicport types which are described using the EKind enumera-tion (Table I). Usually, it is the compiler’s responsibility toapply type-checking to a SystemC model to ensure that allmodule interconnections are correct. For example, ports ofthe data types sc_in<X> and sc_out<Y> may only bebound to a sc_signal<Z> if the types X , Y and Z match.Our framework allows for instantiation and interconnectionof ports at runtime. This means that type-checking has tobe incurred by the runtime infrastructure. The problem issolved using C++ run-time type information (RTTI). We applythe dynamic_cast operator to perform type compatibilitychecking and typeid to get textual type information.

2) Automation: Many higher level languages, such as Java,C# and Objective-C, support reflection features. Reflection is apowerful metaprogramming paradigm which allows a programto examine its own structure and behavior at runtime, or evento alter its behavior. C++ RTTI can be considered as a verybasic and strongly limited implementation of reflection. Anapplication of reflection is to resolve a class name (given as astring) into a class type descriptor. This way, an instance of theclass can be constructed indirectly, without specifying its typeat the source code level. As RTTI does not support this feature,we implemented a small meta-language which imitates thatbehavior, as it allows the designer to enable selected classes forindirect instantiation. This is done by inserting preprocessormacros in order to declare a class as “automatable”. It isalso possible to describe the set of constructor arguments withrespect to their types and names. When the infrastructure codeinitializes, the class will register itself to the Repository(which sticks to the singleton design pattern).

3) Simulation Control: An appropriate interface supportssynchronization and data exchange between the SystemC ker-nel and Simulink. The SimulationContext class exposesfunctionality to control the simulation. A Reset methodresets the SystemC kernel to its initial state. All moduleinstances are destroyed and the simulation time is reset to0. RunSimulation executes the simulation for a specifiedamount of time. The GetTimeOfNextEvent method tellsthe point in time when the next process inside SystemC’sprocess queue is scheduled (or ∞ if currently no process ispending).

B. Client S-Function

The client S-Function acts as a mediator between a com-ponent server and Simulink. It synchronizes both simulatorkernels and enables the exchange of signal values.

1) Signal Data Exchange: CoSimImport andCoSimExport (see Figure 2) are specializations ofsc_channel which are designed to transfer signal data inand out of a SystemC simulation. Both can be interfaced withSimulink. We distinguish four kinds of connections whichcan occur in a Simulink model:

• A Simulink/Simulink connection models a dataflow de-pendency between two native Simulink blocks. This typeof connection does not need any further consideration asit is handled by the Simulink solver.

• A Simulink/SystemC connection links a Simulink signalto a so-called import gateway block. This block mapsto the client S-Function which will create an instanceof CoSimImport in order the tackle the data exchangefrom Simulink to SystemC. It is important to mentionthat the import gateway block is the only block whichsupports that connection type. It is not possible to link aSimulink signal directly to an arbitrary SystemC module(see figure 5).

• A SystemC/Simulink connection links an export gatewayblock to a Simulink signal. In this case, the S-Functionwill create an instance of CoSimExport. Again, theexport gateway is the only block supporting this type ofconnection.

• A SystemC/SystemC connection links two blocks repre-senting each a SystemC module instance. This connectiontype is realized using a propagate and bind scheme whichwill be further explained.

SomeModule

1

OutCoSim-CoSim-

Export

InCoSim-CoSim-

Import

SomeModule

2

OutCoSim-CoSim-

Export

InCoSim-CoSim-

Import

Simulink solver

transfers signal value

Figure 3. Two connected Simulink blocks and their internal representation

A simple solution to realize SystemC/SystemC connectionswould be to create and bind instances of CoSimImport andCoSimExport. In this case Simulink transfers the actualsignal data according to Figure 3. To ensure that no signalvalue change is missed, it is necessary to carefully choosethe sample times of both blocks. Setting them too highcauses data loss and unintuitive behavior. Setting them toolow results in poor simulation performance since the frequentcontext changes hinder the SystemC simulation kernel to skipunnecessary simulation cycles.

There are applications where the simulation performanceis affected in a way that the approach becomes completelyintractable, for example, if SystemC is applied to analyzepacket-based data [13]. In the considered application, packetsare recorded by a data logger and processed by a SystemC-based data analysis framework. The framework synchronizesthe simulation time with the receive timestamp of each cur-rently processed packet. All timestamps possess a resolution

138

of 100ns. However, the time lag between two consecutivepackets usually lays some orders of magnitude above thisresolution. Given moderate traffic, it is possible to run the dataanalysis faster than real-time on a standard PC. Obviously,a context switch every 100ns would result in a non-viableanalysis performance.

We decided to implement a different approach which allowsthe SystemC sub-model to be executed at a much higher rate(or, truly event-based) than the rest of the model. Synchro-nization is only necessary at transitions between Simulink andSystemC blocks which are modeled explicitly by import andexport gateways. The approach imitates the standard way ofconstructing SystemC models where a module connection isrealized by binding the involved module ports to the sameinstance of an sc_channel (in most cases sc_signal, aspecialization of sc_channel).

For each SystemC/SystemC connection inside the Simulinkblock diagram, the client S-Function creates an appropri-ate sc_signal instance and binds it to the underlyingport instances. Unfortunately, there is no elegant way ofextracting the set of diagram connections from inside the S-Function. However, the information is gained implicitly usinga propagate-and-bind scheme. As soon as the simulation isrunning, Simulink provides the S-Function with buffers whichare used to store its input and output values. The idea isnot to store an actual signal value inside a buffer, but tostore a reference (or pointer) to the signal instance. Duringthe first Simulink simulation cycle, the Simulink propagatesthe references in order to complete the binding of the wholeSystemC sub-model.

On each simulation cycle, Simulink passes (amongst others)a calculate outputs phase which instructs each block to updateits outputs. The computation may involve block inputs, giventhat they are marked as having a direct feedthrough prop-erty [15]. Our implementation indicates every input as directfeedthrough in order to get access throughout the calculateoutputs phase. This leads to the following algorithm:

1) When a block is created: Instantiate the appropriatemodule class, create and bind an sc_signal instancefor each data output port. Leave all data input portsunbound.

2) When entering calculate outputs for the first time:• Store the references (or pointers) to all signals which

were created in step (1) in the appropriate outputdata buffers (provided by Simulink).

• Fetch all references from input data buffers (pro-vided by Simulink) and bind all data input ports tothe according signal reference.

3) When all blocks passed calculate outputs: Model is readyto elaborate, start the SystemC simulation.

Figure 4a shows the internal representation of the Simulinkmodel shown in Figure 5 before the simulation is started(step 1). After propagating all signal references the bindingis completed (step 2, see Figure 4b). Elaboration and start ofthe SystemC simulation (step 3) still take place during the

very first Simulink solver step, so even at simulation time0 no information is lost. Prior to the simulation, Simulinkanalyzes all data dependencies in the model and computes anappropriate block execution order which ensures that the inputsof each block are readily computed before that block enters thecalculate outputs phase. However, the propagation scheme isonly viable for SystemC sub-models without loops. Simulinkwould recognize each loop as being algebraic and reportan error, regardless whether a register within the underlyingbehavior actually breaks that loop or not.

2) Time Synchronization Algorithm: Our time synchro-nization algorithm is controlled by Simulink, as opposed to[9] where the SystemC event queue is use to control thesynchronization intervals. If the Simulink model refers tomultiple component servers, it is possible to have more thanone SystemC kernel. In such case, each SystemC kernel runsindependent of each other, though controlled and synchronizedwith Simulink simulation time. There is no direct data ex-change between modules belonging to different componentservers. Instead, gateway blocks have to be used. It is upto the designer to establish an appropriate sampling time foreach gateway block. Setting a sampling rate too low can leadto loss of data; setting it too high will affect the simulationperformance due to oversampling.

The involved co-simulation algorithm (algorithm 1) is quitesimple. The Simulink solver will trigger each gateway block inits specified sampling time. This is done when entering the cal-culate outputs phase which instructs SystemC to synchronizewith Simulink’s simulation time. In the case the block is animport gateway, the input signal value is transferred inside theSystemC model. If the block is an export gateway, a numberof single delta cycle simulations follow until no processes arepending for the current simulation time. This step accountsfor combinatorial computation paths inside the modules andensures that all module outputs have stable values. Afterwards,the output signal value is transferred to Simulink.

Algorithm 1 When entering calculate outputs: SynchronizeSimulink and SystemC

now := CurrentSimulinkTime∆t := now − CurrentSystemCTimeif ∆t > 0 thenRunSimulation(∆t)

end ifif current block is import gateway then

Transfer Simulink input signal value to SystemCelse if current block is export gateway then

while GetTimeOfNextEvent() = now doRunSimulation(0) {Executes a single delta cycle}

end whileTransfer SystemC signal value to Simulink

end if

139

FirFilterOut

sc_ sc_

signal

In

Gateway OutGateway In

Simulink solver propagates

reference/pointer

Simulink solver propagates

reference/pointer

CoSimCoSim

Export

CoSimCoSim

Importsc_ sc_

signal

(a) During model construction

FirFilterOutIn

Gateway OutGateway In

CoSimCoSim

Import

CoSimCoSim

Exportsc_ sc_

signal

sc_ sc_

signal

(b) After binding is completed

Figure 4. Internal representation of Simulink blocks representing SystemC modules (a) during model construction and (b) after binding is completed

C. Middleware

The communication between a component server and aclient is done via a middleware. In the case where the serverand client are compiled together, no additional middlewaresoftware is used and communication is done sharing pointersto a same memory space. In the case of TCP/IP communica-tion, an open source project called Remote Call Framework(RCF) [16] was used. Shared memory communication wasimplemented with an open source Boost IPC library [17]. Bothmiddleware implementations provide convenient and powerfulfunction calls for inter-process communication.

V. RESULTS

We used the co-simulation framework for the verificationof a variable-length FIR filter, an algorithm commonly usedin DSP applications. The filter was modeled as a SystemCmodule with one input and one output port. The length andcoefficients of the filter must be given as parameters when theclass is instantiated. Our model is approximately timed in thesense that we assume data is processed at a constant rate. TheSystemC model could be later on refined by adding timinginformation into the model.

The FIR SystemC module with its required preprocessormacros declaring it as “automatable” (see Section IV-A2) wascompiled into the repository of a component server. Threevariants of the component server were generated according toFigure 1.

Figure 5 shows how the verification of the SystemC FIRFilter was performed. We used as reference the Digital FilterDesign Block from Simulink’s Signal Processing toolbox togenerate a 16-tap passband filter along with its coefficients.The coefficients were saved in an array and given as parame-ters to the SystemC model. We simulated the three componentserver variants and performed verification by inspecting thespectrum calculated by the FFT blocks. We were able toeasily verify our SystemC application in a couple of minutes.Otherwise, this process would have required a considerablyhigher amount of time and effort if a designer were tomanually code SystemC test benches.

A certain simulation time overhead is expected due to thenumerous synchronizations that must be performed betweenSimulink and SystemC. The total number of synchronizations

FirFilter

Input Output

SystemC

B-FFT

Spectrum

SystemC

B-FFT

Spectrum

FDA Tool

Random

Source1

Out

Gateway out

In

Gateway In

FDATool

Digital

Filter Design

Figure 5. Verification of a FIR Filter developed in SystemC.

in a simulation is calculated according to the number ofinput/output gateways and their sampling rates. For our tests,we reused the three component server variants used for the FIRvalidation shown in Figure 5. The simulation time for eachcomponent server variant was measured and its performancecalculated as the ratio of simulation time per synchronizationevent. In the case of TCP/IP communication, the componentserver was run in a local host and later on in a remotehost connected to our LAN. In all cases a standard desktopcomputer (Intel Core2 Quad CPU) running Windows7 wasused.

The performance results are shown in Figure 6. The resultsare presented as the simulation time in seconds per synchro-nization, in relation to the total number of synchronizations ina simulation. In all cases, the performance of a simulationincreases (meaning less time per synchronization) as thenumber of total synchronization increases; afterwards theystabilize to a constant value. Our results can help a designerdecide which communication scheme to use according to thetotal number of expected synchronizations in simulation. Theperformance of the single address space variant differs fromthe rest after 100 synchronizations and reaches its maximumvalue, approximately 20 times faster than the other variants,after 100k synchronizations. To our surprise, the performanceof the shared memory and TCP/IP localhost variants arealmost the same. We believe this is because the Boost[17]library implementation used for shared memory IPC is notefficient. We would expect higher performance results if nativeWindows functions were used for shared memory IPC instead.Finally, the simulation performance of the TCP/IP remote host

140

variant is naturally lower and may be affected by delays in thenetwork.

100

101

102

103

104

105

106

107

10-5

10-4

10-3

10-2

10-1

100

101

Number of Synchronziations

Sim

ula

tion tim

e p

er

Synchorization [sec]

Simulation Performance

Single address space

Shared memory

TCP/IP localhost

TCP/IP remote

Figure 6. Simulation performance results according to the number ofsynchronizations between Simulink and SystemC.

VI. CONCLUSIONS AND OUTLOOK

Our work shows that thanks to the open source natureof SystemC, the principles and benefits of SLD, that haveproven effective in the SoC market, can also be applied to thetraditional design of embedded systems in order to rapidlycreate virtual platforms. Our work demonstrates that it ispossible, from a designer point of view, to seamlessly createand verify SystemC models within Simulink. The complexityof our approach is hidden by an automated framework thatgenerates servers that provide a library of SystemC modulesand clients attached to Simulink which control them.

In our current framework version, Simulink does not allowsignal loops on SystemC blocks. Loops are allowed as longas they are broken up by at least one block with delaying be-havior, for example, a register, or in Simulink terms, if at leastone of the involved ports is declared as non-direct feedthrough.The challenge is to find when it is safe to declare a port as non-direct feedthrough. A heuristic solution would be to analyzethe sensitivity list of all processes of a SystemC module. Portsthat trigger any process, for example, a clock signal, mustbe defined as direct feedthrough and the rest as non-directfeedthrough. As the latter case implies a register behavior, themodule outputs would not be immediately affected. Anothersolution would be to oblige the SystemC module designerto mark non-direct feedthrough ports with special meta tags.However, this topic requires further consideration.

Our approach should be easily extended to support par-allelized distributed simulation of SystemC models as donein [18]. In this way, we could increase the simulation per-formance by distributing the simulation across multiple CPUcores. Further work is the support for TLM2.0 interfaces,which should be possible since the reference propagationscheme could equally be applied to TLM interfaces.

REFERENCES

[1] J.-F. Boland, C. Thibeault, and Z. Zilic, “Using matlab and simulinkin a systemc verification environment,” in Proceedings of Design andVerification Conference, DVCon05, 2005.

[2] K. Keutzer, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-Vincentelli,“System-level design: orthogonalization of concerns and platform-baseddesign,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 19, no. 12, pp. 1523–1543, 2000.

[3] Systemvision. [Online]. Available: www.mentor.com/systemvision[4] Systemgenerator. [Online]. Available:

http://www.xilinx.com/tools/sysgen.htm[5] P. Gerin, S. Yoo, G. Nicolescu, and A. A. Jerraya, “Scalable and flexible

cosimulation of soc designs with heterogeneous multi-processor targetarchitectures,” in Proc. Asia and South Pacific Design Automation Confthe ASP-DAC 2001, 2001, pp. 63–68.

[6] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, “Ageneric instruction set simulator api for timed and untimed simulationand debug of mp2-socs,” in Proc. IEEE/IFIP Int. Symp. Rapid SystemPrototyping RSP ’09, 2009, pp. 116–122.

[7] G. Nicolescu, S. Yoo, A. Bouchhima, and A. A. Jerraya, “Validation ina component-based design flow for multicore socs,” in Proc. 15th IntSystem Synthesis Symp, 2002, pp. 162–167.

[8] F. Bouchhima, M. Briere, G. Nicolescu, M. Abid, and E. Aboulhamid,“A systemc/simulink co-simulation framework for continuous/discrete-events simulation,” in Behavioral Modeling and Simulation Workshop,Proceedings of the 2006 IEEE International, 2006, pp. 1 –6.

[9] F. Bouchhima, G. Nicolescu, M. Aboulhamid, and M. Abid, “Discrete-continuous simulation model for accurate validation in component-basedheterogeneous soc design,” in Proc. 16th IEEE Int. Workshop RapidSystem Prototyping (RSP 2005), 2005, pp. 181–187.

[10] W. Hassairi, M. Bousselmi, M. Abid, and C. Sakuyama, “Using matlaband simulink in systemc verification environment by jpeg algorithm,”in Electronics, Circuits, and Systems, 2009. ICECS 2009. 16th IEEEInternational Conference on, 2009, pp. 912 –915.

[11] K. Hylla, J.-H. Oetjens, and W. Nebel, “Using systemc for an extendedmatlab/simulink verification flow,” in Proc. Forum Specification, Verifi-cation and Design Languages FDL 2008, 2008, pp. 221–226.

[12] D. Berner, J. pierre Talpin, H. Patel, D. A. Mathaikutty, and E. Shukla,“Systemcxml: An extensible systemc front end using xml,” in InProceedings of the Forum on specification and design languages (FDL,2005.

[13] C. Kollner, G. Dummer, A. Rentschler, and K. Muller-Glaser, “Design-ing a graphical domain-specific modelling language targeting a filter-based data analysis framework,” Object/Component/Service-OrientedReal-Time Distributed Computing Workshops , IEEE International Sym-posium on, vol. 0, pp. 152–157, 2010.

[14] M. Moy, F. Maraninchi, and L. Maillet-Contoz, “Pinapa: An extractiontool for SystemC descriptions of systems-on-a-chip,” in EMSOFT,September 2005, pp. 317–324.

[15] Matlab Simulink. [Online]. Available:http://www.mathworks.com/help/toolbox/simulink/

[16] RCF - Interprocess Communication for C++. [Online]. Available:http://www.codeproject.com

[17] Boost C++ Libraries. [Online]. Available: www.boost.org[18] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, “Scalably

distributed systemc simulation for embedded applications,” in IndustrialEmbedded Systems, 2008. SIES 2008. International Symposium on,2008, pp. 271 –274.

141

Extension of Component-Based Models for Controland Monitoring of Embedded Systems at Runtime

Tobias Schwalb and Klaus D. Muller-GlaserKarlsruhe Institute of Technology, Institute for Information Processing Technologies, Germany

Email: {tobias.schwalb, klaus.mueller-glaser}@kit.edu

Abstract—To allow a rapid abstract development and reuse inembedded systems nowadays component-based system develop-ment is often used. However, control and monitoring at runtimefor adjustment and error identification often take place usingdifferent domains or tools. They are in general more concreteand therefore the user needs a deeper understanding of thesystem. This paper in contrast presents a continuous conceptto raise the abstraction level and include runtime control andmonitoring into component-based models. The concept is basedon an extended component-based meta model and libraries,which describe present components as well as their interfaces andparameters. During design time, based to the model designed bythe user, source code is generated. At runtime, control commandsare send to the embedded target according to user modification inthe model as well as acquired monitoring data is back annotatedand displayed on model level. The concept is demonstrated andevaluated using a reconfigurable hardware platform.

I. INTRODUCTION

The development of embedded systems becomes more andmore complex due to the increasing demands and the pressureto meet productivity targets. Dannenberg et al. [1] for examplepredicts a growth of 150% for the market of electric electronicautomotive components up to a total of 316 Billion Euroin 2015 with an exponential rate in the future. To dominatethese, many systems are build using component-based designmethodologies to allow easy reuse of present design parts.However, while this method supports a rapid and abstractdesign, control and monitoring for runtime adjustment anderror identification are normally on a more detailed level usingspecific tools. Therefore, the user needs a deeper understand-ing of the system, increasing costs and development time.

In this paper we present a continuous concept to extendcomponent-based models for runtime control and monitoringto support abstract adjustment and error identification. Themethod is based on an extended meta model for component-based systems and libraries, which store predefined compo-nents. In contrast to actual methods (see Section II), it allowsthe user to work on the same abstract level at design time andruntime. Succeeding the design phase, source code is auto-matically generated for the embedded target. During runtime,the components of the embedded target can be controlledand monitored using special parameters. Other parametersdisplay the monitored status of the embedded system, bothwithin the same abstract component-based model. Therefore,the user does not need a detailed understanding of individualcomponents for fast prototyping.

In this context, we first describe the state of the art in model-based design, configuration, control and monitoring in SectionII. The next section outlines an overview of the concept anddescribes the flow using the method. Section IV illustratesthe developed meta model, while Section V describes theactions for configuration, control and monitoring as well asshows our implemented model-based developing environment.An example, based on the use of reconfigurable hardware, ispresented in Section VI, including practical tests and resultsin Section VII. We close with conclusions and an outlook onfuture work in Section VIII.

II. STATE OF THE ART

In this section we concentrate on specific model-baseddesign methods as part of the V-Model [3] for embeddedsystem development. Therefore, we describe the state of theart concerning system design with focus on component-baseddesign methods and their possibilities. Further, we show ac-tual methods concerning model-based control and monitoring,because we integrate these into component-based models.

For system planning and design in the early developmentphases for embedded systems, the Systems Modeling Lan-guage (SysML) [4] or the Modeling and Analysis of RealTime and Embedded systems profile (MARTE) [5] are used.Both are part of the Unified Modeling Language (UML) [6]and allow abstract specification, analysis and design of com-plex and real time embedded systems. However, SysML andMARTE do not support runtime functionalities. In connectionwith component-based design, SysML and MARTE can beseen earlier in the design process.

Component-based design [7] is located in the implementa-tion phase and well known in the software domain. It describesa concept of separation of systems into components. Thereby,an individual component is often regarded as a softwaremodule that encapsulates a set of related functions (or data).Components communicate with each other via interfaces andare configured using parameters. Components are normallystored in libraries to allow rapid reuse. The design can beperformed text-based or model-based, while in the later thecomponents are displayed as graphical objects.

The main usage of components is to enable reuse of alreadyimplemented functionality in different versions and configura-tions of a product as well as in other projects [8]. Therefore, itreduces the time-to-market and increases the quality, because

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

142

Source Code

Code generation

Implementation

Execution

Monitoring

Target platform

Component Model

Binary

Mapping

Monitoring

Runtime Information

Commands

M3-Level Ecore Meta Meta Model

M2-Level Component Meta Model

M1-Level Component Model

M0-Level Source Code / Binary

Parameter Extensions

<<extends>>

<<references>> Runtime Data (Model level)

Runtime Data (Source level)

<<references>>

Control

Mapping

Fig. 1. Concept of Design, Control and Monitoring (left) and relating Meta Object Facility (MOF) Levels [2] (right)

already used components are mostly well engineered andtested. Different requirements have to be considered usingcomponent-based design methodologies, including scalability,maintainability and interoperability as well as the feasibilityto real systems. These also apply to the embedded systemsdomain, more details are outlined in [9].

If an embedded system is implemented, the developer nor-mally needs separate tools for runtime control and monitoring.These tools allow to adjust the behavior of the implementedsoftware, for example an offset correction. Others are used fordebugging proposes, monitoring internal signals or variables.These tools normally perform control and monitoring on adetailed level in comparison to the abstract component-baseddesign. Furthermore, they are often specific to the embeddedtarget and sometimes adapted to the implemented system.

A further domain are rich component models [10] usedas a uniform representation of different design entities tosupport the management of functional and non-functional partsin the development process. A component-based hierarchicalsoftware platform for automotive electronics is shown in [11].It shows a series of tools for model-driven development, visualconfiguration and automatic code generation. In [12], Gu et al.presents an end-to-end tool-chain for model-based design andanalysis of component-based embedded real time software inthe avionics domain. It includes the configuration of sourcecode as well as runtime instrumentation and statistics thatare feed back into models. The management of distributedsystems and their configuration is discussed in [13]. A concept,based on design time modeling, model transformation andmanagement policies, is presented.

A framework for design and runtime debugging ofcomponent-based systems is presented in [14]. It enables tovalidate the interactions between components by automaticpropagation of checks from the specifications to applicationcode. The usage of different MDE tools at runtime is describedin [15]. Thereby, different runtime models are discussed anda tool presented to design them. Monitoring of embeddedsystems in terms of model-based debugging is presented in[16], [17] and [18]. These concentrate on real time monitoringof functional models (e.g. statecharts).

In our concept we follow the known issues and methodsconcerning component-based design, i.e. reuse, source codegeneration, interfaces, parameters, etc. However, in first ver-sion do not consider specialties, e.g. product line techniques,analysis or distributed systems. In contrast to actual methods,we integrate control and monitoring directly into component-based models, not using functional models, specific runtimemodels or external (low-level) tools. Thereby, the component-based model refers to the architecture of the system and doesnot represent the functionality. As a result, the control andmonitoring possibilities are limited and need to be consideredalready during the design of the components.

III. CONCEPT

In our concept we extend component-based models toinclude runtime control and monitoring of embedded systems.Thereby, we follow the Meta Object Facility (MOF) of theObject Management Group (OMG) [2]. The flow of ourmethod is depicted in Figure 1 on the left, the relating levelsof the MOF are shown on the right.

The platform independent component-based meta model,which corresponds to the M2-Level of the MOF, characterizesa system, which comprises of components with differentattributes, interfaces and parameters. It has been extendedwith special parameters concerning configuration, control andmonitoring (more details see Section IV). According to thismeta model, libraries store the implemented platform depen-dent components. The user uses these components to buildhis system in a component-based model, which correspondsto the M1-Level of the MOF. Thereby, the user also connectscomponents using their interfaces and adjusts them accordingto their parameters.

In the next step, based on the component-based model,the source code of the system is generated using templates.The code generation thereby includes the components, theirconnections and adjustments according to the set parametervalues. Additionally, the templates integrate functionality al-lowing control and monitoring at runtime, including furthercomponents, which handle the communication between theembedded target and a design tool on a PC. This generated

143

source code is between M1- and M0-Level, because it isnormally high level code (VHDL, C, ...), which is not used todirectly program the embedded target. For example, VHDL-code is in a further step used to generate a platform specificbinary file. This implementation (M0-Level) is integrated intothe embedded target.

To control the embedded target, the user modifies thecomponent-based model (M1-Level). According to his actionsand general information gained from the components, com-mands are generated and send to the embedded target via theintegrated communication components. For control, the usercannot change every part at model level - only the value ofspecific control parameters, integrated in the components, canbe modified. For monitoring, using the same communicationcomponents, information read from the embedded target istransferred to a PC. On the PC the information, which corre-sponds to the M0-Level, is interpreted using mapping informa-tion and displayed in the component-based model (M1-Level).Thereby, the information is associated with correspondingmonitoring parameters.

As a result, during design, configuration, control and mon-itoring, the user works on the same abstract level using thesame component-based model. There are only different viewson the model during design time and runtime. At designtime algorithms map the interactions on model level with thegenerated source code for the embedded target. At runtimethe algorithms generate, according to user modification in themodel, control commands and interpreted received data todisplay it on model level. More details on the mechanisms fordesign, configuration, code generation, control and monitoringare described in Section V.

The component-based meta model is based on the Ecoremeta meta model (M3-Level), because it offers a generaldescription and our model-based development environment(see also Section V) is build using the eclipse modelingframework [19].

IV. COMPONENT-BASED META MODEL

The extended platform independent component-based metamodel is depicted in Figure 2. The model, which is instantiatedfrom this meta model, stores later information from differentperspectives (e.g. design time and runtime). Therefore, themeta model has to be able to, on the one hand describe asystem comprising of components with individual interfacesand parameters (including their attributes). On the other hand itholds the information the user adds during assembling, design,configuration and controlling, as well as the data receivedduring monitoring of the embedded target.

The classes on the top right of the meta model describea simplified standard component-based meta model. Theyoutline a system, which composes of components and con-nections. Thereby, the attribute id of the class Componentis unique and allows identifying a single component. Thetype and version specify the type of component, which islater used in correspondence with the libraries. In addition,a component has interfaces, which are modeled as Input and

Fig. 2. Extended Component-Based Meta Model

Output classes. Thereby, the attribute type is used to clarify,which output can be connected to corresponding inputs. Theboolean attribute fixed is needed to signal, if an output or inputis compulsory, i.e. it needs a connection. Outputs and inputsof the components are connected with each other as sourceand target using connections.

The remaining meta model describes three different kindsof component parameters for configuration, control and mon-itoring. The parameters have been split to allow a cleardifferentiation and to implement different functionality in thedevelopment environment. The class ConfigurationParameterdescribes parameters intended for configuration of a compo-nent during design time. The class has two attributes, whichdescribe the name and value of the parameter. In addition,the top class has two child classes, which describe differentparameter forms, i.e. a numerical or a text-based parameter.

The ...Number class forms a parameter in a numericalformat, i.e. an integral or floating point number. The min-and max-attributes are the limits for the value. The ...Listclass describes text-based parameters, the value is therebyselected from a list of predefined values. These possiblevalues are stored in the ...ListValue class. In this context, it isdistinguished between the displayed value on model level andthe coded value used in conjunction with the embedded target.These two forms have been chosen, because they represent themost used parameters in general embedded applications. Fur-thermore, the parameters allow abstract and easy adjustmentof the embedded system from the user perspective as well aslimited the possible inputs and avoid failures.

The ControlParameter class describes parameters used atruntime to control a component at the embedded target orrespectively adjust its behavior. The layout of this class andits subclasses is similar to the classes for configuration. Theonly difference is the additional command attribute, which

144

T. Schwalb - FFN Modelle zum Testen

User

generation commands data

design, configuration,

control and monitoring

Embedded System

Component-Based Model

Database Meta

Model

instantiate component descriptions source code templates

User

4. Generation & Integration

5.1 Commands 5.2 Data

2. Design 3. Configuration 5.1 Control 5.2 Monitoring

Embedded Target

Component-Based Model

Libraries Meta

Model

2. Instantiate 1. Component descriptions Source code templates

Fig. 3. Integration Flow for Design, Configuration, Control and Monitoring

stores the command send to the embedded target to modifythe parameter at runtime. The MonitoringParameter classdescribes parameters monitored during runtime. The layout ofthe classes for monitoring is similar to the classes for control.There are only no minimal and maximal limits for monitorednumerical parameters, as the user cannot modify the value ofthese parameters (it is read from the embedded target). Theclasses for monitoring parameters are designed in the sameway to allow a display of numerical as well as to interpretreceived data and display its abstract representation on modellevel.

V. CONFIGURATION, CONTROL AND MONITORING

The flow of design, configuration, control and monitoring incombination with our implemented development environmentand the embedded target is depicted in Figure 3. In the firststep the loads the libraries with the component descriptionsand source code templates. In the next step the user assemblesand designs the system on model level in a component-basedmodel. For assembling, he uses predefined components de-scribed in the libraries. These create objects in the component-based model, that are instances of classes in the meta model.The library for example stores a multiplexer component, whichcan directly be used and inserted in the model. Thereby, thecomponent is automatically instantiated and displayed with allits interfaces and parameters. A component may be integratedmultiple times. During the design, the user also connects thecomponents to each other using their interfaces.

In the third step the components are configured accordingto their parameters. Thereby, the user adjusts the values of theconfiguration parameters of the components in the model. Thevalue of the numerical parameters can only be set accordingto its limits (predefined in the component description). Fortext-based parameters only an element from the predefined

list can be chosen. These parameters adjust the behavior ofthe individual component and can only be modified duringdesign time, because they influence the generated source codeof the component.

After system design and configuration is completed, thecomponents and their connections are checked before thesource code is generated. The generated source code is in gen-eral split in multiple files regarding the individual componentsand the structure of the system. Additionally, communicationcomponents are integrated to allow controlling and monitoringthe system at runtime.

Succeeding integration on the embedded target, in the laststep the user controls and monitors the system during runtimeusing the same component-based model, he used to design thesystem. For controlling, the user modifies the values of thecontrol parameters. Thereby, also predefined restrictions onnumerical and text-based parameters apply. According to themodifications, algorithms generate commands and send theseto the embedded target. The values of the control parameterscan also be adjusted during design time and thereby formpredefined values for runtime execution. For monitoring atruntime the values of the monitoring parameters are periodi-cally read back from the embedded target, back annotated anddisplayed in the model. Thereby, the respective interpretationand coding of the parameters is used.

We integrated the functionality of design, configuration,control and monitoring in a model-based integrated develop-ment environment (IDE) depicted in Figure 4. The develop-ment environment is based on the open source platform eclipse[19]. For model-based design, the Eclipse Modeling Frame-work (EMF) and the Graphical Modeling Framework (GMF)are used. The generation of the source code is performed withthe xPand framework and the checks in the model using theintegrated Check-language.

The component-based meta model (see Section IV), isintegrated as an Ecore model in EMF. According to this model,we created three models in GMF concerning the graphicalmodel level editor. The first model describes the palette inthe editor, i.e. the tools available to build the model. Thegmfgraph model describes the graphical representation of theelements in the model, i.e. their shape, color etc. The thirdmodel layouts a mapping between the three models, i.e. itcreates relations between elements in meta model, palette andgraphical representations. After creation of these models amodel-based IDE can be generated by the framework. Theresult is depicted in Figure 4. The modeling area can be seen inthe middle with the tool palette on the right (already includescomponents from libraries). The project management withprojects and corresponding files is on the left side. The windowon the bottom in the middle shows model-based propertiesof the actual selected element. The additional functionality toallow all steps within configuration, control and monitoring, isintegrated in the window on the bottom right, which has beenmanually implemented as an eclipse plug-in.

The window is used to load the libraries, access the differentparameters of a component, generate the source code and

145

Fig. 4. Integrated Model-Based Developing Environment (1 - Modeling Area; 2 - Component Palette; 3 - Properties Window; 4 - Parameter Dialog)

communicate with the embedded target. A library is loadedfrom a XML-file, thereby the components get directly inte-grated into the palette. When a component is drawn in themodel, the component is automatically instantiated with allits parameters and interfaces, which can be directly used forconnections to other components. If a component is selectedin the model, its parameters are displayed in the table and theuser can modified them. The numerical parameters can directlybe specified, the text-based parameters are displayed with adrop down menu, that offers the predefined values (monitoringparameters cannot be changed).

A hardware connection to the embedded system can beestablished after specifying the connection settings. If theconnection is active and the user changes a control parameterin the table, corresponding commands are automatically gener-ated and transmitted. Additionally, the monitoring parametersof the selected component are periodically read back, inter-preted and displayed. The configuration parameters cannotbe changed during an active connection, because this wouldrequire a regeneration and integration of the system. For aneasier handling the window offers additional filters and searchfunctions to easier locate parameters.

VI. FPGA INTEGRATION

To demonstrate the functionality and integrity of the conceptand the IDE, we implemented it along with a system fordesign, configuration, control and monitoring of Field Pro-grammable Gate Array (FPGA) systems. FPGAs were chosen,because of their high computing power, the possibility to runprocesses in parallel and the easy extensibility.

In the example, the components used for the component-based design of the system, are integrated in two libraries. The

first library describes general hardware components, whichare often used in FPGA systems, for example multiplexer,timer, AND-gate, OR-gate etc. External inputs and outputs arealso integrated as components to allow connections to externalperipherals. A second library describes specific componentsfor sensor and actuator control.

Using these components, we build as an example a smallcooling control system (depicted in Figure 5), which readsdata from a temperature and a humidity sensor and controlsa cooling fan. A sensor is connected to the system using anExternal Input component associated with a Sensor Prepro-cessing component, which is used to read the sensor valueand correct it if necessary. The humidity sensor is integratedtwice and connected to a Multiplexer and a Sensor Checkcomponent to switch automatically if a sensor fails. TheCooling Algorithm component takes the preprocessed sensorvalues and controls a fan using an additional Actuator Controlcomponent. In the libraries, the components are designedwith compatible interfaces and communication protocols oradapt automatically (e.g. multiplexer) during code generationaccording to the connected components.

The respective component-based model of the system isshown in the modeling area of the IDE in Figure 4. Incomparison to the embedded system the component-basedmodel does not include the additional components and busesfor runtime control and monitoring, which are described inthe next paragraph. All other components are described withtheir interfaces and parameters. As an example, the parametersof the Sensor Preprocessing component for the temperaturesensor are depicted in the table of the Parameter windowin Figure 4. For configuration, the Input Protocol and Input

146

Component-based System

RS232 Communication

Micro-processor

External Output

External Input

External Input

External Input

Sensor Preprocessing



Multiplexer

Cooling Algorithm

Actuator Control

Sensor Check

Fig. 5. FPGA System Integration

Address can be modified. During runtime the Offset and SlopeCorrection parameter can be controlled and the Status, SensorValue and Output Value of the component are monitored. TheInput Address for example is a List Parameter, therefore theuser can only choose from a list of predefined addresses.The parameter Status is also a List Parameter, therefore thecoded value read from the embedded target is interpretedand displayed in common language (not using complex errorcodes). All components are integrated on the embedded targetusing VHDL templates in xPand. The system architecture (i.e.connection and buses) is generated in an additional xPandtemplate as a top level VHDL structural description file.

In addition to the library components, additional compo-nents and buses are automatically integrated during generationof the system to allow control and monitoring at runtime.The components include a RS232 interface for communica-tion with a PC and an 8-Bit microprocessor for processingcommands. The microprocessor is connected with three 8-bitbuses to the components. The first bus is used to identify thecomponent, the second for sending commands and the lastone for reading the status of the components. The buses areseparated from the connections between the components andcan be used independently, therefore they do not influence thefunctionality of the system.

To use the described model-based developed environment(see Section V) for the FPGA integration, the communica-tion interface is adjusted to communicate using a RS232connection. Furthermore, compatible mapping algorithms areintegrated to generate commands for control and interpretedreceived monitoring data. The commands are split into threeparts, the first part is the ID of the component, the second isthe command associated to the actual parameter and the thirdis the value (for monitoring commands the third part is empty).These parts are send together in one string via the interface tothe microprocessor inside the FPGA.

The microprocessor separates these parts and sets the ID onthe first bus to address the respective component. In the next

step the command and value are send on the command bus. Ifa monitoring command was send the answer of the componentis read from the corresponding bus and send to the PC.

VII. TESTS AND RESULTS

Different tests are carried out to evaluate the functionalityand integrity of the concept and the developed IDE. The testsare mainly performed using a XUP Virtex-II Pro developmentsystem, including a Xilinx Virtex-II Pro FPGA [20] as wellas interfaces for communication and programming.

The size and speed of the implemented system depends onthe type and number of components and connections. The testsystem (see Section VI) uses around 6% of the logic resourcesof the used FPGA and has a frequency of 100 MHz, limited bythe layout of the used components. Thereby, the resources ofthe additional microprocessor and communication componentsare fixed to approx. 1%. The resources for additional busesas well as functions for control and monitoring depend on thenumber and layout of the components. The maximum speed isin general limited by the integrated microprocessor to approx.150 MHz, because the components directly communicate withthe microprocessor and therefore need the same clock signal.The communication could also be designed independently toallow different clock frequencies, but this would increase thelogic resources for buses and interfaces.

In tests the communication, including processing modifica-tions and sending commands as well as receiving monitoreddata and displaying the data in the model, worked as describedin Section V. There is only a time delay of up to approx. 250ms between a change in the model and the reaction of theembedded target, as well as between a change in the embeddedtarget and the display in the model. The reasons are the slowRS232 communication interface and the mapping algorithms.In addition, as the development environment is designed multi-threaded, it cannot be determined when the responsible threadfor communication or processing is executed. Therefore, whilethe hardware runs in real time, real time control and monitor-ing are not possible.

As a result the component-based model allows a rapiddesign of the system and reuse of present components. Afterassembling and configuration of the components, the generatedVHDL-code can directly be integrated using Xilinx IDEtools. During runtime the components can be controlled andmonitored using the implemented IDE to support adjustmentand monitoring on abstract level.

The functionality for configuration, control and monitoringof the individual components needs already to be consideredduring the design of the component. This is a challenge,because in the component-based design process only presentparameters can be used on abstract level. For example, ifa parameter is not integrated in the design of a componentthe user needs to work on low level manually adjustingthe source code or using standard methods for control andmonitoring. Regarding control and monitoring, there is anadditional consideration, because every parameter normallyincreases the size and complexity of the component and maybe

147

reduces its speed. However, with regard to rapid prototypingsystems the size and speed does not directly take effect, as thesystem is integrated on a high power computing platform andnot on the final target platform.

Furthermore, the tests outlined that, besides the dependen-cies concerning the interfaces, there are further dependenciesconcerning different components and also between parametersof the same component. For example, one parameter caninfluence the value or the availability of another parameter.Actually, these dependencies are manually implemented ac-cording to the individual component and checked mainly usingthe Check-language. This is error-prone and does follow theconcept of a continues model-based approach. Moreover, thetests showed that the expandability, in terms of new compo-nents, is not comfortable, because the library and templatesneed to be modified manually.

VIII. CONCLUSION AND OUTLOOK

In this paper we presented a concept for expandingcomponent-based models for runtime control and monitoring,to support abstract adjustment and debugging of embeddedsystems. Therefore, the practicability of a continues abstractdevelopment have been increased. In comparison to presenttechniques, the user does not only design on model level, butalso controls and monitors the system from the same abstractcomponent-based model and does not need to use low leveldomains or tools. A single model describes the structure ofthe system, allows adjustment and shows the status of itscomponents during design time and runtime. All intermediatesteps, from model level to the embedded target and vice versaare carried out by algorithms in the background. Therefore,generating the appropriate source code, the concept can beapplied to use with rapid prototyping systems and allowsadjusting, controlling and monitoring systems at runtime.

The proposed meta model is capable of abstract descriptionof different aspects of a system and its components. The useruses libraries with predefined components to rapidly build thesystem. The components are connected using their interfaceand configured according to their parameters. During runtimethe components are controlled and monitored using differentparameters in the same model. The integrated development en-vironment allows to perform all steps of design, configuration,control and monitoring on model level. The concept has beenimplemented along with a FPGA integration to show function-ality and feasibility to real systems. Different tests have beencarried out to evaluate size, speed and maintainability.

In future, the control and monitoring parameters will be-come optional for implementation. Therefore, the user candecide concerning their usage and the additional resources.In connection the IDE will be expanded to allow an easierspecification of new libraries, as this is at the moment per-formed manually. In this context, also the automatic integrationof external modules as black box objects will be added.Additionally, a method will be implemented to check, ifthe system and components on the embedded target matchto the component-based model in the IDE. Moreover, the

concept will be integrated along with other platforms and morecomplex systems to evaluate the scalability and performance.The actual meta model will be enhanced with respect to depen-dencies of components and parameters as well as possibilitiesfor hierarchical structures.

REFERENCES

[1] J. Dannenberg and C. Kleinhans, “The coming age of collaboration inthe automotive industry,” Mercer Management Journal, vol. 17, pp. 88–97, 2004.

[2] Object Management Group (OMG), “Meta Object Facility (MOF) 2.0Core Specification,” 2004.

[3] iABG, “V-Model,” 1997. [Online]. Available: http://www.v-modell.iabg.de/vm97.html

[4] A. Korff, Modellierung von eingebetteten Systemen mit UML undSysML. Spektrum Akademischer Verlag, 2008.

[5] Object Management Group (OMG) , “UML Profile for MARTE:Modeling and Analysis of Real-Time Embedded Systems, Specification,Version 1.0,” 2009. [Online]. Available: http://www.omgmarte.org/

[6] Object Management Group (OMG), “Unified Modeling Language(UML) Specification, Version 2.2,” 2008. [Online]. Available:http://www.uml.org/

[7] G. Heineman and W. T. Councill, Component-Based Software Engineer-ing. Addison-Wesley Longman, Amsterdam, 2001.

[8] J. Kalaoja, E. Niemela, and H. Perunka, “Feature modelling ofcomponent-based embedded software,” in Software Technology andEngineering Practice, 1997. Proceedings., Eighth IEEE InternationalWorkshop on [incorporating Computer Aided Software Engineering],1997, pp. 444–451.

[9] D. Hammer and M. Chaudron, “Component-based software engineeringfor resource-constraint systems: what are the needs?” in Object-OrientedReal-Time Dependable Systems, 2001. Proceedings. Sixth InternationalWorkshop on, 2001, pp. 91–94.

[10] W. Damm, A. Votintseva, E. Metzner, and B. Josko, “Boosting re-useof embedded automotive applications through rich components abstract,”Proceedings, FIT 2005 - Foundations of Interface Technologies, 2005.

[11] H. Li, P. Lu, M. Yao, and N. Li, “SmartSAR: A Component-BasedHierarchy Software Platform for Automotive Electronics,” in EmbeddedSoftware and Systems, 2009. ICESS ’09. International Conference on,2009, pp. 164–170.

[12] Z. Gu, S. Wang, S. Kodase, and K. Shin, “Multi-view modeling andanalysis of embedded real-time software with meta-modeling and modeltransformation,” in High Assurance Systems Engineering. Proceedings.Eighth IEEE International Symposium on, 2004, pp. 32–41.

[13] S. Illner, A. Pohl, H. Krumm, I. Luck, D. Manka, and T. Sparenberg,“Automated runtime management of embedded service systems basedon design-time modeling and model transformation,” in Industrial In-formatics, 2005. INDIN ’05. 2005 3rd IEEE International Conferenceon, 2005, pp. 134–139.

[14] G. Waignier, S. Prawee, A.-F. Le Meur, and L. Duchien, “A Frameworkfor Bridging the Gap Between Design and Runtime Debugging ofComponent-Based Applications,” in 3rd International Workshop onModels@runtime, Toulouse France, 2008.

[15] H. Song, G. Huang, F. Chauvel, and Y. Sun, “Applying MDE Tools atRuntime: Experiments upon Runtime Models,” in Proceedings of the 5thInternational Workshop on Models at Run Time, Oslo Norway, 2010.

[16] P. Graf and K. D. Muller-Glaser, “Modelscope: inspecting executablemodels during run-time,” in ICSE Companion ’08: Companion of the30th international conference on Software engineering. New York, NY,USA: ACM, 2008, pp. 935–936.

[17] T. Schwalb, P. Graf, and K. D. Muller-Glaser, “Architektur fur dasechtzeitfahige Debugging ausfuhrbarer Modelle auf rekonfigurierbarerHardware,” in Methoden und Beschreibungssprachen zur Modellierungund Verifikation von Schaltungen und Systemen. UniversitatsbibliothekBerlin, 2009, pp. 127–137.

[18] T. Schwalb, P. Graf, and K. D. Mueller-Glaser, “Monitoring Executionson Reconfigurable Hardware at Model Level,” in 5th InternationalMODELS Workshop on [email protected], Oslo, Norway, Oct. 2010.

[19] Eclipse Foundation, “Eclipse Modeling Project,” 2010. [Online].Available: http://www.eclipse.org/modeling/

[20] Xlilinx, Virtex-II Pro and Virtex-II Pro X - FPGA User Guide v.4.2,November 2007.

148

A model-driven based framework for rapid parallelSoC FPGA prototyping

Mouna Baklouti†*, Manel Ammar†, Philippe Marquet∗, Mohamed Abid† and Jean-Luc Dekeyser∗∗LIFL, Univ. Lille 1, INRIA Lille Nord Europe

UMR 8022, CNRS, F-59650, Villeneuve d’ascq, FranceEmail:{mouna.baklouti,philippe.marquet,jean-luc.dekeyser}@lifl.fr

†CES Laboratory, Univ. Sfax, ENIS SchoolBP 1173, Sfax 3038, Tunisia

Email:[email protected], [email protected]

Abstract—Model-Driven Engineering (MDE) based approacheshave been proposed as a solution to cope with the inefficiency ofcurrent design methods. In this context, this paper presents anMDE-based framework for rapid SIMD (Single Instruction Mul-tiple Data) parametric parallel SoC (System-on-Chip) prototyp-ing to deal with the ever-growing complexity of such embeddedsystems design process. The design flow covers the design phasesfrom system-level modeling to FPGA prototyping. The proposedframework allows the designer to easily and automatically gener-ate a VHDL parallel SoC configuration from a high-level systemspecification model using the MARTE (Modeling and Analysis ofReal-Time and Embedded systems) standard profile. It is basedon an IP (Intellectual Property) library and a basic parallel SoCmodel. The generated parallel configuration can be adapted tothe data-parallel application requirements. In an experimentalsetting, four steps are needed to generate a parallel SoC: data-parallel programming, SoC modeling, deployment and generationprocess. Experimental results for a video application validatethe approach and demonstrate that the proposed frameworkfacilitates the parallel SoC exploration.

I. INTRODUCTION

With the rising complexity of multimedia and radar/sonarsignal processing applications, parallel programming tech-niques and multi-core Systems-on-Chip (SoC) are more andmore used. Single Instruction Multiple Data (SIMD) systemshave shown to be powerful executers for data-intensive appli-cations [1], especially in pixel processing domain [2]. ManySIMD on-chip architectures, in particular based on FPGA(Field Programmable Gate Arrays) devices, have emerged toaccelerate specific applications [3]–[6]. Compared to ASIC(Application Specific Integrated Circuit), FPGA devices arecharacterized by an increased capacity, smaller non-recurringengineering costs, and programmability [7]. Dealing with theever-growing challenge of parallel SoC design, most of theproposed SIMD solutions are application-specific SoC whichlack flexibility: changing a SoC configuration may necessitateextensive redesign. While these specific systems provide goodperformances, they require long design cycles. The size of aparallel SoC and the complexity involved in its design arecontinuously outpacing the designer productivity. An impor-tant challenge is to find adequate design methodologies that

efficiently address the issues about large and complex SoC.Nowadays, Computer-Aided Design tools are imperative

to automate complex SoC design and reduce the time-to-market. Two approaches have been proposed to cope with thisproblem. Firstly, IP (Intellectual Property) reuse and platform-based design [8] are used to maximize the reuse of pre-designed components and to allow the customization of thesystem according to system requirements. Secondly, Model-Driven Engineering (MDE) [9] approach has been introducedto raise the design abstraction level and to reduce designcomplexity. It stresses the use of models in the embeddedsystems development life cycle and argues automation viamodel transformation and code generation techniques. Com-plex systems can be easily understood thanks to such abstractand simplified representations. Approaches based on MDEhave been proposed as an efficient methodology for embeddedsystems design [10], [11]. An interesting model specificationlanguage is UML (Unified Modeling Language) [12], whichproposes general concepts allowing expressing both behavioraland structural aspects of a system. The latest release of UML(2.0) has support for profiles that enable the language tobe applied on particular application and platform domainswith sophisticated extension mechanisms. As an example, theMARTE (Modeling and Analysis of Real-Time and Embeddedsystems) standard profile [13] is proposed by the OMG to addcapabilities to UML for model-driven development of real-time and embedded systems. The MARTE profile enhancespossibility to model SW, HW and relations between them.

Using the proposed framework, the designer focuses onmodeling his needed SIMD configuration and not on howimplementing it, since the system modeling is independentof any implementation detail. Specifying a model is writtenbased on unified language. The presented design flow is alibrary-based method that hides unnecessary details from high-level design phases and provides an automated path from UMLdesign entry to FPGA prototyping. So, it can be easily usedby non-HW experts of on-chip systems implementation. Thismakes our approach better than using some clever VHDLcoding.

System concerns are represented in separated dimensions:

978-1-4577-0660-8/11/$26.00 c⃝ 2011 IEEE

149

data-parallel coding, SoC modeling, IP selection and imple-mentation. The implementation is performed via the generationtool based on a model-to-text transformation using Acceleo[14]. The framework uses an IP library with various com-ponents (processors, memories, interconnection networks...)that can be selected in the deployment process to generatethe needed SIMD configuration. The modeled SoC has to beconform to a basic parallel SoC model which is parametric,flexible and programmable, proposed in previous work [15].

In an experimental setting that validates our approach,we consider a video color conversion application where weexplore different parallel system configurations and decide thebest one to run the application. Experimental results show thatthe proposed framework considerably reduces design costs andfacilitates modifying the system model and regenerating theimplementation without relying on costly re-implementationcycles. Using the framework, we can create SIMD implemen-tations that are fast enough to meet demanding processingrequirements, are automatically generated from a high-levelspecification model to reach the time-to-market and can easilybe updated to provide a different functionality.

The remaining of this paper is organized as follows. Section2 discusses related work on model-based approaches to gen-erate on-chip multi-processor or massively parallel systems.Section 3 presents the proposed MDE framework.A case study,which illustrates and validates the framework, is described inSection 4. The FPGA platform is chosen as a target platformsince it is a better alternative to test and implement variousparallel SoC configurations. Finally, Section 5 draws mainconclusions and proposes future research directions.

II. RELATED WORK

The high-level SoC design methodology is a rapid emergingresearch area. There are many recent research efforts onembedded systems design using an MDE approach. In thiscontext, different high-level synthesis approaches are currentlybeing studied for different specification languages. For exam-ple, xtUML [11] defines an executable and translatable UMLsubset for embedded real time systems, allowing the simula-tion of UML models and the code generation for C oriented todifferent microcontroller platforms. In [16], an approach usingVHDL synthesis from UML behavioral models is presented.The UML models are first translated into textual code in alanguage called SMDL. This latter can be then compiled into atarget language as VHDL. The translation from UML modelsto SMDL is performed using the aUML toolkit. In [17], atransformation tool, called MODCO, which takes a UML statediagram as input and generates HDL output suitable for usein FPGA circuit design, is presented. A HW/SW co-designis performed based on the MDA approach. XML is used togenerate HDL from high-level UML diagrams. In these twoworks, only state machines HW designs are described. In [18],a UML-based multiprocessor SoC design framework, calledKoski, is described. An automated architecture explorationbased on the system models in UML, as well as the automaticback and forward annotation of information in the design flowcould be performed. The proposed design flow provides an

Fig. 1. Parallel SIMD SoC configuration: 4 PEs, a 2D mesh neighboringnetwork and a crossbar based mpNoC

automated path from UML design entry to FPGA prototyping.The final implementation is application-specific. The proposedapproach is based on synthesizable library components that areautomatically tuned for specific application according to theresults of the architecture exploration.

Our approach is related to the design of massively parallelSoC and covers the design phases from system-level modelingand parallel programming to FPGA prototyping using the no-tion of transformations between models. The DaRT [10] (DataParallelism to Real Time) project also proposes an MDA-based approach for SoC design that has many similaritieswith our approach in terms of the use of meta-modelingconcepts. The DaRT work defines MOF-based meta-models tospecify application, architecture, and SW/HW association anduses transformations between models as code transformationsto optimize an association model. In DaRT, no data-parallelcoding is specified and the code generation for RT (RegisterTransfer) levels is dedicated to specific HW accelerators.

The proposed framework, presented in this paper, takesadvantage of the MDE notion of transformation betweenmodels to generate a complete SIMD parallel SoC at RTlevel dedicated to compute data-intensive applications. Our ap-proach is based on synthesizable library components and fewmodel transformations to generate the synthesizable VHDLcode of the modeled SIMD SoC.

III. SIMD FRAMEWORK

The proposed framework is dedicated to generate differentSIMD configurations derived form the based parallel SoCmodel [15]. These configurations can be then directly simu-lated using available simulation tools or prototyped on FPGAdevices using appropriate synthesis tools. Figure 1 illustrates aSIMD parallel SoC configuration composed of four ProcessingElements (PE) connected in a 2D mesh topology. To handleparallel I/O transfers and point-to-point communications, acrossbar based mpNoC (massively parallel Network on Chip)[19] is integrated. To accelerate and facilitate a SIMD con-figuration design, a model-driven framework is proposed. Theframework allows the designer to model his needed configu-ration derived from the basic provided SIMD SoC model. He

150

Fig. 2. Framework concepts

has to specify the system’s parameters (number of PE, memorysize, neighboring topology) and the different components thatwill be integrated (mpNoC, neighborhood network, devices).The designer has also to code his data-parallel program usingthe specified data-parallel instruction set depending on thechosen processor IP. A help manual is in fact provided to thedesigner to facilitate the parallel programming and describe thedifferent instructions to use according to the chosen processor.

The framework, in particular the deployment phase, is basedon an IP library which contains dedicated IP that can bedirectly integrated in the system. Providing an extensive libraryrequires a significant effort. Currently, the IP library containsprocessors (MIPS, OpenRisc, NIOS II), networks (crossbar,shared bus and multi-stage networks), memories and somedevices. To add new IP resources, the IP provider mustadapt the IP to the architecture dedicated specific interface(described in the help manual). Thus, a new component can beput into the library by following the requirements for interfacesformats. To assemble processors in the SIMD design, wedistinguish two methodologies: reduction and replication. Thereduction consists on reducing an available processor in orderto build a PE with a small reduced size that can be fittedin large quantities into an FPGA device. The replicationconsists on implementing the ACU as well as the PE bythe same processor IP so that the design process is faster.We clearly notice that there is a compromise between thedesign time and the number of integrated PEs in the SIMDconfiguration depending on the applied design methodology.The designer can select the suitable methodology accordingto his application constraints. The three processors of the IPlibrary are provided with the two methodologies.

At this step, the designer can generate different implemen-tations while integrating different IPs. The deployment is alsoresponsible of loading the binary data-parallel program in theACU instruction memory. The SIMD generation approach isdepicted in Figure 2. This approach allows a flexible and rapidplatform development and platform end-user productivity.

To generate SIMD configuration at RT level, an MDE baseddesign flow, presented in Figure 3, is developed. The proposedflow uses two meta-models: the MARTE meta-model and theDeployed meta-model. All meta-model concepts are specifiedas UML classes and then converted into Eclipse/EMF models[20]. The generation process is based on model transforma-

Fig. 3. MDE-based design flow

tions implemented as QVT (Query, Views, Transformations)resources, standardized by OMG.

The designer can generate a SIMD massively parallel SoCconfiguration in four steps: data-parallel programming, SoCmodeling, deployment and then implementation generation.

A. Data-parallel programming

The designer has to write his data-parallel program usingthe provided data-parallel instruction set. Based on availableprocessor compilers (miniMIPS, OpenRisc 1200 and NIOS II)in the IP library and the developed special parallel instructions,the designer can generate his parallel program’s binary. Forthe miniMIPS processor, an extended parallel MIPS assemblerlanguage [21] is developed. For the OpenRisc and NIOSprocessors, high-level asm macros are defined and they can beused in any C program for control and communication instruc-tions. The NIOSII IDE (Integrated Development Environment)and the OR1Ksim [22] tools are used respectively with theNIOS and OpenRisc processors. The developed SW chain isa multi-compiler chain that is responsible of generating theSW code depending on the specified target processor.

Some particular instructions are specified to be used in theprograms as delimiters for parallel and sequential code. Table Ishows three examples of instructions from the provided data-parallel instruction set. It is clearly that these instructionsdepend on the processor instruction set. At this step, a SWlibrary is provided. It includes pre-implemented applicationalgorithms such as matrices multiplication, FIR (Finite Im-pulse Response) filter, reduction algorithm, image rotation,color conversion (RGB to YIQ, RGB to CMYK), etc.

After generating the executable SW, the second step consistson modeling the HW system.

B. SoC modeling

The designer must specify the architecture models using anyUML 2.0 compliant tool with applying the MARTE profile.The most important UML diagrams used in our approach tospecify the system are Class, Structure composite and Deploy-ment diagrams. The modeling of SIMD SoC configurationsrelies on the use of UML and the MARTE profile. ThreeMARTE packages are used: the Hardware Resource Modeling(HRM), the Repetitive Structure Modeling (RSM) and theGeneric Component Model (GCM) packages [23]. The HRMintends to describe the HW platform by specifying its differentelements. At the end, the HW modeled resources present thewhole system. In our approach, only the HRM HW Logical

151

TABLE ISIMD PARALLEL MACROS

ASM Macro Description CodingminiMIPS OpenRisc NIOS

P REG SEND Neighboring SEND: send data (in reg) from source p addi r1,r0,dir l.addi r1,r0,dir IOWR (WRP B,addr, data)(reg,dir,dis,adr) to destination via the neighboring network. p addi r1,r1,dis l.addi r1,r1,dis Where: addr(11)=’0’ and addr(10:3)=dis

p addi r1,r1,adr l.addi r1,r1,adr and addr(2:0)=dir.p SW reg,0(r1) l.sw 0x0(r1),reg

P REG REC Neighboring RECEIVE: receive data (in reg) p addi r1,r0,dir l.addi r1,r0,dir data=IORD (WRP B,addr)(reg,dir,dis,adr) from the source. p addi r1,r1,dis l.addi r1,r1,dis Where: addr(11)=’0’ and

p addi r1,r1,adr l.addi r1,r1,adr addr(10:3)=dis and addr(2:0)=dir.p LW reg,0(r1) l.lwz reg,0x0(r1)

P GET IDENT read identity p lui r1,0x2 l.movhi r1,0x2 NIOS2 READ CPUID(id)(reg) p ori r1,r1,0 l.lwz reg,0x0(r1)

p LW reg,0(r1)

:Local_memory [1] :Elementary_processor [1]

elementary_processor local_memory

PU

West

East

<<flowPort>>

<<flowPort>>

mpNoC_in mpNoC_out ACU<<flowPort>> <<flowPort>> <<flowPort>>

Fig. 4. PU modeling in the case of a linear configuration

sub-package is used. It allows to describe information aboutthe kind of components (HwRAM, HwProcessor, HwBus,etc.), their characteristics, and how they are connected toeach other. The architecture is graphically specified at a highabstraction level with HRM. Multidimensional data arrays andpowerful constructs of data dependencies are managed thanksto the use of the RSM package. It defines stereotypes andnotations to describe in a compact way the regularity of asystem’s structure or topology. The structures considered arecomposed of repetitions of structural elements interconnectedvia a regular connection pattern. It provides the designer a wayto efficiently and explicitly express models with a high numberof identical components. The concepts found in this packageallow to concisely model large regular HW architectures asmulti-processor architectures. Finally, the GCM package isused to specify the nature of flow-oriented communicationparadigm between SoC components.

The modeling process is done in an incremental way. Thedesigner begins by modeling the elementary components: PE,ACU, memories, mpNoC and I/O device. Then, the wholeconfiguration is modeled through successive compositions.Figure 4 illustrates the elementary processing unit (PU). Itis composed of a PE and its local data memory. The classnamed ”Elementary processor” is stereotyped HwResource inthe case of the reduction methodology or HwProcessor in thecase of the replication methodology. It has a bidirectional portstereotyped FlowPort to connect the data memory. The class”Local memory” is stereotyped hwMemory with a paramet-ric tagged value adressSize. In the same manner, the ACUmemories have a parametric size. The PU has one port tocommunicate with the ACU and a number of neighboring portsequal to the number of its neighboring connections. In Figure4, it has two neighboring ports since each PE can communicatewith its neighbor in the east or west directions. If the designer

:ACU_data_memory [1] :ACU [1]

:ACU_Instruction_memory [1]

« shaped »:PU

Main_architecture

ACU

InstMem

Data_mem

Inst_mem

mpNoc_in mpNoC_out

PU

West

ACUEast

mpNoC_out mpNoC_in

« InterRepetition »

« reshape »

:mpNoC_router [1] :Device [1]

InstMem

ACU_in

ACU_out

PU_out PU_in

Device_in

Device_out

mpNoC_in

mpNoC_out

« reshape » « reshape »

Fig. 5. 1D configuration modeling

chooses to integrate the mpNoC in the SIMD configuration, hemust add two ports ”mpNoC in” and ”mpNoC out” to assurethe communications through the mpNoC.

We distinguish between 1D and 2D mppSoC configurations.They differ in the modeling of the interconnections betweenPUs. In the case of 1D configuration, the number of PEs isequal to the tagged value Shape of the stereotype Shapedapplied on the PU class. To model a linear neighboringnetwork, the interconnection link between the East and Westports is stereotyped InterRepetition. Since the PU on the edgeis not connected to the PU on the opposite edge, the taggedvalue isModulo is set to false. The repetitionSpaceDependenceattribute is used to specify the neighbor position of the elementon which the inter-repetition dependency is defined. In thiscase, its value is equal to {1} since each PE[i] is connected toPE[i+1]. Figure 5 shows the mppSoC configuration modelingintegrating a linear neighboring network and the mpNoC. Thelink connector, stereotyped Reshape, between the PU and theACU shows that each PU is connected to the ACU in order toreceive the execution orders. To connect PUs with the mpNoC,two Reshape connectors are expressed between the two portsof the PU and the corresponding ports of the mpNoC. Thislatter has a multiplicity equal to 1. The repetitionSpace tag isequal to the number of PEs. The patternShape tag is equal to 1indicating that the mpNoC port is distributed among the portsof the PEs. The same modeling is followed in the case of aring neighboring network. The only difference is the modulotagged value which is set to true.

In the same manner, we can model a 2D SIMD configura-tion. We need just to know how to model the neighboring linksbased on the MARTE profile. Figure 6 presents a configuration

152

:ACU_data_memory [1] :ACU [1]

:ACU_Instruction_memory [1]

« shaped »:PU

Main_architecture

ACU

InstMem

Data_mem

Inst_mem

mpNoc_in mpNoC_out

PU

West

ACUEast

mpNoC_out mpNoC_in

« InterRepetition »

« reshape »

North

South

:mpNoC_router [1] :Device [1]

InstMem

ACU_in

ACU_out

PU_out PU_in

Device_in

Device_out

mpNoC_in

mpNoC_out

« reshape » « reshape »« InterRepetition »

Fig. 6. 2D configuration modeling (with a mesh neighboring network)

modeling integrating a 2D mesh neighboring network. Wenotice that the PU class is modeled with 4 ports dedicatedto inter-PE communications in east, west, north and southdirections. In this case, the repetitionSpaceDependence taggedvalue is equal to {1,0} indicating that each PE[i,j] is connectedto its neighbor PE[i+1,j] to assure east and west links. Inaddition, this tagged value is equal to {0,1} for north and southlinks to assure that each PE[i,j] is connected to its neighborPE[i,j+1]. For a mesh topology the tagged value isModulois set to false since there are no connections on the edges.However, it is set to true in the case of a torus topology. TheXnet network is modeled like the 2D mesh. The designer hasjust to model the links on the diagonals.

C. Deployment

As described in the previous subsection, a SIMD configu-ration can be modeled at a high abstraction level. To generatean executable low level model, the elementary modeled com-ponents should be associated with an existing implementationbased on the provided IP library. The deployment allows tomove from a general platform (Platform Independent Model)to a specific platform (Platform Specific Model) according tothe MDA approach. At this step, the designer can generate andevaluate different configurations. In fact, the deployment en-ables to precise a specific implementation for each elementaryconcept among a set of possibilities. It concerns the processorIP, the instruction memory, the mpNoC interconnection net-work and the I/O device IP if it exists. At this stage the binarydata-parallel program is specified as the memory initialisationfile of the main instruction memory. In fact, in our case wedeal with a single data-parallel program (one of the advantagesof a SIMD architecture) so no mapping of tasks needs tobe performed. Thus, the mapping of the application to thehardware architecture is systematic. Figure 7 expresses thedeployment of a ”hardwareIP” on the ”Elementary processor”.The concept of codeFile is used to specify the code.

A final transformation chain MARTE2VHDL is developedto generate the synthesizable VHDL implementation of themodeled SIMD configuration.

D. Implementation generation

The MARTE2VHDL transformation is based on the De-ployed model and the IP library to generate the corresponding

Elementary_processor

local_memory

« virtualIP »VPE

« implements »

« implements »

« hardwareIP »PEImplem

« implements »

Fig. 7. Deployment of the PE

synthesizable VHDL implementation depending on the mod-eled configuration. A model conformed to the Deployed meta-model is generated via the transformation UML2MARTE. Thismodel is then analysed in order to deduce the specified pa-rameters. The number of PEs, memory size, processor designmethodology and the topology of the neighboring network areextracted from the UML diagrams. The other configurablecomponents (processor IP, mpNoC interconnection network,etc.) are specified from the deployment step. The developedtransformation model-to-text is based on templates. It uses theAcceleo tool [14] which is part of the Eclipse Model to Textproject and provides an implementation of the MOF Model toText OMG standard. The following code example illustrateshow to deduce the type of the processor (getPeCodeFile) inthe generation step:

[ query public getPECodeFile (m:Model):CodeFile=self.ownedElement->select(oclIsKindOf(CodeFile) and name=’PEImpl codefile’)->asOrderedSet()->first() ]

Using an MDE based framework, the SIMD SoC design isaccelerated. The VHDL implementation can be automaticallygenerated based on model transformations. The SoC model isindependent of any implementation detail making the designflow easy to use. The proposed framework also facilitates SoCexploration and helps the user choose the best configurationfor a given application. The next section illustrates the use ofthis framework in a real application context.

IV. CASE STUDY

A color conversion RGB to CMYK application widelyused in color printers, extracted from the EEMBC benchmark[24] has been developed based on the provided data-parallelinstruction set. The program is written using high-level macros(table I). The binary is then generated depending on the usedprocessor by selecting the corresponding compiler. The pro-posed framework allowed to generate different SIMD suitableconfigurations. An FPGA is used to do real experimentations.

A. HW platform

The used development board is the Altera D2-70 [25]equipped with a CycloneII EP2C70F896C6 FPGA which has68416 Logic Elements (LEs) and 250 M4K RAM blocks. Theused SW tools are the Quartus II V9.0 that allows synthesizing

153

TABLE IISYNTHESIS RESULTS

PE IP Proc. LEs Memorydesign (%) ACU PE %

(bytes) (bytes)8 miniMIPS rep. 71 4096 1024 1832 miniMIPS red. 93 4096 2048 668 OpenRisc rep. 91 4096 1024 2216 OpenRisc red. 98 4096 4096 3648 NIOS rep. 79 8192 512 87

and prototyping the design on the FPGA, and the ModelSim-Altera v.6.4a that allows simulating and debugging the design.To test the color-conversion application, two peripherals areused: a 1M pixel camera TRDB D5M and a 800×RGB×480pixel TRDB LTM LCD displayer. The two external SDRAMand SRAM memories are also used. In fact, the implementedVHDL camera driver directly stores the captured data to theSDRAM to be read by PEs as required. A VHDL SRAMcontroller is implemented. It allows to store the processed datain the SRAM and fetches it as it is required by the LCD.

B. SIMD configurations

For the tested application, only the mpNoC has beenintegrated in the system model (no neighboring network)since we need to assure parallel data transfers: all PEs needto read data from the SDRAM and then write data to theSRAM. In this example, each pixel processing should notexceed 10.42 Ns in order to assure real-time processing.Therefore, a 800×480 pixel frame must be processed within4 Ms. The same system model is used for all implementationgenerations. It is described in composite structure diagram asillustrated in Figure 8. It models all hardware componentscomposing the system as well as their connections. Onlythe deployment diagram changes from one configuration toanother in order to use different processors, memories andinterconnection networks. The modification from one SIMDconfiguration to a new one just needs few milliseconds andthe re-generation process is rapidly performed. The low-levelsynthesizable models from the IP library are used for thefinal implementation. The generated configurations could bedirectly simulated to measure execution time and decide theperformance of the SIMD modeled systems.

Table II shows the obtained synthesis results varying theSIMD parameters and components while integrating the max-imum number of PEs targeting the Cyclone II FPGA. Allthese configurations integrate a crossbar based mpNoC sincethe crossbar allows fast and non-blocking parallel data trans-fers, necessary for real-time image processing applications.We clearly notice that the reduction methodology allowsintegrating a bigger number of PEs on the chip than thereplication methodology. Since the miniMIPS is smaller thanthe OpenRisc, we can reach 32 PEs on the FPGA comparedto 16 PEs when using the OpenRisc IP. The implementationresults prove that the NIOS processor is optimized for theAltera FPGA. We can integrate more than 48 PEs on the chip.

Figure 9 shows the execution time results obtained whenprototyping the generated configurations on the CycloneII

Fig. 9. Execution times for different SIMD configurations

TABLE IIIDIFFERENCES BETWEEN TWO DESIGN SOLUTIONS

SIMD Generic implem. with Generic implem. withconfig. a reduced processor a replicated processor

Design time 15 minutes 40 secondsusing the frameworkDesign time without 1 month 7 daysusing the framework

FPGA. So, these times are measured running the color-conversion application on parallel FPGA based configurations.The different SIMD SoC configurations perform good resultswhile increasing the number of PEs working in parallel.The performance of the system is also closely related to theprocessor type and the design methodology. The experimentalresults show that a SIMD configuration composed of morethan 8 PEs is needed to assure real-time processing. Accord-ing to these results, we can choose the best configuration.The proposed approach easily allows exploration of severalplatform architecture alternatives.

In order to illustrate the efficiency of the model-basedframework, Table III compares the implementation design timeusing the framework with results obtained from a conventionalmanual implementation method done by the same designerwithout using any framework. The measured design time forthe second configuration (using replication methodology) isjust the time needed to modify the first configuration (withreduction methodology). The results in Table III show thatthe proposed framework is a better solution to acceleratethe design of specific SIMD parallel SoC according to theestimated design time compared to a manual design. Twomonths were necessary to reduce an open-source processor toobtain a small PE (with only execution units) [21]. Observingthe results, we can conclude that the model-based designframework allows a very fast SIMD implementation.

This case study illustrates a design framework which fa-cilitates SIMD SoC implementation to run data-parallel ap-plications. Through the Model-Driven Engineering approachfor parallel SoC design presented in this work, a designer canspecify the needed SIMD configuration using UML modelsand the MARTE profile at high abstraction level and automat-ically generate its implementation at RT level. The designercan easily and rapidly generate different SoC configurations

154

Main_architecture

: ACU_memory [1]

InstMem

: ACU [1]

Data_mem

Inst_mem PU

mpNoc_in mpNoC_out

: mpNoC_router [1] PU_out PU_in

«shaped» : PU

ACU mpNoC_in mpNoC_out

ACU_in

ACU_out

Device_in

Device_out

device: Device [1]

mpNoC_out

mpNoC_in

Device2_out Device2_in

device2: Device2 [1]

mpnoc_out mpnoc_in

«reshape»

«reshape» «reshape»

Fig. 8. SIMD configuration composite structure diagram

to look for the best alternative for a given application.

V. CONCLUSIONS AND FUTURE WORK

A Model-Driven Engineering (MDE) approach for SIMDSoC design was presented. The proposed flow design iscomposed of four steps: application programming, systemmodeling, deployment and then implementation generation.The MDE fundamental notion of transformation between mod-els is used to generate a SIMD configuration at register transferlevel from its model at a high abstraction level. The frameworkfacilitates the exploration by rapidly generating different SoCconfigurations in order to choose the most adequate onethat better fulfills the application requirements. Experimentalresults show that the proposed framework strongly contributesto the increase of the designer’s productivity. The case studywith a video processing application proved that the presenteddesign flow can facilitate the design of parallel SIMD SoCsystems. The design flow allows reducing implementationcosts. Besides, the use of UML and MDE promotes thereusability of application and system high-level models.

One of the future directions to be considered is the modelingof a data-parallel application. We also intend to develop ahigh-level exploration step to automatically generate the mostsuitable application-specific SIMD SoC configuration.

REFERENCES

[1] W. C. Meilander, J. W. Baker, and M. Jin, “Importance of SIMDComputation Reconsidered,” in International Parallel and DistributedProcessing Symposium, 2003.

[2] R. Kleihorst and al., “An SIMD smart camera architecture for real-timeface recognition,” in Abstracts of the SAFE & ProRISC/IEEE Workshopson Semiconductors, Circuits and Systems and Signal Processing, 2003.

[3] R. Rosas, A. de Luca, and F. Santillan, “SIMD Architecture for ImageSegmentation using Sobel Operators Implemented in FPGA Technol-ogy,” in Proc. of the 2nd International Conference on Electrical andElectronics Engineering ((ICEEE’05), 2005.

[4] P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch, and P. Gauget,“Definition and SIMD implementation of a multi-processing architectureapproach on FPGA,” in Proc. of DATE, 2008.

[5] F. Schurz and D. Fey, “A Programmable Parallel Processor Architecturein FPGA for Image Processing Sensors,” in Integrated Design andProcess Technology, IDPT, 2007.

[6] X. Xizhen and S. G. Ziavras, “H-SIMD machine: configurable parallelcomputing for matrix multiplication,” in International Conf. on Com-puter Design: VLSI in Computers and Processors, 2005, pp. 671–676.

[7] P. Paulin, “DATE panel: Chips of the future: soft, crunchy or hard?” inProc. Design, Automation and Test in Europe, 2004, pp. 844–849.

[8] A. Sangiovanni-Vincentelli, L. Carloni, F. D. Bernardinis, and M. Sgroi,“Benefits and challenges for platform-based design,” in Proc. DAC,2004, pp. 409–414.

[9] D. Schmidt, “Model-driven Engineering,” IEEE Computer, vol. 39,no. 2, 2006.

[10] C. D. L. Bond and J.-L. Dekeyser, “Metamodels and MDA transforma-tions for embedded systems,” in FDL04, Lille, France, 2004.

[11] S. Mellor and M. Balcer, Executable UML: A foundation for ModelDriven Architecture. Boston: Addison-Wesley, 2002.

[12] O. M. Group. (2004, october) Uml 2 superstructure (availablespecification). [Online]. Available: http://www.omg.org/cgi-bin/doc?ptc

[13] L. Rioux, T. Saunier, S. Gerard, A. Radermacher, R. de Simone,T. Gautier, Y. Sorel, J. Forget, J.-L. Dekeyser, A. Cuccuru, C. Dumoulin,and C. Andre, “MARTE: A new profile RFP for the modeling andanalysis of real-time embedded systems,” in UML-SoC’05, DAC 2005Workshop UML for SoC Design, Anaheim, CA, June 2005.

[14] Acceleo. (2009). [Online]. Available: http://www.acceleo.org[15] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, “IP based

configurable SIMD massively parallel SoC,” in PhD Forum of 20𝑡ℎ In-ternational Conference on Field Programmable Logic and Applications(FPL), Milano, Italy, August 2010.

[16] D. Bjorklund and J. Lilius, “From UML Behavioral Models to EfficientSynthesizable VHDL,” in 20𝑡ℎ IEEE NORCHIP Conference, Copen-hagen, Denmark, November 2002.

[17] F. P. Coyle and M. A. Thornton, “From UML to HDL: a Model DrivenArchitectural Approach to Hardware-Software Co-Design,” InformationSystems: New Generations Conference (ISNG), pp. 88–93, April 2005.

[18] T. Kangas, P. Kukkala, H. Orsila, E. Salminen, M. Hannikainen, andT. Hamalainen, “UML-based multiprocessor SoC design framework,”ACM Trans. Embedded Computing Systems (TECS), vol. 5, no. 2, pp.88–93, May 2006.

[19] M. Bakouti, Y. Aydi, P. Marquet, M. Abid, and J.-L. Dekeyser, “ScalablempNoC for Massively Parallel Systems - Design and Implementation onFPGA,” Journal of Systems Architecture (JSA), vol. 56, pp. 278–292,2010.

[20] EMF. Eclipse modeling framework. [Online]. Available: http://www.eclipse.org/emf

[21] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, “A design andan implementation of a parallel based SIMD architecture for SoC onFPGA,” in Conference on Design and Architectures for Signal and ImageProcessing DASIP’08, Bruxelles, Belgium, November 2008.

[22] OpenCores. Or1200 openrisc processor. [Online]. Available: http://opencores.org/openrisc,or1200

[23] O. M. Group. UML Profile for MARTE: Modeling and Analysisof Real-Time Embedded Sys- tems, version 1.0. [Online]. Available:http://www.omg.org/spec/MARTE/1.0/PDF/.

[24] EEMBC. (2010) The Embedded Microprocessor BenchmarkConsortium. [Online]. Available: http://www.eembc.org/home.php

[25] Terasic. (2010) Altera DE2-70 Board. [Online]. Available: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=226

155

A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System

Architectures

Abstract— Abstract models are means to assist system architects in the evaluation process of hardware/software architectures and then to cope with the still increasing complexity of embedded systems. Efficient methods are necessary to correctly model system architectures and to make possible early performance evaluation and fast exploration of the design space. In this paper, we present the use of a specific modeling approach to improve evaluation of non-functional properties of embedded systems. The contribution is about a computation method defined to improve modeling of properties used for assessment of architecture performances. This method favors creation of abstract transaction level models and leads to significantly reducing simulation time but still preserving accuracy of results. The benefits of the proposed approach for evaluation of performances of system architectures are highlighted through analysis of two specific case studies.

I. INTRODUCTION High performance applications supported by modern

embedded devices imply definition of heterogeneous multiprocessor platforms. The process of system architecting consists of optimally defining organization and performances of such platforms in terms of processing, communication and memory resources according to functional and non-functional requirements. Typical non-functional requirements under consideration for embedded systems are timing constraints, power consumption and cost. Fast exploration of the design space and evaluation of non-functional properties early in the development process have then become mandatory to avoid costly iterations [1]. In this context, abstract models are then needed to maintain the still increasing complexity of embedded systems.

As reported in [2], models for performance evaluation are usually created applying the principles of the Y-chart model. Following this approach, a model of the system application is mapped onto a model of the considered platform and the resulting description is then analyzed analytically or by simulation. Modeling of computation and modeling of communication can be strongly separated and defined at various abstraction levels on both application and platform sides [3]. Raising level of design abstraction above Register Transfer Level (RTL), Transaction Level Modeling (TLM) offers a good trade-off between modeling accuracy and simulation speed and it has then emerged recently in the

design process of embedded systems [4]. However, achievable simulation speed of transaction level models is limited by the amount of required transactions and integration of non-functional properties can significantly reduce simulation speed. Therefore, specific modeling techniques are required to correctly abstract non-functional properties and improve efficiency of simulation.

In this paper, an approach for creation of efficient transaction level models for performance evaluation of system architectures is presented. The contribution is about a specific computation method proposed to improve expression of non-functional properties assessed for performance evaluation. This proposal is based on the distinction between the description of system evolution, driven by transactions, and the description of non-functional properties. This separation of concerns leads to reducing the number of events in transaction level models and favors creation of abstract models. Simulation speed-up is then achieved due to significant reduction of required transactions and the proposed method still preserves accuracy in evaluation of performances. This method has been validated through the use of a specific modeling framework based on the SystemC language [5]. The proposed approach provides fast evaluation of performances and allows efficient exploration of different configurations of architectures. The benefits of this approach are highlighted through two case studies. Created models are simulated to evaluate performances in terms of processing resources and memory cost in order to correctly fix platform properties.

The remainder of this paper is structured as follows. Section II analyzes related modeling and simulation approaches used for evaluation of performances of embedded systems. In Section III, the proposed modeling approach is presented. In Section IV, we detail the computation method used to improve simulation speed of models. In Section V, we describe the implementation of the proposed approach in the considered modeling environment. Section VI highlights the benefits of the approach through two case studies. Finally conclusions are drawn in section VII

II. RELATED WORK Performance evaluation of embedded systems has been

approached in many ways at different levels of abstraction. A good survey of various methods, tools and environments for

Sébastien Le Nours, Anthony Barreteau, Olivier Pasquier Univ Nantes, IREENA, EA1770, Polytech-Nantes, rue C. Pauc, Nantes, F-44000 France

{sebastien.le-nours, anthony.barreteau, olivier.pasquier}@univ-nantes.fr

978-1-4577-0660-8/11/$26.00 ©2011 IEEE

156

early design space exploration is presented in [6]. Typically, performance models capture characteristics of system architectures and they are used to gain reliable data of resource usage. For this purpose, performance evaluation can be performed without considering a complete description of system functionalities. This abstraction enables efficient simulation speed and favors early performance evaluation. Workload models are then defined to represent computation and communication loads applications cause on platforms when executed. Workload models are mapped onto platform models and resulting architecture models are simulated to obtain performance data. Among simulation-based approaches, TLM has recently received wide interest in industrial and research communities in order to improve system design and productivity. Transaction level models make possible to hide unnecessary details of communication and computation. Formally, a transaction has been defined in [4] as the data transfer or synchronization between two modules at an instant determined by the hardware/software system specification. The different levels of abstraction considered in TLM approaches are classified according to granularity of computation and communication and time accuracy [3][4]. TLM is supported by languages such as SystemC [5] and SystemVerilog [7], notably through the TLM2.0 standard promoted by OSCI [8]. Works presented in [9] give a quantitative analysis of the speed/accuracy trade-off in transaction level models. Typically, using the SystemC language, simulation speed is related to the number of thread context switches which usually grows with increasing number of modules. The approach presented in [10] attends to transform the structural description of designs with concurrency alignment along modules in order to minimize switches. The transformation technique re-assigns the concurrency along the dataflow keeping functionality of the model unchanged. In [11], a method is presented to minimize the number of synchronization points in system description by optimizing the granularity of transactions. In the following, we adopt a similar approach in order to reduce the amount of required events in transaction level models created for evaluation of non-functional properties.

Among existing approaches for performance evaluation of embedded systems, different optimization objectives are addressed in order to assist designers to fix platform parameters early in the development process. The design framework presented in [12] supports system modeling at different levels of abstraction. The architecture exploration step supported mainly focuses on optimization of allocation and partitioning. System architecture consists of processing elements and memories. These components are selected by the system designer as part of the decision making. In [13], the proposed methodology allows for architectural exploration at different levels of abstraction. Candidate architectures are selected using analytical modeling and multi-objective optimization taking into account parameters such as processing capacities, power consumption and cost. Potential solutions are then simulated at transaction level using SysemC. In [14], performance evaluation is performed by a combined simulation, associating functionalities and timing in one single simulation model. The performance of each feasible implementation is then assessed with respect to a given set of stimuli and by means of average

latency and average throughput. The design framework proposed in [1] aims at evaluating non-functional properties such as power consumption and temperature. In this approach, the description of the application is done through a model called communication dependency graph. This description is completed by SystemC models of non-functional properties. Simulation is then performed to obtain an evaluation of achieved power consumption. Both approaches presented in [15] and [16] combine UML2 description for application modeling and platform modeling in SystemC for performance evaluation. Applications are modeled in terms of services required from the underlying platform. Workload models are defined to illustrate the load an application causes to a platform when executed in terms of processing and communication. These workload models do not contain timing information; it is left to the platform model to find out how long it takes to process the workloads. Our approach mainly differs from the above as to the way the system architecture is modeled and the models of workload are defined. Besides, we pay a specific attention to the optimization of models in order to improve the related simulation speed.

III. CONSIDERED MODELING APPROACH The considered modeling approach aims at creating

approximately-timed models used for evaluation of properties related to system architectures. It is based on a single view that combines structural description of system under study and non-functional properties relevant to considered hardware and software resources. This approach is illustrated in Figure 1.

F11 F12

P1

Mem1

NodeF2

P2

A1

M

A2 s0

s1

s2

M

Q

t = Tk

K

Considered system architecture

Model of system architecture

CcA2=0;McA2=0;

CcA2=Ccs1;McA2=Mcs1;t = Tj

t = Tl

s3

Q

CcA2=0;McA2=Mcs2;

CcA2=Ccs3;McA2=Mcs3;

Figure 1. Considered modeling approach for evaluation of non-functional properties of system architectures.

The lower part of Figure 1 depicts a typical platform made of communication nodes, memories and processing resources. Processing resources are classified as processors and dedicated hardware resources. In Figure 1, F11, F12 and F2 represent functions of the system application. They are allocated on the processing resources P1 and P2 to form the system architecture. For clarity reason, communications and memory accesses

157

induced by this allocation are not represented. The upper part of the figure depicts the model of the system architecture. This model exhibits transactions exchanged between activities and utilization of the platform resources. This model is based on an activity diagram notation inspired from [17]. Following this notation, single arrow links correspond to transactions exchanged between activities and the communication is in conformity with the rendezvous protocol. The behavior of each activity exhibits waiting conditions on input transactions and production of output transactions. It also expresses the use of processing and memory resources considering the allocation of functions. Transitions between states are expressed as waiting transactions, time conditions or logical conditions on internal variables. In Figure 1, the use of processing resources due to execution of function F2 on P2 is modeled by evolution of the parameter denoted CcA2. In this simple example, Ccs1 operations are first executed for a duration set to Tj after reception of transaction M. Production of transaction Q is done once state s3 finished. Parameter McA2 describes evolution of the amount of memory required during execution of the activity A2. The different internal variables related to each activity can be influenced by data associated to the input transaction M.

Following this approach, resulting model incorporates quantitative properties defined analytically and relevant to the use of processing resources, communication nodes and memories. These analytical expressions of quantitative properties and related time properties are directly influenced by the characteristics of resources considered to support function execution. These expressions are provided by estimations and measurements. Using languages as SystemC, created models can then be simulated to evaluate time evolution of performances obtained considering a given set of stimuli. Various platform configurations and function allocations can be compared considering different descriptions of the behavior of activities. In the following, the presented contribution is about the optimization of the descriptions of activities in order to improve the simulation time of such models.

IV. PROPOSED COMPUTATION METHOD OF NON-FUNCTIONAL PROPERTIES OF SYSTEM ARCHITECTURES

As previously discussed, simulation speed of transaction level models can be significantly improved by avoiding context switches between threads. The proposed computation method relies on the same principle as temporal decoupling supported by the loosely-timed coding style defined by OSCI. Using this coding style, parts of the model are permitted to run ahead in a local time until they reach the point when they need to synchronize with the rest of the model. The proposed method can be seen as an application of this principle to create models for evaluation of architecture performances. It aims at minimizing number of transactions required for description of properties assessed for evaluation of performances Figure 2 illustrates application of proposed computation method.

Figure 2. Comparison of two modeling approaches in order to minimize the amount of required transactions in models used for performance evaluation.

Figure 2 depicts transactions exchanged between two activities and the behavior of the receiving activity denoted A2. The upper part of the figure corresponds to a description with 4 successive transactions. The durations between the successive transactions are denoted Δt1, Δt2 and Δt3 and they are relevant to the communication node used for transfer of data. In this transaction-based modeling approach, the considered property cA2 evolves each time a transaction is received. The lower part of the figure considers the description of the activity A2 at a higher abstraction level. Only one transaction occurs and the content of the transaction is defined at higher granularity. However, evolution of the property cA2 can be preserved by considering a separation with evolution of the activity behavior. In that case, the duration Ts corresponds to the time elapsed between the first transaction and the last transaction considered in the upper part of the figure. It is locally computed relatively to the arrival time of the input transaction M and it defines the next output event. In Figure 2, this is denoted by action ComputeAfterM. The time condition is evaluated during state s0 considering evolution of the simulation time denoted ts. Besides, evolution of property cA2 between two external events is also done during state s0. The successive values, denoted cs0, are evaluated in zero time according to the simulation time. This means that no SystemC wait primitives are used, leading to no thread context switches. Resulting observations correspond to the values cs0 and the associated timestamps To. Timestamps values are considered relatively to what we call the observed time denoted to. Using this technique, the evolution of the considered property can then be computed locally between external transactions. Compared to the previous transaction-based approach, the second modeling approach with the related computation technique can be considered as a state-based approach. Non-functional properties are then locally computed in the same state which reduces the number of required transactions. Figure 3 represents time evolution of property cA2 considering the two modeling approach illustrated in Figure 2.

158

Figure 3. Evolution of property cA2 considering, (a), a transaction-based modeling approach and, (b), the proposed state-based modeling approach.

The upper part of Figure 3 illustrates time evolution of property cA2 with 4 successive input transactions. During simulation of the model each transaction implies a thread context switch between activities and cA2 evolves according to the simulation time. In the lower part of the figure, successive values of property cA2 and associated timestamps are computed at reception of transaction M. Evolution is depicted according to the observed time to. Improved simulation time is achieved due to the amount of context switches avoided. More generally, we can consider that when the number of transaction is reduced by a factor of N, a simulation speed-up by the same factor could be achieved. This computation method favors creation of abstract models and utilization of platform resources can be computed at finer level with low influence on simulation time. We have considered the implementation of the proposed computation method in a specific modeling framework in order to analyze its influence on simulation time of models.

V. IMPLEMENTATION OF THE COMPUTATION METHOD IN A SPECIFIC FRAMEWORK

The proposed computation method has been implemented in the framework CoFluent Studio [18]. This environment supports creation of transaction level models of system applications and architectures. Graphical models captured and associated codes are then automatically generated in a SystemC description and they are simulated to analyze model execution and to assess performances. We used the so-called Timed-Behavioral Modeling part of this framework to create models

according to the considered approach. Figure 4 illustrates the graphical modeling adopted in CoFluent Studio to implement the proposed computation method. It corresponds to the specific case illustrated in Figure 2 with one input transaction and one output transaction.

Figure 4. Graphical modeling in the CoFluent Studio framework to implement the proposed computation method.

In Figure 4, the function denoted A2 is activated once the input transaction M has been received. The production instant of the output transaction Q is computed in the operation denoted OpPerformanceAnalysis. Duration of operation OpPerformanceAnalysis corresponds to duration Ts defined in Figure 2. The other operations OpInit and OpUpdating are executed in zero time according to the simulation time. The loop with a boolean condition on the internal variable denoted Wait_Input is added to manage possible output transactions produced successively before waiting a new input transaction. The operation OpPerformanceAnalysis is described in a C/C++ sequential code to define the computation of properties and display. The example given bellow corresponds to the required instructions to obtain the observations depicted in the lower part of Figure 3.

{ To = CurrentUserTime(ns);

CofDisplay(“to=%f ns, cA2=%f op/s”, To, c1); To = To+t1; CofDisplay(“to=%f ns, cA2=%f op/s”, To, c2); To = To+t2;

CofDisplay(“to=%f ns, cA2=%f op/s”, To, c3); To = To+t3; CofDisplay(“to=%f ns, cA2=%f op/s”, To, c4);

To = To+Tl; CofDisplay(“to=%f ns, cA2=%f op/s”, To, 0); OpDuration = To-CurrentUserTime(ns); }

Procedure CurrentUserTime is used in CoFluentStudio

to get current simulation time. In our case, it is used to get the reception time of input transactions and to compute values of durations To and Ts. Procedure CofDisplay is used to display variables in a Y=f(X) chart. In our case, it is used to

159

display studied properties according to the observed time. The keyword OpDuration defines the duration of the operation OpPerformanceAnalysis and it is evaluated according to the simulation time. Successive values of cA2 and timestamps are provided by estimations and could also be computed according to data associated to the input transaction M. This model has been extended to the case of functions with multiple input and output transactions. In the following, we consider this implementation of the proposed method to create executable models. Models are then simulated to assess performances of considered architectures.

VI. CASE STUDIES

A. Modeling of a pipeline architecture First case study aims at illustrating proposed modeling

approach and simulation speed-up obtained with the computation method presented in Section IV through a didactic case study. Application considered is about a Fast Fourier Transform (FFT) algorithm which is widely used in digital signal processing. A pipeline architecture based on hardware resources is analyzed. To easily illustrate proposed modeling approach, an 8-point FFT is considered. Modeling approach is used to estimate resource utilization and computation method is used to reduce simulation time of performance model. Figure 5 illustrates pipeline architecture considered and related performance model.

InputSymbol OutputSymbolInputStage2 InputStage3

+

- w

Considered system architecture

Stage1 Stage2 Stage3

Model of system architecture

Input

Symbol

Output

Symbol

Stage1

+

- w

Stage2

+

- w

Stage3

Figure 5. Modeling of a 3-stage pipeline architecture.

The lower part of Figure 5 describes typical pipeline architecture as implemented in most commercial FFT IPs. This architecture enables to simultaneously perform transform calculations on a current frame of 8 complex symbols, load input data for next frame, and unload 8 output complex symbols of previous frame. Each stage is made of processing (adders, multipliers) and memory resources. The upper part gives the structural description of the associated model. The behavior of activities Stage1, Stage2, and Stage3 describe utilization of processing resources each time an input transaction is received. Behavior of each activity is described following modeling approach presented in Section III.

Architecture has first been modeled following a transaction-based modeling approach. Input transactions are made of one single complex symbol. Eight input transactions are then required to process one iteration of FFT algorithm. This model has been captured in the CoFluent Studio framework following the previously presented modeling approach. The created

model makes possible to analyze the use of processing resources according to the rate of input transactions. Figure 6 depicts possible observations obtained with the CoFluent Studio simulation tool. In the considered example, input transactions are successively received with a period set to 0.125 ms.

Figure 6. Time evolution of computational complexity (in KOPS) of the considered system architecture.

The upper part of Figure 6 gives an example of time evolution of global computational complexity per time unit required for the complete architecture with three successive executions. The lower part gives an illustration of processed input and output transactions as could be observed in the timeline view proposed by the CoFluent Studio simulation tool. For clarity reason, only the first input transactions and the last output transactions produced are depicted. The behavior of each activity is relevant to the architecture of each stage and to the time constraints allocated to process each complex symbol. In the considered configuration, estimated computational complexity per time unit is evaluated to 120 KOPS.

Considering the state-based modeling approach depicted in Figure 3, we have defined a model of the system with the same structure but with a higher level of data granularity. Transactions are made of eight complex symbols and only one transaction is required to execute an iteration of the complete architecture. Start time corresponds to reception of the first complex symbol and other instants are locally computed relatively to this value. Evolution of the load of processing resources for each transaction is computed considering the method previously presented and a similar observation of computational complexity is obtained. The average simulation speed-up measured is about 7.62, compared to a theoretical factor of 8. We can then notice the weak influence of the computation method on the improvement of the simulation time. Similar observations have been obtained by increasing number of stages in pipeline architecture.

160

B. Modeling of a pipeline architecture Second case study considered here concerns the creation of

a transaction level model for analysis of processing functions involved at the physical layer of the 3GPP LTE protocol [19]. The aim of the model is to study required computational complexity and memory cost according to the various possible parameters associated to each function. In the following, we consider the reception part of a downlink transmission in a single input single output (SISO) configuration. The structural representation of this system is given in Figure 7.

Figure 7. Activity diagram of communication receiver studied.

In the configuration depicted in Figure 7, input transactions are successively received each 1 ms. They are made of 14 OFDM symbols which size can vary according to considered throughput. Based on a detailed analysis of processing and memory resources required for each function [20], we have defined analytical expressions for each activity. These expressions give relation between functional parameters and resulting computational complexity in terms of arithmetic operations and required memory resources. For example, the number of sub-carriers directly gets influence on the computational complexity of the OFDM demodulator function. We used proposed modeling approach to describe each elementary activity depicted in the lower part of Figure 7. The behavior of each activity exhibits the way processing and memory resources are used. The computation method is used to locally compute time evolution of computational complexity and memory cost related to each activity. Time properties defined for each activity depend on the architecture evaluated. In the following, results are presented considering a platform made of dedicated hardware resources to implement each function.

We captured the complete model using the CoFluent Studio tool. Each activity has been captured following approach illustrated in Figure 4. The captured LTE receiver model represents 3850 lines of SystemC code, with 22 % automatically generated by the tool CoFluent Studio. Rest of the code corresponds to the sequential C/C++ code defined for computation of studied non-functional properties and display. This model makes possible to observe the evolution of the computational complexity per time unit for each activity and for the complete architecture. The observation given in Figure 8 represents obtained evolution of computational complexity according to various configurations of the input frames.

Figure 8. Observation of estimated computational complexity (in GOPS) of the receiver architecture according to various configurations of LTE sub-frames.

In Figure 8, we observe evolution of computational complexity during reception of successive LTE sub-frames. The system configuration evolves during execution according to various parameters. The number of blocks of data allocated per user is denoted NbRB, the size of the OFDM symbol is denoted NFFT and the number of iterations of the channel decoder is NbIterTurboDecod. Parameters vary from one frame to another one. The demodulation scheme can also be modified during system execution. In Figure 8, modulation schemes are QPSK, 64QAM, 16QAM. We observe that the global computational complexity strongly varies during system execution and the estimated maximum value is 70 Giga Operation Per Second (GOPS) for the three configurations evaluated. The main computing complexity is due to activity of the channel decoder function. The same model is used to evaluate the memory cost associated to the receiver system. This observation is given in Figure 9.

Figure 9. Observation of the estimated memory cost (in KByte) of the receiver architecture according to various configurations of LTE sub-frames.

161

Figure 9 illustrates evolution of memory cost during successive computation of LTE sub-frames. The maximum value achieved is estimated to 570 KBytes. Observations given in Figure 8 and 9 are used to estimate expected resources of the architecture. The simulation time to execute the created model for 1000 input frames took 11s on a 2.66 GHz Intel Core2 duo machine. This is fast enough for performing performance evaluation and for simulating multiple configurations of architectures. Time properties and quantitative properties defined for each activity can be modified easily to evaluate various configurations of the architecture. Then, we also used this approach to evaluate properties related to an heterogeneous architecture made of dedicated hardware resources and one processor core.

VII. CONCLUSION Creation of abstract models represents a reliable solution to

maintain design complexity of embedded systems and to enable architecting of complex hardware and software resources. In this paper, we have presented an approach for creation of transaction level models for performance evaluation. According to this approach, system architecture is modeled as an activity diagram and description of activities incorporates properties relevant to resources usage. The contribution is about a specific computation method that favors creation of more abstract transaction level models. Simulation speed-up is achieved due to significant reduction in number of transactions in models and architecture properties are computed in zero time according to simulation time. This method makes possible to significantly increase simulation speed of models but still preserving accuracy of observations. The experimentation of this method has been illustrated through the use of the framework CoFluent Studio. However, the presented modeling approach is not limited to this specific environment and it could be applied to other SystemC-based frameworks. Further research is directed towards applying the same modeling principle to other non-functional properties such as dynamic power consumption.

REFERENCES [1] A. Viehl, B. Sander, O. Bringmann, and W. Rosenstiel, “Integrated

requirement evaluation of non-functional system-on-chip properties”, in Proceedings of the Forum of specification and Design Languages (FDL’08), Stuttgart, Germany, September 2008.

[2] D. Densmore, R. Passerone, and A. Sangiovanni-Vincentelli, “A platform-based taxonomy for ESL design”, IEEE Design and Test of Computers, vol. 23, no. 5, pp. 359-374, September/October 2006.

[3] L. Cai, and D. Gajski, “Transaction level modeling: an overview”, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’03), Newport Beach, October 2003.

[4] F. Ghenassia, Transaction-level modeling with SystemC: TLM concepts and applications for embedded systems, Springer, 2005.

[5] Open SystemC Initiative (OSCI), “Functional specification for SystemC 2.0”, http://www.systemc.org

[6] M. Gries, “Methods for evaluating and covering the design space during early design development”, Integration, the VLSI Journal, vol. 38, no. 2, pp. 131-183, 2004.

[7] SystemVerilog, http://www.systemverilog.org [8] Open SystemC Initiative TLM Working Group, Transaction Level

Modeling Standard 2 (TLM 2), June 2008. [9] G. Schirner, and R. Dömer, “Quantitative analysis of the speed/accuracy

trade-off in transaction level modeling”, ACM Transactions on Embedded Computing Systems, vol. 8, no. 4, pp. 1-29, 2008.

[10] N. Savoiu, S. K. Shukla, and R. K. Gupta, “Automated concurrency re-assignment in high level system models for efficient system-level simulation”, in Proceedings of Design, Automation and Test in Europe, 2002.

[11] J. Cornet, F. Maraninchi, and L. Maillet-Contoz, “A method for the efficient development of timed and untimed transaction-level models of systems-on-chip”, in Proceedings of Design, Automation and Test in Europe (DATE’08), Munich, Germany, March, 2008.

[12] R. Dömer, A. Gerstlauer, J. Peng, et al., ”System-on-chip environment: a SpecC-based framework for heterogeneous MPSoC design”, EURASIP Journal on Embedded Systems, vol. 2008, 2008.

[13] A. D. Pimentel, C. Erbas, and S. Polstra, “A systematic approach to exploring embedded system architectures at multiple abstraction levels”, IEEE Transactions on Computers, vol. 55, no. 2, pp. 99-111, 2006.

[14] C. Haubelt, J. Falk, J. Keinert, et al., “A SystemC-based design methodology for digital signal processing systems”, EURASIP Journal on Embedded Systems, vol. 2007, 2007.

[15] J. Kreku, M. Hoppari, T. Kestilä, et al., “Combining UML2 application and SystemC platform modelling for performance evaluation of real-time embedded systems”, EURASIP Journal on Embedded Systems, vol. 2008, 2008.

[16] T. Arpinen, E. Salminen, T. Hämäläinen, and M. Hännikäinen, “Performance evaluation of UML2-modeled embedded streaming applications with system-level simulation”, EURASIP Journal on Embedded Systems, vol. 2009, 2009.

[17] J. P. Calvez, Embedded real-time systems: a specific and design methodology, John Wiley & Sons, May, 1993.

[18] CoFluent Design, http://www.cofluentdesign.com/ [19] E. Dahlman, S. Parkvall, J. Skold, P. Beming, 3G Evolution, HSPA and

LTE for Mobile Broadband, Academic Press, 2008. [20] J. Berkmann, C. Carbonelli, F. Dietrich, C. Drewes, W. Xu, “On 3G

LTE terminal implementation – Standard, algorithms, complexities and challenges”, in Proceedings of International Wireless Communications and Mobile Computing Conference (IWCMC’08), August, 2008.

162

Session 6Software for Embedded Devices

163

978-1-4577-0660-8/11/$26.00 ©2011 IEEE

Task Mapping on NoC-Based MPSoCs with Faulty Tiles Evaluating the Energy Consumption and the Application Execution Time

Alexandre M. Amory, César A. M. Marcon, Fernando G. Moraes

FACIN – Faculdade de Informática – PUCRS Catholic University

Porto Alegre, Brazil {alexandre.amory, cesar.marcon,

fernando.moraes}@pucrs.br

Marcelo S. Lubaszewski PPGC – Instituto de Informática –

UFRGS Federal University Porto Alegre, Brazil [email protected]

Abstract— The use of spare tiles in a networks-on-chip based multi-processor chip can improve the yield, reducing the cost of the chip and maintaining the system functionality even if the chip is defective. However, the impact of this approach on appli-cation characteristics, such as energy consumption and execution time, is not documented. For instance, on one hand the applica-tion tasks might be mapped onto any tile of a defect-free chip. On the other hand, a chip with a defective tile needs special task mapping that avoid fault tiles. This paper presents a task map-ping aware of faulty tiles, where an alternative task mapping can be generated and evaluated in terms of energy consumption and execution time. The results show that faults on tiles have, on av-erage, a small effect on energy consumption but no significant effect on execution time. It demonstrates that spare tiles can im-prove yield with a small impact on the application requirements.

Keywords: MPSoC, task mapping, yield, energy consumption, execution time.

I. INTRODUCTION A multiprocessor system-on-chip (MPSoC) is typically a very

large scale integrated system that incorporates most or all the components necessary for an application, including multiple pro-cessors [1]. A network-on-chip (NoC) is the preferable intrachip communication infrastructure for MPSoCs due to its superior per-formance, scalability, and modularity. MPSoCs that use NoCs as the communication infrastructure are also called NoC-based MPSoCs.

NoCs can consume more than one third of the total chip ener-gy [2][3]. On the other hand, the shrinking feature-sizes of newer technologies and the supply voltage scaling [4][5] increases the defect rate in the chip manufacturing and reduces the yield. High manufacturability, low latency and energy consumption are con-flicting design goals, thus all these requirements have to be jointly evaluated to optimize a NoC-based MPSoC design.

The task mapping problem determines an association of each application task to a tile to minimize some given cost function. This paper presents a tool that finds an optimal task mapping in terms of energy consumption and application execution time, giv-en a set of tiles with manufacturing defects. This way, even chips with defects can be sold, perhaps with some performance degrada-tion, targeting low-end markets.

The goals of this paper are to present the aforementioned task mapping tool and to investigate the energy consumption and ap-plication execution time degradations assuming different applica-tion classes. The contributions of this paper are (i) a task mapping tool for NoC-based MPSoC, which consider faulty tiles to per-form the mapping; (ii) the evaluation of energy consumption and

application execution time under the presence of faulty tiles; (iii) a statistical method to generate fault scenarios for very large SoCs.

The paper is organized as follows: Section II presents motiva-tion, usage of the proposed approach, and main assumptions. Sec-tion III describes the related work. Section IV describes the task mapping tool and its models. Section V describes the experimen-tal setup, the evaluated applications, and the fault scenarios. Sec-tion VI discusses the results. Section VII concludes the paper.

II. PRELIMINARIES

A. System Model and Assumptions This paper assumes that the target MPSoC consists of a set of

identical (or homogeneous) tiles connected by a mesh-based NoC with XY routing algorithm. Each tile contains three main compo-nents: a network interface, a processor, and a memory block. A tile supports one task only (no multitasking). This system model is equivalent, for instance, to the underlying model of HeMPS MPSoC [6] with the Hermes NoC [7].

The present work assumes faults only on the tiles since we as-sume that the tile area is at least 90% of the router. Therefore, the communication infrastructure is assumed faulty-free. A faulty tile is completely shutdown, thus it does not consume energy and generate traffic in the network.

The faults are result of defects created during the chip manu-facturing. These defects are expected to be more common due to the evolution of deep submicron technologies, thus multiple faults on the chip are considered. The proposed task mapping is ex-ecuted in design time for several fault scenarios, such that an overall picture of the relationship between the fault location and the performance metrics can be draw.

B. Motivating Example Redundant hardware is commonly used to tackle the yield

problem. It has been successfully applied to all sorts of regular and repetitive hardware, like different types of memories, pro-grammable logic array, field programmable gate array, and recent-ly to MPSoCs [5]. In the context of MPSoCs, the application task located in a faulty tile can be mapped (in design time) or migrated (in run-time) to a spare tile, keeping the chip functionality.

Shamshiri and Cheng [5] proposed a yield and cost analysis framework employed to evaluate the use of spare tiles in MPSoCs. This one can be used to determine the amount of redun-dancy required to achieve a minimum cost. For instance, given some input parameters detailed in [5], the yield of a block is 94%, the NoC link is 72%, resulting in a system yield of just 21% for a 3x3 mesh NoC, i.e. there is probability of 79% of having at least one faulty block in the system. By including three spare tiles to

164

the system, increasing the number of tiles from 9 to 12, the system yield increases to 99% since only 9 out of 12 tiles are actually required to have a functional system. Moreover, the manufactur-ing cost is 3.2 times less than the original system, since the addi-tional silicon area of the spare tiles is compensated by the in-creased yield.

Given these motivating results, we decided to investigate the use of spare tiles by evaluating the side effects of multiple faulty tiles on the energy consumption and application execution time.

C. Usage of the Proposed Approach Figure 1 illustrates the proposed test approach, which starts as

soon as the chip is manufactured. If the tested chip fails, a diag-nose step is performed to locate the faulty tiles.

Let n be the number of system tiles and m be the number of necessary tiles to implement the systems functionality, then n-m is the number of spare tiles. If the number of faulty tiles is lower or equal than n-m, the place of these faulty tiles is sent to task map-ping tool, otherwise the faulty chip is discarded. The task mapping tool, presented in Section IV, loads a NoC model and the applica-tion task graph to determine the new task mapping avoiding the faulty tiles. Finally, the tool is able to estimate the energy con-sumption and the application execution time of the resulting task mapping. Depending on the resulting overhead, chips with up to n-m faulty tiles can still be sent to the market, perhaps targeting low-end markets.

no

manufacturing test

pass test? diagnose (find faulty tiles)

yes

sold to high-end market - $$$

#faulty tiles <= n -

no chip discarded

yes fault location

application model

task mapping tool

sold to low-end market - $$

NoC model

Figure 1. Proposed test flow for NoC-based MPSoCs with spare tiles.

III. RELATED WORK There are several papers presenting approaches to improve the

reliability of NoC-based SoCs. These papers can be broadly clas-sified (these classes are definitely not exhaustive) in: (i) fault tole-rant circuitry for NoCs and MPSoCs [8][9]; (ii) fault tolerant NoC routing algorithm used to explore different routes of packets in case of network faults [10]; (iii) system-level reliability assess-ment [5]; (iv) system-level reliability co-optimization [11].

This paper fits best in the system-level reliability co-optimization category, where two main approaches are found: dynamic approaches executed in run-time; or static approaches executed in design time. The dynamic system-level reliability co-optimization approach is commonly based on on-line task map-ping and task migration to better accommodate new incoming tasks on the fly assuming the chip might have faults. It can also be used to react, for instance, upon a run-time fault which could have been generated by transient effects or permanent faults due to wear out or aging [15][16]. In this case the tasks located at the faulty resources are moved in runtime to healthy resources. These approaches are out of the scope of this paper since the goal is to improve the yield of the chip manufacturing. Manufacturing de-fects are not dynamic and they do not appear during run-time.

For this reason this paper is best related to static system-level reliability co-optimization approaches, based on static task sche-duling executed in design time. Typically these approaches are largely used in design space exploration targeting the optimization of metrics, such as application execution time, latency, thermal

constraints, and energy consumption [12][13][14]. Recently these approaches also co-optimize reliability related metrics.

Manolache et al. [11] address the reliability problem at appli-cation level. They propose a way to combine spatially and tempo-rally redundant message transmission, where energy and latency overhead are minimized.

Tornero et al. [17] propose a multi-objective optimization strategy, which minimizes energy consumption and maximizes a robustness index, called path diversity, which explores the mul-tiple paths between a pair of nodes. In case of a faulty link, a NoC with adaptive or source-based routing algorithms could explore these multiples paths, improving the chip robustness.

Choudhur et al. [18] introduce a new task mapping, whose ob-jective is to minimize the variance of the system power and laten-cy when faults occur and maximizes the probability that the actual system will work when deployed.

Huang et al. [19] argue that some processors might age much faster than others might, reducing the system’s lifetime. They proposed an analytical model to estimate the lifetime reliability of MPSoCs. This model is integrated to a task mapping algorithm that minimizes the energy consumption of the system and satisfies system lifetime reliability constraint. Huang and Xu [20] expand their previous task mapping tool [19] to support multi-mode em-bedded systems. Huang and Xu [21] argue that exponential life-time distribution can be inaccurate, thus they further refine the lifetime reliability model to support arbitrary lifetime distribu-tions, improving the accuracy of the simulation results.

IV. TASK MAPPING AWARE OF FAULTY TILES The CAFES task mapping framework [22] is composed of

high-level models, algorithms and tools, whose goal is to map application tasks onto the target architecture tiles aiming to save energy and to minimize the execution time. Figure 2 illustrates a partial mapping flow and the main elements used here.

t2 tn

Applicationt1

CDCG

Optimum mapping

CRG

Communication and computation extraction

MPSoC modeling

Mapping algorithm

Faulty tile and spare tile lists

Energy and timing

parameters

Energy consumption

and execution time results

MPSoC synthesis

τ2 τn

Target architecture - MPSoC

τ1 τ3

Production test

Figure 2. Mapping flow used to obtain optima application mappings.

Based on the description of an application already partitioned into tasks ti, the designer may extract the relevant computation and communication aspects.

Communication Dependence and Computation Graph (CDCG) is a model used to describe the application. Each CDCG vertex models a communication with the source and target task, the communication volume and the computation time - the period between all dependences are solved and communication begin-ning. CDCG edges represent the communication dependence, i.e. all vertices are connected to each dependence with an edge. The CDCG is similar to a schedule graph, but focusing on communi-cation aspects instead of computation, which enables to explore several requirements of communication architecture easily.

Figure 3 depicts a small example of CDGC, containing three communications {C1, C2 and C3}. C1 and C3 are concurrent commu-nications and both do not have dependences, since dependences of

165

the Start vertex (dStart_1 and dStart_3) are always solved. Thus, C1 and C3 communications start immediately after the respective computation time: 10 and 20 clock cycles, respectively. Commu-nication C1 states that t1 send 100 bytes to t2 and C3 states that t3 send 100 bytes to t1. As soon as the last byte of C1 is inserted into the NoC, d1_2 is solved. On the other hand, d3_2 is solved only when the last byte of C3 communication arrives to the processor where t1 is mapped.

Start

t1 100

C1 t2

t1 50

C2 t3 25

10

End

20 t3

40 C3 t1

d2_End

d1_2

dStart_3 dStart_1

d3_2

Figure 3. CDCG example.

The target architecture topology is modeled by means of a Communication Resource Graph (CRG), which consists of tiles (graph nodes) and links (graph edges). The energy and execution time parameters are extracted from the target architecture synthe-sized to a given technology. The faulty tile list is generated by the diagnostic flow presented in Figure 1. According to the applica-tion description, NoC energy parameters, NoC execution time parameters, NoC topology and the faulty tile list, the task mapping tool estimates the NoC energy consumption and the application execution time of different mappings, enabling to evaluate the impact of faulty tiles. The next sections detail the underlying algo-rithms and the timing and energy models.

A. Mapping Algorithm As stated before, the mapping problem here consists in finding

an association of each application task to a given processor – placed in a given tile – that minimizes the global energy consump-tion and the application execution time. Let n be the number of tiles, this problem allows n! possible solutions. Given that future MPSoCs may contain hundreds of tiles, an exhaustive search of the solution space is clearly unfeasible. Thus, optimal implemen-tations of such SoCs require the development of efficient mapping heuristics.

Exhaustive analyses of some small applications mapped on NoC-based MPSoCs show that task mapping is clearly a problem with self-similarity [23] behavior. In other words, there are sever-al very different mappings with the same cost – i.e. the same ener-gy consumption and execution time. Therefore, exploring, not all, but very different random mappings followed by some refine-ments (new mapping with few changes), normally result on an optimized solution. Due to two nested loops – an external one, which looks for very different solutions and an internal one, which looks for a local minimum – Simulated Annealing (SA) is an al-gorithm very well adequate to find solutions for self-similar prob-lems.

Our SA mapping algorithm searches for mappings that result in an MPSoC with minimum energy consumption and low execu-tion time. To explore these requirements in the same cost func-tion, the execution time requirement is expressed in terms of ener-gy consumption. Therefore, the static power dissipation is multip-lied by the application execution time (texec) performing the static portion of energy consumption, which is detailed in Section IV.B. As a result, both dynamic and static energy consumption are considered to compute the mapping cost function.

To improve the yield, the SA algorithm searches for mappings with minimum cost avoiding the ones that are marked as spares or

faulty. However, when a tile is marked as faulty, the algorithm replaces the faulty tile with a spare tile, which is faulty-free.

B. Timing Model The total packet delay (dijq) of a wormhole routing algorithm

is composed by the routing delay (dRijq) and by the packet delay (dPijq) of the remaining flits. The routing delay is the time neces-sary to create the communication path, which is determined dur-ing the traffic of the packet header. The packet delay depends on the number of remaining flits. Let nabq be the number of flits of the q-th packet from pa to pb, obtained by dividing wabq by the link width. Let λ be the period of a clock cycle, and let tr be the number of cycles needed to route a packet inside a router. In addi-tion, let tl be the number of cycles needed to transmit a flit through a link (between tiles or between a processor and a router). The routing delay (dRijq) and the packet delay (dPijq) of the q-th packet from tile τi to tile τj, are represented in Equations (1) and (2), considering that a packet goes through η routers without con-tention. Contentions can only be determined at execution time.

dRijq = (η × (tr + tl) + tl) × λ (1)

dPijq = (tl × (nabq - 1)) × λ (2)

Equation (3) expresses the total packet delays (dijq) – packet latency, obtained from the sum of (dRijq) and (dPijq).

dijq = (η × (tr + tl) + tl × nabq) × λ (3)

For example, when applying Equation (3) in a packet with 10 flits (nabq = 10), which is sent from tile τ1 to tile τ2 (two neighbors tiles, i.e. η = 2), and considering λ = 1ns, tr = 3 and tl = 1 clock cycles, then 18ns is the packet latency.

The application execution time (texec) depends on both the application computation and communication. However, a simple equation does not express texec, since several communications and computations are many times parallel. In addition, some communications may compete for the same communication re-source (e.g. links and buffers) at same time, which may cause contentions increasing the overall execution time. Contentions also make a single equation more complex. Therefore, texec is computed during the mapping algorithm execution, which uses several times the dijq and time expend in each computation.

C. Energy Model The dynamic energy consumption is modeled using the con-

cept of bit energy (EBit), similarly to the model described in [24]. For several communication architectures, EBit can be expressed as a function of four variable quantities, as depicted by Equation (4).

EBit = function(Es, Eb, Ec, El) (4)

Es is the dynamic energy consumption of a single bit on wires and on logic gates of each router. Eb is the bit dynamic energy consumption on router buffers. Ec is the dynamic energy con-sumption of a single bit on links between routers and the local module. El is the bit dynamic energy consumption on the links between routers.

Equation (5) illustrates how EBit models a 2D direct mesh NoC. It computes the dynamic energy consumed by a bit passing in such a NoC from tile i (τi) to tile j (τj), where ηij corresponds to the number of routers that the bit traverses.

166

EBitij = ηij × (Es+Eb) + 2 × Ec + (ηij – 1) × El (5)

Let wabq be the total amount of bits of a packet pabq going from pa to pb (i.e. processors a and b, correspondingly), which are mapped on tiles τi and τj, respectively. Then, the dynamic energy consumed by the all k packets of pa → pb communica-tions is given by Equation (6).

EBitab = ∑=

×k

abqq

ijBit1

Ew (6)

Hence, Equation (7) gives the total dynamic energy consumed by the NoC (EDyNoC) and y represents the total number of communication between different processors pa to pb.

EDyNoC = ∑=

y

abi

iBit1

)(E ∀ pa, pb ∈ processors set (7)

The static power dissipation of each router (PRouter) is pro-portional to the number of gates that compose the router and it can be estimated by electrical simulation. With n representing the number of tiles, Equation (8) computes NoC static power dissipa-tion (PNoC).

PNoC = n × PRouter (8)

Using texec explained in Section IV.B, Equation (9) com-putes NoC static energy consumption (EsNoC).

EsNoC = PNoC × texec (9)

Finally, Equation (10) gives the overall energy consumption at the NoC (ENoC) that considers the static and dynamic effects, which SA algorithm uses as cost function to search for optima mappings.

ENoC = EsNoC + EDyNoC (10)

D. Model Calibration The Hermes NoC [7], configured with 16-bit phit and input

buffers with four positions, was used to validate the timing and energy models. The Hermes VHDL description was synthesized to an ASIC standard cell library. The library also supplies energy values for the cells, which are used to extract the energy parame-ters.

The synthesis result is a logic gate netlist. This netlist is asso-ciated to a customized VHDL library, which enables fast and ac-curate energy consumption and timing estimations. A testbench applies both random and typical traffic to the netlist and the re-sults achieved by VHDL simulation are compared to those ob-tained from high-level mapping tool. Our experiments showed average errors below 30.5% and 14% for energy consumption and execution time estimations, respectively.

V. EXPERIMENTAL SETUP This section presents the methods used to generate the combi-

nation of faulty tiles, called fault scenarios. The first method is exhaustive used for small NoCs and the second method is the statistical method used for bigger NoCs. Latter, we present the application classes evaluated in this paper.

A. Exhaustive Fault Generation Method Faulty tiles are exhaustively generated for all combinations of

faulty tile locations, assuming a system with 1 to 3 faulty tiles. Thus, Equation 11 defines the total number of faults injected as the sum of all 1, 2, to nfaults faults combination in x × y tiles. For instance, a 3 × 4 mesh NoC requires 298 fault scenarios (12 single faults, 66 double faults, and 220 triple faults).

⎟⎟⎠

⎞⎜⎜⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛=×

××× yxyxyx

yxscensnfaults21

)nfaults,( (11)

B. Statistical Fault Generation Method The exhaustive fault generation method is precise; however, it

might not be possible to perform exhaustive fault simulation due to the long CPU time. The main reason is that the total number of executions required to perform exhaustive fault simulation, de-fined in Equation 11, grows exponentially with the NoC size (x × y) and the max number of simultaneous faults (nfaults). Moreover, the CPU time of a single execution of the task mapping tool grows with the NoC size.

For instance, assuming a 3 x 4 mesh NoC with up to 3 simul-taneous faults requires 298 task mapping executions (about 3 mi-nutes of CPU) to perform exhaustive fault simulation. However, a bigger NoCs like a 5 x 5 mesh with up to 3 faults requires 2625 executions in about 60 hours of CPU time. The same 5 x 5 mesh NoC with up to 4 simultaneous faults requires 15275 executions, which we estimate that would require about 14 days of CPU use.

Even with the economical motivation of spare tiles is appeal-ing; it might be unfeasible to perform an exhaustive fault simula-tion since the CPU time becomes an issue for bigger NoCs with multiple faults. This section presents a statistical approach, called sample size estimation [25], used to determine the minimal num-ber of fault scenarios required to have satisfactory results - near to the ones achieved by exhaustive approach. This way, CPU time can be drastically reduced, while the results are still accurate. Moreover, this method enables trading off CPU time and result accuracy.

Before executing the sample size estimation, a pilot simulation is performed with a sample of small size. A sample represents a set of executions of the task mapping tool, where each execution assumes that the faulty tiles were randomly selected. Each execu-tion of this pilot results in a different mapping with different ener-gy consumptions and execution times. If the energy consumption is the value to be estimated, then this pilot gives the population’s estimated standard deviation s of energy consumed in the pres-ence of faulty tiles randomly located. The population in this con-text represents the entire combination of fault scenarios, as deter-mined in Equation 11.

The goal of the sample size estimation is to estimate the popu-lation average (μ), i.e. the average energy consumption of the entire population of fault scenarios. The Equation 12 is typically used for this purpose, where s is the estimated standard deviation of the sample. (x - μ) is the difference of the estimated sample average (x) and μ, which represents the acceptable error between the sample and the population. tα,df is the value from student’s t-distribution table [25], where (1 - α) is the confidence level and df is the degree of freedom, defined as df = n - 1.

2,2

2

)()( dft

xsn αμ

×−

= (12)

Since n is unknown, one can select an initial value of n to ob-tain tα,df. This value is used in Equation 12 to find a new n and a new tα,df. This calculation is performed iteratively until the value of n stabilizes. The stable value of n is the minimal sample size required to estimate the population average μ with the expected

167

accuracy of results.

C. System Application We explore several parallel applications with distinct features

aiming to determine what kind of application increases the over-head in energy and execution time in the presence of faulty tiles. A synthetic application generator, detailed in [22], is used to create random CDCGs.

This synthetic application generator can build several applica-tion classes by varying parameters such as: (i) number of proces-sors, which allows to investigate some target architecture dimen-sions; (ii) number of graph levels, which allows to specify the number of dependent communications an application has; (iii) dependence degree that defines the probability of a vertex has more than one dependence, keeping in mind that dependent com-munications can´t concur for NoC resources; (iv) probability of end meeting that defines if a vertex will have dependences or is a final communication; (v) computation time that is the period, as-sociated to each source task, between all dependences are solved and the communication of the source task starts; (vi) communica-tion volume that contains the quantity of bytes transmitted in each communication; and (vii) parallel communications, which de-scribes the minimum quantity of parallel communications an ap-plication have.

For instance, varying the relation between computation time and communication volume the application may change from IO-bounded to CPU-bounded; varying the relation between number of graph level and dependence degree the applications may be dataflow or concurrent.

We built 39 synthetic applications, which enables to explore applications classified as (i) IO or CPU bounded; (ii) dataflow with different levels of parallelism; (iii) strongly parallel or se-quential applications with different levels of concurrence by the communication architecture.

VI. EXPERIMENTAL RESULTS This section evaluates (i) the application execution time under

exhaustive faulty scenarios, (ii) the average energy consumption under exhaustive faulty scenarios, (iii) the proposed statistical fault generation method used to estimate energy consumption, comparing it to the exhaustive method.

A. Evaluating the Application Execution Time All application classes haves been evaluated in terms of ex-

ecution time using the exhaustive fault generation method. The result is that, independently of the application class, the

execution time is not affect by the presence of faulty tiles. On av-erage, the variation of execution time between the fault-free chip and the chips with up to 3 faulty tiles is close to 0%.

The reason lies on the timing model, presented in Section IV.B, more specifically in Equation (3). The total application time consist of computation time plus the communication time. The communication time consist of the routing delay, which depends on the distance between the communication elements, plus the packet delay, which depends on the packet size.

If the computation time is much greater than the communica-tion time, then the task mapping has very small influence on the application execution time. Even if both computation and com-munication times are equivalent, if the packet size is big (hun-dreds of flits) the routing delay has a very small impact on the communication time (since the NoC works as a pipeline), thus also a small impact on the application execution time.

Since in typical scenarios an application has more computa-

tion than communication and applications use packets with hun-dreds of flits, then the impact of the routing delay on the overall application execution time almost is negligible. This claim can be demonstrated with the following example.

Let us assume a given application on a 3x4 mesh NoC, whose normal behavior is to have more computation than communica-tion. This application is modified such that it has three variations: low communication (packets of one flit), low computation (CPU time of 1 clock cycle), and both communication and computation are low. Exhaustive fault generation is performed for these cases generating the Figure 4, which is the difference between the aver-age execution time of the population with faulty chips and the execution time of the fault-free chip.

Figure 4. The average overhead of application execution time of faulty chips.

This figure demonstrates that faulty tiles have a significant in-fluence on the chip execution time only if the both the computa-tion and the communication are low, which is not the typical sit-uation. Most actual applications typically have bigger packet sizes and more computation than communication.

B. Evaluating the Energy Consumption The energy consumption is evaluated for each class of appli-

cation described in Section V.C. The result is that, in spite of the application class, only the proportion of good tiles per faulty tiles affects the energy consumption. For instance, a chip with 15 tiles where two of them are faulty consumes more energy than the same chip with only one faulty tile. These results are illustrated in Figure 5 for a 3x5 mesh and an application with 12 tasks and 3 spare tiles. The average impact of a faulty tile on energy con-sumption is worse in the center of the NoC and it increases if there are more faulty tiles in the chip (Figure 5(a)). This impact gradu-ally decreases as the distance from the center tiles increases (Figure 5(b)).

However, if we map the same application on a 4x4 mesh NoC, then there are 12 tasks and 4 spare tiles. Figure 6 compares the energy profile of this application of a 3x5 against a 4x4 mesh NoC assuming three faults in each of them. It can be observed that the energy overhead in a 4x4 is lower. The reason is the proportion of good and faulty tiles. In a 3x5 with 3 faults the proportion is 15/3 while in a 4x4 it is 16/3. This extra tile of 4x4 gives more freedom to the task mapping tool to determine a good scheduling, improv-ing the effect of self-similarity (Section IV.A), resulting in a better task mapping.

168

(a)

(b)

Figure 5. The average energy consumption overhead of faulty chips.

Figure 6. The energy overhead with three faults on a 3x5 and a 4x4 mesh.

C. Evaluating the Statistical Fault Generation Method This section demonstrates the fault generation method pro-

posed in Section V.B. For this experiment, we assume that a small NoC is used, such as 3x4 mesh NoC, because the total CPU time for both statistical and exhaustive fault generation methods is not too high. An application with 9 tasks is used for this experiment, even though all other applications presented very similar results. Let us assume that the goal of this experiment is to estimate the average energy overhead when a fault hit a given tile, considering scenarios with 3 simultaneous faults.

First, the exhaustive method is executed, running all combina-tions of 3 faults in 12 tiles, i.e. scens(3 x 4, 3) = 298 (Eq. 11) fault scenarios. It took about 3 minutes of CPU time to execute them. These results are considered the target results, i.e. the results we want to achieve with the statistical method.

The second step is to execute a pilot experiment with small number of randomly selected fault scenarios per router. This pilot experiment is used solely to extract the standard deviation of the energy consumed by the chips with three random faulty tiles. The estimated standard deviation is 3.9% of deviation on energy con-sumption.

The proposed approach of sample size estimation is executed assuming two situations: (i) standard deviation of 3.9, confidence interval of 95%, and maximum error of 4%; and (ii) standard dev-iation of 3.9, confidence interval of 98%, and maximum error of 2%. The estimated sample size for each situation is 8 and 23, re-spectively. It means that each tile must be in at least 8 or 23 fault scenarios. For now on, the first situation is called sample8 and the second is called sample23. TABLE 1 presents the obtained results in terms of CPU time, total number of scenarios and the maximum

error observed for each tile. TABLE 1. RESULTS FOR THE STATISTICAL FAULTS GENERATION METHOD.

CPU time (s) # scenarios max obs.

error (%) Exhaustive 192 220 - Sample8 27 39 2.4 Sample23 63 101 0.8

Figure 7 illustrates the three situations and their respective

heat charts, representing the energy overhead when a fault is found at each tile. Each square represent the average energy con-sumption for each tile.

It can be observed that the exhaustive method produce the ex-pected results (the energy is gradually reducing from the center to the borders). The sample23 produces almost the same results as the exhaustive method, with small error but with much less CPU time. The sample8 produce large errors, indicating that the sample size is not sufficient to estimate accurately the energy overhead for each router.

Even if the exhaustive results are not available, it is still possi-ble to check the accuracy of the sample by visually analyzing the heat chart demonstrated in Figure 7. For instance, the expected appearance of a good heat chart is like the exhaustive test set, even if we assume NoCs of different sizes and different applica-tions. Note that the heat chart for sample8 deviates from the ex-pected appearance, indicating that one should increase the sample size, if it is possible, to increase the accuracy of the results.

exhaustive sam

ple8 sam

ple23

Figure 7. Visual analysis of the statistical fault generation method. Figure 8 overlaps the average results for the three situations.

By comparing the exhaustive with the other test sets, it can be seen that the biggest error for sample8, located in the tile [2, 1], is 2.4% (see 1), which is below the maximum error stipulated to this set of experiments (4%). The biggest errors for sample23, located

169

in the tiles [1, 1] and [2, 0] (see 2), are around 0.8%, which is below the maximum error stipulated to this set of experiments (2%).

Figure 8. Close analysis of the resulting error by overlapping the average

results for exhaustive, sample8, and sample23.

The example presented in this section demonstrates that the proposed fault generation approach enables to trade-off CPU time and result accuracy by selecting different values of difference (x - μ) and confidence level (1 - α).

VII. FINAL REMARKS Previous papers have demonstrated that the use of spare tiles

can significantly improve yield and reduce the manufacturing cost of NoC-based MPSoCs. The tool presented in this paper deter-mines task mapping for NoC-based MPSoCs with faulty tiles, minimizing the energy consumption and the application execution time. This way, these defective chips can still execute the applica-tion, perhaps with some performance degradation, but at least it can be sold to a lower-end market, for example.

This paper evaluates energy consumption and application ex-ecution time of faulty chips compared to fault-free chips. We eva-luated several different classes of applications to check if there was any particular application feature that could affect the energy consumption or application execution time under faulty tiles. These results show that the spare tile approach has small impact on energy consumption and this impact can be even smaller if the proportion of good and faulty is higher. The existence of faulty tiles on the chip has, on average, no significant influence on the application execution time. Based on these results, we conclude that the spare tile approach can increase yield and cost with small penalties on the application requirements.

Finally, this paper also proposed a statistical fault generation approach targeting very large MPSoCs. This approach demon-strates that a small sample of fault scenarios is sufficient to have a reasonably accurate estimation of energy consumption and it enables trading of CPU time and result accuracy.

VIII. ACKNOWLEDGMENT Alexandre is supported by postdoctoral scholarships from

Capes-PNPD and FAPERGS-ARD, grants number 02388/09-0 and 10/0701-2, respectively. Fernando Moraes is supported by CNPq and FAPERGS, projects 301599/2009-2 and 10/0814-9, respectively. Cesar Marcon and Marcelo Lubaszewski are partial-ly supported by CNPq scholarships, grants number 308924/2008-8 and 478200/2008-0, respectively.

IX. REFERENCES [1] Wolf, W.; Jerraya, A. A.; Martin G. Multiprocessor system-on-chip

(MPSoC) technology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(10), pp. 1701-1713, 2008.

[2] Kahng, A.; et al. ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. DATE, pp. 423-428, 2009.

[3] Lee, S. E. et al. A high level power model for network-on-chip (NoC) router. Computers & Electrical Engineering, 35(6), 2009.

[4] Refan, F. et al. Reliability in application specific mesh-based NoC architectures. IEEE International On-Line Testing Symposium, pp. 207-212, 2008.

[5] Shamshiri, S.; Cheng, K-T. Yield and Cost Analysis of a Reliable NoC. VLSI Test Symposium, pp. 173-178, 2009.

[6] Carara E. A. et al. HeMPS - a framework for NoC-based MPSoC generation. ISCAS, pp. 1345–1348, 2009.

[7] Moraes, F. et. al. HERMES: an infrastructure for low area overhead packet-switching networks on Chip. Integration, the VLSI Journal, 38(1), pp. 69-93, 2004.

[8] Bertozzi, D.; Benini, L.; De Micheli, G. Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 4(6), pp. 818-831, 2005.

[9] Ejlali, A. et al. Performability/energy tradeoff in error-control schemes for on-chip networks. IEEE Transactions on Very Large Scale Integration Systems, 18(1), pp. 1-14, 2010.

[10] Zhang, Z.; Greiner, A.; Taktak, S. A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. DAC, pp. 441-446, 2008.

[11] Manolache, S.; Eles, P.; Peng, Z. Fault and energy-aware communi-cation mapping with guaranteed latency for applications imple-mented on NoC, DAC, pp. 266-269, 2005.

[12] Hu, J.; Marculescu, R.. Energy-aware communication and task sche-duling for network-on-chip architectures under real-time con-straints. DATE, pp. 234-239, 2004.

[13] Lei, T.; Kumar, S. A two-step genetic algorithm for mapping task graphs to a network on chip architecture. Euromicro Symposium on Digital System Design, pp. 180-187, 2003.

[14] Murali, S. et al. Mapping and configuration methods for multi-use-case networks on chips. ASP-DAC, pp. 146-151, 2006.

[15] Lee, C. et al. A task remapping technique for reliable multi-core embedded systems. CODES/ISSS, pp. 307-316, 2010.

[16] Ababei, C.; Katti, R. Achieving network on chip fault tolerance by adaptive remapping. International Symposium on Parallel & Distri-buted Processing, pp. 1-4, 2009.

[17] Tornero, R. et al; A multi-objective strategy for concurrent mapping and routing in networks on chip. International Symposium on Paral-lel & Distributed Processing, pp. 1-8, 2009.

[18] Choudhury, A. et al. Yield enhancement by robust application-specific mapping on network-on-chips. NoCArc, pp. 37-42, 2009.

[19] Huang, L. et al. Lifetime reliability-aware task allocation and sche-duling for MPSoC platforms. DATE, pp. 51-56, 2009.

[20] Huang, L; Xu, Q. Energy-efficient task allocation and scheduling for multi-mode MPSoCs under lifetime reliability constraint. DATE, pp. 1584-1589, 2010.

[21] Huang, L; Xu, Q. AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs, DATE, pp. 51-56, 2010.

[22] Marcon, C. et al. CAFES: a framework for intrachip application modeling and communication architecture design. Journal of Parallel and Distributed Computing, 71(5), pp. 714-728, 2011.

[23] Mandelbrot, B. How long is the coast of britain? statistical self-similarity and fractional dimension. Science. 156(3775) pp. 636-638, 1967.

[24] Ghadiry, M.; Nadi, M.; Rahmati, D. New approach to calculate ener-gy on NoC. International Conference on Computer and Communication Engineering, pp. 1098-1104, 2008.

[25] Hill, T.; Lewicki, P. Statistics: methods and applications: a compre-hensive reference for science, industry, and data mining. StaSoft, 832p., 2006.

170

Me3D: A Model-driven Methodology ExpeditingEmbedded Device Driver Development

Hui Chen, Guillaume Godet-Bar, Frederic Rousseau, and Frederic PetrotTIMA Laboratory (CNRS – Grenoble INP – UJF), 46 av. Felix Viallet, 38031 Grenoble, France

{hui.chen, guillaume.godet-bar, frederic.rousseau, frederic.petrot}@imag.fr

Abstract—Traditional development of reliable device driversfor multiprocessor system on chip (MPSoC) is a complex anddemanding process, as it requires interdisciplinary knowledgein the fields of hardware and software. This problem can bealleviated by an advanced driver generation environment. Wehave achieved this by systematically synthesizing drivers froma device features model and specifications of hardware and in-kernel interfaces, thereby lessening the impact of human erroron driver reliability and reducing the development costs. Wepresent the methodology called Me3D and confirm the feasibilityof the driver generation environment by manually convertingsources of information captured in different formalisms to aMultimedia Card Interface (MCI) driver for a real MPSoC undera lightweight operating system (OS).

I. INTRODUCTION

Nowadays, a typical multiprocessor system on chip (MP-SoC) project takes place under ever-increasing time-to-marketpressure. Hardware and software are often regularly re-designed for new versions of a product. As it is well-acknowledged, the software development cycle consumes con-siderable time and effort.

On the software side, device driver development causesa serious bottleneck. It is intrinsically complex and error-prone due to the necessity of interdisciplinary knowledgein the fields of engineering and computer science. In otherwords, device driver developers require in-depth understandingof innumerable peripherals that exist in a typical embeddedsystem, programming tools, operating systems (OSes), busprotocols, network programming, system management [1], etc.

Delivering a high-quality and thoroughly tested devicedriver is laborious. For instance, the LH7A404 system on chip(SoC) from NXP Semiconductors contains 16 peripherals, andcorresponding drivers have more than 78,000 physical sourcelines of code (SLOC) [2] (requiring around 19.6 person-yearsas development effort, estimated with SLOCCount1). Soft-ware re-usability and automation methods are hence eagerlyrequired to reduce design effort and ameliorate productivity.

The difficulty in design and implementation of reliabledevice drivers is notorious. Drivers in the Linux kernel 2.6.9account for 53% of bugs [3]. Similarly, 85% of unexpectedsystem crashes originate from driver problems, pursuant to arecent report from Microsoft [4]. With this in mind, a newmethodology addressing reliability is strongly expected.

The contribution presented in this paper is a flexible devicedriver generation environment, able to produce the final C codeof a software driver, starting from a device features model.

1SLOCCount v2.26 by David A. Wheeler, www.dwheeler.com/sloccount/

APPLICATION

OS

KERNEL

HAL INTERFACE

APPLICATION PROGRAMMING INTERFACE

DRIVER

LIB

RA

RIE

S

HARDWARE ABSTRACTION LAYER

DRIVER DRIVER

HARDWARE

a)

b)

c)

d)

Fig. 1. Device driver as a low-level module in the OS structure

This environment is composed of a device driver generationtool and a validation flow. To evaluate the generation environ-ment we conducted our case study on the Multimedia CardInterface (MCI) device driver for the Atmel D940 [5] MPSoC.We created a features model for the MCI device, specificationsof the MCI device and the D940 board, and an in-kernelinterface specification for an ad-hoc OS, all of which arethen systematically converted into the MCI driver. Afterwards,we validated the generated device driver with a validationflow. Experimentation results demonstrate the feasibility ofimplementing the generation environment, and the expectedefficiency in developing device drivers for MPSoC.

The paper is organized as follows. Section II presentsthe anatomy of a device driver. Section III reviews relatedwork. Section IV introduces our methodology for acceleratingembedded device driver development. Section V evaluates theproposed methodology. The last section concludes the paperand identifies future work based on the findings provided here.

II. DEVICE DRIVER OVERVIEW

The term device, as used in this paper, does not refer tothe primary central processing unit (CPU), or main memory,but a specific hardware resource for a dedicated task. Thedevice is either attached to or embedded in a computer systemarchitecture and can interact with the CPU and other hardwareresources in the system via a single system bus or through abus hierarchy.

A device driver is a low-level software component in theOS, which allows upper-level software to interact with adevice. It can be considered, from an abstract point of view, asa brick in the OS chart (Fig. 1). Device drivers mentioned heremostly target embedded systems, which differ from personalcomputers (PCs) in having broader adoption of SoCs and agreater variety of buses.

978-1-4577-0660-8/11/$26.00 c© 2011 IEEE171

A device driver implements an interface to the kernel and/orapplication developers for an underlying device, and providesa lower-level communication channel to the device. It acts as atranslator from the kernel interface to the hardware interface.As a form of communication, it requires kernel services, andoften also offers services to other kernel components.

The communication channels to the device can be providedby lower-level drivers. This leads to cascaded drivers [6]. Theupper-level drivers provide an abstract view of the executionplatform, while the lower-level drivers are more concrete andprovide a transparent communication to the devices interfaces.An example is the Inter-Integrated Circuit (I2C) driver stackin the Linux kernel 2.6 [7].

Device driver interface can be separated into four parts (Fig.1): a) The driver requires kernel services like memory allo-cation, and also offers services (e.g., hardware initialization)to the kernel. b) User application sends general commandsto the driver using exported driver interface, while c) librariesprovide the driver with some services like string manipulation.d) Lastly, the hardware abstraction layer (HAL) accommodateshardware access methods, which are used by the driver.

One of the most elementary pieces of information abouta device and the driver that manages it, is what functionthe device accomplishes. Different devices carry out differenttasks. There are devices that play sound samples, devices thatread and write data on a magnetic disk, and still other devicesthat display graphics to a video screen, and so on.

For each type of functionality, there may be many differentdevices that carry out similar tasks. For instance, when dis-playing graphical information on a video device, the displaycontroller may be a simple Video Graphics Array (VGA)controller, or it may be a modern video card running on Pe-ripheral Component Interconnect Express (PCIe), with severalgigabytes of graphics memory. Nevertheless, in each case, thehigh-level purpose of the device is the same.

The device driver organization involves a set of driver entrypoints, a number of data structures, and possibly also globalsymbols and constants. A typical driver entry point encom-passes the hardware programming part (gray blocks in Fig.2) and the kernel-driver interaction part. Composing a driverentry point requires up to eight pieces of information: 1) HAL-related (e.g., register access primitives), 2) platform-related(e.g., device base address), 3) device-related (e.g., register andbit field offsets), 4) device features (e.g., register programmingsequences), 5) kernel-driver interface (e.g., return type andargument list of the driver entry point), 6) kernel services(e.g., memory allocator), 7) device class-related (e.g., accessprotocols), and 8) libraries-related (e.g., string manipulators).

Thus, because of the various and interrelated sources ofinformation, driver generation is intrinsically complex.

III. RELATED WORK

We will briefly discuss related work in the area of devicedriver development methodology. They can be classified intothree categories: device driver synthesis methodologies, deviceinterface languages, and hardware specification languages.

Early device driver synthesis methods, as part of hard-ware/software co-design efforts, attempt to synthesize OS-

/* Header inclusions */ return_type entry_a_name(...) { ... <kernel memory allocation> ... <mask & backup interrupts>

<restore interrupts> ... <invalidate CPU cache> ... }return_type entry_b_name(...) { ... strycpy(...); ... <create MMC card> ...}

void access_hardware(...) { registerB_bitfieldY_wr($VAL); ... $VAL2 = registerA_bitfieldX_rd(); }

5

4

21

6

static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE + REGISTER_A_OFFSET) & BIT_FIELD_X_MASK; }

31

8 5 : HAL-related : Kernel-driver interface : Platform-related : Kernel services : Device-related : Device class-related : Device feature : C library-related

5678

1234

access_hardware(...);

7

Fig. 2. Looking into the driver entry points for an OS written in C

based device drivers for embedded systems [8]. However, thesedevices are different from those targeted by Me3D as theyhave a simple internal structure and a small set of input/output(I/O) signals. Moreover, the synthesized driver only runs witha platform-specific real-time operating system (RTOS). There-fore, these approaches do not take on some issues addressedhere, including separation of in-kernel interface and hardwarespecifications.

Wang et al. propose a tool [9] for synthesizing embeddeddevice drivers. This approach does not separate in-kernel andhardware interfaces of the driver, forcing the driver developerto detail the complete driver behavior for every device. Inaddition, they assume that the driver functionality can be splitinto non-overlapping control and data parts. This is the casefor some simple drivers. In more complex drivers, the controland data path are tightly interleaved.

Termite [10] synthesizes device drivers by merging two statemachines of the OS and the device. This may unavoidablylead to state explosion and large final code size. Tackling theselimitations is addressed in this paper. In addition, we believe adevice class specification, which solely defines a set of eventsshared between the OS and the device specification, is notnecessary.

Bombieri et al. [11] propose a device driver generationmethodology based on the register transfer level (RTL) testbench of an intellectual property (IP). However, device driverscan not be generated unless the code of the RTL test benchis available. In contrast, we propose a methodology that isapplicable without involving RTL test benches.

Languages such as Laddie [12] and NDL [13] offer someconstructs to describe device interface. However, the firstapproach does not really deal with device driver problems,but is limited to generating register access functions. The NDLapproach requires a change concerning how to write the devicedrivers, but does not offer a solution for legacy drivers.

Hardware specification languages such as IP-XACT, UnifiedModeling Language (UML) MARTE are able to describe someparts of electronic components and designs. UML MARTEis widely used, while IP-XACT is an IEEE standard that de-scribes not only structural information about hardware devicesand designs, but also some contents such as a register map.To demonstrate the feasibility of our methodology, we havemodeled the device and the hardware platform in IP-XACT.

172

In-Kernel Interface Specification

Libraries

Device Features Model

Hardware Specifications

Driver Generation

Is used by Produces Device Driver Validation

Driver Configuration Parameters

Fig. 3. Abstract view of the Me3D methodology

Basic Library

StrCpy = (uClibc) "strcpy" + "..."StrCat = (uClibc) "strcat" + "..."StrCpy = "dna_strcpy" + "..."StrCat = "dna_strcat" + "..."

StrCpy = (Newlib) "strcpy" + "..."StrCat = (Newlib) "strcat" + "..."

Lib XLib Y Lib Z

Fig. 4. Basic library

IV. DRIVER GENERATION ENVIRONMENT

Device drivers such as upper-level driver stacks in cascadeddrivers may not have direct access to underlying hardware.In the remainder of this paper, it is assumed that a driver isdirectly above the HAL.

Fig. 3 shows an abstract view of the Me3D methodology.The generation environment requires a device features model,hardware specifications, in-kernel interface specification, li-braries, and driver configuration parameters to produce devicedrivers. Device driver in binary format is validated on a realMPSoC or virtual platforms. During the validation phase, per-formance results could be extracted as shown in the referencedpaper [14], for tuning driver configuration parameters. Withmodified parameters, a new version of the driver is generatedagain.

A. Basic and HAL libraries

The basic library contains an abstract layer for usual datamanipulation methods (Fig. 4). For instance, the StrCpy (Fig.4) primitive may be linked to the strcpy function of a standardC library implementation, such as Newlib [15] or uClibc [16],or the dna strcpy function of an ad-hoc C library. Introducingthese primitives allows the exploration of memory footprintsand performances through the selection of different C libraryimplementations. The basic library could contain source and/orobject files.

The HAL library contains the implementation of low-levelhardware access primitives (e.g., primitive to read from orwrite to registers), with which it allows the development andthe integration of support for new hardware architectures tobe executed separately from the generation tools, thereby in-creasing the flexibility of the environment and the re-usabilityof the components. It may include source and/or object files.

B. In-kernel interface specification

In order to reflect the in-kernel interface changes duringkernel evolution, and to differentiate kernel-driver interactionamong differing device classes, we propose an in-kernel in-terface specification dedicated to a certain device class for agiven kernel.

The in-kernel interface specification contains mainly kerneldata structures (if any) to be used, software events, and

transitions. The latter defines the driver’s desired reactions torequests, concerning hardware events that must happen beforethe driver sends a notification of completion to the kernel.

C. Device and platform specifications

Hardware vendors often release user manuals that describethe interface and operations of a device and the architecture ofa hardware board. Such a documentation is intended to providesufficient information for driver developers. However, it is usu-ally informal, and written in a natural language. To automatethe driver development, we require device and hardware boardspecifications, which provide not only structural information,but also some contents like a register map. Specifications ina format such as IP-XACT and UML MARTE are availablefrom some hardware vendors or can be obtained from informaldevice or board documentations.

A device specification describes the following driver-relatedproperties of a device: i) device name and ID, ii) register fileinformation (e.g., register widths and offsets, bit field widthsand masks, register/bit field accessibilities, reset value, etc.),and iii) port information.

A platform specification provides some driver-related in-formation as well, i.e. i) device instantiations, ii) I/O offsets,iii) interrupt connections (which indicate whether the interruptpin of the target device is used or not), iv) bus (e.g., bus clock,bus type, data bus width, data transfer type, device access type,transport mode), and v) processor (e.g., byte ordering, clockfrequency, name, word length).

In general, IP-XACT includes most of the features men-tioned above, although, to the best of our knowledge, it stilllacks some information such as data transfer type (e.g., x8,x16).

D. Device features model

Reading from or writing to a certain register may cause aside effect. For instance, writing a value to the length registerof a given direct memory access (DMA) controller may startthe DMA transfer. Often, it necessitates programming someother registers before a side effect takes place. For instance,before writing the length register of this DMA controller, thesource address register and the destination address registershall be set with desired values so as to achieve successfulDMA operations. Such a register programming sequence needsto be modeled to ensure correct device operations. Hence, weare introducing a device features model to capture the way ofinteracting with the device.

The device features model contains a set of predefineddevice features, such as init, read, write, etc. This model canbe translated to C functions. The translation process will beexplained in more detail in the following section.

E. Driver generation

The driver generation flow is broken down into four steps.This section explains each step of the driver generation.

Step 1: Parsing and inline functions generation. Thedevice features model along with the hardware specificationsand the HAL library, are mainly used to generate bit fieldaccess functions (Fig. 5). These inline functions, containing

173


HAL L

ibrar

y

Hardware Specifications

Bit Field Access Functions,Device-related Parameters

1

Is used by Produces

static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE + REGISTER_A_OFFSET) & BIT_FIELD_X_MASK; }...

a)

HAL L

ibrar

y

Fig. 5. Step 1: Parameters parsing and inline functions generation

bitwise operations (e.g., not, bit shift), are responsible foraccessing the bit fields.

It is not difficult to generate these bit field access functions.An example of a bit field read function is shown in Fig. 5.a.To produce a bit field read function, it requires the namesof the register and of the bit field that appear in the devicefeatures model, a read primitive from the HAL library, an I/Obase address from the platform specification, and the offsetsand the widths of the register and the bit field from thedevice specification. If the HAL of the OS provides bit-levelmanipulation operands, then our code will simply call thesefunctions.

There are two reasons for generating bit field access func-tions. Firstly, the low-level bit operations contribute to agreat part of the bug sources in the case of manual driverdevelopment. Secondly, introducing bit field access functionscan enhance readability to some extent.

Apart from these bit field access functions and stringconstant macros, this step parses out device-related parameters(e.g., device cardinality) as well.

Step 2: Device features generation. This step makes use ofthe products (inline functions and parameters) in the previousstep and requires a device features model. The reason for usingthis model is explained in subsection IV-D.

The device features model reflects product dependencies (ifany), hardware configurations, and device operations (e.g., datatransfer operations, command/response operations) describedin a natural language or as functional flow charts, which aretraditionally provided by hardware vendors as a part of theuser manual. Thus, hardware vendors are expected to writethe device features model.

The device features model can be written in C or an al-ternative tiny language – DFDL (Device Features DescriptionLanguage). Our experience shows that the latter is easier tointerpret because it has simpler semantics. To write for (int i= 0; i < 5; i++) in C/C++, one just has to write foreach i(0, 5) in DFDL. As you can see, the foreach loop in DFDLis simpler and less error-prone.

The in-house language-based device features model is onlyused to define register programming sequences, but not forwriting device drivers; this model could evolve to an interme-diate format or even be eliminated, once a device specificationis capable of capturing these sequences.

Using the products (inline functions and parameters) inStep 1, the device features model is translated to devicefunctionalities (Fig. 6) in C. This translation is feasible, asthe tiny language only uses high-level constructs for logicalcontrol, e.g., the await construct (see subsection V-D) refers


Device Features

Bit Field Access Functions, Device-related Parameters

void access_hardware(...) { registerB_bitfieldY_wr($VAL); ... $VAL2 = registerA_bitfieldX_rd();}...

a)

2

Fig. 6. Step 2: Device features generation

Devic

e Fea

tures

In-Kernel Interface

Specification

Driver

Libraries Driver Configuration

Parameters

Generate HeaderInclusions

Compute Dependency,Synthesize EFSM

EFSM

Map Hardware Events,Translate EFSM

Platform

Specification

CompilationEnvironment

Makefile .c/.h

a) b)

Fig. 7. Step 3: a) Driver and b) Makefile generation

to the do-while structure in C.Step 3: Driver source and Makefile generation. This

step makes use of the product (device features) in Step 2,and requires basic and HAL libraries, driver configurationparameters, hardware specifications, and an in-kernel interfacespecification.

The libraries are used to produce some #include commands(Fig. 7). The in-kernel interface specification describes howthe driver interacts with the kernel and adjacent drivers, whilethe driver configuration parameters determine tunable elementssuch as the synchronization method (interrupt or polling). Withthe in-kernel interface specification and the driver configura-tion parameters, it is easy to synthesize an extended finitestate machine (EFSM) after dependency computation. Thenthe hardware events in the synthesized EFSM are mapped todevice features (generated from the previous step). Afterwards,the EFSM is translated into device drivers in C.

In addition, the in-kernel interface specification and the plat-form specification define the compiler flavor and the processortype respectively, allowing the compilation environment toproduce Makefile.

Step 4: Source code compilation. In this step the makecommand is iteratively executed for a certain CPU architec-ture, using the previously generated Makefile as an input. IfHAL and basic libraries are provided in the form of sourcefiles, they will also be compiled.

F. Driver configuration and space explorationDevice driver development involves a series of decision-

making processes. For instance, a write Application Program-ming Interface (API) is to be implemented as synchronousor asynchronous. Likewise, a DMA driver can either use acircular buffer or a linked-list. Different decisions result indiverse C code and differing driver performances. We call this“driver space exploration”.

174

Device Features In-Kernel Interface Specification

Driver Generation

.c/.h

Libraries

.c/.h

Driver Configuration Parameters 1

Driver Generation Driver Configuration Parameters 2

Fig. 8. Code generation possibilities

Conventional driver space exploration refers to iterativelyrefactoring the driver code. The choice of the driver version isusually driven by performance, power consumption, or binarysize.

In order to efficiently and effectively explore the driverspace, we use high-level configuration parameters (Fig. 8)for the driver generation environment. A change in a designdecision requires only a modification of an attribute of thedriver configuration. A new version of the driver can begenerated again.

G. Driver validationIn our experimentation, we used a real board to validate

the driver (along with an application program, an OS, and thehardware platform’s HAL).

If a real board is not available, we propose two simulationmodels to validate the driver. The functionality is validatedwith an abstract SystemC simulation model, called transactionaccurate. The performance of the driver is validated on a lowabstraction SystemC simulation model, called cycle accuratebit accurate. Due to limited space, these simulation modelswill not be presented in this paper.

V. EVALUATION

In this section, we evaluate the applicability of Me3Das the methodology for expediting embedded device driverdevelopment on the Atmel D940 [5] MPSoC.

A. Evaluation pointsThe points of evaluation are shown below: 1) the feasibility

of describing device features and in-kernel interfaces; 2) thefeasibility of systematically converting the device featuresmodel and the specifications to a device driver.

To evaluate point 1, we chose a well-adopted specificationlanguage for hardware platform and devices (the chosen one isIP-XACT, but the methodology is not limited to it), specifiedin-kernel interfaces with in-house IISL (In-kernel InterfaceSpecification Language), and modeled device features withDFDL (Device Features Description Language).

To evaluate point 2, we manually converted the devicefeatures model, the hardware specifications, and the in-kernelinterface specification to the device driver according to theproposed methodology. An open source DNA-OS [17] is usedby the converted software.

B. Hardware specificationsDevice specifications. In order to bring the Multimedia

Card Interface (MCI) (Fig. 9) in function, it requires configur-ing some registers of the power management controller (PMC),

PIO

MCCK MCCDA MCDA0 MCDA1 MCDA2 MCDA3

Interrupt Control

MCI Device

PDC

MCI Interrupt

PMC

APB Bridge

APB

Fig. 9. MCI device and its neighborhood

Application

DNA-

OS

Kernel

MCI Driver

Generic MMC Module

...

...

...

Fig. 10. In-kernel interfaces for a MCI driver

of the programmable input output (PIO), and optionally ofthe programmable DMA controller (PDC). In other words, weneed information about the register layouts of these devices.Hence, we have modeled the specifications for the MCI, thePMC, and the PIO (in IP-XACT for this experimentation).The PDC specification is not modeled, because the native MCIdriver does not use DMA.

Platform specification. The D940 MPSoC contains manyperipheral devices. It is not necessary to specify the wholeplatform. In reality, we have only modeled parts related todevice instantiations, interrupt numbering, etc.

C. In-kernel interface specificationIn this paper, we will not introduce the grammar of our in-

house IISL language. However we will briefly present what thein-kernel interface specification covers. It specifies the driverinterfaces (Fig. 10) toward the application, the kernel, and ageneric MultiMediaCard (MMC) module. The generic MMCmodule is responsible for card properties discovery and MMCprotocol implementation. It offers some services (denoted bythe lollipop connectors on the module side) to the MCI driver,and defines APIs (e.g., read low).

The in-kernel interface specification presents a partial EFSM(Fig. 11) describing the interactions between the driver, itsadjacent drivers (if any), and the kernel. To describe theseinteractions, we use messages. A message is a token sentto the driver from its adjacent drivers or the kernel, or viceversa. The former, called “inbound message”, can be a kernelrequest (e.g., initialize hardware, publish device, etc.), a DMArequest, etc., whereas the latter, called “outbound message”,is the driver response to the sender of the token. In Fig. 11,downward dashed arrows denote inbound messages, whereasupward dashed arrows represent outbound messages.

As shown in Fig. 11, starting from the OS booting (the startstate), the device driver receives inbound messages from thekernel in succession, which brings the driver from one stateto another sequentially, until it reaches the idle state. In theINIT HW (hardware initialization) state, the driver sends back

175

... init_hw INIT_HW

success/ status_ok

exit

read

read_done/ status

write

write_done/ status

...

WRITE ...

IDLE

READ ...

start state end state

Fig. 11. EFSM of kernel-driver interaction

1 read { 2 in void *buffer 3 in int32_t word_count 4 5 foreach i (0, word_count): 6 await (MCI_SR.RXRDY == 1) 7 ((uint32_t *)buffer)[i] = MCI_RDR 8 } 9 ...

Fig. 12. MCI device features model

the status ok message to the OS in the case of a successfulinitialization. When the driver is in the idle state, it waits foran inbound message. A read inbound message sends the driverto the READ state for reading data from the MMC card. At theend of read, the driver sends a message with the read statusto the kernel, and returns to the idle state. An exit inboundmessage brings the driver to the end state.

D. MCI device features modelThe device features model for the MCI device consists of

seven features (e.g., read, write, etc.) and the definition ofthe block length. It is described with our in-house language– DFDL. Fig. 12 presents the read feature. This feature hastwo incoming arguments, i.e. the buffer pointer and the wordcount. It waits until the RXRDY (receive ready) bit field ofthe MCI SR (status register) equals 1, then stores the value ofthe MCI RDR (read register) to a specified buffer. The processabove iterates word count times.

DFDL is a tiny ad-hoc language, which has some constructstailored for modeling device features. The await constructin Fig. 12 simplifies the do-while loop in C/C++. A C/C++equivalent to line 6, Fig. 12 (waiting for a bit field to reacha value) will contain several statements: reading the registervalue and logically anding it with a bit field mask, comparingthe masked result with a specified value, and iterating theabove steps until the bit field value equals the given one.

E. ConversionAt first we analyzed which registers and bit fields appear in

the MCI device features model. Then we parsed out the offsetsand the widths of these registers and bit fields (Fig. 13.a)from device specifications and the MCI base address from theD940 platform specification. Information related to MCI isgathered in the d940 mci.h file. Likewise, information relatedto PIO and PMC are collected to d940 pio.h and d940 pmc.hrespectively. In addition, inline functions for accessing somebit fields are also produced (Fig. 13.b).

Afterwards, the macros and inline functions, along with theMCI device features model, are used to produce the MCI

HAL Library

HardwareSpecifications

1

#define MCI_BASE 0xFFFA8000 ... #define MCI_SR 0x40 #define MCI_SR_RXRDY (0x1 << 1) ...

static inline uint32_t MCI_SR_RXRDY_rd() { return cpu_read_UINT32(MCI_BASE * + MCI_SR) & MCI_SR_RXRDY; }...

MCI DeviceFeatures Model

... ...d940_mci.h

reg_access.h

d940_pmc.hd940_pio.h

a)

b)

Fig. 13. Step 1: Macros and inline functions generation

MCI DeviceFeatures Model

Macros, Inline Functions

read(void *buffer, int32_t word_count){ for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); }}...

2

Fig. 14. Step 2: MCI device features generation

/* Header inclusions */status_t read_low(void *buffer, int32_t word_count){ for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); } return DNA_OK;}...

/* In-Kernel Interface Specification */... process CHOICES { ... || read_low; dev.read; read_low-done[$status == DNA_OK]; }...

/* Driver Config. Parameters */INTERRUPT...

MCI Device Features

b)

a)

Libraries 3.a

Fig. 15. Step 3.a: Synthesize EFSM and derive driver functionalities

device features (Fig. 14). It must be mentioned that althoughthe read device feature looks like a C function, it is not, as itis not qualified with a return type.

Finally, the driver configuration parameters (Fig. 15.a), theMCI device features, libraries, and the in-kernel interfacespecification (Fig. 15.b) are used to synthesize EFSM andderive driver functionalities. The INTERRUPT parameter (Fig.15.a) determines the synchronization mechanism as interrupt.In the in-kernel interface specification there exists a processdescribing how the driver will interact with neighboring com-ponents when it is in the idle state and receives a message.For instance, when the driver is in the idle state and receivesa read low message, it will wait until the read operationterminates, then it will return a status DNA OK to the kernelin the case of a successful read. The dev.read hardware eventis mapped to the read device feature. We must note that the|| construct (Fig. 15.b) is used to separate different transitionconditions.

Table I produces a summary of the conversion results. TheSLOC column refers to the source code size (excluding de-

176

TABLE ISOURCE LINES OF CODE, BINARY SIZES OF THE NATIVE AND GENERATEDMCI DRIVERS (EXCLUDING DEBUGGING FUNCTIONS AND STATEMENTS)

SLOC Binary (contains application, OS, & HAL) in KBNative 407 421.9Generated 362 421.4

TABLE IIEFFORT FOR MCI DRIVER DEVELOPMENT WITH AND WITHOUT ME3D

Effort in person-days SLOCDevice specifications 2 2573Platform specification 1 219In-kernel interface specification 2 188Device features model 1 64Total effort using Me3D 6 -

Total effort without Me3D 21 -

bugging functions and statements) of the native and convertedMCI drivers, while the last column shows the size of the binarydrivers.

We can notice that the systematically generated driversource is slightly smaller than the native one. The reason isthat the native MCI driver is written without optimizations,whereas the generation optimizes code size. For instance,the native driver presents registers as union structures; incontrast, the generated code only defines offsets of used bitfields. A disadvantage of using the union structure is thatreserved bit fields have to be specified too. Though theremight be advantages of using unions in terms of code review,more efficient validation is feasible via checking high-levelintermediate models.

Table II shows the results of generating the MCI driverusing Me3D and developing it manually. We find that usingMe3D methodology for generating drivers results in a 350%improvement in productivity. Considering the fact that thenative driver is developed by an highly experienced kerneldeveloper with about 21 person-days of effort (around 26 to40 person-days of effort using the Intermediate COCOMO2

formula with coefficients for embedded software projects andtypical values for an effort adjustment factor), we can expectthe acceleration of the driver development to be greater than350%. It must be noted that some specifications are suitablefor reuse purpose. The platform specification is usable for alldevices, and the in-kernel interface specification for a certaindevice class. Incidentally, around 500 lines from the IP-XACTspecifications (totaling 18%) of the devices and the platformare used for the driver generation.

F. Performance

We analyzed the performance of the native MCI driverfor the DNA-OS against that of the generated one. Perfor-mance values were captured using a Secure Digital (SD) card(Kingston SD/1GB). We measured a benchmark that performsa sequence of unbuffered reads from the SD card connectedto the MCI device. As a result, the transfer rate and CPU

2Intermediate COCOMO by B. Boehm, en.wikipedia.org/wiki/COCOMO

utilization achieved by the generated and native drivers aresimilar.


Device drivers are crucial software elements having consid-erable impact on both design productivity and quality. Devicedriver development has traditionally been error-prone and quitetime-consuming. On that account, we propose an advanceddevice driver generation environment to shorten driver devel-opment time and improve driver quality. Experimentation ingenerating a Multimedia Card Interface (MCI) driver for theAtmel D940 multiprocessor system on chip (MPSoC) achievesfavorable results regarding the code size.

In the future, we plan to evaluate the methodology onseveral OSes, introduce an intermediate format for devicedriver generation and validation, and develop an automatic toolfor driver generation. We will also study optimization issuessuch as performance and power consumption, and considerother constraints (e.g., up-bound timing that is imposed bycritical or real-time systems) as future research subjects.

ACKNOWLEDGMENT

The authors would like to thank the MEDEA+ andCATRENE offices and the French Ministry of Industry forsupporting this work via the MEDEA+/CATRENE SoftSoCproject.

REFERENCES

[1] Hewlett-Packard Company. (2010) HP Tru64 UNIX Operating SystemVersion 5.1B-6. [Online]. h18004.www1.hp.com/products/quickspecs/13868 div/13868 div.pdf

[2] NXP Semiconductors. (2007) LH7A404 Board Support PackageV1.01. [Online]. ics.nxp.com/support/documents/microcontrollers/zip/code.package.lh7a404.sdk7a404.zip

[3] Coverity, Inc. (2005) Analysis of the Linux Kernel. [Online].www.coverity.com/library/pdf/linux report.pdf

[4] N. Ganapathy. (2008) Introduction to ”Developing Drivers with theWindows Driver Foundation”. [Online]. www.microsoft.com/whdc/driver/wdf/wdfbook intro.mspx

[5] Atmel Corp. (2008) DIOPSIS 940HF AT572D940HF Preliminary.[Online]. www.atmel.com/dyn/resources/prod documents/doc7010.pdf

[6] K. J. Lin and J. T. Lin, “Automated development tools for Linux USBdrivers,” in 14th ISCE, Braunschweig, Germany, 2010, pp. 1–4.

[7] G. Kroah-Hartman, “I2C Drivers, Part II,” Linux Journal, Feb. 2004.[8] M. O’Nils and A. Jantsch, “Device Driver and DMA Controller Syn-

thesis from HW/SW Communication Protocol Specifications,” DesignAutomation for Embedded Systems, vol. 6, no. 2, pp. 177–205, 2001.

[9] S. Wang, S. Malik, and R. A. Bergamaschi, “Modeling and Integrationof Peripheral Devices in Embedded Systems,” in DATE’03, Munich,Germany, 2003, pp. 10 136–10 141.

[10] L. Ryzhyk, P. Chubb, I. Kuz, E. Le Sueur, and G. Heiser, “Automaticdevice driver synthesis with Termite,” in 22nd SOSP, Big Sky, MT, 2009.

[11] N. Bombieri, F. Fummi, G. Pravadelli, and S. Vinco, “Correct-by-construction generation of device drivers based on RTL testbenches,”in DATE’09, Nice, France, 2009, pp. 1500–1505.

[12] L. Wittie, C. Hawblitzel, and D. Pierret, “Generating a statically-checkable device driver I/O interface,” in Workshop on AutomaticProgram Generation for Embedded Systems, Salzburg, Austria, 2007.

[13] C. L. Conway and S. A. Edwards, “NDL: A Domain-Specific Languagefor Device Drivers,” SIGPLAN Not., vol. 39, pp. 30–36, jun 2004.

[14] X. Guerin, K. Popovici, W. Youssef, F. Rousseau, and A. Jerraya,“Flexible Application Software Generation for Heterogeneous Multi-Processor System-on-Chip,” in 31st COMPSAC, Beijing, China, 2007.

[15] Newlib. (2010) Red Hat, Inc. [Online]. sources.redhat.com/newlib[16] uClibc. (2011) Erik Andersen. [Online]. www.uclibc.org/downloads[17] X. Guerin and F. Petrot, “A System Framework for the Design of

Embedded Software Targeting Heterogeneous Multi-core SoCs,” in20th ASAP, Boston, MA, USA, 2009, pp. 153–160, [Online] tima-sls.imag.fr/viewgit/apes.

177

Session 7Tools and Designs for Configurable Architectures

178

Schedulers-Driven Approach for Dynamic

Placement/Scheduling of multiple DAGs onto SoPCs

Ikbel Belaid, Fabrice Muller

University of Nice Sophia-Antipolis

LEAT-CNRS, France

e-mail: {Ikbel.Belaid, Fabrice.Muller}@unice.fr

Maher Benjemaa

National engineering school of Sfax, Univeristy of Sfax

Tunisia

e-mail: [email protected]

Abstract—With the advent of System on Programmable

Chips (SoPCs), there is a serious need for placing and

scheduling algorithms which can allow multiple

Directed Acyclic Graphs (DAGs) structured

applications to compete for the computational resources

provided by SoPCs. A runtime scheme for distributed

scheduling and placement of DAG-based real time tasks

on SoPCs is described in this paper. In the proposed

distributed approach, called Schedulers-Driven, each

scheduler associated to a DAG makes its own

placement/scheduling decisions and collaborates with

the available placers corresponding to SoPCs in the

system. The placers focus in managing free resource

space for the requirements of elected tasks. Schedulers-

Driven aims at optimizing the DAG slowdowns and

reducing the rejection ratio of real-time DAGs. Other

important goals are attained by this approach, which

are the reduction of placement and scheduling

overheads ensured by the techniques of prefetch and

reuse, and the efficiency of resource utilization

guaranteed by the reuse technique and the slickness of

placement method.

Keywords-real-time DAGs; Schedulers-Driven

placement/scheduling; reuse; prefetch; heterogeneous

device; run-time reconfiguration.

I. INTRODUCTION

In the recent years, the reconfigurable computing has

advanced at a phenomenal rate. This new concept emerges

the SoPCs to satisfy all the demands of embedded systems

designers working under many tight constraints. The

SoPCs present a mixture of two parts: general-purpose

processors and reconfigurable hardware resources. Despite

their flexibility and their high performance, the SoPCs

reveal a number of challenges that must be addressed. One

of them is the dynamic scheduling of parallel real-time

jobs modeled by directed acyclic graphs (DAG) onto the

reconfigurable resources. Hence, it is reasonable to

envisage a scenario where more than one DAG compete to

be scheduled onto a high density of reconfigurable

resources at the same time. The purpose of our work is to

provide a dynamic scheduling and placement for DAGs, as

they arrive at a heterogeneous system. The objecti-

ve of dynamic placement/scheduling is i) to fit tasks within

DAGs efficiently on reconfigurable units partitioned on the

SoPCs, respecting their heterogeneities and taking

advantage of run-time reconfiguration mechanism and ii)

to order their execution so that task precedence and real-

time requirements may be satisfied. Many dynamic scheduling schemes have been

introduced in parallel computing systems. One simple and

efficient type of scheduling method is to dynamically

construct a combined DAG, composed of DAGs arrived at

the system and then to schedule the composite DAG by

one among efficient single-DAG algorithms in the

literature. Some methods which fall into this category

include those presented by Zhao and Sakellariou in [1],

who focus in achieving a certain level of quality of service

for the given DAGs defined by the slowdown that each

DAG would experience. The idea of combining dynamic

DAGs is also proposed in [2]. This paper develops Serve

On Time and First Come First Serve algorithms that

schedule each arrived DAG with the unfinished DAGs.

The objective of these algorithms is to properly add the

new submitted DAG into running DAGs, forming a new

integrated DAG. In [3], the dynamic DAGs are scheduled

with periodic real-time jobs running on the heterogeneous

system. The proposed scheduling scheme introduces

admission control for DAGs and schedules globally the

tasks of each arrived DAG by modeling the spare

capability left by the periodic jobs in the system. Then,

each scheduled task is received by a machine where it will

be scheduled locally by EDF algorithm. [4] presents a

hierarchical matching and scheduling framework to

execute multiple DAGs on computational resources. Based

upon a client-server model and DHS algorithm, each DAG

is associated with a client machine and independently,

determines when a scheduling decision should be made.

Through load estimates, each client machine matches its

tasks to the suitable group of server machine. When the

application chooses a particular group of servers to execute

a given task, the low-level scheduler determines the most

appropriate member of the group to execute the received

task. [5] deals with parallel jobs arriving at the system

following a Poisson process and takes into account the

reliability measure as well as the overheads of scheduling

and dispatching tasks to processors. Using admission

control for real-time jobs, the paper presents DAEAP,

978-1-4577-0660-8/11/$26.00 ©2011 IEEE

179

DALAP and DRCD scheduling algorithms to enhance the

reliability of the system.

Several researchers have developed dynamic placement

methods of tasks on reconfigurable devices. The placement

in [6] is considered the baseline placement algorithm. The

placement is based on KAMER method that partitions the

free space into Maximal Empty Rectangles (MER) and

employs the bin-packing rules to fit tasks into MERs. [7]

presents the on-the-fly partitioning. [8] employs the

staircase method to manage the free space. Unlike the

previous works, [9] manages the occupied space instead of

the free space and proposes Nearest Possible Position

algorithm to fit tasks while optimizing inter-task

communication.

To the best of our knowledge, none of these existing

methods of placement and scheduling is suitable for the

environment used in this paper, as most of them are

proposed for purely software context or they are not

applicable for real-time DAGs. In this paper, a new

dynamic competitive placement/scheduling approach is

proposed to execute real-time DAGs on SoPCs. The

remainder of this paper is organized as follows. Section 2

details our proposed approach of placement/scheduling

DAGs onto SoPCs. The experimental results are given in

Section 3 followed by conclusions in Section 4.

II. SCHEDULERS-DRIVEN

PLACEMENT/SCHEDULING

Throughout the paper, the Xilinx heterogeneous

column-based FPGA was used as a reference for the SoPC.

The heterogeneous system is composed of n SoPCs. Each

one is composed of a set of reconfigurable hardware

resources depicted by {RBk} where k denotes the number

of resource type. There are NP types of reconfigurable

resources in SoPCs. As mentioned in Fig. 1, the execution

system is constituted by a set of m local schedulers (Sched

i) associated to arrived DAGs. Local schedulers are

communicating with n placers. Each placer is assigned to a

SoPC and makes its own decision in managing

reconfigurable resource space. Besides the m local

schedulers and n placers, we introduce two other structures

in the system: Recover and Pending. In the distributed

Schedulers-Driven placement/scheduling, all the

structures: m local schedulers, n placers, Recover and

Pending operate to make decisions about scheduling and

placement of real-time tasks. The real-time DAGs are

submitted dynamically and periodically according to a

fixed inter-arrival interval. A real-time DAG is defined by

the pair (N,E). N is the set of nodes representing non-

preemptive tasks in the DAG and E is the set of edges

linking the dependent tasks. Each real-time task in the

DAG is characterized by its worst case execution time

(CA), its relative deadline (DA) and its release time (RA).

The release time is the time when the task is ready for

execution and it receives all its required data from its

predecessors. RA is determined according to the arrival

time of the DAG to which the task belongs and to the time

of execution achievement of its predecessors. Moreover,

each task (A) is presented as a set of reconfigurable

resources (RBk) which are required to achieve its execution

on the SoPCs and defines the RB-model of the task as

expressed in (1).

(1)

Sched 1 …

…

… Sched i Sched m

List_scheduler List_recover List_pending

placer 1 placer n

SoPCnSoPC1

Recover Pending

Figure 1. System overview.

Under hardware environment, the placement and

scheduling problems are highly interlinked. Effectively,

the placers must satisfy the resource requirements of each

task elected by the schedulers, and the scheduler decisions

must be made according to the ability of placers to provide

sufficient RBs for tasks, respecting their precedence and

real-time constraints. Thus, the major challenge in this

environment is to reduce the rejection rate as much as

possible. Both following sections detail our proposed

algorithms for placement/scheduling DAGs on

reconfigurable devices (SoPCs).

A. On-line Placement Algorithm

Placement problem consists of two sub-functions: i)

partitioning which handles the free space of resources in

the SoPC and identifies the Maximal Empty Rectangles

(MER) enabling task execution. MER are the empty

rectangles that are not contained within any other empty

rectangle and ii) fitting which selects the best feasible

placement solution within MERs by maintaining the

resource efficiency. As stated above, as shown in Fig. 2,

we are based on 2D column-based architectures

represented by a matrix (Yi,j), LineNumber depicts the line

number in the SoPC and ColumnNumber denotes its

column number.

(2)

To achieve partitioning sub-function, we have to define

the Max_widthi,j and Max_heighti,j for each Yi,j.

Max_widthi,j is the number of free RBs found throughout

the line of Yi,j by starting from Yi,j and without crossing an

occupied RB. Max_heighti,j is the number of free RBs

counted from Yi,j throughout its column till the first

occupied RB. Max_widthi,j and Max_heighti,j are null for

the occupied RBs (Yi,j = 0). The search of MERs also

180

claims the search of key RBs. Key RBs are the free RBk

which provide the upper left vertices of MERs. A key RB

is an RBk (Yi,j) having an occupied RB on its left (Yi,j-1) or

the free RB on its left has a Max_heighti,j-1 inferior to that

of RBk. Moreover, a key RB RBk must have an occupied

RB above (Yi-1,j) or the free RB above has a Max_widthi-1,j

inferior to that of RBk. In Fig. 2, the RBs in the SoPC with

the star symbol are the key RBs and the values in

parentheses are their Max_width and Max_height. Column-based SoPC of 4 RB types: RB1, RB2, RB3, RB4

0

0

0

0

0

0 2(2,3)

4

1(3,3)

3(2,4)

2 0

1 3 2 0

1 3 0 0

0 3(3,1)

2 4

1(1,4)

j

i

1 2 3 4 5

1

2

3

4

5

MER1Y2,2

MER2Y2,2

MER3Y1,2

MER4Y2,3

MER nestedin MER1 Y2,3

Figure 2. Key RB and MER search.

1) Partitioning: Based on key RB, Max_width and

Max_height of RBs, this first sub-function of placement

problem consists in extracting MERs to enable the

placement of elected tasks on reconfigurable device.

Partitioning is conducted through Algorithm 1. Algorithm

1 deals with each key RB independenly. At the beginning

of Algorithm 1, to avoid MER nesting, line 5 and the test

ensured by line 10 select the RBs throughout the

Max_height of the key RB to be considered for SCAN

function (line 12). These selected RBs must provide

Max_width greater than that of RB above the current key

RB (line 5). In fact the Max_widths of RBs which do not

satisfy this condition are inevitably taken by the previous

key RB. Throughout the Max_height of each key RB (line

6), Algorithm 1 scans all the RBs and each time, it takes

the Max_width of the current RB (lines 7,8) as the current

width of a new MER: MER_width (line 11) and checks

whether there are RBs above this current RB and below the

current key RB having Max_width inferior or equal to that

of the current RB (lines 12-21). Should this be the case, the

current RB would not be considered. For example, in Fig.

2, for the key RB Y1,2, the Max_widths 3 given by Y2,2 and

Y3,2 and the Max_width 2 given by Y4,2 are not considered

as Y1,2 above these RBs has a Max_width of 1. This test

avoids the duplication of MERs as well as it checks the

feasibility of MER construction. If the Max_width of the

Algorithm 1. MER search.

current RB is accepted, Algorithm 1 determines the height

of the MER by booking MER_width RBs on all the lines

between the current key RB and the last RB having

Max_width superior to MER_width (lines 22-27). Once the

construction of the MER is finished (line 28) another tests

of MER nesting is performed by line 29. The first test

Validity_Left searches the MERs added by the key RBs

situated on the same line as the current key RB on its left.

If one of these old MERs has the same height as the new

MER and if the upper right and the bottom right vertices of

the old MER are greater than or equal to the upper left and

the bottom left vertices of the new MER, the new MER is

necessarily encapsulated in the old MER. In this case, the

new MER will not be inserted. For example, in Fig. 2, the

gray MER added by Y2,3 is nested in MER1. MER1 is

created by the key RB Y2,2 on the left of the key RB Y2,3,

both situated on the same line. As both MERs have the

same height 2, and the upper left and the bottom left

vertices of this new MER are inferior to the bottom right

and the upper right vertices of MER1, the new MER of Y2,3

is deleted. Similarly, the second test Validity_Up avoids

the insertion of new MER having the same width as an old

MER provided by a key RB located above in the same

column as the current key RB and its bottom left and

181

bottom right vertices are greater than or equal to the upper

left and the upper right vertices of the new MER.

Consequetly, Algorithm 1 guarantees the search of all

possible MERs in the SoPCs without duplication nor

nesting.

2) Fitting: During scheduling, each elected task from

each local scheduler will be fitted by a placer. Thus,

according to the found MERs in their correponding SoPCs,

n placers provide the best Reconfigurable Physical Blocs

(RPBs) for the elected tasks in the SoPCs. The placers

search all the valid MERs for tasks and provide the closest

RPBs in order to minimize the internal fragmentation.

Valid MER must include all the types of RBs required by

the task and the needed number of these RBs as specified

in the RB-model of the task to enable its execution. Based

on the column-based architecture, our proposed best fitting

for a given task A and by a given placer is described by

Algorithm 2.

Algorithm 2 starts RPB search from the upper left

vertex of each valid MER. Algorithm 2 hugely relies on

the column-based architecture and only scans the first line

of the valid MER. It searches the first column in the MER

containing an RBk included in A_RB and not yet scanned

(line 7). From this current first column (line 8), it scans the

whole MER line horizontally to search the remaining RB

types required by A_RB (lines 10-22). Max_RB represents

the height of the RPB according to the required number of

RBs in the hardware task and the height of the valid MER.

If the required number of one RB type exceeds the height

of the valid MER, Max_RB is equal to the MER_height

(line 18) and the remaining number of this RB type (line

17) will be searched in the following columns of the valid

MER. Otherwise, the required number of the current RB

type is attained (line 14) and the Max_RB is adjusted to the

last highest value (line 15). Then, Algorithm 2 checks

whether all the RB types included in A_RB are found and

their required number are achieved starting from this

current first column (line 23). Should this be the case, it

books the computed necessary number (Max_RB) (line 25)

for this new possible RPB. Among all possible RPBs

(Possible_RPB) extracted by scanning all the columns

from overall valid MERs in a given SoPC and by a given placer, the closest one to the RB-model of A will be

selected as the best fitting for A in the SoPC (line 31).

B. On-line Scheduling Algorithm

In this section, based on the on-line placement

presented in the previous section, we define our proposed

Schedulers-Driven placement/scheduling. Fig. 3 illustrates

the possible states for tasks in the arrived DAGs.

Schedulers-Driven placement/scheduling is performed by

means of Algorithm 3 and 4. Every tick (T time units),

Algorithm 3 uses all the previous algorithms to move tasks

between the various states. We assume that there

are DAG_number DAGs arriving at the system with fixed

Algorithm 2. Best fitting of task A.

inter-arrival interval. The arrived DAGs are assigned to the

idle local schedulers. The Schedulable tasks in each DAG

are fetched by its Local_scheduler and inserted in its

List_scheduler. A task in a DAG is considered schedulable

if either the task has no predecessors or if all its

predecessors have been placed/scheduled. A task is

accepted by a SoPC if its deadline and RB requirements

for that SoPC remain guaranteed. If a task is not accepted

by any SoPC during its laxity time then it is rejected.

Consequently, the DAG that the task belongs to is rejected.

At the beginning, Algorithm 3 checks if the tasks in

List_scheduler, List_recover and List_pending still

guarantee their deadlines (lines 19-20). If a task misses its

deadlines, it is transferred to Rejected state. A DAG is

accepted only if all its composite tasks are acceptable.

When a task is rejected, all the schedulable tasks in the

List_scheduler of the rejected task, all the recovered,

pending and placed/scheduled tasks that belong to the

DAG of the rejected task are deleted from their housed

lists and from their assigned SoPCs (line 21). Then the

Schedulers-Driven detects the SoPCs that have sustained

MER modification after task completion or task rejection

(line 25). The current time t is kept if the SoPC has

experienced MER modification (line 26). When some

deleted tasks were scheduled and placed as the last tasks to

be executed in RPBs, their elimination from the system

could enable the placement/scheduling of pending tasks. In

addition, the completion of the last tasks in RPBs frees

additional resources in SoPCs which could allow the

placement/scheduling of pending tasks. Thus, in these

cases, the pending tasks are transmitted to List_recover by

saving the time of their recovering (lines 29-32) and their

states become Recovered. In the case the rejected tasks are

not the latest tasks for execution in the RPBs (line 33),

Schedulers-Driven checks the possibility of replacing some

of these rejected tasks by pending tasks while respecting

182

Schedulable

Selected

Placed and scheduled

PendingRecovered

Rejected

Earliest deadline

(1)

Deadline missed

(2): No valid MER && (valid occupied RPBs and Ts do not respect deadline or invalid occupied RPBs)

Reject_lastor end tasks

(2)

Deadline missed

Reject_middle

(1): Valid MER OR (valid occupied RPBs && Ts respect deadline)

(3)

(3): (Valid MER (End_task or Reject_last)) OR (valid occupied RPBs && Ts respect deadline (Reject_last ))

(2)Rejected DAG

Figure 3. Task states.

their release times, their deadlines and their RB-models

(line 34). If this latter replacing is feasible for some tasks,

their states change to Placed/Scheduled and the successors

are searched to be the new schedulable tasks and inserted

in the list of Local_scheduler to which these substitute

tasks belong. Therefore, each Local-scheduler and Recover

picks the schedulable task with the earliest deadline from

its list List_scheduler and List_recover (lines 36-38). The

state of elected tasks is changed to Selected state. When the

selected task is taken from List_recover, only the placers,

that sustain their MER modification at a time greater than

or equal to the Recover_time of the elected task, are

selected to deal with this task (lines 39-40) otherwise, all

the n placers are considered to place and schedule this task

(line 42). Then, each Local-scheduler and Recover calls

the on-line placers described by Algorithm 4 and

performed by the selected placers (line 44). Each selected

placer manages its free space by Algorithm 1 detailed in

the previous section (lines 56-59). If the selected placer

affords valid MERs for the selected task, it searches its

fittest RPB in its free RB space by means of Algorithm 2

and the start time of the task in its associated SoPC is

obtained by the maximum between the release time of the

task and the current time (lines 60-63). In the case that the

SoPC does not include valid MERs for the selected task, it

attempts to place and schedule it in its occupied RPBs (line

65). If the possible start time (Ts) provided by an occupied

RPB is superior to the release time of the task (line 66), the

corresponding placer checks if this start time maintains the

deadline of the selected task, should this be the case, it

verifies if this occupied RPB satisfies the RB requirements

of the task. If the occupied RPB respects the real-time

requirements and RB-model of the task (line 67), it is

accepted (lines 68-69). When the Ts of the occupied RPB

is inferior to the release time of the task (line 71), only RB

requirements are checked (line 72) as the start time of the

selected task in this occupied RPB will be its release time

(lines 73-74). Among all the accepted occupied RPBs, the

earliest start time for the task is chosen and the

corresponding RPB is selected (lines 77-78). When several

RPBs ensure the earliest start time, the fittest RPB is kept.

The placers data are sent to the Local-schedulers and

the Recover which make the final decision towards their

Algorithm 3. On-line Schedulers-Driven placer/scheduler.

Algorithm 4. On-line placers.

183

selected tasks. If the placers provide feasible placement for

a selected task by guaranteeing its real-time constraints,

among all possible RPBs, its Local_scheduler or the

Recover picks the fittest RPB that enables the earliest start

time for task execution and the new possible schedulable

tasks are searched and inserted in the list of

Local_scheduler of the selected task (lines 45-46). The

state of the selected tasks is then termed Placed/Scheduled.

If there are no available RPBs for a selected task, it will

be transmitted to Pending state (line 48), as some other

task rejections or completions could allow its

placement/scheduling in SoPCs. Schedulers-Driven

achieves the operating phase by updating limit that controls

the existence of unscheduled DAGs (line 51).

III. SIMULATION RESULTS

In order to analyze the feasibility of our Schedulers-

Driven placement/scheduling approach and to prove its

performance, several simulation experiments are

conducted. 10 DAG sets are generated by means of

TGFF3-5 tool [10]. DAG set features are described in

TABLE I. In each DAG set the inter-arrival interval of

DAGs is fixed to 50 T time units. The empirical chosen

values for local scheduler number and placer number are

m=6 and n=4. n should be inferior to m in purpose of

creating low-cost designs with high resource efficiency, we

cannot produce a number of SoPCs as great as the number

of arrived DAGs to satisfy their physical requirements.

However, the number of local schedulers should be as big

as possible in order to place and schedule several DAGs

simultaneously and to exploit as much as possible the

SoPC resources. We created 4 heterogeneous column-

based SoPCs of 6 lines and 7 columns having 4 RB types.

In all DAG sets, the average RB heterogeneity rate in DAG

tasks is 2.31 (i.e 2.31 RB types among the 4 RB types are

averagely used by each task).

Fig. 4 shows the run time of Schedulers-Driven

placement/scheduling for the 10 DAG sets. DAG_SET6,

DAG_SET8 and DAG_SET9 give the highest run times as

they are composed of the biggest numbers of DAGs (30,

24, 27). Moreover, they produce also high execution times

(23-25) which explain the slowness of the run time.

Indeed, due to the longest execution times of tasks, the

occupied RPBs remain in execution for a long time and the

lateness of their releasing drives tasks to List_Pending

many times. Thus, the DAGs lie for a long time in the

system.

The slowdown of one DAG is defined by:

Slowdown(DAG) = Msingle(DAG)/Mmultiple(DAG), where

Msingle is the makespan of the DAG up to its last placed/

scheduled task by Schedulers-Driven and when it has the

available SoPCs on its own, and Mmultiple is the current

makespan of the same DAG when it is placed/scheduled by

Schedulers-Driven onto SoPCs along with all

the other DAGs. The DAG_SETs comparing slowdown by

TABLE I. FEATURES OF DAG SETS

DAG

number

Average

size/DAG

(Task)

Average

deadline

(T)

Average

execution

(T)

DAG_SET1 20 8.7 68.91 20.79

DAG_SET2 11 15.8 62.91 21.86

DAG_SET3 15 10.53 69.04 17.66

DAG_SET4 22 11.27 71.37 21.58

DAG_SET5 12 14.16 65.31 22.03

DAG_SET6 30 12.76 76.53 23.41

DAG_SET7 18 14.5 88.15 30.99

DAG_SET8 24 16.33 76.41 25.07

DAG_SET9 27 14.5 78.10 25.14

DAG_SET10 10 16.7 92.71 27.66

Figure 4. Run time measurements.

Schedulers-Driven placer/scheduler is shown in Fig. 5. For

better performance in the system, the slowdown should be

closer to 1. As expected, the DAG_SETs having lower

DAG number, lower average size and shorter execution,

afford more fairness to their composed DAGs such as

DAG_SETs1-4 and consequently, they result in smaller

slowdown. DAG_SETs7,8,9,10 produce also small

slowdown due to RPB reuse and placement efficiency.

DAG_SETs5,6 show the highest slowdowns since they are

constructed by big numbers of DAGs of high average sizes

and raised heterogeneity rate which cause conflicts

between DAGs to use SoPCs and increases their

slowdowns.

Our real-time DAG-based placement/scheduling on the

heterogeneous SoPCs suffers from the problem of task

rejection due to missed deadlines and the lack of free RB

space for a given selected task. Fig. 6 presents the

guarantee ratio (i.e percentage of DAGs guaranteed to

meet their deadlines) measured for the DAG_SETs. For all

DAG_SETs, Fig. 6 shows a guarantee ratio superior to 51

%. Highly relaxed average deadline combined with low

average execution time and small average size within

DAG_SET, have noticeable impact on increasing

guarantee ratio. We observe 100 % of DAGs accepted in

DAG_SET 1,3,4 as these latter parameters are suitably

chosen. In [5], by using 8 homogeneous processors, the

attained guarantee ratio for 5 DAGs of 20 tasks is 70 %.

Our approach outperforms [5] as for the DAG_SET nearly

similar to that studied in [5]: DAG_SET5 composed of 12

DAGs with an average size for each one of 14.16

tasks Schedulers-Driven can place and schedule 83 % of

184

Figure 5. Average slowdown measurements.

real-time DAGs in the system.

Under strict physical resource constraints, Schedulers-

Driven placement/scheduling predicts the placement and

scheduling of tasks often before their release times. As

shown in Fig. 7, this advantage provided by our proposed

approach benefits up to 91 % of placement/scheduling

phases in all DAG_SETs to prefetch the schedule and

placement of tasks before their release times. These

remarkable prefetch ratios hugely reduce the placement

and scheduling overheads. Thanks to prefetch technique,

almost all the configuration operations are hidden, which

will lead to improving the system performance.

For better placement quality in the system, the resource

efficiency should be closer to 1. Thanks to the slickness

provided by our optimal placement method, we reached

0.6 of resource efficiency in all DAG_SETs. This relevant

resource efficiency shows how much the used RPBs,

where tasks are fitted, are closer to their RB-models. In

addition, for all DAG_SETs, based on the run-time

reconfiguration mechanism, by reusing the occupied RPBs

in 45-75 % of placement/scheduling phases, the placement

overhead is totally revoked, the configuration overhead is

highly reduced and the resource efficiency is immensely

enhanced by freeing more RB space for the future arriving

DAGs.

IV. CONCLUSION AND FUTURE WORK

This paper presents a novel placement/scheduling

approach for real-time DAGs with non-deterministic

behavior on heterogeneous SoPCs. We think that this paper

reveals an initial only study of heuristics for multiple DAG

placement/scheduling onto SoPCs. Moreover, it addresses

the most challenging problems disrupting the embedded

systems which are the achievement of high performance

expressed by run time, slowdown, reaching the highest

resource efficiency and reducing configuration overhead.

Further research focuses on other approaches with task

preemption and with other notions of quality of service by

exploiting the unused middle slot times within the RPBs.

REFERENCES

[1] H. Zhao, and R. Sakellariou, “Scheduling multiple DAGs onto

heterogeneous systems,” Parallel and Distributed Processing

Symposium, pp. 130, April. 2006.

Figure 6. Guarantee ratio measurements.

Figure 7. Placement/Scheduling prefetch measurements.

[2] L. Zhu, Z. Sun, W. Guo, Y. Jin, W. Sun and W. Hu, "Dynamic

Multi DAG Scheduling Algorithm for Optical Grid

Environment," proceedings of SPIE, Vol 6784; Part 1, pages

6784, 67841F, 2007.

[3] L. He, S. Jarvis, D. Spooner and G. Nudd, "Dynamic,

capability-driven scheduling of DAG-based real-time jobs in

heterogeneous clusters" International Journal of High

Performance Computing and Networking, Vol 2, pp. 165-177,

March. 2004.

[4] M. Iverson, and F. Ozguner, "Hierarchical, competitive

scheduling of multiple DAGs in a dynamic heterogeneous

environment," Distributed Systems Engineering journal, Vol 6,

No 3, pp. 112-120, July. 1999.

[5] X. Qin, and H. Jiang, "Dynamic, reliability-driven scheduling of

parallel real-time jobs in heterogeneous systems," International

Conference on Parallel Processing, pp. 113-122, 2001.

[6] K. Bazargan, R. Kastner, and M. Sarrafzadeh, "Fast Template

Placement for Reconfigurable Computing Systems," IEEE

Design and Test, Vol. 17, pp 68-83, January. 2000.

[7] C.Steiger, H.Walder, M.Platzner, and L.Thiele, "Online

scheduling and placement of real-time tasks to partially

reconfigurable devices" International Real-Time Systems

Symposium, pp. 224-235, December. 2003.

[8] M. Handa, and R. Vemuri, "An Efficient Algorithm for Finding

Empty Space for Online FPGA Placement," Design Automation

Conference, pp. 960-965, June. 2004.

[9] A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, " A New

Approach for On-line Placement on Reconfigurable Devices, "

International Parallel and Distributed Processing Symposium,

p. 134, April. 2004.

[10] http://ziyang.eecs.umich.edu/~dickrp/tgff/

185

978-1-4577-0660-8/11/$26.00 ©2011 IEEE

Generation of emulation platforms for NoC exploration on FPGA

Junyan TAN, Virginie FRESSE Hubert Curien Laboratory UMR CNRS 5516

University of Jean Monnet-University of Lyon 18 Rue du Professeur Benoît Lauras

42000 Saint-Etienne, FRANCE [email protected]

Frédéric ROUSSEAU TIMA Laboratory, UJF/CNRS/Grenoble INP, SLS Group

46, Avenue Félix Viallet 38000 Grenoble, FRANCE [email protected]

Abstract-NoC (Network on Chip) architecture exploration is

an up to date problem with today’s multimedia applications and platforms. The presented methodology gives a solution to easily evaluate timing and resource performances tuning several architectural parameters, in order to find the appropriate NoC architecture with a unique emulation platform. In this paper, a design flow that generates NoC-based emulation platforms on FPGA is presented. From specified traffic scenarios, our tool automatically inserts appropriate IP blocks (emulation blocks and routing algorithm) and generates an RTL NoC model with specific and tunable components that is synthesized on FPGA.

I. INTRODUCTION

Systems-on-Chip (SoC) based on Networks-on-Chip (NoC) architectures are one of the most appropriate solution for media processing embedded applications. With the growing complexity in consumer embedded systems, the emerging SoC architectures integrate numerous components such as memories, DSP, specialized processors, micro controllers and IPs. The ever increasing number of components, number of data and size of data to transfer required by the algorithm lead to design efficient ad hoc NoC architectures according to the algorithm specifications. An ad hoc NoC offers high bandwidth and high scalability at the cost of lower power and lower complexity [1]. However, the design of ad hoc NoC means making several architectural choices, just like buffer sizing, flow control policies, topology selection and routing algorithm selection. These choices must be made at the design time keeping in mind that the final NoC must fulfill a set of critical constraints which depend on the target application such as: latency, energy consumption, design time. The design space being very wide, automation of the design flow must be considered to ensure a rapid evaluation and test of each solution.

In the last years, several approaches [2][3] were proposed to automate the design space exploration of the architecture. All these approaches can be categorized into two types: formal approach and experimental approach. Formal approach aims to construct a mathematical formulation to predict the NoC behavior. Experimental approach uses either simulation or emulation tools. In approaches that use software simulation, the NoC can be modeled at different level of abstraction, the abstraction level being a tradeoff

between the desired accuracy and validation speed. FPGA (Field Programmable Gate Array) are commonly used as reconfigurable devices for emulation and test. FPGAs are programmable logic devices used in various applications requiring rapid prototyping of digital electronics (telecommunication, image processing, aeronautics…). Modern FPGAs are now able to host processors cores or DSPs, as well as several IP blocks to perform efficient prototyping of embedded systems.

Today, several NoC-based emulation platform on FPGA are proposed, such as [4][5][6]. Nevertheless, these emulation platforms are not adapted to image and signal processing applications. Emulation blocks cannot emulate all data transfers such as data transmissions from one initiator to several destinations with automatic data rate injection variation.

In this paper, we propose a generic design flow for the emulation of large NoC-based MPSoCs on FPGA platform. This design flow automatically builds the emulation architecture based on the NoC architecture, the type of emulations and the routing algorithm. According to the requirements of the application, this emulation architecture provides emulations of data transmission performances from one or multi initiators to one or multi destinations with automatic data rate injection. In addition, it is implemented on a FPGA platform and supplies a statistics report for future design of the whole system. The whole emulation architecture is designed in a hierarchical VHDL description fully synthesizable and FPGA-independent.

This paper is organized into 4 sections. Section 2 describes some related work. In section 3, we introduce the generic design flow for the automatic generation of the emulation platform on FPGA. This section details all required components inserted in the design flow. Section 4 presents the design flow adapted to the NoC Hermes on Xilinx platform. Experiments are presented and analyzed in this section. Section 5 contains the conclusion.

II. RELATED WORK

During the last years, there was an impressive evolution and development of NoC architectures on embedded platforms. These existing NoCs do not resolve one of the principal challenges of these communication architectures: to find out the optimum or a set of optimum NoC architecture for a target application. Several simulation and

186

emulation models are proposed at different abstraction levels. These models do not permit the exploration of the design space. Exploration remains a manual task requiring the experience of the designer.

Today, several NoC architectures have successfully been implemented on FPGA device such as: Hermes [7], SoCIN [8], PNoC [9], HIBI [10] Extended Mesh [11]. The Placed and Routed (P&R) architecture for the FPGA implementation is generated from the design flow associated to the NoC. These FPGA-based tools or environments are based on simulation in VHDL, SystemC or the combination of specification, simulation, analyze and generation of NoCs at different levels of abstraction.

In [14] a platform for modeling, simulation and evaluation of an MPSoC NoC including a real-time operating system is based on SystemC. In [15] a mixed design flow is proposed. It is based on SystemC simulation and VHDL implementation of the NoC structure called NoCGen. This platform uses a template router to simulate several interconnection networks using SystemC. In [17] a modeling environment is described for custom NoC topologies based on SystemC. However, these approaches are limited to their levels of accuracy in the estimations and level of synthesis on the FPGA. Increasing the level of accuracy significantly increases the simulation time. These simulations have a much larger execution time compared to NoC platform emulated on FPGA device. The simulation time with SystemC or modelsim for 109 packets can reach from 5 days to 36 days [4]. The NoC structure is implemented on the FPGA only, the emulation platform cannot be implemented. Emulation on FPGA is proposed to obtain faster simulation times and the higher accuracy of the functional validation.

In [4] the authors present a mixed HW-SW NoC emulation platform implemented on FPGA. The VHDL-based NoC is implemented on a Virtex-II FPGA. This architecture contains a network communication, traffic generators, traffic receptors and a control module. A hard-core processor (PowerPC) is connected to the emulation hardware platform as a global controller. This controller defines the parameters of the emulation. A fast network on chip emulation framework on Virtex-II FPGA is presented [16]. It supplies a fast synthesis process by using the several hard cores for partial reconfigurations. These previously presented frameworks integrate one or several hard core processors used for emulation only. They control the communication architecture which cost the limited resources on the FPGA.

All these emulation platforms presented previously are used only for one-to-one or multi-to-one communication. Image and signal processing algorithms require sending data to multi-destination which is not supported by existing emulation platforms.

In order to solve these problems, the proposed NoC emulation platform consists of data emulation blocks

(traffic generator and traffic receptor) and a synthesizable NoC architecture. This platform can emulate any traffic required by image and signal processing application.

III. DESIGN FLOW FOR THE GENERATION OF THE FPGA EMULATION PLATFORM

A generic design flow is proposed to generate the emulation architecture for FPGA platform. The design flow takes as inputs: • NoC architecture with several varying parameters, • Routing algorithms, • Emulation blocks: traffic generators and traffic

receptors according to the type of emulation, • Initiators and receptors with data transfer specifications.

1. Design Flow

The design flow depicted in Figure 1 automatically generates the emulation architecture for FPGA platform. The design flow is designed in a VHDL hierarchical structure to ensure component instantiation at each level of the design flow. Several packages (routing, data_transfers) are designed for the parameterization of architectures. Generic IP blocks are inserted for component insertion. These components are routing component, traffic generators (TG) and traffic receptors (TR).

Figure 1. Design flow for the generation of the emulation platform on

FPGA.

The designer first selects the NoC structure which he wants to explore in the package. He specifies the number of switches and the size of the bus. The design flow is based on the existing NoC structure implemented on a FPGA platform and its associated design flow. Any existing parameterized synchronous NoC structure can be used as the input as long as the HDL description is available. The NoC structure should contain switches, buffers, links and

187

flow protocol. The design flow takes as input the VHDL description of the NoC structure.

The first step concerns the routing algorithm. This is a decision taken by the designer. He selects the routing algorithm to be used with the NoC structure in the routing package. The routing algorithm is selected by setting a 1, the unused routing algorithms are set to 0 (meaning not used as depicted in Figure 2).

Routing_XY:=1; Routing_NFNM:=0; Routing_WFM:=0;

Figure 2. Example of routing algorithm selection. The design flow inserts the corresponding routing IP

component from the routing IP block library. We assume that a routing IP block library contains the VHDL functions of the routing algorithms. The routing algorithm is inserted by instantiating the appropriate routing function in the switch control block of the switch architecture. The design flow adds to all switches of the NoC structure these routing IP blocks to obtain the complete NoC architecture. At this level, the communication architecture is complete and nodes can be inserted. The parameterized emulation blocks are added to the platform in the emulation block insertion step according to the type of emulation. Emulation blocks are traffic generators and traffic receptors designed in VHDL IP blocks. Parameters for all emulation blocks are specified in the package Data_Transfer. This emulation block insertion step is a generic VHDL component instantiation using the Data_transfer package for allocating generic VHDL IP blocks. The complete emulation architecture on FPGA platform is generated in HDL description language. This system is synthesized and implemented using adequate tools for Synthesis and Place and Route tools for the target platform.

In the presented design flow, the designer tunes the corresponding emulation platform setting the routing IP components in the routing IP block library and specifying the scenarios in the Data_transfer package.

In the presented design flow, two inputs are reused from existing algorithms/blocks: the NoC structure and the routing algorithm. These inputs can be immediately reused if they have been previously described in hardware language for synthesis purpose. They are briefly described in the following sections. Emulation blocks and data transfer specifications are designed inputs for this design flow. Descriptions and parameterizations are detailed in the following sections.

2. NoC architecture

NoC architectures [2] are communication architectures improving the flexibility of communications subsystem of SoC, with high scalability, high performance and energy efficient customized solution. NoC architecture is composed of several basic elements: Network Interface

(NI), Switch, Links and Resources. These basic elements are connected using a topology to constitute the NoC architecture.

Data transmitted in NoC architectures are sent through messages. Several data can be sent with one message and one data can be sent with several messages. One message is a set of packets and a packet is a set of Flits (Flow Control Unit). The flit is the basic element of the NoC transferred. The designer selects the size of flits, the number of flits for one packet and the size of packet. Packets are sent according to the data injection rate. Data injection rate is defined as the ratio of the amount of receiving data on its ability to carry data. A 50% data injection rate indicates that the packets use 50% of the bandwidth.

Sizing the NoC has a direct impact on timing performances and resources used. It is important to evaluate performance of the NoC according to the size of flits, size of packets and the size of messages.

Any NoC structure containing all components but the routing algorithm can be used as an input of the design flow. The structure should be described in Hardware Description Language (HDL).

3. Routing algorithm

Several routing algorithm are implemented on FPGA and ASIC devices. Routing algorithms define the path taken by a packet between source and target switches. Three types of routing algorithms exist: determinist, partially adaptive and fully adaptive [3]. 2D Meshes and k-ary n-cubes are popular for FPGA as their regular topologies simplify routing. In a 2D mesh, there are four directions, eight 90-degree turns, and two abstract cycles of four turns.

Most algorithms commonly used are determinist routing algorithms. The reason mentioned is their simplicity and the lower number of resources. XY is a commonly used determinist routing algorithm. Other routing algorithms are semi-deterministic (or semi-adaptive) algorithms. One example of these semi-adaptive routing algorithms is the west first algorithm. Fully adaptive routing algorithms are proposed. Such algorithms prevent from livelock and deadlock [20]. They are not used for the FPGA implementation. The reason claimed is the higher number of required resources required and a more complex algorithm compared to the other two types.

Any routing algorithm can be inserted in the design flow if described in HDL language. Some routing IP blocks have already been developed in VHDL and are inserted in the design flow.

4. Traffic generators

For the emulation of the NoC, IP blocks or other components connected to the NoC are replaced by traffic generators (TG) designed in parameterized VHDL entities. Deterministic traffic generators are widely used in NoC

188

emulation. These traffic generators simulate the traffic flow between IP blocks inside the NoC. They generate stochastic traffic distribution: packet size, injection time, idle intervals duration and packet destination in order to reproduce the behavior of a real IP block for a given application. Several TGs have been previously designed [4][5][6][7]. Format of packets sent by other traffic generators are not suitable to image and signal processing applications. As an example, designed TGs can send packets to one destination node only [4][7] or in broadcast in . For most TG, there is no information about the address of the source node when receiving the data. In image processing applications, applications require unicast, multicast and broadcast transfers. Other required information is the address of the source node when data are received by the destination node. Most of nodes perform computation between 2 types of data coming from 2 different nodes. It is therefore necessary for the destination node to extract the address of the incoming data. The proposed format of packet also inserts information about timing performances (latency) to implement a complete synchronous emulation platform as depicted in Figure 3.

Figure 3. Format of packets generated by TGs.

In our emulation platform the packets contain one header part and a data part with following information: • Address of the destination cores (Dest). Any initiator

core can send data to one or several destination cores. • Address of the initiation core. (Source). • Init clock (Clk_init). The flit is reserved for the

latency evaluation. When the packet is sent, the sending time is loaded.

• Size of transmitted packet (Sz_pckt). • Number of packets (Nb_pckt). The generic TG is depicted in Figure 4. Any TGs generate

control signals and the packet in the data_in output whose size is equal to the size of flits. Parameters for any TG are the address of the destination node (Address), the coordinates of the source node (IP_address_X and IP_address_Y), the size and number of packets (Size_packet, Nbre_packet) and the data injection rate between packets (Idle_packet). All these information are used to generate the format of the packet depicted in Figure 3. The global clock of the NoC is connected to the clk signal of the TG block for constituting a synchronous platform.

Figure 4. Signals and parameters for generic Traffic Generators.

The number and format of packets depend on the traffic scenario specified in the data_transfer package..

5. Traffic Receptors

Traffic flows generated by traffic generators are sent through the NoC and then received by traffic receptors (TR). Proposed traffic receptors are parameterized VHDL entities. TR analyzes received packets and extracts latencies of the NoC. Two types of traffic receptors exist. The first type performs global analyzes and statistics from the executed emulation in hardware. Global analyze consists in testing all packets and extracting the latency from each received packet (latency defined on line and inserts in the 3rd flit in the packet). Latency and global analyze are sent to a unique LCD. The second type only generates a continuous report of traces received with detailed values for the emulation on LCD available on the FPGA board. As the emulation platform is designed to emulate with the highest precision the behavior of the final system to be implemented, emulation components for output data are restricted to LCD. The designer can add components or interfaces for analyze but should keep in mind that the structure to be emulated is modified (changing the performances of the system).

Both traffic receptors are parameterized VHDL blocks to ensure an automatic generation of the emulation platform.

6. Data transfer specification

TG and TR blocks are inserted according to the data transfer specification specified in the Data_Transfer Package. Data transfers are given in the highest level of the emulation architecture description (top_NoC) with generic values. The designer specifies the size (size_packet) and number (nb_packet) of packets sent by all TGs with the data injection rate (idle time between packets expressed with idle_packet). Data have the same format for one TG. The designer indicates all destination nodes with destination value. 1 indicates that the node is a TR and 0 that the node does not receive any data. Last_destination is a value indicating the number of TR for every TG. The number of packets received by every TR is given by total_packet. Then links between TG and TR are given with destination_links.

The example depicted in Figure 5, indicates that switches (3,4) and (4,4) are traffic generators (called TG1 and TG2) as they both send 10 packets of 15 flits (the size of flits is automatically extracted from the NoC structure). The idle time between packets is 20 clock cycles for both TGs. Switches (0,0) (1,0) (2,0) (1,0) receive packets. TG1 sends packets to 2 TRs and TG2 sends packets to 3 TRs. The number of packets received by switch (0,0) is 20 packets, 10 packets for all other TRs. TG1 send packets to switches (0,0) (2,0) TG2 send packets to switches (0,0) (1,0) (0,1).

189

Figure 5. Initiator and receptor with data traffic specification

According to the data transfer specification, the design flow inserts the corresponding emulation blocks using TG and TR library. The TG is inserted to the switch if the associated node is an initiator (i.e. the node sends at least one data to another node). The TR is inserted in the switch if the associated node is a receptor from any initiator (i.e. the node receives at least one data from an initiator). For other cases, TR and TG are not instantiated in the NoC architecture. The design flow can insert any other type of emulation (random accesses, random size, and parameterized latency for sending data…). All these emulation blocks will be described in HDL language and inserted in the emulation block library.

TGs generate several packets that are sequentially sent to the NoC architecture. These packets are generated according to the data traffic specification specified in top_NoC package. If data is sent to several destination nodes, TGs generate several packets (one packet per node). TGs can send data with different size of packets and number of packet to one or more destination nodes. Considering that the designer cannot know the data injection rate for one TG, 2 types of TGs are proposed: • TGs that generate packets with a varying data

injection rate. The data injection rate is automatically and dynamically generated from a 0% to a 100% load. Data injection rate is the idle_packet parameter of the TG that is automatically computed in the Top_NoC entity using the following equation:

load: data injection rate. nbcyclesflit: number of cycles for transfer one flit.

• TGs that generate packets with a given data injection rate. The data injection rate is specified by the designer as a constant value for each TG.

IV. DESIGN FLOW FOR THE HERMES NOC

The design flow previously presented is adapted to the Hermes NoC and its associated design tools on Xilinx FPGA device. Emulation platforms generated are used to

explore the Design Space of these NoC architectures presented in the following section. The experimental study aims to evaluate performances of NoC Hermes according to routing algorithms and to the position of the nodes. The experimental platform used is the ML506 evaluation board that contains a Virtex 5 XC5VSX50 FPGA. Development tools are Xilinx ISE 10.1 with Precision RTL Synthesis. The following experimental studies are based on the average latency and the number of resources.

1. Hermes NoC and ATLAS tool

Hermes is a NoC created by the Catholic University of Rio Grande do Sul (Porto Alegre, Brazil) [7]. This NoC is a 2D packet switched Mesh. The main components of this infrastructure are the Hermes switch and IP cores. The Hermes switch has routing control logic and five bi-directional ports. The local port builds the connection between the switch and its local IP core. All ports possess input buffers for provisional storage of information. Hermes uses the wormhole flow control. ATLAS is an open source environment designed to automate the generation of the Hermes VHDL structure. Several features can be parameterized in the ATLAS environment: size of flit, buffer depth, number of virtual channels, flow control strategies. These parameters are easily set by the designer to match the specifications of the algorithm. ATLAS is the NoC Generation Tool and the VHDL IP Blocks of the Hermes NoC is the NoC VHDL IP Block in the design flow in Figure 1.

The 4x4 mesh NoC architecture with a 1 initiator (I) to 3 receptors (R) scheme is used for the experiments.

2. Routing algorithms

The routing algorithms used are XY, West First (WFM), North-Last (NLNM) and Negative First (NFNM). All these routing algorithms are designed in VHDL IP blocks for an immediate insertion in the switches.

3. Impact of the routing algorithm on the number of resources

The first experiment depicted in Figure 6 shows the number of LUTs according to the size of the NoC for several routing algorithms. Resources concern the NoC only, the emulation blocks are not considered. The number of LUTs is almost similar for all routing algorithms. Experiments are also made on the number of registers and is not depicted in this paper. The number of registers is almost identical whatever the routing algorithm used. The first observation is that the number of resources (LUTs and registers) depends on the size of the NoC.

An in depth analysis is made on the number of LUTs and registers respectively depicted in Figure 7 and Figure 8 for the XY and NLNM algorithms. These routing algorithms are chosen as they use respectively the lowest number and

190

highest number of resources. The biggest difference in the number of LUTs is 71 for a 4-node NoC and 850 for a 36-node NoC. This number seems high but it remains unsignificant compared to the number of resources required for the NoC itself (respectively 3,2% and 3,4% of added LUTs).

Figure 6. Number of LUTs according to the size of the NoC.

Figure 7: Difference of LUTs for XY and NLNM routing algorithms.

Figure 8: Difference of registers for XY and NLNM routing algorithms.

The same analysis is made with the number of registers as depicted in Figure 8. The difference in the number of registers is lower compared to the number of LUTs (18 for a 4-node NoC and 196 for a 36-node NoC). These extra registers represent between 2.79% and 3.65% of the registers required for the NoC architecture itself. Therefore, the difference in the number of LUTs and registers is not significant in the choice of the routing algorithm for the Hermes NoC when such structure is implemented on FPGA. The functionality and advantages are more

important in the choice of the routing algorithm than resource optimization. It is therefore wiser to implement a routing algorithm that avoids deadlocks and livelocks than gaining few resources.

4. Impact of the emulation blocks and routing algorithms on the timing performances

For the following experiments, 4 TGs (TG1 to TG4) and 3 TRs (TR1 to TR3) are used as depicted in Figure 9. Data transfers are based on 40 packets and 500 flits per packet with a size of 16-bit flits. The XY, NFM and NFNM routing algorithms are used. The position of nodes is chosen to ensure data transfer from right to left to compare different routing algorithm (as NFNM uses the XY routing algorithm for left to right data transfers).

The design flow generates immediately and automatically three emulation platforms. Each emulation platform contains the routing algorithm (and switches with the routing IP block stated as R blocks) as depicted in Figure 9.

Figure 9: Emulation NoC platform generated by the design flow.

The first exploration is the comparison of XY and NFNM routing algorithms. TG1 sends all its data to three TRs. Then TG2, TG3 and TG4 use successively the same scenario. The data injection rate is 25%. The total latencies are depicted in Figure 10. The routing algorithm selected does not affect the number of cycles required to transmit data when the traffic is little compared to the capacity of the communication architecture.

Figure 10. Total latency (nb of cycles) for TG1-3TRs scheme according to

the position of the TR (with a 25% data injection rate).

The second exploration is the evaluation of the end to end latency for all TGs sending data to all TRs (Figure 9). Exploration is made with a 25% data injection rate.

191

The latency depends on the routing algorithm and the position of the TRs. Results depicted in Figure 11, show that the XY gives lower latency for TR2 and TR3 and the NFM is more efficient for TR1 for a 25% data injection rate. Both routing algorithms can be used.

Figure 11: End to end latency for three TRs with a 25% data injection rate

for three routing algorithms.

Figure 12. End to end latency (nb of cycles) for a 3-3 scheme according to

the position of the TG and the data injection rate.

The last exploration is the impact of the data injection rate. Figure 12 shows the end to end latency according to two injection rates (50% and 75%). XY is more adapted for sending data to TR1 with 50% data injection rate but is the less adapted for sending data to TR3 with 75% data injection rate. According to the position of TRs and TGs (more precisely the number of hops required), there is not only one routing algorithm that always gives the best latency. These experiments highlight the need of an exploration aided-tool, as online emulations for all scenarios are required to evaluate the best timing performances and select the most appropriate routing algorithm. Such an exploration-aided-tool will help designers to quickly build their emulation platforms to evaluate different scenarios.

V. CONCLUSION

This paper presents a generic design flow designed for the automatic generation of NoC exploration platforms on FPGA. Based on existing NoC structure, all required components are inserted in the design flow. Appropriate

emulation blocks are developed and scenario traffic specifications are proposed to target a wide range of image and signal processing applications. The designer can easily generate and implement several emulation platforms and can explore the NoC structure in a short time. With the immediate generation of the emulation platforms designed for design space explorations on FPGA the designer can explore many architecture solutions, specify/modify the number and position of the initiators and receptors to extract the best timing performances according to the routing algorithms, the data injection rates and the position of initiators and receptors. Experiments show that the performances of the final system significantly depend on all these parameters.

REFERENCES [1] B. M. Al-Hashimi: “System-on-Chip: Next Generation Electronics”.

Circuits, Devices and Systems, 2006 [2] L. Benini: “Application Specific NoC Design”, in DATE, 2006. [3] J. Chan, S. Parameswaran: “NoCGEN: A Template Based Reuse

Methodology for Networks on Chip Architecture”. In Proceedings of the 17th Int. Conference on VLSI Design, Page(s): 717-720, 2004.

[4] N. Genko, D. Atienza, G. De Micheli and al: “A Complete Network-On-Chip Emulation Framework”. In DATE, 2005.

[5] Y. E. Krasteva, F. Criado and al: “A Fast Emulation-based NoC Prototyping Framework”. International Conference on Reconfigurable Computing and FPGAs, 2008. Page(s): 211 – 216.

[6] P. Liu, C. Xiang and al: “A NoC Emulation/Verification Framework”. Sixth International Conference on Information Technology: New Generations, 2009. Page(s): 859 – 864.

[7] F. N. Moraes, A. Mello. Calazans: “HERMES: an Infrastructure for Low Area Overhead Packet-switching Networks on Chip”, Integration, the VLSI Journal, vol. 38, no. 1. Oct. 2004.

[8] C. A. Zeferino, A. A. Susin: “SoCIN: A Parametric and Scalable Network-on-Chip”, Proc. 16th Symposium On Integrated Circuits and System Designs, 2003.Page(s): 169-174.

[9] C. Hilton, B. Nelson: “PNoC: a flexible circuit-switched NoC for FPGA-based Systems”, in Field Programmable Logic, Aug. 2005.

[10] E. Salminen and al.: “HIBI Communication Network for System-on-Chip”, Journal of VLSI Signal Processing Systems, Vol. 43, Issue 2-3, June 2006. Page(s): 185 – 205.

[11] U. Y. Ogras and al.: “Communication Architecture Optimization: Making the Shortest Path Shorter in a Regular Networks-on-Chip”, in DATE, 2006.

[12] OPNET www.opnet.com [13] J. Chan, S. Parameswaran: “NoCGEN: a template based reuse

methodology for network on chip”. VLSI Design 2004, [14] S. Mahadevan, K. Virk, J. Madsen: “Arts: A systemc-based

framework for modelling multiprocessor systems-on-chip,” Design Automation of Embedded Systems, 2006.

[15] J. Chan and al: “Nocgen:a template based reuse methodology for NoC architecture”. In Proc. ICVLSI, 2004.

[16] Y. E. Krasteva; F. Criado and al: "A fast emulation-based NoC prototyping framework," in Reconfigurable Computing and FPGAs, 2008, Page(s): 211-216.

[17] A. Jalabert and al: “Xpipes Compiler: a tool for instantiating application specific network on chip”, in DATE, 2004.

[18] U. Y. Ogras and al: “Communication Architecture Optimization: Making the Shortest Path Shorter in Regular Networks-on-Chip”, in DATE, 2006.

[19] OPNET www.opnet.com [20] J. Liang; S. Swaminathan; R.Tessier: “ aSOC: A Scalable, Single-

Chip communicationsArchitecture”, In: IEEE International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000, Page(s): 37-46.

192

!"#$%"&%$'()&(*)+',%$(-)./0&1%)'()2'3)456$-())

!"#$%&'(&)$*+%$,&-+#.*&/(&)(&).*0$%,&1+2&3(&4(&-.5.6.%#,&7+*%.%"$&8(&)$*.+#&7.095:2&$;&'%;$*<.:=0#,&>?-@A,&>$*:$&/5+B*+,&C*.6=5&

D+"#$%(<$*+%$,&0+#.*(<.*0$%,&%+2(0.5.6.%#,&;+*%.%"$(<$*.+#EFG90*#(H*&)!

!"#$%&'$()!"#$!%&'($)*%&+!&,-.$(!/0!1(/'$**%&+!$2$-$&3*!1)'4$5!%&*%5$! %&3$+()3$5! '%(',%3*! ($6,%($*! '/--,&%')3%/&! )('#%3$'3,($*!*,'#! )*! )! 7$38/(4*9/&9:#%1! ;7/:*<! 3/! 5$)2! 8%3#! *')2).%2%3=>!.)&58%53#! )&5! $&$(+=! '/&*,-13%/&! +/)2*?!@)&=! 5%00$($&3! 7/:!)('#%3$'3,($*!#)A$!.$$&!1(/1/*$5>!)&5!*$A$()2!$B1$(%-$&3*!($A$)2!3#)3!(/,3%&+!)&5!)(.%3()3%/&!*'#$-$*!)($!4$=!5$*%+&!0$)3,($*!0/(!7/:! 1$(0/(-)&'$?! "#$($0/($>! 3#%*! 8/(4! 1(/1/*$*! )! (/,3%&+!*'#$-$!')22$5!12)&&$5!*/,('$!(/,3%&+>!8#%'#!%*!%-12$-$&3$5!%&!)!7/:!)('#%3$'3,($!8%3#!5%*3(%.,3$5!)(.%3()3%/&!')22$5!C$(-$*9DE?!"#$! 1)1$(! '/-1)($*! C$(-$*9DE! 3/! 3#$! C$(-$*! 7/:! 3#)3! $-912/=*! 5%*3%&'3! )(.%3()3%/&! )&5! (/,3%&+! -$'#)&%*-*! )&5! )2+/9(%3#-*?!F&$! *$3! /0! $B1$(%-$&3*! $&).2$*! 3/! '/&0(/&3! 5$*%+&! 3%-$!12)&&$5! */,('$! (/,3%&+! )&5! (,&3%-$! 5%*3(%.,3$5! (/,3%&+?! G55%93%/&)22=>!3#$!1)1$(!1($*$&3*!3#$!)5A)&3)+$*!/0!,*%&+!5$)52/'4!0($$!)5)13%A$! (/,3%&+! )2+/(%3#-*! )*! .)*%*! 0/(! .)2)&'%&+! 3#$! /A$()22!'/--,&%')3%/&!2/)5!%&!./3#!(/,3%&+!-$'#)&%*-*?!G&/3#$(!$B1$9(%-$&3!($A$)2*! 3#$! 3()5$/00*!.$38$$&!,*%&+!'$&3()2%H$5!/(!5%*3(%9.,3$5! )(.%3()3%/&?! G! 2)*3! $A)2,)3%/&! $B1/*$*! 3#$! 1$(0/(-)&'$!)5A)&3)+$*! /0! '/-.%&%&+! 5%*3(%.,3$5! )(.%3$(*! 8%3#! 12)&&$5!*/,('$! (/,3%&+?!E$*,23*! $&0/('$! 3#)3! 5$*%+&! 3%-$!12)&&$5! */,('$!(/,3%&+!3$&5*!3/!)A/%5!7/:!'/&+$*3%/&!)&5!'/&3(%.,3$*!0/(!)A$(9)+$! 2)3$&'=! ($5,'3%/&>! 8#%2$! 5%*3(%.,3$5! )(.%3()3%/&! /13%-%H$*!7/:!*)3,()3%/&!0%+,($*?!

.7! .28+94:38.92);<) -"'=$(-) *5(6$%>) '?) %"&(6$6%'"6) 05") 6$@$1'() ,($%) &"5&)5(&#@5*) %A5) $/0@5/5(%&%$'() '?) &) 1'/0@5%5) 6>6%5/) '() &)

6$(-@5)*$5B)%A5)6')1&@@5*)C>6%5/D'(D&D3A$0)EC'3F7)C'36)%&"-5%)A$-A) 05"?'"/&(15) =$%A) 6/&@@) ?''%0"$(%) &(*) @'=) 5(5"->) 1'(D6,/0%$'()=A5()1'/0&"5*)%')%A5)6&/5)6>6%5/)$/0@5/5(%5*)#>)&()5G,$H&@5(%) 65%)'?)1A$067)!)C'3)$6)1'/0'65*)#>)&)0'66$#@>)@&"-5) &/',(%) '?) 0"'1566$(-) 5@5/5(%6) EI<6FB) *'J5(6) '") 5H5()A,(*"5*6) $(%5"1'((51%5*)#>)&()'()1A$0)1'//,($1&%$'()&"1A$D%51%,"57)!6)%A5)1'/0@5K$%>)'?)&00@$1&%$'(6)?$%%$(-)$(6$*5)&)6$(D-@5)C'3)"&$656B)61&@&#$@$%>)&(*)?@5K$#$@$%>)&"5)&1A$5H5*)%A"',-A)%A5),65)'?)/,@%$0"'1566'")6>6%5/6)'()1A$0)ELIC'36FB)&)6051$&@)1&65)'?)C'36)=A5"5)/'6%)'")&@@)I<6)&"5)0"'-"&//&#@5)0"'156D6'"6)MNOB)$(1"5&6$(-)%A5)C'3)&"1A$%51%,"5)?@5K$#$@$%>7)C'3)&(*)LIC'3)*56$-(6)"5@>)'()%A5)/&66$H5)"5,65)'?)0"5D

*56$-(5*) I<67) 8A5) 1'//,($1&%$'() &"1A$%51%,"5B) '() %A5) '%A5")A&(*) $6) 6051$?$1&@@>) #,$@%) %') ?,@?$@@) %A5) &00@$1&%$'() "5G,$"5D/5(%6B)/&P$(-)%A5)*56$-()'?)%A565)1'/0'(5(%6)1'//,($1&%$'()15(%"$17) 8"&*$%$'(&@) 1'//,($1&%$'() &"1A$%51%,"56) 6,1A) &6)6A&"5*) #,6656) &(*) %A'65) #&65*) '() *5*$1&%5*) 0'$(%) %') 0'$(%)$(%5"1'((51%$'(6) *') ('%) 61&@5) =5@@) =$%A) %A5) 5H5"D-"'=$(-)&/',(%) '?) 0&"&@@5@) *&%&) %"&(6/$66$'() MQO7) 45*$1&%5*) 0'$(%) %')0'$(%) @$(P6)@5&*)%')1'//,($1&%$'()&"1A$%51%,"56)%A&%)&"5)*$??$D1,@%) %') "5,65) &(*) 5(A&(15) $() 6,#65G,5(%) *56$-() "5H$6$'(67)R,6656)/&>)#51'/5)#'%%@5(51P6B)$(1"5&6$(-)@&%5(1>)&(*)0'=5")*$66$0&%$'(7)4560$%5)%A5)?&1%)%A&%)A$5"&"1A$1&@)#,6656)*')6,00'"%)0&"&@@5@) 1'//,($1&%$'(6B) 61&@&#$@$%>) 6,??5"6) &(*) 1'(%5(%$'()$(1"5&656)=A5()1'//,($1&%$'()#5%=55()I<6)@'1&%5*)&%)*$??5"D

5(%)6$*56)'?)&)#"$*-5)$6)(55*5*7)2'36)&"5)1,""5(%@>)1'(6$*5"5*)&6)&)#5%%5")&00"'&1A)?'")5(A&(1$(-)61&@&#$@$%>)&(*)0'=5")*$66$D0&%$'()5??$1$5(1>)MSO7)8A5)*56$-()'?)2'3D#&65*)LIC'36)/,6%)%&P5)$(%')&11',(%)

65H5"&@)1'//,($1&%$'()&"1A$%51%,"5)&6051%6B)$(1@,*$(-)%'0'@'D->B) #,??5") *$/5(6$'($(-B) ',%0,%) 65@51%$'() E$757) "',%$(-) &@-'D"$%A/6F)&(*)$(0,%)65@51%$'()E$757)&"#$%"&%$'()&@-'"$%A/6F7).()%A$6)0&05"B)&)!"#)$6)&)1'//,($1&%$'()&"1A$%51%,"5)1'/0'65*)#>)&)65%) '?) 6=$%1A$(-) 5@5/5(%6) 1&@@5*) $"%&'$() %A&%) 5/0@'>) 0&1P5%)6=$%1A$(-) 1'//,($1&%$'(7) +',%5"6) $(%5"1'((51%5*) #>) @$(P6)?'"/)%A5)2'3)%'0'@'->7)C'/5)'")&@@)"',%5"6)/&>)&@6')1'((51%)%')I<6) %A"',-A)&)(5%='"P) $(%5"?&15B)=A$1A) $6)('%) $%65@?)1'(6$D*5"5*)0&"%)'?)%A5)2'37)8A5)#&6$1)?,(1%$'()'?)"',%5"6)$6)%')/'(D$%'") $(1'/$(-) 0&1P5%6) ?"'/) $%6) $(0,%) 0'"%6B) 65@51%) '(5) ',%0,%)0'"%)&(*)?'"=&"*)0&1P5%6)%A"',-A)6'/5)$(%5"(&@)0&%A7)!"#$%"&D%$'() &(*) "',%$(-) '"1A56%"&%5) "',%5") $(%5"(&@) "56',"156)&11566T0"$'"$%>)&(*)*$"51%$'(6)*51$6$'(7)8A5) "',%$(-) #5A&H$'")/&>) #5) 5$%A5") &) *5%5"/$($6%$1) '") &()

&*&0%$H5) ?,(1%$'(7) 45%5"/$($6%$1) "',%$(-) *5?$(56) %A5) ',%0,%)0'"%)%A&%)&)0&1P5%)=$@@)%&P5)#&65*)'()$%6)6',"15)&(*)*56%$(&%$'(B)$""56051%$H5) '?) %"&??$1) 1A&"&1%5"$6%$167) !() 5K&/0@5) '?) &) 2'3)&00@>$(-) ",(%$/5) *5%5"/$($6%$1) "',%$(-) $6) ;5"/56) MSO7) !*&0D%$H5) "',%$(-)&@@'=6)1'(6$*5"$(-)/'"5) %A&()'(5)1&(*$*&%5)',%D0,%) 0'"%) ?'") &) -$H5() $(0,%) 0'"%) &%) 5&1A) "',%5"B) =A$1A) 1&() #5)$/0'"%&(%)?'")05"?'"/&(15B)1'(-56%$'()1'(%"'@)&(*)?&,@%)%'@5"D&(15B) &6) *51$6$'(6) /&>) 1'(6$*5") %A5) $(6%&(%&(5',6) (5%='"P)6%&%,6) MUO7) !() 5K&/0@5) '?) &) 2'3) &"1A$%51%,"5) 0"'0'6$(-) %A5),65)'?)",(%$/5)&*&0%$H5)"',%$(-)$6)4>!4)MVO7)C$(15)5&1A)"',%5")*5&@6)=$%A)65H5"&@)6$/,@%&(5',6)"5G,56%6)

%') ?'"=&"*) 0&1P5%6B) &() &"#$%"&%$'() 6%"&%5->) $6) (51566&">7)8=')1A'$156)&"5) %')*5&@)=$%A)'(5) "5G,56%)&%) &) %$/5B)1&@@5*))'*&$+,-./'01+$2.&$+&."*B)'") %')*5&@)=$%A)&) 65%)'?) "5G,56%6) $()0&"&@@5@B)1&@@5*)0.(&$.2%&'01+$2.&$+&."*7)8A565)1A'$156)@5&*)%')%A5)-5(5"D$1)"',%5")&"1A$%51%,"56)*50$1%5*)$()W$-,"5)N7).()#'%A)&"#$%"&%$'()6%"&%5-$56)1'/05%$%$'()?'")"',%5")"56',"156)/&>)'11,"7))

)!"# !$# )

I%+,($!J!K!"8/!+$&$(%'!7/:!(/,3$(!)('#%3$'3,($*!.)*$5!/&!3#$!)(.%3()3%/&!'#/%'$L!;)<!'$&3()2%H$5!)(.%3()3%/&M!;.<!5%*3(%.,3$5!)(.%3()3%/&?!

35(%"&@$J5*)&"#$%"&%$'()0"'*,156)"',%5"6)=A$1A)&"5)6$/0@5"B)=A$@5)*$6%"$#,%5*)&"#$%"&%$'()%"&*56)$(1"5&65*)"',%5")1'/0@5K$%>)

8

193

?'")5(A&(15*)05"?'"/&(15) MXO7)35(%"&@$J5*)&"#$%"&%$'(),6,&@@>)$/0@$56) %A&%) %A5) "',%5") 1'(%&$(6) '(@>) '(5) 6$(-@5) "',%$(-) ,($%B)?'") =A$1A) &@@) $(0,%) 0'"%6) 1'/05%57) !"#$%"&%$'() &(*) "',%$(-)*5?$(5)&)1'((51%$'()#5%=55()&()$(0,%)&(*)&()',%0,%)0'"%B)&?%5")=A$1A) %"&(6/$66$'() $() %A&%) 1'((51%$'() 6%&"%6) &(*) %A5) "',%$(-),($%) $6) "5@5&65*) %') 65"H5) '%A5") 05(*$(-) $(0,%) 0'"%) "5G,56%67)4$6%"$#,%5*)&"#$%"&%$'(B)'()%A5)'%A5")A&(*B)$/0@$56)%A&%)1'/05D%$%$'() ?'") "56',"156) '11,"6) '(@>) &%) %A5) ',%0,%) 0'"%67) 8A$6) "5DG,$"56)6'/5)A&"*=&"5),($%) "50@$1&%$'()&%) %A5) $(0,%)&(*)',%0,%)0'"%6) E"',%$(-) &(*) &"#$%5"6B) "56051%$H5@>FB) #,%) /&>) $(1"5&65)05"?'"/&(15) *"&/&%$1&@@>7) 8A5) "56,@%6) $() %A$6) 0&05") 6,00'"%)%A$6)6%&%5/5(%7)8A5)&,%'/&%5*)*56$-()'?)2'36)/&>)G,$1P@>)0"'*,15)&)-5D

(5"$1) 1'//,($1&%$'() &"1A$%51%,"5) 6'@,%$'(7) ;'=5H5"B)2'3TC'3) 6$@$1'() &"5&B) 0'=5") *$66$0&%$'() &(*) 05"?'"/&(15)/&>)#5)'0%$/$J5*) $?) &"1A$%51%,"5)1'(?$-,"&%$'()&(*),6&-5)&"5)0@&((5*)MYO7)8A$6)$6)56051$&@@>)%",5)=A5()%A5)&00@$1&%$'()1'/D/,($1&%$'() 0&%%5"(6) &"5) P('=() $() &*H&(157) 3'(6$*5"$(-) %A$6)6$%,&%$'() &(*) "56%"$1%$(-) &%%5(%$'() %') 0&%A) 65@51%$'() &/'(-)1'//,($1&%$(-) 0&$"6) ?'") 6',"15) "',%$(-) 2'36B) $%) $6) ,6,&@) %')5/0@'>)#&(*=$*%A)@$/$%6)&6)&)1"$%5"$'()%')*5?$(5)&)65%)'?)1'/D/,($1&%$'() "',%56) MZO) M[O7) R&(*=$*%A) @$/$%6) &"5) &() 5??$1$5(%)=&>)%')60"5&*)%A5)'H5"&@@) @$(P)@'&*7);'=5H5"B),6$(-)'(@>)%A$6)1"$%5"$'() /&>) "56,@%) $() 1'//,($1&%$'() @'&*6) %A&%) &"5) #&*@>)*$6%"$#,%5*)$()%$/5B)=A$1A)/&>)$()%,"()1'/0"'/$65)%A5)'H5"&@@)2'3) 05"?'"/&(157) L'"5'H5"B) %A5) &*H&(%&-5) '?) *56$-() %$/5)0&%A) 65@51%$'()=A5() 1'/0&"5*) %') ",(%$/5) *$6%"$#,%5*) "',%$(-)&@-'"$%A/6) $6) ('%) 1@5&"7) 8A$6) 0&05") "50'"%6) $6'@&%5*) &(*) \'$(%)1'/0&"$6'(6) #5%=55() ("%$)') H5"6,6) 0.(&$.2%&'01 $"%&.*3) &(*))'*&$+-./'0)H5"6,6)0.(&$.2%&'01+$2.&$+&."*)6%"&%5-$567)+5/5/#5")%A&%)/'6%B)$?)('%)&@@B)"',%$(-)&@-'"$%A/6),65*)$()

2'36)&"5)*5&*@'1P)?"55)#51&,65)%A5>)0"51@,*5)%A5),65)'?)6'/5)"',%56) ?"'/) 6',"15) %') *56%$(&%$'(7) 9?%5(B) %A5) \,6%$?$1&%$'() %'),65)&*&0%$H5)"',%$(-)&@-'"$%A/6)$()0@&15)'?)*5%5"/$($6%$1)'(56)1'/56)?"'/)%A5)1&0&1$%>)'?)%A5)?'"/5")%')&H'$*)1'(%5(%$'()#>),6$(-)&@%5"(&%5)"',%56)=A5()&)1'(?@$1%)'11,"6)&%)",(%$/57)8A$6)0&05") H&@,56) &('%A5") $/0'"%&(%) 0"'05"%>) '?) &*&0%$H5) "',%$(-)&@-'"$%A/6B)=A$1A)$6) %A5)"$1A5")65%)'?)0'66$#@5)"',%56)&)0&1P5%)1&(),65)%')-')?"'/)6',"15)%')*56%$(&%$'(7)8A$6)0"'05"%>)/&P56)6',(*) 1'/#$($(-) 6',"15) &(*) &*&0%$H5) "',%$(-) /51A&($6/6)=A5()%"&??$1)0&%%5"(6)&"5)P('=()&%)*56$-()%$/57)8A5)"$1A5")65%)'?)"',%56)?&1$@$%&%56)'H5"&@@) @'&*)#&@&(1$(-)$()%A5)1'//,($1&D%$'()&"1A$%51%,"5B)=A$@5)*5&*@'1P)?"55*'/)-,&"&(%556)%A5)$(%5D-"$%>)'?)'05"&%$'()?'")%A5)1'//,($1&%$'()&"1A$%51%,"57)8A5) "5/&$(*5")'?) %A$6)0&05") $6)'"-&($J5*)&6) ?'@@'=67)C51D

%$'()..)0"565(%6)%A5)0"'1566)&*'0%5*)?'")"',%5)/&00$(-)1',0@5*)%') %A5) ;5"/56DC+) 2'37) C51%$'() ...) *50$1%6) %A5) ;5"/56DC+)&"1A$%51%,"57)C51%$'().])*561"$#56) %A5)5K05"$/5(%&@) 65%,0)&(*)"56,@%6B) =A$@5) C51%$'() ]) *$60@&>6) &) 65%) '?) 1'(1@,6$'(6) &(*)*$"51%$'(6)?'")?,%,"5)='"P7)

..7! +9:8<)L!II.2^)8A5)*5?$($%$'()'?)1'//,($1&%$'()"',%56)/&>)#5)-,$*5*)#>)

*$??5"5(%) "5G,$"5/5(%6B) &$/$(-) &%) 0'=5") *$66$0&%$'(B) &"5&) '")05"?'"/&(157) 8A$6) ='"P) 1'(6$*5"6) &6) P5>) "5G,$"5/5(%) 1'/D/,($1&%$'()05"?'"/&(15B)/5&6,"5*)#>)%A5)"5*,1%$'()'?)0'%5(D%$&@)1'(-56%$'(B)&(*)5H&@,&%5*)#>)&H5"&-5)0&1P5%)@&%5(1>7)R&6$D1&@@>B) 1'(-56%$'() $6) *5%51%5*) =A5() %A5) &/',(%) '?) $(1'/$(-)*&%&)E'")"5G,56%)?'")$(1'/$(-)*&%&F)$6)@&"-5")%A&()%A5)',%-'$(-)*&%&) ?"'/) &) -$H5() 1'//,($1&%$'() 5@5/5(%) E57-7) &) "',%5"F7) !)

"5&6'() ?'") %A$6) $6) &) #&*)*$6%"$#,%$'()'?) 1'//,($1&%$'() ?@'=6B)=A$1A) /&>) $/0@>) 'H5"@'&*5*) 1A&((5@6B) 1&@@5*) 4"&(5"&(7) 8')&H'$*)A'%60'%6B)$%)$6)/&(*&%'">)%'_)E$F)'65-"$'1+-&'$*+&.7'15+&4()?'")5&1A)1'//,($1&%$(-)/'*,@56B) E$$F))"82.*'15+&4(1"91)"8,8%*.)+&.*315+.$() $()&) %"&??$1) 615(&"$')&(*) E$$$F)'7+-%+&')$"%&'18+55.*3(1?'")5&1A)%"&??$1)615(&"$'7)

:;& <65-"$.*31:-&'$*+&.7'1=+&4(1!) 0&%A) $6) *5?$(5*) A5"5) &6) %A5) 65G,5(15) '?) "',%5") ',%0,%)

0'"%6),65*) %') %"&(6/$%)0&1P5%6) ?"'/)&)6',"15) %')&)*56%$(&%$'(7)4505(*$(-)'()%A5)"',%$(-)&@-'"$%A/B)/'"5)%A&()'(5)&@%5"(&%$H5)0&%A) /&>) 5K$6%) #5%=55() &) -$H5() 6',"15) &(*) *56%$(&%$'(7) 8A5)5K0@'"&%$'() '?) &@%5"(&%$H5) 0&%A6) A&6) %') -,&"&(%55) %A&%) &%) @5&6%)'(5) 0&%A) 5K$6%6) &/'(-) 5&1A) 1'//,($1&%$(-) 0&$") &(*) %A&%) (')*5&*@'1P) =$@@) '11,") =A5() 0&%A6) &"5) 1'/#$(5*) $(%') %"&??$17)8A5"5) &"5) %=')=&>6) %') '#%&$() *5&*@'1P) ?"55*'/_) E$F) %A"',-A)?'"/&@) H5"$?$1&%$'() '") E$$F) %A"',-A) %A5) &*'0%$'() '?) *5&*@'1PD?"55) "',%$(-) &@-'"$%A/6) &6) #&6$6) ?'") 0&%A) 1'/0,%&%$'(7) 8A5)0"565(%)='"P),656)%A5)651'(*)&00"'&1A7).%)5/0@'>6)?',")*$??5"D5(%) "',%$(-) &@-'"$%A/6B) I,"5) à) EàFB) &(*) %A5) %A"55) %,"()/'*5@) H&"$&%$'(6_) 25-&%$H5) W$"6%) E2WFB)b56%) W$"6%) EbWF) &(*)2'"%A) c&6%) E2cF) MUO7) 8A5) ?$"6%) $6) *5%5"/$($6%$1B) =A$@5) %A5) "5D/&$($(-)&"5)&*&0%$H5)&@-'"$%A/6) $/0@5/5(%5*) $() %=') ?@&H'"6_)/$($/&@) E2WLB) bWLB) 2cLF) &(*) ('() /$($/&@) E2W2LB)bW2LB) 2c2LF7) W'") à) "',%$(-B) 5K&1%@>) &) 0&%A) 5K$6%6) #5D%=55() %=') 1'//,($1&%$(-) 5(%$%$567) W'") /$($/&@) &*&0%$H5)&@-'"$%A/6B)%A5)(,/#5")'?)*$6%$(1%)0&%A6)E*+&$,#F)$6)5$%A5")N)'")$6) *5?$(5*) #>) <G,&%$'() ENFB) =A5"5) -) &(*) .) "50"565(%) %A5)*$6%&(15)$()A'06)&@'(-)%A5)1'""560'(*$(-)&K$6)E-)'").F)#5%=55()6',"15)&(*)*56%$(&%$'(7)

( )ddd

.-.-*+&$,#

!!!+!= ) ENF)

W'")('()/$($/&@)&*&0%$H5)&@-'"$%A/6B)*+&$,#)*505(*6)'()%A5)"5@&%$H5)6',"15)&(*)*56%$(&%$'()0'6$%$'(6)&(*)'()%A5)"',%$(-)&@-'"$%A/) ",@567) W'") 5K&/0@5B) ?'") %A5) ('() /$($/&@) (5-&%$H5)?$"6%)&@-'"$%A/)E2W2LFB)$%)$6)0'66$#@5)%'),65)<G,&%$'()EQF7))

( )= = !!

!+!= e e

ddd

#% #%-- ../0/0

/0/0/0 .-

.-*+&$,# ) EQF1

bA5() %A5) *56%$(&%$'() '?) %A5) 0&1P5%) $6) &%) &) 0'$(%) E-1B.1FB)=A$1A) $6)&#'H5)&(*) %') %A5)"$-A%)'?) %A5)6',"15)1''"*$(&%56) E-#B).#FB) <G,&%$'() EQF) $6) H&@$*7) .() %A$6) <G,&%$'(B) %A5) 0&$") E-%B) .%F)"50"565(%6) %A5) 0'6$%$'() "56,@%$(-) ?"'/) %A5) *$60@&15/5(%) ?"'/)%A5)6',"15)%')%A5)/'6%)(5-&%$H5)0'6$%$'()$()%A5)(5%='"P7)!@6'B)-/0) &(*) ./0) "50"565(%) %A5)*$6%&(15) $() %A5)-) "5607).) &K56)#5D%=55() 0'6$%$'() E-%B! .%F) &(*) %A5) *56%$(&%$'(B) $757) -/0)f)-%)g)-1)&(*) ./0)f).%)g).17)W'")%A5)'%A5")('()/$($/&@)&*&0%$H5)"',%$(-)&@-'"$%A/6)6$/$@&")5G,&%$'(6)&(*)1'(6$*5"&%$'(6)&00@>7)

>;& #"82.*.*31=+&4(1"91#"88%*.)+&.*31=+.$(1!)"',%5)/&00$(-)$6)*5?$(5*)=A5()?'")&@@)0&$"6)'?)1'//,($D

1&%$(-) /'*,@56B) '(5) &(*) '(@>) '(5) 0&%A) $6) 1A'65() ?'") 5&1A)6',"15) &(*) *56%$(&%$'() /'*,@5) E"50"565(%5*) #>) %A5) $%5"&%$'()H&"$&#@5) 2F7) 8A5) &/',(%) '?) 0'66$#@5) "',%5) /&00$(-6) *505(*6)'() %A5) (,/#5")'?) 1'//,($1&%$(-)0&$"6) &(*) %A5) "',%$(-) &@-'D"$%A/7)8A5)-"5&%5")%A5)(,/#5")'?)&@%5"(&%$H5)0&%A6)05")1'//,D($1&%$(-)0&$"B)%A5)-"5&%5")%A5)(,/#5")'?)&1A$5H&#@5)"',%5)/&0D0$(-67) <G,&%$'() ESF) *5?$(56) %A5) /&K$/,/) (,/#5") '?) "',%5)

194

/&00$(-6) E*3&442/5FB) =A5"5) 2) $*5(%$?$56) &) 1'//,($1&%$(-)0&$"B)*+&2%#1 $6) %A5) %'%&@) (,/#5")'?) 1'//,($1&%$(-)0&$"6) &(*)*+&$,#627) $6) %A5)(,/#5")'?) &@%5"(&%$H5)0&%A6) ?'") 27)!6) &() 5KD&/0@5B)*3&442/5) $6) &@=&>6) 5G,&@) %') N) =A5()à) "',%$(-) $6)&*'0%5*B)6$(15)%A5"5)$6)'(@>)&)0'66$#@5)0&%A)?'")5&1A)1'//,($D1&%$(-)0&$"7)

"=

=*+&2%#

22*+&$,#*3&442/5

NFE ) ESF1

#;& <7+-%+&."*1"91?"%&'1@+55.*3(18A5)5H&@,&%$'()'?)"',%5)/&00$(-6)$6)#&65*)'(_)E$F)%A5))"8,

8%*.)+&."*1)4+$+)&'$.(&.)) %A&%) $6)/'*5@5*)#>)&@@) 1'//,($1&D%$'(6)'?)%A5)&00@$1&%$'()$()%5"/6)'?)6',"15)/'*,@5B)%&"-5%)/'*D,@5) &(*) %"&(6/$66$'() "&%5B) E$$F) %A5) +-&'$*+&.7'1 5+&4() ?'") 5&1A)1'//,($1&%$(-)0&$")%A&%)$6)/'*5@5*)#>)&)-"&0A)1'(%&$($(-)&@@)0'66$#@5) 0&%A6) ?"'/) 5&1A) 6',"15) %') 5&1A) *56%$(&%$'() /'*,@5)&(*)E$$$F)%A5)65@51%5*))"(&19%*)&."*B)=A$1A)1'(6$*5"6_)%A5)&H5"D&-5)"&%5)'?)0&%A)'11,0&(1>B)%A5)05&P),6&-5)'?)%A5)0&%A)&(*)%A5)0&%A) @5(-%A7) C/&@@5") H&@,56) ?'") %A565) &6051%6) @5&*) %') #5%%5")0&%A67).($%$&@@>B) &) H&@$*) 0&%A) $6) "&(*'/@>) &66$-(5*) %') 5&1A) 1'/D

/,($1&%$(-) 0&$"7) .() %A$6) 6%50B) (') &**$%$'(&@) 1&"5) $6) %&P5() ?'")0&%A) #$(*$(-7) 8A5) '(@>) -,&"&(%55) '??5"5*) #>) %A$6) &66$-(/5(%)0"'1566) $6) %A5) 5K$6%5(15) '?) %A5) 0&%A) '() %A5) @$6%) '?) &@%5"(&%$H5)0&%A6)?'")%A5)-$H5()1'//,($1&%$(-)0&$"7)!?%5")&)0&%A)A&6)#55()&66$-(5*)%')5&1A)1'//,($1&%$(-)0&$"B)2'3)'11,0&%$'()$6)56%$D/&%5*)#>)&11,/,@&%$(-)%A5)%"&(6/$66$'()"&%5)'?)5&1A)1'//,D($1&%$(-)0&$"7)8A5)(5K%)6%506)655P)%')'0%$/$J5)%A$6)$($%$&@)"',%5)/&00$(-7)8A5) "',%5)/&00$(-)'0%$/$J&%$'() $6)1&""$5*)#>)H&">$(-) %A5)

0&%A6)'?)5&1A)1'//,($1&%$(-)0&$"B)%">$(-)&@@)&@%5"(&%$H5)0&%A67)bA5()&)1'//,($1&%$(-)0&$")$6)#5$(-)5H&@,&%5*B)%A5)"5/&$($(-)0&$"6)A&H5)%A5$")0&%A)?$K5*7)!)(5=)0&%A)?'")&)1'//,($1&%$(-)0&$")$6)&66,/5*)$?B)1'/D

0&"5*)%')%A5)1,""5(%)"',%5)/&00$(-_)E$F)%A5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)$6)@'=5")&(*)$%6)05&P),6&-5)$6)@'=5")'")5G,&@B)'")E$$F)%A5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)$6)5G,&@)&(*)%A5)05&P),6&-5)$6) @'=5"B)'") E$$$F) %A5)&H5"&-5) "&%5)'?)0&%A)'11,0&(1>) $6)5G,&@B)%A5)05&P),6&-5)$6)5G,&@)&(*)%A5)0&%A)@5(-%A)$6)6A'"%5"7)8A5)?$"6%)&00"'&1A)-,&"&(%556)&)#5%%5")*$6%"$#,%$'()'?)2'3)1'//,($1&%D$(-) ?@'=6B)5G,&@$J$(-) %A5)1'//,($1&%$'()1A&((5@)'11,0&%$'(7)8A5)651'(*)&00"'&1A)-,&"&(%556)%A&%)$?)&)@'=)1'(-56%$'()J'(5)1&(('%) #5) ?',(*) &%) @5&6%) A'%60'%6) &"5) &H'$*5*B) #"$(-$(-) 05&P),6&-5)*'=(7)W$(&@@>B)=A5(),6$(-)&)('()/$($/&@)"',%$(-)&@-'D"$%A/B)$?)%A5)6&/5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)1&()#5)?',(*)$()&) 6A'"%5")0&%AB) %A5() @'=5")0'=5")*$66$0&%$'()/&>)#5)'H5"D@''P5*7)8A5) 0"'1566) '?) "',%5) /&00$(-) '0%$/$J&%$'() ?$($6A56) $()

%A"55) 0'66$#@5) 6$%,&%$'(6_) E$F) =A5() %A5"5) $6) '(@>) '(5) 0'66$#@5)"',%5)/&00$(-)E57)-7)=A5(),6$(-)à)"',%$(-FB)E$$F)&?%5")"5&1AD$(-)&)-$H5()(,/#5")'?)%"$56B)=A$1A)$6)0&"&/5%5"$J&#@5B)&(*)E$$$F)=A5() (') ?,"%A5") '0%$/$J&%$'() 1&() #5) '#%&$(5*) &?%5") &@@) 1'/D/,($1&%$(-)0&$"6)A&H5)#55()5H&@,&%5*)&%)@5&6%)'(157)8A5)"56,@%D$(-)"',%5)/&00$(-)$6)1&@@5*)48&//91(#:;%'9(%:;$2/5)EIC+F7)

...7! 8;<);<+L<CDC+)!+3;.8<38:+<)R5?'"5)*561"$#$(-);5"/56DC+B) $%) $6)(51566&">) %')&00"'&1A)

%=')'%A5")2'3)&"1A$%51%,"56),65*)A5"5)&6)&)#&6$6)?'")1'/0&"$D6'(6B);5"/56)&(*);5"/56DL7);5"/56)&(*);5"/56DL)&"5)Q4D

/56A) %'0'@'->) 2'36) =$%A) "',%5"6) ,6$(-) 1"5*$%) #&65*) ?@'=)1'(%"'@B) $(0,%)#,??5"$(-B)='"/A'@5)0&1P5%)6=$%1A$(-)&(*)15(D%"&@$J5*) &"#$%"&%$'() E"',(*D"'#$() &@-'"$%A/F7) ;'=5H5"B) =A$@5)%A5)"',%$(-)61A5/5)'?);5"/56)$6)*$6%"$#,%5*B);5"/56DL),656)&)0@&((5*)6',"15)"',%$(-)61A5/57);5"/56DC+)*$??5"6) ?"'/);5"/56DL)#51&,65) $%) 5/0@'>6) &)

0.(&$.2%&'01 +$2.&$+&."*) 61A5/5) =$%A) &) ?$"6%) 1'/5D?$"6%) 65"H5*)EW3WCF)&@-'"$%A/B)=A$1A)-,&"&(%556)$(D'"*5")0&1P5%)65"H$1$(-7)4505(*$(-)'() %A5)I<)/&00$(-B)'() %A5) 6051$?$1) E&*&0%$H5F)

"',%$(-) &@-'"$%A/) &(*) '() 6'/5) &00@$1&%$'() 1A&"&1%5"$6%$16B)"',%$(-) /&>) 'H5"@'&*) 65H5"&@) (5%='"P) "5-$'(6B) $/0@>$(-) &)*51"5&65)'()%A5)1'//,($1&%$'()&"1A$%51%,"5)5??$1$5(1>B)*,5)%')%A5)$(1"5&65)'?)0&1P5%)@&%5(1$567)8A5)0"5*5?$($%$'()'?)0&%A6)6,00'"%5*)#>);5"/56DC+)6',"15)

"',%$(-)-,&"&(%556)&)P('=()='"6%)1&65)?'")@$(P)@'&*67)!@6'B)$%)/&>) A5@0) '0%$/$J$(-) 2'3) &"5&) %A"',-A) %A5) 5@$/$(&%$'() '?),(,65*)@$(P6)&(*)#,??5")*$/5(6$'($(-B)#'%A)'?)=A$1A)&"5)',%D6$*5)%A5)61'05)'?)%A$6)='"P7)!@@) 2'36) $() %A$6) ='"P) &66,/5) &) 6$/0@5) 0&1P5%) 6%",1%,"5B)

1'/0'65*) #>) &) A5&*5") 1'(%&$($(-) *56%$(&%$'() &(*) 6$J5) $(?'"D/&%$'() &(*) &) 0&>@'&*7) !@@) %A"55) 2'36) 6,00'"%) &"#$%"&">) ?@$%)6$J56B) &@%A',-A) &@@) 5K05"$/5(%6) A5"5),65)'(@>NXD#$%) ?@$%67)4$?D?5"5(%)2'3)0&1P5%6)6@$-A%@>)*$??5")$()%A5$")A5&*5")6%",1%,"57).();5"/56B) %A5) QD?@$%) 0&1P5%) A5&*5") 6%'"56) %A5) "',%5") *56%$(&%$'()&**"566)&6)?$"6%)?@$%B)?'@@'=5*)#>)&)?@$%)=$%A)%A5)6$J5)'?)%A5)0&>@D'&*7) .();5"/56DC+)&(*);5"/56DLB) %A5)A5&*5") 6%&"%6)=$%A)&()$() '"*5") 65G,5(15) '?) ',%0,%) 0'"%6) (51566&">) %') &""$H5) &%) %A5)*56%$(&%$'(B)?'@@'=5*)#>)%A5)6$(-@5D?@$%)0&>@'&*)6$J57)bA$@5)%A5);5"/56)A5&*5")'11,0$56)5K&1%@>)%=')?@$%6B)%A5)'%A5")%=')2'36h)H&"$&#@5) 6$J5) A5&*5"6) 1'/0"$65) &%) @5&6%) %A"55) ?@$%6B) *,5) %') %A5),65)'?)&()&**$%$'(&@)?@$%),65*)&6)"',%5)%5"/$(&%'")?@&-7)3'(15"($(-) &"#$%"&%$'() $() ;5"/56DC+B) 5&1A) "',%5") $(0,%)

0'"%) *$"51%@>) ('%$?$56) %A5) *56$"5*) ',%0,%) 0'"%) %') %"&(6/$%) &)0&1P5%7) .@@,6%"&%5*) $() W$-,"5) NE#FB) %A$6) &00"'&1A) 5(&#@56) %')65"H5)/,@%$0@5)"5G,56%6)%')*$6%$(1%)0'"%6)$()0&"&@@5@7)8"&(6/$6D6$'() "5G,56%6) &"5) 6%'"5*) &(*) 65"H5*) $() &""$H&@) '"*5") #>) 5&1A)',%0,%)0'"%7)W$-,"5)NE&F)$@@,6%"&%56)%A5)'%A5")&00"'&1AB),65*)$();5"/56) &(*) ;5"/56DLB) =A$1A) 5/0@'>6) 15(%"&@$J5*) "',(*D"'#$()&"#$%"&%$'(7);5"5B) 5&1A) $(0,%)0'"%) "5G,56%6) "',%$(-) ?'")&)1'(%"'@),($%)&(*)=&$%6)5$%A5") ?'")&()',%0,%)0'"%)&66$-(/5(%)'")?'") &) *5($&@) '?) &66$-(/5(%B) $?) %A5) "5G,56%5*) 0'"%) $6) &@"5&*>)#,6>7).()&(>)1&65B)$?)&"#$%"&%$'()65"H56)&)0'"%)$%)@'656)$%6)0"$'"$D%>7) R51&,65) '?) %A$6B) $(0,%) 0'"%6) /&>) 6,??5") ?"'/) 6%&"H&%$'(B)*505(*$(-)'()%A5)(5%='"P)@'&*)&(*)%A5)&/',(%)'?)1'/05%$%$'()&/'(-)1'//,($1&%$'()?@'=67)

.]7! <Ì<+.L<28!c)C<8:I)!24)+<C:c8C);5"/56DC+B) ;5"/56) &(*) ;5"/56DL) =5"5) *561"$#5*) $()

6>(%A56$J&#@5)+8c)];4c)=$%A)?$K5*)*$/5(6$'($(-)EVKVFB)?@$%)6$J5)ENX)#$%6F)&(*)&)VeL;J)'05"&%$(-)?"5G,5(1>B)"56,@%$(-)$()&)#&(*=$*%A) '?) ZeeL#06) 05") @$(P7) 8A5) 5K05"$/5(%6) 5/0@'>5*)H&"$',6)"',%$(-)&@-'"$%A/)&(*)#,??5")6$J5)1'/#$(&%$'(67)C5H5"&@)6>(%A5%$1)&(*)"5&@)%"&??$1)0&%%5"(6)&@@'=5*)5H&@,&%D

$(-)&"#$%"&%$'()&(*) "',%$(-)61A5/567)bA$@5) "5&@) %"&??$1) 615(&D"$'6) &@@'=) &66566$(-) %A5) #5A&H$'") '?) 6051$?$1) &00@$1&%$'(6B)6>(%A5%$1) %"&??$1) 615(&"$'6) 5(&#@5) %') 5K0@'"5) %A5) @$/$%6) '?) %A5)2'36B)6,1A)&6)6&%,"&%$'()&(*)#5A&H$'"),(*5")1'(-56%$'(7)!) 65%)'?) %5K%) ?$@56)*561"$#5)5&1A) %"&??$1)0&%%5"()&6)&) 65%)'?)

0&1P5%6)EA5&*5")i)0&>@'&*F)&(*)%A5).0'+-1.*A')&."*)8"8'*&1?'")5&1A)0&1P5%7)8A$6)$6)&()$(%5-5")$(?'"/$(-)%A5)(,/#5")'?)1@'1P)

195

1>1@56) &?%5") 6$/,@&%$'() 6%&"%7) .(\51%$'() 6',"156) &"5) $/0@5D/5(%5*) #>) 1>1@5D) &(*) 0$(D&11,"&%5) C>6%5/3) $(0,%) /'*,@56B)"560'(6$#@5) ?'") $(%5"0"5%$(-) %"&??$1) ?$@56) &(*) $(\51%$(-) 0&1P5%6)$(%') %A5)2'37) C>6%5/3) ',%0,%)/'*,@56) &#6'"#) 0&1P5%6) ?"'/)%A5)2'3)',%0,%6B)6%'"$(-)%A5$")1'(%5(%6)&(*)&""$H&@)/'/5(%)?'")6%&%$6%$1&@)5H&@,&%$'(7)c&%5(1>)H&@,56)0"565(%5*)A5"5)&"5)('%) @$/$%5*) %') %A5)2'3)

%"&(6/$66$'()*5@&>7)W$-,"5)Q)*$??5"5(%$&%56)%"&(6/$66$'()@&%5(D1$56)#&65*)'()$(\51%$'()&(*)"5150%$'()*$6%"$#,%$'(67)!)5-+**'01.*A')&."*) *$6%"$#,%$'() $6) *5?$(5*) $() %A5) %"&??$1) 615(&"$'6) %5K%)?$@56B)&(*)*50$1%6)%A5)$*5&@)$(\51%$'()/'/5(%)?'")5&1A)0&1P5%)27)8A5) +))"85-.(4'01 .*A')&."*) *$6%"$#,%$'() 1'(6$*5"6) %A5) &1%,&@)0&1P5%)$(65"%$'()/'/5(%) $(%')%A5)2'3B)=A$1A)1&()#5)*5@&>5*)#>)1'(%5(%$'()&%)%A5)0&1P5%)6',"157)8A5).0'+-1$')'5&."*)*$6%"$D#,%$'() "50"565(%6) %A5) 5K051%5*) *5@$H5">)/'/5(%6) '?) 0&1P5%6B)%&P$(-) (5%='"P) 6%&%,6) $(%') &11',(%) '") ('%7) 8A5)+))"85-.(4'01$')'5&."*)*$6%"$#,%$'()"50"565(%6)%A5)"5&@)/'/5(%)=A5"5)0&1PD5%6) &"5) *5@$H5"5*) %') %A5$") *56%$(&%$'(7) 2') 1'(%5(%$'() &%) %A5)*56%$(&%$'()$6)1'(6$*5"5*7))

"#$%&'()#!

*%&%+'()#!

%&"''()

*)("&+,"-('./+

011&2."-23'+,"-('./+4(-5367+,"-('./+

+,&-%'!(./! +,&-%'!(! +,&-%'!(0/!

0..381&29:()

*)("&

0..381&29:() )I%+,($!N!K:/--,&%')3%/&!2)3$&'=!3=1$*?!

!)A>0'%A5%$1&@)*$6%"$#,%$'()'?)6,1A)$(\51%$'()&(*)"5150%$'()615(&"$'6) $6) $@@,6%"&%5*) $()W$-,"5)Q7) B0'+-1 -+&'*)C) $6) %A5)/$($D/,/)(,/#5")'?)1>1@56)&)0&1P5%)(55*6)%')"5&1A)$%6)*56%$(&%$'(7)8A$6) $6)#&65*)'() %A5)*$??5"5(15)'?) %A5) $*5&@) $(\51%$'()/'/5(%)&(*) %A5) 5K051%5*) *5@$H5">) /'/5(%7) !'&D"$E) -+&'*)C) $6) %A5)*5@&>) H5"$?$5*)#>) %A5)0&1P5%) *,"$(-) $%6) %"&??$1) ?"'/) 6',"15) %')*56%$(&%$'(B)=A$1A)/&>)#5)$(?@,5(15*)#>)1'/05%$%$'()?'")2'3)"56',"156)E57-7)@$(P6B)#,??5"6B)&"#$%"&%$'(B)"',%$(-F7):55-.)+&."*1-+&'*)C)('"/&@@>)#"$(-6)%A5)/'6%)$/0'"%&(%)$/0&1%)'()%A5)$*5&@)1'//,($1&%$'() 05"?'"/&(157)8A$6) $6) 1'/0,%5*) &6) %A5) *$??5"D5(15)#5%=55()%A5)$*5&@)$(\51%$'()/'/5(%)'?)0&1P5%6)&(*)%A5$")5??51%$H5) *5@$H5">) /'/5(%) &%) %A5) *56%$(&%$'(7) !00@$1&%$'()@&%5(1>)$6)%A5)H&@,5)&66,/5*)?'")1'/0&"$6'()$()%A5)(5K%)5K05D"$/5(%67)

:;& <7+-%+&.*31='$9"$8+*)'1%*0'$1:--1&"1:--1F$+99.)1=+&&'$*18=') 1@&6656) '?) 5K05"$/5(%6) =5"5) 05"?'"/5*) %') 5H&@,&%5)

05"?'"/&(15)'0%$/$J&%$'(7)8A5)?$"6%)1@&66)1'/0&"56) %A5),6&-5)'?)IC+)&-&$(6%)*$6%"$#,%5*)"',%$(-)E4+F)=$%A)%A5)"56,@%6)6,/D/&"$J5*) $() W$-,"5) S7) 8A5) 651'(*) 1@&66) ?'1,6) '() 15(%"&@$J5*)H5"6,6)*$6%"$#,%5*)&"#$%"&%$'(B)=$%A)"56,@%6)*50$1%5*)$()W$-,"5)U7)8A5) 1'/0&"$6'() H&@,56) =5"5) 1&0%,"5*) ?"'/) ?$H5) *$6%$(1%)

%"&??$1) 615(&"$'6B) =$%A) $(\51%$'() "&%56) NejB) QejB) SejB) Uej)&(*)Vej)'?)%A5)1A&((5@)#&(*=$*%A)1&0&1$%>B)1'""560'(*$(-)%')&#6'@,%5) "&%56) '?) "56051%$H5@>) ZeL#06B) NXeL#06B) QUeL#06B)SQeL#06) &(*) UeeL#067) 8A5) %5/0'"&@) *$6%"$#,%$'() '?) 0&1P5%)$(\51%$'() $6) ,($?'"/7)I&1P5%6) ?'");5"/56) A&H5)Qe) ?@$%6B)=A$@5)?'");5"/56DC+) &(*);5"/56DL)2'36B) %A5) 6$J5) H&"$56) &"',(*)%A$6)H&@,5B)*505(*$(-)'()%A5)&/',(%)'?)A'06)%')"5&1A)%A5)*56D%$(&%$'(7)8A5)60&%$&@)*$6%"$#,%$'()$6)'(5)%')&@@)?"'/)5&1A)$(\51D%$'()6',"15B)0"'*,1$(-)&()&@@)%')&@@)2'3)%"&??$1)0&%%5"(7)

8A5)?$"6%)5K05"$/5(%)EW$-,"5)SF)&66,/5*)&)%"&??$1)615(&"$')=A5"5) 6',"156) $(\51%) Sej) '?) 1A&((5@) #&(*=$*%A) 1&0&1$%>7) .%)/&$(@>) 1'/0&"56) ;5"/56) ",(($(-) 4+) &(*) ;5"/56DL) ,6$(-)IC+) =$%A) *$??5"5(%) "',%$(-) &@-'"$%A/67) R'%A) &"1A$%51%,"56)5/0@'>)&)15(%"&@$J5*)&"#$%"&%$'()61A5/5)E"',(*D"'#$(F7)

I%+,($!O!9!P)3$&'=!($*,23*!/.3)%&$5!8#$&!'/-1)(%&+!5%*3(%.,3$5!(/,3%&+!;QE<!A$(*,*!12)&&$5!*/,('$!(/,3%&+!;RDE<!)11(/)'#$*?!

W$-,"5)S)*50$1%6)%A&%)&@@)IC+)@5*)%')$(1"5&65*)@&%5(1>)=A5()1'/0&"5*) %')4+) ?'")2W2L) &(*)2WLB) 5K150%) ?'") '(5) 1&65B)=A$1A) "56,@%5*) $() &) 6/&@@) -&$() EN7NQj) D) &"',(*) SVe) 1@'1P)1>1@56) ?&6%5") $()&H5"&-5F7);'=5H5"B) ?'") %A5)"5/&$($(-)"',%$(-)&@-'"$%A/6) @&%5(1$56)=5"5) "5*,15*) $() &@@) 1&656) ?'")IC+)=A5()1'/0&"5*)%')4+7)8A5) #5A&H$'") '?)bW) &(*)2c)A&6) %A"55) 5K0@&(&%$'(67) 8A5)

?$"6%) $6) %A5)*5-"55)'?) ?"55*'/)0"'H$*5*)#>)bW)&(*)2c)=A5()1'/0&"5*) %')2W) ?'") "',%5)/&00$(-) 5K0@'"&%$'(7) 8A$6) 1&() #5)?'"/&@@>) *5/'(6%"&%5*B) #,%) $(%,$%$H5@>) 605&P$(-)bW) &(*) 2c)*5%5"/$(5)&)6$(-@5)*$"51%$'()%A&%)/,6%)#5)5/0@'>5*)&%)%A5)6%&"%)EbWF)'")5(*)E2cF)'?)%A5)"',%$(-)0"'1566B)=A$@5)2W)*5%5"/$(56)%A&%)'(@>)%=')*$"51%$'(6)E%A5)(5-&%$H5)'(56F)1&()#5)%&P5()&%)%A5)6%&"%) '?) %A5) "',%$(-) 0"'15667) 8A5) 651'(*) 5K0@&(&%$'() $6) %A5)-@'#&@) P('=@5*-5) '?) 1A&((5@6) @'&*) =A5() &*'0%$(-) 0@&((5*)"',%$(-7)8A5)%A$"*)$6)%A5)#&*)*51$6$'()%A&%)4+)/&>)%&P5B)6$(15)\,*-/5(%6) &"5) /&*5) #&65*) '() @'1&@@>) &H&$@&#@5) $(?'"/&%$'()'(@>)%')"56'@H5)1'(-56%$'(B)0'66$#@>)*5H$&%$(-)0&1P5%6)%')'%A5")1'(-56%5*)"5-$'(67)4,"$(-) 6$/,@&%$'(B) 4+) &@=&>6) &1A$5H5*) @'=5") @&%5(1$56)

=A5()1'/0&"$(-)bW2L)%')2c2L)&(*)bWL)%')2cL7);'=D5H5"B)=A5(),6$(-)IC+)@&%5(1$56)&"5)&@=&>6) @'=5")=A5()1'/D0&"$(-) 2c2L) %')bW2L) &(*) 2cL) %')bWL7) 8A565) "56,@%6)6A'=)%A&%)%A5)1A'$15)'?)"',%$(-)&@-'"$%A/)6%"'(-@>)*505(*6)'()%A5)1A'$15)'?)"',%$(-)6%"&%5->7)W$-,"5) U) 6A'=6) &) 651'(*) 5H&@,&%$'() 5K05"$/5(%) %A&%) &6D

6,/56) %"&??$1) 615(&"$'6) =$%A) 0&1P5%) $(\51%$'() "&%56) H&">$(-)?"'/)Nej) %')Vej7)8A$6)5H&@,&%$'()5665(%$&@@>)1'/0&"56)15(D%"&@$J5*) E;5"/56)'");5"/56DLF) &(*)*$6%"$#,%5*) E;5"/56DC+F)&"#$%"&%$'()61A5/567)+',%56)=5"5)*5?$(5*)=$%A)%A5)à)&@-'"$%A/B)-,&"&(%55$(-)

%A5)6&/5)0&1P5%)*$6%"$#,%$'() ?'")&@@)2'367)2')6$-($?$1&(%)*$?D?5"5(15)$6)'#65"H5*),0)%')Qej)'?)$(\51%$'()"&%57)!%)%A5)Sej)'?)$(\51%$'()"&%5)&(*)&#'H5B) $%) $6)('%$15&#@5) %A&%)*$6%"$#,%5*)&"#$D

196

%"&%$'() 1&() "5*,15) %A5) "',%5") 1'(%"'@) 1'(-56%$'(B) 6$(15) @&%5(D1$56)&"5)6$-($?$1&(%@>)"5*,15*)=A5()1'/0&"5*)%')&)15(%"&@$J5*)&00"'&1A7) !**$%$'(&@@>B) $%) $6) '#65"H&#@5) %A&%) %A5) #$--5") %A5)#,??5")6$J56B)%A5)@'=5")%A5)&H5"&-5)@&%5(1>B)$()&@@)1&6567);'=D5H5"B) $%) 1&() &@6') #5) '#65"H5*) %A&%) &%) 5&1A) $(\51%$'() "&%5B) %A5)@'=56%) &H5"&-5) @&%5(1>) '#%&$(5*) =A5() 15(%"&@$J5*) &"#$%"&%$'()5/0@'>6)%A5)#$--56%)#,??5")6$J5)ESQD?@$%)#,??5"FB)$6)-"5&%5")%A&()%A5) &H5"&-5) @&%5(1>) '#%&$(5*) =$%A) %A5) 6&/5) $(\51%$'() "&%5) '?)*$6%"$#,%5*) &"#$%"&%$'() ,6$(-) %A5) 6/&@@56%) #,??5") 6$J5) EUD?@$%)#,??5"FB)5K150%)&%)&)Sej)$(\51%$'()"&%57)8A$6)1&65)$6)('%)"5&@@>)"5@5H&(%) #51&,65) %A5) *$??5"5(15) $6) 6@$-A%) E&"',(*) Ne) 1@'1P)1>1@56)$()&H5"&-5)g)@566)%A&()Nej)'?)*$??5"5(15F7)

:$&3()2%H$5!A*!Q%*3(%.,3$5!G(.%3()3%/&);ST!G2+/(%3#-<)

)

))

;)<!

JUV!

;.<

;'<!

NUV!

OUV!

WUV! XUV!!"#$%&'()"*+ ,(-$%(./$"*+ !"#$%&'()"*+ ,(-$%(./$"*

!"#$%&'()"*+ ,(-$%(./$"*+ !"#$%&'()"*+ ,(-$%(./$"*

!"#$%&'()"*+ ,(-$%(./$"*

)I%+,($!W!K!GA$()+$!2)3$&'%$*!0/(!'$&3()2%H$5!;C$(-$*!)&5!C$(-$*9@<!)&5!5%*3(%.,3$5!)(.%3()3%/&!;C$(-$*9DE<?!Y&Z$'3%/&!()3$*!A)(=!0(/-!;)<!JUV9

NUV>!;.<!OUV!3/!;'<!WUV9XUV?!

>;& <7+-%+&.*31='$9"$8+*)'1%*0'$1G"&(5"&1F$+99.)1=+&&'$*18A$6) 5K05"$/5(%) &@@'=6) 5H&@,&%$(-) %A5) *$6%"$#,%5*) '") 15(D

%"&@$J5*) &"#$%"&%$'() &"1A$%51%,"&@) 1A'$156B) &(*) *$6%"$#,%5*) '")6',"15)"',%$(-)*51$6$'(67)!@6'B)$%)5(&#@56)/5&6,"$(-)%A5)5??51D%$H5(566)'?)6%&%$1&@@>)0@&((5*)"',%$(-7)!)A'%60'%)%"&??$1)615(&D"$') $6) ,65*B) =A5"5) %=') ('*56) 1'(15(%"&%5) &@@) 0&1P5%) *56%$(&D%$'(6) '() %A5) 2'37) !) Nee) L#06) $(\51%$'() "&%5) =$%A) ,($?'"/)%5/0'"&@)*$6%"$#,%$'()05")6',"15) $6),65*7)8&#@5)N)0"565(%6) %A5)&H5"&-5)@&%5(1>)1'/0,%5*)*,"$(-)6$/,@&%$'(7).()%A5)à)&@-'D"$%A/)@$(5)'?)8&#@5)N)'(@>)&"1A$%51%,"&@)*51$6$'(6)1&()#5)&(&D@>J5*B)6$(15)%A5)6&/5)"',%56)&"5),65*)?'")&@@)2'367)3'/0&"$(-);5"/56)&(*);5"/56DL)1'@,/(6)$%)1&()#5)655()%A&%)(')-&$()$6)&1A$5H5*)=A5()&*'0%$(-)*$6%"$#,%5*)'")0@&((5*)6',"15)"',%$(-)$() %A5)à)@$(57);'=5H5"B) %A5)1'/0&"$6'()'?);5"/56DC+)1'@D,/(6)=$%A) %A5) %=')0"5H$',6)'(56)6A'=6) %A&%)*$6%"$#,%5*)&"#$D%"&%$'()@5*)%')/'"5)%A&()Uej)'?)@&%5(1>)"5*,1%$'()=A5()1'/D0&"5*)%')%A5)15(%"&@$J5*)&00"'&1A7)

").2$!J!K7/:!)A$()+$!2)3$&'%$*!'/-1)(%*/&!;%&!'2/'4!'='2$*<?!

!

1)2! 3%45%6! 3%45%6.7! 3%45%6.8*

*)9'(#:!6&;%5%! <(6'4(=9'%<! 6)94&%!>?8*@!

A4=('4,'()#!6&;%5% &%#'4,B(C%<! <(6'4(=9'%<!

*)9'(#:!,B:)4(';5!

DE! /FGHIJKHL! /FGHIJKHL MGJILKNL

1O! PGPLIKFL! /GMNIKFJ FMQKQN

1O7 IGIIPKNI! /GLNJKQL FLJKHL

RO! FG/IMKII! PGLJLKJM /GHNFKIP

RO7! //GHM/KIL! FGPIIKJJ QGPMQKQI

1S QGJPHKMF! QGHFHK/H JFIK/M

1S7! MGHFPK//! QG/HPKNN JPPKHL)

bA5() 1'/0&"$(-);5"/56) &(*);5"/56DL)1'@,/(6)'?)8&D#@5) N)g) &4'1 (+8'1+$2.&$+&."*1+*010.99'$'*&1 $"%&.*31 ()4'8'() g)?'") &@@) "',%$(-) &@-'"$%A/6) E5K150%) àF) %A5) 5??51%$H5(566) '?)IC+) 6%&(*6) ',%7) 8A5) 5K05"$/5(%) 6A'=5*) @&%5(1>) "5*,1%$'(6)?"'/)QU7[Sj)E2cF),0)%')YS7ZYj)E2cLF)$()%A565)1&6567)bA5() 1'/0&"$(-);5"/56DL) &(*);5"/56DC+) 1'@,/(6) '?)

8&#@5)N)g)&4'1(+8'1$"%&.*31+*010.99'$'*&1+$2.&$+&."*1()4'8'()g)%A5)5??51%$H5(566)'?)*$6%"$#,%5*)&"#$%"&%$'() $6)A$-A@$-A%5*7)8A5)5K05"$/5(%)6A'=5*)&()&H5"&-5)@&%5(1>)"5*,1%$'()'?)VS7Q[j7)!**$%$'(&@@>B) $%) $6) ('%$15&#@5) %A5) #5(5?$%) '?) 1'/#$($(-)

0@&((5*) 6',"15) "',%$(-) &(*) *$6%"$#,%5*) &"#$%"&%$'(B) &6) 6,0D0'"%5*) #>) ;5"/56DC+7) R>) 1'/0&"$(-) %A5) ?$"6%) &(*) @&6%) 1'@D,/(6B) %A5) &H5"&-5) @&%5(1>) $6) ,0) %') NN) %$/56) 6/&@@5") E2cLF)&(*)%A5)&H5"&-5)@&%5(1>)"5*,1%$'()?'")&@@)1&656)$6)Ye7Qej7)W$-,"5)V)$@@,6%"&%56)%A5)@$(P)='"P@'&*)56%$/&%$'()=A5()*$?D

?5"5(%)"',%$(-)&@-'"$%A/6)&"5),65*)&6)#&6$6)?'")"',%5)/&00$(-6)$() ;5"/56DC+7) 8A5) à) 0$1%,"5) 0"565(%6) %=') 05&P6) '?) @'&*)1'(15(%"&%$'(B) "56,@%$(-) $() A$-A) 1'/05%$%$'() &(*) 1'(65G,5(%)05"?'"/&(15) *5-"&*&%$'(7) 2WB) 2WLB) 2c) &(*) 2cL) &"5) %A5)/'6%) 6,$%&#@5) &@-'"$%A/6) %') *$6%"$#,%5) %A5) ='"P@'&*) $(%') %A5)2'3)@$(P6B)?'@@'=5*)#>)bW)&(*)bWL7)

)I%+,($!X!9!P%&4*!8/(42/)5!$*3%-)3%/&!8#$&!$-12/=%&+!5%*3%&'3!(/,3%&+!)2+/(%3#-*!0/(!1)3#!*$2$'3%/&!%&!)!12)&&$5!*/,('$!(/,3%&+!)112%$5!3/!)!XBX!C$(-$*9DE!7/:?!"#$!1$)4!)3!3#$!3/1!/0!$)'#!*6,)($!($1($*$&3*!3#$!-)B9

%-,-!8/(42/)5!($)'#$5!5,(%&+!*%-,2)3%/&?!

#;& <7+-%+&.*31:$'+1#"(&(18&#@5)Q)*50$1%6)%A5)&"5&),6&-5)'?);5"/56)&(*);5"/56DC+)

15(%"&@)"',%5"6)E=$%A)V)$(0,%)&(*)',%0,%)0'"%6F)$()&)`3V]c`Se)]$"%5K) V) WI^!7) !"5&) $6) 5K0"5665*) $() %5"/6) '?) (,/#5") '?)c:86) &(*) ?@$0D?@'06) ?'") ?',") 1'(?$-,"&%$'(6) '?) #,??5") 6$J57)C>(%A56$6) "56,@%6)1'/5) ?"'/) %A5),65)'?) %A5)`C8) %''@B)0&"%)'?)

197

%A5) `$@$(K) .C<) [7Q$) %''@65%7) 2'%A$(-) #,%) *5?&,@%) 0&"&/5%5"6)=5"5)&66,/5*) $() %A5)6>(%A56$6) %''@7);5"/56DL)$6)('%)5K0@$1$%)"5?5"5(15*)$()8&#@5)QB)6$(15)$%6)$/0@5/5(%&%$'()$6)G,$%5)6$/$@&")%');5"/567)8A5)*$??5"5(15) $/0@$5*)#>) %A5)"',%$(-)/51A&($6/)$/0@5/5(%&%$'()A&6)(')6$-($?$1&(%)$/0&1%)$()%5"/6)'?)&"5&7)").2$!N!K!C$(-$*!)&5!C$(-$*9DE!)($)!'/-1)(%*/&!;X91/(3!(/,3$(<?!

T9UU%4!6(C%!

3%45%6! 3%45%6.8*!SVW6! OB(+.UB)+6! SVW6! OB(+.UB)+6

I! /HLI! Q/Q! /IPJ! QIHM! //QM! QPF! /FHF! QLH/L! /QQJ! QFI! /LPI! QMHPQ! /FPQ! QLL! /N/F! PHH

X!*%69B'6!U)4!,!F.+)4'!4)9'%4! )!@%A',-A);5"/56DC+)$/0"'H56)05"?'"/&(15)$%)1@5&"@>)05D

(&@$J56)&"5&B)6$(15)$%)0"565(%6)&()&H5"&-5)'?)SN7Yj)/'"5)c:86)&(*) NN7Yj)/'"5) ?@$0D?@'06) %A&();5"/567) ;'=5H5"B) W$-,"5) X)E5K%"&1%5*)?"'/)W$-,"5)UF)0'$(%6)',%)%A&%)*$6%"$#,%5*)&"#$%"&%$'()2'36B) 5H5() $/0@5/5(%5*) =$%A) UD?@$%) #,??5"6B) &1A$5H56) #5%%5")05"?'"/&(15) %A&() 15(%"&@$J5*) &"#$%"&%$'() 2'36) $/0@5/5(%5*)=$%A)SQD?@$%)#,??5")?'")&@@)$(\51%$'()"&%56B)5K150%)Sej7)3'/0&"D$(-) &) UD?@$%) #,??5") ;5"/56DC+) 2'3) =$%A) &) SQD?@$%) #,??5");5"/56B) %A5)&H5"&-5)@&%5(1>)?'")&@@) $(\51%$'()"&%56)"5*,156)#>)&00"'K$/&%5@>)SV7Qj7)

)I%+,($![!K!GA$()+$!2)3$&'%$*!8#$&!'/-1)(%&+!5%*3(%.,3$5!)(.%3()3%/&!%&!W902%3!.,00$(!C$(-$*9DE!3/!'$&3()2%H$5!ON902%3!.,00$(!C$(-$*!0/(!%&Z$'3%/&!

()3$*!A)(=%&+!0(/-!JUV!3/!XUV?!

bA5() 1'(6$*5"$(-) &() $/0@5/5(%&%$'() '?) &) UD?@$%) #,??5");5"/56DC+)&(*)&()$/0@5/5(%&%$'()'?)&)SQD?@$%)#,??5");5"/56B)$%) $6) H$6$#@5) %A&%) 1A''6$(-);5"/56DC+) $/0@$56) &"5&) 1'(6,/0D%$'() "5*,1%$'() EX7Qj) @566) c:86) &(*) [7Zj) @566) ?@$0D?@'06F7)C$(15) %A5) &H5"&-5) @&%5(1>) $6) &@6') "5*,15*) $();5"/56DC+) $/D0@5/5(%&%$'(B)5H5()=$%A)6,1A)&)6/&@@)#,??5")6$J5B)$%)$6)5&6>)%')1'(1@,*5) %A&%) 15(%"&@$J5*) &"#$%"&%$'() =$%A) 0@&((5*) 6',"15)"',%$(-) $6) &) -''*) *56$-() 1A'$15B) "5*,1$(-) 6$J5) &(*) @&%5(1>7)L'"5'H5"B) 6'/5)='"P6) 6A'=) %A&%)#,??5") 6$J5) 6%"'(-@>)1'(%"$D#,%5) %') 0'=5") *$66$0&%$'() MNeOB) =A$@5) '%A5"6) 6A'=) %A&%) 2'3)#,??5"6)/&>)&11',(%)?'")&"',(*)[ej)'?)%A5)0'=5")*$66$0&%$'()$()&)"',%5")MNNO7)8A5"5?'"5B)%A5)"5*,1%$'()'?)SQD?@$%)#,??5")%')UD?@$%)#,??5")0"'#&#@>)&@6')$/0@$56)6$-($?$1&(%)5(5"->)6&H$(-67)

]7! 3923c:C.92C)!24)92^9.2^)b9+k).() %A$6) 0&05"B) %A5)/&$() 1'(%"$#,%$'(6) 1'/5) ?"'/) %A5) 05"D

?'"/&(15) 5H&@,&%$'() '?) "',%$(-) &(*) &"#$%"&%$'() &"1A$%51%,"&@)*51$6$'(6)&(*) %A5$") $/0&1%)'()&"5&)1'6%67)!)651'(*&">)1'(%"$D#,%$'() $6) %A5) 0"'0'6$%$'() '?) &) /5%A'*) ?'") 0&%A) 1'/0,%&%$'(B)

#&65*)'()0"5H$',6@>)P('=()&00@$1&%$'()%"&??$1)#5A&H$'"B)=A$1A)-,&"&(%556)&#65(15)'?)*5&*@'1P)&(*)/'"5)#&@&(15*)1'//,($D1&%$'()@'&*7)8A5);5"/56DC+)2'3)&"1A$%51%,"5)=&6)$/0@5/5(%5*)%')5KD

0@'"5)%A5),65)'?)*$6%"$#,%5*)&"#$%"&%$'()&(*)6',"15)"',%$(-7)8A$6)2'3) 65"H5*) %') 1'/0&"5) %A5) %"&*5D'??6) $(H'@H5*) $() *51$*$(-)&#',%)%A5),65)'?)6',"15)H5"6,6)*$6%"$#,%5*)"',%$(-)6%"&%5-$56B)&6)=5@@)&6) $() %A5),65)'?)15(%"&@$J5*)H5"6,6)*$6%"$#,%5*)&"#$%"&%$'()6%"&%5-$567) !**$%$'(&@@>B) %A5) 0&05") 0"'0'656) &) "',%5) /&00$(-)0"'1566) %A&%) 1&()#5) &*H&(%&-5',6) ?'") 5(&#@$(-) %A5) 56%$/&%$'()'?) @$(P) '11,0&(1>) $() 2'36) &(*) %A5) 1'(65G,5(%) 05"?'"/&(15)$/0"'H5/5(%7) 8A$6) &@@'=6) 6'@H$(-) 1'(-56%$'() /$%$-&%$'()%A"',-A) A'%60'%6) &H'$*&(157) 8A5) "56,@%6) 0'$(%) ',%) %') &) 2'3)*56$-() =$%A) 05"?'"/&(15) $/0"'H5/5(%B) 0'=5") *$66$0&%$'()"5*,1%$'()&(*)&"5&)6&H$(-7)9(-'$(-)='"P)15(%5"6)'()*>(&/$1)%"&??$1)615(&"$'6B)=A5"5)

&00@$1&%$'(6) &"5) @'&*5*) &%) ",(%$/5) &(*) 1'//,($1&%$'() "5DG,$"5/5(%6)&"5)"5G,56%5*)'()%A5)?@>7)C5@?)&*&0%$H5)2'36)#&65*)'() -@'#&@) $(?'"/&%$'() P('=@5*-5) &"5) &() $(%5"56%$(-) 1A'$15B)6$(15)"56,@%6)*5?$($%5@>)6A'=5*)%A&%)*51$6$'(6)#&65*)'()@'1&@@>)&1G,$"5*)$(?'"/&%$'()/&>)@5&*)%')#&*)05"?'"/&(15)"56,@%67)

!3k29bc<4^<L<28C)8A5)!,%A'"6)&1P('=@5*-5)%A5)6,00'"%)'?)%A5)32IG)%A"',-A)

"565&"1A)-"&(%6)NUNQUYTQeeVDSB)SeZ[QUTQeeZDZB)Se[QVVTQeeZDQB)SeNV[[TQee[DQ)&(*)'?)%A5)W!I<+^C)-"&(%)NeTeZNUD[7)

+<W<+<23<C)MNO! b'@?B) b7) 5%) &@7) lL,@%$0"'1566'") C>6%5/D'(D3A$0) ELIC'3F) 851A('@'D

->m7) .<<<) 8"&(6&1%$'(6) '() 3'/0,%5"D!$*5*)456$-() '?) .(%5-"&%5*) 3$"D1,$%6)&(*)C>6%5/6B)QYENeFB)91%7)QeeZB)007)NYeNDNYNS7)

MQO! I&6"$1A&B)C7n)4,%%B)27)l9(D3A$0)3'//,($1&%$'()!"1A$%51%,"56)g)C>6%5/)'()3A$0).(%5"1'((51%m7)L'"-&()k&,?/&(()C1$5(15B)QeeZB)VUU07)

MSO! L'"&56B) W7) 5%) &@7) l;5"/56_) &() $(?"&6%",1%,"5) ?'") @'=) &"5&) 'H5"A5&*)0&1P5%D6=$%1A$(-) (5%='"P6) '() 1A$0m7) .(%5-"&%$'() ]cC.) o',"(&@B) SZENFB)91%7)QeeUB)007)X[D[S7)

MUO! ^@&66B)37n)2$B)c7)l8A5)8,"()L'*5@)?'")!*&0%$H5)+',%$(-m7)o',"(&@)'?)%A5)!66'1$&%$'()?'")3'/0,%$(-)L&1A$(5">B)UNEVFB)C507)N[[UB)007)ZYUD[eQ7)

MVO! ;,B)o7n)L&"1,@561,B)+7)l4>!4)D)C/&"%)+',%$(-)?'")25%='"P6D'(D3A$0m7).(_)4!3heUB)QeeUB)007)QXeDQXS7)

MXO! k,@/&@&B)!7)5%)&@7)l4$6%"$#,%5*)#,6)&"#$%"&%$'()&@-'"$%A/)1'/0&"$6'()'()WI^!D#&65*)LI<^DU)/,@%$0"'1566'")6>6%5/)'()1A$0m7).<8)3'/0,%5"6)p)4$-$%&@)851A($G,56B)o,@7)QeeZB)QEUFB)007)SNUDSQV7)

MYO! R5"%'JJ$B) 47n) R5($($B) c7) l`0$056_) !)25%='"PD'(D1A$0)!"1A$%51%,"5) ?'")^$-&61&@5) C>6%5/6D'(D3A$0m7) .<<<) 3$"1,$%6) &(*) C>6%5/6) L&-&J$(5B)UEQFB)QeeUB)007)NZDSN7)

MZO! W5(B)^7n)2$(-B)b7)l!)L$($/,/DI&%A)L&00$(-)!@-'"$%A/)?'")Q4)/56A)25%='"P)'()3A$0)!"1A$%51%,"5m7).(_)!I33!CheZB)QeeZB)007)NVUQDNVUV7)

M[O! R'@'%$(B)<7)5%)&@7)l+',%$(-)8&#@5)L$($/$J&%$'()?'").""5-,@&")L56A)2'3m7).(_)4!8<heYB)QeeYB)007)[UQD[UY7)

MNeO! R&(5"\55B)27)5%)&@7) l!)I'=5")&(*)I5"?'"/&(15)L'*5@) ?'")25%='"PD'(D3A$0)!"1A$%51%,"56m7).(_)4!8<heUB)QeeUB)007)NQVeDNQVV7)

MNNO! I&@/&B)o7)5%)&@7)lL&00$(-)</#5**5*)C>6%5/6)'(%')2'36)g)8A5)8"&??$1)<??51%) '()4>(&/$1)<(5"->)<6%$/&%$'(m7) .(_)CR33.heVB)QeeVB)007)N[XDQeN7)

)

198

On-Chip Efficient Round-Robin Scheduler forHigh-Speed Interconnection

Pongyupinpanich Surapong and Manfred GlesnerMicroelectronic Systems Research Group,

Technische Universitat Darmstadt,Darmstadt, Germany

Email:{surapong; glesner}@mes.tu-darmstadt.de

Abstract—Due to the simplicity of scheduling, the bufferedcrossbar is becoming attractive for high-speed communicationsystem. Although the previously proposed Round-Robin algo-rithms achieve 100% throughput under uniform traffic, they cannot achieve a satisfactory performance under non-uniform traffic.In this paper, we propose an efficient Round-Robin schedulingalgorithm based on binary-tree scheme where service policyis applied to improve Quality-of-Service. With the proposedscheduling algorithm, the searching time-complexity of O(1) (oneclock cycle) and 100% throughput under non-uniform trafficcan be obtained. Based on a binary-tree structure, the designachieves high-speed data rate at Tbps, and simpler design withcombinational circuits. The design has been simulated on bothFPGA-based (Virtex 5) and Silicon-based technology (0.18 um).The synthesis results show that consumed resources varied from11 to 533 slices and from 46 to 1686 2-NAND gates for crossbarsof size 4× 4 to 128× 128. Critical path delays from 0.72 to 4.52ns for FPGA-based and from 1.33 to 4.0 ns for silicon-basedhave obtained for the design.

I. INTRODUCTION

Performance and efficiency of a generic buffered crossbardepends on input-, internal-, and output-scheduling mecha-nisms [1]. It is composed of three main structures: input ports,output port and a switch fabric interconnecting the input andthe output ports. The complexity of scheduler numbers locatedon all crosspoints to manage the data-queue is O(logN2),where N is the number of input ports based on a symmetricalstructure [1]. Thus, improving the performance and efficiencyof a scheduler is attractive for interconnection designers.

Scheduling schemes are divided into two main categories:weighted algorithms and Round-Robin algorithms. T. Javadi etal [2] and M. Nabeshima [3] introduced LQF-RR and OCF-OCF to match inputs to outputs. Since their basic buildingblocks of matching operations are integer comparators andmultiplexers, their complexities are O(NlogN). To reducethis complexity, the internal information structure, SCBF [4],was proposed with O(logN). However, it has unstable regionsfor the states of input virtual-output-queues (VOQs) and itscomplexity is still too sensitive to the crossbar size. Therefore,the schedulers based on weighted algorithms have limitationsfor building high-speed and large capacity crossbars.

Due to simplicity, fairness, 100% throughput and con-tentionless, Round-Robin-based mechanism was proposed onRR-RR [5]. It has been improved with DRR [6] and DRR-k[1]. These two versions applied the double-pointer’s updat-ing mechanism to overcome the limited performance of theRound-Robin scheme. However, since the position of double-pointer has to be updated as fast as possible, their designstructure based on comparator and counter functions is a

too complex to support data rates up to Terabits per second.Chauo [7] proposed a structure based on a binary-tree arbiterwhich can perform the arbitration in a fast and efficient way.However, this framework can not guarantee fairness to allinputs during non-uniform traffic.

In this paper, we explore the design of an efficient Round-Robin scheduler based on binary-tree structure which guaran-tees fairness, 100% throughput, without contention on non-uniform traffics. The design achieves very high-speed datarates with low time-complexity (O(1)). A service functionhas been included to improve Quality-of-Service (QoS) forall input ports.

The rest of this paper is organized as follows: the efficientRound-Robin algorithm is explained in section II. Section IIIintroduces the hardware implementation of an 8× 8 efficientRound-Robin scheduler based on 8 × 8 buffered crossbar.Performance and efficiency of the design are reported insection IV and compared with the related work. Finally,conclusions are presented in section IV.

II. AN EFFICIENT ROUND-ROBIN ALGORITHM

TABLE I: Binary-tree selection on a Leaf- and Root-Node [9].

State PL RL PR RR Leaf-Node Root-Node1 0 0 0 0 - -2 0 0 0 1 right -3 0 0 1 0 right right4 0 0 1 1 right right5 0 1 0 0 left -6 0 1 0 1 left -7 0 1 1 0 left left8 0 1 1 1 right right9 1 0 0 0 left left10 1 0 0 1 right right11 1 0 1 0 - -12 1 0 1 1 - -13 1 1 0 0 left left14 1 1 0 1 left left15 1 1 1 0 - -16 1 1 1 1 - -

A. Binary-Tree algorithmA time optimized Round-Robin algorithm can be realized by

applying the binary-tree arbitration [7]. With a N×N bufferedcrossbar structure, the binary-tree level equals to log(N + 1)as shown in [8]. Since the basic element of any node on thebinary tree has four inputs and two outputs [9], outputs can bedefined by association of input priority and input request oneither left or right side respectively. Assuming that PLor R andRLor R are input priority and input request, table I determines978-1-4577-0660-8/11/$26.00 c© 2011 IEEE

199

the selective state of Leaf-Node and Root-Node under theirpossible actions.

For example, we suppose that we have four inputscomprised of priorities and requests, PL3RL3PR2RR2 andPL1RL1PR0RR0 respectively equal to 0101 and 0011. Con-forming to the binary-tree selection table I, the RR0 is selectedas shown in figure 1.

01 11

01 01 00 11

Leaf-node

Fig. 1: A binary-tree selection conforming to table I, wherenode PR0RR0 is selected when PL3RL3PR2RR2=0101 andPL1RL1PR0RR0=0011.

B. Round-Robin-based scheduling mechanism with servicefunction

In this section, we introduce a Round-Robin algorithm basedon binary-tree searching scheme where a service function isapplied to all input ports in order to improve QoS during non-uniform traffic defined by [1].

TABLE II: A Round-Robin-Based scheduling mechanism withservice function in order to improve QoS on non-uniformtraffic.

Mechanism : Round-Robin-Based scheduling mechanismInput : Credit Buffer (CBi), Request (Reqi), service-ratio (Servicei)Output : Grant(Granti)Internal : Priority (Prii), counter, start, enable, i1) Beginning Pri0 = 1, counter = 0, i = 0, enable = 1, start = 02) while (1) loop3) if enable = 1 then4) Reading all CBs, Reqs and Services, start = 05) Grant = binary-tree function(Req,CB,Pri)6) if Grant > 0 then7) enable = 0, start = 18) for j = 1 to N9) i = j where Grantj = 110) end for11) end if12) end if13) if counter = Servicei then14) counter = 0, enable = 115) Pri= cyclic shift left(Grant)16) elsif CBi > 0 and start = 1 then17) counter ++18) end if19) end loop

At the beginning, the internal parameters are set as Line1). Within the loop, if enable is in enabled status, the algo-rithm will start to read CB, Reqs and Service information.Meanwhile, start is set in disabled status. Grant is processedby the binary-tree function as Line 5). On Line 6), if Grantis more than zero, enable and start will be in disabled andenabled status. Afterward, i is determined by correspondingto Grants’ value. Between Line 13) and Line 18), counteris used to count up when start is in enabled status. Whencounter reachs Servicei, enable is in enabled status; after-ward the loop will restart.

Fig. 2: Design entity of the efficient Round-Robin schedulerfor eight requests.

Fig. 3: Block diagram of the efficient Round-Robin Scheduler,where service-ratios (S), a 8-bit VOQ Req. and credit buffer(CB) are its’ input.

Based on this mechanisms and under non-uniform traffic(hot-spot and unbalance data rate), the service-ratio (Servicei)for each input port i comes from the data rate itself. As-suming that Vi is the data rate at input port i, Servicei =

Vi

Min(V1,···VN ) where N is the number of input ports. Forexample, if the data rates of four input ports are 50, 100,150 and 200 KBit/sec, their service-ratios will be 1, 2, 3 and4 respectively.

III. HARDWARE IMPLEMENTATION

A. Design Architecture

The efficient Round-Robin algorithm proposed in SectionII has been implemented in hardware. Since our goal is anoptimal time-complexity, combinational circuits are applied asmuch as possible to reduce the processing time of search andgrant state of binary-tree mechanism. Shift registers are usedto maintain pointer’s information.

For simplification reasons, a 8 × 8 buffered crossbar withfour-credit levels per internal crosspoint buffer (CB) has beenused as design case where eight requests become the inputsof the scheduler. Figure 2 and 3 depict a design entity and ablock diagram of the expected design. The circuit has threeinputs and one output. The inputs are: 1) a 8-bit VOQ request(Req) vector containing input requests; 2) a 10-bit vector arrayrepresenting service-ratio (S); 3) a 2-bit vector array calledcredit (CB) contains the level of internal crosspoint buffer(00=full, 11=empty). The output of the circuit is a 8-bit vectorcontaining the grant decision (GRANT ).

B. Searching Mechanism Architecture

Fig.4 shows the detail of the Searching Mechanism blockdiagram with eight requests (Reqs), eight credit buffer arrays(CBs) and eight grants (GRANTs). In CMP block, 1-bit

200

Fig. 4: Simple block diagram of the binary-tree searchingmechanism for eight requests.

Fig. 5: Block diagram of Leaf-Node based on two multiplexersand a L-Circuit module.

(Reqi) and 2-bit CBi operate by this function:

Outi = 1when (CBi > 00)and(Reqi = 1) else 0 (1)

With the possible actions reported in table I, the notationsright, left and ” − ” are mapped to the values 01, 10 and00 respectively. Therefore, the combinational logic circuit of aLeaf-Node can be optimized by logic technique. The input andoutput relations of a Leaf-Node are specified by the booleanequations 2 to 5. Figure 5 and 6 show the block diagram ofthe Leaf-Node and its combinational logic circuit.

Gout0 = GinSel(0) (2)Gout1 = GinSel(1) (3)Sel(0) = Rin1Pin0Rin0 + Pin1Pin0(Rin1 + Rin0) (4)Sel(1) = Rin1Pin0 + Rin0(Pin1Pin0 + Pin1Rin1) (5)

By the same way of the Leaf-Node, combinational logiccircuit of Root-Node can be specified by following booleanfunctions, Equ.6 and Equ.7, and illustrated in Fig.7.

Gout0 = Pin1Pin0(Rin1 + Rin0) + Pin1Rin1Pin0Rin0 (6)Gout1 = Rin1(Pin1Pin0Rin0 + Pin1Pin0) + Pin1Pin0Rin0 (7)

C. Timing DiagramFigure 8 illustrates the timing diagram of the efficient

Round-Robin scheduling algorithm architecture depicted in

Fig. 6: Combinational logic circuit of a L-Circuit modulecomprising of 8 ANDs, 4 ORs and 4 NOTs.

Fig. 7: Combinational logic circuit of a root node comprisingof 6 ANDs, 4 ORs and 4 NOTs.

figure 3 under non-uniform traffic, where the 5th input porthas higher data rate. We assume that the packets within allcrosspoint buffers can be selected in any time-slots; thus, allCBs equal to 11 (3 decimal). Service-ratio of input portsare 1, but the Service-ratio of input port 5th is 1111111111(1024 decimal). Priority of input port 1st set 1, and all theothers are 0. After the 8-bit VOQ Req, 1101 1001, has beenarrived, the binary-tree searches and generates the GRANTcorresponding to the priorities, Service-ratios, CBs and VOQReqs. According to this figure, input port 1 and 4 are grantedfor next two clock cycles; afterwards the service will beoccupied by input port 5 for 1024 clock cycles, and thenoccupied by input port 7 on the next clock cycle.

Fig. 8: Timing diagram of the efficient Round-Robin scheduler.

IV. PERFORMANCE AND COMPARISON

In this section, we present the synthesis result of the pro-posed Round-Robin structure based on two most commonlyused technologies; FPGA-based and silicon-based.

A. FPGA-based TechnologyWe implement SA according to figures of [10] and synthe-

size it by using Xilinx ISE tool targeting the Xilinx Virtex 5

201

device. Table III show the synthesized results of the proposedstructure and SA in term of slices and critical path delay(ns)of N = 4, 8, 16, 32, 64, and 128.

TABLE III: Synthesized result in terms of slices and crit-ical path delay (ns) of efficient Round-Robin scheduler on5vlx330ff.

Design Report N=4 N=8 N=16 N=32 N=64 N=128Proposed Slices 11 25 62 130 264 533

ns 0.72 1.42 2.10 2.80 3.60 4.52SA [10] Slices 124 192 476 1137 8781 16527

ns 4.19 6.45 6.94 7.33 15.7 22.23

As shown in table III, the proposed design conforming tobinary-tree structure was implemented based on combinationalcircuits; therefore, the consumed slices are significantly lowerthan SA. The critical path delay of the proposed designoptimized by the Xilinx ISE tool is varied from 0.72 to 4.52 nsfor N = 4, 8, 16, 32, 64, and 128 because of the combinationalcircuits where the synthesizer can simply map the design withlogic elements.

B. Silicon-based TechnologyThe previous works, ERR [10], PRRA [9], IPRRA [9],

PPE [10], PPA [10], and SA [10], had been analyzed andsynthesized based on Silicon technology 0.18 um standardcell under the same operating conditions and area optimization.For fairness, we also analyze and synthesize our design on thesame setting environment.

TableIV shows the critical path delays (in nanoseconds)of these designs. TableV shows the area cost in number oftwo-input NAND gates for N = 4, 8, 16, 32, 64, and 128.Although the results depend on the standard cell library used,they present the relative performance of these designs.

TABLE IV: Critical path delay of PPE, PPA, SA, PRRA, andIPRRA in terms of ns.

Design N=4 N=8 N=16 N=32 N=64 N=128PPE 1.67 2.73 3.8 5.07 6.31 7.2PPA 1.7 2.53 3.66 4.54 5.67 6.54SA 1.36 1.51 1.79 2.26 2.72 3.35PRRA 1.47 2.52 3.58 4.63 5.68 6.74IPRRA 1.29 1.89 2.68 3.68 4.56 5.01Proposed 1.33 1.40 1.93 2.10 2.95 4.0

TABLE V: Area result of PPE, PPA, SA, PRRA, and IPRRA(number of NAND2 gates).

Design N=4 N=8 N=16 N=32 N=64 N=128PPE 53 150 349 812 1826 4010PPA 63 143 313 644 1316 2649SA 89 292 641 1318 2372 4780PRRA 31 72 155 320 651 1312IPRRA 31 82 173 356 723 1455Proposed 46 112 255 576 867 1686

As shown in table IV, the critical path delay of SA andthe proposed design grow with log4N , while the critical pathdelay of PPE, PPA, PRRA, IPRRA grow with log2N , whichare consistent with the analysis of these designs. SA andthe proposed design operate the fastest with shortest level ofbasic components and combinational circuits synthesized by

Synopsis tool. However the proposed design guarantees thefairness with service function under non-uniform traffic, butSA can not. For comparison purposes, consider a bufferedcrossbar of size N = 128 and assume that the cell size is 64-bytes, where the line rate is determined by 64 × 8. The linerates that a scheduler using SA and the proposed design are15.2 Tbps and 12.8 Tbps.

The area results of all designs grow linearly with N asshown in Table V. The proposed design consumes significantlyfewer 2-NAND gates than the PPE, PPA and SA, but more2-NAND gates than PRRA and IPRRA for all range of N .Compared with its critical path delay, the slightly larger areaof the proposed design is neglectable.

V. CONCLUSION

In this paper, we propose an efficient Round-Robin schedul-ing algorithm based on binary-tree scheme where QoS is im-proved by applying service policy. The proposed scheduling al-gorithm can achieve the searching time-complexity of (O(1))under non-uniform traffic. By using the service policy, 100% throughput can be attained, corresponding to improvementof scheduling performance. The design has been simulatedon both FPGA-based (Virtex 5) and Silicon-based technol-ogy (0.18 um). The synthesis results show that consumedresources varied from 11 to 533 slices and from 46 to 16862-NAND gates for crossbars of size 4×4 to 128×128. Criticalpath delays from 0.72 to 4.52 ns for FPGA-based and from1.33 to 4.0 ns for silicon-based have obtained for the design.

REFERENCES

[1] Y. Zheng, C. Shao, An Efficient Round-Robin Algorithm for CombinedInput-Crosspoint Queued Switches, IEEE ICAS, 2005.

[2] T. Javadi, R. Magill, and T. Hrabik, A high-throughput algorithm forbuffered crossbar switch fabric, in Proc. IEEE ICC, June 2001, pp.1581-1591.

[3] M. Nabeshima, Performance evaluation of combined input-and cross-point-queued switch, IEICE Trans. Commum., Col. E83-B, no.3, mar2000.

[4] X. Zhang and L. N. Bhuyan, An efficient algorithm for combined input-crosspoint-queue (CICQ) switches, IEEE Globecom 2004, pp. 1168-1173.

[5] R. Rojas-Cessa, E. Oki, Z. Jing and H.J. Chao, CIXB-1:Combined inputone-cell-crosspoint buffered switch, Proc. 2001 IEEE WHPSR, pp. 324-329.

[6] J. Z. Luo, Y. Lee, J. Wu, DRR: A fast high-throughput schedulingalgorithm for combined input-crosspoint-queued(CICQ) switches, IEEEMASCOTS, 2005 pp. 329-332.

[7] H. J. Chao, C. H. Lam, X. Guo, Fast ping-pong arbitration for input-output queued packet switches, international Journal of Communicationsystems, 2001 pp. 663-678.

[8] H. J. Chao, C. H. Lam, X. Guo, Fast fair arbitration design in packetswitches, IEEE, 2005 pp. 472-476.

[9] S. Q. Z, M. Yang, Algorithm-Hardware Codesign of Fast Parallel Round-Robin Arbiters, IEEE transactions on parallel and distributed systems,2007 pp. 84-94.

[10] P. Gupta, N. McKeown, Desiging and Implementing a Fast CrossbarScheduler, IEEE Micro, vol. 19, no.1, 1999 pp. 20-29.

202

Author Index

Abid, Mohamed, 149 Aguiar, Alexandra, 113 Alkhayat, Rachid, 79Ammar, Manel, 149 Amory, Alexandre, 164 Azevedo, Rodolfo, 99Baghdadi, Amer, 79 Baklouti, Mouna, 149 Balasubramanian, Daniel, 121Baldassin, Alexandro, 99 Barreteau, Anthony, 156 Becker, Juergen, 135Belaid, Ikbel, 179 Benjemaa, Maher, 179 Beyrouthy, Taha, 59Bhattacharyya, Shuvra, 67 Bobda, Christophe, 16 Bochem, Alexander, 9, 30Bois, Guy, 92 Boland, Jean-François, 92 Brehm, Christian, 74Calazans, Ney, 193 Callanan, Owen, 45 Carmel-Veilleux, Tennessee, 92Castelfranco, Antonino, 45 Centoducatte, Paulo, 99 Champagne, David, 38Chan, King, 38 Chen, Hui, 171 Chen, Yu-Yuan, 38Cheung, Ray, 38 Chowdhury, Sazzadur, 2 Clancy, Charles, 67Cox, Charles, 45 Crawford, Catherine, 45 Dekeyser, Jean-Luc, 149Deschenes, Justin, 30 Eoin, Creedon, 45 Fesquet, Laurent, 59Fresse, Virginie, 186 Gladigau, Jens, 128 Glesner, Manfred, 199, 85Godet-Bar, Guillaume, 171 Gohring De Magalhaes, Felipe, 113 Großhans, Michael, 16Gu, Zonghua, 23 Haubelt, Christian, 128 Hedde, Damien, 106Heinz, Matthias, 53 Herpers, Rainer, 9 Hessel, Fabiano, 113Hillenbrand, Martin, 53 Jezequel, Michel, 79 Karsai, Gabor, 121Kent, Kenneth, 9, 30 Klein, Felipe, 99 Klindworth, Kai, 53Koellner, Christian, 135 Kutzer, Philipp, 128 Kuykendall, John, 67Lal, Sundeep, 2 Le Nours, Sebastien, 156 Lee, Ruby, 38Lekuch, Scott, 45 Li, Will, 38 Losier, Yves, 30Lowry, Michael, 121 Lubaszewski, Marcelo, 164 Marcon, Cesar, 193Marcon, César, 164 Marquet, Philippe, 149 Mendoza, Francisco, 135Moraes, Fernando, 164, 193 Moreira, João, 99 Moreno, Edson, 193Muller, Fabrice, 179 Muller, Kay, 45 Murugappa, Purushotham, 79Muscedere, Roberto, 2 Mühlbauer, Felix, 16 Müller-Glaser, Klaus D., 53, 135, 142Nine, Harmon, 121 Nutter, Mark, 45 Pap, Gabor, 121Pasareanu, Corina, 121 Pasquier, Olivier, 156 Penner, Hartmut, 45Philipp, François, 85 Plishker, William, 67 Pongyupinpanich, Surapong, 199Pressburger, Tom, 121 Purcell, Brian, 45 Purcell, Mark, 45Pétrot, Frédéric, 106, 171 Rigo, Sandro, 99 Rousseau, Frédéric, 171, 186Samman, Faizal, 85 Schwalb, Tobias, 142 Szefer, Jakub, 38Tan, Junyan, 186 Teich, Jürgen, 128 Wehn, Norbert, 74Williams, Jeremy, 30 Xenidis, Jimi, 45 Zaki, George, 67Zhang, Ming, 23 Zhang, Wei, 38

203

Documents

Proceedings Rsp 2011