Upload
alexandra-aguiar
View
454
Download
3
Embed Size (px)
Citation preview
2011 22nd IEEE International Symposium on
Rapid SystemPrototyping
24-27 may 2011Karlsruhe, Germany
Proceedings of the
Sponsored byIEEE Reliability SocietyKarlsruhe Institute of Technology
Shortening the Path
from Specification
to Prototype
Copyright and Reprint Permission:
Abstracting is permitted with credit to the source. Libraries are permitted to photo-copy beyond the limit of U.S. copyright law for private use of patrons those articlesin this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222Rosewood Drive, Danvers, MA 01923. For other copying, reprint or republicationpermission, write to IEEE Copyrights Manager, IEEE Operations Center, 445 HoesLane, Piscataway, NJ 08854.
All rights reserved. Copyright c© 2011 by IEEE.
2011 22nd IEEE International Symposium on Rapid System Prototyping
Table of Contents
Message from the General Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Message from the Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Message from the Organizing Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Conference Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Tutorial and Keynotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Session 1: Automotive & FPGA
An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar . . . . . . . . . . . . . . . . 2Sazzadur Chowdhury, Roberto Muscedere and Sundeep Lal
FPGA based Real-Time Object Detection Approach with Validation of Precision and Performance . . . . . . . . . . . 9Alexander Bochem, Kenneth Kent and Rainer Herpers
Rapid Prototyping of OpenCV Image Processing Applications using ASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Felix Mühlbauer, Michael Großhans and Christophe Bobda
Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations . . . . 23Ming Zhang and Zonghua Gu
FPGA Design for Monitoring CANbus Traffic in a Prosthetic Limb Sensor Network. . . . . . . . . . . . . . . . . . . . . . . . . . 30Alexander Bochem, Kenneth Kent, Yves Losier, Jeremy Williams and Justin Deschenes
Session 2: Prototyping Architectures
Rapid Single-Chip Secure Processor Prototyping on OpenSPARC FPGA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Jakub Szefer, Wei Zhang, Yu-Yuan Chen, David Champagne, King Chan, Will Li, Ray Cheung and Ruby Lee
A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup ofSystem-on-a-Chip Based Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Owen Callanan, Antonino Castelfranco, Catherine Crawford, Eoin Creedon, Scott Lekuch, Kay Muller,Mark Nutter, Hartmut Penner, Brian Purcell, Mark Purcell and Jimi Xenidis
Rapid automotive bus system synthesis based on communication requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Matthias Heinz, Martin Hillenbrand, Kai Klindworth and Klaus D. Müller-Glaser
An event-driven FIR filter: design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Taha Beyrouthy and Laurent Fesquet
Session 3: Prototyping Radio Devices
Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment. . . . . . . . . . . 67William Plishker, George Zaki, Shuvra Bhattacharyya, Charles Clancy and John Kuykendall
Validation of Channel Decoding ASIPs A Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Christian Brehm and Norbert Wehn
Area and Throughput Optimized ASIP for Multi-Standard Turbo decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Rachid Alkhayat, Purushotham Murugappa, Amer Baghdadi and Michel Jezequel
Design of an Autonomous Platform for Distributed Sensing-Actuating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85François Philipp, Faizal A. Samman and Manfred Glesner
Session 4: Virtual Prototyping for MPSoC
A Novel Low-Overhead Flexible Instrumentation Framework for Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 92Tennessee Carmel-Veilleux, Jean-François Boland and Guy Bois
i
Using Multiple Abstraction Levels to Speedup an MPSoC Virtual Platform Simulator . . . . . . . . . . . . . . . . . . . . . . . 99João Moreira, Felipe Klein, Alexandro Baldassin, Paulo Centoducatte, Rodolfo Azevedo and Sandro Rigo
A non intrusive simulation-based trace system to analyse Multiprocessor Systems-on-Chip software . . . . . . . . 106Damien Hedde and Frédéric Pétrot
Embedded Virtualization for the Next Generation of Cluster-based MPSoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Alexandra Aguiar, Felipe Gohring De Magalhaes and Fabiano Hessel
Session 5: Model Based System Design
Rapid Property Specification and Checking for Model-Based Formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Daniel Balasubramanian, Gabor Pap, Harmon Nine, Gabor Karsai, Michael Lowry, Corina Pasareanu andTom Pressburger
Automatic Generation of System-Level Virtual Prototypes from Streaming Application Models . . . . . . . . . . . . . 128Philipp Kutzer, Jens Gladigau, Christian Haubelt and Jürgen Teich
An Automated Approach to SystemC/Simulink Co-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Francisco Mendoza, Christian Koellner, Juergen Becker and Klaus D. Müller-Glaser
Extension of Component-Based Models for Control and Monitoring of Embedded Systems at Runtime. . . . . .142Tobias Schwalb and Klaus D. Müller-Glaser
A model-driven based framework for rapid parallel SoC FPGA prototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149Mouna Baklouti, Manel Ammar, Philippe Marquet, Mohamed Abid and Jean-Luc Dekeyser
A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System Architectures . . . . .156Sebastien Le Nours, Anthony Barreteau and Olivier Pasquier
Session 6: Software for Embedded Devices
Task Mapping on NoC-Based MPSoCs with Faulty Tiles: Evaluating the Energy Consumption and theApplication Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Alexandre Amory, César Marcon, Fernando Moraes and Marcelo Lubaszewski
Me3D: A Model-driven Methodology Expediting Embedded Device Driver Development . . . . . . . . . . . . . . . . . . . 171Hui Chen, Guillaume Godet-Bar, Frédéric Rousseau and Frédéric Pétrot
Session 7: Tools and Designs for Configurable Architectures
Schedulers-Driven Approach for Dynamic Placement/Scheduling of multiple DAGs onto SoPCs . . . . . . . . . . . . 179Ikbel Belaid, Fabrice Muller and Maher Benjemaa
Generation of emulation platforms for NoC exploration on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186Junyan Tan, Virginie Fresse and Fédéric Rousseau
Arbitration and Routing Impact on NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193Edson Moreno, Cesar Marcon, Ney Calazans and Fernando Moraes
On-Chip Efficient Round-Robin Scheduler for High-Speed Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Surapong Pongyupinpanich and Manfred Glesner
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
ii
Message from the General Chair
Welcome to Germany and welcome to Karlsruhe Institute of Technology KIT forthe 22nd IEEE International Symposium on Rapid System Prototyping (RSP). RSPexplores trends in Rapid Prototyping of Computer Based Systems. Its scope rangesfrom embedded system design, formal methods for the verification of systems,engineering methods, process and tool chains to case studies of actual softwareand hardware systems. It aims to bring together researchers from the hardwareand software communities to share their experiences and to foster collaboration ofnew and innovative Science and Technology. The 22nd annual Symposium focuswill encompass theoretical and practical methodologies, resolving technologies ofspecification, completeness, dynamics of change, technology insertion, complexity,integration, and time to market. RSP 2011 is a 4 day event starting with an industrydriven tutorial program on safety critical systems and the according standard IEC61508 and to go along with a visit to a car manufacturing plant. We will havethree days of technical program starting each day with outstanding keynote speak-ers from industry and academia talking about RSP in Automotive, Aerospace andRobotics Applications. Time for technical discussions will be supplemented by socialgathering with old and new friends. The success of the symposium is based on theeffort of many volunteers. I wish to thank and express my appreciation for theirhard and dedicated efforts of Program Chairs Fabiano Hessel and Frédéric Pétrotwho set up an excellent symposium program, the Tutorial Chair Michael Huebnerfor organizing the highly topical tutorial day, the Publicity Chair Jérôme Huguesand the IEEE Liaison Chair Alfred Stevens from IEEE Reliability Society who issponsoring RSP2011, the Organizing Chair Matthias Heinz and last but not leastthe Finance Chair Martin Hillenbrand. Special thanks to all of you who contributeda paper and will contribute to formal presentations and informal discussions. Ihope that you will find this year’s symposium interesting and rewarding, and thatyou will enjoy your time in Karlsruhe.
Klaus D. Mueller-GlaserKarlsruhe Institute of Technology, Germany
iii
Message from the Program Chairs
We welcome you to the 22nd IEEE International Symposium on Rapid SystemPrototyping (RSP-2011) held in Karlsruhe, Germany. RSP is the first conferenceto bring together people from the software and hardware communities comingfrom academia and industry for exchanging on RSP-related research topics fromscientific and technical standpoints. For more than 20 years, the symposium hasbeen attracting an outstanding mix of practitioners and researchers, and since itsearly days spans numerous disciplines, making it the one of its kind.Your participation in RSP will have significant impact to understand the challengesbehind rapid system prototyping and how the methods and tools you develop canlead to better systems for our everyday lives. It is the intent of this symposium to fos-ter exchanges between professionals from industry and academia, from hardwaredesign to software engineering, and to promote a dialog on the latest innovationsin rapid system prototyping.The quality of the Technical Program is a key point to RSP. The strength of theTechnical Program is due to the work of the Technical Program Committee, forsoliciting colleagues for quality submissions and for the review work. We believethat the program of this year covers many exiting topics, and we hope that youwill enjoy it. We extend our sincere appreciation to all those who contributed tomake RSP 2011 a bubbling experience: the Authors, Speakers, Reviewers, SessionChairs, and volunteers. We extend a particular thank you to the technical ProgramCommittee for their dedication to RSP and their excellent work in reviewing thesubmissions. We wish you a very productive and exciting conference.
Frédéric Pétrot - TIMA, FranceFabiano Hessel - PUCRS, Brazil
iv
Message from the Organizing Chairs
We are happy to welcome you at the 22nd IEEE International Symposium on RapidSystem Prototyping in Karlsruhe. We hope you will enjoy the well-chosen programof remarkable keynotes, talks and topics as well as the technology region Karlsruhe.The city of Karlsruhe was founded in 1715 by Margrave Charles III William ofBaden-Durlach. From the Karlsruhe Palace, which forms the center of the city, 32streets are radiating out like the ribs of a fan. This is why Karlsruhe is also called the"fan city". On the right-hand side directly next to the Karlsruhe Palace, the campusof the University of Karlsruhe is located. In 2009, the University of Karlsruheand the Research Center of Karlsruhe joined forces as the Karlsruhe Institute ofTechnology (KIT). This year, the RSP Symposium is hosted by the KIT. We thankour partners for their contributions and efforts to accomplish the symposium: thelocal partners and institutions at KIT for their support, the general chair and theprogram chairs for developing the scientific program of this symposium and themembers of the program committee for their conscientious reviews. We are alsograteful for the sponsorship of Institution of Electrical Engineers (IEEE) and theKIT in financially and technically supporting the symposium. We hope you enjoythe conference program and additionally take some time to explore the city ofKarlsruhe.
Matthias Heinz (KIT), Martin Hillenbrand (KIT), Germany
v
Conference CommitteesGeneral Chair
K. Mueller-Glaser - KIT, Germany
Program Chairs Tutorial Chair Publicity ChairF. Pétrot - TIMA, FranceF. Hessel - PUCRS, Brazil M. Hübner - KIT, Germany J. Hugues - ISAE, France
IEEE Liaison Chair Local Organization Chair Finance ChairA. Stevens - IEEE Reliability Society, USA M. Heinz - KIT, Germany M. Hillenbrand - KIT, Germany
Technical Program Commitee Members
M. Aboulhamid Universite de Montréal R. Kress InfineonG. Alexiou CTI I. Krueger UCSDT. Antonakopoulos University of Patras R. Lauwereins IMECP. Athanas Virginia Tech M. Lemoine CERTM. Auguston NPS P. Leong TheChineseUniversityofHongKongA. Baghdadi TELECOM Bretagne T. Le National University of SingapourJ. Becker Karlsruhe Institute of Technology R. Ludewig IBM GermanyC. Bobda University of Arkansas G. Martin TensilicaG. Bois Ecole Polytechnique de Montreal B. Michael Naval Postgraduate SchoolD. Buchs CUI, University of Geneva K. Mueller-Glaser Karlsruhe Institute of TechnologyR. Cheung UCLA N. Navet INRIA / RTaWR. Drechsler University of Bremen, Germany G. Nicolescu Ecole Polytechnique de MontrealJ. Drummond RSP PC V. Olive CEA-LETIM. Engels FMTC Y. Papaefstathiou Technical University of CreteA. Fröhlich FederalUniversityofSantaCatarina C. Park SAMSUNG ElectronicsM. Glesner TU Darmstadt L. Pautet TELECOM ParisTechM. Glockner BMW AG C. Pereira Univ. FederaldoRioGrandedoSulD. Hamilton Auburn University R. Pettit The Aerospace CorporationW. Hardt Chemnitz University of Technology D. Pnevmatikatos Tech. Univ. of Crete & FORTH-ICSM. Heinz Karlsruhe Institute of Technology F. Pétrot TIMA Lab, Grenoble-INPJ. Henkel Karlsruhe Institute of Technology J. Rice University of LethbridgeF. Hessel PUCRS F. Rousseau TIMA - UJFJ. Hugues ISAE M. Shing Naval Postgraduate SchoolA. Jerraya CEA O. Sokolsky University of PennsylvaniaG. Karsai Vanderbilt University T. Taha Clemson UniversityK. Kent University of New Brunswick E. Todt Universidade Federal do ParanaF. Kordon Univ. P. & M. Curie B. Zalila ReDCAD Laboratory, Univ. of Sfax
Additional ReviewersDoucet, Fred Eggersglüß, Stephan Friederich, Stephanie Gheorghe, LuizaGligor, Marius Grosse, Daniel Heisswolf, Jan Hostettler, SteveKuehne, Ulrich Le, Hoang Lesecq, Suzanne Li, HuaLi, Min Linard, Alban Love, Andrew Marinissen, Erik JanMenarini, Massimiliano Muller, Ivan Muller, Olivier Naessens, FrederikNoack, Joerg Philipp, Francois Samman, Faizal Arya Spies, Christopher
Thoma, Florian Zha, Wenwei
vi
Tutorial and Keynotes
Tutorial Program
Tuesday May 24th:Safety Integrity Levels in FPGA based designs
Gernot Klaes, Technical-Inspection Authority, Germany"Functional Safety according to IEC 61508 — a short introduction —"
Giulio Corradi, Xilinx"FPGA and the IEC61508 an inside perspective"
Romual Girardey, Endress+Hauser"Safety Aware Place and Route for On-Chip Redundancy in FPGA"
Keynote Speeches
Wednesday May 25th:
Prof. Dr.-Ing. Juergen Bortolazzi, Porsche AG,"RSP in Automotive"
Thursday May 26th:
Dr. Costa Pinto, Efacec,"FPGA in Aerospace Applications"
Friday May 27th:
Prof. Dr. Rüdiger Dillmann, FZI,"Prototyping in Robotics"
vii
Session 1Automotive & FPGA
1
An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar
Sundeep Lal, Roberto Muscedere, Sazzadur Chowdhury Department of Electrical and Computer Engineering
University of Windsor Windsor, Ontario, N9B 3P4
Canada
Abstract— An FPGA implemented signal processing algorithm to determine the range and velocity of targets using a MEMS based tri-mode 77 GHz FMCW automotive radar has been presented. In the developed system, A Xilinx Virtex 5 FPGA based signal processing and control algorithm dynamically reconfigures a MEMS based FMCW radar to provide a short-range, mid-range, and long range coverage using the same hardware. The MEMS radar incorporates two MEMS SP3T RF switches, two microfabricated Rotman lenses, and two reconfigurable microstrip antenna arrays embedded with MEMS SPST switches in addition to other microelectronic components. By sequencing the FMCW signal through the 3 beamport of the Rotman lens, the radar beam can be steered ±4 degrees in a combined cycle time of 62 ms for all the 3 modes. A worst case range accuracy of ±0.21m and velocity accuracy of ±0.83 m/s has been achieved which is better than the state-of-the art Bosch LRR3 radar sensor.
Keywords-component; formatting; style; styling; insert (key words)
I. INTRODUCTION The global auto industries are extensively pursuing radar
based proximity detection systems for applications including adaptive cruise control, collision avoidance, and pre-crash warning to avoid or mitigate collision damage. In [1], it has been identified by analyzing actual crash records from the 2004-08 files of the National Automotive Sampling System General Estimates System (NASS GES) and the Fatality Analysis Reporting System (FARS) that a forward collision warning/mitigation system comprised of radar sensors has the greatest potential to prevent or mitigate up to 1.2 million crashes, up to 66,000 nonfatal serious and moderate injury crashes, and 879 fatal crashes per year. Ironically, the IIHS study found that the forward collision warning crash avoidance features that could prevent or mitigate this much fatal and nonfatal injury related crashes were available only to just a handful of luxury vehicle models due to the high cost of the currently available forward collision warning technology. Thus a low cost radar technology for forward collision warning system that can be made available to all the on-road vehicles would be able to prevent/mitigate up to 1.2 million crashes per year. Market research firm Strategy Analytics predicts that over the period 2006 to 2011, the use of long-range distance warning systems in cars could increase by more than 65
percent annually, with demands reaching 3 million units in 2011 with 2.3 million of them using radar sensors [2]. The strategic Automotive Radar frequency Allocation (SARA) consortium specified that a combined SRR and LRR platform in the 77-79 GHz range will enable to reduce size and improve performance of automotive radars [3-4]. In [3] it has been identified that in the long term, 77 GHz will become the only reasonable technology platform to serve both short and long range radars.
In [3, 5] it has been determined that frequency modulated continuous wave (FMCW) radar with an analog or digital beamforming capability with a low cost SiGe based radar front end is the technology of choice for forward collision warning applications. Though GaAs or SiGe based MMICs are being pursued vigorously to minimize the cost and size while improving the performance of automotive radars [3, 6], the auto industry is eyeing on to exploit the small cost, batch fabrication capability of the MEMS technology to realize more sophisticated radar systems [2]. The project goal of a European consortium SARFA has been set to utilize RF MEMS as an enabling technology for performance improvement and cost reduction of automotive radar front ends operating at 76-81 GHz. [7]. In [5], a MEMS based long range radar comprised of a microfabricated Rotman lens and MEMS SP3T switches and a microstrip antenna array has been presented [8]. The DistronicPlus™ system uses one long range and two short range radars in the front, two other short range radar in the front for park assist, and two other short range radars in the rear to provide an effective collision avoidance system. Due to the individual short and long range units, the price tag of the DistronicPlus™ system is pretty high.
Almost all the commercially available automotive radars use microelectronic based ASICs. However, FPGAs are becoming increasingly popular in the development phase for rapid prototyping as opposed to DSP based solutions. Relative advantages of an FPGA based system over the DSP based ones for automotive radars are discussed in [9-11] where it has been determined that FPGAs can offer superior performance in terms of footprint area, high throughput, more time-and-resource efficient implementations, high speed parallel processing, digital data interfacing, ADC and DAC handling, and clock management in a relatively low-cost platform. For example, An FPGA can generate a precision Vtune signal to
This research has been supported by NSERC Canada, Ontario Centres ofExcellence (OCE), and Auto21 Canada.
978-1-4577-0660-8/11/$26.00 ©2011 IEEE 2
control the VCO linearity which is a critical issue to avoid false targets.
Investigation shows that instead of a passive antenna system, a MEMS based reconfigurable microstrip antenna in conjunction with MEMS SP3T switches and a microfabricated Rotman lens can be used to realize a compact tri-mode radar that can provide all the short, mid and long range functionality in a small form-factor single unit. An FPGA based control unit will control the operation of an array of MEMS SPST RF switches embedded in the reconfigurable antenna array to dynamically alter the antenna beamwidth to switch the radar from short to mid to long range using a predetermined time constant. This will reduce the price tag significantly by multiplexing the functionality of three different range radars in the same hardware. Additionally, the passive microfabricated Rotman lens will eliminate microelectronics based analog or digital beamforming components as used in commercially available automotive radars. Consequently, the overall system will become less complex, faster, lower cost, and more reliable. Due to the faster signal processing and digital data interfacing capability, an FPGA like Xilinx Virtex 5 can offer a very robust control of VCO linearity and a faster refresh rate for range and velocity data. In addition to the conventional signal processing tasks, control algorithms for the MEMS SP3T and SPST RF switches can also be embedded in the same FPGA.
In this context, this paper presents the development, implementation, and validation of Xilinx Virtex 5 FPGA based control and signal processing algorithms for the developed MEMS tri-mode radar sensor. The algorithm is able to determine the target range and velocity with a very high degree of precision in a cycle time that is shorter than the state-of-the art 3rd generation long range radar (LRR3) from Bosch [12].
II. MEMS TRI-MODE RADAR OPERATING PRINCIPLE A simplified architecture of the MEMS tri-mode radar is
shown in Fig. 1. The radar operating principle is as follows: (i) An FPGA implemented control circuit generates a triangular signal (Vtune) to modulate a voltage controlled oscillator (VCO) to generate a linear frequency modulated continuous wave (FMCW) signal centered at 77 GHz. (ii) The FMCW signal is fed to a MEMS SP3T switch. (iii) An FPGA implemented control algorithm controls the SP3T switch to sequentially switch the FMCW signal among the three beam ports of a microfabricated Rotman lens. (iv) As the FMCW signal arrives at the array ports of the Rotman lens after traveling through the Rotman lens cavity, the time-delayed in-phase signals are fed to a reconfigurable microstrip antenna array. (v) The reconfigurable antenna has MEMS SPST switches embedded in each of the linear sections as shown in Fig. 2. The scan area of a conventional microstrip antenna array depends on the antenna beamwidth that in turn depends on the number of microstrip patches. The higher the number of patches, the narrower will be the beam. It has been determined that for a short range radar, a beamwidth of 80 degrees is necessary to scan an area up to 30 meter in front of the vehicle as shown in Fig.3. For mid range, a beam width of 20 degrees is necessary to cover an area between 30-80 meters ahead of the vehicle and in the LRR mode, a beam width of 9 degrees is necessary to
Figure 1. MEMS tri-mode radar block diagram.
Figure 2. Reconfigurable microstrip antenna array.
Figure 3. SRR, MRR, and LRR coverage with beam width.
scan an area 80-200 meters ahead of the vehicle. Following Fig. 2, when both the SPST switches SW1 and SW2 are in OFF position, 4 microstrip patches per linear array provide short range coverage. When the switch SW 1 is turned ON and SW 2 is OFF, 8 microstrip patches per linear array provide mid-range range coverage. Finally, when both the switches SW 1 and SW 2 are ON, 12 microstrip patches per linear array provide long range coverage. An FPGA implemented control module controls the operation of the switches SW1 and SW2. (vi) The sequential switching of the input signal among the beamports of the Rotman lens enables the beam to be steered across the target area in steps by a pre-specific angle as shown in Fig. 4. (vii) On the receiving side, a receiver antenna array receives the signal reflected off a vehicle or an obstacle and feeds the signal to another SP3T switch through another Rotman lens. (viii) An FPGA based control circuit controls the operation of the receiver SP3T switch in tandem with the transmit SP3T switch so that the signal output at a specific beamport of the receiver Rotman lens can be mixed with the corresponding
3
Figure 4. Beam steering by the Rotman lens.
transmit signal. (ix) The output of the receiver SP3T switch is passed through a mixer to generate an IF signal in the range of 0-200 KHz. (10) An analog-to-digital converter (ADC) samples the received IF signal and converts it to a digital signal. (x). Finally, an FPGA implemented algorithm processes the digital signal from the ADC to determine the range and velocity of the detected target. In this way a wider near-field area and a narrow far field area can be progressively scanned with a minimum hardware.
III. RADAR SIGNAL PROCESSING AND SWITCH CONTROL
A. Choice of Development Platform – FPGA vs. DSP Older radar systems relied on various analog components.
However, Digital Signal Processors (DSP) and Field-Programmable Gate Arrays (FPGA) are increasingly becoming popular as both offer attractive features. Off-the-shelf signal processing blocks are available from both DSP and FPGA manufacturers as IP (Intellectual Property) cores for rapid prototyping. Key considerations to choose one over the other depend on internal architecture, speed, development time, complexity, and system requirements. DSPs follow a pipelined architecture, and even a dual core DSP requires resource sharing. This limits the overall data throughput in DSPs, and data capture channels depend on the number of memory modules and interrupts. On the other hand, FPGAs are fully customizable, and modules do not have to share resources like I/O ports and memory. DSPs spend most processing time and power in moving instructions and variables in and out of shared memories, which FPGAs inherently avoid. This makes FPGAs more suitable for parallel processing and optimized resource handling, especially in low-memory applications. The capability of FPGAs are continuously increasing due to the development of fast multipliers, accumulators, and RAM units, etc. With parallel processing capabilities, FPGAs have a higher throughput than DSPs while using a slower clock. Table I shows benchmark results comparing a Xilinx FPGA with various processors for a 2048-point FFT and Table II compares DSPs to FPGAs in terms of various development criteria that affect their development time, reliability and applicability. From the Table I, it is clear that Virtex-5 SX50T is well-adapted for signal processing applications and computes the same FFT in a lesser number of clock cycles compared to typical DSPs by using parallel resources. Based on the Tables I and II, the Xilinx Virtex 5 SX50T FPGA has been selected for rapid system prototyping for the target MEMS tri-mode radar.
TABLE I. SPEED COMPARISON - FPGA VS. A DUAL CORE μP AND DSP
Part Name
Clock Freq.
(MHz)
2048-point FFT Latency
(µs)
No. of Clock Cycles
Intel 32-bit Core 2 Duo 3000 37.55 112650 Analog Devices ADSP-
BF53x 600 32.40 19440
Texas Instruments TMS320C67xx 600 34.20 20520
Xilinx Virtex-5 FFT Core 200 39.60 7920
B. LFMCW Sweep Generation and Switch Control To keep the hardware requirements to a minimum for the
Virtex 5 FPGA, FMCW sweep durations of 1.0 ms, 3 ms, and 6 ms have been selected for the LRR, MRR, and SRR modes, respectively. A 10-bit counter in the FPGA is used to generate a digital sweep signal which is fed to a 10 bit DAC to generate the VTune signal for the successive modes as shown in Fig. 5. The sweep generation timing diagram as shown in Fig. 5 is for a clock frequency of 100 MHz on a Virtex-5 FPGA. A sweep bandwidth of 800 MHz for LRR, 1.4 GHz for MRR, and 2 GHz for SRR has been selected to provide sufficient samples. Fig. 5 also shows the required DAC output and tuning voltage for a TLC™ Precision 77 GHz GaAs VCO (MINT77TR™). The MEMS SP3T switches (SP3T-T and SP3T-R in Fig. 1) are operated by 18 V charge pumps which are activated at the end of the down sweep of the SRR mode.
TABLE II. DEVELOPMENT CRITERIA COMPARISON
Criteria DSP FPGA
Sampling rates Low (interrupts and I/O are shared)
High (parallel data capture and processing)
Data rate Better below 30MB/s Can handle faster data rates
Memory management
Unpredictable optimized arrangement – can lead to failures by mishandling
pointers
Fully customizable memory arrangement and
read/write ports
Data capture Uses interrupts with varying orders of precedence – can
lead to conflicts
Independent data capture with dedicated input/output ports and memory blocks
Group development
Developers do not have a clear idea of resource
availability and usage by others
Developers can work independently without resource availability
concerns
Figure 5. VTune timing diagram.
4
C. Signal Processing Algorithm Following the theory of FMCW radars, the range R and
relative velocity RV of a detected target can be calculated from:
kcff
R22
)(×
+= downup (1)
04)(
fcff
VR ×−
= downup (2)
where upf , represents the up sweep beat frequency, downf represents the down-sweep beat frequency, c is the speed of the electromagnetic wave in the medium, TBk /= is the bandwidth / sweep duration and 0f represents the center frequency as shown in Fig. 6. Fig. 7 presents the developed radar signal processing algorithm. In the system, the transmitted and received signals are mixed to generate a beat signal which is then passed through a low pass filter (LPF) to remove the noise. The filtered signal is then converted to digital format using an ADC and a hamming window is applied to limit the frequency content. Afterwards, a Fast Fourier Transform (FFT) is applied to the time-domain samples. The normalized peak intensity for all the FFT samples is computed and processed by a Cell Averaging Constant False Alarm Rate (CA-CFAR) processor architecture as in [11]. Upon processing of both up and down sweep for a beam port, peak pairing is done to compute preliminary values for target range and velocity. Peak pairing is responsible for matching the detected peak in the up sweep spectrum to a detected peak in the down sweep spectrum as belonging to the same target.
D. Implementation Methodology
The critical design considerations to implement the developed algorithm include: (a) Interconnection between the processing modules, (b) Data interdependency between the units, (c) Control interdependency between the units, (d) Coherency in data format, and (e) Synchronization of up and down sweeps and timely switching of the SP3T switches.
Figure 6. FMCW chirp signal and beat frequency.
Figure 7. Signal processing algorithm.
TABLE III. SIGNAL PROCESSING UNITS
Processing Unit Details DAC 10-bit, 1.2 MHz ADC 11-bit, 2.2 MHz
Window Type/Length Hamming/2048 FFT Type Mixed Radix-2/4 DIT
FFT Length 2048 CFAR Type Cell Averaging
CFAR Parameters M = 8, GB = 2, Pfa = 10-6 * Peak Pairing Criteria - Power Comparison, - Spectral Proximity
* M = depth of cell averaging, GB = no. of guard bands, Pfa = probability of false alarm.
The following stepwise methodology has been adopted to address the mentioned considerations to achieve a high degree of accuracy in the FPGA based signal processing scheme: (a) Development and simulation of the developed signal processing algorithm in Matlab environment for a single target, (b) Verification of the algorithm implemented in Matlab for a single target and then extending the Matlab codes for a single Rotman lens beam with 7 random targets, (c) HDL coding and testing of individual modules of the verified algorithm, (d) Identification of hardware resource sharing, timing, and area optimization, (e) Assembly of modules to form overall radar signal processing system in HDL, (f) Validation using a 7-target scenario by comparing the results obtained from Matlab codes.
IV. HARDWARE IMPLEMENTATION
A. HDL Modules Fig. 8 shows the top-level black box view of the FPGA
implemented signal processing system and Fig. 9 presents the developed HDL building blocks for FPGA implementation. The system has been developed using Verilog HDL 2005 (IEEE 1364-2005).
B. Fixed-Point Considerations Fixed-point implications arise at 4 stages of the developed
HDL system: (i) ADC – the quantization noise added by
5
sampling is unavoidable. To minimize the quantization error, an 11-bit ADC has been employed and a maximum error of 0.125% is induced due to quantization. (ii) Windowing – A 2048-point Hamming window is stored inside an on-chip ROM. The window coefficients are stored in 10-bit resolution. This results in 0.084% error in computation. The use of such digitized window functions are validated in [13]. (iii) FFT – The accuracy of the FFT that depends on input resolution and phase coefficient/twiddle factor affects the precision of the entire signal processing algorithm and deduced target information. With a fixed resolution of 12 bits, different phase coefficient resolutions were tested. Highly accurate results were obtained with 16-bit resolution. (iv) Peak Pairing –As the implementation of the ADC and the window function as mentioned above in an FPGA environment involves several multiplications/divisions, to avoid computational delay and resource overhead, a simplified approach is used. It is known that actual frequency is the product of frequency resolution of FFT and bin number of detected target peak by the CFAR processor. For the developed system, the FFT frequency resolution defined as SizeFFTfrequencySampling / equals 976.5625 Hz/bin. Also, factor k in (1) can be calculated from the bandwidth B (800MHz) and sweep duration T (1 ms) as 8 x 1011 for the LRR mode. Using these information, (1) can be simplified to:
( ) 09290625.0×+= down_binup_bin ffR (3)
where, 5625.976×= inup(down)_bup(down) ff and inup(down)_bf is the FFT bin number for up and down peaks of a valid target detected by CFAR. Similarly, (2) can be simplified as
( ) km/h down_binup_bin 398.3×−= ffVR (4)
The constants 0.09290625 in (3) and 3.398 in (4) have been approximated as 0.0927734375 (11-bit binary number) and 3.40625 (7-bit binary number). This reduces multiplier resolution and is suitable for fast computation using embedded FPGA multipliers e.g. in Xilinx DSP48E slices. This creates a maximum error of 0.193% in range and velocity computation. Similar calculations are done for the MRR and SRR modes.
C. HDL Optimization Critical optimization considerations include: (i) Interfacing
modules, (ii) Synchronization of modules in the overall system, (iii) Fixed-point truncation/rounding errors, (iv) Potential overflow identification, (v) Identification of data flow or
Figure 8. Top-level module.
Figure 9. HDL building block modules.
processing bottlenecks in the algorithm and use of multiple parallel units to resolve the same, (vi) Potential race conditions and data consistency issues in shared resources. (vii) Memory sharing, (viii) Clocked/combinational logic synchronization, (ix) Data word length and fixed-point format (position of decimal point) considerations, and (x) Handling of signed and unsigned data.
A bottleneck was identified in the Power spectral density calculation module (PSD unit) where a square root operation caused significant delay. This was resolved by using 4 PSD units in parallel as shown in Fig. 10. Since the time-domain data RAM and frequency-domain data RAM are implemented separately, this allows for simultaneous sampling of the next frequency sweep while processing samples from the previous sweep. This also optimizes the HDL design for time. Upon testing the Xilinx FFT it has been observed that the first half of the output has more noise and higher DC components at lower frequencies as compared to the latter half. Accordingly, only the latter half of the FFT output is utilized. This inherently saves memory and improves timing by a factor of 2. The CFAR unit is designed to process 32 values at a time and is fully synchronized with the FDR and PSD units (Fig. 8) thus avoiding storage of the entire 1024 frequency samples.
D. Resource Usage The design was synthesized for Xilinx Spartan-3A DSP
Edition and Virtex-5 SX50T. The resource usage for the HDL implementation of the developed signal processing system is listed in Table IV.
Figure 10. Parallel PSD units.
6
TABLE IV. RESOURCE USAGE ON DIFFERENT FPGAS
Resource Spartan-3A Virtex-5 Slice registers 46% 4% Slice LUTs 96% 23% DSP48 Slices 30% 6% LUT-FF pairs 11% 9% FPGA fabric area 47% 21%
E. Processing Latency Table V shows the respective number of clock cycles
required by each module in the HDL implementation in Fig. 9 for the LRR mode and the total number of clock cycles consumed for processing both up and down frequency sweeps and producing the final target information. Implementation on Xilinx Spartan-3A has a safe maximum operating frequency of 50 MHz and 160 MHz on Xilinx Virtex-5.
The system has been simulated for 50 MHz on the Spartan-3A and 100 MHz on the Virtex-5, as presented in Table V. One beam corresponds to both up and down frequency sweeps on the same beam port of the Rotman lens.
TABLE V. PROCESSING LATENCY ON DIFFERENT FPGAS FOR LRR
Operation
Clock Cycles / Beam
Latency at 50 MHz
Spartan-3A (ms)
Latency at 100 MHz
Virtex-5 (ms) Sweep Sampling 204756 2.04756 2.04756 Window and feed to FFT
2072 0.04144 0.02072
FFT 1 3960 0.07920 0.03960 PSD (4 parallel units )
10743 0.21486 0.10743
CFAR 4388 0.08776 0.04388 Total Processing 21163 0.42326 0.21163 Overall 225928 2.47082 2.25928 1 This FFT delay is for real-valued input data.
V. SIMULATION AND VALIDATION A randomly generated 3-lane highway scenario as shown in
Fig. 11 has been used to test the validity of the HDL codes. The scenario includes 7 arbitrary targets covered by a composite 17° wide scanning beam formed by the 3 beamports of the Rotman lens as shown in Fig. 1 [5]. Table VI compares the range accuracy of Matlab and HDL implemented versions with actual values for the LRR mode only. The simulation has been carried out over 6 iterations of time-domain MATLAB-generated samples with the algorithm running on a Xilinx Virtex-5 ML506 development board. During simulation, following physical conditions are assumed:
1. Light-medium rain producing RF attenuation of 0.8 dB/km [14].
2. Negligible attenuation and reflection from radome with less than 0.05 mm thick water deposition [15].
3. Simulated targets can be entirely described by Swerling I, III and V (or 0) models [16] and clutter sources are spectrally stationary.
4. Best case SNR of 4.73 dB. Table VII shows a similar comparison for the velocity
accuracy for the same targets for the LRR mode. From the Tables VI and VII, it appears that the HDL version of the developed algorithm
Figure 11. Highway test scenario.
can determine the range with a maximum error of 0.28 m when compared with actual values and the maximum error in relative velocity is 3 km/h (0.83 m/s) only. In both cases, the HDL generates more accurate results as compared to Matlab determined values.
Comparative accuracy and FGPA parameters for all the three modes are listed in Table VIII. From the Table VIII it is clear that the SRR offers highest range and velocity accuracy. Table IX compares the range and velocity accuracy of all the 3 modes from the implemented HDL version with the state-of-the-art Bosch LRR3. From Table IX, it is clear that the new radar can determine the range and velocity with almost same accuracy as Bosch LRR3 while covering 3 ranges and the complete cycle time of all the three modes is 62 ms as compared to 80 ms for the Bosch LRR3. The marginal
TABLE VI. LRR RANGE ACCURACY COMPARISON: MATLAB-HDL
Target ID
Actual Distance
from Host(m)
Matlab value (m)
HDL value (m)
Δ Matlab-Actual
(m)
Δ HDL-Actual
(m)
Δ Matlab-
HDL (m)
1 9.00 9.38 9.00 0.38 0.00 0.38
2 24.00 24.34 24.00 0.34 0.00 0.34
3 29.00 29.27 29.00 0.27 0.00 0.27
4 55.00 55.37 55.00 0.37 0.00 0.37
5 78.00 78.32 78.00 0.32 0.00 0.32
6 106.00 106.28 106.00 0.28 0.00 0.28
7 148.00 148.37 147.75 0.37 0.28 0.62
TABLE VII. LRR VELOCITY ACCURACY COMPARISON: MATLAB-HDL
Target ID
Actual Velocity relative to Host (km/h)
Matlab value
(km/h)
HDL value
(km/h)
Δ Matlab
-Actual (km/h)
Δ HDL-Actual (km/h)
Δ Matlab-HDL (km/h)
1 123 123.85 123.5 0.85 0.5 0.35
2 55 52.31 53.5 2.69 1.5 1.19
3 89 89.78 87.5 0.78 1.5 2.28
4 100 100.00 100.0 0.00 0.0 0.00
5 70 69.34 70.5 0.66 0.5 1.16
6 80 79.56 83.0 0.44 3.0 3.44
7 22 21.64 22.0 0.36 0.0 0.36
7
TABLE VIII. TRI-MODE RADAR DESIGN SPECIFICATIONS
Criteria SRR MRR LRR Range coverage 0-30 m 30-100 m 100-200 m Relative velocity
coverage ±300 km/h ±300 km/h ±300 km/h
Up or Down sweep duration 6 ms 3 ms 1 ms
Sweep bandwidth 2000 MHz 1400 MHz 800 MHz Required sampling rate 200 KSPS 700 KSPS 2000 KSPS
FFT size 1024 points 2048 points 2048 points FFT frequency
resolution 196 Hz/bin 342 Hz/bin 977 Hz/bin
Range accuracy with worst case VCO linearity of 25%
±0.28 m ±0.29 m ±0.34
Range accuracy with practical VCO linearity
of 1% ±0.10 m ±0.14 m ±0.28
Velocity accuracy ±0.14 m/s ±0.42 m/s ±0.83 m/s Processing delay for
beam port 106 µs 212 µs 212 µs
deviation from Bosch LRR3 for MRR and LRR modes is offset by the fact that the combined cycle time for the 3 modes (62 ms) is smaller than the cycle time of Bosch LRR3 (80 ms).
TABLE IX. TRI-MODE RADAR ACCURACY COMP. WITH BOSCH LRR3
Parameter Bosch LRR3
MEMS Tri-Mode (Xilinx Virtex 5 HDL Simulation)
SRR MRR LRR
Range (m) 0.5-250 0.4-30 30-100 100-200
Velocity (km/h) -100 to +200 ±300 ±300 ±300
Range accuracy (m) ±0.10 ±0.10 ±0.14 ±0.28 Velocity accuracy
(m/s) ±0.12 ±0.14 ±0.42 ±0.83
Processing latency N/A 106µs 212µs 212µs
Cycle time 80 ms 62 ms for three modes combined
VI. CONCLUSIONS A signal processing algorithm has been developed and
tested in a Xilinx Virtex 5 SX50T FPGA for a MEMS based tri-mode (short, mid, and long range) 77 GHz FMCW automotive radar to determine the range and velocity of multiple targets. The MEMS based radar uses a microfabricated Rotman lens and MEMS SP3T RF switches in conjunction with a reconfigurable MEMS SPST switch embedded microstrip antenna array. The proposed signal processing hardware implementation for the MEMS radar makes use of independent modules thus eliminating the latency posed by the MicroBlaze softcore used in [11]. The algorithm can determine the target range with a maximum error of ±0.28 m for LRR and ±0.10 for SRR. The maximum velocity error has been determined as ±0.83 m/s for LRR and ±0.14 m/s for SRR. Further investigation shows that by increasing the sweep duration to 6 ms, the maximum velocity error for LRR can be reduced to 0.14 m/s following (1). These accuracies of the FPGA determined range and velocity results meet the auto industry set specifications and are in par with the Bosch/Infineon LRR3 radar. The developed system allows for
a highly reliable low cost small form factor radar sensor that can enable even the lower-end vehicles to be equipped with a collision avoidance system. All the MEMS components have been fabricated and the assembly and packaging of a prototype device is in progress.
ACKNOWLEDGMENT The authors would like to greatly acknowledge the
additional support provided by the Canadian Microelectronics Corporation (CMC Microsystems), and Evigia systems Inc., Ann Arbor, MI.
REFERENCES [1] J. S. Jermakian , “Crash Avoidance Potential of Four Passenger Vehicle
Technologies”, Insurance Institute of Highway safety, April, 2010, http://www.iihs.org/research/topics/pdf/r1130.pdf
[2] H. Arnold, “Infineon: Automotive radar is aimed at mid-range cars”, available. [Online]., http://www.electronics-eetimes.com/en/infineon-auto motive-radar-is-aimed-at-mid-range-cars? cmp_id= 7& news_id=202803354
[3] R. Lachner, “Development Status of Next generation Automotive Radar in EU”, ITS Forum 2009, Tokyo, 2009, [Online]. Available. http://www.itsforum.gr.jp/Public/J3Schedule/ P22/ lachner090226.pdf
[4] G. Rollmann, “Frequency Regulations for Automotive Radar”, SARA, presented at Industrial Wireless Consortium (IWPC), Düsseldorf, Germany, 2009.
[5] R. Schneider, H. Blöcher, K. Strohm, “KOKON – Automotive High Frequency Technology at 77/79 GHz”, Proceedings of the 4th European Radar Conference, 2007, Munich, Germany, pp. 247-250.
[6] R. Stevenson, “SiGe threatens to weaken GaAs’ grip on automotive radar”, 2009, Compound Semiconductor, [Online]. Available. http://www.compoundsemiconductor.net
[7] J. Oberhammer, “RF MEMS Steerable Antennas for Automotive Radar and Future Wireless Applications (acronym SARFA)”, NORDITE, the Scandinavian ICT Research Programme, [Online]. Available. http://www.sarfa.ee.kth.se
[8] A. Sinjari, S. Chowdhury, “MEMS Automotive Collision Avoidance Radar Beamformer,” in Proc. IEEE ISCAS2008, Seattle, WA, 2008, pp. 2086-2089.
[9] D. Kok, J. S. Fu, “Signal Processing for Automotive Radar,” in IEEE Radar Conf.(EURAD2005), Arlington, VA, June 2005, pp. 842-846.
[10] J. Saad, A. Baghdadi, “FPGA-based Radar Signal Processing for Automotive Driver Assistance System,” in IEEE/IFIP Intl. Symp. Rapid System Prototyping, Fairfax, VA, 2009, pp. 196-199.
[11] T. R. Saed, J. K. Ali, Z. T. Yassen, “An FPGA Based Implementation of CA-CFAR Processor,” Asian Journal of Information Technology, Vol. 6, No. 4, pp. 511-514, 2007.
[12] Robert Bosch, “LRR3: 3rd generation Long-Range Radar Sensor”, [Online]. Available. http://www.bosch-automo tivetechnology.com /media/en/pdf/fahrsich erheits systeme_2/lrr3_datenblatt_de_2009.pdf
[13] G. Hampson, “Implementation Results of a Windowed FFT”, Sys. Eng. Div., Ohio State Univ., Columbus, OH, July 12, 2002. [Online]. Available: http://esl.eng.ohio-state.edu/~rstheory/iip/window.pdf
[14] P. W. Gorham, “RF Atmosphere Absorption/Ducting,” Antarctic Impulsive Transient Antenna Project (ANITA), Univ. Hawaii (Manoa), April 21, 2003.
[15] A. Arage, G. Kuehnle, R. Jakoby, “Measurement of Wet Antenna Effects on Millimeter Wave Propagation”, in Proc. IEEE RADAR Conf., New York City, NY, 2006, pp. 190-194.
[16] P. Swerling, “Probability of Detection for Fluctuating Targets,” The RAND Corp., Santa Monica, CA, Mar. 17, 1954.
8
FPGA based Real-Time Object DetectionApproach with Validation of Precision and
PerformanceAlexander Bochem, Kenneth B. Kent
Faculty of Computer ScienceUniversity of New Brunswsick
Fredericton, [email protected], [email protected]
Rainer HerpersDepartment of Computer Science
University of Applied Sciences Bonn-Rhein-SiegSankt Augustin, [email protected]
Abstract—This paper presents the implementationand evaluation of a computer vision problem on aField Programmable Gate Array (FPGA). This workis based upon previous work where the feasibility ofapplication specific image processing algorithms on aFPGA platform have been evaluated by experimentalapproaches. This work coveres the development of aBLOB detection system on an Altera Developmentand Education II (DE2) board with a Cyclone IIFPGA in Verilog. It detects binary spatially extendedobjects in image material and computes their centerpoints. Bounding Box and Center-of-Mass have beenapplied for estimating center points of the BLOBs.The results are transmitted via a serial interfaceto the PC for validation of their ground truth andfurther processing. The evaluation compares precisionand performance gains dependent on the appliedcomputation methods.
keywords: FPGA; BLOB Detection; Image Pro-cessing; Bounding Box; Center-of-Mass; Verilog
I. INTRODUCTION
The usability of interfaces in software is a ma-jor criteria for acceptance by the user. A majorissue in computer science is the improvement ofinformation representation in interfaces and findingalternatives for user interaction. Simplifying theway a user operates with a computer helps inoptimizing the usability and increasing the benefitof digitization.
A common virtual reality (VR) system usesstandard-input devices such as keyboards and mice,or multimodal devices such as omni directionaltreadmills and wired gloves. The issue with these”active input devices” is the lack of informationabout the user and his point of interest. A computermouse, for example, only gives information about
its relative position on the desk in relation to itsposition on the computer monitor. A wired gloverepresents the user in the coordinate system of thevirtual world, but it gives no information about theabsolute position in the real world.
The common way to display a virtual environ-ment are computer monitors, customized stereo-scopic displays, also known as Head MountedDisplays (HMD), and the Cave Automatic VirtualEnvironment (CAVE). In general computer screensare relatively small compared to the virtual environ-ment they display. The frame of a computer monitorcauses a significant disturbance in the perceptionof the VR by the user. HMDs on the other handallow a complete immersion of the virtual realityfor the user. The downside is that HMDs are heavyand therefore not recommendable to wear for alonger time. The CAVE concept applies projectiontechnology on multiple large screens, which arearranged in the form of a big cubicle [1]. Thesize of the cubicle usually fits several people.Based on the CAVE concept the Bonn-Rhein-SiegUniversity of Applied Sciences has invented a lowcost mobile platform for immersive visualisations,called ”Immersion Square” [2]. It consists of threeback projection walls using standard PC hardwarefor the image processing task.
For immersive environments the position andorientation of the user in relation to the projec-tion surface would allow alternative ways of userinteraction. Based on this information, it would bepossible to manipulate the ambiance without havingthe user actively use an input device. At the Bonn-Rhein-Sieg University for Applied Sciences this
”978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
9
problem has been addressed in a project named 6Degree-of-Freedom Multi-User Interaction-Devicefor 3D-Projection Environments (MI6) by the Com-puter Vision research group. The project’s intentis to estimate the position and orientation of theuser to improve the interaction of the user with theimmersive environment. Therefore a light-emittingdevice is developed, that creates a unique patternof light dots. This device will be carried by theuser and pointed towards the area on the projectionscreens of the immersive environment where theuser is looking. By detecting the light dots on theprojection screens, it is possible to estimate the po-sition and orientation of the user in the cubicle, asshown in [3], [4]. One problem is that the detectionprocess of the BLOBs and the estimation of theircenter points requires fast image processing. It isintended to realize an immediate feedback for theuser for any change of direction and position inthe immersive environment. Another problem is therequired precision for the estimation of the BLOBs’center points. A mismatch of the BLOB’s centerpoint by a few pixels will cause an error for theestimated position of the user. With those problemsin mind the intent is to develop a BLOB detectionsystem which serves both high performance andhigh accuracy.
An FPGA is a grid of freely programmable logicblocks which can be combined through intercon-nection wires. This allows the hardware designerto have a flexible and also application specifichardware design. While FPGAs have been used forprototyping only in the beginning, the revolutionof personal computers helped this field becomeattractive for consumer and industry products aswell. The decreasing sizes for chip circuit elementsallowed manufacturers to create FPGAs, whichare large enough to fit even complete processordesigns. Those advantages have been perceived inthe computer vision area [5] and used in variousresearch projects [6], [7], [8]. The FPGA archi-tecture allows one to create a hardware designthat can process data in parallel like a multi-corearchitecture on a GPU. This again is well suited forcomputer vision problems, as have been addressedin [9]. It has to be kept in mind that, althoughASICs and FPGAs have the design of hardware incommon, the circuit space and power consumptionof an FPGA is still several times higher than thatof an ASIC.
The remainder of this paper is organized asfollows. Section 2 gives a brief overview of the
applied methods and system design. Section 3shows how the implementation of the system iscompleted. In Section 4 the results of the system areverified and validated. Section 6 finally concludesthe paper.
II. SYSTEM DESIGN
Fig. 1: Schematic design of the system architecture.
The target platform of the realized system de-sign is an DE2 board from Altera [10]. Figure 1shows the schematic design of the BLOB detectionsystem. The image material can be acquired ondifferent input devices. The analog video input isused for precision evaluation with the recordedimage material. The CCD camera has been appliedfor performance evaluation with real-time data.Both input devices have individual pre-processingmodules and provide the image material in RGBformat with a resolution of 640x480 pixels. TheAD-converter of the analog-video input performswith 30 frames per second (fps). The CCD cameracan provide up to 70 fps for the given imageresolution if the full five Megapixels of the camerasensor is used [11].
A. CCD Camera
The CCD camera D5M from Terasic has beenused [11] for the performance evaluation of theBLOB detection approach. This camera sensor isarranged in a Bayer pattern format. The read-out
10
Fig. 2: Four and eight pixel neighbourhood.
speed depends on several features such as resolu-tion, exposure time and the activation of skippingor binning. Skipping and binning is used to com-bine lines and columns of the CCD sensors. Thisfunctionality allows the design to read out a largerarea of the sensor in the form of a smaller imageresolution. The consecutive lines in the sensor aremerged and handled as a single image line. Thecontroller on the D5M camera can read out onepixel per clock pulse and processes each sensor linein sequential order.
B. BLOB Detection
In computer vision a binary large object is con-sidered as a set of pixels, which describe a commonattribute and are adjacent. This common attributein most cases is defined by color or brightness ofthe pixel. The representation of the pixel dependson the applied color model in the image material.For the detection of the BLOBs the first problemto be solved is the identification of relevant pixels.In this work a pixel is considered to be relevant ifits brightness value exceeds a specified thresholdvalue. This threshold value can be a static param-eter or a computed average value, based upon thepixel values of previous frames.
The adjacency condition is the second importantstep in BLOB detection. The two most commondefinitions for adjacency are known as four pixelneighbourhood and an eight pixel neighbourhood.Figure 2 is showing the two ways of labellingpixels to describe adjacency. On the left imagethe four pixel neighbourhood is applied and fourBLOBs could be detected. The detection applies theadjacency check only on the horizontal and verticalaxis. On the right image it can be seen that thesame pixels are labelled as only two BLOBs. Forthe eight pixel neighbourhood the diagonal axis is
taken into account as well [12].The check of the adjacency condition has to be
performed on all pixels that match the criteria ofthe common attribute. Based on those two actionsall BLOBs in a frame can be found.
The aim of the BLOB detection for our re-quirements is to determine the center points ofthe BLOBs in the current frame. With respect tothe application area this project describes BLOBsas a set of white and light gray pixels while thebackground pixels are black. This circumstancefollows from the setup for the image acquisitionwhere infrared cameras will be applied to trackthe projection surface of the Immersion Square.The expected image material will be similar to thesamples in Figure 3.
Fig. 3: Example of BLOBs perfect shape (left) and withblur (right).
The characteristics of the blurring effect willdepend on the acceleration of the light source bythe user. This effect causes some issues for theestimation of the BLOBs center points, which havebeen addressed in this project.
The Bounding Box based computation estimatesthe center-point of the BLOB by searching forthe minimal and maximal XY-coordinates of theBLOB. The computation can be implemented veryefficiently and will not cause large performanceissues.
11
BLOB’s X center position = maxX position + minX position2
BLOB’s Y center position = maxY position+minY position2
The result for the center coordinates are stronglyaffected by the pixels at the BLOB’s border.This effect becomes even stronger for BLOBs inmotion. With reference to the light emitting devicefor the Immersion Square environment, the anglebetween the light beams and the projection surfacechanges the shape of the BLOBs and increasesthe range of pixels with less intensity. In additionthe movement of the device by the user willcause motion blur. These effects will increase theflickering of pixels at the BLOB’s border and willcause a flickering in the computed center-point.
With Center-of-Mass the coordinates of allpixels of the detected BLOB are taken intoaccount for the computation of the center point.The algorithm combines the coordinate valuesof the detected pixels as a weighted sum andcalculates an averaged center coordinate.[12]
BLOB’s X center position =∑
X position of all BLOB pixelsnumber of pixels in BLOB
BLOB’s Y center position =∑
Y position of all BLOB pixelsnumber of pixels in BLOB
To achieve even better results the brightnessvalues of the pixels have been applied as weightsas well. This increases the precision of theestimated center-point with respect to the BLOB’sintensity.
As described in [12] the sequential processing ofa frame requires an additional check of adjacencyfor the BLOBs itself. Dependent on the BLOB’sshape and orientation the detection might separatepixels of one BLOB into two different BLOBs. Theadjacency check for BLOBs is based on the sameconcept as has been explained for pixels earlier.
C. Serial Interface
To observe the system process and evaluate itsresults, a module for communication on the serialinterface of the target board has been developed.Other hardware interfaces have been available, suchas USB, Ethernet, and IRDA. But the serial in-terface allows the smallest design with respect toprotocol overhead and resource allocation on theFPGA. The serial module reads the informationabout the BLOB result from a FIFO buffer andtransmits it to the RS232 controller. The serialinterface module operates independent from the
other system modules and sends results as longas the FIFO containing the BLOB results is notempty.
III. IMPLEMENTATION
The overall processing of BLOB detection isseparated into several sub-processes. For a higherflexibility of the system design, the functionalityhas been implemented in separate modules. Thosemodules use FIFOs to store their results. Thisallows all modules to run separately with individualclock rates. The identification of pixels, whichmight belong to a BLOB, and the collection ofBLOB attributes is very similar for both BLOBdetection solutions. Only the computation of thecenter point for the detected BLOBs is different,based on the algorithm explained in the previoussection.
In the first processing step the check of the com-mon attribute on the RGB pixel data will identifyrelevant data. All pixels which do not match thecriteria are dropped from further processing. Thechosen identification criteria is the brightness valueof the pixel. Every pixel which has a higher valuethan the specified threshold value, will be savedfor the adjacency check. Those pixels are saved ina dual clock FIFO from where the adjacency checkacquires its input data.
The BLOB detection module implements theadjacency check for the pixel data and sorts theminto data structures, referred to as a container.During the detection process the containers arecontinuously updated until a new frame starts. Atthat point the BLOB attributes are written into aFIFO and the container contents are cleared. Theamount of attributes stored for each BLOB dependson the selected method for computing the centerpoint.
The last task to be processed in the BLOBdetection is the computation of the center point. Forcontrolling the processing of the BLOB attributes,the module is designed as a state machine (Figure4). Using state machines is a common method forprocess control in hardware design. The results ofthe center point computation are written into theresult FIFO from where the serial communicationmodule obtains its input data.
IV. VERIFICATION AND VALIDATION
For the verification and validation of the BLOBdetection results, two different input sources havebeen applied. The S-Video input is used for the
12
Fig. 4: State Machine for reading BLOB attributes and computing center points.
verification of detection accuracy and precisionof the different center point computation meth-ods. Secondly, the CCD camera input is used forperformance benchmarking of the system design.For the performance measuring, the visualizationof the captured image data from the CCD cameraon the VGA display had to be disabled. This wasnecessary because the VGA controller module didnot support higher frame rates. The applied inputmaterial on the S-Video input had a resolution of640x480 pixels. The sampling of the AD-converterwas set to the same resolution to avoid the artificialoffset in the digitized image material, that wouldhave shown up for different image resolutions.The ground truth values for the precision of theexpected BLOB center points had been verified byhand.
Based on the specified application area, the shapeof the BLOBs to be detected has been estimatedas perfect circular and circular shapes with blureffect, as shown in Figure 3. For applying theBLOB detection system it needs to be configuredbefore using the BLOB detection results for furtherprocessing. This includes the estimation of the bestthreshold value for the given application environ-ment.
A. Precision
For the computation of the BLOBs’ center pointthe Bounding Box and the Center-of-Mass basedmethods showed the exact same results for the clearBLOBs with a perfect circular shape. This has beentested for several representative threshold values.If a BLOB is not showing any blur effect, theapplied method for computing the center point hasno influence on the precision.
The Bounding Box and Center-of-Masscomputation showed different results for the imagematerial containing BLOBs with blur effect. TheCenter-of-Mass results turned out to be closer
to the BLOB’s center point. The results for oneparticular example are shown in Table I.
Ground Truth Bounding Box Center-of-MassThreshold X Y X Y X Y
0x190 230 306 226 308 229 3070x20B 230 306 227 308 229 3070x286 232 305 228 307 230 3060x301 234 303 231 305 233 3040x37C 237 300 236 300 237 3000x3E0 242 298 241 299 241 298
TABLE I: Results for center point computation with blurshaped BLOBs.
Center point computation with Center-of-Massshows higher precision for BLOBs with blur effect,compared to Bounding Box. The Center-of-Massmethod is showing an error of 0.02 %. The Bound-ing Box based computation is showing an error of0.04 %. The threshold values which have been usedfor evaluation of precision are in between the valuerange of the BLOBs in the applied image material.This value range is dependent on the applied imagematerial and can not be simply reused for anyimage source or material. The estimation of thevalue range is a configuration requirement beforeusing the BLOB detection system. For thresholdvalues above or below the given value range thesystem was not able to detect all BLOBs in theimage material accurately.
B. Performance
The system performance has been evaluated on afixed environment setup. The CCD camera has beenapplied to perform the BLOB detection approachwith real-time image data. The performance hasbeen measured by counting the number of framesper second that were processed on the DE2 board.For verifying that the BLOB detection was workingcorrect the results have been observed manuallyon the connected Host PC. The visualization of theBLOB detection results on the connected host-PC
13
allowed a reasonable validation by hand for aninput speed of 45 frames per second. This wasthe limitation of the CCD camera for the appliedresolution of 640x480 pixels. Reported valuesabout performance and resource allocation aregiven in Tables II and III.
with Monitor OutputBounding Box Center-of-Mass
Speed (fps.) 12 12Camera Speed (MHz) 25 25System Speed (MHz) 40 50
Max. System Speed (MHz) 72 65Allocated Resources on the FPGA
Logic Elements 7,850 14,430Memory Bits 147,664 237,616
Registers 2,260 2,871Verified * *
TABLE II: Resource allocation and benchmark resultswith monitor output.
without Monitor OutputBounding Box Center-of-Mass
Speed (fps.) 46 50Camera Speed (MHz) 96 96System Speed (MHz) 125 125
Max. System Speed (MHz) 140 189Allocated Resources on the FPGA
Logic Elements 5,884 13,311Memory Bits 113,364 239,316
Registers 1,510 2,078Verified * *
TABLE III: Resource allocation and benchmark resultswithout monitor output.
The performance result ”fps” refers to the ob-tained frame rate during the benchmarking. The”camera speed” is the particular clock rate that isused to read out the CCD sensor. ”System Speed”is the clock rate for the BLOB detection moduleduring the performance test and ”Max. SystemSpeed” describes the maximum possible clock ratefor the implemented design. The CCD camera itselfhas an average speed of 45 frames per secondfor the applied configuration parameter. While bothBLOB detection approaches would have been ableto perform on faster frame rates, the estimation ofthe maximum performance was again restricted bythe input source. The resource allocation shows thatCenter-of-Mass requires about twice as many logicelements and memory bits than Bounding Box.
Figure 5 is showing the relation between theamount of BLOBs that can be processed by the sys-tem and the maximum performance in MHz. Themaximum amount of BLOBs that can be detectedwith the applied target board is approximatly sevenfor Center-of-Mass and sixteen for Bounding Box.
V. FUTURE WORK
It has been shown in Section IV-B that the BLOBdetection could not be tested for its maximumperformance. Both available input sources turnedout to be a bottleneck for the acquisition of im-age material. It is planned to integrate the BLOBdetection approach into a camera with an onboardFPGA. With direct access to the sensor of a highperformance camera, the BLOB detection can beevaluated for its maximum processing speed.
VI. CONCLUSION
With Bounding Box and Center-of-Mass, twocommon methods for the estimation of a BLOBcenter point have been implemented. The eval-uation has shown reliable precision results withrespect to the given application area. As it hasbeen estimated in [12] the Center-of-Mass basedapproach shows higher precision and is the recom-mended solution for the given application task.
The BLOB detection approach is a workingthreshold based solution specialized for BLOBswhich consist of white and light grey pixels ona black image background. Both available inputsources have been identified as a bottleneck forthe system’s processing speed. This has proven theFPGA design to be well qualified for the proposedcomputer vision problem.
For the evaluation and further processing a mod-ule for transmitting results on the serial interfacehas been designed, tested and applied. The max-imum performance of the serial interface is highenough for even faster frame processing rates, asdescribed in Section II-C. The output format ofthe computation results can be easily changed inthe system design. For instance, further informationabout the BLOBs can be computed on the FPGAsystem and transmitted as well, such as directionof movement or speed of the BLOBs.
This paper has shown that a BLOB detectionsystem can be successfully implemented and eval-uated on a FPGA platform. Validation and verifica-tion showed reliable results and the advantages ofimage processing tasks designed in hardware. Thework required a higher time effort compared to theimplementation of a similar system in a high-levellanguage, such as C++ or Java.
Acknowledgments
This work is supported in part by Matrix Vision,CMC Microsystems and the Natural Sciences andEngineering Research Council of Canada.
14
Fig. 5: Performance for BLOB Detection.
REFERENCES
[1] C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti,“Surround-screen projection-based virtual reality: thedesign and implementation of the cave,” in SIGGRAPH’93: Proceedings of the 20th annual conferenceon Computer graphics and interactive techniques.New York, NY, USA: ACM, 1993, pp. 135–142.[Online]. Available: http://portal.acm.org/ft gateway.cfm?id=166134&type=pdf&coll=GUIDE&dl=GUIDE&CFID=82404499&CFTOKEN=73325867
[2] R. Herpers, F. Hetmann, A. Hau, and W. Heiden,“Immersion square - a mobile platform for immersivevisualisations,” 2005, university of Applied SciencesBonn-Rhein-Sieg. [Online]. Available: http://www.cv-lab.inf.fh-brs.de/paper/remagen2005-I-Square1b.pdf
[3] C. Wienss, I. Nikitin, G. Goebbels, K. Troche,M. Gobel, L. Nikitina, and S. Muller, “Sceptre - aninfrared laser tracking system for virtual environments,”vol. isbn 1-59593-321-2. Proceedings of the ACMsymposium on Virtual Reality software and technologyVRST 2006, 2006, pp. pp. 45–50. [Online]. Available:http://www.digibib.net/openurl?sid=hbz:dipp&genre=proceeding&aulast=Wienss&aufirst=Christian&title=+Proceedings+of+the+ACM+symposium+on+Virtual+Reality+software+and+technology+VRST+2006&isbn=1-59593-321-2&date=2006&pages=45-50
[4] M. E. Latoschik and E. Bomberg, “Augmentinga laser pointer with a diffraction grating formonoscopic 6dof detection,” Journal of Virtual Realityand Broadcasting, vol. 4, no. 14, pp. –, jan2007, urn:nbn:de:0009-6-12754,, ISSN 1860-2037. [Online]. Available: http://www.jvrb.org/4.2007/1275/4200714.pdf
[5] J. Hammes, A. P. W. Bohm, C. Ross, M. Chawathe,B. Draper, and W. Najjar, “High performance imageprocessing on fpgas,” in In Proceedings of the LosAlamos Computer Science Institute Symposium. Santa Fe,NM, 2000. [Online]. Available: http://www.cs.colostate.edu/∼draper/publications/hammes lacsi01.pdf
[6] A. Benedetti, A. Prati, and N. Scarabottolo,“Image convolution on fpgas: the implementationof a multi-fpga fifo structure,” in FIFO Structure,24 th. EUROMICRO Conference Volume 1 (EU-ROMICRO’98), August 25 - 27, 1998. [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.1811&rep=rep1&type=pdf
[7] D. G. Bariamis, D. K. Iakovidis, and D. E. Maroulis,“An fpga-based architecture for real time image featureextraction,” in in: Proceedings of the ICPR InternationalConference on Pattern Recognition, 2004, pp. 801–804.[Online]. Available: http://rtsimage.di.uoa.gr/publications/01334338.pdf
[8] J. Trein, A. T. Schwarzbacher, and B. Hoppe, “Fpgaimplementation of a single pass real-time blob analysisusing run length encoding,” MPC - Workshop, February2008. [Online]. Available: http://www.electronics.dit.ie/postgrads/jtrein/mpc08-1Blob.pdf
[9] J. W. MacLean, “An evaluation of the suitability of fpgasfor embedded vision systems,” The First IEEE Workshopon Embedded Computer Vision Systems (San Diego),June 2005. [Online]. Available: http://wjamesmaclean.net/Papers/cvpr05-ecv-MacLean-eval.pdf
[10] Terasic, DE2 Development and Education Board -User Manual, 1st ed., Altera, 101 Innovation Drive,San Jose, CA 95134, 2007. [Online]. Available: http://www.terasic.com/downloads/cd-rom/de2/
[11] ——, Terasic TRDB-D5M Hardware Specification,Terasic, June 2008. [Online]. Available:http://www.terasic.com.tw/cgi-bin/page/archivedownload.pl?Language=English&No=281&FID=cbf3f0dcdbe2222a36f93826bcc25667
[12] A. Bochem, R. Herpers, and K. Kent, “Hardware acceler-ation of blob detection for image processing,” in Advancesin Circuits, Electronics and Micro-Electronics (CENICS),2010 Third International Conference on. CENICS, July2010, pp. 28 –33.
15
Rapid Prototyping of OpenCV Image ProcessingApplications using ASP
Felix Muhlbauer∗, Michael Großhans∗, Christophe Bobda†∗Chair of Computer EngineeringUniversity of Potsdam, Germany
{muehlbauer,grosshan}@cs.uni-potsdam.de†CSCE
University of Arkansas, [email protected]
Abstract—Image processing is becoming more and morepresent in our everyday life. With the requirements of minia-turization, low-power, performance in order to provide someintelligent processing directly into the camera, embedded camerawill dominate the image processing landscape in the future. Whilethe common approach of developing such embedded systemsis to use sequentially operating processors, image processingalgorithms are inherently parallel, thus hardware devices like FP-GAs provide a perfect match to develop highly efficient systems.Unfortunately hardware development is more difficult and thereare less experts available compared to software. Automatizingthe design process will leverage the existing infrastructure, thusproviding faster time to market and quick investigation of newalgorithms. We exploit ASP (answer set programming) for systemsynthesis with the goal of genarating an optimal hardwaresoftware partitioning, a viable communication structure and thecorresponding scheduling, from an image processing application.
I. INTRODUCTION
Image processing is becoming more and more present inour everyday life. Mobiles devices are able to automaticallytake a photo when detecting a smiling face, intelligent camerasare used to monitor suspect peoples and operations at airports.In production chain, smart cameras are being used for qualitycontrol. Besides those fields of application, many other arebeing considered and will be widened in the future. Thechallenge for developing such embedded image processingsystems is that image processing often results in very highresource utilization while embedded systems are usually onlyequipped with limited resources. The common approach isbased on general purpose processor systems, which processesdata mainly sequentially. In contrast, image processing algo-rithms are inherently parallel and thus hardware devices likeFPGAs and ASICs provide the perfect match to develop highlyefficient solutions. Unfortunately, compared to software devel-opment, only few hardware experts are available. Additionally,hardware development is error prone, difficult to debug andtime consuming, leading to huge time-to-market. From aneconomic point of view two criteria for the development areimportant: time to market and the performance of the product.Automatic synthesis with the aim of generating an optimal
architecture according the application, will help provide therequired performance in reasonable time.
Our motivation is to design a development environment inwhich high-level software developers could leverage the speedof hardware accelerators without knowledge in the low-levelhardware design and integration. Our approach consists ofrunning a ready-to-use and well known software library forcomputer vision on a processor featuring an operating system.We rely on very popular open source software like OpenCV[1] and the operating system Linux. This approach allows thesoftware application developers to focus on the developpementof high-quality algorithms, which will be implemented withthe best performance. Given an application, the decision tomap a task to a processing element and defining the underlyingcommunication infrastructure and protocol is a challengingtask.
In this paper we focus on the system synthesis problem ofOpenCV applications using heterogeneous FPGA based on-chip architectures as target architecture. We use ASP (answerset programming) to prune the solution space. The goal isto find optimal solutions for the task mapping and communi-cation simultaneously using constraints like timings and chipresources.
This paper is structured as follows: After addressing relatedworks, we explain our model for image processing architec-tures and the resulting design space. A brief introduction intoASP is followed by a description of the strategy how to expressand solve the problem in an ASP-like manner. The paperconcludes with results and future work.
II. RELATED WORK
Search algorithms like e. g. evolutionary algorithms arecapable of solving very complex optimization problems. Ageneric approach is implemented by the PISA tool [2]. Herea search algorithm is a method which tries to find solu-tions for a given problem by iterating three steps: First, theevaluation of candidate solutions, second, the selection ofpromising candidates based on this evaluation and third, thegeneration of new candidates by variation of these selected
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
16
candidates. As examples most evolutionary algorithms andsimulated annealing fall into this category. PISA is mainlydedicated to multi-objective search, where the optimizationproblem is characterized by a set of conflicting goals. Themodule SystemCoDesigner (SCD) was developed to explorethe design space of embedded systems. The found solution isnot necessarily the best possible, in contrast we use ASP tofind an optimal solution.
In Ishebabi et al. [3] several approaches for architecturesynthesis for adaptive multi-processor systems on chip wereinvestigated. Beside heuristics methods also ILP (linear integerprogramming), ASP and evolutionary algorithms were used. Inour work the focus is less on processor allocation and moreon communication, especially handling streaming data. Theresulting complexity is different.
III. ARCHITECTURE
The flexibility of FPGAs allows for the generation ofapplication specific architecture, without modification on thehardware infrastructure. We are especially interested in cen-tralized systems in which a coordinator processing elementis supported by other slave processing elements. Generallyspeaking, these processing elements could be GPPs (generalpurpose processors) like those available in SMP systems, co-processors like FPUs (floating point unit), or dedicated hard-ware accelerators (AES encryption, Discrete Fourier Transfor-mation, . . . ). Each processing element has different capabilitiesand interfaces which define the way sata exchange withethe environment takes place. Also communication is a keycomponent of a hardware/software system. While an efficientcommunication infrastructure will boost the performance apoorly designed communication system will badly affect thesystem.
In the following our architecture model and the differentpaths of communication are described in detail. In general,the most important components for image processing areprocessing elements, memories and communication channels.The specification will therefore focus on more details on thosecomponents.
In our architecture model we distinguish two kinds of pro-cessing elements: software executing processors and dedicatedhardware accelerators that we call PUs (processing unit). Foreach image processing function one or more implementationsmay exist in software or hardware each of which is havinga different processing speed and resource utilization (BRAM,slices, memory, . . . ).
Considering the communication, image data usually doesnot fit1 into the sticky memory inside an FPGA and must bestored in an external memory. Because of the sequential nature(picture after picture) of the image capture, the computation onvideo data is better organized and processed in a stream. Thehardware representation of this idea is known as pipelining:Several PUs, are concatenated and build up a processing
1A video with VGA resolution, 8 bit per color and 25 frames per secondhas an amount of data of (640·480)·(3·8)·25 ≈ 175 megabit per second.
processor IM PU
PUIM
PU PU
memorycontroller
IMPU
to externalmemory
fromcamera
system bus
= processing unit= interconnection module= SDI connection
FPGA
PUIM
Fig. 1. Example architecture according to our model
chain. The data is processed while flowing through the PUchain. In order to allow a seamless integration of streamingoriented computation in a software environment, we imple-mented an interface called SDI (streaming data interface) here.The interface is simple and able to control the data flow toprevent data loss because of different speeds of the modules. Itconsists of the signals and protocolls to allow an interblockingdata transport accross a chain of connected components. TheSDI-interface allows for reusing PUs in different calculationcontexts.
A variety of processors available for FPGAs, like PowerPC,MicroBlaze, NIOS, LEON or OpenRISC, provide a dedicatedinterface to connect co-processors. These interfaces could beused for instruction set extensions but also as a dedicatedcommunication channel to other modules.
Hence, to build an architecture, a PU or chain of PUscould be connected to a memory or a processor. Furthermore,memories are accessible directly (e. g. internal BRAM) or via ashared bus (e. g. external memory). For these interconnections,so called IMs (interconnection module) are introduced in ourmodel to link an SDI interface to another interface. Figure 1shows an example architecture.
IV. DESIGN SPACE AND SCHEDULING
Compared to a software only solution, the best architecture,based on a hardware/software partitioning, should be found.The search is based on a given set of image processingalgorithms and a given set of objective functions and opti-mization constraints. These general requirements is usuallythe system performance, the chip’s resource utilization or thepower consumption.
We use the task graph modell to capture a computationapplication. It defines the dependencies between the differentprocessing steps, from the capture of the raw image data tothe production of result.
Beside a pool of software and hardware implementationsa database was filled with meta information about theseimplementations like costs and interoperability. Assuming thatfor each task a software implementation exists, the costs ofselecting a component are its processing time and its mem-ory utilization. For computational intensive tasks a hardwareimplementation is available. The important costs here areprocessing time, initial delay, throughput and chip utilization
17
(slices, BRAM, DSP, . . . ). These information are gatheredp. ex. by profiling of function calls or data flow analysis.
The problem to be solved is to distribute the tasks to aselected set of processing elements while being aware oftimings and scheduling.
As mentioned earlier, communication is a key part and oftena trade-off between communication and processing has to befound. For example for adjacent tasks it could be faster in totalto process these all locally instead of transferring the data fromone task in the middle to and from another high-speed module.
While mapping the tasks to processors, two kinds of paral-lelism have to be considered: First, the parallel operation ofindependent processors like in SMP systems and second, PUsin series processing data while streaming like pipelining. Bothparallel operations have different impact on the scheduling.
Figure 2 shows a petri net modeling the behavior of theimplementation of a image processing function. One byte ofdata is represented by one token, which traverse the chainfrom pin to pout according to the implementation type T.I is the number of pixels in one image or frame. The twopaths on the left describe filter like operations which consumeone image with α bytes per pixel and output one imagewith maybe another data rate α′. The two paths on the rightdescribe operations with a fixed result size r like the imagebrightness. While stream-based implementations (T=0;2) workon a pixel-by-pixel base and cause an initialization delay(modeled as transaction t1 / t4) and a processing delay (t2 / t5),other implementations (T=1;3) need full access to the wholeimage and take a delay of t3 / t6.
PUs introduce two additional parameters, which are im-portant to calculate the scheduling. The initial delay and thethroughput of data, which must be considered for chainedPUs: It takes some time after a PU has read the first databefore the first result is available. This delay determines thestarting-time of the next PU as specified in the the task graph.Additionally, it is not always possible for a PU to operate atits maximum speed because the speed is also determined bythe speed of incoming data from the previous module. Thus tocalculate the actual operation speed of a PU, these two factshave to taken in account too. The different operation speedsfor different contexts make a decision for the best architecturemore difficult. Furthermore, the interplay of components onthe chip may incur communication bottlenecks. For examplethe parallel access to main memory by different modules willincur delays in the computation.
V. ANSWER SET PROGRAMMING
Answer Set Programming (ASP) is a declarative program-ming paradigm, which uses facts, rules and other languageelements to specify a problem. Based on this formal descrip-tion an ASP solver can determine possible sets of facts, whichfit to the given model.
pin
pout
T=0t1
α
1
t21
α′
T=1t3
αI
α′I
T=2t4
α
1
t5I
r
T=3t6
αI
r
Fig. 2. Petri net modeling the data flow within processing elements withdifferent implementations T
A. Background
The basic concept of modeling ASP programs are rules inthe form:
p0 ← p1, . . . , pm, not pm+1, . . . , not pn (1)
For a rule r the sets body+(r) = {p1, . . . , pm}, body−(r) ={pm+1, . . . , pn} and head(r) = p0 are defined.
To understand the concept of answer sets the rule couldbe interpreted as following: If an answer set A containsp1, . . . , pm, but not pm+1, . . . , pn then p0 has to be insertedinto this answer set.
Additionally, to avoid unfounded solutions for an answerset A, which contains p0, there must exist one rule r, thathead(r) = p0, body+(r) ⊆ A and body−(r) ∩A = ∅.
Of course, this is only an intuitively way to describe thewide field of answer sets. More specific definitions could befound in several works about answer set programming [4].
With regard to the answer set semantic, a solving strategyand some solving tools are needed to handle the proposed wayof understanding logic programs.
The first step in computing answer sets is to build agrounded version of the logical program: All variables areeliminated by duplicating rules for each possible constant andsubstitute the variable. For example, the program:
p(1). p(2). q(X)← p(X). (2)
is ground to:
p(1). p(2). q(1)← p(1). q(2)← p(2). (3)
Of course the grounded version of the program could bemuch bigger than the original one. Another problem is thatthe grounder needs a complete domain for each variable. For
18
this reason sometimes it is necessary to model such a domainmanually, e. g. the grounder could need a time domain, whereall possible time values are explicitly given. After generatingthe grounded version a SAT-like solver is used to compute allanswer sets.
The most common way to model logic programs for anASP solver is to use the generate-and-test paradigm. Thus,there are some rules, which are responsible to generate a setof facts. Additionally, there exists some constraints which haveto be met, such that the generated set is an answer set, andconsequently a solution for the given problem. For that reasonASP solvers support some special language extensions.
Generating rules could be modeled using aggregates [5]:
l [ v0=a0, v1=a1, . . . , vn=an ] u. (4)
The brackets define a weighted sum of atoms v0, . . . , vn andtheir weights a0, . . . , an. The rule describes the fact, that asubset A ⊆ {v0, . . . , vn} of true atoms exist, such that thesum of weights is within the boundaries [l;u]. Omitted weightsdefault to 1. These rules could be used to generate differentsets of atoms.
Integrity constraints test if a generated set of atoms is ananswer set. These constraints describe, which conditions mustnot be true in any answer set and are written:
← p, q (5)
In this example there exists no answer set containing both pand q.
So far we have described all language features which arenecessary to model our optimization problem.
B. Model
Our ASP model is structured in three parts: First, theproblem description including a task graph and constraintsfor the demanded architecture, second a summary of metainformation for all implementations like costs, mapping of taskto HW or SW-component and the mapping interconnect. Third,the solver itself with all rules needed to find a solution. Thisseparation (in files) also offers a highly flexible and reusablemodel.
The basic idea of the model is to select an implementationfor each task, connect the associated modules and finallyconsider timings and dependencies to build the scheduling.Details are described in the following sections.
C. Allocation of processors
First, each task needs to be mapped to exactly one com-ponent. For the introduced scenario this could be the mainprocessor or a PU.
The amount of permutations in the model to map compo-nents is M !, where M is the maximum number of possiblecomponents allowed to be instantiated. To reduce symmetries,a component is defined to have two indices: cij , j ∈ [1; Ji].With Ji defining the maximum possible number of instantia-tions of a component i, the amount of permutations is reducedto J1! · . . . ·Jn!. The values Ji are derived from the task graph.
Normally it is not necessary to instantiate a certain componentmore than once, thus Ji often is equal to 1.
For each instantiated component cij and each task tn anatom Mtncij is defined, where i specifies the implementationtype of a component and j an instance counter. Thus, for eachtask ti the sum of mapped components must equal 1:
1 [Mtic11 , . . . , Mtickl, ] 1. (6)
D. Data flow
After all processing units are instantiated, they need to beconnected. Connections are derived from edges in the taskgraph and could be simple point-to-point connections, butalso could involve more components. For example in thecase data should be transfered from a processing unit to amemory, the connection requires an IM in between to link thedifferent interfaces. For each transfer n an atom Cncijckl
isdefined, which indicates that data for that transfer is sent fromcomponent cij to ckl. This atom only exists if the transferactually happens:
0 [ Cncijckl] 1. (7)
Again, after generating atoms they have to be limited touseful ones. Modeling these constraints is similar to path-finding algorithms. Assuming a transfer n describing thedependency between two tasks tx (source) and ty (sink) thefollowing constraints need to be met:
←Mtxcij , [ Cncijckl] 0. (8)
←Mtycij , [ Cncklcij ] 0. (9)
If a source task tx is mapped to a component cij there mustexist a component ckl to which cij sends data in transfer n (seerule 8). Similary, if a sink task ty is mapped to a componentcij there must exist a component ckl which sends data to cijin transfer n (see rule 9). Else this solution is invalid.
Additionally, the model must ensure that there exists a pathfor each transfer between the source and sink components aswell as it has to avoid senseless connections, p. ex. in the caseof incompatible interfaces between to components.
E. Time
To evaluate the performance of a hardware architecture itis necessary to schedule all tasks in a temporal order thawill insure the minimal run-time of the algorithm. Modelinga temporal behavior could be done with the help of a timedomain, which defines a discrete and finite set of possibletime slots. Each task is assigned to a time slot to indicate itsstarting time. Additionally, the task graph in extended by twospecial tasks to mark the start and end of computation. Whilethe start task is assigned to time slot 0, the time slot of theend task indicates the total runtime and is used as the valuebe optimized by the solver.
Choosing a practical duration for a time slot is difficult.Selecting shorter clock for time intervals results in a veryaccurate scheduling. However a small time slot leads to anexplosion of the number of possibilities that the solver has to
19
deal with. As trade-off we choose a normalized time intervalfor a time slot related to the fastest component: One time slotis the amount of time the fastest component takes to processa certain amount of data. The occupation of two time slotsindicates that a component operates at half speed compared tothe fastest one.
In our ASP model each task ti is assigned exactly to onetime slot k indicated by an atom Ttik
1 [ Tti1, . . . , Ttim ] 1. (10)
where m is the total number of time slots available in thetime domain and given as a constant in our model. To meetthe dependencies given by the task graph, a task tx may notstart before its predecessor ty:
← Ttxkx , Ttyky , ky < kx. (11)
F. Synchronization
In section IV we explained why the maximum operationspeed of a component may not be exhausted and the actualspeed depends on the processing context. Therefore an atomStik is defined as
1 [ Sti1, . . . , Stip ] 1. (12)
where k is the speed of the task ti. Similar to the time modelalso for modeling the speed a relative criteria was chosenfor the same reasons. A value of 1 implies that the fastestcomponent needs one time slot to process a certain amount ofdata. The constant p is the maximum speed value, and thusthe speed of the slowest component.
The possible speed values for a component are dependentof the speed of the predecessor. If two tasks tx and ty aredependent and mapped on adjacent components, the assignedspeed values have to be equal:
← Stxkx , Styky , kx 6= ky. (13)
To find a scheduling some more helper values are needed.If the starting time and the speed of a task is known, then theend time could be determined. Similar to the definition of thestarting time T in section V-E, the end time of a task ti isdescribed by an atom Etik:
Etike← Stid, Ttiks
, ke = ks + d. (14)
To introduce a local scheduling on each component theremay not exist any tasks which have intersecting computationtimes. Thus, if two tasks tx and ty are mapped to the samecomponent and ty starts after tx, then tx must have finishedbefore ty starts:
← Ttxks, Etxke
, Ttyky, ks ≤ ky, ky < ke. (15)
G. Resource utilization
With the introduced rules so far it is possible to buildup valid architectures. In the following further rules arepresented, which have global influence on the quality of thegenerated architectures and second, comply with the general
conditions. In detail, this concerns memory bandwidth, chiparea utilization and total runtime.
One major issue is the utilization of the system bus resp.memory bus, because most data is stored in the main memoryand easily the memory interface becomes a bottleneck. Foreach point in time it must be assured that the bus is notoverloaded and the speed of the attached components isthrottled if necessary.
In our model the speed of the system bus is given asa constant sb. For each time slot the traffic of all activebus transfers is summed up and compared to the systembus capacity. Furthermore the traffic caused by a componentis inversely proportional to its speed, p. ex. if a componentoperates four times slower than the bus, the bus utilization isone quarter. This is expressed by the following inequation:
1
stc1+ . . .+
1
stcn≥ 1
sb(16)
For the time slot t the components c1, . . . , cn are loading thebus according to their individual speeds stck . For the ease ofreading cij is shortened to ck.
In ASP fraction numbers should be avoided and integernumbers used instead. Therefore the bus capacity is modeledas discrete work slots which could be allocated by activecomponents. The constant p (introduced in rule 12) is derivedfrom the slowest component and hence the minimal bus loadis 1/p, if the bus speed sb equals 1. This also results in p asthe number of needed slots respectively p/sb for sb 6= 1.
To understand that, consider that it is possible to normalizeall speed values, including the maximum value p by sb andget a new maximum value p′ = p/sb and a new bus speeds′b = 1.
For a component ck sending data and operating at speedstck , the normalization coefficient is stck/sb. Thus ck usessb/stck of the bus capacity and consequently allocates
sbstck· p′ = sb
stck· psb
=p
stck(17)
work slots. With this normalization the equation 16 becomesan integrity constraint using only integer numbers:
← d p
stc1e + . . . + d p
stcne ≥ bp′c. (18)
Another issue concerning the general constraints of a so-lution is the chip area. As described before, the resourceutilization r of each component is given as part of the meta-information. For each instantiated component ck the value rkis represented by the atom Rckrk . With Ru defining the overallresource constraint the integrity constraint
← Rc1r1 , . . . , Rcnrn , Ru, r =∑
n
rn, u ≤ r. (19)
rejects architectures, which consume too much resources. Thisrule is replicated to handle different resources like slices orBRAMs.
Finally, to get the optimized model the total run-time shouldbe minimized. As indicator for the runtime the end task te was
20
throughput resourcestask PPC PU slices bramgauss 16 1 2 2sobel 16 1 2 2gradient 8 2 1 0trace 16 - - -system bus 10IM (bus) 3
TABLE IBRIEF META-INFORMATION FOR DIFFERENT IMPLEMENTATIONS
defined earlier. In the ASP model an aggregate is used to findthe time slot of te:
minimize [ Tte1 = 1, . . . , Ttem = m ]. (20)
Each atom Ttek is weighted by its time slot number k. Becauseonly one atom is true, the sum results in the time slot numberof the end task and hence the total runtime.
VI. RESULTS
At the University of Potsdam, Germany a collection of toolscalled POTASSCO [6] was developed to support the computa-tion of answer sets. Some of those tools are trend-setting andaward-winning in the wide field of logic programming [7].
We use the tools gringo and clasp to solve our problem.These applications are capable of handling optimizing state-ments, which themselves are very similar to aggregates. A sumof specific weighted literals is build and tried to optimize thissum during the solving process.
As example application for this paper we used the cannyedge detector, a common preprocessing stage for object recog-nition. The processing steps are: camera→ Gauss filter (noisereduction) → sobel filter (find edges) → calculate gradientof edges → trace edges to find contours. While for the firststeps a hardware implementation is very fast, the tracingof edges has no consecutive memory access and thus onlya software implementation is assumed. Table I summarizesthe resource utilization and the assumed throughput for eachimplementation.
On our test system2 the ASP solver needs about 3 secondsto find a solution. Figure 3 shows two different generatedarchitectures while decreasing the constraint for the availablechip area. The mapping of software tasks is illustrated withparallelograms and dashed arrows. In the bottom right cornerof each drawing the consumption of chip area and the esti-mated run-time is given. Finally, the software only solution(not shown) takes 65 time slots compared to 21 for the mosthardware intensive architecture.
Our current ASP model is a first approach and thus notoptimized. Nevertheless we want to give an idea of theperformance of the solving process. For the measurements foreach number of tasks from 5 up to 10, 1000 problems weregenerated randomly especially the task graph and the meta-information for different implementations. Figure 4 shows a
2Desktop with Intel Core2 Duo Processor 3.16 GHz, 3.2 GB RAM
processor
memorycontroller
IM
gauss
to externalmemory
fromcamera
system bus
FPGA
gradientsobeltrace
chip area: 18time slots: 21
processor
memorycontroller
IMgauss
to externalmemory
fromcamera
system bus
FPGA
sobelgradient,
trace
chip area: 17time slots: 37
Fig. 3. Resulting architectures for different chip area constraints. The puresoftware solution takes 65 time slots.
Fig. 4. ASP solver runtime measurements according to the number of tasks
exponential growth of the solving time relative to the numberof tasks in the problem which is not worse than expected.
VII. CONCLUSION AND FUTURE WORK
We have shown that answer set programming is a viableapproach to solve complex problems like the architecture gen-eration for data stream based hardware/software co-design sys-tems. The advantage over evolutionary algorithms or heuristicsmethods is the guaranty findind of an optimal solution.
Our development platform is an intelligent camera system,based on a Virtex4 FX FPGA with an embedded PowerPChardcore processor. Because image data is normally huge andstored in the external DDR memory the first IM which wasdeveloped connects the PLB (system bus) to a SDI componentand vice versa. This module operates similar to a DMAcontroller but instead of just copying, the data is streamedout and in of the module between the load and store operationto go through PUs.
Our next step is to improve the ASP model to be solvedfaster and to be more accurate especially concerning the res-olution of timings and to examine more complex task graphs.
21
An extension of the POTASSCO tools which is currently inheavy development and capable of handling real numbers, willbe used for this purpose.
Our ASP model is already capable to generate architec-tures which include a partial reconfiguration of modules. Thescheduling is also valid except for the case that, a moduleis used, reconfigured and directly used again. Here the delayfor the reconfiguration must be considered in the schedulingbecause it stalls the processing. It’s possible to include thiscase in our model with little modifications. In the future weare going to combine the work of this paper with our work inthe domain of partial reconfiguration.
REFERENCES
[1] Intel Inc., “Open Computer Vision Library,” http://www.intel.com/research/mrl/research/opencv/, 2007.
[2] ETH Zurich, “PISA - A Platform and Programming Language Inde-pendent Interface for Search Algorithms,” http://www.tik.ee.ethz.ch/pisa/,2010.
[3] H. Ishebabi and C. Bobda, “Automated architecture synthesis for parallelprograms on fpga multiprocessor systems,” Microprocess. Microsyst.,vol. 33, no. 1, pp. 63–71, 2009.
[4] C. Anger, K. Konczak, T. Linke, and T. Schaub, “A glimpse of answerset programming,” Kunstliche Intelligenz, no. 1/05, pp. 12–17, 2005.
[5] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, S. Thiele, andT. Schaub, A Users Guide to gringo, clasp, clingo, and iclingo, Nov.2008.
[6] University of Potsdam, “Potassco - Tools for Answer Set Programming,”http://potassco.sourceforge.net/, 2010.
[7] M. Denecker, J. Vennekens, S. Bond, M. Gebser, and M. Truszczynski,“The second answer set programming competition,” in Proceedings ofthe Tenth International Conference on Logic Programming and Nonmono-tonic Reasoning (LPNMR’09), ser. Lecture Notes in Artificial Intelligence,E. Erdem, F. Lin, and T. Schaub, Eds., vol. 5753. Springer-Verlag, 2009,pp. 637–654.
22
Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations
Ming Zhang, Zonghua Gu Colledge of Computer Science, Zhejiang University
Hangzhou, China 310027 {editing, zgu}@zju.edu.cn
Abstract—AUTOSAR is a component-based modeling language and development framework for automotive embedded systems. Component-to-ECU mapping is conventionally done manually and empirically. As the number of components and ECUs in vehicles systems grows rapidly, it becomes infeasible to find optimal solutions by hand. We address some design issues involved in mapping an AUTOSAR model to a distributed hardware platform with multiple ECUs connected by a bus, each ECU running a real-time operating system. We present algorithms for extracting connectivity between ports of atomic software components from an AUTOSAR model and for calculating blocking times of all tasks of a taskset scheduled by PCP. We then address optimization issues in mapping AUTOSAR components (SWCs) to distributed multithreaded implementations. We formulate and solve two optimization problems: map SWCs to ECUs with the objective of minimizing the bus load; for a given SWC-to-ECU mapping, map runnable entities on each ECU to OS tasks and assign data consistency mechanism to each shared data item to minimize memory size requirement on each ECU while guaranteeing schedulability of tasksets on all ECUs.
Keywords-software component; ECU; schedulability; data consistency
I. INTRODUCTION Today’s automotive electrical and electronic systems are
becoming more and more complex. In order to ease the development of automotive electronic systems, leading automobile companies and first-tier suppliers formed a partnership in 2003, and established AUTOSAR (AUTomotive Open System Architecture), a standard for automotive software development. According to AUTOSAR, application software components (SWCs) are platform-independent and need to be mapped to ECUs [1]. This mapping is an important step of system configuration. For a system consisting of ECUs, with
application SWCs to be mapped to the ECUs, there could be different mapping schemes, including valid and invalid
schemes with respect to constraints (i.e., timing constraints of tasks). As the number of ECUs and application SWCs increases, it is inefficient and error-prone to perform the mapping manually in a trial-and-error manner.
In this work, we propose an approach that works closely with the AUTOSAR model to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among application tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency.
W. Peng et al [2] addressed the deployment optimization problem for AUTOSAR system configuration. In their work, an algorithm was presented to find an SWC-to-ECU mapping scheme to guarantee task schedulability while minimizing inter-ECU communication bandwidth. However, their work didn’t consider shared data among tasks and their protection mechanisms. Ferrari et al [3] discussed several strategies for protecting shared data items and raised the issue of optimization for time and memory tradeoffs, but did not propose any concrete algorithms. In this paper, we attempt to formulate and solve the optimization problems involved in mapping AUTOSAR model to distributed multithreaded implementations.
This paper is organized as follows: Section II introduces basic concepts of AUTOSAR. Section III describes our approach in detail while Section IV presents two algorithms for extracting connectivity of ports and calculating blocking times of tasks. In Section V, two simple experiments on an application example demonstrates the correctness and effectiveness of our approach. Finally, this work is concluded in Section IV.
II. BASIC CONCEPTS OF AUTOSAR According to AUTOSAR, application software is
conceptually located above the AUTOSAR RTE and consists of platform-independent software components (SWCs). An SWC may have multiple ports. A port is either a P-Port or an R-Port. A P-Port provides output data while an R-Port requires input data. Each port is associated with a port interface. Two types of port-interfaces, client-server and sender-receiver, are supported by AUTOSAR. A client-server interface defines a set of operations that can be invoked by a client and implemented by a server. A sender-receiver interface defines a set of data-elements sent/received over the VFB. Runnable Entities (runnables for short) are the smallest executable elements. One component consists of at least one runnable. All runnables are activated by RTEEvents. If no RTEEvent is specified as StartOnEvent for a runnable, then the runnable is never activated by the RTE. Two categories of runnables are defined. Category 1 runnables do not have WaitPoints and have to terminate in finite time. Category 2 runnables always have at least one WaitPoint; or they invoke a server and wait for the response. At runtime, runnables within software components are grouped into tasks scheduled by the OS scheduler. The RTE generator is responsible for constructing the OS tasks bodies. Before the RTE generator takes the ECU configuration description as the input information to generate the code, the
978-1-4577-0660-8/11/$26.00 2011 IEEE
23
RTE configurator configures parts of the ECU-Configuration, e.g. mapping of runnables to tasks.
III. PROBLEM FORMULATION
A. Outline Our approach consists of two phases. The first phase tries
to find one or more optimal SWC-to-ECU mapping schemes, with respect to the data rate over the inter-ECU bus, while trying to guarantee schedulability of the task set on every ECU. This phase takes as input a set of inter-connected atomic application SWCs [1], and a set of ECUs, and outputs a mapping scheme from the atomic application SWCs to the ECUs.
In the second phase, a per-ECU optimization is performed. For each ECU, our approach tries to select a method for each data item to guarantee its data consistency as well as schedulability of the task set on the ECU, while minimizing the memory overhead used to protect data consistency on the ECU. This phase takes as input a mapping scheme produced by the first phase, and outputs the selected method to guarantee data consistency, for each data item.
Both of the two phases involve runnable-to-task mapping. We assume the worst-case execution time and the worst-case execution time of the longest critical section of every runnable are given.
Before the two phases of optimization, the top-level component of a system [1] is decomposed into inter-connected atomic SWCs. During this process, the connectivity between the ports of the atomic SWCs is maintained, as described in detail in Section IV.A.
Given a set of atomic application SWCs and a set of ECUs , a mapping scheme is a function : , where
. (1)
gives the ECU which is mapped to. For the mapping to meet timing constraints, every task in the taskset on each ECU finishes before its deadline , that is:
. (2)
In this work, we assume that Priority Ceiling Protocol [5] is applied as the scheduling algorithm, which defines as:
∑ . (3)
, | _ , . (4)
During the first phase, we do not consider data sharing among tasks, hence, for this phase,
0. (5)
In addition, the total data rate over the bus needs to be minimized. We define as:
∑ ∑ . (6)
∑ . (7)
Where is a data item transmitted between ECUs; is the set of runnables of ; is the set of data items transmitted by runnable over the bus; is the size of data item ; is the period of transmission of
.
The first phase is formulated as the following optimization problem:
Find for all ,
Minimize ,
Subject to with 0 for every .
After the application SWCs are mapped to the ECUs, consistency of data shared among tasks on each ECU needs to be guaranteed. We consider two methods mentioned by [3]: semaphore lock and rate transition blocks. Semaphore lock incurs negligible memory overhead while introducing significant delays; Rate transition blocks incur negligible time overhead but require additional memory space to store multiple copies of data.
For each data item shared among tasks on a given ECU , one of the two methods is selected. Hence, a function is to be found, where
. (8)
gives the method to protect the consistency of data item (SL: semaphore lock or RTB: rate transition blocks), where
is a data item shared by tasks on the given ECU. As in the first phase, schedulability of tasks on must be guaranteed. In this phase, data sharing is taken into account. As is required by PCP [5], is the longest critical section of low-priority tasks of , which is given by (4). In addition, memory overhead on the ECU introduced by rate transition blocks is to be minimized:
∑ 1 . (9)
Where is the number of copies of . Since the original copy exists even if no mechanism is applied to guarantee data consistency, the original copy is not counted as overhead.
To determine , we need to consider three cases:
• If has only one writer and no reader, i.e., is written by one task and no task reads it, no extra copy is needed.
• If has only readers and no writer, i.e., is read by some tasks but written by no task, the original copy suffices.
24
• If has more than one writers, or both a writer and a reader, i.e., is written by more than one tasks, or written by some tasks and read by other tasks, then a copy is needed for each of the writers and another copy is required for all the readers, including the original copy. In other words, the number of extra copies is equal to the number of writers.
Combining all of the three cases above, we define as:
. (10)
Where and are the number of writer tasks and reader tasks of , respectively, and
evaluates to 1 when has a reader and 0 when has no reader.
The second phase of our approach is formulated as the following optimization problem:
For a given ECU ,
Find for all ,
Minimize
Subject to with given by (4) for every on .
The algorithm for calculating given by (4) is described in detail in Section IV.B.
In the outline described above, there are a few pending problems:
• Mapping runnables to tasks;
• Identification of data items transmitted between ECUs;
• Identification of data items shared by more than one tasks on the same ECU.
We will address these problems next.
B. Mapping Runnables to Tasks In this work, we take into account only periodic runnables.
We map runnables to tasks in a simple way that is a common practice in the industry: all runnables on the same ECU with the same period are mapped to the same task . Further, we define the period of the task as the period of the runnables mapped to it. Hence,
. (11)
The worst-case execution time of the task is defined as the sum of the worst-case execution times of the runnables mapped to it. Hence,
∑ . (12)
Therefore, the period of a task is unique among all the tasks on the same ECU. By rate-monotonic priority assignment [4], the priority of each task is also unique.
C. Identification of Data Items Transmitted between ECUs According to AUTOSAR [1], an atomic application SWC
must be mapped to an ECU as a whole. From the viewpoint of the sender SWC, this implies that it could send data to a remote ECU only via its ports. The structure of a data item is defined by a data element in a port interface. Therefore, we identify a data item transmitted over the bus by a port-data element pair , . From the data element, the size of the data item , could be obtained.
From the viewpoint of a runnable, a data item it transmits is referenced by its data send point or data write access. Hence we can obtain the period of the transmission of a data item , by a runnable that has a data send point or data write access referencing it:
, , . (13)
Where , denotes the runnable that transmits the data item identified by , .
In order to identify data items transmitted over the bus instead of via memory on the local ECU, it is necessary to find out whether there is a receiver SWC on a remote ECU. We tackle this problem in two steps. First, we define a function
where
. (14)
gives the set of all R-Ports that are connected with the given P-Port . Then, for each R-Port in , the ECU to which the owner SWC of is mapped is compared with that of . If there exist a in , whose owner SWC is mapped to a different ECU from the ECU of the owner SWC of , or formally,
. (15)
Where
• is the owner SWC of ;
• is the ECU to which is mapped;
then every data item transmitted via must be transmitted over the bus.
With (12), (13), (14), and (15), we could rewrite (7) as:
∑ , , . (16)
Where
1, ,0, . (17)
D. Identification of Data Items Shared by Tasks on the Same ECU In an AUTOSAR model, data shared by runnables of
application SWCs could be classified into two categories: those
25
shared by runnables on the same ECU, and those shared by runnables on different ECUs. For the 1st category, race conditions may occur since the data is shared via memory. For the 2nd category, data consistency is not an issue, since the communication is via message passing.
From the viewpoint of application SWCs, data shared by runnables on the same ECU come in two forms:
• Data shared by runnables of the same atomic SWC, or inter-runnable variables;
• Data shared by runnables of different atomic SWCs, transmitted and received by different SWCs via their ports.
According to the AUTOSAR specification [1], an inter-runnable variable is referenced by runnables that read or write them. By counting tasks that reference as a writer or a reader, we obtain and respectively. For shared data items in the form of an inter-runnable variable, (9) can be rewritten as:
∑ ∑ 1 .(18)
For the latter, it is easy to prove the following lemma.
Lemma 1. If data transmitted by an atomic SWC via a P-Port , and received by another atomic SWC via an R-Port ’, the data passes exact one assembly connector.
The concept of assembly connector is defined in [1]. Note that it’s possible for different data items transmitted via a P-Port to pass different assembly connectors since AUTOSAR allows a port to be connected to more than one ports. The same holds for a data item received via an R-Port.
We identify a data item shared by runnables of different SWCs on the same ECU with a pair , , where
• is the assembly connector the data item passes; • is the data element in the port interface associated with the port via which the data item is transmitted, i.e., the P-Port connected by the assembly connector . To make sure that the data item is shared by runnables on
the same ECU, the P-Port via which the data item is transmitted and the R-Port via which the data item is received now belong to SWCs mapped to the same ECU, or formally:
1, ,0, . (19)
By counting tasks (possibly with runnables of different SWCs mapped to it) on the given ECU, which reference the data item identified by , as a writer or a reader, we obtain , and , respectively. To determine whether a data item , is read or written by a runnable , it needs to be determined whether the data item , transmitted by passes . We define a function , where
. (20)
gives the set of assembly connectors data transmitted via could pass.
For data items shared by runnables of different SWCs on the same ECU, (9) could be rewritten as:
∑ , 1 , , . (20)
Note that (19) and (20) are used by the counting process that determines , and , , which in turn are used to find , .
Considering both (18) and (20), we can rewrite (9) as:
. (21)
E. Genetic Algorithm We used the NSGA-II [6] variant of Genetic Algorithm to
solve the optimization problems described in the first sub-section of this section.
For the optimization of the SWC-to-ECU mapping, we encode each individual as a vector , with the -th element
representing the ECU to which is mapped. To recombine two individuals and , the cross-over operator randomly picks a subset of each time, hence , , … , , 1 | | , and exchanges the values of and 1 . The mutation operator also picks a subset of , and maps each to a random ECU by assigning to the corresponding element in the target individual .
To optimize the selection of method to guarantee data consistency of shared data for a given ECU, the individual encoding, cross-over operator and mutation operator are similar to those for optimization of the SWC-to-ECU mapping scheme. Each individual is encoded as a vector , with the -th element representing the method (SL: semaphore lock or RTB: rate transition block) selected to guarantee data consistency of data item . To recombine two individuals
and , the cross-over operator randomly picks a subset of the set of all shared data items , hence , , … , , 1 | | , each time, and exchanges the values of and 1 . The mutation operator also picks a subset of , and assigns a random method from the available data consistency methods to each by assigning to the corresponding element
in the target individual .
IV. ALGORITHMS
A. Deriving Port Clusters and Port Connectivity from AUTOSAR Models In the last section, we defined function , which
finds the set of all R-Ports that are connected with the given P-Port and function , which finds the set of assembly connectors data transmitted via could pass. We define a port cluster to be a pair , , where
26
• is a set of P-Ports of atomic SWCs;
• is a set of R-Ports of atomic SWCs;
• Data transmitted via any P-Port in can be received by every R-Port in .
By Lemma 1, it is obvious that a port cluster corresponds to exactly one assembly connector, which data transmitted by a P-Port in the passes before being received via one or more R-Ports in the .
In this sub-section, we propose an algorithm to find port clusters which a given port belongs to for every port, whose owner is an atomic SWC.
Our algorithm starts with the top-level composition [1] of a system and performs a breadth-first traversal through the hierarchical structure of the components. In this process, for every composition,
• a port cluster is created for each assembly connector , with the P-Port connects added to the
of and the R-Port to the ;
• every outer port of the composition, which has been added to one or more port clusters, is conceptually replaced with inner ports that are connected to it by delegation connectors [1].
This process continues until all compositions are processed, when every port in the port cluster belong to an atomic SWC. The pseudo-code of the algorithm is as follows:
Algorithm 4.1 (find port clusters) Input: top-level component CP0 add CP0 to Queue Q while Q is not empty
remove CP1 from Q if CP1 is a composition
clear DelegationMap for each connector Cn0 in CP1
if Cn0 is an assembly connector create a port cluster PC(Cn0)=PC(PSet,QSet) add Cn0.PPort to PSet; add Cn0.RPort to RSet add PC(Cn0) to PortToClusterMap
else /*if CnP0 is a delegation connector*/ add (Cn0.outerPort, Cn0.innerPort) to DelegationMap
end if end for each update PortToClusterMap with DelegationMap add all ComponentPrototypes in CP1 to Q
end of while for each (p, PCSet(p)) /*p is a port while PCSet(p) is the set of all port clusters containing p*/
for each PC(PSet, RSet) in PCSet(p) if p is a P-Port
add p to PSet else /*p is an R-Port*/
add p to RSet
end if end for each
end for each return PortToClusterMap
In the pseudo-code above, we use DelegationMap to track inner ports that are directly connected with each outer port. The pseudo-code of the “update PortToClusterMap with DelegationMap” step is as follows:
for each (outerPort, innerPortSet) in DelegationMap /*innerPortSet is the set of all inner ports that are connected directly with outerPort*/
Find PCSet(outerPort) in PortToClusterMap /*PCSet(outerPort) is the set of all port clusters that contain outerPort*/ if PCSet(outerPort) exists in PortToClusterMap /*outerPort is connected from the outside of its owner ComponentPrototype*/
for each innerPort in innerPortSet find PCSet(innerPort) in PortToClusterMap /*PCSet(innerPort) is the set of all port clusters that contain innerPort*/ add (innerPort, PCSet(innerPort)∪PCSet(outerPort)) to PortToClusterMap
end for each end if remove (outerPort, PCSet(outerPort)) from PortToClusterMap
end for each
Note that an inner port may belong to multiple port clusters. When updating a (innerPort, PCSet(innerPort)), care must be taken not to lose the port clusters associated with innerPort previously.
B. Calculating blocking times of tasks In this sub-section, we propose an algorithm to calculate
in (4) for all tasks, which have been sorted by priority (from the highest to the lowest). In the context of this paper, we do not distinguish between a semaphore and a shared data item.
Our algorithm performs two scans through the sorted list of tasks. First, a scan is performed from the highest priority to the lowest priority, and determines the ceiling of each semaphore protecting a shared data item, for every semaphore . Then, the second scan is performed from the lowest to the highest priority. During this scan, our algorithm maintains a map from each semaphore to the longest critical section that uses . For every task , this scan performs the following steps one-by-one:
• remove semaphores that cannot contribute to the blocking time to , hence, all :
;
• find the longest critical section currently in , and save its length as ;
• add all semaphores it encounters for the first time during this scan, hence, : ,
27
to along with the critical section , of ; we use to denote the priority of the lowest-priority task that uses .
• for each semaphore :, update the critical section currently
associated with in , if it is shorter than , .
V. APPLICATION EXAMPLE In this section, we describe two experiments on an
application example, which consists of 6 atomic SWCs to be mapped to 2 ECUs. The hierarchy of SWCs, along with the connectors, is shown in Figure 1. . For simplicity, each atomic SWC contains one and only one runnable, each port interface contains only one data element, and the size of the data element is 36 bytes. There is no inter-runnable variable. The worst-case execution times of runnables are shown in TABLE I. . The maximum lengths of critical sections of runnables, along with the accessed shared data items, are shown in TABLE II. , where each shared data item accessed by a runnable is represented by the port and the data element.
CP0 CP1
CP2
CP00
CP01
CP10
CP11
CP11
P0
P1 P0
R0
P0
P1
R0
R1
R0 P0
P0
R0
R0
R1
P0
P0
R0
A0
D1
A0
A1
D0
D1
D2
D3
A0
A1
A2
Figure 1. Hierachy of SWCs and Connectors of the Application Example
TABLE I. WORST-CASE EXECUTION TIMES OF RUNNABLES IN MS (RE DENOTES RUNNABLE ENTITY)
SC0/CP2,RE40=100 SC0/CP1/CP12,RE30=30 SC0/CP0/CP01,RE10=25 SC0/CP1/CP11,RE20=10 SC0/CP0/CP00,RE00=5 SC0/CP1/CP10,RE10=25
TABLE II. CRITICAL SECTIONS OF RUNNABLES IN MS (DE DENOTES DATA ELEMENT SHARED BETWEEN DIFFERENT RUNNABLES)
SC0/CP2,RE40,R0,DE00=10 SC0/CP1/CP12,RE30,R1,DE00=2 SC0/CP1/CP12,RE30,R0,DE00=8 SC0/CP1/CP12,RE30,P0,DE00=3 SC0/CP0/CP01,RE10,R0,DE00=5 SC0/CP0/CP01,RE10,P0,DE00=6 SC0/CP1/CP11,RE20,R0,DE00=1 SC0/CP1/CP11,RE20,P0,DE00=2 SC0/CP0/CP00,RE00,P1,DE00=1 SC0/CP0/CP00,RE00,P0,DE00=1 SC0/CP1/CP10,RE10,R0,DE00=5
SC0/CP1/CP10,RE10,P0,DE00=6
A. Experiment 1 In this experiment, we map the atomic SWCs to ECUs.
Three mapping schemes are found, all with the same data rate over the bus of 450B/s. The first mapping scheme maps CP00, CP01 and CP2 to EcuInstance0; CP10, CP11 and CP12 to EcuInstance1. From Figure 1. , we can see that all data on bus come from P0 of CP01. This port is written by the runnable RE10 of CP01 with period of 80ms. Hence the data rate over the bus is 36B/80ms=450B/s. (The numerical values are for illustration purposes only.) The second mapping scheme maps CP00, CP01 to EcuInstance0; CP2, CP10, CP11 and CP12 to EcuInstance1. The data rate over the bus is the same as the first scheme. The third mapping scheme is actually the same as the first mapping scheme except that the ECUs are interchanged, thus resulting in the same data rate over the bus, since we assume a homogeneous hardware platform consisting of identical ECUs. The tasksets on EcuInstance0 and EcuInstance1 using the first scheme are shown in TABLE III. and TABLE IV. respectively. The tasksets using the second scheme are shown in TABLE V. and TABLE VI. respectively. We can see that all tasks meet there deadlines.
TABLE III. TASKSET ON ECUINSTANCE0 UNDER 1ST MAPPING SCHEME
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 30 5 0 5
Task 1 80 25 0 30
Task 2 620 100 0 210
TABLE IV. TASKSET ON ECUINSTANCE1 UNDER 1ST MAPPING SCHEME
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 40 10 0 10
Task 1 80 25 0 35
Task 2 132 30 0 75
TABLE V. TASKSET ON ECUINSTANCE0 UNDER 2ST MAPPING SCHEME
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 30 5 0 5
Task 1 80 25 0 30
TABLE VI. TASKSET ON ECUINSTANCE1 UNDER 2ST MAPPING SCHEME
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 40 10 0 10
Task 1 80 25 0 35
Task 2 132 30 0 75
Task 3 620 100 0 600
Next, we assign data consistency method to each shared data item based on the 1st mapping scheme. All the local shared data items are protected with semaphore locks, hence the memory overhead is minimal (0). The tasksets on EcuInstance0 and EcuInstance1 after data consistency method
28
assignment are shown in Tables XI and XII, respectively. Again, we can see that all tasks meet their deadlines.
TABLE VII. TASKSET ON ECUINSTANCE0 AFTER DATA CONSISTENCY METHOD ASSIGNMENT
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 30 5 5 10
Task 1 80 25 0 30
Task 2 620 100 0 210
TABLE VIII. TASKSET ON ECUINSTANCE1 AFTER DATA CONSISTENCY METHOD ASSIGNMENT
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 40 10 5 15
Task 1 80 25 8 53
Task 2 132 30 0 75
B. Experiment 2 In this experiment, we consider the SWC-to-ECU mapping
under 1st mapping scheme (TABLE III. and TABLE IV. ), and increase length of the critical section where the runnable RE30 of SC0/CP1/CP12 accesses the shared data item identified by (R0, DE00) (3rd line in TABLE II. ) from 8ms to 60ms. If all shared data items are protected with semaphore locks, then the taskset is not schedulable due to excessive blocking time, as shown in TABLE IX. . We then run our algorithm for data consistency method assignment on EcuInstance1. This time, our approach assigns rate transition block to the shared data item identified by (A0, DE00), as shown in TABLE X. . Now the taskset is schedulable, as shown in TABLE XI. .
TABLE IX. NON-SCHEDULABLE TASKSET ON ECUINSTANCE1
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 40 10 5 15
Task 1 80 25 60 115
Task 2 132 30 0 75
TABLE X. DATA CONSISTENCY METHOD ASSIGNMENT ON ECUINSTANCE1 AFTER MODIFICATION
SC0,A2,DE00: Lock SC0/CP1,A1,DE00: Lock SC0/CP1,A0,DE00: Rate Transition Block Memory overhead: 36.0
TABLE XI. SCHEDULABLE TASKSET ON ECUINSTANCE1 FOR THE DATA CONSISTENCY METHOD IN TABLE X.
Task ID T (ms) = D C (ms) B (ms) WCRT (ms)
Task 0 40 10 5 15
Task 1 80 25 2 37
Task 2 132 30 0 75
VI. CONCLUSIONS As vehicle electronic systems become increasingly
complex, the number of software components and the number of ECUs have also increased, making it difficult or infeasible to use manual efforts to find optimal SWC-to-ECU mapping schemes. In this work, we present an approach to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency. Along with our approach, we present an algorithm for extracting connectivity between ports of atomic software components from an AUTOSAR model and an algorithm for calculating blocking times of tasks under PCP. Finally, we use an application example to show the correctness and effectiveness of the proposed techniques.
VII. ACKNOWLEDGEMENTS This work was supported by NSFC Project Grant
#61070002 and #60736017; National Important; Science & Technology Specific Projects under Grant No.2009ZX01038-001 and 2009ZX01038-002.
REFERENCES [1] AUTOSAR GbR, AUTOSAR Specifications, AUTOSAR Development
Partnership, 2008, Release 3.0. [2] Wei Peng, Hong Li, Min Yao, Zheng Sun, “Deployment Optimization
for AUTOSAR System Configuration,” International Conference on Computer Engineering and Technology, vol. 4: 4189-4193, 2010.
[3] Alberto Ferrari, Marco Di Natale, Giacomo Gentile, Giovanni Reggiani, Paolo Gai,“Time and memory tradeoffs in the implementation of AUTOSAR components,” Design, Automation & Test in Europe Conference & Exhibition, pp. 864-869, 2009.
[4] C. L. Liu, J. W. Layland, “Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment,” Journal of ACM, vol. 20, No. 1, pp. 46-61, 1973.
[5] L. Sha, R. Rajkumar, J. Lehoczky, “Priority Inheritance Protocols: An Approach to Real-Time Synchronization,” IEEE Transactions on Computers, vol. 39, No. 9, pp. 1175-1185, 1990.
[6] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, T. Meyarivan, “A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II,” Parallel Problem Solving from Nature VI, pp. 849-858, 2000.
29
FPGA Design for Monitoring CANbusTraffic in a Prosthetic Limb Sensor
NetworkA. Bochem∗, J. Deschenes∗, J. Williams∗, K.B. Kent∗, Y. Losier+
Faculty of Computer Science∗ Institute of Biomedical Engineering+
University of New BrunswsickFredericton, Canada
{alexander.bochem, justin.deschenes, jeremy.williams, ken, ylosier}@unb.ca
Abstract—This paper presents a successful im-plementation of a Field Programmable Gate Array(FPGA) CANbus Monitor for embedded use in aprosthesis device, the University of New Brunswick(UNB) hand. The monitor collects serial communi-cations from two separate Controller Area Networks(CAN) within the prosthetic limb’s embedded system.The information collected can be used by researchersto optimize performance and monitor patient use. Thedata monitor is designed with an understanding of theconstraints inherent in both the prosthesis industryand embedded systems technologies. The design uses anumber of verilog logic cores which compartmentalizeindividual logic areas allowing for more successfulvalidation and verification through both simulationsand practical experiments.
keywords: FPGA; CANbus; Data Monitor; Pros-thesis; Verilog
I. INTRODUCTION
In the prosthetics field, various research insti-tutions and commercial vendors are currently de-veloping new microprocessor-based prosthetic limbcomponents which use a serial communication bus.Although some groups’ efforts have been of aproprietary nature, many have expressed interest inthe development of an open bus communicationstandard. The goal is to simplify the intercon-nection of these components within a prostheticlimb system and to allow the interchangeability ofdevices from different manufacturers. This initia-tive is still in development and will undoubtedlyface some obstacles during its development andimplementation as there are currently no embeddeddevices available to reliably monitor the bus activityfor the newly developed protocol.
The open bus communication standard uses the
CAN bus protocol as its underlying hardware com-munication platform. Higher levels of the protocoldefine the initialization, inter-module communica-tion, and data streaming capabilities. Commerciallyavailable off-the-shelf CANbus logic analyzers,although capable of decoding the primary CANfields, are unable to interpret the protocol messagesin order to provide detailed information of thesystem behavior. The design of an FPGA-basedprosthetic limb data monitor will allow embeddedsystem engineers to monitor the new protocol’scommunication activity occurring in the system.This provides an effective developmental tool to notonly help develop new prosthetic limb componentsbut also advance the open bus standard initiative.This monitor will help develop new prostheticcomponents as well as provide a means to as-sess their rehabilitation effectiveness. The designof the system will be flexible enough to meetfuture needs and follow current standards, suchas simplifying the work required for end users toutilize the system. Furthermore, the monitor’s datalogging capabilities will allow the prosthetic fittingrehabilitation team to analyze the amputee’s dailyuse of the system in order to assess its rehabilitationeffectiveness. The evaluation of the data monitor’scapabilities will be performed in conjunction withUNB researchers who are leading members of theStandardized Communication Interface for Pros-thetics forum [1].
Section 2 of the report outlines the field ofbiomedical engineering and presenting an overviewof related work. Section 3 gives an introductioninto the CANbus standard. Section 4 covers thesystem design and its implementation. Section 5
”978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
30
shows how the functionality of the system has beentested and evaluated. Finally, Section 6 concludesthe paper.
II. EMBEDDED SYSTEMS
Within this section we will look at tradi-tional embedded system and biomedical engineer-ing projects. It will highlight the characteristics ofsome approaches with the applied technology andprojects that implement CANbus communication inparticular. Initially, robots were controlled by largeand expensive computers requiring a physical con-nection to link the control unit to the robot. Todaythe shrinking size and cost of embedded systemsand the advances in communication, specificallywireless methods, have allowed smaller, cheapermobile robots. Robots operate and interact withthe physical world and thus require solutions tohard real-time problems. These solutions must berobust and take into account imperfections in theworld. These autonomous systems usually consistsof sensors and actuators; the sensors collect infor-mation coming into the system and the actuatorsare the outputs and can be used to interact withthe outside environment. The signal and controldata is sent to a central processing unit which runsthe main operating system. For reducing powerconsumption and complexity, these sensor networksuse communication buses to exchange data.
In Zhang et al. [2] the researchers started with atypical robotic arm setup which included a numberof controllers and command systems communicat-ing through a communication bus to one another.The researchers, who consider the communicationsystem the most important aspect of the space armdesign set out to improve it. Their system utilizedthe Controller Area Network (CAN) communica-tion bus and enhanced reliability through the imple-mentation of a redundant strategy. The researcherscited features of the CANbus that are desirable,which include: error detection mechanisms, errorhandling by priority, adaptability and a high cost-performance ratio. In order to increase reliability,they implemented a “Hot Double Redundancy”technology. This technology was implemented asthe communication system which consisted of anARM microprocessor, the CANbus controller cir-cuit, data storage, system memory and a complexprogrammable logic device (CPLD) used to imple-ment a redundant strategy. The CPLD interfacedwith two redundant CAN controller circuits, whichwere each connected to their own set of system
devices, while taking commands from the mainmicroprocessor. The logic to handle the “Hot”aspect of the technology, the ability to switch fromone system to the other without any down time, wasimplemented in the hardware definition languageVHDL and put onto the CPLD. This increasesthe redundancy, but also significantly improves thereliability by handling major system faults withoutdown time and increases the flexibility of the designby having hardware components which can beupdated and changed without physical contact. Theprocess was tested through the Quartus II softwaretool from Altera, which simulated normal and ex-treme activity for a period of sixteen to thirty-twohours all the while alternating between the tworedundant systems. The researchers reported thetests to be successful, with a 100% transmissionrate and no error frames or losses due to theredundancy switch time.
Biomedical engineering is an industry whichrequires a multi-disciplinary skillset in which en-gineering principles are applied to medicine andthe life sciences. In recent years renewed interestfrom the American military has promoted the ad-vancement of artificial limb technology. In 2005,the Defense Advanced Research Projects Agency(DARPA) launched two prosthetic revolution pro-gram, revolutionizing prosthetics 2007 (RP2007)and revolutionizing prosthetics 2009 (RP2009) [3].The goal of RP2007 was to deliver a fully func-tional upper extremity prosthesis to an amputee,utilizing the best possible technologies. The arm isto have the same capacity as a native arm, includingfully dexterous fingers, wrist, elbow and shoulderwith the ability to lift weight, reach above one’shead and around one’s back. The prosthetic willinclude a practical power source, a natural lookand a weight equal to that of the average arm.Another, more difficult goal includes control ofthe prosthesis through use of the patient’s centralnervous system. In RP2009 the prosthesis tech-nology will be extended to include sensors forproprioception feedback, a 24 hour power source,the ability to tolerate various environmental issuesand increased durability. Although the prosthesesdeveloped during the course of both projects provedto be technological achievements [4], it is stillunclear whether the cost of these systems, aimedat being around $100,000[5], will prohibit their usewhen they become commercially available.
The University of New Brunswick’s (UNB) handproject [6] seeks to design a low-cost three axis, six
31
basic grip anthropomorphic hand, with control ofthe hand using subconscious grasping to determinemovement.The UNB hand team built intelligentElectromyography (EMG) sensors [7] that couldamplify and process signal information, passingrequired information to the main microprocessorthrough the use of serial communication bus. Thisallows for a reduction in wiring, which reduces theweight and simplifies the component architecture.The serial bus chosen, the Controller Area Networkbus (CANbus), allowed a power strategy to beimplemented, reducing overall power consumption.The CANbus is also noted as having a good com-promise between speed and data protection, a ne-cessity for prosthetics [8]. The hand project creatorshave begun creating a communication standard [9]to improve interchangeability and interconnectionbetween limb components. If adopted, major in-creases in flexibility would be gained which wouldbe beneficial to all people involved in the prosthesisindustry.
The paper by Banzi Mainardi and Davalli [8]extends the idea of a distributed control systemand Controller Area Network serial bus to an entirearm prosthesis. In this project, along with theparallel distribution cited in the UNB hand project,the device had the additional task of handlingexternal communication through either a Bluetoothor RS232 serial connection. The paper also citessimilar reasons in using the CANbus to the UNBhand project, adding evidence of successful deviceintegration in the CANbus’ traditional area, auto-motives and how this could parallel the prostheticindustry. The paper also outlines reasons behind notchoosing other communication protocols or tech-nologies. The two initial choices, Inter-IntegratedCircuit (I2C)[10] and Serial Peripheral Interface(SPI) Bus[11], were rejected because they failedto have adequate data control and data protectionsystems and were unable to handle faults accept-ably. The SPI system required additional hardwareoverhead which would allow for device addressing.The personal computing standards and industrystandards were unable to adapt to the space andweight constraints required by the profile of theprosthesis. The CANbus allowed for a reasonablenumber of sensors, flexibility in expansion andinterfacing, microcontrollers with integrated buscontrollers and efficient, robust, deterministic datatransmission with a reduction of cables requiredand near optimal voltage levels.
III. CANBUS
The CANbus is a serial communication standardused to handle secure and realtime communicationbetween electrical and microcontroller devices andprimarily defines the physical and data link layersof the open systems interconnection model (OSImodel). The CANbus was originally created tosupport the automotive industry and its increasedreliance on electronics; however because of itsreliability, high speed and low-cost wiring, it hasbeen used in many additional areas. The CANbussupports bit rates of up to 1Mbps and has beenengineered to handle the constraints of a secu-rity conscious real-time system[12]. The physicalmedium is shielded copper wiring, which utilizes anon-return-to-zero line coding. The CAN messageframe is either an 11 bit identifier base frame (CAN2.0A) or the 29 bit identifier extended frame (CAN2.0B). There are a number of custom bit identifiersof which most are used to synchronize messages,perform error handling or signal various values.The CAN standard operates on a network bustherefore all devices have access to each message,addressing is handled by identifiers in each messageframe. The CANbus standard outlines the arbitra-tion method as CSMA/BA where the BA stands forbit arbitration. The bit arbitration method allowsany device to transmit its message onto the bus. Ifthere is a collision the transmitter with the greatestpriority, identified by the most successive recessivebits wins bus arbitration. The lesser priority devicesthen standoff for a predefined period of time andre-transmit their message. This allows the highestpriority messages to get handled the fastest. Otherimportant properties defined in this standard are:
• Message prioritization - Critical devices ormessages have priority on the network. This isdone through the media arbitration protocol.
• Guaranteed latency - Realtime messaging la-tency utilizes a scheduling algorithm whichhas a proven worse case and therefore can bereliable in all situations[13].
• Configuration flexibility - The standard is ro-bust in its handling of additional nodes, nodescan be added and removed without requiringa change in the hardware or software of anydevice on the system.
• Concurrent multicasting - Through the ad-dressing and message filtering protocols inplace, the CAN-bus can have multicasting inwhich it is guaranteed that all nodes will
32
accept the message at the same time. Allowingevery node to act upon the same message.
• Global data consistency - Messages contain aconsistency flag which every node must checkto determine if the message is consistent.
• Emphasis on error detection - Transmissionsare checked for errors at many points through-out messaging. This includes monitoring atglobal and local transmission points, cyclicredundancy checks, bit stuffing and messageframe checking[12].
• Automatic Retransmission - Corrupted mes-sages are retransmitted when the bus becomesidle again, according to prioritization.
• Error distinction - Through a combination ofline coding, transmission detection, hardwareand software logic the CAN-bus is able todifferentiate between temporary disturbances,total failures and the switching off of nodes.
• Reduced power consumption - Nodes can beset to sleep mode during periods of inactivity.Activity on the bus or an internal conditionwill awaken the nodes.
IV. SYSTEM DESIGN
The open bus standard-based data monitor sys-tem for Prosthetic Limbs captures and collects theserial information from two separate ControllerArea Network (CAN) buses, the sensor bus andthe actuator bus. With the collected informationit then would transform the data into the correctCAN message format. A timestamp would be addedand the messages would be passed through a usercontrolled filter to dictate which messages shouldbe logged. After the filter the two buses’ messagesare merged and sorted according to their timestamp.Once sorted, the information is then sent to anoutput device for further processing. The currentdesign allows to choose between three differentoutput interfaces. First, the RS232 serial interfacethat sends the CAN message data encoded in ASCIIformat to the serial port using a RS232 chip onthe DE2 board. Second, the direct connection ofthe RS232 serial interface module with the pininterface of the DE2 board. This allows the trans-mission with higher bandwidth, using an externalRS232 chip with better performance. And the thirdcommunication module uses the USB interfaceof the DE2 board to transmit the CAN messagedata [14]. This project will allow the engineers toobserve the communication on the buses and searchfor sources of errors. An overview of the system
design is given in Figure 1.The implementation consists of various ”work-
ing” cores, that can be individually tested eachcontaining one piece of the functionality whichis required for the overall system. These coresinterface with one another through a standard FIFO,alleviating timing issues. The FIFO modules aredual-clocked memory buffers that work on the first-in/first-out concept. With the dual-clock ability,those cores can be used to exchange data betweentwo different modules in a hardware design, evenif those modules run with different clock speeds.The FIFO cores belong to the Intellectual Property(IP) cores library, that is available within the devel-opment environment Quartus II from Altera. Theusage of those cores is usually free for academicuse, but requires a license fee for industrial or end-consumer development purposes.
The module “CAN Reader” is handling the ob-servation of the connected CANbus. It receives allmessages that are transmitted without causing acollision on the bus and forwards it to a FIFObuffer, from where the “Filter” module gets itsinput data. One pair of the “CAN Reader ” andthe “Filter” are connected to the control bus whilethe other pair is listening to the sensor bus. Themodular design would allow to connect more orfewer CANbus’ to the monitoring system. In thecurrent implementation of the “Message Writer”module, only the data transmission on the RS232interface is implemented. The data compressionof the messages and the application of differentinterface technologies has been evaluated duringthe project. Their implementations have been post-poned as future work. RS232 was implementedas it required the least overhead to establish acommunication channel [15].
The design of the “CAN Reader” module isbased upon the “CAN Protocol Controller” projectby Igor Mohor from opencores.org. This design im-plements the specification for the “SJA100 Stand-alone CAN controller” from Philips Semiconduc-tors [16]. To allow the integration of the coredesign, an I/O module that handles the Wishboneinterface had to be implemented. The configurationof the CAN controller is working register based,as defined in the specification of the SJA1000.The created I/O module is used by the “CANReader” that is configuring the “CAN controller”.The “CAN controller” receives the data from theCANbus, while the “CAN Reader” collects themessages from the “CAN controller” and forwards
33
Fig. 1: System design of CANbus monitor.
them to the timestamp and filter procedures. For amanageable processing flow of the modules theirinternal control has been designed as state ma-chines. This design concept allowed the localizationof problems caused by signal runtimes and raceconditions. Figure 2 gives an idea of the statemachine that describes the functionality of the“CAN Reader module”.
To give an idea of the design complexity, thestate “Read Receive Buffer” is using the I/O mod-ule to communicate with the “CAN controller”.This leads to a design with state machines en-capsulated in state machines. Such designs shouldbe avoided in hardware if possible, since theytend to obscure runtime race conditions. In thecurrent design the occurence of race conditions isprobhibited by event driven mutexes.
V. VALIDATION & VERIFICATION
The validation of the system design has beendone by simulation and runtime experiments on thetarget platform. For verification of the functionalitythe single modules have been simulated with Al-tera’s ModelSim tool. This allowed a step-wise exe-cution of the Verilog code to identify logic errors inthe implementation. Afterwards the modules havebeen tested on the target platform, with testbenchmodules providing the required input data. Theoutput of the tested modules has been transmittedon the RS232 interface to a connected computerand verified by hand. The timing verification of thehardware design has been done by an experimentaltest setup, with two sensor nodes on one connectedCANbus (Figure 3).
The two sensor nodes were configured to sendmessages on the bus continuously at 250Hz fre-quency. To allow the verification that all messages
can be received and no message is lost, each mes-sage contained a set of four consecutive numbers.The first node was configured to send numbersin the range from 0 to 999. The second nodewas sending numbers in the range from 1000 to1999. Due to the CANbus standard, each nodewould resend its current message until it receivesan acknowledge response on the bus from the othernode. This would cause the node to increment thenumbers for the next message. If a collision on thebus has occurred, the CANbus protocol standardensures that the transmitting node is informed bya global collision signal. This would be sent bythe first node on the CANbus, which detects thecollision.
The analysis of the system lead to the conclusionthat the hardware design is working correctly. Allmessages that have been sent from both nodeson the CANbus could be received successful andwith correct data by the “CAN Reader” module inthe FPGA design. This could be verified with thelogged data from the output of the serial module inthe hardware design. Since the test setup only hadone CANbus the input pin has been used twice toevaluate the system with two CANbuses connected.This allowed to verify the proper ordering of themessages by the “Message Merge Unit”. It turnedout that the only flaw in the current design is theRS232 interface. The available chip on the DE2board has a maximum bandwidth of 120kbit/s. Thatbecomes the bottleneck if the connected CANbusruns at maximum bandwidth of 1Mbit/s and evenmore for two connected CANbuses. The design foran USB module as communication interface hasbeen created but could not be sufficiently evaluatedfor further usage.
The analysis of the test results showed some
34
Fig. 2: State machine design for ”CAN Reader” module.
unexpected behavior. While the messages of onenode had been received completely, some messagesof the other node were missing. By resetting theCANbus system, this effect could switch to theother node or stay the same. For either one ofthe two nodes some messages were missing, butnever for both nodes at the same test execution.The extensive analysis of the system design leadsto the conclusion that this error might be causedby the configuration of the sensor nodes. Furtherevaluation of the system design will be continued,after a new test setup can be provided by the projectpartner.
VI. FUTURE WORK
It might be useful to have the ability runningthe CANbus monitoring design without a directconnection to a computer. The logged data couldbe stored on an integrated flash memory module.For this feature it would be helfpul to have a com-pression module to increase the systems mobility.The compression would need to be fast enough sothat it does not become a bottleneck for the system.Ideally the core would take in a stream of dataand output a stream of compressed data. This couldplug directly into the current system design.
It would be useful to have wireless functionalityso that the prosthesis does not need to be tethered
to a computer to retrieve the logged data. Thebiggest hurdle to overcome is to understand howto communicate with the wireless controller on thehardware level. Existing solutions in open sourceprojects could build a starting point here.
At the moment all the values for the CANcontroller are hard coded and thus can not bechanged after the design has been synthesized.This could be improved in several ways rangingfrom full configurability at runtime to a bit ofcode reorganization so that the values can easilybe changed for re-synthesis. A compromise couldbe the ability to configure a small number of valuesat runtime. The most time effective solution wouldbe to reorganize the code so that the hard codedvalues are excluded into a configuration file.
The filtering is currently fairly rudimentary andinflexible. It would be advantageous to have moreflexibility with how to filter messages. Reconfigur-ing the filters, and enable or disable filters duringruntime would be helpful. This becomes even morecrucial as the design moves from lab testing toreal world testing where re-synthesizing the designbecomes less feasible. The system could potentiallybe a memory mapped device and masks could bestored in registers which could be written to bya microcontroller. A simpler solution would be tohave several kilobytes of persistent memory, to
35
Fig. 3: Experimental test setup for final system verification.
which masks could be written, to be used duringthe filtering of the messages.
VII. SUMMARY
This paper has shown the successful implemen-tation of a monitoring system for a CANbus sensornetwork in a hardware design on an FPGA devel-opment board. Details of successful projects andrelated work has been introduced. The informationthat has been displayed in this paper should offera good starting point, providing a general under-standing of embedded projects, with emphasis onactual biomedical engineering solutions and basicsof the CANbus standard. The modular design of theapproach allows the application in different projectsthat have a need for a CANbus monitoring system.All project source code will be made availablethrough opencores.org.
Acknowledgments
This work is supported in part by CMC Mi-crosystems, the Natural Sciences and EngineeringResearch Council of Canada and Altera Corpora-tion.
REFERENCES
[1] Standardised Communication Interface for Prosthet-ics. [Online]. Available: http://groups.google.ca/group/scip-forum/
[2] J. Yang, T. Zhang, J. Song, H. Sun, G. Shi, and Y. Chen,“Redundant design of a can bus testing and communicationsystem for space robot arm,” in Control, Automation,Robotics and Vision, 2008. ICARCV 2008. 10th Interna-tional Conference on, Dec. 2008, pp. 1894–1898.
[3] DARPA, DARPA: 50 Years of Bridging the Gap, 1, Ed.Defense Advanced Research Project Agency, 2008.
[4] L. Ward. (2007, October) Breakthrough awards 2007 -darpa-funded proto 2 brings mind control to prosthetics.electronic / periodical. Popular Mechanics.
[5] S. Adee. (2008, February) Ieee spectrum: Dean kamen’s”luke arm” prosthesis readies for clinical trials. electronic.IEEE Spectrum.
[6] A. Wilson, Y. Losier, P. Kyberd, P. Parker, and D. Lovely,“Emg sensor and controller design for a multifunctionhand prosthesis system - the unb hand,” draft document,2009.
[7] Y. G. P. P. A. L. D. F. Wilson, Adam W. Losier, “A bus-based smart myoelectric electrode/amplifier,” in MedicalMeasurements and Applications Proceedings (MeMeA),2010 IEEE International Workshop on, 2010.
[8] S. Banzi, E. Mainardi, and A. Davalli, “A CAN-baseddistributed control system for upper limb myoelectricprosthesis,” in Computational Intelligence Methods andApplications, 2005 ICSC Congress on, Istanbul,.
[9] Y. Losier and A. Wilson, “Moving towards an open stan-dard: The unb prosthetic device communication protocol,”2009.
[10] The I2C-BUS Specification, Philips Semiconductors Std.9398 393 40 011, Rev. 2.1, January 2000.
[11] Motorola, M68HC11 Microcontrollers Reference Manual,rev. 6.1 ed., Freescale Semiconductor Inc., Mai 2007.
[12] CAN Specification Version, Robert BOSCH GmbH Std.,Rev. 2.0, September 1991.
[13] J. Krakora and Z. Hanzalek, “Verifying real-time proper-ties of CAN bus by timed automata,” in FISITA, WorldAutomotive Congress, Barcelona, May 2004.
[14] DE2 Development and Education Board - User Manual,1st ed., Altera, 101 Innovation Drive, San Jose, CA 95134,2007.
[15] MAX232, MAX232I DUAL EIA-232 DRIVERS/RE-CEIVERS, Texas Instruments, Post Office Box 655303,Dallas, Texas 75265, March 2004.
[16] SJA1000 Stand-alone CAN controller, Philips Semicon-ductors, 5600 MD EINDHOVEN, The Netherlands, Jan-uary 2000.
36
Session 2Prototyping Architectures
37
Rapid Single-Chip Secure Processor Prototyping onthe OpenSPARC FPGA Platform
Jakub M. Szefer ∗3, Wei Zhang #1, Yu-Yuan Chen ∗3, David Champagne ∗3,King Chan #1, Will X.Y. Li #1, Ray C.C. Cheung #2, Ruby B. Lee ∗3
# Department of Electronic Engineering, City University of Hong Kong1 {wezhang6, kingychan8, xiangyuli4}@student.cityu.edu.hk, 2 [email protected]
∗ Electrical Engineering Department, Princeton University, USA3 {szefer, yctwo, dav, rblee}@princeton.edu
Abstract—Secure processors have become increasingly impor-tant for trustworthy computing as security breaches escalate.By providing hardware-level protection, a secure processor en-sures a safe computing environment where confidential dataand applications can be protected against both hardware andsoftware attacks. In this paper, we present a single-chip secureprocessor model and demonstrate rapid prototyping of the secureprocessor on the OpenSPARC FPGA platform. OpenSPARC T1is an industry-grade, open-source, FPGA-synthesizable general-purpose microprocessor originally developed by Sun Microsys-tems, now acquired by Oracle. It is a multi-core, multi-threaded64-bit processor with open-source hardware, including the mi-croprocessor core, as well as system software that can befreely modified by researchers. We modify the OpenSPARCT1 processor by adding security modules: an AES engine, aTRNG and a memory integrity tree. These enhancements enablesecurity features like memory encryption and memory integrityverification. By prototyping this single-chip secure processor onthe FPGA platform, we find that the OpenSPARC T1 FPGAplatform has many advantages for secure processor research.Our prototyping demonstrates that additional modules can beadded quickly and easily and they add little resource overheadto the base OpenSPARC processor.
I. INTRODUCTION
As computing devices become ubiquitous and securitybreaches escalate, protection of information security hasbecome increasingly important. Many software schemes,e.g., [1]–[3], have been proposed to enhance the security ofcomputing systems and are effective in defending against soft-ware attacks. However, they are generally ineffective againstphysical or hardware attacks. Attackers who have full con-trol of the physical device can easily bypass software-onlyprotection, and the whole system is left unsafe and subjectto hardware attacks. This results in an increasing need forhardware-enhanced security features in the microprocessor.
Considerable efforts have been made to build secure com-puting platforms that can address security threats. In this paper,we present our extensible secure computing model prototypedon the OpenSPARC FPGA platform. The platform consists ofthe OpenSPARC T1 processor and system software, includingthe hypervisor and the operating system. The OpenSPARCT1 processor is the open-source form of the UltraSPARC T1processor (from Sun Microsystems, now Oracle) that gives
designers the freedom to modify the processor according totheir own needs [4]. This OpenSPARC T1 processor is alsoeasily synthesizable for FPGA targets, which makes the im-plementation of the processor quite easy. Field-programmableGate Array (FPGA) is an integrated circuit designed to beconfigured by the customer or the designer after manufacture.Because of its reconfigurability, FPGA can be used to imple-ment any logic function that an Application-Specific IntegratedCircuit (ASIC) chip could perform, which makes it a goodplatform for rapid system prototyping.
Through the prototyping process, we have found that theOpenSPARC FPGA platform has many advantages for secureprocessor research. For example, the memory subsystem isemulated in a MicroBlaze softcore, which allows new featuresto be added without re-synthesizing the whole platform. Fur-thermore, new hardware components can be easily added asFast Simplex Link (FSL) peripherals without worrying aboutstrict timing and editing fragile HDL code of the processorcore. However, to the best of our knowledge, currently thereare only a few papers [5]–[7] about processor research on theOpenSPARC FPGA platform and none focusing on security.In this paper, we propose a single-chip secure processorarchitecture based on this platform. Furthermore, we find thatthe OpenSPARC FPGA platform is relatively friendly forsecure processor prototyping. Although there are other open-source processors like OpenRISC and LEON available [8],they lack the infrastructure and components (e.g. hypervisoror emulated caches) of the OpenSPARC platform.
The main contributions of this paper are:∙ A reconfigurable single-chip secure processor model.∙ A prototype of the single-chip secure processor on the
OpenSPARC FPGA platform with our new additions:AES engine, true random number generator (TRNG), andmemory integrity tree (MIT).
∙ Evaluation showing the OpenSPARC FPGA platform’sadvantages for secure processor research.
The rest of the paper is organized as follows. Section II de-scribes some existing secure processor models as backgroundinformation. Section III proposes our single-chip secure pro-cessor architecture on the OpenSPARC FPGA platform. Sec-tion IV describes the single-chip secure processor features.978-1-4577-0660-8/11/$26.00 c⃝2011 IEEE
38
Section V gives an evaluation of the secure OpenSPARCplatform. Finally, in Section VI we conclude the paper.
II. RELATED WORK
With the emergence of hardware attacks, hardware-enhanced security has been given considerable attention byresearchers and engineers. Different secure processor archi-tectures have been proposed to provide a secure computingenvironment for protecting sensitive information against bothsoftware and hardware attacks.
The single-chip secure processors consider the processor tobe trusted, but anything outside the processor, e.g. memory,is untrusted. One such example is the Aegis architecture, asdescribed in [9], [10]. In this approach, the secure processorcontains two key primitives: a physical unclonable function(PUF) and off-chip memory protection. Both primitives arerealized within one single-chip processor so that the internalstate of the processor could not be tampered with or observeddirectly by physical means.
The Secret-Protecting (SP) architecture is proposed in [11].In SP, trusted software modules have their data and codeencrypted and hashed when off-chip and a concealed exe-cution mode is provided where these software modules areprotected from other software snooping on them, e.g. registersare encrypted on an interrupt so a potentially compromisedcommodity Operating System cannot read them. Also, ahierarchical key chain structure is used to store all keys inencrypted and hashed form and only the root key needs to bestored in hardware.
Another secure processor architecture is Bastion [12] thatcan protect a trusted hypervisor, which then protects trustedsoftware modules in the application or in the operating sys-tem. Bastion scales to provide support for multiple mutually-distrustful security domains. Bastion also provides a mem-ory integrity tree for runtime memory authentication andprotection from memory replay attacks. It also protects thehypervisor from physical attacks and offline attacks, not justsoftware attacks.
Based on these previous work, we propose our securecomputing model in Section III. Despite the many advantagesof the OpenSPARC FPGA platform, it is not very widelyused as a research platform. We propose a single-chip secureprocessor on this platform and hope that our work can providesome reference for researchers interested in OpenSPARC.
III. SINGLE-CHIP SECURE OPENSPARC PROCESSOR
This section presents our secure computing model andsingle-chip secure OpenSPARC T1 processor.
A. Secure Computing Model
Figure 1 illustrates our secure computing model. We dividethis computing system into two parts. The first part includesthose components in the processor chip, shown inside thedashed box, and the second part consists of all componentsoff the processor chip, shown outside the dashed box. All on-chip modules, including CPU core, cache, registers, encryp-tion/decryption engine and integrity verification module, are
Integrity verification
Encryption / Decryption
External memory
Untrusted operating system
CacheCPU Core
Malicious software
External Peripherals
Security registers
Processor chip
Secure regionPhysical attacks
Software attacks
Physical attacks
Fig. 1: Secure computing model. The processor chip is re-garded as a physically secure region and gray blocks representnew security enhancements.
assumed to be trusted and protected from physical attacks inour design, because the internal state of the processor chipcould not easily be tampered with or observed directly byphysical means. We do not consider such physical meansas side-channel attacks that employ differential power anal-ysis or electromagnetic analysis in this paper. On the otherhand, all off-chip modules, including external memory andperipherals, are considered insecure because those modulescan be easily tampered with by an adversary using physicalattacks. In addition to hardware attacks, the system may sufferfrom software attacks from an untrusted operating system ormalicious software.
To ensure a secure computing environment, the computingsystem must have some secure functions that enable it todefend against either software or hardware attacks. The grayblocks in Figure 1 represent our initial set of new securityenhancements that we add to the computing system. Theencryption/decryption module encrypts all data evicted off theprocessor chip so that they are meaningless to an adversary.The integrity verification module verifies that all data comingfrom the off-chip memory has not been tampered with.
B. OpenSPARC FPGA Platform
Our single-chip secure processor design targets the FPGA-sythesizable version of the OpenSPARC T1 general-purposemicroprocessor. The OpenSPARC T1 microprocessor is anindustry-grade, 64-bit, multi-threaded processor and is freelyavailable from the OpenSPARC website [4], [13]. In additionto the processor core source code (HDL), simulation tools,design verification suites, and hypervisor source code (C andassembly) are available for download [4]. The OpenSPARCFPGA platform consists of the following major components:OpenSPARC T1 microprocessor core, memory subsystem (L2cache), DRAM controller and DRAM, hypervisor and a choiceof operating systems (Linux or OpenSolaris).
Due to the size constraint of the FPGA chip, theOpenSPARC FPGA platform that we use includes only one
39
OpenSPARC T1 Core
cc
x2
mb
MicroBlaze Core( emulated cache )
DRAMUART Ethernet
OPB Bus
CPX
PCX
Processor chip
FSL
Fig. 2: Block diagram of stock OpenSPARC FPGA platform.
single-thread T1 CPU core to minimize its size. In addition,the L2 cache and the L2 cache controller are emulated in aMicroBlaze softcore, i.e. there is no physical L2 cache. Ahigh-level block diagram of the OpenSPARC T1 processor isshown in Figure 2. The microprocessor core is connected tothe L2 cache, emulated by a MicroBlaze softcore, through theCPU-Cache Crossbar (CCX). The DRAM controller is an IP(Intellectual Property) block synthesized and implemented inthe FPGA fabric and connected to the MicroBlaze softcore. Aphysical DRAM is connected to the FPGA board to serve asthe actual memory.
Due to the different complexities of these components andto place and route operations that determine critical pathswhen implementing on FPGA, the FPGA version of theOpenSPARC processor chip has multiple clock domains. TheOpenSPARC T1 core is in one clock domain (50MHz), theMicroBlaze softcore is in another clock domain (125MHz).Peripherals synthesized in the FPGA fabric and connectedto the MicroBlaze softcore are in another clock domain (e.g.10MHz or 50MHz for the peripherals we create). Finally, theDRAM chip has its own clock domain (400MHz).
The OpenSPARC FPGA platform has a lot of advantagesfor security research [14]. It allows users to freely modifyreal hardware. Especially, the memory subsystem is emulated,rather than being fully implemented in HDL, so new featurescan be easily added (in C code and run on the MicroBlaze soft-core) and re-synthesizing the whole platform can be avoided.In addition, new hardware components can also be easilyadded as firmware code or synthesized in FPGA fabric andconnected to the emulated cache by buses, e.g. by FSL bus.The peripherals on the ML505 board are also very useful, e.g.the network port. Also, secure processors can be prototypedwithout fabricating a real processor chip.
C. Secure Processor Architecture
Based on the secure computing model described in Figure 1,we propose our single-chip secure processor architecture. Ourapproach is to add security modules to the stock OpenSPARCFPGA platform. We synthesize and implement the design inFPGA [15].
Figure 3 shows the block diagram of our single-chip secureprocessor architecture on the OpenSPARC FPGA platform.
OpenSPARC T1 Core
cc
x2
mb
MicroBlaze Core( emulated cache )
DRAMUART Ethernet
OPB Bus
FSL
AESTRNG
MIT
firm
wa
re
CPX
PCX
OtherProcessor chip
FSL
Fig. 3: Block diagram of secure OpenSPARC FPGA platform.Gray blocks show our new additions.
The gray blocks in the diagram represent security moduleswe add to the original platform, including a TRNG (truerandom number generator) (HDL code), an AES (AdvancedEncryption Standard algorithm) engine (HDL code), and amemory integrity tree (MIT) (MicroBlaze firmware code). TheTRNG and the AES engine are implemented in the FPGAfabric and the MIT is executed in the MicroBlaze softcoreas firmware. The TRNG and the AES engine are connectedthrough the FSL bus to the MicroBlaze. Through MicroBlaze,the OpenSPARC microprocessor communicates with thesesecurity modules. The MIT firmware calls the AES enginefor memory integrity verification. Each module works in itsown clock domain and a Digital Clock Manager (DCM) on theFPGA board serves to generate these clock frequencies. Table Ishows the different clock domains in our secure OpenSPARCsystem.
In addition to the FPGA chip, the FPGA board also hasmany on-board resources that can be utilized by the secureprocessor. In our experimental setup, we use the Xilinx Virtex-5 ML505 FPGA board. This board contains a 256MB DRAMmodule, in which an 80MB ramdisk is used to boot the Linuxor Solaris operating system. In addition, the board has anEthernet port that can connect the secure processor to theInternet if enabled. The Ethernet port can also serve as acommunication port that provides high data exchange ratebetween the host computer and the secure processor.
Our secure processor works in one of the four modes:
∙ STD - standard mode, which has no additional securitymeasures;
∙ CONF - confidential mode, which performs memoryencryption to ensure data confidentiality;
∙ ITR - integrity tamper-resistant mode, which performsmemory integrity verification to ensure data integrity;
∙ FTR - full tamper-resistant mode, which performs bothmemory encryption and memory integrity verification toensure data confidentiality and integrity.
The secure processor can work in any of the four modesdepending on the user’s need.
40
TABLE I: Clock domains in OpenSPARC system
Module Open-SPARCT1
Micro-Blaze
AESengine
TRNG DRAM
Frequency 50MHz 125MHz 50MHz 10MHz 400MHz
IV. SINGLE-CHIP SECURE PROCESSOR FEATURES
The proposed single-chip secure processor provides thefollowing security features: memory encryption/decryption,secret key generation, and memory integrity verification.
A. AES Engine
Advanced Encryption Standard (AES) is a symmetric en-cryption algorithm approved by the National Institute of Stan-dards and Technology (NIST). It is one of the most widelyused symmetric encryption algorithms and is advanced interms of both security and performance. While any symmetrickey encryption algorithm suits our purpose, we adopt AES asthe memory encryption/decryption algorithm and AES CBCMAC as the cryptographic hash primitive.
AES can process data blocks of 128 bits using cryptographickeys with size of 128, 192 or 256 bits. The encryptionor decryption takes 10 to 14 rounds of array operations,depending on the key size. In our AES unit design, we employthe idea of parallel table lookup (PTLU) as in [16], [17]. TheAES unit is based on AES-128 and takes a block of 128-bitinput and 128-bit key to produce 128-bit output. The blockdiagram of the AES unit is shown in Figure 4.
The AES unit consists of one finite state machine (FSM) thatcontrols the operation of aes_round and aes_key_expander.The load signal triggers the FSM to load the registers withinput data and key. The start signal sets an internal countervalue and starts the AES encryption/decryption cycle fromround 0. The mode signal changes the AES operation betweenencryption and decryption. The latency of one block of AES-128 encryption and decryption is 14 cycles and 25 cycles,respectively. For encryption, the AES unit takes 11 cycles toproduce the output data (1 cycle per round and 1 cycle for theinitial AddRoundKey operation), and 3 cycles to assert load,start, and done signals. However, the decryption processincurs an extra 11 cycles in order to generate the first roundkey for decryption, since the first round key of decryption isthe last round key of encryption. After the round operationsare done, a done signal is asserted to signify that the outputdata is ready on the data_out bus.
The AES engine works in cipher-block chaining (CBC)mode, i.e., AES-CBC. The input and output data width is512 bits (64 bytes) in our design in order to correspond tothe common size of modern cache lines. The AES engine isconnected to the MicroBlaze softcore through the FSL bus,which has a data bus width of 32 bits. All input data has first tobe loaded to the AES engine to start the AES-CBC encryption,which needs (32+128+128+512)/32=25 cycles to load themode, initial_vector, key, and data_in, and to completelyoutput the data back to MicroBlaze takes 512/32=16 cycles. If
FSMAES _ roundAES _ key _expander
registerregister
register
mux mux
128
128
128
128 128
AES unit
key data _ inload, start, mode
donedata _ out
Fig. 4: Block diagram of AES-128 unit.
we include the 14 cycles for encrypting one block of 128-bitinput (25 cycles for decryption), the total latency of encrypting64 bytes of input data is 25+4×14+16=97 cycles. Similarly weget 25+4×25+16=141 cycles for decryption. We note the highoverhead of the FSL bus for data transfer (41 of 97 cycles forencryption). Also, performance can be improved by storing thedecryption round keys so they do not need to be regenerated,at the cost of using more hardware resources.
B. True Random Number Generator (TRNG)
The secure processor needs a secret key for memory en-cryption and decryption. In addition, the secret key should beunpredictable for attackers. A true random number generator(TRNG) is utilized to generate a secret key for the AES engine.
Figure 5 shows the internal structure of the TRNG, whichconsists of many identically laid-out delay loops, or ring oscil-lators (ROs). We call this a ring oscillator TRNG, introducedby Sunar et al. in [18]. Based on the design in [19], which uses110 rings with 13 invertors, our TRNG consists of 114 ROs,each of which is a simple circuit containing 15 concatenatedNOR gates that oscillate at a certain frequency. One of thetwo inputs of the NOR gate is used to reset the TRNG.Because of manufacturing variations, each RO oscillates ata slightly different frequency. The outputs of all ROs areexclusive-ORed in order to correct bias and correlation andto generate a random signal. We sample the random signaloutput from the XOR gate at a frequency of 10MHz. Similarto the AES engine, the TRNG is also connected to MicroBlaze(the emulated cache) through the FSL bus and interacts withthe OpenSPARC T1 core through MircoBlaze.
The TRNG module is separate from other modules, whichrequires overhead of data transfer over the FSL bus if therandom bits are used by firmware or another module. Forexample, a transfer of 128 random bits is needed for theAES engine: TRNG → MicroBlaze → AES. The advantageis that the TRNG is not tied to AES and can be used forother purposes. Only TRNG → MicroBlaze transfer is neededif the random bits are used inside the firmware. Furthermore,the TRNG could be integrated with the AES unit to reducethe transfer overhead if it is dedicated to AES key generation.To the best of our knowledge, combining TRNG with an AES
41
Q
QSET
CLR
D
clock
random114 ring oscillators
reset
15 NOR gates
Fig. 5: Internal structure of true random number generator(TRNG).
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Clock 0
RandomK 127 K 126 K 125 K 0K 1
Key
128 - bit register
128
Counter
flagClock 1
Fig. 6: 128-bit key generation circuit.
engine on a secure processor platform for key generation hasnot been explicitly discussed in previous works.
In testing TRNG on the FPGA platform, we devise anew method to collect enough random bits for testing. Weconnect to MicroBlaze an Ethernet core, which works at10/100/1000Mb/s. The TRNG outputs random bits at 10Mb/s,so these random bits are able to be sent out to the hostcomputer through the Ethernet port on the FPGA board. TheMicroBlaze reads random bits from the TRNG and sends themto the Ethernet port which then sends out these random bitsto the host computer for testing.
C. Cryptographic Key Generation
The output of the TRNG is single-bit. However, the AESencryption and decryption require a 128-bit key. To generatethe 128-bit key using single-bit TRNG, a key generator isemployed, shown in Figure 6.
A single-bit-in, 128-bit-out shift register is used in the keygenerator. The shift register contains 128 concatenated flipflops, each of which stores one bit for the 128-bit symmetrickey. The TRNG generates only one single random bit per clockcycle, and its output is connected to the input of the shiftregister. It takes the shift register 128 clock cycles to store128 random bits. These 128 random bits are then connectedto a 128-bit register, which outputs them as a key. In additionto the registers, there is a counter in the key generator. Thecounter counts from 0 to 127. When the count reaches 127, itsets a flag signal to 1, indicating that the key is ready. Afterthe key is read, the flag signal is reset to 0 until 128 newbits are available. This allows new randomness to be returnedimmediately if some time has passed since last read.
In this way, generating one key needs 128 clock cycles.The generated key can also be used as a seed to generatemore 128-bit keys using the AES engine. The 10MHz workingfrequency of the TRNG might be a little slow. To get fasterkey generation speed, more than one TRNG can be used. For
Root Key
Key 4
Key 1
Key 5
Key 2 Key 3
Key 6 Key 7
Fig. 7: Hierarchical key chain.
example, if using two TRNGs, the generated random bits canbe stored in two 64-bit shift registers respectively, in whichcase a key can be generated in only 64 clock cycles and thespeed can be doubled. However, this is accomplished at theprice of more hardware resource utilization.
D. Key Management
Cryptographic keys in the secure OpenSPARC system arecritical secrets. In the future, as more secure modules andapplications are added to the system, the number of keys mayincrease and key management and protection will become aproblem. We address this problem by utilizing the concept ofa key chain, which is described in [11].
The key chain is a hierarchical structure which stores allkeys of the secure OpenSPARC system in encrypted form, asshown in Figure 7. Each key in the chain is encrypted by itsparent key. At the root of the chain is the Root Key. Only a leafkey can be used to encrypt a user’s data. This tree structureallows an unlimited number of keys to be stored on the keychain. The Root Key is most critical and it is stored in theRoot Key register on the processor chip, which is assumedsecure from physical probing. Because all the other keys arein encrypted form, they can be stored in off-chip repositories(external memory) for retrieval and need no special protection.This has greatly enhanced key security and also reduced theon-chip storage for keys.
E. Memory Integrity Verification
Even though off-chip data stored in external memory are en-crypted, an adversary can still tamper with them by modifyingthem. In ITR or FTR mode, the secure processor performsmemory integrity verification on all off-chip data fed intothe processor to ensure that they are not tampered with. Thememory integrity verification is realized using the hash tree,which is shown in Figure 8.
The external memory is divided into different data blocks,and each data block is computed to get one hash value. Thesehash values are further computed to get their parent hashvalues (in the nodes above them). Thus, a hash tree (calleda Memory Integrity Tree, MIT) is formed and at the top ofthe tree is the root hash. The root hash value is stored in theroot hash register on-chip, which is assumed secure. Whendata are evicted to off-chip memory, the processor performs
42
Ha
sh(h
1)
Has
h(h
4)
Has
h(h
3)
Ha
sh(h
2)
Data block Data blockData blockData block
Has
h(h
5)
Ha
sh(h
6)
Root Hash
External Memory
In Processor
H ( h 5 , h 6 )
Fig. 8: Memory Integrity Tree
cache line hashing and updating of the non-leaf nodes of theMIT, in the path from the leaf node to the root hash. Whenexternal data are read in, the processor performs cache lineverification in the path from the leaf node to the root hash.Whenever there is a mismatch between the new root hash andthe old root hash, it can be asserted that the contents of theexternal memory has been tampered with, and an exceptionwill be generated.
The memory integrity tree (MIT) is realized as OpenSPARCfirmware in our design, rather than as real FPGA fabric. TheMIT firmware calls the AES engine to perform the hashalgorithm using AES CBC MAC. Using the firmware toemulate new hardware has some benefits, one of which is thatthe firmware can be updated without having to re-synthesizeany of the existing components.
V. SYSTEM EVALUATION
This section evaluates our single-chip secure OpenSPARCT1 processor, including its performance and hardware costs.The secure processor is prototyped on Xilinx Virtex-5 ML505FPGA board with a XC5VLX110T FPGA chip. We findthat the OpenSPARC FPGA platform is relatively easy forsecure processor prototyping. The synthesizing time in XilinxISE environment for the whole secure platform is about 4hours. However, if firmware is altered, there is no need tore-synthesize the whole system and recompiling the firmwaretakes only several minutes, which saves a lot of time.
A. Encryption/Decryption Performance
In CONF and FTR mode, the secure processor has toperform encryption and decryption operations. As mentionedin Table I, the secure OpenSPARC system has multiple clockdomains. It takes the AES engine 97 clock cycles to encrypt64 bytes of input plaintext, and 141 clock cycles to decrypt64 bytes of input cyphertext, where many of these cycles aredue to data transfer via the 32-bit FSL bus. The TRNG worksat the frequency of 10MHz, and to generate one 128-bit keyneeds 128 TRNG cycles. However, a new key is infrequently
TABLE II: Clock cycles needed for various operations.
Operation Cycles Cycle frequencyAES encryption of 64-byte block
97 50MHz (AES cycles)
AES decryption of 64-byte block
141 50MHz (AES cycles)
128-bit key generationfrom TRNG
128 10MHz (TRNG cycles)
128-bit key transferTRNG → MicroBlaze
4 125MHz (FSL cycles)
128-bit key transferMicroBlaze → AES
4 125MHz (FSL cycles)
needed. Data from AES and TRNG have to be first processedby MicroBlaze, and they are fetched by MicroBlaze at thefrequency of 125MHz. The clock cycles needed for all theseoperations are shown in Table II.
B. Overall Performance
The STD mode causes no extra performance overhead forthe processor. In CONF and FTR mode, the processor hasto encrypt/decrypt off-chip data, which incurs performanceoverhead, but this only happens for cache misses which arerelatively infrequent. For each 64-byte data evicted off thesecure processor chip, a 97-cycle delay will be caused due tothe encryption operation. Similarly, for each 64-byte off-chipdata fed into the processor, a 141-cycle delay will be causeddue to the decryption operation. This overhead can be reducedif the FSL bus could be widened or multiple FSL buses can beused. Counter-mode AES can also be used to reduce effectiveencryption/decryption latency. In ITR and FTR mode, theprocessor performs cache line hashing on data evicted to off-chip memory, and cache line verification on data read into theprocessor, which also causes additional performance overhead.
In the FPGA version of OpenSPARC, the frequency ofthe OpenSPARC T1 core is 50MHz, which is a bit slowfor performance research running large software applications.Regardless, the OpenSPARC platform is still suitable forsecure processor research because of its many advantages. Inthis paper, we mainly focus on how to modify the platform toadd security features rather than on performance.
C. Hardware Costs
Our single-chip secure processor is implemented on XilinxVirtex-5 XC5VLX110T FPGA. Table III shows the total re-sources of the FPGA chip and hardware costs after the securitymodules are added. Table III shows that the OpenSPARCT1 core has taken up most of the slices of the FPGA chip,up to 78%, while the new security modules consume muchfewer slices. After both AES and TRNG are added, theslice utilization has increased by only 10%, which is far lessthan the 78% consumed by the T1 core. In our design, theOpenSPARC T1 microprocessor has been tailored to includeonly one CPU core. The resource utilization ratio of AES andTRNG will be even lower if two or more cores are used.
43
TABLE III: Logic utilization of single-chip secureOpenSPARC processor on Xilinx Virtex-5 FPGA
Module Slice LUT Register BRAMVirtex-5 FPGAXC5VLX110T
17280 69120 69120 148
OpenSPARC T1 13561(78%)
40270(58%)
30087(43%)
119(80%)
OpenSPARC T1with AES added
14030(81%)
43174(62%)
30945(44%)
143(96%)
OpenSPARC T1with TRNG added
15166(87%)
42162(60%)
30146(43%)
119(80%)
OpenSPARC T1with AES andTRNG added
15181(88%)
45030(65%)
31004(44%)
143(96%)
Table III also shows that the secure OpenSPARC T1 pro-cessor has almost taken up all the resources of the Virtex-5XC5VLX110T FPGA chip. This will restrain the system fromfurther development. For example, if more security modulesare to be added, or two OpenSPARC T1 cores are desiredin the system, the remaining resources may not be enough.One solution to this problem is to move the system to a largerFPGA chip with more logic resources, for example, the XilinxVirtex-6 or Altera Stratix V FPGAs.
VI. CONCLUSIONS
OpenSPARC T1 is an open-source, FPGA-synthesizeablegeneral-purpose microprocessor. In this paper, we have de-scribed a secure computing model which assumes that onlythe on-chip environment is secure from physical attacks, andproposed a single-chip secure processor architecture. Further,we have prototyped the secure OpenSPARC T1 processor onFPGA and evaluated the resulting system. The new securitymodules added to the OpenSPARC system incur little extrahardware costs and performance overhead.
By prototyping the secure OpenSPARC T1 processor, wefind that the OpenSPARC FPGA platform has many advan-tages for secure processor research and prototyping: ability tomodify real hardware, ease of modification due to the emulatedcache, ability to run commodity OS and benchmarks, theavailability of an open source hypervisor, etc. The low workingMHz rate of the stock OpenSPARC T1 processor and the highoverhead of data transfer cycles using the 32-bit FSL bus affectthe performance of the prototype. Hence, performance moni-toring, performance estimation and performance improvementare fruitful areas for further research.
In this work, we have added the AES engine, TRNG andMIT to the OpenSPARC platform. More security modulescan be added to further enhance its security features. Onthe other hand, high-level applications can be developed tomake use of these security modules. In summary, we findthat the OpenSPARC FPGA platform is relatively easy forsecure processor prototyping with many advantages includingits advanced software and hardware platform components.
ACKNOWLEDGMENT
This work is supported in part by NSF CCF-0917134 andNSF EEC-0540832, and by the City University of Hong KongStart-up Grant 7200179.
REFERENCES
[1] W. Ford and B. S. Kaliski, Jr., “Server-assisted generation of a strongsecret from a password,” in Proceedings of the 9th IEEE InternationalWorkshops on Enabling Technologies: Infrastructure for CollaborativeEnterprises. IEEE Computer Society, 2000, pp. 176–180.
[2] J. Garay, R. Gennaro, C. Jutla, and T. Rabin, “Secure distributed storageand retrieval,” in Distributed Algorithms, M. Mavronicolas and P. Tsigas,Eds. Springer Berlin / Heidelberg, 1997, vol. 1320, pp. 275–289.
[3] P. MacKenzie and M. K. Reiter, “Networked cryptographic devicesresilient to capture,” International Journal of Information Security,vol. 2, pp. 1–20, 2003.
[4] OpenSPARC, World’s First Free 64-bit CMT Microprocessors. [Online].Available: http://www.opensparc.net/
[5] I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve, and J. Torrellas,“OpenSPARC: An open platform for hardware reliability experimenta-tion,” the Fourth Workshop on Silicon Errors in Logic-System Effects(SELSE), April 2008.
[6] K. Chandrasekar, R. Ananthachari, S. Seshadri, and R. Parthasarathi,“Fault tolerance in OpenSPARC multicore architecture using core virtu-alization,” International Conference on High Performance Computing,December 2010.
[7] D. Lee, “OpenSPARC - a scalable chip multi-threading design,” in 21stIntl. Conference on VLSI Design, VLSID, 2008, p. 16.
[8] P. Pelgrims, T. Tierens, and D. Driessens, “Overview of embeddedprocessors: Excalibur, LEON, MicroBlaze, NIOS, OpenRISC, Virtex IIPro,” De Nayer Instituut, Tech. Rep., 2003.
[9] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas,“Aegis: architecture for tamper-evident and tamper-resistant processing,”in Proceedings of the 17th annual international conference on Super-computing. ACM, 2003, pp. 160–171.
[10] G. E. Suh, C. W. O’Donnell, and S. Devadas, “Aegis: A single-chipsecure processor,” IEEE Design and Test of Computers, vol. 24, pp.570–580, 2007.
[11] R. B. Lee, P. C. S. Kwan, J. P. McGregor, J. Dwoskin, and Z. Wang,“Architecture for protecting critical secrets in microprocessors,” inProceedings of the 32nd annual international symposium on ComputerArchitecture. IEEE Computer Society, 2005, pp. 2–13.
[12] D. Champagne and R. Lee, “Scalable architectural support for trustedsoftware,” in 16th IEEE International Symposium on High PerformanceComputer Architecture (HPCA), 2010, pp. 1–12.
[13] D. Weaver, OpenSPARC Internals. Sun Microsystems, Inc., 2008.[14] J. Szefer, Y.-Y. Chen, R. Cheung, and R. B. Lee., “Evaluation of
OpenSPARC FPGA platform as a security and performance researchplatform,” Princeton University Department of Electrical EngineeringTechnical Report CE-L2010-002, September 2010.
[15] PALMS OpenSPARC Security and Performance Research Platform.[Online]. Available: http://palms.ee.princeton.edu/opensparc
[16] R. Lee and Y.-Y. Chen, “Processor accelerator for AES,” in Proceedingsof IEEE Symposium on Application Specific Processors (SASP), June2010, pp. 16–21.
[17] J. Szefer, Y.-Y. Chen, and R. B. Lee, “General-purpose FPGA platformfor efficient encryption and hashing,” in Application-specific Systems,Architectures and Processors conference, July 2010, pp. 309–312.
[18] B. Sunar, W. Martin, and D. Stinson, “A provably secure true randomnumber generator with built-in tolerance to active attacks,” IEEE Trans-actions on Computers, vol. 56, no. 1, pp. 109 –119, 2007.
[19] D. Schellekens, B. Preneel, and I. Verbauwhede, “FPGA vendor agnostictrue random number generator,” in International Conference on FieldProgrammable Logic and Applications, FPL, 2006, pp. 1–6.
44
Abstract— Traditional use of software and hardware
simulators and emulators has been in efforts for chip level
analysis and verification. However, prototyping and bringup
requirements often demands system or platform level integration
and analysis requiring new uses of these traditional pre-silicon
methods along with novel interpretations of existing hardware to
prototype some functions matching behaviors of future systems.
In order to demonstrate the versatility and breadth of the pre-
silicon environments in our systems lab, ranging from functional
instruction set software simulators to Field Programmable Gate
Array (FPGA) chip logic implementations to integrated systems
of existing hardware built to mimic key functional aspects of the
future platforms, we present our experiences with platform level
verification, analysis and early software development/enablement
for an I/O attached network appliance system. More specifically,
we show how simulation tools along with these early prototype
systems were used to do chip level verification, early software
development and even system level software testing for a System
on a Chip processor attached as an I/O accelerator via Peripheral
Component Interconnect Express (PCI Express) to a host system.
Our experiences demonstrate that leveraging the full range of
pre-silicon environment capabilities results in full system level
integrated software test for a I/O attached platform prior to the
availability of fully functional ASICs.
Index Terms—Software debugging, software prototyping,
accelerator architectures, product engineering, system analysis
and design.
I. INTRODUCTION
ORKLOAD optimized computer systems represent an
integrated approach to hardware and software
development to achieve maximum performance from available
footprint, power or cost metric. For many of these systems,
Copyright 978-1-4577-0660-8/11/$26.00 ©2011 IEEE
O. Callanan, A. Castelfranco, E. Creedon, K. Muller, B. Purcell, M. Purcell are with the is with IBM Ireland Product Dist, Ltd, Muddart, Ireland (email {owen.callanan, antonino_castelfranco, eoin.creedon, kay.muller,brianpurcell,mark_purcell}@ie.ibm.com)
C.H. Crawford, S. Lekuch, M. Nutter, and H. Penner are with the IBM TJ Watson Research Center in Yorktown Heights, NY (email {catcraw, scottl, hpenner}@us.ibm.com
J. Xenidis is with the IBM Austin Research Lab in Austin, TX (email [email protected])
general purpose processors are integrated with purpose built
processors to balance ease of programming and
implementation with integrated acceleration. When this design
occurs in a single ASIC we refer to this as System On a Chip
(SoC) architecture.
One example of the SoC architecture is the IBM Power
Edge of NetworkTM (PowerENTM) processor. This processor
was designed to handle wirespeed, network facing
applications. It consists of 64 PowerPCTM cores and a set of
accelerators which exist as first class units in the memory
subsystems on the chip. PowerENTM includes
compression/decompression, encryption/decryption, regular
expression (RegX pattern matching), and an Extensible
Markup Language (XML) accelerators all connected via a high
speed bus with an integrated Host Ethernet Adapter (HEA).
(For a complete review of the PowerENTM architecture see
[1]). The combination of integrated I/O with the accelerators
and a massively multithreaded (MMT) capability targeting
many sessions of parallel processing is ideally suited for many
edge of network applications [2]. However, for some
solutions and workloads, because of the availability of
required libraries or components which require significant
single threaded performance, implementing an entire end to
end application on PowerENTM is not possible, and a hybrid
solution is warranted.
The integration of general purpose processors with special
purpose accelerators has become a mainstay in performance
sensitive computing solutions. In the high performance
computing segment, hybrid architectures have been used to
create systems with over a petaflop of performance using Cell
Broadband Architecture [3] and GPUs [4] connected to x86
ISA based “hosts”. In other technical computing systems
FPGAs have been employed to provide application specific
acceleration [5]. Accelerated systems exist outside of
technical and high performance computing as well. Today,
there are a variety of vendors offering both ASIC and FPGA
based I/O attached accelerators for TCP/IP offload [6],
security [7], and financial data processing for real time trading
[8], exactly some of the workloads for which PowerENTM was
targeted,. In all of these systems, the accelerators were
connected to the hosts via PCI Express, allowing for a variety
of choices for both the host platform and operating system, and
A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup of System-on-a-Chip Based Platforms
O. Callanan, A. Castelfranco, C.H. Crawford, E. Creedon, S. Lekuch, K. Muller, M. Nutter, H. Penner, B. Purcell, M. Purcell and J. Xenidis
W
45
so we also choose PCI Express for PowerENTM accelerated
systems.
There are also a variety of programming models to support
these hybrid computing systems. Flexible runtimes such as
OpenCL [9] and CUDA [10] provide versatile language
bindings for a variety of applications. Runtimes derived from
accelerator hierarchies, clusters of hybrid systems or a hybrid
system built from x86_64 hosts and Cell Broadband Engine
accelerators, have also been developed and show another level
of flexibility of heterogeneous programming [11]. Other
vendors build libraries for very specific function with the
accelerator platform, and these libraries are then linked in with
customer applications. In any of these approaches, fast and
reliable communication is required between the host and the
device both for latency (e.g. synchronization) and throughput
(e.g. bulk data transfer) driven communication (see for
instance [11]). For PowerENTM accelerated systems, we are
especially concerned with the performance of our PCI Express
data transport layer given that our set of targeted workloads
have inherent wirespeed computing requirements.
Designing and developing the appropriate PCI Express
software stack for these performance sensitive applications
requires significant access to systems for implementation and
testing. In fact, verifying the PCI Express hardware features in
PowerENTM in addition to a highly integrated and optimized
user space networking stack from the HEA to the PCI Express
interface requires testing from hardware functional tests to
software unit tests all the way to full system level stress tests.
In order to verify hardware designs and logic implementations,
as well as develop and test software and stress test the system
without jeopardizing time to market for our solution, we
leveraged a full range of pre-silicon environments. Our
challenge was to find the right mix of tools and prototype
environments to efficiently develop, debug and test for what
would eventually be a complex, heterogeneous system.
This paper is organized as follows. In section 2, we
describe the PCI Express hardware implementation on
PowerENTM and the software stack which supported PCI
Express connected PowerENTM to an x86 based Linux host.
We review our instruction set simulation tools along with some
prototype environments we used to develop and test the
software in section 3. The hardware verification via HDL
simulation and FPGA implementation is described in section
4. We demonstrate the flexibility of all these environments
with a review of our comprehensive “pre-silion” test
development, including performance and system stress tests in
section 5. We conclude the paper with a section to summarize
the value that the simulation tools and prototype environments
provide for complex heterogeneous systems along with some
of our future work plans in terms of application design and
development.
II. THE POWERENTM PCI EXPRESS FUNCTION
A. Hardware
The PowerENTM architecture incorporated two PCIe ports.
The two ports share up to 17 PCI Express Generation 2 lanes,
with each lane providing a data traffic bandwidth of 500 MB/s.
Additionally PCI Express Port 0 can be freely configured as
PCI Express Host Bridge or as a PCI Express endpoint. In the
context of the hybrid appliance architecture, the endpoint
configuration is of interest, since it enables a PowerENTM chip
to be attached to a host as a device. However, as we will show
later in this paper, the ability to have the loopback
configuration (Port 0 endpoint, Port 1 root complex) allowed
PowerENTM chip logic to operate both functions at the same
time, i.e. be a PCI Express host server and be a PCI Express
device, and provide an additional hardware verification and
software test configuration for bring-up.
The PowerENTM PCI Express endpoint configuration
provides a set of PCI Express Gen 2 features which gives a
host a wide variety of useful features. Most prominent it
supports SR-IOV [12], with the support of 2 physical (PF) and
16 virtual (VF) functions. This feature enables the
development of device drivers on the host which exploit the
PowerENTM in a virtual environment -- multiple device drivers
can operate independently with hardware isolation. As an
example that will be described in more detail in the next
section, PF0 operates a virtual Ethernet driver whereas PF1 is
used as platform of an userspace enablement driver. In future
exploitations, the 16 virtual functions can be used for many
instances of PCIe device drivers and corresponding device
features (embedded analytics for instance), providing one
instance per logical partition or user space process of the host.
The PCI Express endpoint DMA engine in the PowerENTM
architecture is connected to the main bus (PowerBus or PBus)
on the chip as a coprocessor. As such, just as all other
coprocessors on the PowerENTM chip (e.g. crypto), it has a
PBus Interface Controller (PBIC) TLB which give provides
mapping between bus addresses and virtual user space
addresses and a DMA engine. Because it is a coprocessor, the
DMA engine can be operated in userspace with coprocessor
initiate (icswx) commands along with the hardware protection
features of the PBIC TLB. Furthermore, by using an IOMMU
on the host side, the PCI Express addresses can be translated
into user space addresses, providing a user space to user space
communication path from PowerENTM to the host.
B. Software
Given the rich PowerENTM hardware features to support
user space PCI Express communication, the software stack
needed to provide corresponding function for high-speed
communications between userspace applications on the host
and device. In other words, our PCI Express software
infrastructure does not to require kernel involvement in packet
transfers, as with standard socket style systems. In order to
enable such high speed communication, a kernel device driver
is required to provide low-level access to the PCIe device.
46
That is to say, after initial probing and provision of features
(i.e. DMA engine, shared memory areas etc.) to userspace, the
kernel device driver involvement is minimal.
Since the PCI Express DMA engine is a PowerENTM
coprocessor, we created a thin abstraction interface to initiate
coprocessor requests. We called this interface libarb. This
abstraction interface also provided requisite memory
registration functions for DMA and MMIO between user space
processes. We implemented libarb on both x86 and
PowerENTM to allow for architectural neutrality as well for any
software developed on top of this layer, thus allowing for us to
only port this layer across the various simulation and prototype
platforms. Since the DMA engine is only on the PowerENTM
device, host initiated DMA actually first required the DMA
command to be transferred via MMIO to the device and re-
mapped to a device initiated call. Any thread management,
locking, state management or endian conversion is done by the
callers of libarb. By providing implementations of this
abstraction, we have the ability to swap device drivers in/out
depending on the particular bring up platform we’re using at
the time whilst preserving the integrity of the upper layers.
Layered on top of this is our communications protocol
library, the Hardware Abstraction Layer for devices (HAL-d).
HAL-d provides the necessary queue infrastructure required
for two sided packet flow as well as the Remote DMA
(RDMA) semantics and support for one sided throughput
oriented bulk transfer. HAL-d is also not thread safe, but it is
thread-friendly. That is to say that we provide for multiple
DMA groups so that individual threads can initiate DMAs
using separate command groups.
Now that we have a full userspace stack in place, we require
function to start host/device userspace applications. Starting
host applications is trivial, given that the host is standard
Linux server. However, starting applications on the remote
device is not so straightforward. An intuitive method is to ssh
into the remote system and then start the device’s userspace
application. In order to do this, we have an additional kernel
device driver that essentially provides a virtual ethernet
channel over PCI Express. On probing this device, a new
network interface will be instantiated e.g. eth1, which can be
configured using ifconfig. In order to avoid writing yet
another network device driver, we researched the feasibility of
using the virtio infrastructure [13] (particularly virtio_net) to
provide most of the interaction with the kernel’s networking
stack. Although virtio was originally intended for
virtualization, it proved suitable for our purpose. Internally,
this driver also uses the same communications protocol library
for packet flow as the userspace stack, thus providing the
possibility of having a device userspace to host kernel packet
flow.
Although many concepts about MMIO, receive and send
queue management, and RDMA have been well understood in
protocols and network research for decades, actual design and
implementation on new hardware can be a challenge. This is
especially true when working across endian boundaries. To
help facilitate our work on the PowerENTM platforms we used
a variety of pre-silicon environments, many of which are
described in [14]. In the next sections we describe how we
leveraged these tools and environments for PCI Express and
PowerENTM in order to meet our feature and performance
requirements within strict time to market constraints.
III. SOFTWARE DESIGN AND VERIFICATION
A. Shared memory platforms
Using memory areas shared between two processes is an effective way to mimic the behavior of PCI Express communications hardware without needing access to any specific hardware. By mapping shared memory areas into two separate processes running on a single system, one process runs the device-side code whilst the other process runs the host-side code, both MMIO and DMA data transfer methods can be simulated on a single system.
Figure 2: A diagram showing the logical description of the HAL-d IOREMAP as described in the text for MMIO based data movement.
MMIO is the simplest to implement; two shared memory areas are created within a kernel module, a device-side MMIO receive area and a host-side MMIO receive area. The device-side area is mapped into the device process as the receive MMIO space, and into the host process as the send MMIO space. Similarly the host-side receive area is mapped to the host process as the receive MMIO space, and to the device process as the send MMIO space. This “IOREMAP” is shown in Figure 2. MMIO data communication is performed by, for example, the host process writing data to its send MMIO space, which will then be visible to the device process in its receive space.
DMA is more complex to simulate due to its asynchronous nature, and since with many systems arbitrary areas of memory can be used as DMA buffers, once they have been suitably prepared. The libarb interface is the key to effectively simulating DMA with shared memory. Libarb places two key restrictions on DMA users (e.g. HAL-d): 1. All memory for DMA must be allocated through arb_allocate_buffer
2. DMA transfers may only be initiated by calling arb_send. Restriction 1 ensures that only shared memory
areas are allocated to the user application as DMA buffers.
47
This restriction is also useful for the PCIe libarb since it hides the complexities of creatig DMA'able memory from the libarb user. To enable user-space memory copies libarb also internally maps the "remote" DMA buffers, so when arb_send() is called it simply copies data between the
"local" DMA buffer (mapped to the user-space process) and the "remote" DMA buffer, sets the completion struct and then returns. Due to restriction 2 this is invisible to the user-space process, so the behaviour of shared-memory libarb is identical to that of the libarb running on physical PCI Express PowerENTM hardware.
With libarb, a DMA command is issued with arb_send(), and command completion is checked using
arb_check_completion(). When run on a standard
processor architecture however, such as Power or x86 instruction set, the simulated DMA transfers are inherently synchronous; the transfer is performed in software within the calling process' thread of execution. As a result shared memory on x86 does not properly test the DMA completion monitoring systems of any applications using libarb, since arb_check_completion will also return true when called
after arb_send(). On PowerENTM PCI Express hardware
the transfer is performed asynchronously by the chip's PCI Express DMA engine, and the engine notifies DMA completion by updating a struct held in the calling process' memory. DMA's may be completed out-of-order and at any time.
To provide a more complete test of the DMA path, and to verify the HAL-d stack on PowerENTM's A2 processor architecture, the x86 shared memory libarb was ported to an early version of PowerENTM hardware. This hardware did not have PCI Express end-point functionality, so to test the DMA completion code arb_send was altered to use the
asynchronous data mover (ADM). The ADM is a co-processor on the PowerENTM chip that asynchronously moves data between locations in PowerENTM memory. It's mode of operation is very similar to that of the PCI Express DMA engine, except that both the source and destination addresses must be in PowerENTM memory. The shared memory drivers were ported to single-chip PowerENTM, and arb_send was
enhanced to use the ADM. In this way libarb users can test their completion monitoring more completely, without needing full PowerENTM PCI Express hardware.
B. Hybrid Architecture Simulation
The shared memory device drivers and libarb layer are very useful for early development and test of the user-space software stack, however they are little use for pre-silicon development of the kernel space device drivers. For this a much more sophisticated simulation environment that simulates at a hardware register level is required. A combination of the PowerENTM version of the IBM Full System Simulator (IBM FSS or Mambo) [15] and an x86 simulator called Simics [16], from WindRiver is used to provide this. Similar in concept to the IBM FSS, Windriver Simics speeds up system design, software development, deployment and test automation of hardware architectures such as embedded systems, single and multicore CPUs, complex
hybrid architectures and network connected systems like clusters, racks and distributed systems. It supports several processor families, IO devices and standard communication protocols. It runs indifferently the same binary working on real systems on modeled virtual hardware enabling the developers to program, debug and deploy firmware, device drivers, operating systems, middleware, and the application software.
Debugging and testing are simplified by the use of a user friendly interface to run, break and stop the execution of the simulator, to inspect the hardware faults, save the hardware state for later inspection, get the output for using the system in batch from convenient automation tools.
By appropriately connecting the Simics x86 simulator to the IBM FSS PowerENTM simulator, the PCIe link between an end-point mode PowerENTM processor and a root-complex x86 processor was simulated with sufficient accuracy to allow pre-silicon design and implementation of the x86 and PowerENTM device drivers. We used this software to develop two PCI Express devices drivers and a middleware on top of them. We have built two very small images containing busybox linux OS and a customized linux kernel, to have an agile environment for kernel development that allowed us to reboot, debug and modify quickly the system in order to advance investigation useful to possibly fixing hardware and software issues. We developed a set of bash scripts to create automatically these images and we created a set of Simics scripts to test our device drivers indifferently using Linux or Bare Metal Applications (BMAs) [14] on the device.
Furthermore since the Simics/FSS environment runs Linux
on both sides, it was also suitable for test and verification of
the user-space software stack. Using the Simics File System
we were able to mount the partition on the real machine to run
user-space applications. Testing the user-space stack on
Simics/FSS uncovered a number of bugs and problems which
did not show up on the shared memory platform, in particular,
those software bugs related to endian conversion., We were
actually able to discover software bugs in the fully integrated
software stack since the Simics scripts also provides a way to
call the simulation in batch mode from an external automation
tool, our test harness, for continuous integration development
automating build, deployments, unit tests and functional tests.
In particular, Simics provides a way to connect X terminals
using a virtual serial connections that allowed us to stream the
output of the host side and the device side terminal in log files
saved on the host machine and to monitor execution of the
tests to exit gracefully in case of errors during the executions.
The tests are called automatically from an external automation
tool and produce reports about the sanity of the software. The
FSS/Simics environment was our first system level pre-silicon
platform on which we developed and executed a
comprehensive test suite for the PCI Express software and
corresponding applications.
C. Prototype PowerPCTM
- x86 Testbed
A third bring up platform was based on, existing hardware,
namely the PowerXCellTM 8i based PCI Express accelerator
board, or PX-CAB [17]. The PowerXCellTM 8i is a Cell BE
48
based processor, and thus has a PowerPCTM based main
processor, just like PowerENTM. The PCI Express DMA
engine logic implemented as a separate ASIC on the PX-CAB
also resembles the PCI Express logic on PowerENTM. Given
these two features of PX-CAB, it is the most closely related
physical platform to our target platform.
There are several benefits to using real hardware for bring
up. Due to the fact that PX-CAB is a PowerPCTM system, its
use immediately highlights any endian correction issues when
attached to an Intel host. It is also an established and marketed
device, thus being a stable platform for test case generation
and execution, allowing us to use it to drive more complete
end-end test case scenarios, for example, using the PX-CAB as
a NIC and performing some packet processing prior to
forwarding packets to the host. Additionally, we were able to
re-use some portions of the existing PX-CAB device driver to
accelerate bring up.
Our PX-CAB stack is identical to our target software stack
at the upper stack layers, the main difference being the actual
device driver and an implementation of our libarb abstraction
layer to interface with it. Also, some development effort was
required on the PX-CAB device driver to bring it up to the
kernel version of our target software stack, from 2.6.26 to
2.6.32 which helped development teams gain learning on
pushing our driver from one kernel version to another.
One particularly useful aspect of using PX-CAB was to be
able to stress test and performance tune the upper stack layers
relatively early. Especially in relation to our packet flow
communications library, we were able to stress the packet
queues and ensure that there was no unnecessary polling
across the PCI Express bus accessing either the send or receive
queues. Measuring the latency for direct data transfers it was
possible to ascertain any large discrepancy between this and
the transfers using our communications library.
IV. HARDWARE VERIFICATION
A. HDL Simulator
The HDL which described the actual implementation of the
PowerENTM chip was regularly run on the hardware simulator
[14] to ensure the functionality and performance of the
resulting chip. As stated previously PowerENTM has two PCI
Express ports which could be operated in different modes.
This feature enabled us to create an easy test configuration
with only minor changes of the HDL. This was done by
connecting PCI Express Port 1 Root Complex (RC) to the PCI
Express Port 0 Endpoint Complex (EP). On the HDL it is
basically connecting the TX lanes from one port to the RX
port of the other. This configuration enabled a wide range of
test, starting as being able to run the full initialization code
including the PCI Express scan to extensive testing of PCI
Express EP functionally like DMA. And we were able to find
issues we could resolve before the ‘real’ chip, significantly
improving the quality of the first chip samples.
The major development and testing on this environment was
the firmware code responsible for the PCI Express setup of
PCI Express EP and PCI Express RC. For the former a static
setting has to be applied whereas for the later the initialization
code is more complex by the nature of the logic – PCI Express
scan has to do a PCI Express bus walk and configure the
device find on this way. The wrap we had with our model gave
this code an opportunity to find a device, which is a SR-IOV
capable PCI Express EP, providing the most complex setup of
existing PCI Express devices. With this we could not only find
bugs in the newly developed SR-IOV capable scan code but
also give feedback to the chip team where the functionality
was not correct or the documentation lacked preciseness.
Another challenge we addressed in the loopback test setup was
the complex process of PCI Express link training. Since the
PHY was part of the HDL simulated and the PCI Express lanes
connected where after the PHY we were able to run the link
training sequence and spend quite some time to get the link up
and running. This not only exposed to the firmware team to
complexity of the link training but also provided the
opportunity ahead of hardware availability to have
initialization code and debug code ready for the debug on the
real hardware.
The second task on this environment was the development
of testcases for the PCI Express EP functionality, e.g. DMA,
MMIO access, interrupt handling and SR-IOV. As part of the
firmware a special wrap test was created, which ensured the
functionality of the entire feature set listed above in the wrap
configuration. This test again not only discovered bugs in the
early phase but also gave us first performance impression,
since the HDL model is cycle accurate. Only by this effort we
were able to ensure the full functionality of PCI Express of the
PowerENTM chip. And as a side effect we created a combined
hardware and software team which fully understands the
system architecture and was able to do the ‘real’ hardware
bringup in a much shorter test cycle than our lab has
previously experience for full platform enablement.
B. FPGA Emulation of the PCI Express Platform
The PowerENTM FPGA Based Emulator (PFBE) architecture
[14] could be used to exercise unique unit functions by
reassigning the logic configurations assigned to the FPGA
units. For example, logic cards connect to the central bus
could be assigned to hold either additional processing nodes,
or reconfigured to hold multiple instances of either I/O or
accelerator units. In one example, multiple instances of a
single accelerator were emulated in the system by re-assigning
multiple instances of the accelerator to FPGA's originally
provisioned to be used for other accelerator units. Having
multiple instances of the accelerator in the FPGA logic created
a platform that enabled the software team to exercise advanced
unit to unit SMP communication functions of the accelerator
engines well in advance of the availability of the final ASIC
mounted in an SMP configuration. In a second example, a
unique PCI Express to PCI Express wrap logic block was
created to bridge two controllers in the PCI Express logic
chiplet. This wrap logic block removed a dependency on any
external physical I/O by directly connecting the PIPE interface
49
of one PCI Express unit configured as a host controller, to the
PIPE interface of a PCI Express unit configured as a endpoint.
By directly connecting the FPGA logic units, without any
physical external connection, all of the logic could be run at
the same relative system speed, eliminating external real time
dependencies. This combination of root complex with
endpoint complex wrapped together in the FPGA system
enabled the software development teams to develop the drivers
for early root and endpoint logic function. As a result of this
early pre-silicon work, the software team was able to
demonstrate PCI Express driver function on the chip within
one week of receiving the physical chip.
V. TEST DEVELOPMENT AND MULTI-PLATFORM EXECUTION
Given the variety of function capabilities in our pre-silicon
environment, we were able to develop a comprehensive
platform test suite to validate function, measure performance,
and stress both the hardware and software for the x86 host
PowerENTM device environment. To achieve a broad range of
test execution on multiple pre-silicon platforms a common test
framework was developed which could be utilized across all
target test platforms. Our test framework began with the
various unit and function level testing of the A2 cores and
corresponding memory subsystem. For this, we leveraged the
Linux Test Program (LTP) suite [18], augmented with
lmbench [19], along with some microbenchmarks that we
developed internally which exercised some of the key features
of a highly multithreaded indexing code, e.g. a pointer chase
test. We then added component level testing for each of the
coprocessors to stress thread concurrency. Finally, at the chip
level we added user level networking. Where possible, similar
tests were run in the BMA environment to measure software
overheads and ensure any errors found were introduced by our
own PBIC address management code.
As part of system level testing, test suites have been
developed to test the MMIO, DMA and virtual ethernet
channels of the PCI Express software stack for PowerENTM.
As it is difficult to predict how a user will utilize HAL-d, e.g
what combinations of DMA vs. MMIO, data sizes, address
offsets, etc., we had to develop a test harness with which one
could easily change the data patterns and simulate multi-
threaded and multi-process use cases, as well as the standard
single-thread single-process use case. The test driver was
designed primarily to enable the easy, robust and extendable
testing of HAL-d functionality. It also provides monitoring and
reporting functionality. Monitoring is especially important in
utilities such as HAL-d where blocking communication plays a
role. Should a test fail during the communication or hang, the
monitoring function has the ability to terminate the test after a
user defined timeout. Defect identification is assisted by the
reporting functionality of the framework. This functionality
can provided detailed descriptions of the type of failure
encountered and also generate its own backtrace allowing for
quick identification of any issues should they arise.
This test framework provides uniform test execution across
all target test platforms, making it easier for results
comparisons, issue identifications and status reporting. For
instance, we found several scenarios where the software tests
would pass in the functional simulator environment, but the
same software tests would fail on the PFBE “loopback”
system. Upon further inspection and discussions with the
hardware teams, we discovered that the problem was errors in
the hardware documentation when compared with the
implementation. Because we could isolate this to a functional
level on the PFBE system well ahead of silicon delivery, this
helped us considerably when actual hardware arrivied. In
comparison to testing on the simulated hardware, in which test
execution can take much longer than on real hardware, shared
memory testing provides real world execution times of the PCI
Express software stack on the target architecture. This allows
for more efficient prototyping of the test framework as both
host and device processes are resident on a single machine and
the tests themselves are running at current microprocessor
speeds. Along with PX-CAB, which provided fast execution
time for developing tests around endian issues, these systems
provided the test team with a complete test execution
environment prior to the release of PowerENTM. Using the
different pre-silicon platforms, satisfactory test coverage of the
target architecture can be achieved.
We were also able to leverage the wide variety of our pre-
silicon platforms to create performance models of the PCI
Express hardware and software. As mentioned previously, the
hardware performance model was developed using the
awanstar cycle accurate simulator. On top of that we took
measurements on both shared memory x86 as well as the PX-
CAB environment to develop estimates of the HAL-d software
overhead. For PowerENTM, we took measurements for both
pure shared memory and the asynchronous data mover (ADM)
for DMA. Interestingly enough, these measurements showed
performance variations depending upon the size of individual
transfers as well as number of transfers before synchronizing
when using the ADM. Upon further investigation we found
that this was an artifact of the ADM hardware architecture and
not something we would necessarily find in the PCI Express
DMA engine. We adjusted the performance models
accordingly. In the end, we created a performance model and
results of this suite/model have been used to identify
performance bottlenecks and tune the PCI Express software
stack accordingly to remove these. Finally, to gauge the HAL-
d and PCI Express software stack performance both user space
and PCI Express link level applications were used.
Differences between the applications reflect the overhead of
HAL-d and the PCI Express software stack, allowing the
performance model to be created independent of the execution
platform.
VI. DISCUSSION AND FUTURE WORK
A PCI Express attached network processor to an x86 based
host allows for a variety of applications, ranging from
50
intrusion detection systems to financial market data feed
handlers to sensor network data aggregation and filtering.
Many of these applications target the hybrid architectures to
gain substantive performance benefit while maintaining the
general programmability and library availability of the host.
The hybrid approach also allows a development path in which
portions of the code can be ported while others remain on x86
host – allowing for greater experimentation as well as a
progressive and limited risk process. For all these reasons
noted above, the hybrid computing approach, especially one in
which an integrated highly programmable and powerful
network processor, such as PowerENTM, is used.
In order to enable PowerENTM’s PCI Express hardware
capabilities and the corresponding high performance
applications, we implemented a low latency, high throughput
PCI Express software stack, containing both MMIO and
RDMA based programming interfaces. This software stack
was designed with the requirements of application data planes
in mind. Since many applications also require control plane
operations, such as logging, heartbeating, etc., we also
implemented a virtual Ethernet over PCI Express stack which
allowed for standard socket based programming between the
x86 host process and the PowerENTM based device process.
Therefore application developers have a choice of interfaces to
leverage when optimizing with the PowerENTM PCI Express
system.
To gain the greatest benefit from workload optimized
application development on combined general purpose and
purpose built systems, the software application and hardware
platform, including the processor, needs to be designed and
developed in an integrated process. This integrated
development process implies that iterative hardware designs
should be considered as applications are ported and tuned.
Therefore, the availability of system level pre-silicon
environments for hardware verification and test along with
software development are crucial not just from a time to
market perspective, but also from a performance optimization
perspective as well. In this paper we have presented a variety
of pre-silicon environments which were used to validate
hardware and software designs for the new PowerENTM
processor in a hybrid system.
Traditionally, pre-silicon environments have been used at
the microprocessor level. System level design and
development work, including system test, required the
existence of silicon and early hardware. Our goals were to
design and prototype at the system level from processor unit
verification all the way to system software design and
enablement to fully integrated stress tests. In order to
accomplish this we had to include standard software
instruction set simulators, HDL simulators, a novel approach
to using FPGAs to emulate actual chip logic VHDL at
microprocessor speeds, and even some early hardware
“prototype” environments. The various stages and
requirements of integrated development were carefully
reviewed and the various pre-silicon environments were
chosen for appropriate and efficient verification, development
and debugging. For instance, much of our PCI Express
software stack early design and development was done in
shared memory and then prototypes mimicking tightly coupled
x86 and PowerPCTM systems. Once the PCI Express software
stack was available on the various pre-silicon environments,
build acceptance, functional verification, and even full system,
stress and performance tests were developed. This allowed for
both the full software stack and the various tests to be
available as soon as the real hardware arrived.
Our test suites have been developed not just to exercise the
PCI Express function, but also the entirety of the system used
by applications of interest, e.g. the HEA for packet processing,
the A2 PowerPCTM binaries, the PowerENTM coprocessor
along with moving data up to and back from the x86 host. We
are currently evaluating a variety of wire speed applications in
the areas mentioned previously in terms of their performance
characteristics and ability to stress the overall system to
integrate into our system test suite. As actual hardware
arrives, having already developed and debugged the core of
the system infrastructure will allow us to focus on the
performance optimizations required to reach new levels of
processing capability in network computing.
ACKNOWLEDGMENT
We are grateful to Nancy Greco and Heather Achilles for
their continued support and guidance throughout all phases of
these projects as well as the members of the Hybrid Systems
Lab, specifically, Heinz Baier, Thomas Hovarth, Ken Inoue,
and Steve Millman.
REFERENCES
[1] H. Franke, J. Xenidis, C. Basso, B.M. Bass, S.S. Woodward, J.D. Brown, C.L. Johnson, “Introduction to the wire-speed processor and architecture”, IBM Journal of Research and Development, Vol 54. No 1. paper 3 January/February 2010.
[2] D.P. Lapotin, S. Daijavad, C.L.Johnson, S.W. Hunter, K.Ishizaki, H. Franke, H.D. Achilles, D.P. Dumarot, N.A. Greco, B. Davari, “Workload and ntwork-optimized computing systems”, IBM Journal of Research and Development Vol. 54 No1 paper 1: 1-12, January/February 2010.
[3] M.Kistler,J.Gunnels,D.Brokenshire,B.Benton, “Petascale computing with accelerators”, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP 2009, Raleigh, NC, 2009, pp 241-250.
[4] R.Stone and H.Xin “Supercomputer leaves competition – and users – in the dust”, Science, Vol 330. No 6005, pp 746-747, 5 November 2010.
[5] R. Sass, W.V. Kritikos, A.G. Schmidt, S. Beeravolu, P. Beeraka, “Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing” Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007, pp 127-140, IEEE Computer Society, Washington, DC, USA.
[6] “The Unified Wire”, Chelsio communications whitepaper. http://www.chelsio.com/unifiedwire_eng.html.
[7] “Accelerators in Cray’s Adaptive Supercomputing” from http://rssi.ncsa.illinois.edu/2007/docs/industry/Cray_presentation.pdf
[8] G. Valente, “Implementing Hardware Accelerated Accelerated Applications For Market Data and Financial Computations”, from HPC on Wall Street, September 17, 2007, New York, NY,
51
http://www.lighthouse-partners.com/highperformance/presentations07/Session-7.pdf
[9] http://www.khronos.org/opencl/ [10] CUDA Xone, NVIDIA Corporation; see
http://www.nvidia.com/object/cuda_home.html#. [11] IBM Corporation, Data Communication and Synchronization Library
Programmer’s Guide and API Reference, see http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/eicck/dacs/DaCS_Prog_Guide_API_v3.1.pdf
[12] PCI Special Interest Group, “Single Root I/O Vritualization”, see http://www.pcisig.com/specifications/iov/single_root/
[13] R.Russell. “virtio: Towards a De-Facto Standard For Virtual I/O Devices”. See: http://portal.acm.org/citation.cfm?id=1400108
[14] J. Aylward, C. Cox, C.H.Crawford, K.Inoue, S.Lekuch, K.Muller, M.Nutter, H.Penner, K.Schleupen, J.Xenidis, “A Review of Software and System Tools for Hardware Design, Verification and Software Enablement for System-on-a-Chip Architectures”, IBM Research report, submitted.
[15] P. Bohrer, M. Elnozahy, A. Gheith, C.Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang, “Mambo – A Full System Simulator for the PowerPC Architecture” ACM SIGMETRICS Performance Evaluation Review, Volume 31, Number 4, March 2004, pp. 8-12.
[16] P. Mgnusson, M. Christensson, J. Eskilson,D. Forsgren,G.Hllberg,J. Hgberg,F.Larsson,A.Moestedt,B.Werner, “Simics: A Full System Simulation Platform”, IEEE Computer, Feb 2002.
[17] H. Penner, U. Bacher, J. Kunigk, C. Rund, H.J. Schick, „ directCell: Hybrid systems with tightly coupled accelerators“, IBM Journal of Research and Development, Vol 53. No 5. 2009, paper 2.
[18] LTP: Linux Test Project; see: http://ltp.sourceforge.net/ [19] LMBench; see: http://www.bitmover.com/lmbench/
52
Rapid automotive bus system synthesis based oncommunication requirements
Matthias Heinz, Martin Hillenbrand, Kai Klindworth, K.-D. Mueller-GlaserKarlsruhe Institute of Technology (KIT), Germany
Institute of Information Processing TechnologyEmail: {heinz, hillenbrand, kai.klindworth, klaus.mueller-glaser}@kit.edu
Abstract—The complexity of modern cars and along theirelectric/electronic architecture (EEA), rapidly increased duringthe last years. New applications like driver assistance systemsare highly distributed over the network of hardware components.More and more systems share common sensors placed in sensorclusters. This leads to a greater number of mutually connectedelectronic control units (ECUs) and bus systems. The traditionaldomain specific approach of grouping connatural ECUs into onebus system, does not necessarily lead to an overall optimal EEAdesign. We developed a method to automatically determine anetwork structure based on the communication requirementsof ECUs. Based on the EEA model, which is developed duringthe vehicle development life-cycle, we have all the informationwe need, like cycle times and data width, to build a networkof automotive bus systems. We integrated our method into theEEA tool PREEvision to allow rapid investigation of realizationalternatives. The relocation of functions from one ECU toanother can ideally be supported by our method, since we cangenerate a new network structure within minutes, fitting the newcommunication demands.
I. INTRODUCTION
During the design of a vehicle, several thousand signalsbetween up to 70 ECUs have to be considered [1]. Basedon the given requirements, the network designer has to setup a compound of automotive bus systems. Arising fromthe communication requirements, he has to select the rightbus systems for the given bus load. Although not all ECUsin a new vehicle model are newly introduced and formerarchitectures can be taken as reference, the high innovationrate in vehicle electronics leads to a bunch of new func-tions in every new car. Also new concepts like the use ofsensor clusters sharing the sensors for several ECUs, lead tochanged design requirements. Moving of functions from oneECU to another may require a restructuring of bus systems.New technologies, like radar or video based driver assistancesystems, feature high requirements concerning the requireddata rates. This additionally leads to new connections to thealready established ECUs.
Currently the complete bus system architecture is designedby hand. Using a domain approach, where ECUs of a certainapplication area are bind together, designers try to handlethe complexity. In the past, powertrain, chassis and body bussystems where installed fulfilling the communication needs.But new highly distributed systems like for example Adaptive
Cruise Control (ACC) or lane keeping are distributed to anetwork of many ECUs and do not clearly fit into this fixedstructure. So the grown domain specific approach, does notnecessarily lead to the overall optimal design.
To avoid these problems and to speed up the prototypingand development process, we present a method, that allowsto automatically generate automotive bus system connectionsbetween ECUs. The connections are based on the com-munication requirements of the provided ECUs. To get allnecessary information, we employ the data provided in theelectric/electronic architecture (EEA) model. Since the EEAmodel is a crucial part of the overall vehicle design process,the data provided by this model is always up to date and canideally be taken as reference. This allows to directly influencethe network structure based on the current data requirements.Since our methodology has been implemented as plugin, forthe eclipse based EEA modeling tool PREEvision, we canautomatically create all generated bus systems one click.
The partitioning of ECUs, we used for building our net-works, is the basis for hardware/software co-design approachessince a while. The place and route algorithms, used in tools forreconfigurable hardware devices like Field Programmable GateArrays (FPGA) and also Very-large-scale integration (VLSI)processes, extensively uses partitioning technologies [2]. Lit-erature provides many different algorithms for solving suchproblems. Basic algorithms for Electronic Design Automation(EDA) approaches for electronic devices can be e.g. found in[3]. We adapted several techniques for the partitioning problemdescribed in this paper.
The subsequent paper is organized in five sections. Ashort introduction to automotive bus systems and architecturemodeling is given in sections II and III. Section IV presentsour approach of ECU partitioning describing the clustering,the implemented nearness functions, partition optimizing andpartition merging. Section V describes the verification andtest of our approach, followed by conclusions and outlook insection VI.
II. AUTOMOTIVE BUS SYSTEMS
Current vehicles feature a number of different bus sys-tems fullfilling the diverse communication requirements ofthe distributed network of interconnected sensors, ECUs and
978-1-4577-0660-8/11/$26.00 c©2011 IEEE
53
actuators. Based on their communication bandwidth and ap-plication, they are separated in different classes. Currentlydeployed bus systems are listed in Table I. Since infotainmentbus systems have not been designed for open or closed loopcontrol, they feature their own class.
Class Data rate Deployed busesA < 25 Kbit/s Local Interconnect Network (LIN) [4]B 25-125 Kbit/s Controller Area Network (CAN) Class-
B [5]C 125-1000Kbit/s Controller Area Network (CAN) Class-
C [5]D > 1 Mbit/s FlexRay [6]
Infotainment > 10 Mbit/s Media Oriented Systems Transport(MOST) [7]
TABLE IOVERVIEW AUTOMOTIVE BUS SYSTEMS [8]
III. ELECTRIC/ELECTRONIC ARCHITECTURE MODELING
During the concept phase of a vehicle the electric/electronicarchitecture (EEA) is designed. The modeling of such archi-tectures allows to balance the possible realization alternativesand to find an overall design. The tool PREEvision, which isused by leading car manufacturers [9], allows to design suchcomplex architectures containing up to 800.000 elements fora premium car. To handle this complexity, different perspec-tives to the model are provided (Fig. 1). The EEA elementsrequired for our method are located in the function networkand the component network. Functions feature ports thatimplement interfaces. An interface describes the exchangeddata elements between the participators. A communicationrequirement which can be allocated to a port prototype allowsto describe the cycle time of the exchanged data elements. Inthe component network ECUs, bus connectors and bus systemsare modeled.
IV. ECU PARTITIONING
The following information from the EEA model is utilized,for starting the ECU partitioning. Elements can directly beaccessed in the model using java code.
• Cycle time given by communication requirement (Port-CommunicationRequirement)
• Type and number of data elements out of interfaces(DataElement)
• Sending ECU (sender)• Receiving ECU (receiver)
A. Representation as graph
The representation of ECUs and their communication re-quirements as a graph allows to solve the partitioning problemwith the help of algorithms. In our case, edges represent theexchanged data while nodes represent ECUs. The algorithmtries to partition the nodes to single networks while it triesto reduce the cutting costs between the partitions. Partitioningis a classical problem in computer science and is consideredas non-deterministic polynomial-time hard (NP-hard). There
ECU 1
ECU 2
ECU 2
ECU 1
RAM
SoftwareCPU
PCB
ROM
ECU 1
Detailed
Detailed
Detailed
Compo
nent
descrip
tion
Wiring
harne
ss
Sche
matics
Sensor 1
Sensor 2
ECU 1
ECU 2Fuse BoxGround
point
Groun
ding
Co
ncep
t
CompositionSensorFunction_1
SensorFunction_2
FunctionalGroup
ActorFunction_1Actor
Function_2
Branch‐off
Inline connector
Installation space Installation space
Placed in
Segment
Topo
logy
Functio
n ne
twork
Require‐ments
Power distributionNetworking&
Communication
Actor 1
Actor 2
InstallationLocation
InstallationLocation
InstallationLocation
InstallationLocation
InstallationLocation
Placed in
Requirement 1.2 DescriptionRequirement 1.2.1 Description
Requirement 1.2.2 Description
1.21.2.1
1.2.2
Assembly Connector
Fig. 1. Layered EEA
are some well established algorithms to solve such problems[10]. Namely there are Kernighan-Lin, Fiduccia-Mattheyses,Simulated Annealing, Hierarchical Clustering, EvolutionaryAlgorithms, Integer Linear Program and Tabu Search.
B. Hierarchical clustering
To build a set of partitions in the first place we used theHierarchical Clustering (HC) algorithm. While the nodes ofthe graph are ECUs, weighted edges indicate the nearness ofthe ECUs represented by their communication requirements.The edges are undirected cause the direction of informationexchange does not influence the result. The information flow inboth directions between the ECUs is summed up in the weightof the edge. Hierarchical Clustering starts with one ECU ineach cluster and merges the two partitions with the greatestnearness together. This is proceeded till only one partitionis left. Each group of partitions featuring all ECUs formsa possible solution, independent on which solution step theyappear. This feature will be utilized in the succeeding steps.
To find the best overall solution, the costs for each de-termined solution have to be calculated. The costs for onepartitions are composed out of the following parts:
• Bus system costs: holds the costs for establishing a bussystem
• Bus participant: is calculated for each bus member• Gateway costs: will be calculated if there is data transfer
to other partitions• Byte/s for external data transfer: covers the costs for the
data that must be transferred through the gateway.
54
The overall costs of a partition, is the cheapest solution fromthe sum of single costs, calculated for all possible bus systems(LIN, CAN, FlexRay). If a bus system can not fulfill thecommunication requirements, the algorithm returns an errormessage.
1
2 3
4 5 6 7
Traffic: 660 kbit/sFlexRay
Traffic: 410 kbit/sFlexRay
Traffic: 250 kbit/sHigh‐speed CAN
Traffic: 180kbit/sHigh‐speed CAN
Traffic: 230kbit/sHigh‐speed CAN
Traffic: 100 kbit/sHigh‐speed CAN
Traffic: 150 kbit/sHigh‐speed CAN
Fig. 2. Partition tree after Hierarchical clustering
To find the overall best solution for the whole HC-tree, thealgorithm (Fig. 3) processes the tree, beginning at the top.In the first step, the costs for the currently selected partitionare calculated. Afterwards the costs for the child partitionsare calculated and the sum is compared to the own costs.The cheaper solution will then be taken. So the algorithmsteps recursively through the tree and determines the cheapestsolution for all possible partitions.
Using the graph in Fig. 2, the algorithm returns the solutionin Table II. To get a better overview and to keep it simple,we set the gateway and external data costs to zero for thisexample. We set a High-speed CAN bus to 200, a HS-CANdevice to 10, a FlexRay bus to 300, and a FR device to 25 asfictitious costs for this example. As cheapest solution, partition3, 4 and 5 will emerge.
Since the HC algorithm does not take the available data rateof a bus system into concern during partitioning, inappropriatepartition sizes can appear (Fig. 4). To overcome this issue,an additional bin packing algorithm has been implemented tomerge the under utilized partitions together.
Data: Partition partResult: costs c, list of partitions listmyCost := costOfPartition(part)mySolution := part;childrenCost := 0childrenSolution := ∅while child := part.nextChild do
childrenCost += cheapestSolution(child).cchildrenSolution := childrenSolution ∪cheapestSolution(child).list
if childrencost < mycost||childrenSolution 6= ∅ thenreturn (childrenCost, childrenSolution)
elsereturn (myCost, mySolution)
Fig. 3. Cheapest solution algorithm
Partition Bus system Own costs/Costs of children1 FlexRay 950/8602 FlexRay 525/4903 High-speed CAN 370/5704 High-speed CAN (5 devices) 250/-5 High-speed CAN (4 devices) 240/-6 High-speed CAN (8 devices) 280/-7 High-speed CAN (9 devices) 290/-
TABLE IICALCULATED COSTS FOR THE GRAPH IN FIGURE 2
High‐speed CAN
High‐speed CAN High‐speed CAN
Low‐speed CAN90% bus load*
Low‐speed CAN30% bus load*
Low‐speed CAN30% bus load*
Low‐speed CAN90% bus load*
* bus load is approx. 50% of 125 kbit/s data rate for Low‐speed CAN
Fig. 4. Inappropriate partitioning
C. Nearness function
To execute the HC-algorithm, a nearness function has to beimplemented. The obvious idea to take the absolute data ratebetween the partitions turned out to be inapplicable. A usualnetwork consists of unequally fast participants. An algorithmonly taking into account the data rate, would start to mergethe fastest participants together. A big partition dominatingthe others would emerge. Since it is very likley that leftover partitions, featuring only one ECU, also feature a highnearness to this big partition one after another will be addedto the big partition. This leads to an abnormal HC tree. Thisbehavior is not desired, since the slower participants haveno chance to build their own network, featuring minor busrequirements.
During the development of nearness functions, it turned outthat several nearness functions lead to different results.
The first nearness function, we call it ”relative nearness”,the data transfer of the current partition to another partitionsis divided through the overall transfer rate of the currentpartition. This shows the percentaged quota of the commu-nication to other partitions (Fig. 5). This enables partitionscommunicating strongly which each other to have a highnearness, even if they have low data rates.
To avoid the merging of slow nodes with faster ones, alwaysthe smaller value for one connection is used. In Fig. 5 anearness of 0.05 would be taken for the shown connection.We call it ”bothsided relative nearness”.
Because the right node massively lowers the nearness of theconnection and so the sum of all connections for the left node,again the quotient of one connection to all other connectionsis calculated for the ”weigthed nearness” (Fig. 6). The sum of
55
8
8
700
900
90.45 0.05
Fig. 5. Weighting of connections
the connections in this example is 0.20+ 0.14+ 0.02 = 0.36.If we divide the single values by 0.36 we get 0.56, 0.39, 0.06as new values.
600
500
7
5
230.02/0.06
0.14/0.39
0.20/0.56
Fig. 6. Balanced weighting of connections
It turned out that penalizing partitions with many connec-tions leads to a better solution, so we divided the ”weightednearness” through the number of connections. We call this newfunction ”shared weighted nearness”.
The software we developed, contains all above mentionednearness functions, since the structure of future networkscan not be foreseen. Therefore all different functions will becalculated and the best solution will be taken. Since the bestsolution for a certain communication network is unknown,we used randomly generated networks to prove the conceptsexperimentally.
D. Partition optimization
Since the HC-algorithm follows a greedy strategy and takesa locally optimal choice at each stage, it doesn’t necessarilylead to the overall best solution. To improve the solution foundby HC, we implemented a succeeding Fiduccia-Mattheyses(FM) algorithm. This iterative algorithm featuring a com-plexity of O(n), helps to lower the cutting costs betweenpartitions. The advantage of this algorithm is, that it doesn’trequire partitions of equal size like e.g. Kernighan-Lin. Abalance criterion can be used to check, if the balance betweenpartitions is not too unequal and that the busload of the bussystem is not exceeded.
The implemented FM-algorithm comprises the followingsteps:
1) The gain of all nodes for shifting between the partitionsis calculated.
2) A node not violating the balance criterion and holdingthe highest gain will be selected. If several nodes featurethe same gain, the one best fitting the balance criterionwill be selected.
3) This node will be taken to a list and cannot be movedin further steps during this optimization step.
4) Till not all nodes are in this list, the gain for allconnected nodes will be newly calculated and gone onto step 2.
5) The shifting sequence will now be executed till thehighest aggregate gain is found. If the gain is negativefor all steps we stop and do not shift any nodes. If not,the list of fixed elements will be cleared and we startover with step 1.
The first pass of the given algorithm is depicted in Fig. 7. Theshown nodes are ECUs, the while the dashed lines indicate thelimits of the partitions. The numbers below the nodes show theachievable gain when moving the node to the other partition.
Start
2 1
1 2
Step 1Gain total: 2 This step: 2
‐2fixed
‐1
1Criterionviolated
0
Step 2Gain total: 2 This step: 0
0fixed
‐1
‐1 0fixed
Step 3Gain total: 2 This step: 0
0fixed
1fixed
‐1 ‐2fixed
Step 4Gain total: 1 This step: ‐1
0fixed
1fixed
‐1fixed
‐2fixed
Balance criterion:1<= Partition size <=3
Max. gain:Step 1,2,3Best criterion: Step 2
Fig. 7. example FM algorithm first pass
E. Partition merging
After executing the HC- and FM-algorithm sometimes sev-eral partitions are not filled to 100%. To lower the costs ofthe overall architecture it is reasonable to merge partitions to asingle bus. Only partitions featuring the same bus type will bemerged, since the costs would rise if nodes would be shiftedto a faster and so more expensive bus.
The merging of partitions can be considered as bin-packingproblem [11]. Our goal is to maximize the filling level of thepartitions. There are several algorithms available in literature,to solve the bin packing-problem exactly [12]. We solved theproblem using the dynamic programming approach [3].
A challenge for this specific problem is, that the partitionschange their weight when they are packed together. A simpleexample of this relationship is depicted in Fig. 8. The datatransfer between to partitions transferred trough a gateway willbe counted for both partitions. If these partitions are mergedthe data transfer is only counted once, since the ECUs cancommunicate directly. So the overall bus utilization is lowerthan in both parts.
The bin packing algorithm has being implemented for eachdifferent bus system. Since the dynamic programming reuses
56
5 kbit/sECU A ECU B
Gatew
ay5 kbit/s
Bus A Bus B
5 kbit/sECU A ECU B
New bus
Fig. 8. Weight change arising from partition merging
the former calculated blocks to speed up the calculation it hasto suffer the above described the partition size problem. Tominimize this drawback we implemented the following steps:
1) Generate list of all partitions featuring the same bussystem
2) Create empty bin featuring the capacity of the bussystem
3) Filling the bin with partitions. Deleting used partitionsout of the list and adding the new created bin to the listif more than one partition has been packed into it.
4) If unpacked partitions are in the left over, restart withstep 1.
Step 3 enables to recalculate the size of the bins and allowsto pack another partition into it if there is enough space. Thisaddresses the above described problem concerning the sizechanges after merging partitions.
For a small number of bus systems it is also possible touse the exact algorithm which uses more computing timeand memory than the dynamic programming approach. Theexact algorithm steps through each possible combination ofpartitions. It recursively calls itself adding a partition to the binand a second time without adding it. So it checks all availablesolutions. The pseudocode is given in Fig. 9.
Data: List of old partitions oldlist, partition list allparts, listposition it, used bus bus
Result: used partitions listpart := oldpart ∪ {allparts(it)}if it = allparts.last then
if bus = checkBus(part) then /* Check ifthe busload of the current bus is notexceeded */
return part
elsereturn oldpart
elseif bus = checkBus(part) then
return rucksack(oldpart, it+ 1, bus)else
return maxInternalTraffic(rucksack(part, it+1, bus), rucksack(oldpart,it+ 1, bus))
Fig. 9. Exact bin packing algorithm
After finishing the bin packing, it makes sense to runthe FM-algorithm again. This is because merged partitionsmay have a higher nearness to nodes in other partitions.This relationship is depicted in Fig. 10. Before merging the
partitions the shifting of the gray node would cause a loss of1. After merging the partitions on the right, we can achieve again of 3.
5
Gain: ‐1
4
4
54
4
Gain: 3
Fig. 10. Additional execution of the FM-algorithm
The whole process now looks as follows:1) Hierarchical clustering2) Selecting the best solution in the HC tree3) Optimizing of cutting costs using the FM algorithm4) Merging of non busy partitions using the bin packing
algorithm5) Again optimizing the cutting costs using the FM algo-
rithmTo enable a rapid exploration of different architecture pro-
totypes we implemented these steps into a customized metricblock in PREEvision (Fig. 11). PREEvision is based on eclipseplatform and can so easily can be extended by using custommetric blocks. Our customized metric block is based on a javacode which can be directly executed in the framework [13].
To start the calculation, we provide a list of ECUs that shallbe partitioned and a folder containing the allowed bus systems.This allows to exclude ECUs from bus generation, e.g. to meetnon-technical requirements. The same is true for the list ofallowed bus systems which allowes to include/exclude certainnetwork types.
Additional data, e.g. communication requirements, is di-rectly read from the EEA model. This provides all necessaryinput data for the above described steps. The metric blockautomatically generates the determined bus connectors and bussystems in the architecture model.
3Bus synthesis
2
DoorLockDD
DoorLockFP
MasterModule
BodyModule
SlidingDoorController
ECUs
1
Networks
Bus systems
ecuInput
busInput
Context
Context
Fig. 11. PREEvision block plugin implementation
V. RESULTS
Since the real communication structure between the func-tions and ECUs is strictly confidential knowledge of carmanufacturers, there was no data from a real car available to
57
test our approach. As workaround, we designed a customiz-able random network generator. This generator features thefollowing settings:
• min/max number of ECUs• min/max number of connections• min/max distance between the min/max number of con-
nections• min/max of the minimum data rate of connections• min/max of the maximal data rate of connectionsFurthermore we implemented a likeliness that a ECU con-
nects to ECUs of the same tenner block. This means that ECU16 has a higher likelihood to connect to the ECUs 10-19 thanto all others. In addition the user can to set the data rate foreach of these blocks different. This helps to see if the algorithmcorrectly detects the ECUs belonging together. The designednetwork generator also allows to set the group size of ECUsbelonging together, but this prevents from identifing the ECUsbeloging together. Another setting allows to set the group ofthe adjacent ECUs by hand.
We generated 100 different networks using our networkgenerator. The results are depicted in Fig. 12. While it lookslike the ”shared weighted” nearness is the overall winnerof the benchmark, this is only the case for most of thenetworks. Looking at the standard deviation shows, that alsoother nearness functions can lead to a better solution for aspecific network. Because of this, the result of all differentnearness functions is calculated and the best one is selected.The result graph in Figure 12 is based on the best solutionfound for each network. The deviation of the current solutioncompared to the best solution found is depicted in percent.
0
10
20
30
40
50
60
70
80
90
equaldistibution
normal relative bothsidedrelative
weighted sharedweighted
without optimizationbin packingFM+bin packing
Fig. 12. Comparison of implemented nearness functions and algorithms
VI. CONCLUSION AND FUTURE WORK
Our method to automatically partition communicating ECUsto automotive networks allows to rapidly evaluate differentdesign alternatives. During the design phase of a vehicle de-velopment, different architectures are investigated. AutomaticECU partitioning can help the designer to quickly generatea new network prototype when moving function blocks fromone ECU to another. With the help of our approach, all bussystem requirements are met. Since only a subset of ECUs
can be selected, the ECU partitioning can also be executedfor only a specific set of ECUs.
Modifications to the automatically generated network mayof course be necessary, since politically decisions always haveto be considered during the design. Nevertheless our tool canprovide a very good starting solution, meeting all requirementsconcerning data rates. The cost function which is the basis forthe decision for a specific network, can individually be set bythe designer and so meet the specific computation of differentcar manufacturers. The current approach can also easily beexpanded be new bus systems, since it is not dependent on acertain kind of bus.
In the next steps, we will try to improve the selectionof bus systems. Currently, a certain bus is selected by afixed bandwidth value. This could be extended by an in-depthconfiguration and scheduling for the selected bus. So possiblya better bandwidth utilization could arise.
REFERENCES
[1] J. Broy and K. D. Mueller-Glaser, “The impact of time-triggered com-munication in automotive embedded systems,” in Industrial EmbeddedSystems, 2007. SIES ’07. International Symposium on, Jul. 2007, pp.353–356.
[2] J. Teich and C. Haubelt, Digitale Hardware/Software-Systeme: Syntheseund Optimierung, 2nd ed. Berlin: Springer, 2007.
[3] J. Lienig, Layoutsynthese elektronischer Schaltungen - GrundlegendeAlgorithmen fur die Entwurfsautomatisierung. Berlin: Springer, 2006.
[4] LIN-Consortium, LIN Specification Package, revision 2.1 ed., Nov. 2006.[5] Robert Bosch GmbH, CAN Specification, 2nd ed., Stuttgart, September
1991. [Online]. Available: http://www.semiconductors.bosch.de/pdf/can2spec.pdf
[6] FlexRay Consortium, FlexRay Communications System - Protocol Spec-ification Version 2.1, Dec. 2005, version 2.1 Revision A.
[7] MOST Cooperation, MOST Specification, MOST Cooperation, 07 2010,rev. 3.0 E2.
[8] W. Zimmermann and R. Schmidgall, Bussysteme in der Fahrzeugtechnik.Protokolle und Standards. Vieweg+Teubner, September 2008, vol. 3.Auflage.
[9] aquintos GmbH , E/E-Architekturwerkzeug PREEvision, 2009.[10] R. Xu and D. Wunsch, Clustering (IEEE Press Series on Computational
Intelligence). New York: IEEE Press, 2009.[11] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack problems :
with ... 33 tables. Berlin: Springer, 2004. [Online]. Available:http://books.google.de/books?isbn=3540402861
[12] S. Martello and P. Toth, Knapsack problems: algorithms and computerimplementations. New York, NY, USA: John Wiley & Sons, Inc., 1990.
[13] B. Daum, Java-Entwicklung mit Eclipse 3.3 : Anwendungen, Pluginsund Rich Clients, 5th ed. Heidelberg: Dpunkt-Verl., 2008. [Online].Available: http://www.ulb.tu-darmstadt.de/tocs/194895912.pdf
58
978-1-4577-0660-8/11/$26.00 ©2011 IEEE
Abstract — Non-uniform sampling has proven through
different works, to be a better scheme than the uniform sampling
to sample low activity signals. With such signals, it generates
fewer samples, which means less data to process and lower power
consumption. In addition, it is well-known that asynchronous
logic is a low power technology. This paper deals with the
coupling between a non-uniform sampling scheme and an
asynchronous design in order to implement a digital Filter. This
paper presents the first design of a micropipeline asynchronous
FIR filter architecture coupled to a non-uniform sampling
scheme. The implementation has been done on an Altera FPGA
board.
Index Terms — Asynchronous logic, non-uniform sampling,
FIR filter, FPGA.
I. INTRODUCTION
ith the increasing system on chip complexity, several
problems become more and more critical, and affect
severely the performance of the system. These issues
can have different form such as: power consumption, clock
distribution, electromagnetic emission, etc. Synchronous logic
seems to reach its technological limits when dealing with these
problems. However, asynchronous logic has proven that it
could be a better alternative in many cases. It is well known
that it has many interesting properties such as immunity to
metastable states [2], low electromagnetic noise emission [15],
low power consumption [11][12], high operating
speed [13] [14] or Robustness towards variations in supply
voltage, temperature, and fabrication process parameters [16].
Moreover, non-uniform sampling and especially the level-
crossing sampling become more interesting and beneficial
when they deal with specific signals like temperature,
pressure, electro-cardiograms or speech which evolve
smoothly or sporadically. Indeed, these signals are able to
remain constant on a long period and to vary significantly
during a short period of time.
Therefore using the Shannon theory for sampling such signals
leads to useless samples which increases artificially the
computational load. Classical and uniform sampling takes
samples even if no change occurs in the input signal. The
authors in [5] and [6] show how using the non-uniform
sampling technique in ADCs leads to drastic power savings
compared to Nyquist ADCs.
A new class of ADCs, called asynchronous ADCs (A-
ADCs) has been developed by the TIMA Laboratory [7]. This
A-ADC is based on the combination of a level-crossing
sampling scheme and a dedicated asynchronous logic [8]. The
asynchronous logic only samples digital signals when an event
occurs, i.e. a sample is produced by the A-ADC which
delivers non-uniform data in time. This event-driven
architecture combined with the level-crossing sampling
scheme is able to significantly reduce the dynamic activity of
the signal processing chain.
Many publications on non-uniform sampling are available
in the literature, but to the best of our knowledge none relates
to the coupling of an event-driven (asynchronous) logic and
FIR filters techniques applied to a non-uniform sampling
scheme
This paper presents an asynchronous FIR filter architecture,
based on a micro-pipeline asynchronous design style. It also
shows a successful implementation of this architecture on a
commercial FPGA board from Altera. The second part of the
paper is dedicated to the asynchronous logic and more
precisely to one kind on asynchronous circuit: micro-pipeline.
Some details about the A-ADC as well as the non-uniform
sampling scheme are showed in the third paragraph. The
fourth paragraph handles the asynchronous FIR Filter
algorithm and architecture. Finally the fifth paragraph presents
the implementation result of the proposed architecture on a
DE1 Altera FPGA board.
II. PRINCIPLES OVERVIEW
A. Asynchronous logic
Asynchronous logic is well known for its interesting
properties that synchronous logic doesn‟t have: such as low
electromagnetic emission, low power consumption,
robustness, etc. [1]. It has been proven that this logic improves
the Nyquist ADCs performances in terms of immunity to
metastable states [2], low electromagnetic emission [3] or low
power consumption [4]. This paragraph briefly presents the
main asynchronous logic principles. It also shows how
asynchronous micro-pipelined circuits are build, from two
distinguish parts: the data path, and the asynchronous control
path.
Anevent-drivenFIRfilter:designand
implementation
Taha Beyrouthy, Laurent Fesquet TIMA Laboratory - Concurrent Integrated Systems Group, Grenoble, France
[email protected] – [email protected]
W
59
Asynchronous principles
Unlike synchronous logic, where the synchronization is
based on a global clock signal, asynchronous logic doesn‟t
need a clock to maintain the synchronization between its sub-
blocks. It is considered as a data driven logic where
computation occurred only when new data arrived. Each part
of an asynchronous circuit, establishes a communication
protocol, with its neighbors in order to exchange data with
them.Thiskindofcommunicationprotocolisknownas“hand
shake” protocol. It is a bidirectional protocol, between two
blocks called Sender and Receiver as showed in Figure 1
The sender start the communication cycle by sending a
request signal “req” to the receiver. This signal means that
data are ready to be sent. The receiver start the new
computation after thedetectionof the “req” signal, and send
back an acknowledge signal “ack”to the sender marking the
end of the communication cycle, so a new one could start.
Themaingateusedinthiskindofprotocol, is the“Muller
“gate or also known as C-Element. It helps - thanks to its
properties- to detect a rendezvous between different signals.
The C-element is in fact a state-holding gate; Table 1 shows
its output behavior.
Consequently,whentheoutputchangesfrom„0‟to„1‟,we
mayconcludethatbothinputare„1‟.And similarly, when the
output changes from „1‟ to ‟0‟, we may conclude that both
inputsarenowsetto„0‟. Thisbehaviorcouldbeinterpreted
asanacknowledgementthatindicateswhenbothinputare„1‟
or„0‟.
This is why the C-element is extensively used in asynchronous
logic, and is considered as the fundamental component on
which is based the communication protocols
Asynchronous Micro-pipeline circuits
Many asynchronous logic styles exist in the literature. It is
worth mentioning that the choice of the asynchronous style
affects the circuit implementation (area, speed, power,
robustness, etc.). One of the most known styles is the micro-
pipeline style.
Among all the asynchronous circuit styles the micro-pipeline
has the most closely resemblance with the design of
synchronous circuits due to the extensive use of timing
assumptions [5]. Same as a synchronous pipeline circuit, the storage elements is
controlled by control signals. Nevertheless, there is no global
clock. These signals are generated by the Muller gates in the
pipeline controlling the storage elements as shown in Figure 5 A simple asynchronous micro-pipeline circuit, could be
built by using transparent latches as storage elements as shown
in Figure 4.
The Muller gate pipeline is used to generate the local
clocks. The clock pulse generated in a stage overlaps the
pulses generated in the neighboring stages in a specific
controlled interlocked manner (depending on the handshake
protocol).
This circuit can be seen as an asynchronous data-flow
structurecomposedoftwomainblocks:“thedatapath”which
is clocked by a distributed gated-clock driver, “The control
path“.
Figure 1: Handshake protocol is established between two sub-
blocks of an asynchronous circuit that need to exchange data
between each other
Figure 2: C-Element or Muller gate
Input1 Input2 Output
0 0 0
0 1 Output-1
1 0 Output-1
1 1 0
Table 1: The Truth table of the C-Element The output copies the
inputs value when both have the same, and maintain its previous
value when inputs are different.
Figure 3: Muller pipeline, controlling Latch chain
Figure 4: Micro-pipeline asynchronous circuit
60
III. ASYNCHRONOUS ANALOG TO DIGITAL CONVERTER –
‘AADC’
Most of real life signals are time varying in nature. The
spectral contents of these signals vary with time, which is a
direct consequence of the signal generation process [6]. The
synchronous ADCs are based on the Nyquist architectures.
They do not exploit the input signal variations. Indeed, they
sample the signal at a fixed rate, without taking into account
the intrinsic signal nature. Moreover, they are highly
constrained due to the Shannon theory especially in the case of
low activity sporadic signals like electrocardiogram, seismic
signals, etc. It leads to capture and process a large number of
samples without any relevant information, a useless increase
of the system activity and power consumption.
The Asynchronous Analog-to-Digital Converter (AADC)
presented in [7] and [8] is based on a non-uniform sampling
scheme called level-crossing sampling [9]. This system is only
driven by the information present in the input signal. Indeed, it
only reacts to the analog input signal variations.
A. Non-uniform - level crossing sampling
The sampling process strongly affects performances of the
post Digital Signal Processing (DSP) chain. Best
performances can be achieved if the signal is efficiently
sampled. Several ways exist to sample an analog signal.
The classical uniform sampling is well developed and well
adapted to the existing signal processing devices. Although it
covers the whole existing DSP areas, it is not the best one for
all of them. In many cases, the non-uniform sampling could be
a better candidate and provides advantages such as system
complexity reduction, compression, smarter data transmission,
and acquisition, etc., which are not attainable with the uniform
sampling process.
With our non-uniform sampling scheme, a sample is only
captured when the Continuous Time (CT) input signal x(t)
crosses one of the defined levels (Figure 5).
For an M-bit resolution, 2M-1 quantization levels are
regularly disposed along the amplitude range of the signal. Unlike the classical Nyquist sampling,, the samples are not
uniformly spaced out in time, because they depend on the
signal variations. Thus, together with the value of the sample
axn, the time dtxn = txn – txn-1 is defined. It corresponds to the
time elapsed since the previous sample axn-1. A local timer of
period TC is dedicated to record dtxn and deliver it when
necessary along with axn.
Contrarily to the usual sampling technique, the amplitude of
the sample is known and the time elapsed between two
samples is quantized with a timer. The Signal to Noise Ratio
(SNR) depends on the timer period TC, and not on the number
of quantization levels [8]. Thus, for a given implementation of
the non-uniform sampling A/D converter (a fixed number of
quantization levels: L = 2M
-1), the SNR can be externally
tuned by changing the period TC of the timer. In theory, for
level-crossing sampling, the SNR can be improved as far as it
is needed, by reducing TC [10].
B. A-ADC architecture
LetδbetheA-ADC processing delay for one sample. Thus
the proper signal capturing the x(t) must satisfy the tracking
condition given by the equation (1):
(1)
Where q is the A-ADC quantum, defined by the equation (2):
(2)
where E represents the amplitude dynamics of the A-ADC,
and M its resolution.
The output digital value Vnum is converted to Vref by the DAC,
and compared to the CT input signal x(t) (Figure 6).
If the comparison result is greater than
, the counter is
incremented. If it is lower than
, the counter is
decremented. In other cases, nothing is done, and thus, the
output Vnum remains constant.
The output signal is composed of couples (axn, dtxn) where
axn is the digital value of the sample, and dtxn the time elapsed
Figure 5: Level crossing sampling. ‘q’ is considered as the A-
ADC quantum.
Figure 6: Block diagram of the A-ADC
61
since the previous converted sample axn-1, given by the timer,
as said before. Since, the architecture of the A-ADC is
asynchronous, the latter uses asynchronous communication
protocol (based on „req‟ and „ack‟ signals) to exchange data
with its environment.
IV. ASYNCHRONOUS FIR-FILTER
A. Principles & Algorithm
A synchronous Nth
order FIR filter based on a uniform
sampling scheme, computes a digital convolution product
(equation (4)).
∑
(3)
where is the sampling period.
In the non-uniform sampling scheme, the sampling time of the
kth
sample of the impulse response h does not necessarily
correspond to the sampling time of the (n-k)th
sample of the
input signal ax (equation (5))
(4)
The product of two samples is thus meaningless. To bypass
this issue, the impulse response h of the filter is resampled and
interpolated as well as the input signal ax. Thus the new
convolution product is processed between these new samples
and (Figure 7).
The new convolution product is an area computation. The
easier way to compute this area is the rectangle method i.e. a
zero-order interpolation. This method allows splitting the area
corresponding to the convolution product in a sum of rectangle
areas with different widths (Figure 7).
In order to compute each rectangle area, an iterative loop can
do the job [10]:
{ ∑ ( )
If min = then
k = k+1
If min = then
j = j+1
if min = Then j = j+1
k = k+1
where min =min( ,
An example illustrating these iterations is shown in Figure 8
B. FIR filter asynchronous micro-pipeline architecture
General architecture
The previously proposed algorithm is implemented with the
feedback structure presented in the Figure 9. This structure
Figure 7: Principle of the resampling scheme used in the
irregular FIR computation. The continuous lines represent the
original samples, whereas the dashed lines correspond to the
new resampled interpolated samples.
Step1
Step 2
Figure 8 : an example of an asynchronous convolution
product.
62
describes the architecture of the FIR filter.
The “DelayLine”gets the sampled signal data (axn, dtxn),
from the A-ADC. The communication between these two
blocks is based on the handshake protocol.
The“DelayLine”isthememoryblockofthefilter.Itisashift
register that stores the input samples (magnitudes and time
intervals). The output of this register is connected to a
multiplexer (not shown) that allows selecting samples
dependingonthevalueoftheselectioninput“k”.AROMand
another multiplexer (not shown) are used to store the impulse
response coefficients. The coefficients are selected the signal
“j”.
The“MIN”blockhasmultiplefunctionalities:
1- It allows determining the minimum time interval dtmin
of (dtxn-k, dthj).
2- It generates the selection signals „j‟ and „k‟ that
control the selection process in the Delay Line.
3- It detects the end of a convolution product round,
and allows starting a new one. This functionality is
based on detecting the signal „k‟. If „k‟ reaches its
maximum, that means that all filter coefficients are
used, and that the convolution product is done. The
filter is ready to perform a new one. The MIN block
generates at this point a reset signal to the other
blocks, commanding them to start a new convolution.
4- Finally, MIN generates at the end of each
convolution product cycle, a „rest‟ signal and an
„enable‟ signal to reset the output of the Accumulator
and enable the output of the Buffer.
Then the “Multiplier” computes all sub_areas value
(dtmin*axn-k*ahj)thatareaccumulatedinthe“Accumulator”in
order to compute the convolution product.
Until now, nothing is so special. The structure is a simple
and logical translation of the iterative function presented in the
previous paragraph. In the micropipeline architecture, this part
ofthecircuitwillbeconsideredasthe“datapath”.
The challenge begins with defining the control path of the
micropipeline of the filter. The control path will be in charge
of synchronizing the communication between different parts in
the data-path, so that they will work in a complete harmony
while exchanging their data.
Once the data and control path are described, the filter
implementation starts on the FPGA. In order to implement the
control path, a specific asynchronous library has to be
specified. This library contains Muller gates, asynchronous
controllers and some other asynchronous functions [17].
As shown in the paragraph II, each functional block has its
own controller. For the clarity of the paper, not all the
controllers are presented. As an illustration, the controller of
the MIN block is studied below.
Control block of MIN
The simplest block specified for our asynchronous
controllers is the „Linear_control‟ Figure 10. It ensures the
rendezvousbetweentoincomingsignals„req‟and„ack‟.It has
two inputs, and two delayed outputs.
The first input represents an input request signal coming from
a previous ‘P’ block connected to the inputs of the MIN block.
The ‘P’ block sends along with its data a request signal to the
MIN block.
The second input is used for the input acknowledge signal
that comes from the following block ‘F’. The ‘F’ block
receives data to compute from the MIN block, and sends back
an acknowledge signal. Then the MIN block is ready for
receiving new data.
The Linear Controller has also two outputs. The first one is
an output request signal. This output will indicate whether
MIN has finished or not his computation, and thus a new data
is ready to be sent. This signal will be sent to the controller of
Figure 9 : Iterative structure for the Asynchronous FIR Filter
samples.
63
the„F’ block.
The second output is the output acknowledge signal, it is send
to the ‘P’ block to indicate if MIN is ready or not to receive
new data.
In the case of the MIN block: it receives inputs from only
one ‘P’ block:thedelayline.Thismeansonlyone„req‟input
is sent to its controller. However the MIN‟s output is
connected to more than one block: it is connected to the
‘Multiplier’ that receives dtmin.. It is also connected to the
„Accumulator’. The ‘Accumulator’ receives a „reset’ signal
from MIN, in order to restart a new accumulation (a new
accumulation corresponds to a new convolution product – as
mentioned in the description of the MIN functionality).
Finally, one of the MIN outputs is connected to the Buffer
input. At the end of each convolution product, the buffer
receives an ‘enable signal’ from MIN in order to transfer the
new convolution product value to the output.
As a conclusion, the MIN outputs are connected to 4 ‘F’
blocks. This means that the MIN controller receives 4
incoming acknowledge signals, one from each ‘F’ block. This
also means that MIN has to wait for these 4 acknowledgement
signals in order to generate a new output data. Thus, a
rendezvous between these 4 signals should be processed. This
is done by a block called“join_4”whichisimplementedwith
3 2-input Muller gates connected to each other. The MIN
block and its controller are shown in Figure 11.
Practically, the MIN block is more complex. In fact, MIN is
divided into 3 sub-blocks, each processing one of the
functions previously described. Each sub-block has its own
controller. The problem that could appear with multiple blocks
connected to each other in a non-linear pipeline is a dead-lock.
These problems are manually managed because there are no
available asynchronization tools. The controllers of the other
blocks are designed following the same steps
C. Asynchronous FIR-filter implementation results
The micro-pipeline asynchronous FIR filter architecture, as
well as a part of the A-ADC, are implemented on a
synchronous FPGA board: the DE1 from Altera. As
mentioned before, a dedicated library has been specified
because synchronous commercial FPGA are not able to
support asynchronous circuits.
Figure 12 shows the simulation of our asynchronous FIR
Filter after Place and route on the Altera FPGA. It is a low-
pass FIR-Filter of 15th
order. An input signal varying from
1 kHz to 18 kHz has been injected to its input
V. CONCLUSION
An asynchronous FIR Filter architecture was presented in
this paper, along with an asynchronous Analog to digital
converter (A-ADC). The FIR Filter architecture is designed
using the micro-pipeline asynchronous style. This architecture
has been successfully implemented for the first time on a
commercial FPGA board (Altera-DE1). A specific library has
also been designed for this purpose. It allows the synthesis of
asynchronous primitive bocks (the control path in our case) on
the synchronous FPGA. Simulation results of the FIR Filter
after place and route validate the implementation. This work is
still going on, in order to optimize the implementation. We
expect to have a very low-power FIR Filter, with a reduction
of the total power consumption by one order of magnitude.
REFERENCES
[1] M. Renaudin, “Asynchronous Circuits and Systems: a
Promising Design Alternative”, Journal of
Microelectronic Engineering, Vol. 54, pp. 133-149, 2000.
[2] D. Kinniment et al.,“SynchronousandAsynchronousA-
DConversion”,IEEE Trans. on VLSI Syst., Vol. 8, n° 2,
pp. 217-220, April 2000.
[3] D.J. Kinniment et al., “Low Power, Low Noise
Micropipelined Flash A-D Converter”, IEE Proc. On
Circ. Dev. Syst., Vol. 146, n° 5, pp. 263-267, Oct. 1999.
Figure 10: primitive Linear Controller. The delay value depends
on the propagation delay of the functional block.
Figure 11: MIN block and its controller
Figure 12: Asynchronous FIR Filter after P&R simulation
64
[4] L. Alacoque et al., “An Irregular Sampling and Local
Quantification Scheme A-DConverter”, IEE Electronics
Letters, Vol. 39, n° 3, pp. 263-264, Feb. 2003.
[5] “PrinciplesofAsynchronousCircuitDesign – A Systems
Perspective”, Edited by JENS SPARSØ Technical
University of Denmark & STEVE FURBER The
University of Manchester UK
[6] L. Wiliams et al.”A Stereo 16-bit Delta-Sigma A/D
ConverterforDigitalAudio”,Ph.D.dissertation Stanford
University 1993.
[7] E. Allier, L. Fesquet, G. Sicard, M. Renaudin, “Low
Power Asynchronous A/D Conversion”, Proceedings of
the 12th International Workshop on Power and Timing,
Modeling,Optimization and Simulation (PATMOS‟02),
September 11-13 2002, Sevilla, Spain.
[8] E. Allier, G. Sicard, L. Fesquet,M. Renaudin, “A New
Class of Asynchronous A/D Converters Based on Time
Quantization”, ASYNC Proceedings, pp. 197-205, May
12-16 2003, Vancouver, Canada.
[9] J.W.Marketal.,“ANonuniformSamplingApproach to
Data Compression”, IEEE Trans. on Communication.
Vol. COM-29, n° 4, pp. 24-32, Jan. 1981. W.-K. Chen,
Linear Networks and Systems (Book style). Belmont,
CA: Wadsworth, 1993, pp. 123–135.
[10] F. Aeschlimann, E. Allier, L. Fesquet, M. Renaudin,
"Asynchronous FIR Filters: Towards a New Digital
Processing Chain," Asynchronous Circuits and Systems,
International Symposium on, pp. 198-206, 10th IEEE
International Symposium on Asynchronous Circuits and
Systems (ASYNC'04), 2004
[11] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and
N.C. Paver.AMULET2e: An asynchronous embedded
controller. In Proc. International Symposium on advanced
Research in Asynchronous Circuits and Systems, pages
290–299. IEEE Computer Society Press, 1997.
[12] L.S. Nielsen. Low-power Asynchronous VLSI Design.
PhD thesis, Department of Information Technology,
Technical University of Denmark, 1997. IT-TR:1997-12.
[13] SPeedster, a very high speed FPGA by Achronix:
http://www.achronix.com/
[14] A.J. Martin, A. Lines, R. Manohar, M. Nystr¨om, P.
Penzes, R. Southworth, U.V. Cummings, and T.-K. Lee.
The design of an asynchronous MIPS R3000. In
Proceedings of the 17th Conference on Advanced
Research in VLSI, pages 164–181. MIT Press, September
1997.
[15] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A.
Lien, and J. Liu. A low-power, low-noise configurable
self-timed DSP. In Proc. International Symposium on
Advanced Research in Asynchronous Circuits and
Systems, pages 32–42, 1998.
[16] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H. van Berkel.
Low-power operation using self-timed circuits and
adaptive scaling of the supply voltage. IEEE Transactions
on VLSI Systems, 2(4):391–397, 1994.
[17] Quoc Thai Ho, J.-B. Rigaud, L. Fesquet, M. Renaudin, R.
Rolland, "Implementing asynchronous circuits on LUT
based FPGAs", The 12th International Conference on
Field Programmable Logic and Applications (FPL),
September 2-4, 2002, Montpellier (La Grande-Motte),
France.
65
Session 3Prototyping Radio Devices
66
Applying Graphics Processor Acceleration in aSoftware Defined Radio Prototyping Environment
William Plishker, George F. Zaki, Shuvra S. Bhattacharyya
Dept. of Electrical and Computer Engineering
and Institute for Advanced Computer Studies
University of Maryland
College Park, Maryland
{plishker,gzaki,ssb}@umd.edu
Charles Clancy, John Kuykendall
Laboratory for Telecommunications Sciences
College Park, Maryland, USA
{clancy, jbk}@ltsnet.net
Abstract—With higher bandwidth requirements and morecomplex protocols, software defined radio (SDR) has ever growingcomputational demands. SDR applications have different levels ofparallelism that can be exploited on multicore platforms, but de-sign and programming difficulties have inhibited the adoption ofspecialized multicore platforms like graphics processors (GPUs).In this work we propose a new design flow that augments apopular existing SDR development environment (GNU Radio),with a dataflow foundation and a stand-alone GPU acceleratedlibrary. The approach gives an SDR developer the ability toprototype a GPU accelerated application and explore its designspace fast and effectively. We demonstrate this design flow on astandard SDR benchmark and show that deciding how to utilizea GPU can be non-trivial for even relatively simple applications.
I. INTRODUCTION
GNU Radio [1] is a software development framework that
provides software defined radio (SDR) developers a rich
library and a customized runtime engine to design and test
radio applications. GNU Radio is extensive enough to describe
audio radio transceivers, distributed sensor networks, and radar
systems, and fast enough to run such systems on off-the-self
radio hardware and general purpose processors (GPPs). Such
features have made GNU Radio an excellent rapid prototyping
system, allowing designers to come to an initial functional
implementation quickly and reliably.
GNU Radio was developed with general purpose pro-
grammable systems in mind. Often initial SDR prototypes
were fast enough to be deployed on general purpose processors
or needed few custom accelerators. As new generations of
processors were backwards compatible with software, GNU
Radio implementations could track with Moore’s Law. As a
result, programmable solutions have been competitive with
custom hardware solutions that required longer design time
and greater expense to port to the latest process generation.
But with the decline in frequency improvements of GPPs, SDR
solutions are increasingly in need of multicore acceleration,
such as that provided by graphics processors (GPUs). SDR
is well positioned to make use of them since many SDR
applications have abundant parallelism.
GPUs are starting to be employed in SDR solutions, but
their adoption has been inhibited by a number of difficul-
ties, including architectural complexity, new programming
languages, and stylized parallelism. While other research is
addressing these topics [5] [6], one of the primary barriers
in many domains is the ability to quickly prototype the
performance advantages of a GPU for a particular application.
The inability to assess the performance impact of a GPU with
an initial prototype leaves developers to doubt if the time and
expense of targeting a GPU is worth the potential benefit.
Many design decisions are needed before arriving at initial
multicore prototype including mapping tasks to processors and
data to distributed memories. Mapping SDR applications is
further complicated by application requirements. The amount
of parallelism present may be dictated by the application itself
based on its latency tolerances and available vectorization
of the kernels. More vectorization tends to lead to higher
utilization of the platform (and therefore higher throughput),
but often at the expense of increased latency and buffer
memory requirements. Also an accelerator typically requires
significant latency to move data to or from the host processor,
so sufficient data must be burst to the accelerator to amortize
such overheads.
Ideally, application designers would be simply presented
with a Pareto curve of latency versus vectorization trade-offs
so that an appropriate design point can be selected. However,
vectorization generally influences the efficiency of a given
mapping. Thus, to fully unlock the potential of heterogeneous
multiprocessor platforms for SDR, designers must be able to
arrive at a variety of solutions quickly, so that the design space
may be explored along such critical dimensions.
To enable developers to arrive at an initial prototype that
utilizes GPUs, we introduce a new SDR design flow, as shown
in Figure 1. We begin with a formal description of an SDR
application, which we extract from a GNU Radio specification.
Formalisms provide the design flow with a structured, portable
application description which can be used for vectorization,
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
67
!""#$%&'()*+,-%.$"'()*$)*
/01*".(2.&33$)2*
,)4$.()3,)5*
675.&%5*
8(.3&#*
+,-%.$"'()*
0&5&9(:*&""#$%&'()*
+,-%.$"'()*
!;23,)5,+*&""#$%&'()*
+,-%.$"'()*$)*(.$2$)&#*/01*
,)4$.()3,)5*
/%<,+;#$)2*
=&.''()$)2*
>&""$)2*
/01*
,7,%;'()*
,)2$),*
!.%<$5,%5;.,*
3(+,#*
?=@*-",%$A%*
/01*#$B.&.C*
Fig. 1. Dataflow founded SDR Design Flow.
latency, and other design decisions. These design decisions can
ultimately be incorporated into an SDR application through
a GPU specific library of SDR actors. For this work, we
have constructed GRGPU, which is a such a library written
for GNU Radio. With this design process, we demonstrate
the value of this approach with GNU Radio benchmark on a
platform with a GPU.
II. BACKGROUND
Dataflow graphs are widely used in the modeling of signal
processing applications. A dataflow graph G consists of set
of vertexes V and a set of edges E. The vertices or actors
represent computational functions, and edges represent FIFO
buffers that can hold data values, which are encapsulated as
tokens. Depending on the application and the required level
of model-based decomposition, actors may represent simple
arithmetic operations, such as multipliers or more complex
operations as turbo decoders.
A directed edge e(v1, v2) in a dataflow graph is an ordered
pair of a source actor v1 = src(e) and sink actor v2 = snk(e),where v1 ∈ V and v2 ∈ V . When a vertex v executes
or fires, it consumes zero or more tokens from each input
edge and produces zero or more tokens on each output edge.
Synchronous Data Flow (SDF) [8] is a specialized form of
dataflow where for every edge e ∈ E, a fixed number of
tokens is produced onto e every time src(e) is invoked,
and similarly, a fixed number of tokens is consumed from
e every time snk(e) is invoked. These fixed numbers are
represented, respectively, by prd(e) and cns(e). Homogeneous
Synchronous Data Flow (HSDF) is a restricted form of SDF
where prd(e) = cns(e) = 1 for every edge e.Given an SDF graph G, a schedule for the graph is a
sequence of actor invocations. A valid schedule guarantees
that every actor is fired at least once, there is no deadlock
due to token underflow on any edge, and there is no net
change in the number of tokens on any edge in the graph
(i.e., the total number of tokens produced on each edge during
the schedule is equal to the total number consumed from the
edge). If a valid schedule exists for G, then we say that G is
consistent. For each actor v in a consistent SDF graph, there
is a unique repetition count q(v), which gives the number of
times that v must be executed in a minimal valid schedule (i.e.,
a valid schedule that involves a minimum number of actor
firings). In general, a consistent SDF graph can have many
different valid schedules, and these schedules can differ widely
in the associated trade-offs in terms of metrics such as latency,
throughput, code size, and buffer memory requirements [4].
III. RELATED WORK
Many models of computation have been suggested to de-
scribe software radio systems. In [2], the advantages and
drawbacks of various models are investigated. Also different
dataflow models that can be applied to various actors of an
LTE receiver are demonstrated.
Actor implementation on GPUs is discussed in [13]. A GPU
compiler is described in order to take a naive actor implemen-
tation written in CUDA [11], and generate an efficient kernel
configuration that enhances the load balance on the available
GPU cores, hides memory latency, and coalesces data move-
ment. This work can be used in our proposed framework to
enhance the implementation of individual software radio actors
on a GPU. Raising the abstraction of CUDA programming
through program analysis is the focus of Copperhead [6].
In [12], the authors present a multicore scheduler that maps
SDF graphs to a tile based architecture. The mapping process
is streamlined to avoid the derivation of equivalent HSDF
graphs, which can involve significant time and space over-
head. In more general work, MpAssign [5] employs several
heuristics, allows different cost functions and architectural
constraints to arrive at a solution.
In [15], a dynamic multiprocessor scheduler for SDR
applications is described. The basic platform consists of a
Universal Software Radio Peripheral (USRP), and cluster of
GPPs. A flexible framework for dynamic mapping of SDR
components onto heterogeneous multiprocessor platforms is
described in [9].
Various heuristics and mixed linear programming models
have been suggested for scheduling task graphs on homoge-
neous and heterogeneous processors (e.g., see [10]). In these
works, the problem formulations are developed to address
different objective functions and target platforms for imple-
menting the input application graphs.
The focus of this work is to construct a backend capable
of integrating specialized multicore solutions into a domain
68
specific prototyping environment. This should facilitate the
previously described dataflow based design flow, but should
also enable these other works to be applied in the field of SDR.
Any solution targeting a complex multicore system is unlikely
to produce the optimal solution with its first implementation.
The ability to quickly generate and evaluate many solutions on
a multicore platform should improve the efficacy the approach
and ultimately the quality of the final solution.
IV. SDR DESIGN FLOW FOR GPUS
We implemented the design flow proposed in Figure 1
by using GNU Radio as the SDR description and runtime
environment and the Dataflow Interchange Format (DIF) [7]
for the dataflow representation and associated tools. Our GPU
target was CUDA enabled NVIDIA GPUs. With these tools in
place the design flow proceeds as described in the following
steps:
1) Designers write their SDR application in GNU Radio
with no consideration for the underlying platform. As
GNU Radio has an execution engine and a library of
SDR components, designers can verify correct function-
ality of their application. For existing GNU Radio appli-
cations, nothing must be changed with the description
to continue with the design flow.
2) If actors of interest are not in the GPU accelerated li-
brary, a designer writes accelerated versions of the actors
in CUDA. The design focuses on exposing the paral-
lelism to match the GPU architecture in as parametrized
way as possible.
3) Either through automated or manual processes, instanti-
ated actors are either assigned to a GPU or designated
to remain on a GPP. With complex trade offs between
GPU and GPP assignments possible, this step may be
revisited often as part of a system level design space
exploration. Dataflow provides a platform independent
foundation for analytically determining good mappings,
but designer insight is also a valuable resource to be
utilized at this step.
4) The mapping result is utilized by augmenting the origi-
nal SDR application description environment. By lever-
aging a stand-alone library of CUDA accelerated actors
for GNU Radio, the designer can describe and run the
accelerated application description with existing design
flow properties.
The following sections cover these steps in detail, specif-
ically as they relate to our instance of the design flow that
utilizes CUDA, GNU Radio, and DIF.
A. Writing GPU Accelerated Actors
After the application graph is described in GNU Radio, ac-
tors are individually accelerated using GPU specific tools. If an
actor of interest is not present in the GPU accelerated library,
the developer switches to the GPU customized programming
environment, which in our case is CUDA. The designer is still
saddled with difficult design decisions, but these decisions are
localized to a single actor. System level design decisions are
orthogonal to this step of the design process. While we do
not aim to replace the programming approach of the actors
functionality, the following design strategy lends itself to later
design space exploration by the developer.
As with other GPU programming environments, in CUDA
a designer must divide their application into levels of par-
allelism: threads and blocks, where threads represent the
smallest unit of a sequential task to be run in parallel and
blocks are groups of threads. In our experience, SDR actors
vary in how to use thread level parallelism, but tend to
realize block level parallelism with parallelism at the sample
level. The ability to tightly couple execution between threads
within a block creates a host of possibilities for the basic
unit of work within a block, be it processing a code word,
multiplying and accumulating for a tap, or performing an
operation on a matrix. Because blocks are decoupled, only
fully independent tasks can be parallelized. For SDR those
situations tend to arise between channels or between samples
on a single channel. Some samples may overlap between
blocks to support the processing of a neighboring sample, but
this redundancy is often more than offset by the performance
benefits of parallelization.
The performance of this parallelization strategy strongly
influenced by the number of channels or the size of a chunk
of samples that can be processed at one time. When the
application requests processing on a small chunk of sample,
there are few blocks to spread across a GPU leaving it
under utilized, while large chunks enable high-utilization. The
performance difference between small and large chunks is
non-linear due to the high fixed latency penalty that both
scenarios experience when transferring data to and from the
GPU and launching kernels. When chunks are small, GPU
time is dominated by transfer time, but when chunks are larger,
computation time of the kernel dominates, which amortizes the
fixed penalty delay. As the application dictates these values,
actors must be written in a parametrized way to accommodate
different size inputs.
B. Partitioning, Scheduling, and Mapping
Once actors are written, system level design decisions must
be made, such as assigning which actors are to invoke GPU
acceleration. With some applications, the best solution may
be to offload every actor that is faster on the GPU than it
is on the GPP. But in some cases, this greedy strategy fails
to recognize the work that could occur simultaneously on
the GPP, while the host thread with the kernel call waits for
the GPU kernel to finish. A general solution to the problem
would consider application features such as rates of firings,
dependencies, and execution times on each platform of each
actor, as well as architectural features such as the number
and types of processing elements, memories, and topology.
To simplify the problem, designers can cluster certain actors
together so that they are assigned to the same. To promote this
clustering, designers may partition the application graph.
Multirate applications also need to be scheduled properly to
ensure firing rates and dependencies are proper accounted for.
69
!"!#$%
!&$%"'()*%
!#$%+,-.,/%
).%0$12%
34567%
0$12%
89.:;,<)=,-%
3.>557%
0$12%?)@-'-),<%
0AA%B/*5C%
D):;%5'//%:*%
!#$%+,-.,/%
34557%
!#$%+,-.,/%
).%0AA%
34557%
/)@:**/%
8:'.(%'/*.,%E9:;*.%
E'5C'F,%
!"#$%&'()*
!"#$%+!*
&,-"$,./0(123*
Fig. 2. GRGPU: A GNU Radio integration of GPU accelerated actors.
When the application can be extracted into a formal dataflow
model, schedulers will not only respect these constraints but
are able to optimize for buffer assignments [3]. The applica-
bility of such techniques for specialized multicore platforms
are still open research, and this design flow enables greater
experimentation with them for SDR applications. Manual
scheduling and mapping is likely to continue to dominate
smaller, more homogeneous mappings, but a grounding in
dataflow opens the door for new automation techniques. In
this work we focus on the design flow, conventions for writing
SDR actors, and integrating GPU accelerated actors with GNU
Radio.
C. GRGPU: GPU Acceleration in GNU Radio
We developed GPU accelerated GNU Radio actors in a
separate, stand-alone library called GRGPU. GRGPU extends
GNU Radio’s build and install framework to link against
libraries in CUDA as shown in Figure 2. After building against
CUDA libraries, the resulting actors may be instantiated along-
side traditional GNU Radio actors, meaning that designers
may swap out existing actors for GRGPU actors to bring
GPU acceleration to existing SDR applications. The traditional
GNU Radio actors run unaffected on the host GPP, while
GRGPU actors utilize the GPU.
When writing a new GRGPU actor, application developers
start by writing a normal GNU Radio actor including a C++
wrapper that describes the interface to the actor. The GPU
kernels are written in CUDA in a separate file and tied back
to the C++ wrapper via C functions such as device work().
Additional configuration information may be sent in through
the same mechanism. For example, the taps of a FIR filter
typically need to be updated only once or rarely during the
execution, so instead of passing the tap coefficients during
each firing of the actor (taps sent from work() to device work()
to the kernel call), they could be loaded into device mem-
ory when the taps are updated in GNU Radio. The CUDA
compiler, NVCC, is invoked to synthesize C++ code which
contains binaries of the code destined for the GPU, but glue
code formatted for C++. By generating the C++ instead of an
object file directly, we are able to make use of the standard
GNU build process using libtool. Even though the original
application description was in a different language, the code
is wrapped and built in the GNU standard way giving it
compatibility with previous and future versions of GNU and
GNU Radio.
When a GNU Radio actor is instantiated, a new C++
object is created which stores and manages the state of the
actor. However, state in the CUDA file is not automatically
replicated, creating a conflict when more than one GRGPU
actor of the same type is instantiated. To work around this
issue, we save CUDA (both host and GPU) state inside the
C++ actor, which includes GPU memory pointers of data
already loaded to the GPU. The state from the GPU itself is
not saved inside the C++ object, but rather the pointers to the
device memory are. Data residing in the GPUs memory space
is explicitly managed on the host, so saving GPU pointers is
sufficient for keeping the state of the CUDA portion of an
actor.
To minimize the number of host-to-GPU and GPU-to-
host transfers, we introduce two actors, H2D and D2H, to
explicitly move data to and from the device in the flow graph.
This allows other GRGPU actors to contain only kernels that
produce and consume data in the GPU memory. If multiple
GPU operations are chained together, data is processed locally,
reducing redundant I/O between GPU and host as shown in
Figure 3. In GNU Radio, the host side buffers still exist which
connect links between the C++ objects that wrap the CUDA
kernels. Instead of carrying data, these buffers now carry
pointers to data in GPU memory. From a host perspective,
H2D and D2H transform host data to and from GPU pointers,
respectively.
While having both a host buffer and a GPU buffer introduces
some redundancy, it has a number of benefits which make this
an attractive solution. First, there is no change to the GNU
Radio engine. The GNU Radio engine still manages data being
produced and consumed by each actor, so decisions on chunk
size or invocation order do not need to be changed with the
use of GRGPU actors. Second, GPU buffers may be safely
managed by the GRGPU actors. With GPU pointers being
sent through host buffers, actors need only concern themselves
with maintaining their own input and output buffers. This
provides dynamic flexibility (actors can choose to create and
free memory for data as needed) or static performance tuning
(actors can maintain circular buffers which they read and
write a fixed amount of data to and from). Such schemes
require coordination between GRGPU actors and potentially
information regarding buffer sizing, but the designer does have
the power to manage these performance critical actions without
redesigning or changing GRGPU. Future versions of GRGPU
could provide a designers with a few options regarding these
schemes and even make use of the dataflow schedule or
other analysis to make quality design decisions. Finally, no
extraneous transfers between GPU and host occur. While the
host and GPU buffers mirror each other, no transfers occur
between them, which avoids I/O latencies that can be the cause
of application bottlenecks.
70
!"#$%
#"&'()%
!"#$%$"%
*)+,()%
!"#$%
#,-.%
*)+,()%
$"%!"#$%
/01%
203%
/01%
204%
Fig. 3. GRGPU actors within H2D and D2H communicate data using theGPUs memory, avoiding unnecessary host/GPU transfers
!"#$
%&"$
%&"$
%&"$
%&"$
%&"$
%&"$
%&"$
%&"$
%&"$
%&"$ %&"$ %&"$
!'($
!"#$"%&'()*"
!"#$"+,-).,/)*"
Fig. 4. SDF graph of the mp-sched Benchmark.
V. EVALUATION
We have experimented with the proposed design flow us-
ing the mp-sched benchmark. Figure 4 shows the mp-sched
benchmark pictorially. Each of the actors after the distributor
performs FIR filtering. To provide flexibility for evaluating
different multicore platforms, it is configurable with number
of chains of FIR filters (pipelines) and the depth of the chains
(stages). This benchmark describes a flow graph that consists
of a rectangular grid of FIR filters. The dimensions of this
grid are parametrized by the number of stages (STAGES)and number of pipelines (PIPES). The total number of FIR
filters is thus equal to PIPES×STAGES. This benchmark
represents a non-trivial problem for the multiprocessor sched-
uler as all actors in different pipelines can be executed in
parallel. More information about the mp-sched benchmark can
be found in [1].
A. FIR Filter Design
In this implementation [14], we take advantage of data
parallelism between the filter output samples as well as
functional parallelism to calculate every sample. For relatively
large chunks of samples, the CUDA kernel is configured such
that the number of blocks is equal to double the number of
available streaming multiprocessors. By using this configu-
ration, the first level of data parallelism can be achieved if
every CUDA block is responsible to calculate a different set
of output samples. In other words, the required number of
output samples are evenly distributed on the number of CUDA
blocks. To overcome the inherited stateful property of the FIR
filter(i.e, consecutive output samples depend on some shared
input samples), the input of every block must contain an extra
set of delayed input samples equal to the number of taps.
To reduce the number of device memory access, initially
all of the threads will perform a load of a coalesced chunk of
input elements to the shared memory of a multiprocessor. Then
every thread will be responsible of calculating the product
of a filter tap coefficient with an input sample, and adding
this product to the partial sum of the previous stage. After
processing a set of inputs, the threads perform a block store
of the calculated results to the GPU device memory.
B. Empirical Results
We found a variety of design points of mp-sched to evaluate
the utility of rapid prototyping with GRGPU. The target
platform was two GPPs (Intel Xeon CPUs 3GHz) and a GPU
(an NVidia GTX 260). The actors performed a 60 tap FIR
filtering with either CUDA acceleration in the case of GPU
accelerated actors or SSE acceleration in the case of the GPP.
To minimize the latencies incurred by using H2D and D2H,
the GPU accelerated actors were clustered together leaving
remaining GPP actors similarly clustered.
In the case of our exploration of the mp-sched implemen-
tation design space, each pipeline was located in a separate
thread and the number of actors with GPU acceleration was
configurable. Mp-sched pipelines could run in parallel and
share the GPU as an acceleration resource during runtime.
Multiple pipelines with GPU accelerated actors were forced to
serialize their GPU accesses according to CUDA conventions.
For example, one possible solution to a 2x20 instance of
mp-sched is shown in Figure 5. The Gantt chart is not to
scale, but shows how the two different pipelines (one in red
and one in blue), are able to run in parallel on the two
GPPs, but must have exclusive access to the GPU when
running accelerated actors. While the cross thread sequencing
was not specified at runtime, GRGPU’s ability to specify
acceleration and clustering enables the creation of multicore,
GPU accelerated complete solutions.
The problem for a designer is then to leverage GPPs
and GPU, weigh SSE acceleration and CUDA acceleration,
account for communication latencies between GPU and GPP
and thread to thread, and consider how all of this will occur
in parallel. Models and automated techniques should continue
to assist in providing good starting points, but a necessary
condition to arriving at a quality solution is still the ability to
try many points quickly.
To this end, we constructed an illustrative example that
produces an interesting set of design points: mp-sched with 20
stages and varied the number of pipelines. Figure 6 shows a
sub-sampling of the design space. “All GPP” means all stages
of all pipelines are assigned to the GPPs, while “All GPU”
means all stages of all pipelines are assigned to the GPU.
“3/4 GPP”, “Half GPP”, and “3/4 GPU” indicates that three
quarters, one half, or one quarter of the stages of all pipelines
are assigned to the GPP, respectively, while the remaining
actors use the GPU. For example, Figure 5 shows the 2x20
Half GPP solution. We also evaluated solutions in which
one of the pipelines was all GPP and the rest GPU (“One
GPP”) and the reverse (“One GPU”). In the case of only one
pipeline, these solutions were equivalent to an all GPP or all
GPU solution. We ran each solution for 200,000 samples and
recorded the execution time, including GNU Radio overheads,
communication overheads, etc.
71
!""#$#
!""#%#
!"&#
Fig. 5. Gantt chart for 2x20 mp-sched graph on 2 GPP and 1 GPU. Theblue and the red set of blocks and arrows each represent one branch of themp-sched instance.
For the 60 tap FIR filter, SSE acceleration performs well,
but still somewhat slower than the GPU implementation, so
once a sufficient amount of computation is located on the
GPU, GPU weighted implementations tend to perform better.
But this graph does reveal that the GPU should be employed
in different ways depending on the number of pipelines. For
example, a single pipeline implies that there is not quite
enough computation present to merit GPU acceleration. How-
ever when 2 or more pipelines are used, the GPPs become
saturated to the point that GPU acceleration can improve upon
the result. When 4 pipelines are needed, one GPP only pipeline
proves higher performing than an all GPU solution, indicating
that the GPU itself has become saturated with computation and
that employing more of the GPP is appropriate. In each of the
cases, retrospective reasoning gives us insight into improving
performance, but a change in GPU, communication latencies,
etc. would likely change this space again, leaving a designer
to re-explore the design space.
It should be possible to arrive at these solutions more
analytically to accelerate the design space exploration, but
inevitably a set of points will need to be evaluated to judge
the efficacy of any analytical assistance. GRGPU will continue
to provide value in such a scenario feeding-back empirical
solutions to the design space exploration engine.
VI. CONCLUSION AND FUTURE WORK
As SDR attempts to leverage more special purpose multi-
core platforms in complex applications, application developers
must be able to quickly arrive at an initial prototype to
understand the potential performance benefits. In this paper,
we have presented a design flow that extends a popular SDR
environment, lays the foundation for rigorous analysis from
formal models, and creates a stand-alone library of GPU
accelerated actors which can be placed inside of existing
applications. GPU integration into an SDR specific program-
ming environment allows application designers to quickly
evaluate GPU accelerated implementations and explore the
design space of possible solutions at a system level.
Useful directions for future work include new methods
for dealing with scheduling, partitioning, and mapping for
multicore systems along with evaluating existing automation
!"
#!!"
$!!"
%!!"
&!!"
'!!!"
'#!!"
()))"*++" ,-$"*++" .()/"*++" ,-$"*+0" ())"*+0" 123"*+0" 123"*++"
!"#$%&'()&*#)+*,-)
./'$0)1,,23(*#(4)5$6#*#)
'4#!"
#4#!"
,4#!"
$4#!"
Fig. 6. A sampling of the design space of 1x20, 2x20, 3x20, 4x20 mp-schedgraph on 2 GPPs and 1 GPU for different assignments.
solutions that have been developed. Also, GRGPU should
be able to extend to multi-GPU platforms by customizing
GRGPU actors to communicate and launch on a specific GPU.
Acknowledgments
This research was sponsored in part by the Laboratory for
Telecommunication Sciences, and Texas Instruments.
REFERENCES
[1] http://gnuradio.org/redmine/wiki/gnuradio. Nov 2010.[2] H. Berg, C. Brunelli, and U. Lucking. Analyzing models of computation
for software defined radio applications. In Proc. IEEE International
Symposium on System-on-Chip, pages 1–4, Nov. 2008.[3] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis
from Dataflow Graphs. Kluwer Academic Publishers, 1996.[4] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Synthesis of embedded
software from synchronous dataflow specifications. Journal of VLSI
Signal Processing Systems for Signal, Image, and Video Technology,21(2):151–166, June 1999.
[5] Y. Bouchebaba, P. Paulin, A. E. Ozcan, B. Lavigueur, M. Langevin,O. Benny, and G. Nicolescu. Mpassign: A framework for solving themany-core platform mapping problem. In Rapid System Prototyping
(RSP), June 2010.[6] B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an
embedded data parallel language. Technical Report UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sep 2010.
[7] C. Hsu, I. Corretjer, M. Ko., W. Plishker, and S. S. Bhattacharyya.Dataflow interchange format: Language reference for DIF languageversion 1.0, user’s guide for DIF package version 1.0. TechnicalReport UMIACS-TR-2007-32, Institute for Advanced Computer Studies,University of Maryland at College Park, June 2007. Also ComputerScience Technical Report CS-TR-4871.
[8] E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedingsof the IEEE, 75(9):1235–1245, September 1987.
[9] V. Marojevic, X. R. Balleste, and A. Gelonch. A computing resourcemanagement framework for software-defined radios. IEEE Transactions
on Computers, 57:1399–1412, 2008.[10] R. Niemann and P. Marwedel. Hardware/software partitioning using
integer programming. In Proc. of the European Design and Test
Conference, pages 473 –479, Mar. 1996.[11] NVIDIA. CUDA C programming guide version 3.1.1. July 2010.[12] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor
resource allocation for throughput-constrained synchronous dataflowgraphs. In Proc. of the 44th annual Design Automation Conference,
DAC ’07, pages 777–782, June 2007.
72
[13] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler formemory optimization and parallelism management. In Proc. of the
2010 ACM SIGPLAN conference on Programming language desing and
implementation, June 2010.[14] G. Zaki, W. Plishker, T. OShea, N. McCarthy, C. Clancy, E. Blossom,
and S. S. Bhattacharyya. Integration of dataflow optimization techniquesinto a software radio design framework. In Proceedings of the IEEE
Asilomar Conference on Signals, Systems, and Computers, pages 243–247, Pacific Grove, California, November 2009. Invited paper.
[15] K. Zheng, G. Li, and L. Huang. A weighted-selective scheduling schemein an open software radio environment. In IEEE Pacific Rim Conference
on Communications, Computers and Signal Processing, pages 561 –564,Aug 2007.
73
Validation of Channel Decoding ASIPsA Case Study
Christian Brehm, Norbert WehnMicroelectronic Systems Design Research Group
University of Kaiserslautern, Germany{brehm, wehn}@eit.uni-kl.de
Sacha Loitz, Wolfgang KunzElectronic Design Automation Research Group
University of Kaiserslautern, Germany{loitz, kunz}@eit.uni-kl.de
Abstract— It is well known that validation and verificationis the most time consuming step in complex System-on-Chipdesign. Thus, different validation and verification approaches andmethodologies for various implementation styles have been de-vised and adopted by the industry. Application specific instructionset-processors (ASIPs) are an emerging implementation technol-ogy to solve the energy efficiency/flexibility trade-off in basebandprocessing for wireless communication where multiple standardshave to be supported at a very low power budget and a smallsilicon footprint. In order to balance these contrary aims ASIPsfor these application domains have a restricted functionalitytailored to a specific class of algorithms compared to traditionalASIPs. Downside of the outstanding efficiency/flexibility ratio isthe coincidence of bad attributes for validation. Compared tostandard processors, these ASIPs often have a very complexinstruction set architecture (ISA) due to the tight couplingbetween the instructions and the optimized micro-architecturerequiring new validation concepts.
This paper will sensitize for the distinctiveness and complexityof the validation of ASIPs tailored to channel decoding. In a casestudy a composite approach comprising formal methods as wellas simulations and rapid-prototyping for validating an existingchannel decoding ASIP is applied and transferred it into anindustry product.
I. INTRODUCTION
Today’s and future wireless communication networks re-quire flexible modem architectures to support seamless ser-vices between different network standards. Next generationhandsets will have to suppport multiple standards, such asUMTS, LTE, DVB-SH or WiMax. This creates the demand forthe design of flexible, yet power- and area-efficient solutionsfor baseband signal processing, which is one of the mostcomputation intensive tasks in mobile wireless devices [1].
Application specific instruction set processors are a verypromising candidate for this task, as they promise a muchhigher flexibility than dedicated architectures and a betterenergy efficiency than general purpose processers (GPPs) [2].For many applications efficient ASIP designs are best derivedfrom standard processor pipelines in a top-down manner. Thisis done by adding functionality and instructions for the mostcommon kernel operations of the targeted algorithms, such ase.g., an FFT.
Also in the field of channel decoding ASIPs are verypopular, as they are seen as an elegant way to cope with the
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
vast amount of different coding schemes and their parameters,e.g. [3]–[7]. However, there ASIP designs often have not muchin common with an enhanced standard processor pipeline.Energy and area efficiency demand for distributed memoryembedded into the pipeline which are typical for many state-of-the-art decoding schemes and the demand for unifying thecommonalities of several dedicated architectures. Fully cus-tomized deep pipelines with non-standard memory interfacesand instructions tailored to the targeted algorithms are theconsequence. A minimal support for flow control operationsis added, resulting in a weakly programmable architecture thatoffers no more than the desired flexibility. We denote this typeof ASIPs as Weakly Programmable IP Cores (WPIPs).
While WPIPs combine many advantages of standard IPblock design and programmable architectures, they inheritthe drawbacks of the respective implementation styles w.r.t.validation. This has yet been barely addressed by the researchcommunity. Muller [8] and Alles [9] have presented rapidprototyping platforms for ASIPs, which can be used for testingpurposes. But both approaches are far too inflexible, as theycan only show the presence of errors but never their source ortheir absence.
The rest of the paper is structured as follows: we will• illuminate the differences in the design flows for the vari-
ous implementation styles (Section II) and the challengesin WPIP validation (Section III)
• quantify the effort required for different verification andvalidation tasks in order to sensitize for the importance
• introduce our approach for validation in a case studywhich was successfully applied to our FlexiTreP ASIPin order to bring it to product level (Section IV).
II. IMPLEMENTATION STYLES
Design methodologies for the implementation of digitalsignal processing systems consist of two phases. The goalof the first phase is to make all functional design decisionsfrom algorithm selection down to quantization. Purely func-tional, software-based system models are used at this stage toguarantee the desired functional behavior. For state-of-the-artcommunication systems this step is particularly challenging,as the communications performance of todays channel codescan not be evaluated analytically. Instead, extensive MonteCarlo simulations have to be performed to determine the
74
Fig. 1. Implementation Styles [10]
bit-error performance or frame-error performance of everysingle design candidate. At the end of this iterative refinementand evaluation process stands the so called Golden ReferenceModel, a functional software implementation of the system.
The second phase deals mostly with non-functional aspectsof the actual system implementation. Various implementationstyles exist and the right choice depends on flexibility andenergy or area efficiency requirements (cf. Figure 1). By farthe most challenging part at this stage is the validation andverification of the implementation against the Golden Refer-ence Model. Their properties are highlighted in the following.
• General Purpose Processors (GPP) offer the greatestflexibility of all implementation styles. It is also easilypossible to upgrade such systems to support new featuresor new standards by a simple update of the systemsoftware. Another advantage of this implementation styleis the comparatively low effort for system validation andverification. Given the correctness of the processor anda functional model of the instruction set architecture (socalled ISA model), the application software can be vali-dated independently from the underlying processor. Thecorrectness of the hardware is often proven with formalverification methods and guaranteed by the manufacturer.The big drawback of such platforms is their very low areaand energy efficiency.
• Dedicated, hardwired architectures in contrast offer thehighest implementation efficiency. For such architectures,traditional synthesis based design flows are widely used.As the RTL (register transfer level) hardware descrip-tion is typically derived at least in parts by iterativerefinement of the well elaborated golden reference model,the correctness of the implementation can be shown bysimulation or formal methods. The high implementationefficiency of dedicated architectures of course comes atthe cost of very limited flexibility.
• Application Specific Instruction Set Processors (ASIP)[11]–[13] try to close the gap between dedicated hard-wired and programmable off-the-shelf solutions. Typi-cally, the instruction set of a GPP is enhanced with
TABLE ICHARACTERISTICS IN ASIP IMPLEMENTATION STYLES
Top-Down Approach Bottom-Up Approach(classical ASIP) (WPIP)
Standard Pipeline Application Specific PipelineStandard Memory Application Specific Memory
Access Scheme Access and OrganizationStandard Instruction Set Only Application Specific
Extendend by Application Instructions definingSpecific Instructions interplay of functional blocks
Single-context instructions Multi-context instructions
special non-standard instructions to allow a more efficientprocessing of the algorithms under consideration. Theseadditional instructions, extracted in detailed analysis andprofiling of the algorithms, are supported by additionaldedicated stages which are inserted into the processorpipeline. Thus, the original instructions from the GPP re-main unchanged and the ISA is only enhanced. Conceptsfrom standard processor validation are still applicable.
• Weakly programmable IPs (WPIP), too, are ASIPs, butthey are created in a bottom-up approach starting fromdedicated architectures rather than from a GPP. Thecommonalities of dedicated architectures with similarkernel operations and memory requirements are extractedand unified in a fully customized pipeline with a cus-tom, scattered memory architecture offering exactly therequired bandwidth and flexibility. The characteristicdifferences compared to traditional ASIPs are faced inTable I. The gain of this approach is a performance andenergy efficiency very close to that offered by dedicatedarchitectures and at the same time offering at least theminimum required flexibility from programmable archi-tectures (see Figure 1). Thus, they are the preferable im-plementation style for upcoming multi-standard channeldecoder implementations. The biggest challenge in WPIPdesign, however, is the validation.
III. WPIP VALIDATION
While WPIPs inherit many desirable properties from dedi-cated as well as programmable architectures, this is not truefor the ease of validation. The ISA of a WPIP is not designedspecifically, but is merely an emerging phenomenon from thecombination of architectures. Thus, there is no standard ISAmodel that can be used in tools for formal verification. Thetight coupling of hardware and software hinders the separatedvalidation of the WPIP architecture without the applicationsrunning on it. Furthermore, the pipeline of a typical WPIPis very deep (e.g. , 15 stages in case of [14]) and contains acomplex system of irregularly sized and distributed memories,which even may be accessed in an out-of-order fashion inseveral pipeline stages. This fact creates inter-instruction de-pendencies over many clock cycles exceeding the capabilitiesof formal verification tools commonly available today. Takingthese properties into account the following approaches turnedout as being potentially appropriate for WPIP validation.
75
TABLE IIRUNTIMES FOR SIMULATIONS AND VERIFICATION
Simulation Runtime Throughput10 k blocks
Property Checking [16] 18 h 83 properties
Monte Carlo Simulation (Software)· Viterbi, 1k info bits, w/ ASIP 0.7 h 3.8 kbps· bTC UMTS, w/ ASIP 10 h 1.4 kbps· bTC UMTS, w/ SW reference decoder 15 min 58 kbps
RTL Sim. bTC, UMTS, only ASIP 47 h 0.3 kbps
A. Formal Verification
Formal methods prove the absence of errors, while simula-tion can always only show the presence of errors.
Loitz et. al. [15] have recently established a way to reducethe complexity of interval property checking for WPIPs bycomposing instructions of micro operations making the val-idation of complex WPIP instructions feasible. Their com-pleteness approach enables a formal proof that each of thecomplex instructions behaves as intended. Although this doesnot necessarily prove our design to be completely functional asexpected, verification of each instruction can detect errors suchas saturation or rounding issues. However, formal verificationof the system behavior is by now impossible and still underresearch.
B. Simulations
For programmable architectures based on a standardinstruction-set, verification of the instruction-set guaranteesthat any arbitrary algorithm can be implemented. In contrastto that for WPIPs the correctness of all instructions is notsufficient since this does not guarantee that the intendedsystem behavior can be implemented. Hence, despite theconfidence that formal methods provide, there is still thedemand for simulations. Particularly, they are invaluable foranalysis during the program development phase which turnedinto a challenging and error prone task due to the optimizedinstruction-set.
As WPIPs are designed to implement a large numberof possible standards, a purely simulation-based validationapproach needs to simulate every channel decoding applicationand compare a statistically significant number of computedvalues to the respective golden reference models or the re-spective frame or bit error rate (FER, BER) point specified incommunication standards. Further, WPIPs are not created byrefinement from the golden reference model, as are dedicatedarchitectures. Hence, there is no structural similarity betweenthe two, which could be exploited for validation. An approachto cope with this will be presented in Section IV.
Simulation times of WPIPs exceeding those of the goldenreference or even an RTL model of a dedicated architectureby orders of magnitudes (see Table II). It quickly becomesinfeasible to perform statistical Monte Carlo simulations asthe only means of implementation validation.
TABLE IIISIMULATIONS REQUIRED FOR MULTI-STANDARD SYSTEM VALIDATION
Standard CodeCode Types Different
(Enc. Type, Tailing, BlocksizesRates, Polynomials) per Type
GSM/EDGEViterbi 256 states 3 375Viterbi 64 states 8 1000Viterbi 16 states 4 724
LTE bin. Turbo 1 188UMTS/HSPA bin. Turbo 1 5075CDMA2k bin. Turbo 1 18WiMax duobin. Turbo 1 17
overall different code blocks: 17,319
Finally, one of the biggest advantages designers hope forwhen choosing programmable architectures is the flexibilityto easily extend the system to new standards. While this isfeasible by software adjustment for GPPs or ASIPs basedon standard ISAs, every change in the pipeline of a WPIPeffectively poses a potential change to the functional behaviorof every single application running on it. As there is no clearseparation between the WPIP and the applications by themeans of a well defined ISA, and hardware is shared overthe supported algorithms, even small changes can require acomplete re-validation of all implemented applications.
C. Rapid Prototyping
Simulations are a powerful validation method. Drawbackis that simulations with a sufficient amount of test vectorslast very long, up to several days, depending on complexityand computation intensity. This problem can be attenuatedby rapid prototyping: The simulation is transferred to anacceleration platform, usually an FPGA board and run there.A sophisticated variant is often denoted as “hardware in theloop” where the testbench or simulation environment remainsin software and the device under test is integrated as a realhardware component.
D. Combined Approach
The specific properties of WPIP and the andvantages anddisadvantages of the above described validation methods yieldthat neither of these common approaches for traditional im-plementation styles solely will be applicable. A mix of formalmethods and simulation or emulation will be appropriate.
In the next section we will introduce this approach for WPIPvalidation using the example of a WPIP designed for industrialuse and quantize the validation and verification effort.
IV. CASE STUDY: FLEXITREP VALIDATION
The WPIP for the case study is FlexiTreP [14], a FlexibleTrellis Processing engine. With its capability of decoding bi-nary and duobinary Turbo and convolutional codes it supportsmost important wireless communication standards, among oth-ers UMTS, LTE, DVB-SH, or WiMax. It comprises 15 pipelinestages and seven memories that are accessed in differentpipeline stages. The pipeline is dynamically reconfigurable inorder to react to code changes. The pipeline is implemented in
76
Fig. 2. ASIP Design and Validation Flow
a high-level processor description language, LISA [17]. Fromthis description a cycle accurate C++ simulation model as wellas a synthesizable RTL model are generated using SynopsysProcessor Designer [11].
For validation we applied a combined approach accord-ing to Figure 2: with the properties we gained during theimplementation phase from our system specification and theexisting implementation knowledge we can verify formallythat the instructions described in our high-level language workcorrectly by applying property checking to the RTL model (cf.left part of Figure 2). This can be done independently fromapplication program development and excludes a wide rangeof errors such as memory access conflicts (e.g. , from stallunits), address range faults or rounding and saturation faults.
Despite the instruction verification there is still a hugeamount of simulations to be done which is shown in theright part of Figure 2. Simulations are mandatory for twopurposes. For channel decoding architectures like FlexiTrePthe algorithmic performance needs to be proven. This is onlypossible by Monte Carlo simulations comparing FER or BERagainst the specification in the respective standards with alltheir parameters. Therefore the approach from [8] is notapplicable since only single blocks can be decoded with thisplatform. Rapid Prototyping as introduced in [9] is an optionfor gathering BER/FER performance but lacks in flexibility foranalysis, debugging and application development: Usually thedesigner wants to check a modification as early as possible.According to traditional IP block design approaches smallerfunctional parts of the pipeline are compared against thereference model. This shortens hardware as well as softwaredevelopment which consumes a great amount of time due tothe sophisticated ISA. For a first test of minor modificationssimulations on single blocks are perfectly suitable and savesimulation time.
Table III shows that for the validation of FlexiTreP forthe required standards more than 17,000 different codeblocks
Fig. 3. Generic System Simulation Chain
exist. Each of them is combinable with various typical pa-rameters depending on the code (e.g. , windowsize, blockor acquisition length, . . . ) so that the theoretical number ofpossible combinations multiplies. However, many paramaterscan be considered constant for a given code, as they are knownto provide the best properties. Nevertheless, for each codeblock hundreds of thousands of bits have to be simulated inorder to reach statistical significance. As an example let usassume that 10,000 blocks (corresponds a FER of 10−2) persimulation were sufficient. Table II lists the simulation timesfor a single block for each case. For a complete validation ofFlexiTreP for the given standards calculative simulation timesof more than five years were required.
In order to reduce this simulation effort, we have set up asimulation environment modeling the channel encoding anddecoding chain as depicted in Figure 3. It is arbitrarily con-figurable. Modules can be exchanged and added or removedaccording to the needs. For comparison of a design againsta reference it is possible to instantiate the design under test(DUT) and an arbitrary reference model in parallel. Thedeployed reference models are IO equivalent to the implemen-tation, they are well elaborated and also proven by existinghardware implementations.
This enables debugging and program development with anenvironment supporting the flexibility of the design. Functionalsimulations can be run without the fairly slow cycle accuratemodel only on basis of the well elaborated reference modelswhich reduces the simulation times by up to an order ofmagnitude, depending on the code. In conjunction with formalmethods we still get a high quality while simulation times re-duce to a few days. The additional time for property checkingis negligible once the properties are set up. Verification canbe done independently in parallel to the simulations.
Additionally we have added an interface from the simulationchain to an FPGA board over Ethernet. With this the wholesimulation chain runs on a standard PC offering the full
77
flexibility of simulation. Only the design under test is ex-changed by the hardware implementation. This setup enablesfrom the same environment debugging and analysis of singlecode blocks, performance simulations in software or emulationwith the RTL design, or comparisons to an already verifiedreference. The emulation offers an acceleration of anotherorder of magnitude.
Our separation approach offers an additional big advantage:whenever the hardware is modified, it can be shown formallythat unmodified instructions are not influenced by the mod-ifications. Hence, by applying the existing (or only slightlymodified) properties again, it is assured that programs thatdo not use any new or modified instructions are still workingas before and need not to be simulated again. This reducesthe additional validation time for the enhancement to a thirdcompared to a full rerun of all simulations.
V. CONCLUSIONS AND FUTURE WORK
Application specific programmable architectures are spread-ing quickly in the field of channel decoding. In this paperwe outlined the differences between various implementationstyles, showed the advantages of ASIPs and in particularWPIPs for channel decoding and highlighted their disad-vantages w.r.t. validation and fleshed this out by concretenumbers. We validated our existing FlexiTreP ASIP withour flexible channel decoding simulation environment andformal verification methods. We showed that with deliberatesimulations in combination with formal methods the effort forvalidation of a multi-standard architecture can be reduced frominfeasible times calculatingly in the range of several months toa few days, depending on the supported codes. Our validatedASIP was successfully produced in a 65 nm technology andintegrated into a commercial product.
For the future enhancement in the architecture and en-hancements for a multi-core system is planned. The validationenvironment emerged to be perfectly suitable for debuggingthese thanks to its modular character and configurability.
REFERENCES
[1] Y. Lin, H. Lee, M. Who, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti,and K. Flautner, “SODA: A Low-power Architecture For Software Ra-dio,” in Proc. 33rd International Symposium on Computer ArchitectureISCA ’06, 2006, pp. 89–101.
[2] C. Rowen, “Silicon-efficient dsps and digital architecture for lte base-band,” in 10th International Forum on Embedded MPSoC and Multicore,Gifu, Japan, Aug. 2010.
[3] B. Bougard, R. Priewasser, L. V. der Perre, and M. Huemer, “Algorithm-Architecture Co-Design of a Multi-Standard FEC Decoder ASIP,” inICT-MobileSummit 2008 Conference Proceedings, Stockholm, Sweden,Jun. 2008.
[4] S. Kunze, E. Matus, and G. P. Fettweis, “ASIP decoder architecture forconvolutional and LDPC codes,” in Proc. IEEE International Symposiumon Circuits and Systems ISCAS 2009, May 2009, pp. 2457–2460.
[5] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan,L. Van der Perre, and F. Catthoor, “A unified instruction set pro-grammable architecture for multi-standard advanced forward error cor-rection,” in Proc. IEEE Workshop on Signal Processing Systems SiPS2008, Oct. 2008, pp. 31–36.
[6] O. Muller, A. Baghdadi, and M. Jezequel, “From Parallelism Levels toa Multi-ASIP Architecture for Turbo Decoding,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92–102,Jan. 2009.
[7] M. Alles, T. Vogt, and N. Wehn, “FlexiChaP: A Reconfigurable ASIPfor Convolutional, Turbo, and LDPC Code Decoding,” in Proc. 5thInternational Symposium on Turbo Codes and Related Topics, Lausanne,Switzerland, Sep. 2008, pp. 84–89.
[8] O. Muller, A. Baghdadi, and M. Jezequel, “From Application to ASIP-based FPGA Prototype: a Case Study on Turbo Decoding,” IEEEInternational Workshop on Rapid System Prototyping, pp. 128–134, Jun.2008.
[9] M. Alles, T. Lehnigk-Emden, C. Brehm, and N. Wehn, “A RapidPrototyping Environment for ASIP Validation in Wireless Systems,” inProc. edaWorkshop 09, Dresden, Germany, May 2009, pp. 43–48.
[10] T. Noll, T. Sydow, B. Neumann, J. Schleifer, T. Coenen, and G. Kappen,“Reconfigurable Components for Application-Specific Processor Archi-tectures,” in Dynamically Reconfigurable Systems, M. Platzner, J. Teich,and N. Wehn, Eds. Springer Netherlands, 2010, pp. 25–49.
[11] “Synopsys Processor Designer,” June 2010. [Online]. Available:http://www.synopsys.com/Tools/SLD/ProcessorDev/
[12] “Target Compiler Technologies,” http://www.retarget.com.[13] “Tensilica Inc.” http://www.tensilica.com.[14] T. Vogt and N. Wehn, “A Reconfigurable ASIP for Convolutional
and Turbo Decoding in a SDR Environment,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 16, pp. 1309–1320,Oct. 2008. [Online]. Available: http://dx.doi.org/10.1109/TVLSI.2008.2002428
[15] S. Loitz, M. Wedler, C. Brehm, T. Vogt, N. Wehn, and W. Kunz,“Proving Functional Correctness of Weakly Programmable IPs - ACase Study with Formal Property Checking,” in Proc. Symposium onApplication Specific Processors SASP 2008, Anaheim, CA, USA, Jun.2008, pp. 48–54.
[16] S. Loitz, M. Wedler, D. Stoffel, C. Brehm, N. Wehn, and W. Kunz,“Complete Verification of Weakly Programmable IPs against TheirOperational ISA Model,” in FDL, A. Morawiec and J. Hinderscheit,Eds. ECSI, Electronic Chips & Systems design Initiative, 2010, pp.29–36.
[17] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, andH. Meyr, “A methodology for the design of application specific in-struction set processors (ASIP) using the machine description languageLISA,” in Computer Aided Design, 2001. ICCAD 2001. IEEE/ACMInternational Conference on, Nov. 2001, pp. 625–630.
78
Area and Throughput Optimized ASIP forMulti-Standard Turbo decoding
Rachid Al-Khayat, Purushotham Murugappa, Amer Baghdadi, Michel JezequelInstitut Telecom; Telecom Bretagne; UMR CNRS 3192 Lab-STICC
Electronics Department, Telecom Bretagne, Technopole Brest Iroise CS 83818, 29238 BrestUniversite Europeenne de Bretagne, France
E-mail: {firstname.surname}@telecom-bretagne.eu
Abstract—In order to address the large variety of channelcoding options specified in existing and future digital communi-cation standards, there is an increasing need for flexible solutions.Recently proposed flexible solutions in this context generallypresents a significant area overhead and/or throughput reductioncompared to dedicated implementations. This is particularlytrue while adopting an instruction-set programmable processors,including the recent trend toward the use of Application SpecificInstruction-set Processors (ASIP). In this paper we illustrate howthe application of adequate algorithmic and architecture leveloptimization techniques on an ASIP for turbo decoding can makeit even an attractive and efficient solution in terms of area andthroughput. The proposed architecture integrates two ASIP com-ponents supporting binary/duo-binary turbo codes and combinesseveral optimization techniques regarding pipeline structure,trellis compression (Radix4), and memory organization. The logicsynthesis results yield an overall area of 1.5mm2 using 90nmCMOS technology. Payload throughputs of up to 115.5Mbpsin both double binary Turbo codes (DBTC) and single binary(SBTC) are achievable at 520MHz. The demonstrated resultsconstitute a promising trade-off solution between throughput andoccupied area comparing with existing implementations.
Index Terms—SoC design, Embedded System Architecture,ASIP, Pipeline Processor, Turbo codes, WiMAX, 3GPP, LTE,DVB-RCS.
I. INTRODUCTION
Systems on chips (SoCs) in the field of digital communi-cation are becoming more and more diversified and complex.In this field, performance requirements, like throughput anderror rates, are becoming increasingly severe. To reduce theerror rate (closer to the Shannon limit) with a lower signal-to-noise ratio (SNR), turbo (iterative) processing algorithms havebeen recently proposed [1] and adopted in emerging digitalcommunications standards. These standards target differentsectors such as: LTE and WiMAX covering metropolitan areafor voice and data applications with limited video service,while the DVB series targeting video broadcasting. A selectedlist of current standards and their throughput requirements aregiven in table IThe user demands on the other hand, require these applica-
tions to be supported on a single portable device which callsfor future wireless devices to be multi-standards such as PDAs,smart phones and other devices. The efficient implementation
Standard Codes Rates States Blocksize
ChannelThroughput
IEEE802.16 DBTC 1/2 - 3/4 8 .. 4800 .. 75 Mbps(WiMax)
DVB-RCS DBTC 1/3 - 6/7 8 .. 1728 .. 2 Mbps3GPP-LTE SBTC 1/3 8 .. 6144 .. 150 Mbps
TABLE I: Selection of standards supporting turbo codes.DBTC: Double Binary Turbo Code, SBTC: Single BinaryTurbo Code
of the advanced channel decoders, which is the most areaconsuming and computationally intensive block in basebandmodem, becomes more important.Numerous research groups have come up with differentarchitectures providing specific reconfigurability to supportmultiple standards on a single device. A majority of theseworks target channel decoding and particularly turbo de-coding. The supported types of channel coding for turbocodes are usually Single Binary and/ Double Binary TurboCodes (SBTC and DBTC). In this context, the work in [2]presents an ASIP-based implementation (Application-SpecificInstruction-set Processor) with a flexible pipeline architecturethat support turbo decoding of SBTC and DBTC. The pre-sented ASIP occupies a small area of 0.42mm2 in 65nmtechnology ( 0.84mm2 in 90nm), however it achieves alimited throughput of 37.2Mbps in DBTC and 18.6Mbps inSBTC modes at 400Mhz. Besides ASIP-based solutions, otherflexible implementations are proposed using a parametrizeddedicated architecture (not based on instruction-set), like thework presented in [3]. The proposed architecture supportsDBTC and SBTC modes and achieves a high throughput of187Mbps. However, the area occupied is large: 10.7mm2 in130nm technology (5.35mm2 in 90nm).
On the other hand, several one-standard dedicatedarchitectures exist. In this category we can cite the dedicatedarchitectures presented in [4] and [5] which support onlySBTC mode (3GPP-LTE). In [4] a maximum throughput of150Mbps is achieved at the cost of a large area of 2.1mm2
in 65nm technology ( 4.2mm2 in 90nm), while in [5] amaximum throughput of 130Mbps is achieved at the costof 2.1mm2 in 90nm technology. Another example is thededicated architecture proposed in [6] which support only978-1-4577-0660-8/11/$26.00 c©2011 IEEE
79
DBTC mode (WiMAX). A limited throughput of 45Mbpsis achieved (not covering all WiMax requirements) with anoccupied area of 3.8mm2 in 180nm technology ( 0.95mm2
in 90nm).
While analyzing the overall state of the art in this context,one can note that proposed flexible solutions generally presenta significant area overhead and/or throughput reduction com-pared to dedicated implementations. This is particularly truewhile adopting an instruction-set programmable processors,including the recent trend toward the use of ASIP. In this paperwe illustrate how the application of adequate algorithmic andarchitecture level optimization techniques on an ASIP for turbodecoding can make it even an attractive and efficient solutionin terms of area and throughput. The considered initial ASIPfor turbo decoding is the one proposed in [7] and the proposedoptimizations made its area size decreases from 0.2mm2 to0.15mm2 in @90nm technology and its throughput increasesfrom 50Mbps in DBTC to 115.5Mbps and from 25Mbpsin SBTC to 115.5Mbps. These significant improvementsare obtained due to applying three levels of optimization:(1) Architecture optimization by re-arranging the pipeline anddecreasing the instruction set used in iterative-loop to generateextrinsic information, (2) Algorithmic optimization by apply-ing trellis compression (Radix4) to double the throughput inSBTC mode, and (3) Memory re-organization to optimize thearea.The proposed architecture allows to have simple light weight1 × 1 system decoder which achieve a best ratio betweenthroughput and area to support SBTC/DBTC turbo decodingfor an array of standards (WiMAX, LTE, DVB-RCS). The restof the article is organized as follows: section II presents thedecoding algorithms for turbo decoding used in the proposedarchitecture. section III explains in detail the proposed archi-tecture of the decoder system and the proposed optimizationtechniques. The synthesis results and comparisons w.r.t. thestate of the art are given in section IV and finally the paperconcludes with section V giving some future perspectives.
II. DECODING ALGORITHMS
A. Turbo decodingThe typical system diagram for the turbo decoding is shown
in Fig.1. It consists of two component decoders exchangingextrinsic information via an interleave (Π) and deinterleave(Π−1) processes. The component decoder0 receives Log-likelihood ratio Λk (1) for each bit k of a frame of length Nin the natural order while component decoder1 is initializedin interleaved order.
Λk = logPr{dk=0|y0..N−1}Pr{dk=1|y0..N−1} (1)
For efficient hardware implementation Max-Log MAP al-gorithm is used, as described in [7]. For DBTC, the threenormalized extrinsic information are defined by (2) wherei ∈ (01, 10, 11) of the kth symbol while s′ and s are theprevious and current corresponding trellis state respectively.
Zn.extk (d(s′, s) = i) = Zext
k (d(s′, s) = i)− Zextk (d(s′, s) = 00)
(2)
Componentdecoder0
Componentdecoder1
Hard. dec
ChannelLLR
Π−1 Π
Π
Zn.extk
Λk
Fig. 1: Turbo decoding system
The extrinsic information defined by (3) is calculated from theaposteriori probability given by (4), wherein αk(s) and βk(s)are the state metrics in forward (5) and backward recursion(6) respectively and γk(s′, s) are the branch metrics (7). Theγsysk (s′, s) and γpar
k (s′, s) are the systematic and parity symbolLLRs. Finally, when the required number of iterations Niterare completed the hard decision is calculated as given by (9).
Zextk (d(s′, s) = i) = Γ× (Zapos
k (d(s′, s) = i)− γintk (s′, s)) (3)
Zaposk (d(s′, s) = i) = max
(s′,s)/d(s′,s)=i(αk−1(s)+
γn.extk (s
′, s) + βk(s)), i ∈ {00, 01, 10, 11}
(4)
αk(s) = maxs′ ,s(αk−1(s) + γk(s′, s)) (5)
βk(s) = maxs′ ,s(βk+1(s) + γk+1(s′, s)) (6)
γk(s′, s) = γintk (s′, s) + γn.ext
k (s′, s) (7)
γintk (s′, s) = γsys
k (s′, s) + γpark (s′, s) (8)
ZHard.deck = sign(Zapos
k ) (9)
B. Radix4 decoding algorithm
For SBTC, the trellis length is reduced by half throughapplying the one-level look-ahead recursion [8]. The modifiedα and β state metrics for this Radix4 optimization are givenby (10) and (11) where γk(s
′′, s)is the new branch metric for
the combined two-bit symbol (uk−1, uk) connecting state s′′
and s.
S′′2
S′′3
S0
S1
S2
S3
S′′0
S′′1
S′′0
S′′1
S′′2
S′′3
S′0 S0
S′1 S1
S′2 S2
S′3 S3
uk−1 uk uk+1uk−1 uk+1
Fig. 2: Trellis compression (Radix4)
αk(s) = maxs′′ ,s{αk−2(s′′
) + γk(s′′, s)} (10)
βk(s) = maxs′′ ,s{βk+2(s′′
) + γk(s′′, s)} (11)
γk(s′′, s) = γk−1(s
′′, s
′) + γk(s
′, s) (12)
The extrinsic information for uk−1 and uk are computed as:
Zn.extk−1 = Γ× (max(Zext
10 , Zext11 )−max(Zext
00 , Zext01 )) (13)
Zn.extk = Γ× (max(Zext
01 , Zext11 )−max(Zext
00 , Zext10 )) (14)
80
C. QPP InterleavingInterleaving/deinterleaving of extrinsic information is a
key issue that needs to be aware of to enable addressingmultiple extrinsic information in the same cycle becausememory access contention may occur when MAP decoderfetch/write extrinsic information from/to memory. Theinterleaving/deinterleaving addresses required w.r.t. LTEstandard QPP interleaving which is a contention-freeinterleaver that is expressed via a simple mathematicalformula,let N be the number of data couples in each blockat the encoder input.
For j = 0 .. N − 1, I(j) = (F2 ∗ j2 + F1 ∗ j)modN (15)
Where F1 and F2 are constants defined in the standard with jbeing the index of the natural order. By definition, parameterF1 is always an odd number whereas F2 is always an evennumber. QPP interleaver has many algebraic properties, aninteresting one is I(x) has same even/odd parity as x:
I(2k)mod 2 = 0
I(2k + 1)mod 2 = 1(16)
This algebraic property is used later in section III-C indesigning memory system for addressing multiple extrinsicinformation without memory access contention.
III. DECODER SYSTEM ARCHITECTURE
The proposed decoder system architecture consists of 2ASIPs interconnected directly as shown in Fig.3. Shuffleddecoding ASIPs [7] are configured to operate in a 1×1 mode,with one ASIP (ASIP0) process the data in natural orderwhile the other one (ASIP1) process it in interleaved order.The generated extrinsic information are exchanged betweenthe two ASIP decoder components via connection of buffers& multiplexers.
Ext memTop Ext memTop
Ext memBot Ext memBot
Even
Odd Odd EvenEven
Odd
Mux Mux++
EvenOdd
BufferBufferASIP0 ASIP1
Top
InputInput
InputTop
BottomBottom
Input
COMPONENT DECODER1COMPONENT DECODER0
Fig. 3: 1× 1 Decoder system architecture
A. ASIP architecture and optimization levels
Fig.7a illustrates the overall architecture of the optimizedASIP with the proposed memory structure and pipeline organ-isation in 9 stages. The numbers in brackets indicate the equa-tions (referred in section II-A) mapping on the correspondingpipeline stage. The extrinsic information format, at the outputof the ASIP, is also depicted in the same figure for the twomodes SBTC and DBTC. The rest of this sub-section detailsthe proposed architecture optimizations to achieve efficientsolution in terms of area and throughput, classified in threelevels.
1) Architecture level optimization: The initial decodingprocess of the ASIP proposed in [7] was done through 8pipeline stages and was implementing the butterfly schemefor metrics computation. During this process, ASIP calculatesthe α metrics (forward recursion) and β (backward recursion)simultaneously, till it reaches half of the processed sub block(which called left-butterfly). In left butterfly the recursionunits do the state metric calculations in the first clock cycleand the max operators find the state metric maximum values inthe second clock cycle eq.(5) & eq.(6). During processing theother half of the processed sub block (right-butterfly) besidesfinding the state metrics values, another three clock cyclesare required to do addition for extrinsic information and thenfinding maximum A posteriori information eq.(4). All theseoperations are taking place in (EXstage). So 7 clock cyclesin total are required to generate extrinsic information for twosymbols.Major proposed optimization in the architecture level is inre-arranging the pipeline by modifying the (EXstage) toplace the recursion units and max operators in series so onlyone instruction is enough to calculate the state metric andfind the maximum values eq.(5) & eq.(6). In similar way,one instruction will be needed to find maximum A posterioriinformation eq.(4). In fact, finding maximum A posterioriinformation is done in three cascaded stages of max operators(searching the max between 8 metric values). Thus, placingthem in series with recursion units in one pipeline stage willincrease the critical path (i.e. reduce the maximum clockfrequency). To avoid that, new pipeline stage (MAXstage)is added after (EXstage) to distribute the max operators asshown in Fig.4.During the decoding process in left butterfly, ACS units (Add,Compare, Select) do the state metric calculations and findthe state metrics maximum values in the same clock cycle in(EXstage), while during the right butterfly besides findingthe state metrics values, ACS units do addition and findmaximum A posteriori information eq.(4) in (MAXstage)in one clock cycle. So 3 clock cycles in total are required togenerate extrinsic information for two symbols.Another proposed optimization concerns the implementation
of windowing to process large block-size which is achievedby dividing the frame into N windows, where the windowmaximum size supported is 128 bits. Fig.5 shows the win-dows processing in butterfly scheme, i.e. ASIP calculatesthe α values (forward recursion) and β (backward recur-sion) simultaneously, and when it reaches half of the pro-cessed window (left-butterfly) and start the other half (right-butterfly), ASIP can calculate the extrinsic information onthe fly along with α and β calculations. State initializations(α int(wi
(n−1)), β int(wi(n−1))) of a specific α−β recursions
across windows are done by message passing via a specificarray of registers. Since the maximum window size is 128bits, 48 windows are needed to cover all LTE block-sizes, so48×96 array of registers is added.
2) Algorithmic level optimization: In the ASIP proposedin [7], SBTC throughput equals half of DBTC throughput
81
FinderMax
FinderMax
FinderMax
FinderMax
FinderMax
FinderMax
EX MAX
×64
×32
×16
×8
α
γ
β
α′
Fig. 4: Modified pipeline stages (EX, MAX)
Time
Extrinsic
Extrinsic
Extrinsic
right B−Fleft B−F
w(N−
1)
fram
esi
zeN
w0
w1
β int(w1)
β int(w0)
β int(w2)
α int(w0)
β
α
α int(w(N − 2))
Fig. 5: Windowing in butterfly computation scheme
because the decoded symbol is composed of 1bit in SBTCwhile it is 2 bits in the DBTC mode. Trellis compression isapplied to overcome this bottleneck as explained in sectionII-B. This makes the decoding calculation for SBTC similarto DBTC as presented in eq.(10), eq.(11) and eq.(12) so noadditional ACS units are added. The only extra calculationis to separate the extrinsic information to the correspondingsymbol as presented in eq.(13) and eq.(14) and the cost for itshardware implementation is very small. Fig.6 depicts butterflyscheme with Radix4, where the numbers indicate the equations(referred in section II-B). In this case four bits (single binarysymbols) are decoded each time.
LLR1
LLR4LLR3
LLR2
Time
Extrinsic
fram
esi
zeN (14)
(13)
(14)
(13)
α
β
Fig. 6: Butterfly scheme with Radix4
3) Memory level optimization: Three major memory struc-ture optimizations are implemented. The first one concernsthe normalization of the extrinsic information as presentedin eq.(2). This optimization reduces the extrinsic memoryby 25% because (γn.ext00 ) is not stored anymore. The secondoptimization is to restrict the support of trellis definition toa limited number of standards (WiMAX, 3GPP) rather thanall possible ones. Besides reducing the complex multiplexinglogic, this optimization allows for 1bit mode selection which ispassed through the instruction set, and thus, the configurationmemories which store the trellis definition are eliminated.
The third optimization is to re-organize input and extrinsicmemories, where input memories contain the channel LLRsΛn which are quantized to 6 bits each. In the proposedorganization for DBTC mode, LLRs values of systematics(Sn
0 , Sn1) and parties (Pn0, Pn1) for same double binarysymbol are stored in the same input memory word. Howeverin SBTC mode, each input memory word stores the LLRsvalues (Sn
0 , Sn+10 , Pn
0 , Pn+10 ) for two consecutive bits (single
binary symbols). The same approach is proposed for extrinsicmemories. As normalized extrinsic values are quantized to 8bits, in DBTC mode, values γn.ext01 , γn.ext10 , γn.ext11 related tosame symbol are stored in the same memory word. While inSBTC mode, each memory word stores the extrinsic valuesγn.ext1 , γn+1.ext
1 for two consecutive bits. In this way thememory resources in two turbo code modes (STBC/DBTC)are efficiently re-utilized.Memory sizes are dimensioned to support the maximum blocksize of the target standards (Table I). This corresponds to theframe size of 6144 bits of 3GPP-LTE standard which results ina memory depth of 6144+3 tail bits+1 unused
(Nsym=2)×(Nmb=2) = 1537 words (forboth input and extrinsic memories). Where Nsym is numberof symbols per memory word and Nmb is number of memoryblocks (Nmb = 2 as butterfly scheme is adopted).Table II presents the used memories in the proposed ASIP.It has 2 single port input memories to store channel LLRvalues of size 24×1537 and 2 simple dual port (oneportreadand oneportwrite) extrinsic memories to save a priori infor-mation. Each extrinsic memory is split into two banks odd8×1537 and even 16×1537. Each ASIP is further equippedwith 128×16 cross-metric memory which implement buffersto store β and α in left butterfly calculation phase and re-utilized in right butterfly phase.
Memory name # depth WidthProgram memory 1 64 16Input memory 2 1537 24Extrinsic memory odd 2 1537 8Extrinsic memory even 2 1537 16Cross-metric memory 1 16 128
TABLE II: Typical RAM configuration used for one ASIPdecoder component
82
B. Assembly Code Example
An assembly code example of the proposed optimized ASIPin turbo mode is as shown in Fig.7b. First we initialize theASIP mode (SBTC, DBTC), initialize the scaling factor Γidentified in eq.(3) and eq(13), from software module foundfor better performance BER Γ = 0.75 for DBTC and Γ = 0.5for SBTC, current iteration number (iter = 0), number ofwindows (N ) per ASIP, length of windows (L) and the lengthof last window (Llast). The REPEAT instruction controlsthe number of iterations (ITER MAX = 6). For the firstiteration (i=0) the ASIP start with zero as the initial state met-ric (α int(wi=0
n ) = β int(wi=0n ) = 0). The ZOLB instruction
controls the instructions @10 and @12-13 to execute L (orLlast in case of last window) number of times. The DATALEFT instruction @10 executes the left-butterfly recursioncalculating the α\β metrics and store them in the cross-metricmemory. The DATA RIGHT instruction executes the right-butterfly recursion calculating α\β metrics to be used onthe fly along with the corresponding stored metrics from thecross-metric memory in the next instruction EXTCALC tocalculates the extrinsic information (3) @12-13 and sendsthem to the other decoder component through (Buffer -MUX), so extrinsic calculation require two clock cycles tobe calculated. To avoid conflict in cross-metric memory whenASIP finish processing left-butterfly and starting right-butterflyNOP @11 is placed and executed one time for 1 clockcycle delay. In case of SBTC mode, four extrinsic informationis generated one for each input LLR, while in DBTC, sixextrinsic information is generated three for each input LLR((13),(14)). The EXCH WIN forwards the last αi
(n) valuesas α int(wi
(n)), initializes state metric of the next windowwith βwi
(n)of window n and increments the current window
counter (n = n+ 1).
C. Addressing implementation
In DBTC turbo decoding, due to the use of butterflyscheme, two symbols are decoded at the same time so twoextrinsic information are generated simultaneously and shouldbe addressed to the other component decoder. As explained insection III-B during right butterfly there are two clock cyclesto generate extrinsic information, so one value is addressedin first clock cycle and the other is buffered to be addressednext clock cycle. In SBTC, Radix4 decoding is adopted. Usingthis decoding with butterfly scheme will generate four extrinsicinformation simultaneously each time and should be addressedand sent to the other decoder component in two clock cycles.To avoid collision QPP interleaving is applied as explainedin section II-C. According to eq.(16) odd addresses in naturaldomain are also odd in interleaved domain and the same foreven addresses. Extrinsic memories has been split to two banks(odd/even) to avoid memory conflicts. In fact, in the first clockcycle two extrinsic information (out of the generated four inSBTC mode) with odd and even addresses are sent, followedby the other two extrinsic information in the next clock cycle.
IV. SYNTHESIS RESULTS
The ASIP was modeled in LISA language using CoWare’sprocessor designer tool. Generated VHDL code was validatedand synthesized using Synopsys tools and 90nm CMOStechnology. Obtained results demonstrate an area of 0.15mm2
per ASIP with maximum clock frequency of Fclk = 520MHz.Thus, the proposed turbo decoder architecture with 2 ASIPsoccupies a logic area of 0.3 mm2 with total memory area of1.2 mm2. With these results, the turbo decoder throughputcan be computed through the equation (17). An averageNinstr = 3 instructions per iteration are needed to generate theextrinsic information for Nsym = 2 symbols in DBTC mode,where a symbol is composed of Bitssym = 2 bits. In SBTCmode, same number of instructions is required for Nsym = 4symbols, where symbol is composed of Bitssym = 1 bit.Considering Niter = 6 iterations, the maximum throughputachieved is 115.5Mbps in both modes.
Throughput =Nsym ∗Bitssym ∗ Fclk
Ninstr ∗Niter(17)
Standardcompliant
Tech(nm)
Corearea(mm2)
NormalizedCore area@90nm(mm2)
Throughput(Mbps)
Fclk
(MHz)
ThisWork
WiMAX,DVB-RCS,LTE
90 1.5 1.5 115.5@6iter
520
[3] WiMAX,LTE
130 10.7 5.35 187@8iter
250
[2] DBTC,SBTC
65 0.42 0.84 18.6-37.2@5iter
400
[6] WiMax 180 3.8 0.95 45 99[4] LTE 65 2.1 4.2 150
@6.5iter300
[5] LTE 90 2.1 2.1 130@8iter
275
TABLE III: Comparison with state of the art implementations
Table III compares the obtained results of proposed workarchitecture with other related works. The presented ASIP in[2] supports both turbo modes (DBTC, SBTC). Although itoccupies almost half the area of our proposed ASIP, it presentsa limited throughput of 6 times less for SBTC mode and3 times less for DBTC mode. The parametrized dedicatedarchitecture in [3] supports both turbo modes (DBTC, SBTC)and achieves higher throughput ( 1.6 times) at the cost of morethan 3.5 times in area comparing to this work. The SBTC-dedicated architecture proposed in [4] achieves a throughputof 30% more than the proposed work but at a cost of almost3 times the occupied area. Similarly, the SBTC-dedicatedarchitecture proposed in [5] achieves a throughput of 13%more than the proposed work but at a cost of almost 1.4 timesthe occupied area. The DBTC-dedicated architecture proposedin [6] occupies an area of around 30% less comparing withthis work but the achieved throughput is around 40% less.This analysis demonstrates how the proposed optimized ar-chitecture constitutes a promising trade-off solution betweenthroughput and occupied area comparing with existing imple-mentations.
83
Programmemory
Decode
Fetch
Prefetch
s0p0s1p1
Operand fetch
BranchMetric1
BranchMetric2
Extrinsic Exchange
84
Extrinsic memory
666 6
8 8 8
1313
SBTC
13 8 8 813 80
888138881310
DBTC
Even Odd8x153716x1537
Inputmemory24x1537
memoryCross metric
1537
1537
16x64
128x16
Max
EX
(2) (9)
(4)
(8)
(7)
×2
×2
Zext01Zext
10Zext11
w − j j
ZextjZext
j+1addrj+1
j + 1 j
addrw−j
w − j − 1
addrw−j−1
w − j
Zextw−j−1 Zext
01Zext10addrjZext
10Zext11addrw−j Zext
11Zext01
(3)(5)(6)
addrjZextw−j
(a) ASIP Pipeline Architecture
k instruction
1 SET CONF double2 SET SF 63 SET WINDOW ID 1
;setnum windows4 SET WINDOW N 3
;1st n last window length5 SET SIZE 32,8
;repeat @11=41 if last window executed else;repeat @28-41, for 6*WINDOW N times
6 REPEAT until LOOP 6times
7 NOP;repeat 30-31, and 35-36 for CurrWindowLen times
8 ZOLB RW1, CW1, LW19 NOP10 RW1: DATA LEFT add m column2
;save last beta load alpha init11 CW1: NOP12 DATA RIGHT add m col-
umn213 LW1: EXTCALC add i line2 EXT
;save last alpha load beta init if lastwindow else;exch calculated alpha and beta
14 EXCH WIN15 NOP16 LOOP: NOP
(b) Example Assembly Code
(b) Turbo assembly code
Fig. 7: ASIP pipeline and execution schedule
V. CONCLUSION
In this paper, we have presented an area efficient high-throughput 1×1 decoder system based on ASIP that supportturbo codes for both modes DBTC (WiMAX, DVB-RCS)and SBTC (LTE). Three levels of optimization (Architecture,Algorithmic, Memory) have been proposed and significantperformance improvements have been demonstrated. The pro-posed contribution illustrates how the application of adequateoptimization techniques on a flexible ASIP for turbo decodingcan make it even an attractive and efficient solution in termsof area and throughput. Future work targets to integrate lowpower decoding techniques.
VI. ACKNOWLEDGE
This work was supported in part by UDEC and TEROPPprojects of the French National Research Agency (ANR).
REFERENCES
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limiterror-correcting coding and decoding: Turbo-codes. 1,” In Proc. IEEEInternational Conference on Communications, ICC’93., vol. 2, pp. 1064–1070, 1993.
[2] T. Vogt and N. Wehn, “A Reconfigurable Application Specific InstructionSet Processor for Viterbi and Log-MAP Decoding,” In Proc. IEEEWorkshop on Signal Processing Systems Design and Implementation,SIPS’06, pp. 142–147, 2006.
[3] J.-H. Kim and I.-C. Park, “A Unified Parallel Radix-4 Turbo Decoderfor Mobile WiMAX and 3GPP-LTE,” In Proc. IEEE Custom IntegratedCircuits Conference, CICC’09., pp. 487–490, 2009.
[4] M. May, T. Ilnseher, N. Wehn, and W. Raab, “A 150 Mbit/s 3GPP LTETurbo Code Decoder,” In Proc. Design, Automation and Test in EuropeConference & Exhibition, DATE’10, pp. 1420–1425, 2010.
[5] C. Cheng-Chi, Wong. Hsie-Chia, “Reconfigurable Turbo Decoder WithParallel Architecture for 3GPP LTE System,” IEEE Transactions onCircuits and Systems II: Express Briefs, vol. 57, no. 7, pp. 566–570,2010.
[6] H. Arai, N. Miyamoto, K. Kotani, H. Fujisawa, and T. Ito, “A WiMAXturbo decoder with tailbiting BIP architecture,” In Proc. IEEE Asian Solid-State Circuits Conference, SSCC’09., pp. 377–380, 2009.
[7] O. Muller, A. Baghdadi, and M. Jezequel, “From Parallelism Levels toa Multi-ASIP Architecture for Turbo Decoding,” IEEE Transactions onVery Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92–102,2009.
[8] Y. Zhang and K. Parhi, “High-Throughput Radix-4 logMAP TurboDecoder Architecture,” In Proc. Asilomar Conference on Signals, Systems,and Computers, ACSSC’06., pp. 1711–1715, 2006.
84
Design of an Autonomous Platform forDistributed Sensing-Actuating Systems
Francois Philipp, Faizal A. Samman and Manfred Glesner
Microelectronic Systems Research Group
Technische Universitat Darmstadt
Merckstraße 25, 64283 Darmstadt, Germany
{francoisp,faizalas,glesner}(at)mes.tu-darmstadt.de
Abstract—A platform for the prototyping of distributed sens-ing and actuating applications is presented in this paper. By com-bining a low power FPGA and a System-on-Chip specialized inlow power wireless communication, we enabled the developmentof a large range of smart wireless networks and control systems.Thanks to multiple customization possibilities, the platform canbe adapted to specific applications while providing high perfor-mance and consuming little energy. We present our approach todesign the platform and two application examples showing howit was used in practice in the frame of a research project foradaptronics.
Keywords-Smart structures, Control Systems, Wireless SensorNetworks, Reconfigurable Hardware
I. INTRODUCTION
Distributed sensing systems are nowadays a key element
for the development of intelligent environments and adaptive
structures. Information gathered by spatially distributed sen-
sors can either be used for passive monitoring of a system con-
dition or active real-time feedback control. In both cases, tiny
platforms that can be easily integrated on existing structures
are required. Using wireless sensor nodes, placement of power
and sensor cables along a construction is no longer necessary,
but new issues regarding synchronization, speed and autonomy
are appearing.
We introduce a platform combining low power consumption
for autonomous wireless sensor networks applications and
high performance for real-time distributed control systems.
The Hardware accelerated LOw Energy Wireless Embedded
Sensor-Actuator node (HaLOEWEn) relies on fine-grained
reconfigurable hardware to implement complex data processing
tasks. If FPGAs tend to replace microcontrollers and DSPs
for prototyping control systems, their introduction in the
design of very low power autonomous embedded systems is
recent. The new generation of FPGAs based on non volatile
memory is highly suitable for this range of application where
a switch between active and sleep periods is frequent. The
power consumption of these devices is also sufficiently low
to be integrated in systems intended to run on long-term
deployments with batteries.
In addition, monitoring and control of structures wireless
sensor networks depends on high-bandwidth sensing. Large
amount of data are generated by vibration or acceleration
sensors. Wireless transmission of raw data would have a non-
negligible impact on the energy consumption of the node.
Alternatively, local data preprocessing and in-network ag-
gregation algorithms can be implemented with high energy-
efficiency on FPGAs, improving significantly the lifetime of
the network.
The paper is organized as follows. After a short review of
related work in section II, the architecture of the developed
platform and our design concept are introduced in section III.
We then present in section IV-A and IV-B two Wireless Sensor
- Actuator Networks (WSANs) applications using HaLOEWEn
as a prototype.
II. RELATED WORK
Prototyping of wireless sensor node with reconfigurable
hardware was addressed by Hinkelmann & al. in [1]. A Xilinx
Spartan3E with 2000k gates was used as a prototyping chip
for emulating wireless sensor networks microcontrollers. A
sophisticated hardware / software debugging interface has been
additionally developed for precise internal debugging of the
design implemented on the FPGA. Altough the platform pro-
vides enough flexibility to implement and test a large range of
wireless sensor networks applications, its power consumption
is too high for long term deployments. The node still need
a reliable power supply close-at-hand or cables making the
wireless communication feature less adapted.
Reconfigurability of FPGAs was used by Portilla & al. [2]
to implement custom sensor interfaces. Based on a Spartan III
with 200k gates, the COOKIE platform can interface a large
range of analog and digital sensors thanks to a HDL interface
library. However, even if it reduces the power consumption, the
limited size of the FPGA does not allow the implementation
of complex data processing circuits required by our target
applications.
Following the similar approach to reduce energy consump-
tion by locally processing the data, the Imote 2 node [3]
developed by Intel includes a high speed multimedia DSP
coprocessor to handle high bandwidth sensing. Significant im-
provements in performance and energy-efficiency were shown
for various applications in comparison to standard nodes
including only a simple microcontroller. Now commercially
available with multiple sensor and power supply extensions,
the Imote 2 is an interesting alternative for rapid-prototyping of
wireless networks applications implicating complex data pro-
cessing. However, custom hardware implementations enabled
978-1-4577-0660-8/11/$26.00 c©2011 IEEE 85
Low Power Mode Power Consumption
FPGA Flash & Freeze - RF SoC Idle 30.7mWFPGA deep sleep - RF SoC Idle 29.8mW
FPGA deep sleep - RF SoC LPM1 2.1mWFPGA deep sleep - RF SoC deep sleep 50µW
TABLE IPOWER CONSUMPTION OF THE PLATFORM DURING DIFFERENT SLEEP
MODES
by FPGAs result in many cases in higher performance for a
larger range of applications.
III. PLATFORM ARCHITECTURE
For our design, we considered an Actel IGLOO FPGA
AGL1000V5 [4] with 1000k equivalent system gates as the
central unit of the system. It is by default extended by a Texas
Instruments CC2531 System-on-Chip (SoC), integrating an
IEEE 802.15.4 compliant 2.4 Ghz transceiver and a 8051 CPU
core [5]. The SoC includes 256k programmable Flash memory
and 8K RAM. The system runs with a 32 MHz oscillator.
Any node can be connected to a PC via the integrated USB
port. Through such nodes acting as base stations, data can
be accumulated and visualized immediately and user requests
can be disseminated to the whole network. Debuggers can be
plugged to the board allowing simultaneous monitoring of the
FPGA and the RF SoC operation.
The FPGA and the RF SoC communicate via a dedicated
SPI bus. When running at the maximum frequency with a
DMA controller, datarate can reach 2Mbps. Both components
have different deep low power modes useful for applications
involving long sleeping periods. The FPGA has a so-called
Flash & Freeze Low-Power mode with internal SRAM re-
tention activated by a dedicated pin driven by the software
running on the RF SoC. The RF SoC is also able to switch off
the power supply of the FPGA resulting in a deep low power
mode with configuration retention since IGLOO FPGAs are
based on flash memory. Thus, FPGA functionality is quickly
and energy-efficiently recovered after sleeping periods. Power
consumption measured on the platform are summarized in
table I. The power consumption of the FPGA in active mode
depends on the implemented design and can range from 4mwto 120mW . When the radio is activated, 47mW have to be
added in listening mode and 72mW when transmitting at
maximum power output (10dBm).
The platform can be powered by an external power supply
available next to the node or by a battery. If the external
conditions are adequate, a hybrid energy harvesting circuit
combining power extracted from different sources has been
developed for this platform [6]. The latest solutions are well
suited for monitoring-only systems but they are inappropriate
when the application includes control of actuators. Power
generated by batteries or energy harvesting is not sufficient
in this cases and the platform is likely to be supplied by an
external source.
As they do not require complex data processing, low rate
analog sensors are connected to the integrated ADC of the
Fig. 1. Schematic Side View of the Platform with Extension Boards
SoC. A temperature and a light sensor were placed by default
on the board. Further analog sensors may be attached through
an external connector.
Other parts of the platforms are customizable modular
circuits which are connected to one of the three left FPGA I/O
banks. An FPGA with a relatively high number of I/Os (256)
has been chosen to maximize the connectivity of the platform.
Available extension boards include for example additional
volatile and non-volatile memory, Analog-to-Digital (ADC)
and Digital-to-Analog (DAC) converters, interfaces to other
boards, digital sensors, etc. Each available I/O bank has a
dedicated 50 pins header connector to plug the extensions.
The board is small enough (60 mm x 96 mm) to be easily
integrated in various environments.
A. Development Environment
We distinguished two main parts for typical distributed
sensing-actuating applications: communication and data pro-
cessing. The wireless communication with other nodes of the
network is handled by the microcontroller in software while
sensors and actuators are directly interfaced by the FPGA
(Fig. 3). Preprocessing of the sensor data is thus handled
by dedicated hardware circuits for enhanced energy-efficiency.
Similarly, actuator control is implemented on the FPGA for
fast and accurate operation. Communication between the mi-
crocontroller and the FPGA is limited to update of parameters
for control (feedback information from other nodes) and data
extracted from the sensors.
The wireless communication should guarantee a very accu-
rate synchronization between nodes in the case of distributed
control systems in order to minimize delays in the feedback
loop. As it is independent from the sensor and actuator data
processing, both operations can be run in parallel. Thus, a very
accurate synchronization may be achieved by frequent resyn-
chronizations phases without interferences with the accelerator
operation.
A direct implementation of the communication protocol on
the FPGA, as it was done in [7], is limited by the area available
on the FPGA. Complex synchronization or routing protocols
require control units and memory that can not fit together with
data processing blocks. Even if improvements in the energy
86
Fig. 2. Top and Bottom View of the HaLOEWEn platform
consumption are possible, it is then preferable to use software
implementations of the networking protocol for prototyping
purposes.
Power management is also handled in software: FPGA and
RF SoC operation can be shut down to reduce the power con-
sumption of the platform during idle periods. If the platform
uses energy harvesting, part of the power management control,
like the maximum-power-point-tracking algorithm can also be
mapped on the processor [6].
The communication protocol is programmed in the C lan-
guage and can be supported by well-known wireless sensor
network operating systems like Contiki [8]. The design imple-
mented on the FPGA is described with VHDL or Verilog and
Actel IP cores within the Libero IDE.
Fig. 3. Typical Application Mapping
Fig. 4. Wireless Sensor Network for Acoustic Source Localization
IV. PROTOTYPING APPLICATIONS
In this section, we present the details of two applications
illustrating how the platform is used to test and develop
distributed sensing-actuating systems in the frame of a research
project.
A. Acoustic localization
We first detail a setup for the prototyping of a wireless
sensor network used for acoustic localization [9] as illustrated
in Fig. 4. The purpose of this application is to identify the
source of a sound disturbance in a closed environment.
Each platform is extended with two sensors: an ultrasonic
transceiver and a low cost MEMS microphone. The ultrasonic
transceiver is used to perform an accurate self-localization
of the nodes between each others. Ultrasonic signals are
exchanged at regular time intervals in order to determine
distances and then positions of the nodes with a multilateration
algorithm. Nodes may thus have arbitrary positions and can be
87
Fig. 5. Time on Arrival Estimator
moved during operation without interfering with the correct
operation of the application.
The sound disturbance localization process involves two
steps where intensive computation is required. In the first step,
time difference of arrival of the sound to the different nodes
must be estimated. The most common way to realize this
is to use cross-correlation with a reference signal or among
sounds recorded by the node members of the network. In
order to minimize delays due to communication overhead, the
first solution is preferred although it limits the localization to
predefined sounds (a ringtone for example).
tarrival = t,max {(record ⋆ reference)(t)} > threshold(1)
Thanks to the FPGA implementation of the cross-
correlation based detection, a timestamp can be very quickly
extracted. The nodes can then synchronize each other by
exchanging their local time reference (Post-Facto Synchroniza-
tion) and compute time differences on arrival (TDOAs).
In a second step, TDOAs and spatial coordinates of each
nodes are combined by using an algorithm based on a least-
squares estimation called spherical intersection [10]. The
complexity of this algorithm is O(n3) for three dimensional
localization where n denotes the number of available mea-
surements. When the size of the network is important, it
is then likely to use hardware acceleration to speed up the
computation. However, an implementation of this algorithm
is only necessary on one node since the result is unique for
the whole network. As it does not fit together with the cross-
correlator on a single FPGA, the localization accelerator is
only implemented on a reference node that keeps trace of other
nodes positions and measurements. The hardware architecture
of the accelerator is depicted in Fig. 6.
Using FPGAs for this application has several advantages : it
first allows a fast estimation of the cross-correlation necessary
for precise synchronization of nodes. When activated, sound
Fig. 6. Architecture of the Localization Accelerator
processing has to be continuously performed in order to detect
the sound on every nodes. This data streaming implies a real-
time processing of the incoming data which is only supported
using dedicated computation blocks. Additionally, the network
traffic generated by the application is highly reduced. Only
time of arrival and synchronization information need to be
exchanged
Secondly, it allows a fast estimation of the source po-
sition within the network. The system is then able to run
autonomously without support from an external computation
unit allowing deployments in harsh environments. Examples
of applications can be found in the military domain (Counter-
sniper project) [11], but also for structural health monitoring
based on acoustic emissions [12].
B. Wireless Distributed Control Systems
In our project, the platform will also be used for active
vibration and noise control systems, in which the platform
will be implemented as wireless distributed controller. A
decentralized control strategy will be used where a master
platform will coordinate data synchronization and communica-
tions. Fig. 7 presents an example of distributed control systems
for vibration control of a large plate. We assume that the plate
vibration is controlled by using N number of local adaptive
controllers. A single small area i = {1, 2, · · · , N} on the plate
will be controlled by an adaptive controller. As shown in the
figure, only two adaptive controllers are presented for the sake
of simplicity. The objective of the vibration control system
is as follows. The vibration sensed on N number of points
on the plate will be minimized subject to a force disturbance
d on an arbitrary location on the plate. The error signals
ei, i = {1, 2, · · · , N} measured from error sensors should be
minimized.
A pair of sensor-actuator is placed on each node i. Piezo-electric patches can be used as actuators and sensors. For some
reasons, an tuned-mass-dampers can also be used to absorb
vibrations. The signal ui is an actuating signal sent by the
controller to the actuator. While ei and xi are perturbed error
signal and reference signal, respectively. Both signals are used
to make controller parameters adaptation mechanism.
88
u1e1
uN
sensor
sensor
d
actuator
actuator
eN
x
T
TT
T
ue1
de1
ueN
deN
Tux1uxN
Tdx
T
ControllerAdaptive
ControllerAdaptive
Fig. 7. Distributed parameter control for vibration control of a large plate.
Σ
Σ
.....
Tde1
Tue1C(z)
parametersadaptationmechanism
−+
y
u1 z1
1
d
dxTTux1
C(z)
parametersadaptation
mechanism
Σ−
+
TdeN
TueN
y
uNz N
N
TuxN
e1 eN
x x
Fig. 8. Block diagram of the control systems.
Some blocks of transfer functions are shown in Fig. 7.
Tdx is the transfer function from the disturbance signal d to
reference sensor signal x. Tde is the transfer function from
the disturbance signal d to error sensor signal e. Tux is the
transfer function from the control signal u to reference sensor
signal x. Tue is the transfer function from the control signal
u to error sensor signal e. Fig. 8 shows the block diagram of
the distributed adaptive control system. Because the location
of disturbance signal d is not fixed, then the parameter values
(and perhaps also the structure) of the Tdx and Tde will change
accordingly. Due to such situation, the adaptive control system
is used to handle such problem. The parameters of the adaptive
controller can be adaptively tuned to compensate the changes
of the transfer function parameters of Tdx and Tde.
The structure of the adaptive transversal filter is shown
in Fig. 9. The transversal filter and the parameter adaptation
algorithm will be implemented on the Actel FPGA IGLOO
mounted on the platform. The filter consists of three main
units, i.e. a multiplier, an adder and a delay unit (z−1). The
filter output is described in Equ. (2), where u(k) is the controlsignal, aj(k) is the tuneable controller parameter, and x(k) isthe reference signal.
x x
Σ
x
Σ
x
Σ
Parameter Adaptation Algorithm
x
Σ
x
Σ
u(k)
x(k)
Adaptive Transversal Filter
ControllerParameters
a1 a2 a3 a4 aPa0
z−1 z−1z−1 z−1 z−1x(k−1) x(k−2) x(k−3) x(k−4) x(k−P)
e(k)
Fig. 9. Adaptive Controller Architecture.
u(k) =
P∑
j=0
aj(k)× x(k − j) (2)
The controller parameters ap, p ∈ {0, 1, · · · , P} are adap-
tively tuned by using a commonly used Least-Mean-Square
(LMS) algorithms [13] shown in Equ. (3). The parameters
are adaptively updated by using the measurement of the mean
square error of an error sensor signal on a local point in a
system. The constant γ is the adaptation gain that can be set
to accelerate the parameter adaptation mechanism. But a higher
constant tends to instabilize the system. Therefore, a correct
value must be set.
aj(k + 1) = aj(k) + γe(k)x(k − j) (3)
For a system with single adaptive controller, the error
signal is e(k) = y(k) − z(k) = Tde(z)d(k) − Tue(z)u(k).By using Equ. (2), then we have e(k) = Tde(z)d(k) −Tue(z){
∑Pj=0 aj(k)x(k − j)}, or based on Fig. 8, e(k) =
Tde(z)d(k) − Tue(z)C(z)x(k). From the figure, we see
that x(k) = Tdx(z)d(k) − Tux(z)C(z)x(k), and x(k) =Tdx(z)
1+Tux(z)C(z)d(k). Therefore, in a system with single con-
troller, the adaptive control will move to a steady-state, i.e.
e(k) → 0, when Equ. (4) is fulfilled.
Tde(z) =Tdx(z)Tue(z)C(z)
1 + Tux(z)C(z)(4)
And, for a system with N number of adaptive controllers,
then the steady-state condition holds, when Equ. (5).
Tde,i(z) =Tdx(z)Tue,i(z)Ci(z)
1 +∑N
i=1 Tux,i(z)Ci(z), (5)
where i ∈ {1, 2, · · · , N}.In order to optimize the logic gates usage in the FPGA
IGLOO, the adaptive transfersal filter structure presented in
Fig. 9 can also be implemented using serial architecture and
(P + 1) number of storage registers for the shifted reference
signals and controller parameters.
Wireless communication has been an interesting issue in
the area of networked control systems [14], [15]. There are
two control strategies that can be used regarding the need for
the wireless communication based on our platform. Firstly, the
89
wireless communications could be used by a master controller
and the other local controller to synchronize data sampling
from the reference and error sensors. In this case a decen-
tralized control strategy is used, where the LMS adaptation
algorithm can be used to tune the controller parameters.
Secondly, the wireless communications could be used to
exchange the actuator and sensor data such that a master
controller can make online parameter identifications. The on-
line parameter identification can also be made in each local
controller. The identified parameters can be further used to
reconfigure controller parameters values, which can meet the
control objective. However, this strategy requires a high-speed
and guaranteed-lossless communication infrastructure. In order
to come up to such issues, a predictive or a state estimation
control strategy can be used to reduce data exchanges in the
wireless media [16], [17].
V. CONCLUSION AND FUTURE WORK
A platform that can be implemented on multiple distributed
sensing and actuating applications is introduced in this paper.
The hardware platform is realized in a printed-circuit-board
(PCB), in which two main computing elements, i.e. the FPGA
IGLOO and the CC2351 SoC are used. By plugging multiple
independent extensions to the FPGA, the node can be easily
adapted to specific applications. The power consumption of
the whole system is low enough for autonomous operation
during long periods while complex control and data processing
algorithms can be implemented with high efficiency.
For our future work, the performance and the reliability of
our platform for the aforementioned applications will be pre-
cisely measured. Based on comparisons with traditional motes,
speedup and energy consumption gain will be estimated.
Additionally, we will use the platform for further appli-
cations on smart structures. In particular, a structural health
monitoring network for a bridge will be deployed.
ACKNOWLEDGEMENTS
This work has been supported by the European FP7 project
Maintenance on Demand (MoDe) Grant FP7-SST-2008-RTD
233890 and by Hessian Ministry of Science and Arts toward
Project AdRIA (Adaptronik-Research, Innovation, Applica-
tion) with Grant Number III L 4–518/14.004 (2008).
REFERENCES
[1] H. Hinkelmann, A. Reinhardt, and M. Glesner, “A Methodology forWireless Sensor Network Prototyping with Sophisticated DebuggingSupport,” in Proceedings of the 19th IEEE/IFIP International Symposium
on Rapid System Prototyping, 2008.[2] J. Portilla, A. de Castro, E. de la Torre, and T. Riesgo, “A Modular Ar-
chitecture for Nodes in Wireless Sensor Networks,” Journal of UniversalComputer Science, vol. 12, pp. 328 – 339, 2006.
[3] L. Nachman, J. Huang, J. Shahabdeen, and R. A. R. Kling, “IMOTE2:Serious Computation at the Edge,” in Proceedings of the International
Conference on Wireless Communications and Mobile Computing, 2008.[4] IGLOO Low-Power Flash FPGAs Datasheet, Actel.[5] CC253x System-on-Chip Solution for 2.4 GHz IEEE 802.15.4 and ZigBee
Applications User’s Guide, Texas Instruments.[6] F. Philipp, P. Zhao, F. A. Samman, and M. Glesner, “Demonstration :
Monitoring and Control of a Dynamically Reconfigurable Wireless Sen-sor Node Powered by Hybrid Energy Harvesting,” in Design, Automation
& Test in Europe (DATE), University Booth, 2011.
[7] L. A. Vera-Salasa, S. V. Moreno-Tapiaa, R. A. Osornio-Riosa, andR. de J. Romero-Troncosob, “Reconfigurable Node Processing UnitFor A Low-Power Wireless Sensor Network,” in Proceedings of theIntenational Conference on Reconfigurable Computing, 2010.
[8] A. Dunkels, B. Gronvall, and T. Voigt, “Contiki - A Lightweight andFlexible Operating System for Tiny Networked Sensors,” in Proceedings
of the 29th IEEE International Conference on Local Computer Networks,2004.
[9] F. Philipp, F. A. Samman, and M. Glesner, “Real-time Characterizationof Noise Sources with Computationally Optimised Wireless SensorNetworks,” in Proceedings of the 37th Annual Convention for Acoustics
(DAGA), 2011.[10] H. C. Schau and A. Z. Robinson, “Passive Source Localization Employ-
ing Intersecting Spherical Surfaces from Time-of-Arrival Differences,”in IEEE Transactions on Acoustics, Speech and Signal Processing, 1987.
[11] G. Simon, M. Marti, . Ldeczi, G. Balogh, B. Kusy, A. Ndas, G. Pap,J. Sallai, and K. Frampton, “Sensor Network-Based Countersniper Sys-tem,” in Proceedings of the 2nd International Conference on Embeddednetworked sensor systems, 2004.
[12] S. D. G. C. U. Grosse and M. Krger, “Initial Development of WirelessAcoustic Emission Sensor Motes for Civil Infrastructure State Monitor-ing,” Smart Structures and Systems, vol. 6, pp. 197 – 209, 2010.
[13] S. Haykin, Adaptive Filter Theory, 3rd ed. Prentice-Hall, 1996.[14] N. J. Ploplys, P. A. Kawka, and A. G. Alleyne, ““Closed-Loop Control
over Wireless Networks”,” IEEE Control Systems Magazine, vol. 24,no. 3, pp. 58–71, June 2004.
[15] H. A. Thompson, ““Wireless and internet communications technologiesfor monitoring and control”,” Elsevier J., Control Engineering Practice,vol. 12, no. 6, pp. 781–791, June 2004.
[16] J. K. Yook, D. M. Tilbury, and N. R. Soparkar, “Trading Computation forBandwidth: Reducing Communication in Distributed Control Systemsusing State Estimator,” IEEE Trans. Control Systems Technology, vol. 10,no. 4, pp. 503–518, July 2002.
[17] R. Wang, G.-P. Liu, W. Wang, D. Rees, and Y. B. Zhao, “GuaranteedCost Control for Networked Control Systems Based on an ImprovedPredictive Control Method,” IEEE Trans. Control Systems Technology,vol. 18, no. 5, pp. 1226–1232, Sep. 2010.
90
Session 4Virtual Prototyping for MPSoC
91
A Novel Low-Overhead Flexible Instrumentation Frameworkfor Virtual Platforms
Tennessee Carmel-Veilleux∗, Jean-François Boland∗ and Guy Bois†
∗ Dept. of Electrical Engineering, École de Technologie Supérieure, Montréal, Québec, Canada† Dept. of Software and Computer Engineering, École Polytechnique de Montréal, Montréal, Québec, Canada
Abstract—Instrumentation methods for code profiling, tracing andsemihosting on virtual platforms (VP) and instruction-set simulators(ISS) rely on function call and system call interception. To reduceinstrumentation overhead that can affect program behavior and timing,we propose a novel low-overhead flexible instrumentation frameworkcalled Virtual Platform Instrumentation (VPI). The VPI frameworkuses a new table-based parameter-passing method that reduces theruntime overhead of instrumentation to only that of the interception.Furthermore, it provides a high-level interface to extend the functionalityof any VP or ISS with debugging support, without changes to their sourcecode. Our framework unifies the implementation of tracing, profilingand semihosting use cases, while at the same time reducing detrimentalruntime overhead on the target as much as 90 % compared to widelydeployed traditional methods, without significant simulation time penalty.
Index Terms—Computer simulation, Software debugging, Softwareprototyping, System-level design
I. INTRODUCTION
With the advent of multiprocessor systems-on-chip (MPSoC) forconsumer and networking applications, complexity has become asignificant issue for system debugging and prototyping. Simulatorsand system-level modeling tools have become necessary tools tomanage this complexity. Virtual platforms (VP) are system-level soft-ware tools combining instruction-set simulators (ISS) and peripheralmodels that are used to start software prototyping before availabilityof the final product. In the case of state-of-the-art MPSoCs, virtualplatform models can even be used as the “golden model” provided todevelopers years before availability of final silicon [1]. The prolifera-tion of SystemC-based design-space exploration tools (e.g. PlatformArchitect [2], ReSP [3], Space Studio [4], etc.) was also made possibleby mature VP technology.
When using VPs for debugging or design-space exploration,software instrumentation methods can be used to obtain profilingdata, execution traces or other introspective behavior. The runtimeoverhead (i.e. intrusiveness) on the target of these instrumentationmethods is critical. It must be minimized to prevent interfering withthe strict timing constraints common in embedded software [5].
In this paper, we present a novel low-overhead flexible codeinstrumentation framework called Virtual Platform Instrumentation(VPI). The VPI framework can be used to extend existing virtualplatforms with additional tracing, profiling and semihosting capa-bilities with minimal target code overhead and timing interference.Semihosting is a mechanism whereby a function’s execution on thetarget is delegated to an external hosted environment, such as a VP.
The authors would like to acknowledge financial support from the Fondsquébécois de la recherche sur la nature et les technologies (FQRNT), theÉcole de technologie supérieure (ÉTS) and the Regroupement Stratégique enMicrosystème du Québec (ReSMiQ) in the realization of this research work.
Semihosting is traditionally used to exploit the host’s I/O, consoleand file system before support becomes available on the target [6].
Through our proposed framework, we make three main contribu-tions.
Firstly, we describe a new mechanism for fully inlinable instrumen-tation insertion with table-driven parameter-passing between a sim-ulated target and its host. Our method completely foregoes functioncall parameter preparation overhead seen in traditional semihosting.In doing so, we reduce detrimental runtime overhead on the targetbetween 2–10 times in comparison to traditional methods, whileshowing nearly identical simulation run times.
Secondly, we show that our framework can realize function semi-hosting, tracing and profiling tasks, thus unifying usually separateuse cases.
Thirdly, we propose a generic high level instrumentation handlinginterface for VPs which allows for new instrumentation behavior tobe added to existing tools, without requiring modifications.
This paper is organized as follows: section II presents backgroundinformation and related works about virtual platform code instrumen-tation, section III describes our proposed instrumentation framework,section IV presents experimental case studies of semihosting andprofiling with conclusions and future work in section V.
II. BACKGROUND AND RELATED WORK
In this section we explore different instrumentation methods usedfor debugging, system prototyping and profiling on virtual platforms.This is followed by an overall comparison of the methods, includingour proposed VPI framework.
For our purposes, we define virtual platforms as software en-vironments that simulate a full target system on a host platform.Virtual platforms integrate instruction-set simulators as well asmodels of memories, system buses and peripherals to realize afull SoC simulator. The conceptual layering of a VP is shown inFigure 1. Through the development of our framework, we evaluatedthe features and mechanisms present in the Simics [7], PlatformArchitect [2], QEMU [8], ReSP [3] and OVPSim [9] virtual platforms.In our experimental case study of section IV, we concentrated onSimics and QEMU.
A. Instrumentation use cases overview
We define instrumentation as tools added to a program to aid intesting, debugging or measurements at run-time. These tools can beimplemented as intrusive instrumentation functions in source codeor as non-intrusive instrumentation functionality within a VP. In ourcontext, intrusive means that target run time is affected in some wayby the instrumentation. An instrumentation site refers to the locationwhere instrumentation is inserted.
978-1-4577-0660-8/11$26.00 © 2011 IEEE
92
Host Hardware (User’s Workstation)
Host Operating System (Windows, Linux)
Virtual Platform Model of Target System (e.g. SystemC model)
ISSCore 1
Instruction Set Simulator (ISS)
Core 0
System Bus Model
Memory Model
PeripheralModel
Virtual Platform Instrumentation Interface (VPII)
Figure 1. Virtual platform modeling layers
Some examples of intrusive instrumentation use cases are:
• compile-time insertion of tracing or profiling calls at everyfunction entry and exit point [10, p. 75];
• compile-time insertion of code coverage or other measurementstatements in existing source code;
• insertion of probe points for fine-grained execution tracing at theOS kernel level (e.g. Kernel Markers [11] in the Linux kernel).
Conversely, examples of non-intrusive instrumentation include:
• insertion of breakpoints and watchpoints at runtime using adebugger to aid in tracing and debugging;
• interception of library function calls through their runtimeaddress to emulate functionality or store profiling data [4], [3];
• runtime insertion of transparent user-defined instruments tied toprogram or data accesses such as the probe, event and watchmechanisms of Avrova [12];
• storage of control-flow data in hardware trace buffers readablethrough specialized interfaces.
The instrumentation methods which implement these use cases differsignificantly in how much they affect target and host run time (i.e.their intrusiveness) and what they enable the user to do (i.e. theirflexibility).
In terms of tracing instrumentation, many varieties exist whichdiffer in semantic level. The VPs we evaluated each allowed forstraightforward dumping of a trace showing all instructions executedand every data access, with no context-related semantic information.Conversely, user-defined or compiler-inserted high-level tracing storesmuch less data, but with much higher semantic contents (e.g. a listof task context switches in an OS [7]). When we refer to tracing inthis paper, we are referring to the high-level tracing case.
In the next section we will discuss semihosting, a method used toimplement several of the aforementioned instrumentation tasks eitherintrusively or non-intrusively.
B. Semihosting function calls
In the general case, semihosting works by intercepting calls tospecific function stubs in the target code. Instead of running thefunction on the target at these sites, the ISS forwards an event tothe VP. The VP’s semihosting implementation then examines theprocessor state in the ISS and emulates the function’s behaviorappropriately, with target time stopped.
With semihosting, mechanisms for call interception differ byimplementation. Run-time and code size intrusiveness are directlylinked to which interception mechanism is used. We distinguish threesuch mechanisms:
• “Syscall” interception: the “system call” instruction is divertedfor use in interception. This is the traditional approach as usedby ARM tools [6] as well as in QEMU for PPC and ARMplatforms. This approach may require an exception to be taken,with associated runtime overhead.
• “Simcall” interception: a specific instruction is diverted for useas an interception point. The instruction can be specific for thatpurpose, like the SIMCALL instruction in the Tensilica Xtensaarchitecture [13, p. 520], or it can be an architectural NO-OPas in the Simics virtual platform, where it is called a “magicinstruction” [7].
• Address interception: the entrypoints of all functions to beemulated are registered as implicit breakpoints in the VP. Whenthe program counter (PC) reaches these breakpoints, interceptionoccurs. This approach is used in tools such as Imperas OVP-Sim [9], ReSP [3] and Space Studio [4] amongst others. It canalso be implemented in any VP with debugging support usingwatchpoints or breakpoints.
With all traditional semihosting methods, context-specific parame-ters are passed using regular function parameter-passing. The prepa-ration of semihosted function call parameters according to normalcalling conventions accounts for most of the runtime overhead ofthese methods.
Since the emulated function is fully executed by VP host code,the entire state of the modeled system can be exploited. Using theexample of a MPSoC model, this would imply that internal registersof all CPU cores could be accessed while processing the emulatedfunction. This opens the possibility of emulating as much as a fullOS system call as done in [3], [14] or providing “perfect” barriersynchronization primitives across the system [15].
Semihosting is increasingly used in the implementation of designspace exploration tools for hardware–software codesign. In thatcase, it enables developers to quickly assess the performance of analgorithm without having to deal with the accessory details of OSporting or adaptation early in the design phase [3], [4].
C. Comparison of instrumentation methods
In this section we compare different instrumentation methodscommonly used under virtual platforms. The comparison matrix isshown in Table I. Although we have not yet detailed our VirtualPlatform Instrumentation (VPI) method, it is included in the Tablefor comparison purposes and fully described in section III.
Firstly, we evaluated intrusiveness in terms of code size, runtime and features “lost” to the method (e.g. system call no longeravailable). Secondly, we established whether the methods workwithout symbol information (i.e. even with a raw binary image) andwhether they allow for inlining of the instrumentation. By symbolinformation, we mean the symbols table that links function namesto their addresses, which is present in all object file formats. Finally,we determined if the methods listed were suitable for different usecases presented earlier. For the qualitative criteria, we evaluatedimplementation source code or manuals of every method listed todetermine the values shown.
Our VPI method appears to compare favorably with existing ap-proaches. We contrast our method with other approaches and provideexperimental results supporting these intrusiveness comparisons insection IV.
93
Table IINSTRUMENTATION METHODS COMPARISON MATRIX
Intrusive ? Works ? Supports ?
Method Code Run-time Featureslost
Withoutsymbols Inline Tracing Profiling Syscall/OS
emulationCompiler-inserted
profiling function calls High High No No No Yes Yes No
Traditional semihosting("Syscall" interception) Medium Medium Yes Yes No Depends Depends Yes
Traditional semihosting("Simcall" interception) Low Low Yes Yes No Depends Depends Yes
Traditional semihosting(Address interception) Low None No No No Yes Yes Yes
Watchpoints / Breakpoints None None No No N/A Yes No No
VPI (Proposed method) Low None–Low No Yes Yes Yes Yes Yes
III. DETAILS OF PROPOSED FRAMEWORK
Our code instrumentation framework (VPI) is composed of twosoftware elements:
1) an inline instrumentation insertion method with table-basedparameter-passing, implemented with inline assembler in Ccode;
2) a high-level virtual platform instrumentation interface (VPII)that handles interception of instrumentation sites by callingappropriate virtual platform instrumentation functions (VPIFs).
Combined, these two components form a low-overhead generic codeinstrumentation framework that can be implemented on any VP orISS with debugger support or extension capabilities.
For the purposes of this paper, the compiler’s inline assemblerextensions are those of the unmodified GCC version 4.5 C com-piler [16]. However, the concepts behind our method are tool-agnosticand applicable to production-level compilers.
In the following subsections, we refer to the numbered markersin Figure 2 to illustrate the flow of instrumentation insertion frominitial source code to the compiler-generated assembler code. Marker1 of Figure 2 will be listed as (Ê), marker 2 as (Ë) and so on. Wewill use the fopen() C library function as a semihosting exampleto illustrate instrumentation insertion.
A. Target-side instrumentation insertion
Instrumentation statements are inserted into target code by the de-veloper using common C macros (Ê). They can refer to any programvariable (Ë). Each instrumentation macro expands to inline assemblerstatements containing a semihosting interception block (Ì,Í) and aparameter-passing payload related to the desired instrumentation (Î).The entire instrumentation call site is inserted inline (i.e. in-situ).
At compile time, the interception block (Ð) and parameter-passingpayload table (Ñ) are constructed from compiler-provided registerand memory address allocations. This is done by accessing inline-assembler-specific placeholders (Ï) and pretending instructions areemitted from them.
When inline assembler is used within a function’s body, place-holders referencing C variables in the assembler code are replaced byvalues from the compiler’s internal registers and memory addressesallocation algorithms.
We save these references out of band from the main code section(“.text”), in the read-only data section (“.rodata”). The choice of the“.rodata” section for the payload data table is deliberate, to preventinstruction cache interference by data that never gets read by usercode. However, it is possible and sometimes required to use the “.text”section for the payload table. For example, if the target OS uses paged
virtual memory, the interception block and payload table may need tobe inlined in the code section. Otherwise, the table’s effective addressrange might not currently be mapped-in by the OS, causing a dataaccess exception at interception time.
Interception block
Although interception is still necessary with our method, we donot mandate the use of a specific mechanism. The interceptionblock from our example of Figure 2 (Ì,Í) is composed of threeparts: 1) “Simcall” interception instruction (“rlwimi 0,0,0,0,9” inthis case); 2) pointer-skipping branch; and 3) payload table pointer.The interception block shown is an arbitrary example. Any otherinterception mechanism described in section II-B could be used, aslong as it is supported by the VP.
Along with the interception block, a pointer to the parameter-passing payload table is used to link an instrumentation site withits parameters. An unconditional branch is added to the interceptionblock to prevent the fetching and execution of the payload tablepointer.
Parameter-passing payload table
The parameter-passing payload table serves as a link between thetarget program’s state and the high-level instrumentation interfacerunning in the VP. For an instrumentation site, it both uniquelyidentifies desired behavior and provides reference descriptors to thefunction parameters that should be passed to–from the handler. Thesereference descriptors allow a high-level instrumentation interface toboth read data from, and write data back to the target program’s state.
The format used for each payload table is as follows:• Signature header (1 word) including a functional identifier (16
bits) and quantity (from 0–15 each) of constants, input variablesand output variables references;
• Constants table (1 word each);• Input variables references (fixed number of strings and/or in-
structions);• Output variables references (fixed number of strings and/or
instructions);The signature header identifies the desired functional behavior (e.g.:tracing, fopen(), printf(), etc.). For every functional identifier,it is possible to use more or less constants, inputs and outputsdepending on the need. For instance, a “printf()” function couldbe implemented as 16 versions, covering the cases where 0 to 15variables need to be formatted.
Constants are emitted from references known before runtime. Forinstance, our implementation of a semihosted “printf()” uses a
94
do { FILE *retval;__asm__(" SIMCALL/NOP b 2f .long 1f2:
.section "rodata"1: .long FOPEN_IDENTIFIER .asciz "%[RETVAL]" .asciz "%[FNAME]" .asciz "%[MODE]" .text" : [RETVAL] "=r" (retval): [FNAME] "rm" (filename), [MODE] "rm" ("w+")); return retval;} while (0)
char filename[] = "testfile";FILE *f;
f = VP_FOPEN(filename, "w+");
Compile
Expand macro
/* Start of inserted inline assembler */ rlwimi 0,0,0,0,9 b 2f .long 1f2:
.section .rodata,"a",@progbits1: .long 0x0120000b .asciz "11" .asciz "30" .asciz "8(10)" .text
/* End of inserted inline assembler */
Interception: 3 words, 2 executed
Parameter-passing payload:1 word and 3 strings (16 bytes)
Parameters list in inline assembler
syntax
Instrumentation parameter-passing
payload with placeholders
Pointer to payload(skipped by interception)
Interception
1
2
3
4
5
6
7
8
Figure 2. Overview of VPI instrumentation insertion in source code
constant slot for the pointer to the format string. The example ofFigure 2 does not use any constants.
Input variable references and output variable references arecompiler-provided data references that can be accessed by the VPIFsthrough the VPII.
Table II breaks-down the payload table of our “fopen()” exam-ple from Figure 2. Again, this example is based on a PowerPC target,but equivalent content would be present for any architecture.
Although our example of Figure 2 uses only strings for references,both strings and instructions can be used, as long as the tableformat is understood by the VPII implementation. In the case whereinstructions are used, the VPII can disassemble them at runtime todecode the references they contain. To illustrate this, we show a store
Table IIDESCRIPTION OF FIGURE 2’S PAYLOAD TABLE
Compiled value Description
0x0120000b Signature header• Function 0x000b• 0 constants• 2 inputs• 1 output
“11” Value of “retval” output variable is in GPR11
“30” Value of “filename” input variable is in GPR30
“8(10)” orstw 0,8(10)
Pointer to “mode” (“w+”) input variable iscontents of GPR10 + 8
(“stw”) instruction that could replace the reference string of the lastreference in the Table.
Remarks about insertion method construction
As far as we know, every other semihosting-based methods aredesigned for source code equivalence—all instrumentation-callingcode must remain identical after instrumentation is removed. Thisrequirement has the advantage of allowing instrumentation to beincluded by simply linking with different versions of the libraries.However, parameter-passing becomes bound to the C calling conven-tions in effect on the target platform. We constructed our proposedinstrumentation insertion method to overcome the artificial require-ment of function call setup when running on virtual platforms.
In our case, where we know the instrumented binary will be run ina VP, it becomes only necessary to somehow tell the VP where to findfunction parameters after a call is intercepted. Function call prepa-ration merely copies program variables into predetermined registersor stack frame locations. Since the VP can access all system state“in the background” without incurring instruction execution penalties,we replaced the function call and associated execution overhead witha static parameter-passing table. Parameters can then be accessedby interpreting the table, rather than reading predetermined registersor stack frame locations. The compiler guarantees the “reloading”from memory of any variable not locally available from registers oroffsetable memory locations. This reloading overhead is, in all cases,a subset of standard function call overhead.
Another side effect of our method’s construction is that instrumen-tation insertion is always inlined. This has the desirable consequenceof “following” other inlining done by the compiler. It then becomestrivial to instrument functions inlined by the compiler’s optimizer,and to identify them uniquely, without any special compiler support.
Finally, while optimizing compilers can reorder statements aroundsequence points in C code, some compiler-specific mechanisms canbe used to guarantee the positioning of the inlined assembler blocks.During our tests with GCC 4.5 on ARM and PPC platforms, the useof volatile asm statements with “memory” clobber preventedany instruction reordering from affecting the test result signatures atevery optimization level.
B. Virtual Platform Instrumentation Interface
Within our framework, we propose that the VP be pre-configured torun a centralized instrumentation handler whenever interception oc-curs at an instrumentation point. A high-level, object-oriented virtualplatform instrumentation interface (VPII) layer is used to interfacebetween the VP and the instrumentation functions by providingabstract interfaces to the VP’s state and parameter-passing tables.
95
Virtual Platform (VP)
Instruction Set Simulator (ISS)
GDB + Python Internal Python Interface←OR→
Virtual Platform Instrumentation Interface (VPII)
Virtual Platform Instrumentation Functions (VPIF)
Figure 3. VPI framework implementation layers
+eval_expr()+read_mem()+write_mem()+read_n_bytes_variable()+write_n_bytes_variable()+read_string_variable()+get_reg()+set_reg()+get_pc()
DebuggerInterface
+is_64_bits()+get_longlong_second_reg()+get_longlong_from_regs()+get_regs_from_longlong()+get_adapted_endian()+to_signed()+get_top_address()+get_double_from_longlong()
ApplicationBinaryInterface
GdbDebuggerInterface PpcApplicationBinaryInterface
PpcGdbDebuggerInterface
+read_n_bytes_variable()+write_n_bytes_variable()+read_string_variable()
VariableAccessor
DirectAccessor
RegisterAccessor
SymbolicAccessor
+register_func()+get_payload()+process()+accessor_factory()
VPIInterface
+register_func()+get_payload()+process()+accessor_factory()
VPITriggerMethod
PpcGdbVPITriggerMethod
creator
created
+get_func_id()+process()
VPIFunction
PpcVPIInterface
PpcGdbVPIInterface
accesses-state-throught
Figure 4. Class hierarchy of a sample VPII implementation
It also executes the appropriate virtual platform instrumentationfunction (VPIF) handler on behalf of the target code. The layersforming this high-level interface are shown in Figure 3.
The VPII abstraction allows the VPIF handlers to access registers,memory and internal VP state using generic accessors that hide low-level platform interfaces. It also handles data conversion tasks relatedto a platform’s application binary interface (ABI).
The VPII is implemented using the high-level language (HLL)extension interfaces built into VPs. For instance, this could be aninternal script interpreter, such as Python in Simics [7] or Tclin Synopsis Platform Architect tools [2]. It could also be a C++library built on top of a SystemC simulator. Alternatively, the GNUDebugger’s (GDB) Python interface can be used to implement ageneric VPII suitable for existing ISS and VP implementations withGDB debugging support.
For our experimental implementation, we developed VPII andVPIF libraries supporting both Simics’ and GDB’s Python extensioninterfaces. The class hierarchy for our implementation of the VPIIinterface for PowerPC targets with GDB-based VP access is shownin Figure 4. In that example, the PpcGdbVPIInterface class isused as the focal point to register instrumentation behavior (VPIFhandlers) and access the VP through GDB.
For testing, we also developed a sample library of VPIF handlerscovering common instrumentation tasks of I/O semihosting, tracingand code timing. New VPIF handlers can be registered and modifieddynamically at run-time with our sample Python implementation.Handlers written generically using only parameter accessors withoutusing any VP-specific functionality can be reused on any supportedarchitecture.
IV. EXPERIMENTAL CASE STUDY
In order to validate that the method we propose is flexible andhas low overhead, we performed a comparative case study. We
compared our experimental framework implementation to commoninstrumentation methods using controlled examples.
Source code for the case study, as well as for our VPI implementa-tion, is available at http://tentech.ca/vp-instrumentation/ under a BSDopen-source license.
A. Experimental setup
The case study was run on a standard PC running Windows 7x64 Professional with an Intel Core 2 Duo P8400 with two 2.26GHzcores. The toolchain and C libraries were from the Sourcery G++2010.09-53 release, based on GNU GCC 4.5.1 and GNU Binutils2.20.51. The target was a PowerPC e600 single-core processor on theWind River Simics 4.0.60 and QEMU PPC 0.11.50 virtual platforms.We used GDB 7.2.50 with Python support as the debugger.
We instrumented the “QURT” quadratic equation root-findingbenchmark program from the SNU WCET suite [17] based ontwo instrumentation scenarios, which were run independently. Thesescenarios showcase the unification of profiling and semihosting usecases since both are implemented using the same VPI frameworkfunctionality and insertion syntax.
Each scenario comprised a base non-instrumented case, and fourinstrumented cases. The instrumented cases represent different com-binations of instrumentation methods and VPI configurations. TheVPI configurations were the following:
• Internal: VPI handler is run internally on Simics’ Python inter-preter with “simcall” interception.
• External: VPI handler is run externally on GDB’s Pythoninterpreter with debugger watchpoint interception under eitherSimics or QEMU.
We compared the three following instrumentation methods:
• “VPI”: uses our inlined VPI instrumentation for each site;• “Stub-call”: calls a C function stub at every site which wraps an
inlined VPI instrumentation site, so that traditional semihostingfunction call overhead can be compared;
• “Full-code”: in the case of the printf() scenario, we run anoptimized printf() implementation entirely on the target, withI/O redirected to a null device so that “manual” non-semihostedinstrumentation overhead can be compared.
For each run, we recorded binary section sizes, simulation timeson the host and cycle counts on the target. Section sizes provideinformation about code size interference. Simulation times and cyclecounts are used to compare runtime overhead. With the “stub-call”cases, the results in Tables III, IV and V are compensated bysubtracting the wrapped VPI site contribution, which would haveartificially inflated the results of those cases.
All results are from release-type builds with no debugging symbolsand “-O2” (“optimize more”) option on GCC. Host OS noise wasquantified by executing 50 runs of each case.
B. Results of “printf()” semihosting scenario
The printf() semihosting scenario compares space and time over-head of a printf() function semihosting use case. In this case,we inserted 3 instrumented sites to display the results of differ-ent loops of the QURT benchmark. Each loop ran 100 times,for a total of 300 calls. For all cases, the printf() implemen-tation was functionally equivalent, with full float support. Theprintf() statement was printf("Roots: x1 = (%.6f%+.6fj)
x2 = (%.6f%+.6fj)\n", x1[0], x1[1], x2[0], x2[1]).
96
Table IIITARGET RUNTIME OVERHEAD IN CYCLES FOR PRINTF() SCENARIO
Instrumentationcase
CPUcycles
Totaloverhead
Per-calloverhead
Overheadincrease
None 5 538 648 0 0 N/A
Internal VPI 5 539 272 624±24 2 ×1 (Base)
External VPI 5 539 872 1224±24 4 ×2Stub-call 5 545 848 6888±24 23 ×11.5Full-code 7 174 476 1 635 828±24 5453 ×2726.5
Table IVBINARY SIZE OVERHEAD FOR PRINTF() SCENARIO
Instrumentation case .textsize
.datasize
.rodatasize
Total
None 34 796 1864 1224 37 884
Internal VPI +132 +0 +200 +332
External VPI +168 +32 +200 +400
Stub-call +320 +0 +80 +400
Full-code +328 +0 +40 +368
We ran this scenario on both Simics and QEMU. QEMU onlysupports the external configuration without source code modifica-tions. Since all binaries are identical between Simics and QEMU,the results of Tables III and IV apply equally to both VPs.
Simulated CPU runtime overhead results are detailed in Table III.Uncertainty on overhead was ±24 cycles because of the timingmethod. We observe that execution overhead per call for VPI casesis only 2–4 cycles, depending on the configuration. The external VPIconfiguration—with watchpoint interception—requires twice as manyinstructions per call as the internal “simcall”-based VPI configuration.Overheads of the VPI cases are a significant 5–11 times reductionover traditional stub-call instrumentation. Function call preparationaccounts for the higher overhead of the stub-call case. In contrast,even when excluding I/O cost, the full-code printf() cases has 3orders of magnitude higher runtime overhead than either VPI cases.
Space overhead results are listed in Table IV, in comparison to theuninstrumented base case. Code section (.text) space overheads of theVPI cases are noticeably lower than the other cases. Through manualassembler code analysis we confirmed that function call preparationaccounted for the difference observed between the stub-call and full-code cases. As expected, the lower code section overheads with VPIcases come at the cost of a larger constants section (.rodata), althoughtotal sizes are comparable.
In terms of simulation time, our VPI framework’s overhead de-pends considerably on whether an internal or external configurationis used. Simulation times for different scenarios under both Simicsand QEMU are shown in Figure 5 (note the logarithmic scale on thesimulation time axis). The “full -code” and “stub-call” cases in thatfigure do not have any interception methods enabled at runtime.
The internal uninstrumented (“Internal None”) case is shown tohave no penalty on simulation time. Conversely, the external uninstru-mented (“External None”) case—which uses watchpoint instead of“simcall” interception—causes some baseline interception overhead.Furthermore, the internal VPI-only case is shown to have no penaltyover a traditional internal stub-call case.
With all instrumented cases, those using internal configurations
Base
Intern
al Non
e
Full-co
de
Stub-ca
ll
Intern
al VPI
Externa
l Non
e
Externa
l VPI
Base
Externa
l Non
e
Full-co
de
Stub-ca
ll
Externa
l VPI
0.1
1
10
100
Sim
ulat
ion
time
(sec
onds
)
0.247 0.249 0.285 0.292 0.293 0.353
9.791
0.304 0.318 0.353
150.0 182.2
SimicsQEMU
Figure 5. Simulation times for printf() scenario
display significantly better simulation performances than those usingexternal configurations. Moreover, the internal VPI instrumented caseis even faster than the external uninstrumented case under Simics.This shows that simulation time is practically unaffected by lowinstrumentation loads when an internal VPI framework configurationis used. The interception and VPII mechanisms appear to be muchslower when going through the GDB interfaces used in all externalcases. We determined that the slowdown was due to the overheadof both the GDB ASCII protocol and the context switches requiredto go back and forth between the GDB and VP processes. Incontrast, the internal configuration has direct access to VP resources,which explains its better performances. In the case of QEMU, GDBcommunication overhead was prohibitive enough to prevent the useof our framework for non-trivial cases under that particular VP.
C. Results of “profiling” scenario
The profiling scenario compares overhead of runtime pro-filing between stub-call (i.e. compiler-inserted) and inlinedtracing/profiling instrumentation. In the stub-call cases, the-finstrument-functions option of GCC was used to automati-cally insert a call to instrumentation stubs at every function entry andexit points. For the VPI cases, we manually inserted the VPI tracingcalls in the C source code at every function entry and exit points. Inboth cases, the instrumentation behavior involved recording executiontracing information to a file, as usually done by profiling tools. Thetracing call was of the form vp_gcc_inst_trace("FUNC_ENTER",
"NAME", __FILE__, __LINE__), where vp_gcc_inst_trace isa VPI instrumentation site insertion macro. There were 11 instrumen-tation sites, totalling 456 802 calls over a run and yielding a trace fileover 23 megabytes long. This is more than a thousandfold increasein instrumentation calls over the printf() scenario. We did not runthis scenario under QEMU in light of the prohibitive simulation timesfor the much simpler printf() scenario.
Results are detailed in Table V. We only present runtime andsimulation time overhead results, since space overhead is negligible inruntime-dominated profiling use cases. As with the printf() semi-hosting scenario, large differences exist in results depending on theconfiguration used. With internal configuration, the instrumentationcalls penalized simulation time on the order of 150µs per call. Incontrast, the negative impact on simulation speed of accessing VPstate through an external interface is clearly demonstrated by over-
97
Table VOVERHEADS PER CALL FOR PROFILING SCENARIO
Instrumentation caseTotal
simulationtime (s)
Runtimeoverhead(cycles)
Simulationoverhead(seconds)
None 0.250 0 0
Internal VPI 62.59 2.13 136.5µ
Internal stub-call 69.88 9.7 152.4µ
External VPI 3286 4.06 7193µ
External stub-call 3299 9.7 7221µ
head results around 7 ms per call, which is close to 50 times worsethan with internal cases. On the opposite end of the performancespectrum, the internal VPI instrumentation case displays significantlylower runtime overhead than the traditional stub-call approach, for acomparable simulation time.
In terms of target runtime overhead, a reduction of 2–5 times overthe stub-call case is seen with the VPI cases. If complex behaviorhad been implemented in the instrumentation functions on the targetinstead of wrapping a VPI call, overhead would have increasedproportionately over the simple stub-call cases shown.
V. CONCLUSION AND FUTURE WORK
Compared to existing semihosting and profiling instrumentationapproaches, our contributed framework is shown to have lower run-time and space overhead on the target. In both case study scenarios,our method showed 2–11 times lower runtime interference comparedto traditional methods. The lower overall target overhead and theconstruction of our VPI instrumentation insertion method enable theuse of our framework to unify the implementation of previouslyseparate semihosting and tracing/profiling use cases.
Because our method allows for inlining, is fully compatible withall optimization levels and has low target space and time overhead,it may remain in release code. With interception disabled in the VP,instrumented sites do not affect the runtime. This opens the possibilityof distributing instrumented binaries which can later be pulled fromthe field for re-execution with instrumentation enabled under a VP.
In terms of simulation time, our VPI implementation has perfor-mances comparable to traditional stub-call semihosting when usingthe internal configuration.
We have also shown that our framework can be used to extendthe instrumentation capabilities of existing VPs without changingtheir source code. This “add-on instrumentation” capability exploitsscripting interfaces currently available in VPs and provides userswith the option of reusing our sample implementation in their ownenvironments.
While our results validate our assertions, we must also acknowl-edge that our prototype implementation suffers from some perfor-mance issues which are unrelated to the core VPI concepts presentedin this paper.
Firstly, simulation time overhead is dominated by choice of VPIconfiguration, with the external configuration executing as much as50 times slower than internal configurations. In the case of ourGDB-based external implementation, performances are limited by thecommunication and context switching overheads between GDB andthe VP. These performance issues are due to the architecture of GDBand shared by any tool employing GDB as a generic interface to avirtual platform.
Secondly, since our prototype implementation uses pure Pythonscripting code, it is at least an order of magnitude slower than whatcould be achieved using a native C/C++ implementation.
Future work includes implementing our VPI framework on awider variety of VPs and architectures. Additional case studies andbenchmarks could be beneficial in identifying more use cases whereour method is an optimization of existing practices, while alsoserving as validation that inlined instrumentation is robust under moreoptimizations than those we validated.
ACKNOWLEDGEMENTS
We would like to thank L. Moss, J. Engblom, G. Beltrame andL. Fossati for providing us with valuable insights about code instru-mentation on virtual platforms, which helped shape the constructionof our framework and its presentation in this paper. We also wish tothank J-P. Oudet and the peer reviewers for helpful comments aboutthe original manuscript.
REFERENCES
[1] Freescale Semiconductors, Inc. (2008, Jun.) Virtutech announcesbreakthrough hybrid simulation capability allowing mixed levelsof model abstraction. Accessed 6/7/2010. [Online]. Available: http://goo.gl/UErXR
[2] Synopsys, inc., CoWare Platform Architect Product Family: SystemCDebug and Analysis User’s Guide, v2010.1.1 ed., Jun. 2010.
[3] G. Beltrame, L. Fossati, and D. Sciuto, “Resp: A nonintrusivetransaction-level reflective mpsoc simulation platform for design spaceexploration,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,vol. 28, no. 12, pp. 1857–1869, Dec. 2009.
[4] L. Moss, M. de Nanclas, L. Filion, S. Fontaine, G. Bois, and M. Aboul-hamid, “Seamless hardware/software performance co-monitoring in acodesign simulation environment with rtos support,” in Proc. Design,Automation Test in Europe Conf. Exhibition (DATE), 2007, pp. 1–6.
[5] S. Fischmeister and P. Lam, “On time-aware instrumentation of pro-grams,” in Proc. 15th IEEE Real-Time and Embedded Technology andApplications Symp. (RTAS), Apr. 2009, pp. 305–314.
[6] ARM Ltd, ARM Compiler toolchain: Developing Software for ARMProcessors, 2010, version 4.1, document number ARM DUI 0471B.[Online]. Available: http://goo.gl/qlKkO
[7] J. Engblom, D. Aarno, and B. Werner, Full-System Simulationfrom Embedded to High-Performance Systems. Springer US, 2010,ch. 3, pp. 25–45. [Online]. Available: http://dx.doi.org/10.1007/978-1-4419-6175-4_3
[8] Qemu open-source processor emulator. [Online]. Available: http://www.qemu.org
[9] Imperas Ltd. (2010) Technology ovpsim. Accessed 12/15/2010.[Online]. Available: http://www.ovpworld.org/technology_ovpsim.php
[10] IBM Corporation, IBM XL C/C++ for Linux, V11.1, Optimization andProgramming Guide, 2010, document number SC23-8608-00. [Online].Available: http://goo.gl/e1Ri9
[11] J. Corbet. (2007, aug) Kernel markers. Accessed 12/10/2010. [Online].Available: http://lwn.net/Articles/245671/
[12] B. L. Titzer and J. Palsberg, “Nonintrusive precision instrumentation ofmicrocontroller software,” ACM SIGPLAN Not., vol. 40, pp. 59–68, June2005. [Online]. Available: http://doi.acm.org/10.1145/1070891.1065919
[13] Tensilica, inc., Xtensa Instruction Set Architecture: Reference Manual,Santa Clara, CA, Nov. 2006, document number PD-06-0801-00.
[14] H. Shen and F. Petrot, “A flexible hybrid simulation platform targetingmultiple configurable processors soc,” in Proc. 15th Asia and SouthPacific Design Automation Conf., Jan. 2010, pp. 155–160.
[15] N. Anastopoulos, K. Nikas, G. Goumas, and N. Koziris, “Early experi-ences on accelerating dijkstra’s algorithm using transactional memory,”in Proc. IEEE Int. Symp. on Parallel Distributed Processing (IPDPS),May 2009, pp. 1–8.
[16] Free Software Foundation. The gnu c compiler. Accessed 11/1/2010.[Online]. Available: http://gcc.gnu.org
[17] S.-S. Lim. (1996) Snu-rt benchmark suite for worst case timinganalysis. Original SNU site now down. [Online]. Available: http://www.cprover.org/goto-cc/examples/snu.html
98
Using Multiple Abstraction Levels to Speedup anMPSoC Virtual Platform Simulator
Joao Moreira∗, Felipe Klein∗, Alexandro Baldassin†, Paulo Centoducatte∗, Rodolfo Azevedo∗ and Sandro Rigo∗∗Institute of Computing – University of Campinas (UNICAMP) – Brazil
[email protected], {klein, ducatte, rodolfo, sandro}@ic.unicamp.br†IGCE/DEMAC – UNESP – Brazil
Abstract—Virtual platforms are of paramount importance fordesign space exploration and their usage in early softwaredevelopment and verification is crucial. In particular, enablingaccurate and fast simulation is specially useful, but such featuresare usually conflicting and tradeoffs have to be made. In thispaper we describe how we integrated TLM communication mech-anisms into a state-of-the-art, cycle-accurate, MPSoC simulationplatform. More specifically, we show how we adapted ArchC fastfunctional instruction set simulators to the MPARM platform inorder to achieve both fast simulation speed and accuracy. Ourimplementation led to a much faster hybrid platform, reachingspeedups of up to 2.9x and 2.1x on average with negligible impacton power estimation accuracy (average 3.26% and 2.25% ofstandard deviation).
I. INTRODUCTION
As new hardware architectures become increasingly com-plex, the need for tools to support their development becomesevident. The use of virtual platforms to enable design spaceexploration has shown to be an important procedure foraccelerating the design of new hardware components, allowingearly architectural exploration and verification.
Low power consumption is a key feature in hardware de-velopment, not only for embedded systems, extending batterylifetime, but also for hardware in general, reducing heat dissi-pation. The development of energy-aware systems is a hardtask that can be assisted by virtual platforms, since they makeit possible to trace the behavior of interconnected hardwarecomponents, allowing performance and power estimation.
Many approaches have been proposed for power estimationon single core applications, but only a few options are availablefor the multi-core domain. Some simulation platforms [1], [2]with power analysis support were published and are in use,but a need for more alternatives and resources, satisfying awider range of testing possibilities, still exists.
Virtual platforms may be implemented using different abs-traction levels. Cycle-accuracy provides precise simulationswith highly trustable results, but this efficiency comes atcost of complexity and, consequently, increased simulationtime. This characteristic imposes hard performance limitations,sometimes making the execution of real-world applicationsunfeasible. The use of higher abstraction levels, such asfunctional simulators, reduces the simulation complexity, im-proving the platform’s time efficiency. However, since many
hardware details are not taken into account, their results mightbe less precise if compared to those generated with a cycle-accurate platform.
This paper is focused on how to improve the speed of acycle-accurate platform by including a functional simulatorwhile maintaining the accuracy. We integrated functional simu-lators generated with ArchC [3] into the MPARM [1] platform,turning it into a faster hybrid platform. The contributions ofthis work are threefold. First, we introduce a new simulationresource into the MPARM platform to improve its speed upto 2.9 times. Second, we present a detailed implementationdescription of a hybrid simulation platform, showing howthe abstraction compatibility problems were fixed. Finally, adescription of how we managed to statistically fix the precisionloss introduced by the functional simulator is showed.
This paper is organized as follows: Section 2 describesrelated works. Section 3 details the implementation of theplatform, describing the interface of the functional simula-tor with the MPARM platform, techniques used to improveprecision, and the verification process. Section 4 describesthe experimental results, showing the obtained speedups anddescribing how power estimations were statistically fixed.Section 5 presents our conclusions.
II. RELATED WORK
MPARM [1] is a complete platform for Multi-processorSystems-on-Chip simulation (MPSoCs). It is written in C++and makes use of SystemC [4] as its simulation engine.The platform includes an implementation of a cycle-accurateARM simulator called SWARM [5], AMBA buses, hierarchicmemories and synchronization mechanisms. Cycle-accuratepower models for many of the simulated devices are includedin MPARM platform, which makes it quite suitable for powerestimation. MPARM is well known, and have been used forpower analysis in MPSoCs [6], for testing Hardware andSoftware Transactional Memory systems [7], [8].
SimWattch [2] is a simulation tool based on Simics [9]and on Wattch [10], a power modeling extension present onSimpleScalar [11]. This tool have been designed to supportmicroprocessor performance and power estimation, but nomodels are provided for other system components, such asexternal memories.
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
99
Fig. 1. MPARM platform with ArchC generated cores
ArchC [3], [12] is an open-source SystemC-based ar-chitecture description language capable of generating fast,functional, Instruction Set Simulators (ISS). It is easy tomodify an ArchC model to use its natively supported TLM[13] interfaces to communicate with external modules, pro-viding seamless integration into virtual platforms. The ArchCARMv5 model is a functional simulator, which means thatthere is no detailed pipeline simulation. This makes thesimulator implementation very simple, turning it into a muchfaster option than the original SWARM simulator distributedwith MPARM.
III. PROBLEM DESCRIPTION
Hardware research frequently requires the execution of largesimulation sets, using multiple applications with varying num-ber of configurations and hardware compositions. Simulationtime is crucial in order to evaluate a specific design and,since the result of the simulation will probably require designmodifications and further simulation, it becomes impracticalto wait hours (or even days) for a single simulation to finish.
Cycle-accuracy, as in MPARM, imposes hard restrictions tosimulation sets, requiring the remotion of heavy-weight soft-ware or complex hardware from the workload. As a realisticexample, consider the simulation of a lock-based version of thegenome application, which is one of the fastest applicationswithin the STAMP [14] benchmark. Running genome onMPARM with one core required 3:16 minutes. When thenumber of cores is increased to 2, 4, 8, and 16, the overallsimulation time rises to 5:20, 8:47, 21:21 and 45:02 minutes,respectively. Simulating larger STAMP applications, such asYada, takes around 31 hours with the 8 cores configuration.
Another significant limitation imposed by MPARM is itslack of flexibility. The platform implementation tightly couplesthe processor with adjacent modules, making the explorationof different architectures a hard task. This lack of flexibilityalso happens in other simulation platforms but we focus onMPARM in this paper.
A. MPARM modifications
In order to improve performance, the cycle-accurateSWARM processor simulator was replaced by a functionalARMv5 simulator core generated with the ArchC toolset. Thisimplementation led to a hybrid simulator, where all modules,except for the processor, are cycle-accurate. As it will be seen
ahead, power measurement was not seriously compromisedand could be statistically estimated. We choose to modify theprocessor simulator because we are interested in platformswith a varying number of cores (from 1 up to 16) and ourprofile indicated that it is possible to get better results with afaster ISS (see Section III-E).
In the MPARM platform, each ISS module consists of aC/C++ implementation encapsulated into a SystemC wrapper.The wrapper is an interface between the processor core andthe platform, handling all communication. This wrapper is alsoresponsible for providing control to the core to execute a newcycle, which turns it into an efficient layer to force modulesynchronization on the platform.
Due to the loss of simulation accuracy implied by thehigher abstraction level, some modifications needed to beapplied to the model.The original ArchC ARM model, thatwas designed to use internal memory, was changed to use twoTLM ports and the memory loader available in the platform.The platform was modified to correctly compile and instantiatethe new processors. Functions to estimate core power andgenerate simulation reports were also created. Finally, somemodifications were applied to the core signal data structureexistent on the platform, making it compatible with SystemC2.1. The implementation required 460 lines of code for theTLM interfaces and the SystemC wrapper. Around 200 codelines were written in the original platform to correctly supportthe ArchC processor. The code can be easily reused to integrateany other ArchC processor model into MPARM, allowingthe exploration of many architectures not yet supported byMPARM. This new feature can turn the platform, that wasoriginally designed to evaluate embedded systems, into a moreembracing one.
B. Model integration
Models written with ArchC language may use internalmemory or TLM interfaces to allow processor communicationwith external modules. When using TLM interfaces, everymemory operation executed by the processor is forwarded tothe TLM ports through a TLM packet. The TLM interface im-plements the SystemC signaling communication with externalmodules. Being a centralized communication channel, creatingmemory hierarchy only required plugging the cache memoriesto the TLM interface code, correctly consulting and updating iton every operation. The cache memories implementation wasthe same originally used on MPARM. To support split data andinstruction caches, two TLM ports were created in the ArchCprocessor model. One port is exclusively used for instructionfetching and the other for any other memory access. EachTLM port is connected to its own cache, which is consequentlyturned into a data or instruction cache. A diagram with detailedTLM interface implementation can be seen on Figure 1.
MPARM does not support TLM by default. For this reason,once the packets reach the ArchC TLM interface, they aretranslated to an MPARM core signal, and a memory operationrequest is made to the bus master. These memory operationswill block the processor until a ready signal is received back
100
from the bus master. The MPARM platform also uses certainmemory addresses to create a communication channel betweenthe application simulated and the simulator itself. This isuseful for allowing the use of the simulation support API,which provides calls for functionalities such as enabling anddisabling power measurement or printing output messages.
The original AMBA bus was not modified, keeping itsoriginal cycle accuracy. When performing memory operations,this characteristic makes the processors wait the same numberof cycles as cycle-accurate ones would. Being a blockingfunction, the TLM interface forces synchronization betweenthe processor and other modules in the platform.
To provide correct power estimation, calls to the measure-ment API were encapsulated into the TLM interface. Sincethe power estimation models were developed focusing a cycle-accurate simulation, another strategy needed to be applied. TheSWARM processor is modeled as a state machine with eightdifferent states. Each of these states describes an operationmode of the processor, and its transitions are defined by theinternal flow of events such as cache and memory operations.On every simulated cycle, the processor state is stored bymeasurement calls. At the end of the simulation, the number ofcycles on each state is used to estimate power. Since the ArchCsimulator is functional, this power estimation mechanismcould not be directly applied. But, instead of using a fixed stateflow or value for each instruction, we placed the measurementcalls on the TLM interface. Within the TLM interface we caneasily trace all memory accesses, cache updates and numberof wait cycles, which allows us to dynamically reproduce theprocessor’s state flow for each instruction. This approach ledto a reduced precision loss in our power estimation model,making it very similar to the original one, as we show inSection IV. Figure 2 illustrates the measurement API calls onthe TLM interface.
Fig. 2. TLM flowgraph
C. ArchC ARMv5 model modifications
For the sake of speed, ArchC simulators use an instructiondecode cache to avoid the need of decoding the same ins-truction twice. Once an instruction is decoded, a data structure
with results is stored using its address as index. If the ins-truction is needed once again, it is retrieved directly from thedecode cache. Despite being essential for high performance,this mechanism imposes difficulties to statistics collectionas for an instruction that was already decoded, no memoryaccess is performed, leading to wrong memory and cachemeasurements. To fix this behavior, we modified the originaldecode cache code to perform a dummy memory access tothe corresponding address, in case of a cache hit. The solutionforced the core to make the same original memory accesses,but avoided the need of decoding the same instruction twice.
In a first comparison between the cycle-accurate and thefunctional simulations, a big difference of memory read oper-ations was noticed. This difference happened due to the lackof a pipeline in the functional simulation. With a pipeline, theexecution of a branch may flush instructions on the first andsecond pipeline stages. In the functional simulation, these twoinstructions would never be fetched from memory.
This inconsistency is imposed by the different abstractionlevels among processors and the rest of the platform. Tocope with this problem, a branch detection mechanism wasimplemented in the ArchC simulator. If a branch is taken, thismechanism generates two dummy memory reads to the nextaddresses of the “not taken” program flow, corresponding tothe instructions that would be in the pipeline but that wouldbe discarded in the cycle-accurate simulation. This mechanismdrastically improved the precision of the functional simulation,allowing a much more reliable power measurement.
D. Platform Verification Process
The new platform was tested using the STAMP benchmark[14], which a set of applications targeting the evaluation ofSoftware Transactional Memory (STM) systems. This bench-mark showed to be very suitable to our purposes, since itsinherent concurrency contributed for correctly evaluating theplatform’s communication through the TLM interfaces. Thewide variety of algorithms in the STAMP applications presentsa good robustness test, evaluating different aspects of theplatform, such as varying bus contentions, critical section’slength and shared memory area sizes.
The platform verification comprised three main stages.In a first moment, different applications were executed andtheir outputs were compared for correctness. The tests startedwith micro-benchmark applications, and were concluded afterexecuting all applications in STAMP. At this point, all testswere executed with a single core instantiated on the platform.
The second stage consisted in comparing memory tracesgenerated by the new platform with traces generated by theoriginal one. This test allowed the verification for correctmemory operations and address translation. After comparingthe traces, the output reports were also checked, showing thatboth platforms had the same number of memory access foreach memory area while executing the same benchmark. Onceagain, only one core was instantiated on the platform.
The third validation stage consisted in executing the firstand second stages with multiple cores (2, 4 and 8). It is worth
101
mentioning that some of the STAMP programs run more thana billion of instructions in each processor. Once the expectedoutput for the application was found, memory traces werecompared. Due to the different simulation abstraction, thememory traces were not exactly the same for both platforms.
A big difference on the number of memory reads to theprocessor’s private memory area was noticed on the tests. Byusing the branch detection mechanism mentioned on III-C,this difference was drastically reduced, showing that it wasan effect of the pipeline absence in the new abstraction, asdescribed. In multi-core simulations, other memory areas alsoshown some differences but none was significant. The natureof such differences will be described on Section III-E.
An illustrative comparison between 4 applications can beseen on Table I. Values denoted with a + signal means thatthe simulation made with ArchC executed a larger number ofmemory operations. Values describing multi-core simulationsrefer to the number of operations executed by the first core.
TABLE IMEMORY OPERATION DIFFERENCE
Operation genome genome Intruder Intruder1 core 8 cores 1 core 8 cores
Private Rd 0.33% 9.16% 0.05% 14.04%Private Wr 0% 0.2% 0% +1.44%Shared Rd 0% 0.08% 0% 2.89%Shared Wr 0% 1.08% 0% + 1.54%
Semaphore Rd 0% 3.12% 0% 2.47%Semaphore Wr 0% 0.49% 0% + 1.06%
E. Platform Profiling
After reaching the expected behavior with the STAMPapplications, the original and the modified platforms were pro-filed in order to enable better understanding of the differencesimposed by changing the abstraction. Using the gprof tool,the execution of both platforms was profiled while runningthe application genome with 1, 2, 4 and 8 cores. The profilingallowed the verification of the time spent running each of theplatform’s module. The results showed that the differences inthe time spent by each of the modules were not restricted to theISS code, indicating that the effects of changing its abstractionwere propagated to other parts of the platform, such as busesand memories.
A deeper observation of the bus behavior revealed differentcontentions on each platform, showing that the processor’sabstraction had influence on bus operations. In fact, the wayeach abstraction emits its instructions and its memory accessesare not equal. Only one cycle is required to run an instructionthat does not require memory accesses on the functionalsimulator. If we consider an empty pipeline, it would berequired at least three cycles to run the same instruction ona cycle-accurate pipeline. Considering that the block beingexecuted is already in the cache, if each operation needs toperform a memory access, the functional simulator wouldemit one memory access per cycle. In a similar situation,the cycle-accurate simulator would emit a number of memory
accesses that is directly dependent on the number of stagesin its pipeline, and it would never be equal to one memoryaccess per cycle.
Since the functional simulator requires less cycles to executean instruction, it also executes a code block in a smallerinterval. This fact led to differences in the number of cyclesspent running critical sections, which also had a major effecton the number of cycles and memory operations performedby other processors on the platform while waiting for the lockacquisition. Consequently, the bus contention of both platformsis not the same. Considering the effects of the new abstractionon critical sections, and the fact that a single difference heremay influence the whole program flow, not only the differentnumber of memory accesses are explained, but also the varyingtimes spent on modules other than the processor on eachplatform.
Table II summarizes the profile data. It presents the timespent by each processor implementation, in seconds, and thewhole execution time percentage spent on the processors.Since the abstraction level influences the whole platform, itis not possible to assume an absolute efficiency comparisonfor both processors. However, the values show that, whileusing the ArchC simulator, the percentage of time spent onthe processor in relation to the overall simulation time wasreduced, at least, 2.5x, reaching a reduction of 4x in a 8core configuration. After these tests were complete, a seriesof experiments was performed to assess the performance andpower estimation capabilities of our implementation. Theseexperiments are discussed in the next section.
TABLE IIPROCESSOR PROFILING SUMMARY
Processor 1 core 2 cores 4 cores 8 coresArchC 6.87% 7.07% 7.61% 5.55%
SWARM 17.25% 20.29% 23.18% 22.59%ArchC 5.15s 8.72s 16.9s 32.07s
SWARM 25.15s 51.83s 97.1s 180.29s
IV. EXPERIMENTAL RESULTS
The STAMP benchmark with lock-based synchronizationwas used to evaluate both performance and power estimation.The test consisted in executing all the 8 applications availablein the benchmark with 1, 2, 4, and 8 cores. Some of themwere executed with more than one configuration, totalizing13 application variants. A sequential version of each variantwas also executed. The whole test consisted in 65 simulations,which were executed on 2.4GHz and 4GB RAM machinesrunning Ubuntu Linux, with Kernel 2.6.9. Due to restrictionsimposed by the source code of the platform, all the code wascompiled using GCC 3.4 using the −O3 optimization level.
A. Performance Assessment
The result of each simulation was compared with a similarsimulation on the original MPARM platform. A performancecomparison can be seen on Figures 3 and 4. The speedup isshowed as gray bars in the figures. The nomenclature for each
102
Fig. 3. Lock-based speedup / Simulated cycles
Fig. 4. Sequential speedup / Simulated cycles
simulation follows the one presented in the original STAMPpaper [14]. As it can be seen, our implementation reachedat least 1.8x speedup for each simulation, with a maximumspeedup of 2.9x. The average speedup was of 2.1x.
The absence of a pipeline reduced the overall number ofsimulated cycles for each execution. The number of cycles,normalized to values obtained with the original platform, canbe identified by a black line in Figures 3 and 4. As shown,70% of the original cycles were simulated with the sequentialimplementation. This value also stands for lock simulationswith 1 core, but it increases as more cores are added to theplatform. This is an effect of bigger bus contention generatedby the addition of new cores to the platform. Since the bus iscycle-accurate, memory operations will keep the cores blockedfor a number of cycles equivalent to the cycle-accurate simu-lation, which makes the number of cycles increase towardsthe one obtained with the original platform. Simulating lesscycles was not the only reason for the speedup. The functionalArchC simulator is simpler, and thus faster, than SWARM.On average, the ArchC model simulated 1.9x more cyclesper second, reaching a maximum of 2.4x. A comparison ofsimulated cycles per second can be seen on Table III. Seriallyrunning these 65 simulations would spend 187 hours and30 minutes on the original MPARM with SWARM. This
same batch of simulations was completed in 98 hours and19 minutes on the new MPARM implementation using ArchCfunctional cores.
B. Energy and Power estimation
In MPARM, the total energy estimation is calculated basedon the stored states of the processor, as described in SectionIII-B. At each cycle, the processor state is stored and, at theend, is applied to the energy model present in the platform. Asexpected, the use of a higher abstraction introduced impreci-sion to energy estimation. A scatter plot showing the error inenergy estimation obtained from the new platform can be seenin Figure 5, where the dots represent the obtained results, andthe line, the value obtained with the original platform, used asreference for correctness.
Fig. 5. Total energy measurement error
In the results presented in Figure 5, the applications Bayes,Labyrinth, Labyrinth+, and Yada shown a more significanterror when executed with 4 cores. Since these applicationshave long critical sections, they are more susceptible to thenew abstraction effects on bus contention. The upper limitof bus contention can be understood as all processors on theplatform waiting for bus operations. For the mentioned appli-cations, the 4 cores simulation reached maximum contention in
103
TABLE IIINUMBER OF K CYCLES SIMULATED PER SECOND
Application Sequential 1 core 2 cores 4 cores 8 coresArchC SWARM ArchC SWARM ArchC SWARM ArchC SWARM ArchC SWARM
kmeans-low 270 150 276 209 406 196 546 345 870 515kmeans-high 275 203 279 216 407 282 587 360 938 359yada 297 214 295 142 409 308 465 375 727 540bayes 318 150 312 227 487 322 657 396 991 584intruder 351 161 341 159 480 216 603 262 889 556intruder+ 345 160 350 160 488 325 590 261 894 562labyrinth 319 236 317 149 461 204 575 243 892 356labyrinth+ 318 233 316 223 465 204 575 357 880 530vacation-low 352 260 341 163 468 223 588 268 906 391vacation-high 340 261 347 260 479 333 613 427 880 415ssca2 326 254 339 244 502 324 710 408 1063 623genome 324 248 322 232 519 337 711 431 1083 605genome+ 334 255 330 157 520 355 709 428 1075 464average 320 214 320 195 468 279 609 350 929 500
the hybrid platform, but not in the original one, resulting in theobserved error. The 8 cores simulations reached the maximumbus contention, in both platforms, during almost all runtimeand, for this reason, the observed error was not significant.
The power estimation mechanism on MPARM uses thenumber of simulated cycles to estimate the consumption oneach core. As the hybrid platform implementation trades offcycle precision for performance, an error margin was intro-duced to power calculations due to the difference in numberof cycles. Raw results obtained with the modified platformhad an average power estimation error of 21.2% with a 2.7%standard deviation (SD).
In order to improve these results, a model based on re-gression analysis was built using the least square methods. Tobuild the model, values measured on the original platform wereused as expected values. The coefficient reached, if applied tothe results obtained with the hybrid platform, minimizes thepercentage of error. Since this first model was built based onresults obtained with the whole simulation set, it was named“general model”. After applying the general model to theresults obtained with the hybrid platform, the average errorwas reduced to 14.45%, with a SD of 4.3%.
TABLE IVLINEAR REGRESSION COEFFICIENTS
cores modelgeneral model 0.9× x− 3
1 0.71× x+ 0.572 0.8× x4 0.75× x+ 7.58 0.8× x+ 24
As shown in Figure 3 and explained in Section IV-A, thenormalized number of simulated cycles is not the same forsimulations with different number of cores. As this value isan important parameter for the power estimation, it turns outthat using the general model described above is not the mostappropriate choice. In order to achieve a higher accuracy level,we calculated new models for each number of cores. Core-
specific coefficients obtained through the linear regression canbe seen on Table IV, where x stands for the raw powerestimation value obtained with the hybrid simulation. By usingthe core-specific models, the error margin was reduced to anaverage of 3.25% with a SD of 2.25%. A scatter plot showingthe error of the final results can be seen in Figure 6, wheredots represent obtained results after applying the estimationmodels and the diagonal line represents results obtained withthe original platform.
Fig. 6. Total power measurement error
C. Regression Model validation
In order to correctly assess our models, a small test set,composed by STAMP applications executed with differentinputs, was used. The general and the core specific coefficientswere applied to the results obtained after a hybrid simulation.The applications that composed the test set were Bayes,Intruder, and Intruder+ with a different random seed; Labyrinthand Labyrinth+ with different mazes.
104
(a) Power measurement error (b) Power measurement error, generalmodel applied
(c) Power measurement error, specificmodel applied
By using the general model, the average error was reducedfrom 22.21% to 18.74%. The specific model reduced theaverage error to 5.85%. Despite of showing a general worseefficiency than the core-specific model, the general modelwas more precise when applied to the 8 cores simulation.This behavior is due to the larger data set employed in theconstruction of the general model and the similarity in thenumber of cycles executed in the simulations. Plots withthe errors originally obtained and reduced after applying themodels are presented on Figures 7(a), 7(b) e 7(c).
V. CONCLUSION
We have introduced a new simulation resource to MPARM,turning it into a hybrid simulation platform regarding themodel abstraction levels. By replacing the cycle-accurate pro-cessor with a functional one we have significantly increasedperformance, as a consequence of simulating more cyclesper second, and of reducing the number of overall simulatedcycles. We also have introduced techniques to reduce theloss of precision in cycle/power estimates imposed by ahigher abstraction level. The lack of precision introduced bythe abstraction modification was discussed, highlighting theeffects of bus contention while running simulations with abigger number of cores. Finally we have suggested the useof regression analysis to improve power estimation results,defining a different correction model to each number of cores,due to the variation in bus contention effects in each case.By reaching an average error of 3.26% we showed that ourhybrid platform, in spite of reaching speedups of up to 2.9times if compared to the original MPARM, is able to generatepower estimation with a very similar level of confidence inthe results.
VI. ACKNOWLEDGEMENT
This work was partially supported by grants from FAPESP(2009/04707-6, 2009/08239-7, 2009/14681-4), CNPq, and
CAPES.
REFERENCES
[1] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri,“MPARM: Exploring the multi-processor SoC design space with sys-temc,” J. VLSI Signal Process. Syst., vol. 41, no. 2, pp. 169–182, 2005.
[2] J. Chen, M. Dubois, and P. Stenstrom, “Integrating complete-systemand user-level performance/power simulators: the simwattch approach,”in ISPASS ’03: Proceedings of the 2003 IEEE International Symposiumon Performance Analysis of Systems and Software, 2003, pp. 1–10.
[3] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, “ArchC: asystemc-based architecture description language,” Computer Architec-ture and High Performance Computing, pp. 66–73, October 2004.
[4] D. C. Black and J. Donovan, SystemC: from the ground up, 2004.[5] M. Dales, “Swarm 0.44 documentation,” February 2003,
www.cl.cam.ac.uk/ mwd24/phd/swarm.html.[6] M. Loghi, M. Poncino, and L. Benini, “Cycle-accurate power analysis
for multiprocessor systems-on-a-chip,” in GLSVLSI ’04: Proceedings ofthe 14th ACM Great Lakes symposium on VLSI, 2004, pp. 410–406.
[7] C. Ferri, T. Moreshet, R. I. Bahar, L. Benini, and M. Herlihy, “Ahardware/software framework for supporting transactional memory ina mpsoc environment,” SIGARCH Comput. Archit. News, vol. 35, no. 1,pp. 47–54, 2007.
[8] A. Baldassin, F. Klein, G. Araujo, R. Azevedo, and P. Centoducatte,“Characterizing the energy consumption of software transactional me-mory,” Computer Architecture Letters, vol. 8, no. 2, pp. 56–59, Feb.2009.
[9] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A fullsystem simulation platform,” Computer, vol. 35, pp. 50–58, 2002.
[10] D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson, P. N. Kudva,A. Buyuktosunoglu, J.-D. Wellman, V. Zyuban, M. Gupta, and P. W.Cook, “Power-aware microarchitecture: Design and modeling challengesfor next-generation microprocessors,” IEEE Micro, vol. 20, pp. 26–44,2000.
[11] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure forcomputer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, 2002.
[12] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, andE. Barros, “The ArchC architecture description language and tools,” Int.J. Parallel Program., vol. 33, no. 5, pp. 453–484, 2005.
[13] F. Ghenassia, Transaction-Level Modeling with Systemc: Tlm Conceptsand Applications for Embedded Systems, 2006.
[14] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun, “STAMP:Stanford transactional applications for multi-processing,” in IISWC ’08:Proceedings of The IEEE International Symposium on Workload Char-acterization, September 2008.
105
A non intrusive simulation-based trace system toanalyse Multiprocessor Systems-on-Chip software
Damien Hedde, Frédéric PétrotTIMA Laboratory
CNRS/Grenoble INP/UJF, Grenoble, France{Damien.Hedde, Frederic.Petrot}@imag.fr
Abstract—Multiprocessor Systems-on-Chip (MPSoC) are seal-ing in complexity. Most part of the MPSoCs are concernedwith this evolution: number of processors, memory hierarchy,interconnect systems, . . . Due to this increase in complexity andthe debugging and monitoring difficulties it implies, developingsoftware targeting these platforms is very challenging. The needfor methods and tools to assist the development process of theMPSoC software is mandatory. Classical debugging and profilingtools are not suited for use in the MPSoC context, because theylack adaptability and awareness of the parallelism.
As virtual prototyping is today widely used in the developmentof MPSoC software, we advocate the use of simulation platformsfor software analysis. We present a trace system that consistsin tracing hardware events that are produced by models ofmultiprocessor platform components. The component modelsare modified in a non-intrusive way so that their behavior insimulation is not modified. Using this trace results allows to runprecise analysis like data races detection targeting the softwareexecuted on the platform.
I. INTRODUCTION
The ever increasing performance and flexibility demandsfor running applications on embedded platforms has led tothe emergence of Multiprocessor Systems-on-Chip (MPSoC)platforms a decade ago. Nowadays, the systems integratevery complex memory subsystems and interconnects. In its2009 edition [1], the ITRS (International Technology Roadmapfor Semiconductor) expects Systems on Chip in portableconsumer segment with more than 1000 processing elementsin 2020.
Unfortunately, MPSoCs embed more an more elementswhile keeping few debugging or monitoring external capabil-ities. As a consequence, the observability abilities such SoCSare not increasing as their complexity does. By integratingelements that were previously outside the chip, it becomesalmost impossible to observe their behavior and communica-tion. Furthermore, the growing number of internal elementsthat need to be connected together leads to the saturationof classical interconnect systems such as buses. To replacethem, scalable interconnects (ie: Networks on Chip (NoC)),are used. Although providing higher bandwidth, they do nothave the same observability. To settle this general observabilityproblem, Design for Debug (DfD) features ([2], [3], [4], [5])are developed and integrated into SOCs.
This complexity raises problems for designing a MPSoCbut also for the software running on its processors. The
software has to target the processors and make themcommunicate in order to execute the required application.Depending on the MPSoC architecture, the software mayhave to handle multiple communication mechanisms (sharedmemory, mailboxes, DMA) and target different processortypes (general purpose processor, specialized processor).Analysing and debugging programs that use concurrentlyseveral processors is not a new problem. Lots of techniqueswere developed [6] during the 1980s period, particularlyfor targeting distributed systems programs. Encountereddifficulties for debugging these systems is not far from theones for MPSoCs. A very important one is the difficulty toget the state of the global system. There is no real problem toget the state of each node, but getting the state of each nodeat the same time is nearly impossible. With the developmentof GALS (Globally Asynchronous, Locally Synchronous)integrated systems, it becomes clearly unfeasible.
In this paper, we present a method for fine-grain analysis ofsoftware running on MPSoC. This method uses simulation andconsists in doing an instrumentation of component models.During the simulation, these models then generate eventswhich are collected for further analysis. Our approach relyon the fact that when simulating a entire SoC platform, wecan have access to everything that’s going on during thesimulation. The ability of collecting information is indeed notlimited by the classical constraints a real SoC has: limitedbandwidth with external debugging device, limited observ-ability, . . . Contrary to software instrumentation, informationfrom the simulation can be collected without being intrusive:information can be obtained without changing the behaviorof the simulated components. However simulation relies onmodels that are often only approximate in timing. In our caseit is not a problem because we are mainly concerned with theorder of memory events.
The produced trace mainly contains instructions executedby the different processors. The related memory accesses arealso traced up to the memory by every component relayingthem. These memory access are used to recover the inter-processors instructions dependencies, allowing to analyse thedetailed, low level, synchronization mechanisms between thedifferent processors.
The remainder of this article is organized as follows. Relatedwork is reviewed in Section II. Section III describes the978-1-4577-0660-8/11/$26.00 c©2011 IEEE
106
trace mechanisms and analysis method. Section IV presentsexperimentations and results. We then conclude in Section V.
II. RELATED WORK
During the last years, lots of methods were proposed tohelp the debug and analyse of program running onto MPSoCs.Due to the increasing SoC parallelism, an important effort ismade on communication monitoring. Parallel program errorscomes indeed from interaction errors between the processingelements and interactions are mainly done through memory.Several solutions are proposed in order to integrate features inchips ([5], [7]), allowing to debug at run-time.
Following an other track, work is done to improve de-bugging abilities by the use of simulation techniques. Anadvantages of these methods is that they are really non-intrusive. [8] propose a solution using a virtual machine thatexpose the data structures of the Operating System (OS)running into the virtual machine to an external debugger. In[9], a Instruction Set Simulator (ISS) API is proposed forthe simulation of processors in MPSoCs. This ISS has aninterface allowing to add instrumentation tools independent ofthe simulated processor. An implementation of a GDB serverusing the instrumentation ability has been made, allowing tocontrol the set of processors of the MPSoC. Although allowingto control and monitor every processor, it doesn’t address thecommunication mechanism. In [10], a solution is proposed forverifying the shared memory mechanism. The method consistsin recording memory operations and their order, and then tocheck whether there was a bug or not.
Non-intrusive instrumentation in simulators already exists:In eSimu [11], a very low-level trace is generated by a cycleaccurate simulator. This trace is used in energy profiling goal.The trace contains instructions with cache penalties informa-tion, peripheral change of state and data evolution throughfifo in order to link energy consumption to the instructionwhich generates it. For example the sending of data througha wireless device cost lots of energy but it is issued long timeafter the related instruction, it is so needed to keep tracks ofthe data. This solution does not target platforms that embedmultiple processor and so lacks information for recoveringmemory accesses order in multiprocessor platforms. It is alsoonly focused on profiling.
In his thesis, D. Kranzlmueller [12] studies concurrentprograms modelling using events in a debugging goal. Hiswork targets mainly distributed programs at the applicationlevel and not the whole software stack. He records and analysemain events of a concurrent program (mostly communicationevents). The event trace is intrusive as it is generated throughsoftware instrumentation. Event relations and orders are an-alyzed in order to detect erroneous behaviors. Some otherworks, like [13], focus on MPSoC software monitoring byusing a specific programming model with integrated observa-tion abilities. In this last work, a component-based approachis used where the component has observation interfaces.
III. CONCURRENT PROGRAM ANALYSIS
There are several kind of analysis which can be performedon a trace: verification of cache coherency protocol, verifica-tion of memory consistency model, detection of data races, andso on. These analysis require to build order relations betweenthe memory access instructions. Part III-A below explains howto do this instruction trace for a multiprocessor platform usinga sequential consistency memory model [14]. The analysisneeds to build the software threads (which may have migratedbetween several processors in SMP architectures) and identifysynchronizations between the threads to highlight erroneousbehaviours. We explain how to proceed in part III-B focusingon data races detection.
A. Recovering instruction scheduling
In order to analyse a concurrent program running on a multi-processor shared memory platform, we need to sort concurrentprogram instructions that access the same memory. But thedate of a memory access may be very different from thedate of the instruction that generate the access. Indeed due tocommunication interconnects and cache or buffer components,the date of an access might be significantly shifted from theinstruction date.
This section describes the method we use to associate theright date of the access to each instruction accessing memoryusing information provided by the simulation. This is methodis done in two step and has no impact on the program executedby the simulated platform.
1) Tracing hardware events
This first step takes place during the simulation of a multipro-cessor platform executing a concurrent program. It consists intracing all events related to the program execution and memoryaccesses. Several kind of components of the platform areinvolved in this operation: processors, caches and memories.Traced event are stored to allow further use.
Each involved component traces events depending on theoperations it is doing. Traced events are:
• processor instructions,• processor requests,• cache acknowledgements,• cache requests,• memory acknowledgements.Requests and acknowledgements represent memory ac-
cesses. The initiator of a memory access generates a requestevent and the target generates an acknowledgement event. Anacknowledgement match the action of the target to the memoryarray (mainly read to or write from).
Events contain some type related data. A request eventcontains the address, width and type of the access. Examplesof type are: load, store, linked load, exclusive (for write-backpolicy caches), etc. A processor request event contains alsothe data read or written by the access. An instruction eventcontains the instruction address and processor state changes.A memory acknowledgement event contains only the date
107
processor_0
cache_0
Instruction@ 0x5000
time
processor_1
cache_1
memory
RequestRead 0x2004
Instruction@ 0x5004
RequestRead 0x2008
Instruction@ 0x5008
Cache RequestRead 0x2000
Acknowledgement
Acknowledgement
RequestWrite 0x3008
Instruction@ 0x6000
RequestWrite 0x300C
Instruction@ 0x6004
RequestRead 0x0014
Instruction@ 0x6008
Cache RequestWrite 0x3008
Acknowledgement Acknowledgement
An arrow means a dependency between 2 events: the circle side event contains the identifier of the triangle side Fevent
Figure 1. Event dependencies examples for 2 processors platform with write-through policy caches with write buffers
of the target’s action. Because dates are only used to sortacknowledgements, they can be logical dates and not relatedto the simulated time.
Events are generated by several components, and we need tolink some of them together (for example an acknowledgementwith the corresponding request). In order to do that, each eventis tagged with a unique identifier.
The identifier of an event can then be used by another event to indicate a relation between the two events. Thismethod is used in three cases:
• An acknowledgement event (issued by a cache or amemory) contains the identifier of the request eventthat leads to this acknowledgement.
• A cache request contains the identifier of a proces-sor request if the cache relays a request to the memory(for example: in case of a processor load that generatesa load to a whole cache line).
• A processor request event contains the identifier ofthe instruction that generates the memory access.
Figure 1) shows some typical examples of eventdependencies. In this Figure, processor 0 first does aload which lead the cache to do a line load from the memory.The processor 0 then does a second load in the same linewhich is handled by the cache. The processor 1 first doestwo write which are gathered by the cache write buffer andthen does an uncached read which is handled by the memory,skipping the cache.
As long as events are not linked by identifier, it is not aproblem to generate them. Processors, caches and memoriescomponents have to be modified to generate the events but nottheir communication channels.
But as an event may need the identifier of another event,the communication channel between the two components thatgenerate these events must be modified. The channel haveindeed to transport along with the standard communicationthe identifier of an event. Due to the direction of links(an acknowledgement need the identifier of the request
event, but not the opposite) , the response channel does notneed to be modified. Only the request channel is modified.
2) Associating dates to memory related instructions
The second step consists in building for each processor thethread of executed instructions with proper dates associated tomemory access instructions. Theses dates are later used to sortthe instructions from different processors in order to match thememory access order. An Instruction that generates a memoryaccess must be tagged with the date of the acknowledgementat the memory.
It can be noticed that the way of handling write accessesdepends on the cache policy. Write-through caches do notgenerate acknowledgment events on write accesses, but relaythem to the memory as shown in Figure 1. In the contrarywrite-back caches can acknowledge write accesses when thecorresponding line state allows it. When a write-through cacheuses a write-buffer, still no write access acknowledgment isgenerated by the cache, but several identifiers (onefor each intial request) are indicated in the resulting requestevent (see Figure 1).
In order to associate a date to an instruction that generatesan access, the relations between events must be followed fromthe memory acknowledgement down to the instructions. Butdue to the presence of caches, each processor request is notdirectly linked to a memory acknowledgement. In case of aread cache hit, a cache acknowledges a request but the realread access up to the memory has been done possibly longtime ago by the cache. A similar issue is raised for a writeaccess in write-back cache policy (but threal acces is after andnot before).
It could be possible to assign the true date (which could bein the past or in the future) of the memory access to everyprocessor request but it is not what we need. This wouldnot lead to a satisfying result because the dates of memoryaccesses would not be in an increasing order inside a processorinstructions sequence. Without this last point, the analysisof threads would be complicated: for example, it would be
108
Figure 2. Example of date associations to processor requests for a processor behind a write-back policy cache
impossible to find which is the next access to a memorylocation without checking the whole instructions sequence ofeach processor. We have then the following constraints:
1) Dates in a processor instruction sequence must be inincreasing order.
2) Dates of two memory access at the same address fromtwo different processors must be in the proper order. Theaccess that take place first must have a smaller date.
Accesses can be classified in two categories: memory-acknowledged accesses and cache-acknowledged accesses.Figure 2 shows how accesses are dated. There is no difficultiesfor assigning a date for an access that is acknowledged by thememory: we keep the date of the acknowledgement and theconstraints are met.
However for accesses that are acknowledged by a cache, adate must be computed. In order to meet the first constraint,this date must be between the dates of the previous and nextmemory-acknowledged accesses of the processor. The date ofthe corresponding memory acknowledgment cannot be usedsince it will violate this constraint. For all cache-acknowledgedaccesses between two memory-acknowledged accesses, we usethe date of the previous memory-acknowledged access. Thefirst constraint is obviously met although the processor orderis not strict.
The second constraint is also met because the cache-acknowledged access could have been done at the date ofthe previous memory-acknowledged access without changingthe results of accesses at the access address. We have twocases, the read case and the write case. Let us call T1 the truedate of the access and T2 the date of the previous memory-acknowledge access. Sequential consistency ensures there isa total order of all memory access respecting every processoraccess order.
• If the access is a read, then it is a cache hit andT1 <= T2. If a modification had occured before T2 tothe memory cell then, the cache would have received aninvalidation before the previous memory-acknowledgeddue to the sequential consistency. So the line has notbeen writen between T1 and T2. Some may have read tothe line, but this is not a problem to reorder consecutiveread to the same address.
• If the access is a write, then it is delayed due to the write-
back cache policy and T1 > T2. So the cache has theaccessed line in exclusive state and no other processorcan do an access to the line (even a read) without thecache doing first the write back to the memory. So theline has not been accesses between T2 and T1.
The following algorithm tags each processor request witha proper date. Due to the direction of the links betweenevents (acknowledgement to requests), it is not possible tofind easily the acknowledgment event starting with a requestevent. Instead starting from the acknowledgement event to findthe requests that correspond to it do not lead to difficulties.This is why the algorithm is driven by top memory hierarchycomponents (ie: the memories).
• Main: Consume all events of the memories following dateorder. Memories should only have generated acknowledg-ment events.
• Memory: For each consumed event, identify the sourcecomponent (processor or cache) with the identifierthat is contained in the event, and consume events of thiscomponent up to the one referred by the identifier.The acknowledgement date is given to this source com-ponent.
• Cache: For each consumed event, if there is a sourceidentifier, consume event of the source component(lower level cache or processor) up the identified event.The date given to the source component is either theprevious date or the just received date if the event islinked to the request being acknowledged by the memory(or higher level cache).
• Processor: For each consumed request, it is tagged withthe current received date.
This algorithm works as long as dates are consistent be-tween every memory components. Depending on the simulatorused, such a time might not be at our disposal. This algorithmconsumes events of each component in the order they weregenerated. The intermediate storage of the events thread ofeach component can then be avoided and the components candirectly feed the above algorithm. This way, only processorsequences will be generated.
109
B. Software analysis: data races detection
The previously generated processors instructions sequencescontain executed instructions. They are associated with proces-sor state changes and memory access acknowledgment dates.This information allows to run some analysis on the executedconcurrent software.
1) Building software structure: threads
Processor sequences contain information related to thehardware part of the execution: processor state changesand dates of memory accesses. These dates guarantee aproper interleaving of memory instructions. However it lacksinformation on the software part. The following analysismethod needs, as well as the processor instruction sequences,the information called symbols, that allows to match forexample instruction addresses to functions. A prerequisite ofthe analysis is that the symbols are known. The ApplicationBinary Interface (ABI) is also needed.
Processors sequences are then analysed to generate thesoftware structure. This operation is done with respect to thememory access order and may be compared to a kind of replayof the execution. The ABI allows to detect function calls andreturns if the symbols of these functions are known.
Additional data such as DWARF debugging data containsparameters locations for function symbols and allow toidentify parameters of functions calls. Some analysis rely onthe identification of specific function calls or returns.
At start, a thread is associated for each processor and the callgraphs are built. This will lead to correct result as long as thereis no Operating System (OS) doing some thread scheduling. Ifthere is an OS, thread creation and scheduling functions haveto be detected to create new threads and change the threadwhich is associated to a processor.
2) Adding synchronization points
Detection of any function is possible if its symbol informationis known. In order to apply this system to detect data races,synchronization mechanisms of the software threads need tobe first detected. Without taking synchronization constraintsinto account, the whole threads will be considered concurrent.
A synchronization point is a point in software thread thathas some link with at least another point in a second softwarethread. A Link tells that a synchronization point occurs eitherafter or before the other.
At the lower level atomic memory accesses (load linked,store conditional, test and set, compare and swap, etc) may beconsidered as synchronization points. But some synchroniza-tion can be done without using these accesses. Consideringeach memory access as a synchronization is not a goodsolution as most of them do not serve the realization ofsynchronization.
From a higher level point of view, synchronizations are donethrough software functions. Synchronization point are then
created when the execution of such a function is detected.Useless synchronization points generate additional constraintswhich indeed might mask data races. To avoid these uselesssynchronization points must be set only for highest levelsynchronization functions, as they may be constructed fromseveral lower level synchronization ones.
3) Checking data races
Data-race corresponds to a case when multiple threads ac-cesses the same memory location in a concurrent way and ifthe result of the accesses can not be decided without knowingin which order the accesses take place. Two cases exists: write-read and write-write.
These two cases can be reduce to a unique one: a data raceoccurs if a thread do a write access to a memory location andanother thread do an access to the same memory location (reador write).
Figure 3 shows some software threads with their synchro-nization points. Parts of threads between two synchronizationpoints are called segments in the following. To find all dataraces, every pair of threads must be analyzed.
Figure 3. Example of threads with their synchronization point. Numbersrepresent synchronization points and letters represent segments. Dashedarrows represent indirect links.
In order to find data races between two threads, the wholegraph including all threads must be reduced. Although onlytwo threads are concerned, the graph can not be reducedby just removing every other thread and links not involvingthe two analysed threads. Due to the transitivity of the linksbetween synchronization points, some indirect links betweenthe two studied threads may be inferred. For example, inFigure 3, point 1.3 happens before point 0.2 although thereis no direct link between them.
Two segments of the two threads may be concurrent if thereno constraints forcing one to be completely executed beforethe other starts. In Figure 3 concurrent segments of thread 1for the segment 0.A of thread 1 are 1.A, 1.B and 1.C. Finallyaccesses of concurrent segments must be checked in order tofind data races.
IV. IMPLEMENTATION AND RESULTS
This section presents the implementation we have doneand the obtained results. We first describe the implementationarchitecture of the software analysis program. We then detailour experimentations and results.
110
A. Analysis implementation
The two steps described in sections III-A1 and III-A2 arenot implemented in the same program. The first step hasbeen implemented in component models that are used in thesimulation. Events are stored into separated files (one percomponent), the storage is done in parallel of the simulationin a separate thread to limit the overhead. The second step isdone in a separate program and generates a dated instructionssequence per processor.
The software thread analysis program is organized as fol-lows. It takes in input the previously generated processorinstruction sequences and the software binary image executedby the simulated platform. The software image is used toobtain the software symbols.
The main program core consumes instruction following thedate order. When an instruction is consumed it is put intothe current software thread of its processor. And using thesoftware symbols and the ABI, the software thread histories(call graphs) is built. Hook functions can be registered by theuser at entry or end of function symbols to do some tasks (forexample changing the current thread of a processor, or addingsome synchronization points). A hook function is executedby the main core when the given symbol is detected. Theprogram stores into files the software threads histories andthe synchronization graph.
Thread histories can then be used to study the softwarebehavior. A datarace program has been implemented to detectdatarace between a pair of thread using two threads historiesand the synchronization graph.
B. Experimentation detail
We implemented the first step (section III-A1) using theSoCLib framework, which consists in a library of SystemCcomponents for MPSoCs simulation. The SoCLib [15] libraryprovides components such as processors, caches, memories,interconnect and allows to build platforms at system level.
Figure 4. Simulated platform
We modified the CABA (Cycle Accurate, Bit Accurate)(CABA) implementation of the components to build a platformcontaining several MIPS32 ISA processors with write-throughcaches and write-buffers. Memory and caches communicatethrough an abstract network. The memory is kept coherentbetween the caches through a directory based mechanism.
Figure 4 shows an overview of this platform. The platformcontains also others peripherals components: a timer, an inter-rupt controller, a frame-buffer, a serial output peripheral anda storage peripheral. The simulation was launched under theSystemCASS [16] SystemC kernel.
The software that runs on this platform is a parallel MJPEGdecoder on top of a small Operating System called DNA [17].The MJPEG decoder is organized as follows: one first threadis in charge of reading the file and dispatching JPEG blocksto several threads which do the decoding part. Decoded partof images are then given to a last thread which is in chargeof the display.
For the software analysis, several hooks have been registeredon DNA functions.
Hooks have been set for thread context handlers(cpu_context_init, cpu_context_load andcpu_context_save) to handle thread creattion andscheduling.
Hooks have been set for lock functions (lock_acquire,lock_release) and operating system barrier(cpu_mp_proceed and cpu_mp_wait to createsynchronization points between the threads.
The insertion of synchronization points must be done verycarefully, since missing some will lead to false data racedetection. But putting to much points may lead not to detectsome races. For example semaphore should not be consideredas synchronization points, since they are not generally used toprotect shared variables or memory.
Each thread was separated into an application level and akernel level and other hooks have been set for several kernelfunctions to operate the switch between the levels. This allowsto remove the kernel synchronization from the application partand vice-versa.
C. Results
The platform was simulated with 1 to 16 processors.Figure 5shows some numbers for a simulation of 100000000 cycles for1, 2 and 8 processors. Sim. time is the time to simulate theplatform and store the events. Overhead is the simulation timeoverhead compared with the initial simulation without tracingthe events. Ev. num and Ev. size are the number of eventsand their size. Inst. time is the total and user time needed togenerate the dated processor instruction sequences from theevents (step described in section III-A2). Inst. num and Inst.size are the number of instructions executed by all processorsand their size. Soft. time is the total and user time used tobuild the software thread and synchronization graph from thedated instruction sequences.
As shown by these numbers the application on the platformdoes not scale well. This is due to the platform implementinga sequential consistency which is a very strict constraint onthe memory hiearchy. Furthermore we do not use any DMAtransfer because nowadays we do not support them in the tracesystem. In consequence the performance is very low whenusing several processors.
111
Processors 1 2 8Sim. time (seconds) 156s 395s 468s
Overhead 6.3% 4.5% 3.5%Ev. num (millions) 64,5 64,9 70,2
Ev. size 1.3GB 1.3GB 1.4GBInst. time (total/user) 127s/12s 127s/13s 131s/14s
Inst. nb 31,4 31,6 34,6Inst. size 582GB 586GB 638GB
Soft. time (total/user) 43s/7s 44s/8s 48s/8s
Figure 5. Time in the different step for different numbers of processors
The overhead of trace system during the simulation is notvery high. The Time spent in following steps is importantcompared to the simulation time. But this time is for themost part spent in system parts (the user time is low). Thisis due to the large amount of data that is read from files forboth step. In consequence most of that time could be avoidedby pipelining all steps and not using intermediate files. Thiscould be done without difficulties as the two last step readlinearly the results of the previous step.
Using the thread call graphs and synchronization graphgenerated by the last step, data races detection were runnedon pairs of thread that communicate tgether. Due to the com-plexity (each part of the first thread between two consecutivesynchronization points must be checked with each part ofsecond thread ath can be concurrent with the fifrst thread part)of data race detection. No data races were detected in theMJPEG decoder using these. synchronization points. In orderto test our analysis, we removed some locks from the fifodriver used by the threads of the MJPEG decoder to transferdata. This lead to the detection of data races. However a datarace detection must be studied carefully. They may be falsedata race detection. For example, an increasing counter whichis only protected by a lock for updating and not for readingwill cause false data races to be detected on read.
V. CONCLUSION
The generalization of multiprocessor in integrated circuitsintroduces the issue of debugging parallel programs into theembedded devices. Debugging can no longer be done step bystep on a console, and more and more rely on trace analysis.We have shown that using virtual prototypes producing non in-trusive traces, it is possible to perform complex analysis, suchas data races detection. The analysis can be done at differentlevel of abstraction or target different part of the software.For example, operating system kernel and application can beanalyse indepedantly.
However our method only analyses a given execution on aplatform. It may be coupled with stimulation (for example by
adding random delay in communication) of the simulation totry to change the execution at each simulation and increasethe test coverage.
We plan to extend this work to systems that do not followthe sequential consistency model, and to the verification ofcache coherence protocols implementation.
ACKNOWLEDGMENT
This work is funded by the French Authorities in theframework of the Nano 2012 Program.
REFERENCES
[1] ITRS, “ITRS 2009 Edition,” 2009, http://www.itrs.net.[2] ARM, “Embedded trace macrocell architecture specification,” http://
www.arm.com.[3] MIPS Technologies, “EJTAG Trace Control Block Specification,” http:
//www.mips.com.[4] A. B. Hopkins and K. D. McDonald-Maier, “Debug support strategy for
systems-on-chips with multiple processor cores,” IEEE Transactions onComputers, vol. 55, no. 2, pp. 174–184, February 2006.
[5] B. Vermeulen, K. Gooseens, and S. Umrani, “Debugging distributed-shared-memory communication at multiple granularities in networkson chip,” in Proceedings of ACM/IEEE International Symposium onNetworks-on-Chip, April 2008, pp. 3–12.
[6] C. E. Mcdowell and D. P. Helmbold, “Debugging concurrent programs,”ACM Computing Surveys, vol. 21, no. 4, pp. 593–622, December 1989.
[7] C.-N. Wen, S.-H. Chou, T.-F. Chen, and A. P. Su, “Nuda: A non-uniformdebugging architecture and non-intrusive race detection for many-core,”in Proceedings of 46th ACM/IEEE Design Automation Conference, July2008, pp. 148–153.
[8] L. Albertsson, “Simulation-based debugging of soft real-time appli-cations,” in Proceedings of the 7th IEEE Real-Time Technology andApplications Symposium, may 2001, pp. 107–108.
[9] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, “Ageneric instruction set simulator api for timed and untimed simulationand debug of mp2-socs,” in Proceedings of the IEEE/IFIP InternationalSymposium on Rapid System Prototyping, June 2009, pp. 116–122.
[10] S. Taylor, C. Ramey, C. Barner, and D. Asher, “A simulation-basedmethod for the verification of shared memory in multiprocessor sys-tems,” in IEEE/ACM International Conference on Computer AidedDesign, November 2001, pp. 10–17.
[11] N. Fournel, A. Fraboulet, and P. Feautrier, “esimu : a fast and accurateenergy consumption simulator for real embedded system,” in IEEEInternational Symposium on a World of Wireless, Mobile and MultimediaNewtorks, 2007, pp. 1–6.
[12] D. Kranzlmueller, “Event graph analysis for debugging massively par-allel programs,” Ph.D. dissertation, Johannes Kepler University of Linz,Austria, 2000.
[13] C. Prada-Rojas, V. Marangozova-Martin, K. Georgiev, J.-F. Mehaut,and M. Santana, “Towards a component-based observation of mpsoc,”in Proceedings of International Conference on Parallel ProcessingWorkshops, 2009, pp. 542 –549.
[14] L. Lamport, “How to make a multiprocessor computer that correctlyexecutes multiprocess programs,” vol. 28, pp. 690–691, 1979.
[15] “SoCLib,” http://www.soclib.fr.[16] R. Buchmann, F. Pétrot, and A. Greiner, “Fast cycle accurate simulator
to simulate event-driven behavior,” in Proceedings of The 2004 Interna-tional Conference on Electrical, Electronic and Computer Engineering(ICEEC’04), 2004, pp. 35–39.
[17] X. Guerin and F. Pétrot, “A system framework for the design of embed-ded software targeting heterogeneous multi-core socs,” in Proceedings ofthe 20th IEEE International Conference on Application-Specific Systems,Architectures and Processors, 2009, pp. 153–160.
112
Embedded Virtualization for the Next Generation of Cluster-based MPSoCs
Alexandra Aguiar, Felipe G. de Magalhaes, Fabiano HesselFaculty of Informatics – PUCRS – Av. Ipiranga 6681, Porto Alegre, Brazil
[email protected], [email protected], [email protected]
Abstract
Classic MPSoCs tend to be fully implemented using asingle communication approach. However, recent effortshave shown a new promising multiprocessor system-on-chipinfrastructure: cluster-based or clustered MPSoC. This in-frastructure adopts hybrid interconnection schemes whereboth buses and NoCs are used in a concomitant way. Themain idea is to decrease the size and complexity of the NoCby using bus based communication systems at each localport. For example, while in a classic approach a 16 pro-cessor NoC might be formed in a 4 x 4 arrangement, incluster-based MPSoCs a 2 x 2 NoC is employed and eachrouter connected to a local port contains buses that carry 4processors. Nevertheless, although good results have beenreached using this approach, the implementation of wrap-pers to connect the local router port to the bus can be com-plex. Therefore, we propose in this work the use of embed-ded virtualization, another current promising technique, toachieve similar results to cluster based MPSoCs without theneed for wrappers besides providing a decreased area us-age.
1 Introduction
Embedded Systems (ES) have become a solid reality inpeople’s lives. They are present in a broad range of facil-ities, such as entertainment devices (smart phones, videocameras, games toys), medical supply (dialysis machines,infusion pumps, cardiac monitors), automotive business(engine controls, security, ABS) and even in aerospace anddefense fields (flight management, smart weaponry, jet en-gine control) [18].
Usually, these systems need powerful implementationsolutions, which contemplates several processor units, suchas the Multiprocessor System-on-Chips (MPSoCs) [11].One of the most important issues regarding MPSoCs liesin the way communication is implemented. Initially, bus-based systems used to be the most common communication
978-1-4577-0660-8/11/$26.00 2011 IEEE
solution, since they used to be usually simpler in terms ofimplementation.
On the other hand, buses have poor scalability rates,since only a few dozens of processors can be placed inthe same structure without presenting prohibitive contentionrates. Therefore, other communication solutions started be-ing researched and the most prominent one is the Network-on-Chip (NoC) approach [16].
NoCs are a communication solution widely acceptedand based on general purpose network concepts. However,NoCs can present more complex communication protocolsand, consequently, less predictability.
In this context, a recent idea known as Cluster-basedMPSoCs has gained notoriety [7], [12]. In this approachthe best of both worlds are intended to be placed together:NoCs allow higher scalability rates but buses keep the de-sign simpler even with more processors on the system. Tobetter understand the concept, Figure 1 depicts a 2x2 sizedNoC which contains a bus located at each local port. Eachbus carries along four processors which communicate insimpler ways inside and, if needed, can communicate withother clusters through the NoC. Dotted lines represent thewrappers needed to connect the bus to the NoC.
Figure 1. Cluster-based MPSoC concept
Another recent idea for embedded systems is the use ofvirtualization in their composition. Virtualization has sev-eral possible advantages, including the decrease of area,increase of security levels and the ease of software de-sign [10], [2], [3]. Virtualized systems are composed by
113
a hypervisor that holds and controls all virtual machines’operation details.
This paper proposes the unification of both concepts. In-stead of using buses on each router of the NoC, we proposea single processor holding a hypervisor, providing the emu-lation of several virtual processors. Since buses are poorlyscalable, hypervisors do not need to support more proces-sors than a simple bus would. The main contribution ofthis proposal, named as Virtual Cluster-based MPSoCs, isto provide multiprocessed systems with less area occupa-tion.
The remainder of the paper is organized as it follows.Next section show some related work on cluster-based MP-SoC. Section 3 shows basic concepts regarding embeddedvirtualization. Then, in Section 4, details about the VirtualCluster-based MPSoCs are discussed. Section 5 details mo-tivational use cases and some initial experimental results.Finally, Section 6 concludes the paper besides presentingsome future work.
2 Cluster-based MPSoCs
It is widely known that several MPSoCs are bus-basedarchitectures. Systems such as the ARM MPCore [9], theIntel IXP2855 [6] and the Cell processor [13] are examplesof it. Nevertheless, the need for more processing elementsand a growing system complexity has led other approachesto be researched.
Networks-on-Chip (NoCs) have arisen as the maincommunication infrastructure involving complex MPSoCs.However, the design of NoC-based parallel application isfar more complex that the one involving only bus-based sys-tems [7].
Due to the lack of scalability present in bus-based sys-tems and the excessive application design complexity foundin NoCs, cluster-based systems are becoming a possible al-ternative. These systems, intend to achieve the advantagesof both systems.
In [7], the authors propose a cluster-based MPSoC pro-totype design. In this paper, the authors integrate 17NiosII [14] cores, organized in four processing clusters anda central core. In every cluster, the cores are composed bytheir own local memory and their communication is per-formed through a shared memory, accessed from the bus. Inorder to access the inter-cluster communication, cores havea shared network interface.
This system still proposes that a single processing el-ement has the access to external peripherals, such asSDRAM controllers. Also, this central control unit is re-sponsible for managing mapping issues of the parallel appli-cation in the clusters as well as gathering expected results.Figure 2 depicts the architecture proposed by [7]. In thisFigure, LM stands for Local Memory, CSM, for Common
Shared Memory, NI, for Network Interface and SDRAM IFfor SDRAM Interface.
Figure 2. The architecture of Cluster-basedMPSoC proposed by [7]
Results were taken considering two real applications:matrix chain multiplication and JPEG picture decoding,both implemented on an FPGA development board. Theimplementation resulted in speedup ratios of above 15times. The main drawback is that real-time applications arenot referred by the authors.
Figure 3 shows an example of a processing cluster thatcomposes the cluster-based MPSoC, which is composed byfour processor cores itself. Each processor core, a NIOSII,contains its own Local Memory (LM, in the figure) anda bridge to access the local bus. In this bus, it is alsoconnected a Common Shared Memory (CSM, in the fig-ure), used to exchange data among the processors. Still, asemaphore register file, used for synchronization purposesamong the processes during the use of the shared memory,is present. Finally, the cores also share a Network Interface(NI, in the figure) which allows the inter-cluster communi-cation.
Figure 3. The architecture of each processingcluster proposed by [7]
Jin [12] proposes a cluster-based MPSoC using hierar-chical buses on-chip, aiming to attack some of the prob-lems pure NoC implementations can present to the com-ponent connected to the network. One of the main prob-lems pointed by the authors is for real-time applications,
114
where the NoC must provide a high efficiency for data ex-change. In this approach, no NoCs are adopted. Therefore,in cluster-based MPSoCs the performance of the computa-tion cluster is very important for the system as a whole.
The approach presented in [12] can be seen in Figure 4.The system adopts the AMBA-AHB protocol, which is ahigh performance system bus that supports multiple busmasters besides providing high-bandwidth operation. Theauthors also use a hierarchical bus architecture aiming toobtain better performance results, especially when decreas-ing bus collision rates, improving the speed of register con-figuration and avoiding shared memory contention and bot-tlenecks.
Figure 4. The architecture of the cluster basedMPSoC proposed by [12]
The proposed solution is divided into inner buses, whichare present in each SoC itself - forming each cluster - andthe outer bus, which connects them to each other and toexternal peripherals.
Still, the work proposed by [4] also targets pure NoC im-plementation by adding bus-based interface on NoC routers.The main goal is to ease the integration with other bus-basedIP components, which are more commonly found. Thus, theproposed NoC has the ability of integrating standard non-packet based components thus reducing design time.
Other approaches also studied the use of buses in NoCswith different purposes [15], [20]. In our case, we still wantto use the NoC infrastructure but instead of adding anotherlevel of communication we propose to use virtual domains.
Next section introduces some concepts about embeddedvirtualization.
3 Virtualization and Embedded Systems
First of all, even for classic virtualization concepts,which date back more than 30 years [8], the main com-ponent involving virtualization is the hypervisor. It is thehypervisor the responsible for managing the virtual ma-chines (also known as virtual domains) by providing themthe needed scenario for its fine work.
To implement the hypervisor, also known as VirtualMachine Monitor (VMM), commonly two approaches areused. In hypervisor type 1, also known as hardware levelvirtualization, the hypervisor itself can be considered as anoperating system, since it is the only piece of software thatworks in kernel mode, like depicted in Figure 5. Its maintask is to manage multiple copies of the real hardware - thevirtual boards (virtual machines or domains) - just like anOS manages multitasking.
Figure 5. Hypervisor Type 1
Type 2 hypervisors, also known as operating system levelvirtualization, depicted in Figure 6, are implemented suchthat the hypervisor itself can be compared to another userapplication that simply “interprets” the guest machine ISA.
Figure 6. Hypervisor Type 2
One of the most successful techniques to implement vir-tualized systems is known as para-virtualization. It is atechnique that replaces sensitive instructions of the origi-nal kernel code by explicit hypervisor calls (also known ashypercalls). Sensitive instructions belong to a classificationfor the instructions of an ISA (Instruction Set Architecture)into three different groups, proposed by Popek and Gold-
115
berg [19]:
1. privileged instructions: those that trap when used inuser mode and do not trap if used in kernel mode;
2. control sensitive instructions: those that attempt tochange the configuration of resources in the system,and;
3. behavior sensitive instructions: those whose behav-ior or result depends on the configuration of resources(the content of the relocation register or the processor’smode).
The goal of para-virtualization is to reduce the problemsencountered when dealing with different privilege levels.Usually, a scheme referred to as protection rings is usedand it guarantees that the lower level rings (Ring 0, for in-stance) holds the highest privileges. So, most of OSs areexecuted in Ring 0, thus being able to interact directly withthe physical hardware.
When the hypervisor is adopted, it becomes the onlypiece of software to be executed in Ring 0, bringing se-vere consequences for the guest OSs: they are no longerexecuted in Ring 0, instead, run in Ring 1, with fewer priv-ileges.
These concepts are present in the virtualization done forgeneral purpose systems but are very important when deal-ing with embedded systems’ typical challenges. Next, somepeculiarities found in the application of virtualization solu-tions in embedded systems are discussed.
3.1 Virtual-Hellfire Hypervisor
There are several hypervisors with embedded systems’focus [22], [10], [21]. In this work, we adopt the Virtual-Hellfire Hypervisor (VHH) [3], part of the Hellfire Frame-work. The main advantages of VHH are:
• temporal and spatial isolation among domains (eachdomain contains its own OS);
• resource virtualization: clock, timers, interrupts, mem-ory;
• efficient context switch for domains;
• real-time scheduling policy for domain scheduling;
• deterministic hypervisor system calls (hypercalls).
VHH considers a domain as an execution environmentwhere a guest OS can be executed and it offers the vir-tualized services of the real hardware to it. In embed-ded systems where no hardware support is offered, para-virtualization tends to present the best performance results.
Therefore, in VHH, domains need to be modified before be-ing executed on top of it. As a result, they do not managehardware interrupts directly. Instead, the guest OS must bemodified to allow the use of virtualized operations providedby the VHH (hypercalls).
Figure 7 depicts the Virtual-Hellfire Hypervisor struc-ture. In this figure, the hardware continues to provide thebasic services as timer and interrupt but they are managedby the hypervisor, which provides hypercalls for the differ-ent domains, allowing them to perform privileged instruc-tions.
Figure 7. Virtual-Hellfire Hypervisor Domainstructure
Thus, Virtual-Hellfire Hypervisor is implemented basedon the HellfireOS [1] and counts on the following layers:
• Hardware Abstraction Layer - HAL, responsiblefor implementing the set of drivers that manage themandatory hardware, like processor, interrupts, clock,timers etc;
• Kernel API and Standard C Functions, which arenot available to the partitions;
• Virtualization layer, which provides the services re-quired to support virtualization and para-virtualizationservices. The hypercalls are implemented in this layer.
Figure 8 depicts the architecture of the VHH, wheresome of the following modules can be found:
• domain manager, responsible for domain creation,deletion, suspension etc;
• domain scheduler, responsible for scheduling domainsin a single processor;
116
• interrupt manager, which handles hardware interruptsand traps. It is also in charge of triggering virtual in-terrupt and traps to domains, and;
• hypercall manager, responsible for handling callsmade from domains, being analogous to the use of sys-tem calls in conventional operating systems.
Figure 8. VHH System Architecture
4 Virtual Cluster-Based MPSoCs
This section describes the Virtual Cluster-Based MPSoCproposal. Initially, let us take a look into each cluster of theMPSoC.
Since our work is based on the Hellfire Project, wealso use the Plasma [5] processor, a MIPS-like architec-ture. Therefore, the VHH is placed on a Plasma processoras the basis of our cluster. Then, the VHH is responsiblefor managing several virtual domains. In our case, eachVHH is responsible for managing its own processing clus-ter and it allows the internal communication of these pro-cessors through shared memory.
Figure 9 is divided in two parts. In A, the current versionfor memory division, which only predicts a single memorypartition per virtual domain, is shown. This means that thispartition is considered to be the local memory for a givenvirtual domain. In B, it is possible to see that an extra par-tition was added: the shared partition. Here, the idea is toprovide easy and low overhead communication inside thecluster.
The VHH was extended to allow the communication intwo levels. The first level, is named as intracluster commu-nication and occurs through shared memory. Currently, thisis not user transparent and a specific hypercall must be usedfor this communication. In this hypercall, a single CPUidentification (CPU ID) must be used, which means theybelong to the same processing cluster.
These hypercalls are similar to the communication func-tions provided by the HellfireOS and have the follow-
Figure 9. VHH Memory for (A) Non-clusteredsystems (B) Clustered systems
ing parameters: VHH SendMessage (cpu id, task id, mes-sage, message length) used to send a message through theshared memory and VHH ReceiveMessage (source cpu id,source task id, message, message length) used to receive it.
The second communication level is done among clus-ters, through the NoC. In our case, we use the HERMESNoC [17] and a MIPS-like processor in each router. Weadopted a Network Interface (NI) as a wrapper which con-nects the NoC router to the processor located in its localport. This interface, works in a similar way that the non-virtualized approach. This increases the possibility of usingseveral NoC infrastructures as the underlying architecture.Figure 10 depicts this approach.
Figure 10. VHH Communication Infrastructurewith NoC based Systems
The wrapper is connected through specific memory ad-dresses: read and write, to the Plasma. Still, a communica-tion VHH driver had to be written to allow the integrationbetween the wrapper and the virtual cluster. Also, the hy-percalls provided by the VHH allow a virtual processor tosend or receive messages with an extra parameter: the Vir-
117
tual CPU ID, as an identification of the virtual CPU on aspecific cluster.
Thus, the hypercalls to be used to the inter-clustercommunication are: VHH SendMessageNoC (cpu id, vir-tual cpu id, task id, message, message length) used to senda message through the NoC and VHH ReceiveMessage(source cpu id, source virtual cpu id, source task id, mes-sage, message length) used to receive it.
The complete vision of the system is depicted in Fig-ure 11. In the Figure, VHH is the Virtual Hellfire Hypervi-sor. LM stands for Local Memory. NI stands for Networkinterface and PE, for Processing Element. R represents eachrouter of the NoC.
Figure 11. Virtual Cluster-Based MPSoC pro-posal
5 Use Cases and Experimental Results
In this section we highlight some possible use cases forCluster-Based MPSoCs and some preliminary prototypingresults.
The main use for a Cluster-based MPSoC is the possi-bility for field specialization. In this case, each cluster isresponsible for executing a set of tasks with a common pur-pose. For instance, it is possible to execute a JPEG decoderin one cluster, a MPEG decoder in another and so on. In thiscase, the greatest advantage is to simplify the communica-tion of similar tasks, since they share a given memory area,but still allowing a great number of processors, increasingsystem scalability through the NoC usage. Figure 12 de-picts an example of cluster-based MPSoC with applicationspecialization.
Another possible use of the Virtual Cluster-based MP-SoC is when decreasing area with guaranteed system scal-ability is needed. Scalability is assured by NoC usage andthe cluster-based MPSoC itself allows an easier use of real-
Figure 12. Virtual Cluster-Based MPSoC withApplication Specialization
time tasks with no extra communication penalties. Regard-ing area occupation, we prototyped some possible configu-rations to illustrate the benefits of our approach in this issue.We used the Xilinx Virtex-5 XC5VLX330T FPGA.
First, when using the HellfireOS with a Plasma proces-sor, we usually indicate a processor with at least 16KB oflocal memory. HellfireOS is a much optimized kernel anddepending on the application even such a small memorycan fulfill the expected needs. When using the VHH, morememory is required and the total memory size depends es-pecially on the number of virtual domains that are required.Although greater memory sizes infer more block RAMs, itdoes not affect the FPGA area measured in LUTs. In allexperiments performed, the total system memory could beinferred as block RAMs.
We used three different MPSoC configurations, all with16 processors (physical or virtual). First, we have a 16processor MPSoC, distributed in a 4x4 NoC where eachrouter carries its own processor, known as Pure 4x4 NoCapproach.
The second MPSoC configuration regards a 2x2 NoCwith bus-based clustering system, known as Bus Clusteredapproach. Here, each router has a wrapper to connect it tothe clustered-bus, and each bus carries four processors.
Finally, the last approach is the Virtual cluster-based (V-Cluster 2x2 NoC) where a 2x2 NoC was used again andeach router contains a single physical processor. This pro-cessor runs the VHH, where 4 virtual domains are emulatedper cluster, totalizing the 16 processors of the MPSoC.
In the first two solutions, each processor has 16KB oflocal memory. The last, for the virtual cluster approach, 4processors with 128KB of memory each were employed. InTable 1, it is possible to see the prototyping results for threedifferent MPSoCs.
These results show a decrease of the area occupation inup to 70%, depending on the processor local memory con-
118
Table 1. Area results for MPSoCs configura-tion
Configuration Area occupation (LUTs)Pure 4x4 NoC 60934
Bus Clustered 2x2 NoC 56099V-Cluster 2x2 NoC 17179
figuration and the original MPSoC configuration. Also, de-pending on the bus structure used for the Bus-based clus-tered version, the bus communication overhead is similar tothe virtualization overhead.
6 Concluding Remarks and Future Work
This paper presents a new proposal for MPSoC config-uration using virtualization with a cluster-based approach.For validation purposes, we use an extension of the HellfireFramework and, in order to incorporate our virtualizationmethodology, the Virtual-Hellfire Hypervisor (VHH).
We use a HERMES NoC as the underlying architecturewhere each processor runs the VHH, forming the processingclusters. We achieved up to 70% decrease in FPGA areaoccupation in our preliminary tests.
As a future work we intend to get comparison results forperformance and overheads with other approaches. Still,we want to improve the proposal itself, especially regardingmemory and I/O management.
Acknowledgment
The authors acknowledge the support granted by CNPqand FAPESP to the INCT-SEC (National Institute of Sci-ence and Technology Embedded Critical Systems Brazil),processes 573963/2008-8 and 08/57870-9. Also, this workis supported in the scope of the project SRAM by theResearch and Projects Financing (FINEP) under Grant0108031000.
References
[1] A. Aguiar, S. Filho, F. Magalhaes, T. Casagrande, andF. Hessel. Hellfire: A design framework for critical em-bedded systems’ applications. In Quality Electronic Design(ISQED), 2010 11th International Symposium on, pages 730–737, 2010.
[2] A. Aguiar and F. Hessel. Embedded systems’ virtualization:The next challenge? In Rapid System Prototyping (RSP),2010 21st IEEE International Symposium on, pages 1 –7,2010.
[3] A. Aguiar and F. Hessel. Virtual hellfire hypervisor: Extend-ing hellfire framework for embedded virtualization support.
In Quality Electronic Design (ISQED), 2011 12th Interna-tional Symposium on, 2011.
[4] B. Ahmad, A. Ahmadinia, and T. Arslan. Dynamically Re-configurable NoC with Bus Based Interface for Ease of In-tegration and Reduced Design Time. IEEE, June 2008.
[5] O. Cores. Plasma most mips i(tm) opcodes.http://www.opencores.org.uk/projects.cgi/web/mips/,Accessed, September 2009, 2007.
[6] I. Corp. Intel ixp2855 network processor. Web, Available athttp://www.intel.com/., 2005.
[7] L.-F. Geng. Prototype design of cluster-based homogeneousMultiprocessor System-on-Chip. 2009 3rd InternationalConference on Anti-counterfeiting, Security, and Identifica-tion in Communication, pages 311–315, Aug. 2009.
[8] R. P. Goldberg. Survey of virtual machine research. Com-puter, pages 34–35, 1974.
[9] J. Goodacre and A. N. Sloss. Parallelism and the arm in-struction set architecture. Computer, 38:42–50, July 2005.
[10] G. Heiser. Hypervisors for consumer electronics. pages 1–5, jan. 2009.
[11] A. Jerraya, H. Tenhunen, and W. Wolf. Multiprocessorsystems-on-chips. Computer, 38(Issue 7):36– 40, July 2005.
[12] X. Jin, Y. Song, and D. Zhang. FPGA prototype design ofthe computation nodes in a cluster based MPSoC. IEEE,July 2010.
[13] M. Kistler, M. Perrone, and F. Petrini. Cell multiproces-sor communication network: Built for speed. IEEE Micro,26:10–23, May 2006.
[14] A. Ltd. Nios ii processor reference. Web, Available athttp://www.altera.com/., 2009.
[15] R. Manevich, I. Walter, I. Cidon, and A. Kolodny. Best ofboth worlds: A bus enhanced NoC (BENoC). IEEE, Nov.2010.
[16] G. D. Micheli and L. Benini. Networks on Chips: Tech-nology and Tools (Systems on Silicon). Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2006.
[17] F. Moraes, N. Calazans, A. Mello, L. Moller, and L. Ost.Hermes: an infrastructure for low area overhead packet-switching networks on chip. Integr. VLSI J., 38(1):69–93,2004.
[18] T. Noergaard. Embedded Systems Architecture: A Compre-hensive Guide for Engineers and Programmers. Newnes,2005.
[19] G. J. Popek and R. P. Goldberg. Formal requirements forvirtualizable third generation architectures. Commun. ACM,17(7):412–421, 1974.
[20] T. Richardson, C. Nicopoulos, D. Park, V. Narayanan,Y. Xie, C. Das, and V. Degalahal. A hybrid soc interconnectwith dynamic tdma-based transaction-less buses and on-chipnetworks. In VLSI Design, 2006. Held jointly with 5th In-ternational Conference on Embedded Systems and Design.,19th International Conference on, page 8 pp., 2006.
[21] W. River.Wind river. Web, Available at http://www.windriver.com/.Accessed at 2 oct., 2010.
[22] XEN.org. Embedded xen project. Web, Available athttp://www.xen.org/community/projects.html. Accessed at10 ago., 2010.
119
Session 5Model Based System Design
120
1
Rapid Property Specification and Checking forModel-Based Formalisms
Daniel Balasubramanian, Gabor Pap,Harmon Nine, Gabor Karsai
ISIS / Vanderbilt University, Nashville, TN 37212Email: {daniel.a.balasubramanian, gabor.pap,harmon.s.nine, gabor.karsai}@vanderbilt.edu
Michael Lowry, Corina Pasareanu, Tom PressburgerNASA Ames Research Center, Moffett Field, CA 94035
Email: {michael.r.lowry, tom.pressburger,corina.s.pasareanu}@nasa.gov
Abstract—In model-based development, verification techniquescan be used to check whether an abstract model satisfies a setof properties. Ideally, implementation code generated from thesemodels can also be verified against similar properties. However,the distance between the property specification languages andthe implementation makes verifying such generated code diffi-cult. Optimizations and renamings can blur the correspondencebetween the two, further increasing the difficulty of specifyingverification properties on the generated code. This paper de-scribes methods for specifying verification properties on abstractmodels that are then checked on implementation level code. Theseproperties are translated by an extended code generator intoimplementation code and special annotations that are used by asoftware model checker.
I. INTRODUCTION
Model-based development (MBD) is a software and systemdesign paradigm based on abstractions called models. Domain-specific modeling languages (DSMLs) [1] provide the abilityto represent models that are specific to a particular problemdomain. Cast in this light, Matlab/Simulink [2] can be viewedas a DSML for physical and embedded systems, as they allowmodeling the (dynamics of the) physical plant as well asthe behavior of its controller software. Once the model iscreated the closed loop system can be simulated, output tracesobserved, and the model modified as needed.
Simulation alone, however, cannot provide rigorous guaran-tees about a model’s behavior. In order to prove exhaustivelythat a model’s dynamic behavior always satisfies a set ofproperties, some sort of verification [3] must be performed.Typical properties include state reachability, deadlock-freedomand a wide range of temporal properties. In recent years,model-level verification tools have been developed that cancheck models for such properties. While these tools play animportant role in MBD and can provide guarantees about amodel’s behavior, their use is often limited to a small portionof a complex system, i.e. key properties and algorithms.
One of the key goals of MBD is to gradually refine abstract,high-level models until they can be automatically synthesized
into an implementation that runs on a non-ideal computationalplatform. However, one crucial problem is often ignored:how can one verify that the synthesized implementation codesatisfies the same properties as the models from which itwas generated? Without verifying the implementation, theguarantees provided by checking the abstract models arelost. Checking or proving the correctness of the synthesis(transformation) algorithms is an open problem. Further, if noverification is performed on high-level models, then verifyingthe implementation is the only way to prove properties aboutthe system.
The major difficulty in verifying model level properties onimplementation level code lies in the different levels of ab-straction. Abstract models are developed by hand and designedwith readability in mind, while automatically generated codecan be difficult to read. Further, the correspondence betweenmodel elements and their generated code is not obvious.Renamings and optimizations make it difficult to understandhow a particular model element is represented in the generatedcode. As a result, knowing where to place properties that areto be verified becomes a challenge.
Another difficulty lies in the mismatch between the inputlanguages of verification tools used at the different levels ofabstraction. Individual verification tools typically each usetheir own input language for defining properties, so thatproperties checked at the model level must be rewritten ina new syntax to be checked on the implementation level code.This problem is exacerbated by the fact that code generatorstypically rename model elements in the generated code, sothat, for instance, the names of variables in the generatedcode are not known on the model level. Without knowing thenames of the variables, certainly verification properties cannotbe defined.
We present in this paper a method for specifying propertieson high-level models that are then used in the verificationof the generated, implementation level code. Properties arewritten in an intuitive way, directly on the model elements.As the model is translated into various intermediate forms and
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
121
2
SL/SF model + Code
generatorInput Output
Java code +
Software model
checkerGenerates
Verification report
Contracts
Observerautomata
Verification properties
Verification properties
Translated
Property specification methods
Fig. 1. Overview of framework. Verification properties can be specified usingobserver automata or contracts.
ultimately into executable code, the user defined propertiesare preserved and translated into implementation code andannotations that are checked by a software model checker.The translation is performed via a code generator that has beenextended to handle the extra information. The results of theverification are then displayed to the user (in terms of the orig-inal high level model). While we focus on Matlab/Simulink,we believe that our method of defining properties on the modellevel that are checked against a generated implementation canbe generalized and leveraged in other MBD tools as well. Thisapproach makes property-based verification an integral part ofthe development workflow. Note that the framework enablesrun-time verification in addition to model checking.
The remainder of the paper is organized as follows. SectionII gives an overview of our approach and background, includ-ing a description of the tool-suite. Section III provides detailson how the user annotates Simulink models with properties.Section IV presents an end-to-end example. We compare ourapproach with related work in Section V and conclude inSection VI.
II. OVERVIEW AND BACKGROUND
An overview of our approach is depicted in Figure 1 andconsists of the following steps: (1) a Simulink model isdefined, (2) the model is annotated with properties to verify,(3) the code generator is invoked to produce executable code,(4) the software model checker is executed on the code andproperties, (5) results about about the verification process arereported.
The first and third steps are described in [4]; this paperconcentrates on the other steps. The code generator producesrestricted form Java code and is the same code generatordescribed in [4], but extended with features for generating an-notations for verification. The main motivation for this choiceof the target language was that the software model checkerused can work with Java programs. The code generated by ourtoolchain is completely sequential and does not use dynamicmemory (after initialization), hence it is suitable for embeddedapplications. The code is also object-oriented (an increasingtrend in embedded software): subsystems are translated intoJava classes that are instantiated at initialization time. Ourcode generator actually uses a re-targetable back-end, suchthat either Java or C code can be produced from the sameabstract syntax tree.
A. Property annotationsThe second step in Figure 1 is annotating the Simulink
model with properties to verify on the generated code. Since
the development of model checking [5] in the early 1980s,a number of specification languages have been invented toformally define properties. Common ways of specifying theseproperties include regular expressions and temporal logic, suchas LTL and CTL. However, the drawback to using temporallogics for property specification is their steep learning curvefor industrial practitioners. Consequently, designers and de-velopers will be less likely to use verification tools if theymust devote large amounts of time to learning a specificationlanguage.
For this reason, we decided to take two approaches toproperty specification. The first uses the pattern-based systemintroduced in [6]. In that work, the authors studied a large bodyof existing property specifications and found that the majorityof them were instances of a small set of parameterizablepatterns: reusable solutions to recurring problems.
Patterns are entered into our system using a custom interfacethat we integrated directly into Simulink. After the parametershave been entered, our interface generates an observer automa-ton to represent an instance of that pattern. Formulation ofassertions as Statechart observer automata has been describedin Chapters 4 and 5 of [7]. Because we are in a Simulinkcontext, it is natural that we represent observer automataas Stateflow subsystems inserted in the Simulink diagramthat implement the logic of the specification described bythe pattern. They contain input signals corresponding to thevariables and events under observation, and the internal statesthat implement the logic of property. The generated observerautomata are competitive in size to those coded by hand. Fulldetails are given in Section III.
The second approach to property specification is based oncontracts and is similar to the idea of programming by contract[8]. Programming by contract is a methodology for writingprograms that use interface specifications on software compo-nents to define properties about their behavior. Typically, thespecification on a component includes three elements: prop-erties that must hold in order to use the component correctly(preconditions), properties that will hold when the componenthas finished executing (postconditions), and properties thatmust always be satisfied (invariants). We applied this ideaof contracts to specifying properties for Simulink subsystems.On any subsystem, the user is allowed to write preconditions,postconditions and invariants that must be satisfied by thatsubsystem. During the code generation phase the contracts onvarious subsystems are translated into annotations on methodsand classes implementing these subsystems in the generatedcode. A thorough description is given in Section III.
B. Software model checkingOur generated code is verified using Java Pathfinder (JPF)
[9], a software model checker for Java. We chose JPF for tworeasons. First, our toolsuite was already configured to generateJava code. Second, JPF provides libraries supporting a numberof verification features especially useful in our toolsuite: codecontracts, monitoring execution for exceptions and numericalproblems, as well as symbolic execution.
The code contract feature of JPF permits annotations forpreconditions, postconditions and invariants to be written
122
3
on classes and methods. JPF monitors these conditions atruntime and reports any violations. This feature allows thepreconditions, postconditions and invariants that are defined onthe Simulink model elements to be translated to the generatedcode in a straightforward manner by the code generator.
The symbolic execution [10] feature of JPF allows us toperform state reachability analysis and test case generation.The symbolic execution engine runs a program much like asnormal program execution, but it does not assign a concretevalue to program input variables. Instead, input variables areleft as symbolic values. When input variables are used in abranching condition, a constraint solver attempts to find valuesfor the symbolic variables that will allow both branches of thecondition to be taken. This idea is explained further in [10]. Inthis paper, we do not concentrate on the symbolic executionaspect.
III. SPECIFICATION PATTERNS AND CONTRACTS
This section gives details on how properties are specified onthe model level and then translated into generated code. Wefirst describe the specification patterns, which can be attachedto the model using a custom interface or from a suppliedlibrary. If the interface is used, a corresponding observerautomaton is automatically generated from the specifications.The interface can be used to insert basic properties, but todescribe more complex properties, the observer automata canbe compositionally defined using the supplied library. We alsodescribe the details of how contracts are written on the modeland then translated into annotations on the generated code.
A. Specification patterns
Property specification patterns describe commonly observedrequirements in a generalized manner. They capture a par-ticular aspect of a system’s behavior as a sequence of stateconfigurations. Note that the specifications can be state-basedor event-based. In the discussion below we mention the state-based form, but the same approach applies to events as well.
To illustrate, consider the property that throughout a sys-tem’s execution the value of a certain variable should alwaysbe greater than zero. There are two basic parts to this propertythat commonly occur. The first tells when the property shouldhold (in this case, at all times during execution), and thesecond tells what condition should be satisfied during this time(here, the variable should be greater than zero).
A property consists of precisely those two pieces: a scopeand a pattern. The scope defines when a particular propertyshould hold during program execution, and the pattern definesthe conditions that must be satisfied. There are five basic kindsof scopes: global (the entire execution), before (executionup to a given state), after (execution after a state), between(execution from one state to another) and until (execution fromone state even if the second never occurs).
There are three categories of patterns: occurrence, orderand compound. The occurrence group contains the absence(never true), universality (always true), existence (true at leastonce) and bounded existence (true for a finite number oftimes) patterns. The order group contains the response (a state
Pa#ern Error State
error_event
end_event [propertyOK == false]
Global scope
1
2
Pa#ern Error State [Before && propertyOK == false]
Before scope
1
Safe State
[Before && propertyOK == true]
2
Pa#ern Error State
UnBl scope
IniBal State
[Before && propertyOK == true] [AFer]
4 error_event
1
[Before && propertyOK == false] 3
end_event [propertyOK == false] 2
Fig. 2. Scope library.
must be followed by another state) and the precedence (astate must be preceded by another state) patterns, and thecompound group contains the chain precedence, and chainresponse patterns.
Dwyer et al. [6] have shown how these scopes and pat-terns can be expressed in LTL, CTL, and other formalisms.However, the property specification patterns can also be easilyexpressed as parameterized observer automata, which is theapproach we take. Note that many specifications can beadded to a model and each one is translated into a separateautomaton. Additionally, the definition of a simple interfaceallows the composition of the scope and pattern aspects of thespecification, represented as two distinct automata templates.Furthermore, using the Stateflow language allows the observerautomata to be created inside Simulink diagrams. Statecharthierarchy is exploited in some of the examples in Chapters 4and 5 of [7], and we make use of hierarchy in formulatingeach scope as a Stateflow diagram that contains a patternsubmachine.
The Simulink model extended with the observer automatais then translated into the target language. Hence the gener-ated, ’functional’ code will be augmented with the code thatimplements the observer automata. Now the software modelchecker can monitor and verify the execution of the entireimplementation, paying special attention to the error states andproperties specified in observer automata. As specifications aretranslated into executable code, the distance between code-level monitoring and software model checking and model-levelproperty specifications is reduced.
Figure 2 shows the automata for three of the five scopes.We now briefly describe each of these.
The automaton for the global scope is shown at the topof Figure 2. This scope indicates that a property should holdduring the entire system execution. Initially, the state labeled“Pattern” is entered. There are two transitions from this state
123
4
to the state labeled “Error State”. The first is triggered byan event named “error event”. This event is generated by anenclosed property when that property has been violated. Thesecond transition is triggered by an event named “end event”and a guard condition requiring the boolean value “proper-tyOK” to be false. The “end event” is generated upon systemtermination and the “propertyOK” variable is set to false bythe scope’s enclosed property if that property is violated. Thatis, the second transition is taken if the system terminates andthe property enclosed by this scope has been violated.
The automaton for the before scope is shown in the middleof Figure 2. This scope is used to express that a propertyshould hold before some other condition is met. In the Figure,the event named “Before” is used to represent the condition.Initially, the “Pattern” state is entered. If “end event” occurs(the system terminates) and the enclosed property has beenviolated (“propertyOK is false”) then the first transition istaken and the “ErrorState” is entered. If the “Before” eventoccurs and “propertyOK” is false, the second transition istaken and “ErrorState” is entered. The state named “SafeState” is only entered if the “Before” event occurs and theenclosed property has not been violated (“propertyOK” istrue). Note that, in general, a property is considered satisfiedas long as the error state of the property’s scope automaton isnot active.
The until scope captures the requirement that some condi-tion should hold from one state to another even if the secondcondition never occurs, or stated differently, in between onecondition and a second, even if the second condition neveroccurs. The bottom of Figure 2 shows the automaton forthis scope. The two variables named “Before” and “After”are used to represent the two conditions in between which aproperty should hold. Upon entry, “Initial State” is entered.When the variable “After” becomes true, then the transition tothe “Pattern” state is taken. While in this state, the automatonis waiting for the property to happen before the second con-dition is satisfied. When the property is satisfied, the variable“propertyOK” becomes true. If before “propertyOK” becomestrue either the “Before” condition becomes true or systemexecution ends (“end event” occurs), the transition to “ErrorState” occurs and signals an error to the user. Otherwise, if“propertyOK” is true (the property is satisfied) and the secondcondition is also satisfied (“Before” is true), the transition backto “Initial State” is taken, and the cycle repeats.
Figure 3 shows the automata for three of the patterns.At the top of the Figure is the automaton for the existencepattern. This pattern states that a condition (represented inthe automaton by the boolean variable “P1”) should occurduring a specified scope. When the “Initial State” is entered,the “propertyOK” variable is set to false, indicating that theproperty is initially unsatisfied: P1 has not occurred. If “P1”does become true, then the transition to “P1 Encountered” istaken and “propertyOK” is set to true.
A simple pattern, absence, is shown in the middle portionof Figure 3. This pattern states that a condition (representedin the automaton by the boolean variable “P1”) should notoccur during a specified scope. When the “Initial State” isentered, the “propertyOK” variable is set to true, indicating
Error State
Precedence pattern
Initial Stateen: propertyOK = true
[P1]
P1 Encountered Safe State[P2]
[P2]{propertyOK=false; error_event;}
1
2
Initial Stateen: propertyOK = true [P1]{propertyOK = false; error_event;}
Error State
Absence pattern
Initial Stateen: propertyOK = false [P1]
P1 Encountereden: propertyOK = true
Existence pattern
Fig. 3. Pattern library.
Property Error State
error_event
end_event [propertyOK == false]
1
2
[x > 0]
x
Ini?al State en: propertyOK = false
Safe State en: propertyOK = true
Global scope Existence paEern
Fig. 4. Property describing that at some point, x should be greater than 0.Scope states are white and patterns states are shaded.
that the property is initially satisfied: P1 has not occurred. If“P1” does become true, then the transition to “Error State”is taken,“propertyOK” is set to false and the “error event” isemitted.
The automaton for the precedence pattern is at the bottomof Figure 3. This captures the property that some condition(“P2”) must be preceded by another condition (“P1”). Notethat in this automaton, the initial state sets the “propertyOK”variable to true: the property is initially satisfied. If “P2” is truebefore “P1”, that is, the condition denoted by “P2” happensbefore the condition denoted by “P1” is met, then the transitionto “Error State” is taken, “propertyOK” is set to false, andthe “error event” is emitted. Otherwise, the overall precedencepattern is satisfied.
Scopes and patterns are combined to form property specifi-cations. Consider the example in Figure 4, which specifies thefollowing property: at some point during system execution, theinput variable “x” should be greater than 0. Stated differently,throughout the entire system execution (i.e., global scope), xshould be greater than 0 at least once (i.e., existence property).To define this property, the existence pattern shown in Figure 3is inserted into the “Pattern” state of the global scope shown inFigure 2. The difference is that the generic condition shownas “P1” in the basic existence pattern is replaced with thecondition x > 0. Note that the “propertyOK” variable is setby the pattern and its value is used by the scope.
124
5
Subsystem
X
Y Z
Fig. 5. Contract example.
Additionally, we developed a dedicated user interface thatuses dialog forms for inputing property specifications. Thedialogs capture both the kind of scope and pattern, as well asthe parameters needed to instantiate and compose them. Theuser picks the scope and the pattern and enters the appropriateconditions. A composed automaton that composes an instanceof both the scope and pattern is then automatically generated.An example using these dialog forms is described in SectionIV.
B. Contracts
The second method we use for describing verification prop-erties is based on contracts. We extended Simulink with acustom interface that allows the user to annotate any subsystemwith three additional items.
• Preconditions that the input signals to the subsystem mustsatisfy.
• Postconditions that the output signals of the subsystemmust satisfy.
• Invariants that must always be satisfied by the subsystem.Note that a subsystem translates into an executable function
that is called by some scheduler, periodically. Hence, the aboveconditions and invariants can be checked during execution ofthat function block.
Figure 5 shows an example of specifying contracts on asubsystem block. The internal details of the subsystem arenot important, but rather serve to show how our approachallows the complexities of certain elements to be ignored whenwriting specifications. The subsystem in Figure 5 has twoinputs, x and y, and one output, z. Suppose we wish to checkthe following property: either x is equal to 0 and y is between0 and 10, or x is equal to 1 and y is between 10 and 20.Suppose we also wish to check that if x is 0, then the output
z is greater than 0, and if x is 1, then the output z is less than0. These requirements are attached to the subsystem using thedialog box as shown at the top of Figure 5.
The contracts are added to the subsystem model as speciallyformatted descriptions (that are usually just unstructured text),using XML-like syntax. The code generator parses thesedescriptions, and if they are syntactically correct, it constructsthe properly formatted strings (with variable names rewritteninto their ’code’ equivalent) that are suitable for the softwaremodel checker.
A Java implementation of the subsystem in Figure 5 thatis very similar to the code produced by our code generator isshown in Listing 1. Note that in the contract, the inputs andoutputs of the subsystem are referred to by their name in themodel. This is an important part of our approach: the useralways refers to the model elements as they are written in themodel. No knowledge of the code generation process is neededto write specifications. The contract specified in the model isgenerated in the Java code as annotations that automaticallyreference the correct variable names. These annotations areused by the software model checker to monitor the codeexecution.
Listing 1. Java implementation of the subsystem in Figure 5.p u b l i c c l a s s Subsystem15 {
p r i v a t e i n t v a l u e 1 = 0 ;p r i v a t e i n t v a l u e 2 = 0 ;
@Requires ( ‘ ‘ ( x13 == 0 && y25 > 0 && y25 < 10) | |( x13 == 1 && y25 > 10 && y25 < 20) ’ ’ )
@Ensures ( ‘ ‘ ( x13 == 0 && z65 > 0) | |( x13 == 1 && z65 < 0) ’ ’ )
p u b l i c vo id Main23 ( i n t x13 , i n t y25 , i n t [ ] z65 ) {v a l u e 1 = x13 ;v a l u e 2 = y25 ;. . . / / Code i m p l e m e n t i n g s u b s y s t e m l o g i c
}}
IV. EXAMPLE
This section shows how our framework can be applied torealistic models. The example we use is the Apollo LunarModule digital autopilot model, which is included with theMatlab/Simulink distribution as an example. The full modelincludes a dynamic model of the plant: the Apollo Lunar Mod-ule, as well as a model of the Reaction Jet Controller (RJC) –we focused on the embedded controller. A very high-level viewis shown in Figure 6. The RJC receives attitude measurementsand desired attitude values, and generates control signals toactivate yaw, pitch and roll thrusters.
A. Step 1: Define Property
The “Yaw Jets” output of the RJC block is a value from theset -2, 0, 2, which indicates that the yaw thruster should havea negative thrust, no thrust or a positive thrust, respectively.Suppose we wish to verify the property that the “Yaw Jets”output can never go directly from -2 to 2 or directly from2 to -2: at least one output of 0 must always be found in-between. Section III showed how a property like this could bebuilt manually using automata. Using the scope and pattern
125
6
automata as building blocks, one could define this propertydirectly in Stateflow.
As mentioned above, we have also developed a customextension to the Simulink environment that allows propertiesto be entered in an easier way using dialog forms. Thesedialogs decompose the patterns detailed in Section III-A: theuser selects a pattern, enters a scope and a property and theequivalent automaton is generated, including input ports. Ourfirst task is to decide which pattern we need to implement theproperty that the “Yaw Jets” can never go directly from -2 to2 or directly from 2 to -2. Part of the property states that wedo not want the value of “Yaw Jets” to be -2 during a certainscope. The absence pattern fits this requirement, as it checksto see that some condition never occurs.
The dialog form for the absence pattern is shown in Figure7. This dialog guides the user through the process of defininga property. After defining the condition that should neverhold (Command == -2), we define the scope during whichthis condition should hold. In this example, we never wantCommand to go directly from 2 to -2, so the condition thatCommand should never be -2 should hold after Command isequal to 2 and before Command is equal to 0. The propertythat Command should never go directly from -2 to 2 is definedin an analogous way using the absence pattern dialog.
B. Step 2: Connect generated automata
After entering the parameters in the dialog form, the ob-server automaton monitoring the property is generated, asshown in Figure 8. The states representing the scope portion ofthe property are white, and the states representing the patternare shaded. The transition from the initial state is taken whenCommand is 2, at which point we are “in scope” and wantto verify the absence of the condition that Command is -2before it is 0. If the value of Command is -2 before it is 0,the transition to the inner error state is taken, which sets the“propertyOK” variable to false and emits the “error event”.When “error event” is emitted, the outer transition to the errorstate is taken and the automaton remains in this state. Notethat while the automaton is in scope, system termination (the“end event”) will not cause the property to be violated aslong as Command has not been set to -2. The input parameterfor command is automatically generated, so the user mustconnect the “Yaw Jets” signal to the automaton so that itcan be monitored. In Figure 6, the “Command Constraint”and “Command Constraint2” automata have already beenconnected to the “Yaw Jets” signal.
C. Step 3: Verification with JPF
The final step is to invoke the code generator and use JPF toverify our properties. There are two ways JPF can check thecode for property violations. The first uses concrete inputsprovided by the user. If this is done, JPF will perform aconcrete system execution using those inputs and report anyproperty violations in the form of stack traces. The secondway JPF can check for property violations uses the symbolicexecution module. In this case, JPF will try to determine inputsto the system that will cause properties to be violated. With
Fig. 6. High-level view of the Apollo Autopilot. The Command Constraintautomaton was automatically generated using the property defined in Figure7. The second automaton was also automatically generated.
Fig. 7. Property dialog. The property says that after the input variable“Command” becomes 2, it should never be equal to -2 before returning to 0.
either method, property violations can be reported to the userin the form of a stack trace showing the sequence of methodinvocations that led to an error state.
V. RELATED WORK
In more traditional forms of software development, verifica-tion is done in one of two ways. Either an abstract model of thesoftware is created and verified, or the executable code itself isverified. [11] discusses the ongoing trend towards placing theverification efforts directly on the executable code rather thanon models. In MBD, however, one intentionally begins withmodels and gradually refines them until they are synthesizedinto the executable code, and ideally both artifacts can beverified. Our approach eases the burden of both specifyingand checking properties on code generated during the MBDprocess.
A number of tools are available for verifying Simulink/S-tateflow models. Simulink Design Verifier [12] and Reactis[13] are commercial tools for checking model properties.[14] describes an approach that is based on hybrid automata:models are translated from Simulink to a hybrid automata for-
126
7
Error State
Until scope
end_event [propertyOK == false]2
Initial State
[Command == 0 && propertyOK == true][Command == 2]
4
Initial Stateen: propertyOK = true
[Command == -2] {propertyOK = false; error_event;}
Error State
Absence pattern
error_event1
[Command == 0 && propertyOK == false]3
Command
Fig. 8. Generated observer automaton implementing the property specifiedin Figure 7. Scope states are white and pattern states are shaded.
malism and existing techniques for checking hybrid automatacan then be applied. Our approach is complimentary to thesemethods and ensures the properties proved by these tools alsohold for the generated code.
Our approach to specifying properties through patterns isbased on the work of Dwyer et al. in [6]. The patternlibrary described there contains a general description alongwith mappings into multiple formalisms, including LTL, CTLand quantified regular expressions. Our implementation usesa dialog forms to chose and configure simple patterns fromwhich observer automata are generated, and includes a libraryof observer automata for individual scopes and properties fromwhich more complex patterns can be defined.
Runtime monitoring [15] is a related area in which formallyspecified properties are typically translated into executablecode that is used to check program properties during programexecution. Recent work in this area includes optimizing suchmonitors through static analysis techniques [16]. Our approachtranslates properties specified using observer automata intoexecutable code that is checked by a software model checkerand translates contracts on model elements into annotationsthat are used by the model checker.
VI. CONCLUSION
Checking model level properties on implementation codeis a useful approach for practical model-driven development.In this paper, we have shown how relevant properties canbe specified on the model level and then translated intoimplementation code that can be verified with a softwaremodel checker. Our approach is a pragmatic realization of thework described in [6], in the context of the Simulink/Stateflowenvironment. We have shown how the specification patternscan be instantiated from observer automata templates forscopes and properties and how subsystem blocks can beannotated with pre-, post-conditions, and invariants that aremonitored by the software model checker. We have shown theuse of the approach on a realistic example.
Our approach allows two ways for specification: contractsand property specifications based on patterns (that are trans-lated into observer automata). For designers of embedded sys-tems two extensions would be very useful: (1) specifying real-
time properties, and (2) dealing with concurrency. TranslatedSimulink subsystems are typically executed periodically, witha fixed rate. Timing properties can be related to a singleexecution run (i.e. the worst-case execution time of a functionblock), as well as the temporal properties of the system overmultiple execution runs (e.g. the system reacts to a triggeringevent within a bounded number of execution runs). TranslatedSimulink subsystems are also completely sequential; they areusually translated to functions in an implementation language.In order to run them on an execution platform, they haveto be embedded into OS processes, and their communicationand synchronization implemented outside of Simulink. Hence,we need to model these embeddings, and how the threadscontaining the function blocks communicate and synchronize.These topics are the subject of on-going research.
VII. ACKNOWLEDGMENTS
The work described in this paper has been supported byNASA under Cooperative Agreement NNX09AV58A. Anyopinions, findings, and conclusions or recommendations ex-pressed in this material are those of the author(s) and do notnecessarily reflect the views of the National Aeronautics andSpace Administration. The authors would also like to thankMichael Whalen for valuable discussions and feedback.
REFERENCES
[1] A. Ledeczi, A. Bakay, M. Maroti, P. Volgyesi, G. Nordstrom, J. Sprinkle,and G. Karsai, “Composing domain-specific design environments,” IEEEComputer, vol. 34, no. 11, pp. 44–51, 2001.
[2] MATLAB, version 7.10.0 (R2010a). Natick, Massachusetts: TheMathWorks Inc., 2010.
[3] G. J. Holzmann and R. Joshi, “Model-driven software verification,” inSPIN, 2004, pp. 76–91.
[4] J. Porter, P. Volgyesi, N. Kottenstette, H. Nine, G. Karsai, and J. Szti-panovits, “An experimental model-based rapid prototyping environmentfor high-confidence embedded software,” in IEEE International Work-shop on Rapid System Prototyping, 2009, pp. 3–10.
[5] E. M. Clarke, “The birth of model checking,” in 25 Years of ModelChecking, 2008, pp. 1–26.
[6] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett, “Patterns in propertyspecifications for finite-state verification,” in ICSE, 1999, pp. 411–420.
[7] D. Drusinsky, Modeling and Verification Using UML Statecharts - AWorking Guide to Reactive System Design, Runtime Monitoring andExecution-Based Model Checking. Elsevier, 2006.
[8] B. Meyer, Object-Oriented Software Construction, 1st editon. Prentice-Hall, 1988.
[9] W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda, “Modelchecking programs,” Automated Software Engineering (ASE), vol. 10,no. 2, pp. 203–232, 2003.
[10] J. C. King, “Symbolic execution and program testing,” Commun. ACM,vol. 19, no. 7, pp. 385–394, 1976.
[11] G. J. Holzmann, “Trends in software verification,” in FME, 2003, pp.40–50.
[12] “Mathworks Inc. Simulink Design Verifier,”http://www.mathworks.com/products/sldesignverifier/.
[13] “Reactive Systems, Inc.” http://www.reactive-systems.com/.[14] R. Alur, A. Kanade, S. Ramesh, and K. C. Shashidhar, “Symbolic anal-
ysis for improving simulation coverage of simulink/stateflow models,”in Proceedings of the 8th ACM international conference on Embeddedsoftware, ser. EMSOFT ’08. New York, NY, USA: ACM, 2008, pp.89–98.
[15] S. Sankar and M. Mandal, “Concurrent runtime monitoring of formallyspecified programs,” IEEE Computer, vol. 26, no. 3, pp. 32–41, 1993.
[16] E. Bodden, L. J. Hendren, and O. Lhotak, “A staged static programanalysis to improve the performance of runtime monitoring,” in ECOOP,2007, pp. 525–549.
127
Automatic Generation of System-Level VirtualPrototypes from Streaming Application Models
Philipp Kutzer, Jens Gladigau, Christian Haubelt, and Jürgen TeichHardware/Software Co-Design, Department of Computer Science
University of Erlangen-Nuremberg, GermanyEmail: {philipp.kutzer, jens.gladigau, haubelt, teich}@cs.fau.de
Abstract—Virtual prototyping is a more and more acceptedtechnology to enable early software development in the designflow of embedded systems. Since virtual prototypes are typicallyconstructed manually, their value during design space explorationis limited. On the other hand, system synthesis approachesoften start from abstract and executable models, allowing forfast design space exploration, considering only predefined designdecisions. Usually, the output of these approaches is an "ad hoc"implementation, which is hard to reuse in further refinementsteps. In this paper, we propose a methodology for automaticgeneration of heterogeneous MPSoC virtual prototypes startingwith models for streaming applications. The advantage of theproposed approach lies in the fact that it is open to subsequentdesign steps. The applicability of the proposed approach to real-world applications is demonstrated using a Motion JPEG decoderapplication that is automatically refined into several virtualprototypes within seconds, which are correct by construction,instead of using error-prone manual refinement, which typicallyrequires several days.
I. INTRODUCTION
Today, modern Multi-Processor System-on-Chip (MPSoC)architectures consist of a mixture of microprocessors, digitalsignal processors (DSPs), memory subsystems, and hardwareaccelerators, as well as interconnect components. It is notice-able that the adoption of programmable logic in such electronicsystems is more and more increasing. Driven by this rise, theprocess of software development becomes the dominating partduring system design. In the course of software development,software engineers have to cope with operating systems,communication stacks, drivers, and so forth. In order to allowearly software development, virtual prototyping is a more andmore frequently used technology in Electronic System Level(ESL) design. There, the desired target platform is modeledas an abstract, executable, and often completely functionalsoftware model. Hence, the virtual prototype includes all func-tional properties of the target platform, while non-functionalproperties, such as timing behavior, are mostly disregarded.
In contrast to FPGA-based prototyping, virtual prototypesare deployed before architectural models on register-transfer-level are available. Due to this early availability, the overalltime spent on hardware and software design can be reduced,
Supported in part by the German Science Foundation(DFG Project HA 4463/3-1)
Source
Sink
c1
c8
Parser
MComp
c7 c6
c2
c5
Recon
IDCT
c3 c4
CPU HWMemory
Bus
Fig. 1. Application model of a Motion JPEG decoder, clustered and mappedto an architecture template. The architecture template consists of a CPU, ahardware accelerator (HW) and an external memory. All the components areconnected via a bus.
because software can be implemented, refined, tested, de-bugged, and verified on realistic hardware models in parallel tothe hardware design process. Nevertheless, additional time isneeded for implementing such prototypes from the functionaland desired architectural system specification. This drawbackcould be avoided with an automatic virtual prototype genera-tion. This would further speed up the design process and inaddition, errors, often made in manual prototype generation,are avoided.
Describing a complex application abstracted as an actor-oriented model [1] is a more and more accepted approach inESL design. Such models are used to describe the functionalbehavior of the application. Therefore, they consist of con-currently executing actors, which communicate over abstractchannels. In our approach, the communication takes placevia channels with FIFO semantics. An example is shownin Fig. 1 for a small actor-oriented model, a Motion JPEGdecoder, which consists of the actors Source, Parser, Re-construction (Recon), Inverse Discrete Cosine Transformation(IDCT), Motion Compensation (MComp), and Sink, as wellas FIFO channels c1 to c8. In order to generate a virtualprototype starting with an actor-oriented model, additionalinformation about the system architecture candidates and the
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
128
mapping possibilities of the functional components have to bespecified. In the lower part of Fig. 1, a possible mapping toan architecture template is given by the dotted arrows.
In the following, we present a method for automaticgeneration of MPSoC virtual prototypes from actor-orientedmodels. Our proposed approach performs the virtual prototypegeneration in two steps: (i) Based on a given resource map-ping, communication within the application model is refinedto transactions in the virtual prototype, and controllers forintra-resource communication are generated. (ii) The virtualprototype is generated by assembling cycle-accurate processormodels, memory models, and models for hardware accelera-tors using bus models, and synthesizing the software for eachprocessor, according to the given mapping.
The remainder of this paper is structured as follows: Sec-tion II reflects related work. In Section III, a brief overviewof our approach is given. Section IV describes applicationmodeling. In Section V, the automatic generation of architec-tural TLM models is discussed in more detail. Section VI de-scribes the architectural refinement in more detail. Section VIIpresents experimental results from applying the proposedprototype generation approach to a Motion JPEG decoder,a multimedia streaming application mapped onto an MPSoCarchitecture. Finally, conclusions are given in Section VIII.
II. RELATED WORK
As virtual prototypes are nowadays commonly used insystem-level design flows, several commercial as well asfree of charge tools exist to build, simulate and evaluatesuch prototypes. Most prominent are Platform Architect fromCoWare, CoMET from VaST and OVPsim [2] from Imperas.The two first mentioned tools were acquired by Synopsys [3]within the last year. Most existing virtual prototyping toolssupport the integration of transaction level models written inSystemC [4] in the prototypes. However, none of them allowsthe automatic transformation of a formal description, like anactor-oriented model, to a virtual prototype.
In general, mapping formal models on MPSoCs is a currentresearch topic in system synthesis (e.g., see [5], [6]). Thereexist several system-level synthesis tools that automaticallymap formal described applications to a MPSoC target, likeDaedalus [7], Koski [8], and SystemCoDesigner [9]. All theseapproaches want to achieve a common purpose. They targetfinal product generation. This means, they have to cover thecomplete design flow, starting with an high-level applicationspecification down to the running system. Caused by this, theirintegration into existing design flows is hard to establish.
In contrast to system synthesis tools, our proposed approachtargets automatic virtual prototype generation. In this scenario,important design decisions are reflected in the generated proto-type, while support for further manual refinement is retained.Hence, the product quality still could be influenced by adesigner and, even more important, our proposed approach
Application ModelArchitectural
Template
TLM Generation
Architectural
Model (TLM)
Prototype Generation
System-Level
Virtual Prototype
Software Refinement
Automatic 2-step
Prototype Generation
Fig. 2. Design flow from an application model, represented by an abstractexecutable specification, to a virtual prototype. The flow includes automaticmapping of actor-oriented models to TLM architecture models, as well asvirtual prototype generation.
could be easily integrated in established industrial designflows.
III. VIRTUAL PROTOTYPE GENERATION - OVERVIEW
The goal of our system-level design approach is to auto-matically implement abstract system descriptions written inSystemC as virtual MPSoC prototypes. The associated designflow is depicted in Fig. 2.
At the beginning of our ESL design process, an abstractmodel has to be derived for the desired application. In ourapproach, a distinction is drawn between the applicationmodel, which describes the functional behavior of the system,and the architecture template, which represents all architectureinstances of the system.
The system behavior is modeled in form of actor-orientedmodels, which only consist of actors and channels, as depictedin the Motion JPEG example from Fig. 1. Actors are thecommunicating entities, which are executed concurrently. Forcommunication, tokens are produced and consumed by actors,and transmitted via dedicated channels.
The architecture template of the system is representedby a heterogeneous MPSoC platform, which is specified byconnected cores. Single actors or clusters of actors can eitherbe mapped onto processor elements (CPU) or on dedicatedhardware accelerators (HW), as depicted in Fig. 1. Hardwareaccelerators will typically be used for computationally in-tensive or time critical parts of the application. In general,System-on-Chips include both processor elements as wellas hardware accelerators. Depending on the actor mapping,
129
communication channels can either be mapped on internalmemory of data processing units (CPU or HW accelerators),or on shared memory modules. In the Motion JPEG decoderexample, all channels except c1 and c8 are mapped to thehardware accelerator, as communication takes place internally.Channels c1 and c8 represent the communication between theCPU and the dedicated accelerator, and hence have to bemapped to the shared memory.
After modeling the application, the architecture template,and defining a mapping of functional to structural elements,an architectural model will be automatically generated. In thisintermediate model, the actors are clustered according to themapping on architectural resources. Due to the fact that virtualprototypes are usually implemented using transaction levelmodeling (TLM), we use the OSCI TLM-2.0 [10] standardin our design flow.
For virtual prototyping, parts of the architectural model aresubsequently replaced by the corresponding resources froma virtual component library, which consist of cycle-accurateprocessor models, as well as models of communication enti-ties. Beside the architectural refinement, software is generatedand cross-compiled for each CPU, to match its instruction setarchitecture (ISA).
The resulting virtual prototype can now be used for furthersoftware and communication refinement. Moreover, due tothe cycle-accurate processor models, performance estimationbecomes possible. The steps of architectural mapping as wellas prototype generation will be described later in more detail.First, our application modeling approach is described.
IV. APPLICATION MODEL
This section introduces our concept of actor-oriented mod-eling, which is necessary to understand our proposed mappingapproach. In actor-oriented models, actors are potentiallyexecuted concurrently and communicate over dedicated ab-stract channels. Thereby, they produce and consume data (socalled tokens), which are transmitted by those channels. Thesemodels may be represented as bipartite graphs, consisting ofchannels c ∈ C and actors a ∈ A. In the following, we usethe term network graph for this kind of representation.
Definition 1 (network graph): A network graph is a di-rected bipartite graph Gn = (A,C, P,E), containing a set ofactors A, a set of channels C, a channel parameter functionP : C → N∞ × V ∗ that associates with each channelc ∈ C its buffer size n ∈ N∞ = {1, 2, 3, ..,∞}, and also apossibly nonempty sequence v ∈ V ∗ of initial tokens, whereV ∗ denotes the set of all possible finite sequences of tokensv ∈ V . Additionally, the network graph consists of directededges e ∈ E ⊆ (C ×A.I)∪ (A.O×C) between actor outputports o ∈ A.O and channels, as well as channels and actorinput ports i ∈ A.I .An example of a network graph is already given in the upperpart of Fig. 1.
start
i1(1)&gcheck&o1(1)/fpositive
i1(1)&¬gcheck&o2(1)/fnegative
i1
o1
o2
fpositive fnegativegcheck
double in = i1[0];
o1[0] = in;
return i1[0] >= 0; double in = i1[0];
o2[0] = in;
Fig. 3. Visual representation of an actor, which sorts input data accordingto its algebraic sign. The actor consists of one input port i1 and two outputports o1 and o2.
Definition 2 (Channel): A channel is a tuple c =(I,O, n, d) containing channel ports partitioned into a set ofchannel input ports I and a set of channel output ports O,its buffer size n ∈ N∞ = {1, 2, 3, ..,∞}, and also a possiblyempty sequence d ∈ D∗ of initial tokens, where D∗ denotesthe set of all possible finite sequences of tokens d ∈ D.In the following approach, we will use SysteMoC [11], aSystemC [4] based library for modeling and simulating actor-oriented models. In the basic SysteMoC model, each channelis an unidirectional point-to-point connection between an actoroutput port and an actor input port, i.e. |c.I| = |c.O| = 1. Thecommunication between actors is restricted to these abstractchannels, i.e. actors are only permitted to communicate witheach other via channels, to which the actors are connected byports.
In a SysteMoC actor, the communication behavior is sep-arated from its functionality. The communication behavior isdefined as finite state machine (FSM); the functionality is acollection of functions that can access data on channels viaports. These functions are classified in actions and guards, andare driven by the finite state machine (FSM). So SysteMoCfollows the FunState [12] (Functions driven by State machines)approach.
An action of an actor is able to access data on all channels,the actor is connected to, and is allowed to manipulate theinternal state of the actor implemented by internal variables.In contrast, a guard function is only allowed to query, notto alter neither the internal state nor the data on channels.A graphical representation of a SysteMoC actor is given inFig. 3. The actor Sorter, which is used to sort input data tokensaccording to algebraic sign, possesses one input port (i1) andtwo output ports (o1 and o2). Tokens from input port i1 willbe forwarded to output port o1 by the function fpositive, if theactivation pattern i1(1)&gcheck&o1(1) of the state transitionfrom the state start to the state start evaluates to true. Thispattern determines under which conditions the transitions maybe taken. In SysteMoC, the activation pattern can depend on
130
1 class Sorter : smoc_actor {2 public:3 smoc_port_in<double> i1;4 smoc_port_out<double> o1;5 smoc_port_out<double> o2;6 smoc_firing_state start;7 Sorter(sc_module_name name) : smoc_actor(name,
start) {8 start =9 (i1(1) && GUARD(check) && o1(1)) >>
10 CALL(positive) >> start11 |12 (i1(1) && !GUARD(check) && o2(1)) >>13 CALL(negative) >> start;14 }15 private:16 bool check(void) const {17 return i1[0] >= 0;18 }19 void positive(void) {20 double in = i1[0];21 o1[0] = in;22 }23 void positive(void) {24 double in = i1[0];25 o2[0] = in;26 }27 };
Listing 1. SysteMoC code for the actor Sorter. The FSM of the actor isdefined in the constructor of the actor class, whereas the functionality isencoded as private member functions.
some internal state of the actor, on availability and values oftokens on input channels, and on availability of free space onoutput channels. In our example, the state transition will betaken, if at least one token is available on input port (i1(1)),the guard gchecks evaluates to true (data on input channel haspositive algebraic sign), and output port o1 has space for atleast one additional token (o1(1)). Analog to this, the secondtransition is taken if input data is negative. The correspondingSysteMoC code is given in Listing 1.
To summarize, the transition-based execution of SysteMoCactors can be divided into 4 steps: (i) Evaluation of allactivation patterns k of all outgoing state transitions in thecurrent state qc ∈ Q. (ii) Non-deterministically selecting andtaking of one activated transition t ∈ T . (iii) Execution of thecorresponding action f ∈ a.F . (iv) Notification of token con-sumption/production on channels connected to correspondinginput and output actor ports after completion of action as wellas transition to the next state.
During system synthesis from actor-oriented models, actorsa ∈ A and the communication channels c ∈ C are mapped tocomponents of a system architecture. To reflect architecturalstructure in network graphs after mapping, nodes can be clus-tered. For representation of clustering, we define a clusterednetwork graph.
Definition 3 (clustered network graph): A clustered net-work graph Gcn = (Gn, T ) consist of a network graph Gn
and a rooted tree T such that the leaves of T are exactly thevertices of Gn. Each node x of T represents a cluster X(x) ofthe vertices of network graph Gn that are leaves of the subtree
x2x1 x3
x4
Fig. 4. Clustered network graph of the Motion JPEG example. The clusterX(x1) represents the CPU, X(x2) the communication bus and X(x3) thehardware accelerator. Cluster X(x4) represents the whole system.
rooted by x.The representation as tree illustrates the hierarchical structureof the system. This means, the root of T represents the wholesystem, whereas nodes x ∈ T with height(x) = 1 representthe components of the system. As reuse of parts of modelsis common in the design process, hierarchical structures withmore than two levels are possible. The clustered network graphof the example from Fig. 1 is depicted in Fig. 4.
Although we used SysteMoC, our approach is not restrictedto this framework and can be adapted to other frameworksfor actor oriented design, e.g. pure SystemC FIFO channelcommunication. A deeper insight into SysteMoC is givenin [11].
V. GENERATING THE TLM ARCHITECTURE
Transaction level modeling (TLM) with SystemC has be-come apparent as de-facto industry standard for virtual proto-typing and architectural modeling [13], [14]. These models arecharacterized by an encapsulation of low-level communicationdetails. Due to abstraction, very fast simulation speed canbe achieved. To enable fast simulation, details of bus-basedcommunication protocol signaling are replaced with singletransactions. In the course of releasing a TLM standard(OSCI TLM-2.0) to enforce interoperability of models, theOpen SystemC Initiative defined two coding styles [15]: theloosely-timed (LT) and the approximately-timed (AT) codingstyle. The loosely-timed coding style allows only two timingpoints to be associated with each transaction, namely the startand the end of the transaction. This timing granularity ofcommunication is sufficient for software development usinga virtual prototype model of an MPSoC. A transaction inan approximately-timed model is broken down into multiplephases, with timing points marking the transition between twoconsecutive phases. Due to the finer granularity of timing,approximately-timed models are used typically in architecturalexploration and performance analysis. As our approach targetssoftware development, or more precisely the refinement of
131
parts of the application in software, by means of virtualprototyping, the loosely-timed coding style is adequate [15].
As described, actors a ∈ A and the communication channelsc ∈ C are partitioned to clusters X(x) and mapped tocomponents of a system architecture. Due to the mapping,the channel communication can either be internal, in caseboth communicating actors aa and ab mapped onto the sameresource (aa ∈ X(xy) and ab ∈ X(xy)), or external, incase communication crosses cluster boundaries (aa ∈ X(xy)and ab /∈ X(xy)). For intra-resource communication, FIFOscan be put in private memory of the architectural component,whereas FIFOs of inter-resource communication, like c1 andc8 from Figure 1, have to be placed in an external memorymodel. Either way, actor communication semantics throughports are not altered, in order to reuse the existing actorswritten in SystemC based SysteMoC. So, the challenge ofthis step in design flow is to map the FIFO-based communi-cation via dedicated channels to a memory-mapped bus-basedcommunication with global and local shared memory. Sinceour abstract communication semantics (read, write, commit)calls for uniform channel access, access transparency has tobe ensured after mapping to architectural template, resultingin the transaction level architectural model. As communicat-ing actors on different resources are concurrently executed,simultaneous access to FIFO storage has to be avoided. Thismeans that memory coherence as well as cache coherence hasto be guaranteed. To cope with actor clustering and to ensuresynchronized channel access, independent of communicationmapping, we use aggregators and adapters in our approach thatimplement a suitable communication protocol [16]. Adapters,by which the SysteMoC ports (i ∈ A.I and o ∈ A.O)are substituted, serve as links between the actors and thetransaction level. Due to the fact that more than one actorcan be mapped to one resource, and actors can possessmultiple ports, an aggregator is needed for each transactionlevel component (X(xi) : height(xi) = 1) to encapsulatethe desired number of adapters. These aggregators performtransaction level communication and implement the interfaceof the component to the rest of the architectural model. Thereis no need to connect adapters for internal channels with theaggregator, because no communication will take place overcomponent boundaries. In Fig. 1, communication between theactors Parser, Recon, MComp and IDCT is internal and can beimplemented using, e.g., internal memory. In our approach, weuse a transaction level memory model for each communicationchannel. In the following, we will describe the functionalityof adapter and aggregator in more detail.
A. Adapter
An adapter adapts between transactions in the virtual proto-type and the asynchronous FIFO channel communication usedin the application model. Hence, the communication adapterimplements two different interfaces. The interface towards theactor is equivalent to the abstract channel, which has to be
Parser Recon
Out c2 In c2
TLM Memory Model
Virtual Channel c2
Fig. 5. Mapping of parts of cluster X(x3) from the model, depicted in Fig. 1,to a architectural component. The internal communication takes place over avirtual channel, which substitutes the abstract channel. Therefore, adaptersadapt between the abstract model and the transaction level model. The FIFOqueue semantics are implemented using a TLM memory model.
replaced. To sustain abstract communication semantics, theadapter needs to access tokens in a random manner and tocommit completed transitions via this interface. Therefore, aconversion of the token data type (e.g., serialization and dese-rialization) has to be performed in adapters. An adapter alsohas to respect the abstract channel synchronization mechanism.This means, the adapter has to provide an interface throughwhich the adapter can be notified when tokens on channel areproduced or consumedi, respectively. This notification can beused to trigger the corresponding actor waiting for free spaceor tokens on channel.
The transaction level interface consist of three transactionlevel communication sockets (see Fig. 5). One is used for datatransmission. The actor, which is connected to the adapter,can read or write data from a memory through this socket.The other two sockets are needed to sustain the channelsynchronization. For synchronization, the adapters communi-cate among each other over arbitrary TLM communicationresources. Therefor, a dedicated address has to be assigned toeach adapter.
Due to the fact that the SysteMoC channels possess mem-ory, the FIFO storages have to be mapped to resources.As different locations are possible, we allocate the storagein a memory, to which the adapters are connected to. Forinternal communication, the sockets of the adapters can bedirectly coupled with each other, as depicted in Fig. 5. Thesynchronization sockets of the two communicating adaptersare directly coupled, whereas the data sockets are connectedwith the memory.
The memory of external communication is accessible overa bus system, to which the aggregator is connected (seeFig. 6). Allocation of storage in one adapter or splitting anddistributing the storage over both communicating adapters isalso possible. Independent of the chosen implementation andmapping, each adapter needs to know to which address spacehis buffer is mapped to, in order to read or write tokens.
132
X(x1) X(x3)
Out c8In c8 In c1Out c1
Aggregator Aggregator
TLM Bus Model
TLM Memory Model
Fig. 6. Mapping of the cross component communication between HW andCPU from Fig. 1. For the sake of clarity, internal communication structure isomitted.
B. Aggregator
As real computational resources like CPUs or DSPs havea limited number of connection pins, each node x ∈ Tbesides the root node needs a mechanism that aggregatesthe children connected to x. For nodes that represent datatransferring units, like buses (x2), this is done by arbitrationand address translation. Unlike the communication resources(data transferring units), the computational resources (dataprocessing units) need an aggregator for this purpose. Theaggregators contain TLM ports to perform transaction levelcross component communication. Therefore, they implementthe communication protocol for the connected adapters at thetransaction level. For communication, aggregators communi-cate among each other over arbitrary TLM communicationresources. For this purpose, each aggregator is assigned adedicated address-range. Its size depends on the number ofadapters registered to the aggregator. So each adapter isassigned a single address, to which it is accessible for event-based synchronization. Beside his own address range, eachaggregator has to know addresses of peer adapters, whichare associated with registered adapters, and addresses of thecorresponding FIFOs in memory.
VI. VIRTUAL PROTOTYPE GENERATION
In the final step of our automatic design flow, a virtual pro-totype is generated based on the transaction level architecturalmodel.
A. Architectural Refinement
In order to allow for an early software development, parts ofthe architecture have to be substituted by virtual componentmodels. In our approach, all resources except the hardwareaccelerators are replaced. As our approach focuses on software
TABLE IMEASUREMENT TERMS OF THE 5 DIFFERENT VIRTUAL PROTOTYPES.
VP Instructions Simulation VP Performance
Host[s] VP[ms] CPI MIPS
I 4944835683 1997 44285 1.79 111.66II 5319738192 521 30494 1.15 174.45III 5726625319 1791 29222 1.02 195.97IV 5765601708 660 26993 0.94 213.59V 6188808202 1760 7224 0.23 856.66VI 3492102237 550 30870 1.77 113.12
development, the inserted processor models must provide aninstruction set simulator, in order to simulate or furthermoredebug the software running on the models. Therefore, we usea commercial virtual component library [3], which providesthe opportunity to integrate TLM. This feature is necessaryto couple the hardware accelerators with the virtual compo-nents. In order to sustain the abstract channel synchronizationmechanism, an interrupt controller is added for each processorelement. By the use of this controller, the processor elementcan be informed about channel data modification by anotherprocessor or hardware accelerator.
B. Target Software Generation
During the process of target software generation, the actordescription in SystemC is transformed into standard C/C++code. Therefore, the ports for communication of the actor arereplaced by pointers to FIFO interfaces, and the finite state ma-chine is encoded as switch-case statement. The FIFO interfacesrepresent the communication interface equivalent to the TLMcommunication adapters, described in Section V. Moreover,scheduling strategies have to be implemented, in case multipleactors are mapped on the same processor element.
VII. EXPERIMENTAL RESULTS
In order to show the applicability of our approach, wepresent our first results on generating virtual prototypes froman actor-oriented Motion-JPEG model. Therefore, we use amore fine-grained model than given in Fig. 1, which consistsof 19 actors, interconnected by a total of 56 FIFO channels.In Table I, the results of several test cases in terms of differentmappings are presented. Since the architecture template con-tains 19 processors, 19 hardware accelerators, and a sharedmemory, which all are connected by bus, many architectureinstances exist. With our approach, it is possible to generatevirtual prototypes from all of them. To show the applicability,we consider only a few mappings serving as representatives.
Our first prototype (I) consists of a single processor(ARM926), onto which all actors are mapped. For the nexttwo test cases, two processors are allocated and connectedvia a bus. For this architecture instance, two mappings aretested, respectively: (i) The IDCT actors are mapped to oneprocessor, all remaining actors to the other one (II). (ii)The actors are mapped to the processors alternately, i.e. the
133
I II III IV V V I0
10
20
30
40
50
60
Prototype
Tim
e(s
)
generatecompile
Fig. 7. Times measured for generation and compilation of the differentconfigurations.
neighbor of each actor in the decoding pipeline is mappedto the processor different than the processor to which theactor itself is mapped (III). For the FIFO communicationbetween the two processors, a memory is additionally allocatedand connected to the bus. In the fourth prototype (IV), threeprocessors and a memory are allocated. Here, actors Sourceand Sink are clustered to one processor, IDCT is mapped tothe second one, and the remaining actors are mapped to thethird. To take the full advantage of pipelined execution, 19processors are allocated in the fifth prototype (V). In the lasttest case, VI, one processor and one hardware accelerator areallocated. This test case is analog to the second prototype,except the functionality of the IDCT actors is swapped to thehardware accelerator.
Figure 7 shows the time needed for prototype generationand compilation. It can be seen that the time spent forprototype generation is nearly independent of the mapping,whereas the time for compiling depends on the componentsof the prototype. On the one hand, it is obvious, that themore processors are allocated, the more time is needed forcompiling. On the other hand, the code for the transaction levelhardware accelerators is more complex than the code runningon processors, so more time is needed for compiling hardwareaccelerators. However, in summary it can be seen that allvirtual prototypes have been generated within seconds, insteadof hours. In the following, 5 measurement terms will be testedin order to decode 10 images (176x144): total instructionsexecuted; cycles per instruction (CPI); million instructions persecond (MIPS); simulation time (host time); simulated time.In order to make a statement of system performance, not ofsimulator performance, the terms CPI and MIPS relate to thesimulated time. The corresponding values are given in Table I.
It can be seen that the performance of the prototypes behaveas expected. The more processors are allocated, the better thepipeline of the decoder can be exploited. This means lesscycles are needed for one instruction, what causes a higherMIPS and a lower CPI rate. The small difference between IIand III is based on a better workload distribution.
As different developer teams implement different parts of
the application, it is often unneeded to refine all componentsof the TLM architectural model to virtual processor models.Prototype VI shows that there is no appreciable difference insimulated and host time in contrast to the completely refinedmodel (II).
VIII. CONCLUSION
In this paper, we have presented a two-step methodologyfor automatically generating virtual system-level prototypesfrom an abstract system specification. Our main goal was toprovide a methodology to remove the dependency on hardwareavailability, needed for software development, in an earlyphase of the design flow, which starts with an abstract andexecutable application model. For this purpose, design deci-sions are first represented in SystemC TLM, which is typicallysupported by all commercial virtual prototyping tools. Second,the TLM generation is used to assemble the virtual prototypeand generate the embedded software. To show the applicabilityof our approach to real-world applications, we presented firstsimulation results for an actor-oriented Motion JPEG model.
REFERENCES
[1] E. A. Lee, “Overview of the ptolemy project, technical memorandum no.ucb/erl m03/25,” Department of Electrical Engineering and ComputerScience, University of California, Berkely, CA, USA, Tech. Rep., Jul.2004.
[2] OVPworld, http://www.ovpworld.org.[3] Synopsys, http://www.synopsys.com.[4] T. Grötker, S. Liao, G. Martin, and S. Swan, System Design with
SystemC. Norwell, MA, USA: Kluwer Academic Publishers, 2002.[5] O. Moreira, F. Valente, and M. Bekooij, “Scheduling multiple inde-
pendent hard-real-time jobs on a heterogeneous multiprocessor,” inProceedings of EMSOFT, 2007, pp. 57–66.
[6] P. K. F. Hölzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, “Run-time spatial mapping of streaming applications to a heterogeneous multi-processor system-on-chip (MPSoC),” in Proceedings of DATE, 2008, pp.212–217.
[7] M. Thompson, T. Stefanov, H. Nikolov, A. D. Pimentel, C. Erbas,S. Polstra, and E. F. Deprettere, “A framework for rapid system-levelexploration, synthesis, and programming of multimedia MP-SoCs,” inProceedings of CODES+ISSS, 2007, pp. 9–14.
[8] T. Kangas et al., “UML-based multi-processor SoC design framework,”ACM TECS, vol. 5, no. 2, pp. 281–320, May 2006.
[9] J. Keinert, M. Streubühr, T. Schlichter, J. Falk, J. Gladigau, C. Haubelt,J. Teich, and M. Meredith, “SYSTEMCODESIGNER - An AutomaticESL Synthesis Approach by Design Space Exploration and BehavioralSynthesis for Streaming Applications,” TODAES, vol. 14, no. 1, pp. 1–23, 2009.
[10] Open SystemC Initiative (OSCI)., “OSCI SystemC TLM 2.0,”http://www.systemc.org/downloads/standards/tlm20/.
[11] J. Falk, C. Haubelt, and J. Teich, “Efficient representation and simulationof model-based designs in systemc,” in Proceedings of FDL, Sep. 2006,pp. 129–134.
[12] L. Thiele, K. Strehl, D. Ziegenbein, R. Ernst, and J. Teich, “Funstate—aninternal design representation for codesign,” in Proceedings of ICCAD.Piscataway, NJ, USA: IEEE Press, 1999, pp. 558–565.
[13] F. Ghenassia, Transaction-Level Modeling with SystemC. Dordrecht:Springer, 2005.
[14] B. Bailey and G. Martin, ESL Models and their Application. Dordrecht:Springer, 2010.
[15] OSCI TLM-2.0 user manual, Open SystemC Initiative, Jun. 2008.[16] J. Gladigau, C. Haubelt, B. Niemann, and J. Teich, “Mapping actor-
oriented models to TLM architectures,” in Proceedings of Forum onspecification and Design Languages, FDL 2007, Barcelona, Spain, Sep.2007, pp. 128–133.
134
An Automated Approach to SystemC/SimulinkCo-Simulation
F. Mendoza and C. KollnerFZI Research Center for Information Technology
Dept. of Embedded Systems and Sensors Engineering (ESS)Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe, Germany
Email: {mendoza|koellner}@fzi.de
J. Becker and K.D. Muller-GlaserInstitute for Information Processing Technology
Karlsruhe Institute of Technology, Karlsruhe, GermanyEmail: {becker|klaus.mueller-glaser}@kit.edu
Abstract—We present a co-simulation framework which en-ables rapid elaboration, architectural exploration and verifica-tion of virtual platforms made up of SystemC and Simulinkcomponents. We exploit the benefits of Simulink’s graphicalenvironment and simulation engine to instantiate, parametrizeand bind SystemC modules which reside inside single or multiplecomponent servers. Any set of SystemC module implementationscan be easily added into a component server controlled bySimulink through a set of well-defined interfaces and simulationsynchronization schemes. The complexity of our approach ishidden by the automated framework that enables a designerto focus on the creation and verification of SystemC models andnot on the intricate low-level aspects of the co-simulation betweendifferent simulation engines.
I. INTRODUCTION
The increasing complexity of embedded systems has con-stantly triggered the creation of tools and methodologies thatcan aid in the different stages of their design and verification.Traditional approaches for the design of embedded systemsare based on common practices, such as the creation ofspecifications and modeling guidelines, simulation of key con-cepts and algorithms, and the implementation into hardwareprototypes. Though widespread, such approaches are not fittedfor complex embedded systems, especially when it comes tothe implementation into hardware prototypes, where up to70% of a project’s design time is invested in costly functionalverification and redesign cycles [1]. There is an evident needfor newer approaches that can improve design efficiency andthe quality of embedded systems.
In the recent years System Level Design (SLD) methodolo-gies have gained popularity in the electronic design automationmarket. SLD was created to cope with the increasing complex-ity of embedded systems and to enhance the productivity ofdesigners. Regardless of the definition given by each author,the goals of SLD are to enable new levels of design and reuseusing higher levels of modeling abstraction and to enable HWand SW Co-Design [2].
The motivation of our work is to incorporate SLD method-ologies into the development flow of embedded systems. Inthe automotive and industry automation fields for example,Simulink is the most accepted simulation and model driven
prototyping tool for continuous and discrete time data flowdesigns. It is here where the functionality of algorithmsare tested and where SLD methodologies can be seamlesslyintegrated. By adding SLD support to Simulink we can enablerapid architectural exploration in early stages of a design.Our approach uses the right tool for the right job: Simulinkfor the creation of functional models and test benches, andSystemC for the creation of system level models of hard-ware implementation solutions. A designer will be able toinvestigate different architectural partitions of a design thatcan be tested along with sensors/actuators, controllers, andembedded software. This will provide a better understandingof the functionality and interactions between the differentcomponents of a system. The acquired knowledge can then beused for the selection of an appropriate hardware prototypeimplementation whose functionality can be later on verifiedwith the available simulation results.
Our work uses S-Functions developed in C++ as a commonprinciple to extend Simulink’s functionality. An S-Functionis basically the source code that describes the behavior ofa user defined Simulink block. S-Functions have access toSimulink’s simulation engine through a set of defined functioncalls. Using S-Function function calls and the expressivepower of C++ we are able to instantiate, connect, parameterizeand simulate SystemC models inside Simulink. An automatedapproach for the co-simulation of SystemC and Simulink willbe further explained in this paper. Additionally, we present itsimplementation in the verification of a DSP algorithm.
The challenges involved in the time synchronization be-tween the simulation models of Simulink and SystemC arediscussed. Simulink uses a time continuous simulation model,while SystemC uses a discrete time event-driven simulationmodel. In a continuous simulation model, time is discretizedinto fixed or variable time steps, also called integration steps,according to the numerically solver used by the simulatorengine. In a discrete time event-driven simulation model, timesteps are inherently variable and are calculated according toevents scheduled in a queue. Delta cycles are used to updateall processes running concurrently in a same time step. Onlywhen the event queue for that time step is empty, the time can
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
135
be updated to the next scheduled event.
II. RELATED WORK
A list of available commercial mixed language simula-tion tools for the creation of virtual platforms is presentedahead. SystemVision [3] from Mentor Graphics enables theinteraction of SPICE, C/C++ , SystemC and Verliog-AMS forcreating simulations of analog and digital components. SystemGenerator [4] from Xilinx provides a library of their DSP IPblocks translated as Simulink components. They enable co-simulation of their IP DSP blocks with standard Simulinkcomponents, with the advantage of being able to synthesizeinto Xilinx’s FPGAs. In the System-on-Chip area, tools forthe simulation of multi-processor systems are common, forexample VaST from Synopsis and Seamless from MentorGraphics.
A common feature found in mixed language simulationtools is the use of simulation wrappers to adapt and com-municate different abstraction levels and simulation models.The use of wrappers is commonly found in the simulation ofmultiprocessor systems, such as [5] and [6], where processormodels are wrapped and connected to a SystemC backplane.Further formal approaches for the generation of simulationwrappers are presented in [7] and [8].
The mixed language simulation approach we focus on isthe co-simulation between Simulink continuous models andSystemC discrete models. A systematic analysis of continuousand discrete simulation models along with their respectivetriggering mechanisms is presented in [9]. We have classifiedthe available co-simulation approaches in two variants, accord-ing to the simulation engine that takes control of the wholesimulation. In the first case, where SystemC takes control ofthe simulation, two synchronization schemes can be identified.A basic, though effective, synchronization scheme is presentedin [1] and [10], where a SystemC application synchronizeswith a Simulink model on fixed time intervals. A moreefficient approach based on SystemC’s event driven schedulingmechanism is presented in [9]. The authors use SystemC’sevent queue to determine the required synchronization pointswith Simulink. They additionally include the possibility of aSimulink model to trigger additional synchronization points.A continuation of their research is given in [8], where betterdefined interfaces between Simulink and SystemC are pre-sented. In the second approach, where Simulink takes controlof the simulation, it is possible to synchronize with one ormore SystemC kernel via user-defined S-Functions triggeredon fixed or variable sampling times. The authors in [11] usea synchronization scheme based on variable sampling timesextracted from SystemC’s event queue, however no technicaldetails are given on its implementation. Our work has a similarapproach, where the time synchronization scheme is controlledby Simulink and allows for a true event-based simulationinside the SystemC sub-system. Each SystemC module in-stance corresponds to a Simulink block with appropriate inputand output connectors. Therefore, a SystemC sub-system can
consist of any number of arbitrarily interconnected moduleinstances.
Our contribution differs from the above in the sense thatwe exploit the benefits of Simulink’s graphical interface toinstantiate, parameterize and bind SystemC modules whichreside inside single or multiple component servers. The benefitof our approach is increased usability. A hardware designer isable to re-arrange the overall system structure of a virtual plat-form in order to explore several design aspects and realizationalternatives. Therefore, we simplify system composition andsimulation control without the need to manually edit SystemCsource code. Furthermore, the proposed approach enables thedesigner to create as many instances of SystemC modules(including multiple instances of the same module) dynamicallyinside a Simulink model and to interconnect them with otherSystemC module instances and native Simulink blocks.
III. THE CO-SIMULATION INTEGRATION FRAMEWORK
A: sc_module
B: sc_module
C: sc_module
Compile
Component Server
ClientS-Function
(single address space)
Component Server
(TCP/IP)
Component Server
(Shared Memory)
Compile Compile
SystemC Infrastructure
Matlab/Simulink
<<uses>>
Matlab/Simulink
ClientS-Function
(TCP/IP)
<<uses>>
ClientS-Function
(Shared Memory)
Matlab/Simulink
<<uses>>
<<uses>>
Figure 1. Overview showing the design flow for creating a component server.
A. Overview
Figure 1 shows an overall view of our co-simulation in-tegration framework. During a design entry step, the devel-oper defines and creates the SystemC modules which willbe available in the repository of a component server. Thesemodules are then compiled along with a SystemC kernel
136
and infrastructure code in order to build a component serverand client application. The component server allows for thedynamic instantiation and interconnection of the SystemCmodules inside its repository. The SystemC kernel built insidethe component server is able to execute a dynamically createdSystemC model. With the help of a set of well-defined inter-faces and synchronization schemes, a client encapsulates thefunctionality to connect to the component server and controlthe data exchange.
Three variants of component servers with their respectiveclients are available, differing only on the middleware used toconnect them. This gives the designer the liberty of definingwhere a component server will be located, either runninginside a Simulink process, locally in another process or inan external server connected via network. In the first vari-ant, the SystemC kernel and the Simulink solver engine areboth executed in the same process, using the same addressspace. This approach has the advantage of high simulationperformance, but has an impact on robustness. If a softwarebug inside a module implementation crashes the componentserver, it will also crash the Simulink environment. Further-more, debugging the design is tedious compared to debugginga standalone application. In the second and third variants,the component server and client are executed as differentprocesses communicating through shared memory or TCP/IPinter-process communication (IPC). Both variants provide abetter isolation between processes and provide a convenientway to co-simulate Simulink with one or more SystemCkernels running concurrently.
B. Usage
Our approach separates the task of SystemC module de-velopment from the complex code infrastructure required tointerface SystemC and Simulink. A SystemC module designeris provided with a small set of preprocessor macros which,when inserted inside a module class declaration, automaticallyregister that class to a component repository. All the designerhas to do is compile his module implementations along withthe SystemC library and our infrastructure library. Buildparameters let him decide which of the three variants (seeFigure 1) will be created.
The component server is displayed in Simulink as an S-Function block. All modules present in the server’s repositorycan be instantiated in arbitrary quantities using such compo-nent server block (which all refer to the same S-Function).Parameters specify which module to create and, if necessary,its constructor arguments. Each block automatically adopts theinterface of the underlying SystemC module instance in termsof input and output ports. For enhanced usability, the user cancreate a Simulink block library which hides the details of theS-Function parameterization.
IV. IMPLEMENTATION
A. Infrastructure/Component Server
Figure 2 shows the class diagram of the infrastructure code.Throughout this subsection, we will focus on the key concepts
Figure 2. Class diagram of the component server infrastructure.
required to achieve the level of integration and usabilitydescribed in section III-B.
1) Structural Analysis: By structural analysis we under-stand the process of determining the interface of a SystemCmodule. This includes the set of its ports along with theirnames and type information. The interface description isrequired by both the diagram editor and the solver engine. Inthe first case, it is needed to present a meaningful graphicalrepresentation of the module and to check the type compatibil-ity of diagram connections. In the second case, it is requiredto prepare according data structures which are needed to runthe simulation.
Approaches that try to reveal the structure of SystemCmodels by source code analysis are presented in [12] and[13]. These approaches require sticking to certain codingstandards. PINAPA [14] is a hybrid approach where theelaboration phase of a SystemC model is virtually executedin order to determine module hierarchy and interface de-scriptions. We chose a simpler approach where the analysisis done at runtime. The SystemC base class sc_objectimplements two methods get_child_objects and kindwhich let the user enumerate dependent objects, suchas ports and processes. We store the processed informa-tion in instances of ModuleInstanceDescriptor and
137
Table ITHE EKIND ENUMERATION
Literal SystemC port type DescriptionDataIn sc_in<DataType> Inbound dataflowDataOut sc_out<DataType> Outbound dataflowImport sc_import<IfType> Required interfaceExport sc_export<IfType> Provided interface
PortInstanceDescriptor. SystemC provides four basicport types which are described using the EKind enumera-tion (Table I). Usually, it is the compiler’s responsibility toapply type-checking to a SystemC model to ensure that allmodule interconnections are correct. For example, ports ofthe data types sc_in<X> and sc_out<Y> may only bebound to a sc_signal<Z> if the types X , Y and Z match.Our framework allows for instantiation and interconnectionof ports at runtime. This means that type-checking has tobe incurred by the runtime infrastructure. The problem issolved using C++ run-time type information (RTTI). We applythe dynamic_cast operator to perform type compatibilitychecking and typeid to get textual type information.
2) Automation: Many higher level languages, such as Java,C# and Objective-C, support reflection features. Reflection is apowerful metaprogramming paradigm which allows a programto examine its own structure and behavior at runtime, or evento alter its behavior. C++ RTTI can be considered as a verybasic and strongly limited implementation of reflection. Anapplication of reflection is to resolve a class name (given as astring) into a class type descriptor. This way, an instance of theclass can be constructed indirectly, without specifying its typeat the source code level. As RTTI does not support this feature,we implemented a small meta-language which imitates thatbehavior, as it allows the designer to enable selected classes forindirect instantiation. This is done by inserting preprocessormacros in order to declare a class as “automatable”. It isalso possible to describe the set of constructor arguments withrespect to their types and names. When the infrastructure codeinitializes, the class will register itself to the Repository(which sticks to the singleton design pattern).
3) Simulation Control: An appropriate interface supportssynchronization and data exchange between the SystemC ker-nel and Simulink. The SimulationContext class exposesfunctionality to control the simulation. A Reset methodresets the SystemC kernel to its initial state. All moduleinstances are destroyed and the simulation time is reset to0. RunSimulation executes the simulation for a specifiedamount of time. The GetTimeOfNextEvent method tellsthe point in time when the next process inside SystemC’sprocess queue is scheduled (or ∞ if currently no process ispending).
B. Client S-Function
The client S-Function acts as a mediator between a com-ponent server and Simulink. It synchronizes both simulatorkernels and enables the exchange of signal values.
1) Signal Data Exchange: CoSimImport andCoSimExport (see Figure 2) are specializations ofsc_channel which are designed to transfer signal data inand out of a SystemC simulation. Both can be interfaced withSimulink. We distinguish four kinds of connections whichcan occur in a Simulink model:
• A Simulink/Simulink connection models a dataflow de-pendency between two native Simulink blocks. This typeof connection does not need any further consideration asit is handled by the Simulink solver.
• A Simulink/SystemC connection links a Simulink signalto a so-called import gateway block. This block mapsto the client S-Function which will create an instanceof CoSimImport in order the tackle the data exchangefrom Simulink to SystemC. It is important to mentionthat the import gateway block is the only block whichsupports that connection type. It is not possible to link aSimulink signal directly to an arbitrary SystemC module(see figure 5).
• A SystemC/Simulink connection links an export gatewayblock to a Simulink signal. In this case, the S-Functionwill create an instance of CoSimExport. Again, theexport gateway is the only block supporting this type ofconnection.
• A SystemC/SystemC connection links two blocks repre-senting each a SystemC module instance. This connectiontype is realized using a propagate and bind scheme whichwill be further explained.
SomeModule
1
OutCoSim-CoSim-
Export
InCoSim-CoSim-
Import
SomeModule
2
OutCoSim-CoSim-
Export
InCoSim-CoSim-
Import
Simulink solver
transfers signal value
Figure 3. Two connected Simulink blocks and their internal representation
A simple solution to realize SystemC/SystemC connectionswould be to create and bind instances of CoSimImport andCoSimExport. In this case Simulink transfers the actualsignal data according to Figure 3. To ensure that no signalvalue change is missed, it is necessary to carefully choosethe sample times of both blocks. Setting them too highcauses data loss and unintuitive behavior. Setting them toolow results in poor simulation performance since the frequentcontext changes hinder the SystemC simulation kernel to skipunnecessary simulation cycles.
There are applications where the simulation performanceis affected in a way that the approach becomes completelyintractable, for example, if SystemC is applied to analyzepacket-based data [13]. In the considered application, packetsare recorded by a data logger and processed by a SystemC-based data analysis framework. The framework synchronizesthe simulation time with the receive timestamp of each cur-rently processed packet. All timestamps possess a resolution
138
of 100ns. However, the time lag between two consecutivepackets usually lays some orders of magnitude above thisresolution. Given moderate traffic, it is possible to run the dataanalysis faster than real-time on a standard PC. Obviously,a context switch every 100ns would result in a non-viableanalysis performance.
We decided to implement a different approach which allowsthe SystemC sub-model to be executed at a much higher rate(or, truly event-based) than the rest of the model. Synchro-nization is only necessary at transitions between Simulink andSystemC blocks which are modeled explicitly by import andexport gateways. The approach imitates the standard way ofconstructing SystemC models where a module connection isrealized by binding the involved module ports to the sameinstance of an sc_channel (in most cases sc_signal, aspecialization of sc_channel).
For each SystemC/SystemC connection inside the Simulinkblock diagram, the client S-Function creates an appropri-ate sc_signal instance and binds it to the underlyingport instances. Unfortunately, there is no elegant way ofextracting the set of diagram connections from inside the S-Function. However, the information is gained implicitly usinga propagate-and-bind scheme. As soon as the simulation isrunning, Simulink provides the S-Function with buffers whichare used to store its input and output values. The idea isnot to store an actual signal value inside a buffer, but tostore a reference (or pointer) to the signal instance. Duringthe first Simulink simulation cycle, the Simulink propagatesthe references in order to complete the binding of the wholeSystemC sub-model.
On each simulation cycle, Simulink passes (amongst others)a calculate outputs phase which instructs each block to updateits outputs. The computation may involve block inputs, giventhat they are marked as having a direct feedthrough prop-erty [15]. Our implementation indicates every input as directfeedthrough in order to get access throughout the calculateoutputs phase. This leads to the following algorithm:
1) When a block is created: Instantiate the appropriatemodule class, create and bind an sc_signal instancefor each data output port. Leave all data input portsunbound.
2) When entering calculate outputs for the first time:• Store the references (or pointers) to all signals which
were created in step (1) in the appropriate outputdata buffers (provided by Simulink).
• Fetch all references from input data buffers (pro-vided by Simulink) and bind all data input ports tothe according signal reference.
3) When all blocks passed calculate outputs: Model is readyto elaborate, start the SystemC simulation.
Figure 4a shows the internal representation of the Simulinkmodel shown in Figure 5 before the simulation is started(step 1). After propagating all signal references the bindingis completed (step 2, see Figure 4b). Elaboration and start ofthe SystemC simulation (step 3) still take place during the
very first Simulink solver step, so even at simulation time0 no information is lost. Prior to the simulation, Simulinkanalyzes all data dependencies in the model and computes anappropriate block execution order which ensures that the inputsof each block are readily computed before that block enters thecalculate outputs phase. However, the propagation scheme isonly viable for SystemC sub-models without loops. Simulinkwould recognize each loop as being algebraic and reportan error, regardless whether a register within the underlyingbehavior actually breaks that loop or not.
2) Time Synchronization Algorithm: Our time synchro-nization algorithm is controlled by Simulink, as opposed to[9] where the SystemC event queue is use to control thesynchronization intervals. If the Simulink model refers tomultiple component servers, it is possible to have more thanone SystemC kernel. In such case, each SystemC kernel runsindependent of each other, though controlled and synchronizedwith Simulink simulation time. There is no direct data ex-change between modules belonging to different componentservers. Instead, gateway blocks have to be used. It is upto the designer to establish an appropriate sampling time foreach gateway block. Setting a sampling rate too low can leadto loss of data; setting it too high will affect the simulationperformance due to oversampling.
The involved co-simulation algorithm (algorithm 1) is quitesimple. The Simulink solver will trigger each gateway block inits specified sampling time. This is done when entering the cal-culate outputs phase which instructs SystemC to synchronizewith Simulink’s simulation time. In the case the block is animport gateway, the input signal value is transferred inside theSystemC model. If the block is an export gateway, a numberof single delta cycle simulations follow until no processes arepending for the current simulation time. This step accountsfor combinatorial computation paths inside the modules andensures that all module outputs have stable values. Afterwards,the output signal value is transferred to Simulink.
Algorithm 1 When entering calculate outputs: SynchronizeSimulink and SystemC
now := CurrentSimulinkTime∆t := now − CurrentSystemCTimeif ∆t > 0 thenRunSimulation(∆t)
end ifif current block is import gateway then
Transfer Simulink input signal value to SystemCelse if current block is export gateway then
while GetTimeOfNextEvent() = now doRunSimulation(0) {Executes a single delta cycle}
end whileTransfer SystemC signal value to Simulink
end if
139
FirFilterOut
sc_ sc_
signal
In
Gateway OutGateway In
Simulink solver propagates
reference/pointer
Simulink solver propagates
reference/pointer
CoSimCoSim
Export
CoSimCoSim
Importsc_ sc_
signal
(a) During model construction
FirFilterOutIn
Gateway OutGateway In
CoSimCoSim
Import
CoSimCoSim
Exportsc_ sc_
signal
sc_ sc_
signal
(b) After binding is completed
Figure 4. Internal representation of Simulink blocks representing SystemC modules (a) during model construction and (b) after binding is completed
C. Middleware
The communication between a component server and aclient is done via a middleware. In the case where the serverand client are compiled together, no additional middlewaresoftware is used and communication is done sharing pointersto a same memory space. In the case of TCP/IP communica-tion, an open source project called Remote Call Framework(RCF) [16] was used. Shared memory communication wasimplemented with an open source Boost IPC library [17]. Bothmiddleware implementations provide convenient and powerfulfunction calls for inter-process communication.
V. RESULTS
We used the co-simulation framework for the verificationof a variable-length FIR filter, an algorithm commonly usedin DSP applications. The filter was modeled as a SystemCmodule with one input and one output port. The length andcoefficients of the filter must be given as parameters when theclass is instantiated. Our model is approximately timed in thesense that we assume data is processed at a constant rate. TheSystemC model could be later on refined by adding timinginformation into the model.
The FIR SystemC module with its required preprocessormacros declaring it as “automatable” (see Section IV-A2) wascompiled into the repository of a component server. Threevariants of the component server were generated according toFigure 1.
Figure 5 shows how the verification of the SystemC FIRFilter was performed. We used as reference the Digital FilterDesign Block from Simulink’s Signal Processing toolbox togenerate a 16-tap passband filter along with its coefficients.The coefficients were saved in an array and given as parame-ters to the SystemC model. We simulated the three componentserver variants and performed verification by inspecting thespectrum calculated by the FFT blocks. We were able toeasily verify our SystemC application in a couple of minutes.Otherwise, this process would have required a considerablyhigher amount of time and effort if a designer were tomanually code SystemC test benches.
A certain simulation time overhead is expected due to thenumerous synchronizations that must be performed betweenSimulink and SystemC. The total number of synchronizations
FirFilter
Input Output
SystemC
B-FFT
Spectrum
SystemC
B-FFT
Spectrum
FDA Tool
Random
Source1
Out
Gateway out
In
Gateway In
FDATool
Digital
Filter Design
Figure 5. Verification of a FIR Filter developed in SystemC.
in a simulation is calculated according to the number ofinput/output gateways and their sampling rates. For our tests,we reused the three component server variants used for the FIRvalidation shown in Figure 5. The simulation time for eachcomponent server variant was measured and its performancecalculated as the ratio of simulation time per synchronizationevent. In the case of TCP/IP communication, the componentserver was run in a local host and later on in a remotehost connected to our LAN. In all cases a standard desktopcomputer (Intel Core2 Quad CPU) running Windows7 wasused.
The performance results are shown in Figure 6. The resultsare presented as the simulation time in seconds per synchro-nization, in relation to the total number of synchronizations ina simulation. In all cases, the performance of a simulationincreases (meaning less time per synchronization) as thenumber of total synchronization increases; afterwards theystabilize to a constant value. Our results can help a designerdecide which communication scheme to use according to thetotal number of expected synchronizations in simulation. Theperformance of the single address space variant differs fromthe rest after 100 synchronizations and reaches its maximumvalue, approximately 20 times faster than the other variants,after 100k synchronizations. To our surprise, the performanceof the shared memory and TCP/IP localhost variants arealmost the same. We believe this is because the Boost[17]library implementation used for shared memory IPC is notefficient. We would expect higher performance results if nativeWindows functions were used for shared memory IPC instead.Finally, the simulation performance of the TCP/IP remote host
140
variant is naturally lower and may be affected by delays in thenetwork.
100
101
102
103
104
105
106
107
10-5
10-4
10-3
10-2
10-1
100
101
Number of Synchronziations
Sim
ula
tion tim
e p
er
Synchorization [sec]
Simulation Performance
Single address space
Shared memory
TCP/IP localhost
TCP/IP remote
Figure 6. Simulation performance results according to the number ofsynchronizations between Simulink and SystemC.
VI. CONCLUSIONS AND OUTLOOK
Our work shows that thanks to the open source natureof SystemC, the principles and benefits of SLD, that haveproven effective in the SoC market, can also be applied to thetraditional design of embedded systems in order to rapidlycreate virtual platforms. Our work demonstrates that it ispossible, from a designer point of view, to seamlessly createand verify SystemC models within Simulink. The complexityof our approach is hidden by an automated framework thatgenerates servers that provide a library of SystemC modulesand clients attached to Simulink which control them.
In our current framework version, Simulink does not allowsignal loops on SystemC blocks. Loops are allowed as longas they are broken up by at least one block with delaying be-havior, for example, a register, or in Simulink terms, if at leastone of the involved ports is declared as non-direct feedthrough.The challenge is to find when it is safe to declare a port as non-direct feedthrough. A heuristic solution would be to analyzethe sensitivity list of all processes of a SystemC module. Portsthat trigger any process, for example, a clock signal, mustbe defined as direct feedthrough and the rest as non-directfeedthrough. As the latter case implies a register behavior, themodule outputs would not be immediately affected. Anothersolution would be to oblige the SystemC module designerto mark non-direct feedthrough ports with special meta tags.However, this topic requires further consideration.
Our approach should be easily extended to support par-allelized distributed simulation of SystemC models as donein [18]. In this way, we could increase the simulation per-formance by distributing the simulation across multiple CPUcores. Further work is the support for TLM2.0 interfaces,which should be possible since the reference propagationscheme could equally be applied to TLM interfaces.
REFERENCES
[1] J.-F. Boland, C. Thibeault, and Z. Zilic, “Using matlab and simulinkin a systemc verification environment,” in Proceedings of Design andVerification Conference, DVCon05, 2005.
[2] K. Keutzer, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-Vincentelli,“System-level design: orthogonalization of concerns and platform-baseddesign,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 19, no. 12, pp. 1523–1543, 2000.
[3] Systemvision. [Online]. Available: www.mentor.com/systemvision[4] Systemgenerator. [Online]. Available:
http://www.xilinx.com/tools/sysgen.htm[5] P. Gerin, S. Yoo, G. Nicolescu, and A. A. Jerraya, “Scalable and flexible
cosimulation of soc designs with heterogeneous multi-processor targetarchitectures,” in Proc. Asia and South Pacific Design Automation Confthe ASP-DAC 2001, 2001, pp. 63–68.
[6] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, “Ageneric instruction set simulator api for timed and untimed simulationand debug of mp2-socs,” in Proc. IEEE/IFIP Int. Symp. Rapid SystemPrototyping RSP ’09, 2009, pp. 116–122.
[7] G. Nicolescu, S. Yoo, A. Bouchhima, and A. A. Jerraya, “Validation ina component-based design flow for multicore socs,” in Proc. 15th IntSystem Synthesis Symp, 2002, pp. 162–167.
[8] F. Bouchhima, M. Briere, G. Nicolescu, M. Abid, and E. Aboulhamid,“A systemc/simulink co-simulation framework for continuous/discrete-events simulation,” in Behavioral Modeling and Simulation Workshop,Proceedings of the 2006 IEEE International, 2006, pp. 1 –6.
[9] F. Bouchhima, G. Nicolescu, M. Aboulhamid, and M. Abid, “Discrete-continuous simulation model for accurate validation in component-basedheterogeneous soc design,” in Proc. 16th IEEE Int. Workshop RapidSystem Prototyping (RSP 2005), 2005, pp. 181–187.
[10] W. Hassairi, M. Bousselmi, M. Abid, and C. Sakuyama, “Using matlaband simulink in systemc verification environment by jpeg algorithm,”in Electronics, Circuits, and Systems, 2009. ICECS 2009. 16th IEEEInternational Conference on, 2009, pp. 912 –915.
[11] K. Hylla, J.-H. Oetjens, and W. Nebel, “Using systemc for an extendedmatlab/simulink verification flow,” in Proc. Forum Specification, Verifi-cation and Design Languages FDL 2008, 2008, pp. 221–226.
[12] D. Berner, J. pierre Talpin, H. Patel, D. A. Mathaikutty, and E. Shukla,“Systemcxml: An extensible systemc front end using xml,” in InProceedings of the Forum on specification and design languages (FDL,2005.
[13] C. Kollner, G. Dummer, A. Rentschler, and K. Muller-Glaser, “Design-ing a graphical domain-specific modelling language targeting a filter-based data analysis framework,” Object/Component/Service-OrientedReal-Time Distributed Computing Workshops , IEEE International Sym-posium on, vol. 0, pp. 152–157, 2010.
[14] M. Moy, F. Maraninchi, and L. Maillet-Contoz, “Pinapa: An extractiontool for SystemC descriptions of systems-on-a-chip,” in EMSOFT,September 2005, pp. 317–324.
[15] Matlab Simulink. [Online]. Available:http://www.mathworks.com/help/toolbox/simulink/
[16] RCF - Interprocess Communication for C++. [Online]. Available:http://www.codeproject.com
[17] Boost C++ Libraries. [Online]. Available: www.boost.org[18] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, “Scalably
distributed systemc simulation for embedded applications,” in IndustrialEmbedded Systems, 2008. SIES 2008. International Symposium on,2008, pp. 271 –274.
141
Extension of Component-Based Models for Controland Monitoring of Embedded Systems at Runtime
Tobias Schwalb and Klaus D. Muller-GlaserKarlsruhe Institute of Technology, Institute for Information Processing Technologies, Germany
Email: {tobias.schwalb, klaus.mueller-glaser}@kit.edu
Abstract—To allow a rapid abstract development and reuse inembedded systems nowadays component-based system develop-ment is often used. However, control and monitoring at runtimefor adjustment and error identification often take place usingdifferent domains or tools. They are in general more concreteand therefore the user needs a deeper understanding of thesystem. This paper in contrast presents a continuous conceptto raise the abstraction level and include runtime control andmonitoring into component-based models. The concept is basedon an extended component-based meta model and libraries,which describe present components as well as their interfaces andparameters. During design time, based to the model designed bythe user, source code is generated. At runtime, control commandsare send to the embedded target according to user modification inthe model as well as acquired monitoring data is back annotatedand displayed on model level. The concept is demonstrated andevaluated using a reconfigurable hardware platform.
I. INTRODUCTION
The development of embedded systems becomes more andmore complex due to the increasing demands and the pressureto meet productivity targets. Dannenberg et al. [1] for examplepredicts a growth of 150% for the market of electric electronicautomotive components up to a total of 316 Billion Euroin 2015 with an exponential rate in the future. To dominatethese, many systems are build using component-based designmethodologies to allow easy reuse of present design parts.However, while this method supports a rapid and abstractdesign, control and monitoring for runtime adjustment anderror identification are normally on a more detailed level usingspecific tools. Therefore, the user needs a deeper understand-ing of the system, increasing costs and development time.
In this paper we present a continuous concept to extendcomponent-based models for runtime control and monitoringto support abstract adjustment and error identification. Themethod is based on an extended meta model for component-based systems and libraries, which store predefined compo-nents. In contrast to actual methods (see Section II), it allowsthe user to work on the same abstract level at design time andruntime. Succeeding the design phase, source code is auto-matically generated for the embedded target. During runtime,the components of the embedded target can be controlledand monitored using special parameters. Other parametersdisplay the monitored status of the embedded system, bothwithin the same abstract component-based model. Therefore,the user does not need a detailed understanding of individualcomponents for fast prototyping.
In this context, we first describe the state of the art in model-based design, configuration, control and monitoring in SectionII. The next section outlines an overview of the concept anddescribes the flow using the method. Section IV illustratesthe developed meta model, while Section V describes theactions for configuration, control and monitoring as well asshows our implemented model-based developing environment.An example, based on the use of reconfigurable hardware, ispresented in Section VI, including practical tests and resultsin Section VII. We close with conclusions and an outlook onfuture work in Section VIII.
II. STATE OF THE ART
In this section we concentrate on specific model-baseddesign methods as part of the V-Model [3] for embeddedsystem development. Therefore, we describe the state of theart concerning system design with focus on component-baseddesign methods and their possibilities. Further, we show ac-tual methods concerning model-based control and monitoring,because we integrate these into component-based models.
For system planning and design in the early developmentphases for embedded systems, the Systems Modeling Lan-guage (SysML) [4] or the Modeling and Analysis of RealTime and Embedded systems profile (MARTE) [5] are used.Both are part of the Unified Modeling Language (UML) [6]and allow abstract specification, analysis and design of com-plex and real time embedded systems. However, SysML andMARTE do not support runtime functionalities. In connectionwith component-based design, SysML and MARTE can beseen earlier in the design process.
Component-based design [7] is located in the implementa-tion phase and well known in the software domain. It describesa concept of separation of systems into components. Thereby,an individual component is often regarded as a softwaremodule that encapsulates a set of related functions (or data).Components communicate with each other via interfaces andare configured using parameters. Components are normallystored in libraries to allow rapid reuse. The design can beperformed text-based or model-based, while in the later thecomponents are displayed as graphical objects.
The main usage of components is to enable reuse of alreadyimplemented functionality in different versions and configura-tions of a product as well as in other projects [8]. Therefore, itreduces the time-to-market and increases the quality, because
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
142
Source Code
Code generation
Implementation
Execution
Monitoring
Target platform
Component Model
Binary
Mapping
Monitoring
Runtime Information
Commands
M3-Level Ecore Meta Meta Model
M2-Level Component Meta Model
M1-Level Component Model
M0-Level Source Code / Binary
Parameter Extensions
<<extends>>
<<references>> Runtime Data (Model level)
Runtime Data (Source level)
<<references>>
Control
Mapping
Fig. 1. Concept of Design, Control and Monitoring (left) and relating Meta Object Facility (MOF) Levels [2] (right)
already used components are mostly well engineered andtested. Different requirements have to be considered usingcomponent-based design methodologies, including scalability,maintainability and interoperability as well as the feasibilityto real systems. These also apply to the embedded systemsdomain, more details are outlined in [9].
If an embedded system is implemented, the developer nor-mally needs separate tools for runtime control and monitoring.These tools allow to adjust the behavior of the implementedsoftware, for example an offset correction. Others are used fordebugging proposes, monitoring internal signals or variables.These tools normally perform control and monitoring on adetailed level in comparison to the abstract component-baseddesign. Furthermore, they are often specific to the embeddedtarget and sometimes adapted to the implemented system.
A further domain are rich component models [10] usedas a uniform representation of different design entities tosupport the management of functional and non-functional partsin the development process. A component-based hierarchicalsoftware platform for automotive electronics is shown in [11].It shows a series of tools for model-driven development, visualconfiguration and automatic code generation. In [12], Gu et al.presents an end-to-end tool-chain for model-based design andanalysis of component-based embedded real time software inthe avionics domain. It includes the configuration of sourcecode as well as runtime instrumentation and statistics thatare feed back into models. The management of distributedsystems and their configuration is discussed in [13]. A concept,based on design time modeling, model transformation andmanagement policies, is presented.
A framework for design and runtime debugging ofcomponent-based systems is presented in [14]. It enables tovalidate the interactions between components by automaticpropagation of checks from the specifications to applicationcode. The usage of different MDE tools at runtime is describedin [15]. Thereby, different runtime models are discussed anda tool presented to design them. Monitoring of embeddedsystems in terms of model-based debugging is presented in[16], [17] and [18]. These concentrate on real time monitoringof functional models (e.g. statecharts).
In our concept we follow the known issues and methodsconcerning component-based design, i.e. reuse, source codegeneration, interfaces, parameters, etc. However, in first ver-sion do not consider specialties, e.g. product line techniques,analysis or distributed systems. In contrast to actual methods,we integrate control and monitoring directly into component-based models, not using functional models, specific runtimemodels or external (low-level) tools. Thereby, the component-based model refers to the architecture of the system and doesnot represent the functionality. As a result, the control andmonitoring possibilities are limited and need to be consideredalready during the design of the components.
III. CONCEPT
In our concept we extend component-based models toinclude runtime control and monitoring of embedded systems.Thereby, we follow the Meta Object Facility (MOF) of theObject Management Group (OMG) [2]. The flow of ourmethod is depicted in Figure 1 on the left, the relating levelsof the MOF are shown on the right.
The platform independent component-based meta model,which corresponds to the M2-Level of the MOF, characterizesa system, which comprises of components with differentattributes, interfaces and parameters. It has been extendedwith special parameters concerning configuration, control andmonitoring (more details see Section IV). According to thismeta model, libraries store the implemented platform depen-dent components. The user uses these components to buildhis system in a component-based model, which correspondsto the M1-Level of the MOF. Thereby, the user also connectscomponents using their interfaces and adjusts them accordingto their parameters.
In the next step, based on the component-based model,the source code of the system is generated using templates.The code generation thereby includes the components, theirconnections and adjustments according to the set parametervalues. Additionally, the templates integrate functionality al-lowing control and monitoring at runtime, including furthercomponents, which handle the communication between theembedded target and a design tool on a PC. This generated
143
source code is between M1- and M0-Level, because it isnormally high level code (VHDL, C, ...), which is not used todirectly program the embedded target. For example, VHDL-code is in a further step used to generate a platform specificbinary file. This implementation (M0-Level) is integrated intothe embedded target.
To control the embedded target, the user modifies thecomponent-based model (M1-Level). According to his actionsand general information gained from the components, com-mands are generated and send to the embedded target via theintegrated communication components. For control, the usercannot change every part at model level - only the value ofspecific control parameters, integrated in the components, canbe modified. For monitoring, using the same communicationcomponents, information read from the embedded target istransferred to a PC. On the PC the information, which corre-sponds to the M0-Level, is interpreted using mapping informa-tion and displayed in the component-based model (M1-Level).Thereby, the information is associated with correspondingmonitoring parameters.
As a result, during design, configuration, control and mon-itoring, the user works on the same abstract level using thesame component-based model. There are only different viewson the model during design time and runtime. At designtime algorithms map the interactions on model level with thegenerated source code for the embedded target. At runtimethe algorithms generate, according to user modification in themodel, control commands and interpreted received data todisplay it on model level. More details on the mechanisms fordesign, configuration, code generation, control and monitoringare described in Section V.
The component-based meta model is based on the Ecoremeta meta model (M3-Level), because it offers a generaldescription and our model-based development environment(see also Section V) is build using the eclipse modelingframework [19].
IV. COMPONENT-BASED META MODEL
The extended platform independent component-based metamodel is depicted in Figure 2. The model, which is instantiatedfrom this meta model, stores later information from differentperspectives (e.g. design time and runtime). Therefore, themeta model has to be able to, on the one hand describe asystem comprising of components with individual interfacesand parameters (including their attributes). On the other hand itholds the information the user adds during assembling, design,configuration and controlling, as well as the data receivedduring monitoring of the embedded target.
The classes on the top right of the meta model describea simplified standard component-based meta model. Theyoutline a system, which composes of components and con-nections. Thereby, the attribute id of the class Componentis unique and allows identifying a single component. Thetype and version specify the type of component, which islater used in correspondence with the libraries. In addition,a component has interfaces, which are modeled as Input and
Fig. 2. Extended Component-Based Meta Model
Output classes. Thereby, the attribute type is used to clarify,which output can be connected to corresponding inputs. Theboolean attribute fixed is needed to signal, if an output or inputis compulsory, i.e. it needs a connection. Outputs and inputsof the components are connected with each other as sourceand target using connections.
The remaining meta model describes three different kindsof component parameters for configuration, control and mon-itoring. The parameters have been split to allow a cleardifferentiation and to implement different functionality in thedevelopment environment. The class ConfigurationParameterdescribes parameters intended for configuration of a compo-nent during design time. The class has two attributes, whichdescribe the name and value of the parameter. In addition,the top class has two child classes, which describe differentparameter forms, i.e. a numerical or a text-based parameter.
The ...Number class forms a parameter in a numericalformat, i.e. an integral or floating point number. The min-and max-attributes are the limits for the value. The ...Listclass describes text-based parameters, the value is therebyselected from a list of predefined values. These possiblevalues are stored in the ...ListValue class. In this context, it isdistinguished between the displayed value on model level andthe coded value used in conjunction with the embedded target.These two forms have been chosen, because they represent themost used parameters in general embedded applications. Fur-thermore, the parameters allow abstract and easy adjustmentof the embedded system from the user perspective as well aslimited the possible inputs and avoid failures.
The ControlParameter class describes parameters used atruntime to control a component at the embedded target orrespectively adjust its behavior. The layout of this class andits subclasses is similar to the classes for configuration. Theonly difference is the additional command attribute, which
144
T. Schwalb - FFN Modelle zum Testen
User
generation commands data
design, configuration,
control and monitoring
Embedded System
Component-Based Model
Database Meta
Model
instantiate component descriptions source code templates
User
4. Generation & Integration
5.1 Commands 5.2 Data
2. Design 3. Configuration 5.1 Control 5.2 Monitoring
Embedded Target
Component-Based Model
Libraries Meta
Model
2. Instantiate 1. Component descriptions Source code templates
Fig. 3. Integration Flow for Design, Configuration, Control and Monitoring
stores the command send to the embedded target to modifythe parameter at runtime. The MonitoringParameter classdescribes parameters monitored during runtime. The layout ofthe classes for monitoring is similar to the classes for control.There are only no minimal and maximal limits for monitorednumerical parameters, as the user cannot modify the value ofthese parameters (it is read from the embedded target). Theclasses for monitoring parameters are designed in the sameway to allow a display of numerical as well as to interpretreceived data and display its abstract representation on modellevel.
V. CONFIGURATION, CONTROL AND MONITORING
The flow of design, configuration, control and monitoring incombination with our implemented development environmentand the embedded target is depicted in Figure 3. In the firststep the loads the libraries with the component descriptionsand source code templates. In the next step the user assemblesand designs the system on model level in a component-basedmodel. For assembling, he uses predefined components de-scribed in the libraries. These create objects in the component-based model, that are instances of classes in the meta model.The library for example stores a multiplexer component, whichcan directly be used and inserted in the model. Thereby, thecomponent is automatically instantiated and displayed with allits interfaces and parameters. A component may be integratedmultiple times. During the design, the user also connects thecomponents to each other using their interfaces.
In the third step the components are configured accordingto their parameters. Thereby, the user adjusts the values of theconfiguration parameters of the components in the model. Thevalue of the numerical parameters can only be set accordingto its limits (predefined in the component description). Fortext-based parameters only an element from the predefined
list can be chosen. These parameters adjust the behavior ofthe individual component and can only be modified duringdesign time, because they influence the generated source codeof the component.
After system design and configuration is completed, thecomponents and their connections are checked before thesource code is generated. The generated source code is in gen-eral split in multiple files regarding the individual componentsand the structure of the system. Additionally, communicationcomponents are integrated to allow controlling and monitoringthe system at runtime.
Succeeding integration on the embedded target, in the laststep the user controls and monitors the system during runtimeusing the same component-based model, he used to design thesystem. For controlling, the user modifies the values of thecontrol parameters. Thereby, also predefined restrictions onnumerical and text-based parameters apply. According to themodifications, algorithms generate commands and send theseto the embedded target. The values of the control parameterscan also be adjusted during design time and thereby formpredefined values for runtime execution. For monitoring atruntime the values of the monitoring parameters are periodi-cally read back from the embedded target, back annotated anddisplayed in the model. Thereby, the respective interpretationand coding of the parameters is used.
We integrated the functionality of design, configuration,control and monitoring in a model-based integrated develop-ment environment (IDE) depicted in Figure 4. The develop-ment environment is based on the open source platform eclipse[19]. For model-based design, the Eclipse Modeling Frame-work (EMF) and the Graphical Modeling Framework (GMF)are used. The generation of the source code is performed withthe xPand framework and the checks in the model using theintegrated Check-language.
The component-based meta model (see Section IV), isintegrated as an Ecore model in EMF. According to this model,we created three models in GMF concerning the graphicalmodel level editor. The first model describes the palette inthe editor, i.e. the tools available to build the model. Thegmfgraph model describes the graphical representation of theelements in the model, i.e. their shape, color etc. The thirdmodel layouts a mapping between the three models, i.e. itcreates relations between elements in meta model, palette andgraphical representations. After creation of these models amodel-based IDE can be generated by the framework. Theresult is depicted in Figure 4. The modeling area can be seen inthe middle with the tool palette on the right (already includescomponents from libraries). The project management withprojects and corresponding files is on the left side. The windowon the bottom in the middle shows model-based propertiesof the actual selected element. The additional functionality toallow all steps within configuration, control and monitoring, isintegrated in the window on the bottom right, which has beenmanually implemented as an eclipse plug-in.
The window is used to load the libraries, access the differentparameters of a component, generate the source code and
145
Fig. 4. Integrated Model-Based Developing Environment (1 - Modeling Area; 2 - Component Palette; 3 - Properties Window; 4 - Parameter Dialog)
communicate with the embedded target. A library is loadedfrom a XML-file, thereby the components get directly inte-grated into the palette. When a component is drawn in themodel, the component is automatically instantiated with allits parameters and interfaces, which can be directly used forconnections to other components. If a component is selectedin the model, its parameters are displayed in the table and theuser can modified them. The numerical parameters can directlybe specified, the text-based parameters are displayed with adrop down menu, that offers the predefined values (monitoringparameters cannot be changed).
A hardware connection to the embedded system can beestablished after specifying the connection settings. If theconnection is active and the user changes a control parameterin the table, corresponding commands are automatically gener-ated and transmitted. Additionally, the monitoring parametersof the selected component are periodically read back, inter-preted and displayed. The configuration parameters cannotbe changed during an active connection, because this wouldrequire a regeneration and integration of the system. For aneasier handling the window offers additional filters and searchfunctions to easier locate parameters.
VI. FPGA INTEGRATION
To demonstrate the functionality and integrity of the conceptand the IDE, we implemented it along with a system fordesign, configuration, control and monitoring of Field Pro-grammable Gate Array (FPGA) systems. FPGAs were chosen,because of their high computing power, the possibility to runprocesses in parallel and the easy extensibility.
In the example, the components used for the component-based design of the system, are integrated in two libraries. The
first library describes general hardware components, whichare often used in FPGA systems, for example multiplexer,timer, AND-gate, OR-gate etc. External inputs and outputs arealso integrated as components to allow connections to externalperipherals. A second library describes specific componentsfor sensor and actuator control.
Using these components, we build as an example a smallcooling control system (depicted in Figure 5), which readsdata from a temperature and a humidity sensor and controlsa cooling fan. A sensor is connected to the system using anExternal Input component associated with a Sensor Prepro-cessing component, which is used to read the sensor valueand correct it if necessary. The humidity sensor is integratedtwice and connected to a Multiplexer and a Sensor Checkcomponent to switch automatically if a sensor fails. TheCooling Algorithm component takes the preprocessed sensorvalues and controls a fan using an additional Actuator Controlcomponent. In the libraries, the components are designedwith compatible interfaces and communication protocols oradapt automatically (e.g. multiplexer) during code generationaccording to the connected components.
The respective component-based model of the system isshown in the modeling area of the IDE in Figure 4. Incomparison to the embedded system the component-basedmodel does not include the additional components and busesfor runtime control and monitoring, which are described inthe next paragraph. All other components are described withtheir interfaces and parameters. As an example, the parametersof the Sensor Preprocessing component for the temperaturesensor are depicted in the table of the Parameter windowin Figure 4. For configuration, the Input Protocol and Input
146
Component-based System
RS232 Communication
Micro-processor
External Output
External Input
External Input
External Input
Sensor Preprocessing
Sensor Preprocessing
Sensor Preprocessing
Multiplexer
Cooling Algorithm
Actuator Control
Sensor Check
Fig. 5. FPGA System Integration
Address can be modified. During runtime the Offset and SlopeCorrection parameter can be controlled and the Status, SensorValue and Output Value of the component are monitored. TheInput Address for example is a List Parameter, therefore theuser can only choose from a list of predefined addresses.The parameter Status is also a List Parameter, therefore thecoded value read from the embedded target is interpretedand displayed in common language (not using complex errorcodes). All components are integrated on the embedded targetusing VHDL templates in xPand. The system architecture (i.e.connection and buses) is generated in an additional xPandtemplate as a top level VHDL structural description file.
In addition to the library components, additional compo-nents and buses are automatically integrated during generationof the system to allow control and monitoring at runtime.The components include a RS232 interface for communica-tion with a PC and an 8-Bit microprocessor for processingcommands. The microprocessor is connected with three 8-bitbuses to the components. The first bus is used to identify thecomponent, the second for sending commands and the lastone for reading the status of the components. The buses areseparated from the connections between the components andcan be used independently, therefore they do not influence thefunctionality of the system.
To use the described model-based developed environment(see Section V) for the FPGA integration, the communica-tion interface is adjusted to communicate using a RS232connection. Furthermore, compatible mapping algorithms areintegrated to generate commands for control and interpretedreceived monitoring data. The commands are split into threeparts, the first part is the ID of the component, the second isthe command associated to the actual parameter and the thirdis the value (for monitoring commands the third part is empty).These parts are send together in one string via the interface tothe microprocessor inside the FPGA.
The microprocessor separates these parts and sets the ID onthe first bus to address the respective component. In the next
step the command and value are send on the command bus. Ifa monitoring command was send the answer of the componentis read from the corresponding bus and send to the PC.
VII. TESTS AND RESULTS
Different tests are carried out to evaluate the functionalityand integrity of the concept and the developed IDE. The testsare mainly performed using a XUP Virtex-II Pro developmentsystem, including a Xilinx Virtex-II Pro FPGA [20] as wellas interfaces for communication and programming.
The size and speed of the implemented system depends onthe type and number of components and connections. The testsystem (see Section VI) uses around 6% of the logic resourcesof the used FPGA and has a frequency of 100 MHz, limited bythe layout of the used components. Thereby, the resources ofthe additional microprocessor and communication componentsare fixed to approx. 1%. The resources for additional busesas well as functions for control and monitoring depend on thenumber and layout of the components. The maximum speed isin general limited by the integrated microprocessor to approx.150 MHz, because the components directly communicate withthe microprocessor and therefore need the same clock signal.The communication could also be designed independently toallow different clock frequencies, but this would increase thelogic resources for buses and interfaces.
In tests the communication, including processing modifica-tions and sending commands as well as receiving monitoreddata and displaying the data in the model, worked as describedin Section V. There is only a time delay of up to approx. 250ms between a change in the model and the reaction of theembedded target, as well as between a change in the embeddedtarget and the display in the model. The reasons are the slowRS232 communication interface and the mapping algorithms.In addition, as the development environment is designed multi-threaded, it cannot be determined when the responsible threadfor communication or processing is executed. Therefore, whilethe hardware runs in real time, real time control and monitor-ing are not possible.
As a result the component-based model allows a rapiddesign of the system and reuse of present components. Afterassembling and configuration of the components, the generatedVHDL-code can directly be integrated using Xilinx IDEtools. During runtime the components can be controlled andmonitored using the implemented IDE to support adjustmentand monitoring on abstract level.
The functionality for configuration, control and monitoringof the individual components needs already to be consideredduring the design of the component. This is a challenge,because in the component-based design process only presentparameters can be used on abstract level. For example, ifa parameter is not integrated in the design of a componentthe user needs to work on low level manually adjustingthe source code or using standard methods for control andmonitoring. Regarding control and monitoring, there is anadditional consideration, because every parameter normallyincreases the size and complexity of the component and maybe
147
reduces its speed. However, with regard to rapid prototypingsystems the size and speed does not directly take effect, as thesystem is integrated on a high power computing platform andnot on the final target platform.
Furthermore, the tests outlined that, besides the dependen-cies concerning the interfaces, there are further dependenciesconcerning different components and also between parametersof the same component. For example, one parameter caninfluence the value or the availability of another parameter.Actually, these dependencies are manually implemented ac-cording to the individual component and checked mainly usingthe Check-language. This is error-prone and does follow theconcept of a continues model-based approach. Moreover, thetests showed that the expandability, in terms of new compo-nents, is not comfortable, because the library and templatesneed to be modified manually.
VIII. CONCLUSION AND OUTLOOK
In this paper we presented a concept for expandingcomponent-based models for runtime control and monitoring,to support abstract adjustment and debugging of embeddedsystems. Therefore, the practicability of a continues abstractdevelopment have been increased. In comparison to presenttechniques, the user does not only design on model level, butalso controls and monitors the system from the same abstractcomponent-based model and does not need to use low leveldomains or tools. A single model describes the structure ofthe system, allows adjustment and shows the status of itscomponents during design time and runtime. All intermediatesteps, from model level to the embedded target and vice versaare carried out by algorithms in the background. Therefore,generating the appropriate source code, the concept can beapplied to use with rapid prototyping systems and allowsadjusting, controlling and monitoring systems at runtime.
The proposed meta model is capable of abstract descriptionof different aspects of a system and its components. The useruses libraries with predefined components to rapidly build thesystem. The components are connected using their interfaceand configured according to their parameters. During runtimethe components are controlled and monitored using differentparameters in the same model. The integrated development en-vironment allows to perform all steps of design, configuration,control and monitoring on model level. The concept has beenimplemented along with a FPGA integration to show function-ality and feasibility to real systems. Different tests have beencarried out to evaluate size, speed and maintainability.
In future, the control and monitoring parameters will be-come optional for implementation. Therefore, the user candecide concerning their usage and the additional resources.In connection the IDE will be expanded to allow an easierspecification of new libraries, as this is at the moment per-formed manually. In this context, also the automatic integrationof external modules as black box objects will be added.Additionally, a method will be implemented to check, ifthe system and components on the embedded target matchto the component-based model in the IDE. Moreover, the
concept will be integrated along with other platforms and morecomplex systems to evaluate the scalability and performance.The actual meta model will be enhanced with respect to depen-dencies of components and parameters as well as possibilitiesfor hierarchical structures.
REFERENCES
[1] J. Dannenberg and C. Kleinhans, “The coming age of collaboration inthe automotive industry,” Mercer Management Journal, vol. 17, pp. 88–97, 2004.
[2] Object Management Group (OMG), “Meta Object Facility (MOF) 2.0Core Specification,” 2004.
[3] iABG, “V-Model,” 1997. [Online]. Available: http://www.v-modell.iabg.de/vm97.html
[4] A. Korff, Modellierung von eingebetteten Systemen mit UML undSysML. Spektrum Akademischer Verlag, 2008.
[5] Object Management Group (OMG) , “UML Profile for MARTE:Modeling and Analysis of Real-Time Embedded Systems, Specification,Version 1.0,” 2009. [Online]. Available: http://www.omgmarte.org/
[6] Object Management Group (OMG), “Unified Modeling Language(UML) Specification, Version 2.2,” 2008. [Online]. Available:http://www.uml.org/
[7] G. Heineman and W. T. Councill, Component-Based Software Engineer-ing. Addison-Wesley Longman, Amsterdam, 2001.
[8] J. Kalaoja, E. Niemela, and H. Perunka, “Feature modelling ofcomponent-based embedded software,” in Software Technology andEngineering Practice, 1997. Proceedings., Eighth IEEE InternationalWorkshop on [incorporating Computer Aided Software Engineering],1997, pp. 444–451.
[9] D. Hammer and M. Chaudron, “Component-based software engineeringfor resource-constraint systems: what are the needs?” in Object-OrientedReal-Time Dependable Systems, 2001. Proceedings. Sixth InternationalWorkshop on, 2001, pp. 91–94.
[10] W. Damm, A. Votintseva, E. Metzner, and B. Josko, “Boosting re-useof embedded automotive applications through rich components abstract,”Proceedings, FIT 2005 - Foundations of Interface Technologies, 2005.
[11] H. Li, P. Lu, M. Yao, and N. Li, “SmartSAR: A Component-BasedHierarchy Software Platform for Automotive Electronics,” in EmbeddedSoftware and Systems, 2009. ICESS ’09. International Conference on,2009, pp. 164–170.
[12] Z. Gu, S. Wang, S. Kodase, and K. Shin, “Multi-view modeling andanalysis of embedded real-time software with meta-modeling and modeltransformation,” in High Assurance Systems Engineering. Proceedings.Eighth IEEE International Symposium on, 2004, pp. 32–41.
[13] S. Illner, A. Pohl, H. Krumm, I. Luck, D. Manka, and T. Sparenberg,“Automated runtime management of embedded service systems basedon design-time modeling and model transformation,” in Industrial In-formatics, 2005. INDIN ’05. 2005 3rd IEEE International Conferenceon, 2005, pp. 134–139.
[14] G. Waignier, S. Prawee, A.-F. Le Meur, and L. Duchien, “A Frameworkfor Bridging the Gap Between Design and Runtime Debugging ofComponent-Based Applications,” in 3rd International Workshop onModels@runtime, Toulouse France, 2008.
[15] H. Song, G. Huang, F. Chauvel, and Y. Sun, “Applying MDE Tools atRuntime: Experiments upon Runtime Models,” in Proceedings of the 5thInternational Workshop on Models at Run Time, Oslo Norway, 2010.
[16] P. Graf and K. D. Muller-Glaser, “Modelscope: inspecting executablemodels during run-time,” in ICSE Companion ’08: Companion of the30th international conference on Software engineering. New York, NY,USA: ACM, 2008, pp. 935–936.
[17] T. Schwalb, P. Graf, and K. D. Muller-Glaser, “Architektur fur dasechtzeitfahige Debugging ausfuhrbarer Modelle auf rekonfigurierbarerHardware,” in Methoden und Beschreibungssprachen zur Modellierungund Verifikation von Schaltungen und Systemen. UniversitatsbibliothekBerlin, 2009, pp. 127–137.
[18] T. Schwalb, P. Graf, and K. D. Mueller-Glaser, “Monitoring Executionson Reconfigurable Hardware at Model Level,” in 5th InternationalMODELS Workshop on [email protected], Oslo, Norway, Oct. 2010.
[19] Eclipse Foundation, “Eclipse Modeling Project,” 2010. [Online].Available: http://www.eclipse.org/modeling/
[20] Xlilinx, Virtex-II Pro and Virtex-II Pro X - FPGA User Guide v.4.2,November 2007.
148
A model-driven based framework for rapid parallelSoC FPGA prototyping
Mouna Baklouti†*, Manel Ammar†, Philippe Marquet∗, Mohamed Abid† and Jean-Luc Dekeyser∗∗LIFL, Univ. Lille 1, INRIA Lille Nord Europe
UMR 8022, CNRS, F-59650, Villeneuve d’ascq, FranceEmail:{mouna.baklouti,philippe.marquet,jean-luc.dekeyser}@lifl.fr
†CES Laboratory, Univ. Sfax, ENIS SchoolBP 1173, Sfax 3038, Tunisia
Email:[email protected], [email protected]
Abstract—Model-Driven Engineering (MDE) based approacheshave been proposed as a solution to cope with the inefficiency ofcurrent design methods. In this context, this paper presents anMDE-based framework for rapid SIMD (Single Instruction Mul-tiple Data) parametric parallel SoC (System-on-Chip) prototyp-ing to deal with the ever-growing complexity of such embeddedsystems design process. The design flow covers the design phasesfrom system-level modeling to FPGA prototyping. The proposedframework allows the designer to easily and automatically gener-ate a VHDL parallel SoC configuration from a high-level systemspecification model using the MARTE (Modeling and Analysis ofReal-Time and Embedded systems) standard profile. It is basedon an IP (Intellectual Property) library and a basic parallel SoCmodel. The generated parallel configuration can be adapted tothe data-parallel application requirements. In an experimentalsetting, four steps are needed to generate a parallel SoC: data-parallel programming, SoC modeling, deployment and generationprocess. Experimental results for a video application validatethe approach and demonstrate that the proposed frameworkfacilitates the parallel SoC exploration.
I. INTRODUCTION
With the rising complexity of multimedia and radar/sonarsignal processing applications, parallel programming tech-niques and multi-core Systems-on-Chip (SoC) are more andmore used. Single Instruction Multiple Data (SIMD) systemshave shown to be powerful executers for data-intensive appli-cations [1], especially in pixel processing domain [2]. ManySIMD on-chip architectures, in particular based on FPGA(Field Programmable Gate Arrays) devices, have emerged toaccelerate specific applications [3]–[6]. Compared to ASIC(Application Specific Integrated Circuit), FPGA devices arecharacterized by an increased capacity, smaller non-recurringengineering costs, and programmability [7]. Dealing with theever-growing challenge of parallel SoC design, most of theproposed SIMD solutions are application-specific SoC whichlack flexibility: changing a SoC configuration may necessitateextensive redesign. While these specific systems provide goodperformances, they require long design cycles. The size of aparallel SoC and the complexity involved in its design arecontinuously outpacing the designer productivity. An impor-tant challenge is to find adequate design methodologies that
efficiently address the issues about large and complex SoC.Nowadays, Computer-Aided Design tools are imperative
to automate complex SoC design and reduce the time-to-market. Two approaches have been proposed to cope with thisproblem. Firstly, IP (Intellectual Property) reuse and platform-based design [8] are used to maximize the reuse of pre-designed components and to allow the customization of thesystem according to system requirements. Secondly, Model-Driven Engineering (MDE) [9] approach has been introducedto raise the design abstraction level and to reduce designcomplexity. It stresses the use of models in the embeddedsystems development life cycle and argues automation viamodel transformation and code generation techniques. Com-plex systems can be easily understood thanks to such abstractand simplified representations. Approaches based on MDEhave been proposed as an efficient methodology for embeddedsystems design [10], [11]. An interesting model specificationlanguage is UML (Unified Modeling Language) [12], whichproposes general concepts allowing expressing both behavioraland structural aspects of a system. The latest release of UML(2.0) has support for profiles that enable the language tobe applied on particular application and platform domainswith sophisticated extension mechanisms. As an example, theMARTE (Modeling and Analysis of Real-Time and Embeddedsystems) standard profile [13] is proposed by the OMG to addcapabilities to UML for model-driven development of real-time and embedded systems. The MARTE profile enhancespossibility to model SW, HW and relations between them.
Using the proposed framework, the designer focuses onmodeling his needed SIMD configuration and not on howimplementing it, since the system modeling is independentof any implementation detail. Specifying a model is writtenbased on unified language. The presented design flow is alibrary-based method that hides unnecessary details from high-level design phases and provides an automated path from UMLdesign entry to FPGA prototyping. So, it can be easily usedby non-HW experts of on-chip systems implementation. Thismakes our approach better than using some clever VHDLcoding.
System concerns are represented in separated dimensions:
978-1-4577-0660-8/11/$26.00 c⃝ 2011 IEEE
149
data-parallel coding, SoC modeling, IP selection and imple-mentation. The implementation is performed via the generationtool based on a model-to-text transformation using Acceleo[14]. The framework uses an IP library with various com-ponents (processors, memories, interconnection networks...)that can be selected in the deployment process to generatethe needed SIMD configuration. The modeled SoC has to beconform to a basic parallel SoC model which is parametric,flexible and programmable, proposed in previous work [15].
In an experimental setting that validates our approach,we consider a video color conversion application where weexplore different parallel system configurations and decide thebest one to run the application. Experimental results show thatthe proposed framework considerably reduces design costs andfacilitates modifying the system model and regenerating theimplementation without relying on costly re-implementationcycles. Using the framework, we can create SIMD implemen-tations that are fast enough to meet demanding processingrequirements, are automatically generated from a high-levelspecification model to reach the time-to-market and can easilybe updated to provide a different functionality.
The remaining of this paper is organized as follows. Section2 discusses related work on model-based approaches to gen-erate on-chip multi-processor or massively parallel systems.Section 3 presents the proposed MDE framework.A case study,which illustrates and validates the framework, is described inSection 4. The FPGA platform is chosen as a target platformsince it is a better alternative to test and implement variousparallel SoC configurations. Finally, Section 5 draws mainconclusions and proposes future research directions.
II. RELATED WORK
The high-level SoC design methodology is a rapid emergingresearch area. There are many recent research efforts onembedded systems design using an MDE approach. In thiscontext, different high-level synthesis approaches are currentlybeing studied for different specification languages. For exam-ple, xtUML [11] defines an executable and translatable UMLsubset for embedded real time systems, allowing the simula-tion of UML models and the code generation for C oriented todifferent microcontroller platforms. In [16], an approach usingVHDL synthesis from UML behavioral models is presented.The UML models are first translated into textual code in alanguage called SMDL. This latter can be then compiled into atarget language as VHDL. The translation from UML modelsto SMDL is performed using the aUML toolkit. In [17], atransformation tool, called MODCO, which takes a UML statediagram as input and generates HDL output suitable for usein FPGA circuit design, is presented. A HW/SW co-designis performed based on the MDA approach. XML is used togenerate HDL from high-level UML diagrams. In these twoworks, only state machines HW designs are described. In [18],a UML-based multiprocessor SoC design framework, calledKoski, is described. An automated architecture explorationbased on the system models in UML, as well as the automaticback and forward annotation of information in the design flowcould be performed. The proposed design flow provides an
Fig. 1. Parallel SIMD SoC configuration: 4 PEs, a 2D mesh neighboringnetwork and a crossbar based mpNoC
automated path from UML design entry to FPGA prototyping.The final implementation is application-specific. The proposedapproach is based on synthesizable library components that areautomatically tuned for specific application according to theresults of the architecture exploration.
Our approach is related to the design of massively parallelSoC and covers the design phases from system-level modelingand parallel programming to FPGA prototyping using the no-tion of transformations between models. The DaRT [10] (DataParallelism to Real Time) project also proposes an MDA-based approach for SoC design that has many similaritieswith our approach in terms of the use of meta-modelingconcepts. The DaRT work defines MOF-based meta-models tospecify application, architecture, and SW/HW association anduses transformations between models as code transformationsto optimize an association model. In DaRT, no data-parallelcoding is specified and the code generation for RT (RegisterTransfer) levels is dedicated to specific HW accelerators.
The proposed framework, presented in this paper, takesadvantage of the MDE notion of transformation betweenmodels to generate a complete SIMD parallel SoC at RTlevel dedicated to compute data-intensive applications. Our ap-proach is based on synthesizable library components and fewmodel transformations to generate the synthesizable VHDLcode of the modeled SIMD SoC.
III. SIMD FRAMEWORK
The proposed framework is dedicated to generate differentSIMD configurations derived form the based parallel SoCmodel [15]. These configurations can be then directly simu-lated using available simulation tools or prototyped on FPGAdevices using appropriate synthesis tools. Figure 1 illustrates aSIMD parallel SoC configuration composed of four ProcessingElements (PE) connected in a 2D mesh topology. To handleparallel I/O transfers and point-to-point communications, acrossbar based mpNoC (massively parallel Network on Chip)[19] is integrated. To accelerate and facilitate a SIMD con-figuration design, a model-driven framework is proposed. Theframework allows the designer to model his needed configu-ration derived from the basic provided SIMD SoC model. He
150
Fig. 2. Framework concepts
has to specify the system’s parameters (number of PE, memorysize, neighboring topology) and the different components thatwill be integrated (mpNoC, neighborhood network, devices).The designer has also to code his data-parallel program usingthe specified data-parallel instruction set depending on thechosen processor IP. A help manual is in fact provided to thedesigner to facilitate the parallel programming and describe thedifferent instructions to use according to the chosen processor.
The framework, in particular the deployment phase, is basedon an IP library which contains dedicated IP that can bedirectly integrated in the system. Providing an extensive libraryrequires a significant effort. Currently, the IP library containsprocessors (MIPS, OpenRisc, NIOS II), networks (crossbar,shared bus and multi-stage networks), memories and somedevices. To add new IP resources, the IP provider mustadapt the IP to the architecture dedicated specific interface(described in the help manual). Thus, a new component can beput into the library by following the requirements for interfacesformats. To assemble processors in the SIMD design, wedistinguish two methodologies: reduction and replication. Thereduction consists on reducing an available processor in orderto build a PE with a small reduced size that can be fittedin large quantities into an FPGA device. The replicationconsists on implementing the ACU as well as the PE bythe same processor IP so that the design process is faster.We clearly notice that there is a compromise between thedesign time and the number of integrated PEs in the SIMDconfiguration depending on the applied design methodology.The designer can select the suitable methodology accordingto his application constraints. The three processors of the IPlibrary are provided with the two methodologies.
At this step, the designer can generate different implemen-tations while integrating different IPs. The deployment is alsoresponsible of loading the binary data-parallel program in theACU instruction memory. The SIMD generation approach isdepicted in Figure 2. This approach allows a flexible and rapidplatform development and platform end-user productivity.
To generate SIMD configuration at RT level, an MDE baseddesign flow, presented in Figure 3, is developed. The proposedflow uses two meta-models: the MARTE meta-model and theDeployed meta-model. All meta-model concepts are specifiedas UML classes and then converted into Eclipse/EMF models[20]. The generation process is based on model transforma-
Fig. 3. MDE-based design flow
tions implemented as QVT (Query, Views, Transformations)resources, standardized by OMG.
The designer can generate a SIMD massively parallel SoCconfiguration in four steps: data-parallel programming, SoCmodeling, deployment and then implementation generation.
A. Data-parallel programming
The designer has to write his data-parallel program usingthe provided data-parallel instruction set. Based on availableprocessor compilers (miniMIPS, OpenRisc 1200 and NIOS II)in the IP library and the developed special parallel instructions,the designer can generate his parallel program’s binary. Forthe miniMIPS processor, an extended parallel MIPS assemblerlanguage [21] is developed. For the OpenRisc and NIOSprocessors, high-level asm macros are defined and they can beused in any C program for control and communication instruc-tions. The NIOSII IDE (Integrated Development Environment)and the OR1Ksim [22] tools are used respectively with theNIOS and OpenRisc processors. The developed SW chain isa multi-compiler chain that is responsible of generating theSW code depending on the specified target processor.
Some particular instructions are specified to be used in theprograms as delimiters for parallel and sequential code. Table Ishows three examples of instructions from the provided data-parallel instruction set. It is clearly that these instructionsdepend on the processor instruction set. At this step, a SWlibrary is provided. It includes pre-implemented applicationalgorithms such as matrices multiplication, FIR (Finite Im-pulse Response) filter, reduction algorithm, image rotation,color conversion (RGB to YIQ, RGB to CMYK), etc.
After generating the executable SW, the second step consistson modeling the HW system.
B. SoC modeling
The designer must specify the architecture models using anyUML 2.0 compliant tool with applying the MARTE profile.The most important UML diagrams used in our approach tospecify the system are Class, Structure composite and Deploy-ment diagrams. The modeling of SIMD SoC configurationsrelies on the use of UML and the MARTE profile. ThreeMARTE packages are used: the Hardware Resource Modeling(HRM), the Repetitive Structure Modeling (RSM) and theGeneric Component Model (GCM) packages [23]. The HRMintends to describe the HW platform by specifying its differentelements. At the end, the HW modeled resources present thewhole system. In our approach, only the HRM HW Logical
151
TABLE ISIMD PARALLEL MACROS
ASM Macro Description CodingminiMIPS OpenRisc NIOS
P REG SEND Neighboring SEND: send data (in reg) from source p addi r1,r0,dir l.addi r1,r0,dir IOWR (WRP B,addr, data)(reg,dir,dis,adr) to destination via the neighboring network. p addi r1,r1,dis l.addi r1,r1,dis Where: addr(11)=’0’ and addr(10:3)=dis
p addi r1,r1,adr l.addi r1,r1,adr and addr(2:0)=dir.p SW reg,0(r1) l.sw 0x0(r1),reg
P REG REC Neighboring RECEIVE: receive data (in reg) p addi r1,r0,dir l.addi r1,r0,dir data=IORD (WRP B,addr)(reg,dir,dis,adr) from the source. p addi r1,r1,dis l.addi r1,r1,dis Where: addr(11)=’0’ and
p addi r1,r1,adr l.addi r1,r1,adr addr(10:3)=dis and addr(2:0)=dir.p LW reg,0(r1) l.lwz reg,0x0(r1)
P GET IDENT read identity p lui r1,0x2 l.movhi r1,0x2 NIOS2 READ CPUID(id)(reg) p ori r1,r1,0 l.lwz reg,0x0(r1)
p LW reg,0(r1)
:Local_memory [1] :Elementary_processor [1]
elementary_processor local_memory
PU
West
East
<<flowPort>>
<<flowPort>>
mpNoC_in mpNoC_out ACU<<flowPort>> <<flowPort>> <<flowPort>>
Fig. 4. PU modeling in the case of a linear configuration
sub-package is used. It allows to describe information aboutthe kind of components (HwRAM, HwProcessor, HwBus,etc.), their characteristics, and how they are connected toeach other. The architecture is graphically specified at a highabstraction level with HRM. Multidimensional data arrays andpowerful constructs of data dependencies are managed thanksto the use of the RSM package. It defines stereotypes andnotations to describe in a compact way the regularity of asystem’s structure or topology. The structures considered arecomposed of repetitions of structural elements interconnectedvia a regular connection pattern. It provides the designer a wayto efficiently and explicitly express models with a high numberof identical components. The concepts found in this packageallow to concisely model large regular HW architectures asmulti-processor architectures. Finally, the GCM package isused to specify the nature of flow-oriented communicationparadigm between SoC components.
The modeling process is done in an incremental way. Thedesigner begins by modeling the elementary components: PE,ACU, memories, mpNoC and I/O device. Then, the wholeconfiguration is modeled through successive compositions.Figure 4 illustrates the elementary processing unit (PU). Itis composed of a PE and its local data memory. The classnamed ”Elementary processor” is stereotyped HwResource inthe case of the reduction methodology or HwProcessor in thecase of the replication methodology. It has a bidirectional portstereotyped FlowPort to connect the data memory. The class”Local memory” is stereotyped hwMemory with a paramet-ric tagged value adressSize. In the same manner, the ACUmemories have a parametric size. The PU has one port tocommunicate with the ACU and a number of neighboring portsequal to the number of its neighboring connections. In Figure4, it has two neighboring ports since each PE can communicatewith its neighbor in the east or west directions. If the designer
:ACU_data_memory [1] :ACU [1]
:ACU_Instruction_memory [1]
« shaped »:PU
Main_architecture
ACU
InstMem
Data_mem
Inst_mem
mpNoc_in mpNoC_out
PU
West
ACUEast
mpNoC_out mpNoC_in
« InterRepetition »
« reshape »
:mpNoC_router [1] :Device [1]
InstMem
ACU_in
ACU_out
PU_out PU_in
Device_in
Device_out
mpNoC_in
mpNoC_out
« reshape » « reshape »
Fig. 5. 1D configuration modeling
chooses to integrate the mpNoC in the SIMD configuration, hemust add two ports ”mpNoC in” and ”mpNoC out” to assurethe communications through the mpNoC.
We distinguish between 1D and 2D mppSoC configurations.They differ in the modeling of the interconnections betweenPUs. In the case of 1D configuration, the number of PEs isequal to the tagged value Shape of the stereotype Shapedapplied on the PU class. To model a linear neighboringnetwork, the interconnection link between the East and Westports is stereotyped InterRepetition. Since the PU on the edgeis not connected to the PU on the opposite edge, the taggedvalue isModulo is set to false. The repetitionSpaceDependenceattribute is used to specify the neighbor position of the elementon which the inter-repetition dependency is defined. In thiscase, its value is equal to {1} since each PE[i] is connected toPE[i+1]. Figure 5 shows the mppSoC configuration modelingintegrating a linear neighboring network and the mpNoC. Thelink connector, stereotyped Reshape, between the PU and theACU shows that each PU is connected to the ACU in order toreceive the execution orders. To connect PUs with the mpNoC,two Reshape connectors are expressed between the two portsof the PU and the corresponding ports of the mpNoC. Thislatter has a multiplicity equal to 1. The repetitionSpace tag isequal to the number of PEs. The patternShape tag is equal to 1indicating that the mpNoC port is distributed among the portsof the PEs. The same modeling is followed in the case of aring neighboring network. The only difference is the modulotagged value which is set to true.
In the same manner, we can model a 2D SIMD configura-tion. We need just to know how to model the neighboring linksbased on the MARTE profile. Figure 6 presents a configuration
152
:ACU_data_memory [1] :ACU [1]
:ACU_Instruction_memory [1]
« shaped »:PU
Main_architecture
ACU
InstMem
Data_mem
Inst_mem
mpNoc_in mpNoC_out
PU
West
ACUEast
mpNoC_out mpNoC_in
« InterRepetition »
« reshape »
North
South
:mpNoC_router [1] :Device [1]
InstMem
ACU_in
ACU_out
PU_out PU_in
Device_in
Device_out
mpNoC_in
mpNoC_out
« reshape » « reshape »« InterRepetition »
Fig. 6. 2D configuration modeling (with a mesh neighboring network)
modeling integrating a 2D mesh neighboring network. Wenotice that the PU class is modeled with 4 ports dedicatedto inter-PE communications in east, west, north and southdirections. In this case, the repetitionSpaceDependence taggedvalue is equal to {1,0} indicating that each PE[i,j] is connectedto its neighbor PE[i+1,j] to assure east and west links. Inaddition, this tagged value is equal to {0,1} for north and southlinks to assure that each PE[i,j] is connected to its neighborPE[i,j+1]. For a mesh topology the tagged value isModulois set to false since there are no connections on the edges.However, it is set to true in the case of a torus topology. TheXnet network is modeled like the 2D mesh. The designer hasjust to model the links on the diagonals.
C. Deployment
As described in the previous subsection, a SIMD configu-ration can be modeled at a high abstraction level. To generatean executable low level model, the elementary modeled com-ponents should be associated with an existing implementationbased on the provided IP library. The deployment allows tomove from a general platform (Platform Independent Model)to a specific platform (Platform Specific Model) according tothe MDA approach. At this step, the designer can generate andevaluate different configurations. In fact, the deployment en-ables to precise a specific implementation for each elementaryconcept among a set of possibilities. It concerns the processorIP, the instruction memory, the mpNoC interconnection net-work and the I/O device IP if it exists. At this stage the binarydata-parallel program is specified as the memory initialisationfile of the main instruction memory. In fact, in our case wedeal with a single data-parallel program (one of the advantagesof a SIMD architecture) so no mapping of tasks needs tobe performed. Thus, the mapping of the application to thehardware architecture is systematic. Figure 7 expresses thedeployment of a ”hardwareIP” on the ”Elementary processor”.The concept of codeFile is used to specify the code.
A final transformation chain MARTE2VHDL is developedto generate the synthesizable VHDL implementation of themodeled SIMD configuration.
D. Implementation generation
The MARTE2VHDL transformation is based on the De-ployed model and the IP library to generate the corresponding
Elementary_processor
local_memory
« virtualIP »VPE
« implements »
« implements »
« hardwareIP »PEImplem
« implements »
Fig. 7. Deployment of the PE
synthesizable VHDL implementation depending on the mod-eled configuration. A model conformed to the Deployed meta-model is generated via the transformation UML2MARTE. Thismodel is then analysed in order to deduce the specified pa-rameters. The number of PEs, memory size, processor designmethodology and the topology of the neighboring network areextracted from the UML diagrams. The other configurablecomponents (processor IP, mpNoC interconnection network,etc.) are specified from the deployment step. The developedtransformation model-to-text is based on templates. It uses theAcceleo tool [14] which is part of the Eclipse Model to Textproject and provides an implementation of the MOF Model toText OMG standard. The following code example illustrateshow to deduce the type of the processor (getPeCodeFile) inthe generation step:
[ query public getPECodeFile (m:Model):CodeFile=self.ownedElement->select(oclIsKindOf(CodeFile) and name=’PEImpl codefile’)->asOrderedSet()->first() ]
Using an MDE based framework, the SIMD SoC design isaccelerated. The VHDL implementation can be automaticallygenerated based on model transformations. The SoC model isindependent of any implementation detail making the designflow easy to use. The proposed framework also facilitates SoCexploration and helps the user choose the best configurationfor a given application. The next section illustrates the use ofthis framework in a real application context.
IV. CASE STUDY
A color conversion RGB to CMYK application widelyused in color printers, extracted from the EEMBC benchmark[24] has been developed based on the provided data-parallelinstruction set. The program is written using high-level macros(table I). The binary is then generated depending on the usedprocessor by selecting the corresponding compiler. The pro-posed framework allowed to generate different SIMD suitableconfigurations. An FPGA is used to do real experimentations.
A. HW platform
The used development board is the Altera D2-70 [25]equipped with a CycloneII EP2C70F896C6 FPGA which has68416 Logic Elements (LEs) and 250 M4K RAM blocks. Theused SW tools are the Quartus II V9.0 that allows synthesizing
153
TABLE IISYNTHESIS RESULTS
PE IP Proc. LEs Memorydesign (%) ACU PE %
(bytes) (bytes)8 miniMIPS rep. 71 4096 1024 1832 miniMIPS red. 93 4096 2048 668 OpenRisc rep. 91 4096 1024 2216 OpenRisc red. 98 4096 4096 3648 NIOS rep. 79 8192 512 87
and prototyping the design on the FPGA, and the ModelSim-Altera v.6.4a that allows simulating and debugging the design.To test the color-conversion application, two peripherals areused: a 1M pixel camera TRDB D5M and a 800×RGB×480pixel TRDB LTM LCD displayer. The two external SDRAMand SRAM memories are also used. In fact, the implementedVHDL camera driver directly stores the captured data to theSDRAM to be read by PEs as required. A VHDL SRAMcontroller is implemented. It allows to store the processed datain the SRAM and fetches it as it is required by the LCD.
B. SIMD configurations
For the tested application, only the mpNoC has beenintegrated in the system model (no neighboring network)since we need to assure parallel data transfers: all PEs needto read data from the SDRAM and then write data to theSRAM. In this example, each pixel processing should notexceed 10.42 Ns in order to assure real-time processing.Therefore, a 800×480 pixel frame must be processed within4 Ms. The same system model is used for all implementationgenerations. It is described in composite structure diagram asillustrated in Figure 8. It models all hardware componentscomposing the system as well as their connections. Onlythe deployment diagram changes from one configuration toanother in order to use different processors, memories andinterconnection networks. The modification from one SIMDconfiguration to a new one just needs few milliseconds andthe re-generation process is rapidly performed. The low-levelsynthesizable models from the IP library are used for thefinal implementation. The generated configurations could bedirectly simulated to measure execution time and decide theperformance of the SIMD modeled systems.
Table II shows the obtained synthesis results varying theSIMD parameters and components while integrating the max-imum number of PEs targeting the Cyclone II FPGA. Allthese configurations integrate a crossbar based mpNoC sincethe crossbar allows fast and non-blocking parallel data trans-fers, necessary for real-time image processing applications.We clearly notice that the reduction methodology allowsintegrating a bigger number of PEs on the chip than thereplication methodology. Since the miniMIPS is smaller thanthe OpenRisc, we can reach 32 PEs on the FPGA comparedto 16 PEs when using the OpenRisc IP. The implementationresults prove that the NIOS processor is optimized for theAltera FPGA. We can integrate more than 48 PEs on the chip.
Figure 9 shows the execution time results obtained whenprototyping the generated configurations on the CycloneII
Fig. 9. Execution times for different SIMD configurations
TABLE IIIDIFFERENCES BETWEEN TWO DESIGN SOLUTIONS
SIMD Generic implem. with Generic implem. withconfig. a reduced processor a replicated processor
Design time 15 minutes 40 secondsusing the frameworkDesign time without 1 month 7 daysusing the framework
FPGA. So, these times are measured running the color-conversion application on parallel FPGA based configurations.The different SIMD SoC configurations perform good resultswhile increasing the number of PEs working in parallel.The performance of the system is also closely related to theprocessor type and the design methodology. The experimentalresults show that a SIMD configuration composed of morethan 8 PEs is needed to assure real-time processing. Accord-ing to these results, we can choose the best configuration.The proposed approach easily allows exploration of severalplatform architecture alternatives.
In order to illustrate the efficiency of the model-basedframework, Table III compares the implementation design timeusing the framework with results obtained from a conventionalmanual implementation method done by the same designerwithout using any framework. The measured design time forthe second configuration (using replication methodology) isjust the time needed to modify the first configuration (withreduction methodology). The results in Table III show thatthe proposed framework is a better solution to acceleratethe design of specific SIMD parallel SoC according to theestimated design time compared to a manual design. Twomonths were necessary to reduce an open-source processor toobtain a small PE (with only execution units) [21]. Observingthe results, we can conclude that the model-based designframework allows a very fast SIMD implementation.
This case study illustrates a design framework which fa-cilitates SIMD SoC implementation to run data-parallel ap-plications. Through the Model-Driven Engineering approachfor parallel SoC design presented in this work, a designer canspecify the needed SIMD configuration using UML modelsand the MARTE profile at high abstraction level and automat-ically generate its implementation at RT level. The designercan easily and rapidly generate different SoC configurations
154
Main_architecture
: ACU_memory [1]
InstMem
: ACU [1]
Data_mem
Inst_mem PU
mpNoc_in mpNoC_out
: mpNoC_router [1] PU_out PU_in
«shaped» : PU
ACU mpNoC_in mpNoC_out
ACU_in
ACU_out
Device_in
Device_out
device: Device [1]
mpNoC_out
mpNoC_in
Device2_out Device2_in
device2: Device2 [1]
mpnoc_out mpnoc_in
«reshape»
«reshape» «reshape»
Fig. 8. SIMD configuration composite structure diagram
to look for the best alternative for a given application.
V. CONCLUSIONS AND FUTURE WORK
A Model-Driven Engineering (MDE) approach for SIMDSoC design was presented. The proposed flow design iscomposed of four steps: application programming, systemmodeling, deployment and then implementation generation.The MDE fundamental notion of transformation between mod-els is used to generate a SIMD configuration at register transferlevel from its model at a high abstraction level. The frameworkfacilitates the exploration by rapidly generating different SoCconfigurations in order to choose the most adequate onethat better fulfills the application requirements. Experimentalresults show that the proposed framework strongly contributesto the increase of the designer’s productivity. The case studywith a video processing application proved that the presenteddesign flow can facilitate the design of parallel SIMD SoCsystems. The design flow allows reducing implementationcosts. Besides, the use of UML and MDE promotes thereusability of application and system high-level models.
One of the future directions to be considered is the modelingof a data-parallel application. We also intend to develop ahigh-level exploration step to automatically generate the mostsuitable application-specific SIMD SoC configuration.
REFERENCES
[1] W. C. Meilander, J. W. Baker, and M. Jin, “Importance of SIMDComputation Reconsidered,” in International Parallel and DistributedProcessing Symposium, 2003.
[2] R. Kleihorst and al., “An SIMD smart camera architecture for real-timeface recognition,” in Abstracts of the SAFE & ProRISC/IEEE Workshopson Semiconductors, Circuits and Systems and Signal Processing, 2003.
[3] R. Rosas, A. de Luca, and F. Santillan, “SIMD Architecture for ImageSegmentation using Sobel Operators Implemented in FPGA Technol-ogy,” in Proc. of the 2nd International Conference on Electrical andElectronics Engineering ((ICEEE’05), 2005.
[4] P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch, and P. Gauget,“Definition and SIMD implementation of a multi-processing architectureapproach on FPGA,” in Proc. of DATE, 2008.
[5] F. Schurz and D. Fey, “A Programmable Parallel Processor Architecturein FPGA for Image Processing Sensors,” in Integrated Design andProcess Technology, IDPT, 2007.
[6] X. Xizhen and S. G. Ziavras, “H-SIMD machine: configurable parallelcomputing for matrix multiplication,” in International Conf. on Com-puter Design: VLSI in Computers and Processors, 2005, pp. 671–676.
[7] P. Paulin, “DATE panel: Chips of the future: soft, crunchy or hard?” inProc. Design, Automation and Test in Europe, 2004, pp. 844–849.
[8] A. Sangiovanni-Vincentelli, L. Carloni, F. D. Bernardinis, and M. Sgroi,“Benefits and challenges for platform-based design,” in Proc. DAC,2004, pp. 409–414.
[9] D. Schmidt, “Model-driven Engineering,” IEEE Computer, vol. 39,no. 2, 2006.
[10] C. D. L. Bond and J.-L. Dekeyser, “Metamodels and MDA transforma-tions for embedded systems,” in FDL04, Lille, France, 2004.
[11] S. Mellor and M. Balcer, Executable UML: A foundation for ModelDriven Architecture. Boston: Addison-Wesley, 2002.
[12] O. M. Group. (2004, october) Uml 2 superstructure (availablespecification). [Online]. Available: http://www.omg.org/cgi-bin/doc?ptc
[13] L. Rioux, T. Saunier, S. Gerard, A. Radermacher, R. de Simone,T. Gautier, Y. Sorel, J. Forget, J.-L. Dekeyser, A. Cuccuru, C. Dumoulin,and C. Andre, “MARTE: A new profile RFP for the modeling andanalysis of real-time embedded systems,” in UML-SoC’05, DAC 2005Workshop UML for SoC Design, Anaheim, CA, June 2005.
[14] Acceleo. (2009). [Online]. Available: http://www.acceleo.org[15] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, “IP based
configurable SIMD massively parallel SoC,” in PhD Forum of 20𝑡ℎ In-ternational Conference on Field Programmable Logic and Applications(FPL), Milano, Italy, August 2010.
[16] D. Bjorklund and J. Lilius, “From UML Behavioral Models to EfficientSynthesizable VHDL,” in 20𝑡ℎ IEEE NORCHIP Conference, Copen-hagen, Denmark, November 2002.
[17] F. P. Coyle and M. A. Thornton, “From UML to HDL: a Model DrivenArchitectural Approach to Hardware-Software Co-Design,” InformationSystems: New Generations Conference (ISNG), pp. 88–93, April 2005.
[18] T. Kangas, P. Kukkala, H. Orsila, E. Salminen, M. Hannikainen, andT. Hamalainen, “UML-based multiprocessor SoC design framework,”ACM Trans. Embedded Computing Systems (TECS), vol. 5, no. 2, pp.88–93, May 2006.
[19] M. Bakouti, Y. Aydi, P. Marquet, M. Abid, and J.-L. Dekeyser, “ScalablempNoC for Massively Parallel Systems - Design and Implementation onFPGA,” Journal of Systems Architecture (JSA), vol. 56, pp. 278–292,2010.
[20] EMF. Eclipse modeling framework. [Online]. Available: http://www.eclipse.org/emf
[21] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, “A design andan implementation of a parallel based SIMD architecture for SoC onFPGA,” in Conference on Design and Architectures for Signal and ImageProcessing DASIP’08, Bruxelles, Belgium, November 2008.
[22] OpenCores. Or1200 openrisc processor. [Online]. Available: http://opencores.org/openrisc,or1200
[23] O. M. Group. UML Profile for MARTE: Modeling and Analysisof Real-Time Embedded Sys- tems, version 1.0. [Online]. Available:http://www.omg.org/spec/MARTE/1.0/PDF/.
[24] EEMBC. (2010) The Embedded Microprocessor BenchmarkConsortium. [Online]. Available: http://www.eembc.org/home.php
[25] Terasic. (2010) Altera DE2-70 Board. [Online]. Available: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=226
155
A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System
Architectures
Abstract— Abstract models are means to assist system architects in the evaluation process of hardware/software architectures and then to cope with the still increasing complexity of embedded systems. Efficient methods are necessary to correctly model system architectures and to make possible early performance evaluation and fast exploration of the design space. In this paper, we present the use of a specific modeling approach to improve evaluation of non-functional properties of embedded systems. The contribution is about a computation method defined to improve modeling of properties used for assessment of architecture performances. This method favors creation of abstract transaction level models and leads to significantly reducing simulation time but still preserving accuracy of results. The benefits of the proposed approach for evaluation of performances of system architectures are highlighted through analysis of two specific case studies.
I. INTRODUCTION High performance applications supported by modern
embedded devices imply definition of heterogeneous multiprocessor platforms. The process of system architecting consists of optimally defining organization and performances of such platforms in terms of processing, communication and memory resources according to functional and non-functional requirements. Typical non-functional requirements under consideration for embedded systems are timing constraints, power consumption and cost. Fast exploration of the design space and evaluation of non-functional properties early in the development process have then become mandatory to avoid costly iterations [1]. In this context, abstract models are then needed to maintain the still increasing complexity of embedded systems.
As reported in [2], models for performance evaluation are usually created applying the principles of the Y-chart model. Following this approach, a model of the system application is mapped onto a model of the considered platform and the resulting description is then analyzed analytically or by simulation. Modeling of computation and modeling of communication can be strongly separated and defined at various abstraction levels on both application and platform sides [3]. Raising level of design abstraction above Register Transfer Level (RTL), Transaction Level Modeling (TLM) offers a good trade-off between modeling accuracy and simulation speed and it has then emerged recently in the
design process of embedded systems [4]. However, achievable simulation speed of transaction level models is limited by the amount of required transactions and integration of non-functional properties can significantly reduce simulation speed. Therefore, specific modeling techniques are required to correctly abstract non-functional properties and improve efficiency of simulation.
In this paper, an approach for creation of efficient transaction level models for performance evaluation of system architectures is presented. The contribution is about a specific computation method proposed to improve expression of non-functional properties assessed for performance evaluation. This proposal is based on the distinction between the description of system evolution, driven by transactions, and the description of non-functional properties. This separation of concerns leads to reducing the number of events in transaction level models and favors creation of abstract models. Simulation speed-up is then achieved due to significant reduction of required transactions and the proposed method still preserves accuracy in evaluation of performances. This method has been validated through the use of a specific modeling framework based on the SystemC language [5]. The proposed approach provides fast evaluation of performances and allows efficient exploration of different configurations of architectures. The benefits of this approach are highlighted through two case studies. Created models are simulated to evaluate performances in terms of processing resources and memory cost in order to correctly fix platform properties.
The remainder of this paper is structured as follows. Section II analyzes related modeling and simulation approaches used for evaluation of performances of embedded systems. In Section III, the proposed modeling approach is presented. In Section IV, we detail the computation method used to improve simulation speed of models. In Section V, we describe the implementation of the proposed approach in the considered modeling environment. Section VI highlights the benefits of the approach through two case studies. Finally conclusions are drawn in section VII
II. RELATED WORK Performance evaluation of embedded systems has been
approached in many ways at different levels of abstraction. A good survey of various methods, tools and environments for
Sébastien Le Nours, Anthony Barreteau, Olivier Pasquier Univ Nantes, IREENA, EA1770, Polytech-Nantes, rue C. Pauc, Nantes, F-44000 France
{sebastien.le-nours, anthony.barreteau, olivier.pasquier}@univ-nantes.fr
978-1-4577-0660-8/11/$26.00 ©2011 IEEE
156
early design space exploration is presented in [6]. Typically, performance models capture characteristics of system architectures and they are used to gain reliable data of resource usage. For this purpose, performance evaluation can be performed without considering a complete description of system functionalities. This abstraction enables efficient simulation speed and favors early performance evaluation. Workload models are then defined to represent computation and communication loads applications cause on platforms when executed. Workload models are mapped onto platform models and resulting architecture models are simulated to obtain performance data. Among simulation-based approaches, TLM has recently received wide interest in industrial and research communities in order to improve system design and productivity. Transaction level models make possible to hide unnecessary details of communication and computation. Formally, a transaction has been defined in [4] as the data transfer or synchronization between two modules at an instant determined by the hardware/software system specification. The different levels of abstraction considered in TLM approaches are classified according to granularity of computation and communication and time accuracy [3][4]. TLM is supported by languages such as SystemC [5] and SystemVerilog [7], notably through the TLM2.0 standard promoted by OSCI [8]. Works presented in [9] give a quantitative analysis of the speed/accuracy trade-off in transaction level models. Typically, using the SystemC language, simulation speed is related to the number of thread context switches which usually grows with increasing number of modules. The approach presented in [10] attends to transform the structural description of designs with concurrency alignment along modules in order to minimize switches. The transformation technique re-assigns the concurrency along the dataflow keeping functionality of the model unchanged. In [11], a method is presented to minimize the number of synchronization points in system description by optimizing the granularity of transactions. In the following, we adopt a similar approach in order to reduce the amount of required events in transaction level models created for evaluation of non-functional properties.
Among existing approaches for performance evaluation of embedded systems, different optimization objectives are addressed in order to assist designers to fix platform parameters early in the development process. The design framework presented in [12] supports system modeling at different levels of abstraction. The architecture exploration step supported mainly focuses on optimization of allocation and partitioning. System architecture consists of processing elements and memories. These components are selected by the system designer as part of the decision making. In [13], the proposed methodology allows for architectural exploration at different levels of abstraction. Candidate architectures are selected using analytical modeling and multi-objective optimization taking into account parameters such as processing capacities, power consumption and cost. Potential solutions are then simulated at transaction level using SysemC. In [14], performance evaluation is performed by a combined simulation, associating functionalities and timing in one single simulation model. The performance of each feasible implementation is then assessed with respect to a given set of stimuli and by means of average
latency and average throughput. The design framework proposed in [1] aims at evaluating non-functional properties such as power consumption and temperature. In this approach, the description of the application is done through a model called communication dependency graph. This description is completed by SystemC models of non-functional properties. Simulation is then performed to obtain an evaluation of achieved power consumption. Both approaches presented in [15] and [16] combine UML2 description for application modeling and platform modeling in SystemC for performance evaluation. Applications are modeled in terms of services required from the underlying platform. Workload models are defined to illustrate the load an application causes to a platform when executed in terms of processing and communication. These workload models do not contain timing information; it is left to the platform model to find out how long it takes to process the workloads. Our approach mainly differs from the above as to the way the system architecture is modeled and the models of workload are defined. Besides, we pay a specific attention to the optimization of models in order to improve the related simulation speed.
III. CONSIDERED MODELING APPROACH The considered modeling approach aims at creating
approximately-timed models used for evaluation of properties related to system architectures. It is based on a single view that combines structural description of system under study and non-functional properties relevant to considered hardware and software resources. This approach is illustrated in Figure 1.
F11 F12
P1
Mem1
NodeF2
P2
A1
M
A2 s0
s1
s2
M
Q
t = Tk
K
Considered system architecture
Model of system architecture
CcA2=0;McA2=0;
CcA2=Ccs1;McA2=Mcs1;t = Tj
t = Tl
s3
Q
CcA2=0;McA2=Mcs2;
CcA2=Ccs3;McA2=Mcs3;
Figure 1. Considered modeling approach for evaluation of non-functional properties of system architectures.
The lower part of Figure 1 depicts a typical platform made of communication nodes, memories and processing resources. Processing resources are classified as processors and dedicated hardware resources. In Figure 1, F11, F12 and F2 represent functions of the system application. They are allocated on the processing resources P1 and P2 to form the system architecture. For clarity reason, communications and memory accesses
157
induced by this allocation are not represented. The upper part of the figure depicts the model of the system architecture. This model exhibits transactions exchanged between activities and utilization of the platform resources. This model is based on an activity diagram notation inspired from [17]. Following this notation, single arrow links correspond to transactions exchanged between activities and the communication is in conformity with the rendezvous protocol. The behavior of each activity exhibits waiting conditions on input transactions and production of output transactions. It also expresses the use of processing and memory resources considering the allocation of functions. Transitions between states are expressed as waiting transactions, time conditions or logical conditions on internal variables. In Figure 1, the use of processing resources due to execution of function F2 on P2 is modeled by evolution of the parameter denoted CcA2. In this simple example, Ccs1 operations are first executed for a duration set to Tj after reception of transaction M. Production of transaction Q is done once state s3 finished. Parameter McA2 describes evolution of the amount of memory required during execution of the activity A2. The different internal variables related to each activity can be influenced by data associated to the input transaction M.
Following this approach, resulting model incorporates quantitative properties defined analytically and relevant to the use of processing resources, communication nodes and memories. These analytical expressions of quantitative properties and related time properties are directly influenced by the characteristics of resources considered to support function execution. These expressions are provided by estimations and measurements. Using languages as SystemC, created models can then be simulated to evaluate time evolution of performances obtained considering a given set of stimuli. Various platform configurations and function allocations can be compared considering different descriptions of the behavior of activities. In the following, the presented contribution is about the optimization of the descriptions of activities in order to improve the simulation time of such models.
IV. PROPOSED COMPUTATION METHOD OF NON-FUNCTIONAL PROPERTIES OF SYSTEM ARCHITECTURES
As previously discussed, simulation speed of transaction level models can be significantly improved by avoiding context switches between threads. The proposed computation method relies on the same principle as temporal decoupling supported by the loosely-timed coding style defined by OSCI. Using this coding style, parts of the model are permitted to run ahead in a local time until they reach the point when they need to synchronize with the rest of the model. The proposed method can be seen as an application of this principle to create models for evaluation of architecture performances. It aims at minimizing number of transactions required for description of properties assessed for evaluation of performances Figure 2 illustrates application of proposed computation method.
Figure 2. Comparison of two modeling approaches in order to minimize the amount of required transactions in models used for performance evaluation.
Figure 2 depicts transactions exchanged between two activities and the behavior of the receiving activity denoted A2. The upper part of the figure corresponds to a description with 4 successive transactions. The durations between the successive transactions are denoted Δt1, Δt2 and Δt3 and they are relevant to the communication node used for transfer of data. In this transaction-based modeling approach, the considered property cA2 evolves each time a transaction is received. The lower part of the figure considers the description of the activity A2 at a higher abstraction level. Only one transaction occurs and the content of the transaction is defined at higher granularity. However, evolution of the property cA2 can be preserved by considering a separation with evolution of the activity behavior. In that case, the duration Ts corresponds to the time elapsed between the first transaction and the last transaction considered in the upper part of the figure. It is locally computed relatively to the arrival time of the input transaction M and it defines the next output event. In Figure 2, this is denoted by action ComputeAfterM. The time condition is evaluated during state s0 considering evolution of the simulation time denoted ts. Besides, evolution of property cA2 between two external events is also done during state s0. The successive values, denoted cs0, are evaluated in zero time according to the simulation time. This means that no SystemC wait primitives are used, leading to no thread context switches. Resulting observations correspond to the values cs0 and the associated timestamps To. Timestamps values are considered relatively to what we call the observed time denoted to. Using this technique, the evolution of the considered property can then be computed locally between external transactions. Compared to the previous transaction-based approach, the second modeling approach with the related computation technique can be considered as a state-based approach. Non-functional properties are then locally computed in the same state which reduces the number of required transactions. Figure 3 represents time evolution of property cA2 considering the two modeling approach illustrated in Figure 2.
158
Figure 3. Evolution of property cA2 considering, (a), a transaction-based modeling approach and, (b), the proposed state-based modeling approach.
The upper part of Figure 3 illustrates time evolution of property cA2 with 4 successive input transactions. During simulation of the model each transaction implies a thread context switch between activities and cA2 evolves according to the simulation time. In the lower part of the figure, successive values of property cA2 and associated timestamps are computed at reception of transaction M. Evolution is depicted according to the observed time to. Improved simulation time is achieved due to the amount of context switches avoided. More generally, we can consider that when the number of transaction is reduced by a factor of N, a simulation speed-up by the same factor could be achieved. This computation method favors creation of abstract models and utilization of platform resources can be computed at finer level with low influence on simulation time. We have considered the implementation of the proposed computation method in a specific modeling framework in order to analyze its influence on simulation time of models.
V. IMPLEMENTATION OF THE COMPUTATION METHOD IN A SPECIFIC FRAMEWORK
The proposed computation method has been implemented in the framework CoFluent Studio [18]. This environment supports creation of transaction level models of system applications and architectures. Graphical models captured and associated codes are then automatically generated in a SystemC description and they are simulated to analyze model execution and to assess performances. We used the so-called Timed-Behavioral Modeling part of this framework to create models
according to the considered approach. Figure 4 illustrates the graphical modeling adopted in CoFluent Studio to implement the proposed computation method. It corresponds to the specific case illustrated in Figure 2 with one input transaction and one output transaction.
Figure 4. Graphical modeling in the CoFluent Studio framework to implement the proposed computation method.
In Figure 4, the function denoted A2 is activated once the input transaction M has been received. The production instant of the output transaction Q is computed in the operation denoted OpPerformanceAnalysis. Duration of operation OpPerformanceAnalysis corresponds to duration Ts defined in Figure 2. The other operations OpInit and OpUpdating are executed in zero time according to the simulation time. The loop with a boolean condition on the internal variable denoted Wait_Input is added to manage possible output transactions produced successively before waiting a new input transaction. The operation OpPerformanceAnalysis is described in a C/C++ sequential code to define the computation of properties and display. The example given bellow corresponds to the required instructions to obtain the observations depicted in the lower part of Figure 3.
{ To = CurrentUserTime(ns);
CofDisplay(“to=%f ns, cA2=%f op/s”, To, c1); To = To+t1; CofDisplay(“to=%f ns, cA2=%f op/s”, To, c2); To = To+t2;
CofDisplay(“to=%f ns, cA2=%f op/s”, To, c3); To = To+t3; CofDisplay(“to=%f ns, cA2=%f op/s”, To, c4);
To = To+Tl; CofDisplay(“to=%f ns, cA2=%f op/s”, To, 0); OpDuration = To-CurrentUserTime(ns); }
Procedure CurrentUserTime is used in CoFluentStudio
to get current simulation time. In our case, it is used to get the reception time of input transactions and to compute values of durations To and Ts. Procedure CofDisplay is used to display variables in a Y=f(X) chart. In our case, it is used to
159
display studied properties according to the observed time. The keyword OpDuration defines the duration of the operation OpPerformanceAnalysis and it is evaluated according to the simulation time. Successive values of cA2 and timestamps are provided by estimations and could also be computed according to data associated to the input transaction M. This model has been extended to the case of functions with multiple input and output transactions. In the following, we consider this implementation of the proposed method to create executable models. Models are then simulated to assess performances of considered architectures.
VI. CASE STUDIES
A. Modeling of a pipeline architecture First case study aims at illustrating proposed modeling
approach and simulation speed-up obtained with the computation method presented in Section IV through a didactic case study. Application considered is about a Fast Fourier Transform (FFT) algorithm which is widely used in digital signal processing. A pipeline architecture based on hardware resources is analyzed. To easily illustrate proposed modeling approach, an 8-point FFT is considered. Modeling approach is used to estimate resource utilization and computation method is used to reduce simulation time of performance model. Figure 5 illustrates pipeline architecture considered and related performance model.
InputSymbol OutputSymbolInputStage2 InputStage3
+
- w
Considered system architecture
Stage1 Stage2 Stage3
Model of system architecture
Input
Symbol
Output
Symbol
Stage1
+
- w
Stage2
+
- w
Stage3
Figure 5. Modeling of a 3-stage pipeline architecture.
The lower part of Figure 5 describes typical pipeline architecture as implemented in most commercial FFT IPs. This architecture enables to simultaneously perform transform calculations on a current frame of 8 complex symbols, load input data for next frame, and unload 8 output complex symbols of previous frame. Each stage is made of processing (adders, multipliers) and memory resources. The upper part gives the structural description of the associated model. The behavior of activities Stage1, Stage2, and Stage3 describe utilization of processing resources each time an input transaction is received. Behavior of each activity is described following modeling approach presented in Section III.
Architecture has first been modeled following a transaction-based modeling approach. Input transactions are made of one single complex symbol. Eight input transactions are then required to process one iteration of FFT algorithm. This model has been captured in the CoFluent Studio framework following the previously presented modeling approach. The created
model makes possible to analyze the use of processing resources according to the rate of input transactions. Figure 6 depicts possible observations obtained with the CoFluent Studio simulation tool. In the considered example, input transactions are successively received with a period set to 0.125 ms.
Figure 6. Time evolution of computational complexity (in KOPS) of the considered system architecture.
The upper part of Figure 6 gives an example of time evolution of global computational complexity per time unit required for the complete architecture with three successive executions. The lower part gives an illustration of processed input and output transactions as could be observed in the timeline view proposed by the CoFluent Studio simulation tool. For clarity reason, only the first input transactions and the last output transactions produced are depicted. The behavior of each activity is relevant to the architecture of each stage and to the time constraints allocated to process each complex symbol. In the considered configuration, estimated computational complexity per time unit is evaluated to 120 KOPS.
Considering the state-based modeling approach depicted in Figure 3, we have defined a model of the system with the same structure but with a higher level of data granularity. Transactions are made of eight complex symbols and only one transaction is required to execute an iteration of the complete architecture. Start time corresponds to reception of the first complex symbol and other instants are locally computed relatively to this value. Evolution of the load of processing resources for each transaction is computed considering the method previously presented and a similar observation of computational complexity is obtained. The average simulation speed-up measured is about 7.62, compared to a theoretical factor of 8. We can then notice the weak influence of the computation method on the improvement of the simulation time. Similar observations have been obtained by increasing number of stages in pipeline architecture.
160
B. Modeling of a pipeline architecture Second case study considered here concerns the creation of
a transaction level model for analysis of processing functions involved at the physical layer of the 3GPP LTE protocol [19]. The aim of the model is to study required computational complexity and memory cost according to the various possible parameters associated to each function. In the following, we consider the reception part of a downlink transmission in a single input single output (SISO) configuration. The structural representation of this system is given in Figure 7.
Figure 7. Activity diagram of communication receiver studied.
In the configuration depicted in Figure 7, input transactions are successively received each 1 ms. They are made of 14 OFDM symbols which size can vary according to considered throughput. Based on a detailed analysis of processing and memory resources required for each function [20], we have defined analytical expressions for each activity. These expressions give relation between functional parameters and resulting computational complexity in terms of arithmetic operations and required memory resources. For example, the number of sub-carriers directly gets influence on the computational complexity of the OFDM demodulator function. We used proposed modeling approach to describe each elementary activity depicted in the lower part of Figure 7. The behavior of each activity exhibits the way processing and memory resources are used. The computation method is used to locally compute time evolution of computational complexity and memory cost related to each activity. Time properties defined for each activity depend on the architecture evaluated. In the following, results are presented considering a platform made of dedicated hardware resources to implement each function.
We captured the complete model using the CoFluent Studio tool. Each activity has been captured following approach illustrated in Figure 4. The captured LTE receiver model represents 3850 lines of SystemC code, with 22 % automatically generated by the tool CoFluent Studio. Rest of the code corresponds to the sequential C/C++ code defined for computation of studied non-functional properties and display. This model makes possible to observe the evolution of the computational complexity per time unit for each activity and for the complete architecture. The observation given in Figure 8 represents obtained evolution of computational complexity according to various configurations of the input frames.
Figure 8. Observation of estimated computational complexity (in GOPS) of the receiver architecture according to various configurations of LTE sub-frames.
In Figure 8, we observe evolution of computational complexity during reception of successive LTE sub-frames. The system configuration evolves during execution according to various parameters. The number of blocks of data allocated per user is denoted NbRB, the size of the OFDM symbol is denoted NFFT and the number of iterations of the channel decoder is NbIterTurboDecod. Parameters vary from one frame to another one. The demodulation scheme can also be modified during system execution. In Figure 8, modulation schemes are QPSK, 64QAM, 16QAM. We observe that the global computational complexity strongly varies during system execution and the estimated maximum value is 70 Giga Operation Per Second (GOPS) for the three configurations evaluated. The main computing complexity is due to activity of the channel decoder function. The same model is used to evaluate the memory cost associated to the receiver system. This observation is given in Figure 9.
Figure 9. Observation of the estimated memory cost (in KByte) of the receiver architecture according to various configurations of LTE sub-frames.
161
Figure 9 illustrates evolution of memory cost during successive computation of LTE sub-frames. The maximum value achieved is estimated to 570 KBytes. Observations given in Figure 8 and 9 are used to estimate expected resources of the architecture. The simulation time to execute the created model for 1000 input frames took 11s on a 2.66 GHz Intel Core2 duo machine. This is fast enough for performing performance evaluation and for simulating multiple configurations of architectures. Time properties and quantitative properties defined for each activity can be modified easily to evaluate various configurations of the architecture. Then, we also used this approach to evaluate properties related to an heterogeneous architecture made of dedicated hardware resources and one processor core.
VII. CONCLUSION Creation of abstract models represents a reliable solution to
maintain design complexity of embedded systems and to enable architecting of complex hardware and software resources. In this paper, we have presented an approach for creation of transaction level models for performance evaluation. According to this approach, system architecture is modeled as an activity diagram and description of activities incorporates properties relevant to resources usage. The contribution is about a specific computation method that favors creation of more abstract transaction level models. Simulation speed-up is achieved due to significant reduction in number of transactions in models and architecture properties are computed in zero time according to simulation time. This method makes possible to significantly increase simulation speed of models but still preserving accuracy of observations. The experimentation of this method has been illustrated through the use of the framework CoFluent Studio. However, the presented modeling approach is not limited to this specific environment and it could be applied to other SystemC-based frameworks. Further research is directed towards applying the same modeling principle to other non-functional properties such as dynamic power consumption.
REFERENCES [1] A. Viehl, B. Sander, O. Bringmann, and W. Rosenstiel, “Integrated
requirement evaluation of non-functional system-on-chip properties”, in Proceedings of the Forum of specification and Design Languages (FDL’08), Stuttgart, Germany, September 2008.
[2] D. Densmore, R. Passerone, and A. Sangiovanni-Vincentelli, “A platform-based taxonomy for ESL design”, IEEE Design and Test of Computers, vol. 23, no. 5, pp. 359-374, September/October 2006.
[3] L. Cai, and D. Gajski, “Transaction level modeling: an overview”, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’03), Newport Beach, October 2003.
[4] F. Ghenassia, Transaction-level modeling with SystemC: TLM concepts and applications for embedded systems, Springer, 2005.
[5] Open SystemC Initiative (OSCI), “Functional specification for SystemC 2.0”, http://www.systemc.org
[6] M. Gries, “Methods for evaluating and covering the design space during early design development”, Integration, the VLSI Journal, vol. 38, no. 2, pp. 131-183, 2004.
[7] SystemVerilog, http://www.systemverilog.org [8] Open SystemC Initiative TLM Working Group, Transaction Level
Modeling Standard 2 (TLM 2), June 2008. [9] G. Schirner, and R. Dömer, “Quantitative analysis of the speed/accuracy
trade-off in transaction level modeling”, ACM Transactions on Embedded Computing Systems, vol. 8, no. 4, pp. 1-29, 2008.
[10] N. Savoiu, S. K. Shukla, and R. K. Gupta, “Automated concurrency re-assignment in high level system models for efficient system-level simulation”, in Proceedings of Design, Automation and Test in Europe, 2002.
[11] J. Cornet, F. Maraninchi, and L. Maillet-Contoz, “A method for the efficient development of timed and untimed transaction-level models of systems-on-chip”, in Proceedings of Design, Automation and Test in Europe (DATE’08), Munich, Germany, March, 2008.
[12] R. Dömer, A. Gerstlauer, J. Peng, et al., ”System-on-chip environment: a SpecC-based framework for heterogeneous MPSoC design”, EURASIP Journal on Embedded Systems, vol. 2008, 2008.
[13] A. D. Pimentel, C. Erbas, and S. Polstra, “A systematic approach to exploring embedded system architectures at multiple abstraction levels”, IEEE Transactions on Computers, vol. 55, no. 2, pp. 99-111, 2006.
[14] C. Haubelt, J. Falk, J. Keinert, et al., “A SystemC-based design methodology for digital signal processing systems”, EURASIP Journal on Embedded Systems, vol. 2007, 2007.
[15] J. Kreku, M. Hoppari, T. Kestilä, et al., “Combining UML2 application and SystemC platform modelling for performance evaluation of real-time embedded systems”, EURASIP Journal on Embedded Systems, vol. 2008, 2008.
[16] T. Arpinen, E. Salminen, T. Hämäläinen, and M. Hännikäinen, “Performance evaluation of UML2-modeled embedded streaming applications with system-level simulation”, EURASIP Journal on Embedded Systems, vol. 2009, 2009.
[17] J. P. Calvez, Embedded real-time systems: a specific and design methodology, John Wiley & Sons, May, 1993.
[18] CoFluent Design, http://www.cofluentdesign.com/ [19] E. Dahlman, S. Parkvall, J. Skold, P. Beming, 3G Evolution, HSPA and
LTE for Mobile Broadband, Academic Press, 2008. [20] J. Berkmann, C. Carbonelli, F. Dietrich, C. Drewes, W. Xu, “On 3G
LTE terminal implementation – Standard, algorithms, complexities and challenges”, in Proceedings of International Wireless Communications and Mobile Computing Conference (IWCMC’08), August, 2008.
162
Session 6Software for Embedded Devices
163
978-1-4577-0660-8/11/$26.00 ©2011 IEEE
Task Mapping on NoC-Based MPSoCs with Faulty Tiles Evaluating the Energy Consumption and the Application Execution Time
Alexandre M. Amory, César A. M. Marcon, Fernando G. Moraes
FACIN – Faculdade de Informática – PUCRS Catholic University
Porto Alegre, Brazil {alexandre.amory, cesar.marcon,
fernando.moraes}@pucrs.br
Marcelo S. Lubaszewski PPGC – Instituto de Informática –
UFRGS Federal University Porto Alegre, Brazil [email protected]
Abstract— The use of spare tiles in a networks-on-chip based multi-processor chip can improve the yield, reducing the cost of the chip and maintaining the system functionality even if the chip is defective. However, the impact of this approach on appli-cation characteristics, such as energy consumption and execution time, is not documented. For instance, on one hand the applica-tion tasks might be mapped onto any tile of a defect-free chip. On the other hand, a chip with a defective tile needs special task mapping that avoid fault tiles. This paper presents a task map-ping aware of faulty tiles, where an alternative task mapping can be generated and evaluated in terms of energy consumption and execution time. The results show that faults on tiles have, on av-erage, a small effect on energy consumption but no significant effect on execution time. It demonstrates that spare tiles can im-prove yield with a small impact on the application requirements.
Keywords: MPSoC, task mapping, yield, energy consumption, execution time.
I. INTRODUCTION A multiprocessor system-on-chip (MPSoC) is typically a very
large scale integrated system that incorporates most or all the components necessary for an application, including multiple pro-cessors [1]. A network-on-chip (NoC) is the preferable intrachip communication infrastructure for MPSoCs due to its superior per-formance, scalability, and modularity. MPSoCs that use NoCs as the communication infrastructure are also called NoC-based MPSoCs.
NoCs can consume more than one third of the total chip ener-gy [2][3]. On the other hand, the shrinking feature-sizes of newer technologies and the supply voltage scaling [4][5] increases the defect rate in the chip manufacturing and reduces the yield. High manufacturability, low latency and energy consumption are con-flicting design goals, thus all these requirements have to be jointly evaluated to optimize a NoC-based MPSoC design.
The task mapping problem determines an association of each application task to a tile to minimize some given cost function. This paper presents a tool that finds an optimal task mapping in terms of energy consumption and application execution time, giv-en a set of tiles with manufacturing defects. This way, even chips with defects can be sold, perhaps with some performance degrada-tion, targeting low-end markets.
The goals of this paper are to present the aforementioned task mapping tool and to investigate the energy consumption and ap-plication execution time degradations assuming different applica-tion classes. The contributions of this paper are (i) a task mapping tool for NoC-based MPSoC, which consider faulty tiles to per-form the mapping; (ii) the evaluation of energy consumption and
application execution time under the presence of faulty tiles; (iii) a statistical method to generate fault scenarios for very large SoCs.
The paper is organized as follows: Section II presents motiva-tion, usage of the proposed approach, and main assumptions. Sec-tion III describes the related work. Section IV describes the task mapping tool and its models. Section V describes the experimen-tal setup, the evaluated applications, and the fault scenarios. Sec-tion VI discusses the results. Section VII concludes the paper.
II. PRELIMINARIES
A. System Model and Assumptions This paper assumes that the target MPSoC consists of a set of
identical (or homogeneous) tiles connected by a mesh-based NoC with XY routing algorithm. Each tile contains three main compo-nents: a network interface, a processor, and a memory block. A tile supports one task only (no multitasking). This system model is equivalent, for instance, to the underlying model of HeMPS MPSoC [6] with the Hermes NoC [7].
The present work assumes faults only on the tiles since we as-sume that the tile area is at least 90% of the router. Therefore, the communication infrastructure is assumed faulty-free. A faulty tile is completely shutdown, thus it does not consume energy and generate traffic in the network.
The faults are result of defects created during the chip manu-facturing. These defects are expected to be more common due to the evolution of deep submicron technologies, thus multiple faults on the chip are considered. The proposed task mapping is ex-ecuted in design time for several fault scenarios, such that an overall picture of the relationship between the fault location and the performance metrics can be draw.
B. Motivating Example Redundant hardware is commonly used to tackle the yield
problem. It has been successfully applied to all sorts of regular and repetitive hardware, like different types of memories, pro-grammable logic array, field programmable gate array, and recent-ly to MPSoCs [5]. In the context of MPSoCs, the application task located in a faulty tile can be mapped (in design time) or migrated (in run-time) to a spare tile, keeping the chip functionality.
Shamshiri and Cheng [5] proposed a yield and cost analysis framework employed to evaluate the use of spare tiles in MPSoCs. This one can be used to determine the amount of redun-dancy required to achieve a minimum cost. For instance, given some input parameters detailed in [5], the yield of a block is 94%, the NoC link is 72%, resulting in a system yield of just 21% for a 3x3 mesh NoC, i.e. there is probability of 79% of having at least one faulty block in the system. By including three spare tiles to
164
the system, increasing the number of tiles from 9 to 12, the system yield increases to 99% since only 9 out of 12 tiles are actually required to have a functional system. Moreover, the manufactur-ing cost is 3.2 times less than the original system, since the addi-tional silicon area of the spare tiles is compensated by the in-creased yield.
Given these motivating results, we decided to investigate the use of spare tiles by evaluating the side effects of multiple faulty tiles on the energy consumption and application execution time.
C. Usage of the Proposed Approach Figure 1 illustrates the proposed test approach, which starts as
soon as the chip is manufactured. If the tested chip fails, a diag-nose step is performed to locate the faulty tiles.
Let n be the number of system tiles and m be the number of necessary tiles to implement the systems functionality, then n-m is the number of spare tiles. If the number of faulty tiles is lower or equal than n-m, the place of these faulty tiles is sent to task map-ping tool, otherwise the faulty chip is discarded. The task mapping tool, presented in Section IV, loads a NoC model and the applica-tion task graph to determine the new task mapping avoiding the faulty tiles. Finally, the tool is able to estimate the energy con-sumption and the application execution time of the resulting task mapping. Depending on the resulting overhead, chips with up to n-m faulty tiles can still be sent to the market, perhaps targeting low-end markets.
no
manufacturing test
pass test? diagnose (find faulty tiles)
yes
sold to high-end market - $$$
#faulty tiles <= n -
no chip discarded
yes fault location
application model
task mapping tool
sold to low-end market - $$
NoC model
Figure 1. Proposed test flow for NoC-based MPSoCs with spare tiles.
III. RELATED WORK There are several papers presenting approaches to improve the
reliability of NoC-based SoCs. These papers can be broadly clas-sified (these classes are definitely not exhaustive) in: (i) fault tole-rant circuitry for NoCs and MPSoCs [8][9]; (ii) fault tolerant NoC routing algorithm used to explore different routes of packets in case of network faults [10]; (iii) system-level reliability assess-ment [5]; (iv) system-level reliability co-optimization [11].
This paper fits best in the system-level reliability co-optimization category, where two main approaches are found: dynamic approaches executed in run-time; or static approaches executed in design time. The dynamic system-level reliability co-optimization approach is commonly based on on-line task map-ping and task migration to better accommodate new incoming tasks on the fly assuming the chip might have faults. It can also be used to react, for instance, upon a run-time fault which could have been generated by transient effects or permanent faults due to wear out or aging [15][16]. In this case the tasks located at the faulty resources are moved in runtime to healthy resources. These approaches are out of the scope of this paper since the goal is to improve the yield of the chip manufacturing. Manufacturing de-fects are not dynamic and they do not appear during run-time.
For this reason this paper is best related to static system-level reliability co-optimization approaches, based on static task sche-duling executed in design time. Typically these approaches are largely used in design space exploration targeting the optimization of metrics, such as application execution time, latency, thermal
constraints, and energy consumption [12][13][14]. Recently these approaches also co-optimize reliability related metrics.
Manolache et al. [11] address the reliability problem at appli-cation level. They propose a way to combine spatially and tempo-rally redundant message transmission, where energy and latency overhead are minimized.
Tornero et al. [17] propose a multi-objective optimization strategy, which minimizes energy consumption and maximizes a robustness index, called path diversity, which explores the mul-tiple paths between a pair of nodes. In case of a faulty link, a NoC with adaptive or source-based routing algorithms could explore these multiples paths, improving the chip robustness.
Choudhur et al. [18] introduce a new task mapping, whose ob-jective is to minimize the variance of the system power and laten-cy when faults occur and maximizes the probability that the actual system will work when deployed.
Huang et al. [19] argue that some processors might age much faster than others might, reducing the system’s lifetime. They proposed an analytical model to estimate the lifetime reliability of MPSoCs. This model is integrated to a task mapping algorithm that minimizes the energy consumption of the system and satisfies system lifetime reliability constraint. Huang and Xu [20] expand their previous task mapping tool [19] to support multi-mode em-bedded systems. Huang and Xu [21] argue that exponential life-time distribution can be inaccurate, thus they further refine the lifetime reliability model to support arbitrary lifetime distribu-tions, improving the accuracy of the simulation results.
IV. TASK MAPPING AWARE OF FAULTY TILES The CAFES task mapping framework [22] is composed of
high-level models, algorithms and tools, whose goal is to map application tasks onto the target architecture tiles aiming to save energy and to minimize the execution time. Figure 2 illustrates a partial mapping flow and the main elements used here.
t2 tn
Applicationt1
CDCG
Optimum mapping
CRG
Communication and computation extraction
MPSoC modeling
Mapping algorithm
Faulty tile and spare tile lists
Energy and timing
parameters
Energy consumption
and execution time results
MPSoC synthesis
τ2 τn
Target architecture - MPSoC
τ1 τ3
Production test
Figure 2. Mapping flow used to obtain optima application mappings.
Based on the description of an application already partitioned into tasks ti, the designer may extract the relevant computation and communication aspects.
Communication Dependence and Computation Graph (CDCG) is a model used to describe the application. Each CDCG vertex models a communication with the source and target task, the communication volume and the computation time - the period between all dependences are solved and communication begin-ning. CDCG edges represent the communication dependence, i.e. all vertices are connected to each dependence with an edge. The CDCG is similar to a schedule graph, but focusing on communi-cation aspects instead of computation, which enables to explore several requirements of communication architecture easily.
Figure 3 depicts a small example of CDGC, containing three communications {C1, C2 and C3}. C1 and C3 are concurrent commu-nications and both do not have dependences, since dependences of
165
the Start vertex (dStart_1 and dStart_3) are always solved. Thus, C1 and C3 communications start immediately after the respective computation time: 10 and 20 clock cycles, respectively. Commu-nication C1 states that t1 send 100 bytes to t2 and C3 states that t3 send 100 bytes to t1. As soon as the last byte of C1 is inserted into the NoC, d1_2 is solved. On the other hand, d3_2 is solved only when the last byte of C3 communication arrives to the processor where t1 is mapped.
Start
t1 100
C1 t2
t1 50
C2 t3 25
10
End
20 t3
40 C3 t1
d2_End
d1_2
dStart_3 dStart_1
d3_2
Figure 3. CDCG example.
The target architecture topology is modeled by means of a Communication Resource Graph (CRG), which consists of tiles (graph nodes) and links (graph edges). The energy and execution time parameters are extracted from the target architecture synthe-sized to a given technology. The faulty tile list is generated by the diagnostic flow presented in Figure 1. According to the applica-tion description, NoC energy parameters, NoC execution time parameters, NoC topology and the faulty tile list, the task mapping tool estimates the NoC energy consumption and the application execution time of different mappings, enabling to evaluate the impact of faulty tiles. The next sections detail the underlying algo-rithms and the timing and energy models.
A. Mapping Algorithm As stated before, the mapping problem here consists in finding
an association of each application task to a given processor – placed in a given tile – that minimizes the global energy consump-tion and the application execution time. Let n be the number of tiles, this problem allows n! possible solutions. Given that future MPSoCs may contain hundreds of tiles, an exhaustive search of the solution space is clearly unfeasible. Thus, optimal implemen-tations of such SoCs require the development of efficient mapping heuristics.
Exhaustive analyses of some small applications mapped on NoC-based MPSoCs show that task mapping is clearly a problem with self-similarity [23] behavior. In other words, there are sever-al very different mappings with the same cost – i.e. the same ener-gy consumption and execution time. Therefore, exploring, not all, but very different random mappings followed by some refine-ments (new mapping with few changes), normally result on an optimized solution. Due to two nested loops – an external one, which looks for very different solutions and an internal one, which looks for a local minimum – Simulated Annealing (SA) is an al-gorithm very well adequate to find solutions for self-similar prob-lems.
Our SA mapping algorithm searches for mappings that result in an MPSoC with minimum energy consumption and low execu-tion time. To explore these requirements in the same cost func-tion, the execution time requirement is expressed in terms of ener-gy consumption. Therefore, the static power dissipation is multip-lied by the application execution time (texec) performing the static portion of energy consumption, which is detailed in Section IV.B. As a result, both dynamic and static energy consumption are considered to compute the mapping cost function.
To improve the yield, the SA algorithm searches for mappings with minimum cost avoiding the ones that are marked as spares or
faulty. However, when a tile is marked as faulty, the algorithm replaces the faulty tile with a spare tile, which is faulty-free.
B. Timing Model The total packet delay (dijq) of a wormhole routing algorithm
is composed by the routing delay (dRijq) and by the packet delay (dPijq) of the remaining flits. The routing delay is the time neces-sary to create the communication path, which is determined dur-ing the traffic of the packet header. The packet delay depends on the number of remaining flits. Let nabq be the number of flits of the q-th packet from pa to pb, obtained by dividing wabq by the link width. Let λ be the period of a clock cycle, and let tr be the number of cycles needed to route a packet inside a router. In addi-tion, let tl be the number of cycles needed to transmit a flit through a link (between tiles or between a processor and a router). The routing delay (dRijq) and the packet delay (dPijq) of the q-th packet from tile τi to tile τj, are represented in Equations (1) and (2), considering that a packet goes through η routers without con-tention. Contentions can only be determined at execution time.
dRijq = (η × (tr + tl) + tl) × λ (1)
dPijq = (tl × (nabq - 1)) × λ (2)
Equation (3) expresses the total packet delays (dijq) – packet latency, obtained from the sum of (dRijq) and (dPijq).
dijq = (η × (tr + tl) + tl × nabq) × λ (3)
For example, when applying Equation (3) in a packet with 10 flits (nabq = 10), which is sent from tile τ1 to tile τ2 (two neighbors tiles, i.e. η = 2), and considering λ = 1ns, tr = 3 and tl = 1 clock cycles, then 18ns is the packet latency.
The application execution time (texec) depends on both the application computation and communication. However, a simple equation does not express texec, since several communications and computations are many times parallel. In addition, some communications may compete for the same communication re-source (e.g. links and buffers) at same time, which may cause contentions increasing the overall execution time. Contentions also make a single equation more complex. Therefore, texec is computed during the mapping algorithm execution, which uses several times the dijq and time expend in each computation.
C. Energy Model The dynamic energy consumption is modeled using the con-
cept of bit energy (EBit), similarly to the model described in [24]. For several communication architectures, EBit can be expressed as a function of four variable quantities, as depicted by Equation (4).
EBit = function(Es, Eb, Ec, El) (4)
Es is the dynamic energy consumption of a single bit on wires and on logic gates of each router. Eb is the bit dynamic energy consumption on router buffers. Ec is the dynamic energy con-sumption of a single bit on links between routers and the local module. El is the bit dynamic energy consumption on the links between routers.
Equation (5) illustrates how EBit models a 2D direct mesh NoC. It computes the dynamic energy consumed by a bit passing in such a NoC from tile i (τi) to tile j (τj), where ηij corresponds to the number of routers that the bit traverses.
166
EBitij = ηij × (Es+Eb) + 2 × Ec + (ηij – 1) × El (5)
Let wabq be the total amount of bits of a packet pabq going from pa to pb (i.e. processors a and b, correspondingly), which are mapped on tiles τi and τj, respectively. Then, the dynamic energy consumed by the all k packets of pa → pb communica-tions is given by Equation (6).
EBitab = ∑=
×k
abqq
ijBit1
Ew (6)
Hence, Equation (7) gives the total dynamic energy consumed by the NoC (EDyNoC) and y represents the total number of communication between different processors pa to pb.
EDyNoC = ∑=
y
abi
iBit1
)(E ∀ pa, pb ∈ processors set (7)
The static power dissipation of each router (PRouter) is pro-portional to the number of gates that compose the router and it can be estimated by electrical simulation. With n representing the number of tiles, Equation (8) computes NoC static power dissipa-tion (PNoC).
PNoC = n × PRouter (8)
Using texec explained in Section IV.B, Equation (9) com-putes NoC static energy consumption (EsNoC).
EsNoC = PNoC × texec (9)
Finally, Equation (10) gives the overall energy consumption at the NoC (ENoC) that considers the static and dynamic effects, which SA algorithm uses as cost function to search for optima mappings.
ENoC = EsNoC + EDyNoC (10)
D. Model Calibration The Hermes NoC [7], configured with 16-bit phit and input
buffers with four positions, was used to validate the timing and energy models. The Hermes VHDL description was synthesized to an ASIC standard cell library. The library also supplies energy values for the cells, which are used to extract the energy parame-ters.
The synthesis result is a logic gate netlist. This netlist is asso-ciated to a customized VHDL library, which enables fast and ac-curate energy consumption and timing estimations. A testbench applies both random and typical traffic to the netlist and the re-sults achieved by VHDL simulation are compared to those ob-tained from high-level mapping tool. Our experiments showed average errors below 30.5% and 14% for energy consumption and execution time estimations, respectively.
V. EXPERIMENTAL SETUP This section presents the methods used to generate the combi-
nation of faulty tiles, called fault scenarios. The first method is exhaustive used for small NoCs and the second method is the statistical method used for bigger NoCs. Latter, we present the application classes evaluated in this paper.
A. Exhaustive Fault Generation Method Faulty tiles are exhaustively generated for all combinations of
faulty tile locations, assuming a system with 1 to 3 faulty tiles. Thus, Equation 11 defines the total number of faults injected as the sum of all 1, 2, to nfaults faults combination in x × y tiles. For instance, a 3 × 4 mesh NoC requires 298 fault scenarios (12 single faults, 66 double faults, and 220 triple faults).
⎟⎟⎠
⎞⎜⎜⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛=×
××× yxyxyx
yxscensnfaults21
)nfaults,( (11)
B. Statistical Fault Generation Method The exhaustive fault generation method is precise; however, it
might not be possible to perform exhaustive fault simulation due to the long CPU time. The main reason is that the total number of executions required to perform exhaustive fault simulation, de-fined in Equation 11, grows exponentially with the NoC size (x × y) and the max number of simultaneous faults (nfaults). Moreover, the CPU time of a single execution of the task mapping tool grows with the NoC size.
For instance, assuming a 3 x 4 mesh NoC with up to 3 simul-taneous faults requires 298 task mapping executions (about 3 mi-nutes of CPU) to perform exhaustive fault simulation. However, a bigger NoCs like a 5 x 5 mesh with up to 3 faults requires 2625 executions in about 60 hours of CPU time. The same 5 x 5 mesh NoC with up to 4 simultaneous faults requires 15275 executions, which we estimate that would require about 14 days of CPU use.
Even with the economical motivation of spare tiles is appeal-ing; it might be unfeasible to perform an exhaustive fault simula-tion since the CPU time becomes an issue for bigger NoCs with multiple faults. This section presents a statistical approach, called sample size estimation [25], used to determine the minimal num-ber of fault scenarios required to have satisfactory results - near to the ones achieved by exhaustive approach. This way, CPU time can be drastically reduced, while the results are still accurate. Moreover, this method enables trading off CPU time and result accuracy.
Before executing the sample size estimation, a pilot simulation is performed with a sample of small size. A sample represents a set of executions of the task mapping tool, where each execution assumes that the faulty tiles were randomly selected. Each execu-tion of this pilot results in a different mapping with different ener-gy consumptions and execution times. If the energy consumption is the value to be estimated, then this pilot gives the population’s estimated standard deviation s of energy consumed in the pres-ence of faulty tiles randomly located. The population in this con-text represents the entire combination of fault scenarios, as deter-mined in Equation 11.
The goal of the sample size estimation is to estimate the popu-lation average (μ), i.e. the average energy consumption of the entire population of fault scenarios. The Equation 12 is typically used for this purpose, where s is the estimated standard deviation of the sample. (x - μ) is the difference of the estimated sample average (x) and μ, which represents the acceptable error between the sample and the population. tα,df is the value from student’s t-distribution table [25], where (1 - α) is the confidence level and df is the degree of freedom, defined as df = n - 1.
2,2
2
)()( dft
xsn αμ
×−
= (12)
Since n is unknown, one can select an initial value of n to ob-tain tα,df. This value is used in Equation 12 to find a new n and a new tα,df. This calculation is performed iteratively until the value of n stabilizes. The stable value of n is the minimal sample size required to estimate the population average μ with the expected
167
accuracy of results.
C. System Application We explore several parallel applications with distinct features
aiming to determine what kind of application increases the over-head in energy and execution time in the presence of faulty tiles. A synthetic application generator, detailed in [22], is used to create random CDCGs.
This synthetic application generator can build several applica-tion classes by varying parameters such as: (i) number of proces-sors, which allows to investigate some target architecture dimen-sions; (ii) number of graph levels, which allows to specify the number of dependent communications an application has; (iii) dependence degree that defines the probability of a vertex has more than one dependence, keeping in mind that dependent com-munications can´t concur for NoC resources; (iv) probability of end meeting that defines if a vertex will have dependences or is a final communication; (v) computation time that is the period, as-sociated to each source task, between all dependences are solved and the communication of the source task starts; (vi) communica-tion volume that contains the quantity of bytes transmitted in each communication; and (vii) parallel communications, which de-scribes the minimum quantity of parallel communications an ap-plication have.
For instance, varying the relation between computation time and communication volume the application may change from IO-bounded to CPU-bounded; varying the relation between number of graph level and dependence degree the applications may be dataflow or concurrent.
We built 39 synthetic applications, which enables to explore applications classified as (i) IO or CPU bounded; (ii) dataflow with different levels of parallelism; (iii) strongly parallel or se-quential applications with different levels of concurrence by the communication architecture.
VI. EXPERIMENTAL RESULTS This section evaluates (i) the application execution time under
exhaustive faulty scenarios, (ii) the average energy consumption under exhaustive faulty scenarios, (iii) the proposed statistical fault generation method used to estimate energy consumption, comparing it to the exhaustive method.
A. Evaluating the Application Execution Time All application classes haves been evaluated in terms of ex-
ecution time using the exhaustive fault generation method. The result is that, independently of the application class, the
execution time is not affect by the presence of faulty tiles. On av-erage, the variation of execution time between the fault-free chip and the chips with up to 3 faulty tiles is close to 0%.
The reason lies on the timing model, presented in Section IV.B, more specifically in Equation (3). The total application time consist of computation time plus the communication time. The communication time consist of the routing delay, which depends on the distance between the communication elements, plus the packet delay, which depends on the packet size.
If the computation time is much greater than the communica-tion time, then the task mapping has very small influence on the application execution time. Even if both computation and com-munication times are equivalent, if the packet size is big (hun-dreds of flits) the routing delay has a very small impact on the communication time (since the NoC works as a pipeline), thus also a small impact on the application execution time.
Since in typical scenarios an application has more computa-
tion than communication and applications use packets with hun-dreds of flits, then the impact of the routing delay on the overall application execution time almost is negligible. This claim can be demonstrated with the following example.
Let us assume a given application on a 3x4 mesh NoC, whose normal behavior is to have more computation than communica-tion. This application is modified such that it has three variations: low communication (packets of one flit), low computation (CPU time of 1 clock cycle), and both communication and computation are low. Exhaustive fault generation is performed for these cases generating the Figure 4, which is the difference between the aver-age execution time of the population with faulty chips and the execution time of the fault-free chip.
Figure 4. The average overhead of application execution time of faulty chips.
This figure demonstrates that faulty tiles have a significant in-fluence on the chip execution time only if the both the computa-tion and the communication are low, which is not the typical sit-uation. Most actual applications typically have bigger packet sizes and more computation than communication.
B. Evaluating the Energy Consumption The energy consumption is evaluated for each class of appli-
cation described in Section V.C. The result is that, in spite of the application class, only the proportion of good tiles per faulty tiles affects the energy consumption. For instance, a chip with 15 tiles where two of them are faulty consumes more energy than the same chip with only one faulty tile. These results are illustrated in Figure 5 for a 3x5 mesh and an application with 12 tasks and 3 spare tiles. The average impact of a faulty tile on energy con-sumption is worse in the center of the NoC and it increases if there are more faulty tiles in the chip (Figure 5(a)). This impact gradu-ally decreases as the distance from the center tiles increases (Figure 5(b)).
However, if we map the same application on a 4x4 mesh NoC, then there are 12 tasks and 4 spare tiles. Figure 6 compares the energy profile of this application of a 3x5 against a 4x4 mesh NoC assuming three faults in each of them. It can be observed that the energy overhead in a 4x4 is lower. The reason is the proportion of good and faulty tiles. In a 3x5 with 3 faults the proportion is 15/3 while in a 4x4 it is 16/3. This extra tile of 4x4 gives more freedom to the task mapping tool to determine a good scheduling, improv-ing the effect of self-similarity (Section IV.A), resulting in a better task mapping.
168
(a)
(b)
Figure 5. The average energy consumption overhead of faulty chips.
Figure 6. The energy overhead with three faults on a 3x5 and a 4x4 mesh.
C. Evaluating the Statistical Fault Generation Method This section demonstrates the fault generation method pro-
posed in Section V.B. For this experiment, we assume that a small NoC is used, such as 3x4 mesh NoC, because the total CPU time for both statistical and exhaustive fault generation methods is not too high. An application with 9 tasks is used for this experiment, even though all other applications presented very similar results. Let us assume that the goal of this experiment is to estimate the average energy overhead when a fault hit a given tile, considering scenarios with 3 simultaneous faults.
First, the exhaustive method is executed, running all combina-tions of 3 faults in 12 tiles, i.e. scens(3 x 4, 3) = 298 (Eq. 11) fault scenarios. It took about 3 minutes of CPU time to execute them. These results are considered the target results, i.e. the results we want to achieve with the statistical method.
The second step is to execute a pilot experiment with small number of randomly selected fault scenarios per router. This pilot experiment is used solely to extract the standard deviation of the energy consumed by the chips with three random faulty tiles. The estimated standard deviation is 3.9% of deviation on energy con-sumption.
The proposed approach of sample size estimation is executed assuming two situations: (i) standard deviation of 3.9, confidence interval of 95%, and maximum error of 4%; and (ii) standard dev-iation of 3.9, confidence interval of 98%, and maximum error of 2%. The estimated sample size for each situation is 8 and 23, re-spectively. It means that each tile must be in at least 8 or 23 fault scenarios. For now on, the first situation is called sample8 and the second is called sample23. TABLE 1 presents the obtained results in terms of CPU time, total number of scenarios and the maximum
error observed for each tile. TABLE 1. RESULTS FOR THE STATISTICAL FAULTS GENERATION METHOD.
CPU time (s) # scenarios max obs.
error (%) Exhaustive 192 220 - Sample8 27 39 2.4 Sample23 63 101 0.8
Figure 7 illustrates the three situations and their respective
heat charts, representing the energy overhead when a fault is found at each tile. Each square represent the average energy con-sumption for each tile.
It can be observed that the exhaustive method produce the ex-pected results (the energy is gradually reducing from the center to the borders). The sample23 produces almost the same results as the exhaustive method, with small error but with much less CPU time. The sample8 produce large errors, indicating that the sample size is not sufficient to estimate accurately the energy overhead for each router.
Even if the exhaustive results are not available, it is still possi-ble to check the accuracy of the sample by visually analyzing the heat chart demonstrated in Figure 7. For instance, the expected appearance of a good heat chart is like the exhaustive test set, even if we assume NoCs of different sizes and different applica-tions. Note that the heat chart for sample8 deviates from the ex-pected appearance, indicating that one should increase the sample size, if it is possible, to increase the accuracy of the results.
exhaustive sam
ple8 sam
ple23
Figure 7. Visual analysis of the statistical fault generation method. Figure 8 overlaps the average results for the three situations.
By comparing the exhaustive with the other test sets, it can be seen that the biggest error for sample8, located in the tile [2, 1], is 2.4% (see 1), which is below the maximum error stipulated to this set of experiments (4%). The biggest errors for sample23, located
169
in the tiles [1, 1] and [2, 0] (see 2), are around 0.8%, which is below the maximum error stipulated to this set of experiments (2%).
Figure 8. Close analysis of the resulting error by overlapping the average
results for exhaustive, sample8, and sample23.
The example presented in this section demonstrates that the proposed fault generation approach enables to trade-off CPU time and result accuracy by selecting different values of difference (x - μ) and confidence level (1 - α).
VII. FINAL REMARKS Previous papers have demonstrated that the use of spare tiles
can significantly improve yield and reduce the manufacturing cost of NoC-based MPSoCs. The tool presented in this paper deter-mines task mapping for NoC-based MPSoCs with faulty tiles, minimizing the energy consumption and the application execution time. This way, these defective chips can still execute the applica-tion, perhaps with some performance degradation, but at least it can be sold to a lower-end market, for example.
This paper evaluates energy consumption and application ex-ecution time of faulty chips compared to fault-free chips. We eva-luated several different classes of applications to check if there was any particular application feature that could affect the energy consumption or application execution time under faulty tiles. These results show that the spare tile approach has small impact on energy consumption and this impact can be even smaller if the proportion of good and faulty is higher. The existence of faulty tiles on the chip has, on average, no significant influence on the application execution time. Based on these results, we conclude that the spare tile approach can increase yield and cost with small penalties on the application requirements.
Finally, this paper also proposed a statistical fault generation approach targeting very large MPSoCs. This approach demon-strates that a small sample of fault scenarios is sufficient to have a reasonably accurate estimation of energy consumption and it enables trading of CPU time and result accuracy.
VIII. ACKNOWLEDGMENT Alexandre is supported by postdoctoral scholarships from
Capes-PNPD and FAPERGS-ARD, grants number 02388/09-0 and 10/0701-2, respectively. Fernando Moraes is supported by CNPq and FAPERGS, projects 301599/2009-2 and 10/0814-9, respectively. Cesar Marcon and Marcelo Lubaszewski are partial-ly supported by CNPq scholarships, grants number 308924/2008-8 and 478200/2008-0, respectively.
IX. REFERENCES [1] Wolf, W.; Jerraya, A. A.; Martin G. Multiprocessor system-on-chip
(MPSoC) technology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(10), pp. 1701-1713, 2008.
[2] Kahng, A.; et al. ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. DATE, pp. 423-428, 2009.
[3] Lee, S. E. et al. A high level power model for network-on-chip (NoC) router. Computers & Electrical Engineering, 35(6), 2009.
[4] Refan, F. et al. Reliability in application specific mesh-based NoC architectures. IEEE International On-Line Testing Symposium, pp. 207-212, 2008.
[5] Shamshiri, S.; Cheng, K-T. Yield and Cost Analysis of a Reliable NoC. VLSI Test Symposium, pp. 173-178, 2009.
[6] Carara E. A. et al. HeMPS - a framework for NoC-based MPSoC generation. ISCAS, pp. 1345–1348, 2009.
[7] Moraes, F. et. al. HERMES: an infrastructure for low area overhead packet-switching networks on Chip. Integration, the VLSI Journal, 38(1), pp. 69-93, 2004.
[8] Bertozzi, D.; Benini, L.; De Micheli, G. Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 4(6), pp. 818-831, 2005.
[9] Ejlali, A. et al. Performability/energy tradeoff in error-control schemes for on-chip networks. IEEE Transactions on Very Large Scale Integration Systems, 18(1), pp. 1-14, 2010.
[10] Zhang, Z.; Greiner, A.; Taktak, S. A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. DAC, pp. 441-446, 2008.
[11] Manolache, S.; Eles, P.; Peng, Z. Fault and energy-aware communi-cation mapping with guaranteed latency for applications imple-mented on NoC, DAC, pp. 266-269, 2005.
[12] Hu, J.; Marculescu, R.. Energy-aware communication and task sche-duling for network-on-chip architectures under real-time con-straints. DATE, pp. 234-239, 2004.
[13] Lei, T.; Kumar, S. A two-step genetic algorithm for mapping task graphs to a network on chip architecture. Euromicro Symposium on Digital System Design, pp. 180-187, 2003.
[14] Murali, S. et al. Mapping and configuration methods for multi-use-case networks on chips. ASP-DAC, pp. 146-151, 2006.
[15] Lee, C. et al. A task remapping technique for reliable multi-core embedded systems. CODES/ISSS, pp. 307-316, 2010.
[16] Ababei, C.; Katti, R. Achieving network on chip fault tolerance by adaptive remapping. International Symposium on Parallel & Distri-buted Processing, pp. 1-4, 2009.
[17] Tornero, R. et al; A multi-objective strategy for concurrent mapping and routing in networks on chip. International Symposium on Paral-lel & Distributed Processing, pp. 1-8, 2009.
[18] Choudhury, A. et al. Yield enhancement by robust application-specific mapping on network-on-chips. NoCArc, pp. 37-42, 2009.
[19] Huang, L. et al. Lifetime reliability-aware task allocation and sche-duling for MPSoC platforms. DATE, pp. 51-56, 2009.
[20] Huang, L; Xu, Q. Energy-efficient task allocation and scheduling for multi-mode MPSoCs under lifetime reliability constraint. DATE, pp. 1584-1589, 2010.
[21] Huang, L; Xu, Q. AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs, DATE, pp. 51-56, 2010.
[22] Marcon, C. et al. CAFES: a framework for intrachip application modeling and communication architecture design. Journal of Parallel and Distributed Computing, 71(5), pp. 714-728, 2011.
[23] Mandelbrot, B. How long is the coast of britain? statistical self-similarity and fractional dimension. Science. 156(3775) pp. 636-638, 1967.
[24] Ghadiry, M.; Nadi, M.; Rahmati, D. New approach to calculate ener-gy on NoC. International Conference on Computer and Communication Engineering, pp. 1098-1104, 2008.
[25] Hill, T.; Lewicki, P. Statistics: methods and applications: a compre-hensive reference for science, industry, and data mining. StaSoft, 832p., 2006.
170
Me3D: A Model-driven Methodology ExpeditingEmbedded Device Driver Development
Hui Chen, Guillaume Godet-Bar, Frederic Rousseau, and Frederic PetrotTIMA Laboratory (CNRS – Grenoble INP – UJF), 46 av. Felix Viallet, 38031 Grenoble, France
{hui.chen, guillaume.godet-bar, frederic.rousseau, frederic.petrot}@imag.fr
Abstract—Traditional development of reliable device driversfor multiprocessor system on chip (MPSoC) is a complex anddemanding process, as it requires interdisciplinary knowledgein the fields of hardware and software. This problem can bealleviated by an advanced driver generation environment. Wehave achieved this by systematically synthesizing drivers froma device features model and specifications of hardware and in-kernel interfaces, thereby lessening the impact of human erroron driver reliability and reducing the development costs. Wepresent the methodology called Me3D and confirm the feasibilityof the driver generation environment by manually convertingsources of information captured in different formalisms to aMultimedia Card Interface (MCI) driver for a real MPSoC undera lightweight operating system (OS).
I. INTRODUCTION
Nowadays, a typical multiprocessor system on chip (MP-SoC) project takes place under ever-increasing time-to-marketpressure. Hardware and software are often regularly re-designed for new versions of a product. As it is well-acknowledged, the software development cycle consumes con-siderable time and effort.
On the software side, device driver development causesa serious bottleneck. It is intrinsically complex and error-prone due to the necessity of interdisciplinary knowledgein the fields of engineering and computer science. In otherwords, device driver developers require in-depth understandingof innumerable peripherals that exist in a typical embeddedsystem, programming tools, operating systems (OSes), busprotocols, network programming, system management [1], etc.
Delivering a high-quality and thoroughly tested devicedriver is laborious. For instance, the LH7A404 system on chip(SoC) from NXP Semiconductors contains 16 peripherals, andcorresponding drivers have more than 78,000 physical sourcelines of code (SLOC) [2] (requiring around 19.6 person-yearsas development effort, estimated with SLOCCount1). Soft-ware re-usability and automation methods are hence eagerlyrequired to reduce design effort and ameliorate productivity.
The difficulty in design and implementation of reliabledevice drivers is notorious. Drivers in the Linux kernel 2.6.9account for 53% of bugs [3]. Similarly, 85% of unexpectedsystem crashes originate from driver problems, pursuant to arecent report from Microsoft [4]. With this in mind, a newmethodology addressing reliability is strongly expected.
The contribution presented in this paper is a flexible devicedriver generation environment, able to produce the final C codeof a software driver, starting from a device features model.
1SLOCCount v2.26 by David A. Wheeler, www.dwheeler.com/sloccount/
APPLICATION
OS
KERNEL
HAL INTERFACE
APPLICATION PROGRAMMING INTERFACE
DRIVER
LIB
RA
RIE
S
HARDWARE ABSTRACTION LAYER
DRIVER DRIVER
HARDWARE
a)
b)
c)
d)
Fig. 1. Device driver as a low-level module in the OS structure
This environment is composed of a device driver generationtool and a validation flow. To evaluate the generation environ-ment we conducted our case study on the Multimedia CardInterface (MCI) device driver for the Atmel D940 [5] MPSoC.We created a features model for the MCI device, specificationsof the MCI device and the D940 board, and an in-kernelinterface specification for an ad-hoc OS, all of which arethen systematically converted into the MCI driver. Afterwards,we validated the generated device driver with a validationflow. Experimentation results demonstrate the feasibility ofimplementing the generation environment, and the expectedefficiency in developing device drivers for MPSoC.
The paper is organized as follows. Section II presentsthe anatomy of a device driver. Section III reviews relatedwork. Section IV introduces our methodology for acceleratingembedded device driver development. Section V evaluates theproposed methodology. The last section concludes the paperand identifies future work based on the findings provided here.
II. DEVICE DRIVER OVERVIEW
The term device, as used in this paper, does not refer tothe primary central processing unit (CPU), or main memory,but a specific hardware resource for a dedicated task. Thedevice is either attached to or embedded in a computer systemarchitecture and can interact with the CPU and other hardwareresources in the system via a single system bus or through abus hierarchy.
A device driver is a low-level software component in theOS, which allows upper-level software to interact with adevice. It can be considered, from an abstract point of view, asa brick in the OS chart (Fig. 1). Device drivers mentioned heremostly target embedded systems, which differ from personalcomputers (PCs) in having broader adoption of SoCs and agreater variety of buses.
978-1-4577-0660-8/11/$26.00 c© 2011 IEEE171
A device driver implements an interface to the kernel and/orapplication developers for an underlying device, and providesa lower-level communication channel to the device. It acts as atranslator from the kernel interface to the hardware interface.As a form of communication, it requires kernel services, andoften also offers services to other kernel components.
The communication channels to the device can be providedby lower-level drivers. This leads to cascaded drivers [6]. Theupper-level drivers provide an abstract view of the executionplatform, while the lower-level drivers are more concrete andprovide a transparent communication to the devices interfaces.An example is the Inter-Integrated Circuit (I2C) driver stackin the Linux kernel 2.6 [7].
Device driver interface can be separated into four parts (Fig.1): a) The driver requires kernel services like memory allo-cation, and also offers services (e.g., hardware initialization)to the kernel. b) User application sends general commandsto the driver using exported driver interface, while c) librariesprovide the driver with some services like string manipulation.d) Lastly, the hardware abstraction layer (HAL) accommodateshardware access methods, which are used by the driver.
One of the most elementary pieces of information abouta device and the driver that manages it, is what functionthe device accomplishes. Different devices carry out differenttasks. There are devices that play sound samples, devices thatread and write data on a magnetic disk, and still other devicesthat display graphics to a video screen, and so on.
For each type of functionality, there may be many differentdevices that carry out similar tasks. For instance, when dis-playing graphical information on a video device, the displaycontroller may be a simple Video Graphics Array (VGA)controller, or it may be a modern video card running on Pe-ripheral Component Interconnect Express (PCIe), with severalgigabytes of graphics memory. Nevertheless, in each case, thehigh-level purpose of the device is the same.
The device driver organization involves a set of driver entrypoints, a number of data structures, and possibly also globalsymbols and constants. A typical driver entry point encom-passes the hardware programming part (gray blocks in Fig.2) and the kernel-driver interaction part. Composing a driverentry point requires up to eight pieces of information: 1) HAL-related (e.g., register access primitives), 2) platform-related(e.g., device base address), 3) device-related (e.g., register andbit field offsets), 4) device features (e.g., register programmingsequences), 5) kernel-driver interface (e.g., return type andargument list of the driver entry point), 6) kernel services(e.g., memory allocator), 7) device class-related (e.g., accessprotocols), and 8) libraries-related (e.g., string manipulators).
Thus, because of the various and interrelated sources ofinformation, driver generation is intrinsically complex.
III. RELATED WORK
We will briefly discuss related work in the area of devicedriver development methodology. They can be classified intothree categories: device driver synthesis methodologies, deviceinterface languages, and hardware specification languages.
Early device driver synthesis methods, as part of hard-ware/software co-design efforts, attempt to synthesize OS-
/* Header inclusions */ return_type entry_a_name(...) { ... <kernel memory allocation> ... <mask & backup interrupts>
<restore interrupts> ... <invalidate CPU cache> ... }return_type entry_b_name(...) { ... strycpy(...); ... <create MMC card> ...}
void access_hardware(...) { registerB_bitfieldY_wr($VAL); ... $VAL2 = registerA_bitfieldX_rd(); }
5
4
21
6
static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE + REGISTER_A_OFFSET) & BIT_FIELD_X_MASK; }
31
8 5 : HAL-related : Kernel-driver interface : Platform-related : Kernel services : Device-related : Device class-related : Device feature : C library-related
5678
1234
access_hardware(...);
7
Fig. 2. Looking into the driver entry points for an OS written in C
based device drivers for embedded systems [8]. However, thesedevices are different from those targeted by Me3D as theyhave a simple internal structure and a small set of input/output(I/O) signals. Moreover, the synthesized driver only runs witha platform-specific real-time operating system (RTOS). There-fore, these approaches do not take on some issues addressedhere, including separation of in-kernel interface and hardwarespecifications.
Wang et al. propose a tool [9] for synthesizing embeddeddevice drivers. This approach does not separate in-kernel andhardware interfaces of the driver, forcing the driver developerto detail the complete driver behavior for every device. Inaddition, they assume that the driver functionality can be splitinto non-overlapping control and data parts. This is the casefor some simple drivers. In more complex drivers, the controland data path are tightly interleaved.
Termite [10] synthesizes device drivers by merging two statemachines of the OS and the device. This may unavoidablylead to state explosion and large final code size. Tackling theselimitations is addressed in this paper. In addition, we believe adevice class specification, which solely defines a set of eventsshared between the OS and the device specification, is notnecessary.
Bombieri et al. [11] propose a device driver generationmethodology based on the register transfer level (RTL) testbench of an intellectual property (IP). However, device driverscan not be generated unless the code of the RTL test benchis available. In contrast, we propose a methodology that isapplicable without involving RTL test benches.
Languages such as Laddie [12] and NDL [13] offer someconstructs to describe device interface. However, the firstapproach does not really deal with device driver problems,but is limited to generating register access functions. The NDLapproach requires a change concerning how to write the devicedrivers, but does not offer a solution for legacy drivers.
Hardware specification languages such as IP-XACT, UnifiedModeling Language (UML) MARTE are able to describe someparts of electronic components and designs. UML MARTEis widely used, while IP-XACT is an IEEE standard that de-scribes not only structural information about hardware devicesand designs, but also some contents such as a register map.To demonstrate the feasibility of our methodology, we havemodeled the device and the hardware platform in IP-XACT.
172
In-Kernel Interface Specification
Libraries
Device Features Model
Hardware Specifications
Driver Generation
Is used by Produces Device Driver Validation
Driver Configuration Parameters
Fig. 3. Abstract view of the Me3D methodology
Basic Library
StrCpy = (uClibc) "strcpy" + "..."StrCat = (uClibc) "strcat" + "..."StrCpy = "dna_strcpy" + "..."StrCat = "dna_strcat" + "..."
StrCpy = (Newlib) "strcpy" + "..."StrCat = (Newlib) "strcat" + "..."
Lib XLib Y Lib Z
Fig. 4. Basic library
IV. DRIVER GENERATION ENVIRONMENT
Device drivers such as upper-level driver stacks in cascadeddrivers may not have direct access to underlying hardware.In the remainder of this paper, it is assumed that a driver isdirectly above the HAL.
Fig. 3 shows an abstract view of the Me3D methodology.The generation environment requires a device features model,hardware specifications, in-kernel interface specification, li-braries, and driver configuration parameters to produce devicedrivers. Device driver in binary format is validated on a realMPSoC or virtual platforms. During the validation phase, per-formance results could be extracted as shown in the referencedpaper [14], for tuning driver configuration parameters. Withmodified parameters, a new version of the driver is generatedagain.
A. Basic and HAL libraries
The basic library contains an abstract layer for usual datamanipulation methods (Fig. 4). For instance, the StrCpy (Fig.4) primitive may be linked to the strcpy function of a standardC library implementation, such as Newlib [15] or uClibc [16],or the dna strcpy function of an ad-hoc C library. Introducingthese primitives allows the exploration of memory footprintsand performances through the selection of different C libraryimplementations. The basic library could contain source and/orobject files.
The HAL library contains the implementation of low-levelhardware access primitives (e.g., primitive to read from orwrite to registers), with which it allows the development andthe integration of support for new hardware architectures tobe executed separately from the generation tools, thereby in-creasing the flexibility of the environment and the re-usabilityof the components. It may include source and/or object files.
B. In-kernel interface specification
In order to reflect the in-kernel interface changes duringkernel evolution, and to differentiate kernel-driver interactionamong differing device classes, we propose an in-kernel in-terface specification dedicated to a certain device class for agiven kernel.
The in-kernel interface specification contains mainly kerneldata structures (if any) to be used, software events, and
transitions. The latter defines the driver’s desired reactions torequests, concerning hardware events that must happen beforethe driver sends a notification of completion to the kernel.
C. Device and platform specifications
Hardware vendors often release user manuals that describethe interface and operations of a device and the architecture ofa hardware board. Such a documentation is intended to providesufficient information for driver developers. However, it is usu-ally informal, and written in a natural language. To automatethe driver development, we require device and hardware boardspecifications, which provide not only structural information,but also some contents like a register map. Specifications ina format such as IP-XACT and UML MARTE are availablefrom some hardware vendors or can be obtained from informaldevice or board documentations.
A device specification describes the following driver-relatedproperties of a device: i) device name and ID, ii) register fileinformation (e.g., register widths and offsets, bit field widthsand masks, register/bit field accessibilities, reset value, etc.),and iii) port information.
A platform specification provides some driver-related in-formation as well, i.e. i) device instantiations, ii) I/O offsets,iii) interrupt connections (which indicate whether the interruptpin of the target device is used or not), iv) bus (e.g., bus clock,bus type, data bus width, data transfer type, device access type,transport mode), and v) processor (e.g., byte ordering, clockfrequency, name, word length).
In general, IP-XACT includes most of the features men-tioned above, although, to the best of our knowledge, it stilllacks some information such as data transfer type (e.g., x8,x16).
D. Device features model
Reading from or writing to a certain register may cause aside effect. For instance, writing a value to the length registerof a given direct memory access (DMA) controller may startthe DMA transfer. Often, it necessitates programming someother registers before a side effect takes place. For instance,before writing the length register of this DMA controller, thesource address register and the destination address registershall be set with desired values so as to achieve successfulDMA operations. Such a register programming sequence needsto be modeled to ensure correct device operations. Hence, weare introducing a device features model to capture the way ofinteracting with the device.
The device features model contains a set of predefineddevice features, such as init, read, write, etc. This model canbe translated to C functions. The translation process will beexplained in more detail in the following section.
E. Driver generation
The driver generation flow is broken down into four steps.This section explains each step of the driver generation.
Step 1: Parsing and inline functions generation. Thedevice features model along with the hardware specificationsand the HAL library, are mainly used to generate bit fieldaccess functions (Fig. 5). These inline functions, containing
173
Device Features Model
HAL L
ibrar
y
Hardware Specifications
Bit Field Access Functions,Device-related Parameters
1
Is used by Produces
static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE + REGISTER_A_OFFSET) & BIT_FIELD_X_MASK; }...
a)
HAL L
ibrar
y
Fig. 5. Step 1: Parameters parsing and inline functions generation
bitwise operations (e.g., not, bit shift), are responsible foraccessing the bit fields.
It is not difficult to generate these bit field access functions.An example of a bit field read function is shown in Fig. 5.a.To produce a bit field read function, it requires the namesof the register and of the bit field that appear in the devicefeatures model, a read primitive from the HAL library, an I/Obase address from the platform specification, and the offsetsand the widths of the register and the bit field from thedevice specification. If the HAL of the OS provides bit-levelmanipulation operands, then our code will simply call thesefunctions.
There are two reasons for generating bit field access func-tions. Firstly, the low-level bit operations contribute to agreat part of the bug sources in the case of manual driverdevelopment. Secondly, introducing bit field access functionscan enhance readability to some extent.
Apart from these bit field access functions and stringconstant macros, this step parses out device-related parameters(e.g., device cardinality) as well.
Step 2: Device features generation. This step makes use ofthe products (inline functions and parameters) in the previousstep and requires a device features model. The reason for usingthis model is explained in subsection IV-D.
The device features model reflects product dependencies (ifany), hardware configurations, and device operations (e.g., datatransfer operations, command/response operations) describedin a natural language or as functional flow charts, which aretraditionally provided by hardware vendors as a part of theuser manual. Thus, hardware vendors are expected to writethe device features model.
The device features model can be written in C or an al-ternative tiny language – DFDL (Device Features DescriptionLanguage). Our experience shows that the latter is easier tointerpret because it has simpler semantics. To write for (int i= 0; i < 5; i++) in C/C++, one just has to write foreach i(0, 5) in DFDL. As you can see, the foreach loop in DFDLis simpler and less error-prone.
The in-house language-based device features model is onlyused to define register programming sequences, but not forwriting device drivers; this model could evolve to an interme-diate format or even be eliminated, once a device specificationis capable of capturing these sequences.
Using the products (inline functions and parameters) inStep 1, the device features model is translated to devicefunctionalities (Fig. 6) in C. This translation is feasible, asthe tiny language only uses high-level constructs for logicalcontrol, e.g., the await construct (see subsection V-D) refers
Device Features Model
Device Features
Bit Field Access Functions, Device-related Parameters
void access_hardware(...) { registerB_bitfieldY_wr($VAL); ... $VAL2 = registerA_bitfieldX_rd();}...
a)
2
Fig. 6. Step 2: Device features generation
Devic
e Fea
tures
In-Kernel Interface
Specification
Driver
Libraries Driver Configuration
Parameters
Generate HeaderInclusions
Compute Dependency,Synthesize EFSM
EFSM
Map Hardware Events,Translate EFSM
Platform
Specification
CompilationEnvironment
Makefile .c/.h
a) b)
Fig. 7. Step 3: a) Driver and b) Makefile generation
to the do-while structure in C.Step 3: Driver source and Makefile generation. This
step makes use of the product (device features) in Step 2,and requires basic and HAL libraries, driver configurationparameters, hardware specifications, and an in-kernel interfacespecification.
The libraries are used to produce some #include commands(Fig. 7). The in-kernel interface specification describes howthe driver interacts with the kernel and adjacent drivers, whilethe driver configuration parameters determine tunable elementssuch as the synchronization method (interrupt or polling). Withthe in-kernel interface specification and the driver configura-tion parameters, it is easy to synthesize an extended finitestate machine (EFSM) after dependency computation. Thenthe hardware events in the synthesized EFSM are mapped todevice features (generated from the previous step). Afterwards,the EFSM is translated into device drivers in C.
In addition, the in-kernel interface specification and the plat-form specification define the compiler flavor and the processortype respectively, allowing the compilation environment toproduce Makefile.
Step 4: Source code compilation. In this step the makecommand is iteratively executed for a certain CPU architec-ture, using the previously generated Makefile as an input. IfHAL and basic libraries are provided in the form of sourcefiles, they will also be compiled.
F. Driver configuration and space explorationDevice driver development involves a series of decision-
making processes. For instance, a write Application Program-ming Interface (API) is to be implemented as synchronousor asynchronous. Likewise, a DMA driver can either use acircular buffer or a linked-list. Different decisions result indiverse C code and differing driver performances. We call this“driver space exploration”.
174
Device Features In-Kernel Interface Specification
Driver Generation
.c/.h
Libraries
.c/.h
Driver Configuration Parameters 1
Driver Generation Driver Configuration Parameters 2
Fig. 8. Code generation possibilities
Conventional driver space exploration refers to iterativelyrefactoring the driver code. The choice of the driver version isusually driven by performance, power consumption, or binarysize.
In order to efficiently and effectively explore the driverspace, we use high-level configuration parameters (Fig. 8)for the driver generation environment. A change in a designdecision requires only a modification of an attribute of thedriver configuration. A new version of the driver can begenerated again.
G. Driver validationIn our experimentation, we used a real board to validate
the driver (along with an application program, an OS, and thehardware platform’s HAL).
If a real board is not available, we propose two simulationmodels to validate the driver. The functionality is validatedwith an abstract SystemC simulation model, called transactionaccurate. The performance of the driver is validated on a lowabstraction SystemC simulation model, called cycle accuratebit accurate. Due to limited space, these simulation modelswill not be presented in this paper.
V. EVALUATION
In this section, we evaluate the applicability of Me3Das the methodology for expediting embedded device driverdevelopment on the Atmel D940 [5] MPSoC.
A. Evaluation pointsThe points of evaluation are shown below: 1) the feasibility
of describing device features and in-kernel interfaces; 2) thefeasibility of systematically converting the device featuresmodel and the specifications to a device driver.
To evaluate point 1, we chose a well-adopted specificationlanguage for hardware platform and devices (the chosen one isIP-XACT, but the methodology is not limited to it), specifiedin-kernel interfaces with in-house IISL (In-kernel InterfaceSpecification Language), and modeled device features withDFDL (Device Features Description Language).
To evaluate point 2, we manually converted the devicefeatures model, the hardware specifications, and the in-kernelinterface specification to the device driver according to theproposed methodology. An open source DNA-OS [17] is usedby the converted software.
B. Hardware specificationsDevice specifications. In order to bring the Multimedia
Card Interface (MCI) (Fig. 9) in function, it requires configur-ing some registers of the power management controller (PMC),
PIO
MCCK MCCDA MCDA0 MCDA1 MCDA2 MCDA3
Interrupt Control
MCI Device
PDC
MCI Interrupt
PMC
APB Bridge
APB
Fig. 9. MCI device and its neighborhood
Application
DNA-
OS
Kernel
MCI Driver
Generic MMC Module
...
...
...
Fig. 10. In-kernel interfaces for a MCI driver
of the programmable input output (PIO), and optionally ofthe programmable DMA controller (PDC). In other words, weneed information about the register layouts of these devices.Hence, we have modeled the specifications for the MCI, thePMC, and the PIO (in IP-XACT for this experimentation).The PDC specification is not modeled, because the native MCIdriver does not use DMA.
Platform specification. The D940 MPSoC contains manyperipheral devices. It is not necessary to specify the wholeplatform. In reality, we have only modeled parts related todevice instantiations, interrupt numbering, etc.
C. In-kernel interface specificationIn this paper, we will not introduce the grammar of our in-
house IISL language. However we will briefly present what thein-kernel interface specification covers. It specifies the driverinterfaces (Fig. 10) toward the application, the kernel, and ageneric MultiMediaCard (MMC) module. The generic MMCmodule is responsible for card properties discovery and MMCprotocol implementation. It offers some services (denoted bythe lollipop connectors on the module side) to the MCI driver,and defines APIs (e.g., read low).
The in-kernel interface specification presents a partial EFSM(Fig. 11) describing the interactions between the driver, itsadjacent drivers (if any), and the kernel. To describe theseinteractions, we use messages. A message is a token sentto the driver from its adjacent drivers or the kernel, or viceversa. The former, called “inbound message”, can be a kernelrequest (e.g., initialize hardware, publish device, etc.), a DMArequest, etc., whereas the latter, called “outbound message”,is the driver response to the sender of the token. In Fig. 11,downward dashed arrows denote inbound messages, whereasupward dashed arrows represent outbound messages.
As shown in Fig. 11, starting from the OS booting (the startstate), the device driver receives inbound messages from thekernel in succession, which brings the driver from one stateto another sequentially, until it reaches the idle state. In theINIT HW (hardware initialization) state, the driver sends back
175
... init_hw INIT_HW
success/ status_ok
exit
read
read_done/ status
write
write_done/ status
...
WRITE ...
IDLE
READ ...
start state end state
Fig. 11. EFSM of kernel-driver interaction
1 read { 2 in void *buffer 3 in int32_t word_count 4 5 foreach i (0, word_count): 6 await (MCI_SR.RXRDY == 1) 7 ((uint32_t *)buffer)[i] = MCI_RDR 8 } 9 ...
Fig. 12. MCI device features model
the status ok message to the OS in the case of a successfulinitialization. When the driver is in the idle state, it waits foran inbound message. A read inbound message sends the driverto the READ state for reading data from the MMC card. At theend of read, the driver sends a message with the read statusto the kernel, and returns to the idle state. An exit inboundmessage brings the driver to the end state.
D. MCI device features modelThe device features model for the MCI device consists of
seven features (e.g., read, write, etc.) and the definition ofthe block length. It is described with our in-house language– DFDL. Fig. 12 presents the read feature. This feature hastwo incoming arguments, i.e. the buffer pointer and the wordcount. It waits until the RXRDY (receive ready) bit field ofthe MCI SR (status register) equals 1, then stores the value ofthe MCI RDR (read register) to a specified buffer. The processabove iterates word count times.
DFDL is a tiny ad-hoc language, which has some constructstailored for modeling device features. The await constructin Fig. 12 simplifies the do-while loop in C/C++. A C/C++equivalent to line 6, Fig. 12 (waiting for a bit field to reacha value) will contain several statements: reading the registervalue and logically anding it with a bit field mask, comparingthe masked result with a specified value, and iterating theabove steps until the bit field value equals the given one.
E. ConversionAt first we analyzed which registers and bit fields appear in
the MCI device features model. Then we parsed out the offsetsand the widths of these registers and bit fields (Fig. 13.a)from device specifications and the MCI base address from theD940 platform specification. Information related to MCI isgathered in the d940 mci.h file. Likewise, information relatedto PIO and PMC are collected to d940 pio.h and d940 pmc.hrespectively. In addition, inline functions for accessing somebit fields are also produced (Fig. 13.b).
Afterwards, the macros and inline functions, along with theMCI device features model, are used to produce the MCI
HAL Library
HardwareSpecifications
1
#define MCI_BASE 0xFFFA8000 ... #define MCI_SR 0x40 #define MCI_SR_RXRDY (0x1 << 1) ...
static inline uint32_t MCI_SR_RXRDY_rd() { return cpu_read_UINT32(MCI_BASE * + MCI_SR) & MCI_SR_RXRDY; }...
MCI DeviceFeatures Model
... ...d940_mci.h
reg_access.h
d940_pmc.hd940_pio.h
a)
b)
Fig. 13. Step 1: Macros and inline functions generation
MCI DeviceFeatures Model
Macros, Inline Functions
read(void *buffer, int32_t word_count){ for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); }}...
2
Fig. 14. Step 2: MCI device features generation
/* Header inclusions */status_t read_low(void *buffer, int32_t word_count){ for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); } return DNA_OK;}...
/* In-Kernel Interface Specification */... process CHOICES { ... || read_low; dev.read; read_low-done[$status == DNA_OK]; }...
/* Driver Config. Parameters */INTERRUPT...
MCI Device Features
b)
a)
Libraries 3.a
Fig. 15. Step 3.a: Synthesize EFSM and derive driver functionalities
device features (Fig. 14). It must be mentioned that althoughthe read device feature looks like a C function, it is not, as itis not qualified with a return type.
Finally, the driver configuration parameters (Fig. 15.a), theMCI device features, libraries, and the in-kernel interfacespecification (Fig. 15.b) are used to synthesize EFSM andderive driver functionalities. The INTERRUPT parameter (Fig.15.a) determines the synchronization mechanism as interrupt.In the in-kernel interface specification there exists a processdescribing how the driver will interact with neighboring com-ponents when it is in the idle state and receives a message.For instance, when the driver is in the idle state and receivesa read low message, it will wait until the read operationterminates, then it will return a status DNA OK to the kernelin the case of a successful read. The dev.read hardware eventis mapped to the read device feature. We must note that the|| construct (Fig. 15.b) is used to separate different transitionconditions.
Table I produces a summary of the conversion results. TheSLOC column refers to the source code size (excluding de-
176
TABLE ISOURCE LINES OF CODE, BINARY SIZES OF THE NATIVE AND GENERATEDMCI DRIVERS (EXCLUDING DEBUGGING FUNCTIONS AND STATEMENTS)
SLOC Binary (contains application, OS, & HAL) in KBNative 407 421.9Generated 362 421.4
TABLE IIEFFORT FOR MCI DRIVER DEVELOPMENT WITH AND WITHOUT ME3D
Effort in person-days SLOCDevice specifications 2 2573Platform specification 1 219In-kernel interface specification 2 188Device features model 1 64Total effort using Me3D 6 -
Total effort without Me3D 21 -
bugging functions and statements) of the native and convertedMCI drivers, while the last column shows the size of the binarydrivers.
We can notice that the systematically generated driversource is slightly smaller than the native one. The reason isthat the native MCI driver is written without optimizations,whereas the generation optimizes code size. For instance,the native driver presents registers as union structures; incontrast, the generated code only defines offsets of used bitfields. A disadvantage of using the union structure is thatreserved bit fields have to be specified too. Though theremight be advantages of using unions in terms of code review,more efficient validation is feasible via checking high-levelintermediate models.
Table II shows the results of generating the MCI driverusing Me3D and developing it manually. We find that usingMe3D methodology for generating drivers results in a 350%improvement in productivity. Considering the fact that thenative driver is developed by an highly experienced kerneldeveloper with about 21 person-days of effort (around 26 to40 person-days of effort using the Intermediate COCOMO2
formula with coefficients for embedded software projects andtypical values for an effort adjustment factor), we can expectthe acceleration of the driver development to be greater than350%. It must be noted that some specifications are suitablefor reuse purpose. The platform specification is usable for alldevices, and the in-kernel interface specification for a certaindevice class. Incidentally, around 500 lines from the IP-XACTspecifications (totaling 18%) of the devices and the platformare used for the driver generation.
F. Performance
We analyzed the performance of the native MCI driverfor the DNA-OS against that of the generated one. Perfor-mance values were captured using a Secure Digital (SD) card(Kingston SD/1GB). We measured a benchmark that performsa sequence of unbuffered reads from the SD card connectedto the MCI device. As a result, the transfer rate and CPU
2Intermediate COCOMO by B. Boehm, en.wikipedia.org/wiki/COCOMO
utilization achieved by the generated and native drivers aresimilar.
VI. CONCLUSION AND FUTURE WORK
Device drivers are crucial software elements having consid-erable impact on both design productivity and quality. Devicedriver development has traditionally been error-prone and quitetime-consuming. On that account, we propose an advanceddevice driver generation environment to shorten driver devel-opment time and improve driver quality. Experimentation ingenerating a Multimedia Card Interface (MCI) driver for theAtmel D940 multiprocessor system on chip (MPSoC) achievesfavorable results regarding the code size.
In the future, we plan to evaluate the methodology onseveral OSes, introduce an intermediate format for devicedriver generation and validation, and develop an automatic toolfor driver generation. We will also study optimization issuessuch as performance and power consumption, and considerother constraints (e.g., up-bound timing that is imposed bycritical or real-time systems) as future research subjects.
ACKNOWLEDGMENT
The authors would like to thank the MEDEA+ andCATRENE offices and the French Ministry of Industry forsupporting this work via the MEDEA+/CATRENE SoftSoCproject.
REFERENCES
[1] Hewlett-Packard Company. (2010) HP Tru64 UNIX Operating SystemVersion 5.1B-6. [Online]. h18004.www1.hp.com/products/quickspecs/13868 div/13868 div.pdf
[2] NXP Semiconductors. (2007) LH7A404 Board Support PackageV1.01. [Online]. ics.nxp.com/support/documents/microcontrollers/zip/code.package.lh7a404.sdk7a404.zip
[3] Coverity, Inc. (2005) Analysis of the Linux Kernel. [Online].www.coverity.com/library/pdf/linux report.pdf
[4] N. Ganapathy. (2008) Introduction to ”Developing Drivers with theWindows Driver Foundation”. [Online]. www.microsoft.com/whdc/driver/wdf/wdfbook intro.mspx
[5] Atmel Corp. (2008) DIOPSIS 940HF AT572D940HF Preliminary.[Online]. www.atmel.com/dyn/resources/prod documents/doc7010.pdf
[6] K. J. Lin and J. T. Lin, “Automated development tools for Linux USBdrivers,” in 14th ISCE, Braunschweig, Germany, 2010, pp. 1–4.
[7] G. Kroah-Hartman, “I2C Drivers, Part II,” Linux Journal, Feb. 2004.[8] M. O’Nils and A. Jantsch, “Device Driver and DMA Controller Syn-
thesis from HW/SW Communication Protocol Specifications,” DesignAutomation for Embedded Systems, vol. 6, no. 2, pp. 177–205, 2001.
[9] S. Wang, S. Malik, and R. A. Bergamaschi, “Modeling and Integrationof Peripheral Devices in Embedded Systems,” in DATE’03, Munich,Germany, 2003, pp. 10 136–10 141.
[10] L. Ryzhyk, P. Chubb, I. Kuz, E. Le Sueur, and G. Heiser, “Automaticdevice driver synthesis with Termite,” in 22nd SOSP, Big Sky, MT, 2009.
[11] N. Bombieri, F. Fummi, G. Pravadelli, and S. Vinco, “Correct-by-construction generation of device drivers based on RTL testbenches,”in DATE’09, Nice, France, 2009, pp. 1500–1505.
[12] L. Wittie, C. Hawblitzel, and D. Pierret, “Generating a statically-checkable device driver I/O interface,” in Workshop on AutomaticProgram Generation for Embedded Systems, Salzburg, Austria, 2007.
[13] C. L. Conway and S. A. Edwards, “NDL: A Domain-Specific Languagefor Device Drivers,” SIGPLAN Not., vol. 39, pp. 30–36, jun 2004.
[14] X. Guerin, K. Popovici, W. Youssef, F. Rousseau, and A. Jerraya,“Flexible Application Software Generation for Heterogeneous Multi-Processor System-on-Chip,” in 31st COMPSAC, Beijing, China, 2007.
[15] Newlib. (2010) Red Hat, Inc. [Online]. sources.redhat.com/newlib[16] uClibc. (2011) Erik Andersen. [Online]. www.uclibc.org/downloads[17] X. Guerin and F. Petrot, “A System Framework for the Design of
Embedded Software Targeting Heterogeneous Multi-core SoCs,” in20th ASAP, Boston, MA, USA, 2009, pp. 153–160, [Online] tima-sls.imag.fr/viewgit/apes.
177
Session 7Tools and Designs for Configurable Architectures
178
Schedulers-Driven Approach for Dynamic
Placement/Scheduling of multiple DAGs onto SoPCs
Ikbel Belaid, Fabrice Muller
University of Nice Sophia-Antipolis
LEAT-CNRS, France
e-mail: {Ikbel.Belaid, Fabrice.Muller}@unice.fr
Maher Benjemaa
National engineering school of Sfax, Univeristy of Sfax
Tunisia
e-mail: [email protected]
Abstract—With the advent of System on Programmable
Chips (SoPCs), there is a serious need for placing and
scheduling algorithms which can allow multiple
Directed Acyclic Graphs (DAGs) structured
applications to compete for the computational resources
provided by SoPCs. A runtime scheme for distributed
scheduling and placement of DAG-based real time tasks
on SoPCs is described in this paper. In the proposed
distributed approach, called Schedulers-Driven, each
scheduler associated to a DAG makes its own
placement/scheduling decisions and collaborates with
the available placers corresponding to SoPCs in the
system. The placers focus in managing free resource
space for the requirements of elected tasks. Schedulers-
Driven aims at optimizing the DAG slowdowns and
reducing the rejection ratio of real-time DAGs. Other
important goals are attained by this approach, which
are the reduction of placement and scheduling
overheads ensured by the techniques of prefetch and
reuse, and the efficiency of resource utilization
guaranteed by the reuse technique and the slickness of
placement method.
Keywords-real-time DAGs; Schedulers-Driven
placement/scheduling; reuse; prefetch; heterogeneous
device; run-time reconfiguration.
I. INTRODUCTION
In the recent years, the reconfigurable computing has
advanced at a phenomenal rate. This new concept emerges
the SoPCs to satisfy all the demands of embedded systems
designers working under many tight constraints. The
SoPCs present a mixture of two parts: general-purpose
processors and reconfigurable hardware resources. Despite
their flexibility and their high performance, the SoPCs
reveal a number of challenges that must be addressed. One
of them is the dynamic scheduling of parallel real-time
jobs modeled by directed acyclic graphs (DAG) onto the
reconfigurable resources. Hence, it is reasonable to
envisage a scenario where more than one DAG compete to
be scheduled onto a high density of reconfigurable
resources at the same time. The purpose of our work is to
provide a dynamic scheduling and placement for DAGs, as
they arrive at a heterogeneous system. The objecti-
ve of dynamic placement/scheduling is i) to fit tasks within
DAGs efficiently on reconfigurable units partitioned on the
SoPCs, respecting their heterogeneities and taking
advantage of run-time reconfiguration mechanism and ii)
to order their execution so that task precedence and real-
time requirements may be satisfied. Many dynamic scheduling schemes have been
introduced in parallel computing systems. One simple and
efficient type of scheduling method is to dynamically
construct a combined DAG, composed of DAGs arrived at
the system and then to schedule the composite DAG by
one among efficient single-DAG algorithms in the
literature. Some methods which fall into this category
include those presented by Zhao and Sakellariou in [1],
who focus in achieving a certain level of quality of service
for the given DAGs defined by the slowdown that each
DAG would experience. The idea of combining dynamic
DAGs is also proposed in [2]. This paper develops Serve
On Time and First Come First Serve algorithms that
schedule each arrived DAG with the unfinished DAGs.
The objective of these algorithms is to properly add the
new submitted DAG into running DAGs, forming a new
integrated DAG. In [3], the dynamic DAGs are scheduled
with periodic real-time jobs running on the heterogeneous
system. The proposed scheduling scheme introduces
admission control for DAGs and schedules globally the
tasks of each arrived DAG by modeling the spare
capability left by the periodic jobs in the system. Then,
each scheduled task is received by a machine where it will
be scheduled locally by EDF algorithm. [4] presents a
hierarchical matching and scheduling framework to
execute multiple DAGs on computational resources. Based
upon a client-server model and DHS algorithm, each DAG
is associated with a client machine and independently,
determines when a scheduling decision should be made.
Through load estimates, each client machine matches its
tasks to the suitable group of server machine. When the
application chooses a particular group of servers to execute
a given task, the low-level scheduler determines the most
appropriate member of the group to execute the received
task. [5] deals with parallel jobs arriving at the system
following a Poisson process and takes into account the
reliability measure as well as the overheads of scheduling
and dispatching tasks to processors. Using admission
control for real-time jobs, the paper presents DAEAP,
978-1-4577-0660-8/11/$26.00 ©2011 IEEE
179
DALAP and DRCD scheduling algorithms to enhance the
reliability of the system.
Several researchers have developed dynamic placement
methods of tasks on reconfigurable devices. The placement
in [6] is considered the baseline placement algorithm. The
placement is based on KAMER method that partitions the
free space into Maximal Empty Rectangles (MER) and
employs the bin-packing rules to fit tasks into MERs. [7]
presents the on-the-fly partitioning. [8] employs the
staircase method to manage the free space. Unlike the
previous works, [9] manages the occupied space instead of
the free space and proposes Nearest Possible Position
algorithm to fit tasks while optimizing inter-task
communication.
To the best of our knowledge, none of these existing
methods of placement and scheduling is suitable for the
environment used in this paper, as most of them are
proposed for purely software context or they are not
applicable for real-time DAGs. In this paper, a new
dynamic competitive placement/scheduling approach is
proposed to execute real-time DAGs on SoPCs. The
remainder of this paper is organized as follows. Section 2
details our proposed approach of placement/scheduling
DAGs onto SoPCs. The experimental results are given in
Section 3 followed by conclusions in Section 4.
II. SCHEDULERS-DRIVEN
PLACEMENT/SCHEDULING
Throughout the paper, the Xilinx heterogeneous
column-based FPGA was used as a reference for the SoPC.
The heterogeneous system is composed of n SoPCs. Each
one is composed of a set of reconfigurable hardware
resources depicted by {RBk} where k denotes the number
of resource type. There are NP types of reconfigurable
resources in SoPCs. As mentioned in Fig. 1, the execution
system is constituted by a set of m local schedulers (Sched
i) associated to arrived DAGs. Local schedulers are
communicating with n placers. Each placer is assigned to a
SoPC and makes its own decision in managing
reconfigurable resource space. Besides the m local
schedulers and n placers, we introduce two other structures
in the system: Recover and Pending. In the distributed
Schedulers-Driven placement/scheduling, all the
structures: m local schedulers, n placers, Recover and
Pending operate to make decisions about scheduling and
placement of real-time tasks. The real-time DAGs are
submitted dynamically and periodically according to a
fixed inter-arrival interval. A real-time DAG is defined by
the pair (N,E). N is the set of nodes representing non-
preemptive tasks in the DAG and E is the set of edges
linking the dependent tasks. Each real-time task in the
DAG is characterized by its worst case execution time
(CA), its relative deadline (DA) and its release time (RA).
The release time is the time when the task is ready for
execution and it receives all its required data from its
predecessors. RA is determined according to the arrival
time of the DAG to which the task belongs and to the time
of execution achievement of its predecessors. Moreover,
each task (A) is presented as a set of reconfigurable
resources (RBk) which are required to achieve its execution
on the SoPCs and defines the RB-model of the task as
expressed in (1).
(1)
Sched 1 …
…
… Sched i Sched m
List_scheduler List_recover List_pending
placer 1 placer n
SoPCnSoPC1
Recover Pending
Figure 1. System overview.
Under hardware environment, the placement and
scheduling problems are highly interlinked. Effectively,
the placers must satisfy the resource requirements of each
task elected by the schedulers, and the scheduler decisions
must be made according to the ability of placers to provide
sufficient RBs for tasks, respecting their precedence and
real-time constraints. Thus, the major challenge in this
environment is to reduce the rejection rate as much as
possible. Both following sections detail our proposed
algorithms for placement/scheduling DAGs on
reconfigurable devices (SoPCs).
A. On-line Placement Algorithm
Placement problem consists of two sub-functions: i)
partitioning which handles the free space of resources in
the SoPC and identifies the Maximal Empty Rectangles
(MER) enabling task execution. MER are the empty
rectangles that are not contained within any other empty
rectangle and ii) fitting which selects the best feasible
placement solution within MERs by maintaining the
resource efficiency. As stated above, as shown in Fig. 2,
we are based on 2D column-based architectures
represented by a matrix (Yi,j), LineNumber depicts the line
number in the SoPC and ColumnNumber denotes its
column number.
(2)
To achieve partitioning sub-function, we have to define
the Max_widthi,j and Max_heighti,j for each Yi,j.
Max_widthi,j is the number of free RBs found throughout
the line of Yi,j by starting from Yi,j and without crossing an
occupied RB. Max_heighti,j is the number of free RBs
counted from Yi,j throughout its column till the first
occupied RB. Max_widthi,j and Max_heighti,j are null for
the occupied RBs (Yi,j = 0). The search of MERs also
180
claims the search of key RBs. Key RBs are the free RBk
which provide the upper left vertices of MERs. A key RB
is an RBk (Yi,j) having an occupied RB on its left (Yi,j-1) or
the free RB on its left has a Max_heighti,j-1 inferior to that
of RBk. Moreover, a key RB RBk must have an occupied
RB above (Yi-1,j) or the free RB above has a Max_widthi-1,j
inferior to that of RBk. In Fig. 2, the RBs in the SoPC with
the star symbol are the key RBs and the values in
parentheses are their Max_width and Max_height. Column-based SoPC of 4 RB types: RB1, RB2, RB3, RB4
0
0
0
0
0
0 2(2,3)
4
1(3,3)
3(2,4)
2 0
1 3 2 0
1 3 0 0
0 3(3,1)
2 4
1(1,4)
j
i
1 2 3 4 5
1
2
3
4
5
MER1Y2,2
MER2Y2,2
MER3Y1,2
MER4Y2,3
MER nestedin MER1 Y2,3
Figure 2. Key RB and MER search.
1) Partitioning: Based on key RB, Max_width and
Max_height of RBs, this first sub-function of placement
problem consists in extracting MERs to enable the
placement of elected tasks on reconfigurable device.
Partitioning is conducted through Algorithm 1. Algorithm
1 deals with each key RB independenly. At the beginning
of Algorithm 1, to avoid MER nesting, line 5 and the test
ensured by line 10 select the RBs throughout the
Max_height of the key RB to be considered for SCAN
function (line 12). These selected RBs must provide
Max_width greater than that of RB above the current key
RB (line 5). In fact the Max_widths of RBs which do not
satisfy this condition are inevitably taken by the previous
key RB. Throughout the Max_height of each key RB (line
6), Algorithm 1 scans all the RBs and each time, it takes
the Max_width of the current RB (lines 7,8) as the current
width of a new MER: MER_width (line 11) and checks
whether there are RBs above this current RB and below the
current key RB having Max_width inferior or equal to that
of the current RB (lines 12-21). Should this be the case, the
current RB would not be considered. For example, in Fig.
2, for the key RB Y1,2, the Max_widths 3 given by Y2,2 and
Y3,2 and the Max_width 2 given by Y4,2 are not considered
as Y1,2 above these RBs has a Max_width of 1. This test
avoids the duplication of MERs as well as it checks the
feasibility of MER construction. If the Max_width of the
Algorithm 1. MER search.
current RB is accepted, Algorithm 1 determines the height
of the MER by booking MER_width RBs on all the lines
between the current key RB and the last RB having
Max_width superior to MER_width (lines 22-27). Once the
construction of the MER is finished (line 28) another tests
of MER nesting is performed by line 29. The first test
Validity_Left searches the MERs added by the key RBs
situated on the same line as the current key RB on its left.
If one of these old MERs has the same height as the new
MER and if the upper right and the bottom right vertices of
the old MER are greater than or equal to the upper left and
the bottom left vertices of the new MER, the new MER is
necessarily encapsulated in the old MER. In this case, the
new MER will not be inserted. For example, in Fig. 2, the
gray MER added by Y2,3 is nested in MER1. MER1 is
created by the key RB Y2,2 on the left of the key RB Y2,3,
both situated on the same line. As both MERs have the
same height 2, and the upper left and the bottom left
vertices of this new MER are inferior to the bottom right
and the upper right vertices of MER1, the new MER of Y2,3
is deleted. Similarly, the second test Validity_Up avoids
the insertion of new MER having the same width as an old
MER provided by a key RB located above in the same
column as the current key RB and its bottom left and
181
bottom right vertices are greater than or equal to the upper
left and the upper right vertices of the new MER.
Consequetly, Algorithm 1 guarantees the search of all
possible MERs in the SoPCs without duplication nor
nesting.
2) Fitting: During scheduling, each elected task from
each local scheduler will be fitted by a placer. Thus,
according to the found MERs in their correponding SoPCs,
n placers provide the best Reconfigurable Physical Blocs
(RPBs) for the elected tasks in the SoPCs. The placers
search all the valid MERs for tasks and provide the closest
RPBs in order to minimize the internal fragmentation.
Valid MER must include all the types of RBs required by
the task and the needed number of these RBs as specified
in the RB-model of the task to enable its execution. Based
on the column-based architecture, our proposed best fitting
for a given task A and by a given placer is described by
Algorithm 2.
Algorithm 2 starts RPB search from the upper left
vertex of each valid MER. Algorithm 2 hugely relies on
the column-based architecture and only scans the first line
of the valid MER. It searches the first column in the MER
containing an RBk included in A_RB and not yet scanned
(line 7). From this current first column (line 8), it scans the
whole MER line horizontally to search the remaining RB
types required by A_RB (lines 10-22). Max_RB represents
the height of the RPB according to the required number of
RBs in the hardware task and the height of the valid MER.
If the required number of one RB type exceeds the height
of the valid MER, Max_RB is equal to the MER_height
(line 18) and the remaining number of this RB type (line
17) will be searched in the following columns of the valid
MER. Otherwise, the required number of the current RB
type is attained (line 14) and the Max_RB is adjusted to the
last highest value (line 15). Then, Algorithm 2 checks
whether all the RB types included in A_RB are found and
their required number are achieved starting from this
current first column (line 23). Should this be the case, it
books the computed necessary number (Max_RB) (line 25)
for this new possible RPB. Among all possible RPBs
(Possible_RPB) extracted by scanning all the columns
from overall valid MERs in a given SoPC and by a given placer, the closest one to the RB-model of A will be
selected as the best fitting for A in the SoPC (line 31).
B. On-line Scheduling Algorithm
In this section, based on the on-line placement
presented in the previous section, we define our proposed
Schedulers-Driven placement/scheduling. Fig. 3 illustrates
the possible states for tasks in the arrived DAGs.
Schedulers-Driven placement/scheduling is performed by
means of Algorithm 3 and 4. Every tick (T time units),
Algorithm 3 uses all the previous algorithms to move tasks
between the various states. We assume that there
are DAG_number DAGs arriving at the system with fixed
Algorithm 2. Best fitting of task A.
inter-arrival interval. The arrived DAGs are assigned to the
idle local schedulers. The Schedulable tasks in each DAG
are fetched by its Local_scheduler and inserted in its
List_scheduler. A task in a DAG is considered schedulable
if either the task has no predecessors or if all its
predecessors have been placed/scheduled. A task is
accepted by a SoPC if its deadline and RB requirements
for that SoPC remain guaranteed. If a task is not accepted
by any SoPC during its laxity time then it is rejected.
Consequently, the DAG that the task belongs to is rejected.
At the beginning, Algorithm 3 checks if the tasks in
List_scheduler, List_recover and List_pending still
guarantee their deadlines (lines 19-20). If a task misses its
deadlines, it is transferred to Rejected state. A DAG is
accepted only if all its composite tasks are acceptable.
When a task is rejected, all the schedulable tasks in the
List_scheduler of the rejected task, all the recovered,
pending and placed/scheduled tasks that belong to the
DAG of the rejected task are deleted from their housed
lists and from their assigned SoPCs (line 21). Then the
Schedulers-Driven detects the SoPCs that have sustained
MER modification after task completion or task rejection
(line 25). The current time t is kept if the SoPC has
experienced MER modification (line 26). When some
deleted tasks were scheduled and placed as the last tasks to
be executed in RPBs, their elimination from the system
could enable the placement/scheduling of pending tasks. In
addition, the completion of the last tasks in RPBs frees
additional resources in SoPCs which could allow the
placement/scheduling of pending tasks. Thus, in these
cases, the pending tasks are transmitted to List_recover by
saving the time of their recovering (lines 29-32) and their
states become Recovered. In the case the rejected tasks are
not the latest tasks for execution in the RPBs (line 33),
Schedulers-Driven checks the possibility of replacing some
of these rejected tasks by pending tasks while respecting
182
Schedulable
Selected
Placed and scheduled
PendingRecovered
Rejected
Earliest deadline
(1)
Deadline missed
(2): No valid MER && (valid occupied RPBs and Ts do not respect deadline or invalid occupied RPBs)
Reject_lastor end tasks
(2)
Deadline missed
Reject_middle
(1): Valid MER OR (valid occupied RPBs && Ts respect deadline)
(3)
(3): (Valid MER (End_task or Reject_last)) OR (valid occupied RPBs && Ts respect deadline (Reject_last ))
(2)Rejected DAG
Figure 3. Task states.
their release times, their deadlines and their RB-models
(line 34). If this latter replacing is feasible for some tasks,
their states change to Placed/Scheduled and the successors
are searched to be the new schedulable tasks and inserted
in the list of Local_scheduler to which these substitute
tasks belong. Therefore, each Local-scheduler and Recover
picks the schedulable task with the earliest deadline from
its list List_scheduler and List_recover (lines 36-38). The
state of elected tasks is changed to Selected state. When the
selected task is taken from List_recover, only the placers,
that sustain their MER modification at a time greater than
or equal to the Recover_time of the elected task, are
selected to deal with this task (lines 39-40) otherwise, all
the n placers are considered to place and schedule this task
(line 42). Then, each Local-scheduler and Recover calls
the on-line placers described by Algorithm 4 and
performed by the selected placers (line 44). Each selected
placer manages its free space by Algorithm 1 detailed in
the previous section (lines 56-59). If the selected placer
affords valid MERs for the selected task, it searches its
fittest RPB in its free RB space by means of Algorithm 2
and the start time of the task in its associated SoPC is
obtained by the maximum between the release time of the
task and the current time (lines 60-63). In the case that the
SoPC does not include valid MERs for the selected task, it
attempts to place and schedule it in its occupied RPBs (line
65). If the possible start time (Ts) provided by an occupied
RPB is superior to the release time of the task (line 66), the
corresponding placer checks if this start time maintains the
deadline of the selected task, should this be the case, it
verifies if this occupied RPB satisfies the RB requirements
of the task. If the occupied RPB respects the real-time
requirements and RB-model of the task (line 67), it is
accepted (lines 68-69). When the Ts of the occupied RPB
is inferior to the release time of the task (line 71), only RB
requirements are checked (line 72) as the start time of the
selected task in this occupied RPB will be its release time
(lines 73-74). Among all the accepted occupied RPBs, the
earliest start time for the task is chosen and the
corresponding RPB is selected (lines 77-78). When several
RPBs ensure the earliest start time, the fittest RPB is kept.
The placers data are sent to the Local-schedulers and
the Recover which make the final decision towards their
Algorithm 3. On-line Schedulers-Driven placer/scheduler.
Algorithm 4. On-line placers.
183
selected tasks. If the placers provide feasible placement for
a selected task by guaranteeing its real-time constraints,
among all possible RPBs, its Local_scheduler or the
Recover picks the fittest RPB that enables the earliest start
time for task execution and the new possible schedulable
tasks are searched and inserted in the list of
Local_scheduler of the selected task (lines 45-46). The
state of the selected tasks is then termed Placed/Scheduled.
If there are no available RPBs for a selected task, it will
be transmitted to Pending state (line 48), as some other
task rejections or completions could allow its
placement/scheduling in SoPCs. Schedulers-Driven
achieves the operating phase by updating limit that controls
the existence of unscheduled DAGs (line 51).
III. SIMULATION RESULTS
In order to analyze the feasibility of our Schedulers-
Driven placement/scheduling approach and to prove its
performance, several simulation experiments are
conducted. 10 DAG sets are generated by means of
TGFF3-5 tool [10]. DAG set features are described in
TABLE I. In each DAG set the inter-arrival interval of
DAGs is fixed to 50 T time units. The empirical chosen
values for local scheduler number and placer number are
m=6 and n=4. n should be inferior to m in purpose of
creating low-cost designs with high resource efficiency, we
cannot produce a number of SoPCs as great as the number
of arrived DAGs to satisfy their physical requirements.
However, the number of local schedulers should be as big
as possible in order to place and schedule several DAGs
simultaneously and to exploit as much as possible the
SoPC resources. We created 4 heterogeneous column-
based SoPCs of 6 lines and 7 columns having 4 RB types.
In all DAG sets, the average RB heterogeneity rate in DAG
tasks is 2.31 (i.e 2.31 RB types among the 4 RB types are
averagely used by each task).
Fig. 4 shows the run time of Schedulers-Driven
placement/scheduling for the 10 DAG sets. DAG_SET6,
DAG_SET8 and DAG_SET9 give the highest run times as
they are composed of the biggest numbers of DAGs (30,
24, 27). Moreover, they produce also high execution times
(23-25) which explain the slowness of the run time.
Indeed, due to the longest execution times of tasks, the
occupied RPBs remain in execution for a long time and the
lateness of their releasing drives tasks to List_Pending
many times. Thus, the DAGs lie for a long time in the
system.
The slowdown of one DAG is defined by:
Slowdown(DAG) = Msingle(DAG)/Mmultiple(DAG), where
Msingle is the makespan of the DAG up to its last placed/
scheduled task by Schedulers-Driven and when it has the
available SoPCs on its own, and Mmultiple is the current
makespan of the same DAG when it is placed/scheduled by
Schedulers-Driven onto SoPCs along with all
the other DAGs. The DAG_SETs comparing slowdown by
TABLE I. FEATURES OF DAG SETS
DAG
number
Average
size/DAG
(Task)
Average
deadline
(T)
Average
execution
(T)
DAG_SET1 20 8.7 68.91 20.79
DAG_SET2 11 15.8 62.91 21.86
DAG_SET3 15 10.53 69.04 17.66
DAG_SET4 22 11.27 71.37 21.58
DAG_SET5 12 14.16 65.31 22.03
DAG_SET6 30 12.76 76.53 23.41
DAG_SET7 18 14.5 88.15 30.99
DAG_SET8 24 16.33 76.41 25.07
DAG_SET9 27 14.5 78.10 25.14
DAG_SET10 10 16.7 92.71 27.66
Figure 4. Run time measurements.
Schedulers-Driven placer/scheduler is shown in Fig. 5. For
better performance in the system, the slowdown should be
closer to 1. As expected, the DAG_SETs having lower
DAG number, lower average size and shorter execution,
afford more fairness to their composed DAGs such as
DAG_SETs1-4 and consequently, they result in smaller
slowdown. DAG_SETs7,8,9,10 produce also small
slowdown due to RPB reuse and placement efficiency.
DAG_SETs5,6 show the highest slowdowns since they are
constructed by big numbers of DAGs of high average sizes
and raised heterogeneity rate which cause conflicts
between DAGs to use SoPCs and increases their
slowdowns.
Our real-time DAG-based placement/scheduling on the
heterogeneous SoPCs suffers from the problem of task
rejection due to missed deadlines and the lack of free RB
space for a given selected task. Fig. 6 presents the
guarantee ratio (i.e percentage of DAGs guaranteed to
meet their deadlines) measured for the DAG_SETs. For all
DAG_SETs, Fig. 6 shows a guarantee ratio superior to 51
%. Highly relaxed average deadline combined with low
average execution time and small average size within
DAG_SET, have noticeable impact on increasing
guarantee ratio. We observe 100 % of DAGs accepted in
DAG_SET 1,3,4 as these latter parameters are suitably
chosen. In [5], by using 8 homogeneous processors, the
attained guarantee ratio for 5 DAGs of 20 tasks is 70 %.
Our approach outperforms [5] as for the DAG_SET nearly
similar to that studied in [5]: DAG_SET5 composed of 12
DAGs with an average size for each one of 14.16
tasks Schedulers-Driven can place and schedule 83 % of
184
Figure 5. Average slowdown measurements.
real-time DAGs in the system.
Under strict physical resource constraints, Schedulers-
Driven placement/scheduling predicts the placement and
scheduling of tasks often before their release times. As
shown in Fig. 7, this advantage provided by our proposed
approach benefits up to 91 % of placement/scheduling
phases in all DAG_SETs to prefetch the schedule and
placement of tasks before their release times. These
remarkable prefetch ratios hugely reduce the placement
and scheduling overheads. Thanks to prefetch technique,
almost all the configuration operations are hidden, which
will lead to improving the system performance.
For better placement quality in the system, the resource
efficiency should be closer to 1. Thanks to the slickness
provided by our optimal placement method, we reached
0.6 of resource efficiency in all DAG_SETs. This relevant
resource efficiency shows how much the used RPBs,
where tasks are fitted, are closer to their RB-models. In
addition, for all DAG_SETs, based on the run-time
reconfiguration mechanism, by reusing the occupied RPBs
in 45-75 % of placement/scheduling phases, the placement
overhead is totally revoked, the configuration overhead is
highly reduced and the resource efficiency is immensely
enhanced by freeing more RB space for the future arriving
DAGs.
IV. CONCLUSION AND FUTURE WORK
This paper presents a novel placement/scheduling
approach for real-time DAGs with non-deterministic
behavior on heterogeneous SoPCs. We think that this paper
reveals an initial only study of heuristics for multiple DAG
placement/scheduling onto SoPCs. Moreover, it addresses
the most challenging problems disrupting the embedded
systems which are the achievement of high performance
expressed by run time, slowdown, reaching the highest
resource efficiency and reducing configuration overhead.
Further research focuses on other approaches with task
preemption and with other notions of quality of service by
exploiting the unused middle slot times within the RPBs.
REFERENCES
[1] H. Zhao, and R. Sakellariou, “Scheduling multiple DAGs onto
heterogeneous systems,” Parallel and Distributed Processing
Symposium, pp. 130, April. 2006.
Figure 6. Guarantee ratio measurements.
Figure 7. Placement/Scheduling prefetch measurements.
[2] L. Zhu, Z. Sun, W. Guo, Y. Jin, W. Sun and W. Hu, "Dynamic
Multi DAG Scheduling Algorithm for Optical Grid
Environment," proceedings of SPIE, Vol 6784; Part 1, pages
6784, 67841F, 2007.
[3] L. He, S. Jarvis, D. Spooner and G. Nudd, "Dynamic,
capability-driven scheduling of DAG-based real-time jobs in
heterogeneous clusters" International Journal of High
Performance Computing and Networking, Vol 2, pp. 165-177,
March. 2004.
[4] M. Iverson, and F. Ozguner, "Hierarchical, competitive
scheduling of multiple DAGs in a dynamic heterogeneous
environment," Distributed Systems Engineering journal, Vol 6,
No 3, pp. 112-120, July. 1999.
[5] X. Qin, and H. Jiang, "Dynamic, reliability-driven scheduling of
parallel real-time jobs in heterogeneous systems," International
Conference on Parallel Processing, pp. 113-122, 2001.
[6] K. Bazargan, R. Kastner, and M. Sarrafzadeh, "Fast Template
Placement for Reconfigurable Computing Systems," IEEE
Design and Test, Vol. 17, pp 68-83, January. 2000.
[7] C.Steiger, H.Walder, M.Platzner, and L.Thiele, "Online
scheduling and placement of real-time tasks to partially
reconfigurable devices" International Real-Time Systems
Symposium, pp. 224-235, December. 2003.
[8] M. Handa, and R. Vemuri, "An Efficient Algorithm for Finding
Empty Space for Online FPGA Placement," Design Automation
Conference, pp. 960-965, June. 2004.
[9] A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, " A New
Approach for On-line Placement on Reconfigurable Devices, "
International Parallel and Distributed Processing Symposium,
p. 134, April. 2004.
[10] http://ziyang.eecs.umich.edu/~dickrp/tgff/
185
978-1-4577-0660-8/11/$26.00 ©2011 IEEE
Generation of emulation platforms for NoC exploration on FPGA
Junyan TAN, Virginie FRESSE Hubert Curien Laboratory UMR CNRS 5516
University of Jean Monnet-University of Lyon 18 Rue du Professeur Benoît Lauras
42000 Saint-Etienne, FRANCE [email protected]
Frédéric ROUSSEAU TIMA Laboratory, UJF/CNRS/Grenoble INP, SLS Group
46, Avenue Félix Viallet 38000 Grenoble, FRANCE [email protected]
Abstract-NoC (Network on Chip) architecture exploration is
an up to date problem with today’s multimedia applications and platforms. The presented methodology gives a solution to easily evaluate timing and resource performances tuning several architectural parameters, in order to find the appropriate NoC architecture with a unique emulation platform. In this paper, a design flow that generates NoC-based emulation platforms on FPGA is presented. From specified traffic scenarios, our tool automatically inserts appropriate IP blocks (emulation blocks and routing algorithm) and generates an RTL NoC model with specific and tunable components that is synthesized on FPGA.
I. INTRODUCTION
Systems-on-Chip (SoC) based on Networks-on-Chip (NoC) architectures are one of the most appropriate solution for media processing embedded applications. With the growing complexity in consumer embedded systems, the emerging SoC architectures integrate numerous components such as memories, DSP, specialized processors, micro controllers and IPs. The ever increasing number of components, number of data and size of data to transfer required by the algorithm lead to design efficient ad hoc NoC architectures according to the algorithm specifications. An ad hoc NoC offers high bandwidth and high scalability at the cost of lower power and lower complexity [1]. However, the design of ad hoc NoC means making several architectural choices, just like buffer sizing, flow control policies, topology selection and routing algorithm selection. These choices must be made at the design time keeping in mind that the final NoC must fulfill a set of critical constraints which depend on the target application such as: latency, energy consumption, design time. The design space being very wide, automation of the design flow must be considered to ensure a rapid evaluation and test of each solution.
In the last years, several approaches [2][3] were proposed to automate the design space exploration of the architecture. All these approaches can be categorized into two types: formal approach and experimental approach. Formal approach aims to construct a mathematical formulation to predict the NoC behavior. Experimental approach uses either simulation or emulation tools. In approaches that use software simulation, the NoC can be modeled at different level of abstraction, the abstraction level being a tradeoff
between the desired accuracy and validation speed. FPGA (Field Programmable Gate Array) are commonly used as reconfigurable devices for emulation and test. FPGAs are programmable logic devices used in various applications requiring rapid prototyping of digital electronics (telecommunication, image processing, aeronautics…). Modern FPGAs are now able to host processors cores or DSPs, as well as several IP blocks to perform efficient prototyping of embedded systems.
Today, several NoC-based emulation platform on FPGA are proposed, such as [4][5][6]. Nevertheless, these emulation platforms are not adapted to image and signal processing applications. Emulation blocks cannot emulate all data transfers such as data transmissions from one initiator to several destinations with automatic data rate injection variation.
In this paper, we propose a generic design flow for the emulation of large NoC-based MPSoCs on FPGA platform. This design flow automatically builds the emulation architecture based on the NoC architecture, the type of emulations and the routing algorithm. According to the requirements of the application, this emulation architecture provides emulations of data transmission performances from one or multi initiators to one or multi destinations with automatic data rate injection. In addition, it is implemented on a FPGA platform and supplies a statistics report for future design of the whole system. The whole emulation architecture is designed in a hierarchical VHDL description fully synthesizable and FPGA-independent.
This paper is organized into 4 sections. Section 2 describes some related work. In section 3, we introduce the generic design flow for the automatic generation of the emulation platform on FPGA. This section details all required components inserted in the design flow. Section 4 presents the design flow adapted to the NoC Hermes on Xilinx platform. Experiments are presented and analyzed in this section. Section 5 contains the conclusion.
II. RELATED WORK
During the last years, there was an impressive evolution and development of NoC architectures on embedded platforms. These existing NoCs do not resolve one of the principal challenges of these communication architectures: to find out the optimum or a set of optimum NoC architecture for a target application. Several simulation and
186
emulation models are proposed at different abstraction levels. These models do not permit the exploration of the design space. Exploration remains a manual task requiring the experience of the designer.
Today, several NoC architectures have successfully been implemented on FPGA device such as: Hermes [7], SoCIN [8], PNoC [9], HIBI [10] Extended Mesh [11]. The Placed and Routed (P&R) architecture for the FPGA implementation is generated from the design flow associated to the NoC. These FPGA-based tools or environments are based on simulation in VHDL, SystemC or the combination of specification, simulation, analyze and generation of NoCs at different levels of abstraction.
In [14] a platform for modeling, simulation and evaluation of an MPSoC NoC including a real-time operating system is based on SystemC. In [15] a mixed design flow is proposed. It is based on SystemC simulation and VHDL implementation of the NoC structure called NoCGen. This platform uses a template router to simulate several interconnection networks using SystemC. In [17] a modeling environment is described for custom NoC topologies based on SystemC. However, these approaches are limited to their levels of accuracy in the estimations and level of synthesis on the FPGA. Increasing the level of accuracy significantly increases the simulation time. These simulations have a much larger execution time compared to NoC platform emulated on FPGA device. The simulation time with SystemC or modelsim for 109 packets can reach from 5 days to 36 days [4]. The NoC structure is implemented on the FPGA only, the emulation platform cannot be implemented. Emulation on FPGA is proposed to obtain faster simulation times and the higher accuracy of the functional validation.
In [4] the authors present a mixed HW-SW NoC emulation platform implemented on FPGA. The VHDL-based NoC is implemented on a Virtex-II FPGA. This architecture contains a network communication, traffic generators, traffic receptors and a control module. A hard-core processor (PowerPC) is connected to the emulation hardware platform as a global controller. This controller defines the parameters of the emulation. A fast network on chip emulation framework on Virtex-II FPGA is presented [16]. It supplies a fast synthesis process by using the several hard cores for partial reconfigurations. These previously presented frameworks integrate one or several hard core processors used for emulation only. They control the communication architecture which cost the limited resources on the FPGA.
All these emulation platforms presented previously are used only for one-to-one or multi-to-one communication. Image and signal processing algorithms require sending data to multi-destination which is not supported by existing emulation platforms.
In order to solve these problems, the proposed NoC emulation platform consists of data emulation blocks
(traffic generator and traffic receptor) and a synthesizable NoC architecture. This platform can emulate any traffic required by image and signal processing application.
III. DESIGN FLOW FOR THE GENERATION OF THE FPGA EMULATION PLATFORM
A generic design flow is proposed to generate the emulation architecture for FPGA platform. The design flow takes as inputs: • NoC architecture with several varying parameters, • Routing algorithms, • Emulation blocks: traffic generators and traffic
receptors according to the type of emulation, • Initiators and receptors with data transfer specifications.
1. Design Flow
The design flow depicted in Figure 1 automatically generates the emulation architecture for FPGA platform. The design flow is designed in a VHDL hierarchical structure to ensure component instantiation at each level of the design flow. Several packages (routing, data_transfers) are designed for the parameterization of architectures. Generic IP blocks are inserted for component insertion. These components are routing component, traffic generators (TG) and traffic receptors (TR).
Figure 1. Design flow for the generation of the emulation platform on
FPGA.
The designer first selects the NoC structure which he wants to explore in the package. He specifies the number of switches and the size of the bus. The design flow is based on the existing NoC structure implemented on a FPGA platform and its associated design flow. Any existing parameterized synchronous NoC structure can be used as the input as long as the HDL description is available. The NoC structure should contain switches, buffers, links and
187
flow protocol. The design flow takes as input the VHDL description of the NoC structure.
The first step concerns the routing algorithm. This is a decision taken by the designer. He selects the routing algorithm to be used with the NoC structure in the routing package. The routing algorithm is selected by setting a 1, the unused routing algorithms are set to 0 (meaning not used as depicted in Figure 2).
Routing_XY:=1; Routing_NFNM:=0; Routing_WFM:=0;
Figure 2. Example of routing algorithm selection. The design flow inserts the corresponding routing IP
component from the routing IP block library. We assume that a routing IP block library contains the VHDL functions of the routing algorithms. The routing algorithm is inserted by instantiating the appropriate routing function in the switch control block of the switch architecture. The design flow adds to all switches of the NoC structure these routing IP blocks to obtain the complete NoC architecture. At this level, the communication architecture is complete and nodes can be inserted. The parameterized emulation blocks are added to the platform in the emulation block insertion step according to the type of emulation. Emulation blocks are traffic generators and traffic receptors designed in VHDL IP blocks. Parameters for all emulation blocks are specified in the package Data_Transfer. This emulation block insertion step is a generic VHDL component instantiation using the Data_transfer package for allocating generic VHDL IP blocks. The complete emulation architecture on FPGA platform is generated in HDL description language. This system is synthesized and implemented using adequate tools for Synthesis and Place and Route tools for the target platform.
In the presented design flow, the designer tunes the corresponding emulation platform setting the routing IP components in the routing IP block library and specifying the scenarios in the Data_transfer package.
In the presented design flow, two inputs are reused from existing algorithms/blocks: the NoC structure and the routing algorithm. These inputs can be immediately reused if they have been previously described in hardware language for synthesis purpose. They are briefly described in the following sections. Emulation blocks and data transfer specifications are designed inputs for this design flow. Descriptions and parameterizations are detailed in the following sections.
2. NoC architecture
NoC architectures [2] are communication architectures improving the flexibility of communications subsystem of SoC, with high scalability, high performance and energy efficient customized solution. NoC architecture is composed of several basic elements: Network Interface
(NI), Switch, Links and Resources. These basic elements are connected using a topology to constitute the NoC architecture.
Data transmitted in NoC architectures are sent through messages. Several data can be sent with one message and one data can be sent with several messages. One message is a set of packets and a packet is a set of Flits (Flow Control Unit). The flit is the basic element of the NoC transferred. The designer selects the size of flits, the number of flits for one packet and the size of packet. Packets are sent according to the data injection rate. Data injection rate is defined as the ratio of the amount of receiving data on its ability to carry data. A 50% data injection rate indicates that the packets use 50% of the bandwidth.
Sizing the NoC has a direct impact on timing performances and resources used. It is important to evaluate performance of the NoC according to the size of flits, size of packets and the size of messages.
Any NoC structure containing all components but the routing algorithm can be used as an input of the design flow. The structure should be described in Hardware Description Language (HDL).
3. Routing algorithm
Several routing algorithm are implemented on FPGA and ASIC devices. Routing algorithms define the path taken by a packet between source and target switches. Three types of routing algorithms exist: determinist, partially adaptive and fully adaptive [3]. 2D Meshes and k-ary n-cubes are popular for FPGA as their regular topologies simplify routing. In a 2D mesh, there are four directions, eight 90-degree turns, and two abstract cycles of four turns.
Most algorithms commonly used are determinist routing algorithms. The reason mentioned is their simplicity and the lower number of resources. XY is a commonly used determinist routing algorithm. Other routing algorithms are semi-deterministic (or semi-adaptive) algorithms. One example of these semi-adaptive routing algorithms is the west first algorithm. Fully adaptive routing algorithms are proposed. Such algorithms prevent from livelock and deadlock [20]. They are not used for the FPGA implementation. The reason claimed is the higher number of required resources required and a more complex algorithm compared to the other two types.
Any routing algorithm can be inserted in the design flow if described in HDL language. Some routing IP blocks have already been developed in VHDL and are inserted in the design flow.
4. Traffic generators
For the emulation of the NoC, IP blocks or other components connected to the NoC are replaced by traffic generators (TG) designed in parameterized VHDL entities. Deterministic traffic generators are widely used in NoC
188
emulation. These traffic generators simulate the traffic flow between IP blocks inside the NoC. They generate stochastic traffic distribution: packet size, injection time, idle intervals duration and packet destination in order to reproduce the behavior of a real IP block for a given application. Several TGs have been previously designed [4][5][6][7]. Format of packets sent by other traffic generators are not suitable to image and signal processing applications. As an example, designed TGs can send packets to one destination node only [4][7] or in broadcast in . For most TG, there is no information about the address of the source node when receiving the data. In image processing applications, applications require unicast, multicast and broadcast transfers. Other required information is the address of the source node when data are received by the destination node. Most of nodes perform computation between 2 types of data coming from 2 different nodes. It is therefore necessary for the destination node to extract the address of the incoming data. The proposed format of packet also inserts information about timing performances (latency) to implement a complete synchronous emulation platform as depicted in Figure 3.
Figure 3. Format of packets generated by TGs.
In our emulation platform the packets contain one header part and a data part with following information: • Address of the destination cores (Dest). Any initiator
core can send data to one or several destination cores. • Address of the initiation core. (Source). • Init clock (Clk_init). The flit is reserved for the
latency evaluation. When the packet is sent, the sending time is loaded.
• Size of transmitted packet (Sz_pckt). • Number of packets (Nb_pckt). The generic TG is depicted in Figure 4. Any TGs generate
control signals and the packet in the data_in output whose size is equal to the size of flits. Parameters for any TG are the address of the destination node (Address), the coordinates of the source node (IP_address_X and IP_address_Y), the size and number of packets (Size_packet, Nbre_packet) and the data injection rate between packets (Idle_packet). All these information are used to generate the format of the packet depicted in Figure 3. The global clock of the NoC is connected to the clk signal of the TG block for constituting a synchronous platform.
Figure 4. Signals and parameters for generic Traffic Generators.
The number and format of packets depend on the traffic scenario specified in the data_transfer package..
5. Traffic Receptors
Traffic flows generated by traffic generators are sent through the NoC and then received by traffic receptors (TR). Proposed traffic receptors are parameterized VHDL entities. TR analyzes received packets and extracts latencies of the NoC. Two types of traffic receptors exist. The first type performs global analyzes and statistics from the executed emulation in hardware. Global analyze consists in testing all packets and extracting the latency from each received packet (latency defined on line and inserts in the 3rd flit in the packet). Latency and global analyze are sent to a unique LCD. The second type only generates a continuous report of traces received with detailed values for the emulation on LCD available on the FPGA board. As the emulation platform is designed to emulate with the highest precision the behavior of the final system to be implemented, emulation components for output data are restricted to LCD. The designer can add components or interfaces for analyze but should keep in mind that the structure to be emulated is modified (changing the performances of the system).
Both traffic receptors are parameterized VHDL blocks to ensure an automatic generation of the emulation platform.
6. Data transfer specification
TG and TR blocks are inserted according to the data transfer specification specified in the Data_Transfer Package. Data transfers are given in the highest level of the emulation architecture description (top_NoC) with generic values. The designer specifies the size (size_packet) and number (nb_packet) of packets sent by all TGs with the data injection rate (idle time between packets expressed with idle_packet). Data have the same format for one TG. The designer indicates all destination nodes with destination value. 1 indicates that the node is a TR and 0 that the node does not receive any data. Last_destination is a value indicating the number of TR for every TG. The number of packets received by every TR is given by total_packet. Then links between TG and TR are given with destination_links.
The example depicted in Figure 5, indicates that switches (3,4) and (4,4) are traffic generators (called TG1 and TG2) as they both send 10 packets of 15 flits (the size of flits is automatically extracted from the NoC structure). The idle time between packets is 20 clock cycles for both TGs. Switches (0,0) (1,0) (2,0) (1,0) receive packets. TG1 sends packets to 2 TRs and TG2 sends packets to 3 TRs. The number of packets received by switch (0,0) is 20 packets, 10 packets for all other TRs. TG1 send packets to switches (0,0) (2,0) TG2 send packets to switches (0,0) (1,0) (0,1).
189
Figure 5. Initiator and receptor with data traffic specification
According to the data transfer specification, the design flow inserts the corresponding emulation blocks using TG and TR library. The TG is inserted to the switch if the associated node is an initiator (i.e. the node sends at least one data to another node). The TR is inserted in the switch if the associated node is a receptor from any initiator (i.e. the node receives at least one data from an initiator). For other cases, TR and TG are not instantiated in the NoC architecture. The design flow can insert any other type of emulation (random accesses, random size, and parameterized latency for sending data…). All these emulation blocks will be described in HDL language and inserted in the emulation block library.
TGs generate several packets that are sequentially sent to the NoC architecture. These packets are generated according to the data traffic specification specified in top_NoC package. If data is sent to several destination nodes, TGs generate several packets (one packet per node). TGs can send data with different size of packets and number of packet to one or more destination nodes. Considering that the designer cannot know the data injection rate for one TG, 2 types of TGs are proposed: • TGs that generate packets with a varying data
injection rate. The data injection rate is automatically and dynamically generated from a 0% to a 100% load. Data injection rate is the idle_packet parameter of the TG that is automatically computed in the Top_NoC entity using the following equation:
load: data injection rate. nbcyclesflit: number of cycles for transfer one flit.
• TGs that generate packets with a given data injection rate. The data injection rate is specified by the designer as a constant value for each TG.
IV. DESIGN FLOW FOR THE HERMES NOC
The design flow previously presented is adapted to the Hermes NoC and its associated design tools on Xilinx FPGA device. Emulation platforms generated are used to
explore the Design Space of these NoC architectures presented in the following section. The experimental study aims to evaluate performances of NoC Hermes according to routing algorithms and to the position of the nodes. The experimental platform used is the ML506 evaluation board that contains a Virtex 5 XC5VSX50 FPGA. Development tools are Xilinx ISE 10.1 with Precision RTL Synthesis. The following experimental studies are based on the average latency and the number of resources.
1. Hermes NoC and ATLAS tool
Hermes is a NoC created by the Catholic University of Rio Grande do Sul (Porto Alegre, Brazil) [7]. This NoC is a 2D packet switched Mesh. The main components of this infrastructure are the Hermes switch and IP cores. The Hermes switch has routing control logic and five bi-directional ports. The local port builds the connection between the switch and its local IP core. All ports possess input buffers for provisional storage of information. Hermes uses the wormhole flow control. ATLAS is an open source environment designed to automate the generation of the Hermes VHDL structure. Several features can be parameterized in the ATLAS environment: size of flit, buffer depth, number of virtual channels, flow control strategies. These parameters are easily set by the designer to match the specifications of the algorithm. ATLAS is the NoC Generation Tool and the VHDL IP Blocks of the Hermes NoC is the NoC VHDL IP Block in the design flow in Figure 1.
The 4x4 mesh NoC architecture with a 1 initiator (I) to 3 receptors (R) scheme is used for the experiments.
2. Routing algorithms
The routing algorithms used are XY, West First (WFM), North-Last (NLNM) and Negative First (NFNM). All these routing algorithms are designed in VHDL IP blocks for an immediate insertion in the switches.
3. Impact of the routing algorithm on the number of resources
The first experiment depicted in Figure 6 shows the number of LUTs according to the size of the NoC for several routing algorithms. Resources concern the NoC only, the emulation blocks are not considered. The number of LUTs is almost similar for all routing algorithms. Experiments are also made on the number of registers and is not depicted in this paper. The number of registers is almost identical whatever the routing algorithm used. The first observation is that the number of resources (LUTs and registers) depends on the size of the NoC.
An in depth analysis is made on the number of LUTs and registers respectively depicted in Figure 7 and Figure 8 for the XY and NLNM algorithms. These routing algorithms are chosen as they use respectively the lowest number and
190
highest number of resources. The biggest difference in the number of LUTs is 71 for a 4-node NoC and 850 for a 36-node NoC. This number seems high but it remains unsignificant compared to the number of resources required for the NoC itself (respectively 3,2% and 3,4% of added LUTs).
Figure 6. Number of LUTs according to the size of the NoC.
Figure 7: Difference of LUTs for XY and NLNM routing algorithms.
Figure 8: Difference of registers for XY and NLNM routing algorithms.
The same analysis is made with the number of registers as depicted in Figure 8. The difference in the number of registers is lower compared to the number of LUTs (18 for a 4-node NoC and 196 for a 36-node NoC). These extra registers represent between 2.79% and 3.65% of the registers required for the NoC architecture itself. Therefore, the difference in the number of LUTs and registers is not significant in the choice of the routing algorithm for the Hermes NoC when such structure is implemented on FPGA. The functionality and advantages are more
important in the choice of the routing algorithm than resource optimization. It is therefore wiser to implement a routing algorithm that avoids deadlocks and livelocks than gaining few resources.
4. Impact of the emulation blocks and routing algorithms on the timing performances
For the following experiments, 4 TGs (TG1 to TG4) and 3 TRs (TR1 to TR3) are used as depicted in Figure 9. Data transfers are based on 40 packets and 500 flits per packet with a size of 16-bit flits. The XY, NFM and NFNM routing algorithms are used. The position of nodes is chosen to ensure data transfer from right to left to compare different routing algorithm (as NFNM uses the XY routing algorithm for left to right data transfers).
The design flow generates immediately and automatically three emulation platforms. Each emulation platform contains the routing algorithm (and switches with the routing IP block stated as R blocks) as depicted in Figure 9.
Figure 9: Emulation NoC platform generated by the design flow.
The first exploration is the comparison of XY and NFNM routing algorithms. TG1 sends all its data to three TRs. Then TG2, TG3 and TG4 use successively the same scenario. The data injection rate is 25%. The total latencies are depicted in Figure 10. The routing algorithm selected does not affect the number of cycles required to transmit data when the traffic is little compared to the capacity of the communication architecture.
Figure 10. Total latency (nb of cycles) for TG1-3TRs scheme according to
the position of the TR (with a 25% data injection rate).
The second exploration is the evaluation of the end to end latency for all TGs sending data to all TRs (Figure 9). Exploration is made with a 25% data injection rate.
191
The latency depends on the routing algorithm and the position of the TRs. Results depicted in Figure 11, show that the XY gives lower latency for TR2 and TR3 and the NFM is more efficient for TR1 for a 25% data injection rate. Both routing algorithms can be used.
Figure 11: End to end latency for three TRs with a 25% data injection rate
for three routing algorithms.
Figure 12. End to end latency (nb of cycles) for a 3-3 scheme according to
the position of the TG and the data injection rate.
The last exploration is the impact of the data injection rate. Figure 12 shows the end to end latency according to two injection rates (50% and 75%). XY is more adapted for sending data to TR1 with 50% data injection rate but is the less adapted for sending data to TR3 with 75% data injection rate. According to the position of TRs and TGs (more precisely the number of hops required), there is not only one routing algorithm that always gives the best latency. These experiments highlight the need of an exploration aided-tool, as online emulations for all scenarios are required to evaluate the best timing performances and select the most appropriate routing algorithm. Such an exploration-aided-tool will help designers to quickly build their emulation platforms to evaluate different scenarios.
V. CONCLUSION
This paper presents a generic design flow designed for the automatic generation of NoC exploration platforms on FPGA. Based on existing NoC structure, all required components are inserted in the design flow. Appropriate
emulation blocks are developed and scenario traffic specifications are proposed to target a wide range of image and signal processing applications. The designer can easily generate and implement several emulation platforms and can explore the NoC structure in a short time. With the immediate generation of the emulation platforms designed for design space explorations on FPGA the designer can explore many architecture solutions, specify/modify the number and position of the initiators and receptors to extract the best timing performances according to the routing algorithms, the data injection rates and the position of initiators and receptors. Experiments show that the performances of the final system significantly depend on all these parameters.
REFERENCES [1] B. M. Al-Hashimi: “System-on-Chip: Next Generation Electronics”.
Circuits, Devices and Systems, 2006 [2] L. Benini: “Application Specific NoC Design”, in DATE, 2006. [3] J. Chan, S. Parameswaran: “NoCGEN: A Template Based Reuse
Methodology for Networks on Chip Architecture”. In Proceedings of the 17th Int. Conference on VLSI Design, Page(s): 717-720, 2004.
[4] N. Genko, D. Atienza, G. De Micheli and al: “A Complete Network-On-Chip Emulation Framework”. In DATE, 2005.
[5] Y. E. Krasteva, F. Criado and al: “A Fast Emulation-based NoC Prototyping Framework”. International Conference on Reconfigurable Computing and FPGAs, 2008. Page(s): 211 – 216.
[6] P. Liu, C. Xiang and al: “A NoC Emulation/Verification Framework”. Sixth International Conference on Information Technology: New Generations, 2009. Page(s): 859 – 864.
[7] F. N. Moraes, A. Mello. Calazans: “HERMES: an Infrastructure for Low Area Overhead Packet-switching Networks on Chip”, Integration, the VLSI Journal, vol. 38, no. 1. Oct. 2004.
[8] C. A. Zeferino, A. A. Susin: “SoCIN: A Parametric and Scalable Network-on-Chip”, Proc. 16th Symposium On Integrated Circuits and System Designs, 2003.Page(s): 169-174.
[9] C. Hilton, B. Nelson: “PNoC: a flexible circuit-switched NoC for FPGA-based Systems”, in Field Programmable Logic, Aug. 2005.
[10] E. Salminen and al.: “HIBI Communication Network for System-on-Chip”, Journal of VLSI Signal Processing Systems, Vol. 43, Issue 2-3, June 2006. Page(s): 185 – 205.
[11] U. Y. Ogras and al.: “Communication Architecture Optimization: Making the Shortest Path Shorter in a Regular Networks-on-Chip”, in DATE, 2006.
[12] OPNET www.opnet.com [13] J. Chan, S. Parameswaran: “NoCGEN: a template based reuse
methodology for network on chip”. VLSI Design 2004, [14] S. Mahadevan, K. Virk, J. Madsen: “Arts: A systemc-based
framework for modelling multiprocessor systems-on-chip,” Design Automation of Embedded Systems, 2006.
[15] J. Chan and al: “Nocgen:a template based reuse methodology for NoC architecture”. In Proc. ICVLSI, 2004.
[16] Y. E. Krasteva; F. Criado and al: "A fast emulation-based NoC prototyping framework," in Reconfigurable Computing and FPGAs, 2008, Page(s): 211-216.
[17] A. Jalabert and al: “Xpipes Compiler: a tool for instantiating application specific network on chip”, in DATE, 2004.
[18] U. Y. Ogras and al: “Communication Architecture Optimization: Making the Shortest Path Shorter in Regular Networks-on-Chip”, in DATE, 2006.
[19] OPNET www.opnet.com [20] J. Liang; S. Swaminathan; R.Tessier: “ aSOC: A Scalable, Single-
Chip communicationsArchitecture”, In: IEEE International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000, Page(s): 37-46.
192
!"#$%"&%$'()&(*)+',%$(-)./0&1%)'()2'3)456$-())
!"#$%&'(&)$*+%$,&-+#.*&/(&)(&).*0$%,&1+2&3(&4(&-.5.6.%#,&7+*%.%"$&8(&)$*.+#&7.095:2&$;&'%;$*<.:=0#,&>?-@A,&>$*:$&/5+B*+,&C*.6=5&
D+"#$%(<$*+%$,&0+#.*(<.*0$%,&%+2(0.5.6.%#,&;+*%.%"$(<$*.+#EFG90*#(H*&)!
!"#$%&'$()!"#$!%&'($)*%&+!&,-.$(!/0!1(/'$**%&+!$2$-$&3*!1)'4$5!%&*%5$! %&3$+()3$5! '%(',%3*! ($6,%($*! '/--,&%')3%/&! )('#%3$'3,($*!*,'#! )*! )! 7$38/(4*9/&9:#%1! ;7/:*<! 3/! 5$)2! 8%3#! *')2).%2%3=>!.)&58%53#! )&5! $&$(+=! '/&*,-13%/&! +/)2*?!@)&=! 5%00$($&3! 7/:!)('#%3$'3,($*!#)A$!.$$&!1(/1/*$5>!)&5!*$A$()2!$B1$(%-$&3*!($A$)2!3#)3!(/,3%&+!)&5!)(.%3()3%/&!*'#$-$*!)($!4$=!5$*%+&!0$)3,($*!0/(!7/:! 1$(0/(-)&'$?! "#$($0/($>! 3#%*! 8/(4! 1(/1/*$*! )! (/,3%&+!*'#$-$!')22$5!12)&&$5!*/,('$!(/,3%&+>!8#%'#!%*!%-12$-$&3$5!%&!)!7/:!)('#%3$'3,($!8%3#!5%*3(%.,3$5!)(.%3()3%/&!')22$5!C$(-$*9DE?!"#$! 1)1$(! '/-1)($*! C$(-$*9DE! 3/! 3#$! C$(-$*! 7/:! 3#)3! $-912/=*! 5%*3%&'3! )(.%3()3%/&! )&5! (/,3%&+! -$'#)&%*-*! )&5! )2+/9(%3#-*?!F&$! *$3! /0! $B1$(%-$&3*! $&).2$*! 3/! '/&0(/&3! 5$*%+&! 3%-$!12)&&$5! */,('$! (/,3%&+! )&5! (,&3%-$! 5%*3(%.,3$5! (/,3%&+?! G55%93%/&)22=>!3#$!1)1$(!1($*$&3*!3#$!)5A)&3)+$*!/0!,*%&+!5$)52/'4!0($$!)5)13%A$! (/,3%&+! )2+/(%3#-*! )*! .)*%*! 0/(! .)2)&'%&+! 3#$! /A$()22!'/--,&%')3%/&!2/)5!%&!./3#!(/,3%&+!-$'#)&%*-*?!G&/3#$(!$B1$9(%-$&3!($A$)2*! 3#$! 3()5$/00*!.$38$$&!,*%&+!'$&3()2%H$5!/(!5%*3(%9.,3$5! )(.%3()3%/&?! G! 2)*3! $A)2,)3%/&! $B1/*$*! 3#$! 1$(0/(-)&'$!)5A)&3)+$*! /0! '/-.%&%&+! 5%*3(%.,3$5! )(.%3$(*! 8%3#! 12)&&$5!*/,('$! (/,3%&+?!E$*,23*! $&0/('$! 3#)3! 5$*%+&! 3%-$!12)&&$5! */,('$!(/,3%&+!3$&5*!3/!)A/%5!7/:!'/&+$*3%/&!)&5!'/&3(%.,3$*!0/(!)A$(9)+$! 2)3$&'=! ($5,'3%/&>! 8#%2$! 5%*3(%.,3$5! )(.%3()3%/&! /13%-%H$*!7/:!*)3,()3%/&!0%+,($*?!
.7! .28+94:38.92);<) -"'=$(-) *5(6$%>) '?) %"&(6$6%'"6) 05") 6$@$1'() ,($%) &"5&)5(&#@5*) %A5) $/0@5/5(%&%$'() '?) &) 1'/0@5%5) 6>6%5/) '() &)
6$(-@5)*$5B)%A5)6')1&@@5*)C>6%5/D'(D&D3A$0)EC'3F7)C'36)%&"-5%)A$-A) 05"?'"/&(15) =$%A) 6/&@@) ?''%0"$(%) &(*) @'=) 5(5"->) 1'(D6,/0%$'()=A5()1'/0&"5*)%')%A5)6&/5)6>6%5/)$/0@5/5(%5*)#>)&()5G,$H&@5(%) 65%)'?)1A$067)!)C'3)$6)1'/0'65*)#>)&)0'66$#@>)@&"-5) &/',(%) '?) 0"'1566$(-) 5@5/5(%6) EI<6FB) *'J5(6) '") 5H5()A,(*"5*6) $(%5"1'((51%5*)#>)&()'()1A$0)1'//,($1&%$'()&"1A$D%51%,"57)!6)%A5)1'/0@5K$%>)'?)&00@$1&%$'(6)?$%%$(-)$(6$*5)&)6$(D-@5)C'3)"&$656B)61&@&#$@$%>)&(*)?@5K$#$@$%>)&"5)&1A$5H5*)%A"',-A)%A5),65)'?)/,@%$0"'1566'")6>6%5/6)'()1A$0)ELIC'36FB)&)6051$&@)1&65)'?)C'36)=A5"5)/'6%)'")&@@)I<6)&"5)0"'-"&//&#@5)0"'156D6'"6)MNOB)$(1"5&6$(-)%A5)C'3)&"1A$%51%,"5)?@5K$#$@$%>7)C'3)&(*)LIC'3)*56$-(6)"5@>)'()%A5)/&66$H5)"5,65)'?)0"5D
*56$-(5*) I<67) 8A5) 1'//,($1&%$'() &"1A$%51%,"5B) '() %A5) '%A5")A&(*) $6) 6051$?$1&@@>) #,$@%) %') ?,@?$@@) %A5) &00@$1&%$'() "5G,$"5D/5(%6B)/&P$(-)%A5)*56$-()'?)%A565)1'/0'(5(%6)1'//,($1&%$'()15(%"$17) 8"&*$%$'(&@) 1'//,($1&%$'() &"1A$%51%,"56) 6,1A) &6)6A&"5*) #,6656) &(*) %A'65) #&65*) '() *5*$1&%5*) 0'$(%) %') 0'$(%)$(%5"1'((51%$'(6) *') ('%) 61&@5) =5@@) =$%A) %A5) 5H5"D-"'=$(-)&/',(%) '?) 0&"&@@5@) *&%&) %"&(6/$66$'() MQO7) 45*$1&%5*) 0'$(%) %')0'$(%) @$(P6)@5&*)%')1'//,($1&%$'()&"1A$%51%,"56)%A&%)&"5)*$??$D1,@%) %') "5,65) &(*) 5(A&(15) $() 6,#65G,5(%) *56$-() "5H$6$'(67)R,6656)/&>)#51'/5)#'%%@5(51P6B)$(1"5&6$(-)@&%5(1>)&(*)0'=5")*$66$0&%$'(7)4560$%5)%A5)?&1%)%A&%)A$5"&"1A$1&@)#,6656)*')6,00'"%)0&"&@@5@) 1'//,($1&%$'(6B) 61&@&#$@$%>) 6,??5"6) &(*) 1'(%5(%$'()$(1"5&656)=A5()1'//,($1&%$'()#5%=55()I<6)@'1&%5*)&%)*$??5"D
5(%)6$*56)'?)&)#"$*-5)$6)(55*5*7)2'36)&"5)1,""5(%@>)1'(6$*5"5*)&6)&)#5%%5")&00"'&1A)?'")5(A&(1$(-)61&@&#$@$%>)&(*)0'=5")*$66$D0&%$'()5??$1$5(1>)MSO7)8A5)*56$-()'?)2'3D#&65*)LIC'36)/,6%)%&P5)$(%')&11',(%)
65H5"&@)1'//,($1&%$'()&"1A$%51%,"5)&6051%6B)$(1@,*$(-)%'0'@'D->B) #,??5") *$/5(6$'($(-B) ',%0,%) 65@51%$'() E$757) "',%$(-) &@-'D"$%A/6F)&(*)$(0,%)65@51%$'()E$757)&"#$%"&%$'()&@-'"$%A/6F7).()%A$6)0&05"B)&)!"#)$6)&)1'//,($1&%$'()&"1A$%51%,"5)1'/0'65*)#>)&)65%) '?) 6=$%1A$(-) 5@5/5(%6) 1&@@5*) $"%&'$() %A&%) 5/0@'>) 0&1P5%)6=$%1A$(-) 1'//,($1&%$'(7) +',%5"6) $(%5"1'((51%5*) #>) @$(P6)?'"/)%A5)2'3)%'0'@'->7)C'/5)'")&@@)"',%5"6)/&>)&@6')1'((51%)%')I<6) %A"',-A)&)(5%='"P) $(%5"?&15B)=A$1A) $6)('%) $%65@?)1'(6$D*5"5*)0&"%)'?)%A5)2'37)8A5)#&6$1)?,(1%$'()'?)"',%5"6)$6)%')/'(D$%'") $(1'/$(-) 0&1P5%6) ?"'/) $%6) $(0,%) 0'"%6B) 65@51%) '(5) ',%0,%)0'"%)&(*)?'"=&"*)0&1P5%6)%A"',-A)6'/5)$(%5"(&@)0&%A7)!"#$%"&D%$'() &(*) "',%$(-) '"1A56%"&%5) "',%5") $(%5"(&@) "56',"156)&11566T0"$'"$%>)&(*)*$"51%$'(6)*51$6$'(7)8A5) "',%$(-) #5A&H$'")/&>) #5) 5$%A5") &) *5%5"/$($6%$1) '") &()
&*&0%$H5) ?,(1%$'(7) 45%5"/$($6%$1) "',%$(-) *5?$(56) %A5) ',%0,%)0'"%)%A&%)&)0&1P5%)=$@@)%&P5)#&65*)'()$%6)6',"15)&(*)*56%$(&%$'(B)$""56051%$H5) '?) %"&??$1) 1A&"&1%5"$6%$167) !() 5K&/0@5) '?) &) 2'3)&00@>$(-) ",(%$/5) *5%5"/$($6%$1) "',%$(-) $6) ;5"/56) MSO7) !*&0D%$H5) "',%$(-)&@@'=6)1'(6$*5"$(-)/'"5) %A&()'(5)1&(*$*&%5)',%D0,%) 0'"%) ?'") &) -$H5() $(0,%) 0'"%) &%) 5&1A) "',%5"B) =A$1A) 1&() #5)$/0'"%&(%)?'")05"?'"/&(15B)1'(-56%$'()1'(%"'@)&(*)?&,@%)%'@5"D&(15B) &6) *51$6$'(6) /&>) 1'(6$*5") %A5) $(6%&(%&(5',6) (5%='"P)6%&%,6) MUO7) !() 5K&/0@5) '?) &) 2'3) &"1A$%51%,"5) 0"'0'6$(-) %A5),65)'?)",(%$/5)&*&0%$H5)"',%$(-)$6)4>!4)MVO7)C$(15)5&1A)"',%5")*5&@6)=$%A)65H5"&@)6$/,@%&(5',6)"5G,56%6)
%') ?'"=&"*) 0&1P5%6B) &() &"#$%"&%$'() 6%"&%5->) $6) (51566&">7)8=')1A'$156)&"5) %')*5&@)=$%A)'(5) "5G,56%)&%) &) %$/5B)1&@@5*))'*&$+,-./'01+$2.&$+&."*B)'") %')*5&@)=$%A)&) 65%)'?) "5G,56%6) $()0&"&@@5@B)1&@@5*)0.(&$.2%&'01+$2.&$+&."*7)8A565)1A'$156)@5&*)%')%A5)-5(5"D$1)"',%5")&"1A$%51%,"56)*50$1%5*)$()W$-,"5)N7).()#'%A)&"#$%"&%$'()6%"&%5-$56)1'/05%$%$'()?'")"',%5")"56',"156)/&>)'11,"7))
)!"# !$# )
I%+,($!J!K!"8/!+$&$(%'!7/:!(/,3$(!)('#%3$'3,($*!.)*$5!/&!3#$!)(.%3()3%/&!'#/%'$L!;)<!'$&3()2%H$5!)(.%3()3%/&M!;.<!5%*3(%.,3$5!)(.%3()3%/&?!
35(%"&@$J5*)&"#$%"&%$'()0"'*,156)"',%5"6)=A$1A)&"5)6$/0@5"B)=A$@5)*$6%"$#,%5*)&"#$%"&%$'()%"&*56)$(1"5&65*)"',%5")1'/0@5K$%>)
8
193
?'")5(A&(15*)05"?'"/&(15) MXO7)35(%"&@$J5*)&"#$%"&%$'(),6,&@@>)$/0@$56) %A&%) %A5) "',%5") 1'(%&$(6) '(@>) '(5) 6$(-@5) "',%$(-) ,($%B)?'") =A$1A) &@@) $(0,%) 0'"%6) 1'/05%57) !"#$%"&%$'() &(*) "',%$(-)*5?$(5)&)1'((51%$'()#5%=55()&()$(0,%)&(*)&()',%0,%)0'"%B)&?%5")=A$1A) %"&(6/$66$'() $() %A&%) 1'((51%$'() 6%&"%6) &(*) %A5) "',%$(-),($%) $6) "5@5&65*) %') 65"H5) '%A5") 05(*$(-) $(0,%) 0'"%) "5G,56%67)4$6%"$#,%5*)&"#$%"&%$'(B)'()%A5)'%A5")A&(*B)$/0@$56)%A&%)1'/05D%$%$'() ?'") "56',"156) '11,"6) '(@>) &%) %A5) ',%0,%) 0'"%67) 8A$6) "5DG,$"56)6'/5)A&"*=&"5),($%) "50@$1&%$'()&%) %A5) $(0,%)&(*)',%0,%)0'"%6) E"',%$(-) &(*) &"#$%5"6B) "56051%$H5@>FB) #,%) /&>) $(1"5&65)05"?'"/&(15) *"&/&%$1&@@>7) 8A5) "56,@%6) $() %A$6) 0&05") 6,00'"%)%A$6)6%&%5/5(%7)8A5)&,%'/&%5*)*56$-()'?)2'36)/&>)G,$1P@>)0"'*,15)&)-5D
(5"$1) 1'//,($1&%$'() &"1A$%51%,"5) 6'@,%$'(7) ;'=5H5"B)2'3TC'3) 6$@$1'() &"5&B) 0'=5") *$66$0&%$'() &(*) 05"?'"/&(15)/&>)#5)'0%$/$J5*) $?) &"1A$%51%,"5)1'(?$-,"&%$'()&(*),6&-5)&"5)0@&((5*)MYO7)8A$6)$6)56051$&@@>)%",5)=A5()%A5)&00@$1&%$'()1'/D/,($1&%$'() 0&%%5"(6) &"5) P('=() $() &*H&(157) 3'(6$*5"$(-) %A$6)6$%,&%$'() &(*) "56%"$1%$(-) &%%5(%$'() %') 0&%A) 65@51%$'() &/'(-)1'//,($1&%$(-) 0&$"6) ?'") 6',"15) "',%$(-) 2'36B) $%) $6) ,6,&@) %')5/0@'>)#&(*=$*%A)@$/$%6)&6)&)1"$%5"$'()%')*5?$(5)&)65%)'?)1'/D/,($1&%$'() "',%56) MZO) M[O7) R&(*=$*%A) @$/$%6) &"5) &() 5??$1$5(%)=&>)%')60"5&*)%A5)'H5"&@@) @$(P)@'&*7);'=5H5"B),6$(-)'(@>)%A$6)1"$%5"$'() /&>) "56,@%) $() 1'//,($1&%$'() @'&*6) %A&%) &"5) #&*@>)*$6%"$#,%5*)$()%$/5B)=A$1A)/&>)$()%,"()1'/0"'/$65)%A5)'H5"&@@)2'3) 05"?'"/&(157) L'"5'H5"B) %A5) &*H&(%&-5) '?) *56$-() %$/5)0&%A) 65@51%$'()=A5() 1'/0&"5*) %') ",(%$/5) *$6%"$#,%5*) "',%$(-)&@-'"$%A/6) $6) ('%) 1@5&"7) 8A$6) 0&05") "50'"%6) $6'@&%5*) &(*) \'$(%)1'/0&"$6'(6) #5%=55() ("%$)') H5"6,6) 0.(&$.2%&'01 $"%&.*3) &(*))'*&$+-./'0)H5"6,6)0.(&$.2%&'01+$2.&$+&."*)6%"&%5-$567)+5/5/#5")%A&%)/'6%B)$?)('%)&@@B)"',%$(-)&@-'"$%A/6),65*)$()
2'36)&"5)*5&*@'1P)?"55)#51&,65)%A5>)0"51@,*5)%A5),65)'?)6'/5)"',%56) ?"'/) 6',"15) %') *56%$(&%$'(7) 9?%5(B) %A5) \,6%$?$1&%$'() %'),65)&*&0%$H5)"',%$(-)&@-'"$%A/6)$()0@&15)'?)*5%5"/$($6%$1)'(56)1'/56)?"'/)%A5)1&0&1$%>)'?)%A5)?'"/5")%')&H'$*)1'(%5(%$'()#>),6$(-)&@%5"(&%5)"',%56)=A5()&)1'(?@$1%)'11,"6)&%)",(%$/57)8A$6)0&05") H&@,56) &('%A5") $/0'"%&(%) 0"'05"%>) '?) &*&0%$H5) "',%$(-)&@-'"$%A/6B)=A$1A)$6) %A5)"$1A5")65%)'?)0'66$#@5)"',%56)&)0&1P5%)1&(),65)%')-')?"'/)6',"15)%')*56%$(&%$'(7)8A$6)0"'05"%>)/&P56)6',(*) 1'/#$($(-) 6',"15) &(*) &*&0%$H5) "',%$(-) /51A&($6/6)=A5()%"&??$1)0&%%5"(6)&"5)P('=()&%)*56$-()%$/57)8A5)"$1A5")65%)'?)"',%56)?&1$@$%&%56)'H5"&@@) @'&*)#&@&(1$(-)$()%A5)1'//,($1&D%$'()&"1A$%51%,"5B)=A$@5)*5&*@'1P)?"55*'/)-,&"&(%556)%A5)$(%5D-"$%>)'?)'05"&%$'()?'")%A5)1'//,($1&%$'()&"1A$%51%,"57)8A5) "5/&$(*5")'?) %A$6)0&05") $6)'"-&($J5*)&6) ?'@@'=67)C51D
%$'()..)0"565(%6)%A5)0"'1566)&*'0%5*)?'")"',%5)/&00$(-)1',0@5*)%') %A5) ;5"/56DC+) 2'37) C51%$'() ...) *50$1%6) %A5) ;5"/56DC+)&"1A$%51%,"57)C51%$'().])*561"$#56) %A5)5K05"$/5(%&@) 65%,0)&(*)"56,@%6B) =A$@5) C51%$'() ]) *$60@&>6) &) 65%) '?) 1'(1@,6$'(6) &(*)*$"51%$'(6)?'")?,%,"5)='"P7)
..7! +9:8<)L!II.2^)8A5)*5?$($%$'()'?)1'//,($1&%$'()"',%56)/&>)#5)-,$*5*)#>)
*$??5"5(%) "5G,$"5/5(%6B) &$/$(-) &%) 0'=5") *$66$0&%$'(B) &"5&) '")05"?'"/&(157) 8A$6) ='"P) 1'(6$*5"6) &6) P5>) "5G,$"5/5(%) 1'/D/,($1&%$'()05"?'"/&(15B)/5&6,"5*)#>)%A5)"5*,1%$'()'?)0'%5(D%$&@)1'(-56%$'(B)&(*)5H&@,&%5*)#>)&H5"&-5)0&1P5%)@&%5(1>7)R&6$D1&@@>B) 1'(-56%$'() $6) *5%51%5*) =A5() %A5) &/',(%) '?) $(1'/$(-)*&%&)E'")"5G,56%)?'")$(1'/$(-)*&%&F)$6)@&"-5")%A&()%A5)',%-'$(-)*&%&) ?"'/) &) -$H5() 1'//,($1&%$'() 5@5/5(%) E57-7) &) "',%5"F7) !)
"5&6'() ?'") %A$6) $6) &) #&*)*$6%"$#,%$'()'?) 1'//,($1&%$'() ?@'=6B)=A$1A) /&>) $/0@>) 'H5"@'&*5*) 1A&((5@6B) 1&@@5*) 4"&(5"&(7) 8')&H'$*)A'%60'%6B)$%)$6)/&(*&%'">)%'_)E$F)'65-"$'1+-&'$*+&.7'15+&4()?'")5&1A)1'//,($1&%$(-)/'*,@56B) E$$F))"82.*'15+&4(1"91)"8,8%*.)+&.*315+.$() $()&) %"&??$1) 615(&"$')&(*) E$$$F)'7+-%+&')$"%&'18+55.*3(1?'")5&1A)%"&??$1)615(&"$'7)
:;& <65-"$.*31:-&'$*+&.7'1=+&4(1!) 0&%A) $6) *5?$(5*) A5"5) &6) %A5) 65G,5(15) '?) "',%5") ',%0,%)
0'"%6),65*) %') %"&(6/$%)0&1P5%6) ?"'/)&)6',"15) %')&)*56%$(&%$'(7)4505(*$(-)'()%A5)"',%$(-)&@-'"$%A/B)/'"5)%A&()'(5)&@%5"(&%$H5)0&%A) /&>) 5K$6%) #5%=55() &) -$H5() 6',"15) &(*) *56%$(&%$'(7) 8A5)5K0@'"&%$'() '?) &@%5"(&%$H5) 0&%A6) A&6) %') -,&"&(%55) %A&%) &%) @5&6%)'(5) 0&%A) 5K$6%6) &/'(-) 5&1A) 1'//,($1&%$(-) 0&$") &(*) %A&%) (')*5&*@'1P) =$@@) '11,") =A5() 0&%A6) &"5) 1'/#$(5*) $(%') %"&??$17)8A5"5) &"5) %=')=&>6) %') '#%&$() *5&*@'1P) ?"55*'/_) E$F) %A"',-A)?'"/&@) H5"$?$1&%$'() '") E$$F) %A"',-A) %A5) &*'0%$'() '?) *5&*@'1PD?"55) "',%$(-) &@-'"$%A/6) &6) #&6$6) ?'") 0&%A) 1'/0,%&%$'(7) 8A5)0"565(%)='"P),656)%A5)651'(*)&00"'&1A7).%)5/0@'>6)?',")*$??5"D5(%) "',%$(-) &@-'"$%A/6B) I,"5) `a) E`aFB) &(*) %A5) %A"55) %,"()/'*5@) H&"$&%$'(6_) 25-&%$H5) W$"6%) E2WFB)b56%) W$"6%) EbWF) &(*)2'"%A) c&6%) E2cF) MUO7) 8A5) ?$"6%) $6) *5%5"/$($6%$1B) =A$@5) %A5) "5D/&$($(-)&"5)&*&0%$H5)&@-'"$%A/6) $/0@5/5(%5*) $() %=') ?@&H'"6_)/$($/&@) E2WLB) bWLB) 2cLF) &(*) ('() /$($/&@) E2W2LB)bW2LB) 2c2LF7) W'") `a) "',%$(-B) 5K&1%@>) &) 0&%A) 5K$6%6) #5D%=55() %=') 1'//,($1&%$(-) 5(%$%$567) W'") /$($/&@) &*&0%$H5)&@-'"$%A/6B)%A5)(,/#5")'?)*$6%$(1%)0&%A6)E*+&$,#F)$6)5$%A5")N)'")$6) *5?$(5*) #>) <G,&%$'() ENFB) =A5"5) -) &(*) .) "50"565(%) %A5)*$6%&(15)$()A'06)&@'(-)%A5)1'""560'(*$(-)&K$6)E-)'").F)#5%=55()6',"15)&(*)*56%$(&%$'(7)
( )ddd
.-.-*+&$,#
!!!+!= ) ENF)
W'")('()/$($/&@)&*&0%$H5)&@-'"$%A/6B)*+&$,#)*505(*6)'()%A5)"5@&%$H5)6',"15)&(*)*56%$(&%$'()0'6$%$'(6)&(*)'()%A5)"',%$(-)&@-'"$%A/) ",@567) W'") 5K&/0@5B) ?'") %A5) ('() /$($/&@) (5-&%$H5)?$"6%)&@-'"$%A/)E2W2LFB)$%)$6)0'66$#@5)%'),65)<G,&%$'()EQF7))
( )= = !!
!+!= e e
ddd
#% #%-- ../0/0
/0/0/0 .-
.-*+&$,# ) EQF1
bA5() %A5) *56%$(&%$'() '?) %A5) 0&1P5%) $6) &%) &) 0'$(%) E-1B.1FB)=A$1A) $6)&#'H5)&(*) %') %A5)"$-A%)'?) %A5)6',"15)1''"*$(&%56) E-#B).#FB) <G,&%$'() EQF) $6) H&@$*7) .() %A$6) <G,&%$'(B) %A5) 0&$") E-%B) .%F)"50"565(%6) %A5) 0'6$%$'() "56,@%$(-) ?"'/) %A5) *$60@&15/5(%) ?"'/)%A5)6',"15)%')%A5)/'6%)(5-&%$H5)0'6$%$'()$()%A5)(5%='"P7)!@6'B)-/0) &(*) ./0) "50"565(%) %A5)*$6%&(15) $() %A5)-) "5607).) &K56)#5D%=55() 0'6$%$'() E-%B! .%F) &(*) %A5) *56%$(&%$'(B) $757) -/0)f)-%)g)-1)&(*) ./0)f).%)g).17)W'")%A5)'%A5")('()/$($/&@)&*&0%$H5)"',%$(-)&@-'"$%A/6)6$/$@&")5G,&%$'(6)&(*)1'(6$*5"&%$'(6)&00@>7)
>;& #"82.*.*31=+&4(1"91#"88%*.)+&.*31=+.$(1!)"',%5)/&00$(-)$6)*5?$(5*)=A5()?'")&@@)0&$"6)'?)1'//,($D
1&%$(-) /'*,@56B) '(5) &(*) '(@>) '(5) 0&%A) $6) 1A'65() ?'") 5&1A)6',"15) &(*) *56%$(&%$'() /'*,@5) E"50"565(%5*) #>) %A5) $%5"&%$'()H&"$&#@5) 2F7) 8A5) &/',(%) '?) 0'66$#@5) "',%5) /&00$(-6) *505(*6)'() %A5) (,/#5")'?) 1'//,($1&%$(-)0&$"6) &(*) %A5) "',%$(-) &@-'D"$%A/7)8A5)-"5&%5")%A5)(,/#5")'?)&@%5"(&%$H5)0&%A6)05")1'//,D($1&%$(-)0&$"B)%A5)-"5&%5")%A5)(,/#5")'?)&1A$5H&#@5)"',%5)/&0D0$(-67) <G,&%$'() ESF) *5?$(56) %A5) /&K$/,/) (,/#5") '?) "',%5)
194
/&00$(-6) E*3&442/5FB) =A5"5) 2) $*5(%$?$56) &) 1'//,($1&%$(-)0&$"B)*+&2%#1 $6) %A5) %'%&@) (,/#5")'?) 1'//,($1&%$(-)0&$"6) &(*)*+&$,#627) $6) %A5)(,/#5")'?) &@%5"(&%$H5)0&%A6) ?'") 27)!6) &() 5KD&/0@5B)*3&442/5) $6) &@=&>6) 5G,&@) %') N) =A5()`a) "',%$(-) $6)&*'0%5*B)6$(15)%A5"5)$6)'(@>)&)0'66$#@5)0&%A)?'")5&1A)1'//,($D1&%$(-)0&$"7)
"=
=*+&2%#
22*+&$,#*3&442/5
NFE ) ESF1
#;& <7+-%+&."*1"91?"%&'1@+55.*3(18A5)5H&@,&%$'()'?)"',%5)/&00$(-6)$6)#&65*)'(_)E$F)%A5))"8,
8%*.)+&."*1)4+$+)&'$.(&.)) %A&%) $6)/'*5@5*)#>)&@@) 1'//,($1&D%$'(6)'?)%A5)&00@$1&%$'()$()%5"/6)'?)6',"15)/'*,@5B)%&"-5%)/'*D,@5) &(*) %"&(6/$66$'() "&%5B) E$$F) %A5) +-&'$*+&.7'1 5+&4() ?'") 5&1A)1'//,($1&%$(-)0&$")%A&%)$6)/'*5@5*)#>)&)-"&0A)1'(%&$($(-)&@@)0'66$#@5) 0&%A6) ?"'/) 5&1A) 6',"15) %') 5&1A) *56%$(&%$'() /'*,@5)&(*)E$$$F)%A5)65@51%5*))"(&19%*)&."*B)=A$1A)1'(6$*5"6_)%A5)&H5"D&-5)"&%5)'?)0&%A)'11,0&(1>B)%A5)05&P),6&-5)'?)%A5)0&%A)&(*)%A5)0&%A) @5(-%A7) C/&@@5") H&@,56) ?'") %A565) &6051%6) @5&*) %') #5%%5")0&%A67).($%$&@@>B) &) H&@$*) 0&%A) $6) "&(*'/@>) &66$-(5*) %') 5&1A) 1'/D
/,($1&%$(-) 0&$"7) .() %A$6) 6%50B) (') &**$%$'(&@) 1&"5) $6) %&P5() ?'")0&%A) #$(*$(-7) 8A5) '(@>) -,&"&(%55) '??5"5*) #>) %A$6) &66$-(/5(%)0"'1566) $6) %A5) 5K$6%5(15) '?) %A5) 0&%A) '() %A5) @$6%) '?) &@%5"(&%$H5)0&%A6)?'")%A5)-$H5()1'//,($1&%$(-)0&$"7)!?%5")&)0&%A)A&6)#55()&66$-(5*)%')5&1A)1'//,($1&%$(-)0&$"B)2'3)'11,0&%$'()$6)56%$D/&%5*)#>)&11,/,@&%$(-)%A5)%"&(6/$66$'()"&%5)'?)5&1A)1'//,D($1&%$(-)0&$"7)8A5)(5K%)6%506)655P)%')'0%$/$J5)%A$6)$($%$&@)"',%5)/&00$(-7)8A5) "',%5)/&00$(-)'0%$/$J&%$'() $6)1&""$5*)#>)H&">$(-) %A5)
0&%A6)'?)5&1A)1'//,($1&%$(-)0&$"B)%">$(-)&@@)&@%5"(&%$H5)0&%A67)bA5()&)1'//,($1&%$(-)0&$")$6)#5$(-)5H&@,&%5*B)%A5)"5/&$($(-)0&$"6)A&H5)%A5$")0&%A)?$K5*7)!)(5=)0&%A)?'")&)1'//,($1&%$(-)0&$")$6)&66,/5*)$?B)1'/D
0&"5*)%')%A5)1,""5(%)"',%5)/&00$(-_)E$F)%A5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)$6)@'=5")&(*)$%6)05&P),6&-5)$6)@'=5")'")5G,&@B)'")E$$F)%A5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)$6)5G,&@)&(*)%A5)05&P),6&-5)$6) @'=5"B)'") E$$$F) %A5)&H5"&-5) "&%5)'?)0&%A)'11,0&(1>) $6)5G,&@B)%A5)05&P),6&-5)$6)5G,&@)&(*)%A5)0&%A)@5(-%A)$6)6A'"%5"7)8A5)?$"6%)&00"'&1A)-,&"&(%556)&)#5%%5")*$6%"$#,%$'()'?)2'3)1'//,($1&%D$(-) ?@'=6B)5G,&@$J$(-) %A5)1'//,($1&%$'()1A&((5@)'11,0&%$'(7)8A5)651'(*)&00"'&1A)-,&"&(%556)%A&%)$?)&)@'=)1'(-56%$'()J'(5)1&(('%) #5) ?',(*) &%) @5&6%) A'%60'%6) &"5) &H'$*5*B) #"$(-$(-) 05&P),6&-5)*'=(7)W$(&@@>B)=A5(),6$(-)&)('()/$($/&@)"',%$(-)&@-'D"$%A/B)$?)%A5)6&/5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)1&()#5)?',(*)$()&) 6A'"%5")0&%AB) %A5() @'=5")0'=5")*$66$0&%$'()/&>)#5)'H5"D@''P5*7)8A5) 0"'1566) '?) "',%5) /&00$(-) '0%$/$J&%$'() ?$($6A56) $()
%A"55) 0'66$#@5) 6$%,&%$'(6_) E$F) =A5() %A5"5) $6) '(@>) '(5) 0'66$#@5)"',%5)/&00$(-)E57)-7)=A5(),6$(-)`a)"',%$(-FB)E$$F)&?%5")"5&1AD$(-)&)-$H5()(,/#5")'?)%"$56B)=A$1A)$6)0&"&/5%5"$J&#@5B)&(*)E$$$F)=A5() (') ?,"%A5") '0%$/$J&%$'() 1&() #5) '#%&$(5*) &?%5") &@@) 1'/D/,($1&%$(-)0&$"6)A&H5)#55()5H&@,&%5*)&%)@5&6%)'(157)8A5)"56,@%D$(-)"',%5)/&00$(-)$6)1&@@5*)48&//91(#:;%'9(%:;$2/5)EIC+F7)
...7! 8;<);<+L<CDC+)!+3;.8<38:+<)R5?'"5)*561"$#$(-);5"/56DC+B) $%) $6)(51566&">) %')&00"'&1A)
%=')'%A5")2'3)&"1A$%51%,"56),65*)A5"5)&6)&)#&6$6)?'")1'/0&"$D6'(6B);5"/56)&(*);5"/56DL7);5"/56)&(*);5"/56DL)&"5)Q4D
/56A) %'0'@'->) 2'36) =$%A) "',%5"6) ,6$(-) 1"5*$%) #&65*) ?@'=)1'(%"'@B) $(0,%)#,??5"$(-B)='"/A'@5)0&1P5%)6=$%1A$(-)&(*)15(D%"&@$J5*) &"#$%"&%$'() E"',(*D"'#$() &@-'"$%A/F7) ;'=5H5"B) =A$@5)%A5)"',%$(-)61A5/5)'?);5"/56)$6)*$6%"$#,%5*B);5"/56DL),656)&)0@&((5*)6',"15)"',%$(-)61A5/57);5"/56DC+)*$??5"6) ?"'/);5"/56DL)#51&,65) $%) 5/0@'>6) &)
0.(&$.2%&'01 +$2.&$+&."*) 61A5/5) =$%A) &) ?$"6%) 1'/5D?$"6%) 65"H5*)EW3WCF)&@-'"$%A/B)=A$1A)-,&"&(%556)$(D'"*5")0&1P5%)65"H$1$(-7)4505(*$(-)'() %A5)I<)/&00$(-B)'() %A5) 6051$?$1) E&*&0%$H5F)
"',%$(-) &@-'"$%A/) &(*) '() 6'/5) &00@$1&%$'() 1A&"&1%5"$6%$16B)"',%$(-) /&>) 'H5"@'&*) 65H5"&@) (5%='"P) "5-$'(6B) $/0@>$(-) &)*51"5&65)'()%A5)1'//,($1&%$'()&"1A$%51%,"5)5??$1$5(1>B)*,5)%')%A5)$(1"5&65)'?)0&1P5%)@&%5(1$567)8A5)0"5*5?$($%$'()'?)0&%A6)6,00'"%5*)#>);5"/56DC+)6',"15)
"',%$(-)-,&"&(%556)&)P('=()='"6%)1&65)?'")@$(P)@'&*67)!@6'B)$%)/&>) A5@0) '0%$/$J$(-) 2'3) &"5&) %A"',-A) %A5) 5@$/$(&%$'() '?),(,65*)@$(P6)&(*)#,??5")*$/5(6$'($(-B)#'%A)'?)=A$1A)&"5)',%D6$*5)%A5)61'05)'?)%A$6)='"P7)!@@) 2'36) $() %A$6) ='"P) &66,/5) &) 6$/0@5) 0&1P5%) 6%",1%,"5B)
1'/0'65*) #>) &) A5&*5") 1'(%&$($(-) *56%$(&%$'() &(*) 6$J5) $(?'"D/&%$'() &(*) &) 0&>@'&*7) !@@) %A"55) 2'36) 6,00'"%) &"#$%"&">) ?@$%)6$J56B) &@%A',-A) &@@) 5K05"$/5(%6) A5"5),65)'(@>NXD#$%) ?@$%67)4$?D?5"5(%)2'3)0&1P5%6)6@$-A%@>)*$??5")$()%A5$")A5&*5")6%",1%,"57).();5"/56B) %A5) QD?@$%) 0&1P5%) A5&*5") 6%'"56) %A5) "',%5") *56%$(&%$'()&**"566)&6)?$"6%)?@$%B)?'@@'=5*)#>)&)?@$%)=$%A)%A5)6$J5)'?)%A5)0&>@D'&*7) .();5"/56DC+)&(*);5"/56DLB) %A5)A5&*5") 6%&"%6)=$%A)&()$() '"*5") 65G,5(15) '?) ',%0,%) 0'"%6) (51566&">) %') &""$H5) &%) %A5)*56%$(&%$'(B)?'@@'=5*)#>)%A5)6$(-@5D?@$%)0&>@'&*)6$J57)bA$@5)%A5);5"/56)A5&*5")'11,0$56)5K&1%@>)%=')?@$%6B)%A5)'%A5")%=')2'36h)H&"$&#@5) 6$J5) A5&*5"6) 1'/0"$65) &%) @5&6%) %A"55) ?@$%6B) *,5) %') %A5),65)'?)&()&**$%$'(&@)?@$%),65*)&6)"',%5)%5"/$(&%'")?@&-7)3'(15"($(-) &"#$%"&%$'() $() ;5"/56DC+B) 5&1A) "',%5") $(0,%)
0'"%) *$"51%@>) ('%$?$56) %A5) *56$"5*) ',%0,%) 0'"%) %') %"&(6/$%) &)0&1P5%7) .@@,6%"&%5*) $() W$-,"5) NE#FB) %A$6) &00"'&1A) 5(&#@56) %')65"H5)/,@%$0@5)"5G,56%6)%')*$6%$(1%)0'"%6)$()0&"&@@5@7)8"&(6/$6D6$'() "5G,56%6) &"5) 6%'"5*) &(*) 65"H5*) $() &""$H&@) '"*5") #>) 5&1A)',%0,%)0'"%7)W$-,"5)NE&F)$@@,6%"&%56)%A5)'%A5")&00"'&1AB),65*)$();5"/56) &(*) ;5"/56DLB) =A$1A) 5/0@'>6) 15(%"&@$J5*) "',(*D"'#$()&"#$%"&%$'(7);5"5B) 5&1A) $(0,%)0'"%) "5G,56%6) "',%$(-) ?'")&)1'(%"'@),($%)&(*)=&$%6)5$%A5") ?'")&()',%0,%)0'"%)&66$-(/5(%)'")?'") &) *5($&@) '?) &66$-(/5(%B) $?) %A5) "5G,56%5*) 0'"%) $6) &@"5&*>)#,6>7).()&(>)1&65B)$?)&"#$%"&%$'()65"H56)&)0'"%)$%)@'656)$%6)0"$'"$D%>7) R51&,65) '?) %A$6B) $(0,%) 0'"%6) /&>) 6,??5") ?"'/) 6%&"H&%$'(B)*505(*$(-)'()%A5)(5%='"P)@'&*)&(*)%A5)&/',(%)'?)1'/05%$%$'()&/'(-)1'//,($1&%$'()?@'=67)
.]7! <`I<+.L<28!c)C<8:I)!24)+<C:c8C);5"/56DC+B) ;5"/56) &(*) ;5"/56DL) =5"5) *561"$#5*) $()
6>(%A56$J&#@5)+8c)];4c)=$%A)?$K5*)*$/5(6$'($(-)EVKVFB)?@$%)6$J5)ENX)#$%6F)&(*)&)VeL;J)'05"&%$(-)?"5G,5(1>B)"56,@%$(-)$()&)#&(*=$*%A) '?) ZeeL#06) 05") @$(P7) 8A5) 5K05"$/5(%6) 5/0@'>5*)H&"$',6)"',%$(-)&@-'"$%A/)&(*)#,??5")6$J5)1'/#$(&%$'(67)C5H5"&@)6>(%A5%$1)&(*)"5&@)%"&??$1)0&%%5"(6)&@@'=5*)5H&@,&%D
$(-)&"#$%"&%$'()&(*) "',%$(-)61A5/567)bA$@5) "5&@) %"&??$1) 615(&D"$'6) &@@'=) &66566$(-) %A5) #5A&H$'") '?) 6051$?$1) &00@$1&%$'(6B)6>(%A5%$1) %"&??$1) 615(&"$'6) 5(&#@5) %') 5K0@'"5) %A5) @$/$%6) '?) %A5)2'36B)6,1A)&6)6&%,"&%$'()&(*)#5A&H$'"),(*5")1'(-56%$'(7)!) 65%)'?) %5K%) ?$@56)*561"$#5)5&1A) %"&??$1)0&%%5"()&6)&) 65%)'?)
0&1P5%6)EA5&*5")i)0&>@'&*F)&(*)%A5).0'+-1.*A')&."*)8"8'*&1?'")5&1A)0&1P5%7)8A$6)$6)&()$(%5-5")$(?'"/$(-)%A5)(,/#5")'?)1@'1P)
195
1>1@56) &?%5") 6$/,@&%$'() 6%&"%7) .(\51%$'() 6',"156) &"5) $/0@5D/5(%5*) #>) 1>1@5D) &(*) 0$(D&11,"&%5) C>6%5/3) $(0,%) /'*,@56B)"560'(6$#@5) ?'") $(%5"0"5%$(-) %"&??$1) ?$@56) &(*) $(\51%$(-) 0&1P5%6)$(%') %A5)2'37) C>6%5/3) ',%0,%)/'*,@56) '"#) 0&1P5%6) ?"'/)%A5)2'3)',%0,%6B)6%'"$(-)%A5$")1'(%5(%6)&(*)&""$H&@)/'/5(%)?'")6%&%$6%$1&@)5H&@,&%$'(7)c&%5(1>)H&@,56)0"565(%5*)A5"5)&"5)('%) @$/$%5*) %') %A5)2'3)
%"&(6/$66$'()*5@&>7)W$-,"5)Q)*$??5"5(%$&%56)%"&(6/$66$'()@&%5(D1$56)#&65*)'()$(\51%$'()&(*)"5150%$'()*$6%"$#,%$'(67)!)5-+**'01.*A')&."*) *$6%"$#,%$'() $6) *5?$(5*) $() %A5) %"&??$1) 615(&"$'6) %5K%)?$@56B)&(*)*50$1%6)%A5)$*5&@)$(\51%$'()/'/5(%)?'")5&1A)0&1P5%)27)8A5) +))"85-.(4'01 .*A')&."*) *$6%"$#,%$'() 1'(6$*5"6) %A5) &1%,&@)0&1P5%)$(65"%$'()/'/5(%) $(%')%A5)2'3B)=A$1A)1&()#5)*5@&>5*)#>)1'(%5(%$'()&%)%A5)0&1P5%)6',"157)8A5).0'+-1$')'5&."*)*$6%"$D#,%$'() "50"565(%6) %A5) 5K051%5*) *5@$H5">)/'/5(%6) '?) 0&1P5%6B)%&P$(-) (5%='"P) 6%&%,6) $(%') &11',(%) '") ('%7) 8A5)+))"85-.(4'01$')'5&."*)*$6%"$#,%$'()"50"565(%6)%A5)"5&@)/'/5(%)=A5"5)0&1PD5%6) &"5) *5@$H5"5*) %') %A5$") *56%$(&%$'(7) 2') 1'(%5(%$'() &%) %A5)*56%$(&%$'()$6)1'(6$*5"5*7))
"#$%&'()#!
*%&%+'()#!
%&"''()
*)("&+,"-('./+
011&2."-23'+,"-('./+4(-5367+,"-('./+
+,&-%'!(./! +,&-%'!(! +,&-%'!(0/!
0..381&29:()
*)("&
0..381&29:() )I%+,($!N!K:/--,&%')3%/&!2)3$&'=!3=1$*?!
!)A>0'%A5%$1&@)*$6%"$#,%$'()'?)6,1A)$(\51%$'()&(*)"5150%$'()615(&"$'6) $6) $@@,6%"&%5*) $()W$-,"5)Q7) B0'+-1 -+&'*)C) $6) %A5)/$($D/,/)(,/#5")'?)1>1@56)&)0&1P5%)(55*6)%')"5&1A)$%6)*56%$(&%$'(7)8A$6) $6)#&65*)'() %A5)*$??5"5(15)'?) %A5) $*5&@) $(\51%$'()/'/5(%)&(*) %A5) 5K051%5*) *5@$H5">) /'/5(%7) !'&D"$E) -+&'*)C) $6) %A5)*5@&>) H5"$?$5*)#>) %A5)0&1P5%) *,"$(-) $%6) %"&??$1) ?"'/) 6',"15) %')*56%$(&%$'(B)=A$1A)/&>)#5)$(?@,5(15*)#>)1'/05%$%$'()?'")2'3)"56',"156)E57-7)@$(P6B)#,??5"6B)&"#$%"&%$'(B)"',%$(-F7):55-.)+&."*1-+&'*)C)('"/&@@>)#"$(-6)%A5)/'6%)$/0'"%&(%)$/0&1%)'()%A5)$*5&@)1'//,($1&%$'() 05"?'"/&(157)8A$6) $6) 1'/0,%5*) &6) %A5) *$??5"D5(15)#5%=55()%A5)$*5&@)$(\51%$'()/'/5(%)'?)0&1P5%6)&(*)%A5$")5??51%$H5) *5@$H5">) /'/5(%) &%) %A5) *56%$(&%$'(7) !00@$1&%$'()@&%5(1>)$6)%A5)H&@,5)&66,/5*)?'")1'/0&"$6'()$()%A5)(5K%)5K05D"$/5(%67)
:;& <7+-%+&.*31='$9"$8+*)'1%*0'$1:--1&"1:--1F$+99.)1=+&&'$*18=') 1@&6656) '?) 5K05"$/5(%6) =5"5) 05"?'"/5*) %') 5H&@,&%5)
05"?'"/&(15)'0%$/$J&%$'(7)8A5)?$"6%)1@&66)1'/0&"56) %A5),6&-5)'?)IC+)&-&$(6%)*$6%"$#,%5*)"',%$(-)E4+F)=$%A)%A5)"56,@%6)6,/D/&"$J5*) $() W$-,"5) S7) 8A5) 651'(*) 1@&66) ?'1,6) '() 15(%"&@$J5*)H5"6,6)*$6%"$#,%5*)&"#$%"&%$'(B)=$%A)"56,@%6)*50$1%5*)$()W$-,"5)U7)8A5) 1'/0&"$6'() H&@,56) =5"5) 1&0%,"5*) ?"'/) ?$H5) *$6%$(1%)
%"&??$1) 615(&"$'6B) =$%A) $(\51%$'() "&%56) NejB) QejB) SejB) Uej)&(*)Vej)'?)%A5)1A&((5@)#&(*=$*%A)1&0&1$%>B)1'""560'(*$(-)%')'@,%5) "&%56) '?) "56051%$H5@>) ZeL#06B) NXeL#06B) QUeL#06B)SQeL#06) &(*) UeeL#067) 8A5) %5/0'"&@) *$6%"$#,%$'() '?) 0&1P5%)$(\51%$'() $6) ,($?'"/7)I&1P5%6) ?'");5"/56) A&H5)Qe) ?@$%6B)=A$@5)?'");5"/56DC+) &(*);5"/56DL)2'36B) %A5) 6$J5) H&"$56) &"',(*)%A$6)H&@,5B)*505(*$(-)'()%A5)&/',(%)'?)A'06)%')"5&1A)%A5)*56D%$(&%$'(7)8A5)60&%$&@)*$6%"$#,%$'()$6)'(5)%')&@@)?"'/)5&1A)$(\51D%$'()6',"15B)0"'*,1$(-)&()&@@)%')&@@)2'3)%"&??$1)0&%%5"(7)
8A5)?$"6%)5K05"$/5(%)EW$-,"5)SF)&66,/5*)&)%"&??$1)615(&"$')=A5"5) 6',"156) $(\51%) Sej) '?) 1A&((5@) #&(*=$*%A) 1&0&1$%>7) .%)/&$(@>) 1'/0&"56) ;5"/56) ",(($(-) 4+) &(*) ;5"/56DL) ,6$(-)IC+) =$%A) *$??5"5(%) "',%$(-) &@-'"$%A/67) R'%A) &"1A$%51%,"56)5/0@'>)&)15(%"&@$J5*)&"#$%"&%$'()61A5/5)E"',(*D"'#$(F7)
I%+,($!O!9!P)3$&'=!($*,23*!/.3)%&$5!8#$&!'/-1)(%&+!5%*3(%.,3$5!(/,3%&+!;QE<!A$(*,*!12)&&$5!*/,('$!(/,3%&+!;RDE<!)11(/)'#$*?!
W$-,"5)S)*50$1%6)%A&%)&@@)IC+)@5*)%')$(1"5&65*)@&%5(1>)=A5()1'/0&"5*) %')4+) ?'")2W2L) &(*)2WLB) 5K150%) ?'") '(5) 1&65B)=A$1A) "56,@%5*) $() &) 6/&@@) -&$() EN7NQj) D) &"',(*) SVe) 1@'1P)1>1@56) ?&6%5") $()&H5"&-5F7);'=5H5"B) ?'") %A5)"5/&$($(-)"',%$(-)&@-'"$%A/6) @&%5(1$56)=5"5) "5*,15*) $() &@@) 1&656) ?'")IC+)=A5()1'/0&"5*)%')4+7)8A5) #5A&H$'") '?)bW) &(*)2c)A&6) %A"55) 5K0@&(&%$'(67) 8A5)
?$"6%) $6) %A5)*5-"55)'?) ?"55*'/)0"'H$*5*)#>)bW)&(*)2c)=A5()1'/0&"5*) %')2W) ?'") "',%5)/&00$(-) 5K0@'"&%$'(7) 8A$6) 1&() #5)?'"/&@@>) *5/'(6%"&%5*B) #,%) $(%,$%$H5@>) 605&P$(-)bW) &(*) 2c)*5%5"/$(5)&)6$(-@5)*$"51%$'()%A&%)/,6%)#5)5/0@'>5*)&%)%A5)6%&"%)EbWF)'")5(*)E2cF)'?)%A5)"',%$(-)0"'1566B)=A$@5)2W)*5%5"/$(56)%A&%)'(@>)%=')*$"51%$'(6)E%A5)(5-&%$H5)'(56F)1&()#5)%&P5()&%)%A5)6%&"%) '?) %A5) "',%$(-) 0"'15667) 8A5) 651'(*) 5K0@&(&%$'() $6) %A5)-@'#&@) P('=@5*-5) '?) 1A&((5@6) @'&*) =A5() &*'0%$(-) 0@&((5*)"',%$(-7)8A5)%A$"*)$6)%A5)#&*)*51$6$'()%A&%)4+)/&>)%&P5B)6$(15)\,*-/5(%6) &"5) /&*5) #&65*) '() @'1&@@>) &H&$@&#@5) $(?'"/&%$'()'(@>)%')"56'@H5)1'(-56%$'(B)0'66$#@>)*5H$&%$(-)0&1P5%6)%')'%A5")1'(-56%5*)"5-$'(67)4,"$(-) 6$/,@&%$'(B) 4+) &@=&>6) &1A$5H5*) @'=5") @&%5(1$56)
=A5()1'/0&"$(-)bW2L)%')2c2L)&(*)bWL)%')2cL7);'=D5H5"B)=A5(),6$(-)IC+)@&%5(1$56)&"5)&@=&>6) @'=5")=A5()1'/D0&"$(-) 2c2L) %')bW2L) &(*) 2cL) %')bWL7) 8A565) "56,@%6)6A'=)%A&%)%A5)1A'$15)'?)"',%$(-)&@-'"$%A/)6%"'(-@>)*505(*6)'()%A5)1A'$15)'?)"',%$(-)6%"&%5->7)W$-,"5) U) 6A'=6) &) 651'(*) 5H&@,&%$'() 5K05"$/5(%) %A&%) &6D
6,/56) %"&??$1) 615(&"$'6) =$%A) 0&1P5%) $(\51%$'() "&%56) H&">$(-)?"'/)Nej) %')Vej7)8A$6)5H&@,&%$'()5665(%$&@@>)1'/0&"56)15(D%"&@$J5*) E;5"/56)'");5"/56DLF) &(*)*$6%"$#,%5*) E;5"/56DC+F)&"#$%"&%$'()61A5/567)+',%56)=5"5)*5?$(5*)=$%A)%A5)`a)&@-'"$%A/B)-,&"&(%55$(-)
%A5)6&/5)0&1P5%)*$6%"$#,%$'() ?'")&@@)2'367)2')6$-($?$1&(%)*$?D?5"5(15)$6)'#65"H5*),0)%')Qej)'?)$(\51%$'()"&%57)!%)%A5)Sej)'?)$(\51%$'()"&%5)&(*)&#'H5B) $%) $6)('%$15&#@5) %A&%)*$6%"$#,%5*)&"#$D
196
%"&%$'() 1&() "5*,15) %A5) "',%5") 1'(%"'@) 1'(-56%$'(B) 6$(15) @&%5(D1$56)&"5)6$-($?$1&(%@>)"5*,15*)=A5()1'/0&"5*)%')&)15(%"&@$J5*)&00"'&1A7) !**$%$'(&@@>B) $%) $6) '#65"H&#@5) %A&%) %A5) #$--5") %A5)#,??5")6$J56B)%A5)@'=5")%A5)&H5"&-5)@&%5(1>B)$()&@@)1&6567);'=D5H5"B) $%) 1&() &@6') #5) '#65"H5*) %A&%) &%) 5&1A) $(\51%$'() "&%5B) %A5)@'=56%) &H5"&-5) @&%5(1>) '#%&$(5*) =A5() 15(%"&@$J5*) &"#$%"&%$'()5/0@'>6)%A5)#$--56%)#,??5")6$J5)ESQD?@$%)#,??5"FB)$6)-"5&%5")%A&()%A5) &H5"&-5) @&%5(1>) '#%&$(5*) =$%A) %A5) 6&/5) $(\51%$'() "&%5) '?)*$6%"$#,%5*) &"#$%"&%$'() ,6$(-) %A5) 6/&@@56%) #,??5") 6$J5) EUD?@$%)#,??5"FB)5K150%)&%)&)Sej)$(\51%$'()"&%57)8A$6)1&65)$6)('%)"5&@@>)"5@5H&(%) #51&,65) %A5) *$??5"5(15) $6) 6@$-A%) E&"',(*) Ne) 1@'1P)1>1@56)$()&H5"&-5)g)@566)%A&()Nej)'?)*$??5"5(15F7)
:$&3()2%H$5!A*!Q%*3(%.,3$5!G(.%3()3%/&);ST!G2+/(%3#-<)
)
))
;)<!
JUV!
;.<
;'<!
NUV!
OUV!
WUV! XUV!!"#$%&'()"*+ ,(-$%(./$"*+ !"#$%&'()"*+ ,(-$%(./$"*
!"#$%&'()"*+ ,(-$%(./$"*+ !"#$%&'()"*+ ,(-$%(./$"*
!"#$%&'()"*+ ,(-$%(./$"*
)I%+,($!W!K!GA$()+$!2)3$&'%$*!0/(!'$&3()2%H$5!;C$(-$*!)&5!C$(-$*9@<!)&5!5%*3(%.,3$5!)(.%3()3%/&!;C$(-$*9DE<?!Y&Z$'3%/&!()3$*!A)(=!0(/-!;)<!JUV9
NUV>!;.<!OUV!3/!;'<!WUV9XUV?!
>;& <7+-%+&.*31='$9"$8+*)'1%*0'$1G"&(5"&1F$+99.)1=+&&'$*18A$6) 5K05"$/5(%) &@@'=6) 5H&@,&%$(-) %A5) *$6%"$#,%5*) '") 15(D
%"&@$J5*) &"#$%"&%$'() &"1A$%51%,"&@) 1A'$156B) &(*) *$6%"$#,%5*) '")6',"15)"',%$(-)*51$6$'(67)!@6'B)$%)5(&#@56)/5&6,"$(-)%A5)5??51D%$H5(566)'?)6%&%$1&@@>)0@&((5*)"',%$(-7)!)A'%60'%)%"&??$1)615(&D"$') $6) ,65*B) =A5"5) %=') ('*56) 1'(15(%"&%5) &@@) 0&1P5%) *56%$(&D%$'(6) '() %A5) 2'37) !) Nee) L#06) $(\51%$'() "&%5) =$%A) ,($?'"/)%5/0'"&@)*$6%"$#,%$'()05")6',"15) $6),65*7)8&#@5)N)0"565(%6) %A5)&H5"&-5)@&%5(1>)1'/0,%5*)*,"$(-)6$/,@&%$'(7).()%A5)`a)&@-'D"$%A/)@$(5)'?)8&#@5)N)'(@>)&"1A$%51%,"&@)*51$6$'(6)1&()#5)&(&D@>J5*B)6$(15)%A5)6&/5)"',%56)&"5),65*)?'")&@@)2'367)3'/0&"$(-);5"/56)&(*);5"/56DL)1'@,/(6)$%)1&()#5)655()%A&%)(')-&$()$6)&1A$5H5*)=A5()&*'0%$(-)*$6%"$#,%5*)'")0@&((5*)6',"15)"',%$(-)$() %A5)`a)@$(57);'=5H5"B) %A5)1'/0&"$6'()'?);5"/56DC+)1'@D,/(6)=$%A) %A5) %=')0"5H$',6)'(56)6A'=6) %A&%)*$6%"$#,%5*)&"#$D%"&%$'()@5*)%')/'"5)%A&()Uej)'?)@&%5(1>)"5*,1%$'()=A5()1'/D0&"5*)%')%A5)15(%"&@$J5*)&00"'&1A7)
").2$!J!K7/:!)A$()+$!2)3$&'%$*!'/-1)(%*/&!;%&!'2/'4!'='2$*<?!
!
1)2! 3%45%6! 3%45%6.7! 3%45%6.8*
*)9'(#:!6&;%5%! <(6'4(=9'%<! 6)94&%!>?8*@!
A4=('4,'()#!6&;%5% &%#'4,B(C%<! <(6'4(=9'%<!
*)9'(#:!,B:)4(';5!
DE! /FGHIJKHL! /FGHIJKHL MGJILKNL
1O! PGPLIKFL! /GMNIKFJ FMQKQN
1O7 IGIIPKNI! /GLNJKQL FLJKHL
RO! FG/IMKII! PGLJLKJM /GHNFKIP
RO7! //GHM/KIL! FGPIIKJJ QGPMQKQI
1S QGJPHKMF! QGHFHK/H JFIK/M
1S7! MGHFPK//! QG/HPKNN JPPKHL)
bA5() 1'/0&"$(-);5"/56) &(*);5"/56DL)1'@,/(6)'?)8&D#@5) N)g) &4'1 (+8'1+$2.&$+&."*1+*010.99'$'*&1 $"%&.*31 ()4'8'() g)?'") &@@) "',%$(-) &@-'"$%A/6) E5K150%) `aF) %A5) 5??51%$H5(566) '?)IC+) 6%&(*6) ',%7) 8A5) 5K05"$/5(%) 6A'=5*) @&%5(1>) "5*,1%$'(6)?"'/)QU7[Sj)E2cF),0)%')YS7ZYj)E2cLF)$()%A565)1&6567)bA5() 1'/0&"$(-);5"/56DL) &(*);5"/56DC+) 1'@,/(6) '?)
8&#@5)N)g)&4'1(+8'1$"%&.*31+*010.99'$'*&1+$2.&$+&."*1()4'8'()g)%A5)5??51%$H5(566)'?)*$6%"$#,%5*)&"#$%"&%$'() $6)A$-A@$-A%5*7)8A5)5K05"$/5(%)6A'=5*)&()&H5"&-5)@&%5(1>)"5*,1%$'()'?)VS7Q[j7)!**$%$'(&@@>B) $%) $6) ('%$15&#@5) %A5) #5(5?$%) '?) 1'/#$($(-)
0@&((5*) 6',"15) "',%$(-) &(*) *$6%"$#,%5*) &"#$%"&%$'(B) &6) 6,0D0'"%5*) #>) ;5"/56DC+7) R>) 1'/0&"$(-) %A5) ?$"6%) &(*) @&6%) 1'@D,/(6B) %A5) &H5"&-5) @&%5(1>) $6) ,0) %') NN) %$/56) 6/&@@5") E2cLF)&(*)%A5)&H5"&-5)@&%5(1>)"5*,1%$'()?'")&@@)1&656)$6)Ye7Qej7)W$-,"5)V)$@@,6%"&%56)%A5)@$(P)='"P@'&*)56%$/&%$'()=A5()*$?D
?5"5(%)"',%$(-)&@-'"$%A/6)&"5),65*)&6)#&6$6)?'")"',%5)/&00$(-6)$() ;5"/56DC+7) 8A5) `a) 0$1%,"5) 0"565(%6) %=') 05&P6) '?) @'&*)1'(15(%"&%$'(B) "56,@%$(-) $() A$-A) 1'/05%$%$'() &(*) 1'(65G,5(%)05"?'"/&(15) *5-"&*&%$'(7) 2WB) 2WLB) 2c) &(*) 2cL) &"5) %A5)/'6%) 6,$%&#@5) &@-'"$%A/6) %') *$6%"$#,%5) %A5) ='"P@'&*) $(%') %A5)2'3)@$(P6B)?'@@'=5*)#>)bW)&(*)bWL7)
)I%+,($!X!9!P%&4*!8/(42/)5!$*3%-)3%/&!8#$&!$-12/=%&+!5%*3%&'3!(/,3%&+!)2+/(%3#-*!0/(!1)3#!*$2$'3%/&!%&!)!12)&&$5!*/,('$!(/,3%&+!)112%$5!3/!)!XBX!C$(-$*9DE!7/:?!"#$!1$)4!)3!3#$!3/1!/0!$)'#!*6,)($!($1($*$&3*!3#$!-)B9
%-,-!8/(42/)5!($)'#$5!5,(%&+!*%-,2)3%/&?!
#;& <7+-%+&.*31:$'+1#"(&(18&#@5)Q)*50$1%6)%A5)&"5&),6&-5)'?);5"/56)&(*);5"/56DC+)
15(%"&@)"',%5"6)E=$%A)V)$(0,%)&(*)',%0,%)0'"%6F)$()&)`3V]c`Se)]$"%5K) V) WI^!7) !"5&) $6) 5K0"5665*) $() %5"/6) '?) (,/#5") '?)c:86) &(*) ?@$0D?@'06) ?'") ?',") 1'(?$-,"&%$'(6) '?) #,??5") 6$J57)C>(%A56$6) "56,@%6)1'/5) ?"'/) %A5),65)'?) %A5)`C8) %''@B)0&"%)'?)
197
%A5) `$@$(K) .C<) [7Q$) %''@65%7) 2'%A$(-) #,%) *5?&,@%) 0&"&/5%5"6)=5"5)&66,/5*) $() %A5)6>(%A56$6) %''@7);5"/56DL)$6)('%)5K0@$1$%)"5?5"5(15*)$()8&#@5)QB)6$(15)$%6)$/0@5/5(%&%$'()$6)G,$%5)6$/$@&")%');5"/567)8A5)*$??5"5(15) $/0@$5*)#>) %A5)"',%$(-)/51A&($6/)$/0@5/5(%&%$'()A&6)(')6$-($?$1&(%)$/0&1%)$()%5"/6)'?)&"5&7)").2$!N!K!C$(-$*!)&5!C$(-$*9DE!)($)!'/-1)(%*/&!;X91/(3!(/,3$(<?!
T9UU%4!6(C%!
3%45%6! 3%45%6.8*!SVW6! OB(+.UB)+6! SVW6! OB(+.UB)+6
I! /HLI! Q/Q! /IPJ! QIHM! //QM! QPF! /FHF! QLH/L! /QQJ! QFI! /LPI! QMHPQ! /FPQ! QLL! /N/F! PHH
X!*%69B'6!U)4!,!F.+)4'!4)9'%4! )!@%A',-A);5"/56DC+)$/0"'H56)05"?'"/&(15)$%)1@5&"@>)05D
(&@$J56)&"5&B)6$(15)$%)0"565(%6)&()&H5"&-5)'?)SN7Yj)/'"5)c:86)&(*) NN7Yj)/'"5) ?@$0D?@'06) %A&();5"/567) ;'=5H5"B) W$-,"5) X)E5K%"&1%5*)?"'/)W$-,"5)UF)0'$(%6)',%)%A&%)*$6%"$#,%5*)&"#$%"&%$'()2'36B) 5H5() $/0@5/5(%5*) =$%A) UD?@$%) #,??5"6B) &1A$5H56) #5%%5")05"?'"/&(15) %A&() 15(%"&@$J5*) &"#$%"&%$'() 2'36) $/0@5/5(%5*)=$%A)SQD?@$%)#,??5")?'")&@@)$(\51%$'()"&%56B)5K150%)Sej7)3'/0&"D$(-) &) UD?@$%) #,??5") ;5"/56DC+) 2'3) =$%A) &) SQD?@$%) #,??5");5"/56B) %A5)&H5"&-5)@&%5(1>)?'")&@@) $(\51%$'()"&%56)"5*,156)#>)&00"'K$/&%5@>)SV7Qj7)
)I%+,($![!K!GA$()+$!2)3$&'%$*!8#$&!'/-1)(%&+!5%*3(%.,3$5!)(.%3()3%/&!%&!W902%3!.,00$(!C$(-$*9DE!3/!'$&3()2%H$5!ON902%3!.,00$(!C$(-$*!0/(!%&Z$'3%/&!
()3$*!A)(=%&+!0(/-!JUV!3/!XUV?!
bA5() 1'(6$*5"$(-) &() $/0@5/5(%&%$'() '?) &) UD?@$%) #,??5");5"/56DC+)&(*)&()$/0@5/5(%&%$'()'?)&)SQD?@$%)#,??5");5"/56B)$%) $6) H$6$#@5) %A&%) 1A''6$(-);5"/56DC+) $/0@$56) &"5&) 1'(6,/0D%$'() "5*,1%$'() EX7Qj) @566) c:86) &(*) [7Zj) @566) ?@$0D?@'06F7)C$(15) %A5) &H5"&-5) @&%5(1>) $6) &@6') "5*,15*) $();5"/56DC+) $/D0@5/5(%&%$'(B)5H5()=$%A)6,1A)&)6/&@@)#,??5")6$J5B)$%)$6)5&6>)%')1'(1@,*5) %A&%) 15(%"&@$J5*) &"#$%"&%$'() =$%A) 0@&((5*) 6',"15)"',%$(-) $6) &) -''*) *56$-() 1A'$15B) "5*,1$(-) 6$J5) &(*) @&%5(1>7)L'"5'H5"B) 6'/5)='"P6) 6A'=) %A&%)#,??5") 6$J5) 6%"'(-@>)1'(%"$D#,%5) %') 0'=5") *$66$0&%$'() MNeOB) =A$@5) '%A5"6) 6A'=) %A&%) 2'3)#,??5"6)/&>)&11',(%)?'")&"',(*)[ej)'?)%A5)0'=5")*$66$0&%$'()$()&)"',%5")MNNO7)8A5"5?'"5B)%A5)"5*,1%$'()'?)SQD?@$%)#,??5")%')UD?@$%)#,??5")0"'#&#@>)&@6')$/0@$56)6$-($?$1&(%)5(5"->)6&H$(-67)
]7! 3923c:C.92C)!24)92^9.2^)b9+k).() %A$6) 0&05"B) %A5)/&$() 1'(%"$#,%$'(6) 1'/5) ?"'/) %A5) 05"D
?'"/&(15) 5H&@,&%$'() '?) "',%$(-) &(*) &"#$%"&%$'() &"1A$%51%,"&@)*51$6$'(6)&(*) %A5$") $/0&1%)'()&"5&)1'6%67)!)651'(*&">)1'(%"$D#,%$'() $6) %A5) 0"'0'6$%$'() '?) &) /5%A'*) ?'") 0&%A) 1'/0,%&%$'(B)
#&65*)'()0"5H$',6@>)P('=()&00@$1&%$'()%"&??$1)#5A&H$'"B)=A$1A)-,&"&(%556)A(15)'?)*5&*@'1P)&(*)/'"5)#&@&(15*)1'//,($D1&%$'()@'&*7)8A5);5"/56DC+)2'3)&"1A$%51%,"5)=&6)$/0@5/5(%5*)%')5KD
0@'"5)%A5),65)'?)*$6%"$#,%5*)&"#$%"&%$'()&(*)6',"15)"',%$(-7)8A$6)2'3) 65"H5*) %') 1'/0&"5) %A5) %"&*5D'??6) $(H'@H5*) $() *51$*$(-)&#',%)%A5),65)'?)6',"15)H5"6,6)*$6%"$#,%5*)"',%$(-)6%"&%5-$56B)&6)=5@@)&6) $() %A5),65)'?)15(%"&@$J5*)H5"6,6)*$6%"$#,%5*)&"#$%"&%$'()6%"&%5-$567) !**$%$'(&@@>B) %A5) 0&05") 0"'0'656) &) "',%5) /&00$(-)0"'1566) %A&%) 1&()#5) &*H&(%&-5',6) ?'") 5(&#@$(-) %A5) 56%$/&%$'()'?) @$(P) '11,0&(1>) $() 2'36) &(*) %A5) 1'(65G,5(%) 05"?'"/&(15)$/0"'H5/5(%7) 8A$6) &@@'=6) 6'@H$(-) 1'(-56%$'() /$%$-&%$'()%A"',-A) A'%60'%6) &H'$*&(157) 8A5) "56,@%6) 0'$(%) ',%) %') &) 2'3)*56$-() =$%A) 05"?'"/&(15) $/0"'H5/5(%B) 0'=5") *$66$0&%$'()"5*,1%$'()&(*)&"5&)6&H$(-7)9(-'$(-)='"P)15(%5"6)'()*>(&/$1)%"&??$1)615(&"$'6B)=A5"5)
&00@$1&%$'(6) &"5) @'&*5*) &%) ",(%$/5) &(*) 1'//,($1&%$'() "5DG,$"5/5(%6)&"5)"5G,56%5*)'()%A5)?@>7)C5@?)&*&0%$H5)2'36)#&65*)'() -@'#&@) $(?'"/&%$'() P('=@5*-5) &"5) &() $(%5"56%$(-) 1A'$15B)6$(15)"56,@%6)*5?$($%5@>)6A'=5*)%A&%)*51$6$'(6)#&65*)'()@'1&@@>)&1G,$"5*)$(?'"/&%$'()/&>)@5&*)%')#&*)05"?'"/&(15)"56,@%67)
!3k29bc<4^<L<28C)8A5)!,%A'"6)&1P('=@5*-5)%A5)6,00'"%)'?)%A5)32IG)%A"',-A)
"565&"1A)-"&(%6)NUNQUYTQeeVDSB)SeZ[QUTQeeZDZB)Se[QVVTQeeZDQB)SeNV[[TQee[DQ)&(*)'?)%A5)W!I<+^C)-"&(%)NeTeZNUD[7)
+<W<+<23<C)MNO! b'@?B) b7) 5%) &@7) lL,@%$0"'1566'") C>6%5/D'(D3A$0) ELIC'3F) 851A('@'D
->m7) .<<<) 8"&(6&1%$'(6) '() 3'/0,%5"D!$*5*)456$-() '?) .(%5-"&%5*) 3$"D1,$%6)&(*)C>6%5/6B)QYENeFB)91%7)QeeZB)007)NYeNDNYNS7)
MQO! I&6"$1A&B)C7n)4,%%B)27)l9(D3A$0)3'//,($1&%$'()!"1A$%51%,"56)g)C>6%5/)'()3A$0).(%5"1'((51%m7)L'"-&()k&,?/&(()C1$5(15B)QeeZB)VUU07)
MSO! L'"&56B) W7) 5%) &@7) l;5"/56_) &() $(?"&6%",1%,"5) ?'") @'=) &"5&) 'H5"A5&*)0&1P5%D6=$%1A$(-) (5%='"P6) '() 1A$0m7) .(%5-"&%$'() ]cC.) o',"(&@B) SZENFB)91%7)QeeUB)007)X[D[S7)
MUO! ^@&66B)37n)2$B)c7)l8A5)8,"()L'*5@)?'")!*&0%$H5)+',%$(-m7)o',"(&@)'?)%A5)!66'1$&%$'()?'")3'/0,%$(-)L&1A$(5">B)UNEVFB)C507)N[[UB)007)ZYUD[eQ7)
MVO! ;,B)o7n)L&"1,@561,B)+7)l4>!4)D)C/&"%)+',%$(-)?'")25%='"P6D'(D3A$0m7).(_)4!3heUB)QeeUB)007)QXeDQXS7)
MXO! k,@/&@&B)!7)5%)&@7)l4$6%"$#,%5*)#,6)&"#$%"&%$'()&@-'"$%A/)1'/0&"$6'()'()WI^!D#&65*)LI<^DU)/,@%$0"'1566'")6>6%5/)'()1A$0m7).<8)3'/0,%5"6)p)4$-$%&@)851A($G,56B)o,@7)QeeZB)QEUFB)007)SNUDSQV7)
MYO! R5"%'JJ$B) 47n) R5($($B) c7) l`0$056_) !)25%='"PD'(D1A$0)!"1A$%51%,"5) ?'")^$-&61&@5) C>6%5/6D'(D3A$0m7) .<<<) 3$"1,$%6) &(*) C>6%5/6) L&-&J$(5B)UEQFB)QeeUB)007)NZDSN7)
MZO! W5(B)^7n)2$(-B)b7)l!)L$($/,/DI&%A)L&00$(-)!@-'"$%A/)?'")Q4)/56A)25%='"P)'()3A$0)!"1A$%51%,"5m7).(_)!I33!CheZB)QeeZB)007)NVUQDNVUV7)
M[O! R'@'%$(B)<7)5%)&@7)l+',%$(-)8&#@5)L$($/$J&%$'()?'").""5-,@&")L56A)2'3m7).(_)4!8<heYB)QeeYB)007)[UQD[UY7)
MNeO! R&(5"\55B)27)5%)&@7) l!)I'=5")&(*)I5"?'"/&(15)L'*5@) ?'")25%='"PD'(D3A$0)!"1A$%51%,"56m7).(_)4!8<heUB)QeeUB)007)NQVeDNQVV7)
MNNO! I&@/&B)o7)5%)&@7)lL&00$(-)</#5**5*)C>6%5/6)'(%')2'36)g)8A5)8"&??$1)<??51%) '()4>(&/$1)<(5"->)<6%$/&%$'(m7) .(_)CR33.heVB)QeeVB)007)N[XDQeN7)
)
198
On-Chip Efficient Round-Robin Scheduler forHigh-Speed Interconnection
Pongyupinpanich Surapong and Manfred GlesnerMicroelectronic Systems Research Group,
Technische Universitat Darmstadt,Darmstadt, Germany
Email:{surapong; glesner}@mes.tu-darmstadt.de
Abstract—Due to the simplicity of scheduling, the bufferedcrossbar is becoming attractive for high-speed communicationsystem. Although the previously proposed Round-Robin algo-rithms achieve 100% throughput under uniform traffic, they cannot achieve a satisfactory performance under non-uniform traffic.In this paper, we propose an efficient Round-Robin schedulingalgorithm based on binary-tree scheme where service policyis applied to improve Quality-of-Service. With the proposedscheduling algorithm, the searching time-complexity of O(1) (oneclock cycle) and 100% throughput under non-uniform trafficcan be obtained. Based on a binary-tree structure, the designachieves high-speed data rate at Tbps, and simpler design withcombinational circuits. The design has been simulated on bothFPGA-based (Virtex 5) and Silicon-based technology (0.18 um).The synthesis results show that consumed resources varied from11 to 533 slices and from 46 to 1686 2-NAND gates for crossbarsof size 4× 4 to 128× 128. Critical path delays from 0.72 to 4.52ns for FPGA-based and from 1.33 to 4.0 ns for silicon-basedhave obtained for the design.
I. INTRODUCTION
Performance and efficiency of a generic buffered crossbardepends on input-, internal-, and output-scheduling mecha-nisms [1]. It is composed of three main structures: input ports,output port and a switch fabric interconnecting the input andthe output ports. The complexity of scheduler numbers locatedon all crosspoints to manage the data-queue is O(logN2),where N is the number of input ports based on a symmetricalstructure [1]. Thus, improving the performance and efficiencyof a scheduler is attractive for interconnection designers.
Scheduling schemes are divided into two main categories:weighted algorithms and Round-Robin algorithms. T. Javadi etal [2] and M. Nabeshima [3] introduced LQF-RR and OCF-OCF to match inputs to outputs. Since their basic buildingblocks of matching operations are integer comparators andmultiplexers, their complexities are O(NlogN). To reducethis complexity, the internal information structure, SCBF [4],was proposed with O(logN). However, it has unstable regionsfor the states of input virtual-output-queues (VOQs) and itscomplexity is still too sensitive to the crossbar size. Therefore,the schedulers based on weighted algorithms have limitationsfor building high-speed and large capacity crossbars.
Due to simplicity, fairness, 100% throughput and con-tentionless, Round-Robin-based mechanism was proposed onRR-RR [5]. It has been improved with DRR [6] and DRR-k[1]. These two versions applied the double-pointer’s updat-ing mechanism to overcome the limited performance of theRound-Robin scheme. However, since the position of double-pointer has to be updated as fast as possible, their designstructure based on comparator and counter functions is a
too complex to support data rates up to Terabits per second.Chauo [7] proposed a structure based on a binary-tree arbiterwhich can perform the arbitration in a fast and efficient way.However, this framework can not guarantee fairness to allinputs during non-uniform traffic.
In this paper, we explore the design of an efficient Round-Robin scheduler based on binary-tree structure which guaran-tees fairness, 100% throughput, without contention on non-uniform traffics. The design achieves very high-speed datarates with low time-complexity (O(1)). A service functionhas been included to improve Quality-of-Service (QoS) forall input ports.
The rest of this paper is organized as follows: the efficientRound-Robin algorithm is explained in section II. Section IIIintroduces the hardware implementation of an 8× 8 efficientRound-Robin scheduler based on 8 × 8 buffered crossbar.Performance and efficiency of the design are reported insection IV and compared with the related work. Finally,conclusions are presented in section IV.
II. AN EFFICIENT ROUND-ROBIN ALGORITHM
TABLE I: Binary-tree selection on a Leaf- and Root-Node [9].
State PL RL PR RR Leaf-Node Root-Node1 0 0 0 0 - -2 0 0 0 1 right -3 0 0 1 0 right right4 0 0 1 1 right right5 0 1 0 0 left -6 0 1 0 1 left -7 0 1 1 0 left left8 0 1 1 1 right right9 1 0 0 0 left left10 1 0 0 1 right right11 1 0 1 0 - -12 1 0 1 1 - -13 1 1 0 0 left left14 1 1 0 1 left left15 1 1 1 0 - -16 1 1 1 1 - -
A. Binary-Tree algorithmA time optimized Round-Robin algorithm can be realized by
applying the binary-tree arbitration [7]. With a N×N bufferedcrossbar structure, the binary-tree level equals to log(N + 1)as shown in [8]. Since the basic element of any node on thebinary tree has four inputs and two outputs [9], outputs can bedefined by association of input priority and input request oneither left or right side respectively. Assuming that PLor R andRLor R are input priority and input request, table I determines978-1-4577-0660-8/11/$26.00 c© 2011 IEEE
199
the selective state of Leaf-Node and Root-Node under theirpossible actions.
For example, we suppose that we have four inputscomprised of priorities and requests, PL3RL3PR2RR2 andPL1RL1PR0RR0 respectively equal to 0101 and 0011. Con-forming to the binary-tree selection table I, the RR0 is selectedas shown in figure 1.
01 11
01 01 00 11
Leaf-node
Fig. 1: A binary-tree selection conforming to table I, wherenode PR0RR0 is selected when PL3RL3PR2RR2=0101 andPL1RL1PR0RR0=0011.
B. Round-Robin-based scheduling mechanism with servicefunction
In this section, we introduce a Round-Robin algorithm basedon binary-tree searching scheme where a service function isapplied to all input ports in order to improve QoS during non-uniform traffic defined by [1].
TABLE II: A Round-Robin-Based scheduling mechanism withservice function in order to improve QoS on non-uniformtraffic.
Mechanism : Round-Robin-Based scheduling mechanismInput : Credit Buffer (CBi), Request (Reqi), service-ratio (Servicei)Output : Grant(Granti)Internal : Priority (Prii), counter, start, enable, i1) Beginning Pri0 = 1, counter = 0, i = 0, enable = 1, start = 02) while (1) loop3) if enable = 1 then4) Reading all CBs, Reqs and Services, start = 05) Grant = binary-tree function(Req,CB,Pri)6) if Grant > 0 then7) enable = 0, start = 18) for j = 1 to N9) i = j where Grantj = 110) end for11) end if12) end if13) if counter = Servicei then14) counter = 0, enable = 115) Pri= cyclic shift left(Grant)16) elsif CBi > 0 and start = 1 then17) counter ++18) end if19) end loop
At the beginning, the internal parameters are set as Line1). Within the loop, if enable is in enabled status, the algo-rithm will start to read CB, Reqs and Service information.Meanwhile, start is set in disabled status. Grant is processedby the binary-tree function as Line 5). On Line 6), if Grantis more than zero, enable and start will be in disabled andenabled status. Afterward, i is determined by correspondingto Grants’ value. Between Line 13) and Line 18), counteris used to count up when start is in enabled status. Whencounter reachs Servicei, enable is in enabled status; after-ward the loop will restart.
Fig. 2: Design entity of the efficient Round-Robin schedulerfor eight requests.
Fig. 3: Block diagram of the efficient Round-Robin Scheduler,where service-ratios (S), a 8-bit VOQ Req. and credit buffer(CB) are its’ input.
Based on this mechanisms and under non-uniform traffic(hot-spot and unbalance data rate), the service-ratio (Servicei)for each input port i comes from the data rate itself. As-suming that Vi is the data rate at input port i, Servicei =
Vi
Min(V1,···VN ) where N is the number of input ports. Forexample, if the data rates of four input ports are 50, 100,150 and 200 KBit/sec, their service-ratios will be 1, 2, 3 and4 respectively.
III. HARDWARE IMPLEMENTATION
A. Design Architecture
The efficient Round-Robin algorithm proposed in SectionII has been implemented in hardware. Since our goal is anoptimal time-complexity, combinational circuits are applied asmuch as possible to reduce the processing time of search andgrant state of binary-tree mechanism. Shift registers are usedto maintain pointer’s information.
For simplification reasons, a 8 × 8 buffered crossbar withfour-credit levels per internal crosspoint buffer (CB) has beenused as design case where eight requests become the inputsof the scheduler. Figure 2 and 3 depict a design entity and ablock diagram of the expected design. The circuit has threeinputs and one output. The inputs are: 1) a 8-bit VOQ request(Req) vector containing input requests; 2) a 10-bit vector arrayrepresenting service-ratio (S); 3) a 2-bit vector array calledcredit (CB) contains the level of internal crosspoint buffer(00=full, 11=empty). The output of the circuit is a 8-bit vectorcontaining the grant decision (GRANT ).
B. Searching Mechanism Architecture
Fig.4 shows the detail of the Searching Mechanism blockdiagram with eight requests (Reqs), eight credit buffer arrays(CBs) and eight grants (GRANTs). In CMP block, 1-bit
200
Fig. 4: Simple block diagram of the binary-tree searchingmechanism for eight requests.
Fig. 5: Block diagram of Leaf-Node based on two multiplexersand a L-Circuit module.
(Reqi) and 2-bit CBi operate by this function:
Outi = 1when (CBi > 00)and(Reqi = 1) else 0 (1)
With the possible actions reported in table I, the notationsright, left and ” − ” are mapped to the values 01, 10 and00 respectively. Therefore, the combinational logic circuit of aLeaf-Node can be optimized by logic technique. The input andoutput relations of a Leaf-Node are specified by the booleanequations 2 to 5. Figure 5 and 6 show the block diagram ofthe Leaf-Node and its combinational logic circuit.
Gout0 = GinSel(0) (2)Gout1 = GinSel(1) (3)Sel(0) = Rin1Pin0Rin0 + Pin1Pin0(Rin1 + Rin0) (4)Sel(1) = Rin1Pin0 + Rin0(Pin1Pin0 + Pin1Rin1) (5)
By the same way of the Leaf-Node, combinational logiccircuit of Root-Node can be specified by following booleanfunctions, Equ.6 and Equ.7, and illustrated in Fig.7.
Gout0 = Pin1Pin0(Rin1 + Rin0) + Pin1Rin1Pin0Rin0 (6)Gout1 = Rin1(Pin1Pin0Rin0 + Pin1Pin0) + Pin1Pin0Rin0 (7)
C. Timing DiagramFigure 8 illustrates the timing diagram of the efficient
Round-Robin scheduling algorithm architecture depicted in
Fig. 6: Combinational logic circuit of a L-Circuit modulecomprising of 8 ANDs, 4 ORs and 4 NOTs.
Fig. 7: Combinational logic circuit of a root node comprisingof 6 ANDs, 4 ORs and 4 NOTs.
figure 3 under non-uniform traffic, where the 5th input porthas higher data rate. We assume that the packets within allcrosspoint buffers can be selected in any time-slots; thus, allCBs equal to 11 (3 decimal). Service-ratio of input portsare 1, but the Service-ratio of input port 5th is 1111111111(1024 decimal). Priority of input port 1st set 1, and all theothers are 0. After the 8-bit VOQ Req, 1101 1001, has beenarrived, the binary-tree searches and generates the GRANTcorresponding to the priorities, Service-ratios, CBs and VOQReqs. According to this figure, input port 1 and 4 are grantedfor next two clock cycles; afterwards the service will beoccupied by input port 5 for 1024 clock cycles, and thenoccupied by input port 7 on the next clock cycle.
Fig. 8: Timing diagram of the efficient Round-Robin scheduler.
IV. PERFORMANCE AND COMPARISON
In this section, we present the synthesis result of the pro-posed Round-Robin structure based on two most commonlyused technologies; FPGA-based and silicon-based.
A. FPGA-based TechnologyWe implement SA according to figures of [10] and synthe-
size it by using Xilinx ISE tool targeting the Xilinx Virtex 5
201
device. Table III show the synthesized results of the proposedstructure and SA in term of slices and critical path delay(ns)of N = 4, 8, 16, 32, 64, and 128.
TABLE III: Synthesized result in terms of slices and crit-ical path delay (ns) of efficient Round-Robin scheduler on5vlx330ff.
Design Report N=4 N=8 N=16 N=32 N=64 N=128Proposed Slices 11 25 62 130 264 533
ns 0.72 1.42 2.10 2.80 3.60 4.52SA [10] Slices 124 192 476 1137 8781 16527
ns 4.19 6.45 6.94 7.33 15.7 22.23
As shown in table III, the proposed design conforming tobinary-tree structure was implemented based on combinationalcircuits; therefore, the consumed slices are significantly lowerthan SA. The critical path delay of the proposed designoptimized by the Xilinx ISE tool is varied from 0.72 to 4.52 nsfor N = 4, 8, 16, 32, 64, and 128 because of the combinationalcircuits where the synthesizer can simply map the design withlogic elements.
B. Silicon-based TechnologyThe previous works, ERR [10], PRRA [9], IPRRA [9],
PPE [10], PPA [10], and SA [10], had been analyzed andsynthesized based on Silicon technology 0.18 um standardcell under the same operating conditions and area optimization.For fairness, we also analyze and synthesize our design on thesame setting environment.
TableIV shows the critical path delays (in nanoseconds)of these designs. TableV shows the area cost in number oftwo-input NAND gates for N = 4, 8, 16, 32, 64, and 128.Although the results depend on the standard cell library used,they present the relative performance of these designs.
TABLE IV: Critical path delay of PPE, PPA, SA, PRRA, andIPRRA in terms of ns.
Design N=4 N=8 N=16 N=32 N=64 N=128PPE 1.67 2.73 3.8 5.07 6.31 7.2PPA 1.7 2.53 3.66 4.54 5.67 6.54SA 1.36 1.51 1.79 2.26 2.72 3.35PRRA 1.47 2.52 3.58 4.63 5.68 6.74IPRRA 1.29 1.89 2.68 3.68 4.56 5.01Proposed 1.33 1.40 1.93 2.10 2.95 4.0
TABLE V: Area result of PPE, PPA, SA, PRRA, and IPRRA(number of NAND2 gates).
Design N=4 N=8 N=16 N=32 N=64 N=128PPE 53 150 349 812 1826 4010PPA 63 143 313 644 1316 2649SA 89 292 641 1318 2372 4780PRRA 31 72 155 320 651 1312IPRRA 31 82 173 356 723 1455Proposed 46 112 255 576 867 1686
As shown in table IV, the critical path delay of SA andthe proposed design grow with log4N , while the critical pathdelay of PPE, PPA, PRRA, IPRRA grow with log2N , whichare consistent with the analysis of these designs. SA andthe proposed design operate the fastest with shortest level ofbasic components and combinational circuits synthesized by
Synopsis tool. However the proposed design guarantees thefairness with service function under non-uniform traffic, butSA can not. For comparison purposes, consider a bufferedcrossbar of size N = 128 and assume that the cell size is 64-bytes, where the line rate is determined by 64 × 8. The linerates that a scheduler using SA and the proposed design are15.2 Tbps and 12.8 Tbps.
The area results of all designs grow linearly with N asshown in Table V. The proposed design consumes significantlyfewer 2-NAND gates than the PPE, PPA and SA, but more2-NAND gates than PRRA and IPRRA for all range of N .Compared with its critical path delay, the slightly larger areaof the proposed design is neglectable.
V. CONCLUSION
In this paper, we propose an efficient Round-Robin schedul-ing algorithm based on binary-tree scheme where QoS is im-proved by applying service policy. The proposed scheduling al-gorithm can achieve the searching time-complexity of (O(1))under non-uniform traffic. By using the service policy, 100% throughput can be attained, corresponding to improvementof scheduling performance. The design has been simulatedon both FPGA-based (Virtex 5) and Silicon-based technol-ogy (0.18 um). The synthesis results show that consumedresources varied from 11 to 533 slices and from 46 to 16862-NAND gates for crossbars of size 4×4 to 128×128. Criticalpath delays from 0.72 to 4.52 ns for FPGA-based and from1.33 to 4.0 ns for silicon-based have obtained for the design.
REFERENCES
[1] Y. Zheng, C. Shao, An Efficient Round-Robin Algorithm for CombinedInput-Crosspoint Queued Switches, IEEE ICAS, 2005.
[2] T. Javadi, R. Magill, and T. Hrabik, A high-throughput algorithm forbuffered crossbar switch fabric, in Proc. IEEE ICC, June 2001, pp.1581-1591.
[3] M. Nabeshima, Performance evaluation of combined input-and cross-point-queued switch, IEICE Trans. Commum., Col. E83-B, no.3, mar2000.
[4] X. Zhang and L. N. Bhuyan, An efficient algorithm for combined input-crosspoint-queue (CICQ) switches, IEEE Globecom 2004, pp. 1168-1173.
[5] R. Rojas-Cessa, E. Oki, Z. Jing and H.J. Chao, CIXB-1:Combined inputone-cell-crosspoint buffered switch, Proc. 2001 IEEE WHPSR, pp. 324-329.
[6] J. Z. Luo, Y. Lee, J. Wu, DRR: A fast high-throughput schedulingalgorithm for combined input-crosspoint-queued(CICQ) switches, IEEEMASCOTS, 2005 pp. 329-332.
[7] H. J. Chao, C. H. Lam, X. Guo, Fast ping-pong arbitration for input-output queued packet switches, international Journal of Communicationsystems, 2001 pp. 663-678.
[8] H. J. Chao, C. H. Lam, X. Guo, Fast fair arbitration design in packetswitches, IEEE, 2005 pp. 472-476.
[9] S. Q. Z, M. Yang, Algorithm-Hardware Codesign of Fast Parallel Round-Robin Arbiters, IEEE transactions on parallel and distributed systems,2007 pp. 84-94.
[10] P. Gupta, N. McKeown, Desiging and Implementing a Fast CrossbarScheduler, IEEE Micro, vol. 19, no.1, 1999 pp. 20-29.
202
Author Index
Abid, Mohamed, 149 Aguiar, Alexandra, 113 Alkhayat, Rachid, 79Ammar, Manel, 149 Amory, Alexandre, 164 Azevedo, Rodolfo, 99Baghdadi, Amer, 79 Baklouti, Mouna, 149 Balasubramanian, Daniel, 121Baldassin, Alexandro, 99 Barreteau, Anthony, 156 Becker, Juergen, 135Belaid, Ikbel, 179 Benjemaa, Maher, 179 Beyrouthy, Taha, 59Bhattacharyya, Shuvra, 67 Bobda, Christophe, 16 Bochem, Alexander, 9, 30Bois, Guy, 92 Boland, Jean-François, 92 Brehm, Christian, 74Calazans, Ney, 193 Callanan, Owen, 45 Carmel-Veilleux, Tennessee, 92Castelfranco, Antonino, 45 Centoducatte, Paulo, 99 Champagne, David, 38Chan, King, 38 Chen, Hui, 171 Chen, Yu-Yuan, 38Cheung, Ray, 38 Chowdhury, Sazzadur, 2 Clancy, Charles, 67Cox, Charles, 45 Crawford, Catherine, 45 Dekeyser, Jean-Luc, 149Deschenes, Justin, 30 Eoin, Creedon, 45 Fesquet, Laurent, 59Fresse, Virginie, 186 Gladigau, Jens, 128 Glesner, Manfred, 199, 85Godet-Bar, Guillaume, 171 Gohring De Magalhaes, Felipe, 113 Großhans, Michael, 16Gu, Zonghua, 23 Haubelt, Christian, 128 Hedde, Damien, 106Heinz, Matthias, 53 Herpers, Rainer, 9 Hessel, Fabiano, 113Hillenbrand, Martin, 53 Jezequel, Michel, 79 Karsai, Gabor, 121Kent, Kenneth, 9, 30 Klein, Felipe, 99 Klindworth, Kai, 53Koellner, Christian, 135 Kutzer, Philipp, 128 Kuykendall, John, 67Lal, Sundeep, 2 Le Nours, Sebastien, 156 Lee, Ruby, 38Lekuch, Scott, 45 Li, Will, 38 Losier, Yves, 30Lowry, Michael, 121 Lubaszewski, Marcelo, 164 Marcon, Cesar, 193Marcon, César, 164 Marquet, Philippe, 149 Mendoza, Francisco, 135Moraes, Fernando, 164, 193 Moreira, João, 99 Moreno, Edson, 193Muller, Fabrice, 179 Muller, Kay, 45 Murugappa, Purushotham, 79Muscedere, Roberto, 2 Mühlbauer, Felix, 16 Müller-Glaser, Klaus D., 53, 135, 142Nine, Harmon, 121 Nutter, Mark, 45 Pap, Gabor, 121Pasareanu, Corina, 121 Pasquier, Olivier, 156 Penner, Hartmut, 45Philipp, François, 85 Plishker, William, 67 Pongyupinpanich, Surapong, 199Pressburger, Tom, 121 Purcell, Brian, 45 Purcell, Mark, 45Pétrot, Frédéric, 106, 171 Rigo, Sandro, 99 Rousseau, Frédéric, 171, 186Samman, Faizal, 85 Schwalb, Tobias, 142 Szefer, Jakub, 38Tan, Junyan, 186 Teich, Jürgen, 128 Wehn, Norbert, 74Williams, Jeremy, 30 Xenidis, Jimi, 45 Zaki, George, 67Zhang, Ming, 23 Zhang, Wei, 38
203