18
VIRTUTECH WHITE PAPER SIMICS AND MULTICORE SYSTEMS DEVELOPMENT MARCH 2009 JAKOB ENGBLOM WWW.VIRTUTECH.COM

SIMICS AND MULTICORE SYSTEMS DEVELOPMENTwindriver.com.cn/windriver-do-more-with-less/doc/Simics_for_Multi... · A virtual platform also provides benefits like freedom from physical

Embed Size (px)

Citation preview

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

INTRODUCTION Virtualized systems development is a development methodology where the actual hardware of a system is augmented by a virtual platform, a simulation model of the hardware running on a workstation or PC. The virtual platform can run the same binary software as the physical hardware, fast enough to be used as an alternative and augmentation to physical hardware for software development.

The hardware shift to multicore processors and multiprocessor systems calls for new software and systems development tools to help developers transform their code into parallel applications and have systems take full advantage of multicore processors and distributed systems. Performance increases for an application or system application set will only come from taking advantage of multicore processors and parallel processing, and there is no more “free lunch” from waiting for faster single-core processors to appear. Thus, most systems developers will at some point have to confront the issue of how to create software and architect systems that can use parallel hardware efficiently.

Traditional debugging techniques and debugging tools do not work very well on an inherently nondeterministic system such as hardware multiprocessor or multicore processor. Virtual platforms can reintroduce control and repeatability in the software debug and systems analysis process, by virtue of providing a level of indirection between hardware and software.

PAGE 2 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

A virtual platform also provides benefits like freedom from physical constraints, arbitrary configurability, checkpointing and restart at any point, , superior convenience and stability, access to the target long before prototype hardware, and the ability to test faults and boundary cases with complete control and precision.

Current and future systems will look much like the template shown in Figure 1: software stacks running on small (or large) shared-memory nodes, communicating with other nodes over various types of interconnects, and building a system-level software abstraction on top of the multicore and distributed hardware.

Figure 1. Multicore Systems Template

Multicore node

CPU

L1$

CPU

L1$

CPU

L1$

L2$

RAM

Devices

Networketc.

Timer Serial

One shared memory space

Multicore node

CPU

L1$

CPU

L1$

CPU

L1$

L2$

RAM

Devices

etc.Network

Timer Serial

Network (Ethernet, PCIe, RapidIO, etc.) with local memory in each node

Software Stack Software Stack

System-Level Software Abstractions

Note that a software structure like that shown in Figure 1 can be used within a single multicore hardware device – the trends towards virtualization and hypervisors in multicore hardware makes it possible to have several isolated groups of cores run their own shared-memory abstraction, looking at other

PAGE 3 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

groups of cores as remote network nodes, even if they physically exist within the same silicon package.

THE SOFTWARE PROBLEM Doing multicore hardware right is not easy, but it is certainly easier than doing multicore software right. The move to multicore hardware that began in earnest in 2004 finally pushed parallel software into the mainstream, and the software development world is still catching up. There are three main problems:

• Ensuring existing software keeps working (without taking advantage of multicore)

• Parallelizing existing software to get the performance and power consumption benefits of multicore parallel execution

• Creating new software that is parallel from the beginning

The Virtual Platform Solution Virtualized systems development offers an approach to creating and porting multicore applications that has many compelling advantages compared with simply attempting to use a multicore chip directly for debugging and testing.

Virtual platforms in Virtutech Simics provide several crucial features for testing and debugging parallel code, and especially multicore implementations of parallel systems:

• Repeatability. Rerunning a test case in Simics will give precisely the same execution, even when simulating with multiple processor cores and multiple processes on each core. Thus, the worst problem in multiprocessor debugging is removed: the difficulty of reproducing failing conditions. It makes debugging a multiprocessor as easy as debugging a single program on a single processor. If an error is found in the simulator, it can be reproduced many times over.

• Multicore debugging. Simics can attach a debugger to each processor core in a multiprocessor machine simultaneously. There is no limitation to the access to the state of the machine due to packaging.

• OS-aware debugg. Simics provides the ability (with the proper OS awareness package) to debug a single process running on a multiprocessor machine, using GDB or other debuggers. Any number

PAGE 4 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

of such process-specific debuggers can be set up. Furthermore, this makes it possible to assess the time spent inside an OS kernel as opposed to time spent in user code.

• Global stop and step. When single-stepping code on a specific processor in a multiprocessor with Simics, everything else in the system also proceeds step-wise. This is not possible on real hardware where the best that can be provided is limited-skid breakpoints on a single multicore chip.

• Test scale-out. With Simics, there is no limit to the number of parallel cores. It is therefore possible to test what happens when multicore goes from two to four to eight and on, even if only two-way processors are available in hardware today. Remember, future performance increases will largely arrive as an increase in the number of cores rather than an increase in the clock rate.

• Reversibility The reversible execution and debugging features of Simics work perfectly with multiprocessors and multicore processors. When Simics reverses the system, the entire system, including all the cores, is reversed. This provides a powerful capability for tracking down hard-to-find bugs such as lock conflicts, deadlock and priority starvation, which will all be more common in a multicore environment.

• Efficient error provocation. Simics provides several mechanisms to help you provoke errors in multiprocessor systems. First, you can vary the speed of processors in the system, which is a good way to stress parallel cores. Second, you can make data writes take very long to propagate between processors. This has the nice property of efficiently provoking errors in parallel codes, much more efficiently than on real hardware. Third, you can script and inject targeted system behavior at any point in time. Note that real hardware will sometimes provoke errors by virtue of its inherent randomness, but that same randomness prevents replication and thus diagnosis of the problem.

In the following, we will describe these abilities with some concrete examples.

PAGE 5 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

EXPOSING BUGS IN SOFTWARE A truly concurrent multiprocessor environment presents a number of new types of potential software errors, as well as making classic concurrency problems worse.

In particular, existing software that works just fine on a single-core multi-tasking setup often breaks on multicore devices. Figure 2 shows a simple example of a case of how moving from a single-core to a multicore device makes a system much more likely to exhibit a parallel programming problem. We test the same program across a range of processor core speeds, as well as for a single-core and a dual-core configuration.

Figure 2. Single-Core vs Dual-Core Execution Example

1 CPU2 …

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 310

100 200 500800

950977

10001013

10000Clock freqency (MHz)

Percentage of runs triggering race

The program is multithreaded and has a problem with a race condition. Note that on a single-core (1 CPU) processor, this problem only manifests itself

PAGE 6 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

occasionally, and for high clock frequencies, very rarely1. On the other hand, when using a dual-core setup, every single execution triggers the bug. If this bug had been less aggressive, it would more likely have manifested itself once in a year on a single-core device… making it just a rare glitch. Once the system had moved to multicore, it would have manifested itself once every week, making it a real problem. With a virtual platform, such issues can be smoked out early, by testing software over a range of configurations, including those not yet available in hardware (see Scalability and Configuration Testing below).

DEBUGGING PARALLEL SOFTWARE A primary benefit of a virtual platform is that it provides a superior debug and analysis features compared to the physical hardware. Anyone who has ever developed code for an embedded board will appreciate the convenience of a virtual environment. You get a system that is not randomly flaky, better control over the target, faster communication, and conveniences like unlimited numbers of breakpoints. If the target freezes completely, you can stop it and check what happened. You can change system parameters like core counts, clock speeds and memory size, as well as network setups, with complete freedom and ease.

To give you an idea for how this works in practice, here is a real-world example of debugging a multiprocessor system with Virtutech Simics:

A virtual platform based on Simics was used to port a popular real-time operating system to the Freescale 8641D multicore processor. In one test, the clock frequency of the target system was changed from 800 MHz to 833 MHz, and suddenly the system froze early in the boot process. The system was completely unresponsive, with no input or output.

1 This is expected, since at a high clock frequency the operating system will execute more instructions in a program between each process/thread switch. Thus, the chance of being in a critical section exactly when a process switch occurs is lower, and the program has longer sequential uninterrupted execution slices for each thread.

PAGE 7 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

A preliminary investigation revealed that the problem only occurred between 829.9 and 833.3 MHz. Thus, it had not been seen before since the clock frequency used to be 800 MHz.

Thanks to the repeatability of a virtual platform, the bug was trivial to reproduce. Each time the virtual platform was booted with one of the bad clock frequencies, the same crash happened at the same time. Unlike the real world where all we would have had was an unresponsive brick, the virtual platform made it possible to examine the state of the processor, memory, and software at the point where it froze.

To home in on the problem, we used reverse execution and interrupt tracing on the serial port, the interrupt controller, and the processor cores. With this, we could pin down the exact cycle where the problem occurred, and the sequence of events leading up to it. We did stack back traces at the critical point to determine the locations in the operating system where the freeze occurred.

In the end, it was determined that the problem was that an interrupt service routine was attempting to lock a kernel spinlock, before re-enabling interrupts. In the case that froze, the lock was already taken when the service routine was entered, and with no interrupts enabled there was no way for any other code to run to release it.

The bug was only found since the virtual platform ran the complete real software stack, including interrupt handlers and hardware drivers. It was triggered by changing the system configuration, demonstrating the value of configurability of a virtual platform, and the power of configuration changes to provoke errors in complex code. Thanks to repeatability, bug reproduction was trivial–in a physical system, it would only have happened occasionally, if ever again. The ability to trace and inspect any part of the state was crucial in understanding what happened and in which order. With reverse execution, we simply backed across the freeze to inspect the path the system took to get there.

PAGE 8 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

Figure 3. Repeatability and Reverse Execution

On hardware, only some runs reproduce an error

On a virtual platform, errors are repeatable, and reverse execution offers a powerful way to diagnose and fix them.

A key benefit of a virtual platform is repeatability and reversibility, as illustrated in Figure 3. The key problem in finding and fixing software bugs in parallel software is the lack of determinism in the execution of the software system. Every run of a program will exhibit a different order of events in the program, and even very small changes to the system state or timing result in very different program execution (as illustrated on the left in Figure 3). This complicates debugging greatly, as the very act of debugging a parallel program will make timing-sensitive bugs such as race conditions disappear or appear in a different place.

A virtual platform provides determinism and repeatability. The simulator has explicit control over the execution of instructions and propagation of information between processors, and can thus impose a repeatable behavior on the software running on a multicore processor. Note that this property, usually known as determinism, does not mean that the behavior of a software program is always identical. It just means that when running the same software from the same initial state with the exact same sequence of asynchronous inputs, the same execution sequence is seen.

If anything is changed, a different behavior is seen. Figure 4 shows an example of this, where the same intentionally buggy program is run several times on two different simulated multicore machines. Each run gets a different result since they are run from different initial states. But the simulator can go back and reproduce each run precisely, which is not possible on physical hardware.

PAGE 9 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

Figure 4. Repeatability and Multicore

Another benefit of a virtual platform for multicore debugging is that the simulator can stop the execution of the entire system at any point in time. This means that it is possible to single-step code where processors communicate with each other without changing the behavior of the code, and that code running on other processors cannot swamp a stopped processor with data to process.

It is also possible to build extra statistics into the virtual platform, to assess fairness, accessibility, and catch starvation of different parts of the system.

SCALABILITY AND CONFIGURATION TESTING Virtual platforms are also eminent for testing the scalability of systems and software designs. On physical hardware, you are necessarily limited to the core counts and configurations shipping today. On a virtual platform, it is easy to do what-if analysis and test software on arbitrarily wide systems. For example, to check whether and how a software stack created for a dual-core

PAGE 10 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

platform scales (or even keeps working) up as the system moves to triple-core, quad-core, dozen-core, hundred-core, and beyond in future hardware generations.

It is also possible to insert delays into the system and model contention in critical places, to see whether the combined hardware-software system is sensitive to certain system configurations or setups.

Figure 5. Scaling Beyond Physical Hardware

In Figure 5, there is an example of scaling beyond the physical limits of current hardware. Here, we have three machines based on the physically dual-core Freescale MPC8641D SoC, where two of the machines have configurations with three and eight cores, respectively. Note how the software stack was not quite designed for this, since the /proc/interrupts output for the eight-core machine does not look very pretty. This is an admittedly trivial example of how a virtual machine can reveal software issues by scaling systems beyond what is possible with physical hardware.

Figure 6 shows a different use of a virtual platform, this time to assess the scalability of a parallel code. Here, we have fed a series of packets to be processed through a parallel set of worker nodes, and measured the end-to-end execution time on the virtual platform. The hardware has been configured with one to nine worker nodes, and the software configured for each hardware configuration. Additionally, we add a simple contention and delay model to

PAGE 11 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

the shared global memory used for communication between the cores, and set its delays to 100, 200, and 500 cycles. The purpose is not to model the precise performance of the software on any particular target platform, but to assess how well it handles increased communications latencies. There are also two different communications variants used in the software: single-packet and quad-packet transfers to the worker nodes.

Figure 6. Scalability Testing Example

This simple experiment tells us that the software should use quad-packet transfers to ensure scalability, since even with perfect memory the performance of single-packet transfers starts to degrade at seven worker nodes. It also tells us that with slow memory and single-packet transfers, adding worker nodes beyond five or six has no real benefit. Thus, we can tell that even a program that is computationally “conveniently concurrent” can stop scaling as communications overhead enters the picture. Thanks to Simics, we can investigate this long before the hardware becomes available, and plan and adjust the systems design accordingly.

PAGE 12 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

SYSTEM EXECUTION INSIGHT A virtual platform can inspect and trace the execution of a software stack with no intrusion effect at all. Not having to instrument the program code or run a heavy-duty profiler on the target means that just like as with debugging using a virtual platform, we have no probe effects from profiling and tracing. Thus, if a software load is behaving strangely, we can apply checkpointing and reverse execution to get back to where it started, and profile or trace the exact execution that was causing problems. Another advantage of a virtual platform is that we have perfect synchronization between the cores, and all traces can be time-stamped without worrying about hardware jitter or clocks being out of synch.

Tracing can be applied at a number of levels, from hardware-level tracing of memory operations to determine data accesses and possible races, to profiling at the operating-system thread level, to profiling the execution of a multiple-board multi-processor distributed system to determine load balance.

Figure 7. What runs Where Profiling Example

Figure 7 shows an example of a profiling run. We are comparing two different ways to parallelize a packet-processing application running on a Linux-based quad-core platform. We start the measurements from a saved checkpoint, to ensure that the system state is the same when measurements being, and apply Simics Linux OS awareness to determine which threads are running on which cores.

tid 50

tid 51tid 52

tid 53tid 54total

200 000 000,00    

400 000 000,00    

600 000 000,00    

800 000 000,00    

1 000 000 000,00    

1 200 000 000,00    

cpu0 cpu1 cpu2cpu3

Four identical worker threads

tid 50

tid 51

tid 52

tid 53total

200 000 000,00    

400 000 000,00    

600 000 000,00    

800 000 000,00    

1 000 000 000,00    

1 200 000 000,00    

1 400 000 000,00    

12

34

Pipelined Asymmetric Solution

PAGE 13 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

In program under test, there is one thread that starts off the worker threads (the lowest-numbered thread, tid 50), and then either four symmetric threads (which process one packet at a time through three steps) or three pipelined threads (each implementing one of the three processing steps, and handling all packets in sequence).

In the case with pipelined setup, we have observed massive packet losses, and this profile clearly shows why. The pipelined setup achieves no parallelism at all, and the OS mostly runs it on a single core for the duration of the execution. In contrast, there is ample parallelism in the symmetric setup, and the OS spreads out the execution on all four cores.

VIRTUAL PLATFORMS FOR DETAILED HARDWARE DESIGN Virtual platforms and the simulation of key components is an important tool for computer hardware designers, both for processor cores and entire System-on-Chip (SoC) designs. Here, we move away from the use of exclusively fast loosely-timed models, and go into the clock-cycle (CC) level of detail.

In processor design, detailed simulation models of the pipeline and memory system of a processor core have been the mainstay architecture tool for many years. Every microarchitectural idea is first evaluated with the help of a detailed simulator before being used in an actual design. When creating new cores or new variants of cores, the design teams use their own detailed simulators to assess performance, power consumption, and other metrics.

Moving from processor cores to complete SoC designs, virtual platforms are used to evaluate interconnect architectures, required bus widths, and other performance factors. They are also used to verify that the design works as intended when isolated devices and processors are combined to form a whole system. For multicore designs in particular, you have to validate cache-coherency protocols and check that systems scale to the number of cores desired.

Detailed Timing Models for Software Developers Usually, computer architecture simulators are internal engineering tools. However, for targets such as the Freescale QorIQ P4080, the detailed performance models will be made available to end users integrated in the fast Virtutech Simics model of the P4080. See the Virtutech website for more information.

PAGE 14 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

The fast virtual platform will be used to boot operating systems and position workloads in interesting location, and then the simulation will switch over to the performance models to allow detailed studies of system performance. In essence, the resulting combined solution lets users zoom in on performance details when and where they need to, without compromising on the ability to run large workloads. Figure 8 shows the work style and an example of a mixed detailed and fast virtual platform.

Figure 8. Zoom in with Detailed Models

Virtual board

time

Functional simulation

Detailedsimulation

Drop into detailed mode at interesting

points

Virtual model of new SoC

CPU

PatternMatching

Timer

Interrupt MemCtrl

UART

Ethernet

Ethernet

CPU

CryptoBufferMemory

TCPOffload

BufferMemory

CPURT clock

RAM FLASH

A packet processing pipeline and the processor cores

involved modeled in detail, rest of virtual platform functional

Flow control

Since detailed CC-level models obviously provide more information about the target than fast loosely timed models as usually employed with Simics, one might ask why they are not the norm. There are three main reasons:

• CC-level models run many orders of magnitude slower than fast models.

• It much longer to build a CC-level model than a fast model. • CC-level models by necessity encode only a small part of the possible

space of configurations of a system. A fast functional model is much easier to vary and test configuration parameters in, since the parameters are expressed in a much more abstract way.

Execution speed is the most important reason. With multiple processors running at gigahertz level and above, a few seconds of execution of software a multicore device quickly requires tens of billions of instructions to be simulated in the virtual platform. A virtual platform operating at a few million instructions per second, or the few hundred thousand typically seen in CC-

PAGE 15 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

level simulations, will be too slow to run interesting workloads and provide the insight across the whole execution.

Just booting the example shown in Figure 5 takes some 100 billion instructions. The use of a zoom paradigm as discussed above is the proven and optimal way to work with detailed models. The fast models are a necessarily enabler, along with checkpointing to store the state of a booted or configured system to disk to bring it up instantly later (there is no point in repeating those 100 billion instructions more than once as long as nothing has changed in the hardware and operating system setup).

Thus, for modern multicore chips, we need virtual platforms to be as fast as possible in order to run the complete software in reasonable time. Reducing the detail level in order to do this is a good trade-off. It is better to cover the whole problem in some detail, than a tiny part of the problem in great detail. The vast majority of software development can be performed on a fast virtual platform with loose timing.

PAGE 16 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

SUMMARY Multicore hardware is here today and here to stay, and building reliable and high-performance systems out of multicore processors and parallel software stacks is a tough problem for systems developers across the globe.

Virtualized systems development offers a key tool to help make this simpler, faster, and less risky. Using virtual platforms to augment physical hardware, more information and insight into the system workings can be obtained, debugging made faster, and system and software architectures explored.

PAGE 17 / http://www.virtutech.com Virtutech 2009

SIMICS AND MULTICORE SYSTEMS DEVELOPMENT

PAGE 18 / Virtutech 2009http://www.virtutech.com

CONTACT INFORMATION

North and South America

[email protected]

Europe, Middle-East, Africa

[email protected]

Asia-Pacific

[email protected]

Japan

[email protected]

http://www.virtutech.com

© Copyright 2009 Virtutech, Inc. All Rights Reserved.

TRADEMARKS. Virtutech, Simics, Hindsight, and the logos thereof, are trademarks or registered trademarks of Virtutech, Inc. and/or its subsidiaries, in the United States and/or other countries.

THIS PUBLICATION IS PROVIDED ”AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT.

THIS PUBLICATION COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO THE INFORMATION HEREIN; THESE CHANGES WILL BE INCORPORATED IN NEW EDITIONS OF THE PUBLICATION. VIRTUTECH MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED IN THIS PUBLICATION AT ANY TIME.

Virtutech, Inc., 2001 Gateway Place, Suite 201E, San Jose, CA 95110, USA