4
EVALUATION OF LOW POWER DESIGN TECHNIQUES ON AN MPEG2 VIDEO DECODER PLATFORM Yang Yun Ju University of Campinas – Unicamp Av. Albert Einstein, 1251 Campinas, SP, Brazil [email protected] Guido Araujo University of Campinas - Unicamp Av. Albert Einstein, 1251 Campinas, SP, Brazil [email protected] ABSTRACT In this paper, we discuss the design and evaluation of low-power optimization techniques on an industrial- strength MPEG-2 SoC platform. The design was performed using a TSMC90/LP (90nm) process and an ASIC Cadence tool-chain. Several power reduction techniques (clock gating, multi-threshold voltage, operand isolation and multiple supply voltage) were used to evaluate power reduction efficiency and impact on the final chip area, performance and testing. An MPEG-2 video decoder application was used as a benchmark to generate stimuli vector and switching activity. The combination of all techniques resulted in a 45% power reduction, with a 14% silicon area impact and no performance impact. Details of the methodology and relevant design issues are provided, so as to help designers pursuing their own power optimization strategies. Keywords: low power, power reduction, SoC, Leon3, ASIC, MPEG-2 video decoder, TSMC90LP. 1. INTRODUCTION With the continuous growth in the demand for power efficient solutions, industry has been increasingly applying low-power optimization techniques at the device, circuit and system levels. In the context of the design flow, each power reduction technique offers different capabilities to reduce power consumption at different levels of abstraction. Each low-power technique is associated to an specific power intent, and causes some impact in the design, which needs to be carefully evaluated. Unfortunately, most interesting information on this area remains covered by confidential agreements or industrial protection. Moreover, low-power design flow and methodologies are company specific, and such expertise is typically not available to the whole design community. By using an open-source industrial-strength SoC platform (Leon3 SoC) [1], the logic/physical design were performed on a TSMC90G/LP process and created a low- power design flow to optimize and estimate the power consumption of an MPEG-2 video decoder. The low- power techniques used in the design flow are: clock gating, operand isolation, multiple-threshold voltage and multiple supply voltage. This paper reports the implementation of such low- power techniques, and their corresponding impact in terms of area, gate count, power consumption, clock speed and testing. A detailed description of the design flow can be found in [2], which is provided to enable designers to re-use such methodology in their own design flow. This paper is divided as follows. Section 2 describes the SoC platform used in this work. Section 3 gives an overview of the MPEG2 video decoder standard. Section 4 lists the selected power optimization techniques. Finally, Sections 5 and 6 respectively describe the design results and concludes the work. Figure 1: Leon3 Platform Architecture Configuration 2. LEON3 SOC PLATFORM The Leon3 platform is an industrial-strength synthesizable SoC platform, from Gaisler Research [1]. The central processing unit is a 32-bit processor, compliant to the Sparc V8 architecture. Since it is an open source distribution, it was free to customize the implementation and build an architecture more suitable to our needs. The Figure 1 shows how the Leon3 platform was customized to be used as an MPEG2 decoder SoC. The Sparc V8 core is composed by: a 7-stage pipeline unit, 8-register window, a hardware multiplier, one divider and MAC unit. It also presents 8 Kbits instruction and data caches. The AMBA 2.0 AHB on-chip bus integrates the SDRAM controller to the SparcV8 core. The low speed APH bus is used to connect the SRAM controller, 4 on-chip timers, generic I/O controller, UART interface and an IRQ controller. By using the

EVALUATION OF LOW POWER DESIGN TECHNIQUES … · EVALUATION OF LOW POWER DESIGN TECHNIQUES ON AN MPEG2 VIDEO DECODER PLATFORM ... a multimedia SoC platform was ... system and provides

Embed Size (px)

Citation preview

EVALUATION OF LOW POWER DESIGN TECHNIQUES ON AN MPEG2 VIDEO DECODER PLATFORM

Yang Yun Ju

University of Campinas – Unicamp Av. Albert Einstein, 1251

Campinas, SP, Brazil [email protected]

Guido Araujo University of Campinas - Unicamp

Av. Albert Einstein, 1251 Campinas, SP, Brazil [email protected]

ABSTRACT

In this paper, we discuss the design and evaluation of low-power optimization techniques on an industrial-strength MPEG-2 SoC platform. The design was performed using a TSMC90/LP (90nm) process and an ASIC Cadence tool-chain. Several power reduction techniques (clock gating, multi-threshold voltage, operand isolation and multiple supply voltage) were used to evaluate power reduction efficiency and impact on the final chip area, performance and testing. An MPEG-2 video decoder application was used as a benchmark to generate stimuli vector and switching activity. The combination of all techniques resulted in a 45% power reduction, with a 14% silicon area impact and no performance impact. Details of the methodology and relevant design issues are provided, so as to help designers pursuing their own power optimization strategies.

Keywords: low power, power reduction, SoC, Leon3, ASIC, MPEG-2 video decoder, TSMC90LP.

1. INTRODUCTION

With the continuous growth in the demand for power efficient solutions, industry has been increasingly applying low-power optimization techniques at the device, circuit and system levels.

In the context of the design flow, each power reduction technique offers different capabilities to reduce power consumption at different levels of abstraction. Each low-power technique is associated to an specific power intent, and causes some impact in the design, which needs to be carefully evaluated.

Unfortunately, most interesting information on this area remains covered by confidential agreements or industrial protection. Moreover, low-power design flow and methodologies are company specific, and such expertise is typically not available to the whole design community.

By using an open-source industrial-strength SoC platform (Leon3 SoC) [1], the logic/physical design were performed on a TSMC90G/LP process and created a low-power design flow to optimize and estimate the power consumption of an MPEG-2 video decoder. The low-power techniques used in the design flow are: clock gating, operand isolation, multiple-threshold voltage and multiple supply voltage.

This paper reports the implementation of such low-power techniques, and their corresponding impact in terms of area, gate count, power consumption, clock speed and testing.

A detailed description of the design flow can be found in [2], which is provided to enable designers to re-use such methodology in their own design flow.

This paper is divided as follows. Section 2 describes the SoC platform used in this work. Section 3 gives an overview of the MPEG2 video decoder standard. Section 4 lists the selected power optimization techniques. Finally, Sections 5 and 6 respectively describe the design results and concludes the work.

Figure 1: Leon3 Platform Architecture Configuration

2. LEON3 SOC PLATFORM

The Leon3 platform is an industrial-strength

synthesizable SoC platform, from Gaisler Research [1]. The central processing unit is a 32-bit processor, compliant to the Sparc V8 architecture. Since it is an open source distribution, it was free to customize the implementation and build an architecture more suitable to our needs. The Figure 1 shows how the Leon3 platform was customized to be used as an MPEG2 decoder SoC.

The Sparc V8 core is composed by: a 7-stage pipeline unit, 8-register window, a hardware multiplier, one divider and MAC unit. It also presents 8 Kbits instruction and data caches. The AMBA 2.0 AHB on-chip bus integrates the SDRAM controller to the SparcV8 core. The low speed APH bus is used to connect the SRAM controller, 4 on-chip timers, generic I/O controller, UART interface and an IRQ controller. By using the

Leon3 and the ASIC Cadence tool-chain, a multimedia SoC platform was synthesized (Figure 3) and capable of running at a 300MHz clock rate.

Figure 2 : MPEG-2 Standard Encoding/Decoding Flow

3. THE MPEG2 VIDEO DECODER The MPEG-2 standard [3] extends the basic MPEG

system and provides compression support to TV quality digital video transmission. It defines a series of standards for video compression algorithms, which exploit redundancy in the video stream.

The temporal redundancy arises when successive frames of the video display images of the same scene. In this (very frequent) case, the contents of the scene remains fixed, or changes slightly, with respect to the successive frames. The spatial redundancy occurs because parts of the picture are often replicated (with minor changes) within a single frame of video.

As show in Figure 2, the input video stream is captured in the form of an analog signal. This is followed by applying the discrete co-sine transform, and the redundancy elimination algorithms. The redundancy stream is then quantized and encoded into a new compressed video format using Huffman encoding.

The decoding process can be thought as the reverse of the encoding. The first stage is responsible for reconstructing the data from the Huffman encoding. Then, the motion vectors are parsed from the data stream and fed into the motion compensator. Also from the data stream, the quantized DCT coefficients are extracted and are used by the inverse quantizer to perform the data de-quantization. After that, IDCT (Inverse Discrete Cosine Transform) transforms the re-quantized data back into the spatial domain, thus reconstructing the video.

4. POWER REDUCTION TECHNIQUES

AND IMPLEMENTATION FLOW 4.1. Power reduction techniques

In this design, a set of low-power techniques [3,4,5,6] has been evaluated (below), so as to identify its impact and relevancy when used in a SoC multimedia platform.

Figure 3: Leon3 SoC layout Clock Gate: tries to eliminate unnecessary clock

toggle activity in storage elements (e.g. Flip-Flops). Industrial reports reveal that it can reduce dynamic power about 15~20% [5], but it has no effect in solving leakage issues. Although clock gating causes a small impact in area and performance, it can considerably reduce the test coverage of the DFT and scan-chain circuitries.

Operand Isolation: operand isolation is used to

reduce dynamic dissipation of combinational circuit, based on selectively blocking the redundant switching activity and preventing their propagation into the data-path circuitry. Similar to the case of clock gating, it does not affect leakage dissipation.

Multiple Threshold Voltage: library cells at different threshold voltages are available from vendors on various operating conditions. With different threshold voltages (high Vth and low Vth in most cases), it is possible to make a trade-off between leakage power and performance, during physical design. High Vth cells reduce leakage dissipation, but introduce significant impact on system performance. On the other hand, low Vth cells provide excellent speed, but demand excessive leakage dissipation, and thus are reserved to critical paths.

Multiple Supply Voltage: CMOS dynamic power

dissipation is proportional to Vdd² (Vdd: gate supply voltage). By dividing the design into different voltage island and supplying each block with distinct voltages, one can significantly reduce the total dissipation. Nevertheless, this technique can also impact area, performance and physical implementation of the chip.

Figure 4 : Simulation Environment 4.2. Implementation and power estimation flow

The baseline power dissipation of the Leon3 SoC was

estimated without any power optimization technique. Logical and physical synthesis of the entire system was performed using the TSMC 90nm process.

RTL software/hardware co-simulation was used to ensure that the hardware configuration was compatible with the embedded software. First, the entire design was validated using an Altera StratixII FPGA design kit and logic simulators from Mentor Graphics and Cadence Design Systems. The machine used for simulation was an Intel Quad Core processor, with 2.4 GHz clock and 4 GB DDR2 RAM memory.

Logic simulation for the entire flow required 16 hours, with the simulation clock set to 300MHz. According to results of measurement, this corresponds to 550ms in real time, enabling us to decode 18 frames per second in the target hardware. Logic synthesis was performed using Cadence RTL Compiler at last 1 hour. Encounter Test was used to generate ATPG vectors and achieved 95% coverage. Physical synthesis followed the Cadence flow. The final baseline chip used approximately 30K gates, within a 0.9 mm2 area and dissipated 155mW.

After final layout (Figure 3), the corresponding parasitic-aware netlist was dumped and post-layout simulations were performed to extract switching activity.

The MPEG-2 video decoder and a small video-stream were cross-compiled using the BCC compiler (a C/C++ cross-compiler system based on gcc) and stored into the SDRAM simulation model, which communicates with the Leon3 SoC through an external memory controller and an AMBA on-chip bus. Each simulation run took 240 hours to decode 10 video frames.

After the SDRAM information was loaded into the SoC, video decoding was monitored, and switching activity collected in a TCF format file (Toggle Count Format, from Cadence). Switching activity, with parasitic aware layout, was used to estimate the baseline power dissipation, which was a reference point to the power

reduction techniques. Figure 4 summarizes the simulation environment for the extraction of switching activity during baseline power computation.

Each one of the techniques from Section 4.1 was applied to the SoC, using a TSMC90LP low-power library, without changing the embedded software or simulation environment. Switching activities were collected for each power optimization, and the new dissipation estimated and compared to the baseline power profile. With these comparisons, the efficiency and impact of each technique were evaluated.

5. EXPERIMENTAL RESULTS

Baseline power dissipation of the SoC was estimated without any power improvement technique (Table 1). Cache memories and register files were implemented using foundry macro cells (single port static RAM memory), and are responsible for 26% of the power dissipation and 40% of the final die area.

The usage of macro cells from foundry considerably simplified the implementation of the cache system and register file. However, they could not benefit from any of the selected power optimization techniques in terms of power dissipation. In such cases, two alternative approaches can be used: choose another SRAM model with a smaller nominal consumption, or modify the cache access policy to reduce the switching activity.

Item Leon3 SoC

Technology TSMC 90G (Typical case)

Internal Power 121500 µW

Switching Power 28650 µW

Leakage Power 6315.26 µW

Total Power 155969 µW

Gate Number 302158

Silicon Area 916567.22 µm²

Estimated Speed 300 MHz (Typical Case)

Table 1 : Results from a baseline implementation

The extraction of switching activity was the most time

consuming task. In the case of our project, with just about 1.2 million transistors and a small number of stimuli vectors, it was required about 240 hours to dump the switching activities for the decoding of 10 video frames.

Based on the experience acquired from this project, we concluded that the time to collect the switching activities can become a serious issue when the project involves more than 10 million of transistors.

By using the baseline power dissipation as a reference, the efficiency of each optimization technique was evaluated. Table 2 summarizes the improvement for each optimization with respect to the baseline implementation, in terms of power, area, gate count and performance.

Item/ Tech

Dynamic Power

Leakage Power

Total Power Area Gate

Number Speed

Clock Gate -20.14% -6.75% -19.39% +1% +3% 0%

Operand Isolation 0% 0% 0% 0% 0% 0%

Multiple Threshold -2.40% -46.70% -3.70% -1.20% -1.60% 0%

Multiple Supply -46.48% -6.63% -44.68% +14.6% +6.63% 0%

Table 2 : Power optimization results

From Table 2 the impact of each power optimization

technique was evaluated separately, as follows: (a) Unlike other reports from industry, not much

improvement resulted from operand isolation; we credit this to the fact that the Leon3 SoC is a platform which lacks repetitive DSP-based operations; (b) Clock gating reduced power dissipation by 19.40% without any performance impact, though it considerably increased DFT and scan chain generation costs, as many more iterations were required to achieve an acceptable fault coverage (92%); (c) Multiple-threshold voltage optimization reduced leakage power dissipation by about 47% without any performance or testing overhead; critical STA1 issues had to be addressed in order to assure that performance was met at all circuit critical paths; (d) Multiple supply voltage optimization showed significant reduction in dynamic dissipation (46.48%), contributing to a reduction of 44.68% in the overall power; this was achieved at the expense of a 14.6% impact in area, caused mainly by special library cells, which are responsible for the isolation of supply voltage domains, and decoupling cells, which prevent cross-bar current; due to the different supply voltage domains, STA was considerably harder, particularly during clock tree synthesis.

As discussed above, no power optimization technique

can alone meet the power goal within the given performance and area constraints. Combining several techniques and establishing a trade-off between power, speed, area and the implementation effort are essential steps. Thus, it is fundamental that the design team has a very good understanding of the overall platform architecture, and the underlying design and manufacturing technologies.

6. CONCLUSIONS AND FUTURE WORK From the experience learned from this project, several

new R&D directions can be pursued. First, a better approach to extract the switching

activity is necessary. As mentioned before, when the design has large transistor counts and stimuli vector sets, the time and amount of storage required to collect the switching activity becomes a bottleneck in the power 1 STA: Static Timing Analysis.

simulation and optimization process. Eventually, this could make it impracticable to correct characterize the impact of the selected power optimization techniques.

Second, most techniques used for power optimization are applied at the backend stage of the ASIC design flow. They are not appropriate to optimize or estimate power dissipation at the early stages of the design. Specifically, they cannot be used when a project requires ESL design or if it is architected using behavioral modeling. High-level power optimization techniques have been shown to considerably improve power consumption, and thus an evaluation of our platform under such techniques would be a relevant next step.

It was also noticed that the traditional functional verification methodology for RTL design shows no concern with respect to power dissipation. In this project, assuming the RTL code is acceptable, and using the power intent file, a set of EDA tools was linked to achieve the desirable power goal. Unfortunately, power intent was not used to verify the power optimization techniques during RTL design. One work left for the future is to combine some of the latest functional verification methodologies (e.g. VVM2 or OVM3) to the power intent file, so as to enable one to analyze the power behavior of the system at the early design stages.

7. ACKNOWLEDGMENTS

We would like to thank CNPq for funding this work,

and the researchers and staff of the Brazil-IP Network for their help and support all these years.

8. REFERENCES [1] Gaisler Research, GRLIB Product Brief, 2010. [2] Yang Yun Ju. The Impact of Design Techniques in the

Reduction of Power Consumption of ASICs. MSc Dissertation, Institute of Computing, UNICAMP, 2011.

[3] International Standard ISO/IEC DIS 13818: Generic Coding of

Moving Pictures and Associated Audio, Part2 (Video), 2007. [4] Devadas, Srinivas and Malik, Sharad. A Survey of Optimization

Techniques Targeting Low Power VLSI circuits, DAC '95 Proceedings of the 32nd annual ACM/IEEE Design Automation Conference, pp. 242 – 247, 1995.

[5] Flynn, D., Aitken, R., Gibbons, A. and Shi, Kaijian. Low Power

Methodology Manual: For System-on-Chip Design, New York, NY: Springer New York, 2008.

[6] Rabaey J. M., Low Power Design Essentials, Series on

Integrated Circuits and Systems, New York, NY: Springer New York, 2009.

2 VMM: Verification Methodology Manual. 3 OVM: Open Verification Methodology.