6

Northeastern University - A University Like No Other

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Northeastern University - A University Like No Other
Page 2: Northeastern University - A University Like No Other
Page 3: Northeastern University - A University Like No Other
Page 4: Northeastern University - A University Like No Other
Page 5: Northeastern University - A University Like No Other
Page 6: Northeastern University - A University Like No Other

optimizes both input and output of the Sobel Core.

45.6

20.3 21.717.9 17

0

5.83.6

0.91 0.58

0

5

10

15

20

25

30

35

40

45

50

Pure SW Solution HW-SW Co-

design (no opt)

HW-SW input

pack/unpack

HW-SW output

pack/unpack

HW-SW

input/output

pack/unpack

Mil

lio

n C

ycl

es

Total Run Time Communication Delay

Figure 13: Sobel Edge Detect Performance

The baseline solution is the pure SW solution, mapping the whole

Simulink model on DSP. It results in the longest execution time

about 45.6 Mcycles. HW-SW [no-opt] maps the most computa-

tionally intensive Sobel Core to FPGA, with the rest running on

DSP, all in a pipelined fashion. The total execution time drops to

20.3Mcycles, yielding a 2.25x speed up. However, HW-SW [no-

opt] includes a communication overhead across DSP and FPGA of

22.2% of the total execution time.

Optimizing the path from SW to HW, input pack/unpack solu-

tion reduces traffic overhead slightly but yields an longer total exe-

cution time than HW-SW [no-opt]. The overhead of executing pack

(in SW) outweighs the communication performance gain which is

small due to the low input concatenation ratio of 2 : 1 (pack 2 pix-

els). Conversely, when only optimizing the path from HW to SW in

[output pack/unpack] solution, the overall performance increases,

as the output concatenation ratio 16 : 1 is much higher than input

concentration.

Finally, optimizing both paths, SW to HW and HW to SW, achieves

a 2.68x speed up against the pure software solution. The total

communication time (0.58Mcycles) of HW-SW input/output pack-

/unpack decreases 10 fold compared to unoptimized HW-SW [no-

opt] solution (5.8Mcycles). Meanwhile, the communication time

(0.58Mcycles) with 3.4% of the total execution time is no longer a

significant delay contributor.

To assess power efficiency, we measure board-level power of our

platform. It remains fairly constant at around 680mW regardless of

load and FPGA usage. As the DSP runs at the fixed frequency, this

indicates that the FPGA (whose load is changed) is a minor con-

tributor towards the total dynamic power. Nonetheless, HW/SW

Co-design and further communication optimization shorten total

execution time in the heterogeneous execution. Hence, the energy

efficiency increases linearly with performance speedup. Our opti-

mized HW/SW solution is 2.68x more energy efficient.

Table 2: FPGA Utilization of HW/SW Optimized Solution

Slice Total Application Proxy+Glue Database

Pack+Unpack

Usage (out of 9312) 547 170 177 200

Utilization 5.8% 1.8% 1.9% 2.1%

In the HW/SW optimized solution, the FPGA utilization of the

Spartan 3E is only 5.874% as shown in Table 2. The generated

Proxy, pack, unpack and other glue logic (bus IF) occupy 32%.

The application specific Soble Core is small in this example. This

indicates significant room for other implementations, e.g. dupli-

cating Sobel Core on FPGA. However, this algorithm optimization

is out of the scope of the SimSH. On the other hand, the DSP is

fully utilized at nearly 100% for all five implementations due to the

overlapped HW/SW execution.

5. CONCLUSION

This paper introduces a Simulink-based SimSH to bridge the

gap between the algorithm design in Simulink and its implementa-

tion on a heterogeneous platform. Given an allocation and a map-

ping decision, our SimSH automatically synthesizes the Simulink

model onto the heterogeneous target and refines the synchroniza-

tion and communication across processing elements. Furthermore,

the SimSH optimizes communication by detecting an underutilized

bus and concatenating transactions accordingly. In the result, it al-

lows the developer to focus on the algorithm exploration and tuning

and rapidly prototype it on a heterogeneous target platform.

We have demonstrated the benefits using Sobel Edge Detection

[9], and targeted a heterogeneous architecture with a Blackfin pro-

cessor and Spartan3E FPGA. Our proposed SimSH achieves up to

a 2.68x speedup and energy efficiency with communication opti-

mization against a pure software solution. In future work, we will

investigate into automatic mapping decisions for a given platform.

6. REFERENCES

[1] Analog Devices, Inc. (ADI). ADSP-BF52x Blackfin oProcessor Hardware Reference, February 2013. Rev. 1.2.

[2] V. Berman. Standards: The P1685 IP-XACT IP MetadataStandard. Design Test of Computers, IEEE, 23(4):316–317,April 2006.

[3] L. Cai, et al. Retargetable profiling for rapid, earlysystem-level design space exploration. In Proceedings of theDesign Automation Conference (DAC), San Diego, CA, June2004.

[4] S.-I. Han, et al. Simulink-based HeterogeneousMultiprocessor SoC Design Flow For MixedHardware/Software Refinement And Simulation. Integration,the VLSI Journal, 42(2):227–245, Feb. 2009.

[5] Internation Organization for Standardization (ISO).Reference Model of Open System Interconnection (OSI),second edition, 1994. ISO/IEC 7498 Standard.

[6] K. Popovici et al. Simulink Based Hardware-SoftwareCodesign Flow For Heterogeneous MPSoC. In Proceedingsof the 2007 summer computer simulation conference, pages497–504. Society for Computer Simulation International,2007.

[7] K. M. Popovici. Multilevel Programming Envrionment forHeterogeneous MPSoC Architectures. PhD thesis, GrenobleInstitute of Technology, 2008.

[8] F. Robino et al. From Simulink to NoC-based MPSoC onFPGA. In DATE, pages 1–4, 2014.

[9] I. Sobel. Neighborhood Coding Of Binary Images For FastContour Following And General Binary Array Processing.Computer Graphics and Image Processing, 8(1):127 – 135,1978.

[10] I. The MathWorks. Embedded coder ref. http://www.mathworks.com/products/embedded-coder/,2014a.

[11] I. The MathWorks. Hdl coder ref. http://www.mathworks.com/products/hdl-coder/,2014a.

[12] The MathWorks Inc. MATLAB and Simulink, 1993-2014.[13] Xilinx. Xilinx Command Line Tools User Guide, October

2013. Version 14.7.[14] J. Zhang et al. Joint Algorithm Developing and

System-Level Design: Case Study on Video Encoding. InEmbedded Systems: Design, Analysis and Verification, pages26–38. Springer, 2013.

[15] L. Zhang, et al. Bridging Algorithm And ESL Design:Matlab/Simulink Model Transformation And Validation. InSpecification Design Languages (FDL), 2013 Forum on,pages 1–8, Sept 2013.