1
www.symparallel.com Efficient and Scalable OpenMP-based System-Level Design sy arallel MP turns parallel programmers into hardware designers Blending parallel programming with ESL design Applications: Graphics Processing with sy arallel MP Graphics Processing JPEG massive algorithm-level parallelism dynamic scheduling strategy is an important application domain for parallel computing due to the massive thread-level parallelism inherently present in most algorithms. A common example of graphics processing applications, the algorithm is used for compression and decompression of bitmap images. According to the JPEG standard, the first stages of image compression/decompression, i.e. Discrete Cosine Transform, Quantization, and their inverse transformations, are applied independently for all 8x8 pixel sub-blocks in the image, enabling . We used to perform the above steps of JPEG standard, both for compression and decompression. Relying on the possibility of mapping the shared memory to a frame-buffer device we could also validate graphically the results through a 640x480 VGA external display. During the Design Space Exploration stage, we chose to allocate the maximum number of hardware accelerators meeting the constraints of the FPGA device area, yielding a very fast compression/decompression cycle. The chart on the left exemplifies the trade-offs explored by means of the area estimation tool. Notice that, retaining a soft-core embedded processor in the final system architecture, we could adopt a when coding the cycles of the OpenMP application, i.e. we could use a fully-fledged OpenMP application for the design-entry in the flow. for sy arallel sy arallel sy arallel MP MP MP Results: efficient and scalable design solutions sy arallel sy arallel sy arallel MP MP MP is an innovative design framework for massively parallel applications, blending the OpenMP programming paradigm with electronic system-level (ESL) design methodologies. A de-facto standard for parallel programming, is the most popular approach to shared memory parallel applications in a number of different application domains, ranging from financial applications to scientific computing and graphics processing [1]. OpenMP naturally meets the characteristics of current multi-processor systems-on-chip ( ), in that it provides enough semantics to express the parallelism of typical data-intensive applications. The design flow enables the automated translation from high-level OpenMP programs down to an FPGA-based multi-processor system-on-chip highly customized for the target application, possibly accelerated by hardware cores inferred from the source code through a high-level synthesis process. A few literature works address the description of hardware accelerators through OpenMP. They focus either on the integration of hardware accelerators in essentially software OpenMP programs [6][2], or on the pure hardware translation [3]-[5]. When supporting OpenMP-based hardware design, they impose to the constructs actually available to parallel programmers, effectively preventing the reuse of legacy OpenMP programs and kernels. Other limitations include the use of centralized mechanisms for controlling interactions among threads, causing scalability issues, the limited support for external memory and efficiency-critical mechanisms such as caching, and several unsupported runtime library routines. The design flow enables the generation of heterogeneous systems, including one or more processors and dedicated hardware components, where OpenMP threads can be mapped to either software threads or hardware blocks. The control-intensive facets making the full OpenMP difficult to implement in hardware are managed in software, while the data-intensive part of the OpenMP application is addressed by dedicated software/hardware parallelism. Hardware threads are generated by means of high-level synthesis tools that perfectly fit the structure of an OpenMP program, where the application logic is still described by means of plain C/C++ code. This approach provides full support for standard-compliant OpenMP applications, as well as fundamental "general-purpose" characteristics such as memory hierarchies and management. OpenMP MPSoCs drastic restrictions sy arallel MP Eclipse plug-in The figure below shows the typical architecture of a system generated by . Each subsystem represents an OpenMP thread, or a group of threads. Software subsystems are processors executing compiled C code. Hardware subsystems are either blocks generated through HLS equipped with DMA capabilities and synchronization ports, or they can be ordinary peripherals such as timers or ad-hoc processing components. For the generation of these subsystems, relies on third-party back-ends. Namely, the framework currently supports Xilinx FPGA devices and MPSoC architectures [10], and exploits Impulse CoDeveloper for high-level synthesis [9]. sy arallel sy arallel sy arallel MP MP MP The design environment exposes a Graphical User Interface (GUI) seamlessly integrated into the as an external plug-in. Programmers are provided with a quick, interactive, and user-friendly interface to control the whole design cycle, from functional simulation to system implementation and execution. The system-level design methodology underpinned by environment is mainly based on a high-level top-down strategy where each step can be controlled by the designer through the GUI. Eclipse IDE sy arallel sy arallel MP MP References Applications: Financial Computing with sy arallel MP Computational finance Monte Carlo Option Price dynamic strategy Design Space Exploration many times faster provides numerical methods to tackle difficult problems in financial applications. Coding these numerical algorithms can be greatly simplified by using parallel programming paradigms like OpenMP, e.g. to exploit task-level parallelism for financial simulations. is a numerical method often used in computational finance to calculate the value of an option with multiple sources of uncertainties and random features, such as changing interest rates, stock prices, or exchange rates. Monte Carlo simulations can be easily parallelized, since tasks are largely independent so they can be partitioned among different computation units, while communication requirements are limited. The parallelization with was done by partitioning the Monte Carlo iterations among all threads with a , hence balancing the load in a completely distributed fashion. As soon as the independent tasks are complete, a reduction operation allows the thread-safe merging of all results, letting the main thread compute the final results. During the step, the lowest latency implementation was chosen. It retains a software thread, used to perform control-related and miscellaneous operations such as random numbers generation. In conclusion, offered a direct path from parallel code to a parallel hardware engine for acceleration of Monte Carlo Option Price simulations, able to run than a software implementation. sy arallel sy arallel MP MP [1] OpenMP Architecture Review Board. (2011) , v3.1. [Online]. Available: www.openmp.org [2] W.-C. Jeun and S. Ha, Effective OpenMP implementation and translation for multiprocessor System-on-Chip without using OS,in - ASP-DAC 07, Jan. 2007, pp. 4449. [3] Y. Leow, C. Ng, and W. Wong, Generating hardware from OpenMP programs,in (FPT 2006), Dec. 2006, pp. 7380. [4] P. Dziurzanski and V. Beletskyy, Defining synthesizable OpenMP directives and clauses,in - ICCS 2004, ser. LNCS, vol. 3038. Springer, 2004, pp. 398407. [5] P. Dziurzanski, W. Bielecki, K. Trifunovic, and M. Kleszczonek, A system for transforming an ANSI C code with OpenMP directives into a SystemC description,in , 2006. IEEE, apr 2006, pp. 151152. [6] D. Cabrera, X. Martorell, G. Gaydadjiev, E. Ayguade, and D. Jimenez-Gonzalez, OpenMP extensions for FPGA accelerators,in , 2009 - SAMOS 09, Jul. 2009, pp. 1724. [7] P. Coussy and A. M. (Eds.), . Springer, 2008. [8] EPCC. (2012) . [Online]. Available: http://www.epcc.ed.ac.uk/software-products/epcc-openmp-benchmarks/ [9] Impulse Accelerated Technologies. (2012) . [Online]. Available: http://www.impulseaccelerated.com [10] Xilinx. (2012) (EDK). [Online]. Available: http://www.xilinx.com/tools/platform.htm OpenMP application program interface Proceedings of the 2007 Asia and South Pacific Design Automation Conference IEEE International Conference on Field Programmable Technology Proceedings of the 4th International Conference on Computational Science Design and Diagnostics of Electronic Circuits and Systems International Symposium on Systems, Architectures, Modeling, and Simulation High-Level Synthesis from Algorithm to Digital Circuit EPCC OpenMP benchmarks Impulse CoDeveloper Platform studio and the embedded development kit Any ideas? Contact us! [email protected] C / OpenMP- based system description functional simulation hardware / software partitioning high-level synthesis software code compilation Platform-based system composition hardware synthesis and place&route IP-core library (e.g. timers, video controllers,) a) c) b) d) FPGA Intel i7 FPGA Intel i7 our approach Ref. [3] our approach Ref. [3] private firstprivate dynamic static critical sy arallel sy arallel MP MP relies on state-of-the-art ESL design techniques helping reduce the design cycle for building a complete system. After the initial coding in C/OpenMP, a functional simulation is performed to verify the correctness of the application. The high- level executable specification, here, is key to enabling a fast, purely software simulation. Starting from the high-level specification, the ad-hoc compiler generates all the files needed to build the heterogeneous MPSoC, completely hiding the underlying technology and the back-end synthesis tools. To fully support the OpenMP specification, a library of IP cores is incorporated into the environment and automatically used by the compiler when some specific OpenMP directives are detected. The bifurcation point in the flow corresponds to the hardware/software partitioning stage, where the physical architecture of the system is defined. technology enables a of all costs associated with a specific implementation. The area occupation is estimated by means of a suitable statistical analysis, while the evaluation of the overall latency relies on a cycle-accurate Instruction Set Simulator for the embedded processors and timing analysis for hardware accelerators. Subsequently, two distinct branches in the flow cover software compilation and high-level synthesis concurrently. They re-join together on the system composition step, where the whole system is automatically built, usually assembling library hardware/software components and application-specific components generated by the previous steps, before the actual hardware synthesis takes place. fast evaluation sy arallel MP sy arallel sy arallel sy arallel MP MP MP was benchmarked by measuring the implementation overhead caused by the support for OpenMP constructs. Precisely, we measured the overhead/execution time ratio, i.e. the , relying on the well-known EPCC benchmarks [8]. We compared the normalized overhead with an OpenMP implementation for a Windows 7 OS on an Intel i7 processor at 1.8 GHz running the same benchmarks. Chart a) presents the normalized overheads as the number of threads is increased. The important clue provided by the plot is that, in addition to being low in absolute values, the overhead tends to be horizontal. This shows the effects of the distributed architecture and synchronization mechanisms implemented by and their impact on the and the of the resulting MPSoC. Chart b) summarizes the overhead trends for the above constructs, depicting the of the normalized overhead measured during our experiments. For example, the figure tells us that, on average, the overhead for the clause grows by 0.006 per thread, while it grows by 0.147 per thread for the i7 implementation. Again, this provides a convincing demonstration of the scalability of the approach compared to the pure software implementation. normalized overhead efficiency scalability average slopes #pragma omp parallel firstprivate (var) We also compared with [3], the only solution for OpenMP-to-hardware translation presenting a working tool and some performance figures. Charts c) and d) compare our results with [3] in terms of performance scalability and system frequency. Chart c) refers to a program implementing the Sieve of Eratosthenes algorithm, while Chart d) refers to a program implementing an infinite impulse response filter, as in [3]. sy arallel MP The plots confirm the effectiveness of , ensuring a satisfactory level of efficiency and scalability, concerning both the application speed-up and the complexity of the generated hardware, directly affecting the clock frequency. sy arallel MP Fast estimates of latency and area occupation for the chosen implementation can be displayed in the form of reports and graphical diagrams helping the designer visualize and understand the actual system requirements. This allows design choices to be made as early as possible in the development cycle, enabling a fast and effective (DSE) process. After the complete hardware platform is generated, the early estimation parameters can be validated against the actual synthesis results, again relying on an intuitive and user-friendly graphic interface.Then, the designer can finally burn the FPGA chip, download the code, and run the application on his custom MPSoC. Design Space Exploration Throughout the design process, the dedicated console shows the state of the third-party backend tools, notifying errors and warnings without blocking the Eclipse graphical interface. After coding and compiling OpenMP source code in the Eclipse text editor, users can execute and debug it on the host machine in order to complete the required functional simulation. Then, they can generate the source files taken as input by the next steps by just launching the compiler. Several architectural constraints can be specified for the system being built. For example, users can determine the mapping of a given parallel thread by binding it to an embedded processor or to a dedicated faster hardware accelerator. If an image or video application is being developed, the designer can map a specific portion of the shared memory onto a frame-buffer device: the compiler will automatically generate a system with an integrated display controller that can be used to drive an external display. sy arallel sy arallel sy arallel MP MP MP The parallel heterogeneous architecture is defined in such a way as to orchestrate its hardware/software components , avoiding centralized hardware elements. The memory components may be implemented in a different technology and indeed they may be mapped to FPGA internal RAM blocks or off-chip SDRAM memory, enabling the synthesis of real-world systems working on large amounts of data. Shared memory areas are accessed by all subsystems. Each hardware subsystem has its own local memory corresponding to the synthesized registers, while software subsystems can also have a local memory corresponding to the processor cache levels of the memory hierarchy. The are the fundamental block for the implementation of mutual exclusion mechanisms as they provide a specific hardware support for it. and mechanisms are very important to OpenMP, and are implemented by as efficiently as possible. Just as an example, the overhead caused by the OpenMP clause only depends on the number of listed variables and does not increase with the number of threads, preserving the application scalability. Another essential aspect in OpenMP is the dynamic scheduling partitioning strategy in the directive. Supporting this strategy efficiently is vital for load balancing in heterogeneous systems. supports it in a completely distributed fashion. Following is a simplified version of the algorithm executed independently by each processing element: As shown above, the support is completely distributed. Consequently, the threads execute a number of iterations determined at run time according to the computational power of the unit they are allocated to and the different load they happen to handle, fully implementing the semantics of the clause. in a distributed fashion atomic registers Data scoping initialization syMParallel firstprivate #pragma omp for dynamic while (iterations are not finished){ critical { read the iteration_counter re-check iterations are not finished update the iteration_counter with (iteration_counter + chunk_size) } execute iterations } sy arallel MP sy ara e ll l MP sy arallel MP technology communication infrastructure and distributed synchronization Software subsystem Local Memory Memory subsystem SW S1 stack SW S1 heap SW S1 text/global Memory subsystem Memory subsystem Hardware subsystem Local Memory SW S3 stack SW S3 heap SW S3 text/global SW S2 stack SW S2 heap SW S2 text/global Hardware subsystem peripheral (e.g. TImer) Hardware subsystem Local Memory Software subsystem Local Memory Atomic registers Hardware subsystem Local Memory

sy ara e MP ll lEfficient and Scalable · graphics processing [1]. OpenMP naturally meets the characteristics of current multi-processor systems-on-chip ( ), in that it provides enough

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: sy ara e MP ll lEfficient and Scalable · graphics processing [1]. OpenMP naturally meets the characteristics of current multi-processor systems-on-chip ( ), in that it provides enough

www.symparallel.com

Efficient and Scalable

OpenMP-based System-Level Designsy arallelMPturns parallel programmers

into hardware designers

Blending parallel programming with ESL design

Applications: Graphics Processing with sy arallelMP

Graphics Processing

JPEG

massive algorithm-levelparallelism

dynamic schedulingstrategy

is an important application domain forparallel computing due to the massive thread-level

parallelism inherently present in most algorithms. A commonexample of graphics processing applications, the

algorithm is used for compression and decompression ofbitmap images. According to the JPEG standard, the first

stages of image compression/decompression, i.e. DiscreteCosine Transform, Quantization, and their inverse

transformations, are applied independently for all 8x8 pixelsub-blocks in the image, enabling

. We used to perform the abovesteps of JPEG standard, both for compression and

decompression. Relying on the possibility of mapping theshared memory to a frame-buffer device we could alsovalidate graphically the results through a 640x480 VGA

external display. During the Design Space Exploration stage,we chose to allocate the maximum number of hardware

accelerators meeting the constraints of the FPGA devicearea, yielding a very fast compression/decompression cycle.

The chart on the left exemplifies the trade-offs explored bymeans of the area estimation tool. Notice that,

retaining a soft-core embedded processor in the final systemarchitecture, we could adopt a

when coding the cycles of the OpenMP

application, i.e. we could use a fully-fledged OpenMPapplication for the design-entry in the flow.

for

sy arallel

sy arallel

sy arallel

MP

MP

MP

Results: efficient and scalable design solutions

sy arallel

sy arallel

sy arallel

MP

MP

MP

is an innovative design framework for massively parallelapplications, blending the OpenMP programming paradigm withelectronic system-level (ESL) design methodologies. A de-facto standardfor parallel programming, is the most popular approach toshared memory parallel applications in a number of different applicationdomains, ranging from financial applications to scientific computing andgraphics processing [1]. OpenMP naturally meets the characteristics ofcurrent multi-processor systems-on-chip ( ), in that it providesenough semantics to express the parallelism of typical data-intensiveapplications. The design flow enables the automatedtranslation from high-level OpenMP programs down to an FPGA-basedmulti-processor system-on-chip highly customized for the targetapplication, possibly accelerated by hardware cores inferred from thesource code through a high-level synthesis process.

A few literature works address the description of hardware acceleratorsthrough OpenMP. They focus either on the integration of hardwareaccelerators in essentially software OpenMP programs [6][2], or on thepure hardware translation [3]-[5]. When supporting OpenMP-basedhardware design, they impose to the constructsactually available to parallel programmers, effectively preventing thereuse of legacy OpenMP programs and kernels. Other limitations includethe use of centralized mechanisms for controlling interactions amongthreads, causing scalability issues, the limited support for externalmemory and efficiency-critical mechanisms such as caching, and severalunsupported runtime library routines.

The design flow enables the generation of heterogeneoussystems, including one or more processors and dedicated hardwarecomponents, where OpenMP threads can be mapped to either softwarethreads or hardware blocks. The control-intensive facets making the fullOpenMP difficult to implement in hardware are managed in software,while the data-intensive part of the OpenMP application is addressed bydedicated software/hardware parallelism. Hardware threads aregenerated by means of high-level synthesis tools that perfectly fit thestructure of an OpenMP program, where the application logic is stilldescribed by means of plain C/C++ code. This approach provides fullsupport for standard-compliant OpenMP applications, as well asfundamental "general-purpose" characteristics such as memoryhierarchies and management.

OpenMP

MPSoCs

drastic restrictions

sy arallelMP Eclipse plug-in

The figure below shows the typical architecture of a system generatedby . Each subsystem represents an OpenMP thread, or agroup of threads. Software subsystems are processors executingcompiled C code. Hardware subsystems are either blocks generatedthrough HLS equipped with DMA capabilities and synchronization ports,or they can be ordinary peripherals such as timers or ad-hoc processingcomponents. For the generation of these subsystems,relies on third-party back-ends. Namely, the frameworkcurrently supports Xilinx FPGA devices and MPSoC architectures [10],and exploits Impulse CoDeveloper for high-level synthesis [9].

sy arallel

sy arallelsy arallel

MP

MP

MP

The design environment exposes a Graphical User Interface (GUI) seamlesslyintegrated into the as an external plug-in. Programmers are provided with a quick,

interactive, and user-friendly interface to control the whole design cycle, from functionalsimulation to system implementation and execution. The system-level design methodology

underpinned by environment is mainly based on a high-level top-down strategywhere each step can be controlled by the designer through the GUI.

Eclipse IDEsy arallel

sy arallel

MP

MP

References

Applications: Financial Computing with sy arallelMP

Computational finance

Monte Carlo Option Price

dynamic strategy

Design Space Exploration

manytimes faster

provides numerical methods to tackle difficultproblems in financial applications. Coding these numerical algorithms canbe greatly simplified by using parallel programming paradigms likeOpenMP, e.g. to exploit task-level parallelism for financial simulations.

is a numerical method often used incomputational finance to calculate the value of an option with multiplesources of uncertainties and random features, such as changing interestrates, stock prices, or exchange rates. Monte Carlo simulations can beeasily parallelized, since tasks are largely independent so they can bepartitioned among different computation units, while communicationrequirements are limited. The parallelization with was doneby partitioning the Monte Carlo iterations among all threads with a

, hence balancing the load in a completely distributedfashion. As soon as the independent tasks are complete, a reductionoperation allows the thread-safe merging of all results, letting the mainthread compute the final results. During thestep, the lowest latency implementation was chosen. It retains a softwarethread, used to perform control-related and miscellaneous operationssuch as random numbers generation. In conclusion, offereda direct path from parallel code to a parallel hardware engine foracceleration of Monte Carlo Option Price simulations, able to run

than a software implementation.

sy arallel

sy arallel

MP

MP

[1] OpenMP Architecture Review Board. (2011) , v3.1. [Online]. Available: www.openmp.org[2] W.-C. Jeun and S. Ha, “Effective OpenMP implementation and translation for multiprocessor System-on-Chip without using OS,” in

- ASP-DAC ’07, Jan. 2007, pp. 44–49.[3] Y. Leow, C. Ng, and W. Wong, “Generating hardware from OpenMP programs,” in (FPT 2006), Dec.

2006, pp. 73–80.[4] P. Dziurzanski and V. Beletskyy, “Defining synthesizable OpenMP directives and clauses,” in -

ICCS 2004, ser. LNCS, vol. 3038. Springer, 2004, pp. 398–407.[5] P. Dziurzanski, W. Bielecki, K. Trifunovic, and M. Kleszczonek, “A system for transforming an ANSI C code

with OpenMP directives into a SystemC description,” in, 2006. IEEE, apr 2006, pp. 151–152.

[6] D. Cabrera, X. Martorell, G. Gaydadjiev, E. Ayguade, and D. Jimenez-Gonzalez, “OpenMP extensionsfor FPGA accelerators,” in

, 2009 - SAMOS ’09, Jul. 2009, pp. 17–24.[7] P. Coussy and A. M. (Eds.), . Springer, 2008.[8] EPCC. (2012) . [Online]. Available:

http://www.epcc.ed.ac.uk/software-products/epcc-openmp-benchmarks/[9] Impulse Accelerated Technologies. (2012) . [Online]. Available:

http://www.impulseaccelerated.com[10] Xilinx. (2012) (EDK). [Online]. Available:

http://www.xilinx.com/tools/platform.htm

OpenMP application program interfaceProceedings of the 2007 Asia and South

Pacific Design Automation ConferenceIEEE International Conference on Field Programmable Technology

Proceedings of the 4th International Conference on Computational Science

Design and Diagnostics of Electronic Circuitsand Systems

International Symposium on Systems, Architectures, Modeling, andSimulation

High-Level Synthesis from Algorithm to Digital CircuitEPCC OpenMP benchmarks

Impulse CoDeveloper

Platform studio and the embedded development kit

Any ideas? Contact [email protected]

C / OpenMP-based system

description

functionalsimulation

hardware /software

partitioninghigh-levelsynthesis

software codecompilation

Platform-basedsystem

composition

hardwaresynthesis andplace&route

IP-core library(e.g. timers, video

controllers,…)

a)

c)

b)

d)

FPGA Intel i7FPGA Intel i7

ourapproach

Ref. [3]ourapproach

Ref. [3]

private

firstprivate

dynamic

static

critical

sy arallel

sy arallel

MP

MP

relies on state-of-the-art ESLdesign techniques helping reduce thedesign cycle for building a complete system.After the initial coding in C/OpenMP, afunctional simulation is performed to verifythe correctness of the application. The high-level executable specification, here, is keyto enabling a fast, purely softwaresimulation. Starting from the high-levelspecification, the ad-hoccompiler generates all the files needed tobuild the heterogeneous MPSoC,completely hiding the underlying technologyand the back-end synthesis tools.

To fully support the OpenMP specification, a library of IP cores isincorporated into the environment and automatically used by thecompiler when some specific OpenMP directives are detected. Thebifurcation point in the flow corresponds to the hardware/softwarepartitioning stage, where the physical architecture of the system isdefined. technology enables a of all costsassociated with a specific implementation. The area occupation isestimated by means of a suitable statistical analysis, while the evaluationof the overall latency relies on a cycle-accurate Instruction Set Simulatorfor the embedded processors and timing analysis for hardwareaccelerators. Subsequently, two distinct branches in the flow coversoftware compilation and high-level synthesis concurrently. They re-jointogether on the system composition step, where the whole system isautomatically built, usually assembling library hardware/softwarecomponents and application-specific components generated by theprevious steps, before the actual hardware synthesis takes place.

fast evaluationsy arallelMP

sy arallel

sy arallel

sy arallel

MP

MP

MP

was benchmarked by measuring the implementationoverhead caused by the support for OpenMP constructs. Precisely, wemeasured the overhead/execution time ratio, i.e. the

, relying on the well-known EPCC benchmarks [8]. Wecompared the normalized overhead with an OpenMP implementation fora Windows 7 OS on an Intel i7 processor at 1.8 GHz running the samebenchmarks.

Chart a) presents the normalized overheads as the number of threads isincreased. The important clue provided by the plot is that, in addition tobeing low in absolute values, the overhead tends to be horizontal. Thisshows the effects of the distributed architecture and synchronizationmechanisms implemented by and their impact on the

and the of the resulting MPSoC.

Chart b) summarizes the overhead trends for the above constructs,depicting the of the normalized overhead measuredduring our experiments. For example, the figure tells us that, on average,the overhead for the

clause grows by 0.006 per thread, while it grows by 0.147 per thread forthe i7 implementation. Again, this provides a convincing demonstration ofthe scalability of the approach compared to the puresoftware implementation.

normalizedoverhead

efficiency scalability

average slopes

#pragma omp parallel firstprivate (var)

We also comparedwith [3], the only solution forOpenMP-to-hardware translationpresenting a working tool andsome performance figures.Charts c) and d) compare ourresults with [3] in terms ofperformance scalability andsystem frequency. Chart c) refersto a program implementing theSieve of Eratosthenes algorithm,while Chart d) refers to aprogram implementing an infiniteimpulse response filter, as in [3].

sy arallelMP

The plots confirm the effectiveness of , ensuring asatisfactory level of efficiency and scalability, concerning both theapplication speed-up and the complexity of the generated hardware,directly affecting the clock frequency.

sy arallelMP

Fast estimates of latency and area occupation for the chosen implementation can bedisplayed in the form of reports and graphical diagrams helping the designer visualize and

understand the actual system requirements. This allows design choices to be made as earlyas possible in the development cycle, enabling a fast and effective

(DSE) process. After the complete hardware platform is generated, the earlyestimation parameters can be validated against the actual synthesis results, again relying onan intuitive and user-friendly graphic interface.Then, the designer can finally burn the FPGA

chip, download the code, and run the application on his custom MPSoC.

Design SpaceExploration

Throughout the design process, the dedicated console shows thestate of the third-party backend tools, notifying errors and warnings without

blocking the Eclipse graphical interface. After coding and compiling OpenMPsource code in the Eclipse text editor, users can execute and debug it on the host

machine in order to complete the required functional simulation. Then, they cangenerate the source files taken as input by the next steps by just launching the

compiler. Several architectural constraints can be specified for thesystem being built. For example, users can determine the mapping of a givenparallel thread by binding it to an embedded processor or to a dedicated fasterhardware accelerator. If an image or video application is being developed, thedesigner can map a specific portion of the shared memory onto a frame-bufferdevice: the compiler will automatically generate a system with an

integrated display controller that can be used to drive an external display.

sy arallel

sy arallel

sy arallel

MP

MP

MP

The parallel heterogeneous architecture is defined in such a way as toorchestrate its hardware/software components

, avoiding centralized hardware elements. The memorycomponents may be implemented in a different technology and indeedthey may be mapped to FPGA internal RAM blocks or off-chip SDRAMmemory, enabling the synthesis of real-world systems working on largeamounts of data. Shared memory areas are accessed by allsubsystems. Each hardware subsystem has its own local memorycorresponding to the synthesized registers, while software subsystemscan also have a local memory corresponding to the processor cachelevels of the memory hierarchy. The are thefundamental block for the implementation of mutual exclusionmechanisms as they provide a specific hardware support for it.

and mechanisms are very important to OpenMP,and are implemented by as efficiently as possible. Just asan example, the overhead caused by the OpenMP

clause only depends on the number of listed variables and does notincrease with the number of threads, preserving the applicationscalability. Another essential aspect in OpenMP is the dynamicscheduling partitioning strategy in the directive.

Supporting this strategy efficiently is vital for load balancing inheterogeneous systems. supports it in a completelydistributed fashion. Following is a simplified version of the algorithmexecuted independently by each processing element:

As shown above, the support is completely distributed. Consequently,the threads execute a number of iterations determined at run timeaccording to the computational power of the unit they are allocated toand the different load they happen to handle, fully implementing thesemantics of the clause.

in a distributedfashion

atomic registers

Datascoping initialization

syMParallel

firstprivate

#pragma omp for

dynamic

while (iterations are not finished){critical {

read the iteration_counterre-check iterations are not finishedupdate the iteration_counter with (iteration_counter + chunk_size)

}execute iterations

}

sy arallelMP

sy ara ell lMP

sy arallelMP technology

communication infrastructureand distributed synchronization

Softwaresubsystem

LocalMemory

Memorysubsystem

SW S1 stackSW S1 heapSW S1 text/global

Memorysubsystem

Memorysubsystem

Hardwaresubsystem

LocalMemory

SW S3 stackSW S3 heapSW S3 text/global

SW S2 stackSW S2 heapSW S2 text/global

Hardwaresubsystemperipheral

(e.g. TImer)

Hardwaresubsystem

LocalMemory

Softwaresubsystem

LocalMemory

Atomicregisters

Hardwaresubsystem

LocalMemory