sy ara e MP ll lEfficient and Scalable · graphics processing [1]. OpenMP naturally meets the...

Preview:

Citation preview

www.symparallel.com

Efficient and Scalable

OpenMP-based System-Level Designsy arallelMPturns parallel programmers

into hardware designers

Blending parallel programming with ESL design

Applications: Graphics Processing with sy arallelMP

Graphics Processing

JPEG

massive algorithm-levelparallelism

dynamic schedulingstrategy

is an important application domain forparallel computing due to the massive thread-level

parallelism inherently present in most algorithms. A commonexample of graphics processing applications, the

algorithm is used for compression and decompression ofbitmap images. According to the JPEG standard, the first

stages of image compression/decompression, i.e. DiscreteCosine Transform, Quantization, and their inverse

transformations, are applied independently for all 8x8 pixelsub-blocks in the image, enabling

. We used to perform the abovesteps of JPEG standard, both for compression and

decompression. Relying on the possibility of mapping theshared memory to a frame-buffer device we could alsovalidate graphically the results through a 640x480 VGA

external display. During the Design Space Exploration stage,we chose to allocate the maximum number of hardware

accelerators meeting the constraints of the FPGA devicearea, yielding a very fast compression/decompression cycle.

The chart on the left exemplifies the trade-offs explored bymeans of the area estimation tool. Notice that,

retaining a soft-core embedded processor in the final systemarchitecture, we could adopt a

when coding the cycles of the OpenMP

application, i.e. we could use a fully-fledged OpenMPapplication for the design-entry in the flow.

for

sy arallel

sy arallel

sy arallel

MP

MP

MP

Results: efficient and scalable design solutions

sy arallel

sy arallel

sy arallel

MP

MP

MP

is an innovative design framework for massively parallelapplications, blending the OpenMP programming paradigm withelectronic system-level (ESL) design methodologies. A de-facto standardfor parallel programming, is the most popular approach toshared memory parallel applications in a number of different applicationdomains, ranging from financial applications to scientific computing andgraphics processing [1]. OpenMP naturally meets the characteristics ofcurrent multi-processor systems-on-chip ( ), in that it providesenough semantics to express the parallelism of typical data-intensiveapplications. The design flow enables the automatedtranslation from high-level OpenMP programs down to an FPGA-basedmulti-processor system-on-chip highly customized for the targetapplication, possibly accelerated by hardware cores inferred from thesource code through a high-level synthesis process.

A few literature works address the description of hardware acceleratorsthrough OpenMP. They focus either on the integration of hardwareaccelerators in essentially software OpenMP programs [6][2], or on thepure hardware translation [3]-[5]. When supporting OpenMP-basedhardware design, they impose to the constructsactually available to parallel programmers, effectively preventing thereuse of legacy OpenMP programs and kernels. Other limitations includethe use of centralized mechanisms for controlling interactions amongthreads, causing scalability issues, the limited support for externalmemory and efficiency-critical mechanisms such as caching, and severalunsupported runtime library routines.

The design flow enables the generation of heterogeneoussystems, including one or more processors and dedicated hardwarecomponents, where OpenMP threads can be mapped to either softwarethreads or hardware blocks. The control-intensive facets making the fullOpenMP difficult to implement in hardware are managed in software,while the data-intensive part of the OpenMP application is addressed bydedicated software/hardware parallelism. Hardware threads aregenerated by means of high-level synthesis tools that perfectly fit thestructure of an OpenMP program, where the application logic is stilldescribed by means of plain C/C++ code. This approach provides fullsupport for standard-compliant OpenMP applications, as well asfundamental "general-purpose" characteristics such as memoryhierarchies and management.

OpenMP

MPSoCs

drastic restrictions

sy arallelMP Eclipse plug-in

The figure below shows the typical architecture of a system generatedby . Each subsystem represents an OpenMP thread, or agroup of threads. Software subsystems are processors executingcompiled C code. Hardware subsystems are either blocks generatedthrough HLS equipped with DMA capabilities and synchronization ports,or they can be ordinary peripherals such as timers or ad-hoc processingcomponents. For the generation of these subsystems,relies on third-party back-ends. Namely, the frameworkcurrently supports Xilinx FPGA devices and MPSoC architectures [10],and exploits Impulse CoDeveloper for high-level synthesis [9].

sy arallel

sy arallelsy arallel

MP

MP

MP

The design environment exposes a Graphical User Interface (GUI) seamlesslyintegrated into the as an external plug-in. Programmers are provided with a quick,

interactive, and user-friendly interface to control the whole design cycle, from functionalsimulation to system implementation and execution. The system-level design methodology

underpinned by environment is mainly based on a high-level top-down strategywhere each step can be controlled by the designer through the GUI.

Eclipse IDEsy arallel

sy arallel

MP

MP

References

Applications: Financial Computing with sy arallelMP

Computational finance

Monte Carlo Option Price

dynamic strategy

Design Space Exploration

manytimes faster

provides numerical methods to tackle difficultproblems in financial applications. Coding these numerical algorithms canbe greatly simplified by using parallel programming paradigms likeOpenMP, e.g. to exploit task-level parallelism for financial simulations.

is a numerical method often used incomputational finance to calculate the value of an option with multiplesources of uncertainties and random features, such as changing interestrates, stock prices, or exchange rates. Monte Carlo simulations can beeasily parallelized, since tasks are largely independent so they can bepartitioned among different computation units, while communicationrequirements are limited. The parallelization with was doneby partitioning the Monte Carlo iterations among all threads with a

, hence balancing the load in a completely distributedfashion. As soon as the independent tasks are complete, a reductionoperation allows the thread-safe merging of all results, letting the mainthread compute the final results. During thestep, the lowest latency implementation was chosen. It retains a softwarethread, used to perform control-related and miscellaneous operationssuch as random numbers generation. In conclusion, offereda direct path from parallel code to a parallel hardware engine foracceleration of Monte Carlo Option Price simulations, able to run

than a software implementation.

sy arallel

sy arallel

MP

MP

[1] OpenMP Architecture Review Board. (2011) , v3.1. [Online]. Available: www.openmp.org[2] W.-C. Jeun and S. Ha, “Effective OpenMP implementation and translation for multiprocessor System-on-Chip without using OS,” in

- ASP-DAC ’07, Jan. 2007, pp. 44–49.[3] Y. Leow, C. Ng, and W. Wong, “Generating hardware from OpenMP programs,” in (FPT 2006), Dec.

2006, pp. 73–80.[4] P. Dziurzanski and V. Beletskyy, “Defining synthesizable OpenMP directives and clauses,” in -

ICCS 2004, ser. LNCS, vol. 3038. Springer, 2004, pp. 398–407.[5] P. Dziurzanski, W. Bielecki, K. Trifunovic, and M. Kleszczonek, “A system for transforming an ANSI C code

with OpenMP directives into a SystemC description,” in, 2006. IEEE, apr 2006, pp. 151–152.

[6] D. Cabrera, X. Martorell, G. Gaydadjiev, E. Ayguade, and D. Jimenez-Gonzalez, “OpenMP extensionsfor FPGA accelerators,” in

, 2009 - SAMOS ’09, Jul. 2009, pp. 17–24.[7] P. Coussy and A. M. (Eds.), . Springer, 2008.[8] EPCC. (2012) . [Online]. Available:

http://www.epcc.ed.ac.uk/software-products/epcc-openmp-benchmarks/[9] Impulse Accelerated Technologies. (2012) . [Online]. Available:

http://www.impulseaccelerated.com[10] Xilinx. (2012) (EDK). [Online]. Available:

http://www.xilinx.com/tools/platform.htm

OpenMP application program interfaceProceedings of the 2007 Asia and South

Pacific Design Automation ConferenceIEEE International Conference on Field Programmable Technology

Proceedings of the 4th International Conference on Computational Science

Design and Diagnostics of Electronic Circuitsand Systems

International Symposium on Systems, Architectures, Modeling, andSimulation

High-Level Synthesis from Algorithm to Digital CircuitEPCC OpenMP benchmarks

Impulse CoDeveloper

Platform studio and the embedded development kit

Any ideas? Contact us!info@symparallel.com

C / OpenMP-based system

description

functionalsimulation

hardware /software

partitioninghigh-levelsynthesis

software codecompilation

Platform-basedsystem

composition

hardwaresynthesis andplace&route

IP-core library(e.g. timers, video

controllers,…)

a)

c)

b)

d)

FPGA Intel i7FPGA Intel i7

ourapproach

Ref. [3]ourapproach

Ref. [3]

private

firstprivate

dynamic

static

critical

sy arallel

sy arallel

MP

MP

relies on state-of-the-art ESLdesign techniques helping reduce thedesign cycle for building a complete system.After the initial coding in C/OpenMP, afunctional simulation is performed to verifythe correctness of the application. The high-level executable specification, here, is keyto enabling a fast, purely softwaresimulation. Starting from the high-levelspecification, the ad-hoccompiler generates all the files needed tobuild the heterogeneous MPSoC,completely hiding the underlying technologyand the back-end synthesis tools.

To fully support the OpenMP specification, a library of IP cores isincorporated into the environment and automatically used by thecompiler when some specific OpenMP directives are detected. Thebifurcation point in the flow corresponds to the hardware/softwarepartitioning stage, where the physical architecture of the system isdefined. technology enables a of all costsassociated with a specific implementation. The area occupation isestimated by means of a suitable statistical analysis, while the evaluationof the overall latency relies on a cycle-accurate Instruction Set Simulatorfor the embedded processors and timing analysis for hardwareaccelerators. Subsequently, two distinct branches in the flow coversoftware compilation and high-level synthesis concurrently. They re-jointogether on the system composition step, where the whole system isautomatically built, usually assembling library hardware/softwarecomponents and application-specific components generated by theprevious steps, before the actual hardware synthesis takes place.

fast evaluationsy arallelMP

sy arallel

sy arallel

sy arallel

MP

MP

MP

was benchmarked by measuring the implementationoverhead caused by the support for OpenMP constructs. Precisely, wemeasured the overhead/execution time ratio, i.e. the

, relying on the well-known EPCC benchmarks [8]. Wecompared the normalized overhead with an OpenMP implementation fora Windows 7 OS on an Intel i7 processor at 1.8 GHz running the samebenchmarks.

Chart a) presents the normalized overheads as the number of threads isincreased. The important clue provided by the plot is that, in addition tobeing low in absolute values, the overhead tends to be horizontal. Thisshows the effects of the distributed architecture and synchronizationmechanisms implemented by and their impact on the

and the of the resulting MPSoC.

Chart b) summarizes the overhead trends for the above constructs,depicting the of the normalized overhead measuredduring our experiments. For example, the figure tells us that, on average,the overhead for the

clause grows by 0.006 per thread, while it grows by 0.147 per thread forthe i7 implementation. Again, this provides a convincing demonstration ofthe scalability of the approach compared to the puresoftware implementation.

normalizedoverhead

efficiency scalability

average slopes

#pragma omp parallel firstprivate (var)

We also comparedwith [3], the only solution forOpenMP-to-hardware translationpresenting a working tool andsome performance figures.Charts c) and d) compare ourresults with [3] in terms ofperformance scalability andsystem frequency. Chart c) refersto a program implementing theSieve of Eratosthenes algorithm,while Chart d) refers to aprogram implementing an infiniteimpulse response filter, as in [3].

sy arallelMP

The plots confirm the effectiveness of , ensuring asatisfactory level of efficiency and scalability, concerning both theapplication speed-up and the complexity of the generated hardware,directly affecting the clock frequency.

sy arallelMP

Fast estimates of latency and area occupation for the chosen implementation can bedisplayed in the form of reports and graphical diagrams helping the designer visualize and

understand the actual system requirements. This allows design choices to be made as earlyas possible in the development cycle, enabling a fast and effective

(DSE) process. After the complete hardware platform is generated, the earlyestimation parameters can be validated against the actual synthesis results, again relying onan intuitive and user-friendly graphic interface.Then, the designer can finally burn the FPGA

chip, download the code, and run the application on his custom MPSoC.

Design SpaceExploration

Throughout the design process, the dedicated console shows thestate of the third-party backend tools, notifying errors and warnings without

blocking the Eclipse graphical interface. After coding and compiling OpenMPsource code in the Eclipse text editor, users can execute and debug it on the host

machine in order to complete the required functional simulation. Then, they cangenerate the source files taken as input by the next steps by just launching the

compiler. Several architectural constraints can be specified for thesystem being built. For example, users can determine the mapping of a givenparallel thread by binding it to an embedded processor or to a dedicated fasterhardware accelerator. If an image or video application is being developed, thedesigner can map a specific portion of the shared memory onto a frame-bufferdevice: the compiler will automatically generate a system with an

integrated display controller that can be used to drive an external display.

sy arallel

sy arallel

sy arallel

MP

MP

MP

The parallel heterogeneous architecture is defined in such a way as toorchestrate its hardware/software components

, avoiding centralized hardware elements. The memorycomponents may be implemented in a different technology and indeedthey may be mapped to FPGA internal RAM blocks or off-chip SDRAMmemory, enabling the synthesis of real-world systems working on largeamounts of data. Shared memory areas are accessed by allsubsystems. Each hardware subsystem has its own local memorycorresponding to the synthesized registers, while software subsystemscan also have a local memory corresponding to the processor cachelevels of the memory hierarchy. The are thefundamental block for the implementation of mutual exclusionmechanisms as they provide a specific hardware support for it.

and mechanisms are very important to OpenMP,and are implemented by as efficiently as possible. Just asan example, the overhead caused by the OpenMP

clause only depends on the number of listed variables and does notincrease with the number of threads, preserving the applicationscalability. Another essential aspect in OpenMP is the dynamicscheduling partitioning strategy in the directive.

Supporting this strategy efficiently is vital for load balancing inheterogeneous systems. supports it in a completelydistributed fashion. Following is a simplified version of the algorithmexecuted independently by each processing element:

As shown above, the support is completely distributed. Consequently,the threads execute a number of iterations determined at run timeaccording to the computational power of the unit they are allocated toand the different load they happen to handle, fully implementing thesemantics of the clause.

in a distributedfashion

atomic registers

Datascoping initialization

syMParallel

firstprivate

#pragma omp for

dynamic

while (iterations are not finished){critical {

read the iteration_counterre-check iterations are not finishedupdate the iteration_counter with (iteration_counter + chunk_size)

}execute iterations

}

sy arallelMP

sy ara ell lMP

sy arallelMP technology

communication infrastructureand distributed synchronization

Softwaresubsystem

LocalMemory

Memorysubsystem

SW S1 stackSW S1 heapSW S1 text/global

Memorysubsystem

Memorysubsystem

Hardwaresubsystem

LocalMemory

SW S3 stackSW S3 heapSW S3 text/global

SW S2 stackSW S2 heapSW S2 text/global

Hardwaresubsystemperipheral

(e.g. TImer)

Hardwaresubsystem

LocalMemory

Softwaresubsystem

LocalMemory

Atomicregisters

Hardwaresubsystem

LocalMemory

Recommended