3rd 3DDRESD: Polaris

BY

Massimo Morandi

[email protected]

3D DRESD – 29/07/08

Runtime Core Allocation Runtime Core Allocation Management Management

for 2D Self Partially and for 2D Self Partially and Dynamically Reconfigurable Dynamically Reconfigurable

SystemsSystems

2

Rationale and InnovationRationale and Innovation

Problem statementProviding runtime management support for 2D self partial and dynamical reconfiguration, in particular for what concerns Core placement decisions

Innovative contributionsA fast and flexible solution

A low complexity, to avoid introducing too much overhead at runtimeSupporting different scenarios and placement policies, according to user needs

Allowing the possibility to exploit multiple shapes per Core by integration with area constraints definition

3

AimsAims

Our proposed solution must support different scenarios, placement policies and intervention from the designer

It must be fast when compared to related solutions existing in literature

The quality of the placement choices must be high, in terms of percentage of placement success, global application completion time or other metrics, as defined by the user

4

OutlineOutline

Context Definition

Motivations and GoalsSpecific Contributions to Polaris

Area Constraints DefinitionProposed solution

Runtime Core Allocation ManagementFeatures and Structure of an Allocation ManagerRelevant WorksProposed Solution

ResultsConclusions and Future Work

5

Context DefinitionContext Definition

Reconfigurable hardware:Has the capability of changing its configuration (functionality) according to user needs

Self reconfiguration:the system must be completely autonomous at runtime

Partial reconfiguration:the changes can also involve fractions of the device

Dynamical Reconfiguration:if a part of the hardware is reconfigured, the rest can continue its computation

2D Reconfiguration:arbitrary rectangular slots can be dynamically reconfigured, as opposed to arbitrary columns in 1D

6

A bit of TerminologyA bit of Terminology

7

What’s nextWhat’s next

Context Definition





8

Motivations and goalsMotivations and goals

The creation and management of a self partially and dynamically reconfigurable system is a complex problem

this is even more critical when exploiting the 2D reconfiguration paradigmmore issues in the definition of area constraints, in the core allocation decisionssince the system must be autonomous, it also needs runtime management functionalities

Need for automation in those processesto reduce the workload on the designerto improve efficiency of the final reconfigurable system

9

Motivations and goalsMotivations and goals

Creation of an automated workflow to generate a self dynamically reconfigurable architecture that:

Has “good” area constraints assigned to coresIs autonomous in performing 2D runtime core allocation decisionsExploits relocation to ensure that the system can obtain the configuration bitstreams it needs at runtimeSupports intervention from the designer, to guide or constraint the decisionsKeeps high flexibility and generality

10

Specific Contributions to Specific Contributions to PolarisPolaris

Solution identification phase of the flow:The definition of area constraints for Cores, when the user does not specify themThe creation of Core Allocation Management solutions, able to efficiently manage runtime Core placement

This last task includes:Offering high versatility, supporting different placement policies and different scenariosKeeping low complexity, to avoid too much overhead in the running time of the systemExperimenting techniques to improve the efficiency, for example allowing multiple shapes per Core

11

What’s NextWhat’s Next

Context Definition





12

Area Constraints DefinitionArea Constraints Definition

The designer can choose to specify or not the AC for each Core in the application

If not specified, they are automatically computed

The designer can also choose wheter to allow multiple shapes per Core (and how many)

Finally, the last parameter represent the tightness of the constraints that will be defined:

Impacts on feasibility of implementationImpacts on performance of the RFU

CORE RFU (or set of RFUs)

13


The constraints are defined with a simple heuristics

First a square-like constraint is defined, using these formulae:

Where H is the height (in slice) and W is the width, S is the number of slices of the Core and m is the tightness

14


Then, the constraints are converted from slice to slots

Where Vg is a granularity parameter, Vslices is the number of vertical slices in the device and avgH is the average height of all the RFUs defined with the square-like formula

Finally, the constraints (in slots) are iteratively altered to horizontally or vertically stretch the Core and obtain multiple RFUs

15


Context Definition





16

Runtime Core Allocation Runtime Core Allocation ManagementManagement

The Problem:Perform the choice of where to place new cores on the reconfigurable areaIn an online scenario: self partial and dynamical reconfiguration

The Goal:Allow efficient usage of the FPGA area Critical in the 2D reconfiguration case

This requires the creation of a solution for allocation management and suitable policies

17

Allocation Manager Desired Allocation Manager Desired FeaturesFeatures

Low Core Rejection Rate (CRR)% of cores that are not successfully placed in time

Fast application completion timeTime from arrival of first Core to completion of last

Low fragmentation gradeFraction of area that is unusable because too sparse

Small management overheadWe want a lightweight solution to run inside the system

High routing efficiencyIf interacting cores are clustered, the system is more efficient

Need to find a good compromise between them

18

Example: 2D fragmentationExample: 2D fragmentation

the 2D-fragmentation problem:Area generally more fragmentedCan nullify the area optimizations obtained

19

Example: Core RejectionExample: Core Rejection

Bad choices can lead to performance loss and rejection

A: Core C is successfully placed at step 2B: Core C is delayed (possibly rejected, if deadline=2)

20

Considered ScenariosConsidered Scenarios

Dynamic ScheduleCores can arrive at any timeHave an ASAP and an ALAP time (dependencies)Rejection: failure to respect ALAP for a CoreGoal: respect the schedule, CRR is the most important metric and should tend to zero

Blind ScheduleCores can be either available from the start or arrive at different times, no dependencies assumedno ASAP, Cores can optionally have a deadlineIf a Core is not placed, retry laterGoal: application must complete as fast as possibile, rejection is not the main issue, total time is

21

Allocation Manager CreationAllocation Manager Creation

Choose how to maintain information on empty space

Keep all information (Expensive but more accurate)Heuristically prune information (Cheaper)

Which placement policy to choose:General (First Fit, Best Fit, Worst Fit…)Focused (Fragmentation Aware, Routing Aware… )

Define in which scenario(s) the manager will work

It can also be useful to consider and exploit different shapes of a Core (multiple RFUs per Core scenario)

22


Context Definition





23

Relevant WorksRelevant Works

Maintain complete information on empty space:

KAMER: K. Bazargan, R. Kastner and M. Sarrafzadeh, ''Fast template placement for reconfigurable computing systems'', IEEE Design and Test of Computers, Vol.17, 2000.

Keep All Maximally Empty RectanglesApply a general placement policy

CUR: A. Ahmadinia and C. Bobda and S. P. Fekete and J. Teich and J. v.d. Veen, ''Optimal Routing-Conscious Dynamic Placement for Reconfigurable Devices'', Field-Programmable Logic and Applications (FPL'04), 2004.

Maintain the Countour of a Union of RectanglesApply a focused placement policy

24

Relevant WorksRelevant Works

Heuristically prune part of the information:

KNER: K. Bazargan, R. Kastner and M. Sarrafzadeh, ''Fast template placement for reconfigurable computing systems'', IEEE Design and Test of Computers, Vol.17, 2000.

Keep Non-overlapping Empty RectanglesApply a general placement policy

2D-HASHING: H. Walder and C. Steiger and M. Platzner, ''Fast Online Task Placement on FPGAs: Free Space Partitioning and 2D-Hashing'', International Parallel and Distributed Processing Symposium (IPDPS'03), 2003.

Keep Non-ov. Empty Rectangles in optimized data structure

Apply (exclusively) a general placement policy

25

Example: Empty Space Example: Empty Space InformationInformation

26

EvaluationEvaluation

The solutions with higher placement quality also have higher complexityThe fastest solution cannot exploit focused policies, for example routing aware, and adds the overhead of maintaining the 2D hashing structureCUR does not support all general policies, for example Best Fit is not allowed

27


Context Definition

Motivations and GoalsThe Complete Polaris WorkflowSpecific Contributions




28

Proposed ApproachProposed Approach

Choice driven by:Need for a low complexity solution to introduce low overhead at runtime in the self reconfigurable systemDesire to keep high flexibility, to suit user needs also in terms of placement policies

For this reasons we propose an heuristic (KNER-like) empty space manager:

Supporting general and focused placement policies (in particular, First Fit, Best Fit and Routing Aware)Suitable for both dynamic schedule and blind schedule scenariosExploiting multiple RFUs per Core, to improve results

29

Data RepresentationData Representation

Core, defined by:Arrival time,Set of RFUs, each one with:

H, W, Latency

Optional set of communicating Cores (if using RA)ASAP and ALAP (if in dynamic schedule scenario)

Two queues: one for new Coresone for Cores that were not successfully placed and need reexamination

30

Data RepresentationData Representation

Reconfigurable Device, represented as:Binary Tree structure, each node is a Rectangle, each leaf is an empty Rectangle.Navigation trough:

pointers to left child, right child, next leafa function to find the previous leaf (used for bookkeeping after rectangle split and merge operations)

Rectangle, defined by:Coordinates on device: X, YSize: H, WInitially one, the root, with:

(X,Y)=(0,0), H=FPGA Rows, W=FPGA Cols

31

The Online Placement The Online Placement AlgorithmAlgorithm

The whole processing of a Core is completed in linear time

32


33


34


Context Definition





35

Evaluation of the proposed Evaluation of the proposed solutionsolution

To evaluate the quality of the proposed approach in various scenarios and with different metrics 3 kinds of experiment were performed:

1) A comparison against presented literature solutions

In a dynamic schedule scenarioWith a Routing Aware placement policyMeasuring CRR (and indirectly fragmentation), routing costs and computational overheadResults published in:

M. MORANDI, M. Novati, M. D. Santambrogio, D. Sciuto, “Core allocation and relocation management for a self dynamically recongurable architecture”, IEEE Computer Society Annual Symposium on VLSI, 2008

36

Evaluation of the proposed Evaluation of the proposed solutionsolution

2) A measure of application completion timeComposed of real Cores used as benchmarksIn a blind schedule scenarioDirectly measuring application completion time, gaining some insight on CRR and fragmentation

3) Evaluation of the multiple shapes per Core approach

Comparison between our solution with multiple shapes and KNER (adapted to blind schedule scenario)In a mixed scenario (blind schedule with deadlines and variable arrival times)Using both First Fit and Best FitMeasure of CRR and running time

37

Experiment 1: Routing AwareExperiment 1: Routing Aware

Version of our general solution:Tailored to minimize routing pathsCompared with close solutions from literatureNamed in the table RALP (Routing Aware Linear Placer)

Benchmark of 100 randomly generated tasks:Size (5% to 20% of FPGA), randomly interconnected

38

Experiment 2: Appl. Completion Experiment 2: Appl. Completion TimeTime

Benchmark applications composed of cores taken from opencores.org like JPEG, AES, 3DES

Measure the time instants needed to complete the applications with different amounts of resources

Infinite resources is shown, to compare against the lower bound

39

Experiment 3: Multiple ShapesExperiment 3: Multiple Shapes

Similar benchmark, but Cores have deadlines (for CRR)Shapes defined using the heuristic described previously

Difference in runtime is on average 30% more for 3 shapes and 40% more for 5 shapes w.r.t. 1 shapeCRR is more than halved, often reduced to one third

40

Numerical ExampleNumerical Example

To give an idea of the goodness of the obtained results, it is useful to give some numerical values for reconfiguration

Let us consider a JPEG Core, described by a 690 Kb configuration bitstream for a V4 device and using about 10% of the total area

Reconfiguration time: 150 msRelocation time: 90 msPlacement time: 0.4 ms

The obtained time is low and is suitable to actual usage in a real system

41

Concluding RemarksConcluding Remarks

The proposed solution offers:High versatility, supporting different placement policies and scenarios, designer intervention, multiple shapesLow overhead, always processing a Core in linear time and obtaining good results compared with literatureGood CRR, especially when exploiting multiple shapesFast application completion time, as shown by exp. 2Effective routing costs reduction, when used in conjunction with a Routing Aware policy (exp. 1)

The original goals were metUnder Review:

S. Corbetta, M. MORANDI, M. Novati, M. D. Santambrogio, D. Sciuto, P. Spoletini, “Internal and External Bitstream Relocation for Partial Dynamic Reconfiguration”, IEEE Transactions on VLSI (2nd review)

42

Future WorkFuture Work

Future work will be in the direction of integration with the rest of the workflow that was briefly introduced

The parts that were described achieved good results as a stand-alone in the runtime management of the reconfigurable system, it is important to evaluate them also inside the complete workflow

The final goal is to achieve complete automation in the creation process of a self dynamically reconfigurable architecture, from user specification up to bistreams and processor code generation

43

QuestionsQuestions

Business

3rd 3DDRESD: Polaris