A and a tool for source level timing/energy estimation of ... · • Reducing packaging costs and achieving energy savings ... Software energy and timing estimation ... explores compiler

A methodology and a tool for source level timing/energy estimation of software

P f Willi F i i f i @ l li i i h d i li i i /f iProf. William Fornaciari, [email protected], home.dei.polimi.it/fornaciaProject Manager and Contact

Prof Carlo Brandolese brandole@elet polimi itProf. Carlo Brandolese, [email protected] Developer

Politecnico di Milano – DEI Milano, Italy

OutlineMotivation– Software energy and timing estimationgy g– Goals

Modeling approach & Toolchains– Source‐level modelsSource‐level models– Dynamic models– Target processor characterization models– Target processor characterization models

Res ltsResults– Estimation

l– Analysis

2

Low power and energy aware designPractical market issue• Increasing market share of mobile, gasking for longer cruising life

• Limitations of battery technology

Economic issue• Reducing packaging costs and achieving energy savings

Technology issue• Enabling the realization of high‐density chips g g y p(heat poses severe constraints to reliability)

3

Motivation ‐ Application scenarios

Large data centers, or data‐intensive

High performance Low power

Low energy

General purpose, and multimedia Reliability, and

battery‐supplied

Smartphones, Wireless Sensor Networks

4

pmobile multimedia

Software energy and timing estimationEvaluation of the performance of an embedded system– Must consider specific contributions realted to software executionp

• User application• Operating System & BSPs• User data

– Should be performed as early as possibleT ll l i diff i l i l i• To allow exploring different implementation alternatives

– Should not depend on the availability of the target platform• ISS based solutions do exist• ISS‐based solutions do exist• Better if performed directly on the host machine

– Should provide data at level at which the application is conceivedShould provide data at level at which the application is conceived• The source code level

– Should be as fast and accurate as possiblep• A fast estimation toolchain allows automated design space exploration

5

GoalsThe SWAT Project(*) aims at three main goals– Defines a general estimation approachg pp– Defines

• Execution time models• Energy consumption models

– Defines low‐level, target independent software metrics• To enable a more comprehensive analysis of the application • To support the designer in optimizing the application

I l t t l t t th d l– Implements tools to support the models• As modular and flexible as possible

(*) The SWAT Project is part of the EU‐Funded Project IP 247999 Project COMPLEX:"COdesing and power Management in PLatform‐based design space Exploration"

6

Modeling approachEnergy consumption and execution time of an embedded application depend on– The structure of the application

• Its source code and, consequently, its assembly translation

– The stimuli• Specific data that the application processes

– The execution platform• The target microprocessor

h i if• The operating system, if any• The third‐party libraries

All these aspectsAll these aspects– Are complex to model as a whole with sufficient generalitySh ld b "d l d" d d ll d i d d l– Should be "decoupled" and modelled independently

7

Modeling approachThe SWAT methodology decouples the different aspectsSource code modelSource code model– Captures the structure of the code at a high level of abstraction– No dependence on data or platform is accounted forNo dependence on data or platform is accounted for

Stimuli modelCollects set of representative execution profiles– Collects set of representative execution profiles

Execution platform– The target microprocessor is modeled at ISA level– The operating system and all third‐party are statistically

h i d i bi l l dcharacterized operating at binary level, as source codemay not be available

8

Source‐level modelsTo model the source code, several representations have been used in literature– Concrete and abstract syntax trees– Static metrics mutuated from software engineering analysesg g y– Black‐box function‐level analysis– Intermediate representationp

SWAT models the source– Using the LLVM Compiler Infrastructure– Using the LLVM Compiler Infrastructure – Starting from the LLVM intermediate representationAdapting it to the specific purpose– Adapting it to the specific purpose• For static modelling, several semantic aspects are irrelevant and the originlLLVM code is simplified

• For dynamic modelling, the LLVM code is augmented with suitable probes

9

Source‐level modelsThe simplified LLVM code is called "Basic‐block model"

int sumsqr( int* x, int n ) {

f.c

int i, t = 0;for( i = 0; i < n; i++ )

t += x[i] * x[i]return t;

}

llvm-gccdefine i32 @sumsqr(i32* %x, i32 %n) nounwind {

...bb7:

}

f.ll

bb7:%8 = load i32* %i, align 4 %9 = load i32** %1, align 8 %10 = getelementptr inbounds i32* %9, i64 %9 ...

swat-bbmodel

sumsqr:6:22:27:25:bb7:332:124:158::-:-:

}

Back‐annotation info

St ti & d i

f.bbmodel

: : ::-:-:-:(25,load,-,-,-)(25,load,-,-,-)(25,getelementprt,-,-,-)

Static & dynamic energyand timing (not yet available)

Basic‐block model

10

...]

Source‐level modelsFormally the basic‐block model is expressed by the representation matrix:

Where– Each row corresponds to a basic‐block– Each column corresponds to an LLVM instruction

The element of the representation matrixp– Indicate how many LLVM instructions of type j are present in basic block i

Dynamic modelDynamic models accont for data dependence– Collect profiling information by executing the codep g y g

• On the host platform, since the LLVM code can be targeted to the host • On an targtet ISS, if available (very slow)• On the actual target, using in‐circuit debugging facilities

For OS‐less application host execution is preferrable– Must provide stubs for emulating devices

The result of execution is a basic‐block traceThe result of execution is a basic‐block trace– The trace can be stored for an analysis over timeThe trace can be compacted to provide execution counts only– The trace can be compacted to provide execution counts only

12

Dynamic modelSWAT provides a very flexible tracing support– Can trace basic‐blocks, functions, values, stack, ...

Tracing is based on a three‐steps process– Meta‐instrumentationMeta instrumentation

• Add tags to the LLVM code containing structured information about it• Tags are added as comments, so the resulting code is still LLVM‐compliant• This phase is independent from the information to be traced

– Instrumentation expansion• Expands tags into specific probes (function calls)• Uses a custom, awk‐like, set of expansion rules

E i– Execution• Produces the dynamic information for the specific simulus or set of stimuli• Data can be dumped in time‐trace or collected as counts in memory• Data can be dumped in time‐trace or collected as counts in memory

13

Dynamic modelThe dynamic model "enriches" the basic‐block model

f.ll f.bbmodelf.ll

swat-minstr

f.bbmodel

swat-mexpand

f i llf.i.ll

compilelink

sumsqr:6:22:27:25:bb7:332:124:158::20.377:13.239:

Back‐annotation info

St ti & d i

execute

a.trace: . : . ::10:203.774:132.393:(25,load,-,4.331,2.973)(25,load,-,4.331,2.973)(25,getelementprt,-,6.104,4.610)

Static & dynamicenergy and timing

Basic‐block model withinstruction‐level energyand timing figures

swat-todyn

14

...]

g g

f.bbmodel

Dynamic modelFormally the basic‐block counts are represented as

Wh h l t i i t d t b i bl kWhere each element is associated to a basic blockThe resulting LLVM instruction counts are thus:

Where each element is the cumulative execution count of a specific LLVM instructionof a specific LLVM instruction

15

Dynamic modelTo target the model to the specific processor we define a "translation matrix"

Where– Each row corresponds to an LLVM instruction– Each column corresponds to an instruction of the target set

The element of the representation matrixp– Indicate how many target instructions of type j are statistically used by the compiler to translate an LLVM instruction of type i

16

Dynamic modelThe energy and timing model of LLVM istruction for the specific target instruction set can thus be expressed as

WhereWhere – is an LLVM instruction

i t t i t ti– is a target instruction– indicates the estimated execution time

i di h i d– indicates the estimated energy

17

Dynamic modelCombining all the equations described so far yields the total estimated energy and execution time of the code

In other words, combining, g– Excution count per each basic‐block– LLVM instruction per each basic‐blockLLVM instruction per each basic block– Target instruction per each LLVM instruction– Energy and execution time per each target instructionEnergy and execution time per each target instruction

...leads to the overall estimates

18

Target processor characterizationAs we have seen– The model of the source code is expressed as LLVM code– The actual execution involves instructions of the target processor

The translation matrix links these two levels– It expresses the "average" way the compiler translates

LLVM i t t t i t tian LLVM into one or more target instruction

As an exampleAs an example:

19

Target processor characterizationThe translation of an instance of an LLVM instruction– Depends on the context and optimizations selected p p

The coefficients collects in the translation matrix– Can only represent an "average" translationCan only represent an average translation– Should be derived analyzing a very large training set

The characterization procedure is structured in stepsThe characterization procedure is structured in steps– Select a suitably large set of source code benchmarksT l t h i t LLVM– Translate each source into LLVM• Count LLVM instruction frequency

Transalte each source into the target assembly– Transalte each source into the target assembly• Count LLVM instruction frequency

– Bulid a multilinear problem and solve it in least‐square senseBulid a multilinear problem and solve it in least square sense

20

Target processor characterizationFormally, let– be the set of training source codesg– the count of LLVM instruction of type j

in the source code i– the count of target instruction of type k

in the source code i

The desired multilinear model can thus be expressed as

I t t i fIn compact matrix form:

21

Target processor characterizationThis model is though too general– It assumes that the translation of each LLVM instruction may ypotentially involve all target instructions!

Realistically, the translation will only use a subset of y, ytarget instruction– Knowing the semantics of both instruction sets, is easy tog , ydefine a reasonable subset of target instructions

This is modeled by the set of vectorsy

where a 1 indicates thatwhere a 1 indicates that – LLVM instruction jMi h b i ll l d i l i i k– Might be potentially traslated using also target instruction k

22

Target processor characterizationTo integrate these hypotheses into the model:

From a "physical" point of viewNegative coefficients have no meaning– Negative coefficients have no meaning

– Coefficients must be constrained to be positive or null

Th t i d bl i th dThe constrained problem is thus expressed as

Each solution vector collects the coefficients of the model of LVVM instruction jjThe translation matrix is thus simply given by

23

Target processor characterizationThis characterization process – Is almost totally automaticy

• Only the translation “hypothesis " vectors must be coded manually

– Has been implemented in the SWAT toolchain

*.c

Training source set

llvm-gccllvm.translations

Manual input model

target-gcc

llvm.count target.count

t h t i t / tl bswat-characterize

target.model

octave/matlab

target.model

The SWAT toolchainSWAT collects a set of tools analysis– Organized into several toolchains or flowsg

In particular:– Target characterization flowTarget characterization flow– Estimation & back‐annotation flow– Analysis flowAnalysis flow– Generic tracing toolchain

A few prototyping optimization tools and flows have alsoA few prototyping optimization tools and flows have also been implementedO ti d ti i– Operating modes optimizer

– Application parameters exploration and optimizationHi h l l f i "hi "– High‐level transformation "hint" generator

25

The SWAT toolchainToolchains are organized as follows– Front‐End: Source code modellingg– Core: Perform specific analyses– Back‐End: Post‐processing & Report generationp g p g

SOURCES

SWAT Front‐End

Source Code Models

Estimates & Reports

Core Flow NCore Flow 1 Core Flow 2

26

Estimates & Reports

Estimation resultsEnergy estimation errors (WCET Benchmarks)

Estimation error (%)

0

5

10

‐10

‐5

0

Average: ‐2.32

‐15cover bs bsort100 fibcall matmult cnt compress recursion edn expint lcdnum

Estimation error (%, absolute value)

810

1214

( , )

2

46

8Average: +5,98

27

0cover bs bsort100 fibcall matmult cnt compress recursion edn expint lcdnum

Estimation resultsData dependency of energy estimation errors (WCET Benchmarks)– Same benchmark different data, namely the size of the array to be compressed

Estimation error (%)

2

4

6

6

‐4

‐2

0

Average: ‐3,74

‐10

‐8

‐6

0 20 40 60 80 100 120 140 160

28

Analysis ResultsApplication summary & structure

29

Analysis ResultsFunction models

30

Analysis ResultsEstimates overview & detailed charts

31

Analysis ResultsDynamic analysis: Basic‐block execution count

32

Analysis ResultsTrace analysis: Stack size over time

33

Work in progressSWAT opt. Toolchain (1)– Based on the same front-end of

sources

Based on the same front end ofthe estimation flow and usingthe results extracted from the

SWAT Front‐Endcpu

dynamic models and theexecution traces, the mainoptimization flow apply a set of

static & dynamic models

optimization flow apply a set offuzzy rules on selected portionsof the application to suggest the

build fitness function

of the application to suggest themost promising transformationsto apply.

fitness functions

inferential enginefuzzy fuzzy rules

optimization hints

34

Work in progressSWAT opt. Toolchain (2)– A second optimization flow

sources

A second optimization flowexplores compileroptimization options in order

SWAT Front‐Endcpu

to determine thetransformation mix that bestfits the specific application To

llvm‐opt optimization mix

fits the specific application. Tothis purpose, the MOSTdesign space exploration

SWAT Back‐End

energy and timing

MOSTDSE engine(COMPLEX) design space exploration

engine is used to generatesets of transformation mixes

energy and timingestimates

for the LLVM optimizer untilthe best optimization recipe isfound

optimizationrecipe

found.

35

Moving up to system level: RTRM (BBQ)

Scenarios

RequirementsAggregationhi

cal

Run‐Time Resource Manager (BBQ)

gg gPg

Hierarch

A t i

ResourcesAccounting

P P P

Symmetric

D1 D2 DNAsymmetric

36

Retargetability

ConclusionsSoftware characterization is of paramount importance for embedded mobile applicationsThere is room for optimization at source levelPressure is also for moving up the perspectivePressure is also for moving up the perspective• Time adaptability• Retargetability• Retargetability• System‐wide resources management

• For more info Thank you!

• Prof. William Fornaciari• [email protected]• home.dei.polimi.it/fornacia

37

Documents

A and a tool for source level timing/energy estimation of ... · • Reducing packaging costs and achieving energy savings ... Software energy and timing estimation ... explores compiler