Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
A methodology and a tool for source level timing/energy estimation of software
P f Willi F i i f i @ l li i i h d i li i i /f iProf. William Fornaciari, [email protected], home.dei.polimi.it/fornaciaProject Manager and Contact
Prof Carlo Brandolese brandole@elet polimi itProf. Carlo Brandolese, [email protected] Developer
Politecnico di Milano – DEI Milano, Italy
OutlineMotivation– Software energy and timing estimationgy g– Goals
Modeling approach & Toolchains– Source‐level modelsSource‐level models– Dynamic models– Target processor characterization models– Target processor characterization models
Res ltsResults– Estimation
l– Analysis
2
Low power and energy aware designPractical market issue• Increasing market share of mobile, gasking for longer cruising life
• Limitations of battery technology
Economic issue• Reducing packaging costs and achieving energy savings
Technology issue• Enabling the realization of high‐density chips g g y p(heat poses severe constraints to reliability)
3
Motivation ‐ Application scenarios
Large data centers, or data‐intensive
High performance Low power
Low energy
General purpose, and multimedia Reliability, and
battery‐supplied
Smartphones, Wireless Sensor Networks
4
pmobile multimedia
Software energy and timing estimationEvaluation of the performance of an embedded system– Must consider specific contributions realted to software executionp
• User application• Operating System & BSPs• User data
– Should be performed as early as possibleT ll l i diff i l i l i• To allow exploring different implementation alternatives
– Should not depend on the availability of the target platform• ISS based solutions do exist• ISS‐based solutions do exist• Better if performed directly on the host machine
– Should provide data at level at which the application is conceivedShould provide data at level at which the application is conceived• The source code level
– Should be as fast and accurate as possiblep• A fast estimation toolchain allows automated design space exploration
5
GoalsThe SWAT Project(*) aims at three main goals– Defines a general estimation approachg pp– Defines
• Execution time models• Energy consumption models
– Defines low‐level, target independent software metrics• To enable a more comprehensive analysis of the application • To support the designer in optimizing the application
I l t t l t t th d l– Implements tools to support the models• As modular and flexible as possible
(*) The SWAT Project is part of the EU‐Funded Project IP 247999 Project COMPLEX:"COdesing and power Management in PLatform‐based design space Exploration"
6
Modeling approachEnergy consumption and execution time of an embedded application depend on– The structure of the application
• Its source code and, consequently, its assembly translation
– The stimuli• Specific data that the application processes
– The execution platform• The target microprocessor
h i if• The operating system, if any• The third‐party libraries
All these aspectsAll these aspects– Are complex to model as a whole with sufficient generalitySh ld b "d l d" d d ll d i d d l– Should be "decoupled" and modelled independently
7
Modeling approachThe SWAT methodology decouples the different aspectsSource code modelSource code model– Captures the structure of the code at a high level of abstraction– No dependence on data or platform is accounted forNo dependence on data or platform is accounted for
Stimuli modelCollects set of representative execution profiles– Collects set of representative execution profiles
Execution platform– The target microprocessor is modeled at ISA level– The operating system and all third‐party are statistically
h i d i bi l l dcharacterized operating at binary level, as source codemay not be available
8
Source‐level modelsTo model the source code, several representations have been used in literature– Concrete and abstract syntax trees– Static metrics mutuated from software engineering analysesg g y– Black‐box function‐level analysis– Intermediate representationp
SWAT models the source– Using the LLVM Compiler Infrastructure– Using the LLVM Compiler Infrastructure – Starting from the LLVM intermediate representationAdapting it to the specific purpose– Adapting it to the specific purpose• For static modelling, several semantic aspects are irrelevant and the originlLLVM code is simplified
• For dynamic modelling, the LLVM code is augmented with suitable probes
9
Source‐level modelsThe simplified LLVM code is called "Basic‐block model"
int sumsqr( int* x, int n ) {
f.c
int i, t = 0;for( i = 0; i < n; i++ )
t += x[i] * x[i]return t;
}
llvm-gccdefine i32 @sumsqr(i32* %x, i32 %n) nounwind {
...bb7:
}
f.ll
bb7:%8 = load i32* %i, align 4 %9 = load i32** %1, align 8 %10 = getelementptr inbounds i32* %9, i64 %9 ...
swat-bbmodel
sumsqr:6:22:27:25:bb7:332:124:158::-:-:
}
Back‐annotation info
St ti & d i
f.bbmodel
: : ::-:-:-:(25,load,-,-,-)(25,load,-,-,-)(25,getelementprt,-,-,-)
Static & dynamic energyand timing (not yet available)
Basic‐block model
10
...]
Source‐level modelsFormally the basic‐block model is expressed by the representation matrix:
Where– Each row corresponds to a basic‐block– Each column corresponds to an LLVM instruction
The element of the representation matrixp– Indicate how many LLVM instructions of type j are present in basic block i
Dynamic modelDynamic models accont for data dependence– Collect profiling information by executing the codep g y g
• On the host platform, since the LLVM code can be targeted to the host • On an targtet ISS, if available (very slow)• On the actual target, using in‐circuit debugging facilities
For OS‐less application host execution is preferrable– Must provide stubs for emulating devices
The result of execution is a basic‐block traceThe result of execution is a basic‐block trace– The trace can be stored for an analysis over timeThe trace can be compacted to provide execution counts only– The trace can be compacted to provide execution counts only
12
Dynamic modelSWAT provides a very flexible tracing support– Can trace basic‐blocks, functions, values, stack, ...
Tracing is based on a three‐steps process– Meta‐instrumentationMeta instrumentation
• Add tags to the LLVM code containing structured information about it• Tags are added as comments, so the resulting code is still LLVM‐compliant• This phase is independent from the information to be traced
– Instrumentation expansion• Expands tags into specific probes (function calls)• Uses a custom, awk‐like, set of expansion rules
E i– Execution• Produces the dynamic information for the specific simulus or set of stimuli• Data can be dumped in time‐trace or collected as counts in memory• Data can be dumped in time‐trace or collected as counts in memory
13
Dynamic modelThe dynamic model "enriches" the basic‐block model
f.ll f.bbmodelf.ll
swat-minstr
f.bbmodel
swat-mexpand
f i llf.i.ll
compilelink
sumsqr:6:22:27:25:bb7:332:124:158::20.377:13.239:
Back‐annotation info
St ti & d i
execute
a.trace: . : . ::10:203.774:132.393:(25,load,-,4.331,2.973)(25,load,-,4.331,2.973)(25,getelementprt,-,6.104,4.610)
Static & dynamicenergy and timing
Basic‐block model withinstruction‐level energyand timing figures
swat-todyn
14
...]
g g
f.bbmodel
Dynamic modelFormally the basic‐block counts are represented as
Wh h l t i i t d t b i bl kWhere each element is associated to a basic blockThe resulting LLVM instruction counts are thus:
Where each element is the cumulative execution count of a specific LLVM instructionof a specific LLVM instruction
15
Dynamic modelTo target the model to the specific processor we define a "translation matrix"
Where– Each row corresponds to an LLVM instruction– Each column corresponds to an instruction of the target set
The element of the representation matrixp– Indicate how many target instructions of type j are statistically used by the compiler to translate an LLVM instruction of type i
16
Dynamic modelThe energy and timing model of LLVM istruction for the specific target instruction set can thus be expressed as
WhereWhere – is an LLVM instruction
i t t i t ti– is a target instruction– indicates the estimated execution time
i di h i d– indicates the estimated energy
17
Dynamic modelCombining all the equations described so far yields the total estimated energy and execution time of the code
In other words, combining, g– Excution count per each basic‐block– LLVM instruction per each basic‐blockLLVM instruction per each basic block– Target instruction per each LLVM instruction– Energy and execution time per each target instructionEnergy and execution time per each target instruction
...leads to the overall estimates
18
Target processor characterizationAs we have seen– The model of the source code is expressed as LLVM code– The actual execution involves instructions of the target processor
The translation matrix links these two levels– It expresses the "average" way the compiler translates
LLVM i t t t i t tian LLVM into one or more target instruction
As an exampleAs an example:
19
Target processor characterizationThe translation of an instance of an LLVM instruction– Depends on the context and optimizations selected p p
The coefficients collects in the translation matrix– Can only represent an "average" translationCan only represent an average translation– Should be derived analyzing a very large training set
The characterization procedure is structured in stepsThe characterization procedure is structured in steps– Select a suitably large set of source code benchmarksT l t h i t LLVM– Translate each source into LLVM• Count LLVM instruction frequency
Transalte each source into the target assembly– Transalte each source into the target assembly• Count LLVM instruction frequency
– Bulid a multilinear problem and solve it in least‐square senseBulid a multilinear problem and solve it in least square sense
20
Target processor characterizationFormally, let– be the set of training source codesg– the count of LLVM instruction of type j
in the source code i– the count of target instruction of type k
in the source code i
The desired multilinear model can thus be expressed as
I t t i fIn compact matrix form:
21
Target processor characterizationThis model is though too general– It assumes that the translation of each LLVM instruction may ypotentially involve all target instructions!
Realistically, the translation will only use a subset of y, ytarget instruction– Knowing the semantics of both instruction sets, is easy tog , ydefine a reasonable subset of target instructions
This is modeled by the set of vectorsy
where a 1 indicates thatwhere a 1 indicates that – LLVM instruction jMi h b i ll l d i l i i k– Might be potentially traslated using also target instruction k
22
Target processor characterizationTo integrate these hypotheses into the model:
From a "physical" point of viewNegative coefficients have no meaning– Negative coefficients have no meaning
– Coefficients must be constrained to be positive or null
Th t i d bl i th dThe constrained problem is thus expressed as
Each solution vector collects the coefficients of the model of LVVM instruction jjThe translation matrix is thus simply given by
23
Target processor characterizationThis characterization process – Is almost totally automaticy
• Only the translation “hypothesis " vectors must be coded manually
– Has been implemented in the SWAT toolchain
*.c
Training source set
llvm-gccllvm.translations
Manual input model
target-gcc
llvm.count target.count
t h t i t / tl bswat-characterize
target.model
octave/matlab
target.model
The SWAT toolchainSWAT collects a set of tools analysis– Organized into several toolchains or flowsg
In particular:– Target characterization flowTarget characterization flow– Estimation & back‐annotation flow– Analysis flowAnalysis flow– Generic tracing toolchain
A few prototyping optimization tools and flows have alsoA few prototyping optimization tools and flows have also been implementedO ti d ti i– Operating modes optimizer
– Application parameters exploration and optimizationHi h l l f i "hi "– High‐level transformation "hint" generator
25
The SWAT toolchainToolchains are organized as follows– Front‐End: Source code modellingg– Core: Perform specific analyses– Back‐End: Post‐processing & Report generationp g p g
SOURCES
SWAT Front‐End
Source Code Models
Estimates & Reports
Core Flow NCore Flow 1 Core Flow 2
26
Estimates & Reports
Estimation resultsEnergy estimation errors (WCET Benchmarks)
Estimation error (%)
0
5
10
‐10
‐5
0
Average: ‐2.32
‐15cover bs bsort100 fibcall matmult cnt compress recursion edn expint lcdnum
Estimation error (%, absolute value)
810
1214
( , )
2
46
8Average: +5,98
27
0cover bs bsort100 fibcall matmult cnt compress recursion edn expint lcdnum
Estimation resultsData dependency of energy estimation errors (WCET Benchmarks)– Same benchmark different data, namely the size of the array to be compressed
Estimation error (%)
2
4
6
6
‐4
‐2
0
Average: ‐3,74
‐10
‐8
‐6
0 20 40 60 80 100 120 140 160
28
Analysis ResultsApplication summary & structure
29
Analysis ResultsFunction models
30
Analysis ResultsEstimates overview & detailed charts
31
Analysis ResultsDynamic analysis: Basic‐block execution count
32
Analysis ResultsTrace analysis: Stack size over time
33
Work in progressSWAT opt. Toolchain (1)– Based on the same front-end of
sources
Based on the same front end ofthe estimation flow and usingthe results extracted from the
SWAT Front‐Endcpu
dynamic models and theexecution traces, the mainoptimization flow apply a set of
static & dynamic models
optimization flow apply a set offuzzy rules on selected portionsof the application to suggest the
build fitness function
of the application to suggest themost promising transformationsto apply.
fitness functions
inferential enginefuzzy fuzzy rules
optimization hints
34
Work in progressSWAT opt. Toolchain (2)– A second optimization flow
sources
A second optimization flowexplores compileroptimization options in order
SWAT Front‐Endcpu
to determine thetransformation mix that bestfits the specific application To
llvm‐opt optimization mix
fits the specific application. Tothis purpose, the MOSTdesign space exploration
SWAT Back‐End
energy and timing
MOSTDSE engine(COMPLEX) design space exploration
engine is used to generatesets of transformation mixes
energy and timingestimates
for the LLVM optimizer untilthe best optimization recipe isfound
optimizationrecipe
found.
35
Moving up to system level: RTRM (BBQ)
Scenarios
RequirementsAggregationhi
cal
Run‐Time Resource Manager (BBQ)
gg gPg
Hierarch
A t i
ResourcesAccounting
P P P
Symmetric
D1 D2 DNAsymmetric
36
Retargetability
ConclusionsSoftware characterization is of paramount importance for embedded mobile applicationsThere is room for optimization at source levelPressure is also for moving up the perspectivePressure is also for moving up the perspective• Time adaptability• Retargetability• Retargetability• System‐wide resources management
• For more info Thank you!
• Prof. William Fornaciari• [email protected]• home.dei.polimi.it/fornacia
37