Upload
luke-little
View
225
Download
0
Embed Size (px)
DESCRIPTION
Relative RAM costs and Latencies Latency On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAM On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAMCost On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM
Citation preview
Embedded Embedded Systems SeminarSystems Seminar
Heterogeneous Memory Heterogeneous Memory Management for Embedded Management for Embedded
SystemsSystemsBy O.Avissar, R.Barua and By O.Avissar, R.Barua and
D.Stewart.D.Stewart.Presented by Kumar Karthik.Presented by Kumar Karthik.
Heterogeneous MemoryHeterogeneous Memory Heterogeneous = different types Heterogeneous = different types
of…of… Embedded Systems come with a Embedded Systems come with a
small amount of on-chip SRAM, a small amount of on-chip SRAM, a moderate amount of off-chip SRAM, moderate amount of off-chip SRAM, a considerable amount of off-chip a considerable amount of off-chip DRAM and large amounts of DRAM and large amounts of EEPROM (Flash memory)EEPROM (Flash memory)
Relative RAM costs and Relative RAM costs and LatenciesLatencies
LatencyLatency On-chip SRAM < off-chip SRAM < On-chip SRAM < off-chip SRAM <
on-chip DRAM < off-chip DRAMon-chip DRAM < off-chip DRAMCostCost
On-chip SRAM > off-chip SRAM > On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAMon-chip DRAM > off-chip DRAM
Caches in Embedded Caches in Embedded ChipsChips
Caches are power hungry Caches are power hungry Cache miss penalties make it hard to Cache miss penalties make it hard to
give real-time performance give real-time performance guarantees guarantees
Solution : do away with caches and Solution : do away with caches and create a non-overlapping address create a non-overlapping address space for systems with space for systems with heterogeneous memory units heterogeneous memory units (DRAM, SRAM, EEPROM).(DRAM, SRAM, EEPROM).
Memory Allocation in ESMemory Allocation in ES Memory allocation for program data is Memory allocation for program data is
done by the embedded system done by the embedded system programmer, in software, as current programmer, in software, as current compilers are not capable of doing it compilers are not capable of doing it over heterogeneous memory unitsover heterogeneous memory units
Code is written in Assembly : tedious Code is written in Assembly : tedious and non-portableand non-portable
Solution : An intelligent compilation Solution : An intelligent compilation strategy that can achieve optimal strategy that can achieve optimal memory allocation in ES.memory allocation in ES.
Memory Allocation Memory Allocation ExampleExample
The need for ProfilingThe need for Profiling Recall : RAM LatenciesRecall : RAM Latencies Optimal if most frequently accessed Optimal if most frequently accessed
code sections are stored in the code sections are stored in the memory unit with lowest latency. memory unit with lowest latency.
Access frequencies of memory Access frequencies of memory references need to be measured.references need to be measured.
Solution : Profiling.Solution : Profiling.
Intelligent CompilersIntelligent Compilers The intelligent compiler must be able The intelligent compiler must be able
to to 1.1. Optimally allocate memory to program Optimally allocate memory to program
datadata2.2. Base memory allocation on frequency Base memory allocation on frequency
estimates collected through profilingestimates collected through profiling3.3. Correlate memory accesses with the Correlate memory accesses with the
variables they accessvariables they access Task 3 demands inter-procedural pointer Task 3 demands inter-procedural pointer
analysis, which is costly.analysis, which is costly.
ProfilingProfiling Instead of pointers, a more efficient Instead of pointers, a more efficient
statistical method is used. Each statistical method is used. Each accessed address is marked checked accessed address is marked checked against a table of address ranges for against a table of address ranges for the different variables.the different variables.
Provides exact statistics as opposed to Provides exact statistics as opposed to pointer analysispointer analysis
Memory Access TimesMemory Access Times Total access time (Sum) of all the Total access time (Sum) of all the
memory accesses in the program memory accesses in the program needs to be minimizedneeds to be minimized
The formulation is first defined for The formulation is first defined for global variables and then extended global variables and then extended for heap and stack variables. for heap and stack variables.
Formulation for global Formulation for global variablesvariables
Key termsKey termsTTrjrjNNrr(v(vii) – Total time taken for N reads of ) – Total time taken for N reads of
variable i stored on memory unit j.variable i stored on memory unit j.TTwjwjNNww(v(vii) – Total time taken for N writes of ) – Total time taken for N writes of
variable i stored on memory unit j.variable i stored on memory unit j.IIjj(v(vii) – The set of 0/1 integer variables.) – The set of 0/1 integer variables.
Formulation for global Formulation for global variablesvariables
Total Access time Total Access time
= = ∑∑((j=1 to U)j=1 to U) ∑ ∑(i=1 to G)(i=1 to G) IIjj(v(vii)[T)[TrjrjNNrr(v(vii) + T) + TwjwjNNww(v(vii) ]) ]U = Number of Memory unitsU = Number of Memory unitsG = Number of VariablesG = Number of VariablesTTrjrjNNrr(v(vii) + T) + TwjwjNNww(v(vii) contributes to the ) contributes to the
inner sum only if variable i is stored in inner sum only if variable i is stored in memory unit j (if not, Imemory unit j (if not, Ijj(v(vii) = 0 and the ) = 0 and the whole term will be 0).whole term will be 0).
0/1 integer linear 0/1 integer linear program solverprogram solver
The 0/1 integer linear program solver The 0/1 integer linear program solver tries out all combinations of tries out all combinations of the summationthe summation to arrive at the lowest to arrive at the lowest total memory access time and returns total memory access time and returns this solution to the compiler this solution to the compiler
The solution is the optimal memory The solution is the optimal memory allocation. allocation.
MATLAB is used as the solver in this MATLAB is used as the solver in this paper.paper.
ConstraintsConstraints The following constraints also hold :The following constraints also hold :
The embedded processor allows at most The embedded processor allows at most one memory access per cycle. one memory access per cycle. Overlapping memory latencies are not Overlapping memory latencies are not considered.considered.
Every variable is allocated on only one Every variable is allocated on only one memory unitmemory unit
The sum of the sizes of all the variables The sum of the sizes of all the variables allocated to a particular memory unit allocated to a particular memory unit must not exceed the size of the unit.must not exceed the size of the unit.
Stack variablesStack variables Extending the formulation for local Extending the formulation for local
variables, procedure parameters and variables, procedure parameters and return variables (collectively known as return variables (collectively known as stack variables).stack variables).
Stacks are sequentially allocated Stacks are sequentially allocated abstractions, much like arrays.abstractions, much like arrays.
Distributing stacks over Distributing stacks over heterogeneous memory units heterogeneous memory units optimizes memory allocation.optimizes memory allocation.
Stack split exampleStack split example
Distributed StacksDistributed Stacks Multiple stack pointers…from Multiple stack pointers…from
example, 2 stack pointers will have example, 2 stack pointers will have to be incremented on entry (on for to be incremented on entry (on for each split of the stack) and 2 will each split of the stack) and 2 will have to be decremented on leaving have to be decremented on leaving the procedure.the procedure.
Induces overhead when 2 stack Induces overhead when 2 stack pointers have to be maintained.pointers have to be maintained.
Distributed StacksDistributed Stacks software overhead…tolerated for long-software overhead…tolerated for long-
running procedures and eliminated by running procedures and eliminated by allocating each stack frame to one allocating each stack frame to one memory unit for short procedures (one memory unit for short procedures (one stack pointer per procedure)stack pointer per procedure)
Distributed stacks are implemented by Distributed stacks are implemented by compiler for ease of use…..abstraction compiler for ease of use…..abstraction of stack as a contiguous data structure of stack as a contiguous data structure is maintained for the programmeris maintained for the programmer
Comparison to globalsComparison to globals Stack variables have limited lifetimes Stack variables have limited lifetimes
compared to globals. They are ‘live’ when compared to globals. They are ‘live’ when a particular procedure is executing and a particular procedure is executing and can be garbage collected once the can be garbage collected once the procedure is exited. procedure is exited.
Hence variables with non-overlapping Hence variables with non-overlapping lifetimes can share the same address lifetimes can share the same address space and their total size can be space and their total size can be larger than that of the memory unit they alarger than that of the memory unit they are stored inre stored in..
Formulation for Stack Formulation for Stack FramesFrames
2 ways of extending the method to 2 ways of extending the method to handle stack variables.handle stack variables.
Each procedure’s stack frame is Each procedure’s stack frame is stored in a single memory unit. stored in a single memory unit. No multiple stack pointers No multiple stack pointers Distributed stack as different stack Distributed stack as different stack
frames may still be allocated to frames may still be allocated to different memory unitsdifferent memory units
Stack-extended Stack-extended formulationformulation
Total access time = time taken to access Total access time = time taken to access global variables + time taken to access stack global variables + time taken to access stack variablesvariables
The fThe fiis refer to the number of functions in the s refer to the number of functions in the program (as each function has a stack frame).program (as each function has a stack frame).
ConstraintsConstraints Each stack frame may at most be Each stack frame may at most be
stored in one memory unitstored in one memory unit Stack reaches maximum size when a Stack reaches maximum size when a
call-graph leaf node is reached. call-graph leaf node is reached. A call-graph leaf node is the deepest A call-graph leaf node is the deepest
nested procedure called….if all such nested procedure called….if all such procedures’ stack frames can be procedures’ stack frames can be allocated, program allocation will fit allocated, program allocation will fit into memory if all paths to leaf nodes into memory if all paths to leaf nodes on the call graph fit into memory. on the call graph fit into memory.
Stack-extended Stack-extended formulationformulation
22ndnd alternative alternative Stack variables from the same procedure Stack variables from the same procedure
can be mapped to different memory unitscan be mapped to different memory units Stack variables are thus treated like Stack variables are thus treated like
globals with the total access time equal to globals with the total access time equal to ==
However memory requirements are However memory requirements are relaxed as in the stack-frame case based relaxed as in the stack-frame case based on disjoint lifetimes of the stack variableson disjoint lifetimes of the stack variables
Heap-extended Heap-extended formulationformulation
Heap data cannot be allocated statically Heap data cannot be allocated statically as the allocation frequencies and block as the allocation frequencies and block sizes are unknown at compile time.sizes are unknown at compile time.
Calls such as malloc( ) fall into this Calls such as malloc( ) fall into this categorycategory
Allocation has to be estimated using a Allocation has to be estimated using a good heuristic.good heuristic.
Each static heap allocation site is Each static heap allocation site is treated as a variable v in the formulationtreated as a variable v in the formulation
Heap-extended Heap-extended formulationformulation
The number of references to each site is The number of references to each site is counted through profiling.counted through profiling.
The variable size is bounded as a finite The variable size is bounded as a finite multiple of the total size of memory multiple of the total size of memory allocated at that site.allocated at that site.
If a malloc( ) site allocates 20 bytes 8 If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the times over in a program, 160 bytes is the size of v which is multiplied by a safety size of v which is multiplied by a safety factor of 2 to give 320 bytes as the factor of 2 to give 320 bytes as the allocation size for this site.allocation size for this site.
Heap-extended Heap-extended formulationformulation
This optimizes for the common caseThis optimizes for the common case Calls like malloc( ) are cloned for each Calls like malloc( ) are cloned for each
memory level which in turn maintains memory level which in turn maintains a free list.a free list.
If allocation size is exceeded at If allocation size is exceeded at runtime (max size is passed as a runtime (max size is passed as a parameter for each call site) a parameter for each call site) a memory block from slower and larger memory block from slower and larger memory is returned.memory is returned.
Heap-extended Heap-extended formulationformulation
Latency would be ≤ latency of Latency would be ≤ latency of slowest memoryslowest memory
If real-time guarantees are needed, If real-time guarantees are needed, all heap allocation must be assumed all heap allocation must be assumed to go to the slowest memory.to go to the slowest memory.
ExperimentExperiment This compiler was implemented as an This compiler was implemented as an
extension to the commonly used GCC extension to the commonly used GCC cross-compiler to target the Motorola M-cross-compiler to target the Motorola M-Core processor.Core processor.
Benchmarks used represent code in Benchmarks used represent code in typical applications.typical applications.
The runtimes were normalized using only The runtimes were normalized using only the fastest memory type (SRAM) and then the fastest memory type (SRAM) and then slower memories were introduced for slower memories were introduced for subsequent tests to measure runtimes.subsequent tests to measure runtimes.
ResultsResults
ResultsResults Using 20% SRAM and the rest DRAM Using 20% SRAM and the rest DRAM
still produces runtimes closer to the all still produces runtimes closer to the all SRAM case. Cheaper and without SRAM case. Cheaper and without much of a performance loss.much of a performance loss.
This proves that (at least for the This proves that (at least for the benchmark programs) memory benchmark programs) memory allocation is optimal. The FIB with a allocation is optimal. The FIB with a linear recurrence to compute Fibonacci linear recurrence to compute Fibonacci numbers is an exception with equal numbers is an exception with equal number of accesses to all variables.number of accesses to all variables.
Experiment 2Experiment 2 Enough DRAM and EEPROM was Enough DRAM and EEPROM was
provided while SRAM size was provided while SRAM size was varied for each of the benchmark varied for each of the benchmark programs.programs.
This would help determine the This would help determine the minimum amount of SRAM needed minimum amount of SRAM needed to maintain performance reasonably to maintain performance reasonably close to the 100% SRAM caseclose to the 100% SRAM case
FIR BenchmarkFIR Benchmark
Matrix multiplication Matrix multiplication benchmarkbenchmark
Fibonacci series Fibonacci series benchmarkbenchmark
Byte to ASCII converterByte to ASCII converter
ResultsResults Clear that most frequently accessed Clear that most frequently accessed
code is between 10-20% of entire code is between 10-20% of entire program program
This portion of code is successfully This portion of code is successfully put on SRAM through profile-based put on SRAM through profile-based optimizations.optimizations.
Comparing Stack frames Comparing Stack frames and stack variablesand stack variables
ResultsResults The BMM benchmark is used as it has The BMM benchmark is used as it has
the most number of the most number of functions/procedures (hence most functions/procedures (hence most number of stack frames/variables).number of stack frames/variables).
Allocating stack variables on different Allocating stack variables on different units performs better in theory due to units performs better in theory due to the finer granularity and thus a more the finer granularity and thus a more custom allocation. The difference is custom allocation. The difference is apparent for the smaller SRAM sizes.apparent for the smaller SRAM sizes.
ApplicationsApplications The approach in the paper can be The approach in the paper can be
used to determine an optimal trade-used to determine an optimal trade-off between minimum SRAM size off between minimum SRAM size and meeting performance and meeting performance requirements.requirements.
Adapting to pre-emptionAdapting to pre-emption In context-switching environments, all In context-switching environments, all
data has to be live at any given time data has to be live at any given time on some live memory. on some live memory.
The variables of all the live programs The variables of all the live programs are combined and the formulation is are combined and the formulation is solved by multiplying the relative solved by multiplying the relative frequencies of the contexts with their frequencies of the contexts with their respective variables. An optimal respective variables. An optimal allocation is achieved in this case.allocation is achieved in this case.
SummarySummary Compiler method to distribute program Compiler method to distribute program
data efficiently among heterogeneous data efficiently among heterogeneous memories.memories.
Caching hardware is not usedCaching hardware is not used Static allocation of memory unitsStatic allocation of memory units Stack distributionStack distribution Optimal guaranteeOptimal guarantee Runtime depends on relative access Runtime depends on relative access
frequencies.frequencies.
Related workRelated work Not much work on cache-less Not much work on cache-less
embedded chips with heterogeneous embedded chips with heterogeneous memory unitsmemory units
Memory allocation task is usually left Memory allocation task is usually left to the programmerto the programmer
Compiler method is better for larger, Compiler method is better for larger, more complex programsmore complex programs
It is error free and is also portable over It is error free and is also portable over different systems with minor different systems with minor modifications to the compiler.modifications to the compiler.
Related workRelated work Panda et al., Sjodin et al. have researched Panda et al., Sjodin et al. have researched
on memory allocation in cached on memory allocation in cached embedded chips.embedded chips.
Cached systems spend more effort on Cached systems spend more effort on minimizing cache misses than minimizing minimizing cache misses than minimizing memory access times…no optimal memory access times…no optimal guarantee.guarantee.
Earlier studies only take into account 2 Earlier studies only take into account 2 memory levels (SRAM and DRAM) while memory levels (SRAM and DRAM) while this formulation can be extended to N this formulation can be extended to N levels of memory.levels of memory.
Related workRelated work Dynamic allocation strategies are Dynamic allocation strategies are
also possible but not explored here.also possible but not explored here. Software caching (emulation of a Software caching (emulation of a
cache in fast memory) is an option.cache in fast memory) is an option. Methods to overcome software Methods to overcome software
overhead need to be devised. overhead need to be devised. Inability to provide real-time Inability to provide real-time
guarantees should be addressed.guarantees should be addressed.THE ENDTHE END