Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik

Embedded Embedded Systems SeminarSystems Seminar

Heterogeneous Memory Heterogeneous Memory Management for Embedded Management for Embedded

SystemsSystemsBy O.Avissar, R.Barua and By O.Avissar, R.Barua and

D.Stewart.D.Stewart.Presented by Kumar Karthik.Presented by Kumar Karthik.

Heterogeneous MemoryHeterogeneous Memory Heterogeneous = different types Heterogeneous = different types

of…of… Embedded Systems come with a Embedded Systems come with a

small amount of on-chip SRAM, a small amount of on-chip SRAM, a moderate amount of off-chip SRAM, moderate amount of off-chip SRAM, a considerable amount of off-chip a considerable amount of off-chip DRAM and large amounts of DRAM and large amounts of EEPROM (Flash memory)EEPROM (Flash memory)

Relative RAM costs and Relative RAM costs and LatenciesLatencies

LatencyLatency On-chip SRAM < off-chip SRAM < On-chip SRAM < off-chip SRAM <

on-chip DRAM < off-chip DRAMon-chip DRAM < off-chip DRAMCostCost

On-chip SRAM > off-chip SRAM > On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAMon-chip DRAM > off-chip DRAM

Caches in Embedded Caches in Embedded ChipsChips

Caches are power hungry Caches are power hungry Cache miss penalties make it hard to Cache miss penalties make it hard to

give real-time performance give real-time performance guarantees guarantees

Solution : do away with caches and Solution : do away with caches and create a non-overlapping address create a non-overlapping address space for systems with space for systems with heterogeneous memory units heterogeneous memory units (DRAM, SRAM, EEPROM).(DRAM, SRAM, EEPROM).

Memory Allocation in ESMemory Allocation in ES Memory allocation for program data is Memory allocation for program data is

done by the embedded system done by the embedded system programmer, in software, as current programmer, in software, as current compilers are not capable of doing it compilers are not capable of doing it over heterogeneous memory unitsover heterogeneous memory units

Code is written in Assembly : tedious Code is written in Assembly : tedious and non-portableand non-portable

Solution : An intelligent compilation Solution : An intelligent compilation strategy that can achieve optimal strategy that can achieve optimal memory allocation in ES.memory allocation in ES.

Memory Allocation Memory Allocation ExampleExample

The need for ProfilingThe need for Profiling Recall : RAM LatenciesRecall : RAM Latencies Optimal if most frequently accessed Optimal if most frequently accessed

code sections are stored in the code sections are stored in the memory unit with lowest latency. memory unit with lowest latency.

Access frequencies of memory Access frequencies of memory references need to be measured.references need to be measured.

Solution : Profiling.Solution : Profiling.

Intelligent CompilersIntelligent Compilers The intelligent compiler must be able The intelligent compiler must be able

to to 1.1. Optimally allocate memory to program Optimally allocate memory to program

datadata2.2. Base memory allocation on frequency Base memory allocation on frequency

estimates collected through profilingestimates collected through profiling3.3. Correlate memory accesses with the Correlate memory accesses with the

variables they accessvariables they access Task 3 demands inter-procedural pointer Task 3 demands inter-procedural pointer

analysis, which is costly.analysis, which is costly.

ProfilingProfiling Instead of pointers, a more efficient Instead of pointers, a more efficient

statistical method is used. Each statistical method is used. Each accessed address is marked checked accessed address is marked checked against a table of address ranges for against a table of address ranges for the different variables.the different variables.

Provides exact statistics as opposed to Provides exact statistics as opposed to pointer analysispointer analysis

Memory Access TimesMemory Access Times Total access time (Sum) of all the Total access time (Sum) of all the

memory accesses in the program memory accesses in the program needs to be minimizedneeds to be minimized

The formulation is first defined for The formulation is first defined for global variables and then extended global variables and then extended for heap and stack variables. for heap and stack variables.

Formulation for global Formulation for global variablesvariables

Key termsKey termsTTrjrjNNrr(v(vii) – Total time taken for N reads of ) – Total time taken for N reads of

variable i stored on memory unit j.variable i stored on memory unit j.TTwjwjNNww(v(vii) – Total time taken for N writes of ) – Total time taken for N writes of

variable i stored on memory unit j.variable i stored on memory unit j.IIjj(v(vii) – The set of 0/1 integer variables.) – The set of 0/1 integer variables.

Formulation for global Formulation for global variablesvariables

Total Access time Total Access time

= = ∑∑((j=1 to U)j=1 to U) ∑ ∑(i=1 to G)(i=1 to G) IIjj(v(vii)[T)[TrjrjNNrr(v(vii) + T) + TwjwjNNww(v(vii) ]) ]U = Number of Memory unitsU = Number of Memory unitsG = Number of VariablesG = Number of VariablesTTrjrjNNrr(v(vii) + T) + TwjwjNNww(v(vii) contributes to the ) contributes to the

inner sum only if variable i is stored in inner sum only if variable i is stored in memory unit j (if not, Imemory unit j (if not, Ijj(v(vii) = 0 and the ) = 0 and the whole term will be 0).whole term will be 0).

0/1 integer linear 0/1 integer linear program solverprogram solver

The 0/1 integer linear program solver The 0/1 integer linear program solver tries out all combinations of tries out all combinations of the summationthe summation to arrive at the lowest to arrive at the lowest total memory access time and returns total memory access time and returns this solution to the compiler this solution to the compiler

The solution is the optimal memory The solution is the optimal memory allocation. allocation.

MATLAB is used as the solver in this MATLAB is used as the solver in this paper.paper.

ConstraintsConstraints The following constraints also hold :The following constraints also hold :

The embedded processor allows at most The embedded processor allows at most one memory access per cycle. one memory access per cycle. Overlapping memory latencies are not Overlapping memory latencies are not considered.considered.

Every variable is allocated on only one Every variable is allocated on only one memory unitmemory unit

The sum of the sizes of all the variables The sum of the sizes of all the variables allocated to a particular memory unit allocated to a particular memory unit must not exceed the size of the unit.must not exceed the size of the unit.

Stack variablesStack variables Extending the formulation for local Extending the formulation for local

variables, procedure parameters and variables, procedure parameters and return variables (collectively known as return variables (collectively known as stack variables).stack variables).

Stacks are sequentially allocated Stacks are sequentially allocated abstractions, much like arrays.abstractions, much like arrays.

Distributing stacks over Distributing stacks over heterogeneous memory units heterogeneous memory units optimizes memory allocation.optimizes memory allocation.

Stack split exampleStack split example

Distributed StacksDistributed Stacks Multiple stack pointers…from Multiple stack pointers…from

example, 2 stack pointers will have example, 2 stack pointers will have to be incremented on entry (on for to be incremented on entry (on for each split of the stack) and 2 will each split of the stack) and 2 will have to be decremented on leaving have to be decremented on leaving the procedure.the procedure.

Induces overhead when 2 stack Induces overhead when 2 stack pointers have to be maintained.pointers have to be maintained.

Distributed StacksDistributed Stacks software overhead…tolerated for long-software overhead…tolerated for long-

running procedures and eliminated by running procedures and eliminated by allocating each stack frame to one allocating each stack frame to one memory unit for short procedures (one memory unit for short procedures (one stack pointer per procedure)stack pointer per procedure)

Distributed stacks are implemented by Distributed stacks are implemented by compiler for ease of use…..abstraction compiler for ease of use…..abstraction of stack as a contiguous data structure of stack as a contiguous data structure is maintained for the programmeris maintained for the programmer

Comparison to globalsComparison to globals Stack variables have limited lifetimes Stack variables have limited lifetimes

compared to globals. They are ‘live’ when compared to globals. They are ‘live’ when a particular procedure is executing and a particular procedure is executing and can be garbage collected once the can be garbage collected once the procedure is exited. procedure is exited.

Hence variables with non-overlapping Hence variables with non-overlapping lifetimes can share the same address lifetimes can share the same address space and their total size can be space and their total size can be larger than that of the memory unit they alarger than that of the memory unit they are stored inre stored in..

Formulation for Stack Formulation for Stack FramesFrames

2 ways of extending the method to 2 ways of extending the method to handle stack variables.handle stack variables.

Each procedure’s stack frame is Each procedure’s stack frame is stored in a single memory unit. stored in a single memory unit. No multiple stack pointers No multiple stack pointers Distributed stack as different stack Distributed stack as different stack

frames may still be allocated to frames may still be allocated to different memory unitsdifferent memory units

Stack-extended Stack-extended formulationformulation

Total access time = time taken to access Total access time = time taken to access global variables + time taken to access stack global variables + time taken to access stack variablesvariables

The fThe fiis refer to the number of functions in the s refer to the number of functions in the program (as each function has a stack frame).program (as each function has a stack frame).

ConstraintsConstraints Each stack frame may at most be Each stack frame may at most be

stored in one memory unitstored in one memory unit Stack reaches maximum size when a Stack reaches maximum size when a

call-graph leaf node is reached. call-graph leaf node is reached. A call-graph leaf node is the deepest A call-graph leaf node is the deepest

nested procedure called….if all such nested procedure called….if all such procedures’ stack frames can be procedures’ stack frames can be allocated, program allocation will fit allocated, program allocation will fit into memory if all paths to leaf nodes into memory if all paths to leaf nodes on the call graph fit into memory. on the call graph fit into memory.

Stack-extended Stack-extended formulationformulation

22ndnd alternative alternative Stack variables from the same procedure Stack variables from the same procedure

can be mapped to different memory unitscan be mapped to different memory units Stack variables are thus treated like Stack variables are thus treated like

globals with the total access time equal to globals with the total access time equal to ==

However memory requirements are However memory requirements are relaxed as in the stack-frame case based relaxed as in the stack-frame case based on disjoint lifetimes of the stack variableson disjoint lifetimes of the stack variables

Heap-extended Heap-extended formulationformulation

Heap data cannot be allocated statically Heap data cannot be allocated statically as the allocation frequencies and block as the allocation frequencies and block sizes are unknown at compile time.sizes are unknown at compile time.

Calls such as malloc( ) fall into this Calls such as malloc( ) fall into this categorycategory

Allocation has to be estimated using a Allocation has to be estimated using a good heuristic.good heuristic.

Each static heap allocation site is Each static heap allocation site is treated as a variable v in the formulationtreated as a variable v in the formulation


The number of references to each site is The number of references to each site is counted through profiling.counted through profiling.

The variable size is bounded as a finite The variable size is bounded as a finite multiple of the total size of memory multiple of the total size of memory allocated at that site.allocated at that site.

If a malloc( ) site allocates 20 bytes 8 If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the times over in a program, 160 bytes is the size of v which is multiplied by a safety size of v which is multiplied by a safety factor of 2 to give 320 bytes as the factor of 2 to give 320 bytes as the allocation size for this site.allocation size for this site.


This optimizes for the common caseThis optimizes for the common case Calls like malloc( ) are cloned for each Calls like malloc( ) are cloned for each

memory level which in turn maintains memory level which in turn maintains a free list.a free list.

If allocation size is exceeded at If allocation size is exceeded at runtime (max size is passed as a runtime (max size is passed as a parameter for each call site) a parameter for each call site) a memory block from slower and larger memory block from slower and larger memory is returned.memory is returned.


Latency would be ≤ latency of Latency would be ≤ latency of slowest memoryslowest memory

If real-time guarantees are needed, If real-time guarantees are needed, all heap allocation must be assumed all heap allocation must be assumed to go to the slowest memory.to go to the slowest memory.

ExperimentExperiment This compiler was implemented as an This compiler was implemented as an

extension to the commonly used GCC extension to the commonly used GCC cross-compiler to target the Motorola M-cross-compiler to target the Motorola M-Core processor.Core processor.

Benchmarks used represent code in Benchmarks used represent code in typical applications.typical applications.

The runtimes were normalized using only The runtimes were normalized using only the fastest memory type (SRAM) and then the fastest memory type (SRAM) and then slower memories were introduced for slower memories were introduced for subsequent tests to measure runtimes.subsequent tests to measure runtimes.

ResultsResults

ResultsResults Using 20% SRAM and the rest DRAM Using 20% SRAM and the rest DRAM

still produces runtimes closer to the all still produces runtimes closer to the all SRAM case. Cheaper and without SRAM case. Cheaper and without much of a performance loss.much of a performance loss.

This proves that (at least for the This proves that (at least for the benchmark programs) memory benchmark programs) memory allocation is optimal. The FIB with a allocation is optimal. The FIB with a linear recurrence to compute Fibonacci linear recurrence to compute Fibonacci numbers is an exception with equal numbers is an exception with equal number of accesses to all variables.number of accesses to all variables.

Experiment 2Experiment 2 Enough DRAM and EEPROM was Enough DRAM and EEPROM was

provided while SRAM size was provided while SRAM size was varied for each of the benchmark varied for each of the benchmark programs.programs.

This would help determine the This would help determine the minimum amount of SRAM needed minimum amount of SRAM needed to maintain performance reasonably to maintain performance reasonably close to the 100% SRAM caseclose to the 100% SRAM case

FIR BenchmarkFIR Benchmark

Matrix multiplication Matrix multiplication benchmarkbenchmark

Fibonacci series Fibonacci series benchmarkbenchmark

Byte to ASCII converterByte to ASCII converter

ResultsResults Clear that most frequently accessed Clear that most frequently accessed

code is between 10-20% of entire code is between 10-20% of entire program program

This portion of code is successfully This portion of code is successfully put on SRAM through profile-based put on SRAM through profile-based optimizations.optimizations.

Comparing Stack frames Comparing Stack frames and stack variablesand stack variables

ResultsResults The BMM benchmark is used as it has The BMM benchmark is used as it has

the most number of the most number of functions/procedures (hence most functions/procedures (hence most number of stack frames/variables).number of stack frames/variables).

Allocating stack variables on different Allocating stack variables on different units performs better in theory due to units performs better in theory due to the finer granularity and thus a more the finer granularity and thus a more custom allocation. The difference is custom allocation. The difference is apparent for the smaller SRAM sizes.apparent for the smaller SRAM sizes.

ApplicationsApplications The approach in the paper can be The approach in the paper can be

used to determine an optimal trade-used to determine an optimal trade-off between minimum SRAM size off between minimum SRAM size and meeting performance and meeting performance requirements.requirements.

Adapting to pre-emptionAdapting to pre-emption In context-switching environments, all In context-switching environments, all

data has to be live at any given time data has to be live at any given time on some live memory. on some live memory.

The variables of all the live programs The variables of all the live programs are combined and the formulation is are combined and the formulation is solved by multiplying the relative solved by multiplying the relative frequencies of the contexts with their frequencies of the contexts with their respective variables. An optimal respective variables. An optimal allocation is achieved in this case.allocation is achieved in this case.

SummarySummary Compiler method to distribute program Compiler method to distribute program

data efficiently among heterogeneous data efficiently among heterogeneous memories.memories.

Caching hardware is not usedCaching hardware is not used Static allocation of memory unitsStatic allocation of memory units Stack distributionStack distribution Optimal guaranteeOptimal guarantee Runtime depends on relative access Runtime depends on relative access

frequencies.frequencies.

Related workRelated work Not much work on cache-less Not much work on cache-less

embedded chips with heterogeneous embedded chips with heterogeneous memory unitsmemory units

Memory allocation task is usually left Memory allocation task is usually left to the programmerto the programmer

Compiler method is better for larger, Compiler method is better for larger, more complex programsmore complex programs

It is error free and is also portable over It is error free and is also portable over different systems with minor different systems with minor modifications to the compiler.modifications to the compiler.

Related workRelated work Panda et al., Sjodin et al. have researched Panda et al., Sjodin et al. have researched

on memory allocation in cached on memory allocation in cached embedded chips.embedded chips.

Cached systems spend more effort on Cached systems spend more effort on minimizing cache misses than minimizing minimizing cache misses than minimizing memory access times…no optimal memory access times…no optimal guarantee.guarantee.

Earlier studies only take into account 2 Earlier studies only take into account 2 memory levels (SRAM and DRAM) while memory levels (SRAM and DRAM) while this formulation can be extended to N this formulation can be extended to N levels of memory.levels of memory.

Related workRelated work Dynamic allocation strategies are Dynamic allocation strategies are

also possible but not explored here.also possible but not explored here. Software caching (emulation of a Software caching (emulation of a

cache in fast memory) is an option.cache in fast memory) is an option. Methods to overcome software Methods to overcome software

overhead need to be devised. overhead need to be devised. Inability to provide real-time Inability to provide real-time

guarantees should be addressed.guarantees should be addressed.THE ENDTHE END

Documents

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik