Upload
kellie-webb
View
216
Download
1
Embed Size (px)
Citation preview
Internal Memory & Cache
Chapter 15
C6000 Integration Workshop
Copyright © 2005 Texas Instruments. All rights reserved.
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
Parking Choices: 0 minute walk @ $100 for close-in parking
Parking Dilemma
Sports
Arena
Close Parking 0 minute walk 10 spaces $100/space
Distant Parking-Ramp
10 minute walk
10 minute walk 1000 spaces $5/space
Parking Choices: 0 minute walk @ $100 for close-in parking 10 minute walk @ $5 for distant parking
Parking Choices: 0 minute walk @ $100 for close-in parking 10 minute walk @ $5 for distant parking
or …
Parking Dilemma
Sports
Arena
Distant Parking-Ramp
10 minute walk
Close Parking 0 minute walk 10 spaces $100/space
10 minute walk 1000 spaces $5/space
Parking Choices: 0 minute walk @ $100 for close-in parking 10 minute walk @ $5 for distant parking
or … Valet parking: 0 minute walk @ only $6.00
How does this compare to cache memory?T TOTechnical Training
Organization
Why Cache?
Sports
Arena
BulkMemory
FastMemory
Memory Choices: Small, fast memory Large, slow memory
Memory Choices: Small, fast memory Large, slow memory
or …
Slower Larger Cheaper
Fast Small
Why Cache?
Sports
Arena
BulkMemory
CacheMemory
Slower Larger Cheaper
Memory Choices: Small, fast memory Large, slow memory
or … Use Cache: Combines advantages of both Like valet, data movement is automatic
Fast Small Works
like Big, Fast Memory
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
CPU
InternalRAM
EDMA
Using Internal RAMExternalMemory
EMIF
Programfunc1func2func3
0x8000
Before executing functions (e.g. func1) they must be transferred to Internal Memory
The Programmer has to set this up If all functions can’t fit at once, it becomes more
complicated (i.e. overlays)
T TOTechnical Training
Organization
CPU
Cache
Cache H/W
Using Cache MemoryExternalMemory
EMIF
Programfunc1func2func3
0x8000
Cache hardware automatically transfers code/data to internal memory, as needed
Addresses in the Memory Map are associated with locations in cache
Cache locations do not have their own addresses
Let’s start with Basic Concepts of a Cache …T TOTechnical Training
Organization
Cache: Block, Line, IndexExternalMemory
0x8000
Conceptually, a cache divides the entire memory into blocks equal to its size
0x8010
0x8020
Block
Cache
0
...
0xF
Cache: Block, Line, IndexExternalMemory
0x8000
Conceptually, a cache divides the entire memory into blocks equal to its size
A cache is divided into smaller storage locations called lines
0x8010
0x8020
Block
Cache
0
...
0xF
CacheLine
Cache: Block, Line, IndexExternalMemory
0x8000
Conceptually, a cache divides the entire memory into blocks equal to its size
A cache is divided into smaller storage locations called lines
The term Index or Line-Number is used to specify a specific cache line
0x8010
0x8020
Cache
0
...
0xF
Block
CacheLine
Index
How do we know which block is cached?T TOTechnical Training
Organization
Cache TagsExternalMemory
0x8000
A Tag value keeps track of which block is associated with a cache line
0x8010
0x8020
Tag Index Cache800 0
...
0xF
What could be better than one tag per block?
Cache TagsExternalMemory
0x8000
A Tag value keeps track of which block is associated with a cache block
Each line has it’s own tag -- thus, the whole cache block won’t be erased when lines from different memory blocks need to be cached simultaneously
0x8010
0x8020
Tag Index Cache800 0
801 1...
0xF
How do we know a cache line is valid (or not)?T TOTechnical Training
Organization
Valid BitsExternalMemory
0x8000
A Valid bit keeps track of which lines contain “real” information
They are set by the cache hardware whenever new code or data is stored
0x8010
0x8020
Tag Index Cache800 0
801 1...
721 0xF
Valid1
1...0
0
This type of cache is called ...T TOTechnical Training
Organization
Direct-Mapped CacheExternalMemory
0x8000
0x8010
0x8020
Index Cache
0
...
0xF
Block
Direct-Mapped Cache associates an address within each block with one cache line
Thus … there will be only one unique cache index for any address in the memory-map
Only one block can have information in a cache line at any given time
Let's look at an example ...T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
Direct-Mapped Cache ExampleExternalMemory
0x8000
0x8010
0x8020
0x8030
IndexTag3 015 4
Tag Index Cache0
1...
E
0xF
Let’s examine an arbitrary direct-mapped cache example: A 16-line, direct-mapped cache requires
a 4-bit index
If our example P used 16-bit addresses, this leaves us with a 12-bit tag
Valid
T TOTechnical Training
Organization
Arbitrary Direct-Mapped Cache Example
The following example uses: 16-line cache 16-bit addresses, and Stores one 32-bit instruction per line
C6000 cache’s have different cache and line sizes than this example
It is only intended as a simple cache example to reinforce cache concepts
T TOTechnical Training
Organization
Conceptual Example Code
0003h0004h0005h0006h
0026h0027h0028h
L1 LDHMPYADDB L2
L2 ADDSUB cnt
[!cnt] B L1
CodeAddress
Tag Index15 4 3 0
T TOTechnical Training
Organization
Direct Mapped Cache Example
Index0123456789A..
F
CacheTagValid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
CacheTagValid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
CacheTagValid
Compulsory Miss Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
LDH
Cache
000
Tag
Valid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
LDHMPY
Cache
000000
Tag
Valid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
Cache
000000000
Tag
Valid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B
Cache
000000000000
Tag
Valid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADD
Cache
000000000
000 002
Tag
Valid
Conflict Miss
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADDSUB
Cache
000000000
000 002002
Tag
Valid
Address Code 0003h L1 LDH 0004h MPY 0005h ADD 0006h B L2
0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADDSUB
B
Cache
000000000
000 002002002
Tag
Valid
Address Code 0003h L1 LDH ... 0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADDSUB
B
Cache
000000000
000 002002002
Tag
Valid
Address Code 0003h L1 LDH ... 0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADDSUB
B
Cache
000000000
000 002002002
Tag
Valid
Address Code 0003h L1 LDH ... 0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADDSUB
B
Cache
000000000
000 002002002
Tag
Valid
Address Code 0003h L1 LDH ... 0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
B ADD BSUB
B
Cache
000000000
000 002 000002002
Tag
Valid
Address Code 0003h L1 LDH ... 0026h L2 ADD 0027h SUB cnt 0028h [!cnt] B L1
Direct-Mapped Cache Example
Index0123456789A..
F
LDHMPYADD
BADDSUB
B
Cache
000000000000002002002
Tag
Valid
Notes: This example was contrived to show how
cache lines can thrash Code thrashing is minimized on the
C6000 due to relatively large cache sizes Keeping code in contiguous sections
also helps to minimize thrashing Let’s review the two types of misses that
we encountered
Notes: This example was contrived to show how
cache lines can thrash Code thrashing is minimized on the
C6000 due to relatively large cache sizes Keeping code in contiguous sections
also helps to minimize thrashing Let’s review the two types of misses that
we encounteredT TOTechnical Training
Organization
Types of Misses
How can we visualize these misses?
Compulsory Miss when first accessing an new address
Conflict Line is evicted upon access of an address whose
index is already cached Solutions:
Change memory layout Allow more lines for each index
Capacity (we didn’t see this in our example) Line is evicted before it can be re-used because
capacity of the cache is exhausted Solution: Increase cache size
T TOTechnical Training
Organization
CacheTune
Mem
ory
Lo
cati
on
s
Time (# of instructions executed)
Cache HitHit/MissCache Miss
Cache HitHit/MissCache Miss
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory
L1 Program Cache L1 Data Cache L2 Memory
‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
‘1x Internal Memory
InternalRAM or Cache
(L2)CPU
ProgramCache(L1P)
DataCache(L1D)
EMIF
L1Level 2
Level 3
We often refer to a system’s memory in hierarchical levels
Higher levels (L1) are closer to the CPU
CPU always requests from highest level memory …
… If address isn’t present in L1, cache h/w gets it from lower level
Looking more closely at L1P …T TOTechnical Training
Organization
L1P CacheExternalMemory
CPU
ProgramCache (L1P)
L2 EMIF
for( i = 0; i < 10; i++ ) { sum += x[i] * y[i];}
for( i = 0; i < 10; i++ ) { sum += x[i] * y[i];}
Cache is always on Direct-Mapped Cache
Works exceptionally well for DSP code (which tends to have many loops)
Can be placed to minimize thrashing
How big is the cache?
L1P CacheExternalMemory
CPU
ProgramCache (L1P)
L2 EMIF
Cache is always on Direct-Mapped Cache
Works exceptionally well for DSP code (which tends to have many loops)
Can be placed to minimize thrashing The cache is 4K bytes Each line stores 16 instructions (Linesize = 16)
for( i = 0; i < 10; i++ ) { sum += x[i] * y[i];}
for( i = 0; i < 10; i++ ) { sum += x[i] * y[i];}
4KB
What does Linesize mean?T TOTechnical Training
Organization
New Term: LinesizeExternalMemory
0x8000
0x8010
0x8020
Block
Cache
0
...
0xF
In our earlier cache example, the size was: Size: 16 bytes Linesize: 1 byte # Of index’s: 16
How else could it be configured?
New Term: LinesizeExternalMemory
0x8000
0x8010
0x8020
Block
Index Cache
0 0 1...
0x7 0xE 0xF
In our earlier cache example, the size was: Size: 16 bytes Linesize: 1 byte # Of index’s: 16
We have now changed it to: Size: 16 bytes Linesize: 2 bytes # Of index’s: 8
What’s the advantage of greater line size?Speed! When cache retrieves one item, it
gets another at the same time.T TO
Technical Training Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory
L1 Program Cache L1 Data Cache L2 Memory
‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
Caching Data
for( i = 0; i < 4; i++ ) { sum += x[i] * y[i];}
One instruction may access multiple data elements:
Tag Data Cache
0
4K
What would happen if x and y ended up at the following addresses?
x = 0x8000y = 0x9000
ExternalMemory
x
y
They would end up overwriting each other in the cache --- called thrashing
Increasing the associativity of the cache will reduce this problem
How do you increase associativity?T TOTechnical Training
Organization
Increased AssociativityExternalMemory
0x08000
0x10800
0x11000
0x11800
Valid Tag Data Cache0
2K
0
2K
Way 0
Way 1
Split a Direct-Mapped Cache in half Each half is called a cache way Multiple ways make data caches more efficient
Increased AssociativityExternalMemory
0x08000
0x10800
0x11000
0x11800
Valid Tag Data Cache0
2K
0
2K
Way 0
Way 1
Split a Direct-Mapped Cache in half Each half is called a cache way Multiple ways makes data caches more efficient
C671x/C621x L1D dimensions: 4K Bytes 2 Ways 32 Byte linesize
T TOTechnical Training
Organization
What is a Set?ExternalMemory
0x8000
The lines from each way that map to the same index form a set
0x8008
0x8010
0x8018
Data Cache
0
0
Set of index zero’s,
i.e. Set 0
The number of lines per set defines the cache as an N-way set-associative cache
Set 1
Now, how many unique locations are there for each memory address?T TOTechnical Training
Organization
What is a Set?ExternalMemory
0x8000
The lines from each way that map to the same index form a set
0x8008
0x8010
0x8018
Data Cache
0
0
Set of index zero’s,
i.e. Set 0
The number of lines per set defines the cache as an N-way set-associative cache
With 2 ways, there are now 2 unique cache locations for each memory address
Set 1
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory
L1 Program Cache L1 Data Cache L2 Memory
‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
'11 / '12 Internal Memory Level 1 Caches
Single-cycle access Always enabled L2 accessed on miss
Level 2 Memory Unified: Prog or Data
L2 L1D delivers32-bytes in 4 cycles
L2 L1P delivers 16 instr’s in 5 cycles
Configure L2 as cache or addressable RAM
CPU
L1Program
(4KB)
L1Data(4KB)
L2
Program& Data
(64K Bytes)8/16/32/64
If cache is so great, why is L2 configurable?
(C6713: L2 memory is 256K bytes)
T TOTechnical Training
Organization
If L2 didn’t have addressable RAM? Requires external storage of peripheral data Both EDMA and CPU must tie up EMIF to
store and retrieve data
Cache
ExternalMemory
EnhancedDMA
(EDMA)EMIF
Peripheral Port
T TOTechnical Training
Organization
C6000 Level 2 - Flexible & Efficient
Mappedas RAM
Cache
ExternalMemory EMIF
Peripheral Port
EnhancedDMA
(EDMA)
Configure L2 as cache and/or mapped-RAM Allows peripheral data or critical code and
data storage on-chip
T TOTechnical Training
Organization
L2 Memory Configuration
Four 16KB blocks – Configure each as cache or addressable RAM Each additional cache block provides another cache way L2 is unified memory – can hold program or data
C6713 Still has 4 configurable 16KB RAM/cache blocks, the remaining 192KB is always RAM
Hardwaredefault
or or
RAM 3
RAM 0
RAM 1
RAM 2oror
dsk6711.cdbtemplate default
RAM 1
RAM 0
RAM 2
Way 1
RAM 1
RAM 0
Way 2
Way 1
Way 3
RAM 0
Way 2
Way 1
Way 3
Way 4
Way 2
Way 1
Configuring L2 Cache with CDB
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory
L1 Program Cache L1 Data Cache L2 Memory
‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
C64x Internal Memory
CPU
L1Program
(16KB)
L1Data
(16KB)
L2
Program& Data
(1M Bytes)8/16/32/64
L1 Program Cache Direct Mapped (1 way) Single cycle access Size = 16K
Bytes Linesize = 8
instr.
L1 Data Cache 2-Way Cache Single cycle access Size = 16K
Bytes Linesize = 64
bytes
Level 2 Memory C6414/15/16 = 1M Byte
C6411/DM642 = 256K Byte
T TOTechnical Training
Organization
C64x L2 Memory Configuration
When cache is enabled, it’s always 4-Way
This differs from C671x
Linesize Linesize = 128
bytes Same linesize as C671x
Performance L2 L1P
1-8 Cycles L2 L1D
L2 SRAM hit: 6 cycles
L2 Cache hit: 8 cycles
Pipelined: 2 cycles
L2 Ways are Configurable in Size
L2 Ways are Configurable in Size
0 32K 64K 128K 256K
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics
Data Mem Bank Cache Optimization Cache Coherency MAR Registers
Using the C Optimizer Lab 15T TO
Technical Training Organization
‘C641x L1D Memory Banks
512x32
01234567
512x32
01234567
512x32
01234567
512x32
01234567
512x32
01234567
512x32
01234567
512x32
01234567
512x32
01234567
#pragma DATA_MEM_BANK(a, 0);
#pragma DATA_MEM_BANK(x, 4);
Only one access allowed per bank per cycle
Use DATA_MEM_BANK to make sure that arrays thatwill be accessed in parallel start in different banks
T TOTechnical Training
Organization
DATA_MEM_BANK(var, 0 or 2 or 4 or 6)#pragma DATA_MEM_BANK(a, 0);
short a[256] = {1, 2, 3, …
#pragma DATA_MEM_BANK(x, 4);
short x[256] = {256, 255, 254, …
#pragma UNROLL(2);
#pragma MUST_ITERATE(10, 100, 2);
for(i = 0; i < count ; i++) {
sum += a[i] * x[i];
}
An internal memory specialized Data Align
Optimizes variable placement to account for the wayinternal memory is organized
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview
Data Mem Bank Cache Optimization Cache Coherency MAR Registers
Using the C Optimizer Lab 15
T TOTechnical Training
Organization
Cache Optimization Optimize for Level 1 Multiple Ways and wider lines maximize
efficiency – we did this for you! Main Goal - maximize line reuse before eviction
Algorithms can be optimized for cache
“Touch Loops” can help with compulsory misses
Up to 4 write misses can happen sequentially, but the next read or write will stall
Be smart about data output by one function then read by another (touch it first)
T TOTechnical Training
Organization
Updated Cache Documentation
Cache Reference Guides for C621x/C671x (SPRU609) and C64x (SPRU610) Replaces “Two-Level Internal Memory” chapter in
Peripherals Reference Guide More comprehensive description of C6000 cache Revised terminology for cache coherence
operations Cache User’s Guide for C6000 (SPRU656)
Cache Basics Using C6000 Cache Optimization for Cache Performance
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics
Data Mem Bank Cache Optimization Cache Coherency MAR Registers
Using the C Optimizer Lab 15T TO
Technical Training Organization
Coherency Example: Description
CPU
L2External
L1D
For this example, L2 is set up as cache Example’s Data Flow:
EDMA fills RcvBuf CPU reads RcvBuf, processes data, and writes to XmtBuf EDMA moves data from XmtBuf (e.g. to a D/A converter)
RcvBuf
XmtBuf
EDMA
EDMAC
ache
T TOTechnical Training
Organization
EDMA Writes Buffer
CPU
L2External
L1D
RcvBufEDMA
Buffer (in external memory) written by the EDMA
T TOTechnical Training
Organization
CPU Reading Buffers
CPU
L2External
L1D
RcvBufEDMA
CPU reads the buffer for processing This read causes a cache miss in L1D and L2
CPU Reading Buffers
CPU
L2External
L1D
RcvBufRcvBuf RcvBufEDMA
CPU reads the buffer for processing This read causes a cache miss in L1D and L2 RcvBuf is added to both caches
Space is allocated in each cache RcvBuf data is copied to both caches
T TOTechnical Training
Organization
Where Does EDMA Read From?
CPU
L2External
L1D
RcvBufRcvBuf RcvBufEDMA
The EDMA reads from the external memory addresswhere XmtBuf is located
XmtBuf
EDMA
Will the EDMA get the right data?Where does the new data reside?T TO
Technical Training Organization
Where Does the CPU Write To?
CPU
L2External
L1D
RcvBufRcvBuf RcvBuf
XmtBuf
EDMA
After processing, the CPU writes to XmtBuf Write misses to L1D are written directly to the
next level of memory (L2) Thus, the write does not go directly to external memory
XmtBuf
EDMA
Where Does the CPU Write To?
CPU
L2External
L1D
RcvBufRcvBuf RcvBuf
XmtBuf
EDMA
After processing, the CPU writes to XmtBuf Write misses to L1D are written directly to the
next level of memory (L2) Thus, the write does not go directly to external memory Cache line Allocated: L1D on Read only
L2 on Read or Write
XmtBuf
EDMA
T TOTechnical Training
Organization
A Coherency Issue
CPU
L2External
L1D
RcvBuf
XmtBuf
RcvBuf RcvBuf
XmtBuf
EDMA
EDMA
EDMA is set up to transfer the buffer from ext. mem The buffer resides in cache, not in ext. memory So, the EDMA transfers whatever is in ext. memory,
probably not what you wanted
There are three solutions to data coherency ...T TOTechnical Training
Organization
Solution 1: Flush & Clear the Cache
CPU
L2External
L1D
RcvBuf
XmtBuf
RcvBuf RcvBuf
XmtBuf
EDMA
EDMA
When the CPU is finished with the data (and has written it to XmtBuf in L2), it can be sent to ext. memory with a cache writeback
A writeback is a copy operation from cache to memory
CSL (Chip Support Library) provides an API for writeback:CACHE_wbL2((void *)XmtBuf, bytecount, CACHE_WAIT);
writeback
Before looking at the next solution, there’s one other coherency issue ...T TO
Technical Training Organization
Another Coherency Issue
CPU
L2External
L1D
RcvBuf
XmtBuf
RcvBuf RcvBuf
XmtBuf
EDMA
EDMA writes a new RcvBuf buffer to ext. memory When the CPU reads RcvBuf a cache hit occurs
since the buffer (with old data) is still valid in cache Thus, the CPU reads the old data instead of the new
T TOTechnical Training
Organization
Another Coherency Solution
CPU
L2External
L1D
RcvBuf
XmtBuf
RcvBuf RcvBuf
XmtBuf
EDMA
To get the new data, you must first invalidate the old data before trying to read the new data (clears cache line’s valid bits)
CSL provides an API to writeback with invalidate: It writes back modified (i.e. dirty) data, Then invalidates cache lines containing the buffer
CACHE_wbInvL2((void *)RcvBuf, bytecount, CACHE_WAIT);
A quick summary of L2 writeback/invalidate functions …T TO
Technical Training Organization
L2 Cache Coherence Operations
For block operations, only the lines in L1D or L1P with addresses corresponding to the addresses of L2 operations are affected
Careful: Cache always invalidates/writes back whole lines. To avoid unexpected coherence problems: align buffers at a boundary equal to the cache line size and make the size of the buffers a multiple of the cache line size
Type / Scope L2 CSL Function L2 Cache Operation Affect on
L1 Caches
InvalidateBlock
CACHE_invL2 (ext memory base addr,byte count,wait)
• Lines invalidated • Corresponding lines invalidated in L1D & L1P
• Any L1D updates discarded
WritebackBlock
CACHE_wbL2 (ext memory base addr,byte count,wait)
• Dirty lines written back• Lines remain valid
• L1D: Updated data written back, then corresponding lines invalidated
• L1P: No affectWriteback
with Invalidate
Block
CACHE_wbInvL2 (ext memory base addr,byte count,wait)
• Dirty lines written back• Lines invalidated
• L1D: Updated data written back, then corresponding lines invalidated
• L1P: corr. lines invalidatedWriteback
AllCACHE_wbAllL2 (wait) • Updated lines written
back• All lines remain valid
• L1D: Updated data written back, then all lines invalidated
• L1P: No affectWriteback
with Invalidate
All
CACHE_wbInvAllL2 (wait) • Updated lines written back
• All lines invalidated
• L1D: Updated data written back, then all lines invalidated
• L1P: All lines invalidated
When to Use Coherency Functions? Use When CPU and EDMA share a cacheable region in
external memory Safest: Use L2 Writeback-Invalidate All before any EDMA
transfer to/from external memory. Disadvantage: Larger Overhead
Reduce overhead by: Only operating on buffers used for EDMA, and Distinguishing between three possible scenarios:
1. EDMA reads data written by the CPU Writeback before EDMA
2. EDMA writes data to be read by the CPU Invalidate before EDMA*
3. EDMA modifies data written by the CPU that is to be read back by the CPU
Writeback-Invalidate before EDMA
* For C6211/6711 use Writeback-Invalidate before EDMAT TOTechnical Training
Organization
Solution 2: Keep Buffers in L2
CPU
L2External
L1D
RcvBuf RcvBuf
XmtBuf
EDMA
Configure some of L2 as RAM Locate buffers in this RAM space Coherency issues do not exist between L1 and L2
EDMA
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics
Data Mem Bank Cache Optimization Cache Coherency MAR Registers
Using the C Optimizer Lab 15T TO
Technical Training Organization
"Turn Off" the Cache (MAR)
CPU
L2External
L1D
RcvBuf
XmtBuf
EDMA
The Memory Attribute Registers (MARs) enable/disablecaching for a memory range
Turning off the cache can solve coherency issues, but Without cache, though, access to memory is slow
Looking more closely at the MAR registers ...T TOTechnical Training
Organization
Memory Attribute Regs (MAR)
CE0
CE2
CE3
0 = Not cached1 = Cached
MAR4 0
MAR5 1
MAR6 1
MAR7 1
Reserved
Use MAR registers to enable/disable caching of external ranges
Useful when external data is modified outside the scope of the CPU
You can specify MAR values in Config Tool
C671x: 16 MAR’s 4 per CE space Each handles 16MB
C64x: Each handles 16MB 256 MAR’s 16 per CE space
(on current C64x, some are rsvd)
Setting MAR's in CDB files ...T TOTechnical Training
Organization
Setting MARs in CDB (C67x)
MAR bit values:0 = Not cached1 = Cached
MAR bit values:0 = Not cached1 = Cached
MAR0 00000001
MAR1 00000000
MAR2 00000000
MAR3 00000000
…
…
MAR15 00000000 Finally, for more info ...T TOTechnical Training
Organization
Setting MARs in CDB (C64x)
MAR bit values:0 = Not cached1 = Cached
MAR bit values:0 = Not cached1 = CachedT TO
Technical Training Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15
T TOTechnical Training
Organization
Compiler Build Options Nearly one-hundred compiler options available to tune your
code's performance, size, etc. Following table lists most commonly used options:
Options Description
-mv6700 Generate ‘C67x code (‘C62x is default)
-mv67p Generate ‘C672x code
-mv6400 Generate 'C64x code
-mv6400+ Generate 'C64x+ code
-fr <dir> Directory for object/output files
-fs <dir> Directory for assembly files
Debug-g Enables src-level symbolic debugging
-ss Interlist C statements into assembly listing
Optimize (release)
-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3)
-k Keep asm files, but don't interlist
To make options easier, TI recommends using …T TOTechnical Training
Organization
-g -fr“$(Proj_dir)\Debug" -d"_DEBUG" -mv6700
-o3 -k -fr“$(Proj_dir)\Release" -mv6700
Default Build Configurations
For new projects, CCS automatically creates two build configurations: Debug
(unoptimized) Release
(optimized)
Use the drop-down to quickly select build config.
Add/Remove build config's with Project Configurations dialog (on project menus)
Edit a configuration:1. Set it active2. Modify build options
(shown next)3. Save project
For more help …T TOTechnical Training
Organization
Optimizing C Performance Compiler Tutorial (in CCS Help & SPRU425a.pdf)
C6000 Programmer’s Guide (SPRU198)
Chapter 4: “Optimizing C Code”
C6000 Optimizing C Compiler UG (SPRU187)T TO
Technical Training Organization
Lab 15/15A
Move buffers off-chip Turn on L2 cache Investigate MAR bits Solve coherency issues with
writeback/invalidate Use cache debug techniques
LAB 15
Use Release Configuration Benchmark performance and
code size
LAB 15A
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15 Optional Topics
‘0x Memory Summary ‘0x Data Memory Optimization
T TOTechnical Training
Organization
Lab 15 – Internal Mem. and Cache
Lab 15a – Find Memory Bank Conflicts
C64x Users Only
Lab 15b – Compare the performance:
1. Code/Data Off-Chip with No Cache
2. Code/Data Off-Chip with Cache
3. Code/Data On-Chip
Lab 15c – Exploring CacheTune
T TOTechnical Training
Organization
Outline Why Cache? Cache Basics Cache Example (Direct-Mapped) C6211/C671x Internal Memory ‘C64x Internal Memory Overview Additional Memory/Cache Topics Using the C Optimizer Lab 15 Optional Topics
‘0x Memory Summary & Optimization
T TOTechnical Training
Organization
‘0x Internal Memory
Internal Data Internal Data128K bytes
Internal Data512K bytes
Cache / RAMCache / RAM128K bytes
RAM128K bytes
Cache / RAM128K bytes
RAM256K bytes‘C6x01/04/05
1M bit Total
‘C62023M bit Total
‘C62037M bit Total
Program
Data
Only program cache (no data cache) Configurable as Cache or RAM Single-cycle access
T TOTechnical Training
Organization
‘C6201 Internal Data Split into 4 banks Dual access to two banks in 1 cycle Dual accesses to one bank results
in one cycle delay
8Kx16
16
A D
8Kx16
16
A D
8Kx16
16
A D
8Kx16
16
A D
T TOTechnical Training
Organization
‘C6201 Internal Data
Banks are interleaved
8Kx16
16
A D
8Kx16
16
A D
8Kx16
16
A D
8Kx16
16
A D
0 1 6 72 3 4 58 9 E FA B C D
Byte Ordering
T TOTechnical Training
Organization
‘C6201 Internal Data
Banks are interleaved How many cycles would these two LDW accesses
take?
...
16
A D
...
16
A D
…
16
A D
...
16
A D
0 1 6 72 3 4 58 9 E FA B C D
8 9 E FA B C DT TOTechnical Training
Organization
‘C6201 Internal Data
Banks are interleaved How many cycles would these two LDW accesses
take? 1
...
16
A D
...
16
A D
…
16
A D
...
16
A D
0 1 6 72 3 4 58 9 E FA B C D
8 9 E FA B C DT TOTechnical Training
Organization
‘C6201 Internal Data
Now, how many cycles would it take for these two LDW’s?
...
16
A D
...
16
A D
…
16
A D
...
16
A D
0 1 6 72 3 4 58 9 E FA B C D
8 9 E FA B C DT TOTechnical Training
Organization
‘C6201 Internal Data
Now, how many cycles would it take for these two LDW’s? 2
...
16
A D
...
16
A D
…
16
A D
...
16
A D
0 1 6 72 3 4 58 9 E FA B C D
8 9 E FA B C DT TOTechnical Training
Organization
‘0x Memory Optimization
Basic Memory Layout Improving Performance
Offset Arrays Unroll Loop
Aren’t There Two Blocks?
T TOTechnical Training
Organization
Solution 1: Offset Arrays Offset accesses
#pragma DATA_ALIGN(x, 8);#pragma DATA_ALIGN(a, 8);
int x[40] = {1, 2, 3, … };int a[41] = {0, 40, 39, 38, … };
int *xp = &x[0];int *ap = &a[1];
...
16
A D
...
16
A D
…
16
A D
...
16
A D
0 1 2 3
E F C D
x0 = 1
a0 = a[1] = 40a[0] = 0T TOTechnical Training
Organization
Solution 2: Unroll Loop Offset accesses Unroll the loop
Read two values from each array in parallel, then perform two calculations
... ... … ...
16
A D
16
A D
16
A D
16
A D
0 1 2 3
x0 x1
LDW from x0|| LDW from x1
T TOTechnical Training
Organization
Solution 2: Unroll Loop Offset accesses Unroll the loop:
Read two values from each array in parallel, then perform two calculations
... ... … ...
16
A D
16
A D
16
A D
16
A D
0 1 2 3
x0 x1
LDW from x0|| LDW from x1
LDW from a0|| LDW from a1
ADD x0, a0|| ADD x1, a1
a0 a1T TOTechnical Training
Organization
‘0x Memory Optimization
Basic Memory Layout Improving Performance
Offset Arrays Unroll Loop
Aren’t There Two Blocks?
T TOTechnical Training
Organization
Two Blocks of Memory (4 banks each)
4Kx16 4Kx16 4Kx16 4Kx16
16 16 16 16
0 1 6 72 3 4 58 9 E FA B C D
4Kx16 4Kx16 4Kx16 4Kx16
0 1 6 72 3 4 58 9 E FA B C D
16 16 16 16
Why use offset-arrays or loop-unrolling if there’s two blocks?
This allows the DMA unrestricted access to internal memoryT TOTechnical Training
Organization
OutlineWhy Cache?Cache BasicsCache Example (Direct-Mapped)C6211/C671x Internal Memory‘C64x Internal Memory OverviewAdditional Memory/Cache TopicsUsing the C OptimizerLab 15Optional Topics
‘0x Memory Summary & Optimization
ti
Technical TrainingOrganization
T TOTechnical Training
Organization