View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Outline
• Memory System Overview
• Related work
• Experiment setup
• Page level access measurements
• Solution
• Expected Speedup
Processor-Memory Gap
µProc 60% / year. Doubles every 1.5yearsDRAM 9% / year. Doubles every 10 years
Processor-Memory Performance Gap: Grows 50% / yearhttp://www.e-insite.net/ednmag
Memory Access Time
Core L1 L2 MC DRAM
CPU
Access Time (cycles)
L1 3
L2 8
DRAM 181
Data for 1.8GHz Opteron www.aceshardware.com/
Large Size Memory Accesses
• Applications– Initialization– Data Movement– Stream operations
• Operating System– Task Creation– System Calls– Page Allocation, Management
• Functions that would use them– Memset, Clear User– Memcpy, Copy from User, Copy To User
Experiment Setup
• Workstation based– 2.4 GHz P4 (Wonko)– 750MHz PIII (Majikthise) – 900 MHz P III (Jaleel)
• Bochs x86 emulator• Operating System
– Linux Kernel v 2.4.19
• Applications– SPEC2000 Integer benchmarks using glibc-2.2.5
Memset : % Overhead
0
5
10
15
20
25
vorte
xgc
cgz
ip
perlb
mk
twolf
craf
ty vpr
bzip2 m
cf
parse
r
% O
verh
ead
% Memset Time
Memcpy : Access Size
1.0E+00
1.0E+03
1.0E+06
1.0E+09
vortex gcc gzip perlbmk twolf crafty vpr bzip2 mcf parser
Average Length Maximum Length
OS : Memset / Clear User Real-Time Plot
• Behavior over Time
• Frequency of operation
• Access Size
• Operation Duration
• Averages
OS : Memcpy / Copy User Real-Time Plot
• Behavior over Time
• Frequency of operation
• Access Size
• Operation Duration
• Averages
Page based Commands
• Set Page– A constant
• Copy Page– A B
• Page level Arithmetic operations– A B + C– A B - C
Page based Commands Issue
4 kB
DRAM
SETPAGE ZERO, 0x04000
Cache
How do we ensure Memory and Cache Consistency?
128 bytes
How much data is actually in the cache ?
Function % Hit Rate
Boot + Halt
% Hit Rate
SPEC workloadMemset 7.23% 0.23
Memcpy ( Source) 7.88 10.53%
Memcpy (Destination) < 0.01 % < 0.01 %
Page based Commands Issue
SETPAGE ZERO, 0x04000
4 kB
DRAM
4 kBDRAM level Page Fragmentation
Maximum number of rows a page can occupy is 2
Expected Speedup I
Current Implementation
EndAddr Addr + LengthWhile ( Address < EndAddr) Mem[Address] SetValue Address Address + 1
Memset( Address, Length, SetValue)
Proposed Implementation
While (Length >= PageSize) SetPage (SetValue, Address) Length Length – PageSize Address Address + Length
Call Memset ( Address , Length, SetValue)
Expected Speedup II
• Current Memset Time for a page : 4 s• Expected Memset Time for a page
= # Rows in a page * Time to read a Row + +Cache Coherence Logic + Misc
= 2 * 100 ns + X
= 200 ns + X
Related Work
• IRAM – On-chip DRAM– Advantage: bigger storage, eliminates much of the
off-chip memory access, energy efficient– Disadvantage: not much performance increase,
doesn’t work with conventional microprocessors
• Active page – bring computation to DRAM– break the memory into fixed page-size and add
reconfigurable logic to DRAM
• Heap paper shows some memory accesses that can be eliminated entirely