View
16
Download
0
Category
Preview:
DESCRIPTION
The Fresh Breeze Memory Model Status: Linear Algebra and Plans. Jack Dennis. Guang R. Gao. Xiaoxuan Meng. Joshua Slocum. Brian Lucas. MIT CSAIL. University of Delaware. Funded in part by NSF HECURA Grant CCF-0937832. Problem: I/O Performance. P. $. File. Main. Memory. L3 $. P. $. - PowerPoint PPT Presentation
Citation preview
The Fresh Breeze Memory ModelStatus: Linear Algebra and Plans
Guang R. GaoJack Dennis
MIT CSAIL University of Delaware
Funded in part by NSF HECURA Grant CCF-0937832
Joshua Slocum Xiaoxuan MengBrian Lucas
Problem: I/O Performance
Culprits:
• Large Units of Data Transfer
• Operating System Overhead/Noise
Managed by hardwareManaged by software (OS)
$Main
Memory
L3
$
P
Disk
Other
FileMemory
• Few Concurrent Transfers (OS Limits) 2
$P
$P
Solution: Integrated Memory Hierarchy
Features:
• Many Concurrent Transactions – High Bandwidth
• Global Virtual Store
Managed by hardware
$Main
Memory
L3
$
P
Disk
Other
FileMemory
• Superior Basis for Security / Privacy 3
$P
$P
The Fresh Breeze Memory Model
• Fan-out as large as 16
Data
Chunks
e.g. 128 Bytes
Master
Chunk
• Write-Once then Read Only
Cycle-Free Heap Arrays as Trees of Chunks
4
• Arrays: Three levels yields 4096 elements (longs)
Fresh Breeze System Vision
Many Core Processing Chips
Switch Switch
Main Memory:Associative Directoriesand DRAM Chips
Backing Store:Access Controllersand Flash Devices
AD DRAM
Switch
AD DRAMAD DRAM
AC Flash AC Flash AC Flash AC FlashAC Flash
L2 Cache
PL1
PL1
PL1
PL1
L2 Cache
PL1
PL1
PL1
PL1
L2 Cache
PL1
PL1
PL1
PL1
Unique Features of the Processor
• Cache Lines are 128-byte Chunks.• Registers are tagged to indicate those
holding handles of chunks.• Hardware task scheduler: Active Task
List and Pending Task Queue.
6
Spawn and Join
7
Master Task
spawn (n)
Join_fetch Join
Worker 0 Worker n-1
join_update join_update
Spawned Worker Tasks
Continuation Task
The Dot Product
A
B
* Sum
A B
5 levels:Vector length =165 = 1,048,576
* +
scalar result
* *
Each Leaf Task: Dot Product of two 16-element vectors: 16 multiplies; 15 adds
Linear Algebra: Three Algorithms
• Dot Product• Matrix Multiply• Fast Fourier Transform
9
Let’s consider the special characteristics of each.
Dot Product
16
10
SegmentA
15
Multiplies
Adds
31 Operations
• No data reuse
• No intermediate data
• No chunks written
• Large volume of input data
Leaf Task: Dot Product of 16-element segments A and B
SegmentB
+*
Matrix Multiply
64
11
48
112
Multiplies
Adds
Operations
• Each input chunk used many times
• Result chunks written to memory
• No chunks written
• Relatively small input data
Leaf Task: Product of two 4-by-4 matrices
16 dot products of four-element vectors
+*
Fast Fourier Transform
4
EightData
Samples
FourTwiddleFactors
6
Multiplies
Adds
10Operations
• Log2 (n) stages
• Intermediate data
• Chunks written and read
Leaf Task: Group of Four Butterfly Computations
BFLYEight
ResultsBFLY
BFLY
BFLY
16
40
One Butterfly Four Butterflies
24
Simulated System Model
L1P
Memory
Switch
Processors – 16, 24, 32, 40
Up to 64IndependentStorage Units
L1P
Simulation Modes
Non-Blocking:
A task executing a read simply waits for operation to complete.
Models an L2 cache with short access time.
Models a main memory with long access time.
Blocking:
A task executing a read suspends to permit other tasks to run.
Estimated PerformanceL2 Cache Model - NonBlocking
Dot ProductA: 163
B: 164
C: 165
Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48
Fourier TransformA: 1024B: 2048C: 4096
AA
A
A
A A A A
A A A AB
B
B
B
B
B
BB
B B B B
C
C
C
C C
C
C
C
C C C C
0
2
4
6
8
10
12
16 24 32 40 16 24 32 40 16 24 32 40
Number of Processors
To
tal P
erfo
rman
ce -
GF
LO
PS
Estimated PerformanceL2 Cache Model - NonBlocking
Dot ProductA: 163
B: 164
C: 165
Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48
Fourier TransforA: 1024B: 2048C: 4096
0
2
4
6
8
10
12
16 24 32 40 16 24 32 40 16 24 32 40
Number of Processors
To
tal P
erfo
rman
ce -
GF
LO
PS
Findings
• Data locality of chunks works well.• L1 Cache is not very important; only a buffer
for input and result chunks of tasks.• Task switching for chunk reads is costly;
percolation would make a big difference.
17
Percolation: Ensuring that input chunks of a task have been retrieved before the task is scheduled.
Further Work
• Model a Three-Level or Four-Level Storage System
• Develop Hierarchical Work Stealing to improve Load Distribution for Massively Parallel Systems.
• Compiler Development: Automatic Mapping of Objects to Trees of Chunks.
• Expand Test Programs to include Transaction Processing and Database Operations .
18
Distributed Discrete-Event Simulation
• Accurate timing for interconnected communicating components.
• Packet Communication Architecture is a model for distributed systems especially amenable to efficient distributed simulation.
• UDel and MIT have begun a project to develop a PCA-based Simulation Sandbox for evaluating alternate PXMs for massively parallel computing.
19
Relevance to FSIO
• The Fresh Breeze Memory Model extends to 264 chunks, about 1021 chunks or 100,000 exabytes.
• The tree-structure is an excellent basis for advanced security and data object sharing.
• Can simplify check-point / restart.
• Seamless transition from “in-core” to “out-of-core” operation of arbitrary parallel programs.
• Provides ability to use any program as a component of new programs – including any parallel program.
20
Conclusion
• Making the best of many-core technology requires study of new program execution models (PXMs).
• The FSIO challenge can be met by integrating the file system into the system memory hierarchy as a global virtual store.
• The Fresh Breeze PXM demonstrates some of the benefits from departing from conventional system organization.
• More exploration of new PXMs is needed!
21
Recommended