Upload
pdvdm
View
222
Download
3
Tags:
Embed Size (px)
Citation preview
2005 IT Roadmap Semiconductors
*
Clock Rate (GHz)
2005 Roadmap
Intel single core
*
Change in ITS Roadmap in 2 yrs
*
Clock Rate (GHz)
2005 Roadmap
2007 Roadmap
Intel single core
Intel multicore
*
*
Shared Address Space Architectures
Any core can directly reference any memory locationCommunication between cores occurs implicitly as result of loads and stores*
Memory hierarchy and cache memories:
Review concepts assuming Single Core
Introduce problems and solution when used in Multicore Machines
*
Single core memory hierarchy and cache memories
Programs tend to exhibit temporal and spatial locality:Temporal locality: Once programs access data items or instructions they tend to access them again in the near future.Spatial locality: Once programs access data items or instruction, they tend to access nearby data items or instruction in the near future. Because of the locality property of programs, memory is organized in a hierarchy.Memory hierarchy
~ 1s Cycle
~ 1s 10
Cycles
~ 100s Cycles
~ 1000s Cycles
Connecting lines thickness depict bandwidth: Bytes/Second
Key Observations
Access to L1 cache is on order of 1 cycleAccess to L2 on order of 1 to 10 cyclesAccess to Main memory ~ 100s cyclesAccess to Disk ~ 1000s cyclesMagnetic Disk
Main Memory
L2
Cache
L1 Cache
Core
*
Slide depiction fig 3.1 inspired by Software Optimization for High Performance Computing by HP press
By Wadleigh & Crawford
Connecting lines thickness depict bandwidth
Key take away for fstudents is that latency for L1 cache is on order of 1 cycle, for L2 on order of between 1 & 10 cycles, for L2 miss forced read from main memory ~ 100s cycles, And Disk access ~ 1000s cycles.
This is useful to know to be able to explain why effective cache utilization can be MORE important than utilizing multiple cores. BUT in many cases, we can get good cache use AND use multiple cores to get huge performance gains (5-100X in aggregate for 8 core system)
Art of Multiprocessor Programming
*
Processor and Memory are Far Apart
processor
memory
interconnect
2003 Herlihy and Shavit
*
From our point of view, one architectural principle drives everything else: processors and memory are far apart.
Art of Multiprocessor Programming
*
Reading from Memory
address
2003 Herlihy and Shavit
*
It takes a long time for a processor to read a value from memory. It has to send the address to the memory
Art of Multiprocessor Programming
*
Reading from Memory
zzz
2003 Herlihy and Shavit
*
Wait for the message to be delivered
Art of Multiprocessor Programming
*
Reading from Memory
value
2003 Herlihy and Shavit
*
And wait or the response to come back.
Art of Multiprocessor Programming
*
Writing to Memory
address, value
2003 Herlihy and Shavit
*
Writing is similar, except you send the address and the new value,
Art of Multiprocessor Programming
*
Writing to Memory
zzz
2003 Herlihy and Shavit
*
Wait
Art of Multiprocessor Programming
*
Writing to Memory
ack
2003 Herlihy and Shavit
*
And then get an acknowledgement that the new value was actually installed in the memory.
Art of Multiprocessor Programming
*
Cache: Reading from Memory
address
cache
2003 Herlihy and Shavit
*
We alleviate this problem by introducing one or more caches: small, fast memories situated between main memory and processors.
Art of Multiprocessor Programming
*
Cache: Reading from Memory
cache
2003 Herlihy and Shavit
*
Now, when a processor reads a value from memory, it stores the data in the cache before returning the data to the processor.
Art of Multiprocessor Programming
*
Cache: Reading from Memory
cache
2003 Herlihy and Shavit
*
Later, if the processor wants to use the same data
Art of Multiprocessor Programming
*
Cache Hit
cache
?
2003 Herlihy and Shavit
*
When a processor wants to read a value, it first checks whether the data is present in the cache
Art of Multiprocessor Programming
*
Cache Hit
cache
Yes!
2003 Herlihy and Shavit
*
If so, it reads directly from the cache, saving a long round-trip to main memory. We call this a cache hit.
Art of Multiprocessor Programming
*
Cache Miss
address
cache
?
No
2003 Herlihy and Shavit
*
Sometimes the processor doesnt find what it is lookin for in the cache.
Art of Multiprocessor Programming
*
Cache Miss
cache
2003 Herlihy and Shavit
*
We call this a cache miss.
Art of Multiprocessor Programming
*
Cache Miss
cache
2003 Herlihy and Shavit
*
*
Memory and cache performance metrics
Cache Hit and Miss : When the data is found in the cache, we have a cache hit, otherwise it is a miss.Hit Ratio ,HR = fraction of memory references that hitDepends on locality of applicationMeasure of effectiveness of caching mechanismMiss Ratio , MR= fraction of memory references that missHR = 1- MR*
Average memory system access time
If all the data fits in main memory (i.e. ignore desk access)
HR * cache access time + MR * main memory access time
*
Cache line
When there is a cache miss, a fixed size block of consecutive data elements, or line, is copied from main memory to the cache.Typical cache line size is 4-128 bytes.Main memory can be seen as a sequence of lines, some of which can have a copy in the cache.*
MEMORY HIERARCHY AND BANDWIDTH ON MULTICORE
Each core has its own private cache, L1 cache to provide fast access, e.g. 1-2 cycles.L2 caches may be shared across multiple cores.In the event of cache miss at both L1 and L2, the memory controller must forward a load/store request to the off-chip main memory.High Level Multicore Architectural view
Intel Core 2
Duo Processor
Intel Core 2
Quad Processor
A = Architectural State E = Execution Engine & Interrupt
C = 2nd Level Cache B = Bus Interface connects to main memory & I/O
Memory
Memory
64B Cache Line
64B Cache Line
Dual Core has shared cache
Quad core has both shared
And separated cache
Intel Core Microarchitecture Memory Sub-system
AAAAEEEEC1C2BBAAEECB*
Main point to cover here is that False sharing is issue for platforms with separated cores we can alleviate false sharing by restructuring the data layout/data access patterns
A = Architectural states refers to contents or state of FPU, MMX registers and others
E = Execution Engine refers to functions units such as FP, ALU, SIMD etc
C = 2nd Level Cache memory w ~1 or 2 cycle latency compared to 100s of clocks for main memory
B = system bus interface that connects to main memory & I/O.
Cache Line is the smallest unit of data that can be transferred to or from memory.
When a single data elements is requested by a program say your need to read one variable of type float form memory then that float and its 7 nearest neighbors (8 floats in total = 64 Bytes) in memory (in the same cache line) are brought into the faster cache memory for use by the processor
*
Cache line ping-ponging or tennis effect
One processor writes to a cache line and then another processor writes to the same cache line but different data element Cash line is in a separate socket/separate L2 cache environmentEach core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB (Front Side Bus to memory)This increases the FSB traffic and even in good conditions costs about the cost of a memory accessWith a separated cache
CPU1
CPU2
Memory
Front Side Bus (FSB)
Cache Line
Shipping L2 Cache Line
~Half access to memory
Intel Core Microarchitecture Memory Sub-system
*
There is often an effect called ping-ponging or tennis where one processor writes to a cache line and then another processor writes to the same cache line but different data elementIn a separate socket/separate last level cache environmentEach core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB
This increases the FSB traffic and even in good conditions costs about the cost of a memory access
CPU2
Advantages of Shared Cache using Advanced Smart Cache Technology
CPU1
Memory
Front Side Bus (FSB)
Cache Line
L2 is shared:
No need to ship cache
line
Intel Core Microarchitecture Memory Sub-system
*
Shared L2No need to ship cache lineCache line just goes from exclusive to shared for the other core to read.False Sharing
Performance issue in programs where cores may write to different memory addresses BUT in the same cache linesKnown as Ping-Ponging Cache line is shipped between coresCore 0
Core 1
Time
1
X[0] =
0
X[1] =
0
1
0
2
False Sharing not an
issue in shared cache
It is an issue in
separated cache
1
0
X[0] =
1
X[1] =
1
X[0] =
2
1
1
*
Intel book Multi-core Programming Increasing Performance Through Software Multi-threading by Shameem Akhter and Jason Roberts
False Sharing
The smallest unit of memory that two processors interchange is a
cache line or cache sector. Two separate caches can share a cache
line when they both need to read it, but if the line is written in
one cache, and read in another, it must be shipped between caches,
even if the locations of interest are disjoint.
Like two people writing in different parts of a log book, the writes are independent, but unless the book can be ripped apart, the writers must pass the book back and forth. In the same way, two hardware threads writing to different locations contend for a cache sector to the point where it becomes a ping-pong game.
In this ping pong game, there are two threads, each running on a different core. Each thread increments a different location belonging to the same cache line. But because the locations belong to the same cache line, the cores must pass the sector back and forth across the memory bus.
In order to avoid false sharing we need to alter either the algorithm or the data structure. We can add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines. Or we can adjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread
*
Avoiding False Sharing
Change either
Algorithmadjust the implementation of the algorithm (the loop stride) to access data in different cache line for each threadOr
Data Structure:add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines.