1-MulticoreArchitecture Basics.ppt

2005 IT Roadmap Semiconductors

*

Clock Rate (GHz)

2005 Roadmap

Intel single core

*

Change in ITS Roadmap in 2 yrs

*

Clock Rate (GHz)

2005 Roadmap

2007 Roadmap

Intel single core

Intel multicore

*

*

Shared Address Space Architectures
Any core can directly reference any memory locationCommunication between cores occurs implicitly as result of loads and stores
*

Memory hierarchy and cache memories:

Review concepts assuming Single Core

Introduce problems and solution when used in Multicore Machines

*

Single core memory hierarchy and cache memories
Programs tend to exhibit temporal and spatial locality:Temporal locality: Once programs access data items or instructions they tend to access them again in the near future.Spatial locality: Once programs access data items or instruction, they tend to access nearby data items or instruction in the near future. Because of the locality property of programs, memory is organized in a hierarchy.

Memory hierarchy

~ 1s Cycle

~ 1s 10

Cycles

~ 100s Cycles

~ 1000s Cycles

Connecting lines thickness depict bandwidth: Bytes/Second

Key Observations
Access to L1 cache is on order of 1 cycleAccess to L2 on order of 1 to 10 cyclesAccess to Main memory ~ 100s cyclesAccess to Disk ~ 1000s cycles
Magnetic Disk

Main Memory

L2
Cache

L1 Cache

Core

*

Slide depiction fig 3.1 inspired by Software Optimization for High Performance Computing by HP press

By Wadleigh & Crawford

Connecting lines thickness depict bandwidth

Key take away for fstudents is that latency for L1 cache is on order of 1 cycle, for L2 on order of between 1 & 10 cycles, for L2 miss forced read from main memory ~ 100s cycles, And Disk access ~ 1000s cycles.

This is useful to know to be able to explain why effective cache utilization can be MORE important than utilizing multiple cores. BUT in many cases, we can get good cache use AND use multiple cores to get huge performance gains (5-100X in aggregate for 8 core system)

Art of Multiprocessor Programming

*

Processor and Memory are Far Apart

processor

memory

interconnect

2003 Herlihy and Shavit

*

From our point of view, one architectural principle drives everything else: processors and memory are far apart.


*

Reading from Memory

address


*

It takes a long time for a processor to read a value from memory. It has to send the address to the memory


*

Reading from Memory

zzz


*

Wait for the message to be delivered


*

Reading from Memory

value


*

And wait or the response to come back.


*

Writing to Memory

address, value


*

Writing is similar, except you send the address and the new value,


*

Writing to Memory

zzz


*

Wait


*

Writing to Memory

ack


*

And then get an acknowledgement that the new value was actually installed in the memory.


*

Cache: Reading from Memory

address

cache


*

We alleviate this problem by introducing one or more caches: small, fast memories situated between main memory and processors.


*


cache


*

Now, when a processor reads a value from memory, it stores the data in the cache before returning the data to the processor.


*


cache


*

Later, if the processor wants to use the same data


*

Cache Hit

cache

?


*

When a processor wants to read a value, it first checks whether the data is present in the cache


*

Cache Hit

cache

Yes!


*

If so, it reads directly from the cache, saving a long round-trip to main memory. We call this a cache hit.


*

Cache Miss

address

cache

?

No


*

Sometimes the processor doesnt find what it is lookin for in the cache.


*

Cache Miss

cache


*

We call this a cache miss.


*

Cache Miss

cache


*

*

Memory and cache performance metrics
Cache Hit and Miss : When the data is found in the cache, we have a cache hit, otherwise it is a miss.Hit Ratio ,HR = fraction of memory references that hitDepends on locality of applicationMeasure of effectiveness of caching mechanismMiss Ratio , MR= fraction of memory references that missHR = 1- MR
*

Average memory system access time

If all the data fits in main memory (i.e. ignore desk access)

HR * cache access time + MR * main memory access time

*

Cache line
When there is a cache miss, a fixed size block of consecutive data elements, or line, is copied from main memory to the cache.Typical cache line size is 4-128 bytes.Main memory can be seen as a sequence of lines, some of which can have a copy in the cache.
*

MEMORY HIERARCHY AND BANDWIDTH ON MULTICORE
Each core has its own private cache, L1 cache to provide fast access, e.g. 1-2 cycles.L2 caches may be shared across multiple cores.In the event of cache miss at both L1 and L2, the memory controller must forward a load/store request to the off-chip main memory.

High Level Multicore Architectural view

Intel Core 2
Duo Processor

Intel Core 2
Quad Processor

A = Architectural State E = Execution Engine & Interrupt

C = 2nd Level Cache B = Bus Interface connects to main memory & I/O

Memory

Memory

64B Cache Line

64B Cache Line

Dual Core has shared cache

Quad core has both shared

And separated cache

Intel Core Microarchitecture Memory Sub-system
AAAAEEEEC1C2BBAAEECB
*

Main point to cover here is that False sharing is issue for platforms with separated cores we can alleviate false sharing by restructuring the data layout/data access patterns

A = Architectural states refers to contents or state of FPU, MMX registers and others

E = Execution Engine refers to functions units such as FP, ALU, SIMD etc

C = 2nd Level Cache memory w ~1 or 2 cycle latency compared to 100s of clocks for main memory

B = system bus interface that connects to main memory & I/O.

Cache Line is the smallest unit of data that can be transferred to or from memory.

When a single data elements is requested by a program say your need to read one variable of type float form memory then that float and its 7 nearest neighbors (8 floats in total = 64 Bytes) in memory (in the same cache line) are brought into the faster cache memory for use by the processor

*

Cache line ping-ponging or tennis effect
One processor writes to a cache line and then another processor writes to the same cache line but different data element Cash line is in a separate socket/separate L2 cache environmentEach core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB (Front Side Bus to memory)This increases the FSB traffic and even in good conditions costs about the cost of a memory access

With a separated cache

CPU1

CPU2

Memory

Front Side Bus (FSB)

Cache Line

Shipping L2 Cache Line

~Half access to memory


*
There is often an effect called ping-ponging or tennis where one processor writes to a cache line and then another processor writes to the same cache line but different data elementIn a separate socket/separate last level cache environment
Each core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB

This increases the FSB traffic and even in good conditions costs about the cost of a memory access

CPU2

Advantages of Shared Cache using Advanced Smart Cache Technology

CPU1

Memory

Front Side Bus (FSB)

Cache Line

L2 is shared:

No need to ship cache

line


*
Shared L2No need to ship cache lineCache line just goes from exclusive to shared for the other core to read.

False Sharing
Performance issue in programs where cores may write to different memory addresses BUT in the same cache linesKnown as Ping-Ponging Cache line is shipped between cores
Core 0

Core 1

Time

1

X[0] =

0

X[1] =

0

1

0

2

False Sharing not an
issue in shared cache

It is an issue in
separated cache

1

0

X[0] =

1

X[1] =

1

X[0] =

2

1

1

*

Intel book Multi-core Programming Increasing Performance Through Software Multi-threading by Shameem Akhter and Jason Roberts

False Sharing
The smallest unit of memory that two processors interchange is a cache line or cache sector. Two separate caches can share a cache line when they both need to read it, but if the line is written in one cache, and read in another, it must be shipped between caches, even if the locations of interest are disjoint.

Like two people writing in different parts of a log book, the writes are independent, but unless the book can be ripped apart, the writers must pass the book back and forth. In the same way, two hardware threads writing to different locations contend for a cache sector to the point where it becomes a ping-pong game.

In this ping pong game, there are two threads, each running on a different core. Each thread increments a different location belonging to the same cache line. But because the locations belong to the same cache line, the cores must pass the sector back and forth across the memory bus.

In order to avoid false sharing we need to alter either the algorithm or the data structure. We can add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines. Or we can adjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread

*

Avoiding False Sharing

Change either
Algorithmadjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread
Or
Data Structure:add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines.

Documents

1-MulticoreArchitecture Basics.ppt