34
1 0907532 Special Topics in Computer Engineering Multicore Architecture Basics

1-MulticoreArchitecture Basics.ppt

  • Upload
    pdvdm

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

  • 2005 IT Roadmap Semiconductors

    *

    Clock Rate (GHz)

    2005 Roadmap

    Intel single core

    *

  • Change in ITS Roadmap in 2 yrs

    *

    Clock Rate (GHz)

    2005 Roadmap

    2007 Roadmap

    Intel single core

    Intel multicore

    *

    *

    Shared Address Space Architectures

    Any core can directly reference any memory locationCommunication between cores occurs implicitly as result of loads and stores

    *

    Memory hierarchy and cache memories:

    Review concepts assuming Single Core

    Introduce problems and solution when used in Multicore Machines

    *

    Single core memory hierarchy and cache memories

    Programs tend to exhibit temporal and spatial locality:Temporal locality: Once programs access data items or instructions they tend to access them again in the near future.Spatial locality: Once programs access data items or instruction, they tend to access nearby data items or instruction in the near future. Because of the locality property of programs, memory is organized in a hierarchy.
  • Memory hierarchy

    ~ 1s Cycle

    ~ 1s 10

    Cycles

    ~ 100s Cycles

    ~ 1000s Cycles

    Connecting lines thickness depict bandwidth: Bytes/Second

    Key Observations

    Access to L1 cache is on order of 1 cycleAccess to L2 on order of 1 to 10 cyclesAccess to Main memory ~ 100s cyclesAccess to Disk ~ 1000s cycles

    Magnetic Disk

    Main Memory

    L2
    Cache

    L1 Cache

    Core

    *

    Slide depiction fig 3.1 inspired by Software Optimization for High Performance Computing by HP press

    By Wadleigh & Crawford

    Connecting lines thickness depict bandwidth

    Key take away for fstudents is that latency for L1 cache is on order of 1 cycle, for L2 on order of between 1 & 10 cycles, for L2 miss forced read from main memory ~ 100s cycles, And Disk access ~ 1000s cycles.

    This is useful to know to be able to explain why effective cache utilization can be MORE important than utilizing multiple cores. BUT in many cases, we can get good cache use AND use multiple cores to get huge performance gains (5-100X in aggregate for 8 core system)

  • Art of Multiprocessor Programming

    *

    Processor and Memory are Far Apart

    processor

    memory

    interconnect

    2003 Herlihy and Shavit

    *

    From our point of view, one architectural principle drives everything else: processors and memory are far apart.

  • Art of Multiprocessor Programming

    *

    Reading from Memory

    address

    2003 Herlihy and Shavit

    *

    It takes a long time for a processor to read a value from memory. It has to send the address to the memory

  • Art of Multiprocessor Programming

    *

    Reading from Memory

    zzz

    2003 Herlihy and Shavit

    *

    Wait for the message to be delivered

  • Art of Multiprocessor Programming

    *

    Reading from Memory

    value

    2003 Herlihy and Shavit

    *

    And wait or the response to come back.

  • Art of Multiprocessor Programming

    *

    Writing to Memory

    address, value

    2003 Herlihy and Shavit

    *

    Writing is similar, except you send the address and the new value,

  • Art of Multiprocessor Programming

    *

    Writing to Memory

    zzz

    2003 Herlihy and Shavit

    *

    Wait

  • Art of Multiprocessor Programming

    *

    Writing to Memory

    ack

    2003 Herlihy and Shavit

    *

    And then get an acknowledgement that the new value was actually installed in the memory.

  • Art of Multiprocessor Programming

    *

    Cache: Reading from Memory

    address

    cache

    2003 Herlihy and Shavit

    *

    We alleviate this problem by introducing one or more caches: small, fast memories situated between main memory and processors.

  • Art of Multiprocessor Programming

    *

    Cache: Reading from Memory

    cache

    2003 Herlihy and Shavit

    *

    Now, when a processor reads a value from memory, it stores the data in the cache before returning the data to the processor.

  • Art of Multiprocessor Programming

    *

    Cache: Reading from Memory

    cache

    2003 Herlihy and Shavit

    *

    Later, if the processor wants to use the same data

  • Art of Multiprocessor Programming

    *

    Cache Hit

    cache

    ?

    2003 Herlihy and Shavit

    *

    When a processor wants to read a value, it first checks whether the data is present in the cache

  • Art of Multiprocessor Programming

    *

    Cache Hit

    cache

    Yes!

    2003 Herlihy and Shavit

    *

    If so, it reads directly from the cache, saving a long round-trip to main memory. We call this a cache hit.

  • Art of Multiprocessor Programming

    *

    Cache Miss

    address

    cache

    ?

    No

    2003 Herlihy and Shavit

    *

    Sometimes the processor doesnt find what it is lookin for in the cache.

  • Art of Multiprocessor Programming

    *

    Cache Miss

    cache

    2003 Herlihy and Shavit

    *

    We call this a cache miss.

  • Art of Multiprocessor Programming

    *

    Cache Miss

    cache

    2003 Herlihy and Shavit

    *

    *

    Memory and cache performance metrics

    Cache Hit and Miss : When the data is found in the cache, we have a cache hit, otherwise it is a miss.Hit Ratio ,HR = fraction of memory references that hitDepends on locality of applicationMeasure of effectiveness of caching mechanismMiss Ratio , MR= fraction of memory references that missHR = 1- MR

    *

    Average memory system access time

    If all the data fits in main memory (i.e. ignore desk access)

    HR * cache access time + MR * main memory access time

    *

    Cache line

    When there is a cache miss, a fixed size block of consecutive data elements, or line, is copied from main memory to the cache.Typical cache line size is 4-128 bytes.Main memory can be seen as a sequence of lines, some of which can have a copy in the cache.

    *

    MEMORY HIERARCHY AND BANDWIDTH ON MULTICORE

    Each core has its own private cache, L1 cache to provide fast access, e.g. 1-2 cycles.L2 caches may be shared across multiple cores.In the event of cache miss at both L1 and L2, the memory controller must forward a load/store request to the off-chip main memory.
  • High Level Multicore Architectural view

    Intel Core 2
    Duo Processor

    Intel Core 2
    Quad Processor

    A = Architectural State E = Execution Engine & Interrupt

    C = 2nd Level Cache B = Bus Interface connects to main memory & I/O

    Memory

    Memory

    64B Cache Line

    64B Cache Line

    Dual Core has shared cache

    Quad core has both shared

    And separated cache

    Intel Core Microarchitecture Memory Sub-system

    AAAAEEEEC1C2BBAAEECB

    *

    Main point to cover here is that False sharing is issue for platforms with separated cores we can alleviate false sharing by restructuring the data layout/data access patterns

    A = Architectural states refers to contents or state of FPU, MMX registers and others

    E = Execution Engine refers to functions units such as FP, ALU, SIMD etc

    C = 2nd Level Cache memory w ~1 or 2 cycle latency compared to 100s of clocks for main memory

    B = system bus interface that connects to main memory & I/O.

    Cache Line is the smallest unit of data that can be transferred to or from memory.

    When a single data elements is requested by a program say your need to read one variable of type float form memory then that float and its 7 nearest neighbors (8 floats in total = 64 Bytes) in memory (in the same cache line) are brought into the faster cache memory for use by the processor

    *

    Cache line ping-ponging or tennis effect

    One processor writes to a cache line and then another processor writes to the same cache line but different data element Cash line is in a separate socket/separate L2 cache environmentEach core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB (Front Side Bus to memory)This increases the FSB traffic and even in good conditions costs about the cost of a memory access
  • With a separated cache

    CPU1

    CPU2

    Memory

    Front Side Bus (FSB)

    Cache Line

    Shipping L2 Cache Line

    ~Half access to memory

    Intel Core Microarchitecture Memory Sub-system

    *

    There is often an effect called ping-ponging or tennis where one processor writes to a cache line and then another processor writes to the same cache line but different data elementIn a separate socket/separate last level cache environment

    Each core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB

    This increases the FSB traffic and even in good conditions costs about the cost of a memory access

  • CPU2

    Advantages of Shared Cache using Advanced Smart Cache Technology

    CPU1

    Memory

    Front Side Bus (FSB)

    Cache Line

    L2 is shared:

    No need to ship cache

    line

    Intel Core Microarchitecture Memory Sub-system

    *

    Shared L2No need to ship cache lineCache line just goes from exclusive to shared for the other core to read.
  • False Sharing

    Performance issue in programs where cores may write to different memory addresses BUT in the same cache linesKnown as Ping-Ponging Cache line is shipped between cores

    Core 0

    Core 1

    Time

    1

    X[0] =

    0

    X[1] =

    0

    1

    0

    2

    False Sharing not an
    issue in shared cache

    It is an issue in
    separated cache

    1

    0

    X[0] =

    1

    X[1] =

    1

    X[0] =

    2

    1

    1

    *

    Intel book Multi-core Programming Increasing Performance Through Software Multi-threading by Shameem Akhter and Jason Roberts

    False Sharing
    The smallest unit of memory that two processors interchange is a cache line or cache sector. Two separate caches can share a cache line when they both need to read it, but if the line is written in one cache, and read in another, it must be shipped between caches, even if the locations of interest are disjoint.

    Like two people writing in different parts of a log book, the writes are independent, but unless the book can be ripped apart, the writers must pass the book back and forth. In the same way, two hardware threads writing to different locations contend for a cache sector to the point where it becomes a ping-pong game.

    In this ping pong game, there are two threads, each running on a different core. Each thread increments a different location belonging to the same cache line. But because the locations belong to the same cache line, the cores must pass the sector back and forth across the memory bus.

    In order to avoid false sharing we need to alter either the algorithm or the data structure. We can add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines. Or we can adjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread

    *

    Avoiding False Sharing

    Change either

    Algorithmadjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread

    Or

    Data Structure:add some padding to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines.