of 31/31
1 EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland State University Multiprocessor Architectures Late 1950s - one general-purpose processor and one or more special-purpose processors for input and output operations Early 1960s - multiple complete processors, used for program-level concurrency Mid-1960s - multiple partial processors, used for instruction-level concurrency Single-Instruction Multiple-Data (SIMD) machines Multiple-Instruction Multiple-Data (MIMD) machines A primary focus of this chapter is shared memory MIMD machines (multiprocessors)

EEC 581 Computer Architecture - Cleveland State University · EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland

  • View
    0

  • Download
    0

Embed Size (px)

Text of EEC 581 Computer Architecture - Cleveland State University · EEC 581 Computer Architecture...

  • 1

    EEC 581

    Computer Architecture

    Multicore Architecture

    Department of Electrical Engineering and Computer Science

    Cleveland State University

    Multiprocessor Architectures

    Late 1950s - one general-purpose processor and one

    or more special-purpose processors for input and

    output operations

    Early 1960s - multiple complete processors, used for

    program-level concurrency

    Mid-1960s - multiple partial processors, used for

    instruction-level concurrency

    Single-Instruction Multiple-Data (SIMD) machines

    Multiple-Instruction Multiple-Data (MIMD) machines

    A primary focus of this chapter is shared memory

    MIMD machines (multiprocessors)

  • 2

    Instruction and Data Streams

    3

    2

    (3)

    Thread Level Parallelism (TLP)

    • Multiple threads of execution

    • Exploit ILP in each thread

    • Exploit concurrent execution across threads

    (4)

    Instruction and Data Streams

    • Taxonomy due to M. Flynn

    Data Streams

    Single Multiple

    Instruction

    Streams

    Single SISD:

    Intel Pentium 4

    SIMD: SSE

    instructions of x86

    Multiple MISD:

    No examples today

    MIMD:

    Intel Xeon e5345

    Example: Multithreading (MT) in a single address space

    Recall Executable Format

    4

    3

    (5)

    Recall The Executable Format

    header

    text

    static data

    reloc

    symbol table

    debug

    Object file ready to be linked and loaded

    Linker Loader

    Static Libraries

    An executable instance or

    Process

    What does a loader do?

    (6)

    Process

    • A process is a running program with state

    ! Stack, memory, open files

    ! PC, registers

    • The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes

    • There many processes for the

    same application

    ! E.g., web browser

    • Operating systems class for details Code

    Static data

    Heap

    Stack

    DLL’s

  • 3

    Process Level Parallelism

    5

    4

    (7)

    Process Level Parallelism

    Process Process Process

    • Parallel processes and throughput computing

    • Each process itself does not run any faster

    (8)

    From Processes to Threads

    • Switching processes on a core is expensive ! A lot of state information to be managed

    • If I want concurrency, launching a process is expensive

    • How about splitting up a single process into parallel computations?

    " Lightweight processes or threads!

    Process

    6

    3

    (5)

    Recall The Executable Format

    header

    text

    static data

    reloc

    symbol table

    debug

    Object file ready to be linked and loaded

    Linker Loader

    Static Libraries

    An executable instance or

    Process

    What does a loader do?

    (6)

    Process

    • A process is a running program with state

    ! Stack, memory, open files

    ! PC, registers

    • The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes

    • There many processes for the same application

    ! E.g., web browser

    • Operating systems class for details Code

    Static data

    Heap

    Stack

    DLL’s

  • 4

    Categories of Concurrency

    Categories of Concurrency: Physical concurrency - Multiple independent processors (

    multiple threads of control)

    Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control)

    Coroutines (quasi-concurrency) have a single thread of control

    A thread of control in a program is the sequence of program points reached as control flows through the program

    Motivations for the Use of Concurrency

    Multiprocessor computers capable of physical

    concurrency are now widely used

    Even if a machine has just one processor, a program

    written to use concurrent execution can be faster than

    the same program written for nonconcurrent execution

    Involves a different way of designing software that can

    be very useful—many real-world situations involve

    concurrency

    Many program applications are now spread over

    multiple machines, either locally or over a network

  • 5

    Introduction to Subprogram-Level

    Concurrency

    A task or process or thread is a program unit that can be in concurrent execution with other program units

    Tasks differ from ordinary subprograms in that: A task may be implicitly started

    When a program unit starts the execution of a task, it is not necessarily suspended

    When a task’s execution is completed, control may not return to the caller

    Tasks usually work together

    Two General Categories of Tasks

    Heavyweight tasks execute in their own address

    space

    Lightweight tasks all run in the same address space –

    more efficient

    A task is disjoint if it does not communicate with or

    affect the execution of any other task in the program

    in any way

  • 6

    Task Synchronization

    A mechanism that controls the order in which tasks execute

    Two kinds of synchronization Cooperation synchronization

    Competition synchronization

    Task communication is necessary for synchronization, provided by:- Shared nonlocal variables- Parameters- Message passing

    Kinds of synchronization

    Cooperation: Task A must wait for task B to complete

    some specific activity before task A can continue its

    execution, e.g., the producer-consumer problem

    Competition: Two or more tasks must use some

    resource that cannot be simultaneously used, e.g., a

    shared counter

    Competition is usually provided by mutually exclusive access

    (approaches are discussed later)

  • 7

    From Processes to Threads

    13

    4

    (7)

    Process Level Parallelism

    Process Process Process

    • Parallel processes and throughput computing

    • Each process itself does not run any faster

    (8)

    From Processes to Threads

    • Switching processes on a core is expensive ! A lot of state information to be managed

    • If I want concurrency, launching a process is expensive

    • How about splitting up a single process into parallel computations?

    " Lightweight processes or threads!

    A Thread

    14

    5

    (9)

    Thread Parallel Execution Process

    thread

    (10)

    A Thread

    • A separate, concurrently executable instruction stream within a process

    • Minimum amount state to execute on a core

    ! Program counter, registers, stack

    ! Remaining state shared with the parent process

    o Memory and files

    • Support for creating threads

    • Support for merging/terminating threads

    • Support for synchronization between threads

    ! In accesses to shared data

    Our datapath

    so far!

  • 8

    15

    TLP

    ILP of a single program is hard

    Large ILP is Far-flung

    We are human after all, program w/ sequential mind

    Reality: running multiple threads or programs

    Thread Level Parallelism

    Time Multiplexing

    Throughput computing

    Multiple program workloads

    Multiple concurrent threads

    Helper threads to improve single program performance

    Thread Level Parallelism (TLP)

    16

    2

    (3)

    Thread Level Parallelism (TLP)

    • Multiple threads of execution

    • Exploit ILP in each thread

    • Exploit concurrent execution across threads

    (4)

    Instruction and Data Streams

    • Taxonomy due to M. Flynn

    Data Streams

    Single Multiple

    Instruction

    Streams

    Single SISD:

    Intel Pentium 4

    SIMD: SSE

    instructions of x86

    Multiple MISD:

    No examples today

    MIMD:

    Intel Xeon e5345

    Example: Multithreading (MT) in a single address space

  • 9

    17

    Single and Multithreaded Processes

    A Simple Example

    18

    6

    (11)

    A Simple Example

    Data Parallel Computation

    (12)

    Thread Execution: Basics

    funcA()

    Static data

    Heap

    Stack

    funcB()

    Stack

    PC, registers, stack pointer

    PC, registers, stack pointer

    Thread #1

    Thread #2

    create_thread(funcB)

    create_thread(funcA)

    funcA() funcB()

    WaitAllThreads()

    end_thread() end_thread()

  • 10

    19

    Examples of Threads

    A web browser

    One thread displays images

    One thread retrieves data from network

    A word processor

    One thread displays graphics

    One thread reads keystrokes

    One thread performs spell checking in the background

    A web server

    One thread accepts requests

    When a request comes in, separate thread is created to service

    Many threads to support thousands of client requests

    RPC or RMI (Java)

    One thread receives message

    Message service uses another thread

    Thread Execution

    20

    6

    (11)

    A Simple Example

    Data Parallel Computation

    (12)

    Thread Execution: Basics

    funcA()

    Static data

    Heap

    Stack

    funcB()

    Stack

    PC, registers, stack pointer

    PC, registers, stack pointer

    Thread #1

    Thread #2

    create_thread(funcB)

    create_thread(funcA)

    funcA() funcB()

    WaitAllThreads()

    end_thread() end_thread()

  • 11

    Threads Execution on a Single Core

    21

    7

    (13)

    Threads Execution on a Single Core

    • Hardware threads ! Each thread has its own hardware state

    • Switching between threads on each cycle to share the core pipeline – why?

    IF ID MEM WB EX

    lw $t0, label($0) lw $t1, label1($0) and $t2, $t0, $t1

    andi $t3, $t1, 0xffff

    srl $t2, $t2, 12 ……

    lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4

    addi $t1, $t1, -1

    bne $t1, $zero, loop

    …….

    Thread #1

    Thread #2

    lw

    lw

    lw lw

    lw

    lw

    lw lw lw add

    lw lw lw add and

    No pipeline stall on load-to-use hazard!

    Interleaved execution Improve

    utilization !

    (14)

    An Example Datapath

    From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU

    Execution Model: Multithreading

    22

    9

    (17)

    Execution Model: Multithreading

    • Fine-grain multithreading ! Switch threads after each cycle

    ! Interleave instruction execution

    • Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss)

    ! Simplifies hardware, but does not hide short stalls (e.g., data hazards)

    ! If one thread stalls (e.g., I/O), others are executed

    (18)

    Simultaneous Multithreading

    • In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads

    ! Schedule instructions from multiple threads

    ! Instructions from independent threads execute when function units are available

    • Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function

    units and caches

    ! Known as Hyperthreading in Intel terminology

  • 12

    23

    Threads vs. Processes

    Thread

    A thread has no data

    segment or heap

    A thread cannot live on its

    own, it must live within a

    process

    There can be more than one

    thread in a process, the first

    thread calls main and has

    the process’s stack

    Inexpensive creation

    Inexpensive context

    switching

    If a thread dies, its stack is

    reclaimed by the process

    Processes

    A process has

    code/data/heap and other

    segments

    There must be at least one

    thread in a process

    Threads within a process

    share code/data/heap, share

    I/O, but each has its own

    stack and registers

    Expense creation

    Expensive context switching

    It a process dies, its

    resources are reclaimed and

    all threads die

    24

    Thread Implementation

    Process defines address

    space

    Threads share address

    space

    Process Control Block (PCB)

    contains process-specific

    info

    PID, owner, heap pointer,

    active threads and pointers

    to thread info

    Thread Control Block (TCB)

    contains thread-specific info

    Stack pointer, PC, thread

    state, register …

    CODE

    Initialized data

    Heap

    Stack – thread 2

    Stack – thread 1

    DLL’s

    Reserved

    Process’s address space

    $pc

    $sp

    State

    Registers

    $pc

    $sp

    State

    Registers

    TCB for thread2

    TCB for thread1

  • 13

    Benefits

    Responsiveness When one thread is blocked, your browser still responds

    E.g. download images while allowing your interaction

    Resource Sharing Share the same address space

    Reduce overhead (e.g. memory)

    Economy Creating a new process costs memory and resources

    E.g. in Solaris, 30 times slower in creating process than thread

    Utilization of MP Architectures Threads can be executed in parallel on multiple processors

    Increase concurrency and throughput

    User-level Threads

    Thread management done by user-level threads library

    Similar to calling a procedure

    Thread management is done by the thread library in user space

    User can control the thread scheduling (No disturbing the underlying OS scheduler)

    No OS kernel support more portable

    Low overhead when thread switching

    Three primary thread libraries: POSIX Pthreads

    Java threads

    Win32 threads

  • 14

    Kernel Threads

    A.k.a. lightweight process in the literature

    Supported by the Kernel

    Thread scheduling is fairer

    Examples

    Windows XP/2000

    Solaris

    Linux

    Tru64 UNIX

    Mac OS X

    Multithreading Models

    Many-to-One

    One-to-One

    Many-to-Many

  • 15

    Many-to-One

    Many user-level threads mapped to one single kernel

    thread

    The entire process will block if a thread makes a

    blocking system call

    Cannot run threads in parallel on multiprocessors

    Examples

    Solaris Green Threads

    GNU Portable Threads

    Many-to-One Model

  • 16

    One-to-One

    Each user-level thread maps to kernel thread

    Do not block other threads when one is making a blocking system call

    Enable parallel execution in an MP system

    Downside:

    performance/memory overheads of creating kernel threads

    Restriction of the number of threads that can be supported

    Examples

    Windows NT/XP/2000

    Linux

    Solaris 9 and later

    One-to-one Model

  • 17

    Many-to-Many Model

    Allows many user level threads to be mapped to many

    kernel threads

    Allows the operating system to create a sufficient

    number of kernel threads

    Threads are multiplexed to a smaller (or equal)

    number of kernel threads which is specific to a

    particular application or a particular machine

    Solaris prior to version 9

    Windows NT/2000 with the ThreadFiber package

    Many-to-Many Model

  • 18

    Pipeline Hazards

    35

    Multithreading

    36

  • 19

    37

    38

    Multi-Tasking Paradigm Virtual memory makes it easy

    Context switch could be

    expensive or requires extra

    HW

    VIVT cache

    VIPT cache

    TLBs

    Thread 1

    Unused

    Exe

    cuti

    on

    Tim

    e Q

    uan

    tum

    FU1 FU2 FU3 FU4

    Conventional

    Superscalar

    Single

    Threaded

    Thread 2

    Thread 3

    Thread 4

    Thread 5

  • 20

    39

    Multi-threading Paradigm

    Thread 1

    UnusedE

    xecu

    tio

    n T

    ime

    FU1 FU2 FU3 FU4

    Conventional

    Superscalar

    Single

    Threaded

    Simultaneous

    Multithreading

    (SMT)

    Fine-grained

    Multithreading

    (cycle-by-cycle

    Interleaving)

    Thread 2

    Thread 3

    Thread 4

    Thread 5

    Coarse-grained

    Multithreading

    (Block Interleaving)

    Chip

    Multiprocessor

    (CMP or

    MultiCore)

    40

    Conventional Multithreading

    Zero-overhead context switch

    Duplicated contexts for threads

    0:r0

    0:r71:r0

    1:r72:r0

    2:r73:r0

    3:r7

    CtxtPtr

    Memory (shared by threads)

    Register file

  • 21

    41

    Cycle Interleaving MT

    Per-cycle, Per-thread instruction fetching

    Examples: HEP, Horizon, Tera MTA, MIT M-machine

    Interesting questions to consider

    Does it need a sophisticated branch predictor?

    Or does it need any speculative execution at all?

    Get rid of “branch prediction”?

    Get rid of “predication”?

    Does it need any out-of-order execution capability?

    42

    Block Interleaving MT

    Context switch on a specific event (dynamic pipelining) Explicit switching: implementing a switch instruction

    Implicit switching: trigger when a specific instruction class fetched

    Static switching (switch upon fetching) Switch-on-memory-instructions: Rhamma processor

    Switch-on-branch or switch-on-hard-to-predict-branch

    Trigger can be implicit or explicit instruction

    Dynamic switching Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle

    (MIT Alewife’s node), Rhamma Processor

    Switch-on-use (lazy strategy of switch-on-cache-miss) Wait until last minute

    Valid bit needed for each register Clear when load issued, set when data returned

    Switch-on-signal (e.g. interrupt)

    Predicated switch instruction based on conditions

    No need to support a large number of threads

  • 22

    43

    Register

    Renamer

    Register

    Renamer

    Register

    Renamer

    Register

    Renamer

    Register

    Renamer

    Simultaneous Multithreading (SMT)

    SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92]

    Intel’s HyperThreading (2-way SMT)

    IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores;

    Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

    Reg

    File

    FMult(4 cycles)

    FAdd(2 cyc)

    ALU

    1ALU

    2Load/Store(variable)

    Fdiv, unpipe (16 cycles)

    RS & ROB

    plus

    Physical

    Register

    File

    Fetch

    Unit

    PCPCPCPC

    PCPCPCPC

    I-CACHE

    Decode

    Register

    RenamerReg

    File

    Reg

    File

    Reg

    File

    Reg

    File

    Reg

    File

    Reg

    File

    Reg

    File

    D-CACHE

    Register

    Renamer

    Register

    Renamer

    44

    Instruction Fetching Policy

    FIFO, Round Robin, simple but may be too naive

    Adaptive Fetching Policies BRCOUNT (reduce wrong path issuing)

    Count # of br inst in decode/rename/IQ stages

    Give top priority to thread with the least BRCOUNT

    MISSCOUT (reduce IQ clog)

    Count # of outstanding D-cache misses

    Give top priority to thread with the least MISSCOUNT

    ICOUNT (reduce IQ clog)

    Count # of inst in decode/rename/IQ stages

    Give top priority to thread with the least ICOUNT

    IQPOSN (reduce IQ clog)

    Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues

    Due to that threads with the oldest instructions will be most prone to IQ clog

    No Counter needed

  • 23

    45

    Resource Sharing

    Could be tricky when threads compete for the resources

    Static

    Less complexity

    Could penalize threads (e.g. instruction window size)

    P4’s Hyperthreading

    Dynamic

    Complex

    What is fair? How to quantify fairness?

    A growing concern in Multi-core processors

    Shared L2, Bus bandwidth, etc.

    Issues

    Fairness

    Mutual thrashing

    HyperThreading

    46

    10

    (19) 19

    Hyper-threading

    • Implementation of Hyper-threading adds less than 5% to the chip area

    • Principle: share major logic components (functional units) and improve utilization

    • Architecture State: All core pipeline resources needed for executing a thread

    Processor Execution Resources

    Arch State

    Processor Execution Resources

    Processor Execution Resources

    Processor Execution Resources

    Arch State Arch State Arch State Arch State Arch State

    2 CPU Without Hyper-threading 2 CPU With Hyper-threading

    (20)

    Multithreading with ILP: Examples

  • 24

    47

    P4 HyperThreading Resource Partitioning

    TC (or UROM) is alternatively accessed per cycle for

    each logical processor unless one is stalled due to TC

    miss

    op queue (into ½) after fetched from TC

    ROB (126/2)

    LB (48/2)

    SB (24/2) (32/2 for Prescott)

    General op queue and memory op queue (1/2)

    TLB (½?) as there is no PID

    Retirement: alternating between 2 logical processors

    48

    Alpha 21464 (EV8) Processor

    Technology

    Leading edge process technology – 1.2 ~ 2.0GHz 0.125µm CMOS

    SOI-compatible

    Cu interconnect

    low-k dielectrics

    Chip characteristics

    ~1.2V Vdd

    ~250 Million transistors

    ~1100 signal pins in flip chip packaging

  • 25

    49

    Alpha 21464 (EV8) Processor

    Architecture

    Enhanced out-of-order execution (that giant 2Bc-gskew

    predictor we discussed before is here)

    Large on-chip L2 cache

    Direct RAMBUS interface

    On-chip router for system interconnect

    Glueless, directory-based, ccNUMA for up to 512-way

    SMP

    8-wide superscalar

    4-way simultaneous multithreading (SMT)

    Total die overhead ~ 6% (allegedly)

    50

    SMT Pipeline

    Fetch Decode/

    Map

    Queue Reg

    Read

    Execute Dcache/

    Store

    Buffer

    Reg

    Write

    Retire

    Icache

    Dcache

    PC

    Register

    Map

    Regs Regs

    Source: A company once called Compaq

  • 26

    51

    EV8 SMT

    In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB

    Replicated hardware contexts

    Program counter

    Architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool)

    Shared resources

    Rename register pool (larger than needed by 1 thread)

    Instruction queue

    Caches

    TLB

    Branch predictors

    Deceased before seeing the daylight.

    52

    Reality Check, circa 200x

    Conventional processor designs run out of steam

    Power wall (thermal)

    Complexity (verification)

    Physics (CMOS scaling)

    1

    10

    100

    1000

    Wa

    tts

    /cm

    2

    i386i486

    Pentium ® processor

    Pentium Pro ® processor

    Pentium II ® processor Pentium III ® processor

    Hot plateHot plate

    Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle

    Sun’sSun’sSurfaceSurface

    1

    10

    100

    1000

    Wa

    tts

    /cm

    2

    i386i486

    Pentium ® processor

    Pentium Pro ® processor

    Pentium II ® processor Pentium III ® processor

    Hot plateHot plate

    Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle

    Sun’sSun’sSurfaceSurface

    “Surpassed hot-plate power

    density in 0.5m; Not too long

    to reach nuclear reactor,”

    Former Intel Fellow Fred

    Pollack.

  • 27

    53

    Latest Power Density Trend

    Yeo and Lee, “Peeling the Power Onion of Data Centers,” In

    Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011

    54

    Reality Check, circa 200x

    Conventional processor designs run out of steam

    Power wall (thermal)

    Complexity (verification)

    Physics (CMOS scaling)

    Unanimous direction Multi-core

    Simple cores (massive number)

    Keep

    Wire communication on leash

    Gordon Moore happy (Moore’s Law)

    Architects’ menace: kick the ball to the other side of the court?

    What do you (or your customers) want?

    Performance (and/or availability)

    Throughput > latency (turnaround time)

    Total cost of ownership (performance per dollar)

    Energy (performance per watt)

    Reliability and dependability, SPAM/spy free

  • 28

    55

    Multi-core Processor Gala

    56

    Intel’s Multicore Roadmap

    To extend Moore’s Law

    To delay the ultimate limit of physics

    By 2010

    all Intel processors delivered will be multicore

    Intel’s 80-core processor (FPU array)

    Source: Adapted from Tom’s Hardware

    2006 20082007

    SC 1MB

    DC 2MB

    DC 2/4MB shared

    DC 3 MB/6 MB shared (45nm)

    2006 20082007

    DC 2/4MB

    DC 2/4MB shared

    DC 4MB

    DC 3MB /6MB shared (45nm)

    2006 20082007

    DC 2MB

    DC 4MB

    DC 16MB

    QC 4MB

    QC 8/16MB shared

    8C 12MB shared (45nm)

    SC 512KB/ 1/ 2MB

    8C 12MB shared (45nm)

    Deskto

    p p

    roce

    sso

    rs

    Mo

    bile

    p

    roce

    sso

    rs

    En

    terp

    rise

    p

    roce

    sso

    rs

  • 29

    57

    Is a Multi-core really better off?

    Well, it is hard to say in Computing World

    If you were plowing a field,

    which would you rather use:

    Two strong oxen or 1024 cores?

    --- Seymour Cray

    58

    Q1. For a PIPT cache with virtual memory support, three possible events can be triggered

    during an instruction fetch: (1) a cache lookup, (2) a TLB miss, (3) a page fault. Please

    order these events in the correct order of their occurrences.

    (2) (3) (1): In a PIPT cache, address translation will take place prior to a cache

    lookup. It first searches for a match in TLB, therefore, a TLB miss (if any) will take

    place first. Then a page table walk is initiated. If a page has not been allocated, a page

    fault will follow. The OS will then allocate the page, fill in the page table entry, then

    fill the translation into TLB, followed by a cache lookup.

    Q2. Given a 256Meg x4 DRAM chip which consists of 2 banks, with 14-bit row

    addresses. (256Meg indicates the number of addresses.) What is the row buffer size for

    each bank?

    256M 28 address bits needed. One bit is used for bank index, hence, the column

    address = 28 – 1 – 14 = 13 bits

    As the DRAM is a “x4” configuration

    One row buffer of a bank = 213 * (x4 bits) = 32 kbits = 4 KB

  • 30

    59

    Q3. Assume an Inverted Page Table (8-entry IPT) is used by a 32-bit OS. The memory

    page size is 256KB. The complete IPT content is shown below. The Physical Page Number

    (PPN) starts from 0 to 7 from the top of the table. There are three active processes, P1

    (PID=1), P2 (PID=2) and P3 (PID=3) running in the system and the IPT holds the translation

    for the entire physical memory. Answer the following questions.

    Based on the size of the Inverted Page Table above, what is the size of the physical

    memory ?

    There are 8 entries in the IPT. As each page is 256KB, the size of the physical

    memory = 8 * 256KB = 2MB

    60

    IBM Watson Jeopardy! Competition

    POWER7 chips (2,880 cores) + 16TB memory

    Massively parallel processing

    Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction

  • 31

    61

    Major Challenges for Multi-Core Designs

    Communication

    Memory hierarchy

    Data allocation (you have a large shared L2/L3 now)

    Interconnection network

    AMD HyperTransport

    Intel QPI

    Scalability

    Bus Bandwidth, how to get there?

    Power-Performance — Win or lose?

    Borkar’s multicore arguments

    15% per core performance drop 50% power saving

    Giant, single core wastes power when task is small

    How about leakage?

    Process variation and yield

    Programming Model