44
Qiang Xu CUHK, Fall 2013 Part.1 .1 ENGG 5101 Advanced Computer Architecture Lecture 01 - Introduction XU, Qiang (Johnny) 徐強

Advanced Computer Architecture - Piazza

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .1

ENGG 5101

Advanced Computer Architecture

Lecture 01 - Introduction XU, Qiang (Johnny) 徐強

Page 2: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .2

Course Genereal Information ¨ Instructor: Qiang Xu

* http://www.cse.cuhk.edu.hk/~qxu * Office hours: 1-3pm, Tuesday

¨ Course Info * http://www.cse.cuhk.edu.hk/engg5101

¨ TA: Zelong Sun * [email protected]

¨ Check student/faculty expectations on teaching and learning (available on course webpage)

Page 3: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .3

Course Objective ¨ Learn the organizational paradigms that determine

the capabilities, performance, power consumption and reliability of computer systems * The what, the how, and more importantly, the why * Processor microarchitecture * Memory hierarchies and cache coherence

¨ Focus on parallel organization and design, e.g., superscalar/VLIW and multiprocessors

¨ Learn how to read and evaluate research papers

Page 4: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .4

Prerequisites

¨ Basic courses in *  Digital Design *  Hardware Organization/Computer Architecture

Page 5: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .5

References

¨ Reference Books * M. Dubois, M. Annavaram, P. Stenstrom, Parallel

Computer Organization and Design, Cambridge, 2012. * J. Hennessy and D. Patterson, Computer Architecture, A

Quantitative Approach, 5th ed., Morgan-Kaufman 2012.

¨ Papers listed on the course webpage

Acknowledgement: Some slides adapted from reference slides of these books

Page 6: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .6

Course Structure and Grading Scheme

¨  Lectures: *  2 weeks review of basic concepts and scalar processor *  2 weeks on advanced single-core processor *  1 week on memory systems *  1 week on power and reliability *  3 weeks on multiprocessor systems *  1 week on future trends

¨ What you need to do? *  homework assignments – 20% *  2 research essays (by group) - 30% *  midterm and final exam – 50% *  final exam grade must exceed 50 (out of 100) to pass!

This course is NOT for everyone!!!

Page 7: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .7

What’re your Choices for Computing?

Ener

gy E

fficienc

y (in

MOPS

/mW

)

Flexibility (or application scope)

0.1-1

1-10

10-100

100-1000

None Fully flexible

Somewhat flexible

Har

dwired

cus

tom

Conf

igur

able/P

aram

eter

izab

le

Dom

ain-

spec

ific p

roce

ssor

(e

.g., G

PU, DSP

)

micro

proc

esso

r

Page 8: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .8

Layered View of Computer Systems

Page 9: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .9

Von Neumann Architecture

John von Neumann “the last of the great

mathematicians”

Alan Turing “Father of Computer

Science and AI”

Page 10: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .10

Below the Program

¨ High-level language program (in C) swap (int v[], int k) (int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; )

¨ Assembly language program (for MIPS) swap: sll $2, $5, 2 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31

¨ Machine (object) code (for MIPS) 000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000

. . .

C compiler

assembler

one-to-many

one-to-one

Page 11: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .11

Input Device Inputs Object Code

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Devices

Input

Output

Network

Page 12: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .12

Object Code Stored in Memory

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Devices

Input

Output

Network

Page 13: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .13

Processor Fetches an Instruction

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Processor fetches an instruction from memory

Devices

Input

Output

Network

Page 14: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .14

Control Decodes the Instruction

Processor

Control

Datapath

Memory

000000 00100 00010 0001000000100000

Control decodes the instruction to determine what to execute

Devices

Input

Output

Network

Page 15: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .15

Datapath Executes the Instruction

Processor

Control

Datapath

Memory

contents Reg #4 ADD contents Reg #2

results put in Reg #2

Datapath executes the instruction as directed by control

000000 00100 00010 0001000000100000

Devices

Input

Output

Network

Page 16: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .16

What Happens Next?

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Fetch

Decode Exec

Devices

Input

Output

Network

Page 17: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .17

Processor Fetches the Next Instruction

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Processor fetches the next instruction from memory

How does it know which location in memory to fetch from next?

Devices

Input

Output

Network

Page 18: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .18

Output Data Stored in Memory

Processor

Control

Datapath

Memory

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

At program completion the data to be output resides in memory

Devices

Input

Output

Network

Page 19: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .19

Output Device Outputs Data

Processor

Control

Datapath

Memory

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

Devices

Input

Output

Network

Page 20: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .20

What Differentiates Various Computer Architecture?

¨  The conceptual design and fundamental operational structure of a computer system *  Instruction set architecture (ISA)

»  Programming model of a processor »  Instructions, data types, registers, addressing modes, etc. »  Not many ISAs survive over the years

*  Microarchitecture »  How to implement the ISA at high-level »  Pipelining, cache, branch prediction, superscalar, out-of-

order execution, register renaming, multi-this, multi-that, etc.

Page 21: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .21

Modern PC Architecture

Page 22: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .22

Modern Smartphone Architecture

From: TI website

Page 23: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .23

Generic Parallel Compute Architecture

Page 24: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .24

Moore’s Law for CPUs and DRAMs

From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!

Page 25: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .25

Main driver: device scaling ...

From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!

Page 26: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .26

Secondary driver: Wafer size

From: “Facing the Hot Chips Challenge Again”, Bill Holt,

Intel, presented at Hot Chips 17, 2005.!

Page 27: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .27

Intel Core i7 Processor

45nm technology, 18.9mm x 13.6mm, 0.73billion transistors, 2008

Page 28: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .28

Intel Core i7 Processor

Page 29: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .29

Highest Clock Rate of Intel Processors

»  Due to process improvements »  Deeper pipeline »  Circuit design techniques

What if the exponential increase had kept up? Why not?

Page 30: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .30

What will happen??

Page 31: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .31

Power Density (if Increasing Clock Rate Exponentially as Before)

4004 8008 8080

8085

8086

286 386 486

Pentium® proc P6

1

10

100

1000

10000

1970 1980 1990 2000 2010 Year

Powe

r Den

sity

(W

/cm2)

Hot Plate

Nuclear Reactor

Rocket Nozzle

Power density too high to keep junctions at low temp

Courtesy, Intel

Page 32: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .32

POWER is the King Now!

¨ Total power = Dynamic power + Static Power Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

Page 33: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .33

Hitting “Power Wall” - Go for Multi-Core

P. Gargini Intel Developer’s Forum 2005

Page 34: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .34

Parallel Computing for Higher Performance

¨ Classes of parallelism: * Instruction-Level Parallelism (ILP)

»  Pipelining, Superscalar, VLIW, EPIC * Data-Level Parallelism (DLP)

»  Vector architectures, GPU, SIMD extension for multimedia * Thread-Level Parallelism

»  SMT, Multiprocessor * Request-Level Parallelism

»  Warehouse-scale computing

Page 35: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .35

Amdahl’s Law

¨ Lessons learned * Focus on the common case in design!! *  the law of diminishing returns

¨ In practice, super-linear acceleration is observed in some rare cases, how is that possible?

1-F F

Apply enhancement

1-F F/S

without E

with E

Speedup = 1

(1− F)+ FS

< 11− F

Page 36: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .36

Gustafson’s Law ¨ When more cores are available, the workloads are also

growing *  Let us start with the execution time on the parallel machine

with P processors »  s is the time taken for serial code while p is the time taken for

parallel code *  Execution time on a single-core processor would be *  Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)

TP = s + p

T1 = s + pP

Page 37: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .37

Challenges in Parallel Computing

¨ Parallel computing exists for decades, but it gets into mainstream (even in your phone) for just a few years, why? * The design of parallel architecture is difficult, but it is

not a road blocker * Parallel programming is hard!!!

»  The shift to multicore would not happen if there are alternatives for performance improvement without changing programming model

Page 38: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .38

Why Parallel Programming is Hard?

¨ Programmers need to find parallelism in the algorithm * The good news is that emerging workloads usually have

large data-level parallelism ¨ Programmers need to manage parallel overheads

(e.g., communication and synchronization) ¨ Programmers often need to deal with memory

system explicitly * In order to perform more efficiently, program should

work on local data whenever possible

¨ Some of the above difficulties may be hidden in libraries, compilers and high-level languages, but a long way to go

Page 39: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .39

Memory Systems

¨ Growing gap between processor and memory speed, the so-called “Memory Wall”!

¨ One wants a memory system that is big, fast and cheap at the same time, how?

DRAM: 1.07 CGR

Memory wall = memory_cycle/processor_cycle

In 1990, it was about 4 (25MHz,150ns).Grew to 200 exponentially until 2002Has tappered off since then

Page 40: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .40

Second Level Cache (SRAM)

Memory Hierarchy

Control

Datapath

Secondary Memory (Disk)

On-Chip Components

RegFile

Main Memory (DRAM) D

ata Cache

Instr Cache

ITLB

DTLB

eDRAM

Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s

¨ By taking advantage of the principle of locality!! * Present the user with as much memory as available with the

cheapest technology * At the speed offered by the fastest technology

Size (%cycles): 100’s 10k’s 100k’s G’s 100G’s

Page 41: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .41

What Memory Wall Indeed? ¨ Although still a big problem, the processor/

memory speed gap stopped growing around 2002. * Growing on-chip cache size also mitigates the latency

problem ¨ With multicore, it is the memory bandwidth wall!

From: Sandia National Lab.

Memory bandwidth is constrained by the limited IC pin count and I/O power.

Page 42: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .42

Yet Another Challenge

¨ Hardware is NOT error-free in its lifetime and this problem is exacerbated with scaling!!! * Toyota blames soft error for sudden acceleration problem.

Burn-in test less effective

Higher random failure rate Faster

wear-out

Page 43: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .43

Engineering Design is about Tradeoff!

Performance

Reliability/ Availability

Cost

Design

Power ¨ This course is about how to achieve better

tradeoff at the architecture level * Used to be “Stupid, it’s performance” * Power is often weighed more importantly than performance

nowadays, especially for battery-powered systems * Reliability is becoming a first-class citizen

Security

Page 44: Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .44

What would it be in the Next 10 Years

¨ We drop the ball?? *  Core number and transistor count stabilize at a

certain point ¨  100-billion transistor chip with 1000 cores??

¨ New process technology is invented for mainstream adoption??

¨  Domain-specific computing with lots of

accelerators??