27
S. Reda EN2910A FALL13 EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis Prof. Sherief Reda School of Engineering Brown University 1

EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

S. Reda EN2910A FALL‘13

EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis

Prof. Sherief Reda School of Engineering

Brown University

1

Page 2: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Topic 01: Introduction to Quantitative Analysis

1.  Trends in computing systems

2.  Quantifying performance

3.  Quantifying power

4.  Role of simulators

S. Reda EN2910A FALL’13 2

Page 3: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Moore’s law

S. Reda EN2910A FALL’13 3

•  Started with 10 um feature size in 1970. We are now at 22 nm feature size.

•  Number of transistors per unit area doubles every 2-3 area. •  More transistors in die à more capable cores and/or more cores, GPU,

accelerating functional units as in SoCs •  Transistors can switch faster with new technology

Page 4: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Trends in single processor performance

S. Reda EN2910A FALL’13 4

1.  Process technology à more design per unit area à build more powerful cores (more ILP) and/or add more cores (TLP)

2.  Process technology à faster transistors à higher clock rate 3.  Deeper pipelines à higher clock 4.  Better circuit design techniques and better CAD tools

[patterson]

Page 5: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Evolution of clock rate

S. Reda EN2910A FALL’13 5

•  Dynamic power proportional to fV2

•  Increasing the frequency with less than ideal voltage scaling leads to increased power density

[Debois]

Page 6: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Power wall

S. Reda EN2910A FALL’13 6

Economical Heat removal mechanisms (e.g., air and liquid) limit the maximum amount of power consumption

[patterson]

Page 7: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Lack of parallelism wall •  Programming parallel applications is hard •  Some times applications do not scale; synchronization could be a

bottleneck •  Example: speedup of PARSEC benchmarks on a 8-core Xeon

server

•  Circumventing the parallelism wall: incorporate heterogeneous functional units on die (e.g., GPU, accelerators)

S. Reda EN2910A FALL’13 7

[weaver ‘09]

Page 8: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Memory wall

S. Reda EN2910A FALL’13 8

•  Unlike processor performance, DRAM performance improved by 7% a year à a memory wall.

•  Density followed Moore’ law

•  Although still a big problem, the memory latency wall stopped growing around 2002.

•  With the advent of multicore microarchitectures the memory problem has shifted from latency to bandwidth

[Debois]

•  Memory wall = memory_cycle/processor_cycle

•  In 1990, it was about 4 (25 MHz, 150 ns). Grew to 200 until 2002 but has tapered off since then.

Page 9: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Summary of current and future challenges to computing

•  Memory wall –  Increasing number of cores requires increased memory

bandwidth; otherwise, starvation and stalling occurs •  Parallelism wall

–  Some applications lack enough ILP or TLP à not much benefit from aggressive superscalar or many-core designs

•  Power wall –  Limits on heat removal imposes a limit on power density and

frequency of operation

•  Power and parallelism wall à dark silicon (only a small portion of the chip is operational at any moment of time)

S. Reda EN2910A FALL’13 9

Page 10: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Performance metrics

1.  execution time, latency, response time: time to complete a task

2.  throughput: number of tasks (e.g., instructions, queries, frames rendered) completed per unit time.

•  Is throughput = 1/av. response time?

–  Only if NO overlap –  Otherwise, throughput > 1/av. response time

S. Reda EN2910A FALL’13 10

Page 11: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Which benchmarks? 1.  Real programs (mpeg encoding) 2.  Synthetic benchmarks (e.g., measuring I/O storage bandwidth) 3.  Kernels 4.  Toy benchmarks (e.g., quicksort) 5.  Benchmark suites: •  SPEC: Standard Performance Evaluation Corporation

–  SPEC CPU Integer point –  SPEC CPU Floating point –  SPEC POWERSSJ transactional –  SPEC Viewperf for GPU performance

•  PARSEC for multi-threaded applications •  Rodinia for GPGPU performance •  NASA Parallel benchmarks (NPB) for clusters (e.g., FFT) •  HPC Challenge benchmarks (HPCC) for clusters (e.g., linear solver)

S. Reda EN2910A FALL’13 11

Page 12: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Examples of benchmark results

S. Reda EN2910A FALL’13 12

Runtime and average processor power for SPEC CPU2006 benchmarks using AMD Phenom II X4 965 Black edition at 3.4 GHz and 4GB DRAM running Linux 2.6.10.8

Page 13: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Reporting performance for a set of programs

•  Arithmetic mean: or

problem is that programs with longest execution delays can dominate the result

•  Reporting speedups (Why?): –  Speed-up measures the advantage of a machine over

reference machine R for program i:

–  Arithmetic mean of speedups:

–  Geometric mean of speedups:

–  Harmonic mean: S. Reda EN2910A FALL’13 13

X

i

Si/N

N

vuutNY

i

Si

NPN

i 1/Si

Si =TR,i

Ti

PNi=1 Ti

N

NX

i=1

wiTi

What is the advantage?

Page 14: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Example

S. Reda EN2910A FALL’13 14

Program A Program B Arithmetic Harmonic Geometric

speedup wrt Reference 1

Machine 1 10 100 55 18.2 31.6

Machine 2 100 50 75 66.7 70.7

speedup wrt Reference 2

Machine 1 10 10 10 10 10

Machine 2 100 5 52.5 9.5 22.4

Which is better machine 1 or machine 2? Program A Program B Arithmetic

Mean Ratio of means (ref 1)

Ratio of mens (ref 2)

Machine 1 10 sec 100 sec 55 sec 91.8 10

Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5

Reference 1 100 sec 10000 sec 5050 sec

Reference 2 100 sec 1000 sec 550 sec

Page 15: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

When to use the harmonic mean? •  Consider a processor that executed instructions for the first 10 billion

instructions at a rate of 1 BIPS (billion instructions per second) and then for the second 10 billion instructions at a rate of 2 BIPS, what is the average instruction rate?

•  Average BIPS = (1+2)/2 = 1.5 WRONG •  Average BIPS = (10 + 10)/(10/1 + 10/2) = 20/15 =1.33

S. Reda EN2910A FALL’13 15

•  Harmonic mean of rates =

•  Use HM if forced to start and end with rates (e.g. reporting CPI or miss rates or branch misprediction rates)

⎭⎬⎫

⎩⎨⎧∑=

n

i nrate

n

1 )(1

Page 16: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Performance metrics for clusters

•  Supercomputers: –  Execution time –  FLOPS (FLOP/s): theoretical peak or using a standard

benchmark (e.g., LINPACK is used for Top-500 supercomputer ranking)

•  Warehouse scale: –  Latency is important metric because it is seen by users –  Bing study: users will use search less as response time

increases –  Service Level Objectives (SLOs)/Service Level Agreements

(SLAs). E.g. 99% of requests be below 100 ms

S. Reda EN2910A FALL’13 16

Page 17: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Amdahl’s law

S. Reda EN2910A FALL’13 17

1-F F

Apply enhancement

1-F F/S

without E

with E

•  Enhancement E accelerates a fraction F of a task by a factor of S

speedup =

Texe

(without E)

Texe

(with E)

=

1

(1� F ) +

F

S

•  Enhancement is limited by the fraction of execution time that can’t be enhanced à law of diminishing returns

•  Amdahl’s law à optimize the common case

F=0.5

[Debois]

Page 18: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Physical reasons for power consumption

S. Reda EN2910A FALL’13 18

•  If transistor input voltage above Vt à transistor is ON (short circuit); otherwise, it is off (open circuit)

•  Dynamic power is consumed when transistors switch status. •  Static or leakage power is consumed when there is no

switching (historically negligible but growing in significance with nanoscale CMOS)

[Debois]

Page 19: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

1. Static (leakage) power

•  When input voltage is less than Vt, transistor should be off; however, some electrons are still able to go through because of reductions in threshold voltage (Vt) of recent technology à static or leakage power consumption.

•  Leakage current is exponentially dependent on Vt as well as the operating temperature (T).

•  As Vt decreases, static power increases exponentially à switch to 3D transistors was mainly motivated to control leakage power.

•  Noise also limits reducing Vth

S. Reda EN2910A FALL’13 19

Pstatic = V Isub / V e�KVt

T

Page 20: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

2. Dynamic power

S. Reda EN2910A FALL’13 20

•  Pdynamic = αfCV2 •  α is fraction of clock cycles when gate switches

•  At a particular design & technology nodes, higher frequency demands higher voltage.

•  If chip size grows then total power grows •  Non-ideal scaling of Vt à Non-idea scaling of V à non-ideal scaling

of power. •  Power dissipation leads to heat generation à when heat is not

removed appropriately, it causes thermal hot spots à problems to reliability and leakage power.

[Reda et al, TComp 11]

Page 21: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Reasons for the power wall

•  Scaling rules from one technology node to another –  Area scales by –  C scales by

•  Let N be #cores at 45 nm and p be power per core. Total

power = Np •  Assuming same frequency and total chip area at 32 nm,

power at 45 nm = , where S=(old voltage/new voltage)2.

•  45 nm voltage = 1.1 V, 32 nm voltage = 1.0 V à S = 1.21à less than à power density increases.

S. Reda EN2910A FALL’13 21

1p2

12

p2

2Npp2S

Page 22: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Combined metrics for performance and power •  Energy: cost paid by the user

•  Joules per instruction (EPI)

•  Energy delay product (EDP)

•  MIPS/W and FLOPS/W

•  For clusters: Power Utilization

Effectiveness (PEU) = Total

facility power / IT equipment

power

S. Reda EN2910A FALL’13 22 [datacenterexperts.com]

E =Z finish

startp(t)dt

Page 23: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Role of simulators •  Computing systems are more complex à need simulators to

evaluate new ideas and explore design space.

•  Simulation infrastructure should consider architectural issues (e.g., performance) and complex intertwined physical phenomena (e.g., power, thermal and reliability).

•  Simulation is getting hard due to the need to simulate multi-threaded workloads on multi cores à simulator itself could be single-threaded or multi-threaded simulators

•  Simulator taxonomy: 1.  User-level versus full-system

2.  Functional versus cycle-accurate 3.  Trace-driven versus execution driven

S. Reda EN2910A FALL’13 23

Page 24: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

1. User-level vs. full-system simulators

S. Reda EN2910A FALL’13 24

User-level simulator Full-system simulator

1.  User-level simulators: focus on simulating microarchitecture, leaving out system components; system calls are treated as black box

2.  Full-system simulators: model an entire computing system including CPU, I/O, disks, and network.

[Dubois]

Page 25: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

2. Functional vs. cycle-accurate simulators

S. Reda EN2910A FALL’13 25

[Dubois]

functional accurate cycle accurate

•  Orthogonal classification to user-level and full-system. •  Functional accurate: the function of each instruction is executed

without any microarchitectural detail. Fast but not accurate •  Cycle accurate: capture the details of all microarchitectural blocks

and keep track of timing. Accurate but slow. •  Functional first simulators.

Page 26: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

3. Trace-driven vs. execution-driven simulators

•  Trace-driven simulation: –  Benchmark is first executed on an ISA compatible processor

–  Each executed instruction is logged into a trace file –  Architectural state could be logged before and after OS calls and

interrupts

–  The final trace is then fed into an cycle-accurate simulator

•  Execution-driven simulation: –  No trace file; benchmark fed directly to the simulators –  All timing and functional aspects of the machine must be

reproduced faithfully

S. Reda EN2910A FALL’13 26

Page 27: EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Summary

S. Reda EN2910A FALL’13 27

1.  Trends: power wall, memory wall, parallelism wall. Frequency

increases à power wall à multi-core à parallelism wall à fusion

2.  Quantifying performance: response time and throughput, computing

means (arithmetic, geometric, harmonic)

3.  Quantifying power: static and dynamic, origins of the power wall.

4.  Role of simulators: user level vs system level, functional vs. cycle

accurate.