Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures

using Commercial Workloads

Lieven Eeckhout and Koen De Bosschere

Dept. of Electronics and Information Systems (ELIS)Ghent University, Belgium

CAECW’01, January 21, 2001

2

Outline

• Introduction• Statistical Simulation

– Statistical profiling– Synthetic trace generation

• Methodology• Evaluation• Conclusion

3

Introduction• Architectural simulation

– trace-driven or execution-driven– accurate– long simulation times– long traces to be stored

• Need for fast simulation techniques– take part of a full trace– analytical modeling– trace sampling– statistical simulation

4

Goal

• Previous work used SPEC benchmarks to evaluate statistical simulation

• In this talk we use both commercial and scientific workloads– SPECint, SPECfp, system traces,

multimedia, X graphics, database

5

Statistical Simulation• Three steps:

– extract statistical profile from a program execution

– generate synthetic trace from it– simulate on a trace-driven simulator

• Two major advantages:– statistical profile is more compact than

full trace– fast simulation due to statistical nature design space exploration in limited

time

6

statisticalprofile

Statistical Simulation

real trace (e.g. SPEC benchmark)real trace (e.g. SPEC benchmark)

branch profilingbranch profiling cache profilingcache profiling instruction profilinginstruction profiling

branch statisticsbranch statistics cache statisticscache statistics instruction statisticsinstruction statistics

synthetic trace generatorsynthetic trace generator

synthetic tracesynthetic trace

trace-driven simulatortrace-driven simulator

7

Statistical Profiling

• Microarchitecture-independent statistics– instruction statistics

• Microarchitecture-dependent statistics– branch statistics– cache statistics

• Result: statistical simulation only to explore design options of processor core (cache and branch predictor are fixed)

8

Statistical ProfilingInstruction Statistics

• Instruction mix (13 classes)• Number of register operands• Age of register operands

– probability that register operand was produced instructions before it in the trace (only RAW)

• Memory dependencies– probability that load is memory-dependent

on the -th store before it in the trace (only RAW)

9

Statistical ProfilingBranch Statistics

• Six branch types– conditional branch, unconditional

branch, call with offset, indirect jump, indirect call, return

• Distinction– branch prediction accuracy: refill

pipeline on branch misprediction– branch target prediction accuracy:

single-cycle bubble in pipeline on correct branch prediction but target misprediction

10

Statistical ProfilingCache Statistics

• D-cache statistics– L1 D-cache miss rate– L2 D-cache miss rate

• I-cache statistics– L1 I-cache miss rate– L2 I-cache miss rate

11

Synthetic Trace Generation

Instruction-by-instructionthrough random number generation

Determine• instruction type• number of operands• age of register operands• memory dependency• branch behavior• D-cache behavior• I-cache behavior

st

add

ld

br mispredicted

D-cache miss

I-cache miss

12

Methodology: microarchitecture

• Out-of-order processor– 8 and 16 issue– windows of 64 and 128 instructions

• McFarling branch predictor• ‘small’ cache configuration

– 8KB DM L1 I-cache, 8KB DM L1 D-cache, 64KB 2WSA unified L2 cache

• ‘large’ cache configuration– 32KB DM L1 I-cache, 64KB 2WSA L1 D-cache,

512KB 4WSA unified L2 cache• Access time

– L1 I-cache (1 cycle), L1 D-cache (2 cycles), L2 cache (10 cycles), main memory (80 cycles)

13

Methodology: benchmarks

• 8 SPECint95 benchmarks• 5 SPECfp95 benchmarks (hydro2d, su2cor, swim,

tomcatv, wave5)• 8 IBS system traces (mpeg, jpeg, gs, verilog, gcc,

sdet, nroff, groff)• 4 MediaBench applications (g721, gs, gsm, mpeg2)• 4 X graphics benchmarks (DooM, POVRay, Xanim,

Quake)• 2 TPC-D queries running on Postgres 6.3 ~ 200 million instructions / trace

14

Evaluation• IPC prediction error = IPC real trace - IPC synthetic trace IPC real trace• IPC real trace = IPC when running real

trace on trace-driven simulator• IPC synthetic trace = IPC when running

synthetic trace generated from the statistical profile of the real trace

• Simulation speed: sIPC/xIPC less than 1% after simulating 1 million instructions

15

IPC prediction error (1)157%135%

-30%

-20%

-10%

0%

10%

20%

30%

40%

hydro

2d

su2

cor

swim

tom

catv

wave5

mpeg

jpeg

gs

veri

log

real_

gcc

sdet

nro

ffgro

ffg7

21

_e gs

gsm

_em

peg2

xanim

xdoom

xpovra

yxquake

tpc-

d.1

7tp

c-d.2

IPC

pre

dic

tion e

rror

SPECint95 SPECfp95 IBS MediaBench X graphics TPC-D

ligcc

com

pre

ss go

ijpeg

vort

ex

m8

8ks

imperl

16-issue, 128-entry window, ‘small’ cache configuration

high D-cachemiss rate

high D-cachemiss rate

16

IPC prediction error (2)

-30%

-20%

-10%

0%

10%

20%

30%li

gcc

com

pre

ss go

ijpeg

vort

ex

m8

8ks

imperl

hydro

2d

su2

cor

swim

tom

catv

wave5

mpeg

jpeg

gs

veri

log

real_

gcc

sdet

nro

ffgro

ff

g7

21

_e gs

gsm

_em

peg2

xanim

xdoom

xpovra

yxquake

tpc-

d.1

7tp

c-d.2

IPC

pre

dic

tion e

rror

SPECint95 SPECfp95 IBS MediaBench X graphics TPC-D

16-issue, 128-entry window, ‘large’ cache configuration

17

IPC prediction error vs. static instruction count

-40%

-20%

0%

20%

40%

60%

80%

100%

120%

140%

160%

0 20000 40000 60000 80000 100000 120000 140000 160000

static instruction count (number of instructions executed at least once)

IPC

pre

dic

tion e

rror

w = 64; i = 8; 'small' cache

w = 128; i = 16; 'small' cache

w = 64; i = 8; 'large' cache

w = 128; i = 16; 'large' cache

DooMQuakeDooMQuake gs (IBS) gs (IBS)

gccgcc

gcc (IBS)gcc (IBS)

mpeg (IBS)groff

mpeg (IBS)groff

nroffjpeg (IBS)

verilogsdet

nroffjpeg (IBS)

verilogsdet

TPC-DTPC-Dvortex

govortex

go

18

Conclusion (1)

• Higher IPC prediction errors for applications with smaller static instruction count:– MediaBench applications– SPECfp95 benchmarks– 2 X graphics benchmarks (POVRay and

Xanim)– 5 SPECint95 benchmarks

19

Conclusion (2)

• Smaller IPC prediction errors for applications with larger instruction footprint:– IBS system traces– TPC-D traces– 2 X graphics benchmarks (DooM and

Quake)– 3 SPECint95 benchmarks (go, gcc, vortex) IPC prediction error between -1% and 25%

20

Conclusion (3)

• Statistical simulation is a useful fast simulation technique for commercial workloads– due to higher variability in instructions– since commercial workloads have larger

instruction footprint– which makes a statistical technique more

powerful

Documents

Statistical Simulation of Superscalar Architectures using Commercial Workloads