Neural Methods for Dynamic Branch Prediction

Neural Methods for Dynamic Branch Prediction

Daniel A. Jiménez

Department of Computer ScienceRutgers University

2

The Context

I'll be discussing the implementation of microprocessors Microarchitecture

I study deeply pipelined, high clock frequency CPUs

The goal is to improve performance Make the program go faster

How can we exploit program behavior to make it go faster? Remove control dependences

Increase instruction-level parallelism

3

An Example

This C++ code computes something useful. The inner loop executes two statements each time through the loop.

int foo (int w[], bool v[], int n) {int sum = 0;for (int i=0; i<n; i++) {

if (v[i])sum += w[i];

elsesum += ~w[i];

}return sum;

}

4

An Example continued

This C++ code computes the same thing with three statements in the loop.

This version is 55% faster on a Pentium 4. Previous version had many mispredicted branch instructions.

int foo2 (int w[], bool v[], int n) {int sum = 0;for (int i=0; i<n; i++) {

int a = w[i];int b = - (int) v[i];sum += ~(a ^ b);

}return sum;

}

5

How an Instruction is Processed

Instruction fetch

Instruction decode

Execute

Memory access

Write back

Processing can be divided

into five stages:

6

Instruction-Level Parallelism

Instruction fetch

Instruction decode

Execute

Memory access

Write back

To speed up the process, pipelining overlaps execution of multiple instructions, exploiting parallelism between instructions

7

Control Hazards: Branches

Conditional branches create a problem for pipelining: the next instruction can't be fetched until the branch has executed, several stages later.

Branch instruction

8

Pipelining and Branches

Instruction fetch

Instruction decode

Execute

Memory access

Write back

Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle.

Unresolved branch instruction

9

Branch Prediction

Instruction fetch

Instruction decode

Execute

Memory access

Write back

A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path.

Speculative execution

Branch predictors must be highly accurate to avoid mispredictions!

10

Branch Predictors Must Improve

The cost of a misprediction is proportional to pipeline depth As pipelines deepen, we need more accurate branch predictors

Pentium 4 pipeline has 20 stages Future pipelines will have > 32 stages

Simulations with SimpleScalar/Alpha

Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage

Decreasing misprediction rate from 9% to 4% results in 31% speedup for 32 stage pipeline

11

Overview

Branch prediction background

Applying machine learning to branch prediction

Results and analysis

Circuit-level implementation

Future work and conclusions

12

Branch Prediction Background

13

Branch Prediction Background

The basic mechanism: 2-level adaptive prediction [Yeh & Patt `91]

Uses correlations between branch history and outcome Examples:

gshare [McFarling `93] agree [Sprangle et al. `97] hybrid predictors [Evers et al. `96]

This scheme is highly accurate in practice

14

Branch Predictor Accuracy

Larger tables and smarter organizations yield better accuracy Longer histories provide more context for finding correlations

Table size is exponential in history length The cost is increased access delay and chip area

15

Applying Machine Learning to Branch Prediction

16

Branch Prediction is a Machine Learning Problem

So why not apply a machine learning algorithm? Replace 2-bit counters with a more accurate predictor

Tight constraints on prediction mechanism

Must be fast and small enough to work as a component of a

microprocessor

Artificial neural networks Simple model of neural networks in brain cells

Learn to recognize and classify patterns

Most neural nets are slow and complex relative to tables

For branch prediction, we need a small and fast neural method

17

A Neural Method for Branch Prediction

We investigated several neural methods

Most were too slow, too big, or not accurate enough

Our choice: The perceptron [Rosenblatt `62, Block `62]

Very high accuracy for branch prediction

Prediction and update are quick, relative to other neural methods

Sound theoretical foundation; perceptron convergence theorem

Proven to work well for many classification problems

18

Branch-Predicting Perceptron

Inputs (x’s) are from branch history register Weights (w’s) are small integers learned by on-line training Output (y) gives prediction; dot product of x’s and w’s Training finds correlations between history and outcome

19

Training Algorithm

20

Organization of the Perceptron Predictor

Keeps a table of perceptrons, indexed by branch address Inputs are from branch history register Predict taken if output 0, otherwise predict not taken

Key intuition: table size isn't exponential in history length, so we can consider much longer histories

21

Results and Analysis for the Perceptron Predictor

22

Experimental Evaluation

Execution and trace driven simulations: Measure instruction throughput (IPC) and misprediction rates

SimpleScalar/Alpha [Burger & Austin `97]

Alpha 21264-like configuration:

4-wide issue, 64KB I-cache, 64KB D-cache, 512 entry BTB

SPECint 2000 benchmarks

Technological estimates: HSPICE for circuit delay estimates

Modified CACTI 2.0 [Agarwal 2000] for PHT delay estimates

23

Results: Predictor Accuracy

Perceptron outperforms competitive hybrid predictor by 36% at ~4KB; 1.71% vs. 2.66%

24

Results: Large Hardware Budgets

Multi-component hybrid was the most accurate fully dynamic predictor known in the literature [Evers 2000]

Perceptron predictor is even more accurate

25

Delay Sensitive Implementation

Even the relatively simple perceptron has high access delay

Our solution: An overriding perceptron predictor

First level is a single-cycle gshare

Second level is a 4KB, 23-bit history perceptron predictor

HSPICE total prediction delay estimates:

2 cycles at 833 MHz (like Alpha 21264)

4 cycles at 1.76 GHz (like Pentium 4)

Compare with 4KB hybrid predictor

26

Results: IPC with high clock rate

Pentium 4-like: 20 cycle misprediction penalty, 1.76 GHz 15.8% higher IPC than gshare, 5.7% higher than hybrid

27

Analysis: History Length

The fixed-length path branch predictor can also use long histories [Stark, Evers & Patt `98]

28

Analysis: Training Times

Perceptron “warms up’’ faster

29

Circuit-Level Implementation of a Neural Branch Predictor

30

Circuit-Level Implementation

Example output computation: 12

weights, Wallace tree of depth 6

followed by 14-bit carry-lookahead

adder

Delay is 2-4 cycles for longer histories

Carry-save adders have O(1)

depth, carry-lookahead adder

has O(log n) depth

31

HSPICE Perceptron Simulations

2 cycles at 833 MHz, 4 cycles at 1.76 GHz, 180 nm technology

32

Future Work and Conclusions

33

Future Work with Perceptron Predictor

Let's make the best predictor even better

Better representation

Better training algorithm

Latency is a problem

Crazy people are saying that overriding organizations don't work as

well as simple but large predictors [ Me, HPCA 2003 ]

How can we eliminate the latency of the perceptron predictor?

34

Future Work with Perceptron Predictor

Value prediction

Predict value of a load to mitigate memory latency

Indirect branch prediction

Virtual dispatch

Switch statements in C

Exit prediction

Predict the taken exit from predicated hyperblocks

35

Future Work Characterizing Predictability

Branch predictability, value predictability

How can we characterize algorithms in terms of their predictability?

Given an algorithm, how can we transform it so that its branches and

values are easier to predict?

How much predictability is inherent in the algorithm, and how much is

an artifact of the program structure?

How can we compare different algorithms' predictability?

36

Conclusions

Neural predictors can improve performance for deeply

pipelined microprocessors

Perceptron learning is well-suited for microarchitectural

implementation

There is still a lot of work left to be done on the perceptron

predictor in particular and microarchitectural prediction in

general

37

The End

Documents

Neural Methods for Dynamic Branch Prediction