Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs,

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU

Mauricio Breternitz Jr, Herbert Hum, Sanjeev KumarMicroprocessor Research Labs,

Intel Corporation

Parallel Architecture and Compilation Techniques, 2003 2

Graphics Applications Computational intensive graphics applications

are becoming increasingly popular Computer-Aided Design

─ From Airplanes to Cars

Visualization of massive quantities of Data Visual Simulators e.g. Training Pilots Fancier Graphical User Interfaces And, of course, Games

And this trend is continuing As high-end applications become more mainstream


OpenGL Or DirectX

Compositing

Transform

Lighting

Clipping

Rasterization

Texture Mapping

Display

Graphics Pipeline

Vertex Shaders• Operate on every vertex in the scene• Effects like

• Blur • Diffuse and specular reflection

Pixel Shaders• Operate on every pixel• Effects like

• Texturing• Fog blending

Scene

3D Application


Vertex and Pixel Shaders Need to operate millions of times a second

Small programs Typically run on the graphics cards However most desktops do not have graphics

cards that support programmable shaders This work focuses on running Vertex Shaders

on the main CPU Pixel shaders have very high computational and

bandwidth requirements Graphics applications are designed to adapt to the

available features and performance


Goals Improving the performance of Vertex Shaders

on the main CPU Analyze the performance on today’s CPU Better Compiler Optimizations Additional Architectural Support

Identify three architectural and compiler enhancements Significant impact on the performance

─ Roughly by a factor of 2


Outline Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions


Vertex Shader Programs

Small Programs (at most 256 instructions) SIMD instructions with xyzw components Mask and Swizzle on each instruction No state saved between vertices

Read-only memory & Temporary Registers Program cannot change control flow

Vertex Input16 x 4 Registers

Vertex Output15 x 4 Registers

SIMD ALUConstantMemory256 x 4

TemporaryRegisters

12 x 4

Integer Registers

84 x 1

dp4 oPos.x, v0, c[0] dp4 oPos.y, v0, c[1] dp4 oPos.z, v0, c[2] dp4 oPos.w, v0, c[3]

mov oD0, c[4].wzyx

Virtual Machine


Baseline Optimizing Compiler Implemented a Compiler for Vertex Shaders

Input: Vertex Shader Assembly

Output: Optimized x86 (with SSE2) Started with DirectX reference rasterizer: Interpreter

─ Used it as the front end Use Olive pattern-matching code-generator generator Graph-coloring based register allocator Loop unrolling List-scheduler

About 70% faster than a naïve translator Translate into C and feed it to a C compiler


Characteristics of Generated Code Mostly SIMD instructions (x86 with SSE2)

83-99 % instructions

Large basic blocks Use of control-flow is limited Makes it easier to compile efficiently

Vertex Shared Assembly to x86 Assembly 10-20 times increase in number of instructions

mul r0.x_z_, v0.xyzz, v1.wwww




1. New Instructions Dot products are very common in Shaders A dot product translates is expensive on x86

A sequence of 7 instructions 1 multiply, 2 add, 4 shuffle instructions

─ In the simple case

New dot product instructions Compute dot product of two source operands and

store it in each of the word of the destination operand


2. Mask Analysis Optimization Traditional optimizers keep track of the

liveness information on a per-register basis Shaders: often only part of the SIMD register is live Modify to do this for each word of the SIMD register

Analysis Phase Annotate the IR with additional information During live variable analysis, propagate the liveness

mask depending on the instructions Optimization Phase

Identify dead code Replace some shuffle/mask instructions with move

─ Might get eliminated entirely during register allocation


3. Number of Registers Spilling registers to memory can degrade

performance Investigate the impact of increasing the

number of registers from 8 to 16

Why not more? Trickier to encode it in the ISA




Experimental Setup 10 Vertex Shaders

8-84 instructions Only 3 of them have loops (Control)

2.2 GHz Pentium IV processor Instruction counts otherwise Breakdown the instructions into categories

Measure performance by using the generated code to process an array of vertices Compute average


Evaluation

New dot-product Instructions: 27.4% Average (Estimate) Reduces the number of instructions by 24 %

Mask optimization: 19.5% on Average Both: 42% on Average

0

0.2

0.4

0.6

0.8

1

B CTC L PS PL PE R T TS W

Base New Instructions Only Mask Optimization Only Both

Vertex Shaders

Nor

mal

ized

Exe

cutio

n T

ime


Evaluation Cont’d

Reduce the number of instructions by 8 % on average 35-100% of the spill instructions

This understates the potential benefit More registers allow more aggressive optimizations like

instruction scheduling

0

0.2

0.4

0.6

0.8

1

B CTC L PS PL PE R T TS W

Base 16 Registers

Vertex Shaders

Nor

mal

ized

Inst

ruct

ion

Cou

nt


Outline Motivation Baseline Compiler Three Enhancement Performance Evaluation Conclusions


Conclusions & Future Work Implemented an Optimizing Compiler for

Vertex Shaders Propose and Evaluate Three Enhancements

Compiler: Mask Optimization Architectural: New Instructions & More registers

Improve the performance by a factor of 2 (Roughly)

Shaders are evolving rapidly More like general purpose processors More complex model

Documents

Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs,