Upload
terri
View
25
Download
1
Embed Size (px)
DESCRIPTION
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU. Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, Intel Corporation. Graphics Applications. - PowerPoint PPT Presentation
Citation preview
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU
Mauricio Breternitz Jr, Herbert Hum, Sanjeev KumarMicroprocessor Research Labs,
Intel Corporation
Parallel Architecture and Compilation Techniques, 2003 2
Graphics Applications Computational intensive graphics applications
are becoming increasingly popular Computer-Aided Design
─ From Airplanes to Cars
Visualization of massive quantities of Data Visual Simulators e.g. Training Pilots Fancier Graphical User Interfaces And, of course, Games
And this trend is continuing As high-end applications become more mainstream
Parallel Architecture and Compilation Techniques, 2003 3
OpenGL Or DirectX
Compositing
Transform
Lighting
Clipping
Rasterization
Texture Mapping
Display
Graphics Pipeline
Vertex Shaders• Operate on every vertex in the scene• Effects like
• Blur • Diffuse and specular reflection
Pixel Shaders• Operate on every pixel• Effects like
• Texturing• Fog blending
Scene
3D Application
Parallel Architecture and Compilation Techniques, 2003 4
Vertex and Pixel Shaders Need to operate millions of times a second
Small programs Typically run on the graphics cards However most desktops do not have graphics
cards that support programmable shaders This work focuses on running Vertex Shaders
on the main CPU Pixel shaders have very high computational and
bandwidth requirements Graphics applications are designed to adapt to the
available features and performance
Parallel Architecture and Compilation Techniques, 2003 5
Goals Improving the performance of Vertex Shaders
on the main CPU Analyze the performance on today’s CPU Better Compiler Optimizations Additional Architectural Support
Identify three architectural and compiler enhancements Significant impact on the performance
─ Roughly by a factor of 2
Parallel Architecture and Compilation Techniques, 2003 6
Outline Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions
Parallel Architecture and Compilation Techniques, 2003 7
Vertex Shader Programs
Small Programs (at most 256 instructions) SIMD instructions with xyzw components Mask and Swizzle on each instruction No state saved between vertices
Read-only memory & Temporary Registers Program cannot change control flow
Vertex Input16 x 4 Registers
Vertex Output15 x 4 Registers
SIMD ALUConstantMemory256 x 4
TemporaryRegisters
12 x 4
Integer Registers
84 x 1
dp4 oPos.x, v0, c[0] dp4 oPos.y, v0, c[1] dp4 oPos.z, v0, c[2] dp4 oPos.w, v0, c[3]
mov oD0, c[4].wzyx
Virtual Machine
Parallel Architecture and Compilation Techniques, 2003 8
Baseline Optimizing Compiler Implemented a Compiler for Vertex Shaders
Input: Vertex Shader Assembly
Output: Optimized x86 (with SSE2) Started with DirectX reference rasterizer: Interpreter
─ Used it as the front end Use Olive pattern-matching code-generator generator Graph-coloring based register allocator Loop unrolling List-scheduler
About 70% faster than a naïve translator Translate into C and feed it to a C compiler
Parallel Architecture and Compilation Techniques, 2003 9
Characteristics of Generated Code Mostly SIMD instructions (x86 with SSE2)
83-99 % instructions
Large basic blocks Use of control-flow is limited Makes it easier to compile efficiently
Vertex Shared Assembly to x86 Assembly 10-20 times increase in number of instructions
mul r0.x_z_, v0.xyzz, v1.wwww
Parallel Architecture and Compilation Techniques, 2003 10
Outline Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions
Parallel Architecture and Compilation Techniques, 2003 11
1. New Instructions Dot products are very common in Shaders A dot product translates is expensive on x86
A sequence of 7 instructions 1 multiply, 2 add, 4 shuffle instructions
─ In the simple case
New dot product instructions Compute dot product of two source operands and
store it in each of the word of the destination operand
Parallel Architecture and Compilation Techniques, 2003 12
2. Mask Analysis Optimization Traditional optimizers keep track of the
liveness information on a per-register basis Shaders: often only part of the SIMD register is live Modify to do this for each word of the SIMD register
Analysis Phase Annotate the IR with additional information During live variable analysis, propagate the liveness
mask depending on the instructions Optimization Phase
Identify dead code Replace some shuffle/mask instructions with move
─ Might get eliminated entirely during register allocation
Parallel Architecture and Compilation Techniques, 2003 13
3. Number of Registers Spilling registers to memory can degrade
performance Investigate the impact of increasing the
number of registers from 8 to 16
Why not more? Trickier to encode it in the ISA
Parallel Architecture and Compilation Techniques, 2003 14
Outline Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions
Parallel Architecture and Compilation Techniques, 2003 15
Experimental Setup 10 Vertex Shaders
8-84 instructions Only 3 of them have loops (Control)
2.2 GHz Pentium IV processor Instruction counts otherwise Breakdown the instructions into categories
Measure performance by using the generated code to process an array of vertices Compute average
Parallel Architecture and Compilation Techniques, 2003 16
Evaluation
New dot-product Instructions: 27.4% Average (Estimate) Reduces the number of instructions by 24 %
Mask optimization: 19.5% on Average Both: 42% on Average
0
0.2
0.4
0.6
0.8
1
B CTC L PS PL PE R T TS W
Base New Instructions Only Mask Optimization Only Both
Vertex Shaders
Nor
mal
ized
Exe
cutio
n T
ime
Parallel Architecture and Compilation Techniques, 2003 17
Evaluation Cont’d
Reduce the number of instructions by 8 % on average 35-100% of the spill instructions
This understates the potential benefit More registers allow more aggressive optimizations like
instruction scheduling
0
0.2
0.4
0.6
0.8
1
B CTC L PS PL PE R T TS W
Base 16 Registers
Vertex Shaders
Nor
mal
ized
Inst
ruct
ion
Cou
nt
Parallel Architecture and Compilation Techniques, 2003 18
Outline Motivation Baseline Compiler Three Enhancement Performance Evaluation Conclusions
Parallel Architecture and Compilation Techniques, 2003 19
Conclusions & Future Work Implemented an Optimizing Compiler for
Vertex Shaders Propose and Evaluate Three Enhancements
Compiler: Mask Optimization Architectural: New Instructions & More registers
Improve the performance by a factor of 2 (Roughly)
Shaders are evolving rapidly More like general purpose processors More complex model