GDC09 Abrash Larrabee+Final

Preview:

Citation preview

Rasterization on Larrabee:

A First Look at the Larrabee New

Instructions (LRBni) in Action *

Michael AbrashGDC, March 2009

* Warning: this is a pretty technical talk, so if you have no interest

in processor architecture, instruction sets, or programming and

were trying to decide between this talk and something about game

design or the production pipeline, you might want to reconsider!

I never did get that lerp instruction!

Larrabee

What is Larrabee?

Better yet, why is Larrabee?

For decades, processors have gotten faster

Higher clock speeds

Smaller (== faster and more) gates

Bigger caches

Hardware that extracts more work per clock

Why Larrabee?

That process certainly hasn’t stopped

But it’s getting harder

Low-hanging fruit already taken

Running up against power budgets

Why Larrabee?

More recently, developments along another axis

Vector processing

And another

Multiple hardware threads

And yet another

Multiple cores

Performance can scale linearly with gates

More work per clock by moving burden of extracting parallelism to software

What is Larrabee?

Larrabee is the logical conclusion of these trends

Lots of power-efficient cores

In-order pipeline

Clocked at the power/performance sweet spot

4 threads per core

16-wide vector units

Streaming support

What is Larrabee?

Very powerful

Very power-efficient

Very highly parallel

Very dependent on software to make use of that

parallelism

What is Larrabee?

Gets as much work out of each watt and each

square millimeter as possible

Scales well far into the future

Massive potential performance; 1 teraflop & up

My favorite part

Excellent software control of performance

Relies heavily on software to get it to live up to its

potential

Larrabee hardware architecture

Obligatory vague architecture slide

Me

mo

ry C

on

tro

ller

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

D$D$I$I$D$D$I$I$

D$D$I$I$

Fix

ed

Fu

ncti

on

Textu

re L

og

ic

Me

mo

ry C

on

tro

ller

Me

mo

ry C

on

tro

ller

Dis

pla

y I

nte

rface

Syste

m In

terf

ace

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

L2 Cache

. . .

. . .

Me

mo

ry C

on

tro

ller

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

D$D$I$I$D$D$I$I$ D$D$I$I$

D$D$I$I$

Fix

ed

Fu

ncti

on

Textu

re L

og

ic

Me

mo

ry C

on

tro

ller

Me

mo

ry C

on

tro

ller

Dis

pla

y I

nte

rface

Syste

m In

terf

ace

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

L2 Cache

. . .

. . .

Larrabee hardware architecture

Lots of enhanced in-order x86 cores

Fully capable of running an OS and apps

Great flexibility in graphics pipeline design

Can support a wide variety of software

4 threads per core

Hide pipeline latency, cache misses

Coherent caches, connected by a fast ring

Larrabee hardware architecture

Tried to maximize general usability of Larrabee

hw

Fixed-function texture sampler units

Also a per-core cache-management unit

No other fixed-function units

Larrabee hardware architecture

These features boost performance via thread-

level parallelism

Key element of Larrabee performance, but it’s not

unique to Larrabee, so I’m not going to talk about it

further today

Larrabee hardware architecture

I’m going to focus on per-thread (data-level,

SIMD) parallelism

16-wide vector unit

Why 16-wide?

The wider the better – if it gets used

LRBni

Larrabee New Instructions

>100 new instructions

Mostly vector instructions

Architected in close collaboration with developers

Design philosophy

It’s hard to leverage data parallelism without all the right

pieces in the instruction set

Enable generally-usable extraction of data-level parallelism

The fundamentals of LRBni

32 vector register, each 512 bits wide

v0-v31

Full complement of vector instructions

Operate on int32, float32, float64

Mul, add, sub, adc, sbb, subr, and, or, xor, multiply-add, multiply-sub

Vector compares

Aligned and unaligned store/load

Gather/scatter

Bit manipulation: insert field, interleave, shuffle

The fundamentals of LRBni

Ternary (three-argument) operations

Load-operand – can read one arg from memory

No-cost type conversions on load/store

All math is 32- or 64-bit wide

Smaller data in memory to save bw and footprint

Common upconversions on load-operand

Upconversion and/or broadcast on memory load

Downconversion and/or selection on memory store

All common DX/OpenGL types including float16, unorm8, etc.

The fundamentals of LRBni

8 16-bit mask registers

Every vector instruction can do no-cost predication

Most often set by vector compares

Can be copied from scalar registers (eax, ebx, …)

Set of logical instructions that operate on masks

Mask tests allow real branches and loops

The fundamentals of LRBni

Parallel <=> serial conversion

Pack-store; load-unpack

Gather; scatter

Bit scan initialized

Streaming support

Prefetching

Cache control

Tim Sweeney on Larrabee

Quotes from Tim Sweeney on Larrabee:

Short version: Larrabee rocks!

Larrabee instructions are “vector complete”

More precisely: Any loop written in a traditional programming language can be vectorized, to execute 16 iterations of the loop in parallel on Larrabee vector units, provided the loop body meets the following criteria:

Its call graph is statically known.

There are no data dependencies between iterations.

Michael Abrash on Larrabee

“Tim’s absolutely right, but I’ll bet there’s still a

lot of performance to be had from mucking

around under the hood ”

Sample LRBni codekxnor k2, k2

vorpi v0, v2, v2

vorpi v1, v3, v3

vxorpi v4, v4, v4

mov ebx, 256

loop:

cmp ebx,0

jl endloop

dec ebx

vmulps v21 {k2}, v0, v1

vaddps v21 {k2}, v21, v21

vmadd213ps v0 {k2}, v0, v2

vmsub231ps v0 {k2}, v1, v1

vaddps v1 {k2}, v21, v3

vaddps v4 {k2}, v4, [ConstantFloatOne] {1to16}

vmulps v25 {k2}, v0, v0

vmadd231ps v25 {k2}, v1, v1

vcmpleps k2 {k2}, v25, [ConstantFloatOne] {1to16}

kortest k2, k2

jnz loop

endloop:

(this happens to be a Mandelbrot-set generator)

(thanks Dean for fixing this!)

LRBni examples

Ternary: start with a simple vector multiply

vmulps v0, v5, v6 ; v0 = v5 * v6

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

LRBni examples

Predication: mask the writing of the elements

vmadd231ps v0 {k1}, v5, v6

LRBni examples

Load-operand: src2 is the memory location

specified by rbx+rcx*4

vmadd231ps v0 {k1}, v5, [rbx+rcx*4]

LRBni examples

The operands can be plugged in differently

vmadd213ps v0 {k1}, v5, [rbx+rcx*4]

Operand 2 times operand 1 plus operand 3

LRBni examples

Broadcast: expand 4 (or 1) elements in memory

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}

LRBni examples

Conversion: upconvert from float16 format

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}

LRBni examples

Pack-store

vcompressd [rbx] {k1}, v0

LRBni examples Gather

vgatherd v1 {k1}, [rbx + v2*4]

The fundamentals of LRBni

All these instructions run at full speed!

I know this has been way too fast

There just isn’t time for an in-depth look

LRBni in more detail:

Tom Forsyth’s talk (10:30: Room 3002, West Hall)

Dr. Dobb’s Journal article in April

These slides (complete with notes)

LRBni in Action

Everything you need to run fully-parallel code well

General code running on a CPU Can run anything

How well can it run less-than-perfectly-parallelizable code?

RAD has been working on the Larrabee D3D/OpenGL graphics pipeline

Pipeline is not an ideal candidate for parallelization Retirement must be in order

Rasterization is not easy to vectorize efficiently

I’ll look at rasterization today The process of determining which pixels are inside a triangle

Applying LRBni to the graphics

pipeline

For the most part, this is easy

Z/stencil buffering, pixel shading, blending all

fall out of processing 4x4 blocks

16 vertices can be shaded and cached in parallel

Vertex usage tends to be localized

Triangle set-up lends itself pretty well to

vectorization

Some efficiency cost from culling

Rasterization is not easy to vectorize

At least not with usable performance

In fact, we were sure it couldn’t be done!

Forced to reexamine assumptions

So irritated at being asked for the hundredth time if

it was possible that we sat down to prove it wasn’t

Rasterization is not easy to vectorize

At least not with usable performance

In fact, we were sure it couldn’t be done!

Forced to reexamine assumptions

So irritated at being asked for the hundredth time if

it was possible that we sat down to prove it wasn’t

We failed

Dedicated hardware can do any

given task more efficiently

In general, dedicated hardware will be able to do any

given graphics task more efficiently per square

millimeter than software

However, CPU flexibility can gain some or all of that

back by applying square millimeters as needed

Hardware needs worst-case capacity for each component

Often partly or entirely idle

When little rasterization is needed (long shaders, say), CPUs

can just use cycles for other purposes; ALUs are never idle

Can even have higher peak rates in many cases, because the

whole chip can work on a single task if necessary

A quick refresher

Three edges per triangle, each defined by an equation Bx+Cy relative to any point on edge

Sign indicates in/out

X and y snapped to 15.8

Range [-16K,+16K]

Tested at pixel/sample centers

Fill rules must be observed

Must be exact (discrete math)

Must support multisampled antialiasing (MSAA)

Pixel rasterization: example

+ +

+

- --

Pixomatic 1 rasterization

Pixo 1 decomposed triangles into 1 or 2

trapezoids

Stepped down 2 edges at a time on pixel centers

Could do with only 1 branch, for loop

Branch misprediction is very expensive

Rasterization code tends to predict poorly

Pixomatic 1 rasterization

Problems with Pixo 1 rasterization

Required expensive IMUL per edge

Poorly suited to small triangles

Not well suited to MSAA sample jittering

Never could figure out how to vectorize it

efficiently

Sweep rasterization

Sweep rasterization

Sweep rasterization

Sweep rasterization

Sweep rasterization

Sweep rasterization

Sweep rasterization

Generally well suited to vectorization

Problems for CPUs

Lots of badly-predicted branching

Significant work to figure out where to descend

Example of an approach that dedicated

hardware can perform much more efficiently

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

CPU smarts are useful for

rasterization

Not-rasterizing

CPU smarts are useful for

rasterization

The Larrabee renderer uses chunking (binning, tiling)

For chunking, separate per-tile and intra-tile rasterization

Per-tile rasterization (triangle->tile assignment)

Bounding box tests for triangles up to 1 tile in size

Sweep or just walk bounding box for larger triangles

Tile-assignment time is insignificant for larger triangles

CPUs make it easy to do the 90% case well, difficult 10% case

adequately

If large-triangle tile assignment important, could use a form

of the approach used for intra-tile (discussed next)

A tiled render target

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

Tile assignment test – trivial reject

Calculate value of each edge equation at its

trivial reject tile corner

If any edge is non-negative, triangle does not

intersect tile (<0 == inside so we can just test sign

bit)

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial reject corner of

tile 0 for black edge; if

this point isn’t inside

black edge, no point in

the tile can be inside

black edge

Trivial reject corner of

tile 2 for black edge

Trivial reject

corner of tile

1 for black

edge

Trivial reject

corner of tile

3 for black

edge

More positive

More negative

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial reject corner of

tile 0 for black edge; if

this point isn’t inside

black edge, no point in

the tile can be inside

black edge

Trivial reject corner of

tile 2 for black edge

Trivial reject

corner of tile

1 for black

edge

Trivial reject

corner of tile

3 for black

edge

Tile assignment test – trivial accept

For each edge, sum trivial reject corner value

with the equation step to opposite corner

If any edge is negative, the whole tile is trivially

accepted for that edge

No need to consider it when rasterizing within tile

In general, only relevant edges need to be considered

Scissors, user clip

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+

-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

-+

-+

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

-+

+-

Not-rasterizing

If all edges are negative at their trivial accept

corners, the whole tile is inside the triangle

No further rasterization is needed

Can just store a “draw-whole-tile” command in bin

Back end can then effectively do two nested loops

around shaders

Full-screen triangle rasterization speed ~= infinity

Intra-tile rasterization

Same principle, but vectorized

Assume tile is 64x64 pixels and vector width is 16

Vector-calculate the 16 trivial-reject and trivial-

accept corners of the 16x16 blocks as a delta

from the tile trivial-reject corner

Just two adds per edge, using tables generated at

triangle set-up time

AND edge results together

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial

reject corner

for black edge

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial

reject corner

for black edge

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial reject

corner for black edge

[Step from value here](-48, 0) [Step by B(-48) + C(0)]

(-48, -48) [Step by

B(-48) + C(-48)]

Vector rasterization to 16x16s:

trivial accept example64x64 tile

+-

White dots are

trivial accept

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Pink 16x16s

are trivially

accepted by

black edge

Intra-tile rasterization

Vector-test sign of edge equations at trivial

accept and trivial reject corners and AND

together

Bit-scan through resulting masks to find trivially

and partially accepted 16x16 blocks

Each trivial accept becomes a draw-block command

Again, no further rasterization needed for those pixels

Intra-tile rasterization

2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3

Edge #1 equation step values

from tile trivial accept corner to

trivial accept corners of 16x16s 3

1Edge #1 equation value at tile

trivial accept corner

+ + + + + + + + + + + + + + + +

3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24

= = = = = = = = = = = = = = = =

Edge #1 equation values at

trivial accept corners of 16x16s

Intra-tile rasterization

2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3

Edge #1 equation step values

from tile trivial accept corner to

trivial accept corners of 16x16s 3

1Edge #1 equation value at tile

trivial accept corner

+ + + + + + + + + + + + + + + +

0 0 0 0 0 0 0 0 0 00 0 0 0 00

< < < < < < < < < < < < < < < <

3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24

= = = = = = = = = = = = = = = =

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Edge #1 equation values at

trivial accept corners of 16x16s

Preset zeroes, in a register or

broadcast from memory

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Block #0 found

by first bit-scan

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Block #4 found by

second bit-scan

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Bit-scan

completed

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Trivial accept for 16x16 blocks

; Step to edge values at 16 16x16 block trivial accept corners &

; see if each trivial accept corner is inside all three edges

vaddsetspi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddsetspi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddsetspi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Partially accepted 16x16 blocks

For each partial 16x16, descend again to 4x4s

Put trivially accepted 4x4s in bins

Partially accepted 4x4s need to be processed

into pixel masks

Vector add of equation step for each edge

AND signs together to form pixel mask

Pixel mask for partial 4x4: example4x4 pixels

+-

White dots

are pixel

centers

Blue pixels

are inside

black edge

Grey pixels

are outside

black edge

Trivial reject

corner for black

edge for 4x4

1110111011001100

Resulting mask register

Pixel mask for partial 4x4: example

4x4 pixels

+-

White dots

are pixel

centers

Blue pixels

are inside

black edgeGrey pixels

are outside

black edge

Trivial reject

corner for black

edge for 4x4

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Partially accepted 4x4 blocks - MSAA

4x4 pixels

+ -

Yellow dots

are sample 2

centers

Blue samples

are inside

black edgeGrey

samples are

outside black

edge

Trivial reject

corner for black

edge for 4x4

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Larrabee rasterization

Adaptive rasterization

Don’t have the luxury of custom data and ALU sizes

Do have the luxury of adapting to input data

Edge equations have to be evaluated with 48 bits in the worst case

We have to use 64 bits

Can use 32 bits for triangles that fit in 128x128 bounding boxes

90+% of triangles

Adaptive rasterization (cont.)

When we do have to do 64-bit edge evaluation

64-bit only required for tile assignment

Any tile up to 128x128 that’s not trivially accepted or rejected can be rasterized using 32 bits

Adaptive intra-tile rasterization

Triangles that fit in a 16x16 bounding box

One less level of descent, less set-up, no trivial accept test

Direct mask stamping for 4x4, 4x8, 8x4 bounding boxes

Non-rasterization-based z for small triangles

Easy to try things out

Implementation

Will still not match dedicated hardware peak

rates per square millimeter on average

Efficient enough, and avoids dedicating area and

design effort for a narrow purpose

Generality improves overall perf for a wide range of

tasks

For example, can bring more mm^2 to bear – the

whole chip!

Implementation

Texture sampling and filtering remain as

significant challenges for software

Apart from that, everything can be implemented

in software

Not always obvious how at first, but surprisingly

often doable

Still evolving

A whole new way to think about optimization

Further Larrabee Information

Tom Forsyth’s talk

10:30: Room 3002, West Hall

SIGGRAPH paper

“Larrabee: A Many-Core x86 Architecture for Visual Computing,” Seiler et al

Just search on “Larrabee SIGGRAPH paper”

Dr. Dobb’s Journal article in April

www.intel.com/software/graphics

GDC Larrabee talks

C++ Larrabee Prototype Library

Recommended