4
IEEE TRANSACTIONS ON MAGNETICS, VOL. 50, NO. 7, JULY 2014 3500904 An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems Yeong-Kang Lai and Yu-Chieh Chung Department of Electrical Engineering, National Chung Hsing University, Taichung 402, Taiwan To render 3-D graphics efficiently, rasterization techniques have been developed. Traditional triangle traversal techniques using scan-line-based or edge-equation-based methods may cause potential instability from the division operations. This paper develops an efficient rasterization algorithm—a barycentric-based ladder start tile traversal that is division free. Throughout the process, no extra traversal position and context is produced to reduce the number of pixel tests and improve the efficiency and stability of the graphic rendering. It also presents the architecture of a 300 MHz 3-D graphics rasterizer in a 65 nm 1P9M process with a core size of 0.537 mm 2 and internal buffer 3 K. The proposed ladder start tile traversal architecture can perform each tile intersect test in six cycles and tile interior traversal with barycentric tests in 2 pixels per cycle. In addition, the rasterizer throughput can achieve up to 50M triangles per second and 600M pixels per second. Index Terms— 3-D graphics, rasterization, tile traversal. I. I NTRODUCTION I N RECENT years, 3-D graphics applications have become increasingly popular, particularly in the consumer elec- tronic market for 3-D graphics used in gaming applications on mobile devices. Notably, 3-D graphics-intensive applications are predicted to soon become widely available on a variety of portable mobile devices ranging from tablets to notebooks and smart phones. These features include multicore SOCs, higher resolution, and lower power requirements. Consequently, low- power design is an important strategy for becoming competi- tive in the arena of portable mobile consumer electronics. In OpenGL ES 2.0, the fixed function components of the rendering pipeline are two types of processing modules: the geometry module and the rendering module. The geometry module supports the primitive assembly, back-face culling, preclipping, and clipping functions. The rendering module supports the rasterizer (alpha, depth, and stencil) test, color buffer blending, and the dither process. To efficiently render 3-D graphics, rasterization techniques are developed. Traditional triangle traversal techniques using scan-line- based or edge-equation-based methods may cause potential instability from the division operations. For 3-D graphics gaming applications requiring high resolution, cost-effective mobile device hardware is crucial and even necessary. This paper develops an efficient rasterization algorithm—a barycentric-based ladder start tile traversal which is division free. Throughout the process, no extra traversal position and context is produced for reducing the number of pixel tests and improving the efficiency and stability of the graphic render- ing. The proposed ladder start tile traversal architecture can perform each tile intersect test in six cycles and performs tile interior traversal with barycentric tests in 2 pixels per cycle. The remainder of this paper is organized as follows. The related work of the rasterizing algorithm is discussed in Section II, and the concept creation and concept flow of Manuscript received October 11, 2013; revised December 26, 2013; accepted January 12, 2014. Date of current version July 7, 2014. Corresponding author: Y.-K. Lai (e-mail: [email protected]). Digital Object Identifier 10.1109/TMAG.2014.2301880 the new tile traversal algorithm follow in Section III. The proposed barycentric-based rasterization and ladder start tile traversal are described in Section IV. The architecture of this 3-D graphics rasterization IP is shown in Section V. The implementation result is shown in Section VI. Finally, a conclusion is provided in Section VII. II. RELATED WORKS In the late 1980s, as edge functions were proposed, many new methods appeared. These included the zigzag traversal [1], bounding box traversal [2], midpoint traversal [3], [4], scanline-based traversal [5], and tiled traversal [6], [7]. The research work of Pineda [2] aimed at a parallel algorithm for the rasterization of polygons representing each edge of a polygon by a linear edge function. It also provided several simple and smarter pixel traversal algorithms. To increase mobile-rendering capabilities, Akenine-Moller and Strom [1] proposed hardware architecture for rasterizing textured triangles using a zigzag traversal culling scheme that avoids a significant amount of z-buffer reads. Ma et al. [3] and Wang et al. [4] used FPGA to implement and improve the central-line algorithm in the equality resource, in contrast to other triangle rasterization algorithms based on edge functions. Sun et al. [6] proposed a universal rasterizer (UR) with edge equations and a tile-scan triangle traversal algorithm for low- cost graphics rendering. Woo et al. [5] designed a low-power, 3-D-rendering engine with two texture units, where shaping the triangle is accelerated in the TS stage to perform the scanline-based rasterization. The research of McCormack and McNamara [7] describes a polygon traversal algorithm that generates fragments in a tiled fashion. This requires several additional saved contexts (the values of all interpolator accumulators, such as Z-depth, red, green, and blue). III. CONCEPT CREATION AND CONCEPT FLOW A. Barycentric-Based Triangle Traversal A key stage in the 3-D graphics pipeline is the one that maps each pixel to a given triangle. Triangle traversal is the procedure that identifies the pixels contained inside 0018-9464 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems

Embed Size (px)

Citation preview

Page 1: An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems

IEEE TRANSACTIONS ON MAGNETICS, VOL. 50, NO. 7, JULY 2014 3500904

An Efficient Rasterization Unit With Ladder StartTile Traversal in 3-D Graphics Systems

Yeong-Kang Lai and Yu-Chieh Chung

Department of Electrical Engineering, National Chung Hsing University, Taichung 402, Taiwan

To render 3-D graphics efficiently, rasterization techniques have been developed. Traditional triangle traversal techniques usingscan-line-based or edge-equation-based methods may cause potential instability from the division operations. This paper developsan efficient rasterization algorithm—a barycentric-based ladder start tile traversal that is division free. Throughout the process, noextra traversal position and context is produced to reduce the number of pixel tests and improve the efficiency and stability of thegraphic rendering. It also presents the architecture of a 300 MHz 3-D graphics rasterizer in a 65 nm 1P9M process with a core sizeof 0.537 mm2 and internal buffer 3 K. The proposed ladder start tile traversal architecture can perform each tile intersect test insix cycles and tile interior traversal with barycentric tests in 2 pixels per cycle. In addition, the rasterizer throughput can achieveup to 50M triangles per second and 600M pixels per second.

Index Terms— 3-D graphics, rasterization, tile traversal.

I. INTRODUCTION

IN RECENT years, 3-D graphics applications have becomeincreasingly popular, particularly in the consumer elec-

tronic market for 3-D graphics used in gaming applications onmobile devices. Notably, 3-D graphics-intensive applicationsare predicted to soon become widely available on a variety ofportable mobile devices ranging from tablets to notebooks andsmart phones. These features include multicore SOCs, higherresolution, and lower power requirements. Consequently, low-power design is an important strategy for becoming competi-tive in the arena of portable mobile consumer electronics.

In OpenGL ES 2.0, the fixed function components of therendering pipeline are two types of processing modules: thegeometry module and the rendering module. The geometrymodule supports the primitive assembly, back-face culling,preclipping, and clipping functions. The rendering modulesupports the rasterizer (alpha, depth, and stencil) test, colorbuffer blending, and the dither process. To efficiently render3-D graphics, rasterization techniques are developed.

Traditional triangle traversal techniques using scan-line-based or edge-equation-based methods may cause potentialinstability from the division operations. For 3-D graphicsgaming applications requiring high resolution, cost-effectivemobile device hardware is crucial and even necessary.This paper develops an efficient rasterization algorithm—abarycentric-based ladder start tile traversal which is divisionfree. Throughout the process, no extra traversal position andcontext is produced for reducing the number of pixel tests andimproving the efficiency and stability of the graphic render-ing. The proposed ladder start tile traversal architecture canperform each tile intersect test in six cycles and performs tileinterior traversal with barycentric tests in 2 pixels per cycle.

The remainder of this paper is organized as follows. Therelated work of the rasterizing algorithm is discussed inSection II, and the concept creation and concept flow of

Manuscript received October 11, 2013; revised December 26, 2013;accepted January 12, 2014. Date of current version July 7, 2014.Corresponding author: Y.-K. Lai (e-mail: [email protected]).

Digital Object Identifier 10.1109/TMAG.2014.2301880

the new tile traversal algorithm follow in Section III. Theproposed barycentric-based rasterization and ladder start tiletraversal are described in Section IV. The architecture ofthis 3-D graphics rasterization IP is shown in Section V.The implementation result is shown in Section VI. Finally,a conclusion is provided in Section VII.

II. RELATED WORKS

In the late 1980s, as edge functions were proposed,many new methods appeared. These included the zigzagtraversal [1], bounding box traversal [2], midpoint traversal[3], [4], scanline-based traversal [5], and tiled traversal[6], [7]. The research work of Pineda [2] aimed at a parallelalgorithm for the rasterization of polygons representing eachedge of a polygon by a linear edge function. It also providedseveral simple and smarter pixel traversal algorithms. Toincrease mobile-rendering capabilities, Akenine-Moller andStrom [1] proposed hardware architecture for rasterizingtextured triangles using a zigzag traversal culling scheme thatavoids a significant amount of z-buffer reads. Ma et al. [3]and Wang et al. [4] used FPGA to implement and improve thecentral-line algorithm in the equality resource, in contrast toother triangle rasterization algorithms based on edge functions.Sun et al. [6] proposed a universal rasterizer (UR) with edgeequations and a tile-scan triangle traversal algorithm for low-cost graphics rendering. Woo et al. [5] designed a low-power,3-D-rendering engine with two texture units, where shapingthe triangle is accelerated in the TS stage to perform thescanline-based rasterization. The research of McCormackand McNamara [7] describes a polygon traversal algorithmthat generates fragments in a tiled fashion. This requiresseveral additional saved contexts (the values of all interpolatoraccumulators, such as Z-depth, red, green, and blue).

III. CONCEPT CREATION AND CONCEPT FLOW

A. Barycentric-Based Triangle Traversal

A key stage in the 3-D graphics pipeline is the onethat maps each pixel to a given triangle. Triangle traversalis the procedure that identifies the pixels contained inside

0018-9464 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems

3500904 IEEE TRANSACTIONS ON MAGNETICS, VOL. 50, NO. 7, JULY 2014

the triangle. An efficient algorithm is needed not only forminimal power consumption but also to provide the stabil-ity required for hardware implementation. Existing triangletraversal algorithms fall into two main categories: scanline-based and edge-equation-based algorithms. However, the scan-line traversal has several problems [7]. Similarly, testing thefragments inside the triangle using the edge-equation-basedtriangle traversal may cause potential instability from thedivision operations. The accuracy of the division operation isbased upon the reference table implemented by the hardware.Some pixels will be rendered as lost or redundant near theedge of the triangle because of the inaccuracy of the edgeequation. Division operations are used for perspective correc-tion and impact the coefficient accuracy of the edge equation.According to the concept from [8], we can do rasterization(including finding fragments inside the triangle and varyinginterpolation) without division. But when we start determininghow to implement the division-free concept, we find it difficultto implement this in our mobile graphic system. We initiallycame up with three rendering pipeline tests of the solutions butsome made no sense or required too many additional calcula-tions. Below are the rendering pipeline tests of the solutions.

First, we did rasterization in the homogeneous space, justas in [8]. We found it difficult and possibly a waste ofcomputation time to do tile scanning in the homogeneousspace, as there may be one huge triangle in homogeneousspace yet the triangle becomes very small after being projectedonto a screen. For mobile devices, this solution is not suitablebecause too many extra calculations are required.

Next, we did tile scanning and found fragments insidethe triangle in screen space. To avoid perspective correction,we conducted varying interpolation in a homogeneous space.However, we needed to calculate the homogeneous coordinateof every fragment inside the triangle from its screen coordi-nate. It was necessary to multiply by the inverse matrix.

Finally, we attempted to do barycentric-based rasterizationin screen space, testing the fragment inside the triangle forbarycentric-based triangle traversal that is division free bycomparator. Then, it required perspective correction in thepixel-rendering stage. However, the correction effort is stillmuch less than that required in the previous two test solu-tions. This paper therefore uses this scenario and derives theproposed algorithm in the next section.

IV. PROPOSED ALGORITHM

A. Barycentric-Based Rasterization in Screen Space

In the rasterization stage of a graphics pipeline, a fragmentmust be generated for each pixel position within a polygonalobject. Rasterizing polygons with edge equation evaluationhave been extensively investigated [2], [7]. In this approach,barycentric coordinates are first set up for each polygo-nal primitive. Triangle ABC has three vertices with screencoordinates. There is one vertex P with screen coordinates.P’s barycentric coordinate can be derived in [8]

a1 = −b1

b4, a2 = −b2

b4, a3 = −b3

b4(1)

bi = αi x + βi y + γi (2)

Fig. 1. Barycentric-based tile scanning using the ladder start tile.

If and only if 0 ≤ ai ≤ 1, i = 1 ∼ 3, P(x, y) is inside triangleABC. The sign of ai is calculated as division free, while bi canbe derived from the determinant form [8, eq. (30)].

Typically, we assume that a positive sign defines the interiorhalf-space of a barycentric coordinate because of the percent-age of the area inside the triangle. Conceptually, a barycentric-based rasterizer simply tests the positions of all potentialfragments against the barycentric coordinates for each triangle.However, Hakura’s study shows the benefits of using tile-scanrasterization order, in which the screen is tiled into rectanglesrelated to the size of the texture cache [9]. To save more timeto traverse redundant tiles, Sun et al. [6] used bounding boxand executed double side tile traversal with backward down tobecome the next start tile. Our proposed method executes tiletraversal by the sign of barycentric coefficient with doubleside, travel down like the ladder to become the next starttile until the whole tile is outside down of the triangle bytesting, then decide travel down from another ladder. The keypoint that we want to integrate into barycentric-based tile-scanning rasterization is how to check if the whole tile isout of the triangle. Fig. 1 shows our barycentric-based tile-traversal algorithm and pseudocode using the ladder start tilelist below.

Decision FirstTileFrom top to down in bounding box

Do GoTileTraversal (Right) //scan tile in right directionDo GoTileTraversal (Left) //scan tile in left directionDo GoLadderStartTile //go to next row

GoTileTraversal(dir){If (dir== Left){

From current tile position to the most left tile positionIf ((whole tile’s barycentric coordinate1<0&& α1 >0) ||

(whole tile’s barycentric coordinate2<0 && α2 >0)||(whole tile’s barycentric coordinate3<0 && α3>0)

){

Record Left Bound Position; //it means there is no tile that intersectBreak; //the triangle from current tile position

//to the most left tile position}

If (whole tile’s barycentric coordinate1<0 ||whole tile’s barycentriccoordinate2<0 ||whole tile’s barycentric coordinate3<0)

{Go to next left side tile

}Else{

Current tile intersect the triangleGo to next left side tile

}

Page 3: An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems

LAI AND CHUNG: EFFICIENT RASTERIZATION UNIT WITH LADDER START TILE TRAVERSAL IN 3-D GRAPHICS SYSTEMS 3500904

Fig. 2. Tile intersect test by sign of the barycentric coordinates (a1, a2, a3).

}Else if (dir== Right){

From current tile position to the most right tile positionIf ((whole tile’s barycentric coordinate1<0 && α1<0) ||

(whole tile’s barycentric coordinate2<0 && α2<0)||(whole tile’s barycentric coordinate3<0 && α3<0)

){

Record Right Bound Position; //it means there is no tile that intersectBreak; //the triangle from current tile position

//to the most right tile position}If (whole tile’s barycentric coordinate1<0 ||whole tile’s barycentric

coordinate2<0 ||whole tile’s barycentric coordinate3<0){

Go to next right side tile}

Else{

Current tile intersect the triangleGo to next right side tile

}}

}GoLadderStartTile{Go down one row

If ((whole tile’s barycentric coordinate1<0 &&α1<0)||(whole tile’s barycentric coordinate2<0 &&α2<0)||(whole tile’s barycentric coordinate3<0 &&α3<0))

{Ladder Start Tile =RightBound-1Go to Ladder Start Tile

}Else if ((whole tile’s barycentric coordinate1<0 &&α1>0)||

(whole tile’s barycentric coordinate2<0 &&α2>0)||(whole tile’s barycentric coordinate3<0 &&α3>0))

{Ladder Start Tile =LeftBound+1Go to Ladder Start Tile

}Else{

Go to Ladder Start Tile}

}DecisionFirstTile{

If (top of triangle with two vertex){

FirstTile Position = Middle Position of two vertex}Else{

FirstTile Position= Vertex Position}

}

Fig. 2 shows triangle ABC and four vertices representingthe corners of the tile inside the triangle or around it. We can

Fig. 3. Block diagram of the rasterizer function.

determine whether or not the given vertex is out of the directededge by the sign of the barycentric coordinate. Then, we caneasily check if one tile is wholly out of the triangle.

If the current tile intersects the triangle, the tile interiortraversal with barycentric coordinate generation test must beperformed. There are four judgments to accomplish efficientinterior pixel traversal: 1) when the four corners are in thetriangle, this indicates that all pixels will be effective; so theentire tile is used for pixel generation, starting from the topleft and proceeding to the right, column by column; 2) if morethan one corner is outside the right side of the triangle andthe other corners are inside the triangle, this tile is in the rightboundary of the triangle, and it has the opportunity right partof tile with no pixel, so do the pixel generation test from theleft to the right column by column for cycle efficiency; 3) ifmore than one corner is outside the left side of the triangleand the other corners are within the triangle, this indicatesthat this tile is on the left boundary of the triangle. It hasthe opportunity of the left part of tile with no pixel, so thepixel generation test is conducted from right to left, columnby column, for efficiency; and 4) if some corners are on theleft, and some corners are on the right, it is not possible todetermine the location of the tile; so the entire tile is requiredto do the pixel generation test. The processing order is fromthe top left to the bottom right and column by column.

V. PROPOSED ARCHITECTURE

A function block diagram of the developed rasterizer IPis depicted in Fig. 3. It dedicatedly integrates several modulesto accelerate the fixed function of 3-D-rendering pipeline. Thekey IP of the rasterizer mainly consists of two major functionalunits: a geometry engine and a pixel-rendering engine. Specif-ically, the internal blocks are the main processing sub-modulesthat can be efficiently partitioned into six large pipeline stages.

In the geometry engine, the first stage command FIFOreceives a job from the vertex scheduler and sends the requiredcontrol signals to the data loader from the vertex buffer forvertex data loading. The vertex buffer stores the vertex trans-form and lighting variations generated from the vertex shadercore. The second stage involves line/point preprocessing thatpartitions the line and points into two triangles based on lineand point size. The third stage can be divided into two parallelparts: the processing viewport transform, where the polygonis offset to resolve the Z-fighting problem, and stage 1 ofthe barycentric coefficient (vertex level) or LOD calculation

Page 4: An Efficient Rasterization Unit With Ladder Start Tile Traversal in 3-D Graphics Systems

3500904 IEEE TRANSACTIONS ON MAGNETICS, VOL. 50, NO. 7, JULY 2014

TABLE I

THROUGHPUT AND GATE COUNT COMPARISON

for supporting texture mip-map and cube-map level selection.The fourth stage can also be divided into two parallel parts,the first being processing backside culling used to reject thebackside polygon before the raster pipeline, clip-less edgeequation, and bounding box findings with regard to cost-effective design considerations. The other is stage 2 of thebarycentric coefficient (vertex level) or LOD calculation forsupporting texture mip-map and cube-map level selection.

At this point, the process will enter the pixel-renderingengine stage. When the primitive type is a point, it only checkswhether or not this point/pixel is in view volume. It then sendsthe information to an unshading pixel buffer. The point doesnot need to calculate barycentric coordinates. If the primitivetype is a triangle, there are two stages: the first one involvesan 8 × 4 sample tile traversal and will send available tilesto the next stage using a ladder start tile traversal algorithm.The second stage is a 2 pixel per cycle for generation testand a pixel-level barycentric coordinate calculation; it will dothe interior traversal to test and generate pixels in the tileefficiently and then output the valid pixels.

VI. IMPLEMENTATION RESULT

To propose a cost-effective strategy for pixel renderingin rasterization, this paper develops an efficient rasterizationalgorithm—barycentric-based ladder start tile traversal that isdivision free. The proposed ladder start tile traversal architec-ture does each tile intersect test in six cycles, and performsthe tile interior traversal with barycentric tests in 2 pixelsper cycle. The proposed architecture includes barycentriccoordinate pixel generation and ladder start tile traversal witha gate count of only 130k. Furthermore, it also presents thearchitecture of a 300 MHz 3-D graphics rasterizer in 65 nm1P9M process with a core size of 0.537 mm2 and the internalbuffer is 3 K. In addition, the rasterizer throughput can achieveup to 50M triangles per second and 600M pixels per second.Table I lists the throughput and gate counts as compared withtraversal engines [4]–[7]. Table II lists the performance of

TABLE II

PERFORMANCE COMPARISON @ 100 MHz (RASTERIZATION)2

rasterization as compared with traversal engines [2]–[4], [7],based on [4, Table II].

VII. CONCLUSION

This paper develops a novel and efficient rasterizationalgorithm—a barycentric-based ladder start tile traversal inscreen space, not only deciding the start tie of traversal by thesign of the barycentric coefficient but also testing the fragmentinside the triangle, and it is division free. Compared withother tile-traversal algorithms, no extra traversal position andcontext are produced for reducing the number of pixel tests andimproving the efficiency and stability of the graphic rendering.Moreover, the proposed algorithm can be implemented withthe barycentric parameter calculation in parallel with thetriangle setup stage and is suitable for determining a low-costhardware design. Thus, the rasterization algorithm is efficientand cost effective.

REFERENCES

[1] T. Akenine-Moller and J. Strom, “Graphics for the masses: A hardwarerasterization architecture for mobile phones,” ACM Trans. Graph., vol. 22no. 3, pp. 801–808, Jul. 2003.

[2] J. Pineda, “A parallel algorithm for polygon rasterization,” in Proc. 15thAnnu. Conf. Comput. Graph. Interact. Tech., 1988, pp. 17–20.

[3] Y. Ma, X. Wang, M. Zhu, and W. Wan, “Rasterization of genometricprimitive in graphics based on FPGA,” in Proc. ICALIP, Nov. 2010,pp. 1211–1216.

[4] X. Wang, F. Guo, and M. Zhu, “A more efficient triangle rasteriza-tion algorithm implemented in FPGA,” in Proc. ICALIP, Jul. 2012,pp. 1108–1113.

[5] R. Woo, S. Choi, J. Sohn, S. Song, Y. Bae, and H. Yoo, “A low-power 3-Drendering engine with two texture Units and 29-Mb embedded DRAMfor 3G multimedia terminals,” IEEE J. Solid-State Circuits, vol. 39, no. 7,pp. 1101–1109, Jul. 2004.

[6] C.-H. Sun, Y.-M. Tsao, K.-H. Lok, and S.-Y. Chien, “Universal rasterizerwith edge equations and tile-scan triangle traversal algorithm for graphicsprocessing units,” in Proc. ICME, Jul. 2009, pp. 1358–1361.

[7] J. McCormack and R. McNamara, “Tiled polygon traversal using half-plane edge functions,” in Proc. ACM SIGGRAPH/EUROGRAPHICSWorkshop Graph. Hardw., 2000, pp. 15–21.

[8] V. Skala, “Barycentric coordinates computation in homogeneous coordi-nates,” Comput. Graph., vol. 32, no. 1, pp. 120–127, Feb. 2008.

[9] Z. S. Hakura and A. Gupta, “The design and analysis of cache architecturefor texture mapping,” in Proc. 24th Int. Symp. Comput. Archit., Jun. 1997,pp. 108–120.