51
Learn iOS Game Optimization. Ultimate Guide by Dmitriy Vovk

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Embed Size (px)

Citation preview

Page 1: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Learn iOS Game Optimization. Ultimate Guide

by Dmitriy Vovk

Page 2: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Want to achieve the same level of technology speed? Welcome!

Image is used without any permissions

Page 3: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

General Recommendations

Page 4: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Batch, Batch, Batch!http://ce.u-sys.org/Veranstaltungen/Interaktive%20Computergraphik%20(Stamminger)/papers/BatchBatchBatch.pdf

• Render from one thread only

• Avoid synchronizations:

1. glFlush/glFinish;

2. Querying GL states;

3. Accessing render targets;

Page 5: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Vertex Data Recommendations

Page 6: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Pixel perfect HSR (Hidden Surface Removal),

• But still need to sort opaque geometry!

• Avoid doing alpha test. Use alpha blend instead

Page 7: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• HSR still requires vertices to be processed!

• …thus don’t forget to cull your geometry on CPU!

• Prefer Stencil Test before Scissor.

– Stencil test is performed in hardware on PowerVR GPUs, thus resulting in dramatically increased performance.

– Stencil can be of any form in contrast to the rectangular Scissor

Page 8: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Why no alpha test?! o Alpha test\discard requires fragment shader to run, before

visibility for current fragment can be determined. This will remove benefits of HSR

o Even more! If shader code contains discard, than any geometry rendered with this shader will suffer from alpha test drawbacks. Even if this key-word is under condition, USSE does assumes, that this condition may be hit.

o Move discard into separate shader

o Draw opaque geometry, than alpha tested one and alpha blended in the end

Page 9: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Bandwidth matters

1. Use constant color per object, instead of per vertex

2. Simplify your models. Use smaller data types.

3. Use indexed triangles or non-indexed triangle strips

4. Use VBO instead of client arrays

5. Use VAO

Page 10: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

– VAO implementation on at least iOS 4.0 did harmed your performance

– VBOs are allocated at 4KB page size multiples. Be aware of that. Large amount of small VBOs can defragment and waste you memory.

Page 11: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Updating your VBO data each frame:

1. glBufferSubData, that updates big part of the original data do harm performance. Try not to update buffer, that is used now

2. glBufferData, that will completely overwrite original data is OK. Old data will be orphaned by driver and storage for new one will be allocated

3. glMapBuffer with triple buffered VBO is preferred way to update your data

4. EXT_map_buffer_range (iOS 6 only), when you need to update only a subset of a buffer object.

Page 12: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not knowint bufferID = 0; //initialization

for (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it

{

glBindBuffer(vertexBuffer[i]);

glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);

}

//...

glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);

void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);

//update data here

glUnmapBufferOES(GL_ARRAY_BUFFER);

++bufferID;

if (bufferID == 3) //cycling through 3 buffers

{

bufferID = 0;

}

Page 13: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• This scheme will give you the best performance possible – no blocking CPU by GPU (or vice versa), no redundant memcpy operations, lower CPU load, but extra memory is used (note, that you will need no extra temporal buffer to store your data before sending it to VBO).

update(1), draw(1), gpuworking(................)

update(2), draw(2), gpuworking(................)

update(3), draw(3), gpuworking(................)

Page 14: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Float type is native to GPU

• …that means any other type will be converted to float by USSE

• …resulting in few additional cycles

• Thus it’s your choice in tradeoff between bandwidth\storage and additional cycles

Page 15: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Use interleaved vertex data

– Align each vertex attribute by 4 bytes boundaries

Page 16: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Why you have to do this?!

– You don’t. Driver can do this instead of you

– …resulting in slower performance.

Page 17: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Split your vertex data into two parts:

1. Static VBO - the one, that never will be changed

2. Dynamic VBO – the one, that needs to be updated frequently

• Split your vertex data into few VBOs, when few meshes share the same set of attributes

Page 18: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Texture Data Recommendations

Page 19: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Bandwidth matters

1. Use lower precision formats i.e. RGB565

2. Use PVRTC compressed textures

3. Use atlases

4. Use mipmaps. They improve texture cache efficiency and quality.

Page 20: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• iOS OpenGL ES drivers from 4.0 version prior to 6.0 has a bug, that will ALWAYS reserve memory for mipmaps, regardless, whether you requested to create them, or not. And you don’t need mip maps for 2D graphics.

• …but there are one workaround – make your textures NPOT (non-power of two).

Page 21: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• NPOT textures works only with the GL_CLAMP_TO_EDGE warp mode

• POT are preferable, they gives you the best performance possible

• Use NPOT textures with dimensions multiple to 32 pixels for best performance

• Driver will pad data of your NPOT texture to match the size of the closes POT values.

Page 22: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Why do I have to use PVRTC? It looks ugly!

1.PVRTC provides great compression, resulting in smaller texture size, improved cache, saved bandwidth and decreased power consumption

2.PVRTC stores pixel data in GPU’s native order i.e BGRA, instead of RGBA

Page 23: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• BGRA vs RGBA

1. RGBA:

• Requires pixel data to be shuffled by driver into BGRA

• Has options for RGB422, RGB565, RGBA4444, RGBA5551

2. BGRA:

• Stores data in GPU’s native order

• Has option only for BGRA8888 for upload and BGRA888, BGRA5551, BGRA4444 for ReadPixels

Page 24: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Prefer OES_texture_half_float instead of OES_texture_float

• Texture reads read only 32 bits per texel, thus RGBA float texture will result in 4 texture reads

Page 25: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Prefer multitexturing instead of multiple passes

• Configure texture parameters before feeding image data to driver

Page 26: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Texture uploading to the GPU is a mess!

• Usual way to do this:

1. Load texture to temporal buffer in RAM

2. Feed this buffer to glTexImage2D

3. Draw!

• Looks simple and fast, right?

Page 27: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• …NO!

void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture

LoadTexture(textureName);

glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D

glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);

// driver will do some additional work to fully upload texture first time it is actually used!

• Textures are finally uploaded only when they are used first time. So draw them off screen immediately after glTexImage2D

• A lot of redundant work!

Page 28: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Jedi way to upload textures:void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping

glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D

glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);

// driver will do some additional work to fully upload texture first time it is actually used!

munmap(ptr, TEXTURE_SIZE);

• File mapping does not copy your file data into RAM! It does load file data page by page, when it’s accessed.

• Thus we eliminated one redundant copy, dramatically increased texture upload time and decreased memory fragmentation

Page 29: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Always use glClear at the beginning of the frame…

• … and EXT_discard_framebuffer at the end.

• PVR GPU series have a fast on chip depth buffer for each tile. If you forget to clear\discard depth buffer, it will be uploaded from HW to SW

Page 30: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Shaders Best Practices

Page 31: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Be wise with precision hints

• Avoid branching

• Eliminate loops

• Do not use discard. Place discard instruction as early, as possible to avoid useless computations

Page 32: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Code inside of dynamic branch (it’s condition is evaluated against value calculated in shader) will be executed anyway and than it will be orphaned if condition is false

Page 33: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• highp – represents 32 bit floating point value

• mediump – represents 16 bit floating point value in range of [-65520, 65520]

• lowp – 10 bit fixed point values in range of [-2, 2] with step of 1/256

• Try to give the same precision to all you operands, because conversion takes some time

Page 34: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• highp values are calculated on a scalar processor on USSE1 only:highp vec4 v1, v2;

highp float s1, s2;

// Bad

v2 = (v1 * s1) * s2;

//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations

// Good

v2 = v1 * (s1 * s2);

//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor

Page 35: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Hardware features

Page 36: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might know

• Typical CPU found in iOS devices:

1. ARMv7 architecture

2. Cortex A8\Cortex A9\Custom Apple cores

3. 600 – 1300 MHz

4. 1-2 cores

5. Thumb-2 instructions set

Page 37: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• ARMv7 has no hardware support for integer division

• VFPv3 FPU\VFPv4 on Apple A6 (rumored)

• NEON SIMD engine

• Unaligned access is done in software on Cortex A8. That means a hundred times slower

• Cortex A8 is in-order CPU. Cortex A9+ are out of order

Page 38: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Cortex A9 core has full VFPv3 FPU, while Cortex A8 has a VFPLite. That means, that float operations take 1 cycle on A9 and 10 cycles on A8!

Page 39: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• NEON – 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values

• NEON can be used for:

– Software geometry instancing;

– Skinning on ES 1.1;

– As a general vertex processor;

– Other, typical, applications for SIMD.

Page 40: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• USSE1 architecture is scalar, NEON is vector by nature. Move your vertex processing to CPU from GPU to speedup calculations*

• ???????

• PROFIT!!!111

• *NOTE. That doesn’t apply to USSE2 hardware

Page 41: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• The weakest side of mobile GPUs is a fill rate. Fill rate is quickly killed by blending. 2D games are heavy on this. PowerVR USSE engine doesn’t care what to do – vertex or fragments processing. Moving you vertex processing to CPU (NEON) will leave some room space for fragment processing. It will have more effect on USSE1, scalar hardware.

Page 42: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• There are 3 ways to use NEON engine in your code:

1. Intrinsics

2. 1.1 GLKMath

3. Handwritten NEON assembly

4. Autovectorization. Add –mllvm –vectorize –mllvm –bb-vectorize-aligned-only to Other C Flags in project settings and you are ready to go.

Page 43: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Page 44: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Intrinsics:

Page 45: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Assembly:

Page 46: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• Summary:

• Intrinsics got me 25% speedup over assembly. Let’s see the code!

• Note that speed of intrinsics code vary from compiler to compiler.

Running time, ms

CPU usage, %

Intrinsics 2764 19

Assembly 3664 20

FPU 6209 25-28

FPU autovectorized

5028 22-24

Page 47: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

__attribute__((always_inline)) void Matrix4ByVec4(const float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result)

{

(*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);

(*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);

(*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]);

(*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);

}

Page 48: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r)

{

#ifdef INTRINSICS

(*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));

(*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0));

(*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0));

(*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0));

(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));

(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1));

(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1));

(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1));

(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2));

(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2));

(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2));

(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2));

(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3));

(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3));

(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3));

(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));

}

Page 49: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know __asm__ volatile

(

"vldmia %6, { q0-q3 } \n\t"

"vldmia %0, { q8-q11 }\n\t"

"vmul.f32 q12, q8, d0[0]\n\t"

"vmul.f32 q13, q8, d2[0]\n\t"

"vmul.f32 q14, q8, d4[0]\n\t"

"vmul.f32 q15, q8, d6[0]\n\t"

"vmla.f32 q12, q9, d0[1]\n\t"

"vmla.f32 q13, q9, d2[1]\n\t"

"vmla.f32 q14, q9, d4[1]\n\t"

"vmla.f32 q15, q9, d6[1]\n\t"

"vmla.f32 q12, q10, d1[0]\n\t"

"vmla.f32 q13, q10, d3[0]\n\t"

"vmla.f32 q14, q10, d5[0]\n\t"

"vmla.f32 q15, q10, d7[0]\n\t"

"vmla.f32 q12, q11, d1[1]\n\t"

"vmla.f32 q13, q11, d3[1]\n\t"

"vmla.f32 q14, q11, d5[1]\n\t"

"vmla.f32 q15, q11, d7[1]\n\t"

"vldmia %1, { q0-q3 } \n\t"

"vmul.f32 q8, q12, d0[0]\n\t"

"vmul.f32 q9, q12, d2[0]\n\t"

"vmul.f32 q10, q12, d4[0]\n\t"

"vmul.f32 q11, q12, d6[0]\n\t"

"vmla.f32 q8, q13, d0[1]\n\t"

"vmla.f32 q8, q14, d1[0]\n\t"

"vmla.f32 q8, q15, d1[1]\n\t"

"vmla.f32 q9, q13, d2[1]\n\t"

"vmla.f32 q9, q14, d3[0]\n\t"

"vmla.f32 q9, q15, d3[1]\n\t"

"vmla.f32 q10, q13, d4[1]\n\t"

"vmla.f32 q10, q14, d5[0]\n\t"

"vmla.f32 q10, q15, d5[1]\n\t"

"vmla.f32 q11, q13, d6[1]\n\t"

"vmla.f32 q11, q14, d7[0]\n\t"

"vmla.f32 q11, q15, d7[1]\n\t"

"vstmia %2, { q8 }\n\t"

"vstmia %3, { q9 }\n\t"

"vstmia %4, { q10 }\n\t"

"vstmia %5, { q11 }"

:

: "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)

: "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"

);

Page 50: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

What you might not know

• For detailed explanation on intrinsics\assembly see: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491e/CIHJBEFE.html

Page 51: Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Contact me

http://www.linkedin.com/in/dvovk/

http://nukecode.blogspot.com/