33
QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2 , Ferad Zyulkyarov 1,2 ,Osman S. Unsal 1 , Adrián Cristal 1 , Eduard Ayguadé 1,2 , Tim Harris 3 , Mateo Valero 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya 3 Microsoft Research

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

Embed Size (px)

Citation preview

Page 1: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

QuakeTM: Parallelizing a Complex Serial Application

Using Transactional Memory

Vladimir Gajinov1,2, Ferad Zyulkyarov1,2,Osman S. Unsal1, Adrián Cristal1, Eduard Ayguadé1,2, Tim Harris3, Mateo Valero1,2

1Barcelona Supercomputing Center

2Universitat Politècnica de Catalunya

3Microsoft Research

Page 2: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

2

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

Page 3: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

3CPU processing is the bottleneck.

Introduction

Topic of this workParallelization of the Quake server.

What is Quake? The first person shooter game.

A sequential application.

Close to instantaneous control of player actions.

High degree of interaction among players in a detailed 3D virtual world.

Requirements of a sequential game server

OpenMP + Transactional MemoryMethod

Page 4: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

4

Background

• OpenMP:– API for writing shared-memory parallel programming

in C/C++ and Fortran. – Compiler directives and library routines.– Fork-Join parallelism.

• Transactional Memory (TM):– concurrency control mechanism.– series of reads and writes to shared memory are

handled atomically. – When successful transaction commits,

otherwise it aborts.

Page 5: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

5

Motivation• Just a few TM applications available

– STAMP, Haskell STM benchmark, RMS-TM …– Clear need for more complex applications.

• Contribution:Parallelization of a complex sequential application using

TM.

• Question:Is it possible to achieve fine-grained locking performance with the coarse-grained parallelization effort?

• MOTIVATION - Test TM programmability:– Start with a coarse-grained approach.– Test the performance.– Determine the problems.– Compare with a fine-grained approach.

Page 6: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

6

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

Page 7: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

7

Quake Organization

Typical client – server architecture

ServerMaintains the consistency of

the game world.

Handles the coordination among clients.

Clients

Update graphics

Implement user-interface operations

Page 8: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

8

The Server• The main server task - computing a new frame

Process

Read

Physics Update

SELECT

Reply

Yes

No

Tx

Rx

Frame execution diagram

Request Processing

Sequential server execution with 8 connected clients.

Execution breakdown

We concentrate on the request

processing stage

2.1%

87.8%

3.1%

Page 9: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

9

LEVEL 4

LEVEL 1

LEVEL 2

LEVEL 3

LEVEL 5

Areanode tree Top view

• 3D volume in a 3D coordinate space.• Represented as a binary space partition tree.• Fine grained and inefficient.

Areanode tree:- balanced binary tree.- each 3D point in the map must

either be in an areanode that is a leaf or in a division plane.

- areanodes maintain a list of game objects (entities).

Quake Map

Page 10: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

10

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

Page 11: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

11

Parallelization

• Only the request processing stage is parallelized• OpenMP to start parallel execution.• Transactions for synchronization.• Coarse-grained approach.• Comparison with the fine-grained implementation

of Atomic Quake [PPoPP2009]• Application characteristics:

Coarse-grained8 TM blocks

Big read & write setsLong transactionsAbort rate 35.3%

Fine-grained

58 TM blocksAbort rate 4.1%

Page 12: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

12

Shared Data

• Three types of shared data structures:– Areanode tree – Game objects– Message buffers

• Common global state buffer • Per-player reply buffers

• Most intensive sharing inside the request processing stage.

Page 13: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

13

Client Requests

Two types of requests:

• Connection related messages – associated with the connection or disconnection protocols,

used when the client wants to join or leave the server game session, or other facilities that do not affect gameplay

• Gameplay messages– most important type of requests – model the player’s interaction with the game world. – the most used – MOVE COMMAND.

Page 14: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

14

Pseudocode for the request processing stage

while (NET_GetPacket ()) { // Filter packets

if (connection related packet) { SV_ConnectionlessPacket (); continue; }

// game play packets for (i=0 ; i<MAX_CLIENTS ; i++) { // Do some checking here SV_ExecuteClientMessage (); }}

while (NET_GetPacket ()) { // Filter packets

if (connection related packet){ SV_ConnectionlessPacket (); continue; }

AddPacketToList(); CopyBuffer();}

#pragma intel omp parallel taskq shared(packetlist, ...){while (packetlist != NULL) { #pragma intel omp task captureprivate(packetlist) { NET_Message_Init(..); // check for packets from connected clients for (i=0, cl=svs.clients ; i<MAX_CLIENTS ; i++,cl++) { // Do some checking here SV_ExecuteClientMessage (cl); } }

packetlist = packetlist->next;}

Sequential Parallel

Page 15: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

15

The Move Command

ExecutionConstruct the bounding box.

Traverse the areanode tree.

Find objects contained in the bounding box.

Associate them with the command.

Simulate the move.

Remove the player from the old position.

Add him to the new position.

Parameters

Player’s origin

View angles

Motion indicators

Time to run

Page 16: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

16

Move Command Execution

AddLinksToPmove

Execute Move

ClientPhysics

ClientThink

PmoveInit

PlayerMove

LinkEntity

PlayerTouch

Transaction 1

Transaction 2

Transaction 3

Transaction 4

T1

T2

T3

T4

ClientPhysics client’s physics update

ClientThink execute actions registered in previous frames

PmoveInit pmove (player move) structure initialization

AddLinksToPmove determines which entities could be affected by the current move command.

PlayerMove constructs a trajectory line and determines the client's final position

LinkEntity re-links the player’s entity to the new position in the areanode tree

PlayerTouch model influence on the other game objects

Page 17: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

17

ReachPoints

int reachpoints[NumThreads][x*16]

TM_PUREvoid PointReached(int check) {

reachpoints[ThreadId][check]++;}

int main () {. . .TRANSACTION

PointReached (1);

statement_1;PointReached

(2);TRANSACTION_END. . .

}

Helps to:• Identify thread private variables.• Discover where transactions abort• Discover causes for the aborts.• Discover TM false sharing conflicts

(conflict management granularity).

Page 18: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

18

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

Page 19: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

19

Evaluation• TraceBot:

– An automatic trace client.– Behavior is controlled by a finite state machine.

• VideoClient:– Normal graphical client for proving correctness.– For trace creation.

• The server runs on one machine, the clients on the other.– Server – 8 cores (4 x dual-core 64-bit Intel® Xeon™).

• Frame execution time as a performance measure.

• Prototype version 3.0 of the Intel STM C/C++ compiler.– In-place updates.– Cache line granularity conflict detection.– Transactions validate the read set at commit time, and

if necessary during the read operation, – function annotations: tm_callable, tm_pure and tm_unknown.– Closed nesting - flattening

Page 20: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

20

Results - Normalized average frame execution times (coarse)

0

2

4

6

8

1 2 4 8 16

Number of clients

Nor

mal

ized

exe

cutio

n tim

e serial global_lock TM_coarse

The baseline is always the average frame execution time of the sequential server for the respected number of clients.

TM version overhead3.5x – 6x

more than 85% of the time is spent

in critical sections.

Overhead is too high

Page 21: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

21

Results - performance of coarse-grained configurations

0.01.02.03.04.05.06.07.08.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

1 client 2 clients 4 clients 8 clients 16 clients

Number of threads

Tim

e [m

s]

global_lock TM_coarse

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

TM_coarse

Spee

dup

2 threads 4 threads 8 threads

0.0

2.0

4.0

6.0

8.0

1 2 4 8

Threads

Ave

rage

fra

me

time

[ms]

global_lock TM_coarse

Comparative performance of parallel configurations

Transactional server running with 16 clients (speedup & scalability)

Page 22: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

22

Transactional statistic – coarse-grained

Clients Transactions AbortsAbort rate

[%]  Mean [KB] Max [KB] Total [MB]

1 34754 0 0.0Reads 3.0 104 105

Writes 0.6 17 20

2 95980 1970 2.1Reads 2.8 863 263

Writes 0.6 164 55

4 179241 10820 6.0Reads 3.4 1413 570

Writes 0.6 269 108

8 364305 76560 21.0Reads 4.2 1478 1207

Writes 0.8 251 216

16 524561 184992 35.3Reads 5.1 1704 1725

Writes 0.9 262 296

The abort rate is significant

TM server running with 8 threads.

Page 23: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

23

The Overhead Breakdown

TM block

Multithread execution - 8 threads, 16 clients

Total [109 cycles]

Instrumentation time Abort overheadAbort rate

[%]109 cycles % 109 cycles %

1 13.5 10.3 75.8 3.3 24.2 19.5

2 9.5 9.0 94.1 0.6 5.9 18.0

3 17.2 15.1 87.9 2.1 12.1 52.7

4 11.6 10.9 94.3 0.7 5.7 22.4

5 5.9 3.2 53.7 2.8 46.3 61.1

overall 57.9 48.5 83.8 9.4 16.2 35.2

We have limited possibility for profiling

Seems like the TM instrumentation

overhead is more important

Page 24: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

24

Results - Normalized average frame execution times (fine)

0

1

2

3

4

1 2 4 8 16

Number of clients

Nor

mal

ized

exe

cutio

n tim

e serial lock_fine TM_fine

TM version overhead2.4x – 3x

Page 25: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

25

Results - performance of fine-grained configurations

0.00.51.01.52.02.53.03.54.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

1 client 2 clients 4 clients 8 clients 16 clients

Number of threads

Tim

e [m

s]

lock_fine TM_fine

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

lock_fine TM_fine

Spee

dup

2 threads 4 threads 8 threads

0.0

2.0

4.0

6.0

8.0

1 2 4 8

Threads

Ave

rage

fra

me

time

[ms]

global_lock lock_fineTM_coarse TM_fine

Comparative performance of parallel configurations

Transactional server running with 16 clients (speedup & scalability)

Page 26: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

26

Transactional statistic – fine-grained

Clients Transactions AbortsAbort rate

[%]  Mean [B] Max [B] Total [MB]

1 190206 0 0.0Reads 65.1 58511 12

Writes 5.2 20102 1

2 367118 826 0.2Reads 66.0 62728 25

Writes 5.7 24397 2

4 655020 4165 0.6Reads 83.7 80275 55

Writes 8.2 39726 5

8 1439874 20593 1.4Reads 102.5 102470 145

Writes 9.6 57552 14

16 3226759 131814 4.1Reads 133.3 231593 192

Writes 15.5 211651 22

TM server running with 8 threads.

Page 27: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

27

Outline

Introduction & motivation

Quake description

Parallelization

Results

Conclusion

Page 28: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

28

QuakeTM Characteristics

• 27.600 lines of code.• 49 files.• Configurable with macros

– Synchronization, granularity, nesting, TM implementation.

• Coarse-grained setup:– 8 critical regions (TM or global lock)

• Fine-grained setup:– 58 critical regions (TM or fine-grained locks)

• Available on the www.bscmsrc.eu

Page 29: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

29

Conclusion

• The transactional overhead is excessive:– 6x slowdown – 35.3% abort rate

• A coarse-grained approach is not a good option for the current STM systems.

• Significant programmer time investment (10 man-months).

• Fine-grained approach maybe the only solution.

Page 30: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

30

Questions?

Thank you!

Download QuakeTM

www.bscmsrc.eu

Page 31: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

31

Intel Compiler

• single lock atomicity semantics and weak atomicity guarantees. – Strongly atomic semantics, where non-

transactional accesses are treated as implicit single-operation transactions

Page 32: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

32

Atomic Quake

• Main objective was to evaluate the effort of replacing locks with transactions.

• The lock parallelization is not block structured which required code reorganization to adapt to the TM model.

• The second problem was to avoid I/O operations which is not an issue in a lock based system.

• Finally, a big fraction of the development time was spent in understanding how locks are associated with the variables and to get a grip with the locking strategy.

Page 33: QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory Vladimir Gajinov 1,2, Ferad Zyulkyarov 1,2,Osman S. Unsal 1, Adrián Cristal

33

Atomic Quake 2

• Thread private data – call to get_specific• The conditional variables – no retry• I/O in transactions – tm_pure• Proposition for error handling

– When error happens commit the transaction and handle the error outside the atomic block.

• Privatization examples– Custom memory manager allocates a block of

memory for string operations• TM fits for guarding access to different shared

data (separate locks)