Code Layout Optimization for Transaction Processing Workloads

Code Layout Optimization for Transaction Processing Workloads

2006/05/29

KINS

Kyuhwan Kim

Alex Ramirez, Luiz Adnre Barroso, Kourosh Gharachorloo,

Robert Cohn, Josep Larriba-Pey, P.Geoffrey Lowney, and Mateo Valero

Introduction OLTP (OnLine Transaction Processing)

A form of transaction processing conducted via computer network.

Electronic banking, order processing, e-commerce. Large number of clients who continually access and update smal

l portions of the database through short running transactions. Large memory stall Large instructions and data footprints and

high communication miss rates.

Introduction (cont.) Code Layout Optimization

Large applications have a particular problem: A lot of instructions. Can’t hold entire application on-chip at any one time. Stalled waiting to fetch new instructions from memory.

Hold more useful instructions improve performance

Outline Introduction Code Layout Optimizations Methodology Behavior of the Database Application in Isolation Combined Database Application and O/S Behavior Conclusion

Code Layout Optimizations Spike

DTKS tool for performing code optimization after linking Profile-driven optimization.

Three parts of Spike optimizer algorithm Basic Block Chaining Fine-Grain Procedure Splitting Procedure Ordering

Basic Block Chaining Definition

Order the basic blocks within a procedure. Algorithm

Simple greedy algorithm

1. Sort flow edges by weight

2. Chain two block with heaviest weight. Gain

Improve instruction cache behavior

Ex) Basic Block Chaining

Unconditional branch / Fall-through

Conditional branch

A1 10 Node weight

0.6 0.4 Branch probability

A1

A2

A3

A4 A5

A6 A7

A8

10

10

10

6 4

2.4 7.6

10

0.6 0.4

0.4 0.6

A1

A2

A3

A4

A5

A6

A7

A8

10

10

10

6

7.6

10

4

2.4

Fine-Grain Procedure Splitting Definition

Divide the chain into multiple code segments new procedures. Algorithm

Find unconditional branch or return. (just study) Split into hot and cold part. (current available)

Gain Extra degree of flexibility for the procedure ordering algorithm.

Ex) Fine-Grain Procedure Splitting

Procedure 1

Unconditional branch

Procedure 2

Subroutine return

Procedure 3

Subroutine return

Procedure 4

Subroutine return

RET

RET

RET

Procedure Ordering Definition

Place related procedures near one another. Algorithm

1. Build call graph and assign weight (# call).

2. Select the most heavily weighted edge and merge.

3. Use weights in original graph when merge.

4. Iterate until graph is reduced to a single node. Gain

Improve instruction cache behavior

Ex) Procedure Ordering

E,D,B,A,C

A

B C

D E

4 10

8 1

3

1

B A,C

D E

8 1

7

1

B,D A,C

E

1 1

7

D,B,A,C E2


Methodology OLTP Workload

TPC-B Oracle 8.0.4

Collecting Profiles OLTP profile data Pixie. Kernel profile Tru64 Unix kprofile tool.

Hardware and Simulation Platforms SimOS-Alpha environment


Behavior of the DB App. Only Instruction cache miss

X-axis: cache line size Y-axis: # instruction cache miss Reduction of misses is 55~65%.

Baseline OLTP binary Optimized OLTP binary

Experiment (cont.) Impact of different code layout optimization.

Procedure ordering increase cache misses. Largest benefit comes from basic block chaining. Procedure ordering after splitting improve performance further.

Experiment (cont.) Sequentially executed instructions.

Optimized binary 7.3 to over 10 instructions. Temporal locality.

# instructions reused before eviction Optimized binary Increase # of instructions reused.


Behavior of Combined DB App. & OS Instruction cache miss

Reduction of misses is 45~60%. Reduction of misses is 55~65% (App. in isolation).


Experiment (cont.) Interference between App. and OS

Majority of app. misses arise due to self interference. Kernel interferes very little with itself.


Conclusion Profile-driven compiler optimization to improve code

layout in OLTP workloads. App in isolation reduce 55~65% cache misses. With OS reduce 45~60% cache misses. Overall, these optimizations yield improvement in

performance of 1.33 times

Documents

Code Layout Optimization for Transaction Processing Workloads