32
Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University http://tcc.stanford.edu/prototyp es

Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

Using ATLAS for Performance Tuning and Debugging

Sewook Wee and Njuguna NjorogeComputer Systems Laboratory

Stanford University http://tcc.stanford.edu/prototypes

Page 2: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

2

Tutorial Set-up

Wireless Router access• SSID: RAMP-DEMO• Passphrase: rampramp

Team setup (Extreme Programming)• One member = driver, will code on his or her laptop• Rest of team = passengers, will review and help driver

Server connection• ssh 10.0.0.2• Username and password is on your desk

Environment variables check• check $BEE2_BOARD $VACATION $DLL

echo $BEE2 $VACATION $DLL

Make sure that your favorite text editor is working properly VNC viewer

• VNC Viewer executable http://www.realvnc.com/cgi-bin/download.cgi

• Open up VNC in shared mode

Page 3: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

3

Transactionalizing vacation

vacation – Part of TCC group’s STAMP benchmark suite

• STAMP = Stanford’s Transactional Applications for Multi-Processing

• http://stamp.stanford.edu

• Modeled after SPECjbb2000

About vacation …

• Implements travel reservation system powered by database

• Workload consists of clients interacting with DB manager

• Four tables in DB: cars, rooms, flights, and customers The table of customers tracks the reservations and total price

The tables are implemented as Red-Black trees

Page 4: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

4

Running vacation …

Let’s run vacation in its original form

%> cd $VACATION%> make run_seq

ACTION!

Page 5: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

5

vacation pseudocode – main

In vacation.c, function MAIN

starting on line 340

initializeManager;

initializeClients;

PROFILER_ON;

client_run;

PROFILER_OFF;

Page 6: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

6

vacation pseudocode – client work

In client.c, function client_run starting on line 109for (i = 0; i < numOperation; i++) {

action = selectAction;

switch (action) {

case ACTION_MAKE_RESERVATION: // 1st case

for(j = 0; j < numQueries; j++)

switch(query_type) { …}

case ACTION_BILL_CUSTOMER: // 2nd case

case ACTION_UPDATE_TABLES: // 3rd case

for(j = 0; j < numUpdates; j++) switch(update_type){ …}

}

} ...

Page 7: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

7

Verify success of sequential run

Initializing manager... done.

Manager Stats are initialized

Initializing clients... done.

Transactions = 1024

Clients = 1

Transactions/client = 1024

Queries/transaction = 1

Relations = 4096

Query percent = 99

Query range = 4055

Percent user = 80

Running clients... done.

Checking tables... done.

Deallocating memory...

Number of total adds = 24700

Number of total deletes = 56

Number of total queries = 3341

Number of total reservations = 1618

Number of total cancellations = 0

Done.

%> more trace/0/atlas.stdoutACTION!

Page 8: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

8

Quick overview TCC API

TM_PARALLEL(function_ptr, arg_ptr, numThreads)• function_ptr = pointer to parallel function• arg_ptr = pointer to function’s arguments

• numThreads = for TCC number of CPUs

TM_BEGIN(), TM_END()• Indicate start and end of a transaction

TM_GET_THREAD_ID(), TM_GET_NUM_THREAD()• Retrieve thread’s ID and number of threads

High-level language, OpenTM (resembles OpenMP), is in the works

Page 9: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

9

Transacationalizing vacation – Step 1

OPEN vacation.c, CHANGE line 362 to:

358 MEMORY_INIT();358 MEMORY_INIT();

359 PROFILER_ON();359 PROFILER_ON();

360 360

361 /* Run transactions */

362 TM_PARALLEL(client_run, (void*)clients, global_params[PARAM_CLIENTS]);

363363

Note: global_params[PARAM_CLIENTS] = Number of Processors

• command-line parsing code sets this value

CHANGE

Page 10: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

10

Transacationalizing vacation – Step 2 OPEN client.c

• ADD #include "tm.h“ on line 19

16 #include "client.h"16 #include "client.h"

17 #include "manager.h"17 #include "manager.h"

18 #include "reservation.h"18 #include "reservation.h"

19 #include "tm.h”

2020

• ADD int myId = TM_GET_THREAD_ID(); to line 113 and CHANGE line 115 clients[0] clients[myId]

112 int i;112 int i;

113 int myId = TM_GET_THREAD_ID();

114 client_t** clients = (client_t**)(args);114 client_t** clients = (client_t**)(args);

115 client_t* clientPtr = clients[myId];

ADD

ADD

CHANGE

Page 11: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

11

Transacationalizing vacation – Step 3

Still in client.c , ADD TM_BEGIN(); on line 129124 for (i = 0; i < numOperation; i++) {124 for (i = 0; i < numOperation; i++) {

125 125

126 int r = random_generate(randomPtr) % 100;126 int r = random_generate(randomPtr) % 100;

127 action_t action = selectAction(r, percentUser);127 action_t action = selectAction(r, percentUser);

128 128

129 TM_BEGIN();

130 130

131 switch (action) {131 switch (action) {

Still in client.c , ADD TM_END(); on line 242

239 } /* switch (action) */239 } /* switch (action) */

240 formattingAndProtocol(&i);240 formattingAndProtocol(&i);

241 241

242 TM_END();

243 243

ADD

ADD

%> make run_parACTION!

Page 12: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

12

Profiling File Output Format

1-way ATLAS Polling BRAM...

*****************************Profiling Info from TCC[0/1]*****************************TOTAL: 1339369387PERF: 595385277BUSY: 592922718L1_MISS: 2451797ARBIT: 1703COMMIT: 9059SYNC: 0VIOL: 0MISC: 0... OVFL CYCLE: 592071059

# OVFL: 98# LRU OVFL: 98# READ: 246260# R-MISS: 10793# WRITE: 108005# W-MISS: 1866# Inst.: 271417058# Trans: 3# Violation: 0# ITLBMISS: 595# DTLBMISS: 4775# DStorage: 0# SC: 0ITLBCYCLE: 546605DTLBCYCLE: 8177964DS CYCLE: 0SC CYCLE: 0# SYS Inst.: 1554382# SYS CYCLE: 8724569# Timeout: 1394# TimeoutL: 0

While vacation is running, open sequential run’s stats

%> more trace/0/atlas.logACTION!

Page 13: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

13

Analyzing scalability of vacation

Look at reported speedup• Gets slower when we add more processors!!

Violations dominate PERF time!

%> less trace/8/atlas.logACTION!

Page 14: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

14

Reading TAPE-violation report

Let’s look at violation log file

• Report says … lines 132 to 135 of manager.c should be largest offenders

%> less trace/8/viol.logACTION!

In manager.c, go to above lines and examine code

• Function increment_stats reads and writes global variables lots of conflicts between transactions

• Incrementing these stats causes many violations

Read_PC Object_Addr Occurence Loss Write_Proc Line 10001500 100830e0 32 1265341 3 ..//vacation/manager.c:134 10001448 100830e0 29 766816 4 ..//vacation/manager.c:134 10001390 100830e0 30 6446858 1 ..//vacation/manager.c:134 10005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105 10005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105

Page 15: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

15

Fixing violations in vacation

The problem Violations on global stats variables

The fix privatize global variables

Simple privatization scheme

• Make an 8-element array for each stat variable i.e. int num_adds; int num_adds[MAX_CPUS];

• Each element is owned by a processor i.e. num_adds[x] = Processor x’s element

• In the stats printing function, aggregate the array elements into one single variable

Page 16: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

16

Privatization of vacation – Step 1

OPEN manager.c, CHANGE to lines 111-115 to:

110 #ifdef manager_stats110 #ifdef manager_stats

111 int num_adds[MAX_CPUS];

112 int num_deletes[MAX_CPUS];

113 int num_queries[MAX_CPUS];

114 int num_reservations[MAX_CPUS];

115 int num_cancels[MAX_CPUS];

116 #endif116 #endif

Then, for lines 132-136, CHANGE to:

130 switch (stat) 130 switch (stat)

131 {131 {

132 case ADDS: num_adds[TM_GET_THREAD_ID()]++; break;

133 case DELETES: num_deletes[TM_GET_THREAD_ID()]++; break;

134 case QUERIES: num_queries[TM_GET_THREAD_ID()]++; break;

135 case RESERVATIONS: num_reservations[TM_GET_THREAD_ID()]++; break;

136 case CANCELS: num_cancels[TM_GET_THREAD_ID()]++; break;

137 default: break;137 default: break;

138 }138 }

CHANGE

CHANGE

CHANGE

CHANGE

Page 17: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

17

Privatization of vacation – Step 2

In function manager_initStats in manager.c, ADD to lines 153-155 and CHANGE lines 156-161:

149 void149 void

150 manager_initStats(void)150 manager_initStats(void)

151 {151 {

152 #ifdef manager_stats 152 #ifdef manager_stats

153 int i;

154 for(i = 0; i < MAX_CPUS; i++)

155 {

156 num_adds[i] = 0;

157 num_deletes[i] = 0;

158 num_queries[i] = 0;

159 num_reservations[i] = 0;

160 num_cancels[i] = 0;

161 }

162 #endif162 #endif

163 163

164 printf("Manager Stats are initialized\n");164 printf("Manager Stats are initialized\n");

165 165

166 }166 }

ADD

ADD

CHANGE

CHANGE

Page 18: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

18

Privatization of vacation – Step 3

In manager_printStats function in manager.c, CHANGE line 177 and lines 192-196

176 #ifdef manager_stats176 #ifdef manager_stats 177 #if 1 178 int i;178 int i; 179 int num_adds_t = 0, num_deletes_t = 0, num_queries_t = 0;179 int num_adds_t = 0, num_deletes_t = 0, num_queries_t = 0; 180 int num_reservations_t = 0, num_cancels_t = 0;180 int num_reservations_t = 0, num_cancels_t = 0; 181 /* aggregate stats */181 /* aggregate stats */

…. 191 printf("\n");191 printf("\n");

192 printf("Number of total adds = %d\n", num_adds_t);

193 printf("Number of total deletes = %d\n", num_deletes_t);

194 printf("Number of total queries = %d\n", num_queries_t);

195 printf("Number of total reservations = %d\n", num_reservations_t);

196 printf("Number of total cancellations = %d\n", num_cancels_t);

CHANGE

CHANGE

CHANGE

%> make run_parACTION!

Page 19: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

19

Summary of Transactional vacation

After ~2 minutes, observe speedup at 8 processors is approx 6 times faster than uniprocessor configuration• Note: In OpenTM, privatization and reduction will be

automated by flagging variables Compiler will insert the privatization and reduction code for us

In this exercise, we demonstrated• Ease of use of transactional memory

Intuitive coarse-grain parallelization Did not require low-level understanding of code

• Guided performance tuning Identifies significant performance bottlenecks Without profiler and TAPE, finding such bottlenecks is like

“looking for a needle in a haystack”!

Page 20: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

20

Debugging Parallel Code

There are established techniques for debugging sequential code

• Standard debugger (i.e. GDB)

How about parallel code?

• Non-deterministic runtime behavior

• Sometimes you have to understand underlying architecture

How can transactional memory help?

• Atomicity & Isolation No intrusion from other threads inside the transaction

• Deterministic replay

• Infinite data-watches

Our focus today => deterministic replay & GDB support

Page 21: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

21

Functional Debugging of Transactional Apps

Once app is transactional, most common type of functional bug is atomicity violation

What are atomicity violations?

• In TM, programmer splits an atomic region of code into two or more transactions Intermediate values of shared data in one transaction prematurely

exposed to other transactions

• In fine-grain lock-based programming, much easier to introduce such violations

Challenge: Atomicity violations are non-deterministic and hard to regenerate

Page 22: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

22

Fixing atomicity violations in ATLAS

In this session, you will debug an application with atomicity violations

ATLAS provides framework for deterministic replay

• 1st Step: Run a small application with atomicity violations

• 2nd Step: Deterministically regenerate the buggy execution

• 3rd Step: Add monitor code to identify origins of bugs

• 4th Step: Fix the code!

Page 23: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

23

Example Code: Doubly Linked List

Toy example: Goal is to demonstrate the tool

Global doubly-linked-list queue

• Head and Tail pointers

• Each thread dequeues an item from the Head pointer, and enqueue it back to the Tail

• Threads use dequeue and enqueue functions which are individually synchronized using transaction

Programmer’s intention:

• The order of items in the queue remains same

• Like one thread repeats dequeue and enqueue

Page 24: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

24

High-level Pseudo code

run: for i = 0:NUM_ITERS item = atomic_dequeue(); atomic_enqueue(item); End

atomic_enqueue (item): TM_BEGIN() enqueue_to_Tail(item) TM_END()

Head Tail

Thread B

Thread A

atomic_dequeue: TM_BEGIN() item = dequeue_from_Head TM_END() return item

1 3 42 5 6

Page 25: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

25

Execution Step 1: Test Drive

The result from ATLAS may or may not meet the spec.

Here’s one possible example of undesired execution result.

Try it several times. You will see different result.

CorrectOutput

Actual Result

1 3 42 5 6

%> cd $DLL%> make run

ACTION!

70

1 4 32 5 6 70

Page 26: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

26

Execution Step 2: Replay

%> make replay LOGFILE=commit.outACTION!

Log file• After test drive, you will get atlas.stdout and commit.out

• atlas.stdout: Standard output from the application

• commit.out: transaction order Now we will replay the previous run with commit.out

Unlike previous step, you see the same behavior

%> make replay LOGFILE=commit.correct%> make replay LOGFILE=commit.error

TIP

Page 27: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

27

Execution Step 3: Finding the bug

Hypothesis: Enqueuing order may be different from the dequeuing order.

For example,

How to test hypothesis?

• Let’s make it simple: add printf

Thread A dequeue X

Thread A enqueue X

Thread B dequeue Y

Thread B enqueue Y

Thread A dequeue X

Thread B dequeue Y

Thread B enqueue Y

Thread A enqueue X

Page 28: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

28

Execution Step 3: Finding the bug

Let’s add monitoring code

• printf in the transaction will not affect the transaction order

• Therefore, you will get the exactly same behavior

EDITdll.c

96 Head = next; 97 printf("Dequeue(%d)\n", item->id); 98

108 109 printf("\t\tEnqueue(%d)\n", item->id); 110 item->prev = NULL;

%> make replay LOGFILE=commit.outACTION!

You see that your hypothesis is right.

Page 29: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

29

Execution Step 4: Fix it Make dequeue/enqueue as one atomic block

• Untransactionalize dequeue/enqueue• Transactionalize dequeue/enqueue pair

%> make runACTION!

EDITdll.c

84 //TM_BEGIN();... 99 //TM_END();... 107 //TM_BEGIN();... 122 //TM_END();

73 for (i = 0; i < NUM_ITERS; i+= TM_GET_NUM... 74 75 TM_BEGIN(); 76 77 item = dequeue(); 78 enqueue(item); 79 80 TM_END(); 81 82 }

Page 30: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

30

Replay on Local Machine

Runs sequentially following LOGFILE

Faster execution

GDB support already exist

Does not support machine specific code

%> make replay_local LOGFILE=commit.outACTION!

Page 31: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

31

GDB and Replay

%> echo “commit_in commit.error” > config.tcc%> gdb --args ./dll_local 8

ACTION!

(gdb) break dll.c:88(gdb) condition 1 Head->id==3(gdb) run(gdb) p Head->id(gdb) p Tail->id(gdb) p myid(gdb) break dll.c:110(gdb) condition 2 myid==3(gdb) continue(gdb) p myid(gdb) p Tail->id

About to dequeue

About to enqueue

1

4

Page 32: Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University

32

Debugging Conclusion Slide

Deterministic replay

• Provides regeneration of buggy scenario

• Allows embedding monitoring code without contaminating the buggy scenario

All Transactions All The Time concept helps parallel code debugging

• Easier deterministic replay

• Easier GDB support