Extending Open64 with Transactional Memory features

Extending Open64 withTransactional Memory features

Jiaqi ZhangTsinghua University

Contents

• Background• Design• Implementation• Optimization• Experiment• Conclusion

Transactional Memory Background

• Trend to concurrent programming• Current solution:

– Lock– Flaws:

• Association between locks and data• Deadlock• Not composable


a.credit(amount);b.debit(amount);

class Account{ int balance; lock mylock; bool credit(int amount); bool debit(int amount); };

bool credit(int amount){ acquire(mylock); balance+=amount; release(mylock);}bool debit(int amount){ acquire(mylock); balance-=amount; release(mylock);}

inconsistent stateacquire(a.mylock);acquire(b.mylock);

release(a.mylock);release(b.mylock);

Poor abstraction of class AccountDeadlockExposed implementation details

transfer(Account a, Account b, int amount){

}

atomic{ a.credit(amount); b.debit(amount);}


• Current Implementations– TM libraries

• DSTM• DracoSTM• TL2• TinySTM• ……..

Function calls:TM_INIT()/TM_SHUTDOWN()TM_ATOMIC_BEGIN()/TM_ATOMIC_END()TM_SHARED_READ()/TM_SHARED_WRITE()

Explicit Transaction


• Current Implementations– Compilers

• Intel C++ STM Compiler• Tanger• OpenTM• GCC

Design

• Programming Interfaces#pragma tm atomic [clause]structured block

readonly

private(var list)

shared(var list)

#pragma tm abort

#pragma tm functionfunction declaration

#pragma tm waiverfunction declaration

Design

• TM runtime interfaces (TL2)Interface Description

Thread* TxNewThread() Allocate a new Thread structure to keep logs

TxStart(Thread* Self, jmp_buf* buf, int flags) Start a new transaction for current thread

TxCommit(Thread* Self) Commit the current transaction

TxLoad(Thread* Self, void* addr) Perform synchronized load from given memory address

TxStore(Thread* Self, void* addr, intptr_t val) Perform synchronized store to given memory address

TxStoreLocal(Thread* Self, void* addr, intptr_t val) Perform locally logged store to given memory address

TxAbort(Thread* Self) Abort the current transaction and re-execute

Design

• Wrapper functions– To ease the process of integrating new TM librariestm_init()/tm_finalize()tm_thread_start()/tm_thread_end()

__tm_atomic_begin()/__tm_atomic_end()__tm_shared_read()/__tm_shared_read_float()__tm_shared_write()/__tm_shared_write_float()__tm_local_write()/__tm_local_write_float()

by programmers

by compiler

more wrapper functions are needed for other data types, and additional TM semantics

Design

• Optimization– Eliminate redundant calls to runtime libraries

Implementation

• General Transformation

Implementation

• General Transformation– #pragma tm atomic– simple statements– control flow statements

• IF• WHILE_DO

a = b+c;

PARM #address of cCALL <__tm_shared_read> LDID <return_offset>STID #tm_preg_num_0 PARM #address of bCALL <__tm_shared_read> LDID <return_offset> STID #tm_preg_num_1 LDID #tm_preg_num_0 LDID #tm_preg_num_1 ADD PARM PARM #address of aCALL <__tm_shared_write>

setjmp();__tm_atomic_begin();

for(;i<10;i++){}

PARM #address of ICALL <__tm_shared_read> LDID <return_offset>STID #tm_preg_num_0WHILE_DO LDID #tm_preg_num_0 INTCONST 9 LEBODY BLOCK ……………. PARM #address of I CALL <__tm_shared_read> LDID <return_offset> STID #tm_preg_num_0 END_BLOCK

Implementation

• General Transformation1.1 int i = 0;

1.2 #pragma tm atomic

{

1.3 int j = 0;

1.4 for(i=0;i<20;i++)

{

1.5 for(j=0;j<10;j++)

{

1.6 result++;

}

}

}

2.1 int i = 0;

2.2 jmpbuf jbuf;

2.3 _setjmp(jbuf);

2.4 TxStart(Self, jbuf);

2.5 TxStore(Self, &j, 0);

2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20;

TxStore(Self, &i, TxLoad(Self, &i)+1)){

2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10;

TxStore(Self, &j, TxLoad(Self, &j)+1)){

2.8 TxStore(Self, &result, TxLoad(Self, &result)+1);

}}

2.9 TxCommit(Self);

Implementation

• Functions– clone and instrument

#pragma tm functionvoid calculate(){}

void calculate()

__tm_cloned__calculate() //instrumented

#pragma tm atomic{ calculate();}

#pragma tm atomic{ __tm_cloned__calculate();}

Implementation

• Optimization1.1 int i = 0;


{

1.3 int j = 0;

1.4 for(i=0;i<20;i++)

{

1.5 for(j=0;j<10;j++)

{

1.6 result++;

}

}

}

2.1 int i = 0;

2.2 jmpbuf jbuf;

2.3 _setjmp(jbuf);


2.5 TxStore(Self, &j, 0);

2.6 for (TxStore(Self, &i, 0);; TxLoad(Self, &i)<20;


2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10;

TxStore(Self, &j, TxLoad(Self, &j)+1)){


}}

2.9 TxCommit(Self);

Transaction local variables : detected by the frontend

Implementation



{

1.3 int j = 0;

1.4 for(i=0;i<20;i++)

{

1.5 for(j=0;j<10;j++)

{

1.6 result++;

}

}

}

2.1 int i = 0;

2.2 jmpbuf jbuf;

2.3 _setjmp(jbuf);


2.5 j=0;

2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20;


2.7 for(j=0; j<10;j++)){


}}

2.9 TxCommit(Self);

Barrier Free variables : detected according to its storage class

Implementation



{

1.3 int j = 0;

1.4 for(i=0;i<20;i++)

{

1.5 for(j=0;j<10;j++)

{

1.6 result++;

}

}

}

2.1 int i = 0;

2.2 jmpbuf jbuf;

2.3 _setjmp(jbuf);


2.5 j=0;

2.6 for (; i<20; TxStoreLocal(Self, &i, i+1)){

2.7 for(j=0; j<10;j++)){


}}

2.9 TxCommit(Self);

Implementation

• Optimization– Optimization opportunities detection strategy

• Pthread parallel task – transaction local: declared in tm atomic scope– barrier free: auto variables

• Cloned transactional function– transaction local: declared in the function

• OpenMP parallel task– transaction local: declared in tm atomic scope– barrier free: declared in micro task, marked in openmp private clause

• Checking readonly transactions

– Limitation• Reserved design for pointers• Needs programmers to participate in optimization

Preliminary Experiments• Compare with fine-grained lock based application

Preliminary Experiments

• Compare with manually instrumented application

Preliminary Experiments

#pragma tm atomic{ int j; *new_centers_len[index] ++; for(j=0;j<nfeatures;j++){ new_centers[index][j]+=feature[i][j]; }}

private(feature)

Conclusion & Future work

• A infrastructure for TM on Open64– Replaceable TM implementation– Optimization

• More experiments on non-trivial applications are desired• Nested transaction• Signal processing• Event handler• Indirect calls• Dealing with legacy code• …

FastDB: 8 out of 75 critical regions contain nested transactionsFastDB: 28 out of 75 critical regions contain signal processing

PARSEC: 20 out of 55 critical regions contain signal processing

Thanks

Documents

Extending Open64 with Transactional Memory features