63
April 15 2008 Thesis Defense Talk ATLAS Software Development Environment for Hardware Transactional Memory Sewook Wee Computer Systems Lab Stanford University

ATLAS Software Development Environment for Hardware Transactional Memory

  • Upload
    aman

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

ATLAS Software Development Environment for Hardware Transactional Memory. Sewook Wee Computer Systems Lab Stanford University. The Parallel Programming Crisis. Multi-cores for scalable performance No faster single core any more Parallel programming is a must, but still hard - PowerPoint PPT Presentation

Citation preview

Page 1: ATLAS Software Development Environment  for Hardware Transactional Memory

April 15 2008 Thesis Defense Talk

ATLASSoftware Development Environment for Hardware Transactional Memory

Sewook Wee

Computer Systems LabStanford University

Page 2: ATLAS Software Development Environment  for Hardware Transactional Memory

2

The Parallel Programming Crisis

Multi-cores for scalable performance No faster single core any more

Parallel programming is a must, but still hard Multiple threads access shared memory Correct synchronization is required

Conventional: lock-based synchronization Coarse-grain locks: serialize system Fine-grain locks: hard to be correct

Page 3: ATLAS Software Development Environment  for Hardware Transactional Memory

3

Alternative: Transactional Memory (TM) Memory transactions [Knight’86][Herlihy & Moss’93]

An atomic & isolated sequence of memory accesses Inspired by database transactions

Atomicity (all or nothing) At commit, all memory updates take effect at once On abort, none of the memory updates appear to take

effect Isolation

No other code can observe memory updates before commit

Serializability Transactions seem to commit in a single serial order

Page 4: ATLAS Software Development Environment  for Hardware Transactional Memory

4

Advantages of TM

As easy to use as coarse-grain locks Programmer declares the atomic region No explicit declaration or management of locks

As good performance as fine-grain locks System implements synchronization Optimistic concurrency [Kung’81] Slow down only on true conflicts (R-W or W-W) Fine-grain dependency detection

No trade-off between performance & correctness

Page 5: ATLAS Software Development Environment  for Hardware Transactional Memory

5

Implementation of TM

Software TM [Harris’03][Saha’06][Dice’06] Versioning & conflict detection in software No hardware change, flexible Poor performance (up to 8x)

Hardware TM [Herlihy & Moss’93] [Hammond’04][Moore’06] Modifying data cache hardware High performance Correctness: strong isolation

Page 6: ATLAS Software Development Environment  for Hardware Transactional Memory

6

Software Environment for HTM

Programming language [Carlstrom’07] Parallel programming interface

Operating system Provides virtualization, resource management, … Challenges for TM

Interaction of active transaction and OS

Productivity tools Correctness and performance debugging tools Build up on TM features

Page 7: ATLAS Software Development Environment  for Hardware Transactional Memory

7

Contributions

An operating system for hardware TM

Productivity tools for parallel programming

Full-system prototyping & evaluation

Page 8: ATLAS Software Development Environment  for Hardware Transactional Memory

8

Agenda

Motivation

Background

Operating System for HTM

Productivity Tools for Parallel Programming

Conclusions

Page 9: ATLAS Software Development Environment  for Hardware Transactional Memory

9

TCC: Transactional Coherence/Consistency A hardware-assisted TM

implementation Avoids overhead of software-only

implementation Semantically correct TM implementation

A system that uses TM for coherence & consistency Use TM to replace MESI coherence

Other proposals build TM on top of MESI All transactions, all the time

Page 10: ATLAS Software Development Environment  for Hardware Transactional Memory

10

TCC Execution Model

CPU 0

CPU 1

CPU 2

Commit

Arbitrate

Execute

Code

Commit

Arbitrate

Execute

Code

Undo

Execute

Code

ld 0xccccRe-

Execute

Code

...

ld 0x1234

ld 0x5678

...

ld 0xcccc...

...

0xcccc0xcccc

st 0xcccc

...

ld 0xabdc

ld 0xe4e4

...

See [ISCA’04] for details

tim

e

Page 11: ATLAS Software Development Environment  for Hardware Transactional Memory

11

Processor

W7:0TAG

(2-ported)Data

Cache

ViolationLoad/Store

Address

Snoop Control

Commit Address

CommitControl

CommitData

StoreAddress

FIFO

RegisterCheckpoint

Commit Bus

Refill Bus

CommitAddress In

CommitData Out

CommitAddress Out

DATA(single-ported)

R7:0V

CMP Architecture for TCC

Transactionally Read Bits:

ld 0xdeadbeef

Transactionally Written Bits:

st 0xcafebabe

Conflict Detection:Compare incoming address to R bits

Commit:Read pointers from Store Address FIFO, flush addresses with W bits set

See [PACT’05] for details

Page 12: ATLAS Software Development Environment  for Hardware Transactional Memory

12

ATLAS Prototype Architecture

Goal Convinces a proof-of-concept of TCC Experiments with software issues

Main memory & I/O

Coherent bus with commit token arbiter

CPU0

TCCCache

CPU1

TCCCache

CPU2

TCCCache

CPU7

TCCCache

Page 13: ATLAS Software Development Environment  for Hardware Transactional Memory

13

Mapping to BEE2 Board

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

Arb-iter

switchmemory

Page 14: ATLAS Software Development Environment  for Hardware Transactional Memory

14

Agenda

Motivation

Background

Operating System for HTM

Productivity Tools for Parallel Programming

Conclusions

Page 15: ATLAS Software Development Environment  for Hardware Transactional Memory

15

What should we do if OS needs to run

in the middle of transaction?

Challenges in OS for HTM

Page 16: ATLAS Software Development Environment  for Hardware Transactional Memory

16

Challenges in OS for HTM

Loss of isolation at exception Exception info is not visible to OS until commit I.e. faulting address in TLB miss

Loss of atomicity at exception Some exception services cannot be undone I.e. file I/O

Performance OS preempts user thread in the middle of

transaction I.e. interrupts

Page 17: ATLAS Software Development Environment  for Hardware Transactional Memory

17

Practical Solutions

Performance A dedicated CPU for operating system No need to preempt user thread in the

middle of transaction

Loss of isolation at exception Mailbox: separate communication layer

between application and OS

Loss of atomicity at exception Serialize system for irrevocable exceptions

Page 18: ATLAS Software Development Environment  for Hardware Transactional Memory

18

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

CPU

TCCcache

CPU

TCCcache

switch

Arbiter

switch

memory

CPU

$ M

CPU

$ M

switch

CPU

$ MArbiter

switch

memory

CPU

$ M

CPU

$ M

switch

CPU

$ M

CPU

$ M

switch

CPU

$ M

CPU

$ M

switch

Architecture Update

Linux

proxy kernel

Page 19: ATLAS Software Development Environment  for Hardware Transactional Memory

19

P

$M

P

$M

Application CPU

$Mailbox

OS CPU

$ Mailbox

Operating system BootloaderATLAS core

Initialcontext

TM application

Execution overview (1) - Start of an application

ATLAS core A user-level program runs on OS CPU Same address space as TM application Start application & listen to requests from apps

Initial context Registers, PC, PID, …

Page 20: ATLAS Software Development Environment  for Hardware Transactional Memory

20

P

$M

P

$M

Application CPU

$Mailbox

OS CPU

$ Mailbox

ATLAS core

ExceptionInformation

TM application

Execution overview (2) - Exception

Proxy kernel forward the exception information to OS CPU Fault address for TLB misses Syscall number and arguments for syscalls

OS CPU services the request and returns the result TLB mapping for TLB misses Return value and error code for syscalls

Operating system

ExceptionResult

Proxy kernel

Page 21: ATLAS Software Development Environment  for Hardware Transactional Memory

21

Operating System Statistics

Strategy: Localize modifications Minimize the work needed to track main stream kernel

development

Linux kernel (version 2.4.30) Device driver that provides user-level access to

privilege-level information ~1000 lines (C, ASM)

Proxy kernel Runs on application CPU ~1000 lines (C, ASM)

A full workstation for programmer’s perspective

Page 22: ATLAS Software Development Environment  for Hardware Transactional Memory

22

System Performance

Total execution time scales OS time scales, too

Scalability in average of 10 benchmarks

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1p 2p 4p 8p

Number of processors

Normalized execution time

OS

user

Page 23: ATLAS Software Development Environment  for Hardware Transactional Memory

23

Scalability of OS CPU

Single CPU for operating system Eventually, it will become a bottleneck as

system scales Multiple CPUs for OS will need to run SMP

OS

Micro-benchmark experiment Simultaneous TLB miss requests Controlled injection ratio Looking for the number of application CPUs

that saturates OS CPU

Page 24: ATLAS Software Development Environment  for Hardware Transactional Memory

24

Experiment results

Average TLB miss rate = 1.24% Start to congest from 8 CPUs

With victim TLB (Average TLB miss rate = 0.08%) Start to congest from 64 CPUs

Page 25: ATLAS Software Development Environment  for Hardware Transactional Memory

25

Agenda

Motivation

Background

Operating System for HTM

Productivity Tools for Parallel Programming

Conclusions

Page 26: ATLAS Software Development Environment  for Hardware Transactional Memory

26

Challenges in Productivity Tools for Parallel Programming Correctness

Nondeterministic behavior Related to a thread interleaving

Need to track an entire interleaving Very expensive in time/space

Performance Detailed information of the performance

bottleneck events Light-weight monitoring

Do not disturb the interleaving

Page 27: ATLAS Software Development Environment  for Hardware Transactional Memory

27

Opportunities with HTM

TM already tracks all reads/writes Cheaper to record memory access

interleaving

TM allows non-intrusive logging Software instrumentation in TM system Not in user’s application

All transactions, all the time Everything in transactional granularity

Page 28: ATLAS Software Development Environment  for Hardware Transactional Memory

Thesis Defense Talk

Tool 1: ReplayTDeterministic Replay

Page 29: ATLAS Software Development Environment  for Hardware Transactional Memory

29

Deterministic Replay

Challenges in recording an interleaving Record every single memory access Intrusive Large footprint

ReplayT’s approach Record only a transaction interleaving Minimally overhead: 1 event per transaction Footprint: 1 byte per transaction (thread ID)

Page 30: ATLAS Software Development Environment  for Hardware Transactional Memory

30

ReplayT Runtime

Log Phase Replay Phase

Commit

time time

LOG:

T0

T1

T2 T2

Commit

T0

T1

T2 T2

Commit protocolreplays loggedcommit order

T0 T1 T2

Page 31: ATLAS Software Development Environment  for Hardware Transactional Memory

31

Runtime Overhead

Minimal time & space overhead

B: baselineL: log modeR: replay mode

Average on 10 benchmarks

7 STAMP, 3 SPLASH/SPLASH2

Less than 1.6% overhead for logging

More overhead in replay mode

longer arbitration time

1B per 7119 insts.

Page 32: ATLAS Software Development Environment  for Hardware Transactional Memory

Thesis Defense Talk

Tool 2. AVIO-TMAtomicity Violation Detection

Page 33: ATLAS Software Development Environment  for Hardware Transactional Memory

33

Atomicity Violation

Problem: programmer breaks an atomic task into two transactions

ATMDeposit:ATMDeposit: atomic {atomic { t = Balancet = Balance Balance = t + $100Balance = t + $100 }}

atomic {atomic { Balance = t + $100Balance = t + $100 }}

ATMDeposit:ATMDeposit: atomic {atomic { t = Balancet = Balance }} directDeposit:directDeposit:

atomic {atomic { t = Balancet = Balance Balance = t + $1,000Balance = t + $1,000 }}

Page 34: ATLAS Software Development Environment  for Hardware Transactional Memory

34

Atomicity Violation Detection

AVIO [Lu’06] Atomic region = No unserializable interleavings Extracts a set of atomic region from correct runs Detects unserializable interleavings in buggy runs

Challenges of AVIO Need to record all loads/stores in global order

Slow (28x) Intrusive - software instrumentation Storage overhead

Slow analysis Due to the large volume of data

Page 35: ATLAS Software Development Environment  for Hardware Transactional Memory

35

My Approach: AVIO-TM

Data collection in deterministic rerun Captures original interleavings

Data collection at transaction granularity Eliminate repeated loggings for same address

(10x) Lower storage overhead

Data analysis in transaction granularity Less possible interleavings faster extraction Less data faster analysis More accurate with complementary detection tools

Page 36: ATLAS Software Development Environment  for Hardware Transactional Memory

Thesis Defense Talk

Tool 3. TAPE

Performance Bottleneck Monitor

Page 37: ATLAS Software Development Environment  for Hardware Transactional Memory

37

TM Performance Bottlenecks

Dependency conflicts Aborted transactions waste useful cycles

Buffer overflows Speculative states may not fit into cache Serialization

Workload imbalance

Transaction API overhead

Page 38: ATLAS Software Development Environment  for Hardware Transactional Memory

38

Dependency Conflicts

Write XWrite X

Useful Arbitration Commit Abort

Time

T0

Read XRead XT1

Useful cycles are wasted in T1

Page 39: ATLAS Software Development Environment  for Hardware Transactional Memory

39

TAPE on ATLAS

TAPE [Chafi, ICS2005] Light weight runtime monitor for performance

bottlenecks

Hardware Tracks information of performance bottleneck

events

Software Collects information from hardware for events Manages them through out the execution

Page 40: ATLAS Software Development Environment  for Hardware Transactional Memory

40

TAPE Conflict

Commit X from Thread 1

T0

Read X

Object: X

Writing Thread: 1

Wasted cycles: 82,402

Restart

Read X

Read PC: 0x100037FC

Per Thread Read PC: 0x100037FC … Occurrence: 34

Per Transaction

Page 41: ATLAS Software Development Environment  for Hardware Transactional Memory

41

Read_PC Object_Addr Occurence Loss Write_Proc Read in source line 10001390 100830e0 30 6446858 1 ..//vacation/manager.c:13410001500 100830e0 32 1265341 3 ..//vacation/manager.c:13410001448 100830e0 29 766816 4 ..//vacation/manager.c:13410005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105

TAPE Conflict Report

Now, programmers know, Where the conflicts are What the conflicting objects are Who the conflicting threads are How expensive the conflicts are

Productive performance tuning!

Page 42: ATLAS Software Development Environment  for Hardware Transactional Memory

42

Runtime Overhead

Base overhead 2.7% for 1p

Overhead from real conflicts More CPU

configuration has higher chance of conflicts

Max. 5% in total

Page 43: ATLAS Software Development Environment  for Hardware Transactional Memory

43

Conclusion

An operating system for hardware TM A dedicated CPU for the operating system Proxy kernel on application CPU Separate communication channel between them

Productivity tools for parallel programming ReplayT: Deterministic replay AVIO-TM: Atomicity violation detection TAPE: Runtime performance bottleneck monitor

Full-system prototyping & evaluation Convincing proof-of-concept

Page 44: ATLAS Software Development Environment  for Hardware Transactional Memory

44

RAMP Tutorial

ISCA 2006 and ASPLOS 2008

Audience of >60 people (academia & industry) Including faculties from Berkeley, MIT, and UIUC

Parallelized, tuned, and debugged apps with ATLAS From speedup of 1 to ideal speedup in a few minutes Hands-on experience with real system

“most successful hands-on tutorial in last several decades”

- Chuck Thacker (Microsoft Research)

Page 45: ATLAS Software Development Environment  for Hardware Transactional Memory

45

Acknowledgements

My wife So Jung and our baby (coming soon) My parents who have supported me for last 30

years My advisors: Christos Kozyrakis and Kunle Olukotun My committee: Boris Murmann and Fouad A. Tobagi Njuguna Njoroge, Jared Casper, Jiwon Seo, Chi Cao

Minh, and all other TCC group members RAMP community and BEE2 developers Shan Lu from UIUC Samsung Scholarship All of my friends at Stanford & my Church

Page 46: ATLAS Software Development Environment  for Hardware Transactional Memory

Thesis Defense Talk

Backup Slides

Page 47: ATLAS Software Development Environment  for Hardware Transactional Memory

47

Single core’s Performance Trend

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Performance (vs. VAX-

25%/year

52%/year

??%/year?

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006

Page 48: ATLAS Software Development Environment  for Hardware Transactional Memory

48

TAPE Conflict

ObjectShooting Thread ID

Read PCOccurrence

Wasted cycles

XX

T0T0

0x100037FC0x100037FC

442,4532,453

TCC cache PowerPC Software counter

Write XWrite X

Time

T0

Read XRead XT1

Page 49: ATLAS Software Development Environment  for Hardware Transactional Memory

49

Memory transaction vs. Database transaction

Page 50: ATLAS Software Development Environment  for Hardware Transactional Memory

50

TLB miss handling

Page 51: ATLAS Software Development Environment  for Hardware Transactional Memory

51

Syscall Handing

Page 52: ATLAS Software Development Environment  for Hardware Transactional Memory

52

ReplayT Extensions

Unique replay Problem: maximize usefulness of test runs Approach: shuffle commit order to generate unique

scenarios

Replay with monitoring code Problem: replay accuracy after recompilation Approach: faithfully repeat commit order if binary

changes E.g., printf statements inserted for monitoring purposes

Cross-platform replay Problem: debugging on multiple platforms Approach: support for replaying log across platforms &

ISAs

Page 53: ATLAS Software Development Environment  for Hardware Transactional Memory

53

Integration with GDB

Breakpoints Software breakpoint == self-modifying code Breakpoints may be buffered in the TCC $ by the end of

transactions be better to set it in OS core

Traps Stop all threads - controlling token arbiter Debug only committable transaction - acquiring commit

token

Stepping Backward stepping using abort & restart

Data-watch

Page 54: ATLAS Software Development Environment  for Hardware Transactional Memory

54

Intermediate Write Analyzer

Intermediate write A write that is overwritten by a local or remote

thread before it was read by a remote thread

Intermediate writes in the correct runs Potential bugs, it can be read by remote thread at

some point. Analyze the buggy run, if there’s any intermediate

write that is read by remote threads.

Why in TM? In every single memory access base, there will be

too many of intermediate writes which are actually safe. Too high false positive rate

Page 55: ATLAS Software Development Environment  for Hardware Transactional Memory

55

Buffer Overflow

Computation Arbitration Commit Token Hold

Time

T0

T1

Miss-speculation wastes computation cycles in T1

Overflow Overflow Commit

Commit

Page 56: ATLAS Software Development Environment  for Hardware Transactional Memory

56

TAPE Overflow

Overflowed PC

Type

Occurrence

Duration (cycles)

OverflowOverflow

0x10004F180x10004F18

LRU overflowLRU overflow

35,07235,072

44

CommitCommit

TCC cache Software counterPowerPC

Page 57: ATLAS Software Development Environment  for Hardware Transactional Memory

57

ATLAS’ Contribution on TAPE

Evaluation on real hardware In theory, there is no difference in theory

and practice. But, in pratice, there is. - Jan van de Snepscheut

Optimization Minimizes HW modification from original

proposal Eliminates some information to track

Runtime overhead vs. Usefulness of the

information

Page 58: ATLAS Software Development Environment  for Hardware Transactional Memory

58

Why not SMP kernel?

Page 59: ATLAS Software Development Environment  for Hardware Transactional Memory

59

What is strong isolation?

Page 60: ATLAS Software Development Environment  for Hardware Transactional Memory

60

TCC vs. SLE

Speculative Lock Elision (SLE)[Rajwar & Goodman’01] Speculate through locks If a conflict is detected, it aborts ALL involved

threads No guarantee to forward progress

TLR: Transactional Lock Removal [above’02] Extended from SLE Guarantee to forward progress by giving a priority

to the oldest thread

Page 61: ATLAS Software Development Environment  for Hardware Transactional Memory

61

TCC vs. TLS

TLS (Thread-level speculation) Maintains serial execution order Forward speculative states from less

speculative threads to more speculative threads

Page 62: ATLAS Software Development Environment  for Hardware Transactional Memory

62

void deposit(account, amount) synchronized(account) {

int t = bank.get(account);t = t + amount;bank.put(account, t);

}

void withdraw(account, amount) synchronized(account) {

int t = bank.get(account);t = t – amount;bank.put(account, t);

}

void deposit(account, amount) atomic {

int t = bank.get(account);t = t + amount;bank.put(account, t);

}

void withdraw(account, amount) atomic {

int t = bank.get(account);t = t – amount;bank.put(account, t);

}

Programming with TM

Declarative synchronization Programmers say what but not how No explicit declaration or management of locks

System implements synchronization Typically with optimistic concurrency Slow down only on true conflicts (R-W or W-W)

Page 63: ATLAS Software Development Environment  for Hardware Transactional Memory

63

AVIO’s serializability analysis

R

R

RR

R

WR

W

RR

W

W

W

R

RW

R

WW

W

RW

W

W

OK OK

OK OK

BUG1

BUG3 BUG4

BUG2

* OK, if interleaved access is serializable* Possibly atomicity violation, if unserializable