12
Lock Reservation: Java Locks Can Mostly Do Without Atomic Operations Kiyokuni Kawachiya Akira Koseki Tamiya Onodera IBM Research, Tokyo Research Laboratory 1623-14, Shimotsuruma,Yamato, Kanagawa 242-8502, Japan {kawatiya,akoseki,tonodera}@jp.ibm.com ABSTRACT Because of the built-in support for multi-threaded program- ming, Java programs perform many lock operations. Al- though the overhead has been significantly reduced in the recent virtual machines, one or more atomic operations are required for acquiring and releasing an object’s lock even in the fastest cases. This paper presents a novel algorithm called lock reserva- tion. It exploits thread locality of Java locks, which claims that the locking sequence of a Java lock contains a very long repetition of a specific thread. The algorithm allows locks to be reserved for threads. When a thread attempts to acquire a lock, it can do without any atomic operation if the lock is reserved for the thread. Otherwise, it cancels the reservation and falls back to a conventional locking algorithm. We have evaluated an implementation of lock reservation in IBM’s production virtual machine and compiler. The results show that it achieved performance improvements up to 53% in real Java programs. Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—optimiza- tion General Terms Languages, Algorithms, Performance, Measurement, Exper- imentation Keywords Java, synchronization, monitor, lock, reservation, thread lo- cality, atomic operation 1. INTRODUCTION One important characteristics of the Java programming language [17] is the built-in support for multi-threaded pro- gramming. For synchronization between independently exe- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OOPSLA’02, November 4-8, 2002, Seattle, Washington, USA. Copyright 2002 ACM 1-58113-417-1/02/0011 ...$5.00. cuting threads, Java adopts semantics based on monitor [11, 18], and has monitors associated with objects. The language constructs for synchronization are synchro- nized methods and blocks. When a thread executes a syn- chronized method against an object or a synchronized block with an object, the thread acquires the object’s lock be- fore the execution and releases the lock after the execu- tion. Thus, at most one thread can execute the synchronized method or the synchronized block. Because of the built-in support for multi-threaded pro- gramming, libraries in Java tend to be designed to be thread- safe, containing many methods declared as synchronized. As a result, Java applications perform a significant number of lock operations. It was reported that 19% of the total execution time was wasted by thread synchronization in an early version of Java virtual machine [4]. Many techniques have since been proposed for optimiz- ing locks in Java, which can be divided into two categories, runtime techniques and compile-time techniques. The for- mer attempts to make lock operations cheaper [2, 6, 13, 34], while the latter attempts to eliminate lock operations [3, 9, 10, 12, 38, 44]. Almost all the runtime techniques follow the principle of optimizing common cases. They exploit the observation that Java locks are normally not contended, and optimize the uncontended cases. These techniques allow a lock to be ac- quired and released with only a few machine instructions in the absence of contention. However, the instruction se- quence inevitably contains one or more compound atomic operations such as compare_and_swap. Considering that atomic operations are especially expensive in modern archi- tectures, the synchronization has not yet become sufficiently light, though the overhead has significantly been reduced. This paper proposes a new runtime technique called lock reservation. It also follows the principle of optimizing com- mon cases. The observation exploited is the biased distri- bution of lockers called thread locality. That is, for a given object, the lock tends to be dominantly acquired and re- leased by a specific thread, which is obviously the case in single-threaded applications 1 . The key idea is to allow a lock to be reserved for a thread. When a thread attempts to acquire an object’s lock, the acquisition is ultra-fast if the lock is reserved for the thread. In particular, it does not require any atomic operation. On 1 Java virtual machines may create internal helper threads, where Java programs can never be single-threaded in the strict sense. 130

Lock Reservation: Java Locks Can Mostly Do Without Atomic

Embed Size (px)

Citation preview

Page 1: Lock Reservation: Java Locks Can Mostly Do Without Atomic

Lock Reservation: Java Locks Can Mostly DoWithout Atomic Operations

Kiyokuni Kawachiya Akira Koseki Tamiya OnoderaIBM Research, Tokyo Research Laboratory

1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan{kawatiya,akoseki,tonodera}@jp.ibm.com

ABSTRACTBecause of the built-in support for multi-threaded program-ming, Java programs perform many lock operations. Al-though the overhead has been significantly reduced in therecent virtual machines, one or more atomic operations arerequired for acquiring and releasing an object’s lock even inthe fastest cases.

This paper presents a novel algorithm called lock reserva-tion. It exploits thread locality of Java locks, which claimsthat the locking sequence of a Java lock contains a very longrepetition of a specific thread. The algorithm allows locks tobe reserved for threads. When a thread attempts to acquirea lock, it can do without any atomic operation if the lock isreserved for the thread. Otherwise, it cancels the reservationand falls back to a conventional locking algorithm.

We have evaluated an implementation of lock reservationin IBM’s production virtual machine and compiler. Theresults show that it achieved performance improvements upto 53% in real Java programs.

Categories and Subject DescriptorsD.3.4 [Programming Languages]: Processors—optimiza-tion

General TermsLanguages, Algorithms, Performance, Measurement, Exper-imentation

KeywordsJava, synchronization, monitor, lock, reservation, thread lo-cality, atomic operation

1. INTRODUCTIONOne important characteristics of the Java programming

language [17] is the built-in support for multi-threaded pro-gramming. For synchronization between independently exe-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.OOPSLA’02, November 4-8, 2002, Seattle, Washington, USA.Copyright 2002 ACM 1-58113-417-1/02/0011 ...$5.00.

cuting threads, Java adopts semantics based on monitor [11,18], and has monitors associated with objects.

The language constructs for synchronization are synchro-nized methods and blocks. When a thread executes a syn-chronized method against an object or a synchronized blockwith an object, the thread acquires the object’s lock be-fore the execution and releases the lock after the execu-tion. Thus, at most one thread can execute the synchronizedmethod or the synchronized block.

Because of the built-in support for multi-threaded pro-gramming, libraries in Java tend to be designed to be thread-safe, containing many methods declared as synchronized.As a result, Java applications perform a significant numberof lock operations. It was reported that 19% of the totalexecution time was wasted by thread synchronization in anearly version of Java virtual machine [4].

Many techniques have since been proposed for optimiz-ing locks in Java, which can be divided into two categories,runtime techniques and compile-time techniques. The for-mer attempts to make lock operations cheaper [2, 6, 13, 34],while the latter attempts to eliminate lock operations [3, 9,10, 12, 38, 44].

Almost all the runtime techniques follow the principle ofoptimizing common cases. They exploit the observation thatJava locks are normally not contended, and optimize theuncontended cases. These techniques allow a lock to be ac-quired and released with only a few machine instructionsin the absence of contention. However, the instruction se-quence inevitably contains one or more compound atomicoperations such as compare_and_swap. Considering thatatomic operations are especially expensive in modern archi-tectures, the synchronization has not yet become sufficientlylight, though the overhead has significantly been reduced.

This paper proposes a new runtime technique called lockreservation. It also follows the principle of optimizing com-mon cases. The observation exploited is the biased distri-bution of lockers called thread locality. That is, for a givenobject, the lock tends to be dominantly acquired and re-leased by a specific thread, which is obviously the case insingle-threaded applications1.

The key idea is to allow a lock to be reserved for a thread.When a thread attempts to acquire an object’s lock, theacquisition is ultra-fast if the lock is reserved for the thread.In particular, it does not require any atomic operation. On

1Java virtual machines may create internal helper threads,where Java programs can never be single-threaded in thestrict sense.

130

Page 2: Lock Reservation: Java Locks Can Mostly Do Without Atomic

Table 1: Benchmark programs

Multi-Program name threaded? Description

SPECjvm98 Run each program three times in the application mode._202_jess No Expert shell system solving a set of puzzles_201_compress No LZW compression and decompression_209_db No Perform database functions on memory resident database_222_mpegaudio No Decompress MP3 audio files_228_jack No Parser generator generating itself_213_javac No Java source-to-bytecode compiler from the JDK 1.0.2_227_mtrt Yes Two-threaded ray tracer

SPECjbb2000 Yes Simulate the operations of a TPC-C like business logic, run for 8 warehouses.Volano Server Yes Chat room simulatorVolano Client Yes Chat client, creating 200 connections and sending 100 messages per connection.

Created Garbagecollected

denotes that thread X acquires the lock

Object 1

Created GarbagecollectedObject 2

Exploitable locality

A C B B B BB

X

C C C C C B

B B C A C

C CC

Difficult-to-exploit locality

Figure 1: General thread locality and exploitable

thread locality

the other hand, if the lock is reserved for another thread,the reservation must first be canceled, and the acquisitionfalls back to an existing algorithm.

As we see later, lock reservation can be built on any ex-isting locking algorithm, as long as it uses a word or field inthe object header and has one available bit. This bit is usedfor representing the reservation status. When the status bitis set, the meaning of the rest of the bits is defined by ourlock reservation algorithm, while when the bit is not set, themeaning is defined by the underlying algorithm.

The rest of the paper is organized as follows. Section 2shows the thread locality of locks in real Java programs. Sec-tion 3 describes the algorithm of lock reservation. Section 4presents performance results, while Section 5 discusses therelated work. Finally, Section 6 offers conclusions.

2. THREAD LOCALITY OF JAVA LOCKSThis section studies the thread locality of Java locks, which

we exploit for reducing the synchronization overhead of Javaprograms. Thread locality of a lock is defined in terms ofthe locking sequence, the sequence of threads (in temporalorder) that acquire the lock. The general form of threadlocality is stated as follows. For a given lock, if its lockingsequence contains a very long repetition of a specific thread,the lock is said to exhibit thread locality, while the specificthread is said to be the dominant locker.

However, the general form of thread locality is not easyto exploit, since we consider adaptive optimization of locksrather than static optimization using off-line profiles. Whenthe locking sequence of a lock is currently being constructed,it is very hard for the runtime system to cheaply determinewhether the lock exhibits thread locality or whether the cur-rent locker is the dominant locker.

Table 2: Exploitable thread locality of Java locks

Number of Number of Ratios of locksync’d lock ops. in 1st.

Program name objects operations repetitions

SPECjvm98_202_jess 21278 14646978 99.993%_201_compress 2135 28895 97.211%_209_db 66592 162117521 99.9998%_222_mpegaudio 1620 27168 98.108%_228_jack 1635497 38570415 99.998%_213_javac 1192734 47062772 99.974%_227_mtrt 3020 3522926 99.557%

SPECjbb20002 2077210 102282147 79.392%Volano Server 7279 7244208 75.983%Volano Client 4102 10419671 84.270%

Thus, a stronger form of thread locality is considered forexploitability, which is described as follows. For a given lock,if the locking sequence starts with a very long repetition ofa specific thread, the lock is said to show exploitable threadlocality. When the lock exhibits exploitable thread locality,the initial locker is the dominant locker. Figure 1 shows twoobjects, one with general but not exploitable locality, andthe other with exploitable locality.

To investigate how many objects show exploitable threadlocality in real programs, we gathered locking statistics us-ing an instrumented version of the IBM Development Kitfor Windows, Java Technology Edition, Version 1.3.1 [20].We measured the Java programs listed in Table 1 — theseven programs of the SPECjvm98 [40], the SPECjbb2000[39] for eight warehouses, and the server and client programsof the Volano Mark [43]. Among these programs, _227_mtrt,SPECjbb2000, and the Volano Mark are multi-threaded pro-grams. We ran these programs with the JIT compiler dis-abled, since some locks would otherwise be optimized awayby compiler optimizations.

The focus in our measurements is the first repetition inthe locking sequence of each lock. This is the beginningsubsequence consisting only of the initial locker3. If the firstrepetition of a lock is very long, the lock shows exploitablethread locality. Table 2 presents the results4, including the

2The total number of locks for SPECjbb2000 varies depend-ing on the execution speed.3The length of the first repetition may be one. Also, theinitial locker may appear again after the first repetition.4The results shown here are for the complete execution ofeach program, including lock operations during the programstartup and shutdown.

131

Page 3: Lock Reservation: Java Locks Can Mostly Do Without Atomic

tid rcnt Reserve mode

LRV bit

1

A 0 1

A >0 (b) Reserved for and held by Thread A1

0 0 1

Lockword semantics in the reserve mode

(a) Reserved for Thread A, but not held

Base mode0(defined by base lock)

(c) Reserved anonymously (will be reserved by the initial locker)

Lockword structure

Recursion countThread ID

Figure 2: Lockword structure and semantics

total number of synchronized objects, the total number oflock operations, and the ratios of lock operations in the firstrepetitions. As shown in the table, the vast majority of lockoperations are performed by the initial lockers. Even formulti-threaded programs, more than 75% of the lock opera-tions were performed by the initial lockers in the first repe-titions. Thus, we can draw the conclusion that a significantnumber of objects exhibit exploitable thread locality.

Notice that the ratios in the last column are not 1.0 evenfor single-threaded programs, since the virtual machine cre-ates system threads for internal tasks such as finalization.We also note that the initial locker of an object is not neces-sarily the creator of the object. This happens in the VolanoMark programs, where a single thread is dedicated to creat-ing objects and passing them to worker threads that actuallyuse the objects.

3. LOCK RESERVATIONThis section presents a new locking algorithm called lock

reservation. It exploits the observation that Java locks showthread locality, as discussed in the previous section. The keyidea is to reserve locks for threads. When a thread attemptsto acquire an object’s lock, one of the following actions istaken:

1. If the object’s lock is reserved for the thread, the runtimesystem allows the thread to acquire the lock with a fewinstructions involving no atomic operation.

2. If the object’s lock is reserved for another thread, theruntime system cancels the reservation, and falls back toa conventional algorithm for further processing.

3. If the object’s lock is not reserved, the runtime systemuses a conventional algorithm.

Our algorithm can be built on any existing locking algo-rithm, as long as it uses a lockword5, a word in the objectheader for locking, and allows one bit to be available in thelockword. The bit is used for representing the lock reserva-tion status, and hence named the LRV bit. When the LRV

5Actually, we don’t need the whole 32 bits of the word,and could put in the word other information unrelated tolocking. However, for the sake of explanation, we assumethat the whole word is used for locking.

1A 0

Reserved for Thread A

0 0

Anonymously reserved

Acquired

unreserve

acquire release

acquire release

1

Objectcreation

1A 1

1A 2

unreserve

unreserve

0

0

0

::

Base locking algorithm

Recursively acquired

::

Reserve mode Base mode

xxxxxx

yyyyyy

zzzzzz

acquire(initial synchronization)

Figure 3: Lock state transitions

bit is set, the lockword is in the reserve mode, and the struc-ture is defined by our algorithm. When the bit is not set,the lockword is in the base mode, and the structure is de-fined by the underlying algorithm that the runtime systemfalls back to after canceling the reservation.

3.1 Lockword StructureFigure 2 shows the structure of the lockword. When the

LRV bit is set, the lockword is in the reserve mode, and isfurther divided into the thread identifier (tid) field and therecursion count (rcnt) field. The former field contains anidentifier of the owner thread, for which the lock is reserved,while the latter field keeps the lock recursion level. Whenthe rcnt field is zero, the lock is reserved but not held byany thread (Figure 2(a)). When the field is non-zero, thelock is held by the owner thread (Figure 2(b)). As we willsee later, the owner thread can acquire the lock by simplyincrementing the rcnt field, with no atomic operation.

The rcnt field is also intended for recursive locking, whichis fairly common in Java. The owner thread acquires the lockrecursively by simply incrementing the rcnt field, in just thesame manner as it initially acquires the lock. We must main-tain the recursion count of a lock since Java does not allowa thread to release a lock more times than it acquires thelock. The virtual machine must detect such an illegal stateand raise an instance of IllegalMonitorStateException.

When an object is created, the lock is anonymously re-served. That is, the lockword is in the reserve mode, but notreserved for or held by any particular thread (Figure 2(c)).This is because the thread for which the lock should be re-served is normally not known at the time of creation.

In general, a reservation policy determines when and forwhich thread a lock is reserved. Since we base our algorithmon exploitable thread locality from the previous section, weuse the initial-locker policy in our algorithm. That is, whenan object is locked for the first time by a thread, we reservethe object’s lock for that thread.

When the reservation is canceled, the LRV bit is reset,and the lockword is put in the base mode. The structureis completely defined by the base algorithm. As we will seelater, canceling a reservation is the most challenging part ofour algorithm, requiring the owner thread to be suspended.The cancellation replaces the lockword in the reserve modewith the corresponding state in the base algorithm.

Figure 3 depicts the state transitions of the lockword inour algorithm.

132

Page 4: Lock Reservation: Java Locks Can Mostly Do Without Atomic

3.2 AlgorithmFigure 4 shows the algorithm of lock reservation6. A

thread attempting to acquire an object’s lock calls the ac-

quire() function, where it reads the lockword, and performsfour checks to see if it is not in a special state (lines 21–24).If it passes all the checks, the lock is in the most commonstate where the thread owns the lock’s reservation. It com-pletes the lock acquisition by simply incrementing the rcnt

field (line 28).Similarly, a thread attempting to release an object’s lock

calls the release() function, where it first reads the lock-word, and performs three checks to see if it is not in a spe-cial state (lines 52–54). When it passes all the checks, thefunction finishes the lock release by simply decrementing thercnt field (line 58). Thus, it only takes a few non-atomic in-structions to acquire and release a lock in the most commoncase when the thread owns the reservation.

There are three special cases in the acquire() function.First, when the lock is anonymously reserved (line 22), thefunction attempts to make it specifically reserved by usingcompare_and_swap (line 33). Second, when the lock is re-served for another thread (line 23), the thread calls theunreserve() function to cancel the reservation (line 37),and falls back to the base algorithm. This second spe-cial case also results when the thread owns the reservationbut the recursion count has reached the maximum value(line 24). Third, when the lockword is not in the reservemode (line 21), the thread executes the corresponding func-tion of the base algorithm (line 40).

There is only one legal special case in the release() func-tion. That is, when the lockword is not in the reserve mode(line 52), the function invokes the corresponding function inthe base algorithm (line 65). The Java specification [17]requires that, when a thread attempts to release a lock,the thread actually holds the lock. Otherwise the runtimesystem must raise an instance of IllegalMonitorState-Exception. The checks in lines 53 and 54 detect the illegalstate in the reserve mode.

We now explain cancellation of a reservation, the mostcomplicated part of our algorithm, which the unreserve()

function is responsible for. Basically, a thread calls the func-tion when the thread attempts to acquire a lock which isreserved for another thread7. The calling thread atomi-cally replaces the lockword in the reserve mode with theequivalent state in the base algorithm. In doing so, it firstsuspends the owner thread (line 74), modifies the lockwordusing the atomic operation (line 80), and resumes the sus-pended thread (line 90).

Special care must be taken when the owner thread is inthe middle of the acquire() or release() functions, morespecifically, when it is in one of the unsafe regions which arebetween the read and write of the lockword in the acquire()(lines 18–28) and release() functions (lines 49–58). Toavoid a data race condition, the unreserve() function ob-

6For readability, the code shown here is slightly differentfrom the actual code. For instance, the condition checksin the beginning of the acquire() and release() functionsare merged into two checks in the actual code. Also, thebase acquire() and base release() functions are tightlycoupled with the acquire() and release() functions, re-spectively.7The unreserve() function is also called when the rcnt isabout to overflow or when the wait() method is called.

tains the execution context of the suspended thread (line 83)to see whether the thread is in one of the unsafe regions. Ifit is in an unsafe region, the function modifies the programcounter with the address of the corresponding retry point(line 17 or 48). Notice that each unsafe region was care-fully made restartable by preventing any side effects fromoccurring.

Finally, after a lock’s reservation is canceled, our algo-rithm does not return the lock back to the reserve mode.The algorithm supporting repeated reservation would be-come too complicated, while it might result in more cancel-lations and degrade performance. In addition, the investiga-tions in the previous section show that most lock operationscan be performed in the reserve mode even without repeatedreservation.

3.3 CorrectnessWe now discuss the correctness of our algorithm. As

we have shown, a thread does not have to execute anyatomic operation in acquiring and releasing a lock when itowns the reservation. In other words, the owner thread canread-modify-write the lockword without atomic operations.Thus, when a different thread attempts to change the lock-word between the read and the write, special care must betaken to prevent the modification from being lost. The lockstate would otherwise become inconsistent.

When a thread does not own a lock’s reservation, our algo-rithm requires the thread to call the unreserve() function,where the thread without the reservation modifies the lock-word after suspending the owner thread. When the latterthread is suspended in the middle of an unsafe region, itis forced to restart the unsafe region, detecting that it nolonger has the reservation. This prevents the thread fromcontinuing the execution based on the no-longer-valid as-sumption that the thread still owns the reservation.

The owner thread may have already completed the com-putation and ceased to exist when another thread attemptsto cancel a reservation. Although the unreserve() mustalso handle this case properly, there is no risk of a data racecondition involving the owner thread.

More than one thread may simultaneously try to makean anonymous reservation specific (line 33) or try to con-vert the lockword in the reserve mode to the base mode(line 80). However, it is guaranteed that only one threadsucceeds since atomic operations are used in both cases.

Once the reservation is canceled, the lockword will benever reserved again. Thus, after the cancellation, our al-gorithm behaves in exactly the same manner as the basealgorithm, and the correctness is ensured by the correctnessof the base algorithm.

3.4 DiscussionThis section considers the performance characteristics of

lock reservation, discusses in detail how to determine whethera thread has been suspended in the middle of an unsafe re-gion and how to cancel reservations, and explains multipro-cessor issues.

Performance CharacteristicsOur algorithm is strongly expected to reduce the synchro-nization overhead when the reservation succeeds, since theowner thread can acquire and release the lock by simply

133

Page 5: Lock Reservation: Java Locks Can Mostly Do Without Atomic

1 : // Lockword structure in each object header2 : struct Object {3 : :4 : struct lockword { // [tid:rcnt:R]5 : unsigned int tid : N; // Thread ID of the owner thread.6 : unsigned int rcnt : M; // Recursion count. Non-zero denotes that the lock is acquired.7 : unsigned int reserve : 1; // LRV bit. One denotes that the lock is reserved.8 : } lockword;9 : :

10 : };11 :12 : int acquire(struct Object *obj)13 : {14 : struct lockword l1, l2;15 : int myTID = thread_id();16 :17 : retry_acquire:18 : l1 = obj->lockword; // read the lockword ------------------(1)19 : A20 : // check special cases |21 : if (l1.reserve == 0) goto base_acquire; // [xxxxxx:0] not reserved |22 : if (l1.tid == 0) goto make_specific; // [0:0:1] anonymously reserved |unsafe23 : if (l1.tid != myTID) goto unreserve_and_base; // [other:xxx:1] reserved for another thread |region24 : if (l1.rcnt == RCNT_MAX) goto unreserve_and_base; // [myTID:max:1] rcnt reached the maximum |25 : |26 : // reserved for me, and rcnt does not reach the maximum |27 : l2 = l1; l2.rcnt++; // [myTID:rcnt:1] -> [myTID:rcnt+1:1] V28 : obj->lockword = l2; // write the lockword ------------------(2)29 : return SUCCESS;30 :31 : make_specific:32 : l2 = l1; l2.tid = myTID; l2.rcnt = 1;33 : if (compare_and_swap(&obj->lockword, l1, l2) != SUCCESS) goto retry_acquire; // [0:0:1] -> [myTID:1:1]34 : return SUCCESS;35 :36 : unreserve_and_base_acquire:37 : unreserve(obj, l1.tid, myTID); // [xxx:xxx:1] -> [xxxxxx:0]38 :39 : base_acquire:40 : return base_acquire(obj); // if not reserved, call the function for the base mode41 : }42 :43 : int release(struct Object *obj)44 : {45 : struct lockword l1, l2;46 : int myTID = thread_id();47 :48 : retry_release:49 : l1 = obj->lockword; // read the lockword ------------------(1)50 : A51 : // check special cases |52 : if (l1.reserve == 0) goto base_release; // [xxxxxx:0] not reserved |53 : if (l1.tid != myTID) goto illegal_state; // [other:xxx:1] reserved for another thread |unsafe54 : if (l1.rcnt == 0) goto illegal_state; // [myTID:0:1] rcnt is zero |region55 : |56 : // reserved for and held by me |57 : l2 = l1; l2.rcnt--; // [myTID:rcnt:1] -> [myTID:rcnt-1:1] V58 : obj->lockword = l2; // write the lockword ------------------(2)59 : return SUCCESS;60 :61 : illegal_state:62 : return IllegalMonitorStateException;63 :64 : base_release:65 : return base_release(obj); // if not reserved, call the function for the base mode66 : }67 :68 : void unreserve(struct Object *obj, int ownerTID, int myTID)69 : {70 : struct lockword l1, l2;71 : struct Context context;72 :73 : if (ownerTID == myTID) ownerTID = 0; // don’t suspend myself74 : thread_suspend(ownerTID); // no-op when the target thread does not exist75 :76 : retry_unreserve:77 : l1 = obj->lockword;78 : if (l1.reserve == 0) goto already_unreserved; // already unreserved by someone79 : l2 = base_equivalent_lockword(l1); // create the equivalent lock state in the base mode80 : if (compare_and_swap(&obj->lockword, l1, l2) != SUCCESS) goto retry_unreserve; // [xxx:xxx:1] -> [xxxxxx:0]81 :82 : // modify the owner thread’s context if it is in an unsafe region83 : if (thread_get_context(ownerTID, &context) == SUCCESS) {84 : if (in_unsafe_region(context.pc)) { // check if (1) < next PC <= (2)85 : context.pc = retry_point(context.pc); // move the PC to the corresponding retry point86 : thread_set_context(ownerTID, &context);87 : } }88 :89 : already_unreserved:90 : thread_resume(ownerTID);91 : }

Note. Each of the thread-manipulating functions (thread_suspend(), thread_resume(), thread_get_context(), and thread set -

context())does nothing and returns FAIL if the target thread does not exist. The thread_suspend() function can be called multiple times,

where the target thread will be resumed after thread_resume() is called the same number of times. Note that the thread_suspend()

and thread_resume() functions are unrelated to the deprecated Java methods suspend() and resume() in the java.lang.Thread class.

Figure 4: Algorithm of lock reservation

134

Page 6: Lock Reservation: Java Locks Can Mostly Do Without Atomic

reading and writing the lockword without any atomic oper-ations.

When a lock is not reserved, our algorithm falls back tothe base algorithm with almost no additional overhead. Itsimply requires two additional checks, one in the acquire()

function (line 21) and the other in the release() function(line 52). However, depending on the details of the basealgorithm, we can completely eliminate the additional over-head. That is, if the base algorithm starts the lock acquisi-tion and the lock release by testing one or more bits in thelockword, we may be able to merge the additional checks ofour algorithm into the testing. Actually, this is the case inour implementation which we present in Section 4.

The greatest concern in terms of performance is reserva-tion cancellation in the unreserve() function, which relieson expensive system calls such as thread_suspend() andthread get context(). However, since we do not reservelocks repeatedly, the cancellation occurs at most once dur-ing the lifetime of an object. As we will show in the nextsection, the ratios of cancellations to lock operations are lessthan 0.05% in actual lock-intensive programs. Thus, we be-lieve that performance loss from cancellation does not offsetthe performance gain by reservation success except for arti-ficially created pathological benchmarks.

Unsafe RegionsIf a thread always acquires and releases an object’s lockby calling the runtime functions acquire() and release(),respectively, we have only two unsafe regions in the virtualmachine. The in_unsafe_region() (line 84) has only toperform two range checks, which is easy to implement.

However, the JIT compiler may inline the synchronizationoperations into the generated code. This results in manyunsafe regions in the virtual machine, which we must registerin a data structure with the corresponding retry addresses.Given a program counter, the in_unsafe_region() functionsearches the data structure to see if the program counterpoints to any unsafe region.

Alternatively, we could use the designated sequences ap-proach by Bershad et al. [8]. That is, the JIT compiler em-beds some landmark no-op around each unsafe region, whilethe in unsafe region() function compares the instructionstream of a suspended thread against the landmark no-opto determine if the thread is in an unsafe region.

Whatever techniques are used, we need to obtain the pro-gram counter of a suspended thread by invoking an appro-priate system call, which is expensive in most operating sys-tems. Thus, it is desirable to reduce the number of calls tothe thread_get_context() function. If the virtual machineprovides a fast way to see if the thread is in the module ofcompiled code, we can reduce the number by creating unsaferegions only in the compiled code.

Some virtual machines allow us to cheaply determine,without the program counter, whether a thread has beensuspended within the module of compiled code. For in-stance, the virtual machine we use in Section 4 maintains athread local variable for the thread’s execution mode. Thevariable takes values such as EXECUTING COMPILED CODE, COM-PILING, and INTERPRETING. We can thus know if the threadis in the module of compiled code by simply checking thecurrent value of the thread local variable.

On the other hand, we can confine unsafe regions to themodule of compiled code as follows. In general, we can

convert an unsafe region into a safe region by modifyingthe lockword with a compare_and_swap even in the reservemode (lines 28 and 58). Although we should not make suchconversions for frequently executed unsafe regions, it is rea-sonable to convert the unsafe regions in the Java bytecodeinterpreter and other performance insensitive components.

Putting these two things together, we can, in our vir-tual machine, use the following sequence in the unreserve()function.

82 : // modify the owner thread’s context if necessary| 82a: if (get_exec_mode(ownerTID) == EXECUTING_COMPILED_CODE) {

83 : if (thread_get_context(ownerTID, &context) == SUCCESS) {

84 : if (in_unsafe_region(context.pc)) {85 : context.pc = retry_point(context.pc);86 : thread_set_context(ownerTID, &context);87 : } }

| 87a: }

88 :

The quick check in line 82a is expected to filter out manyuninteresting cases, resulting in many fewer calls to thethread get context().

Reservation CancellationThe essential property in the unreserve() function is toprevent the owner thread from changing the lockword whileanother thread is canceling the reservation. As long as thisproperty is satisfied, we could implement the unreserve()

function in different ways. We show two variations of thefunction.

First, we could use signals as provided in Unix operatingsystems. In this variation, the thread without the reserva-tion requests the cancellation, while the owner thread ac-tually does the cancellation. More concretely, the threadwithout the reservation sends a signal to the owner thread,and waits until the latter has completed the processing. Inthe signal handler, the owner thread cancels the reservation,checks with the saved program counter to see if it has beeninterrupted in an unsafe region, and, if so, modifies the pro-gram counter to the corresponding retry address.

Second, we could exploit predicated stores8, which areavailable, for instance, on Intel’s IA-64 processors [22]. Wededicate one predicate register to lock reservation. We ini-tialize it to TRUE before reading the lockword in acquire()and release(), while we write into the lockword in the re-serve mode with a predicated store qualified by the predicateregister. In the unreserve() function, we set the value ofthe predicate register of the owner thread to FALSE. Thisprevents the owner thread from changing the lockword in-consistently9.

Multiprocessor ConsiderationsThe Java language specification [17] describes the Java mem-ory model in Chapter 17. According to rules about theinteraction of locks and variables, we cannot move beforea lock acquisition the load operations that follow the ac-quisition or move after a lock release the store operations

8In the IA-64, most operations can be qualified by a one-bitpredicate register to indicate whether it is actually executedor not. The execution of a predicated store consists of check-ing the predicate register and conditionally performing thestore, and cannot be interrupted in the middle.9Hudson et al. [19] proposed a similar technique for ob-ject allocation which utilizes dedicated predicate registersset and reset by the context switcher.

135

Page 7: Lock Reservation: Java Locks Can Mostly Do Without Atomic

1 0 1A 0

Reserved for Thread A

1 Monitor ID

Flat mode

0 0 0

Inflated modeAnonymously reserved

Not acquired

Acquired

acquire release

acquire release

acquire release

(rcnt overflow)

inflate

deflate1

Reserve mode Base mode (tasuki lock)

(rcnt overflow)

Object creation

1S tid rcnt Rlockword

Shape bit

LRV bit

Thread IDRecursion count

0 1A 1

0 1A 2

0 00 0

0 0A 1

0 0A 2

acquire release

0 0B 1

0 0B 2

acquire release

acquirerelease

Heavy-weightmonitor

acquire / release

: : :

Flat mode

0 acquire(initial synchronization)

unreserve

unreserve

unreserve

Figure 5: Complete lock state transitions when lock reservation is coupled with tasuki lock

that precede the release. Therefore, when we implementlock reservation on a multiprocessor system with a relaxedmemory model [1], we need to issue appropriate types ofmemory barriers in the functions of lock acquisition and re-lease. More concretely, the lfence (load fence) and sfence

(store fence) instructions must be inserted at lock acquisi-tion and release points, respectively, on the Pentium 4 [21],while the ld.acq (load acquire) and st.rel (store release)instructions must be used at lock acquisition and releasepoints, respectively, on the IA-64 [22]. Memory barriersare normally much cheaper than atomic operations such ascompare_and_swap. However, an older processor may notsupport memory barriers at all, so that an expensive in-struction must be used to meet the requirements.

Practically speaking, we believe that these memory bar-riers are unnecessary in the reserve mode, since no otherthread can be trying to execute the critical region. We cantake care of the necessary synchronizations when the reser-vation is canceled and while the owner thread is suspended.Finally, we note that Pugh [36] pointed out flaws in the Javamemory model, and that the revision is being discussed un-der Java Specification Request 133 [24].

4. PERFORMANCE MEASUREMENTSThis section evaluates the effectiveness of lock reservation

with the IBM Development Kit for Windows, Java Tech-nology Edition, Version 1.3.1 [20] and its JIT compiler [23,41]. We ran all of the benchmark programs under Windows2000 SP2 on an unloaded IBM IntelliStation M Pro con-taining two 1.7-GHz Pentium 4 Xeon processors with 1024megabytes of main memory.

We implemented lock reservation on top of the existingalgorithm in the development kit, which is called tasuki lock[34], one of the fastest locking algorithms for Java. Tasukilock is an improved version of thin lock [6], and both use oneword10in an object header for representing the lock state.

10Actually, the lowest eight bits in the lockword are used forother states unrelated to locks.

The lockword contains a mode flag called the shape bit,which distinguishes between the two modes of tasuki lock.When the shape bit is zero, it is in the flat mode. Otherwise,it is in the inflated mode.

As long as contention does not occur, the lock is in theflat mode. The lockword in the flat mode is further dividedinto the tid field and the rcnt field, as in our algorithm. Inthis mode, the lock can be acquired by a compare_and_swap,and released by a simple store11.

When contention happens, the lockword is converted tothe inflated mode, where a heavyweight monitor is createdand the reference to the monitor is stored in the lockword.The lock remains in the inflated mode unless contentionceases.

Although our lock reservation can be built upon any al-gorithm, tasuki lock is a very natural fit since the lockwordstructure in the flat mode is almost the same as the struc-ture in the reserve mode. This allows lock operations to behighly efficient in terms of both space and time. Figure 5shows all of the state transitions when lock reservation iscoupled with tasuki lock.

We took a simple approach to implementing checks forunsafe regions. Our virtual machine includes two sets of im-plementations of acquire() and release() functions, onepair in the module of the JIT runtime code and the otherpair in the module of the interpreter. The former is calledfrom but not inlined into the JIT generated code. The lat-ter, written in C, is called from the interpreter, and imple-mented without unsafe regions as described in Section 3.4.This means we only have two unsafe regions in our virtualmachine. To make the comparison exact, we disabled inlin-ing of the lock acquisition and release code in the originalvirtual machine.

Finally, in order to comply with the current Java memorymodel, we inserted the lfence and sfence instructions intothe functions for lock acquisition and release, respectively.

11The simple store must be followed by a memory barrier ina multiprocessor system.

136

Page 8: Lock Reservation: Java Locks Can Mostly Do Without Atomic

4.1 Micro-BenchmarksWe show the results of two micro-benchmarks.

PrimitiveTestThe PrimitiveTest is intended for measuring the cost of syn-chronization, that is, of acquiring and releasing a lock, in dif-ferent lock states. We measured the following two cases inthree lock states, reserved, not reserved (flat), and inflated.

• Outermost: Acquire and release a lock using a synchro-nized block n times, and measure the elapsed time.

• Recursive: Perform the same measurement inside anothersynchronized block.

To calculate the cost of acquiring and releasing a lock ineach state, we created a special virtual machine that per-forms nothing on lock acquisition or release, and calculatedthe differences between the times of the normal and specialvirtual machines.

TransitionTestThe TransitionTest measures the cost of transitions of lockstates unique to lock reservation. We created a total of nobjects, and forced them to make the following two transi-tions.

• Anonymous-to-specific: Acquire and release the lock ofeach object, making the anonymously reserved lock specif-ically reserved.

• Reserved-to-base: Cancel the lock reservation for eachobject by creating another thread and having this secondthread acquire and release the lock.

To calculate the cost of each transition, we took the differ-ences from the times of lock acquisition and release withoutthe transition.

We confirmed in both tests that the relevant methods werecompiled to native code, and that the synchronization oper-ations within the methods were not optimized away by theJIT compiler. We also verified that garbage collection didnot occur during the measurements.

Table 3 shows the results for the PrimitiveTest. For com-parison, the table also contains the numbers for the originaltasuki algorithm. When the reservation succeeds, we dra-matically reduced the cost of the outermost synchronizationby more than 70%. On the other hand, after the reserva-tion is canceled, the cost of the synchronization is almostthe same as in the original algorithm.

Table 4 presents the results for the TransitionTest. Sincewe found that the cost of cancellation heavily depends onwhether or not thread_get_context() (line 83 in Figure 4)is actually executed, we show two cases for cancellation inthe table, the faster case in which the function is not ex-ecuted and the slower case in which the function is exe-cuted. As the table shows, the cost of making an anonymousreservation specific is very small and negligible, while thecost of reservation cancellation is very large, as expected.The cost of cancellation is by 20 to 60 times larger thanthe cost of synchronization in the inflated mode. One rea-son for this is that getting a thread context (by calling

Table 3: Synchronization costs in lock reservation

The cost of acquiring and releasing a lock isshown for three lock states of our algorithm andtwo lock states of the original algorithm.

Lockword state Outermost Recursive

Reserved 61.4 nsec 61.4 nsecNot reserved 229.5 nsec 61.4 nsecInflated 335.5 nsec 155.8 nsecFlat in original 228.9 nsec 62.2 nsecInflated in original 330.3 nsec 150.0 nsec

Table 4: Costs of lock state transitions

The times spent on lock acquisition and releaseare not included.

State transition Time

Anonymous-to-specific 89.0 nsecReserved-to-base (faster case) 6741 nsecReserved-to-base (slower case) 18986 nsec

GetThreadContext()) is very slow in Windows. Althoughwe do not believe that this badly influences the performanceof real programs, it is important to reduce the number of ex-pensive system calls, for instance by using the quick checkas described in Section 3.4.

4.2 Macro-BenchmarksWe now show the performance improvement in real pro-

grams. We measured the performance of the same set ofprograms as in our investigation in Section 2 (listed in Ta-ble 1). We ran each program several times with two virtualmachines, one with the original algorithm and the other withlock reservation, and compared the best scores. We took themeasurements with the JIT compiler enabled.

Figure 6 shows the results. Lock reservation improvedthe performance of all programs except _201_compress and_222_mpegaudio, both of which perform very few lock op-erations. We observed especially significant improvementsof more than 30% in _209_db, _228_jack, and _213_javac.As a result, lock reservation improved the geometric meanof the SPECjvm98 programs by 18.13%. Furthermore, weobserved improvements of 5% to 10% even in the multi-threaded programs, SPECjbb2000 and the Volano Mark.

Table 5 shows lock statistics in the actual environment,which we measured separately. As the table shows, evenwhen the JIT compiler is enabled, many lock operations areperformed. The table also shows the ratios of lock opera-tions accelerated by our implementation of lock reservation.Note that these numbers do not include synchronizationsperformed inside the interpreter or performed recursively inthe compiled code, even if the reservations were successful.Because of this, most of the lock operations were not accel-erated in _201_compress and _222_mpegaudio, since theywere not in hot methods and were executed by the inter-preter rather than compiled by the JIT. For other, lock-intensive programs, more than 58% of the lock operationswere accelerated by lock reservation.

137

Page 9: Lock Reservation: Java Locks Can Mostly Do Without Atomic

11.59%

0.25%

52.76%

0.83%

33.14%

37.81%

1.55%

9.79%

5.45%

_202

_jess

_201

_com

press

_209

_db

_222

_mpe

gaud

io

_228

_jack

_213

_java

c

_227

_mtrt

SPECjbb20

00

Volano

Mark

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

Figure 6: Performance improvements

Table 5: Lock statistics

Number of Ratios of Ratios oflock accelerated reservation

Program name operations lock ops. cancellations

SPECjvm98_202_jess 14585409 99.289% 0.00125%_201_compress 29150 31.547% 0.419%_209_db 162079177 99.963% 0.0000296%_222_mpegaudio 27480 35.837% 0.313%_228_jack 35207339 91.947% 0.000395%_213_javac 43510883 99.402% 0.00403%_227_mtrt 3523262 99.035% 0.00284%

SPECjbb200012 335718621 58.544% 0.0535%Volano Server 6862014 79.755% 0.0248%Volano Client 10381000 84.333% 0.0138%

4.3 Possible ExtensionsAs the results of the micro- and macro-benchmarks show,

the implementation of lock reservation significantly improvesperformance if the reservation succeeds, while it maintainscomparable performance if the reservation fails. The onlyproblem is a relatively high cost in canceling a reserva-tion, which occurs when a thread acquires an object’s lockreserved for another thread. However, as Table 5 shows,canceling a reservation rarely happens in real programs.Although the locks are initially put in the reserve modein our implementation, less than 0.05% of the lock opera-tions caused reservations to be canceled in the lock-intensivebenchmarks.

There might be pathological programs in which reserva-tions are canceled more frequently. It may be importantto lower the cost of a cancellation, and to reduce the num-ber of cancellations by refining the reservation policy. Forinstance, if dynamic profiles of cancellations uncover thatreservations are frequently canceled for objects of specificclasses or created at specific execution points, we shouldinitially put them into the base mode. Also, we may be ableto predict which thread is likely to initially acquire an ob-ject’s lock, using dynamic profiles or static analysis. Finally,if we can reduce the cost of a cancellation, it could becomeworthwhile to pursue an algorithm allowing repeated reser-vations.

12Again, the total number of locks for SPECjbb2000 is notvery meaningful because it varies with the execution speed.

5. RELATED WORKThere is a significant body of literature on locks. Here

we mainly focus on Java locks and locks without atomicoperations.

5.1 Improvements of Java LocksAs we mentioned in Section 1, synchronization operations

tend to be very frequent in Java, so many techniques havebeen proposed to optimize them.

The early versions of virtual machines from Sun allocatemonitors separately from objects, and maintain the map-ping from objects to monitors in a data structure called themonitor cache [47]. While this does not require any bit in anobject header for synchronization, it suffers from slow per-formance and bad scalability, since the monitor cache mustbe synchronized. A similar technique is also used in Kaffe[26].

Onodera [33] proposed a simple and space-efficient wayof implementing Java locks. The method directly stores areference to a monitor in a rarely used field in an objectheader, displacing the original value of the field into themonitor. Furthermore, it does so for only heavily synchro-nized objects in order to reduce the space overhead. Whileit mostly eliminated the need to synchronize the monitorcache, it could not drastically reduce the synchronizationoverhead, since it still used heavyweight monitors.

Bacon et al. [6] proposed a locking algorithm for Java,called thin lock, exploiting the observation that most locksare not contended in Java. An object’s lock operates in oneof two modes, the flat mode and the inflated mode. Theyreserve a 24-bit field in an object header, which has one oftwo structures, depending on the current operating mode ofthe lock, and are distinguished by one bit, called the shapebit. Initially, each lock is in the flat mode, and remains inthis mode as long as contention does not occur. Acquiringand releasing a lock in the flat mode is highly efficient, ex-ecuting only a few machine instructions. In particular, theinstruction sequence for acquisition includes only one atomicoperation, while the instruction sequence for release containsno atomic operations. When contention is detected, the lockchanges to the inflated mode, and falls back to the heavy-weight monitor. Once a lock is put in the inflated mode, thinlock keeps the lock in this mode for the rest of its life, result-ing in all the subsequent synchronizations being performedthrough the heavyweight monitor.

Onodera and Kawachiya [34] discovered that most con-tentions are temporary in Java, and proposed an enhancedalgorithm, named tasuki lock, which supports deflation torecover the higher performance of the flat mode. SableVM[16] employs a variation of the tasuki lock.

Agesen et al. [2] proposed another locking algorithm,called meta lock. While it needs only two bits in a headerfor synchronization, it requires two atomic operations in ac-quiring and releasing a lock. Thus, it is not as time-efficientas thin lock and tasuki lock. Recently, Dice [13] proposed amodified version of meta lock named relaxed lock.

Although the details are significantly different, these fastalgorithms can acquire and release a lock with a small num-ber of machine instructions containing one or two atomicoperations. Lock reservation attempts to further reduce theoverhead by completely eliminating the atomic operationsthat are now becoming more and more expensive in modernarchitectures. It exploits the observation that most locks

138

Page 10: Lock Reservation: Java Locks Can Mostly Do Without Atomic

are not only uncontended in Java, but also dominantly ac-quired by a specific thread. As we already described, ifimplemented on top of tasuki lock, it only requires one ad-ditional bit in the header to represent the reservation status,while it attains an unprecedented level of performance forsynchronization when reservation succeeds. We note thatBacon and Fink [7] independently proposed a similar ideaof eliminating atomic operations in Java locks.

5.2 Elimination of Java LocksAnother approach to improve the synchronization perfor-

mance is to eliminate locks altogether rather than to reducethe cost of the locks.

Using escape analysis [35], we can find objects accessibleonly by their creator threads, and eliminate all the syn-chronization operations for the locks of such non-escapingobjects [3, 9, 10, 12, 38, 44]. However, these techniquesare the most effective in a static compiler that can performwhole program analysis, while they provide only limited ben-efits for a dynamic language such as Java. When applyingescape analysis to Java, many more objects must conserva-tively be judged as escaping, and their locks cannot be op-timized away. Whaley [45] recently proposed partial methodcompilation for improving effectiveness of escape analysis fora dynamic language.

The IBM JIT compiler eliminates some of the recursivelocks [27]. This can happen when the compiler inlines onesynchronized method into another synchronized method. Ifthe compiler determines that the receiver objects of the twomethods are identical, it then eliminates the lock operationsfor the inlinee.

Since lock reservation is a runtime technique, it is basi-cally complementary to compiler optimizations such as es-cape analysis and recursive lock elimination. It can speedup the locks of escaping objects and outermost locks as longas they show thread locality.

Bacon [5] attempted to eliminate all of the synchroniza-tion overhead from single-threaded executions. As long asthe system creates and runs only one thread, nothing is donefor lock acquisition and release. When the running programattempts to create a second thread, the system scans thestack frames and properly recovers the lock states. Muller[32] also briefly mentioned a similar idea. Ruf [38] proposedwhole-program analysis to determine if the program doesnot create a second thread. Unfortunately, these ideas can-not be used in most of the commercial virtual machines,since they always create a couple of helper threads, besidesthe main thread, at start-up time.

5.3 Other Lock OptimizationsSome of the locking algorithms provide mutual execution

without atomic operations such as compare_and_swap andtest_and_set.

Bershad et al. [8] proposed a unique locking algorithmthat closely cooperates with the operating system’s sched-uler. When a thread is preempted in one of the criticalsections, it is forcibly restarted from the entry point of thesection. To determine whether a thread is suspended insuch a restartable atomic sequence, they mark each atomicsequence with a designated sequence. As we mentioned inSection 3.4, we can apply the technique to mark the un-safe regions in our algorithm. By extending Bershad’s idea,Johnson et al. [25] proposed interruptible critical sections,

which support the modification of multiple data objects.When a virtual machine is built on a user-level thread

package, we can use scheduler-based techniques to imple-ment locks. Actually, both CACAO [28] and LaTTe [46]implement locks by inhibiting thread switches inside thecritical sections. However, scheduler-based locks are only ef-fective on a uniprocessor system. Moreover, they may causestarvation when a foreign function is called through the JavaNative Interface, and the foreign function attempts to ac-quire a system-level lock. On the contrary, lock reservationworks properly on a multiprocessor system and under thesystem-level, preemptive scheduler.

Dijkstra [14] and Lamport [30] presented complex algo-rithms for mutual execution which do not rely on compoundatomic operations. However, to the best of our knowledge,they have never been used in practical systems, because oftheir subtlety and lack of generality.

The communities of database systems [42] and distributedfile systems invented many optimization techniques based onaccess locality, which is similar to our thread locality. Kungand Robinson [29] proposed an optimistic concurrency con-trol for database systems, which speculatively executes criti-cal regions without acquiring locks and commits the changesif there is no contention. Rajwar and Goodman [37] re-cently proposed a technique to implement a similar idea atthe micro-architectural level. Microsoft’s CIFS distributedfile system includes a file-locking mechanism called oppor-tunistic locks or oplocks [15, 31]. When a client is granted anexclusive oplock for a file, it can cache the file data for betterperformance. If another client attempts to open the file, theserver sends the client holding the oplock an oplock break re-quest to return the cached data. This resembles reservationcancellation in our algorithm.

6. CONCLUDING REMARKSWe have presented a new locking algorithm, lock reserva-

tion, which optimizes Java locks by exploiting thread local-ity. The algorithm allows locks to be reserved for threads,and runs in either reserve mode or base mode. When athread attempts to acquire a lock in the reserve mode, itcan do so extremely quickly without any atomic operationif the lock is reserved for the thread. If the lock is not re-served for the thread, it cancels the reservation and fallsback to the base mode.

We have defined thread locality of locks, which claims thatthe locking sequence of a lock contains a very long repetitionof a specific thread, and confirmed that the vast majority ofJava locks exhibit the thread locality.

We have evaluated an implementation of lock reservationin IBM’s production virtual machine and compiler. Theresults of micro-benchmarks show that we could reduce thelocking overhead by more than 70% when the reservationsucceeded. The results of macro-benchmarks show that lockreservation sped up more than 58% of the locks operations,and achieved up to 53% performance improvements in realJava applications.

ACKNOWLEDGMENTSWe thank the members of the Network Computing Plat-form group in IBM Tokyo Research Laboratory, who gaveus valuable suggestions.

139

Page 11: Lock Reservation: Java Locks Can Mostly Do Without Atomic

REFERENCES[1] S. V. Adve and K. Gharachorloo. Shared Memory

Consistency Models: A Tutorial. IEEE Computer,29(12), 66–76, 1996.

[2] O. Agesen, D. Detlefs, A. Garthwaite, R. Knippel,Y. S. Ramakrishna, and D. White. An EfficientMeta-lock for Implementing UbiquitousSynchronization. Proceedings of ACM OOPSLA ’99,207–222, 1999.

[3] J. Aldrich, C. Chambers, E. G. Sirer, and S. Eggers.Static Analyses for Eliminating UnnecessarySynchronization from Java Programs. Proceedings ofthe 6th Int’l Static Analysis Symposium (SAS ’99),19–38, 1999.

[4] E. Armstrong. HotSpot: A New Breed of VirtualMachine. http://www.javaworld.com/jw-03-1998/jw-03-hotspot.html, 1998.

[5] D. F. Bacon. Fast and Effective Optimization ofStatically Typed Object-Oriented Languages. Ph.D.Thesis UCB/CSD-98-1017, University of California,1997.

[6] D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano.Thin Locks: Featherweight Synchronization for Java.Proceedings of ACM PLDI ’98, 258–268, 1998.

[7] D. F. Bacon and S. Fink. Personal Communication.

[8] B. N. Bershad, D. D. Redell, and J. R. Ellis. FastMutual Exclusion for Uniprocessors. Proceedings ofACM ASPLOS V, 223–233, 1992.

[9] B. Blanchet. Escape Analysis for Object-OrientedLanguages: Application to Java. Proceedings of ACMOOPSLA ’99, 20–34, 1999.

[10] J. Bogda and U. Holzle. Removing UnnecessarySynchronization in Java. Proceedings of ACMOOPSLA ’99, 35–46, 1999.

[11] P. A. Buhr, M. Fortier, and M. H. Coffin. MonitorClassification. ACM Computing Surveys, 27(1),63–107, 1995.

[12] J.-D. Choi, M. Gupta, M. Serrano, V. C. Sreedhar,S. Midkiff. Escape Analysis for Java. Proceedings ofACM OOPSLA ’99, 1–19, 1999.

[13] D. Dice. Implementing Fast Java Monitors withRelaxed-Locks. Proceedings of USENIX JVM ’01,79–90, 2001.

[14] E. W. Dijkstra. Solution of a Problem in ConcurrentProgramming and Control. Communications of theACM, 8(9), 569, 1965.

[15] R. Eckstein, D. Collier-Brown, and P. Kelly. UsingSamba. O’Reilly, 1999.http://www.oreilly.com/catalog/samba/chapter/book/ch05 05.html.

[16] E. M. Gagnon and L. J. Hendren. SableVM: AResearch Framework for the Efficient Execution ofJava Bytecode Proceedings of USENIX JVM ’01,27–39, 2001.

[17] J. Gosling, B. Joy, and G. Steele. The Java LanguageSpecification. Addison Wesley, 1996.

[18] C. A. R. Hoare. Monitors: An Operating SystemStructuring Concept. Communications of the ACM,17(10), 549–557, 1974.

[19] R. L. Hudson, J. E. B. Moss, S. Subramoney, andW. Washburn. Cycles to Recycle: Garbage Collection

on the IA-64. Proceedings of the 2nd ACM Int’lSymposium on Memory Management (ISMM ’00),101–110, 2000.

[20] IBM developerWorks Java Technology Zone.http://www.ibm.com/developerworks/java/.

[21] Intel Corporation. IA-32 Intel Architecture SoftwareDeveloper’s Manual Vol. 1–3.http://developer.intel.com/design/Pentium4/

manuals/.

[22] Intel Corporation. Intel Itanium Architecture SoftwareDeveloper’s Manual Vol. 1–3.http://developer.intel.com/design/itanium/manuals/.

[23] K. Ishizaki, M. Kawahito, T. Yasue, M. Takeuchi,T. Ogasawara, T. Suganuma, T. Onodera,H. Komatsu, and T. Nakatani. Design,Implementation, and Evaluation of Optimizations in aJust-In-Time Compiler. Proceedings of ACM JavaGrande ’99, 119–128, 1999.

[24] Java Community Process. JSR 133: Java MemoryModel and Thread Specification Revision.http://jcp.org/jsr/detail/133.jsp.

[25] T. Johnson and K. Harathi. Interruptible CriticalSections. Technical Report TR94007, University ofFlorida, 1994.

[26] Kaffe.org. Developing Kaffe.http://www.kaffe.org/develop.html.

[27] M. Kawahito. Personal Communication.

[28] A. Krall and M. Probst. Monitors and Exceptions:How to Implement Java Efficiently. Proceedings ofACM Workshop on Java for High-PerformanceNetwork Computing, 15–24, 1998.

[29] H. T. Kung and J. T. Robinson. On OptimisticMethods for Concurrency Control. ACM Transactionson Database System, 6(2), 213–226, 1981.

[30] L. Lamport. A Fast Mutual Exclusion Algorithm.ACM Transactions on Computing System, 5(1), 1–11,1987.

[31] P. Leach and D. Perry. CIFS: A Common InternetFile System.http://www.microsoft.com/mind/1196/cifs.asp,1996.

[32] G. Muller, B. Moura, F. Bellard, and C. Consel.Harissa: A Flexible and Efficient Java EnvironmentMixing Bytecode and Compiled Code. Proceedings ofthe 3rd USENIX Conference on Object OrientedTechnologies and Systems (COOTS ’97), 1–20, 1997.

[33] T. Onodera. A Simple and Space-Efficient MonitorOptimization for Java. IBM Research Report RT0259,IBM, 1998.

[34] T. Onodera and K. Kawachiya. A Study of LockingObjects with Bimodal Fields. Proceedings of ACMOOPSLA ’99, 223–237, 1999.

[35] Y. G. Park and B. Goldberg. Escape Analysis onLists. Proceedings of ACM PLDI ’92, 116–127, 1992.

[36] W. Pugh. Fixing the Java Memory Model. Proceedingsof ACM Java Grande ’99, 89–98, 1999.

[37] R. Rajwar and J. R. Goodman. Speculative LockElision: Enabling Highly Concurrent MultithreadedExecution. Proceedings of the 34th ACM/IEEEMICRO 34, 294–305, 2001.

140

Page 12: Lock Reservation: Java Locks Can Mostly Do Without Atomic

[38] E. Ruf. Effective Synchronization Removal for Java.Proceedings of ACM PLDI ’00, 208–218, 2000.

[39] Standard Performance Evaluation Corporation.SPEC JBB2000.http://www.spec.org/osg/jbb2000/.

[40] Standard Performance Evaluation Corporation.SPEC JVM98 Benchmarks.http://www.spec.org/osg/jvm98/.

[41] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue,M. Kawahito, K. Ishizaki, H. Komatsu, andT. Nakatani. Overview of the IBM Java Just-in-TimeCompiler. IBM Systems Journal, 39(1), 175–193,2000.

[42] A. Thomasian. Concurrency Control: Methods,Performance, and Analysis. ACM Computing Surveys,30(1), 70–119, 1998.

[43] Volano LLC. Volano Benchmarks.http://www.volano.com/benchmarks.html.

[44] J. Whaley and M. Rinard. Compositional Pointer andEscape Analysis for Java Programs. Proceedings ofACM OOPSLA ’99, 187–206, 1999.

[45] J. Whaley. Partial Method Compilation usingDynamic Profile Information. Proceedings of ACMOOPSLA ’01, 166–179, 2001.

[46] B.-S. Yang, J. Lee, J. Park, S.-M. Moon, K. Ebcioglu,and E. Altman. Lightweight Monitor for Java VM.ACM SIGARCH Computer Architecture News, 27(1),35–38, 1999.

[47] F. Yellin and T. Lindholm. Java Runtime Internals.Presentation in JavaOne ’96,http://java.sun.com/javaone/javaone96/pres/Runtime.pdf, 1996.

141