Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Intel’s ‘cmpxchg’ instruction

How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Review of the i386 registers

EAX

EBX

ECX

EDX

ESI

EDI

EBP

ESP

General Registers (32-bits)

CS

DS

ES

FS

GS

SS

Segment Registers (16-bits)

EIP

EFLAGS

Program Control and Status Registers (32 bits)

The x86 ‘system’ registersCR0

CR1

CR2

CR3

CR4

CR5

CR6

CR7

Control Registers (32-bits)

means ‘unimplemented’

DR0

DR1

DR2

DR3

DR4

Debug Registers (32-bits)

DR5

DR6

DR7LDTR

TR

GDTR

IDTR

(16-bits)

(48-bits)

How often is ‘cmpxchg’ used?$ cat vmlinux.asm | grep cmpxchg

c01046de: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc01055d9: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc010b895: f0 0f b1 11 lock cmpxchg %edx,(%ecx)c010b949: f0 0f b1 0b lock cmpxchg %ecx,(%ebx)c0129a9f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx)c0129acf: f0 0f b1 0b lock cmpxchg %ecx,(%ebx)c012d377: f0 0f b1 0e lock cmpxchg %ecx,(%esi)c012d41a: f0 0f b1 0e lock cmpxchg %ecx,(%esi)c012d968: f0 0f b1 16 lock cmpxchg %edx,(%esi)c012e568: f0 0f b1 2e lock cmpxchg %ebp,(%esi)c012e57a: f0 0f b1 2e lock cmpxchg %ebp,(%esi)c012e58a: f0 0f b1 2e lock cmpxchg %ebp,(%esi)c012e83f: f0 0f b1 13 lock cmpxchg %edx,(%ebx)c012e931: f0 0f b1 0a lock cmpxchg %ecx,(%edx)c012ea94: f0 0f b1 11 lock cmpxchg %edx,(%ecx)c012ecf4: f0 0f b1 13 lock cmpxchg %edx,(%ebx)c012f08e: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx)c012f163: f0 0f b1 11 lock cmpxchg %edx,(%ecx)c013cb60: f0 0f b1 0e lock cmpxchg %ecx,(%esi)c0148b3c: f0 0f b1 29 lock cmpxchg %ebp,(%ecx)c0150d0f: f0 0f b1 3b lock cmpxchg %edi,(%ebx)c0150d87: f0 0f b1 31 lock cmpxchg %esi,(%ecx)c0199c5e: f0 0f b1 0b lock cmpxchg %ecx,(%ebx)c024b06f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx)c024b2fe: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx)c024b321: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx)c024b34b: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx)c024b960: f0 0f b1 53 18 lock cmpxchg %edx,0x18(%ebx)

Here’s the occurrence that we studied in the ‘rtc_cmos_read()’ kernel-function…

…plus 28 other times!

Intel’s documentation

• You can find out what any of the Intel x86 instructions does by consulting the official software developer’s manual, online at:http://www.intel.com/products/processor/manuals/index.htm

• Our course-webpage has a link to this site that you can just click (under ‘Resources’)

• The instruction-set reference is two parts:– Volume 2A: for opcodes A through M– Volume 2B: for opcodes N through Z

http://www.intel.com/products/processor/manuals/index.htm

http://www.intel.com/products/processor/manuals/index.htm

Example: ‘cmpxchg’

• Operation of the ‘cmpxchg’ instruction is described (on 3 pages) in Volume 2A

• There’s an English-sentence description, and also a description in ‘pseudo-code’

• You probably do not want to print out this complete volume (.pdf) – over 700 pages!

• (You could order a printed copy from Intel)

Instruction format

• Intel’s assembly language syntax differs from the GNU/Linux syntax (known as ‘AT&T syntax’ with roots in UNIX history)

• When AT&T syntax is used, the ‘cmpxchg’ instruction has this layout:[lock] cmpxchg reg, reg/mem

optional ‘prefix’ (used for SMP)

mnemonic opcode

source operand

destination operand

‘effects’ and ‘affects’

• According to Intel’s manual, the ‘cmpxchg’ instruction also uses two ‘implicit’ operands (i.e., operands not mentioned in the instruction) – The CPU’s accumulator register– The CPU’s EFLAGS register

• The accumulator-register (EAX) is both a source-operand and a destination-operand

• The six status-bits in the EFLAGS register will get modified, as a ‘side-effect’ this instruction

‘cmpxchg’ description

• This instruction compares the accumulator with the destination-operand (so the ZF-bit in EFLAGS gets assigned accordingly)

• Then:– If (accumulator == destination)

{ ZF 1; destination source; }

– If (accumulator != destination){ ZF 0; accumulator destination; }

An instruction-instance

• In our recent disassembly of Linux’s kernel function ‘rtc_cmos_read()’, this ‘cmpxchg’ instruction-instance was used:

lock cmpxchg %edx, cmos_lock

prefix opcode source-operand destination-operand

Note: Keep in mind that the accumulator %eax will affect what happens! So we need to consider this instruction within it’s surrounding context

The complete functionc0105574 <rtc_cmos_read>:c0105574: 53 push %ebxc0105575: 9c pushfc0105576: 5b pop %ebxc0105577: fa clic0105578: 64 8b 15 08 20 30 c0 mov %fs:0xc0302008,%edxc010557f: 0f b6 c8 movzbl %al,%ecxc0105582: 42 inc %edxc0105583: c1 e2 08 shl $0x8,%edxc0105586: 09 ca or %ecx,%edxc0105588: a1 3c 99 30 c0 mov 0xc030993c,%eaxc010558d: 85 c0 test %eax,%eaxc010558f: 75 f7 jne c0105588 <rtc_cmos_read+0x14>c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993cc0105598: c0c0105599: 85 c0 test %eax,%eaxc010559b: 75 eb jne c0105588 <rtc_cmos_read+0x14>c010559d: 88 c8 mov %cl,%alc010559f: e6 70 out %al,$0x70c01055a1: e6 80 out %al,$0x80c01055a3: e4 71 in $0x71,%alc01055a5: e6 80 out %al,$0x80c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993cc01055ae: 00 00 00c01055b1: 53 push %ebxc01055b2: 9d popfc01055b3: 0f b6 c0 movzbl %al,%eaxc01055b6: 5b pop %ebxc01055b7: c3 ret

The ‘preparation’ steps

• The instructions that preceed ‘cmpxchg’ will setup register EDX (source operand) and register EAX (the x86 ‘accumulator’)

• Several instructions are used to set up a value in EDX, and result in this layout:

The current processor’s value for‘per_cpu__cpu_number’

plus 1EDX:

CMOSregister’s

index

31 8 7 0

this might be zero… …but this part is guaranteed to be non-zero!

The ‘cmos_lock’ variable

• This global variable is initialized to zero, meaning that access to CMOS memory locations is not currently ‘locked’

• If some CPU stores a non-zero value in this variable’s memory-location, it means that access to CMOS memory is ‘locked’

• The kernel needs to insure that only one CPU at a time can set this ‘lock’

The ‘most likely’ senario

• One of the CPUs wishes to access CMOS memory – so it needs to test ‘cmos_lock’ to be sure that access is now ‘unlocked’ (i.e., cmos_lock == 0 is true)

• The CPU copies the ‘cmos_lock’ variable into the EAX, where it can then be tested using the ‘test %eax, %eax’ instruction

• A conditional-jump follows the test

The ‘busy-wait’ loop

# Here is a ‘busy-wait’ loop, used to wait for the CMOS access to be ‘unlocked’ spin: mov cmos_lock, %eax # copy lock-variable to accumulator

test %eax, %eax # was CMOS access ‘unlocked’?jnz spin # if it wasn’t, then check it again

# A CPU will fall through to here if ‘unlocked’ access was detected, # and that CPU will now attempt to set the ‘lock’ – in other words, it# will try to assign a non-zero value to the ‘cmos_lock’ variable.

# But there’s a potential ‘race’ here – the ‘cmos_lock’ might have been# zero when it was copied, but it could have been changed by now…# … and that’s why we need to execute ‘lock cmpxchg’ at this point

Busy-waiting will be brief

spin: # see if the lock-variable is clearmov cmos_lock, %eax

test %eax, %eax jnz spin

# ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock

# did another CPU grab it first? test %eax, %eaxjnz spin

If our CPU wins the ‘race’, the (non-zero) value from source-operand EDX will have been stored into the (previously zero) ‘cmos_lock’ memory-location, butthe (previously zero) accumulator EAX will not have been modified; hence ourCPU will not jump back, but will fall through and execute the ‘critical section’ ofcode (just a few instructions), then will promptly clear the ‘cmos_lock’ variable.

The ‘less likely’ case

spin: # see if the lock-variable is clearmov cmos_lock, %eax

test %eax, %eax jnz spin

# ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock

# did another CPU grab it first? test %eax, %eaxjnz spin

If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to somenon-zero value after we had fetched our copy of it, then the (now non-zero) valuefrom the ‘cmos_lock’ destination-operand will have been copied into EAX, and sothe final conditional-jump shown above will take our CPU back into the spin-loop,where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.

yes

Setup nonzerovalue in EDX

EAX cmos_lock EAX is zero?

no

EAX is zero?

EAX equals cmos_lock ?

ZF 1cmos_lock EDX

ZF 0EAX cmos_lock

yes no

no criticalsection

yes

cmos_lock 0

start

finish

flowchart

‘btr’/’bts’ versus ‘cmpxchg’

• In an earlier lesson we used the ‘btr’/’bts’ instructions to achieve ‘mutual exclusion’, whereas Linux uses ‘cmpxchg’ to do that

• We think ‘btr’/’bts’ is easier to understand, so why do you think the Linux developers would prefer to use ‘cmpxchg’ instead?

<allow some class discussion here>

In-class exercise #1

• Was it really necessary to insert a second ‘test %eax, %eax’ following ‘cmpxchg’?

• Can you design a simple LKM that would verify your answer to that question?

• The Intel documentation does not state precisely how other EFLAGS status-bits (besides ZF) are affected by ‘cmpxchg’, only that they reflect the comparison of ‘accumulator’ and ‘destination’ operands

• Usually the CPU implements comparison- of-operands by performing a subtraction

EFLAGS

CF

1PF

0AF

0ZF

SF

TF

IF

DF

OF

IOPLNT

RF

VM

AC

31 11 10 9 8 7 6 5 4 3 2 1 0

In-class exercise #2

• Can you decide what Intel means by “the comparison operation”, by writing suitable code that examines the effect on EFLAGS of ‘cmpxchg opnd1, opnd2’ and these two plausable alternatives:

cmp opnd1, opnd2

cmp opnd2, opnd1

Documents

Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?