77
Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication Konstantina Mitropoulou, Vasileios Porpodas, Xiaochun Zhang and Timothy M. Jones Computer Laboratory UKMAC 2016, Edinburgh slide 1 of 30 http://www.cl.cam.ac.uk/ ~ km647/

=1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx: Using OS andHardware Support for FastFine-Grained Inter-Core

Communication

Konstantina Mitropoulou, Vasileios Porpodas,Xiaochun Zhang and Timothy M. Jones

Computer Laboratory

UKMAC 2016, Edinburgh

slide 1 of 30 http://www.cl.cam.ac.uk/~km647/

Page 2: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Outline

• Background:• Lamport’s queue• Multi-section queue

• Lynx queue

• Performance evaluation

slide 2 of 30 http://www.cl.cam.ac.uk/~km647/

Page 3: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lamport’s Queue Bottlenecks

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

• Frequent thread synchronisation

• Cache ping-pong

slide 3 of 30 http://www.cl.cam.ac.uk/~km647/

Page 4: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lamport’s Queue Bottlenecks

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

while(next enqueue ptr == dequeue ptr){; }

• Frequent thread synchronisation

• Cache ping-pong

slide 3 of 30 http://www.cl.cam.ac.uk/~km647/

Page 5: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lamport’s Queue Bottlenecks

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

while(next enqueue ptr == dequeue ptr){; }Performance degradation due to:

• Frequent thread synchronisation

• Cache ping-pong

slide 3 of 30 http://www.cl.cam.ac.uk/~km647/

Page 6: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lamport’s Queue Bottlenecks

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

while(next enqueue ptr == dequeue ptr){; }Performance degradation due to:

• Frequent thread synchronisation

• Cache ping-pong

slide 3 of 30 http://www.cl.cam.ac.uk/~km647/

Page 7: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lamport’s Queue Bottlenecks

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

while(next enqueue ptr == dequeue ptr){; }Performance degradation due to:

• Frequent thread synchronisation

• Cache ping-pong

slide 3 of 30 http://www.cl.cam.ac.uk/~km647/

Page 8: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Cache Ping-Pong

L2 cache

L1 cache

core 1

dequeue_ptr

L2 cache

L1 cacheenqueue_ptr

core 2

L3 cache

while(next enqueue ptr == dequeue ptr){; }

• Queue pointers ping-pong across cachehierarchy

slide 4 of 30 http://www.cl.cam.ac.uk/~km647/

Page 9: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Cache Ping-Pong

L2 cache

L1 cache

core 1

dequeue_ptr

L2 cache

L1 cacheenqueue_ptr

core 2

L3 cache

while(next enqueue ptr == dequeue ptr){; }• Queue pointers ping-pong across cache

hierarchyslide 4 of 30 http://www.cl.cam.ac.uk/~km647/

Page 10: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Cache Ping-Pong

L2 cache

L1 cache

core 1

dequeue_ptr

L2 cache

L1 cacheenqueue_ptr

core 2

L3 cache

while(next dequeue ptr == enqueue ptr){; }• Queue pointers ping-pong across cache

hierarchyslide 5 of 30 http://www.cl.cam.ac.uk/~km647/

Page 11: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

dequeue_ptr

enqueue_ptr

section 1 section 2

• Each section is exclusively used by one thread

slide 6 of 30 http://www.cl.cam.ac.uk/~km647/

Page 12: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

• Each section is exclusively used by one thread

slide 6 of 30 http://www.cl.cam.ac.uk/~km647/

Page 13: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

������������������������������������������������

������������������������������������������������

enqueue_ptr

dequeue_ptr

section 1 section 2

• Enqueue thread cannot access section 1because dequeue thread still uses it

• Enqueue thread waits (spins) at the end ofsection 2

slide 7 of 30 http://www.cl.cam.ac.uk/~km647/

Page 14: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

������������������������������������������������

������������������������������������������������

enqueue_ptr

dequeue_ptr

section 1 section 2

• Enqueue thread cannot access section 1because dequeue thread still uses it

• Enqueue thread waits (spins) at the end ofsection 2

slide 7 of 30 http://www.cl.cam.ac.uk/~km647/

Page 15: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

����������������������������������������

����������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

• Dequeue thread reached the end of section 1

slide 8 of 30 http://www.cl.cam.ac.uk/~km647/

Page 16: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

����������������������������������������

����������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

• Dequeue thread reached the end of section 1

• Enqueue thread enters section 1

slide 9 of 30 http://www.cl.cam.ac.uk/~km647/

Page 17: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

����������������������������������������

����������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

Performance optimisations:

• Infrequent boundary checks (less frequentsynchronisation)

• Reduced cache ping-pong

slide 10 of 30 http://www.cl.cam.ac.uk/~km647/

Page 18: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

����������������������������������������

����������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

Performance optimisations:

• Infrequent boundary checks (less frequentsynchronisation)

• Reduced cache ping-pong

slide 10 of 30 http://www.cl.cam.ac.uk/~km647/

Page 19: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Multi-Section Queue(MSQ): state-of-the-art

����������������������������������������

����������������������������������������

dequeue_ptr

enqueue_ptr

section 1 section 2

Performance optimisations:

• Infrequent boundary checks (less frequentsynchronisation)

• Reduced cache ping-pong

slide 10 of 30 http://www.cl.cam.ac.uk/~km647/

Page 20: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

3

4

5

dequeue function

1

2

4

5

3

enqueue function

6

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 21: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 22: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 23: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e checks if next section is free

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 24: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

spin loop

syn

chro

nis

ati

on

cod

e checks if next section is free

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 25: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

spin loop

update local variables

syn

chro

nis

ati

on

cod

e checks if next section is free

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 26: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

spin loop

update local variables

update shared variable

syn

chro

nis

ati

on

cod

e checks if next section is free

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 27: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

spin loop

update local variables

update shared variable

syn

chro

nis

ati

on

cod

e

join basic−block

checks if next section is free

slide 11 of 30 http://www.cl.cam.ac.uk/~km647/

Page 28: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

������������������������������������������������

������������������������������������������������

dequeue_ptr

enqueue_ptr

synchronisation code

section 1 section 2

slide 12 of 30 http://www.cl.cam.ac.uk/~km647/

Page 29: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 30: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 31: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

store

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 32: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

compiler’s copy

store

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 33: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

rotate pointer

compiler’s copy

store

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 34: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

end of section

rotate pointer

compiler’s copy

store

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 35: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

MSQ Control-Flow Graph and Internals

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

cod

e lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2 skip sync code

end of section

rotate pointer

compiler’s copy

store

incr pointer

slide 13 of 30 http://www.cl.cam.ac.uk/~km647/

Page 36: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Optimal Queue

���������������������������

���������������������������

dequeue_ptr

enqueue_ptr

Optimal queue features:

• infinite size

• 2 instructions overhead

1 pointer increment2 store into the queue

slide 14 of 30 http://www.cl.cam.ac.uk/~km647/

Page 37: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Optimal Queue

���������������������������

���������������������������

dequeue_ptr

enqueue_ptr

Optimal queue features:

• infinite size• 2 instructions overhead

1 pointer increment2 store into the queue

slide 14 of 30 http://www.cl.cam.ac.uk/~km647/

Page 38: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx: Just 2 instructions overhead

1

2

4

5

3

6

enqueue function

syn

chro

nis

ati

on

co

de

Lynx removes part of enqueue

the critical path

(boundary checks) and all thesynchronisation overhead off

slide 15 of 30 http://www.cl.cam.ac.uk/~km647/

Page 39: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

co

de lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 16 of 30 http://www.cl.cam.ac.uk/~km647/

Page 40: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

co

de lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 16 of 30 http://www.cl.cam.ac.uk/~km647/

Page 41: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

co

de lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 16 of 30 http://www.cl.cam.ac.uk/~km647/

Page 42: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

section 2section 1

• A red zone is a non-read and non-write part ofmemory

• SSRZ: Section Synchronisation Red-Zone

slide 17 of 30 http://www.cl.cam.ac.uk/~km647/

Page 43: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������

��������

��������

��������

section 2section 1

• A red zone is a non-read and non-write part ofmemory

• SSRZ: Section Synchronisation Red-Zone

slide 17 of 30 http://www.cl.cam.ac.uk/~km647/

Page 44: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

• A red zone is a non-read and non-write part ofmemory

• SSRZ: Section Synchronisation Red-Zone

slide 17 of 30 http://www.cl.cam.ac.uk/~km647/

Page 45: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

����������������������������������������

����������������������������������������

��������

��������

��������

��������

enqueue_ptr

dequeue_ptr

SSRZ SSRZ

section 2section 1

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 46: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 47: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 48: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

Lynx’s handler checks:

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 49: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

Lynx’s handler checks:

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 50: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

Lynx’s handler checks:

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 51: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

Lynx’s handler checks:

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is free

slide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 52: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

Lynx’s handler checks:

• whether the SIG SEGV is from the queue orthe system

• which thread raised the exception

• if the thread is in section 1 or 2

• if the next section is freeslide 18 of 30 http://www.cl.cam.ac.uk/~km647/

Page 53: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

��������������������������������������������������������

��������������������������������������������������������

��������

��������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

dequeue_ptr

enqueue_ptr

• The dequeue thread still uses the first section

• The enqueue thread waits at the end of thesecond section and it adds a new red zone

• The new red zone is part of the synchronisationand it is temporalily added

slide 19 of 30 http://www.cl.cam.ac.uk/~km647/

Page 54: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(1): H/W triggered Synchronisation

������������������������������������������������

������������������������������������������������

��������

��������

��������

��������

SSRZ SSRZ

section 2section 1

enqueue_ptr

dequeue_ptr

• The dequeue thread has finished with the firstsection

• The enqueue thread removes the second redzone and it enters the first section

slide 20 of 30 http://www.cl.cam.ac.uk/~km647/

Page 55: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

co

de lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 21 of 30 http://www.cl.cam.ac.uk/~km647/

Page 56: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

1

2

4

5

3

enqueue function

6

enqueue

syn

chro

nis

ati

on

co

de lea rax, [rdx+8]

mov QWORD PTR [rdx], rcx

mov rdx, rax

and rdx, ROTATE MASK

test eax, SECTION_MASK

jne .L2

slide 21 of 30 http://www.cl.cam.ac.uk/~km647/

Page 57: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

����������������������������������������

����������������������������������������

��������

��������

��������

��������

enqueue_ptr

section 2

dequeue_ptr

section 1

SSRZ SSRZ

• SSRZ: Section Synchronisation Red-Zone

• PRRZ: Pointer Rotation Red-Zone

slide 22 of 30 http://www.cl.cam.ac.uk/~km647/

Page 58: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

����������������������������������������

����������������������������������������

��������

��������

��������

��������

����������������

enqueue_ptr

section 2

dequeue_ptr

section 1

SSRZ SSRZ PRRZ

• SSRZ: Section Synchronisation Red-Zone

• PRRZ: Pointer Rotation Red-Zone

slide 22 of 30 http://www.cl.cam.ac.uk/~km647/

Page 59: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

������������������������������������

������������������������������������

��������

��������

��������

��������

����������������

enqueue_ptr

section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

• SSRZ: Section Synchronisation Red-Zone

• PRRZ: Pointer Rotation Red-Zone

slide 22 of 30 http://www.cl.cam.ac.uk/~km647/

Page 60: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

���������������������������������������������

���������������������������������������������

����������

����������

����������

����������

������������

������������

enqueue_ptr

section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

• SSRZ: Section Synchronisation Red-Zone

• PRRZ: Pointer Rotation Red-Zone

slide 22 of 30 http://www.cl.cam.ac.uk/~km647/

Page 61: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

������������������������������������

������������������������������������

��������

��������

��������

��������

����������������section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

enqueue_ptr

• SSRZ: Section Synchronisation Red-Zone

• PRRZ: Pointer Rotation Red-Zone

slide 22 of 30 http://www.cl.cam.ac.uk/~km647/

Page 62: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

������������������������������������

������������������������������������

��������

��������

��������

��������

����������������section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

enqueue_ptr

Two types of red-zones:

1 moving red-zone: SSRZ (SectionSynchronisation Red-Zone)

2 fixed red-zone: PRRZ (Pointer RotationRed-Zone)

slide 23 of 30 http://www.cl.cam.ac.uk/~km647/

Page 63: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

������������������������������������

������������������������������������

��������

��������

��������

��������

����������������section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

enqueue_ptr

Two types of red-zones:

1 moving red-zone: SSRZ (SectionSynchronisation Red-Zone)

2 fixed red-zone: PRRZ (Pointer RotationRed-Zone)

slide 23 of 30 http://www.cl.cam.ac.uk/~km647/

Page 64: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Lynx(2): H/W triggered Pointer Rotation

������������������������������������

������������������������������������

��������

��������

��������

��������

����������������section 2section 1

SSRZ SSRZ PRRZ

dequeue_ptr

enqueue_ptr

Two types of red-zones:

1 moving red-zone: SSRZ (SectionSynchronisation Red-Zone)

2 fixed red-zone: PRRZ (Pointer RotationRed-Zone)

slide 23 of 30 http://www.cl.cam.ac.uk/~km647/

Page 65: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Experimental Setup

• Implementation in C++ with inline assembly

• Evaluation on a wide range of machines: fromembedded SOCs to server CPUs

• Throughput experiments for a wide range ofqueue sizes

• Absolute throughput performance in GB/s

slide 24 of 30 http://www.cl.cam.ac.uk/~km647/

Page 66: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Experimental Setup

• Implementation in C++ with inline assembly

• Evaluation on a wide range of machines: fromembedded SOCs to server CPUs

• Throughput experiments for a wide range ofqueue sizes

• Absolute throughput performance in GB/s

slide 24 of 30 http://www.cl.cam.ac.uk/~km647/

Page 67: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Experimental Setup

• Implementation in C++ with inline assembly

• Evaluation on a wide range of machines: fromembedded SOCs to server CPUs

• Throughput experiments for a wide range ofqueue sizes

• Absolute throughput performance in GB/s

slide 24 of 30 http://www.cl.cam.ac.uk/~km647/

Page 68: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Experimental Setup

• Implementation in C++ with inline assembly

• Evaluation on a wide range of machines: fromembedded SOCs to server CPUs

• Throughput experiments for a wide range ofqueue sizes

• Absolute throughput performance in GB/s

slide 24 of 30 http://www.cl.cam.ac.uk/~km647/

Page 69: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Throughput (GB/s) on Intel core-i5

0

2

4

6

8

10

12

14

64

KB

12

8K

B

25

6K

B

51

2K

B

1M

B

2M

B

4M

B

8M

B

16

MB

32

MB

64

MB

12

8M

B

25

6M

BG

B/s

Queue size

Throughput for 64bit Memory Instr. (Core-i5 4570)

MSQ-mov Lynx-mov

slide 25 of 30 http://www.cl.cam.ac.uk/~km647/

Page 70: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Breakdown of Lynx Overheads

0102030405060708090

100

64

KB

12

8K

B

25

6K

B

51

2K

B

1M

B

2M

B

4M

B

8M

B

16

MB

32

MB

64

MB

12

8M

B

25

6M

B

% E

xe

cu

tio

n T

ime

real kernel sync handler other

Queue Size

slide 26 of 30 http://www.cl.cam.ac.uk/~km647/

Page 71: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Throughput (GB/s) on Various Machines

0

2

4

6

8

10

12

14

64K

B

128K

B

256K

B

512K

B

1M

B

2M

B

4M

B

8M

B

16M

B

32M

B

64M

B

128M

B

256M

B

GB

/s

Queue size

Throughput for 64bit Memory Instr. (Xeon E5-2667v2)

MSQ-mov Lynx-mov0

1

2

3

4

5

6

64K

B

128K

B

256K

B

512K

B

1M

B

2M

B

4M

B

8M

B

16M

B

32M

B

64M

B

128M

B

256M

B

GB

/s

Queue size

Throughput for 64bit Memory Instr. (Opteron 6376)

MSQ-mov Lynx-mov

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

64K

B

128K

B

256K

B

512K

B

1M

B

2M

B

4M

B

8M

B

16M

B

32M

B

64M

B

128M

B

256M

B

GB

/s

Queue size

Throughput for 64bit Memory Instr. (Core-i3 2367M)

MSQ-mov Lynx-mov0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

64K

B

128K

B

256K

B

512K

B

1M

B

2M

B

4M

B

8M

B

16M

B

32M

B

64M

B

128M

B

256M

B

GB

/s

Queue size

Throughput for 64bit Memory Instr. (Celeron J1900)

MSQ-mov Lynx-mov

slide 27 of 30 http://www.cl.cam.ac.uk/~km647/

Page 72: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Real World Applications on Intel Xeon

Queue

Thread

Main

Thread

Checker

...Queue

Thread

Code

Instrum.

Dispatch

ThreadsWorker

...

Analysis

Dis

pa

tch

ParserPacket

AnalysisPartial Main

......

0.80

0.90

1.00

1.10

1.20

1.30

1.40

BT CG EP IS LU MG SPGeo

MSQ Lynx

0.800.850.900.951.001.051.101.151.201.25

BT CG EP IS LU MG SPGeo

MSQ Lynx

0.960.981.001.021.041.061.081.101.121.141.161.18

1T 2T 3T 4T 5T 6T Geo

MSQ Lynx

SRMT SD3 NetworkAnalyser

• The best queue configuration with Lynx isbetter than the best with MSQ

slide 28 of 30 http://www.cl.cam.ac.uk/~km647/

Page 73: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Conclusion

• Proposed Lynx: a lock-free SP/SC softwarequeue with just 2 instructions overhead

• Relies on existing commodity H/W and O/Ssupport for memory protection

• The overhead of synchronisation and boundarychecking is moved to the exception handler

• Throughput increases by up to 57%

slide 29 of 30 http://www.cl.cam.ac.uk/~km647/

Page 74: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Conclusion

• Proposed Lynx: a lock-free SP/SC softwarequeue with just 2 instructions overhead

• Relies on existing commodity H/W and O/Ssupport for memory protection

• The overhead of synchronisation and boundarychecking is moved to the exception handler

• Throughput increases by up to 57%

slide 29 of 30 http://www.cl.cam.ac.uk/~km647/

Page 75: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Conclusion

• Proposed Lynx: a lock-free SP/SC softwarequeue with just 2 instructions overhead

• Relies on existing commodity H/W and O/Ssupport for memory protection

• The overhead of synchronisation and boundarychecking is moved to the exception handler

• Throughput increases by up to 57%

slide 29 of 30 http://www.cl.cam.ac.uk/~km647/

Page 76: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Conclusion

• Proposed Lynx: a lock-free SP/SC softwarequeue with just 2 instructions overhead

• Relies on existing commodity H/W and O/Ssupport for memory protection

• The overhead of synchronisation and boundarychecking is moved to the exception handler

• Throughput increases by up to 57%

slide 29 of 30 http://www.cl.cam.ac.uk/~km647/

Page 77: =1=Lynx: Using OS and Hardware Support for Fast Fine ...conferences.inf.ed.ac.uk/UKMAC2016/slides/Konstantina_Mitropoulo… · MSQ Control-Flow Graph and Internals 1 2 4 5 3 enqueue

Source Code

https://www.cl.cam.ac.uk/~km647/papers/

lynx/lynxQ.tar.bz2

or

https://www.repository.cam.ac.uk/handle/

1810/254651

LYNX QUEUE

slide 30 of 30 http://www.cl.cam.ac.uk/~km647/