23
A Study of Garbage Collector Scalability on Multicores LokeshGidra , Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

Embed Size (px)

Citation preview

Page 1: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of Garbage Collector Scalability on Multicores

LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro

INRIA/University of Paris 6

Page 2: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

2

14/20 most popular languages have GC

but GC doesn’t scale on multicore hardware

Garbage collection on multicore hardware

Lokesh Gidra

Parallel Scavenge/HotSpot scalability on a 48-core machine

GC threads

GC

Thr

ough

put (

GB

/s)

SPECjbb2005 with48 application threads/3.5GB

A Study of the Scalability of Garbage Collectors on Multicores

Degrades after 24 GC threads

Better ↑

Page 3: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

3

Scalability of GC is a bottleneckBy adding new cores, application creates more garbage per time unit

And without GC scalability, the time spent in GC increases

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Lusearch[PLOS’11]

~50% of the time spent in the GCat 48 cores

Page 4: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

4

Where is the problem?

Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7(both, stop-the-world and concurrent GCs)

What has really changed:

Multicores are distributed architectures, not centralized anymore

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Page 5: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 5

From centralized architectures to distributed ones

Lokesh Gidra

A few years ago…

Uniform memory access machines

Now…

Inter-connectNode 0 Node 1

Node 2 Node 3Cores

Non-uniform memory access machines

Cores

Memory

System Bus

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Page 6: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 6

From centralized architectures to distributed onesOur machine: AMD Magny-Cours with 8 nodes and 48 cores

12 GB per node 6 cores per node

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cycles

Page 7: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 7

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cycles

#cores = #threads

Better↓

Com

plet

ion

time

(ms)

Time to perform a fixednumber of reads in //

Page 8: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 8

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Local Access

Time to perform a fixednumber of reads in //

#cores = #threads

Page 9: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 9

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Random access

Time to perform a fixednumber of reads in //

#cores = #threads

Local Access

Page 10: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of the Scalability of Garbage Collectors on Multicores 10

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Time to perform a fixednumber of reads in //

Single node access

Local Access

#cores = #threads

Random access

Page 11: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

11

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policyFirst-touch allocation

policy

Virtual address space

Parallel Scavenge

Page 12: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

12

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node

Initial application

thread

First-touch allocation policy

Parallel Scavenge

Page 13: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

13

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node

Initial application

thread

First-touch allocation policy

But during the whole execution, the mapping remains as it is

(virtual space reused by the GC)

Parallel Scavenge

A severe problem for generational GCs

Page 14: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

14

Parallel Scavenge Heap Space

Lokesh Gidra

Bad balance

Bad locality

First-touch allocation policy

95% on a single node

PS

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Parallel Scavenge

A Study of the Scalability of Garbage Collectors on Multicores

Page 15: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

15

NUMA-aware heap layouts

Lokesh Gidra

Bad balance

Bad locality

First-touch allocation policy

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node

Targets balance Targets locality

Parallel Scavenge Interleaved Fragmented

A Study of the Scalability of Garbage Collectors on Multicores

PS

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Page 16: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

16

Interleaved heap layout analysis

Lokesh Gidra

Bad balance Perfect balance

Bad locality Bad locality

First-touch allocation policy

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node 7/8 remote accesses

PS

Interleaved

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Parallel Scavenge Interleaved Fragmented

A Study of the Scalability of Garbage Collectors on Multicores

Page 17: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

17

Fragmented heap layout analysis

Lokesh Gidra

Bad balance Perfect balance Good balance

Bad locality Bad locality Average locality

Parallel Scavenge Interleaved Fragmented

First-touch allocation policy

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node 7/8 remote accesses Bad balance if a singlethread allocates for the others

PS

Interleaved

FragmentedSpecJBB7/8 remote scans

100% local copies

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

A Study of the Scalability of Garbage Collectors on Multicores

Page 18: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

18

Synchronization optimizationsRemoved a barrier between the GC phases

Replaced the GC task-queue with a lock-free one

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

GC

Thr

ough

put (

GB

/s)

PS

Interleaved

FragmentedSpecJBB

Fragmented + synchro

Synchro optimization

has effect with high contention

GC threads

Better ↑

Page 19: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

19

Effect of Optimizations on the App (GC excluded)

A good balance improves a lot application time

Locality has only a marginal effect on applicationWhile fragmented space increases locality for application over interleaved

space

(recently allocated objects are the most accessed)Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

App

licat

ion

time

PS

Other heap layouts

XML Transform from SPECjvm

GC threads

Better↓

Page 20: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

20

Overall effect (both GC and application)

Optimizations double the app throughput of SPECjbb

Pause time divided in half (105ms to 49ms)

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

App

licat

ion

thro

ughp

ut (

ops/

ms)

PS

Fragmented

SpecJBB

Interleaved

Fragmented + synchro

GC threads

Better ↑

Page 21: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

21

GC scales well with memory-intensive applications

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

3.5GB 1GB 2GB

512MB1GB2GB

PS Fragmented + synchro

Page 22: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

22

Take Away

Previous GCs do not scale because they are not NUMA-aware Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects

1. Balancing memory access

2. Memory locality only helps at high core count

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Page 23: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

23

Take Away

Previous GCs do not scale due to NUMA obliviousness Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects

1. Balancing memory access

2. Memory locality at high core count

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Thank You