23
A Study of Garbage Collector Scalability on Multicores LokeshGidra , Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6

A Study of Garbage Collector Scalability on Multicores

  • Upload
    totie

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

A Study of Garbage Collector Scalability on Multicores. LokeshGidra , Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6. Garbage collection on multicore hardware. 14/20 most popular languages have GC but GC doesn’t scale on multicore hardware. - PowerPoint PPT Presentation

Citation preview

Page 1: A Study of Garbage Collector Scalability  on  Multicores

A Study of Garbage Collector Scalability on Multicores

LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro

INRIA/University of Paris 6

Page 2: A Study of Garbage Collector Scalability  on  Multicores

2

14/20 most popular languages have GCbut GC doesn’t scale on multicore hardware

Garbage collection on multicore hardware

Lokesh Gidra

Parallel Scavenge/HotSpot scalability on a 48-core machine

GC threads

GC

Thr

ough

put (

GB

/s)

SPECjbb2005 with48 application threads/3.5GB

A Study of the Scalability of Garbage Collectors on Multicores

Degrades after 24 GC threads

Better ↑

Page 3: A Study of Garbage Collector Scalability  on  Multicores

3

Scalability of GC is a bottleneckBy adding new cores, application creates more garbage per time unit

And without GC scalability, the time spent in GC increases

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Lusearch[PLOS’11]

~50% of the time spent in the GCat 48 cores

Page 4: A Study of Garbage Collector Scalability  on  Multicores

4

Where is the problem?

Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7(both, stop-the-world and concurrent GCs)

What has really changed:Multicores are distributed architectures, not centralized anymore

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Page 5: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 5

From centralized architectures to distributed ones

Lokesh Gidra

A few years ago…

Uniform memory access machines

Now…

Inter-connectNode 0 Node 1

Node 2 Node 3Cores

Non-uniform memory access machines

Cores

Memory

System Bus

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Page 6: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 6

From centralized architectures to distributed onesOur machine: AMD Magny-Cours with 8 nodes and 48 cores

12 GB per node 6 cores per node

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cycles

Page 7: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 7

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cycles

#cores = #threadsBetter↓

Com

plet

ion

time

(ms)

Time to perform a fixednumber of reads in //

Page 8: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 8

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Local Access

Time to perform a fixednumber of reads in //

#cores = #threads

Page 9: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 9

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Random access

Time to perform a fixednumber of reads in //

#cores = #threads

Local Access

Page 10: A Study of Garbage Collector Scalability  on  Multicores

A Study of the Scalability of Garbage Collectors on Multicores 10

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

From centralized architectures to distributed ones

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

emor

y

Mem

ory

Mem

ory

Mem

ory

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Com

plet

ion

time

(ms)

Time to perform a fixednumber of reads in //

Single node access

Local Access

#cores = #threads

Random access

Page 11: A Study of Garbage Collector Scalability  on  Multicores

11

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policyFirst-touch allocation

policy

Virtual address space

Parallel Scavenge

Page 12: A Study of Garbage Collector Scalability  on  Multicores

12

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node

Initial application

thread

First-touch allocation policy

Parallel Scavenge

Page 13: A Study of Garbage Collector Scalability  on  Multicores

13

Parallel Scavenge Heap Space

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node

Initial application

thread

First-touch allocation policy

But during the whole execution, the mapping remains as it is

(virtual space reused by the GC)

Parallel Scavenge

A severe problem for generational GCs

Page 14: A Study of Garbage Collector Scalability  on  Multicores

14

Parallel Scavenge Heap Space

Lokesh Gidra

Bad balanceBad locality

First-touch allocation policy

95% on a single node

PS

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Parallel Scavenge

A Study of the Scalability of Garbage Collectors on Multicores

Page 15: A Study of Garbage Collector Scalability  on  Multicores

15

NUMA-aware heap layouts

Lokesh Gidra

Bad balanceBad locality

First-touch allocation policy

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node

Targets balance Targets locality

Parallel Scavenge Interleaved Fragmented

A Study of the Scalability of Garbage Collectors on Multicores

PS

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Page 16: A Study of Garbage Collector Scalability  on  Multicores

16

Interleaved heap layout analysis

Lokesh Gidra

Bad balance Perfect balanceBad locality Bad locality

First-touch allocation policy

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node 7/8 remote accesses

PS

Interleaved

SpecJBB

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

Parallel Scavenge Interleaved Fragmented

A Study of the Scalability of Garbage Collectors on Multicores

Page 17: A Study of Garbage Collector Scalability  on  Multicores

17

Fragmented heap layout analysis

Lokesh Gidra

Bad balance Perfect balance Good balanceBad locality Bad locality Average locality

Parallel Scavenge Interleaved FragmentedFirst-touch allocation

policyRound-robin

allocation policyNode local object

allocation and copy

95% on a single node 7/8 remote accesses Bad balance if a singlethread allocates for the others

PS

Interleaved

FragmentedSpecJBB7/8 remote scans

100% local copies

GC threads

GC

Thr

ough

put (

GB

/s)

Better ↑

A Study of the Scalability of Garbage Collectors on Multicores

Page 18: A Study of Garbage Collector Scalability  on  Multicores

18

Synchronization optimizationsRemoved a barrier between the GC phasesReplaced the GC task-queue with a lock-free one

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

GC

Thr

ough

put (

GB

/s)

PS

Interleaved

FragmentedSpecJBB

Fragmented + synchro

Synchro optimization

has effect with high contention

GC threads

Better ↑

Page 19: A Study of Garbage Collector Scalability  on  Multicores

19

Effect of Optimizations on the App (GC excluded)

A good balance improves a lot application timeLocality has only a marginal effect on application

While fragmented space increases locality for application over interleaved space

(recently allocated objects are the most accessed)Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

App

licat

ion

time PS

Other heap layouts

XML Transform from SPECjvm

GC threads

Better↓

Page 20: A Study of Garbage Collector Scalability  on  Multicores

20

Overall effect (both GC and application)

Optimizations double the app throughput of SPECjbbPause time divided in half (105ms to 49ms)

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

App

licat

ion

thro

ughp

ut (o

ps/m

s)

PS

Fragmented

SpecJBB

Interleaved

Fragmented + synchro

GC threads

Better ↑

Page 21: A Study of Garbage Collector Scalability  on  Multicores

21

GC scales well with memory-intensive applications

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

3.5GB 1GB 2GB

512MB1GB2GB

PS Fragmented + synchro

Page 22: A Study of Garbage Collector Scalability  on  Multicores

22

Take Away

Previous GCs do not scale because they are not NUMA-aware Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects1. Balancing memory access2. Memory locality only helps at high core count

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Page 23: A Study of Garbage Collector Scalability  on  Multicores

23

Take Away

Previous GCs do not scale due to NUMA obliviousness Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects1. Balancing memory access2. Memory locality at high core count

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Thank You