A Study of Garbage Collector Scalability on Multicores

LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro

INRIA/University of Paris 6

14/20 most popular languages have GCbut GC doesn’t scale on multicore hardware

Garbage collection on multicore hardware

Lokesh Gidra

Parallel Scavenge/HotSpot scalability on a 48-core machine

GC threads

SPECjbb2005 with48 application threads/3.5GB

A Study of the Scalability of Garbage Collectors on Multicores

Degrades after 24 GC threads

Better ↑

Scalability of GC is a bottleneckBy adding new cores, application creates more garbage per time unit

And without GC scalability, the time spent in GC increases

Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

Lusearch[PLOS’11]

~50% of the time spent in the GCat 48 cores

Where is the problem?

Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7(both, stop-the-world and concurrent GCs)

What has really changed:Multicores are distributed architectures, not centralized anymore

A Study of the Scalability of Garbage Collectors on Multicores 5

From centralized architectures to distributed ones

Lokesh Gidra

A few years ago…

Uniform memory access machines

Now…

Inter-connectNode 0 Node 1

Node 2 Node 3Cores

Non-uniform memory access machines

Memory

System Bus

From centralized architectures to distributed onesOur machine: AMD Magny-Cours with 8 nodes and 48 cores

12 GB per node 6 cores per node

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

Local access: ~ 130 cyclesRemote access: ~350 cycles

Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

Local access: ~ 130 cyclesRemote access: ~350 cycles

#cores = #threadsBetter↓

Time to perform a fixednumber of reads in //

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓

Local Access

#cores = #threads

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

Random access

#cores = #threads

Local Access

Lokesh Gidra

Node 0 Node 1

Node 2 Node 3M

Single node access

Local Access

#cores = #threads

Random access

Parallel Scavenge Heap Space

Kernel’s lazy first-touch page allocation policyFirst-touch allocation

policy

Virtual address space

Parallel Scavenge

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node

Initial application

thread

First-touch allocation policy

Parallel Scavenge

Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node

Initial application

thread

But during the whole execution, the mapping remains as it is

(virtual space reused by the GC)

Parallel Scavenge

A severe problem for generational GCs

Lokesh Gidra

Bad balanceBad locality

95% on a single node

SpecJBB

GC threads

Better ↑

Parallel Scavenge

NUMA-aware heap layouts

Lokesh Gidra

Bad balanceBad locality

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node

Targets balance Targets locality

Parallel Scavenge Interleaved Fragmented

SpecJBB

GC threads

Better ↑

Interleaved heap layout analysis

Lokesh Gidra

Bad balance Perfect balanceBad locality Bad locality

Round-robin allocation policy

Node local objectallocation and copy

95% on a single node 7/8 remote accesses

Interleaved

SpecJBB

GC threads

Better ↑

Parallel Scavenge Interleaved Fragmented

Fragmented heap layout analysis

Lokesh Gidra

Bad balance Perfect balance Good balanceBad locality Bad locality Average locality

Parallel Scavenge Interleaved FragmentedFirst-touch allocation

policyRound-robin

allocation policyNode local object

allocation and copy

95% on a single node 7/8 remote accesses Bad balance if a singlethread allocates for the others

Interleaved

FragmentedSpecJBB7/8 remote scans

100% local copies

GC threads

Better ↑

Synchronization optimizationsRemoved a barrier between the GC phasesReplaced the GC task-queue with a lock-free one

Interleaved

FragmentedSpecJBB

Fragmented + synchro

Synchro optimization

has effect with high contention

GC threads

Better ↑

Effect of Optimizations on the App (GC excluded)

A good balance improves a lot application timeLocality has only a marginal effect on application

While fragmented space increases locality for application over interleaved space

(recently allocated objects are the most accessed)Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores

time PS

Other heap layouts

XML Transform from SPECjvm

GC threads

Better↓

Overall effect (both GC and application)

Optimizations double the app throughput of SPECjbbPause time divided in half (105ms to 49ms)

Fragmented

SpecJBB

Interleaved

Fragmented + synchro

GC threads

Better ↑

GC scales well with memory-intensive applications

3.5GB 1GB 2GB

512MB1GB2GB

PS Fragmented + synchro

Take Away

Previous GCs do not scale because they are not NUMA-aware Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects1. Balancing memory access2. Memory locality only helps at high core count

Take Away

Previous GCs do not scale due to NUMA obliviousness Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs

(concurrent GCs included)

Most important NUMA effects1. Balancing memory access2. Memory locality at high core count

Thank You

A Study of Garbage Collector Scalability on Multicores

Documents

Parallelizing Compilers for Multicores - Purdue Universityeigenman/app/slides1-152.pdf · 2010. 6. 15. · R. Eigenmann, Parallelizing Compilers for Multicores , Summer 2010 Slide

Garbage in garbage out ppt

GARBAGE IN ,GARBAGE OUT

Jornal click oficial 2º edição prisma multicores curva

Chapter 7 Multicores, Multiprocessors, and Clusters [Compatibility

Database Engines on Multicores, Why ... - Uni Salzburg

Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Garbage First Garbage Collector Algorithm

Factored Operating system for Clouds and Multicores

Application Performance On Multicores · Application Performance On Multicores Douglas Doerfler CSRI Seminar Feb. 4th, 2008 SAND2008-1084P Unlimited Release Printed February, 2008

on GPUs and multicores - Western Universitymmorenom/Publications/internship_Alexan… · internship report : Real root isolation for univariate polynomials on GPUs and multicores

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms

Chapter 7 Multicores, Multiprocessors, and Clustersmarkov/ccsu_courses/COD-Chapter7.pdf7.3 Shared Memory Multiprocessors Chapter 7 — Multicores, Multiprocessors, and Cluster s —

Multicores, Manycores and Amdahl’s Law

COLLECTION DAY MAP - Surrey · surrey collection day 2020, cloverdale garbage day, guildford garbage day, surrey garbage schedule, garbage schedule, recycling schedule, garbage pickup

Multicores, Multiprocessors, and Clustersece569a/Readings/ppt/multicore.pdf– Multiprocessors – Scalability, availability, power efficiency ... • Multicore microprocessors –

Serial Code Accelerators for Heterogeneous Multicores ...research.ihost.com/waha2009/finalPapers/waha-paper03.pdf · "Serial Code Accelerators for Heterogeneous Multicores, Employing

Chapter 6 Multicores, Multiprocessors, and Clusters (condensed lecture)

Programming Multicores with Pthreads and OpenMP - Duke University

Cache-eﬃcient Dynamic Programming Algorithms for Multicores