Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of...

Polar Opposites:Next Generation

Languages & ArchitecturesKathryn S McKinley

The University of Texas at Austin

Collaborators• Faculty

– Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss,

• Graduate Students– Xianglong Huang, Sundeep Kushwaha,

Aaron Smith, Zhenlin Wang (MTU)• Research Staff

– Jim Burrill, Sam Guyer, Bill Yoder

Computing in the Twenty-First Century

New and changing architectures Hitting the microprocessor wall TRIPS - an architecture for future technology

Object-oriented languages Java and C# becoming mainstream

Key challenges and approaches Memory gap, parallelism Language & runtime implementation efficiency Orchestrating a new software/hardware dance Break down artificial system boundaries

Technology Scaling Hitting the Wall

130 nm

100 nm

70 nm35 nm

20 mm chip edge

Analytically … Qualitatively …

Either way … Partitioning for on-chip communication is key

End of the Road for Out-of-Order SuperScalars

• Clock ride is over– Wire and pipeline limits– Quadratic out-of-order issue logic– Power, a first order constraint

• Major vendors ending processor lines

• Problems for any architectural solution – ILP - instruction level parallelism– Memory latency

Where are Programming Languages?• High Productivity Languages

– Java, C#, Matlab, S, Python, Perl• High Performance Languages

– C/C++, Fortran• Why not both in one?

– Interpretation/JIT vs compilation– Language representation

• Pointers, arrays, frequent method calls, etc.– Automatic memory management costs

Obscure ILP and memory behavior

Outline• TRIPS

– Next generation tiled EDGE architecture– ILP compilation model

• Memory system performance– Garbage collection influence – The GC advantage

• Locality, locality, locality• Online adaptive copying

– Cooperative software/hardware caching

TRIPS•Project Goals

–Fast clock & high ILP in future technologies–Architecture sustains 1 TRIPS in 35 nm

technology–Cost-performance scalability–Find the right hardware/software balance

•New balance reduces hardware complexity & power–New compiler responsibilities & challenges

•Hardware/Software Prototype–Proof-of-concept of scalability and

configurability–Technology transfer

TRIPS Prototype Architecture

Execution Substrate

0 1 2 3

I-cache 0

I-cache 1

I-cache 2

I-cache 3D-cache/LSQ 3

D-cache/LSQ 2

D-cache/LSQ 1

D-cache/LSQ 0

Global CtrlBranch Predictor I-cache H

Register banksExecution node

Execution array

Interconnect topology & latency exposed to compiler scheduler

Large Instruction Window

Execution Node

opcode src1 src2

Out-of-Order Instruction Buffers form a logical “z-dimension”

in each node

opcode src1 src2

4 logical framesof 4 X 4 instructions

Control

Router

• Instruction buffers add depth to execution array– 2D array of ALUs; 3D volume of instructions

• Entire 3D volume exposed to compiler

Execution Model• SPDI - static placement, dynamic

issue– Dataflow within a block– Sequential between blocks

• TRIPS compiler challenges– Create large blocks of instructions

• Single entry, multiple exit, predication– Schedule blocks of instructions on a tile– Resource limitations

• Registers, Memory operations

Block Execution Model• Program execution

– Fetch and map block to TRIPS grid– Execute block, produce result(s)– Commit results– Repeat

• Block dataflow execution– Each cycle, execute a ready instruction at every

node– Single read of registers and memory locations– Single write of registers and memory locations– Update the PC to successor block

• TRIPS core may speculatively execute multiple blocks (as well as instructions)

• TRIPS uses branch prediction and register renaming between blocks, but not within a block

Just Right Division of Labor• TRIPS architecture

– Eliminates short-term temporaries– Out-of-order execution at every node in grid– Exploits ILP, hides unpredictable latencies

• without superscalar quadratic hardware• without VLIW guarantees of completion time

• Scale compiler - generate ILP– Large hyperblocks - predicate, unroll, inline, etc.– Schedule hyperblocks

• Map independent instructions to different nodes• Map communicating instructions to same or close nodes

– Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths

High Productivity Programming Languages• Interpretation/JIT vs compilation• Language representation

– Pointers, arrays, frequent method calls, etc.• Automatic memory management costs

MMTk in IBM Jikes RVM – ICSE’04, SIGMETRICS’04– Memory Management Toolkit for Java – High Performance, Extensible, Portable– Mark-Sweep, Copying SemiSpace,

Reference Counting– Generational collection, Beltway, etc.

Bump-Pointer

Fast (increment & bounds check)

Can't incrementally free & reuse: must free en masse

Relatively slow (consult list for fit)

Can incrementally free & reuse cells

Free-List

Allocation Choices

• Bump pointer– ~70 bytes IA32 instructions, 726MB/s

• Free list– ~140 bytes IA32 instructions, 654MB/s

• Bump pointer 11% faster in tight loop– < 1% in practical setting– No significant difference (?)

• Second order effects?– Locality??– Collection mechanism??

Implications for Locality

• Compare SS & MS mutator– Mutator time– Mutator memory performance: L1, L2 & TLB

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

javac mutator time

MarkSweepSemiSpace

Normalized Heap Size

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

javac L1 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

javac L2 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

javac TLB misses

MarkSweepSemiSpace

pseudojbb

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

jbb mutator time

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

jbb L1 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

jbb L2 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

jbb TLB misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

db L1 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

db mutator time

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

db L2 misses

MarkSweepSemiSpace

1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 61

db TLB misses

MarkSweepSemiSpace

Locality &Architecture

MS/SS Crossover 1.6GHz PPC

1 2 3 4 5 6Heap Size Relative to Minimum

Normalized Total Time

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep

MS/SS Crossover1.9GHz AMD

1 2 3 4 5 6

Heap Size Relative to Minimum

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep

MS/SS Crossover 2.6GHz P4

1 2 3 4 5 6

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep

MS/SS Crossover3.2GHz P4

1 2 3 4 5 6

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep3.2GHz P4 SemiSpace3.2GHz P4 MarkSweep

1 2 3 4 5 6

1.6GHz PPC SemiSpace1.6GHz PPC MarkSweep1.9GHz AMD SemiSpace1.9GHz AMD MarkSweep2.6GHz P4 SemiSpace2.6GHz P4 MarkSweep3.2GHz P4 SemiSpace3.2GHz P4 MarkSweep

MS/SS Crossover

2.6GHz2.6GHz

1.9GHz1.9GHz

1.6GHz1.6GHz

locality space

3.2GHz3.2GHz

Locality in Memory Management

• Explicit memory management on its way out– Key GC vs Explicit MM insights 20 yrs old– Technology has and is changing

• Generational and Beltway Collectors– Significant collection time benefits over

full heap collectors– Collect young objects– Infrequently collect old space– Copying nursery attains similar locality effects

as full heap

Where are the Misses?

_209_db

0200400600800

100012001400160018002000

Boot ImageImmortal LOS Older GenNurseryTotal Accesses (in millions)

hitsmisses

Generational Copying Collector

Copy Order• Static copy orders

– Bredth first - Cheney scan– Depth first, hierarchical– Problem: one size does not fit all

• Static profiling per class– Inconsistant with JIT

• Object sampling– Too expensive in our experience

• OOR - Online Object Reordering– OOPSLA’04

OOR Overview• Records object accesses in each

method (excludes cold basic blocks)

• Finds hot methods by dynamic sampling

• Reorders objects with hot fields in higher generation during GC

• Copies hot objects into separate region

Static Analysis Example

Compiler

Hot BBCollect access info

Cold BBIgnore

Compiler

Access List:1. A.b2. ….….

Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}

Adaptive SamplingMethod Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}

Adaptive Sampling

Foo is hot

Foo Accesses:1. A.b2. ….….

A.b is hot

b…..

Advice Directed Reordering• Example

– Assume (1,4), (4,7) and (2,6) are hot field accesses

– Order: 1,4,7,2,6 : 3,5

OOR System Overview

BaselineCompiler

SourceCode

ExecutingCode

AdaptiveSampling Optimizing

Compiler

HotMethods

Access InfoDatabase

Register HotField Accesses

Look Up

AddsEntries

GC: copyingobjects

Affects Locality

AdviceGC: CopiesObjects

OOR additionJikes RVMInput/Output

Cost of OORBenchmark Default OOR Differencejess 4.39 4.43 0.84%jack 5.79 5.82 0.57%raytrace 4.63 4.61 -0.59%mtrt 4.95 4.99 0.70%javac 12.83 12.70 -1.05%compress 8.56 8.54 0.20%pseudojbb 13.39 13.43 0.36%db 18.88 18.88 -0.03%antlr 0.94 0.91 -2.90%gcold 1.21 1.23 1.49%hsqldb 160.56 158.46 -1.30%ipsixql 41.62 42.43 1.93%jython 37.71 37.16 -1.44%ps-fun 129.24 128.04 -1.03%Mean -0.19%

Performance db

Performance jython

Performance javac

Software is not enoughHardware is not enough• Problem: inefficient use of cache• Hardware limitations: set associativity, cannot

predict the future• Cooperative Software/Hardware Caching

– Combines high level compiler analysis with dynamic miss behavior

• Lightweight ISA support conveys compiler’s global view to hardware– Compiler-guided cache replacement (evict-

me)– Compiler-guided region prefetching– ISCA’03, PACT’02

Exciting Times• Dramatic architectural changes

– Execution tiles– Cache & Memory tiles

• Next generation system solutions– Moving hardware/software boundaries– Online optimizations– Key compiler challenges (same old…) ILP and Cache Memory Hierarchy

Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of...

Documents

The Yin and Yang of Hardware Heterogeneity: Can Software Survive? Kathryn S McKinley

Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss

Probabilistic Calling Context Michael D. Bond Kathryn S. McKinley University of Texas at Austin

Bell: Bit-Encoding Online Memory Leak Detection Michael D. Bond Kathryn S. McKinley University of Texas at Austin

Free-Me: A Static Analysis for Individual Object Reclamation Samuel Z. Guyer Tufts University Kathryn S. McKinley University of Texas at Austin Daniel

Department of Computer Sciences Cork: Dynamic Memory Leak Detection with Garbage Collection Maria Jump Kathryn S. McKinley {mjump,mckinley}@cs.utexas.edu

Systematic Editing: Generating Program Transformations from an Example Na Meng Miryung Kim Kathryn S. McKinley The University of Texas at Austin

GHC 2014 Leadership Lori Pollock, University of Delaware Kathryn S McKinley, Microsoft Research

Michael Bond Katherine Coons Kathryn McKinley University of Texas at Austin

Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss

Presentation and Oral Communication Skills€¦ · Presentation and Oral Communication Skills Lucy Nowell, DOE Kathryn S McKinley, Microsoft Research

Presentation and other Oral Communication Skills Kathryn S McKinley, Microsoft Research

University of Washingtonbornholt/papers/estimates... · 2019-08-15 · University of Washington. Kathryn McKinley Matthias Felleisen Shan Lu Steve Blackburn Michael Carbin Matthew

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences

1 Wake Up and Smell the Coffee: Performance Analysis Methodologies for the 21st Century Kathryn S McKinley Department of Computer Sciences University of

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin

Presentation and Oral Communication Skills Kathryn S McKinley, Microsoft Research

Department of Computer Sciences Dynamic Shape Analysis via Degree Metrics Maria Jump & Kathryn S. McKinley Department of Computer Sciences The University

KATHRYN S. MCKINLEY RESEARCH INTERESTS EDUCATION … · KATHRYN S. MCKINLEY Senior Staff Research Scientist, Google 601 N 34th St, Seattle, WA 98103 ... ICS 25th Anniversary Volume,

Networking New and Sustaining Professional Relationships Kathryn S McKinley, Microsoft Research