23
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phil Colella, and Alex Aiken Computer Science Division University of California at Berkeley and Lawrence Berkeley National Laboratory

A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

A High-Performance Java Dialect

Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger,

Susan Graham, David Gay, Phil Colella, and Alex Aiken

Computer Science DivisionUniversity of California at Berkeley

and

Lawrence Berkeley National Laboratory

What is Titanium?

• A practical language and system for high-performance parallel scientific computing– both shared and distributed-memory architectures

– based on Java

• A platform for compiler and language experiments– parallel and cache optimizations

– domain-specific language extensions

• Future directions for Java?

Practical Language Design• Leverage existing culture

– C-like languages– FORTRAN arrays

• Leverage existing design

• Small language, small compiler– no interpreter– compile into C

• No heroism– rely on well-understood techniques– treat advanced optimizations as a

convenience rather than a necessity

Java

Titanium

Other high-performance languages

Priorities• Performance

– consistently close to C/FORTRAN + MPI• currently: 10%-80% slower• aiming for 10%-20%

• Safety– as safe as Java

• ease of programming• better optimizations

• Expressiveness– add small set of essential features

• Compatibility, interoperability, etc.– no gratuitous departures from Java standard

New Language Features

• Scalable parallelism– SPMD model of execution with global address space

• Multidimensional arrays– also: points and index sets as first-class values– multidimensional iterators

• Memory management– semi-automated zone-based allocation

• Other– Immutable classes– Operator overloading

Model of Parallelism

• Single Program, Multiple Data– fixed number of processes

– each process has own local data

– global synchronization (barrier)

n processes

...start

barrier

barrier

barrier

...

...

...

end ...

Global Synchronization Analysis

• In Titanium, processes must synchronize at the same textual instances of barrier()

doThis();barrier();boolean x = someCondition();if (x) { doThat(); barrier();}doSomeMore();barrier();

Global Synchronization Analysis

• In Titanium, processes must synchronize at the same textual instances of barrier()

• Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow

doThis();barrier();boolean single x = someCondition();if (x) { doThat(); barrier();}doSomeMore();barrier();

Global Address Space

• Each process has its own heap• References can span process

boundaries

Class T { … }

T gv;T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally}gv = broadcast lv from 0; // distribute… gv.field ...

Process 0Other

processes

lv

gv

lv

gv

lv

gv

lv

gv

lv

gv

lv

gv

LOCAL HEAP

LOCAL HEAP

Global vs. Local References

• Global references may be slow– distributed memory: overhead of a few instructions

when using a global reference to access a local object

– shared memory: no performance implications

• Solution: use local qualifier– statically restrict references to local objects

– example: T local lv = null;

– use only in critical sections

Arrays, Points, Domains

• Fast, expressive arrays– multidimensional– lower bound, upper bound, stride– concise indexing: A[p] instead of A(i, j, k)

• Points– tuple of integers as primitive type

• Domains– sets of points

• rectangular (bounds and stride)• general (arbitrary set)

• Multidimensional iterators

Example: Point, RectDomain, Array

Point<2> lb = [1, 1];Point<2> ub = [10, 20];RectDomain<2> R = [lb : ub : [2, 2]];double [2d] A = new double[R]; // (no distributed arrays)…foreach (p in A.domain()) {

A[p] = B[2 * p];} Standard optimizations:

• strength reduction• common subexpression elimination• invariant code motion• removing bounds checks from body

Example: Domain

Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> R = [lb : ub : [2, 2]];…Domain<2> red = R + (R + [1, 1]);foreach (p in red) {

…}

(0, 0)

(6, 4)R

(1, 1)

(7, 5)R + [1, 1]

red

(0, 0)

(7, 5)

Gauss-Seidel relaxationwith red-black ordering

Memory Management

• Distributed GC– too unpredictable

• Zone-based memory management– extends existing model

– good performance

– safe

– easy to use

Zone-Based Memory Management

Zone Z1 = new Zone(); Z1

Zone Z2 = new Zone();

Z2

T x = new(Z1) T(); xT y = new(Z2) T();

y

x.field = y;x = y;delete Z1;delete Z2; // error

• Allocate objects in zones• Release zones manually

Immutable Classes

• User-definable “primitive” type– same reason for primitive types in Java: performance

• No inheritance– does not inherit from Object– final– all (non-static) fields are final

• Example: complex numbers

• Used internally for Point<n>

Other Features

• Operator overloading– useful to scientific programmers

• Parameterized types– will conform to standard

Implementation

• Strategy– compile Titanium into C (currently C++)– Posix threads for SMPs (currently Solaris threads)– Libsplit-c for communication

• Active Messages

• Status– runs on SUN Enterprise 8-way SMP– runs on Berkeley NOW– trivial ports to 1/2 dozen other architectures– tuning for sequential performance

Applications

• Three-D AMR Poisson Solver (AMR3D)– block-structured grids– 2000 line program– algorithm not yet fully implemented in other languages– tests performance and effectiveness of language features

• Three-D Electromagnetic Waves (EM3D)– unstructured grids

• Several smaller benchmarks

Current Performance

C/C++/FORTRAN

JavaArrays

TitaniumArrays Overhead

DAXPY3D multigrid2D multigridEM3D

1.4s12s

5.4s0.7s 1.8s 1.0s 42%

15%83%

7%

6.2s22s

1.5s6.8s

Sequential performance

1 2 4 8

EM3D

AMR3D

1 2 4 8

1 1.8 2.6 3.9

Parallel performance

number of processors

speedups

Conclusions

• Java is a good base language– easily extended– compilation reasonably simple

• High performance is possible– explicit parallelism– advanced array features– rely on simple, well-understood optimizations

• Essence of Java is preserved– small– safe

Sorry, I Clicked Too Far...

• there is nothing here

Incompatibilities

• Threads– no threads for the time being

• coexisting threads and processes are difficult to design

• Exceptions– run-time errors such as out-of-bound indexing halt the

program instead of throwing an exception• throwing exceptions prevents optimizations that reorder code