View
218
Download
0
Embed Size (px)
Citation preview
A High-Performance Java Dialect
Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger,
Susan Graham, David Gay, Phil Colella, and Alex Aiken
Computer Science DivisionUniversity of California at Berkeley
and
Lawrence Berkeley National Laboratory
What is Titanium?
• A practical language and system for high-performance parallel scientific computing– both shared and distributed-memory architectures
– based on Java
• A platform for compiler and language experiments– parallel and cache optimizations
– domain-specific language extensions
• Future directions for Java?
Practical Language Design• Leverage existing culture
– C-like languages– FORTRAN arrays
• Leverage existing design
• Small language, small compiler– no interpreter– compile into C
• No heroism– rely on well-understood techniques– treat advanced optimizations as a
convenience rather than a necessity
Java
Titanium
Other high-performance languages
Priorities• Performance
– consistently close to C/FORTRAN + MPI• currently: 10%-80% slower• aiming for 10%-20%
• Safety– as safe as Java
• ease of programming• better optimizations
• Expressiveness– add small set of essential features
• Compatibility, interoperability, etc.– no gratuitous departures from Java standard
New Language Features
• Scalable parallelism– SPMD model of execution with global address space
• Multidimensional arrays– also: points and index sets as first-class values– multidimensional iterators
• Memory management– semi-automated zone-based allocation
• Other– Immutable classes– Operator overloading
Model of Parallelism
• Single Program, Multiple Data– fixed number of processes
– each process has own local data
– global synchronization (barrier)
n processes
...start
barrier
barrier
barrier
...
...
...
end ...
Global Synchronization Analysis
• In Titanium, processes must synchronize at the same textual instances of barrier()
doThis();barrier();boolean x = someCondition();if (x) { doThat(); barrier();}doSomeMore();barrier();
Global Synchronization Analysis
• In Titanium, processes must synchronize at the same textual instances of barrier()
• Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow
doThis();barrier();boolean single x = someCondition();if (x) { doThat(); barrier();}doSomeMore();barrier();
Global Address Space
• Each process has its own heap• References can span process
boundaries
Class T { … }
T gv;T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally}gv = broadcast lv from 0; // distribute… gv.field ...
Process 0Other
processes
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
LOCAL HEAP
LOCAL HEAP
Global vs. Local References
• Global references may be slow– distributed memory: overhead of a few instructions
when using a global reference to access a local object
– shared memory: no performance implications
• Solution: use local qualifier– statically restrict references to local objects
– example: T local lv = null;
– use only in critical sections
Arrays, Points, Domains
• Fast, expressive arrays– multidimensional– lower bound, upper bound, stride– concise indexing: A[p] instead of A(i, j, k)
• Points– tuple of integers as primitive type
• Domains– sets of points
• rectangular (bounds and stride)• general (arbitrary set)
• Multidimensional iterators
Example: Point, RectDomain, Array
Point<2> lb = [1, 1];Point<2> ub = [10, 20];RectDomain<2> R = [lb : ub : [2, 2]];double [2d] A = new double[R]; // (no distributed arrays)…foreach (p in A.domain()) {
A[p] = B[2 * p];} Standard optimizations:
• strength reduction• common subexpression elimination• invariant code motion• removing bounds checks from body
Example: Domain
Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> R = [lb : ub : [2, 2]];…Domain<2> red = R + (R + [1, 1]);foreach (p in red) {
…}
(0, 0)
(6, 4)R
(1, 1)
(7, 5)R + [1, 1]
red
(0, 0)
(7, 5)
Gauss-Seidel relaxationwith red-black ordering
Memory Management
• Distributed GC– too unpredictable
• Zone-based memory management– extends existing model
– good performance
– safe
– easy to use
Zone-Based Memory Management
Zone Z1 = new Zone(); Z1
Zone Z2 = new Zone();
Z2
T x = new(Z1) T(); xT y = new(Z2) T();
y
x.field = y;x = y;delete Z1;delete Z2; // error
• Allocate objects in zones• Release zones manually
Immutable Classes
• User-definable “primitive” type– same reason for primitive types in Java: performance
• No inheritance– does not inherit from Object– final– all (non-static) fields are final
• Example: complex numbers
• Used internally for Point<n>
Other Features
• Operator overloading– useful to scientific programmers
• Parameterized types– will conform to standard
Implementation
• Strategy– compile Titanium into C (currently C++)– Posix threads for SMPs (currently Solaris threads)– Libsplit-c for communication
• Active Messages
• Status– runs on SUN Enterprise 8-way SMP– runs on Berkeley NOW– trivial ports to 1/2 dozen other architectures– tuning for sequential performance
Applications
• Three-D AMR Poisson Solver (AMR3D)– block-structured grids– 2000 line program– algorithm not yet fully implemented in other languages– tests performance and effectiveness of language features
• Three-D Electromagnetic Waves (EM3D)– unstructured grids
• Several smaller benchmarks
Current Performance
C/C++/FORTRAN
JavaArrays
TitaniumArrays Overhead
DAXPY3D multigrid2D multigridEM3D
1.4s12s
5.4s0.7s 1.8s 1.0s 42%
15%83%
7%
6.2s22s
1.5s6.8s
Sequential performance
1 2 4 8
EM3D
AMR3D
1 2 4 8
1 1.8 2.6 3.9
Parallel performance
number of processors
speedups
Conclusions
• Java is a good base language– easily extended– compilation reasonably simple
• High performance is possible– explicit parallelism– advanced array features– rely on simple, well-understood optimizations
• Essence of Java is preserved– small– safe