View
34
Download
1
Category
Preview:
DESCRIPTION
Safe and Efficient Cluster Communication in Java using Explicit Memory Management. Chi-Chao Chang Dept. of Computer Science Cornell University. Goal. High-performance cluster computing with safe languages parallel and distributed applications Use off-the-shelf technologies Java - PowerPoint PPT Presentation
Citation preview
Safe and Efficient Cluster Safe and Efficient Cluster Communication in Java using Communication in Java using Explicit Memory ManagementExplicit Memory Management
Chi-Chao ChangDept. of Computer Science
Cornell University
GoalGoal
2
High-performance cluster computing with safe languages parallel and distributed applications
Use off-the-shelf technologies Java
safe: “better C++” “write once run everywhere” growing interest for high-performance applications (Java Grande)
User-level network interfaces (UNIs) direct, protected access to network devices prototypes: U-Net (Cornell), Shrimp (Princeton), FM (UIUC) industry standard: Virtual Interface Architecture (VIA) cost-effective clusters: new 256-processor cluster @ Cornell TC
Java NetworkingJava Networking
3
Traditional “front-end” approach pick favorite abstraction (sockets, RMI,
MPI) and Java VM write a Java front-end to custom or
existing native libraries good performance, re-use proven code magic in native code, no common solution
Interface Java with Network Devices bottom-up approach minimizes amount of unverified code focus on fundamental data transfer
inefficiencies due to:
1. Storage safety
2. Type safety
RMI, RPC
Sockets
Active Messages, MPI, FM
UNI
Networking Devices
Apps
Java
C
OutlineOutlineThesis Overview
GC/Native heap separation, object serialization
Experimental Setup: VI Architecture and Marmot
Part I: Array Transfers(1) Javia-I: Java Interface to VI Architecture
respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
4
(1) Storage Safety(1) Storage Safety
5
Java programs are garbage-collected no explicit de-allocation: GC tracks and frees garbage objects programs are oblivious to the GC scheme used: non-copying (e.g.
conservative) or copying no control over location of objects
Modern Network and I/O Devices direct DMA from/into user buffers native code is necessary to interface with hardware devices
(1) Storage Safety(1) Storage Safety
6
GC heap Native heap
NI
RAM
Application Memory
DMA
NI
RAM
Application Memory
DMAON OFF OFF
copypin
pin
(a) Hard Separation: Copy-on-demand (b) Optimization: Pin-on-demand
Pin-on-demand only works for send/write operations For receive/read operations, GC must be disabled indefinitely...
Result: Hard Separation between GC and native heaps
GC heap Native heap
OFF
(1) Storage Safety: Effect(1) Storage Safety: Effect
7
Throughput
0
20
40
60
80
0 8 16 24 32Kbytes
MB/s
C rawJava copyJava pin
Best case scenario: 10-40% hit in throughput pick your favorite JVM, your fastest network interface, and a pair of
450Mhz P-II with commodity OS pinning on demand is expensive...
(2) Type Safety(2) Type SafetyCannot forge a reference to a Java object
b is an array of bytes in C:
double *data = (double *)b; in Java:
double[] data = new double[1024/8];
for (int i=0,off=0;i<1024/8;i++,off+=8) {
int upper = (((b[off]&0xff)<<24) +
((b[off+1]&0xff)<<16) +
((b[off+2]&0xff)<<8) +
(b[off+3]&0xff));
int lower = (((b[off+4]&0xff)<<24) + ((b[off+5]&0xff)<<16) +
((b[off+6]&0xff)<<8) +
(b[off+7]&0xff));
data[i] = Double.toLongBits(((long)upper)<<32)+
(lower&0xffffffffL))
}8
(2) Type Safety(2) Type Safety
Objects have meta-data runtime safety checks (array-bounds, array-store, casts)
9
In C:struct Buffer {
int len; char data[1];}
Buffer *b = malloc(sizeof(Buffer)+1024);
b.len = 1024;
In Java:class Buffer { int len; byte[] data;
Buffer(int n) {
data = new byte[n]; len = n; }
}
Buffer b = new Buffer(1024);
1024b
lock obj
Buffer vtable
lock objbbyte[] vtable
1024
(2) Type Safety(2) Type Safety
10
Result: Java objects need to be serialized and de-serialized across the network
GC heap Native heap
NI
RAM
Application Memory
DMAON OFF
copy
pin
serial
(2) Type Safety: Effect (2) Type Safety: Effect
11
Performance hit of one order of magnitude: pick your favorite high-level communication abstraction (e.g.
Remote Method Invocation) pick your favorite JVM, your fastest network interface, and a pair of
450Mhz P-II
Round-Trip Latency
0
400
800
1200
1600
0 2 4 6 8Kbytes
us
C rawJava copyJava RMI copy
ThesisThesis
12
Use explicit memory management to improve Java communication performance Jbufs: safe and explicit management of Java buffers
softens the GC/Native heap separation preserves type and storage safety “zero-copy” array transfers
Jstreams: extends Jbufs for optimizing serialization in clusters “zero-copy” de-serialization of arbitrary objects
GC heap Native heap
NI
RAM
Application Memory
DMAON OFF
pin
user-controlled
OutlineOutlineThesis Overview
GC/Native heap separation, object serialization
Experimental Setup: Giganet cluster and Marmot
Part I: Array Transfers(1) Javia-I: Java Interface to VI Architecture
respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
13
Giganet ClusterGiganet Cluster
Configuration 8 P-II 450MHz, 128MB RAM 8 1.25 Gbps Giganet GNN-1000 adapter one Giganet switch
GNN1000 Adapter: User-Level Network Interface Virtual Interface Architecture implemented as a library (Win32 dll)
Base-line pt-2-pt Performance 14s r/t latency, 16s with switch over 100MBytes/s peak, 85MBytes/s with switch
14
MarmotMarmotJava System from Microsoft Research
not a VM static compiler: bytecode (.class) to x86 (.asm) linker: asm files + runtime libraries -> executable (.exe) no dynamic loading of classes most Dragon book opts, some OO and Java-specific opts
Advantages source code good performance two types of non-concurrent GC (copying, conservative) native interface “close enough” to JNI
15
OutlineOutlineThesis Overview
GC/Native heap separation, object serialization
Experimental Setup: Giganet cluster and Marmot
Part I: Array Transfers(1) Javia-I: Java Interface to VI Architecture
respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers(3) A Case For Specialization
micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
16
Javia-IJavia-I
Basic Architecture respects heap separation
buffer mgmt in native code Marmot as an “off-the-shelf” system
copying GC disabled in native code primitive array transfers only
Send/Recv API non-blocking blocking
bypass ring accesses pin-on-demand alloc-recv: allocates new array on-
demand cannot eliminate copying during recv
17
send/recv ticket ring
send/recvqueue
descriptor
buffer
Java
C
byte array ref
Vi
GC heap
VIA
Javia-I: PerformanceJavia-I: Performance
18
0
100
200
300
400
0 1 2 3 4 5 6 7 8
Kbytes
s rawcopy(s)pin(s)copy(s)+alloc(r) pin(s)+alloc(r)
0
20
40
60
80
0 8 16 24 32
Kbytes
MB/s
rawcopy(s)pin(s)copy(s)+alloc(r)pin(s)+alloc(r)
Basic Costs (PII-450, Windows2000b3):pin + unpin = (10 + 10)us, or ~5000 machine cycles
Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us
Latency: N = transfer size in bytes16.5us + (25ns) * N raw
38.0us + (38ns) * N pin(s)
21.5us + (42ns) * N copy(s)
18.0us + (55ns) * N copy(s)+alloc(r)
BW: 75% to 85% of raw for 16Kbytes
jbufsjbufsGoal
provide buffer management capabilities to Java without violating its safety properties
re-use is important: amortizes high pinning costs
jbuf: exposes communication buffers to Java programmers1. lifetime control: explicit allocation and de-allocation
2. efficient access: direct access as primitive-typed arrays
3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap
heap separation becomes soft and user-controlled
19
jbufs: Lifetime Control jbufs: Lifetime Control
1. jbuf allocation does not result in a Java reference to it cannot access the jbuf from the wrapper object
2. jbuf is not automatically freed if there are no Java references to it free has to be explicitly called
20
public class jbuf {
public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */
public void free() throws CannotFreeException; /* frees jbuf if it can */
}
jbuf
GC heap
handle
jbufs: Efficient Access jbufs: Efficient Access
3. (Storage Safety) jbuf remains allocated as long as there are array references to it when can we ever free it?
4. (Type Safety) jbuf cannot have two differently typed references to it at any given time when can we ever re-use it (e.g. change its reference type)?
21
public class jbuf {
/* alloc and free omitted */
public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/
public int[] toIntArray() throws TypedException; /*hands out int[] ref*/
. . .
}
jbuf
GC heap
Java byte[]
ref
jbufs: Location Control jbufs: Location Control
Idea: Use GC to track references
unRef: application claims it has no references into the jbuf jbuf is added to the GC heap GC verifies the claim and notifies application through callback application can now free or re-use the jbuf
Required GC support: change scope of GC heap dynamically
22
public class jbuf {
/* alloc, free, toArrays omitted */
public void unRef(CallBack cb); /* app intends to free/re-use jbuf */
}
jbuf
GC heap
Java byte[]
ref
jbuf
GC heap
Java byte[]
ref
jbuf
GC heap
Java byte[]
ref
unRef callBack
jbufs: Runtime Checksjbufs: Runtime Checks
Type safety: ref and to-be-unref states parameterized by primitive type
GC* transition depends on the type of garbage collector non-copying: transition only if all refs to array are dropped before GC copying: transition occurs after every GC
23
Unref ref<p>
to-beunref<p>
to<p>Array
to<p>Array, GC
unRef
to<p>Array, unRef
GC*
alloc
free
Javia-IIJavia-II
Exploiting jbufs explicit pinning/unpinning of jbufs only non-blocking send/recvs
24
send/recv ticket ring
send/recvqueue
descriptor
jbuf
Java
C
Vi
state
GC heap
array refs
VIA
Javia-II: PerformanceJavia-II: Performance
25
Basic Jbuf Costsallocation = 1.2us, to*Array = 0.8us, unRefs = 2.3 us, GC degradation=1.2us/jbuf
Latency (n = xfer size)16.5us + (0.025us) * n raw
20.5us + (0.025us) * n jbufs
38.0us + (0.038us) * n pin(s)
21.5us + (0.042us) * n copy(s)
BW within 1% of raw
0
100
200
300
400
0 1 2 3 4 5 6 7 8
Kbytes
s
raw
jbufs
copy
pin
0
10
20
30
40
50
60
70
80
0 8 16 24 32
Kbytes
MB/s
raw
jbufs
copy
pin
MM: CommunicationMM: Communication
26
pMM Comm Time (64x64, 8 procs)
0
2
4
6
8
10
msecs
comm
barrier
copy-alloc
copy-async
pin-alloc
pin-async
jbufs jdk copy-alloc
jdk copy-async
67% 70%
78% 85%
56%
78% 73%
pMM Comm Time (256x256, 8 procs)
0
10
20
30
40
50
msecs
comm
barrier
copy-alloc
copy-async
pin-alloc
pin-async
jbufs jdk copy-alloc
jdk copy-async
19%16%
24% 22%
13%
29%
18%
pMM over Javia-II/jbufs spends at least 25% less in communication for 256x256 matrices on 8 processors
MM: OverallMM: Overall
27
pMM MFLOPS (64x64)
0
20
40
60
80
100
120
140
160
180
200 2 procs
4 procs
8 procs
copy-alloc
copy-async
pin-alloc
pin-async
jbufsjdk copy-
allocjdk copy-
async
pMM MFLOPS (256x256)
0
50
100
150
200
250
300
350 2 procs
4 procs
8 procs
copy-alloc
copy-async
pin-alloc
pin-async
jbufsjdk copy-
allocjdk copy-
async
Cache effects: better communication performance does not always translate to better overall performance
28
Exercising Jbufs: user supplies a list of jbufs upon message arrival:
jbuf passed to handler unRef is invoked after
handler invocation if pool is empty, reclaim
existing ones copying deferred to GC-time
only if needed
class First extends AMHandler {
private int first;
void handler(AMJbuf buf, …) {
int[] tmp = buf.toIntArray();
first = tmp[0];
}
}
class Enqueue extends AMHandler {
private Queue q;
void handler(AMJbuf buf, …) {
int[] tmp = buf.toIntArray();
q.enq(tmp);
}
}
Active MessagesActive Messages
AM: PerformanceAM: Performance
29
Latency about 15s higher than Javia synch access to buffer pool, endpoint header, flow control
checks, handler id lookup
BW within 10% of peak for 16KByte messages
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8
Kbytes
s
rawjbufsAM jbuf
AM copyAM copy-alloc
0
10
20
30
40
50
60
70
80
0 8 16 24 32
Kbytes
MB/s
rawjbufsAM jbufAM copyAM pinAM copy-alloc
30
Jbufs: ExperienceJbufs: ExperienceEfficient access through arrays is useful:
no indirect access via method invocation promotes code re-use of large numerical kernels leverages compiler infrastructure for eliminating safety checks
Limitations still not as flexible as C buffers stale references may confuse programmers
Discussed in thesis: the necessity of explicit de-allocation implementation of Jbufs in Marmot’s copying collector impact on conservative and generational collector extension to JNI to allow “portable” implementations of Jbufs
OutlineOutlineThesis Overview
GC/Native heap separation, object serialization
Experimental Setup: VI Architecture and Marmot
Part I: Array Transfers(1) Javia-I: Java Interface to VI Architecture
respects heap separation
(2) Jbufs: Safe and Explicit Management of Buffers Javia-II, matrix multiplication, Active Messages
Part II: Object Transfers(3) A Case For Specialization on Homogeneous Clusters
micro-benchmarks, RMI using Javia-I/II, impact on application suite
(4) Jstreams: in-place de-serialization micro-benchmarks, RMI using Javia-III, impact on application suite
Conclusions
31
32
Standard JOS Protocol “heavy-weight” class descriptors are serialized along with objects type-checking: classes need not be “equal”, just “compatible.” protocol allows for user extensions
Remote Method Invocation object-oriented version of Remote Procedure Call relies on JOS for argument passing actual parameter object can be a sub-class of the formal parameter class.
Object Serialization and RMIObject Serialization and RMI
writeObject
GC heap
readObject
GC heap
NETWORK
33
1. overheads in tens or hundreds of s: send/recv overheads=~ 3 s, memcpy of 500 bytes=~ 0.8 s
2. double[] 50% more expensive than byte[] of similar size
3. overheads grow as object sizes grow
JOS CostsJOS CostswriteObject
0
10
20
30
40
50
60
70jview
jdk
marmot
us
120 27593
byte[] 100
byte[]500
double[] 12
double[] 62
complex[]
p/elemlist 4
p/elem
readObject
0
10
20
30
40
50
60
70
jview
jdk
marmot
us
117
byte[] 100
byte[]500
double[] 12
double[] 62
complex[]
p/elemlist 4
p/elem
271
list 160p/elem
86
34
Impact of Marmot’s optimizations: Method inlining: up to 66% improvement (already deployed) No synchronization whatsoever: up to 21% improvement No safety checks whatsoever: up to 15% combined
Better compilation technology unlikely to reduce overheads substantially
Impact of Marmot Impact of Marmot
35
Order of magnitude worse than Javia-I/II round-trip latency drops to about 30us in a null RMI: no JOS! peak bandwidth of 22MBytes/s, about 25% of raw
Impact on RMIImpact on RMI
0
400
800
1200
1600
2000
0 1 2 3 4 5 6 7 8
Kbytes
srawjbufsRMI jbufsRMI copy+allocRMI pinRMI copy+allocjdk RMI copy
4-byte (us)150.4161.9164.5211.8271.0482.3520.1
RMIjbufs
pincopy
jdk copysockets
copy+alloc
jdk sockets
36
Impact on ApplicationsImpact on Applications
% comm time (est.)
% total time (est.)
11.76% 2.73%10.90% 5.22%14.28% 13.73%
1.42% 1.37%7.64% 5.20%pMM
Application
SOR
FFT arraysFFT complexEM3D arrays
A Case for Specializing Serialization for Cluster applications: overheads a order of magnitude higher than send/recv and memcpy RMI performance degraded by one order of magnitude 5-15% “estimated” impact on applications old adage: “specialize for the common case”
Optimizing De-serializationOptimizing De-serialization
“in-place” object de-serialization specialization for homogeneous cluster and JVMs
Goal eliminate copying and allocation of objects
Challenges preserve the integrity of the receiving JVM permit de-serialization of arbitrary Java objects with unrestricted usage
and without special annotations independent of a particular GC scheme
37
writeObject
GC heap GC heap
NETWORK
Jstreams: writeJstreams: write
writeObject deep-copy of objects: maintains in-memory layout deals with cyclic data structures swizzle pointers: offsets to a base address replace object meta-data with 64-bit class descriptor optimization: primitive-typed arrays in jbufs are not copied
38
public class Jstream extends Jbuf {
public void writeObject(Object o) /* serializes o onto the stream */
throws TypedException, ReferencedException;
public void writeClear() /* clears the stream for writing*/
throws TypedException, ReferencedException;
}
Jstreams: readJstreams: read
readObject replace class descriptors with meta-data unswizzle pointers, array-bounds checking after first readObject, add jstream to GC heap
tracks references coming out of read objects unRef: user is willing to free or re-use
39
public class Jstream extends Jbuf {
public Object readObject() throws TypedException; /* de-serialization */
public boolean isJstream(Object o); /* checks if o resides in the stream */
}
GC heapGC heap
unRef callBackGC heap
jstreams: Runtime Checksjstreams: Runtime Checks
Modification to Javia-II: prevent DMA from clobbering de-serialized objects receive posts not allowed if jstream is in read mode no changes to Javia-II architecture
Unref
Write Mod
e
to-be unref
writeObject
writeObject, GC
unRef
readObjectGC*
ReadMode
readObject
readObject, GC
writeClear
unRef
alloc
free
jstream: Performancejstream: Performance
41
De-serialization costs constant w.r.t. object size 2.6us for arrays, 3.3us per list element.
readObject
0
5
10
15
20
25
30
JOS jdkJOS marmotjstreams marmotjstreams (C)
us
39
byte[] 100 double[] 62 list 4 p/e
55
list 160 p/e
86
byte[] 500
jstream: Impact on RMIjstream: Impact on RMI
42
4-byte round-trip latency of 45us (25us higher than Javia-II)
52MBytes/s for 16KBytes arguments
0
10
20
30
40
50
60
70
80
0 8 16 24 32
Kbytes
MB/s
rawjavia-IIAM javia-IIRMI avia-IIIRMI javia-I
jstream: Impact on Applicationsjstream: Impact on Applications
43
3-10% improvement in SOR, EM3D, FFT
10% hit in pMM performance over 22,000 incoming RMIs, 1000 jstreams in receive pool, ~26
garbage collections: 15% of total execution time in GC generational collection will alleviate GC costs substantially receive pool size is hard to tune: tradeoffs between GC and locality
JOS comm (secs)
JOS total
(secs)
jstreams comm (secs)
jstreams total
(secs)
% improv. comm
% improv.
total
% improv. comm (est.)
% improv. total (est.)
4.59 19.78 3.99 19.08 13.20% 3.52% 11.76% 2.73%2.20 4.60 1.99 4.37 9.50% 4.85% 10.90% 5.22%
18.30 19.03 16.16 17.26 11.70% 9.30% 14.28% 13.73%14.82 15.36 14.29 14.83 3.57% 3.40% 1.42% 1.37%
190.58 280.00 170.91 307.80 10.32% -9.93% 7.64% 5.20%pMM
Application
SOR
FFT arraysFFT complexEM3D arrays
44
Jstreams: ExperienceJstreams: Experience
Implementation of readObject and writeObject integrated into JVM protocol is JVM-specific native implementation is faster
Limitations not as flexible as Java streams: cannot read and write at the same time no “extensible” wire protocols
Discussed in thesis: implementation of Jstreams in Marmot’s copying collector support for polymorphic RMI: minor changes to the stub compiler JNI extensions to allow “portable” implementations of Jstreams
Related WorkRelated WorkMicrosoft J-Direct
“pinned” arrays defined using source-level annotations JIT produces code to “redirect” array access: expensive Berkeley’s Jaguar: efficient code generation with JIT extensions security concern: JIT “hacks” may break Java or byte-code
Custom JVMs many “tricks” are possible (e.g. pinned array factories, pinned
and non-pinned heaps, etc): depend on a particular GC scheme Jbufs: isolates minimal support needed from GC
Memory Management Safe Regions (Gay and Aiken): reference counting, no GC
Fast Serialization and RMI KaRMI (Karlsruhe): fixed JOS, ground-up RMI implementation Manta (Vrije U): fast RMI but a Java dialect
45
SummarySummary
Use of explicit memory management to improve Java communication performance in clusters softens the GC/Native heap separation preserves type and storage safety independent of GC scheme jbufs: zero-copy array transfers jstreams: zero-copy de-serialization of arbitrary objects
Framework for building communication software and applications in Java Javia-I/II parallel matrix multiplication Jam: active messages Java RMI cluster applications: TSP, IDA, SOR, EM3D, FFT, and MM
46
Recommended