Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Safe and Efficient Cluster Safe and Efficient Cluster Communication in Java using Communication in Java using Explicit Memory ManagementExplicit Memory Management

Chi-Chao ChangDept. of Computer Science

Cornell University

GoalGoal

High-performance cluster computing with safe languages parallel and distributed applications

Use off-the-shelf technologies Java

safe: “better C++” “write once run everywhere” growing interest for high-performance applications (Java Grande)

User-level network interfaces (UNIs) direct, protected access to network devices prototypes: U-Net (Cornell), Shrimp (Princeton), FM (UIUC) industry standard: Virtual Interface Architecture (VIA) cost-effective clusters: new 256-processor cluster @ Cornell TC

Java NetworkingJava Networking

Traditional “front-end” approach pick favorite abstraction (sockets, RMI,

MPI) and Java VM write a Java front-end to custom or

existing native libraries good performance, re-use proven code magic in native code, no common solution

Interface Java with Network Devices bottom-up approach minimizes amount of unverified code focus on fundamental data transfer

inefficiencies due to:

1. Storage safety

2. Type safety

RMI, RPC

Sockets

Active Messages, MPI, FM

Networking Devices

OutlineOutlineThesis Overview

GC/Native heap separation, object serialization

Experimental Setup: VI Architecture and Marmot

Part I: Array Transfers(1) Javia-I: Java Interface to VI Architecture

respects heap separation

(2) Jbufs: Safe and Explicit Management of Buffers Javia-II, matrix multiplication, Active Messages

Part II: Object Transfers(3) A Case For Specialization

micro-benchmarks, RMI using Javia-I/II, impact on application suite

(4) Jstreams: in-place de-serialization micro-benchmarks, RMI using Javia-III, impact on application suite

Conclusions

(1) Storage Safety(1) Storage Safety

Java programs are garbage-collected no explicit de-allocation: GC tracks and frees garbage objects programs are oblivious to the GC scheme used: non-copying (e.g.

conservative) or copying no control over location of objects

Modern Network and I/O Devices direct DMA from/into user buffers native code is necessary to interface with hardware devices

(1) Storage Safety(1) Storage Safety

GC heap Native heap

Application Memory

DMAON OFF OFF

copypin

(a) Hard Separation: Copy-on-demand (b) Optimization: Pin-on-demand

Pin-on-demand only works for send/write operations For receive/read operations, GC must be disabled indefinitely...

Result: Hard Separation between GC and native heaps

GC heap Native heap

(1) Storage Safety: Effect(1) Storage Safety: Effect

Throughput

0 8 16 24 32Kbytes

C rawJava copyJava pin

Best case scenario: 10-40% hit in throughput pick your favorite JVM, your fastest network interface, and a pair of

450Mhz P-II with commodity OS pinning on demand is expensive...

(2) Type Safety(2) Type SafetyCannot forge a reference to a Java object

b is an array of bytes in C:

double *data = (double *)b; in Java:

double[] data = new double[1024/8];

for (int i=0,off=0;i<1024/8;i++,off+=8) {

int upper = (((b[off]&0xff)<<24) +

((b[off+1]&0xff)<<16) +

((b[off+2]&0xff)<<8) +

(b[off+3]&0xff));

int lower = (((b[off+4]&0xff)<<24) + ((b[off+5]&0xff)<<16) +

((b[off+6]&0xff)<<8) +

(b[off+7]&0xff));

data[i] = Double.toLongBits(((long)upper)<<32)+

(lower&0xffffffffL))

(2) Type Safety(2) Type Safety

Objects have meta-data runtime safety checks (array-bounds, array-store, casts)

In C:struct Buffer {

int len; char data[1];}

Buffer *b = malloc(sizeof(Buffer)+1024);

b.len = 1024;

In Java:class Buffer { int len; byte[] data;

Buffer(int n) {

data = new byte[n]; len = n; }

Buffer b = new Buffer(1024);

lock obj

Buffer vtable

lock objbbyte[] vtable

(2) Type Safety(2) Type Safety

Result: Java objects need to be serialized and de-serialized across the network

GC heap Native heap

Application Memory

DMAON OFF

serial

(2) Type Safety: Effect (2) Type Safety: Effect

Performance hit of one order of magnitude: pick your favorite high-level communication abstraction (e.g.

Remote Method Invocation) pick your favorite JVM, your fastest network interface, and a pair of

450Mhz P-II

Round-Trip Latency

0 2 4 6 8Kbytes

C rawJava copyJava RMI copy

ThesisThesis

Use explicit memory management to improve Java communication performance Jbufs: safe and explicit management of Java buffers

softens the GC/Native heap separation preserves type and storage safety “zero-copy” array transfers

Jstreams: extends Jbufs for optimizing serialization in clusters “zero-copy” de-serialization of arbitrary objects

GC heap Native heap

Application Memory

DMAON OFF

user-controlled

Experimental Setup: Giganet cluster and Marmot

Conclusions

Giganet ClusterGiganet Cluster

Configuration 8 P-II 450MHz, 128MB RAM 8 1.25 Gbps Giganet GNN-1000 adapter one Giganet switch

GNN1000 Adapter: User-Level Network Interface Virtual Interface Architecture implemented as a library (Win32 dll)

Base-line pt-2-pt Performance 14s r/t latency, 16s with switch over 100MBytes/s peak, 85MBytes/s with switch

MarmotMarmotJava System from Microsoft Research

not a VM static compiler: bytecode (.class) to x86 (.asm) linker: asm files + runtime libraries -> executable (.exe) no dynamic loading of classes most Dragon book opts, some OO and Java-specific opts

Advantages source code good performance two types of non-concurrent GC (copying, conservative) native interface “close enough” to JNI

Experimental Setup: Giganet cluster and Marmot

Conclusions

Javia-IJavia-I

Basic Architecture respects heap separation

buffer mgmt in native code Marmot as an “off-the-shelf” system

copying GC disabled in native code primitive array transfers only

Send/Recv API non-blocking blocking

bypass ring accesses pin-on-demand alloc-recv: allocates new array on-

demand cannot eliminate copying during recv

send/recv ticket ring

send/recvqueue

descriptor

buffer

byte array ref

GC heap

Javia-I: PerformanceJavia-I: Performance

0 1 2 3 4 5 6 7 8

Kbytes

s rawcopy(s)pin(s)copy(s)+alloc(r) pin(s)+alloc(r)

0 8 16 24 32

Kbytes

rawcopy(s)pin(s)copy(s)+alloc(r)pin(s)+alloc(r)

Basic Costs (PII-450, Windows2000b3):pin + unpin = (10 + 10)us, or ~5000 machine cycles

Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us

Latency: N = transfer size in bytes16.5us + (25ns) * N raw

38.0us + (38ns) * N pin(s)

21.5us + (42ns) * N copy(s)

18.0us + (55ns) * N copy(s)+alloc(r)

BW: 75% to 85% of raw for 16Kbytes

jbufsjbufsGoal

provide buffer management capabilities to Java without violating its safety properties

re-use is important: amortizes high pinning costs

jbuf: exposes communication buffers to Java programmers1. lifetime control: explicit allocation and de-allocation

2. efficient access: direct access as primitive-typed arrays

3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap

heap separation becomes soft and user-controlled

jbufs: Lifetime Control jbufs: Lifetime Control

1. jbuf allocation does not result in a Java reference to it cannot access the jbuf from the wrapper object

2. jbuf is not automatically freed if there are no Java references to it free has to be explicitly called

public class jbuf {

public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */

public void free() throws CannotFreeException; /* frees jbuf if it can */

GC heap

handle

jbufs: Efficient Access jbufs: Efficient Access

3. (Storage Safety) jbuf remains allocated as long as there are array references to it when can we ever free it?

4. (Type Safety) jbuf cannot have two differently typed references to it at any given time when can we ever re-use it (e.g. change its reference type)?

public class jbuf {

/* alloc and free omitted */

public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/

public int[] toIntArray() throws TypedException; /*hands out int[] ref*/

GC heap

Java byte[]

jbufs: Location Control jbufs: Location Control

Idea: Use GC to track references

unRef: application claims it has no references into the jbuf jbuf is added to the GC heap GC verifies the claim and notifies application through callback application can now free or re-use the jbuf

Required GC support: change scope of GC heap dynamically

public class jbuf {

/* alloc, free, toArrays omitted */

public void unRef(CallBack cb); /* app intends to free/re-use jbuf */

GC heap

Java byte[]

GC heap

Java byte[]

GC heap

Java byte[]

unRef callBack

jbufs: Runtime Checksjbufs: Runtime Checks

Type safety: ref and to-be-unref states parameterized by primitive type

GC* transition depends on the type of garbage collector non-copying: transition only if all refs to array are dropped before GC copying: transition occurs after every GC

Unref ref

to-beunref

toArray

toArray, GC

toArray, unRef

Javia-IIJavia-II

Exploiting jbufs explicit pinning/unpinning of jbufs only non-blocking send/recvs

send/recv ticket ring

send/recvqueue

descriptor

GC heap

array refs

Javia-II: PerformanceJavia-II: Performance

Basic Jbuf Costsallocation = 1.2us, to*Array = 0.8us, unRefs = 2.3 us, GC degradation=1.2us/jbuf

Latency (n = xfer size)16.5us + (0.025us) * n raw

20.5us + (0.025us) * n jbufs

38.0us + (0.038us) * n pin(s)

21.5us + (0.042us) * n copy(s)

BW within 1% of raw

0 1 2 3 4 5 6 7 8

Kbytes

0 8 16 24 32

Kbytes

MM: CommunicationMM: Communication

pMM Comm Time (64x64, 8 procs)

barrier

copy-alloc

copy-async

pin-alloc

pin-async

jbufs jdk copy-alloc

jdk copy-async

67% 70%

78% 85%

78% 73%

pMM Comm Time (256x256, 8 procs)

barrier

copy-alloc

copy-async

pin-alloc

pin-async

jbufs jdk copy-alloc

jdk copy-async

19%16%

24% 22%

pMM over Javia-II/jbufs spends at least 25% less in communication for 256x256 matrices on 8 processors

MM: OverallMM: Overall

pMM MFLOPS (64x64)

200 2 procs

4 procs

8 procs

copy-alloc

copy-async

pin-alloc

pin-async

jbufsjdk copy-

allocjdk copy-

pMM MFLOPS (256x256)

350 2 procs

4 procs

8 procs

copy-alloc

copy-async

pin-alloc

pin-async

jbufsjdk copy-

allocjdk copy-

Cache effects: better communication performance does not always translate to better overall performance

Exercising Jbufs: user supplies a list of jbufs upon message arrival:

jbuf passed to handler unRef is invoked after

handler invocation if pool is empty, reclaim

existing ones copying deferred to GC-time

only if needed

class First extends AMHandler {

private int first;

void handler(AMJbuf buf, …) {

int[] tmp = buf.toIntArray();

first = tmp[0];

class Enqueue extends AMHandler {

private Queue q;

void handler(AMJbuf buf, …) {

int[] tmp = buf.toIntArray();

q.enq(tmp);

Active MessagesActive Messages

AM: PerformanceAM: Performance

Latency about 15s higher than Javia synch access to buffer pool, endpoint header, flow control

checks, handler id lookup

BW within 10% of peak for 16KByte messages

0 1 2 3 4 5 6 7 8

Kbytes

rawjbufsAM jbuf

AM copyAM copy-alloc

0 8 16 24 32

Kbytes

rawjbufsAM jbufAM copyAM pinAM copy-alloc

Jbufs: ExperienceJbufs: ExperienceEfficient access through arrays is useful:

no indirect access via method invocation promotes code re-use of large numerical kernels leverages compiler infrastructure for eliminating safety checks

Limitations still not as flexible as C buffers stale references may confuse programmers

Discussed in thesis: the necessity of explicit de-allocation implementation of Jbufs in Marmot’s copying collector impact on conservative and generational collector extension to JNI to allow “portable” implementations of Jbufs

Experimental Setup: VI Architecture and Marmot

Part II: Object Transfers(3) A Case For Specialization on Homogeneous Clusters

Conclusions

Standard JOS Protocol “heavy-weight” class descriptors are serialized along with objects type-checking: classes need not be “equal”, just “compatible.” protocol allows for user extensions

Remote Method Invocation object-oriented version of Remote Procedure Call relies on JOS for argument passing actual parameter object can be a sub-class of the formal parameter class.

Object Serialization and RMIObject Serialization and RMI

writeObject

GC heap

readObject

GC heap

NETWORK

1. overheads in tens or hundreds of s: send/recv overheads=~ 3 s, memcpy of 500 bytes=~ 0.8 s

2. double[] 50% more expensive than byte[] of similar size

3. overheads grow as object sizes grow

JOS CostsJOS CostswriteObject

70jview

marmot

120 27593

byte[] 100

byte[]500

double[] 12

double[] 62

complex[]

p/elemlist 4

p/elem

readObject

marmot

byte[] 100

byte[]500

double[] 12

double[] 62

complex[]

p/elemlist 4

p/elem

list 160p/elem

Impact of Marmot’s optimizations: Method inlining: up to 66% improvement (already deployed) No synchronization whatsoever: up to 21% improvement No safety checks whatsoever: up to 15% combined

Better compilation technology unlikely to reduce overheads substantially

Impact of Marmot Impact of Marmot

Order of magnitude worse than Javia-I/II round-trip latency drops to about 30us in a null RMI: no JOS! peak bandwidth of 22MBytes/s, about 25% of raw

Impact on RMIImpact on RMI

0 1 2 3 4 5 6 7 8

Kbytes

srawjbufsRMI jbufsRMI copy+allocRMI pinRMI copy+allocjdk RMI copy

4-byte (us)150.4161.9164.5211.8271.0482.3520.1

RMIjbufs

pincopy

jdk copysockets

copy+alloc

jdk sockets

Impact on ApplicationsImpact on Applications

% comm time (est.)

% total time (est.)

11.76% 2.73%10.90% 5.22%14.28% 13.73%

1.42% 1.37%7.64% 5.20%pMM

Application

FFT arraysFFT complexEM3D arrays

A Case for Specializing Serialization for Cluster applications: overheads a order of magnitude higher than send/recv and memcpy RMI performance degraded by one order of magnitude 5-15% “estimated” impact on applications old adage: “specialize for the common case”

Optimizing De-serializationOptimizing De-serialization

“in-place” object de-serialization specialization for homogeneous cluster and JVMs

Goal eliminate copying and allocation of objects

Challenges preserve the integrity of the receiving JVM permit de-serialization of arbitrary Java objects with unrestricted usage

and without special annotations independent of a particular GC scheme

writeObject

GC heap GC heap

NETWORK

Jstreams: writeJstreams: write

writeObject deep-copy of objects: maintains in-memory layout deals with cyclic data structures swizzle pointers: offsets to a base address replace object meta-data with 64-bit class descriptor optimization: primitive-typed arrays in jbufs are not copied

public class Jstream extends Jbuf {

public void writeObject(Object o) /* serializes o onto the stream */

throws TypedException, ReferencedException;

public void writeClear() /* clears the stream for writing*/

throws TypedException, ReferencedException;

Jstreams: readJstreams: read

readObject replace class descriptors with meta-data unswizzle pointers, array-bounds checking after first readObject, add jstream to GC heap

tracks references coming out of read objects unRef: user is willing to free or re-use

public class Jstream extends Jbuf {

public Object readObject() throws TypedException; /* de-serialization */

public boolean isJstream(Object o); /* checks if o resides in the stream */

GC heapGC heap

unRef callBackGC heap

jstreams: Runtime Checksjstreams: Runtime Checks

Modification to Javia-II: prevent DMA from clobbering de-serialized objects receive posts not allowed if jstream is in read mode no changes to Javia-II architecture

Write Mod

to-be unref

writeObject

writeObject, GC

readObjectGC*

ReadMode

readObject

readObject, GC

writeClear

jstream: Performancejstream: Performance

De-serialization costs constant w.r.t. object size 2.6us for arrays, 3.3us per list element.

readObject

JOS jdkJOS marmotjstreams marmotjstreams (C)

byte[] 100 double[] 62 list 4 p/e

list 160 p/e

byte[] 500

jstream: Impact on RMIjstream: Impact on RMI

4-byte round-trip latency of 45us (25us higher than Javia-II)

52MBytes/s for 16KBytes arguments

0 8 16 24 32

Kbytes

rawjavia-IIAM javia-IIRMI avia-IIIRMI javia-I

jstream: Impact on Applicationsjstream: Impact on Applications

3-10% improvement in SOR, EM3D, FFT

10% hit in pMM performance over 22,000 incoming RMIs, 1000 jstreams in receive pool, ~26

garbage collections: 15% of total execution time in GC generational collection will alleviate GC costs substantially receive pool size is hard to tune: tradeoffs between GC and locality

JOS comm (secs)

JOS total

(secs)

jstreams comm (secs)

jstreams total

(secs)

% improv. comm

% improv.

% improv. comm (est.)

% improv. total (est.)

4.59 19.78 3.99 19.08 13.20% 3.52% 11.76% 2.73%2.20 4.60 1.99 4.37 9.50% 4.85% 10.90% 5.22%

18.30 19.03 16.16 17.26 11.70% 9.30% 14.28% 13.73%14.82 15.36 14.29 14.83 3.57% 3.40% 1.42% 1.37%

190.58 280.00 170.91 307.80 10.32% -9.93% 7.64% 5.20%pMM

Application

FFT arraysFFT complexEM3D arrays

Jstreams: ExperienceJstreams: Experience

Implementation of readObject and writeObject integrated into JVM protocol is JVM-specific native implementation is faster

Limitations not as flexible as Java streams: cannot read and write at the same time no “extensible” wire protocols

Discussed in thesis: implementation of Jstreams in Marmot’s copying collector support for polymorphic RMI: minor changes to the stub compiler JNI extensions to allow “portable” implementations of Jstreams

Related WorkRelated WorkMicrosoft J-Direct

“pinned” arrays defined using source-level annotations JIT produces code to “redirect” array access: expensive Berkeley’s Jaguar: efficient code generation with JIT extensions security concern: JIT “hacks” may break Java or byte-code

Custom JVMs many “tricks” are possible (e.g. pinned array factories, pinned

and non-pinned heaps, etc): depend on a particular GC scheme Jbufs: isolates minimal support needed from GC

Memory Management Safe Regions (Gay and Aiken): reference counting, no GC

Fast Serialization and RMI KaRMI (Karlsruhe): fixed JOS, ground-up RMI implementation Manta (Vrije U): fast RMI but a Java dialect

SummarySummary

Use of explicit memory management to improve Java communication performance in clusters softens the GC/Native heap separation preserves type and storage safety independent of GC scheme jbufs: zero-copy array transfers jstreams: zero-copy de-serialization of arbitrary objects

Framework for building communication software and applications in Java Javia-I/II parallel matrix multiplication Jam: active messages Java RMI cluster applications: TSP, IDA, SOR, EM3D, FFT, and MM

Safe and Efficient Cluster Communication in Java using Explicit Memory Management

Documents

Efficient Combination of Explicit and Implicit Solution Schemes

Stability-Delay Efficient Cluster-based Routing Protocol

Explicit Coupled Thermo-Mechanical Finite Element … · Introduction Commercial processes ... explicit method to be more efficient and robust than the implicit method for ... In

Energy Efficient Web Server Cluster

NICE: Network-Integrated Cluster-Efficient Storage

Vol. 3, Issue 10, October 2014 Energy Efficient Cluster ... Efficient Cluster Based Multihop Routing for Wireless Sensor Network LavitaVirmani1, S.K.Agarwal2, Dharam Vir3 Department

Cluster-GCN: An Efficient Algorithm for Training Deep and Large … · 2019-08-09 · Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Energy-efficient Failure Recovery in Hadoop Cluster

An Efficient and Effective Explicit Damping Time

eeClust – Energy Efficient Cluster Computing - EnA-HPC

An Accurate and Computationally Efficient Explicit Friction Factor

A Low-Cost Energy-Efficient Raspberry Pi Cluster for Data

Efficient Explicit-Solvent Molecular Dynamics Simulations

Secure and efficient data transmission for cluster based wireless

Highly Available and Efficient Load Cluster …dpnm.postech.ac.kr/papers/NOMS/02/s14p2CD.pdf1 Highly Available and Efficient Load Cluster Management System using SNMP and Web Myung-Sup

Secure and efficient data transmission for cluster based wireless sensor network

A Radical Study of Energy Efficient Hierarchical Cluster

Secure and Efficient Data Transmission for Cluster-Based Wireless Sensor Networks

Whale Optimization Based Energy-Efficient Cluster Head

DESIGNING EFFICIENT INTER-CLUSTER ...nowlab.cse.ohio-state.edu/static/media/publications/...DESIGNING EFFICIENT INTER-CLUSTER COMMUNICATION LAYER FOR DISTRIBUTED COMPUTING A Thesis