Review of Two Papers on Performance of Remote Procedure Calls

Preview:

Citation preview

Review of Two Papers on Performance of Remote Procedure Calls: `Performance of Firefly RPC' and `Lightweight Procedure Call'

Quentin Fennessy qfennessy@gmail.com

Unpublished, originally written June 26, 1996

This review covers two papers that discuss performance problems and resolutions in remote procedure calls (RPC). The first paper [1] is a detailed report on RPC performance on the Firefly multiprocessor system and includes precise measurements of the latency in the system RPC. The paper goes to great lengths to account for time spent in RPCs, breaking them down to packet creation, send and receive and packet reception. The authors also estimate the improvement if certain proposed improvements were made. The second paper [2] is a report on an highly optimized pseudo-RPC called LRPC (Lightweight RPC) as implemented on Taos (an operating system also on the Firefly). Lightweight RPC is used to optimize performance for RPC calls that do not cross machine boundaries and do not have large or complicated data structures. These two papers describe two approaches to the same problem. The problem is RPC performance optimization. Firefly RPC (in [1] is traditional RPC with stub compilation in a high level language. The RPCs in Firefly handle arbitrarily complex data structures and are semantically consistent for both local and remote RPCs. Both Firefly RPC and LRPC are optimized -- that is, the implementation is not straightforward but sacrifices benefits such as security and portability in the quest for high performance. LRPC (in [2]) is a more exotic implementation -- RPC so highly optimized for common cases that it barely deserves the name. LRPC does not handle inter-machine communication and will only handle simple data structures. The authors of [2] present a good case that most RPCs are actually local calls with very simple data structure requirements. The Firefly RPC paper [1] includes remarkably precise and detailed measurements of RPC latency and throughput. It is very interesting to see both the fixed and variable delays involved in process communication. The authors baselined their timing via null RPCs, and compared those times to largest-packet-sized RPCs. Some timings were done by sending 10,000 packets and dividing elapsed time by 10,000, and some were done by counting machine instructions involved and summing the times the instructions took (from a processor reference manual). Not surprisingly a large part of the cost was variable and depended on

the size of the RPC packet (Ethernet latency, UDP checksum calculation and system bus latency). Some of the interesting aspects of the Firefly optimization were: the awakening of threads to handle received packets from within interrupt routines and the sharing of address space between all processes using RPC and the Ethernet driver. The authors admit that these performance improvements `collapse layers of abstraction' and also admit to the security implications of shared buffer space. The Lightweight RPC paper [2] (also about the Firefly system running the TAOS operating system) discusses a more radical RPC implementation. RPCs traditionally look like normal function calls but are actually synchronous communication mechanisms with distinct remote or local processes. The authors argue that simple and local RPCs deserve optimization as they constitute the bulk of interprocess communication. Accordingly, they have implemented LRPC (Lightweight RPC) with four new techniques. First the control transfer between client and server is simplified. The client directly executes the requested procedure in the server's space. Second, client and server share an argument stack. Third, LRPC uses simple stubs that preclude sending complex data structures. Fourth and finally LRPC avoids shared data structure bottlenecks and can take advantage of free processors in the Firefly multiprocessor system. These four techniques present interesting tradeoffs. Similar to [1] there are security implications in the optimizations that involve shared data space between client and server. These are handled in several ways. Client binding to servers is handled carefully. Clients cannot communicate without objects that identify them to servers. Client calls are rigorously checked before being mapped and executed in server space. RPC stubs are generated in Firefly machine language. In the homogenous Firefly environment this is not an issue. These stubs are up to four times faster than Modula2+ compiled stubs. LRPC stubs are invoked directly by the Firefly kernel thus avoiding data copying or message checking in user space. LRPC is optimized for multiprocessor use by avoiding shared data structures. Shared argument stacks are locked individually and queuing on these stacks takes less than 2% of call time. The first paper discusses optimization of traditional RPCs. These optimizations are easily described but increase risk in execution and probably increase the difficulty of kernel code maintenance in Firefly. Firefly RPC performance is compared with other distributed systems such as Sprite, Amoeba, V, Cedar and UNIX. Although Firefly is the only VAX based system absolute performance numbers are interesting to compare. Firefly RPC latency (at about 2.7ms/call) is within .2ms of the fastest RPC implementations (in V). Firefly RPC throughput (at 4.6Mbit/sec) is above the median of the compared system but not quite as fast as that

in Sprite (at 5.6Mbit/sec). The second paper optimizes Firefly RPCs for the simple cases -- local calls and simple data structures. LRPC are demonstrably lightweight. Null LRPCs add only 48usec to minimum time for each operation (for a total of 157usec). LRPC at 157usec compares very favorably with the Firefly Null RPC at 464usec (3:1 difference!). Larger calls show almost the same ratio: LRPC 200 byte calls at 227usec and Firefly RPC at 636usec. The multiprocessor optimizations produce good linearity with respect to number of processors. Firefly RPC plateaus at two processors while LRPC is linear at least to four processors. These two papers agree in several areas on RPC optimization. Unfortunately, high-level language implementations, nicely layered designs, clearly distinguished protection domains and arbitrarily complex data structures are all sacrificed to the need for speed. RPC optimization is critical in RPC acceptance as otherwise programmers will work around the system. Fortunately, the dirty details of these optimizations can to a large degree be hidden from programmers and users, thus allowing higher-level software engineering techniques in user code.

Works Cited

[1] M. D. Schroeder and M. Burrows, "Performance of Firefly RPC," Transactions on Computing Systems, vol. 8, no. 1, pp. 1-17, Feb 1990.

[2] B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M. Levy, "Lightweight Remote Procedure Call," Transactions on Computing Systems, vol. 8, no. 1, pp. 37-55, Feb 1990.

Recommended