8
Virtual-Memory-Mapped Network Interfaces In today’s multicomputers, software overhead dominates the message-passing latency cost. We designed two multicomputer network interfaces that signif~cantiy reduce this overhead. Both support vMual-memory-mapped communication, allowing user processes to communicate without expensive buffer management and without making system calls across the protection boundary separating user processes from the operating system kerneL Here we compare the two interfaces and discuss the performance trade-offs between them. Matthias A. Blumrich Cezary Dubnicki Edward W. Felten Kai Li Malena R. Mesarina Princeton University II n Princeton’s SHRIMP (Scalable High- Performance Really Inexpensive Multi- processor) project, we are working to develop high-performance communication mechanisms that will integrate commodity desk- top computers such as PCs and workstations into inexpensive, high-performance multicomputers. Our primary performance metrics are the end-to- end latency and bandwidth available to user processes. Our goal is to provide a low-latency. high-bandwidth communication mechanism whose performance is competitive with or better than that of mechanisms used in specially designed multicomputers. The network interfaces of existing mulricom- puters and workstation networks require a signif- icant amount of software overhead at the operating system and user levels to provide protection, buffer management, and message-passing protocols. In fact, message-passing primitives on many multi- computers, such as the csend/crecv of Intel’s NX/2,’ often execute more than 1,000 instructions to send and receive a message. By comparison, the hardware overhead of data transfer is negligi- ble. For example, sending and receiving a mes- sage on Intel’s Delta multicomputer requires 67 ys, of which less than 1 l.~s is due to time on the wire.z Other recent multicomputers, such as Intel’s Paragon,3 Meiko’s G-2, and TMC’s CM-j,’ have lower message-passing latencies than Delta. These designs treat communication as a ser- vice of the operating system. The challenge in designing network interfaces is to provide appro- priate hardware support to achieve minimal software message-passing overhead, to accom- modate multiprogramming under a variety of scheduling policies without sacrificing protection, and to overlap communication with computation. As the first step of our research, we developed an idea we call virtual-memory-mapped commu- nication. This approach allows programs to pass messages directly between user processes without crossing the protection boundary to the operating system kernel, thus reducing software message- passing overhead significantly. Implementation of this approach requires network interface support. We designed two network interfaces for the SHRIMP multicomputer that use Pentium PCs and an Intel Paragon routing network. Our first design makes minimal modifications to the tra- ditional DMA-based network interface design, while implementing virtual-memory mapping in software. The design requires a system call to initiate outgoing data transfer, but its virtua- memory-mapped communication can reduce the send latency overhead as much as 78 percent. The interface transfers received messages direct- ly to memory, typically reducing the receive soft- ware overhead to only a few instructions. Our second design implements virtual-mem- ory mapping completely in hardware.j This approach provides fully protected, user-level 0740-7475/95/$04.000 1995 IEEE February 1995 21

Virtual-memory-mapped network interfaces

Embed Size (px)

Citation preview

Virtual-Memory-Mapped Network Interfaces

In today’s multicomputers, software overhead dominates the message-passing latency cost. We designed two multicomputer network interfaces that signif~cantiy reduce this overhead. Both support vMual-memory-mapped communication, allowing user processes to communicate without expensive buffer management and without making system calls across the protection boundary separating user processes from the operating system kerneL Here we compare the two interfaces and discuss the performance trade-offs between them.

Matthias A. Blumrich

Cezary Dubnicki

Edward W. Felten

Kai Li

Malena R. Mesarina

Princeton University

II n Princeton’s SHRIMP (Scalable High- Performance Really Inexpensive Multi- processor) project, we are working to develop high-performance communication

mechanisms that will integrate commodity desk- top computers such as PCs and workstations into inexpensive, high-performance multicomputers. Our primary performance metrics are the end-to- end latency and bandwidth available to user processes. Our goal is to provide a low-latency. high-bandwidth communication mechanism whose performance is competitive with or better than that of mechanisms used in specially designed multicomputers.

The network interfaces of existing mulricom- puters and workstation networks require a signif- icant amount of software overhead at the operating system and user levels to provide protection, buffer management, and message-passing protocols. In fact, message-passing primitives on many multi- computers, such as the csend/crecv of Intel’s NX/2,’ often execute more than 1,000 instructions to send and receive a message. By comparison, the hardware overhead of data transfer is negligi- ble. For example, sending and receiving a mes- sage on Intel’s Delta multicomputer requires 67 ys, of which less than 1 l.~s is due to time on the wire.z Other recent multicomputers, such as Intel’s Paragon,3 Meiko’s G-2, and TMC’s CM-j,’ have lower message-passing latencies than Delta.

These designs treat communication as a ser-

vice of the operating system. The challenge in designing network interfaces is to provide appro- priate hardware support to achieve minimal software message-passing overhead, to accom- modate multiprogramming under a variety of scheduling policies without sacrificing protection, and to overlap communication with computation.

As the first step of our research, we developed an idea we call virtual-memory-mapped commu- nication. This approach allows programs to pass messages directly between user processes without crossing the protection boundary to the operating system kernel, thus reducing software message- passing overhead significantly. Implementation of this approach requires network interface support.

We designed two network interfaces for the SHRIMP multicomputer that use Pentium PCs and an Intel Paragon routing network. Our first design makes minimal modifications to the tra- ditional DMA-based network interface design, while implementing virtual-memory mapping in software. The design requires a system call to initiate outgoing data transfer, but its virtua- memory-mapped communication can reduce the send latency overhead as much as 78 percent. The interface transfers received messages direct- ly to memory, typically reducing the receive soft- ware overhead to only a few instructions.

Our second design implements virtual-mem- ory mapping completely in hardware.j This approach provides fully protected, user-level

0740-7475/95/$04.000 1995 IEEE February 1995 21

Node A ,’ _-..._______-_

.\ / Virtual- : memory

Node B , , _ - . . _ _ _ _ _ ,’ *\

Virtual- :

Figure 1. Virtual-memory mapping

message passing, and it allows user programs to initiate an outgoing block data transfer with a single memory store instruction.

Virtual-memory-mapped communication Figure 1 illustrates the basic idea of virtual-memory-

mapped communication: Applications create a mapping between two virtual-memory address spaces over the net- work. That is, the user maps a piece of the sender’s virtual memory to an equal-size piece of the receiver’s virtual mem- ory across the network. The mapping operation requires a system call to provide protection between users and process- es in a multiprogrammed environment. But once the map- ping is established, the sending and receiving processes can use the mapped memory as send and receive buffers, and can communicate without kernel involvement.

Virtual-memory-mapped communication has several advan- tages over traditional, kernel-dispatch-based message passing. One is that virtual-memory-mapped communication incurs low overhead, since data can move between user processes with- out context switching and message dispatching.

Another advantage is that virtual-memory-mapped com- munication moves memory buffer management to the user level. Applications or libraries can manage their communi- cation buffers directly without the expensive overhead of unnecessary context switches and protection boundary cross- ings that are commonly used. Recent studies indicate that moving communication buffer management out of the ker- nel to the user level can greatly reduce the software overhead of message passing. The use of a compiled, application- tailored runtime library can improve the latency of multi- computer message passing by about 30 percent.h

In addition, virtual-memory-mapped communication takes advantage of the protection provided by virtual-memory sys-

tems. Since mappings are established at the virtual-memory level, virtual-address translation hardware guarantees that an application can use only mappings created by itself. This eliminates the per-message software protection checking of traditional message-passing implementations.

Several ways of implementing virtual-memory mapping are possible. To achiev-e a simple, low-cost design, we inves- tigated various combinations of hardware and software. Our results are the SHRIMP-I network interface, designed to pro- vide minimal hardware support, and the SHRIMP-II network interface, intended to provide as much hardware support as needed to minimize communication latency.

SHRIMP-I network interface Our design goal for the SHRIMP-I network interface was

to start with a traditional, DMA-based nemork interface and add the minimal hardware support needed for implement- ing virtual-memory-mapped communication. The resulting network interface supports the DMA-based model and optionally implements virtual-memory-mapper1 communi- cation with some software assistance.

Figure 2 shows a block diagram of the SHRIMP-I network interface data path. The card uses DMA transactions to inter- face between the EISA (Extended Industry Standard Archi- tecture) bus of a Pentium PC and a network interface chip connected to an Intel Paragon routing network. DMA trans- actions are limited to the size of a memory page and cannot cross page boundaries, since pages are the unit of protec- tion. The card provides control through a set of memory- mapped registers, which device driver programs use to compose packets, initiate packet transfers, examine interface status, and set up receiving memory addresses. Incoming packets optionally can generate interrupts to the host proces- sor. The arbiter controls sharing of the bidirectional data path to the network interface chip, giving incoming data priority over outgoing data.

The hardware supports physical-memory mapping for incoming data. That is, each packet carries a receive desti- nation physical-memory address in its packet header, and the hardware automatically initiates a DMA transfer to this address upon packet arrival, without host CPU intervention.

The beginning of every packet carries a header of two 64- bit words. The first 64-bit word contains routing information for the Paragon network and is stripped by the network hard- ware. The second 64-bit word is the SHRIMP-I packet head- er containing four fields: version, destination address, packet size, and action. The version field identifies the version of the network interface that generated the packet. The destination address specifies a physical base address on the destination machine to receive the packet’s data. The packet size field specifies the number of 32-bit data words in the body of the packet. The action field tells the receiving network interface how to handle the packet.

22 IEEE Micro

Writing a packet header to the send registers initiates a send operation. That starts the send state machine, which builds a network packet and transfers the data directly from memory via the network interface chip. When the packet arrives at the destination, its header is stored in the receive registers. By default, the packet’s data is delivered to the physical memory indicated by the destination address field in the packet header. Optionally, the packet’s action field can instruct the receiving logic to deliver the data to a phys- ical address provided by the receiver (in a memory-mapped register). In addition, the action field can cause an interrupt to the receiving host processor immediately after packet delivery. An interrupt freezes the incoming data path until the host processor explicitly restarts it by writing to a special control register.

For software flexibility and debugging support, the user can program the receiving logic to override the actions indi- cated in the packet’s action field. Specifically, the receive con- trol register can ignore the action field and use the physical address from the special receive register as the destination address for the next incoming packet. The user can also pro- gram the receive control register to interrupt the CPU after every packet (or never to intermpt). Finally, the user can pro- gram the receive logic to freeze the incoming data path after each packet arrival, an action useful for debugging.

The SHRIMP-I network interface supports both tradition- al message passing and virtual-memory-mapped communi- cation. In traditional message passing, the receiver provides the destination address, and an interrupt is raised upon mes- sage arrival (as indicated by action bits in the message head- er). This option allows the operating system kernel to manage memory buffers and dispatch messages.

Before virtual-memory-mapped communication can take place, a mapping operation to map a user-level send buffer to a user-level receive buffer is necessary. The mapping oper- ation pins both buffers in physical memory. Once a map- ping is established, we can use it to send messages without interrupting the receiving processor. That is, we can perform a receive operation entirely at the user level, without mak- ing a system call.

Virtual-memory-mapped communication is an optimiza- tion that reduces software message-passing overhead at the expense of additional mapping steps and increased con- sumption of physical memory caused by the pinning of send and receive buffers. If physical memory becomes scarce, one can always use traditional message passing with kernel- allocated memory buffers instead of virtual-memory-mapped communication.

SHRIMP-II network interface For the SHRIMP-II network interface, we wanted to pro-

vide hardware support for protected, low-latency, user-level message passing to minimize software overhead. The design

r EISA bus

I Arbiter I I

Figure 2. SHRIMP-I network interface data path.

shares the main idea of the SHRIMP-I network interface design-supporting virtual-memory-mapped communica- tion. The principal difference is that the SHRIMP-II network interface implements virtual-memory mapping in hardware, allowing programs to perform message passing completely at the user level with full protection.

Figure 3, next page, shows the data path of the SHRIMP- II network interface, which connects to both the EISA bus and the Xpress memory extension connector of the Intel Xpress memory bus. The network interface uses the con- nection with the Xpress bus to “snoop” (detect) ordinary memory write transactions and to filter outgoing data des- tined for other nodes. It uses the connection with the EISA bus to transfer DMA bulk data between the local memory and the network.

February 1995 23

Xpress bus

Network

Ii

interface chip

_

Figure 3. SHRIMP-II network interface data path.

The key component that allows the SHRIMP-II network interface to support virtual-memory mapping in hardware is the network interface page table. This table has an entry for each physical page of main memory. Each NIPT entry con- tainp information about whether and how a page is mapped, spicifies the destination node and the physical page num- ber that is mapped to, and includes various control infor- mation for sending and receiving data.

The SHRIMP-II network interface supports two update strategies: automatic update and deliberate update. A user program selects an update strategy for the mapped-out pages

at the time a mapping is created. The mapping system call uses a write-through strategy to cache pages mapped for automatic update.

To initiate an automatic update operation, the source process writes to mapped memory; this write takes place on the Xpress bus. It is convenient to think of the address of this ,write as a physical page number and an offset on that page. While the write is updating main memory, the network inter- face snoops it and directly indexes into the NIPT, using the page number, to obtain the mapping information. If the page is mapped out for automatic update, the network interface constructs a packet header using the destination and physi- cal mapping information from the NIPS entry, along with the original offset from the write address. The written data is appended to this header, and the now-complete packet goes into the outgoing FIFO buffer. When it eventually reaches the head of the FIFO, the network interface chip injects it into the network.

When the packet arrives at the destination processor, the network interface chip puts it into the incoming FIFO buffer. Once the packet reaches the head of this FIFO, the interface again uses the page number to index into the NIPT and deter- mines whether that page has been mapped in. If it has, the EISA DMA logic uses the destination address from the pack- et to transfer the data directly to main memory. The snoop- ing architecture of the PC system ensures that the caches remain consistent with main memory during this transfer. Therefore, a SHRIMP system can use regular, cacheable DRAM as send and receive buffers for message passing with- out special hardware.

User programs select deliberate update-to obtain the high- est transfer bandwidth. Data written to a deliberate-update page does not automatically transfer to the destination node; it transfers only when the user-level application issues an explicit send command. The send command initiates an EISA DMA transfer to move data from memory to the out- going FIFO and then to the network. Therefore, deliberate- update pages can be cached with a write-back strategy but must be consistent with the cache at the time the send is initiated.

To allow an application to issue user-level commands to control some operations of the network interface without involving the kernel, we provide a mechanism called virtu- al-memory-mapped commands. The network interface decodes command memory, located in the node’s physical- address space but not corresponding to actual RAM. References to command memory simply transmit informa- tion to or from the network interface at user level.

The current network interface supports one command memory space the same size as the actual physical memory and associates a unique command page with each page of physical memory. Since the two address spaces are linear and of equal size, simply adding or subtracting a fixed off-

24 IEEE Micro

set determines the association. The operating system kernel gives a user-level process

access to a command page by mapping that command page into the process’s virtual-memory space. For example, if physical page p currently holds the contents of some virtu- al page of process X, the kernel can give X access to the command pages that control p. This allows Xto tell the net- work interface how to operate withpdirectly from user level. If the kernel later decides to reallocate p to another process, it can revoke X’s right to access the command pages corre- sponding top.

The command memory mechanism uses physical-address space (but not physical memory) to achieve low-overhead control of the network interface. The amount of physical- address space it consumes is a small constant times that of the local physical memory.

We currently use the command space to implement the sencl command for deliberate updates.

System software support An advantage of virtual-memory-mapped communication

is the diversity of communication models it can support, We designed a simple communication model for supporting mul- ticomputer programs, We do not describe models suitable for other application classes. For all models, the virtual- memory-mapped communication for the network interface requires system calls to create mappings and primitives to send messages using mapped memory.

Our multicomputer interface uses two system calls for mapping creation: map-send and map-recv. The first is sim- ilar to the following NW2 csend and crecv calls:

mapid = map-send(node-id, sendbuf, size)

process-id, bind-id; mode,

where node-id is the network address of the receiving node, process-id indicates the receiving process, and bind-id is a binding identifier (whose function is similar to the message type in the NX/2 send and receive primitives). Mode indi- cates whether the mapping should be an automatic or delib- erate update (meaningless for the SHRIMP-I network interface, which does not support automatic update). Sendbuf is the starting address of the send buffer, and size is the number of words in the send buffer. This call is used on the sender’s side to establish a mapping. It returns a mapid, which identifies this mapping for send operations. For SHRIMP-I, mapid is just an index into a kernel-level map- ping table specifically for a calling process. For SHRIMP-II, mapid is the virtual address in the command space corre- sponding to sendbuf.

The second system call is

map-recvcbind-id, recvbuf, size, ihandler)

Here, bind-id is the binding identifier to match the mapping request by the map-send call, recvbuf is the starting address of the receive buffer, and size is the number of words in the receive buffer. If ihdndler is nonnull, it specifies a user-level interrupt handler that will be called for every message received for this mapping. The map-recv call provides the mapping identified by bind-id with a receiving physical- memory address so that the sender’s side can create a phys- ical-memory mapping for the virtual-memory mapping.

The mapping calls will pin the memory pages of both the send and receive buffers into physical memory to create a stable physical-memory mapping. This enables data trans- fers on both sending and receiving sides without CPU involvement on SHRIMP-II and with minimal sender-side overhead on SHRIMP-I.

Every mapping is unidirectional and asymmetric, from the source (sending buffer) to the destination (receiving buffer). A mapping can be established only if the receive buffer’s size is the same as the send buffer’s size. Mapid can be viewed as a handle to select a mapping for a send opera- tion. SHRIMP-I uses it to provide multiple and overlapped mappings for the same memory. SHRIMP-II uses it to calcu- late the base address in the command space.

For security, we must verify that the sending process has permission to transmit data to the receiving process. In our multicomputer programming model, only objects owned by processes belonging to the same process task group can be mapped to each other. The operating system fully controls a process’s membership in a given task group, so all process- es within a task group trust each other. For example, process- es cooperating on the execution of a given multicomputer program will usually belong to the same task group.

Both SHRIMP-I and SHRIMP-II support the following send operation:

sendcmapid, send-offset, size)

For SHRIMP-I, this operation is a system call that builds a packet for each memory page. It simply looks up the mapid in the mapping table, finds the destination physical address, builds a packet header, and initiates the outgoing data trans- fer. This call returns immediately after the data is sent out to the network. For SHRIMP-II, the send operation is a deliber- ate-update macro. If the data to be sent resides within one page, this macro executes a user-level store of size to the address mapid+send-offset in the command space. The net- work interface decodes this write as a command to initiate the requested transfer from the corresponding physical-mem- ory page. For a message spanning multiple pages, one store is issued for each page.

Since a destination object is allocated in user space, both SHRIMP-I and SHRIMP-II can deliver data directly to the user memory without a receive interrupt. The user process can

February 1995 25

Table 1. Message-passing overhead for three kinds of network interface designs.

Message-passing overhead Traditional SHRIMP-I SHRIMP-II

Sender Send system call X X

software Send argument processing X X

Verify/allocate receive buffer X

Preparing packet descriptors X X

Initiation of send X X X

Hardware DMA data via I/O bus X X X

Data transfer over network X X X

Data transfer via I/O bus X X X

Receiver Interrupt service X Optional Optional software Receive buffer management X

Message dispatch X

Copy data to user space X

Receive system call X

observe the message delivery, for example, by polling a flag located at the end of the message buffer. Thus, it can imple- ment user-level buffer management and avoid the overhead of kernel buffer management, message dispatching, inter- rupts, and making receive system calls.

Cost and performance Both the SHRIMP-I and SHRIMP-II network interfaces sup

port traditional, DMA-based message passing, and provide the option of virtual-memory-mapped communication. This option can eliminate a large amount of software overhead in traditional message passing, such as buffer management and message dispatching.

Compared to traditional network interface designs, the SHRIMP-I interface only adds the destination physical address of a packet in its header, with receiving logic to deliv- er data accordingly. This simple change makes the network interface very flexible. Although the interface requires a sys- tem call to send a message, it requires no CPU involvement to receive and dispatch messages.

The SHRIMP-II network interface supports protected, user- level virtual-memory-mapped communication. Automatic- update mode allows a single store instruction to initiate a send with only the local write-buffer latency. Deliberate- update mode requires a few user-level instructions to send up to a page of data.

Table 1 shows the overhead components of message pass- ing on three kinds of network interfaces: traditional, SHRIMP- I, and SHRIMP-II5 The SHRIMP-II network interface implements virtual-memory-mapping translation in hardware so that the send operation can proceed at the user level.

Although the SHRIMP-I network inter- face requires a system call to send a message, it provides virtual-memory- mapped communication with very lit- tle additional hardware over the traditional design.

We implemented the send opera- tion for both the SHRIMP-I and SHRIMP-II network interfaces using Pentium-based PCs and compared the cost of our send with that of the csend/crecv primitives of NX/2. For passing a small message (less than 100 bytes), the software overhead of a send for the SHRIMP-I network interface is 117 instructions plus the system call overhead and, optional- ly, an interrupt. For SHRIMP-II, the overhead of a send implemented as a macro is only 15 user-level instruc- tions, with no system call necessary. In contrast, the software overhead of

a csend and a crecv in NX/2 is 483 instructions plus two sys- tem calls (csend and crecv) and an interrupt,

For passing a large message, the primitives for the SHRIMP-I network interface require only 26 additional instructions for each additional page transferred. For SHRIMP-II, this overhead is only eight user-level instructions. NX/2’s csend and crecv require additional network transac- tions to allocate receive buffer space on the receiving side, and they must prepare data descriptors needed by the net- work interface to initiate a send.

The cost of mapping on both SHRIMP-I and SHRIMP-II is similar to that of passing a small message using csend and crecv in NX/2. For applications that have static communica- tion patterns, the amortized overhead of creating a mapping can be negligib1e.j

We should point out that the semantics of NX/2’s csend/crecv primitives are richer than the virtual-memory- mapped communication supported by the SHRIMP inter- faces. Our comparison shows that rich semantics often come with substantial overhead. Since both SHRIMP network inter- faces support traditional message passing and virtual- memory-mapped communication, they allow user programs to optimize for common cases.

Related work Traditional network interface design is based on DMA data

transfer. Recent examples include the NCube’ and iPSC/860.” In this scheme an application sends messages by making oper- ating system calls to initiate DMA data transfers. The network interface initiates an incoming DMA data transfer when a mes- sage arrives and interrupts the local processor when the trans-

26 IEEE Micro

fer completes so that it can dispatch the arriving message. The main disadvantage of traditional network interfaces is that message-passing usually takes thousands of CPU cycles.

One solution to the problem of software overhead is to add a separate processor on every node just for message passing.‘,‘” Recent examples of this approach are the Intel Paragon and the Meiko ‘3-2, mentioned earlier. The basic idea is for the “compute” processor to communicate with the “message” processor either through mailboxes in shared memory or through closely coupled data paths. The com- pute and message processors can then work in parallel to overlap communication and computation. In addition, the message processor can poll the network device, eliminating interrupt overhead. This approach, however, does not elim- inate the overhead of the software protocol on the message processor, which is still hundreds of CPU instructions. In addition, the node is complex and expensive to build.

Several projects have reduced communication latency by bringing the network all the way into the processor and map- ping the network interface FIFOs to special processor regis- ters.“-‘” Writing and reading these registers queues and dequeues data from the FIFOs. While this is efficient for fine- grained, low-latency communication, it requires the use of a nonstandard CPU and does not support the protection of multiple contexts in a multiprogramming environment.

An alternative approach employs memory-mapped net- work interface FIFOs.’ In this scheme, the controller has no DMA capability. Instead, the host processor communicates with the network interface by reading or writing special memory locations that correspond to the FIFOs. This approach results in good latency for short messages. However, for longer messages the DMA-based controller is preferable because it uses the bus burst mode, which is much faster than processor-generated single-word transactions.

Among commercially available massively parallel proces- sors. the machine with the lowest latency is the Cray T3D, which supports shared memory without caching. The T3D requires a large amount of custom hardware design, and it is not clear whether the overhead from sharing remote mem- ories without caching degrades the performance of message- passing applications.

MINIMAL ADDITIONS TO THE TRADITIONAL network interface design can reduce software overhead by up to 78 percent. With more hardware support, the software overhead for sending a message can be reduced to a single user-level instruction. Although virtual-memory-mapped communica- tion requires the use of map system calls, it can avoid a receive system call and a receive interrupt. For multicomputer

programs that exhibit static communication patterns (that is, transfers from a given send buffer to a fixed destination buffer), the net gain can be substantial.

Both the SHRIMP-I and SHRIMP-II network interfaces pro- vide users with flexible functionality, including the specifi- cation of data delivery locations and optional receiver interrupt generation. In addition, a good part of both designs consists of programmable logic, allowing for experimentation with hardware protocols at various stages of data transfer.

We built a simulator for software development, and we are constructing 16-node prototypes of SHRIMP-I and SHRIMP-II systems. We expect the SHRIMP-I system to be operational sometime during the second quarter of 1995. At that time we also expect to have an operational two-node SHRIMP-II system. c

Acknowledgments We thank Otto Anshus, Doug Clark, Liviu Iftode, and

Jonathan Sandberg for numerous discussions of the SHRIMP- I and SHRIMP-II network interface designs. We also thank David Dewitt and Jeffrey Naughton for several discussions of the SHRIMP project, and Michael Carey for his innovative suggestion of using SHRIMP as our project name.

References 1. P. Pierce, “The NX/2 Operating System,” Proc. Third Conf.

Hypercube Concurrent Computers and Applications. Assoc. of Computing Machinery, N.Y.. 1988, pp. 384-390.

2. R.J. Littlefield, “Characterizing and Tuning Communications Performance for Real Applications,” presentation at the First Intel Delta Applications Workshop, Tech. Report CCSF-14-92, Calif. Inst. of Tech., Pasadena, Calif., 1992, pp. 179-190.

3. ParagonXPIS ProductOverview, Intel Corp., Santa Clara, Calif., 1991.

4. C. Leiserson et al., “The Network Architecture of the Connec- tion Machine CM-5,” Proc. Fourth ACM Symp. Parallel Algo- rithms and Architectures, ACM, 1992, pp. 272-285.

5. M. Blumrich et al., “A Virtual Memory Mapped Network Inter- face for the SHRIMP Multicomputer,” Proc. 2lst Int’l Symp. ComputerArchitecture, IEEE Computer Society Press, Los Alami- tos, Calif., 1994, pp. 142-153.

6. E.W. Felten, Protocol Compilation: High-Performance Commu- nication for Parallel Programs, PhD thesis, available as Tech. Report 93-09-09, Dept. of Computer Science and Engineering, Univ. of Washington, Seattle, 1993.

7. J. Palmer, “The NCube Family of High-Performance Parallel Computer Systems,” Proc. Thrd Conf Hypercube Concurrent Computers and Applications, 1988, pp. 845-85 1.

February 1995 27

8. iPSU860 Technical Reference Manual. Intel Corp., Santa Clara, Calif., 1991.

9. R. Nikhil, G. Papadopoulos, and Arvind, “*T: A Multithreaded Massively Parallel Architecture,” Proc. 19th Int’l Symp. Com- puterAfchitectufe, ACM, 1992, pp. 156-167.

10. J.-M. Hsu and P. Banerjee, “A Message-Passing Coprocessor for Distributed Memory Multicomputers,” Proc. Supercomputing, CS Press, 1990, pp. 720-729.

11. 5. Borkar et al., “Supporting Systolic and Memory Communi- cation in iwarp,” Proc. 17th /nt’/Symp. ComputerArchitecture, CS Press, 1990, pp. 70-81.

12. D.S. Henry and C.F. Joerg, “A Tightly-Coupled Processor-Net- work Interface,” Pfoc. Fifth Int’/ Conf Afchitectura/%pport for Programming Languages and Operating Systems, Assoc. Com- puting Machinery, 1992, pp. 1 1 l-l 22.

13. W.J. Dally, “The l-Machine System,” Artificial intelligence at MIT: Expanding Frontiers, P. Winston and 5. Shellard, eds., MIT Press, Cambridge, Mass., 1990, pp. 558-580.

I

Matthias A. Blumrich is a PhD candi- date at Princeton University, where he is implementing the network interface for SHRIMP-II, His research centers on high- performance computer systems, with a preference for hardware design and implementation. He has worked on a

variety of R&D projects, both in academia and industry, rang- ing from multiprocessor architecture simulation to printed circuit board layout. Blumrich received his BS in electrical engineering from the State University of New York at Stony Brook and his MA from Princeton.

sage passing, and an MS in compute PhD in computer :

28 IEEE Micro

Cezary Dubnicki, a research staff mem- ber at Princeton University, designs and implements systems software for SHRIMP. His research interest is architectural and systems support for parallel and distrib- uted programming, including network interface design, high-performance mes- distributed shared memory. He received :r science from Warsaw University and a science from the University of Rochester.

Edward W. Felten is an assistant pro- fessor in the Department of Computer Science at Princeton. His research inter- ests include parallel and distributed sys- tems, scientific computing, and operating systems. He received the BS in physics from California Institute of Technology

and the PhD in computer science from the University of Washington. He is a member of the IEEE Computer Society.

Kai Li is an associate professor in Prince- ton’s Department of Computer Science. His research interests are operating sys- tems, computer architecture, fault toler- ance, and parallel computing. He is an editor of the IEEE Transactions on Par- allel and Distributed Sy~stems and a mem-

ber of the editorial board of the International Journal of Parallel Programming. He received his PhD from Yale Uni- versity and is a member of the IEEE Computer Society.

Malena R. Mesarina, a technical staff member in Princeton’s Department of Computer Science, is participating in the hardware design of a parallel computer network. Previously, she worked on microprocessor embedded systems as a hardware design engineer at Mayflower

Communications, a satellite communications company in Massachusetts. Mesarina holds a bachelor’s degree in elec- trical engineering from Boston University and is a member of the IEEE and the Society of Women Engineers.

Direct questions to the authors at Princeton University, Dept. of Computer Science: 35 Olden St.. Princeton, NJ 08544; [email protected].

Reader Interest Survey Indicate your interest in this article by circling the appropriate number on the Reader Service Card.

LOW 156 Medium 157 High 158