View
239
Download
0
Tags:
Embed Size (px)
Citation preview
Playing Distributed Systems Playing Distributed Systems withwith
Memory-to-Memory Memory-to-Memory CommunicationCommunication
Liviu IftodeLiviu IftodeDepartment of Computer ScienceDepartment of Computer Science
University of MarylandUniversity of Maryland
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
Most of this work has been done in the Distributed Computing (Disco) Lab at Most of this work has been done in the Distributed Computing (Disco) Lab at Rutgers University, http://discolab.rutgers.eduRutgers University, http://discolab.rutgers.edu
How it all started...How it all started...
Cost-effective alternative to multicomputersCost-effective alternative to multicomputers Commodity networks of high-volume Commodity networks of high-volume
uniprocessor or multiprocessor systemsuniprocessor or multiprocessor systems track technology besttrack technology best low cost/performance ratiolow cost/performance ratio
Networking became the headache of this Networking became the headache of this approachapproach large software overheadslarge software overheads
Multicomputers Clusters of computers
Too much OS...Too much OS...
Application
OS
Network Interface
Application
OSCO
PY
CO
PY
Applications interact with network interface through the OS: Applications interact with network interface through the OS: exclusive access, protection, buffering, etcexclusive access, protection, buffering, etc
OS involvement increases latency & overhead OS involvement increases latency & overhead Multiple copies (App-> OS, OS-> App) reduce effective bandwidthMultiple copies (App-> OS, OS-> App) reduce effective bandwidth
ReceiveSend
Network Interface
User-Level User-Level ProtectedProtected CommunicationCommunication
Application
OS
Application
Application has direct access to the network interfaceApplication has direct access to the network interface OS involved only in connection setup to ensure protectionOS involved only in connection setup to ensure protection Performance benefits: zero-copy, low-overheadPerformance benefits: zero-copy, low-overhead Special support in the network interfaceSpecial support in the network interface
NIC
OS
Send [Receive]
NIC
Two User-Level Communication Two User-Level Communication ModelsModels
Active Messages: send(local_buffer, Active Messages: send(local_buffer, remote _handlerremote _handler))
Application
OS
Application
NIC
OS
Send Handler
NIC
Application
OS
Application
NIC
OS
Send Buffer
NIC
Memory-to-Memory: send(local_buffer, Memory-to-Memory: send(local_buffer, remote_bufferremote_buffer))
Memory-to-Memory Memory-to-Memory CommunicationCommunication
Receive operation not required Receive operation not required Also called: (virtually) mapped comm, send-controlled Also called: (virtually) mapped comm, send-controlled
comm, deliberate update, remote write, remote DMA, non-comm, deliberate update, remote write, remote DMA, non-intrusive/silent commintrusive/silent comm
Application buffers must be (pre)Application buffers must be (pre)registeredregistered with the NIC with the NIC
send(local_buffer, remote_buffer)
NIC
OS
Rid=import(rem_buf)
send(local_buf1,Rid)
send(local_buf2,Rid)
NIC
OS
export(rem_buf)
Sender Receiver
M2M Communication HistoryM2M Communication History
Started both in universities (SHRIMP-Princeton, Started both in universities (SHRIMP-Princeton, UNet-Cornell ) and in industry (Hamlyn-HP, UNet-Cornell ) and in industry (Hamlyn-HP, Memory Channel-DEC)Memory Channel-DEC)
First application: High-Performance ComputingFirst application: High-Performance Computing Software DSM: HLRC (Princeton), Cashmere(Rochester)Software DSM: HLRC (Princeton), Cashmere(Rochester) Lightweight message-passing librariesLightweight message-passing libraries
Lightweight transport layer for cluster-based Lightweight transport layer for cluster-based servers and storageservers and storage
Industrial StandardsIndustrial Standards Virtual Interface Architecture (VIA)Virtual Interface Architecture (VIA) InfiniBand I/O ArchitectureInfiniBand I/O Architecture Direct Access File System (DAFS) ProtocolDirect Access File System (DAFS) Protocol
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2M Playing with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault Tolerance and AvailabilityFault Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
What is VIA?What is VIA?
M2M communication architecture similar to U-Net M2M communication architecture similar to U-Net and VMMC/SHRIMP and VMMC/SHRIMP
Standard initiated by Compaq, Intel, and Microsoft Standard initiated by Compaq, Intel, and Microsoft in 1997 for cluster interconnect in 1997 for cluster interconnect
Point-to-point connection oriented protocol Point-to-point connection oriented protocol Two communication modelsTwo communication models
send/receive: a pair of descriptors queues send/receive: a pair of descriptors queues M2M: M2M: RDMA write and RDMA read RDMA write and RDMA read
Virtual Interface ArchitectureVirtual Interface Architecture
SENDSENDQUEUEQUEUE
RECVRECVQUEUEQUEUE
KernelKernelAgentAgent
VI User LibraryVI User Library
VI NICVI NIC
Data transfer at Data transfer at user leveluser level
Polling or Polling or interrupt for interrupt for completions completions
Setup & Memory Setup & Memory registration registration through kernelthrough kernel
Set
up &
Mem
ory
regi
stra
tion
ApplicationApplication
COMPCOMPQUEUEQUEUE
InfiniBand: An I/O Architecture InfiniBand: An I/O Architecture with M2M with M2M
Point-to-point switched-based I/O interconnect to Point-to-point switched-based I/O interconnect to replace the bus-based I/O architecture for replace the bus-based I/O architecture for serversservers more bandwidthmore bandwidth more protectionmore protection
Trade association founded by Compaq, Dell, HP, Trade association founded by Compaq, Dell, HP, IBM, Intel, Microsoft and Sun in 1999IBM, Intel, Microsoft and Sun in 1999
M2M communication similar to VIA M2M communication similar to VIA RDMA write, RDMA read RDMA write, RDMA read Remote atomic operationsRemote atomic operations
InfiniBand I/O ArchitectureInfiniBand I/O Architecture
Processor Memory
HCA
I/O ModuleTCA
I/O ModuleTCA
I/O ModuleTCA
Switched I/O fabric
Hardware protocols for message-passing between Hardware protocols for message-passing between devices implemented in channel adaptersdevices implemented in channel adapters
A channel adapter(CA) is a programmable DMA A channel adapter(CA) is a programmable DMA engine with special protection features that allow engine with special protection features that allow DMA operations to be initiated locally and DMA operations to be initiated locally and remotelyremotely
M2M Communication in M2M Communication in InfiniBandInfiniBand
Memory region: virtually contiguous area of memory Memory region: virtually contiguous area of memory registered with the channel adapter (L_key)registered with the channel adapter (L_key)
Memory window: protected remote access to a Memory window: protected remote access to a specified area of the memory region (R_key)specified area of the memory region (R_key)
Remote DMA Read/Write {L_key, R_key}Remote DMA Read/Write {L_key, R_key}
Physicalmemory
Memoryregion
Memoryregion
Physicalmemory
Memorywindow
Local_key Remote_key
RDMA
InfiniBand Work Queue InfiniBand Work Queue OperationsOperations
InfiniBand Communication StackInfiniBand Communication Stack
Direct Access File SystemDirect Access File System
Lightweight remote file access protocol designed Lightweight remote file access protocol designed to take advantage of M2M interconnect to take advantage of M2M interconnect technologies technologies
DAFS Collaborative group including 85 DAFS Collaborative group including 85 companies proposed the standard in 2001companies proposed the standard in 2001
High PerformanceHigh Performance Optimized for high throughput and low latencyOptimized for high throughput and low latency Transfer directly to/from user buffersTransfer directly to/from user buffers Efficient file sharing using lock caching Efficient file sharing using lock caching
Network-attached storage solution for data Network-attached storage solution for data centerscenters
DAFS ModelDAFS ModelApplication
BuffersDAFS Client
VIPLUser
Kernel
NIC
VI NICDriver
DAFS Server
DAFS File Server
Buffers Driver
KVIPL
VI NICDriver
NIC
File access API
DAFS vs Traditional File Access DAFS vs Traditional File Access MethodsMethods
M2M Product Market
VIA: Emulex (former Giganet) InfiniBand: Mellanox DAFS software distribution:Duke,
Harvard, British Columbia, Rutgers (soon)
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault Tolerance and AvailabilityFault Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
Software DSM over VIASoftware DSM over VIA
Execution model: One application process on Execution model: One application process on each node of the clustereach node of the cluster
Invalidation-based memory coherence at page Invalidation-based memory coherence at page granularity using VM page protectiongranularity using VM page protection
Data and Synchronization traffic using VIAData and Synchronization traffic using VIA
VIA Interconnect
CodeData
CodeData
CodeData
CodeData
Shared VirtualAddress Space
Home-based Data Coherency using Home-based Data Coherency using VIAVIA
write (A)
Home of A
RDMA diff
read (A)
Node 1 Node 2
RDMA whole page
RDMA page invalidation
Lessons about M2M from DSM
Silent communication: used for diff propagation
Low-latency: 75 % of messages are small Copy avoidance: not always possible Useful but not available
Scatter-gather support Remote read (RDMA Read) Broadcast support
Scatter-Gather SupportScatter-Gather Support
What VIA supports
Source Dest
What we do
Source Dest
““True” scatter-gather can avoid multiple True” scatter-gather can avoid multiple message latenciesmessage latencies
Potential gain of 5-10%Potential gain of 5-10%
What we need
Source Dest
1
2
RDMA ReadRDMA Read
Allows fetching of data without involving the Allows fetching of data without involving the processor of the remote nodeprocessor of the remote node
Potential gain of 10-20% Potential gain of 10-20%
Page (p)
Page req (p)
Page (p)
Page req (p)
Scheduling Delay +Handling time
With RDMA ReadWithout RDMA Read
Broadcast SupportBroadcast Support
Useful for the software DSM protocolUseful for the software DSM protocol Eager invalidation propagationEager invalidation propagation Eager update of data Eager update of data
Previous research (Cashmere’00) Previous research (Cashmere’00) speculates a gain of 10-15% from the speculates a gain of 10-15% from the use of broadcastuse of broadcast
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
TCP is really bad for Intra-Server TCP is really bad for Intra-Server CommunicationCommunication
Can easily steal 30-50% of the host cycles:Can easily steal 30-50% of the host cycles: From 1 GHz processor only 500-700 MHz are available to From 1 GHz processor only 500-700 MHz are available to
the applicationthe application Processor saturates before the NICProcessor saturates before the NIC TCP Offloading Engine (TOE) solves the problem only TCP Offloading Engine (TOE) solves the problem only
partiallypartially without TOE: 90% of the dual 1 GHz processor required to without TOE: 90% of the dual 1 GHz processor required to
achieve 2x875 MHz bandwidthachieve 2x875 MHz bandwidth with TOE: 52% of dual 1 GHz processors are required to with TOE: 52% of dual 1 GHz processors are required to
obtain 1.9 GHz Ethernet bandwidthobtain 1.9 GHz Ethernet bandwidth With M2M (Mellanox InfiniBand): 90% of the 3.8 GHz With M2M (Mellanox InfiniBand): 90% of the 3.8 GHz
bandwidth using only 7% of an 800 MHz processorbandwidth using only 7% of an 800 MHz processor
Distributed Intra-Cluster Protocols Distributed Intra-Cluster Protocols using M2Musing M2M
Direct Access File System (DAFS): network-Direct Access File System (DAFS): network-attached storage over VIA/IBattached storage over VIA/IB
Sockets Direct Protocol (SDP): lightweight Sockets Direct Protocol (SDP): lightweight transport protocol over VIA/IBtransport protocol over VIA/IB
SCSI Remote Protocol (SRP): connect servers SCSI Remote Protocol (SRP): connect servers to storage area networks over VIA/IBto storage area networks over VIA/IB
Ongoing industry debateOngoing industry debate ““TCP or not TCP?” = “IP or M2M network?”TCP or not TCP?” = “IP or M2M network?”
Distributed Intra-Cluster Server Distributed Intra-Cluster Server Applications using M2MApplications using M2M
Cluster-Based Web ServerCluster-Based Web Server Storage ServersStorage Servers Distributed File SystemsDistributed File Systems
Cluster-based Web Server: PressCluster-based Web Server: Press
location-aware web server with request forwarding and location-aware web server with request forwarding and load balancing (Bianchini et al -Rutgers)load balancing (Bianchini et al -Rutgers)
send
recvmain
disk
/ eth0 cLAN
fs TCP VIA
clients cluster
Performance of VIA-based Press Web ServerPerformance of VIA-based Press Web Server
0
1000
2000
3000
4000
5000
6000
Clarknet Forth Nasa Rutgers
Thr
ough
put
TCP/FE TCP/cLAN VIA/cLAN
[Carrera et al, HPCA’02]
Lessons about M2M from Web Lessons about M2M from Web ServersServers
M2M/VIA used for small messages M2M/VIA used for small messages (requests, cache summaries, load) and (requests, cache summaries, load) and large messages (files)large messages (files)
low overhead is the most beneficial featurelow overhead is the most beneficial feature trade off transparency for performance is trade off transparency for performance is
necessarynecessary sometimes zero copy traded off for number sometimes zero copy traded off for number
of messages (in the absence of scatter-of messages (in the absence of scatter-gather)gather)
VI-Attached Storage ServerVI-Attached Storage Server
M2M for database-storage interconnect M2M for database-storage interconnect (Zhou et al)(Zhou et al)
VI Network
Local Disks
…
VI
Storage Server
…
Local Disks
…
VI
Storage Server
…
Database
Server
VI
Database Performance with Database Performance with VI-Attached Storage ServerVI-Attached Storage Server
FC driver highly FC driver highly optimized by vendoroptimized by vendor
cDSA outperforms by cDSA outperforms by 18%18%
Normalized TPC-C Transaction Rate
0
20
40
60
80
100
120
140
Fibre-Channel
kDSA wDSA cDSA
[Zhou et al, ISCA’02]
Lessons about M2M from Storage Lessons about M2M from Storage ServersServers
Zero-copy, low-overhead: most beneficialZero-copy, low-overhead: most beneficial Trade off transparency for performanceTrade off transparency for performance
extend I/O API (asynchronous I/O, buffer extend I/O API (asynchronous I/O, buffer registration) and/or relax I/O semantics (I/O registration) and/or relax I/O semantics (I/O completion)completion)
require application modificationsrequire application modifications
Missed VIA Features Missed VIA Features no flow controlno flow control no buffer managementno buffer management
Serious Competition: iSCSISerious Competition: iSCSI
Federated File System (FedFS)Federated File System (FedFS)
Global file namespace for distributed applications built Global file namespace for distributed applications built on top of autonomous local file systemson top of autonomous local file systems
FedFS
LocalFS
M2M Interconnect
LocalFS
LocalFS
LocalFS
A1A2 A2 A3 A3 A3
FedFS
Location Independent Location Independent Global NamingGlobal Naming
Virtual Directory (VD): union of local directories Virtual Directory (VD): union of local directories created on demand (created on demand (dirmergedirmerge) and volatile ) and volatile
Directory Table: local cache of VDs (analogue to Directory Table: local cache of VDs (analogue to TLB)TLB)
usr
file1
usr
file2
usr
file1 file2
virtual directory
local directories
Role of M2M in FedFSRole of M2M in FedFS
Directory Table - Virtual Directory Directory Table - Virtual Directory Coherency Coherency
Cooperative CachingCooperative Caching File MigrationFile Migration
DAFS + VIA/IP = FedFS over the InternetDAFS + VIA/IP = FedFS over the Internet
FedFSA1
A2 A2 A3 A3 A3
FedFS
DAFS DAFS DAFS DAFS
VIA/IP
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
M2M for Fault Tolerance and AvailabilityM2M for Fault Tolerance and Availability
Use RDMA write to efficiently mirror an Use RDMA write to efficiently mirror an application virtual space on remote application virtual space on remote memory: Fast Cluster Failover [Zhou et al]memory: Fast Cluster Failover [Zhou et al] fast checkpointingfast checkpointing fast failoverfast failover
Use RDMA read for “silent” state migration: Use RDMA read for “silent” state migration: Migratory TCPMigratory TCP extract checkpoints from overloaded servers extract checkpoints from overloaded servers
with zero overheadwith zero overhead
TCP-based Internet ServicesTCP-based Internet Services
Adverse conditions to affect service availabilityAdverse conditions to affect service availability internetwork congestion or failureinternetwork congestion or failure servers overloaded, failed or under DoS attackservers overloaded, failed or under DoS attack
TCP has one responseTCP has one response network delays => packet loss => retransmissionnetwork delays => packet loss => retransmission
TCP limitationsTCP limitations early binding of service to early binding of service to aa server server client cannot dynamically switch to another server client cannot dynamically switch to another server
for sustained servicefor sustained service
The Migratory TCP ModelThe Migratory TCP Model
Client
Server 1
Server 2
Migratory TCP: At a GlanceMigratory TCP: At a Glance
Migratory TCP solution to network delays: migrate Migratory TCP solution to network delays: migrate connection to a “better” serverconnection to a “better” server
Migration mechanism is Migration mechanism is genericgeneric (not application (not application specific) specific) lightweightlightweight (fine-grain migration of a per- (fine-grain migration of a per-connection state) and connection state) and low-latencylow-latency
Requires changes to the server application but Requires changes to the server application but totally transparent to the client applicationtotally transparent to the client application
Interoperates with existing TCPInteroperates with existing TCP
Per-connection State TransferPer-connection State Transfer
Server 1 Server 2
Application
M-TCP
Connections Connections
RDMA•`
Application- M-TCP “Contract”Application- M-TCP “Contract”
Server applicationServer application Define per-connection application stateDefine per-connection application state During connection service, export snapshots of During connection service, export snapshots of
per-connection application state when consistentper-connection application state when consistent Upon acceptance of a migrated connection, Upon acceptance of a migrated connection,
import per-connection state and resume serviceimport per-connection state and resume service
Migratory TCPMigratory TCP Transfer per-connection application and protocol Transfer per-connection application and protocol
state from the old to the new server and state from the old to the new server and synchronizesynchronize ( (here is where VIA/IP can helphere is where VIA/IP can help !) !)
Lazy Connection MigrationLazy Connection Migration
C (0)
C’<
Sta
t e R
equ
est
> (
2)
< S
tate
Reply
> (
3)
Client
Server 1
Server 2
<SYN C,…> (1)
<SYN + ACK> (4)
RDMA read (in the future)
Future work: Connection Migration using Future work: Connection Migration using M2MM2M
C (0)
C’
Client
Server 1
Server 2
<SYN C,…> (1)
<SYN + ACK> (4)
RDMA read (lazy)
RDMA write (eager) or
Stream Server ExperimentStream Server Experiment
Effective throughput close to average rate seen before server performance degrades (without VIA)
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
TCP Servers: TCP Offloading for TCP Servers: TCP Offloading for Cluster-Based Servers Cluster-Based Servers
VIA
Application
HOST TCP Server
Socket API
n/w processing
Implementation DetailsImplementation Details
VIA
Application
HOST TCP Server
Socket API
VIA NIC VIA NIC NIC
OS
BSD Sockets
WANWAN
TUNNEL SOCKET CALLS
EXECUTE SOCKET CALLS
OS
Sockets, VI Channels, and BuffersSockets, VI Channels, and Buffers
VI
SEND
RDMA
RECEIVE
SocketBuffer
Extended APIExtended API
Standard APIStandard API tcps_socket, tcps_send, tcps_recv …tcps_socket, tcps_send, tcps_recv …
Extended APIExtended API tcps_register_memorytcps_register_memory tcps_deregister_memorytcps_deregister_memory tcps_send_async_registeredtcps_send_async_registered tcps_io_donetcps_io_done tcps_io_waittcps_io_wait
Memory Registration
Asynchronous Send
Asynchronous Send ProcessingAsynchronous Send Processing
RESULTS OF PREVIOUS SENDS
FLOW CONTROL
EXPORTED BUFFERS
RETURN
HOST
VI
TCP Server
RDMA WRITE
PIPELINE REQUEST request request PROCESS A REQUEST
Send in TCP Server ArchitectureSend in TCP Server Architecture
HOST APPLICATION TCP SERVER
Time Line (us)
tcps_send()0
pre processing3
Post Send5
Send Wait
Async Send
14
Sync Send90
Recv Wait89
received by TCPS
32
return from send()
61
send()
return
return
VIA
HTTP/1.0 Static WorkloadsHTTP/1.0 Static Workloads
•0
•100
•200
•300
•400
•500
•600
•700
•800
•900
•400 •500 •600 •700 •800 •900 •1000
•Offered Load (requests/sec)
• Th
rou
gh
pu
t (r
epli
es/s
ec)
•Regular
•Sync
•AsyncSend+EAccept
•AsyncSend
•ERecv
•AsyncSend+ERecv+EAccept
Mixed Loads - ThroughputMixed Loads - Throughput
•0
•50
•100
•150
•200
•250
•300
•350
•400
•450
•500
•200 •300 •400 •500 •600 •700
•Offered load (requests/sec)
• Th
rou
gh
pu
t (r
epli
es/s
ec)
•Regular
•Sync
•AsyncSend+EAccept
•AsyncSend
•ERecv
•AsyncSend+ERecv+EAccept
HTTP/1.1 ThroughputHTTP/1.1 Throughput
•0
•200
•400
•600
•800
•1000
•1200
•1400
•800 •900 •1000 •1100 •1200 •1300 •1400 •1500 •1600
•Offered Load (requests/sec)
• Th
rou
gh
pu
t (r
eplie
s/se
c)
•Regular
•Standard API
•Sync
•AsyncSend
•AsyncSend+EAccept
Traditional Computer SystemTraditional Computer System
Intelligent host and passive I/O devicesIntelligent host and passive I/O devices OS executed exclusively on the host along with OS executed exclusively on the host along with
applicationsapplications I/O devices communicate only through the host I/O devices communicate only through the host
memorymemory
Processor
Memory
Storage Controller
Network Adapter
Applications
Filesystem
Networkprotocols OS
I/O bus
The Cost of OS-Application Co-The Cost of OS-Application Co-habitationhabitation
OS “steals” compute cycles and memory from OS “steals” compute cycles and memory from applicationsapplications
Two protection modes: switching overheadTwo protection modes: switching overhead OS executed asynchronouslyOS executed asynchronously
interrupt processing overheadinterrupt processing overhead internal synchronization on multiprocessor serversinternal synchronization on multiprocessor servers
Cache pollutionCache pollution Host involved in “service-work” Host involved in “service-work”
TCP packet retransmissionTCP packet retransmission TCP ACK processingTCP ACK processing ARP request serviceARP request service
Extreme cases are even worseExtreme cases are even worse Receive livelocksReceive livelocks Denial-of-service (DoS) attacksDenial-of-service (DoS) attacks
Host mediates data transfer Host mediates data transfer between devicesbetween devices
DiskNetworkinterface
HostApplication buffer
Network bufferFile buffer
OS
Application
Server=Cluster of Intelligent Server=Cluster of Intelligent DevicesDevices
DISK
CPU CPU
MEMORY
IBCPU MEM MEM CPU
NIC
HOST
I-STORAGE I-NIC
Split-OS IdeaSplit-OS Idea
HOST
I-STORAGE I-NIC
OS
File-System TCP / IP
Application
REMOTEDMA
Networking in Conventional Networking in Conventional OSOS
Appl
OS
Host Network Interface
DMA
Networkpackets
DMA
interrupts
acksSend/Receivebuffers
Split-NetworkingSplit-Networking
Appl
OS
Host
I-NIC
Networkpackets
InfiniBand
Backupbuffers
Send/Receivebuffers
Split-NetworkingSplit-Networking
Minimum overhead on the host: communication Minimum overhead on the host: communication between application and network interfacebetween application and network interface
Retransmission and ack processing handled in the Retransmission and ack processing handled in the intelligent network interfaceintelligent network interface
Interrupts are eliminatedInterrupts are eliminated Receive livelock is avoided (no interrupts )Receive livelock is avoided (no interrupts )
DoS attacks can be absorbed in the network DoS attacks can be absorbed in the network interfaceinterface
Send and receive buffers kept in the network Send and receive buffers kept in the network interface as long as possibleinterface as long as possible optimal replacement policy can be implementedoptimal replacement policy can be implemented retransmission buffers may be evicted and written back retransmission buffers may be evicted and written back
to the host (non-intrusively using RDMA)to the host (non-intrusively using RDMA) receive buffers can be eagerly transferred to the host or receive buffers can be eagerly transferred to the host or
discarded if overflowdiscarded if overflow
Direct Device-to-Device Direct Device-to-Device CommunicationCommunication
Host
I-NICI-Storage
1. Bind(socket, VI_channel)2. Bind(file,VI_channel)
3. RDMA _write
Transfer file to socket bypassing the hostTransfer file to socket bypassing the host OS creates a channel and binds socket and file to it OS creates a channel and binds socket and file to it Direct D2D conflicts with caching in the host memoryDirect D2D conflicts with caching in the host memory
transfer(file,socket,size)
OutlineOutline
The M2M GameThe M2M Game M2M Toys: VIA, InfiniBand, DAFSM2M Toys: VIA, InfiniBand, DAFS Playing with M2MPlaying with M2M
Software DSMSoftware DSM Intra-Server CommunicationIntra-Server Communication Fault-Tolerance and AvailabilityFault-Tolerance and Availability TCP OffloadingTCP Offloading
ConclusionsConclusions
Why is M2M Good ?Why is M2M Good ?
Low-overhead to the senderLow-overhead to the sender Zero-overhead to the receiver (with Zero-overhead to the receiver (with
RDMA)RDMA) Low-latency: good especially for small Low-latency: good especially for small
messagesmessages Zero-Copy from/to registered buffersZero-Copy from/to registered buffers
M2M PitfallsM2M Pitfalls
No flow controlNo flow control not necessary for round-trip or overwrite-not necessary for round-trip or overwrite-
type messagestype messages
Registration: expensiveRegistration: expensive Zero-CopyZero-Copy
limited by registration capacitylimited by registration capacity copying is better when several small copying is better when several small
messages must be sentmessages must be sent
Best performance requires M2M-aware Best performance requires M2M-aware application/protocol application/protocol
What would be good to haveWhat would be good to have
Remote read to silently fetch remote Remote read to silently fetch remote data data
Remote atomic to allow buffer sharingRemote atomic to allow buffer sharing Scatter-gatherScatter-gather Hardware flow-controlHardware flow-control
Open QuestionsOpen Questions
Blocking vs. spinning on I/O Blocking vs. spinning on I/O completioncompletion
User vs. Kermel implementationUser vs. Kermel implementation VIA/IP & IB/IPv6 vs. IP with TOEVIA/IP & IB/IPv6 vs. IP with TOE
storage networking: IB. vs. iSCSIstorage networking: IB. vs. iSCSI