1
Improving Server Efficiency by Op3mizing TCP Pure ACK Receive Processing Mo3va3ons Servers today spend significant por2on of cycles on TCP/IP stack Impact of future network technologies such as 40GE and 100GE Servers handle more concurrent clients; bigger network processing burden on CPU Problems with current network stacks: Expensive packet buffer management, many layers of processing Many servers are bulk TCP data senders, receiving mostly pure TCP ACKs, e.g. Web servers, file repositories Pure ACK RX processing is simple; can be op2mized to: 1. Consume fewer cycles per ACK 2. Improve ACK throughput Michael Chan and David Cheriton [email protected] , [email protected] Op3mizing pure ACK RX processing Detect pure TCP ACK by device and driver Parse TCP header fields Seq #, ACK #, receiver window size, 2mestamps Deliver TCP header fields directly to TCP layer Reuse the same packet buffer for next receive opera2on Implementa3on in Linux Kernel Removed all layers between driver and TCP ACK processing LRO, network taps, IP, Ne[ilter, other TCP code Removed many memory management opera2ons No need to deallocate and reallocate packet buffers 40line ACK header parser, 400line TCP Pure ACK fast path Evalua3on Two machines: Xeon Nehalem 2.66GHz, Intel 82574L Gigabit LanonMotherboard, Neterion X3120 10GE NIC Network Device RX Desc Ring RX Packet Ring Network Device Driver Ethernet IP TCP Pure ACK Fast Path BSD Socket Packet buffer allocator Slow path: Packet buffers Packet read RX Descriptor r/w Packet RX DMA RX Descriptor r/w Op3mized Path: TCP seq #, ACK #, Receiver window size, Timestamps User Applica2on Microbenchmark Flood a single core with 100K pure ACKs; Bulk Data Transfer Send bulk data via 10GE NIC to 20 clients concurrently # CPU cycles saved PerACK processing 2me when TX is involved bare minimum ACK processing 2me (ACKs do not cause TX of new segments): Original: 1269ns ± 66ns Op2mized: 457ns ± 7ns 40% reduc2on in perACK processing 2me 14% fewer CPU cycles required for data transfer

Improving*Server*Efficiencyby Op3mizing*TCP*Pure… · Network&Device& RX Desc&Ring& RX Packet Ring& Network&Device&Driver& Ethernet IP& TCP& Pure&ACK&Fast Path& BSD&Socket Packet

  • Upload
    hoangtu

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving*Server*Efficiencyby Op3mizing*TCP*Pure… · Network&Device& RX Desc&Ring& RX Packet Ring& Network&Device&Driver& Ethernet IP& TCP& Pure&ACK&Fast Path& BSD&Socket Packet

Improving  Server  Efficiency  by  Op3mizing  TCP  Pure  ACK  Receive  Processing  

Mo3va3ons  •  Servers  today  spend  significant  por2on  of  cycles  on  TCP/IP  stack  •  Impact  of  future  network  technologies  such  as  40GE  and  100GE  

•  Servers  handle  more  concurrent  clients;  bigger  network  processing  burden  on  CPU  •  Problems  with  current  network  stacks:  Expensive  packet  buffer  management,  many  layers  of  processing  •  Many  servers  are  bulk  TCP  data  senders,  receiving  mostly  pure  TCP  ACKs,  e.g.  Web  servers,  file  repositories  •  Pure  ACK  RX  processing  is  simple;  can  be  op2mized  to:  

1.  Consume  fewer  cycles  per  ACK  2.  Improve  ACK  throughput  

Michael  Chan  and  David  Cheriton  [email protected],  [email protected]    

Op3mizing  pure  ACK  RX  processing  •  Detect  pure  TCP  ACK  by  device  and  driver  •  Parse  TCP  header  fields  

•  Seq  #,  ACK  #,  receiver  window  size,  2mestamps  •  Deliver  TCP  header  fields  directly  to  TCP  layer  •  Reuse  the  same  packet  buffer  for  next  receive  opera2on  

Implementa3on  in  Linux  Kernel  •  Removed  all  layers  between  driver  and  TCP  ACK  processing  

•  LRO,  network  taps,  IP,  Ne[ilter,  other  TCP  code  •  Removed  many  memory  management  opera2ons  

•  No  need  to  deallocate  and  reallocate  packet  buffers  •  40-­‐line  ACK  header  parser,  400-­‐line  TCP  Pure  ACK  fast  path  

Evalua3on  Two  machines:  Xeon  Nehalem  2.66GHz,  Intel  82574L  Gigabit  Lan-­‐on-­‐Motherboard,  Neterion  X3120  10GE  NIC  

 

Network  Device  

RX  Desc  Ring  

RX  Packet  Ring  

Network  Device  Driver  

Ethernet  

IP  

TCP  Pure  ACK  Fast  

Path  

BSD  Socket  

Packet  buffer  

allocator  

Slow  path:  Packet  buffers  

Packet  read  RX  Descriptor  r/w  

Packet  RX  DMA  RX  Descriptor  r/w  

Op3mized  Path:  TCP  seq  #,  ACK  #,  

Receiver  window  size,  Timestamps  

User  Applica2on  

Micro-­‐benchmark  

Flood  a  single  core  with  100K  pure  ACKs;  

   

Bulk  Data  Transfer  

•  Send  bulk  data  via  10GE  NIC  to  20  clients  concurrently  •  #  CPU  cycles  saved  •  Per-­‐ACK  processing  2me  when  TX    is  involved  

bare  minimum  ACK  processing  2me    (ACKs  do  not  cause  TX  of  new  segments):                        

 Original:  1269ns  ±  66ns  

                     Op2mized:  457ns  ±  7ns  

   

40%  reduc2on  in  per-­‐ACK  processing  2me  

14%  fewer  CPU  cycles  required  for  data  transfer