Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
iSwitch: Accelerating Distributed Reinforcement
Learning with In-Switch Computing
Jian Huang
Youjie Li Iou-Jen Liu Yifan Yuan
Deming Chen Alexander Schwing
University of Illinois at Urbana-Champaign
2
AI Applications are Increasingly Operating in Dynamic Environments
Autonomous Driving GamesRobotics
2
AI Applications are Increasingly Operating in Dynamic Environments
Autonomous Driving GamesRobotics
Reinforcement Learning Empowers AI Applications to Take Real-Time Intelligent Actions
3
What is Reinforcement Learning?
Agent Environment
3
What is Reinforcement Learning?
Agent Environment
State
3
What is Reinforcement Learning?
Agent Environment
Action
State
3
What is Reinforcement Learning?
Agent Environment
Action
Next State
3
What is Reinforcement Learning?
Agent Environment
Action
Reward
Next State
3
What is Reinforcement Learning?
Model
Agent Environment
Action
Reward
Next State
3
What is Reinforcement Learning?
Model
Agent Environment
Action
Reward
Next State
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
Action
Reward
Next State
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
3
What is Reinforcement Learning?
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
Train a Typical RL Agent on a
Single GPU = 8 Days*
*Mnih, ICML’16
3
What is Reinforcement Learning?
RL Requires Distributed Training for Improved Performance
Gradient
Model
Agent Environment
ActionTraining
Reward
Next State
Train a Typical RL Agent on a
Single GPU = 8 Days*
*Mnih, ICML’16
4
Parameter
Server
Workers
Centralized Distributed RL Training: Parameter-Server Based
Switch
4
Parameter
Server
Workers
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
Server
Multiple
Network Hops
4
Parameter
Server
Workers
Gradient
Centralized Distributed RL Training: Parameter-Server Based
Switch
Model
Sum Update WeightParameter
ServerCentral
Bottleneck
Multiple
Network Hops
5
Decentralized Distributed RL Training: AllReduce Based
Ring-AllReduce
Switch
Workers
Model Sum
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum Sum
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Sum Sum
Full
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Model Sum
Aggregated
Gradient
Sum Sum
FullFull
Full Full
Aggregation
Complete!
5
Decentralized Distributed RL Training: AllReduce Based
Gradient
Ring-AllReduce
Switch
Workers
Excessive
Network Hops
Model Sum
Aggregated
Gradient
Sum Sum
FullFull
Full Full
Aggregation
Complete!
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
Network Hops = 4
6
Parameter
Server
Workers
Centralized Design
Gradient
Switch Gradient
Decentralized Design
Ring-AllReduce
Switch
Workers
Network Communication is the Bottleneck in Distributed RL Training
Network Hops = 4 Network Hops = 4N - 4
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
88x Smaller Gradient Size
158x More Iterations
7
The Unique Characteristic of Distributed RL Training: Latency Critical
RLBenchmark
DQN-
Atari
A2C-
Atari
PPO-
MuJoCo
DDPG-
MuJoCo
Gradient Size 6 MB 3 MB 40 KB 158 KB
Training Iterations 200 M 2 M 0.2 M 3 M
DNNBenchmark
AlexNet-
ImageNet
ResNet50-
ImageNet
VGG16-
ImageNet
MLP-
MNIST
Gradient Size 250 MB 100 MB 525 MB 4 MB
Training Iterations 320 K 600 K 370 K 10 K
Distributed RL Training is Latency Critical
88x Smaller Gradient Size
158x More Iterations
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
Gradient Aggregation over the Network Dominates the Training Time (50~83%)
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
Parameter Server
Local Computation Grad Aggregation
0%
20%
40%
60%
80%
100%
DQN A2C PPO DDPG
AllReduce
Local Computation Grad Aggregation
8
Quantifying the Network Overhead in Distributed RL Training
Gradient Aggregation over the Network Dominates the Training Time (50~83%)
Compute
Network
9
Programmable Switch
Aggregation Accelerator
+ + + =
In-Switch Acceleration: A New Distributed Computing Paradigm
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
In-Switch Acceleration: A New Distributed Computing Paradigm
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
Programmability Hardware-Algorithm Co-Design
In-Switch Acceleration: A New Distributed Computing Paradigm
9
Programmable Switch
Aggregation Accelerator
+ + + =Performance Reduce End-to-End Network Latency
Programmability
Scalability
Hardware-Algorithm Co-Design
Scale Training at Rack Scale
In-Switch Acceleration: A New Distributed Computing Paradigm
10
Challenges of In-Switch Acceleration
No Impact on
Regular Switch
Functions
10
Challenges of In-Switch Acceleration
Limited
On-Chip
Resources
No Impact on
Regular Switch
Functions
10
Challenges of In-Switch Acceleration
Limited
On-Chip
Resources
No Impact on
Regular Switch
Functions
Scale with
More Switches
and Nodes
11
Basics of Programmable Switch
Control Plane
Data Plane
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
DataHead
Packet Forwarding
11
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control
DataHead
Packet Forwarding
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
11
Input Port Output Ports
Basics of Programmable Switch
Control Plane
Data Plane
Forwarding Control System Configuration
DataHead
Packet Forwarding
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
DataHeader
…
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
Core of
Regular
Functions
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Integrating Aggregation Accelerator into the Programmable Switch
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Gradient
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Gradient
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Regular Traffic
Gradient Traffic
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Regular Traffic
Gradient Traffic
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
12
RxQ
Receiver
Receiver
Receiver
Receiver
Input Arbiter
Output Port Lookup
Packet ProcessRxQ
RxQ
RxQ
TxQ
Transmitter
Transmitter
Transmitter
Transmitter
TxQ
TxQ
TxQ
Data Plane
Accelerator
Integrating Aggregation Accelerator into the Programmable Switch
Input
Arbiter
Hardware Acceleration Isolated From Regular Switch Function
Header
13
Developing Light-Weight Accelerator for Aggregation
In-Switch Accelerator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Header
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Header
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Header
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Pkt i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Pkt i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Slicer
Elements
Pkt i
Seg i
Threshold
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
Pkt i
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
Separator
13
Developing Light-Weight Accelerator for Aggregation
Seg 0 Seg 1 … Seg i … Seg NGradient Vector
Parser
Buffer ModuleHeader
Seg
Idx
Payload
Counter Module
Pkt i
In-Switch Accelerator
Output
Module
Slicer
Elements
Pkt i
Seg i
Threshold
Accelerator Resource Consumption:
extra 18.6% of LUT, 17.3% of FF, and 17 DSP
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Sum
Result
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Packet-Level Aggregation in Our iSwitch
Sum
Result
14
Aggregating Gradient at Packet-Level for Improved Parallelism
Conventional Vector-Level Aggregation
Packet-Level Aggregation in Our iSwitch
Sum
Result
Further Reduce
Aggregation Time
15
Extending Network Protocol for In-Switch Computing
Regular Packet:
ETH IP UDP Application Data
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Data Packet of iSwitch:
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Data Packet of iSwitch:
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
15
Extending Network Protocol for In-Switch Computing
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
15
Extending Network Protocol for In-Switch Computing
Action Description
Join Join the training job
Leave Leave the training job
Reset Clear the accelerator on the switch
SetH Set aggregation threshold H on switch
FBcast Force broadcast a segment on switch
Help Request a lost data packet for a worker
Ack Confirm the success of some actions
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
15
Extending Network Protocol for In-Switch Computing
Action Description
Join Join the training job
Leave Leave the training job
Reset Clear the accelerator on the switch
SetH Set aggregation threshold H on switch
FBcast Force broadcast a segment on switch
Help Request a lost data packet for a worker
Ack Confirm the success of some actions
iSwitch extension will NOT affect regular network functions
ETH IP UDP Application Data
Type-of-Service Field
Seg Gradient
Data Packet of iSwitch:
Control Packet of iSwitch:
ETH IP UDP Application DataAction Value (optional)
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Programmable Switch
Aggregation
Accelerator
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
Keep
Aggregating
16
Supporting Different (Sync & Async) Training Execution Modes
Synchronous Distributed Training
In-Switch Acceleration Directly Applies
Asynchronous Distributed Training
Programmable Switch
Aggregation
Accelerator
Programmable Switch
Aggregation
Accelerator
Keep
Computing
Keep
Aggregating
HW/Algo Co-Design For Improved Parallelism
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Typical Network Architecture at Data Center
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt Grad PktGrad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad PktGrad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Racks of
Servers
Top-of-Rack
Switches
Core
Switches
“Aggregate”
Switches
17
Scaling In-Switch Computing in Rack-Scale Data Centers
The Hierarchical Aggregation of iSwitch
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
Grad Pkt
No Additional Cost or Topology Change for Scaling In-Switch Computing
18
In-Switch
Computing
Implementation
RL Training
Benchmarks
NetFPGA-SUME Board
GPU Cluster
DQN A2C PPO DDPG
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
19
Reducing the End-to-End Training Time with iSwitch
-25
-20
-15
-10
-5
0
5
10
15
20
25
0 250 500 750 1000 1250 1500 1750 2000
Avera
ge E
pis
ode R
ew
ard
Training Time (min) of DQN
Parameter Server (PS)AllReduce (AR)iSwitch (iSW)
3.7x Speedup
1.9x
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
20
Performance Breakdown for Each Training IterationT
rain
ing
Tim
e (
No
rm)
0
0.2
0.4
0.6
0.8
1
1.2
PS AR iSW PS AR iSW PS AR iSW PS AR iSW
Agent Action Environment Buffer Sampling Memory Alloc
Forward Pass Backward Pass GPU Copy Grad Aggregation
Weight Update Others
DQN A2C PPO DDPG
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO Asynchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
iSW
Ideal
21
Improved Training Scalability with In-Switch Computing
Synchronous Training of PPO Asynchronous Training of PPO
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
AR
iSW
Ideal
1
1.5
2
2.5
3
4 6 9 12
Spee
dup
Number of Worker Nodes
PS
iSW
Ideal
Close-to Linear Speedup for Both Training Modes
22
In-Switch
Computing
Summary
Programmable Switch
Aggregation Accelerator
+ + + =
3.7x Speedup for Both Sync/Async Training
Scales at Rack-Scale Clusters
Thanks!
Jian Huang
Youjie Li
Iou-Jen Liu Yifan Yuan
Deming Chen Alexander Schwing
University of Illinois at Urbana-Champaign