23
LOAD BALANCING SWITCH By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar PROJECT POSTER Winter - Spring 2008 1

1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

Embed Size (px)

Citation preview

Page 1: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

1

LOAD BALANCING SWITCH

By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar

PROJECT POSTER

Winter - Spring 2008

Page 2: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

2

Abstract

Software solutions for real-time are too slow

Power dissipation limits work frequencies

Greater computing power neededH/W accelerators may improve S/W

processesMulti-core, multi-threaded systems

are the future

Page 3: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

3

Multiprocessor environment for parallel processing of vectors data stream

Maximal ThroughputConfigurable hardwareExpandable designStatistics report

Project Goals

Page 4: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

4

System specifications

SW over transparent HWInterface over PCI 1 Mbit/sec input streamVectors of 8 ÷ 1024 chunksVariable number of processorsSystem spreads over multiple

FPGAs

Page 5: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

5

Problem

How to manage Data stream? How to manage multiple parallel units? How to achieve full and effective

utilization of resources?

Page 6: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

6

Solution (Top Level)

Board Level Load Balancing SwitchOne system input and output to

PCIDistribute vectors among classes Local buffers for chip data

Page 7: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

7

Solution (Chip Level)

Chip Level Load Balancing SwitchConverting shared resources to

“personal” work space.Cluster ‘s organized VPUsMonitoring for each unit’s loadSmart arbitrationFlexible and easy configuration

Page 8: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

8

Solution - Tree Distribution Switch

Class of Service Distribution

SW/HW interface

Clusters of VPUs

Clusters of VPUsClusters

of VPUs

LBS Arbitration

Clusters of VPUs

Clusters of VPUsClusters

of VPUs

LBS Arbitration

Clusters

of VPUs

Clusters

of VPUs

Clusters

of VPUs

LBS Arbitration

Page 9: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

9

Three level Architecture

Provide level for packets management ( Classes )Type, Size, Priority of Data

Provide level for organizing various processing units ( Clusters )Speed , Quantity, Resources of

Processors

Provide level for fine tuning ( VPUs ) Algorithm, HW accelerating

Page 10: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

10

Implementation

Multi chip system connected over two busses

Input and Controls over Main BusOutput via streamed neighbored

bussesLocal FIFOs for every chip/classClassifier for packet managementSW configurable controlsCluster organized VPUs with in/out

arbitrationWatchdogs & Statistics Gathering

Page 11: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

11S/W emulator or H/W DSP system

Board Level diagram

Input vectorsOutput reports

LBS1

Classifier

Stratix II 180

PROCStar II

PCI Bus

DDR2 DDR2

LBS2

DDR2 DDR2 DDR2 DDR2 DDR2 DDR2

LBS3 LBS4

Main Bus : Data In and Controls

Stratix II 180 Stratix II 180 Stratix II 180

Ring Bus

Ring Bus

Per LBS registers

Page 12: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

12

Single FPGA Top Diagram

Load Balancing

Switch

(LBS)

DDR2Controls Bank A

LBS 1-4

Stratix II 180 FPGA

DDR2 Controls

Bank B

I/O – LBSControl Block

Data flow

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

BusControl Block

Page 13: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

13

Data Packet Format

Header Data 1 to N of 32-bit

Words

Tail

……

Unused

Nios Numb

er

Data Length N

Vector ID/Command Type

8-bit 32-bit16-bitVersion 4-bit

SW/HW Control 1-bit

Type 1-bit(Data/

Command)

Tail : Sync Data

Header:

Page 14: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

14

LBS Class Top Level View

Main Controller

unit

Stratix II FPGA

Output Writer

Cluster ArbiterNIOS II Syste

m

Input Reader

Cluster ArbiterNIOS II Syste

m

Control

Control

FIFO Input Port

FIFOOutput

Port

Control

Cluster ArbiterNIOS II Syste

mMuxed output data bus

Input data bus

Controland Status

Statistics

Reporter

Bu

sses C

on

trol B

lock

Page 15: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

Organization of VPU’s(Vector Processing Units)

NIOS VPUs joined into the clustersConstant number of ClustersParametric number of NIOS VPU’s

in clusterParametric control & distribution

logicVarious configurations of NIOS Static/Dynamic Priority Arbitration

Page 16: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

16

Single processor with in/out buffers

HW accelerated systemShared resources system with

mutexMulti- processors system with

number of ports to Cluster

LBS Units DescriptionVPUs: NIOS System

Page 17: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

17

Resource Usage

Module

Logicutilizati

on%

Memory

(M4K)%

Peripheral IPs (MegaFIFO, PLLs, etc.) 3,100 2 16 2

User System (All VPUs + LBS) 42,000 30 675 88

Single VPU 6,775 4.7 112 15

LBS Logic 1,350 1 3 0.5

Total usage of chip resources 45,896 32 691 90

Total available 143,000 100 768 100

Resource usage data for 6 VPU system

VPU resource usage is based on basic VPUs and may be decreased by advanced configurations and policies.

Page 18: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

18

Performance of LBSTheoretical Throughput:

100MHz x 64bit = 6.4Gbit/s

Arbitration and routing latency:2-4 cycles in average

60% effective bandwidth utilization for short vectors, up to 98% for long vectors

1Mbit/s – 400 Mbit/s real throughput

PCI and slow algorithms = bottlenecks

Page 19: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

19

Performance for short vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)0.1 3.2

6 VPUs 0.00209 122 38

2 Classes of 6 VPUs 0.00134 191 60

3 Classes of 6 VPUs 0.00086 297 93

4 Classes of 6 VPUs 0.00064 400 125

Time and throughput for 1000 vectors of 4 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 20: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

20

Performance for medium vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)2.9 2.3

6 VPUs 0.28 23.4 10

2 Classes of 6 VPUs 0.15 43.5 18.5

3 Classes of 6 VPUs 0.01 66 28.7

4 Classes of 6 VPUs 0.074 88 38

Time and throughput for 1000 vectors of 200 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 21: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

21

Performance for long vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)1.1 2.9

One VPU 1.224 2.62 0.89

6 VPUs 0.208 15.43 5.3

2 Classes of 6 VPUs 0.11 29.1 10

3 Classes of 6 VPUs 0.074 43.69 14.8

4 Classes of 6 VPUs 0.061 52.46 18

Time and throughput for 100 vectors of 1000 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 22: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

22

System performance – missing TOAs

0 2 4 6 8 10 12 14 16 18 200.000

0.100

0.200

0.300

0.400

0.500

Missing TOAs

HardwareSoftware

Number of missing TOAs

Pro

cess

ing

tim

e [

sec]

Page 23: 1. 2 3 4 5 6 7 8 Class of Service Distribution SW/HW interface Clusters of VPUs Clusters of VPUs Clusters of VPUs LBS Arbitration Clusters of VPUs

23

System performance – noise levels

0 5 10 15 20 25 30 35 40 45 500.000

0.100

0.200

0.300

0.400

0.500

0.600

Noise percentage

HardwareSoftware

Noise [%]

Pro

cess

ing

tim

e [

sec]