48
Low-Latency Virtual- Channel Routers for On- Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar

Virtual Channel Router

Embed Size (px)

DESCRIPTION

interconnection network

Citation preview

  • Low-Latency Virtual-Channel Routers for On-Chip Networks

    Robert Mullins, Andrew West, Simon MoorePresented bySailesh Kumar

  • OutlineMotivation

    Why Network-on-chip (NoC)

    Comparison to Packet Networks

    SimilaritiesDifferencesDesign Constraints

    Topology and Routing/Switching techniques for NoC

    Mesh, fat-tree, honey-combGreedy, Deflection, Wormhole, Virtual-Channels

    Start with the paper Design of a Low-Latency Virtual-Channel Router

    *

  • Why NoCBillion transistor era has arrived

    Several such SoC are in pipeline, Inter-connection is critical

    A generic inter-connection architecture ensures

    Reduced design timeIP reusePredictable backend (versus ad-hoc wiring)

    Bus based inter-connects were sufficient until now

    But not now

    Shared bus is slow (arbitrates between several requesters)More components increase loading => speed drops furtherAd-hoc routing of wires results in backend complications, lower performance and higher power consumption

    *

  • Why NoC (cont)Recently Dally proposed an idea

    Route packets not wires as in data networksPoint to point communicationPoint to point links are faster

    Create a chip wide network (Like a regular IP WAN)

    A router at every nodeLinks connecting all routersMessages encapsulated in packets, which are routed

    Challenges

    Topologies, Routing protocolNetwork and router design with small footprint and low latency

    *

  • Some more motivationsThe need to put repeaters into long wires allows us to add the switching needed to implement a network at little additional cost

    Makes efficient use of critical global wiring resources by sharing them across different senders and receivers

    Simplifies overall design

    Design a single router and do copy-paste in both dimension

    *

  • A typical NoC nodeLayered Design of reconfigurable micronetworks.Exploits methods and tools used for general network.Micronetworks based on the ISO/OSI model.

    NoC architecture consists of Physical, Data link, and Network layers.

    *

  • A typical NoCLayered Design of reconfigurable micronetworks.Exploits methods and tools used for general network.Micronetworks based on the ISO/OSI model.

    NoC architecture consists of Physical, Data link, and Network layers.

    Implemented in cores, enables end-to-end reliable transport

    *

  • A typical NoCLayered Design of reconfigurable micronetworks.Exploits methods and tools used for general network.Micronetworks based on the ISO/OSI model.

    NoC architecture consists of Physical, Data link, and Network layers.

    Implemented in cores, enables end-to-end reliable transportMulti hop route setup, packet addressing, etc

    *

  • A typical NoCLayered Design of reconfigurable micronetworks.Exploits methods and tools used for general network.Micronetworks based on the ISO/OSI model.

    NoC architecture consists of Physical, Data link, and Network layers.

    Implemented in cores, enables end-to-end reliable transportMulti hop route setup, packet addressing, etcContention issues, reliability issues, grouping of physical layer bits, e.g. flits

    *

  • A NoC topology

    Cores Communicates With Each Other Using NoCNoC Consists of Routers (R) and Network Interfaces (NI)A NI linked to Router by Non-Pipelined WiresOne or More Cores Connected to a NI

    *

  • Another NoC topologies

    Multi hop route setup, packet addressing, etcFat treeMesh

    *

  • Routing protocolsWe will only consider mesh topology

    Objective is to find a path from

    a source to a destination

    Greedy Algorithms (deterministic)

    Choose shortest path (e.g. X-Y)Adaptive routing

    If congestion, choose alternative pathDeflection routingIs adaptive better than greedy => NOT REALLY (when only local information is used)Adaptive routing can also result in livelock

    *

  • Switching techniquesCircuit Switching: A control message is sent from source to destination and a path is reserved. Communication starts. The path is released when communication is complete.Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switchCut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately

    In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (Needs only buffer the piece of the packet that is sent between switches).Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermediate switch. (Need buffer large enough to hold the largest packet).

    *

  • Wormhole Routing Good fit for NoCWormhole routing is good for NoC

    Low latencyLess buffering requirementsSuffers from deadlock

    *

  • Adding Virtual ChannelsWith virtual channels, deadlock can be avoided

    Move message and reply on different channels => Will never have loop on a single channel

    *

  • Designing Virtual Channel RoutersDesign Constraints in NoC

    Minimize LatencyMinimize BufferingMinimal footprint

    Can exploit far greater number of pins and wires

    May use fat data and flow control wires

    Objective: Design routers with minimal latency

    This will also result in smaller buffers

    This paper presents design of a low latency router

    Cycle time of 12 FO4Single cycle routing/switching

    *

  • Designing Virtual Channel RoutersDesign Constraints in NoC

    Minimize LatencyMinimize BufferingMinimal footprint

    Can exploit far greater number of pins and wires

    May use fat data and flow control wires

    Objective: Design routers with minimal latency

    This will also result in smaller buffers

    This paper presents design of a low latency router

    Cycle time of 12 FO4Single cycle routing/switching

    *

  • A Virtual Channel Router

    *

  • Designing Virtual Channel RoutersArriving flits are placed into the buffers of corresponding VCEvery VC of every input port has buffers to hold arriving flits

    *

  • Designing Virtual Channel RoutersArriving flits are placed into the buffers of corresponding VCEvery VC of every input port has buffers to hold arriving flitsRouting logic assigns set of outgoing VC on which flit can goArbitrates between competing input VC & allocates output VC

    *

  • Designing Virtual Channel RoutersArriving flits are placed into the buffers of corresponding VCEvery VC of every input port has buffers to hold arriving flitsRouting logic assigns set of outgoing VC on which flit can goArbitrates between competing input VC & allocates output VCMatches successful input ports (allocated VC) to output portsFlits at input VCs getting grants are passed to output VCs

    *

  • Routing LogicThree possibilities

    Return a single VCReturn set of VCs on a single portReturn any VCs

    Look ahead routing

    Routing performed at the previous routerGood for X-Y deterministic (non adaptive) routingA SGI routing chip first implemented it

    *

  • VC AllocationComplexity of VC allocation depends on routing range

    Routing returns single VC

    Needs PxV input arbiter for every outgoing VC

    Routing returns multiple VC at single port

    Additional V:1 arbiter at every input VC to reduce potential outgoing VC to 1

    Routing returns any set of VCs

    Needs two cascaded PxV input arbiters

    We consider multiple VC at single port case

    *

  • VC Allocation LogicAt every outgoing VC following logic is needed

    *

  • Switch AllocationIndividual flits at input VCs arbitrate for access to the crossbar port

    Arbitration can be performed in two stagesFirst stage

    A VC among V possible VCs at every input port is selectedV:1 arbiter at every input portSecond stage

    Winning VC at every input port is matched to the output portP:1 arbiter at every output port

    This scheme doesnt guarantee a maximal/maximum/good matchingBut simple to implement

    *

  • Switch Allocation

    *

  • IssuesVC allocation and Switch allocation are serialized

    A flit will either take 2 clocks to get throughElse clock speed will be low

    Solution: Speculative switch allocation

    *

  • Speculative Switch AllocationDally proposed speculative switch allocation

    Perform switch and VC allocation in parallelAssume that participating VC in switch allocation will get the output VCIf not then wasted cycle

    An even better idea is to perform speculative and non-speculative switch allocation in parallel

    Non-speculative allocation has higher priority

    Note that non-speculative allocation is done for input VCs which has already been allocated an output VC

    Mostly one cycle delay under light loadMostly one cycle delay under heavy load

    Speculative will workNon-speculative will work

    *

  • Further Enhancement

    Is it possible to have zero cycle VC/switch allocation

    YES, Most of the time, thats what this paper is about!

    *

  • Idea 1: Free Virtual Channel QueueKeep queue of free VC at every outgoing port

    Also bit mask with one set bit

    Thus First stage of VC allocation where an output VC is selected will be removed

    *

  • Idea 1: Free Virtual Channel QueueKeep queue of free VC at every outgoing port

    Also bit mask with one set bit

    Thus First stage of VC allocation where an output VC is selected will be removed

    *

  • Idea 2: Pre-computing arbitration decisionsIf somehow, you know the arbitration results before flits actually arrive and fight for the VC and switch

    I mean, every arbitration decision

    VC allocationSwitch allocationEtc

    Then the router can be made to run in zero cycle

    The arriving flit route/switch in the same clock they arrive

    Also, clock speed may be pretty good

    Data path and control path are no more in series

    Thats what the idea 2 is.

    *

  • Some preliminaries before going into detail

    Tree Arbiters

    Implements large arbiters using tree of small arbiters

    Matrix Arbiters

    Fair and Fast arbiter implementation

    *

  • VC allocation using a Tree Arbiter

    *

  • A Matrix Arbiter

    *

  • Pre-computing arbitration decisionsAn alternative arbiter design

    *

  • Pre-computing arbitration decisionsAn alternative arbiter design

    Generate grant enables one cycle prior and latch themGrants are product of latched enables and the requests

    *

  • Pre-computing arbitration decisionsAn alternative arbiter design

    Grants are generated in same clock as request arrives

    If at least one request remainsGenerate grant enables one cycle prior and latch themGrants are product of latched enables and the requests

    *

  • Pre-computing arbitration decisionsAn alternative arbiter design

    However, when no request remains, it is difficult to generate grant enables ???

    Generate grant enables one cycle prior and latch themGrants are product of latched enables and the requests

    *

  • Generating grant enablesSafe Environment

    Only one request may arrive in a cycleThus it is safe to assert all grant enablesThus grant can still be generated in same cycle

    Unsafe Environment

    Multiple request may arrive in same cycleCan still assert all grantsBut need to abort when multiple requests arrive in same cycle

    All first stage V:1 arbiters operate under safe environmentHowever P:1 arbiters doesnt

    *

  • Generating grant enablesEven in unsafe environments, assert all grants

    May need to abort when multiple requests arriveNote that after an abort, a correct arbitration is ensured in the next cycle

    Why will it work?

    Because in lightly loaded network, multiple requests for same VC/port will not arrive (few aborts)In heavily loaded network flits will remain buffered and Non-speculative arbitration (higher priority) will happen most of the time

    Few aborts again

    *

  • I will skip the design details nowSince it is confusing and complex

    Will jump to critical path analysis

    *

  • Analysis of critical path

    Generates VC/switch grants from pre-computed grant enables

    *

  • Analysis of critical path

    Generates VC/switch grants from pre-computed grant enablesCrossbar traversal is aborted once invalid grants are detected

    *

  • Analysis of critical path

    Generates VC/switch grants from pre-computed grant enablesCrossbar traversal is aborted once invalid grants are detectedIn case of an abort, the correct control signals are ensured in the next cycle

    *

  • Final designControl path critical delay is 12 FO4

    Until now, the best design had 20 FO4 delays

    They have sampled a NoC based ASIC last week using this idea

    Runs at several GHz speeds

    Note that fast cycle time is possible by

    Running VC allocation and Switch allocation in parallelMust use speculation, else delay will be higher (1 more cycle)

    *

  • Simulation results

    *

  • If (doubts) ThenAsk;ElseThank you;Goto Discussion;End if;

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *