The Virtex FPGA and Introduction to design techniques · The Virtex FPGA and Introduction to design techniques SMD098 Computation Structures Lecture 6 2 ... each CLB that rive dedicated

SMD098 Computation Structures Lecture 6 1

The The Virtex Virtex FPGAFPGAandand

Introduction to design techniquesIntroduction to design techniques


Simple Programmable Logic DevicesSimple Programmable Logic Devices

Programmable Array Logic (PAL)AND-OR arrays are common blocks in SPLD and CPLD architectures

Implements two level logicfunctions like:

CDDBCBAF ++=


Simple Programmable Logic DevicesSimple Programmable Logic Devices

Programmable AND Array32 x 64

MACRO

MC0

MACRO

MC1

MACRO

MC2

MACRO

MC3

MACRO

MC4

MACRO

MC5

MACRO

MC6

MACRO

MC7

OE/I9 I/O0 I/O1 I/O2 I/O3 I/O4

8

I1 - I8 CLK/I0

17713D-1I/O5 I/O6 I/O7

1 00 1

1 1

0 X

*SG1

SG1

SL0X

D Q

Q

1 01 10 X

1 11 00 0VCC

CLK

SL0X

OE

ToAdjacentMacrocell

FromAdjacent

Pin

1 10 X

1 0

SL1X

I/OX

Vantis PALV16V8

Macrocell


Complex Programmable Logic DevicesComplex Programmable Logic DevicesCPLDs have much higher capacity than SPLDs, but thearchitecture is similar.

In-System Programming ControllerJTAGController

I/OBlocks

FunctionBlock 1

Macrocells1 to 18

Macrocells1 to 18

Macrocells1 to 18

Macrocells1 to 18

JTAG Port

3

36

I/O/GTS

I/O/GSR

I/O/GCK

I/O

I/O

I/O

I/O

2 or 4

1

I/O

I/O

I/O

I/O

3

FunctionBlock 2

36

FunctionBlock 3

36

18

18

18

18FunctionBlock N

36

Fas

tCO

NN

EC

T S

witc

h M

atrix

Macrocell 18

Macrocell 1

ProgrammableAND-Array

ProductTerm

Allocators

FromFastCONNECT

Switch Matrix

X5878

36

1

To FastCONNECTSwitch Matrix

To I/O Blocks

OUT

GlobalSet/Reset

3

18

PTOE18

18

GlobalClocks

X5879

ToFastCONNECTSwitch Matrix

AdditionalProductTerms(from othermacrocells)

GlobalSet/Reset

GlobalClocks

AdditionalProductTerms(from othermacrocells)

To I/O Blocks

OUT

1

0

36

3

PTOE

D/T QS

R

ProductTerm

Allocator

Product Term Set

Product Term Clock

Product Term Reset

Product Term OE

Xilinx XC9500 architecture

Function block

Macro cell


Field Programmable Gate Arrays - Xilinx XC4000Field Programmable Gate Arrays - Xilinx XC4000


Virtex Virtex ArchitectureArchitecture

IOBs

IOBs

IOB

s

IOB

s

DLL

DLLDLL

DLL

VersaRing

VersaR

ing

VersaRing

Ver

saR

ing

CLBs

BR

AM

s

BR

AM

s

• SRAM based, needs external configurationmemory

• Two main configurable elements: configurablelogic blocks (CLBs) and input/output blocks(IOBs)

• CLBs interconnect through a general routingmatrix (GRM).

• The VersaRing™ I/O interface providesadditional routing resources around the peripheryof the device.

• The Virtex architecture also includes thefollowing circuits that connect to the GRM.

– Dedicated block memories of 4096 bits each

– Clock DLLs for clock-distribution delaycompensation and clock domain control

– 3-State buffers (BUFTs) associated witheach CLB that rive dedicated segmentablehorizontal routing resources


Virtex Virtex routing resourcesrouting resourcesA view from FPGA editor. Blue boxes are slices (2 slices = 1 CLB). Greylines are local interconnect. Red lines are long lines. Green lines are pinwires. Three switch boxes per CLB.


Virtex Virtex clock distributionclock distributionThere are four primary global clock nets that are driven byfour global buffers. If these clock nets are used clock skewwill not be a problem.

Global Clock Spine

Global Clock Column

GCLKPAD2

GCLKBUF2

GCLKPAD3

GCLKBUF3

GCLKBUF1

GCLKPAD1

GCLKBUF0

GCLKPAD0

Global Clock Rows

gclkbu_2.eps


Virtex Virtex IOBIOBThe Virtex IOBs are configurable to support several different high speedI/O standards

OBUFT

IBUF

Vref

ds022_02_091300

R

LK

CE

CE

I

Q

CE

DCE

Q

SR

DCE

Q

SR

DCE

Q

SR

PAD

ProgrammableDelay

WeakKeeper


Virtex Virtex CLBCLBXilinx definitions:• Logic cell (LC) - 4 input LUT, carry logic and a storage element• A slice consist of two LCs• A CLB consists of 4.5 CLBs. The 1/2 LC comes from the fact that someadditional logic is available for implementing functions with more than 4inputs

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

slice_b.eps

Slice 1 Slice 0

XB

X

LUTLUT

LUT DEC

Q

RC

SP

DEC

Q

RC

SP

DEC

Q

RC

SP

DEC

Q

RC

SP


Virtex Virtex slice - detailed viewslice - detailed view

The “additional logic” are the F5 and F6 multiplexers.

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

COUT

CY

DEC

Q

DEC

Q

F6

CK WSO

WSH

WEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV


Virtex Virtex - look-up tables- look-up tables

• 4-input LUTs• 16x1-bit synchronous RAM• Two LUTs in one slice can be combined to implement

– 16x2-bit or 32x1-bit synchronous RAM– 16x1-bit dual-port synchronous RAM

• 16-bit shift register

The Virtex LUTs can be configure to implement:


Virtex Virtex slice - FPGA Editor viewslice - FPGA Editor view


Example 1Example 1library ieee;use ieee.std_logic_1164.all;

entity Example1 is port ( A, B, C, D : in std_logic; -- Inputs Reset, Clk, En : in std_logic; -- Reset, Clock, Clock enable Y : out std_logic); -- Output end Example1;

architecture RTL of Example1 is

begin -- RTL

process(Clk) begin if rising_edge(Clk) then if Reset = ’1’ then Y <= ’0’; elsif En = ’1’ then Y <= A xor B xor C xor D; end if; end if; end process; end RTL;

How will this be implemented? How many slices?


Example 1Example 1


Example 2Example 2library ieee;use ieee.std_logic_1164.all;use ieee.numeric_std.all;

entity Example2 is port ( A, B : in unsigned(7 downto 0); Cin : in std_logic; R : out unsigned(7 downto 0); Cout : out std_logic); end Example2;

architecture RTL of Example2 is begin -- RTL

process(A, B, Cin) variable r_tmp : unsigned(8 downto 0); variable cin_tmp : integer range 0 to 1; begin if Cin = ’0’ then cin_tmp := 0; else cin_tmp := 1; end if; r_tmp := (’0’ & A) + B + cin_tmp; R <= r_tmp(7 downto 0); Cout <= r_tmp(8); end process; end RTL;

8-bit adder with carry input andoutput

How can this be implemented ina Virtex? How many slices?


Example 2Example 2

Four slices - the carry chain is thehigh lighted (red) net

Next slide shows this slice


Example 2Example 2One full adder per slice


Example 3 - shift registerExample 3 - shift registerlibrary ieee;use ieee.std_logic_1164.all;

entity Example3 is port ( A : in std_logic; Clk, Reset : in std_logic; Y1, Y2 : out std_logic);end Example3;

architecture RTL of Example3 is signal S1, S2 : std_logic_vector(15 downto 0);begin -- RTL

Shift1 : process(Clk, Reset) begin if Reset = ’1’ then S1 <= (others => ’0’); elsif rising_edge(Clk) then S1 <= S1(14 downto 0) & A; end if; end process; Shift2 : process(Clk) begin if rising_edge(Clk) then S2 <= S2(14 downto 0) & A; end if; end process;

Y1 <= S1(15); Y2 <= S2(15); end RTL

SRL16

un2.I_1

0 A01 A11 A21 A3

DCLK

QFD

un2.DOUT[0]

DC Q Y2

ClkA

[14] DCCLR

Q

[13] DCCLR

[14]Q

[12] DCCLR

[13]Q

[11] DCCLR

[12]Q

[10] DCCLR

[11]Q

[9] DCCLR

[10]Q

[8] DCCLR

[9]Q

[7] DCCLR

[8]Q

[6] DCCLR

[7]Q

[5] DCCLR

[6]Q

[4] DCCLR

[5]Q

[3] DCCLR

[4]Q

[2] DCCLR

[3]Q

[1] DCCLR

[2]Q

[0] DCCLR

[1]Q

DCCLR

[0]Q

Y1

ResetClk

A

FDC

s1[15]

FDC

s1[14]

FDC

s1[13]

FDC

s1[12]

FDC

s1[11]

FDC

s1[10]

FDC

s1[9]

FDC

s1[8]

FDC

s1[7]

FDC

s1[6]

FDC

s1[5]

FDC

s1[4]

FDC

s1[3]

FDC

s1[2]

FDC

s1[1]

FDC

s1[0]

1/2 slice

16 FFs → 8 slices


Virtex Virtex Block RAMBlock RAM

• Each Block RAM is asynchronous dual-ported4096-bit RAM withindependent control signalsfor each port

• Data widths may beconfigured independently

WEBENBRSTB CLKBADDRB[#:0]DIB[#:0]

WEAENARSTA CLKAADDRA[#:0]DIA[#:0]

DOA[#:0]

DOB[#:0]

RAMB4_S#_S#

Virtex Device # of BlocksTotal Block

SelectRAM BitsXCV50 8 32,768

XCV100 10 40,960

XCV150 12 49,152

XCV200 14 57,344XCV300 16 65,536

XCV400 20 81,920

XCV600 24 98,304

XCV800 28 114,688

XCV1000 32 131,072

You have actually already used theblock RAM in one lab.


VirtexVirtex DLLs DLLsChip 1 Chip 2

D Q

DLL

D Q

DLL

Clock

A Delayed Locked Loop (DLL)can align internal and externalclocks. Effectively eliminates on-chip clock distribution delay.This maximizes the achievableI/O speed.

Delay

Clock distribution

Comparator Error

Clock

Data

Virtex have four DLLs. The DLLs canalso be used to divide or double theincoming clock frequency internally.The output of the DLL can drive theglobal clock routing recourses andclock skew can be eliminated.


Virtex Virtex compared to compared to VirtexVirtex-E-EDevice System Gates CLB Array Logic Cells

Maximum Available I/O

Block RAM Bits

Maximum SelectRAM+™ Bits

XCV50 57,906 16x24 1,728 180 32,768 24,576

XCV100 108,904 20x30 2,700 180 40,960 38,400

XCV150 164,674 24x36 3,888 260 49,152 55,296

XCV200 236,666 28x42 5,292 284 57,344 75,264

XCV300 322,970 32x48 6,912 316 65,536 98,304

XCV400 468,252 40x60 10,800 404 81,920 153,600

XCV600 661,111 48x72 15,552 512 98,304 221,184

XCV800 888,439 56x84 21,168 512 114,688 301,056

XCV1000 1,124,022 64x96 27,648 512 131,072 393,216

DeviceSystem Gates

Logic Gates

CLB Array

Logic Cells

Differential

I/O PairsUser I/O

BlockRAM Bits

Distributed

RAM Bits

XCV50E 71,693 20,736 16 x 24 1,728 83 176 65,536 24,576

XCV100E 128,236 32,400 20 x 30 2,700 83 196 81,920 38,400

XCV200E 306,393 63,504 28 x 42 5,292 119 284 114,688 75,264

XCV300E 411,955 82,944 32 x 48 6,912 137 316 131,072 98,304

XCV400E 569,952 129,600 40 x 60 10,800 183 404 163,840 153,600

XCV600E 985,882 186,624 48 x 72 15,552 247 512 294,912 221,184

XCV1000E 1,569,178 331,776 64 x 96 27,648 281 660 393,216 393,216

XCV1600E 2,188,742 419,904 72 x 108 34,992 344 724 589,824 497,664

XCV2000E 2,541,952 518,400 80 x 120 43,200 344 804 655,360 614,400

XCV2600E 3,263,755 685,584 92 x 138 57,132 344 804 753,664 812,544

XCV3200E 4,074,387 876,096 104 x 156 73,008 344 804 851,968 1,038,336

Virtex

Virtex-E


How to find the “best” implementation?How to find the “best” implementation?

• You have to know the target architecture in order tomake efficient design implementations

• Synthesis tools will not always provide the optimalsolution. Structural coding can aid the synthesis tool -provided that the designer knows a better solution

• Use vendor specific module generations tools, suchas Xilinx CoreGenerator. CoreGenerator cangenerate optimized cores such as arithmeticfunctions, FFTs, FIR filters etc


CoreGenerator CoreGenerator flowflow

X8974

HDL Editor

HDL Editor

CORE Generatoror IP Install

Schematic Editor

Schematic Simulation Tools

Implementation Tools

Symbol

Synthesizer

VHOVEO

EDIF

EDIF

EDN EDIF

CORE Generator

simprim Unified

Xilinx CoreLib

<Vendor> CoreLib

Unisim

VITAL & Verilog

simprim

VITAL, Verilog,Gate-level

HDLTest Bench

VHDLVerilog

VHDLVerilog

EDIF

SDF

VHDLVerilog

VHDLVerilog

SDF

Behavioral Simulation Models

TimingSimulation

FlowFunctionalSimulation

Flow

Verilog & VHDLInstantiation


What is “best” - what are the requirements?What is “best” - what are the requirements?

• Short time to market• Low resource usage - area• High operating frequency• Low power consumption (Mikael will talk about this

next lecture)

Depending on what requirement is most important,different design solutions will be oprimal for the particularrequirements

Some requirements can be:


Time to marketTime to market

If time to market is the most important requirementyour boss will not be satisfied if you try to optimizeother requirements that are already met. Your willnot get a raise if you manage to save 5 CLBsbecause you spent two days optimizing a counter.

This probably how most of you work in the lab. Youtry to meet the lab requirements before the deadlinebut don’t care much if your solution is the mostefficient in terms of speed or area. Am I right?


If you are optimizing for area you should consider

Resource usageResource usage

• Sequential execution instead of parallel execution• Bit serial implementation of data paths• Scheduling of data paths, interleaving of resources in

time• Choosing the algorithm that minimizes area...


If you are optimizing for speed you should consider

SpeedSpeed

• Parallel execution• Pipelining• Choosing the fastest algorithm...

Next and last lecture I will give you a practical exampleon how one algorithm, a FIR filtering, can be imple-mented in hardware. We will optimize it for area and forspeed and we will come up with two separateimplementations


Which of these two implementations are optimal?

Final questionFinal question

Critical path = 3TArea = 8A

Critical path = 4TArea = 6A

A

B

F

C

D

DecoderS

2T

2T

1T

1T

Max x T

3A

3A

1A

1A

A

B

F

C

D

DecoderS

2T

1T

Max x T

3A

1A

1T

1A

1T

1A

Documents

The Virtex FPGA and Introduction to design techniques · The Virtex FPGA and Introduction to design techniques SMD098 Computation Structures Lecture 6 2 ... each CLB that rive dedicated