Upload
others
View
50
Download
0
Embed Size (px)
Citation preview
SMD098 Computation Structures Lecture 6 1
The The Virtex Virtex FPGAFPGAandand
Introduction to design techniquesIntroduction to design techniques
SMD098 Computation Structures Lecture 6 2
Simple Programmable Logic DevicesSimple Programmable Logic Devices
Programmable Array Logic (PAL)AND-OR arrays are common blocks in SPLD and CPLD architectures
Implements two level logicfunctions like:
CDDBCBAF ++=
SMD098 Computation Structures Lecture 6 3
Simple Programmable Logic DevicesSimple Programmable Logic Devices
Programmable AND Array32 x 64
MACRO
MC0
MACRO
MC1
MACRO
MC2
MACRO
MC3
MACRO
MC4
MACRO
MC5
MACRO
MC6
MACRO
MC7
OE/I9 I/O0 I/O1 I/O2 I/O3 I/O4
8
I1 - I8 CLK/I0
17713D-1I/O5 I/O6 I/O7
1 00 1
1 1
0 X
*SG1
SG1
SL0X
D Q
Q
1 01 10 X
1 11 00 0VCC
CLK
SL0X
OE
ToAdjacentMacrocell
FromAdjacent
Pin
1 10 X
1 0
SL1X
I/OX
Vantis PALV16V8
Macrocell
SMD098 Computation Structures Lecture 6 4
Complex Programmable Logic DevicesComplex Programmable Logic DevicesCPLDs have much higher capacity than SPLDs, but thearchitecture is similar.
In-System Programming ControllerJTAGController
I/OBlocks
FunctionBlock 1
Macrocells1 to 18
Macrocells1 to 18
Macrocells1 to 18
Macrocells1 to 18
JTAG Port
3
36
I/O/GTS
I/O/GSR
I/O/GCK
I/O
I/O
I/O
I/O
2 or 4
1
I/O
I/O
I/O
I/O
3
FunctionBlock 2
36
FunctionBlock 3
36
18
18
18
18FunctionBlock N
36
Fas
tCO
NN
EC
T S
witc
h M
atrix
Macrocell 18
Macrocell 1
ProgrammableAND-Array
ProductTerm
Allocators
FromFastCONNECT
Switch Matrix
X5878
36
1
To FastCONNECTSwitch Matrix
To I/O Blocks
OUT
GlobalSet/Reset
3
18
PTOE18
18
GlobalClocks
X5879
ToFastCONNECTSwitch Matrix
AdditionalProductTerms(from othermacrocells)
GlobalSet/Reset
GlobalClocks
AdditionalProductTerms(from othermacrocells)
To I/O Blocks
OUT
1
0
36
3
PTOE
D/T QS
R
ProductTerm
Allocator
Product Term Set
Product Term Clock
Product Term Reset
Product Term OE
Xilinx XC9500 architecture
Function block
Macro cell
SMD098 Computation Structures Lecture 6 5
Field Programmable Gate Arrays - Xilinx XC4000Field Programmable Gate Arrays - Xilinx XC4000
SMD098 Computation Structures Lecture 6 6
Virtex Virtex ArchitectureArchitecture
IOBs
IOBs
IOB
s
IOB
s
DLL
DLLDLL
DLL
VersaRing
VersaR
ing
VersaRing
Ver
saR
ing
CLBs
BR
AM
s
BR
AM
s
• SRAM based, needs external configurationmemory
• Two main configurable elements: configurablelogic blocks (CLBs) and input/output blocks(IOBs)
• CLBs interconnect through a general routingmatrix (GRM).
• The VersaRing™ I/O interface providesadditional routing resources around the peripheryof the device.
• The Virtex architecture also includes thefollowing circuits that connect to the GRM.
– Dedicated block memories of 4096 bits each
– Clock DLLs for clock-distribution delaycompensation and clock domain control
– 3-State buffers (BUFTs) associated witheach CLB that rive dedicated segmentablehorizontal routing resources
SMD098 Computation Structures Lecture 6 7
Virtex Virtex routing resourcesrouting resourcesA view from FPGA editor. Blue boxes are slices (2 slices = 1 CLB). Greylines are local interconnect. Red lines are long lines. Green lines are pinwires. Three switch boxes per CLB.
SMD098 Computation Structures Lecture 6 8
Virtex Virtex clock distributionclock distributionThere are four primary global clock nets that are driven byfour global buffers. If these clock nets are used clock skewwill not be a problem.
Global Clock Spine
Global Clock Column
GCLKPAD2
GCLKBUF2
GCLKPAD3
GCLKBUF3
GCLKBUF1
GCLKPAD1
GCLKBUF0
GCLKPAD0
Global Clock Rows
gclkbu_2.eps
SMD098 Computation Structures Lecture 6 9
Virtex Virtex IOBIOBThe Virtex IOBs are configurable to support several different high speedI/O standards
OBUFT
IBUF
Vref
ds022_02_091300
R
LK
CE
CE
I
Q
CE
DCE
Q
SR
DCE
Q
SR
DCE
Q
SR
PAD
ProgrammableDelay
WeakKeeper
SMD098 Computation Structures Lecture 6 10
Virtex Virtex CLBCLBXilinx definitions:• Logic cell (LC) - 4 input LUT, carry logic and a storage element• A slice consist of two LCs• A CLB consists of 4.5 CLBs. The 1/2 LC comes from the fact that someadditional logic is available for implementing functions with more than 4inputs
F1
F2
F3
F4
G1
G2
G3
G4
Carry &Control
Carry &Control
Carry &Control
Carry &Control
LUT
CINCIN
COUT COUT
YQ
XQXQ
YQ
X
XB
YYBYB
Y
BX
BY
BX
BY
G1
G2
G3
G4
F1
F2
F3
F4
slice_b.eps
Slice 1 Slice 0
XB
X
LUTLUT
LUT DEC
Q
RC
SP
DEC
Q
RC
SP
DEC
Q
RC
SP
DEC
Q
RC
SP
SMD098 Computation Structures Lecture 6 11
Virtex Virtex slice - detailed viewslice - detailed view
The “additional logic” are the F5 and F6 multiplexers.
BY
F5IN
SRCLKCE
BX
YB
Y
YQ
XB
X
XQ
G4G3G2G1
F4F3F2F1
CIN
0
1
1
0
F5 F5
COUT
CY
DEC
Q
DEC
Q
F6
CK WSO
WSH
WEA4
BY DG
BX DI
DI
O
WEI3I2I1I0
LUT
CY
I3I2I1I0
O
DIWE
LUT
INIT
INIT
REV
REV
SMD098 Computation Structures Lecture 6 12
Virtex Virtex - look-up tables- look-up tables
• 4-input LUTs• 16x1-bit synchronous RAM• Two LUTs in one slice can be combined to implement
– 16x2-bit or 32x1-bit synchronous RAM– 16x1-bit dual-port synchronous RAM
• 16-bit shift register
The Virtex LUTs can be configure to implement:
SMD098 Computation Structures Lecture 6 13
Virtex Virtex slice - FPGA Editor viewslice - FPGA Editor view
SMD098 Computation Structures Lecture 6 14
Example 1Example 1library ieee;use ieee.std_logic_1164.all;
entity Example1 is port ( A, B, C, D : in std_logic; -- Inputs Reset, Clk, En : in std_logic; -- Reset, Clock, Clock enable Y : out std_logic); -- Output end Example1;
architecture RTL of Example1 is
begin -- RTL
process(Clk) begin if rising_edge(Clk) then if Reset = ’1’ then Y <= ’0’; elsif En = ’1’ then Y <= A xor B xor C xor D; end if; end if; end process; end RTL;
How will this be implemented? How many slices?
SMD098 Computation Structures Lecture 6 15
Example 1Example 1
SMD098 Computation Structures Lecture 6 16
Example 2Example 2library ieee;use ieee.std_logic_1164.all;use ieee.numeric_std.all;
entity Example2 is port ( A, B : in unsigned(7 downto 0); Cin : in std_logic; R : out unsigned(7 downto 0); Cout : out std_logic); end Example2;
architecture RTL of Example2 is begin -- RTL
process(A, B, Cin) variable r_tmp : unsigned(8 downto 0); variable cin_tmp : integer range 0 to 1; begin if Cin = ’0’ then cin_tmp := 0; else cin_tmp := 1; end if; r_tmp := (’0’ & A) + B + cin_tmp; R <= r_tmp(7 downto 0); Cout <= r_tmp(8); end process; end RTL;
8-bit adder with carry input andoutput
How can this be implemented ina Virtex? How many slices?
SMD098 Computation Structures Lecture 6 17
Example 2Example 2
Four slices - the carry chain is thehigh lighted (red) net
Next slide shows this slice
SMD098 Computation Structures Lecture 6 18
Example 2Example 2One full adder per slice
SMD098 Computation Structures Lecture 6 19
Example 3 - shift registerExample 3 - shift registerlibrary ieee;use ieee.std_logic_1164.all;
entity Example3 is port ( A : in std_logic; Clk, Reset : in std_logic; Y1, Y2 : out std_logic);end Example3;
architecture RTL of Example3 is signal S1, S2 : std_logic_vector(15 downto 0);begin -- RTL
Shift1 : process(Clk, Reset) begin if Reset = ’1’ then S1 <= (others => ’0’); elsif rising_edge(Clk) then S1 <= S1(14 downto 0) & A; end if; end process; Shift2 : process(Clk) begin if rising_edge(Clk) then S2 <= S2(14 downto 0) & A; end if; end process;
Y1 <= S1(15); Y2 <= S2(15); end RTL
SRL16
un2.I_1
0 A01 A11 A21 A3
DCLK
QFD
un2.DOUT[0]
DC Q Y2
ClkA
[14] DCCLR
Q
[13] DCCLR
[14]Q
[12] DCCLR
[13]Q
[11] DCCLR
[12]Q
[10] DCCLR
[11]Q
[9] DCCLR
[10]Q
[8] DCCLR
[9]Q
[7] DCCLR
[8]Q
[6] DCCLR
[7]Q
[5] DCCLR
[6]Q
[4] DCCLR
[5]Q
[3] DCCLR
[4]Q
[2] DCCLR
[3]Q
[1] DCCLR
[2]Q
[0] DCCLR
[1]Q
DCCLR
[0]Q
Y1
ResetClk
A
FDC
s1[15]
FDC
s1[14]
FDC
s1[13]
FDC
s1[12]
FDC
s1[11]
FDC
s1[10]
FDC
s1[9]
FDC
s1[8]
FDC
s1[7]
FDC
s1[6]
FDC
s1[5]
FDC
s1[4]
FDC
s1[3]
FDC
s1[2]
FDC
s1[1]
FDC
s1[0]
1/2 slice
16 FFs → 8 slices
SMD098 Computation Structures Lecture 6 20
Virtex Virtex Block RAMBlock RAM
• Each Block RAM is asynchronous dual-ported4096-bit RAM withindependent control signalsfor each port
• Data widths may beconfigured independently
WEBENBRSTB CLKBADDRB[#:0]DIB[#:0]
WEAENARSTA CLKAADDRA[#:0]DIA[#:0]
DOA[#:0]
DOB[#:0]
RAMB4_S#_S#
Virtex Device # of BlocksTotal Block
SelectRAM BitsXCV50 8 32,768
XCV100 10 40,960
XCV150 12 49,152
XCV200 14 57,344XCV300 16 65,536
XCV400 20 81,920
XCV600 24 98,304
XCV800 28 114,688
XCV1000 32 131,072
You have actually already used theblock RAM in one lab.
SMD098 Computation Structures Lecture 6 21
VirtexVirtex DLLs DLLsChip 1 Chip 2
D Q
DLL
D Q
DLL
Clock
A Delayed Locked Loop (DLL)can align internal and externalclocks. Effectively eliminates on-chip clock distribution delay.This maximizes the achievableI/O speed.
Delay
Clock distribution
Comparator Error
Clock
Data
Virtex have four DLLs. The DLLs canalso be used to divide or double theincoming clock frequency internally.The output of the DLL can drive theglobal clock routing recourses andclock skew can be eliminated.
SMD098 Computation Structures Lecture 6 22
Virtex Virtex compared to compared to VirtexVirtex-E-EDevice System Gates CLB Array Logic Cells
Maximum Available I/O
Block RAM Bits
Maximum SelectRAM+™ Bits
XCV50 57,906 16x24 1,728 180 32,768 24,576
XCV100 108,904 20x30 2,700 180 40,960 38,400
XCV150 164,674 24x36 3,888 260 49,152 55,296
XCV200 236,666 28x42 5,292 284 57,344 75,264
XCV300 322,970 32x48 6,912 316 65,536 98,304
XCV400 468,252 40x60 10,800 404 81,920 153,600
XCV600 661,111 48x72 15,552 512 98,304 221,184
XCV800 888,439 56x84 21,168 512 114,688 301,056
XCV1000 1,124,022 64x96 27,648 512 131,072 393,216
DeviceSystem Gates
Logic Gates
CLB Array
Logic Cells
Differential
I/O PairsUser I/O
BlockRAM Bits
Distributed
RAM Bits
XCV50E 71,693 20,736 16 x 24 1,728 83 176 65,536 24,576
XCV100E 128,236 32,400 20 x 30 2,700 83 196 81,920 38,400
XCV200E 306,393 63,504 28 x 42 5,292 119 284 114,688 75,264
XCV300E 411,955 82,944 32 x 48 6,912 137 316 131,072 98,304
XCV400E 569,952 129,600 40 x 60 10,800 183 404 163,840 153,600
XCV600E 985,882 186,624 48 x 72 15,552 247 512 294,912 221,184
XCV1000E 1,569,178 331,776 64 x 96 27,648 281 660 393,216 393,216
XCV1600E 2,188,742 419,904 72 x 108 34,992 344 724 589,824 497,664
XCV2000E 2,541,952 518,400 80 x 120 43,200 344 804 655,360 614,400
XCV2600E 3,263,755 685,584 92 x 138 57,132 344 804 753,664 812,544
XCV3200E 4,074,387 876,096 104 x 156 73,008 344 804 851,968 1,038,336
Virtex
Virtex-E
SMD098 Computation Structures Lecture 6 23
How to find the “best” implementation?How to find the “best” implementation?
• You have to know the target architecture in order tomake efficient design implementations
• Synthesis tools will not always provide the optimalsolution. Structural coding can aid the synthesis tool -provided that the designer knows a better solution
• Use vendor specific module generations tools, suchas Xilinx CoreGenerator. CoreGenerator cangenerate optimized cores such as arithmeticfunctions, FFTs, FIR filters etc
SMD098 Computation Structures Lecture 6 24
CoreGenerator CoreGenerator flowflow
X8974
HDL Editor
HDL Editor
CORE Generatoror IP Install
Schematic Editor
Schematic Simulation Tools
Implementation Tools
Symbol
Synthesizer
VHOVEO
EDIF
EDIF
EDN EDIF
CORE Generator
simprim Unified
Xilinx CoreLib
<Vendor> CoreLib
Unisim
VITAL & Verilog
simprim
VITAL, Verilog,Gate-level
HDLTest Bench
VHDLVerilog
VHDLVerilog
EDIF
SDF
VHDLVerilog
VHDLVerilog
SDF
Behavioral Simulation Models
TimingSimulation
FlowFunctionalSimulation
Flow
Verilog & VHDLInstantiation
SMD098 Computation Structures Lecture 6 25
What is “best” - what are the requirements?What is “best” - what are the requirements?
• Short time to market• Low resource usage - area• High operating frequency• Low power consumption (Mikael will talk about this
next lecture)
Depending on what requirement is most important,different design solutions will be oprimal for the particularrequirements
Some requirements can be:
SMD098 Computation Structures Lecture 6 26
Time to marketTime to market
If time to market is the most important requirementyour boss will not be satisfied if you try to optimizeother requirements that are already met. Your willnot get a raise if you manage to save 5 CLBsbecause you spent two days optimizing a counter.
This probably how most of you work in the lab. Youtry to meet the lab requirements before the deadlinebut don’t care much if your solution is the mostefficient in terms of speed or area. Am I right?
SMD098 Computation Structures Lecture 6 27
If you are optimizing for area you should consider
Resource usageResource usage
• Sequential execution instead of parallel execution• Bit serial implementation of data paths• Scheduling of data paths, interleaving of resources in
time• Choosing the algorithm that minimizes area...
SMD098 Computation Structures Lecture 6 28
If you are optimizing for speed you should consider
SpeedSpeed
• Parallel execution• Pipelining• Choosing the fastest algorithm...
Next and last lecture I will give you a practical exampleon how one algorithm, a FIR filtering, can be imple-mented in hardware. We will optimize it for area and forspeed and we will come up with two separateimplementations