27
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Embed Size (px)

Citation preview

Page 1: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

ECE 636

Reconfigurable Computing

Lecture 13

Logic Emulation

Page 2: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Overview

• Background

• Rent’s Rule

• Overcoming pin limitations through scheduling

• “Virtual wires” implementation

• Veloce 2

• DEEP

Page 3: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

The Challenge

• Making a large multi-FPGA system is easy. Making it programmable is hard.

• New approach is a software technology that facilitates hardware implementation.

• Effectively make a large number of discrete devices look like one large one.

• Leads to low-cost, scalable multi-FPGA substrate.

Page 4: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Logic Emulation

• Emulation takes a sizable amount of resources

• Compilation time can be large due to FPGA compiles

• One application: also direct ties to other FPGA computing applications.

Page 5: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Are Meshes Realistic?

• The number of wires leaving a partition grows with Rent’s Rule

P = KGB

• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5

• Effectively devices highly pin limited

• What does this mean for meshes?

Page 6: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Possible Device Scenarios

• Rent’s Rule indicates that pin limited situation is getting worse.

• Frequently some logic must be left unused leading to limited utilization

• Perhaps this logic can be “reclaimed”

Page 7: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Partition vs FPGA Pin Count

• FPGAs don’t have enough pins

• Problem may or may not get worse depending on “structured” design.

Page 8: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Virtual Wires

• Overcome pin limitations by multiplexing pins and signals

• Schedule when communication will take place.

Page 9: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Virtual Wires Software Flow

• Global router enhanced to include scheduling and embedding.

• Multiplexing logic synthesized from FPGA logic.

Page 10: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Simple Example

FPGA 1 FPGA 2

FPGA 3FPGA 4

Page 11: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Clocking Strategy

• Evaluation and communication split into phases

• Longest dependency path determines number of phases

- Overall emulation performance

Page 12: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Example Scheduling

• Initial phase requires one uClk for computation, one for communication.

• Second phase requires 2 communication uClks due to through hop.

• Note this example assumed needed bandwidth was available.

Page 13: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Routing Algorithm

• For each phase, only some internal signals are ready for routing.

• Routing resources between FPGAs may be considered channels.

• Solution: Route signals use maze route for each phase.

• If available bandwidth not present, delay signals until later phases.

Page 14: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Worst Case Microcycle Count

• Most designs dominated by latency bound.

• If original design has been pipelined this is less of an issue

V >= max ( L*D, PC/Pf )

L = critical path lengthD = network diameterPC = max circuit partition pin countPf = FPGA pin count

Page 15: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Improved Scheduling

• Overlap computation and communication.

• Effectively create a “data flow” of information

• Schedule communication to happen as soon as possible

- No need for phases.

uCLK

Page 16: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Physical Implementation

• Small finite state machine encoded and placed in each FPGA

• Current implementation is one-hot encoding.

Page 17: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

System Implementation

• Low cost hardware

• So simple a graduate student can build it

Page 18: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Benchmark Designs

• Sparcle – modified Sparc processor

- 17K gates

- 4,352 bits of memory

- Emulated in circuit.

• CMMU – cache controller for scalable multiprocessor system

- 85K gates

- Designed as gate array and optimized with SIS

• Palindrome

- 14K gates

- systolic

Page 19: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Emulation Results

• At least 31 FPGAs needed for HW full connectivity (>100 for torus)

• Some degradation in overall system performance.

Page 20: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Device Utilization

• Approximately 45% of CLBs used for design logic.

• ~10% virtual wires overhead

Page 21: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Utilizations

• As devices scale projected utilization increases

• Hardwired approach doesn’t scale

• Equation ->

Page 22: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2

• Emulates up to 2 billion logic gates using 128 boards of FPGAs• VirtuaLAB environment

- Allows emulator sharing across a company (128 users)- Allows for emulation of popular interfaces (Ethernet, USB,

PCIe)• Significant company investment• 40 MG per hour compile• 11 KW power consumption

Veloce 2 Quattro

Source: Mentor Graphics Veloce2 brochure

Page 23: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2 Execution Environment

• Emulation environment contains three parts- Generation of test vectors and other stimulus- Design under test (DUT) and other support tools- Design checking resources (response checker)

Source: Mentor Graphics Veloce2 brochure

Page 24: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2

• Emulator must interact with a number of interfaces

Source: Mentor Graphics Veloce2 brochure

Page 25: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Contemporary Example: DEEP

• Developed at U. Delaware in 2011

• Ten FPGA processor boards

• Backplane used for communication in conjunction with switch boards

• Ethernet interface

• Somewhat limited scalability

Page 26: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Contemporary Example: DEEP

• Used to emulate a multi-threaded processor

• Subblocks assigned to specific FPGAs

• The same logic is repetitively used for different threads

• Hand-partitioning of subblocks

• Bottom line: 80K cycles per second

Page 27: Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Summary

• Virtual wires overcome pin limitations by intelligently multiplexing I/O signals

• Key CAD step is scheduling. Simplifies routing and partitioning.

• Latest push is towards incremental compilation

• Commercialized by Mentor Graphics – Veloce2

• Contemporary multi-FPGA systems often contain many boards

- Some require multiplexing of subblocks