Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory

Lecture 18

Lecture 18: Case Study of SoC Design

ECE 412: Microcomputer Laboratory

Lecture 18

Outline

• Web server example• MP3 example

Lecture 18

Example: Embedded web server application

• Basic web server capable of responding to simple HTTP requests

• Simple CGI requests for dynamic HTML• Read a timer peripheral before, during, and after

servicing an HTTP request to log throughput calculations, which are then displayed on a dynamically generated web page

• Simple read only file system was implemented using flash memory to store static web pages and JPEG images

Lecture 18

Throughput calculations

• Transmission throughput– Reflects the latency between starting to send the first TCP packet

containing the HTTP response until the file was completely sent– Could theoretically reach a maximum of 10Mbps

• Raw network speed that the CPU and TCP/IP stack are capable of sustaining.

• HTTP server throughput– Takes into account all delay between the incoming HTTP connection

request and file send completion• Includes the transmission latency above• Also measures the time the HTTP server took to open a TCP

connection to the host

Lecture 18

Baseline system• Web server put to test to serve up JPEG images of varying sizes

across the LAN to a host PC• During each transfer several snapshots of the timer peripheral were

taken

Lecture 18

Baseline system dataflow

NIOs CPU

Instruction

Master

Data

Master

Avalon Bus

UART, IO,

Timer, etc.

SRAM FLASH Ethernet MAC

Data flow

The Nios CPU’s data master port is used to read data memory (SRAM) and

write to the Ethernet MAC.

This would occur for each packet transmitted in the baseline system.

Lecture 18

Performance optimization

• Using a DMA to transfer data from incoming packets into memory without the intervention of the microprocessor

• The use of a custom peripheral to do the checksum calculation

• The combination of the two• Optimization of the slave-arbitration priority for the

memories to provide maximum data throughput

Lecture 18

Dataflow enhancement with DMA

• Using DMA to transfer packets between Ethernet MAC and data memory • CPU higher priority for any conflicts with the DMA• During DMA, CPU is free to access other peripherals• For access to the shared SRAM, arbitration is performed

NIOs CPU

Instruction

Master

Data

Master

Avalon Bus

UART, IO,

Timer, etc.SRAM FLASH Ethernet MAC

Data flow

DMA Controller

Read

Master

Write

Master

Avalon Bus

Data flow

Arbitrator

Lecture 18

Performance improvement

Transmission throughput is doubled compared to baseline

The entire HTTP server throughput is about 2.5X that of the baseline

36% increase of logic resource usage (3600 logic elements)

Lecture 18

TCP checksum

• Checksum calculations can be regarded as a necessary evil in dataflow-sensitive applications

– For a 1300-byte payload, it takes 33,000 clock cycles– At a 33 Mhz clock speed it requires 1ms latency for each maximum size

packet• In the benchmark, the largest file (60KB) breaks down into 46

maximum-sized packets – 46ms out of 156ms transmission latency in the baseline

• The inner loop of TCP/IP stack checksum performs a 16-bit one’s complement checksum calculation

– Adding up data repeatedly is a simple task for hardware– A Verilog implementation can be designed– The checksum peripheral operates

• Reading the payload contents directly out of data memory• Performing the checksum calculation• Storing the result in a CPU-addressable register

– It takes 386 clock cycles now– Speedup of 90X over the software version

Lecture 18

Checksum peripheral

• Again, for access to the shared SRAM, arbitration is performed

NIOs CPU

Instruction

Master

Data

Master

Avalon Bus

UART, IO,

Timer, etc.SRAM FLASH

Data flow

Checksum Peripheral

Read

Master

Avalon Bus

Data flow

Arbitrator

Lecture 18

Performance boost

Transmission latency decreased by 44ms

Average transmission throughput increase of 40% and average HTTP throughput increase of 25% over the baseline

Resource usage 22% increase over the baseline (3250 logic elements)

Lecture 18

Putting it all together

Lecture 18

Embedded uP systems in Xilinx FPGA

Traditional embedded microprocessor system as implemented on a platform FPGA

Co-processor Architecture with multiple hardware accelerators

1. Start with developing for the first architecture

2. Automatically generating the second architecture under the control of the user

Lecture 18

Profiling results

DCT32 and IMDCT36 perform the discrete cosine transform and inverse discrete cosine transform respectively. The other functions are multiply-accumulate functions of various sizes.

These functions comprise over 90% of the total application execution time on the host.

Lecture 18

Design automation

• Implement co-processor accelerators to meet performance requirements.

• Using the tagging facilities in Xilinx design environment to mark the functions for hardware acceleration.

• ‘Compile for target’ – The tool chain will create an implementation that includes a MicroBlaze processor

and interfaces the same as before– Augmented with three hardware accelerators that implement the multiplications, DCT

and inverse DCT.

• The creation of the hardware accelerator blocks is done automatically:– The use of an advanced C to hardware compiler optimized for Platform FPGAs.– The ‘stitching’ of the accelerators into the new co-processing architecture.– Handling the movement of the appropriate data to and from the accelerators.

Lecture 18

New architecture

Lecture 18

Final results

Enables the mp3 application to run in real time at a system clock rate of 67.5MHz.

Lecture 18

A simple summary

• Platform-based design involves hardware/software codesign

• Right design decisions can provide significant amount of performance improvement

• Need careful tradeoff between performance, resource usage, cost and design time

• Platform FPGAs are a convenient/low cost platform for such a task

Lecture 18

Overview of the Rest of the Semester• This is the last formal lecture

– If we haven’t covered it already, we can’t really expect you to use it on your projects

• Quiz 2. Next Thursday. No class next Tuesday.• Final project proposal is 4/13 and 4/15.

– 2 teams each day. Each team has 20 minutes– Proposal presentations can be sent to me through email before class or brought in using a

flash memory

• Initial report due on 4/20 (new due date)– Three-pages (four at most)– May contain: introduction, background, motivation, impact, block diagram, and workload

partition among team members – Goal: give us enough information that we can provide feedbacks about project complexity

and suggestions

• From now on, I’ll have office hours during class meeting times to discuss final project-related issues

• Final Project Presentation: 5/12• Final Project Report/Demo: Due 5/14• Details referring to Lecture 14

Lecture 18

Next time

• Quiz 2 (next Thursday, 4/8)

Documents

Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory