Interfacing Processor and Peripherals

Gary Marsden Slide 1University of Cape Town

Interfacing Processor and Peripherals

Overview

Mainmemory

I/Ocontroller

I/Ocontroller

I/Ocontroller

Disk Graphicsoutput

Network

Memory– I/O bus

Processor

Cache

Interrupts

Disk


Introduction

I/O often viewed as second class to processor design– Processor research is cleaner– System performance given in terms of processor

– Courses often ignore peripherals– Writing device drivers is not fun

This is crazy - a computer with no I/O is pointless


Peripheral design

As with processors, characteristics of I/O driven by technology advances– E.g. properties of disk drives affect how they should be connected to the processor

– PCs and super computers now share the same architectures, so I/O can make all the difference

Different requirements from processors– Performance– Expandability– Resilience


Peripheral performance

Harder to measure than for the processor– Device characteristics

• Latency / Throughput

– Connection between system and device– Memory hierarchy– Operating System

Assume 100 secs to execute a benchmark– 90 secs CPU and 10 secs I/O– If processors get 50% faster per year for the next 5 years, what is the impact?


Relative performance

CPU time + IO time = total time (% of IO time)

Year 0: 90 + 10 = 100 (10%)Year 1: 60 + 10 = 70 (14%):Year 5: 12 + 10 = 22 (45%)!


IO bandwidth

Measured in 2 ways depending on application– How much data can we move through the system in a given time• Important for supercomputers with large amounts of data for, say, weather prediction

– How many IO operations can we do in a given time• ATM is small amount of data but need to be handled rapidly

So comparison is hard. Generally– Response time lowered by handling early– Throughput increased by handling multiple requests together


I/O Performance Measures

I/O bandwidth (throughput) – amount of information that can be input (output) and communicated across an interconnect (e.g., a bus) to the processor/memory (I/O device) per unit time1. How much data can we move through the system in

a certain time?2. How many I/O operations can we do per unit

time?

I/O response time (latency) – the total elapsed time to accomplish an input or output operation– An especially important metric in real-time

systems

Many applications require both high throughput and short response times


I/O System Performance

Designing an I/O system to meet a set of bandwidth and/or latency constraints means

Finding the weakest link in the I/O system – the component that constrains the design– The processor and memory system– The underlying interconnection (e.g., bus)– The I/O controllers– The I/O devices themselves

(Re)configuring the weakest link to meet the bandwidth and/or latency requirements

Determining requirements for the rest of the components and (re)configuring them to support this latency and/or bandwidth


I/O System Performance Example A disk workload consisting of 64KB reads and writes where

the user program executes 200,000 instructions per disk I/O operation and– a processor that sustains 3 billion instr/s and averages 100,000 OS instructions to handle an I/O operation

– a memory-I/O bus that sustains a transfer rate of 1000 MB/s

The maximum I/O rate of the processor is

-------------------------- = ------------------------ = 10,000 I/Os/sec

Instr execution rate 3 x 109Instr per I/O (200 + 100) x

103

Each I/O reads/writes 64 KB so the maximum I/O rate of the bus is

---------------------- = ----------------- = 15,625 I/O’s/sec

Bus bandwidth 1000 x 106

Bytes per I/O 64 x 103


Input and Output Devices I/O devices are incredibly diverse with respect to

– Behavior – input, output or storage– Partner – human or machine– Data rate – the peak rate at which data can be

transferred between the I/O device and the main memory or processor

Device Behavior Partner Data rate (Mb/sec)

Keyboard input human 0.0001

Mouse input human 0.0038

Laser printer output human 3.2000

Graphics display

output human 800.0000-8000.0000

Network input or output

machine 100.0000-1000.0000

Magnetic disk storage machine 240.0000-2560.0000

8 orders of magnitude

range


Mouse

Communicates with– Pulses from LED– Increment / decrement counters

Mice have at least 1 button– Need click and hold

Movement is smooth, slower than processor– Polling– No submarining– Software configuration

Initialposition

of mouse+20 in X– 20 in X

+20 in Y+20 in Y+20 in X

+20 in Y– 20 in X

– 20 in Y– 20 in Y+20 in X

– 20 in Y– 20 in X


Mouse guts

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


Hard disk

Rotating rigid platters with magnetic surfaces

Data read/written via head on armature– Think record player

Storage is non-volatileSurface divided into tracks

– Several thousand concentric circles

Track divided in sectors– 128 or so sectors per track


Diagram

Platter

Track

Platters

Sectors

Tracks


Access time

Three parts1. Perform a seek to position arm over

correct track2. Wait until desired sector passes under

head. Called rotational latency or delay

3. Transfer time to read information off disk – Usually a sector at a time at 2~4 Mb / sec

– Control is handled by a disk controller, which can add its own delays.


Calculating time

Seek time:– Measure max and divide by two– More formally: (sum of all possible seeks)/number of possible seeks

Latency time:– Average of complete spin– 0.5 rotations / spin speed (3600~5400 rpm)

– 0.5/ 3600 / 60– 0.00083 secs– 8.3 ms


Comparison

Currently, 7.25 Gb (7,424,000) per inch squared

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


More faking

Disk drive hides internal optimisations from external world

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


Disk Latency & Bandwidth Milestones

Disk latency is one average seek time plus the rotational latency. Disk bandwidth is the peak transfer time of formatted data from the media (not from the cache).In the time that the disk bandwidth doubles the latency improves by a factor of only 1.2 to 1.4

CDC Wren SG ST41 SG ST15 SG ST39 SG ST37

RSpeed (RPM) 3600 5400 7200 10000 15000

Year 1983 1990 1994 1998 2003

Capacity (Gbytes)

0.03 1.4 4.3 9.1 73.4

Diameter (inches)

5.25 5.25 3.5 3.0 2.5

Interface ST-412 SCSI SCSI SCSI SCSI

Bandwidth (MB/s)

0.6 4 9 24 86

Latency (msec)

48.3 17.1 12.7 8.8 5.7


Media Bandwidth/Latency Demands Bandwidth requirements

– High quality video• Digital data = (30 frames/s) × (640 x 480 pixels) × (24-b color/pixel) = 221 Mb/s (27.625 MB/s)

– High quality audio• Digital data = (44,100 audio samples/s) × (16-b audio samples) × (2 audio channels for stereo) = 1.4 Mb/s (0.175 MB/s)

– Compression reduces the bandwidth requirements considerably

Latency issues– How sensitive is your eye (ear) to variations in video (audio) rates?

– How can you ensure a constant rate of delivery?– How important is synchronizing the audio and video streams?• 15 to 20 ms early to 30 to 40 ms late tolerable


Magnetic Disk Examples (www.seagate.com)

Characteristic Seagate ST37 Seagate ST32 Seagate ST94

Disk diameter (inches)

3.5 3.5 2.5

Capacity (GB) 73.4 200 40

# of surfaces (heads)

8 4 2

Rotation speed (RPM) 15,000 7,200 5,400

Transfer rate (MB/sec)

57-86 32-58 34

Minimum seek (ms) 0.2r-0.4w 1.0r-1.2w 1.5r-2.0w

Average seek (ms) 3.6r-3.9w 8.5r-9.5w 12r-14w

MTTF (hours@25oC) 1,200,000 600,000 330,000

Dimensions (inches) 1”x4”x5.8” 1”x4”x5.8” 0.4”x2.7”x3.9”

GB/cu.inch 3 9 10

Power: op/idle/sb (watts)

20?/12/- 12/8/1 2.4/1/0.4

GB/watt 4 16 17

Weight (pounds) 1.9 1.4 0.2


Buses: Connecting I/O devices

Interfacing subsystems in a computer system is commonly done with a bus: “a shared communication link, which uses one set of wires to connect multiple sub-systems”


Why a bus?

Main benefits:– Versatility: new devices easily added– Low cost: reusing a single set of wires many ways

Problems:– Creates a bottleneck– Tries to be all things to all subsystems

Comprised of– Control lines: signal requests, acknowledgements and to show what type of information is on the

– Data lines:data, destination / source address


Controlling a bus

As the bus is shared, need a protocol to manage usage

Bus transaction consists of– Sending the address– Sending / receiving the data

Note than in buses, we talk about what the bus does to memory– During a read, a bus will ‘receive’ data


Bus transaction 1 - disk write

97018/PattersonFig. 8.07

Memory Processor

Control lines

Data lines

Disks

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

c.

Memory


Bus transaction 2 - disk read

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

Memory

97108/PattersonFig. 8.08


Types of Bus

Processor-memory bus– Short and high speed– Matched to memory system (usually Proprietary)

I/O buses– Lengthy,– Connected to a wide range of devices– Usually connected to the processor using 1 or 3

Backplane bus– Processors, memory and devices on single bus– Has to balance proc-memory with I/O-memory– Usually requires extra logic to do this


Bus type diagramProcessor Memory

Backplane bus

a. I/O devices

Processor MemoryProcessor-memory bus

b.

Busadapter

Busadapter

I/Obus

I/Obus

Busadapter

I/Obus

Processor MemoryProcessor-memory bus

c.

Busadapter

Backplanebus

Busadapter

I/O bus

Busadapter

I/O bus


Synchronous and Asynchronous buses

Synchronous bus has a clock attached to the control lines and a fixed protocol for communicating that is relative to the pulse

Advantages– Easy to implement (CC1 read, CC5 return value)

– Requires little logic (FSM to specify)

Disadvantages– All devices must run at same rate– If fast, cannot be long due to clock skew

Most proc-mem buses are clocked


Asynchronous buses

No clock, so it can accommodate a variety of devices (no clock = no skew)

Needs a handshaking protocol to coordinate different devices– Agreed steps to progress through by sender and receiver

– Harder to implement - needs more control lines


Example handshake - device wants a word from memory


FSM control

1Record fromdata linesand assert

Ack

ReadReq

ReadReq________

ReadReq

ReadReq

3, 4Drop Ack;

put memorydata on datalines; assert

DataRdy

Ack

Ack

6Release data

lines andDataRdy

________

___

Memory

2Release data

lines; deassertReadReq

Ack

DataRdy

DataRdy

5Read memorydata from data

lines;assert Ack

DataRdy

DataRdy

7Deassert Ack

I/O device

Put addresson data

lines; assertReadReq

________

Ack___

________

New I/O request

New I/O request


Increasing bus bandwidth

Key factors– Data bus width: Wider = fewer cycles for transfer

– Separate vs Multiplexed, data and address lines• Separating allows transfer in one bus cycle

– Block transfer: Transfer multiple blocks of data in consecutive cycles without resending addresses and control signals etc.


Obtaining bus access

Need one, or more, bus masters to prevent chaos

Processor is always a bus master as it needs to access memory– Memory is always a slave

Simplest system as a single master (CPU)Problems

– Every transfer needs CPU time– As peripherals become smarter, this is a waste of time

But, multiple masters can cause problems


Bus Arbitration

Deciding which master gets to go next– Master issues ‘bus request’ and awaits ‘granted’

Two key properties– Bus priority (highest first)– Bus fairness (even the lowest get a go, eventually)

Arbitration is an overhead, so good to reduce it– Dedicated lines, grant lines, release lines etc.


Different arbitration schemes

Daisy chain: Bus grant line runs through devices from highest to lowest

Very simple, but cannot guarantee fairness

Device n

Lowest priority

Device 2Device 1

Highest priority

Busarbiter

Grant

Grant Grant

Release

Request


Centralised Arbitration

Centralised, parallel: All devices have separate connections to the bus arbiter– This is how the PCI backplane bus works (found in most PCs)

– Can guarantee fairness– Arbiter can become congested


Distributed

Distributed arbitration by self selection:

Each device contains information about relative importance

A device places its ID on the bus when it wants access

If there is a conflict, the lower priority devices back down

Requires separate lines and complex devices

Used on the Macintosh II series (NuBus)


Collision detection

Distributed arbitration by collision detection:

Basically ethernetEveryone tries to grab the bus at once

If there is a ‘collision’ everyone backs off a random amount of time


Bus standards

To ensure machine expansion and peripheral re-use, there are various standard buses– IBM PC-AT bus (de-facto standard)– SCSI (needs controller)– PCI (Started as Intel, now IEEE)– Ethernet

Bus bandwidth depends on size of transfer and memory speed


PCI

Type Backplane

Data width 32-64

Address/data Multiplexed

Bus masters Multiple

Arbitration Central parallel

Clocking Synch. 33-66 Mhz

Theoretical Peak 133-512 MB/sec

Achievable peak 80 MB/sec

Max devices 1024

Max length 50 cm

Bananas none


My Old Macintosh

Mainmemory

I/Ocontroller

I/Ocontroller

Graphicsoutput

PCI

CDROM

Disk

Tape

I/Ocontroller

Stereo

I/Ocontroller

Serialports

I/Ocontroller

Appledesktop bus

Processor

PCIinterface/memory controller

EthernetSCSI bus

outputinput


Example: The Pentium 4’s Buses

System Bus (“Front Side Bus”): 64b x 800 MHz (6.4GB/s), 533 MHz, 400 MHz

2 serial ATAs: 150

MB/s

8 USBs: 60 MB/s

2 parallel ATA: 100

MB/s

Hub Bus: 8b x 266 MHz

Memory Controller Hub (“Northbridge”)

I/O Controller Hub (“Southbridge”)

Gbit ethernet: 0.266 GB/s

DDR SDRAM Main Memory

Graphics output: 2.0 GB/s

PCI: 32b x

33 MHz


Buses in TransitionCompanies are transitioning from synchronous, parallel, wide buses to asynchronous narrow buses

– Reflection on wires and clock skew makes it difficult to (synchronously) use 16 to 64 parallel wires running at a high clock rate (e.g., ~400 MHz) so companies are transitioning to buses with a few one-way, asynchronous wires running at a very high clock rate (~2 GHz)

PCI PCIexpress

ATA Serial ATA

Total # wires

120 36 80 7

# data wires 32 – 64 (2-way)

2 x 4 (1-way)

16 (2-way)

2 x 2 (1-way)

Clock (MHz) 33 – 133 635 50 150

Peak BW (MB/s)

128 – 1064

300 100 150


ATA Cable Sizes

Serial ATA cables (red) are much thinner than parallel ATA cables (green)


Giving commands to I/O devices

Processor must be able to address a device– Memory mapping: portions of memory are allocated to a device (Base address on a PC)• Different addresses in the space mean different things

• Could be a read, write or device status address

– Special instructions: Machine code for specific devices• Not a good idea generally


Communicating with the Processor

Polling– Process of periodically checking the status bits to see if it is time for the next I/O operation

– Simplest way for device to communicate (via a shared status register

– Mouse– Wasteful of processor time


Interrupts

Notify processor when a device needs attention (IRQ lines on a PC)

Just like exceptions, except for– Interrupt is asynchronous with program execution• Control unit only checks I/O interrupt at the start of each instruction execution

– Need further information, such as the identity of the device that caused the interrupt and its priority• Remember the Cause Register?


Interrupt-Driven I/O

With I/O interrupts– Need a way to identify the device generating the interrupt

– Can have different urgencies (so may need to be prioritized)

Advantages of using interrupts– Relieves the processor from having to continuously poll for an I/O event; user program progress is only suspended during the actual transfer of I/O data to/from user memory space

Disadvantage – special hardware is needed to– Cause an interrupt (I/O device) and detect an interrupt and save the necessary information to resume normal processing after servicing the interrupt (processor)


Interrupt-Driven Input

memory

userprogram

1. input interrupt

2.1 save state

Processor

ReceiverMemory

addsubandorbeq

lbusb...jr

2.2 jump to interruptservice routine

2.4 returnto user code

Keyboard

2.3 service interrupt

inputinterruptserviceroutine


Interrupt-Driven Output

Processor

TrnsmttrMemory

Display

addsubandorbeq

lbusb...jr

memory

userprogram

1.output interrupt

2.1 save state

outputinterruptserviceroutine

2.2 jump to interruptservice routine

2.4 returnto user code

2.3 service interrupt


Transferring Data Between Device and Memory

We can do this with Interrupts and Polling– Works best with low bandwidth devices and keeping cost of controller and interface

– Burden lies with the processor

For high bandwidth devices, we don’t want the processor worrying about every single block

Need a scheme for high bandwidth autonomous transfers


Direct Memory Access (DMA)

Mechanism for offloading the processor and having the device controller transfer data directly

Still uses interrupt mechanism, but only to communicate completion of transfer or error

Requires dedicated controller to conduct the transfer


Doing DMA

Essentially, DMA controller becomes bus master and sets up the transfer

Three steps– Processor sets up the DMA by supplying

• device identity• Operation on device• Memory Address (source or destination)• Amount to transfer

– DMA operates devices, supplies addresses and arbitrates bus

– On completion, controller notifies processor


DMA and the memory system

With DMA, the relationship between memory and processor is changed– DMA bypasses address translation and hierarchy

So, should DMA use virtual or physical addresses?– Virtual addresses: DMA must translate– Physical addresses: Hard to cross page boundary


DMA address translation

Can provide the DMA with a small address translation table for pages - provided by OS at transfer time

Get the OS to break the transfer into chunks, each chunk relating to a single page

Regardless, the OS cannot relocate pages during transfer


DMA Cache problems

DMA can create inconsistencies between cache and main memory– Called the ‘stale data’ or ‘coherency’ problem

Solve by– Route all I/O activity through cache (expensive)

– Have the whole cache flushed (easy and not too bad)

– Selectively flush cache (slightly more efficient but lots of control circuit needed)


Parallel Processors

Parallel processing machines are common– All current G5 Macs have two processors

Parallel categorisation (Flynn 1966)– Single Instruction Stream, Single Data Stream (SISD)

– Single Instruction, Multiple Data (SIMD - MMX)

– Multiple Instruction, Single Data (MISD - SuperScalar)

– Multiple Instructions, Multiple Data (MIMD - true parallelism)


Directions

Microprocessors get faster every year

Development costs are high– MIPS R4000: 30 engineers for 3 years; $30 million to develop; $10 million to fabricate; 50000 hours simulation

– High costs led companies to look at re-using existing chips in multi-processor machines

– This is possible due to improvements in memory and bus technology

– Still expensive to run these things


Evolution vs Revolution

Evolutionary approaches tend to be invisible to users except for– Lower cost and better performance

Revolutionary approaches require new languages and applications– Looks good on paper– Must be worth the effort– KCM

Documents

Interfacing Processor and Peripherals