Upload
randi
View
65
Download
0
Embed Size (px)
DESCRIPTION
Interfacing Processor and Peripherals. Overview. Introduction. I/O often viewed as second class to processor design Processor research is cleaner System performance given in terms of processor Courses often ignore peripherals Writing device drivers is not fun - PowerPoint PPT Presentation
Citation preview
Gary Marsden Slide 1University of Cape Town
Interfacing Processor and Peripherals
Overview
Mainmemory
I/Ocontroller
I/Ocontroller
I/Ocontroller
Disk Graphicsoutput
Network
Memory– I/O bus
Processor
Cache
Interrupts
Disk
Gary Marsden Slide 2University of Cape Town
Introduction
I/O often viewed as second class to processor design– Processor research is cleaner– System performance given in terms of processor
– Courses often ignore peripherals– Writing device drivers is not fun
This is crazy - a computer with no I/O is pointless
Gary Marsden Slide 3University of Cape Town
Peripheral design
As with processors, characteristics of I/O driven by technology advances– E.g. properties of disk drives affect how they should be connected to the processor
– PCs and super computers now share the same architectures, so I/O can make all the difference
Different requirements from processors– Performance– Expandability– Resilience
Gary Marsden Slide 4University of Cape Town
Peripheral performance
Harder to measure than for the processor– Device characteristics
• Latency / Throughput
– Connection between system and device– Memory hierarchy– Operating System
Assume 100 secs to execute a benchmark– 90 secs CPU and 10 secs I/O– If processors get 50% faster per year for the next 5 years, what is the impact?
Gary Marsden Slide 5University of Cape Town
Relative performance
CPU time + IO time = total time (% of IO time)
Year 0: 90 + 10 = 100 (10%)Year 1: 60 + 10 = 70 (14%):Year 5: 12 + 10 = 22 (45%)!
Gary Marsden Slide 6University of Cape Town
IO bandwidth
Measured in 2 ways depending on application– How much data can we move through the system in a given time• Important for supercomputers with large amounts of data for, say, weather prediction
– How many IO operations can we do in a given time• ATM is small amount of data but need to be handled rapidly
So comparison is hard. Generally– Response time lowered by handling early– Throughput increased by handling multiple requests together
Gary Marsden Slide 7University of Cape Town
I/O Performance Measures
I/O bandwidth (throughput) – amount of information that can be input (output) and communicated across an interconnect (e.g., a bus) to the processor/memory (I/O device) per unit time1. How much data can we move through the system in
a certain time?2. How many I/O operations can we do per unit
time?
I/O response time (latency) – the total elapsed time to accomplish an input or output operation– An especially important metric in real-time
systems
Many applications require both high throughput and short response times
Gary Marsden Slide 8University of Cape Town
I/O System Performance
Designing an I/O system to meet a set of bandwidth and/or latency constraints means
Finding the weakest link in the I/O system – the component that constrains the design– The processor and memory system– The underlying interconnection (e.g., bus)– The I/O controllers– The I/O devices themselves
(Re)configuring the weakest link to meet the bandwidth and/or latency requirements
Determining requirements for the rest of the components and (re)configuring them to support this latency and/or bandwidth
Gary Marsden Slide 9University of Cape Town
I/O System Performance Example A disk workload consisting of 64KB reads and writes where
the user program executes 200,000 instructions per disk I/O operation and– a processor that sustains 3 billion instr/s and averages 100,000 OS instructions to handle an I/O operation
– a memory-I/O bus that sustains a transfer rate of 1000 MB/s
The maximum I/O rate of the processor is
-------------------------- = ------------------------ = 10,000 I/Os/sec
Instr execution rate 3 x 109Instr per I/O (200 + 100) x
103
Each I/O reads/writes 64 KB so the maximum I/O rate of the bus is
---------------------- = ----------------- = 15,625 I/O’s/sec
Bus bandwidth 1000 x 106
Bytes per I/O 64 x 103
Gary Marsden Slide 10University of Cape Town
Input and Output Devices I/O devices are incredibly diverse with respect to
– Behavior – input, output or storage– Partner – human or machine– Data rate – the peak rate at which data can be
transferred between the I/O device and the main memory or processor
Device Behavior Partner Data rate (Mb/sec)
Keyboard input human 0.0001
Mouse input human 0.0038
Laser printer output human 3.2000
Graphics display
output human 800.0000-8000.0000
Network input or output
machine 100.0000-1000.0000
Magnetic disk storage machine 240.0000-2560.0000
8 orders of magnitude
range
Gary Marsden Slide 11University of Cape Town
Mouse
Communicates with– Pulses from LED– Increment / decrement counters
Mice have at least 1 button– Need click and hold
Movement is smooth, slower than processor– Polling– No submarining– Software configuration
Initialposition
of mouse+20 in X– 20 in X
+20 in Y+20 in Y+20 in X
+20 in Y– 20 in X
– 20 in Y– 20 in Y+20 in X
– 20 in Y– 20 in X
Gary Marsden Slide 12University of Cape Town
Mouse guts
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 13University of Cape Town
Hard disk
Rotating rigid platters with magnetic surfaces
Data read/written via head on armature– Think record player
Storage is non-volatileSurface divided into tracks
– Several thousand concentric circles
Track divided in sectors– 128 or so sectors per track
Gary Marsden Slide 14University of Cape Town
Diagram
Platter
Track
Platters
Sectors
Tracks
Gary Marsden Slide 15University of Cape Town
Access time
Three parts1. Perform a seek to position arm over
correct track2. Wait until desired sector passes under
head. Called rotational latency or delay
3. Transfer time to read information off disk – Usually a sector at a time at 2~4 Mb / sec
– Control is handled by a disk controller, which can add its own delays.
Gary Marsden Slide 16University of Cape Town
Calculating time
Seek time:– Measure max and divide by two– More formally: (sum of all possible seeks)/number of possible seeks
Latency time:– Average of complete spin– 0.5 rotations / spin speed (3600~5400 rpm)
– 0.5/ 3600 / 60– 0.00083 secs– 8.3 ms
Gary Marsden Slide 17University of Cape Town
Comparison
Currently, 7.25 Gb (7,424,000) per inch squared
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 18University of Cape Town
More faking
Disk drive hides internal optimisations from external world
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 19University of Cape Town
Disk Latency & Bandwidth Milestones
Disk latency is one average seek time plus the rotational latency. Disk bandwidth is the peak transfer time of formatted data from the media (not from the cache).In the time that the disk bandwidth doubles the latency improves by a factor of only 1.2 to 1.4
CDC Wren SG ST41 SG ST15 SG ST39 SG ST37
RSpeed (RPM) 3600 5400 7200 10000 15000
Year 1983 1990 1994 1998 2003
Capacity (Gbytes)
0.03 1.4 4.3 9.1 73.4
Diameter (inches)
5.25 5.25 3.5 3.0 2.5
Interface ST-412 SCSI SCSI SCSI SCSI
Bandwidth (MB/s)
0.6 4 9 24 86
Latency (msec)
48.3 17.1 12.7 8.8 5.7
Gary Marsden Slide 20University of Cape Town
Media Bandwidth/Latency Demands Bandwidth requirements
– High quality video• Digital data = (30 frames/s) × (640 x 480 pixels) × (24-b color/pixel) = 221 Mb/s (27.625 MB/s)
– High quality audio• Digital data = (44,100 audio samples/s) × (16-b audio samples) × (2 audio channels for stereo) = 1.4 Mb/s (0.175 MB/s)
– Compression reduces the bandwidth requirements considerably
Latency issues– How sensitive is your eye (ear) to variations in video (audio) rates?
– How can you ensure a constant rate of delivery?– How important is synchronizing the audio and video streams?• 15 to 20 ms early to 30 to 40 ms late tolerable
Gary Marsden Slide 21University of Cape Town
Magnetic Disk Examples (www.seagate.com)
Characteristic Seagate ST37 Seagate ST32 Seagate ST94
Disk diameter (inches)
3.5 3.5 2.5
Capacity (GB) 73.4 200 40
# of surfaces (heads)
8 4 2
Rotation speed (RPM) 15,000 7,200 5,400
Transfer rate (MB/sec)
57-86 32-58 34
Minimum seek (ms) 0.2r-0.4w 1.0r-1.2w 1.5r-2.0w
Average seek (ms) 3.6r-3.9w 8.5r-9.5w 12r-14w
MTTF (hours@25oC) 1,200,000 600,000 330,000
Dimensions (inches) 1”x4”x5.8” 1”x4”x5.8” 0.4”x2.7”x3.9”
GB/cu.inch 3 9 10
Power: op/idle/sb (watts)
20?/12/- 12/8/1 2.4/1/0.4
GB/watt 4 16 17
Weight (pounds) 1.9 1.4 0.2
Gary Marsden Slide 22University of Cape Town
Buses: Connecting I/O devices
Interfacing subsystems in a computer system is commonly done with a bus: “a shared communication link, which uses one set of wires to connect multiple sub-systems”
Gary Marsden Slide 23University of Cape Town
Why a bus?
Main benefits:– Versatility: new devices easily added– Low cost: reusing a single set of wires many ways
Problems:– Creates a bottleneck– Tries to be all things to all subsystems
Comprised of– Control lines: signal requests, acknowledgements and to show what type of information is on the
– Data lines:data, destination / source address
Gary Marsden Slide 24University of Cape Town
Controlling a bus
As the bus is shared, need a protocol to manage usage
Bus transaction consists of– Sending the address– Sending / receiving the data
Note than in buses, we talk about what the bus does to memory– During a read, a bus will ‘receive’ data
Gary Marsden Slide 25University of Cape Town
Bus transaction 1 - disk write
97018/PattersonFig. 8.07
Memory Processor
Control lines
Data lines
Disks
Memory Processor
Control lines
Data lines
Disks
Processor
Control lines
Data lines
Disks
a.
b.
c.
Memory
Gary Marsden Slide 26University of Cape Town
Bus transaction 2 - disk read
Memory Processor
Control lines
Data lines
Disks
Processor
Control lines
Data lines
Disks
a.
b.
Memory
97108/PattersonFig. 8.08
Gary Marsden Slide 27University of Cape Town
Types of Bus
Processor-memory bus– Short and high speed– Matched to memory system (usually Proprietary)
I/O buses– Lengthy,– Connected to a wide range of devices– Usually connected to the processor using 1 or 3
Backplane bus– Processors, memory and devices on single bus– Has to balance proc-memory with I/O-memory– Usually requires extra logic to do this
Gary Marsden Slide 28University of Cape Town
Bus type diagramProcessor Memory
Backplane bus
a. I/O devices
Processor MemoryProcessor-memory bus
b.
Busadapter
Busadapter
I/Obus
I/Obus
Busadapter
I/Obus
Processor MemoryProcessor-memory bus
c.
Busadapter
Backplanebus
Busadapter
I/O bus
Busadapter
I/O bus
Gary Marsden Slide 29University of Cape Town
Synchronous and Asynchronous buses
Synchronous bus has a clock attached to the control lines and a fixed protocol for communicating that is relative to the pulse
Advantages– Easy to implement (CC1 read, CC5 return value)
– Requires little logic (FSM to specify)
Disadvantages– All devices must run at same rate– If fast, cannot be long due to clock skew
Most proc-mem buses are clocked
Gary Marsden Slide 30University of Cape Town
Asynchronous buses
No clock, so it can accommodate a variety of devices (no clock = no skew)
Needs a handshaking protocol to coordinate different devices– Agreed steps to progress through by sender and receiver
– Harder to implement - needs more control lines
Gary Marsden Slide 31University of Cape Town
Example handshake - device wants a word from memory
Gary Marsden Slide 32University of Cape Town
FSM control
1Record fromdata linesand assert
Ack
ReadReq
ReadReq________
ReadReq
ReadReq
3, 4Drop Ack;
put memorydata on datalines; assert
DataRdy
Ack
Ack
6Release data
lines andDataRdy
________
___
Memory
2Release data
lines; deassertReadReq
Ack
DataRdy
DataRdy
5Read memorydata from data
lines;assert Ack
DataRdy
DataRdy
7Deassert Ack
I/O device
Put addresson data
lines; assertReadReq
________
Ack___
________
New I/O request
New I/O request
Gary Marsden Slide 33University of Cape Town
Increasing bus bandwidth
Key factors– Data bus width: Wider = fewer cycles for transfer
– Separate vs Multiplexed, data and address lines• Separating allows transfer in one bus cycle
– Block transfer: Transfer multiple blocks of data in consecutive cycles without resending addresses and control signals etc.
Gary Marsden Slide 34University of Cape Town
Obtaining bus access
Need one, or more, bus masters to prevent chaos
Processor is always a bus master as it needs to access memory– Memory is always a slave
Simplest system as a single master (CPU)Problems
– Every transfer needs CPU time– As peripherals become smarter, this is a waste of time
But, multiple masters can cause problems
Gary Marsden Slide 35University of Cape Town
Bus Arbitration
Deciding which master gets to go next– Master issues ‘bus request’ and awaits ‘granted’
Two key properties– Bus priority (highest first)– Bus fairness (even the lowest get a go, eventually)
Arbitration is an overhead, so good to reduce it– Dedicated lines, grant lines, release lines etc.
Gary Marsden Slide 36University of Cape Town
Different arbitration schemes
Daisy chain: Bus grant line runs through devices from highest to lowest
Very simple, but cannot guarantee fairness
Device n
Lowest priority
Device 2Device 1
Highest priority
Busarbiter
Grant
Grant Grant
Release
Request
Gary Marsden Slide 37University of Cape Town
Centralised Arbitration
Centralised, parallel: All devices have separate connections to the bus arbiter– This is how the PCI backplane bus works (found in most PCs)
– Can guarantee fairness– Arbiter can become congested
Gary Marsden Slide 38University of Cape Town
Distributed
Distributed arbitration by self selection:
Each device contains information about relative importance
A device places its ID on the bus when it wants access
If there is a conflict, the lower priority devices back down
Requires separate lines and complex devices
Used on the Macintosh II series (NuBus)
Gary Marsden Slide 39University of Cape Town
Collision detection
Distributed arbitration by collision detection:
Basically ethernetEveryone tries to grab the bus at once
If there is a ‘collision’ everyone backs off a random amount of time
Gary Marsden Slide 40University of Cape Town
Bus standards
To ensure machine expansion and peripheral re-use, there are various standard buses– IBM PC-AT bus (de-facto standard)– SCSI (needs controller)– PCI (Started as Intel, now IEEE)– Ethernet
Bus bandwidth depends on size of transfer and memory speed
Gary Marsden Slide 41University of Cape Town
PCI
Type Backplane
Data width 32-64
Address/data Multiplexed
Bus masters Multiple
Arbitration Central parallel
Clocking Synch. 33-66 Mhz
Theoretical Peak 133-512 MB/sec
Achievable peak 80 MB/sec
Max devices 1024
Max length 50 cm
Bananas none
Gary Marsden Slide 42University of Cape Town
My Old Macintosh
Mainmemory
I/Ocontroller
I/Ocontroller
Graphicsoutput
PCI
CDROM
Disk
Tape
I/Ocontroller
Stereo
I/Ocontroller
Serialports
I/Ocontroller
Appledesktop bus
Processor
PCIinterface/memory controller
EthernetSCSI bus
outputinput
Gary Marsden Slide 43University of Cape Town
Example: The Pentium 4’s Buses
System Bus (“Front Side Bus”): 64b x 800 MHz (6.4GB/s), 533 MHz, 400 MHz
2 serial ATAs: 150
MB/s
8 USBs: 60 MB/s
2 parallel ATA: 100
MB/s
Hub Bus: 8b x 266 MHz
Memory Controller Hub (“Northbridge”)
I/O Controller Hub (“Southbridge”)
Gbit ethernet: 0.266 GB/s
DDR SDRAM Main Memory
Graphics output: 2.0 GB/s
PCI: 32b x
33 MHz
Gary Marsden Slide 44University of Cape Town
Buses in TransitionCompanies are transitioning from synchronous, parallel, wide buses to asynchronous narrow buses
– Reflection on wires and clock skew makes it difficult to (synchronously) use 16 to 64 parallel wires running at a high clock rate (e.g., ~400 MHz) so companies are transitioning to buses with a few one-way, asynchronous wires running at a very high clock rate (~2 GHz)
PCI PCIexpress
ATA Serial ATA
Total # wires
120 36 80 7
# data wires 32 – 64 (2-way)
2 x 4 (1-way)
16 (2-way)
2 x 2 (1-way)
Clock (MHz) 33 – 133 635 50 150
Peak BW (MB/s)
128 – 1064
300 100 150
Gary Marsden Slide 45University of Cape Town
ATA Cable Sizes
Serial ATA cables (red) are much thinner than parallel ATA cables (green)
Gary Marsden Slide 46University of Cape Town
Giving commands to I/O devices
Processor must be able to address a device– Memory mapping: portions of memory are allocated to a device (Base address on a PC)• Different addresses in the space mean different things
• Could be a read, write or device status address
– Special instructions: Machine code for specific devices• Not a good idea generally
Gary Marsden Slide 47University of Cape Town
Communicating with the Processor
Polling– Process of periodically checking the status bits to see if it is time for the next I/O operation
– Simplest way for device to communicate (via a shared status register
– Mouse– Wasteful of processor time
Gary Marsden Slide 48University of Cape Town
Interrupts
Notify processor when a device needs attention (IRQ lines on a PC)
Just like exceptions, except for– Interrupt is asynchronous with program execution• Control unit only checks I/O interrupt at the start of each instruction execution
– Need further information, such as the identity of the device that caused the interrupt and its priority• Remember the Cause Register?
Gary Marsden Slide 49University of Cape Town
Interrupt-Driven I/O
With I/O interrupts– Need a way to identify the device generating the interrupt
– Can have different urgencies (so may need to be prioritized)
Advantages of using interrupts– Relieves the processor from having to continuously poll for an I/O event; user program progress is only suspended during the actual transfer of I/O data to/from user memory space
Disadvantage – special hardware is needed to– Cause an interrupt (I/O device) and detect an interrupt and save the necessary information to resume normal processing after servicing the interrupt (processor)
Gary Marsden Slide 51University of Cape Town
Interrupt-Driven Input
memory
userprogram
1. input interrupt
2.1 save state
Processor
ReceiverMemory
addsubandorbeq
lbusb...jr
2.2 jump to interruptservice routine
2.4 returnto user code
Keyboard
2.3 service interrupt
inputinterruptserviceroutine
Gary Marsden Slide 52University of Cape Town
Interrupt-Driven Output
Processor
TrnsmttrMemory
Display
addsubandorbeq
lbusb...jr
memory
userprogram
1.output interrupt
2.1 save state
outputinterruptserviceroutine
2.2 jump to interruptservice routine
2.4 returnto user code
2.3 service interrupt
Gary Marsden Slide 53University of Cape Town
Transferring Data Between Device and Memory
We can do this with Interrupts and Polling– Works best with low bandwidth devices and keeping cost of controller and interface
– Burden lies with the processor
For high bandwidth devices, we don’t want the processor worrying about every single block
Need a scheme for high bandwidth autonomous transfers
Gary Marsden Slide 54University of Cape Town
Direct Memory Access (DMA)
Mechanism for offloading the processor and having the device controller transfer data directly
Still uses interrupt mechanism, but only to communicate completion of transfer or error
Requires dedicated controller to conduct the transfer
Gary Marsden Slide 55University of Cape Town
Doing DMA
Essentially, DMA controller becomes bus master and sets up the transfer
Three steps– Processor sets up the DMA by supplying
• device identity• Operation on device• Memory Address (source or destination)• Amount to transfer
– DMA operates devices, supplies addresses and arbitrates bus
– On completion, controller notifies processor
Gary Marsden Slide 56University of Cape Town
DMA and the memory system
With DMA, the relationship between memory and processor is changed– DMA bypasses address translation and hierarchy
So, should DMA use virtual or physical addresses?– Virtual addresses: DMA must translate– Physical addresses: Hard to cross page boundary
Gary Marsden Slide 57University of Cape Town
DMA address translation
Can provide the DMA with a small address translation table for pages - provided by OS at transfer time
Get the OS to break the transfer into chunks, each chunk relating to a single page
Regardless, the OS cannot relocate pages during transfer
Gary Marsden Slide 58University of Cape Town
DMA Cache problems
DMA can create inconsistencies between cache and main memory– Called the ‘stale data’ or ‘coherency’ problem
Solve by– Route all I/O activity through cache (expensive)
– Have the whole cache flushed (easy and not too bad)
– Selectively flush cache (slightly more efficient but lots of control circuit needed)
Gary Marsden Slide 59University of Cape Town
Parallel Processors
Parallel processing machines are common– All current G5 Macs have two processors
Parallel categorisation (Flynn 1966)– Single Instruction Stream, Single Data Stream (SISD)
– Single Instruction, Multiple Data (SIMD - MMX)
– Multiple Instruction, Single Data (MISD - SuperScalar)
– Multiple Instructions, Multiple Data (MIMD - true parallelism)
Gary Marsden Slide 60University of Cape Town
Directions
Microprocessors get faster every year
Development costs are high– MIPS R4000: 30 engineers for 3 years; $30 million to develop; $10 million to fabricate; 50000 hours simulation
– High costs led companies to look at re-using existing chips in multi-processor machines
– This is possible due to improvements in memory and bus technology
– Still expensive to run these things
Gary Marsden Slide 61University of Cape Town
Evolution vs Revolution
Evolutionary approaches tend to be invisible to users except for– Lower cost and better performance
Revolutionary approaches require new languages and applications– Looks good on paper– Must be worth the effort– KCM