Upload
tobias-may
View
217
Download
0
Embed Size (px)
Citation preview
CS252Graduate Computer Architecture
Lecture 22
I/O Continued
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
http://www-inst.eecs.berkeley.edu/~cs252
4/18/2007 cs252-S07, Lecture 22 2
Review: A Little Queuing Theory• Assumptions:
– System in equilibrium; No limit to the queue– Time between successive arrivals is random and memoryless
• Parameters that describe our system: : mean number of arriving customers/second– Tser: mean time to service a customer (“m1”)– C: squared coefficient of variance = 2/m12
– μ: service rate = 1/Tser
– u: server utilization (0u1): u = /μ = Tser • Parameters we wish to compute:
– Tq: Time spent in queue– Lq: Length of queue = Tq (by Little’s law)
• Results:– Memoryless service distribution (C = 1):
» Called M/M/1 queue: Tq = Tser x u/(1 – u)– General service distribution (no restrictions), 1 server:
» Called M/G/1 queue: Tq = Tser x ½(1+C) x u/(1 – u))
Arrival Rate
Queue ServerService Rate
μ=1/Tser
4/18/2007 cs252-S07, Lecture 22 3
A Little Queuing Theory: An Example
• Processor access memory over “network”• DRAM service properties
– speed = 9cycles+ 2cycles/word– With 8 word cachelines: Tser= 25cycles, =1/25=.04 ops/cycle– Deterministic Servicetime! (C=0)
• Processor behavior– CPI=1, 40% memory, 7.5% cache misses – Rate: = 1 inst/cycle*.4*.075 = .03 ops/cycle
• Notation: average number of arriving customers/cycle=.03
Tser average time to service a customer= 25 cyclesu server utilization (0..1): u = x Tser= .0325=.75Tq average time/customer in queue = Tser x u x (1 + C) /(2 x (1 – u))
= (25 x 0.75 x ½)/(1 – 0.75) = 37.5 cyclesTsys average time/customer in system: Tsys = Tq +Tser= 62.5 cyclesLq average length of queue:Lq= x Tq
= .03 x 37.5 = 1.125 requests in queueLsys average # tasks in system :Lsys = x Tsys = .03 x 62.5 = 1.875
4/18/2007 cs252-S07, Lecture 22 4
Review: A Three-Bus System
• A small number of backplane buses tap into the processor-memory bus
– Processor-memory bus is only used for processor-memory traffic
– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
Processor Memory
Processor Memory Bus
BusAdaptor
BusAdaptor
BusAdaptor
I/O Bus
BacksideCache bus
I/O BusL2 Cache
4/18/2007 cs252-S07, Lecture 22 5
Main components of Intel Chipset: Pentium 4
• Northbridge:– Handles memory
– Graphics
• Southbridge: I/O– PCI bus
– Disk controllers
– USB controllers
– Audio
– Serial I/O
– Interrupt controller
– Timers
4/18/2007 cs252-S07, Lecture 22 6
DeviceController
readwrite
controlstatus
AddressableMemoryand/orQueuesRegisters
(port 0x20)
HardwareController
Memory MappedRegion: 0x8f008020
BusInterface
How does the processor actually talk to the device?
• CPU interacts with a Controller– Contains a set of registers that
can be read and written– May contain memory for request
queues or bit-mapped images • Regardless of the complexity of the connections and
buses, processor accesses registers in two ways: – I/O instructions: in/out instructions
» Example from the Intel architecture: out 0x21,AL– Memory mapped I/O: load/store instructions
» Registers/memory appear in physical address space» I/O accomplished with load and store instructions
Address+Data
Interrupt Request
Processor Memory Bus
CPU
RegularMemory
InterruptController
BusAdaptor
BusAdaptor
Other Devicesor Buses
4/18/2007 cs252-S07, Lecture 22 7
Example: Memory-Mapped Display Controller• Memory-Mapped:
– Hardware maps control registers and display memory into physical address space
» Addresses set by hardware jumpers or programming at boot time
– Simply writing to display memory (also called the “frame buffer”) changes image on screen
» Addr: 0x8000F000—0x8000FFFF– Writing graphics description to
command-queue area » Say enter a set of triangles that describe
some scene» Addr: 0x80010000—0x8001FFFF
– Writing to the command register may cause on-board graphics hardware to do something
» Say render the above scene» Addr: 0x0007F004
• Can protect with page tables
DisplayMemory
0x8000F000
0x80010000
Physical AddressSpace
Status0x0007F000Command0x0007F004
GraphicsCommand
Queue
0x80020000
4/18/2007 cs252-S07, Lecture 22 8
Case for Storage• Shift in focus from computation to
communication and storage of information – E.g., Cray Research/Thinking Machines vs.
Google/Yahoo– “The Computing Revolution” (1960s to 1980s)
“The Information Age” (1990 to today)
• Storage emphasizes reliability and scalability as well as cost-performance
• What is “Software king” that determines which HW actually features used?
– Operating System for storage– Compiler for processor
• Also has own performance theory—queuing theory—balances throughput vs. response time
4/18/2007 cs252-S07, Lecture 22 9
Hard Disk Drives
IBM/Hitachi Microdrive
Western Digital Drivehttp://www.storagereview.com/guide/
Read/Write HeadSide View
4/18/2007 cs252-S07, Lecture 22 10
Historical Perspective• 1956 IBM Ramac — early 1970s Winchester
– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in.
• Form factor and capacity drives market more than performance• 1970s developments
– 5.25 inch floppy disk formfactor (microcode into mainframe)– Emergence of industry standard disk interfaces
• Early 1980s: PCs and first generation workstations• Mid 1980s: Client/server computing
– Centralized storage on file server» accelerates disk downsizing: 8 inch to 5.25
– Mass market disk drives become a reality» industry standards: SCSI, IPI, IDE» 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
• 1900s: Laptops => 2.5 inch drives• 2000s: Shift to perpendicular recording
– 2006: Seagate introduces 750GB drive– 2007: Seagate promises 1TB drive by second quarter
4/18/2007 cs252-S07, Lecture 22 11
Disk History
Data densityMbit/sq. in.
Capacity ofUnit ShownMegabytes
1973:1. 7 Mbit/sq. in140 MBytes
1979:7. 7 Mbit/sq. in2,300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
4/18/2007 cs252-S07, Lecture 22 12
Disk History
1989:63 Mbit/sq. in60,000 MBytes
1997:1450 Mbit/sq. in2300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
1997:3090 Mbit/sq. in8100 MBytes
4/18/2007 cs252-S07, Lecture 22 13
Properties of a Hard Magnetic Disk
• Properties– Independently addressable element: sector
» OS always transfers groups of sectors together—”blocks”– A disk can access directly any given block of information it
contains (random access). Can access any file either sequentially or randomly.
– A disk can be rewritten in place: it is possible to read/modify/write a block from the disk
• Typical numbers (depending on the disk size):– 500 to more than 20,000 tracks per surface– 32 to 800 sectors per track
» A sector is the smallest unit that can be read or written• Zoned bit recording
– Constant bit density: more sectors on outer tracks– Speed varies with track location
Track
Sector
Platters
4/18/2007 cs252-S07, Lecture 22 14
MBits per square inch: DRAM as % of Disk over time
0%
10%
20%
30%
40%
50%
1974 1980 1986 1992 1998
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
470 v. 3000 Mb/si
9 v. 22 Mb/si
0.2 v. 1.7 Mb/si
4/18/2007 cs252-S07, Lecture 22 15
Nano-layered Disk Heads• Special sensitivity of Disk head comes from “Giant
Magneto-Resistive effect” or (GMR) • IBM is (was) leader in this technology
–Same technology as TMJ-RAM breakthrough
Coil for writing
4/18/2007 cs252-S07, Lecture 22 16
Disk Figure of Merit: Areal Density• Bits recorded along a track
– Metric is Bits Per Inch (BPI)
• Number of tracks per surface– Metric is Tracks Per Inch (TPI)
• Disk Designs Brag about bit density per unit area– Metric is Bits Per Square Inch: Areal Density = BPI x TPI
Year Areal Density1973 2 1979 8 1989 63 1997 3,090 2000 17,100 2006 130,000 2007 164,000
1
10
100
1,000
10,000
100,000
1,000,000
1970 1980 1990 2000 2010
Year
Are
al D
ensi
ty
4/18/2007 cs252-S07, Lecture 22 17
Newest technology: Perpendicular Recording
• In Perpendicular recording:– Bit densities much higher
– Magnetic material placed on top of magnetic underlayer that reflects recording head and effectively doubles recording field
4/18/2007 cs252-S07, Lecture 22 18
Seagate Barracuda
• 750 GB! 130 GB/in2
• 4 platters, 2 heads each• 3.5” platters • Perpendicular recording• 7200 RPM• 4.2ms latency• 100MB/Sec transfer speed• 16MB cache• Technology demonstrations:
– 1TB drive due in next couple of months: 164 GB/in2
– Last year, Seagate demonstrated it could do 421 GB/in2!
4/18/2007 cs252-S07, Lecture 22 19
Disk Device Terminology
• Several platters, with information recorded magnetically on both surfaces (usually)
• Actuator moves head (end of arm,1/surface) over track (“seek”), select surface, wait for sector rotate under head, then read or write
– “Cylinder”: all tracks under heads
• Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes)
Platter
OuterTrack
InnerTrackSector
Actuator
HeadArm
4/18/2007 cs252-S07, Lecture 22 20
Disk Time Example• Disk Parameters:
– Transfer size is 8K bytes– Advertised average seek is 12 ms– Disk spins at 7200 RPM– Transfer rate is 4 MB/sec
• Controller overhead is 2 ms• Assume that disk is idle so no queuing delay• Disk Latency =
Queuing Time + Seek Time + Rotation Time + Xfer Time + Ctrl Time
• What is Average Disk Access Time for a Sector?– Ave seek + ave rot delay + transfer time + controller overhead– 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms– 12 + 4.15 + 2 + 2 = 20 ms
• Advertised seek time assumes no locality: typically 1/4 to 1/3 advertised seek time: 20 ms => 12 ms
4/18/2007 cs252-S07, Lecture 22 21
A Little Queuing Theory: An Example
• processor sends 10 x 8KB disk I/Os per second, requests & service exponentially distrib., avg. disk service = 20 ms
• On average, how utilized is the disk?– What is the number of requests in the queue?
– What is the average time spent in the queue?
– What is the average response time for a disk request?
• Notation: average number of arriving customers/second = 10
Tser average time to service a customer = 20 ms (0.02s)u server utilization (0..1): u = x Tser= 10/s x .02s = 0.2Tq average time/customer in queue = Tser x u / (1 – u)
= 20 x 0.2/(1-0.2) = 20 x 0.25 = 5 ms (0 .005s)Tsys average time/customer in system: Tsys =Tq +Tser= 25 msLq average length of queue:Lq= x Tq
= 10/s x .005s = 0.05 requests in queueLsys average # tasks in system: Lsys = x Tsys = 10/s x .025s = 0.25
4/18/2007 cs252-S07, Lecture 22 22
Alternative Data Storage Technologies: Early 1990s
Cap BPI TPI BPI*TPI Data Xfer Access
Technology (MB) (Million) (KByte/s) Time
Conventional Tape:
Cartridge (.25") 150 12000 104 1.2 92 minutes
IBM 3490 (.5") 800 22860 38 0.9 3000 seconds
Helical Scan Tape:
Video (8mm) 4600 43200 1638 71 492 45 secs
DAT (4mm) 1300 61000 1870 114 183 20 secs
Magnetic & Optical Disk:
Hard Disk (5.25") 1200 33528 1880 63 3000 18 ms
IBM 3390 (10.5") 3800 27940 2235 62 4250 20 ms
Sony MO (5.25") 640 24130 18796 454 88 100 ms
4/18/2007 cs252-S07, Lecture 22 23
Tape vs. Disk
• Longitudinal tape uses same technology as hard disk; tracks its density improvements
• Disk head flies above surface, tape head lies on surface
• Disk fixed, tape removable
• Inherent cost-performance based on geometries: fixed rotating platters with gaps (random access, limited area, 1 media / reader)vs. removable long strips wound on spool (sequential access, "unlimited" length, multiple / reader)
• New technology trend: Helical Scan (VCR, Camcoder, DAT) Spins head at angle to tape to improve density
4/18/2007 cs252-S07, Lecture 22 24
Current Drawbacks to Tape
• Tape wear out:– Helical 100s of passes to 1000s for longitudinal
• Head wear out: – 2000 hours for helical
• Both must be accounted for in economic / reliability model
• Long rewind, eject, load, spin-up times; not inherent, just no need in marketplace (so far)
• Designed for archival
4/18/2007 cs252-S07, Lecture 22 25
Future Disk Size and Performance
• Continued advance in capacity (60%/yr) and bandwidth (40%/yr)
• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours2000 12 minutes 1 week(!)2006 56 minutes 3 weeks (SCSI)2006 171 minutes 7 weeks (SATA)
4/18/2007 cs252-S07, Lecture 22 26
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?
4/18/2007 cs252-S07, Lecture 22 27
Replace Small # of Large Disks with Large # of Small Disks! (1988 Disks)
Data Capacity
Volume
Power
Data Rate
I/O Rate
MTTF
Cost
IBM 3390 (K)
20 GBytes
97 cu. ft.
3 KW
15 MB/s
600 I/Os/s
250 KHrs
$250K
IBM 3.5" 0061
320 MBytes
0.1 cu. ft.
11 W
1.5 MB/s
55 I/Os/s
50 KHrs
$2K
x70
23 GBytes
11 cu. ft.
1 KW
120 MB/s
3900 IOs/s
??? Hrs
$150K
Disk Arrays have potential for
large data and I/O rates
high MB per cu. ft., high MB per KW
reliability?
4/18/2007 cs252-S07, Lecture 22 28
Advantages of Small Formfactor Disk Drives
Low cost/MBHigh MB/volumeHigh MB/wattLow cost/Actuator
Cost and Environmental Efficiencies
4/18/2007 cs252-S07, Lecture 22 29
Array Reliability
• Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
• Arrays (without redundancy) too unreliable to be useful!
Hot spares support reconstruction in parallel with access: very high media availability can be achievedHot spares support reconstruction in parallel with access: very high media availability can be achieved
4/18/2007 cs252-S07, Lecture 22 30
Redundant Arrays of Disks
• Files are "striped" across multiple spindles• Redundancy yields high data availability
Disks will fail
Contents reconstructed from data redundantly stored in the array
Capacity penalty to store it
Bandwidth penalty to update
Mirroring/Shadowing (high capacity cost)
Horizontal Hamming Codes (overkill)
Parity & Reed-Solomon Codes
Failure Prediction (no capacity overhead!)VaxSimPlus — Technique is controversial
Techniques:
4/18/2007 cs252-S07, Lecture 22 31
Redundant Arrays of DisksRAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its "shadow" Very high availability can be achieved
• Bandwidth sacrifice on write: Logical write = two physical writes
• Reads may be optimized
• Most expensive solution: 100% capacity overhead
Targeted for high I/O rate , high availability environments
recoverygroup
4/18/2007 cs252-S07, Lecture 22 32
Redundant Arrays of Disks RAID 5+: High I/O Rate Parity
A logical writebecomes fourphysical I/Os
Independent writespossible because ofinterleaved parity
Reed-SolomonCodes ("Q") forprotection duringreconstruction
A logical writebecomes fourphysical I/Os
Independent writespossible because ofinterleaved parity
Reed-SolomonCodes ("Q") forprotection duringreconstruction
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogical
Disk Addresses
Stripe
StripeUnit
Targeted for mixedapplications
4/18/2007 cs252-S07, Lecture 22 33
Problems of Disk Arrays: Small Writes
D0 D1 D2 D3 PD0'
+
+
D0' D1 D2 D3 P'
newdata
olddata
old parity
XOR
XOR
(1. Read) (2. Read)
(3. Write) (4. Write)
RAID-5: Small Write Algorithm
1 Logical Write = 2 Physical Reads + 2 Physical Writes
4/18/2007 cs252-S07, Lecture 22 34
Subsystem Organization
hostarray
controller
single boarddisk
controller
single boarddisk
controller
single boarddisk
controller
single boarddisk
controller
hostadapter
manages interfaceto host, DMA
control, buffering,parity logic
physical devicecontrol
often piggy-backedin small format devices
striping software off-loaded from host to array controller
no applications modifications
no reduction of host performance
4/18/2007 cs252-S07, Lecture 22 35
System Availability: Orthogonal RAIDs
ArrayController
StringController
StringController
StringController
StringController
StringController
StringController
. . .
. . .
. . .
. . .
. . .
. . .
Data Recovery Group: unit of data redundancy
Redundant Support Components: fans, power supplies, controller, cables
End to End Data Integrity: internal parity protected data paths
4/18/2007 cs252-S07, Lecture 22 36
System-Level Availability
Fully dual redundantI/O Controller I/O Controller
Array Controller Array Controller
. . .
. . .
. . .
. . . . . .
.
.
.RecoveryGroup
Goal: No SinglePoints ofFailure
Goal: No SinglePoints ofFailure
host host
with duplicated paths, higher performance can beobtained when there are no failures
OceanStore:Global Scale Persistent Storage
Global-Scale Persistent Storage
4/18/2007 cs252-S07, Lecture 22 38
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
• Service provided by confederation of companies– Monthly fee paid to one service provider– Companies buy and sell capacity from each other
IBM
Utility-based Infrastructure
4/18/2007 cs252-S07, Lecture 22 39
Important P2P Technology(Decentralized Object Location and Routing)
GUID1
DOLR
GUID1GUID2
4/18/2007 cs252-S07, Lecture 22 40
Peer-to-peer systems can be very stable
(May 2003: 1.5 TB over 4 hours)
In JSAC, To appear
4/18/2007 cs252-S07, Lecture 22 41
The Path of an OceanStore UpdateSecond-Tier
Caches
Multicasttrees
Inner-RingServers
Clients
4/18/2007 cs252-S07, Lecture 22 42
Archival Disseminationof Fragments
4/18/2007 cs252-S07, Lecture 22 43
Aside: Why erasure coding?High Durability/overhead ratio!
• Exploit law of large numbers for durability!• 6 month repair, FBLPY:
– Replication: 0.03– Fragmentation: 10-35
Fraction Blocks Lost Per Year (FBLPY)
4/18/2007 cs252-S07, Lecture 22 44
• OceanStore Concepts Applied to Tape-less backup– Self-Replicating, Self-Repairing, Self-Managing
– No need for actual Tape in system » (Although could be there to keep with tradition)
The Berkeley PetaByte Archival Service
4/18/2007 cs252-S07, Lecture 22 45
Summary• Disk industry growing rapidly, improves:
– bandwidth 40%/yr ,
– areal density 60%/year, $/MB faster?
• queue + controller + seek + rotate + transfer
• Advertised average seek time benchmark much greater than average seek time in practice
• Redundancy useful to gain reliability– Redundant disks+controllers+etc (RAID)
– Geographical scale systems (OceanStore)
• Queueing theory: for (c=1):
u
uxCW
1
121
uux
W1