Upload
ethel-kristina-osborne
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
1/35
Stonehenge: Multi-Dimensional Storage
Virtualization
Lan HuangIBM Almaden Research Center
Joint work with Gang Peng and Tzi-cker Chiueh
SUNY Stony Brook
June, 2004
2/35
Introduction Storage growth is
phenomenal: new hardware
Isolated storage: resource waste
Management
clients
IP LAN/MAN/WAN
Database server
File server
1
10
100
1000
10000
1970 1980 1990 2000
Year
Are
al D
ensi
ty
[Patterson’98]
Huge amount of data,heterogeneous Devices, spread out everywhere.
3/35
Storage Virtualization
Examples: LVM, xFS, StorageTank Hide Physical details from high-level applications
applicationStorage
management
O. S. StorageVirtualization
Disks,Controllers
Hardwareresources
AbstractInterface
Physical Disks
Virtual Disks
Clients
4/35
Storage Virtualization
Storage consolidation VD as tangible as PD:
Capacity Throughput Latency
Resource efficiency Ei
5/35
Stonehenge Overview
Input: VD (B, C, D, E) Output: VDs with
performance guarantee
High Level Goals: Storage Consolidation Performance Isolation Efficiency Performance
clients
IP LAN/MAN/WAN
Database server
File server
Stonehenge (LAN)
6/41
Hardware Organization
Storagemanag
er
Storage server
Diskarray
Kernel
ClientApplicatio
n
Storage Clerk
Kernel
ClientApplicatio
n
Storage Clerk
Storage server
Diskarray
Storage server
Diskarray
Control mesg Data/cmds
Gigabit network
Object interface
Object interface
client client
File interface
7/41
Key Issues in Stonehenge
How to ease the task of storage management:
Centralization Virtualization Consolidation
How to achieve performance isolation among virtual disks?
Run time QoS guarantee How to do it efficiently?
Efficiency-aware algorithms Dynamic adaptive feedback
8/41
Key components
Mapper CVC scheduler Feedback path between them
9/35
Virtual to Physical Disk Mapping
Multi-dimension disk mapping: NP Complete
Goal: maximize resource utilization Heuristics: maximize goal function
[toyota75] Input: VDs, PDs Goal Function G: max(G) Output: VD, PD mapping
10/35
Islands Effect
1 2 3 4
PDs
VDs
11/35
Key Components
Mapper CVC scheduler Feedback path between them
12/35
Requirements of Real-time Disk Scheduling
Disk Specific Improve disk bandwidth utilization
SATF, CSCAN etc…
Non Disk Specific Meet real-time request’s deadline Fair disk bandwidth allocation
among virtual disks (Virtual Clock scheduling)
Key: Bandwidth Guarantee
seek
rotation
txferother
13/35
CVC Algorithm
Two Queues: FT(i) = max(FT(i-1),
realtime)+1/IOPSm LBA
LBA Queue is used only if FT’s slack time allows it.
Real time + service time(R) < starting deadline of next request
FT LBA
CVC Scheduler
VD(m)
14/35
Real-life Deployment
Dispatch the next N requests from LBA queue
The next batch will not be issued until the previous batch is done.
FT LBA
CVC Scheduler
VD(m)
Storage controller
On disk scheduler
16/35
CVC Performance
3 VDs with real-life traces: video stream, web, financial, TPC-C
Touch 40% of the storage space
Video Streams Mixed Traces
17/35
Impact of Disk I/O Time Estimate
Model Disk I/O time ? ATA disk impossible [ECSL TR-81] SCSI disk possible?
Run Time measurement: P(I/O Time)
18/35
CVC Latency Bound
If the traffic generated within the period of [0,t]
V(t) <= T + r * t then D <= (T + Lmax )/ Bi +Lmax/C
(1)Storage System:D <= ( ( N+1)*k*C + T + Lmax)/
Bi + ( k*C+Lmax)/C (2)Stonehenge:D <=(N+1)/IOPSi+1/IOPSmax (3)
FT
VD(m)
IOPS(m)
IOPS(max)
N reqT Bytes
?
seek
rotation
txferother
19/35
Key Components
Mapper CVC scheduler Feedback path between them
Relaxing worst case service time estimate
VD multiplex effect
20/35
Empirical Latency vs Worst Case
Approximate P(service time, N) with P(service time, N-1)
Q is P’s inverse function
D <=(Q(0.95) + s) * [(N+1)/IOPSi+1/IOPSmax]
x
y
21/35
Bursty I/O Traffic and Pspare
Self-similar Multiplexing effect : Pspare(x)
x
y
22/35
Latency or Throughput Bound
(Bthroughput, C, D, E) D--> Blatency
(Bthroughput, C, Blatency, E) Bthroughput >= Blatency: throughput
bound Bthroughput < Blatency: latency bound
BthroughputBlatency Or even less?
23/35
MBAC for Latency Bound VDs
When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,
Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPS’i+1/IOPSmax]
Let IOPSi = max(IOPS’i, IOPS’’i)2. If sum(IOPSi) <IOPSmax , accept the new VD,
otherwise, reject.
24/35
MBAC Performance Pservice
VD Type Probability
Deterministic MBAC Oracle
Run 1 Financial 95% 7 20 22
Run 2 Mixed 95% 7 14 14
Run 3 Mixed 85% 7 17 17
Number of VDs 7 9 10 11 13 14 15
Q_{service}(0.95) 11% 15% 19% 24% 37% 49% -
MBAC N/A 38% 43% 47% 55% 67% 95%
Deterministic 90% - - - - - -
Table 2. Resource Reservation
Table 1. Maximum number of VDs accepted.
25/35
MBAC for Throughput Bound VDs
When jth VD (Dj, IOPS’’j, Cj, E) comes,Convert Dj to IOPS’j:
Dj <=(Qservice(0.95)+s)*[(N+1)/IOPS’j+1/IOPSmax]
Let IOPSj = max(IOPS’j, IOPS’’j)
if IOPSj < Qspare(E) admit the new VD, otherwise, reject it.
26/35
MBAC Performance Pspare
VD 0 – TPC-CVD 1 - financialVD 2 – web
search
27/35
Measurement-based Admission Control (MBAC)
When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,
Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPSi+1/IOPSmax]
Let IOPSi = max(IOPS’i, IOPS’’i)2. Group VDs into two sets: throughput bounded set T and latency bounded L3. For the throughput bound VDs, calculate combined QI/O_rate, Let Qspare(x) = IOPSmax – QI/O_rate(x)4. If sum(IOPS(L)) <Qspare(E) , accept the new VD, otherwise, reject.
28/35
Issues with Measurement
Stability I/O rate pattern is stable Boundary case for Pservice
Overhead of monitoring trivial
Window Size
29/35
Put them all together: Stonehenge
Functionality: A general purpose IP storage cluster
Performance scheduling
Efficiency measurement
30/35
Software Architecture
kernel
kernel
kernel
F. E. T. D.
iSCSI initiator
Stonehenge
V. Table
P. Table
IDE Mid Layer Driver
Disk Mapper
Admission controller
Trafficshaping
Scheduler
F. E. T. D.
F. E. T. D.
V. Table
P. Table
Scheduler
Queues
FETD front end target driver
User
Stonehenge
31/35
Effectiveness of QoS Guarantees in Stonehenge
(a) CVC
(b) CSCAN (c) Deadline violation percentage
32/35
Impact of Leeway Factor
Overload probability Violation Percentage
33/35
Overall System Performance and Latency Breakdown
1 GHZ CPU IBM 7200 ATA disk
array Promise IDE
controllers 64 bit 66MHZ PCI
bus Intel GB NICs
Software Modules
Average Latency (usec)
iSCSI client 57
iSCSI server 507
Disk access 1360
Central 50
Network delay 1
574
Network delay 2
2
A max of 55 MB/sec per server.
34/35
Related Work
Storage management Minerva etc at HPL
Efficiency-aware disk scheduler: Cello, Prism, YFQ
Run time QoS guarantee Web server, Video server, network QoS
IP storage
35/35
Conclusion
IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%.
Efficiency-aware CVC real time disk scheduler with dynamic I/O time estimate provides guarantee of performance and good disk head utilization.
Measurement feed-back effectively remedies the over-provision.
Latency: Pservice 2-3 folds Throughput: Pspare 20% I/O time estimate: PI/O time Load imbalance: Pleeway