34
1/35 Stonehenge: Multi- Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony Brook June, 2004

1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

Embed Size (px)

Citation preview

Page 1: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

1/35

Stonehenge: Multi-Dimensional Storage

Virtualization

Lan HuangIBM Almaden Research Center

Joint work with Gang Peng and Tzi-cker Chiueh

SUNY Stony Brook

June, 2004

Page 2: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

2/35

Introduction Storage growth is

phenomenal: new hardware

Isolated storage: resource waste

Management

clients

IP LAN/MAN/WAN

Database server

File server

1

10

100

1000

10000

1970 1980 1990 2000

Year

Are

al D

ensi

ty

[Patterson’98]

Huge amount of data,heterogeneous Devices, spread out everywhere.

Page 3: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

3/35

Storage Virtualization

Examples: LVM, xFS, StorageTank Hide Physical details from high-level applications

applicationStorage

management

O. S. StorageVirtualization

Disks,Controllers

Hardwareresources

AbstractInterface

Physical Disks

Virtual Disks

Clients

Page 4: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

4/35

Storage Virtualization

Storage consolidation VD as tangible as PD:

Capacity Throughput Latency

Resource efficiency Ei

Page 5: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

5/35

Stonehenge Overview

Input: VD (B, C, D, E) Output: VDs with

performance guarantee

High Level Goals: Storage Consolidation Performance Isolation Efficiency Performance

clients

IP LAN/MAN/WAN

Database server

File server

Stonehenge (LAN)

Page 6: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

6/41

Hardware Organization

Storagemanag

er

Storage server

Diskarray

Kernel

ClientApplicatio

n

Storage Clerk

Kernel

ClientApplicatio

n

Storage Clerk

Storage server

Diskarray

Storage server

Diskarray

Control mesg Data/cmds

Gigabit network

Object interface

Object interface

client client

File interface

Page 7: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

7/41

Key Issues in Stonehenge

How to ease the task of storage management:

Centralization Virtualization Consolidation

How to achieve performance isolation among virtual disks?

Run time QoS guarantee How to do it efficiently?

Efficiency-aware algorithms Dynamic adaptive feedback

Page 8: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

8/41

Key components

Mapper CVC scheduler Feedback path between them

Page 9: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

9/35

Virtual to Physical Disk Mapping

Multi-dimension disk mapping: NP Complete

Goal: maximize resource utilization Heuristics: maximize goal function

[toyota75] Input: VDs, PDs Goal Function G: max(G) Output: VD, PD mapping

Page 10: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

10/35

Islands Effect

1 2 3 4

PDs

VDs

Page 11: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

11/35

Key Components

Mapper CVC scheduler Feedback path between them

Page 12: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

12/35

Requirements of Real-time Disk Scheduling

Disk Specific Improve disk bandwidth utilization

SATF, CSCAN etc…

Non Disk Specific Meet real-time request’s deadline Fair disk bandwidth allocation

among virtual disks (Virtual Clock scheduling)

Key: Bandwidth Guarantee

seek

rotation

txferother

Page 13: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

13/35

CVC Algorithm

Two Queues: FT(i) = max(FT(i-1),

realtime)+1/IOPSm LBA

LBA Queue is used only if FT’s slack time allows it.

Real time + service time(R) < starting deadline of next request

FT LBA

CVC Scheduler

VD(m)

Page 14: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

14/35

Real-life Deployment

Dispatch the next N requests from LBA queue

The next batch will not be issued until the previous batch is done.

FT LBA

CVC Scheduler

VD(m)

Storage controller

On disk scheduler

Page 15: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

16/35

CVC Performance

3 VDs with real-life traces: video stream, web, financial, TPC-C

Touch 40% of the storage space

Video Streams Mixed Traces

Page 16: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

17/35

Impact of Disk I/O Time Estimate

Model Disk I/O time ? ATA disk impossible [ECSL TR-81] SCSI disk possible?

Run Time measurement: P(I/O Time)

Page 17: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

18/35

CVC Latency Bound

If the traffic generated within the period of [0,t]

V(t) <= T + r * t then D <= (T + Lmax )/ Bi +Lmax/C

(1)Storage System:D <= ( ( N+1)*k*C + T + Lmax)/

Bi + ( k*C+Lmax)/C (2)Stonehenge:D <=(N+1)/IOPSi+1/IOPSmax (3)

FT

VD(m)

IOPS(m)

IOPS(max)

N reqT Bytes

?

seek

rotation

txferother

Page 18: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

19/35

Key Components

Mapper CVC scheduler Feedback path between them

Relaxing worst case service time estimate

VD multiplex effect

Page 19: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

20/35

Empirical Latency vs Worst Case

Approximate P(service time, N) with P(service time, N-1)

Q is P’s inverse function

D <=(Q(0.95) + s) * [(N+1)/IOPSi+1/IOPSmax]

x

y

Page 20: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

21/35

Bursty I/O Traffic and Pspare

Self-similar Multiplexing effect : Pspare(x)

x

y

Page 21: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

22/35

Latency or Throughput Bound

(Bthroughput, C, D, E) D--> Blatency

(Bthroughput, C, Blatency, E) Bthroughput >= Blatency: throughput

bound Bthroughput < Blatency: latency bound

BthroughputBlatency Or even less?

Page 22: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

23/35

MBAC for Latency Bound VDs

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPS’i+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. If sum(IOPSi) <IOPSmax , accept the new VD,

otherwise, reject.

Page 23: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

24/35

MBAC Performance Pservice

VD Type Probability

Deterministic MBAC Oracle

Run 1 Financial 95% 7 20 22

Run 2 Mixed 95% 7 14 14

Run 3 Mixed 85% 7 17 17

Number of VDs 7 9 10 11 13 14 15

Q_{service}(0.95) 11% 15% 19% 24% 37% 49% -

MBAC N/A 38% 43% 47% 55% 67% 95%

Deterministic 90% - - - - - -

Table 2. Resource Reservation

Table 1. Maximum number of VDs accepted.

Page 24: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

25/35

MBAC for Throughput Bound VDs

When jth VD (Dj, IOPS’’j, Cj, E) comes,Convert Dj to IOPS’j:

Dj <=(Qservice(0.95)+s)*[(N+1)/IOPS’j+1/IOPSmax]

Let IOPSj = max(IOPS’j, IOPS’’j)

if IOPSj < Qspare(E) admit the new VD, otherwise, reject it.

Page 25: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

26/35

MBAC Performance Pspare

VD 0 – TPC-CVD 1 - financialVD 2 – web

search

Page 26: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

27/35

Measurement-based Admission Control (MBAC)

When the jth VD with requirements (Dj, IOPS’’j, Cj, E) comes,1. For 0 < i <= j,

Convert Di to IOPS’i: Di <=(Qservice(0.95) +s)*[(N+1)/IOPSi+1/IOPSmax]

Let IOPSi = max(IOPS’i, IOPS’’i)2. Group VDs into two sets: throughput bounded set T and latency bounded L3. For the throughput bound VDs, calculate combined QI/O_rate, Let Qspare(x) = IOPSmax – QI/O_rate(x)4. If sum(IOPS(L)) <Qspare(E) , accept the new VD, otherwise, reject.

Page 27: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

28/35

Issues with Measurement

Stability I/O rate pattern is stable Boundary case for Pservice

Overhead of monitoring trivial

Window Size

Page 28: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

29/35

Put them all together: Stonehenge

Functionality: A general purpose IP storage cluster

Performance scheduling

Efficiency measurement

Page 29: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

30/35

Software Architecture

kernel

kernel

kernel

F. E. T. D.

iSCSI initiator

Stonehenge

V. Table

P. Table

IDE Mid Layer Driver

Disk Mapper

Admission controller

Trafficshaping

Scheduler

F. E. T. D.

F. E. T. D.

V. Table

P. Table

Scheduler

Queues

FETD front end target driver

User

Stonehenge

Page 30: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

31/35

Effectiveness of QoS Guarantees in Stonehenge

(a) CVC

(b) CSCAN (c) Deadline violation percentage

Page 31: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

32/35

Impact of Leeway Factor

Overload probability Violation Percentage

Page 32: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

33/35

Overall System Performance and Latency Breakdown

1 GHZ CPU IBM 7200 ATA disk

array Promise IDE

controllers 64 bit 66MHZ PCI

bus Intel GB NICs

Software Modules

Average Latency (usec)

iSCSI client 57

iSCSI server 507

Disk access 1360

Central 50

Network delay 1

574

Network delay 2

2

A max of 55 MB/sec per server.

Page 33: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

34/35

Related Work

Storage management Minerva etc at HPL

Efficiency-aware disk scheduler: Cello, Prism, YFQ

Run time QoS guarantee Web server, Video server, network QoS

IP storage

Page 34: 1/35 Stonehenge: Multi-Dimensional Storage Virtualization Lan Huang IBM Almaden Research Center Joint work with Gang Peng and Tzi-cker Chiueh SUNY Stony

35/35

Conclusion

IP Storage Cluster consolidates storage and reduces fragmentation by 20-30%.

Efficiency-aware CVC real time disk scheduler with dynamic I/O time estimate provides guarantee of performance and good disk head utilization.

Measurement feed-back effectively remedies the over-provision.

Latency: Pservice 2-3 folds Throughput: Pspare 20% I/O time estimate: PI/O time Load imbalance: Pleeway