Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001

Slide 1

Breaking databases for fun and publications: availability

benchmarks

Aaron BrownUC Berkeley ROC Group

HPTS 2001

Slide 2

Motivation• Drinking the availability Kool-Aid

– availability is the key metric for modern apps.

• Database stack’s availability is especially important– guardians of the world’s hard state– almost any user’s request for electronic

information hits a database stack» web services, directories, enterprise apps, ...

• Can we trust database software stacks in the face of failure?

Slide 3

• Availability benchmarks quantify system behavior under failures, maintenance, recovery

• They require– a realistic workload for the system: TPC-C– quality of service metrics: txn rates, OK and aborted– fault-injection to simulate failures: single-disk errors

Repair Time

QoS degradationfailure

normal behavior(99% conf.)

Availability benchmarking 101

Slide 4

Well, what happens?• Setup

– 3-tier: Microsoft SQLServer/COM+/IIS & bus. logic– TPC-C-like workload; faults injected into DB data & log

• Results– DBMS tolerates transient and recoverable failures,

reflecting errors back via transaction aborts– middleware highly unstable: degrades or crashes

when DBMS fails or undergoes lengthy recovery

Disk hang during write to data disksticky uncorrectable write error, log disk

middleware causesdegraded performance

database recovers

database fails, middleware degrades

middlewarecrashes

Slide 5

Summary• Database is pretty resilient

– transaction abort == good error-reflection mechanism

• Middleware/applications suck

(well, at least this instance of them)

• Robustness is end-to-end– user cannot distinguish DBMS and middleware failures– failure recovery must go beyond the DBMS

• Achievable Grand Challenges?– build and run availability benchmarks on your systems– tolerate and recover from non-failstop system-level

faults

Does performance matter?

Slide 6

Backup slides

Slide 7

Experimental setup• Database

– Microsoft SQL Server 2000, default configuration

• Middleware/front-end software– Microsoft COM+ transaction monitor/coordinator– IIS 5.0 web server with Microsoft’s tpcc.dll HTML

terminal interface and business logic– Microsoft BenchCraft remote terminal emulator

• TPC-C-like OLTP order-entry workload– 10 warehouses, 100 active users, ~860 MB database

• Measured metrics– throughput of correct NewOrder transactions/min– rate of aborted NewOrder transactions (txn/min)

Slide 8

Experimental setup (2)

• Database installed in one of two configurations:– data on emulated disk, log on real (IBM) disk– data on real (IBM) disk, log on emulated disk

IBM18 GB

10k RPM

DB Server

IDEsystem

disk

= Fast/Wide SCSI bus, 20 MB/sec

Adaptec3940

EmulatedDisk

DB data/log disks

Front End

SCSIsystem

disk

100mbEthernet

IBM18 GB

10k RPM

SCSIsystem

disk

Disk Emulator

Intel P-II/300128 MB DRAM

Windows NT 4.0

Adaptec2940

emulatorbacking disk

(NTFS)AdvStorASC-U2W

UltraSCSI

ASC VirtualSCSI lib.

Intel P-III/450256 MB DRAM

Windows 2000 AS

MS BenchCraft RTEIIS + MS tpcc.dll

MS COM+

AMD K6-2/333128 MB DRAM

Windows 2000 AS

SQL Server 2000

Slide 9

Results• All results are from single-fault micro-

benchmarks• 14 different fault types

– injected once for each of data and log partitions

• 4 categories of behavior detected1) normal

2) transient glitch

3) degraded

4) failed

Slide 10

Type 1: normal behavior

• System tolerates fault• Demonstrated for all sector-level faults except:

– sticky uncorrectable read, data partition– sticky uncorrectable write, log partition

Slide 11

Type 2: transient glitch

• One transaction is affected, aborts with error• Subsequent transactions using same data would fail• Demonstrated for one fault only:

– sticky uncorrectable read, data partition

Slide 12

Type 3: degraded behavior

• DBMS survives error after running log recovery• Middleware partially fails, results in degraded perf.• Demonstrated for one fault only:

– sticky uncorrectable write, log partition

Slide 13

Type 4: failure

• DBMS hangs or aborts all transactions• Middleware behaves erratically, sometimes crashing• Demonstrated for all fatal disk-level faults

– SCSI hangs, disk power failures

• Example behaviors (10 distinct variants observed)

Disk hang during write to data disk Simulated log disk power failure

Documents

Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001