Slide 1
Breaking databases for fun and publications: availability
benchmarks
Aaron BrownUC Berkeley ROC Group
HPTS 2001
Slide 2
Motivation• Drinking the availability Kool-Aid
– availability is the key metric for modern apps.
• Database stack’s availability is especially important– guardians of the world’s hard state– almost any user’s request for electronic
information hits a database stack» web services, directories, enterprise apps, ...
• Can we trust database software stacks in the face of failure?
Slide 3
• Availability benchmarks quantify system behavior under failures, maintenance, recovery
• They require– a realistic workload for the system: TPC-C– quality of service metrics: txn rates, OK and aborted– fault-injection to simulate failures: single-disk errors
Repair Time
QoS degradationfailure
normal behavior(99% conf.)
Availability benchmarking 101
Slide 4
Well, what happens?• Setup
– 3-tier: Microsoft SQLServer/COM+/IIS & bus. logic– TPC-C-like workload; faults injected into DB data & log
• Results– DBMS tolerates transient and recoverable failures,
reflecting errors back via transaction aborts– middleware highly unstable: degrades or crashes
when DBMS fails or undergoes lengthy recovery
Disk hang during write to data disksticky uncorrectable write error, log disk
middleware causesdegraded performance
database recovers
database fails, middleware degrades
middlewarecrashes
Slide 5
Summary• Database is pretty resilient
– transaction abort == good error-reflection mechanism
• Middleware/applications suck
(well, at least this instance of them)
• Robustness is end-to-end– user cannot distinguish DBMS and middleware failures– failure recovery must go beyond the DBMS
• Achievable Grand Challenges?– build and run availability benchmarks on your systems– tolerate and recover from non-failstop system-level
faults
Does performance matter?
Slide 6
Backup slides
Slide 7
Experimental setup• Database
– Microsoft SQL Server 2000, default configuration
• Middleware/front-end software– Microsoft COM+ transaction monitor/coordinator– IIS 5.0 web server with Microsoft’s tpcc.dll HTML
terminal interface and business logic– Microsoft BenchCraft remote terminal emulator
• TPC-C-like OLTP order-entry workload– 10 warehouses, 100 active users, ~860 MB database
• Measured metrics– throughput of correct NewOrder transactions/min– rate of aborted NewOrder transactions (txn/min)
Slide 8
Experimental setup (2)
• Database installed in one of two configurations:– data on emulated disk, log on real (IBM) disk– data on real (IBM) disk, log on emulated disk
IBM18 GB
10k RPM
DB Server
IDEsystem
disk
= Fast/Wide SCSI bus, 20 MB/sec
Adaptec3940
EmulatedDisk
DB data/log disks
Front End
SCSIsystem
disk
100mbEthernet
IBM18 GB
10k RPM
SCSIsystem
disk
Disk Emulator
Intel P-II/300128 MB DRAM
Windows NT 4.0
Adaptec2940
emulatorbacking disk
(NTFS)AdvStorASC-U2W
UltraSCSI
ASC VirtualSCSI lib.
Intel P-III/450256 MB DRAM
Windows 2000 AS
MS BenchCraft RTEIIS + MS tpcc.dll
MS COM+
AMD K6-2/333128 MB DRAM
Windows 2000 AS
SQL Server 2000
Slide 9
Results• All results are from single-fault micro-
benchmarks• 14 different fault types
– injected once for each of data and log partitions
• 4 categories of behavior detected1) normal
2) transient glitch
3) degraded
4) failed
Slide 10
Type 1: normal behavior
• System tolerates fault• Demonstrated for all sector-level faults except:
– sticky uncorrectable read, data partition– sticky uncorrectable write, log partition
Slide 11
Type 2: transient glitch
• One transaction is affected, aborts with error• Subsequent transactions using same data would fail• Demonstrated for one fault only:
– sticky uncorrectable read, data partition
Slide 12
Type 3: degraded behavior
• DBMS survives error after running log recovery• Middleware partially fails, results in degraded perf.• Demonstrated for one fault only:
– sticky uncorrectable write, log partition
Slide 13
Type 4: failure
• DBMS hangs or aborts all transactions• Middleware behaves erratically, sometimes crashing• Demonstrated for all fatal disk-level faults
– SCSI hangs, disk power failures
• Example behaviors (10 distinct variants observed)
Disk hang during write to data disk Simulated log disk power failure