ISTORE: Introspective Storage for Data-Intensive Network Services

Slide 1

ISTORE: Introspective Storage for Data-Intensive

Network ServicesAaron Brown, David Oppenheimer, Jim Beck, Kimberly Keeton, Rich Martin, Randi Thomas, John Kubiatowicz, David Patterson, and Kathy

Yelick

Computer Science DivisionUniversity of California, Berkeley

http://iram.cs.berkeley.edu/istore/1999 Summer IRAM Retreat

Slide 2

ISTORE Philosophy

• Traditional systems research has focused on peak performance and cost

Traditional Research Priorities

1) Performance1’) Cost 3) Scalability4) Availability5) Maintainability

Slide 3

ISTORE Philosophy: SAM

• In reality, scalability, maintainability, and availability (SAM) are equally important– performance & cost mean little if the system isn’t

working

Traditional Research Priorities

1) Performance1’) Cost 3) Scalability4) Availability5) Maintainability

ISTORE Priorities

1) Maintainability2) Availability 3) Scalability4) Performance4’) Cost

Slide 4

ISTORE Philosophy: Introspection

• ISTORE’s solution is introspective systems– systems that monitor themselves and automatically

adapt to changes in their environment and workload– introspection enables automatic self-maintenance and

self-tuning

• ISTORE vision: a framework that makes it easy to build introspective systems

• ISTORE target: high-end servers for data-intensive infrastructure services– single-purpose systems managing large amounts of

data for large numbers of active network users

Slide 5

Outline• Motivation for Introspective Systems• ISTORE Research Agenda and

Architecture– Hardware – Software

• Policy-driven Introspection Example• Research Issues, Status, and Discussion

Slide 6

Motivation: Service Demands• Emergence of a true information

infrastructure– today: e-commerce, online database services,

online backup, search engines, and web servers– tomorrow: more of above (with ever-growing

datasets), plus thin-client/PDA infrastructure support

• Infrastructure users expect “always-on”service and constant quality of service– infrastructure must provide fault-tolerance

and performance-tolerance– failures and slowdowns have major business impact

» e.g., recent EBay, E*Trade, Schwab outages

Slide 7

Motivation: Service Demands (2)

• Delivering 24x7 fault- and performance-tolerance requires:– a robust hardware platform– fast adaptation to failures, load spikes, changing

access patterns– easy incremental scalability when existing resources

stop providing desired quality of service– self-maintenance: the system handles problems as

they arise, automatically» can't rely on human intervention to fix problems or to

tune performance» humans are too expensive, too slow, prone to mistakes

• Introspective systems are self-maintaining

Slide 8

Motivation: System Scaling • Infrastructure services are growing rapidly

– more users, more online data, higher access rates, more historical data

– bigger and bigger back-end systems are needed» O(300)-node clusters deployed now; thousands of

nodes not far off– techniques for maintenance and administration must scale

with the system to 1000s of nodes• Today’s administrative approaches don’t scale

– systems will be too big to reason about, monitor, or fix– failures and load variance will be too frequent for static

solutions to work• Introspective, reactive techniques are required

Slide 9

ISTORE Research Agenda• ISTORE goal = create a hardware/software

framework for building introspective servers

– Hardware: plug-and-play intelligent devices with integrated self-monitoring, diagnostics, and fault injection hardware

» intelligence used to collect and filter monitoring data» diagnostics and fault injection enhance robustness» networked to create a scalable shared-nothing cluster

– Software: toolkit that allows programmers to easily define the system’s adaptive behavior

» provides abstractions for manipulating and reacting to monitoring data

Slide 10

Hardware Requirements for Self-Maintaining Servers

• Redundant components that fail fast– no single point of failure anywhere

• Tightly-integrated device monitoring– low-level HW diagnostics to detect impending failure – device “health,” performance data, access patterns,

environmental info, ...• Automatic preventive maintenance

– predictive failure analysis based on diagnostic data– continual “scrubbing” and in situ stress testing of all

components, new and old• Self-characterizing, plug-and-play

hardware

Slide 11

ISTORE-1 Hardware Prototype

IntelligentChassis:scalable

redundantswitching,

power,env’t monitoring Intelligent

Disk “Brick”Disk

CPU, memory, diagnosticprocessor, redundant NICs

• Based on intelligent disk bricks– each brick is one ISTORE node– ISTORE-1 will have 64 bricks/nodes

Slide 12

ISTORE-1 Hardware Design• Brick

– processor board» mobile Pentium-II, 128MB SODRAM» PCI and ISA busses/controllers, SuperIO (serial

ports)» Flash BIOS» 4x100Mb Ethernet interfaces» Adaptec Ultra2-LVD SCSI interface

– disk: one 18.2GB low-profile SCSI disk– diagnostic processor– OS: several UNIX™-like OSs supporting Linux ABI

(Linux, NetBSD, FreeBSD, Solaris x86?)

Slide 13

ISTORE-1 Hardware Design (2)• Network

– primary data network» hierarchical, highly-redundant switched Ethernet» uses 16 20-port 100Mb switches at the leaves

•each brick connects to 4 independent switches» root switching fabric is two ganged 25-port Gigabit

switches (PacketEngines PowerRails)– diagnostic network

Slide 14

Diagnostic Processor• Each brick has a diagnostic processor

– Goal: small, independent, trusted piece of hardware running hand-verifiable monitoring/control software

» monitoring: connects to motherboard SMbus, CAN bus•environmental monitor, CPU watchdog

» control•reboot/power-cycle main CPU•inject simulated faults: power, bus transients, memory errors, network interface failure, ...

• Not-so-small embedded Motorola 68k processor– provides the flexibility needed for research prototype– still can run just a small, simple monitoring and control

program if desired (no OS, networking, etc.)

Slide 15

Diagnostic Network• Separate “diagnostic network” connects

the diagnostic processors of each brick– provides independent network path to diagnostic

CPU» works when brick CPU is powered off or has failed» separate failure modes from Ethernet interfaces

• CAN (Controller Area Network) diagnostic interconnect– CAN connects directly to environmental

monitoring sensors (temperature, fan RPM, ...)– one brick per “shelf” of 8 acts as gateway from

CAN to redundant switched Ethernet fabric

Slide 16

ISTORE-1 Hardware Prototype• Meets requirements for a robust HW

platform– fast embedded CPU performs local monitoring tasks– diagnostic hardware enables low-level diagnostic

monitoring, fail-fast behavior, and fault injection– highly-redundant system design

» redundant data network and interfaces at all levels» separate diagnostic network» redundant backup power

– powerful preventive maintenance» each brick can be periodically taken offline and

stress-tested/scrubbed using diagnostic processor’s fault injection capabilities

Slide 17

ISTORE Research Agenda• ISTORE goal = create a

hardware/software framework for building introspective servers

– Hardware

– Software: toolkit that allows programmers to easily define the system’s adaptive behavior

» provides abstractions for manipulating and reacting to monitoring data

Slide 18

A Software Framework for Introspection

• ISTORE hardware provides device monitoring– application programmers could write ad-hoc code to

collect, process, and react to monitoring data• ISTORE software framework should simplify

writing introspective applications– rule-based adaptation engine encapsulates the

mechanisms of collecting, processing monitoring data– policy compiler and mechanism libraries help turn

application adaptation goals into rules & reaction code– these provide a high-level, abstract interface to the

system’s monitoring and adaptation mechanisms

Slide 19

Rule-based Adaptation• ISTORE’s adaptation framework built on

model of active database– “database” includes:

» hardware monitoring data: device status, access patterns, performance stats

» software monitoring data: app-specific quality-of-service metrics, high-level workload patterns, ...

– applications define views and triggers over the DB» views select and aggregate data of interest to app.» triggers are rules that invoke application-specific

reaction code when their predicates are satisfied– SQL-like declarative language used to specify

views and trigger rules

Slide 20

Benefits of Views and Triggers• Allow applications to focus on adaptation,

not monitoring– hide the mechanics of gathering and processing

monitoring data– can be dynamically redefined without altering

adaptation code as situation changes• Can be implemented without a real

database– views and triggers implemented as device-local and

distributed filters and reaction rules– defined views and triggers control frequency,

granularity, types of data gathered by HW monitoring– no materialized database necessary

Slide 21

Raising the Level of Abstraction:

Policy Compiler and Mechanism Libs• Rule-based adaptation doesn’t go far

enough– application designer must still write views, triggers, and

adaptation code by hand» but designer thinks in terms of system policies

• Solution: designer specifies policies to system; system implements them– policy compiler automatically generates views, triggers,

adaptation code– uses preexisting mechanism libraries to implement

adaptation algorithms– claim: feasible for common adaptation mechanisms

needed by data-intensive network service apps.

Slide 22

Adaptation Policies• Policies specify system states and how

to react to them– high-level specification: independent of “schema”

of system, object/node identity» that knowledge is encapsulated in policy compiler

• Examples– self-maintenance and availability

» if overall free disk space is below 10%, compress all but one replica/version of least-frequently-accessed data

» if any disk reports more than 5 errors per hour, migrate all data off that disk and shut it down

» invoke load-balancer when new disk is added to system– performance tuning

» place large, sequentially-accessed objects on outer tracks of fast disks as space becomes available

Slide 23

Software Structure

policy

view trigger adaptation code

mechanism libraries

policy compilercallsused as input toproduces

Slide 24

Detailed Adaptation Example• Policy: quench hot spots by migrating

objects

policy


mechanism libraries

policy compilercallsused as input toproduces

while ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length

Slide 25

policy


policy compiler

Example: View Definitionwhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length

DEFINE VIEW (average_queue_length= SELECT AVG(queue_length) FROM disk_stats,queue_length[]= SELECT queue_length FROM disk_stats,disk_id[]= SELECT disk_id FROM disk_stats,

short_disk= SELECT disk_id FROM disk_stats WHERE queue_length = SELECT MIN(queue_length) FROM disk_stats)

mechanism libraries

Slide 26

DEFINE VIEW (average_queue_length=...,queue_length[]=...,disk_id[]=...,short_disk=...)

policy


policy compiler

Example: Triggerwhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length

foreach disk_id from_disk { if (queue_length[from_disk] > 1.2*average_queue_length) user_migrate(from_disk,short_disk)}

mechanism libraries

Slide 27

policy


policy compiler

Example: Adaptation Codewhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length



user_migrate(from_disk,to_disk) { diskObject x; x = find_hottest_obj(from_disk); migrate(x, to_disk);}

mechanism libraries

Slide 28

policy


policy compiler

Example: Mechanism Lib. Callswhile ((average queue length for any disk D) > (120% of average for whole system)) migrate hottest object on D to disk with shortest average queue length



user_migrate(from_disk,to_disk) { diskObject x; x = find_hottest_obj(from_disk); migrate(x, to_disk);}

mechanism libraries

Slide 29

Mechanism Libraries• Unify existing techniques/services found in

single-node OSs, DBMSs, distributed systems– distributed directory services– replication and migration– data layout and placement– distributed transactions– checkpointing– caching– administrative (human UI) tasks

• Provide a place for higher-level monitoring• Simplify creation of adaptation code

– for humans using the rule system directly– for the policy compiler auto-generating code

select key mechanisms fordata-intensive

network services

Slide 30

Open Research Issues• Defining appropriate software abstractions

– how should views and triggers be declared?– what should the policy language look like?– what functions should mechanism libraries provide?– what is the system’s “schema”?

» how should heterogeneous hardware be integrated?» can it be extended by the user to include new types and

statistics?– what level of policies can be expressed?

» how much of the implementation can the system figure out automatically?

» to what extent can the system reason about policies and their interactions?

Slide 31

More Open Research Issues• Implementing an introspective system

– what default policies should the system supply?– what are the internal and external interfaces? – debugging

» visualization of states, triggers, ...» simulation/coverage analysis of policies, adaptation code

– appropriate administrative interfaces• Measuring an introspective system

– what are the right benchmarks for scalability, availability, and maintainability (SAM)?

• O(>=1000)-node scalability– how to write applications that scale and run well despite

continual state of partial failure?

Slide 32

Related Work• Hardware:

– CMU and UCSB Active Disks• Software:

– adaptive databases: MS AutoAdmin, Informix NoKnobs

– adaptive OSs: MS Millennium, adaptive VINO– adaptive storage: HP AutoRAID, attribute-

managed storage– active databases: UFL Gator, TriggerMan

• ISTORE unifies many of these techniques in a single system

Slide 33

Related Work: Ninja• Ninja: composable Internet-scale

services– some ISTORE runtime software services provided

using Ninja programming platform?– provides

» some fault tolerance» a framework for automatic service discovery » incremental s/w upgrades

Slide 34

Related Work: Telegraph• Universal system for information• Four layers

– query, browse, mine– global agoric federation– continuously reoptimizing query processor +

adaptive data placement– storage manager

• Relationship to ISTORE– continuous online reoptimization– adaptive data placement– indexing, other operations on disk CPU

Slide 35

Related Work: OceanStore• Global-scale persistent storage• Nomadic, highly-available data• Federation of data storage providers• Investigate global-scale SAM

– also naming, indexability, consistency• Relationship to ISTORE

– investigating similar concepts but on a global scale

– converse: ISTORE as “Internet in a box”

Slide 36

Related Work: Endeavour• Endeavour: new research project at UCB

– goal: “enhancing human understanding through information technology”

• ISTORE’s potential contributions:– ISTORE is building adaptive, scalable, self-

maintaining back-end servers for storage-intensive network services

» can be part of Endeavour’s back-end infrastructure– software research

» using policies to guide a system’s adaptive behavior» providing QoS under degraded conditions

– application platform» process and store streams of sensor data

Slide 37

Status and Conclusions• ISTORE’s focus is on introspective systems

– a new perspective on systems research priorities• Proposed framework for building

introspection– intelligent, self-monitoring plug-and-play hardware– software that provides a higher level of abstraction

for the construction of introspective systems» flexible, powerful rule system for monitoring» policy specification automates generation of adaptation

• Status– ISTORE-1 hardware prototype being constructed now– software prototyping just starting

Slide 38

ISTORE Short-Term Plans• Solidify/begin implementing benchmarking ideas

– run on existing systems to characterize and compare them with respect to SAM

• Assemble ISTORE-0 system– 6 PCs with similar configurations to ISTORE-1 bricks– 100 Mb/s switched Ethernet– Gain experience running multiple OSes

• Investigate implementation options for monitoring database, views, and triggers

• Study data-intensive network service applications – to guide development of policy lang.– to determine what types of adaptation will help

Slide 39

ISTORE: Introspective Storage for Data-Intensive

Network Services

For more information:http://iram.cs.berkeley.edu/istore/

[email protected]

Slide 40

Backup Slides

Slide 41

ISTORE-1 Hardware Design• Brick

– processor board» mobile Pentium-II, 366 MHz, 128MB SODRAM» PCI and ISA busses/controllers, SuperIO (serial ports)» Flash BIOS» 4x100Mb Ethernet interfaces» Adaptec Ultra2-LVD SCSI interface

– disk: one 18.2GB 10,000 RPM low-profile SCSI disk– diagnostic processor

» Motorola MC68376, 2MB Flash or NVRAM» serial connections to CPU for console and monitoring» controls power to all parts on board» CAN interface

Slide 42

ISTORE-1 Hardware Design (2)• Network

– primary data network» hierarchical, highly-redundant switched Ethernet» uses 16 20-port 100Mb switches at the leaves

•each brick connects to 4 independent switches» root switching fabric is two ganged 25-port Gigabit

switches (PacketEngines PowerRails)– diagnostic network

» point-to-point CAN network connects bricks in a shelf» Ethernet fabric described above is used for shelf-to-

shelf communication» console I/O from each brick can be routed through

diagnostic network

Slide 43

Motivation: Technology Trends• Disks, systems, switches are getting smaller

• Convergence on “intelligent” disks (IDISKs)– MicroDrive + system-on-a-chip => tiny IDISK nodes

• Inevitability of enormous-scale systems– by 2006, a O(10,000) IDISK-node cluster with 90TB of

storage could fit in one rack

IBM MicroDrive(340MB, 5MB/s)

World’s Smallest Web Server(486/66, 16MB RAM, 16MB ROM)

Slide 44

Disk Limit• Continued advance in capacity (60%/yr)

and bandwidth (40%/yr)• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk

Year Sequentially Randomly (1 sector/seek)1990 4 minutes 6 hours2000 12 minutes 1 week(!)

• 3.5” form factor make sense in 5-7 years?

Slide 45

ISTORE-II Hardware Vision• System-on-a-chip enables computer,

memory, redundant network interfaces without significantly increasing size of disk

• Target for + 5-7 years:• 1999 IBM MicroDrive:– 1.7” x 1.4” x 0.2”

(43 mm x 36 mm x 5 mm)– 340 MB, 5400 RPM,

5 MB/s, 15 ms seek• 2006 MicroDrive?

– 9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

Slide 46

2006 ISTORE• ISTORE node

– Add 20% pad to MicroDrive size for packaging, connectors

– Then double thickness to add IRAM– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)

• Crossbar switches growing by Moore’s Law– 2x/1.5 yrs 4X transistors/3yrs– Crossbars grow by N2 2X switch/3yrs– 16 x 16 in 1999 64 x 64 in 2005

• ISTORE rack (19” x 33” x 84”)(480 mm x 840 mm x 2130 mm)

– 1 tray (3” high) 16 x 32 512 ISTORE nodes– 20 trays+switches+UPS 10,240 ISTORE nodes(!)

Slide 47

Benefits of Views and Triggers (2)

• Equally useful for performance and failure management– Performance tuning example: DB index

management» View: access patterns to tables, query predicates

used» Trigger: access rate to table above/below average» Adaptation: add/drop indices based on query stream

– Failure management example: impending disk failure

» View: disk error logs, environmental conditions» Trigger: frequency of errors, unsafe environment» Adaptation: redirect requests to other replicas, shut

down disk, generate new replicas, signal operator

Slide 48

More Adaptation Policy Examples • Self-maintenance and availability

– maintain two copies of all dirty data stored only in volatile memory

– if a disk fails, restore original redundancy level for objects stored on that disk

• Performance tuning– if accesses to a read-mostly object take more than

10ms on average, replicate the object on another disk• Both (like HP AutoRAID)

– if an object is in the top 10% of frequently-accessed objects, and there is only one copy, create a new replica. if an object is in the bottom 90%, delete all replicas and stripe the object across N disks using RAID-5.

Slide 49

Mechanism Library Benefits• Programmability

– libraries provide high-level abstractions of services– code using the libraries is easier to reason about,

maintain, customize• Performance

– libraries can be highly-optimized– optimization complexity is hidden by abstraction

• Reliability– libraries include code that’s easy to forget or get wrong

» synchronization, communication, memory allocation– debugging effort can be spent getting libraries right

» library users inherit the verification effort