47
Theory of System Administration DANSS Seminar Feb 23 rd , 2003 Elliot Jaffe

Danss - Theory of SysAdmin

Embed Size (px)

Citation preview

Page 1: Danss - Theory of SysAdmin

Theory of System Administration

DANSS SeminarFeb 23rd, 2003Elliot Jaffe

Page 2: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 2Feb. 23, 2003

Outline

What is System Administration Problems in System Administration Theory overview Results Research directions

Page 3: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 3Feb. 23, 2003

What is System Administration?

What do you think?

Page 4: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 4Feb. 23, 2003

What is System Administration

In computer technology, a set of functions that provides support services, ensures reliable operations, promotes efficient use of the system, and ensures that prescribed service-quality objectives are met.

Synonym system management.US Federal Standard 1037C

Page 5: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 5Feb. 23, 2003

System Administration is

The function that provides:

Reliability – Stable, consistent service

Efficiency – Performance

Predictability – Service Level Agreement

Page 6: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 6Feb. 23, 2003

CS HUJI System Administration

Infrastructure Operating Systems Networking Account Administration Software Licensing, Installation and Support Education

Page 7: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 7Feb. 23, 2003

What you don’t see

Budgets Cost Benefit Analysis Vendor Selection Service Contracts Long term planning Policy creation

Page 8: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 8Feb. 23, 2003

Problems in Sys Admin

Strategic

Tactical

Page 9: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 9Feb. 23, 2003

Strategic Problems

Economic costs/benefit analysisHow much disk space should be purchased in

the next year?Should we buy a one new router, or do we

need a fail-over pair? If we get %25 additional students, what

resources will we need?

Page 10: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 10Feb. 23, 2003

Strategic Problems #2

What is the right level of disk space quotas?

Should we use a VLAN to localize network traffic?

Page 11: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 11Feb. 23, 2003

Tactical Problems

What is the best way to maintain multiple systems?How do we apply patches?How should we rollout an OS change?How do we support multiple configurations?How many configurations should we support?How do we use version control part of system

administration?

Page 12: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 12Feb. 23, 2003

A complete theory should enable Policy determination and evaluation

Strategic decisions about resource usage and allocation

Interactions between users and system for resources Productivity considerations (economics of the system)

Empirical verification of strategies and policies Efficiency of policy and its implementation Efficiency of the system in doing its job

Page 13: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 13Feb. 23, 2003

Theory of System Administration

A group of computers is an evolving, stochastic system viewable at multiple

levels of detail.

Page 14: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 14Feb. 23, 2003

Configuration Space

The memory state of the computer The set of bits that define the computer

state.

Example:The state of the bits in primary memory and

on secondary media (disks)

Page 15: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 15Feb. 23, 2003

Time

Time is a discrete value. For averaging purposes, we allow it to take on

real values.

Example: The system clock is discrete, having values as a

multiple of the clock speed Tc. t=0, Tc, 2Tc,…,nTc

Page 16: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 16Feb. 23, 2003

Configuration

A pattern of values associated with each point on the configuration space.

Example:The state of all bits in main memory at time t.This pattern changes over time.

Page 17: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 17Feb. 23, 2003

Averaging

Over time scales much larger than Tc, the average properties of the system can be treated as a continuum approximation, i.e. as real functions of time.

Example:The number of non-zero bits at any real value

of time.

Page 18: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 18Feb. 23, 2003

Scales

Transition from low-level to high-level

Group objects together to form new objects

Refer to state of object over time

Level Example6 LANS5 Users, VMs4 Files3 int, float, char2 bytes, words1 bits

Page 19: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 19Feb. 23, 2003

Closed Dynamical Systems

A closed dynamical system consists of a configuration space, an initial configuration and a rule for subsequent time development

Closed dynamical systems are deterministic Example:

A standalone computer without any external input is a closed dynamical system

Page 20: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 20Feb. 23, 2003

Interactions

An interaction between two systems is an endomorphism on the combined systems such that both systems determine the time developments of one another.

Example:Two standalone computers connected via a

network and synchronizing system times.

Page 21: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 21Feb. 23, 2003

Environment

An ensemble of mutually interacting systems.

Example:A user interacting with a computer.People are not standalone!

Page 22: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 22Feb. 23, 2003

Open Dynamical System

Projection of an ensemble of interacting systems onto the state of a given system.

The configuration state of an open system is unpredictable over any interval dt ~ Tc.

Does this mean that all is lost?

Page 23: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 23Feb. 23, 2003

Stability

Assume that there exists some time scale on which it is possible to predict the average state of the systems in question.

We are not interested in managing systems which cannot achieve a minimal level of stability, since these system cannot perform any reliable function.

Page 24: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 24Feb. 23, 2003

Multiple Time Scales

Short term: Tc the computer clock

Medium term: human time > 107 Tc

Long term: months and years > 107 human time

Page 25: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 25Feb. 23, 2003

Components of System State

The state of a system at any given time is composed of a slowly varying local average and a rapidly fluctuating stochastic remainder.

Are these systems stable?

Time

State

Page 26: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 26Feb. 23, 2003

Tasks

A task is a representation of an autonomous process executed on related sets of state.

A task is closed if after execution, it returns the system to the original state.

A task is open if after execution, it has changed the overall system state.

Page 27: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 27Feb. 23, 2003

Maintenance Tasks

A maintenance tasks is a task which reduces the total rate of change of the average configuration state.

Example:Deletion of accumulated garbage

Page 28: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 28Feb. 23, 2003

Policy

A policy is an average specification of equivalent system behaviors.

A set of system states that are equivalent over the given time period.

A policy is neither good nor bad. It does not necessarily lead to stability or chaos.

Page 29: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 29Feb. 23, 2003

Policy - Examples

Users are restricted to a known quota of file system space.

All computers must run Microsoft Office. Only port 80 will be open on network

servers. SSH will be used for all remote computer

access.

Page 30: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 30Feb. 23, 2003

Convergence

A convergent average policy is one whose tasks result in an equivalent configuration for all sufficiently large time scales.

A convergent average policy is one whose average behavior in time ends in a fixed average state between two sufficiently different time values.

Page 31: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 31Feb. 23, 2003

Convergence - Example

Deleting temporary files on a regular basis is a convergent policy since it returns the system to a known state (i.e. a given amount of free file system space).

Page 32: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 32Feb. 23, 2003

Persistent State

A persistent state is a configuration for which the probability of returning to an equivalent configuration at a later time is 1.

Persistence is reflected in the property that the rate of change of the average state is much slower than the rate of change of fast moving variations.

Page 33: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 33Feb. 23, 2003

Persistent States

The fast variations extend over several complete cycles before any appreciable change in the average is seem.

Time

State

Page 34: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 34Feb. 23, 2003

Theorem

In an open system, a policy specifies a class of equivalent persistent states if and only if the policy exhibits average convergence.

You can maintain the state of the system if and only if your policy consistently returns the system to a similar state. i.e. the average resource usage is constant over the policies time scale.

Page 35: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 35Feb. 23, 2003

Implications

System Administration is the development, specification and implementation of environments and maintenance tasks with the goal of creating a persistent average state.

Page 36: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 36Feb. 23, 2003

Strategy

Type IStochastic models

Type IISemantic models

Page 37: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 37Feb. 23, 2003

Type I - Stochastic models

Analyze what is happening on multiple time scales Describe locally averaged states Model known boundary conditions

Empirical measurements of existing systems. Predictive modeling of systems based on

measurements.

Page 38: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 38Feb. 23, 2003

Problems with Stochastic Models Statistics measurements are rare No experimental repeatability Conditions of measurements are

constantly changing Absolute definitions are impossible People cannot be described by a small

number of characteristics

Page 39: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 39Feb. 23, 2003

Stochastic modeling -- Uses

Strategic planning Do we need to buy more file servers?

Problem identificationWhy is user X using 300% of the normal disk

quota?Why is computer Y rebooting twice a week

when all other systems are stable for months?

Page 40: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 40Feb. 23, 2003

Strategic models

Analyze what might be changed in a system.

Formulate as a game of strategy Achieve larger goals than just maintaining

a persistent state.

Page 41: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 41Feb. 23, 2003

Strategic Goals

Sys Admin: Keep the system alive and running so that users can perform a maximum amount of work

Benign User: produce useful work using the system. (consumes resources)

Malicious User: Maximize control of system resources

Page 42: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 42Feb. 23, 2003

Strategic tools

Game TheoryContests between System Administrator and

malicious users.System Downtime: Mean time to repair /

Mean time before failure Minimize MTTR or maximize MTBF?

Levels of monitoring: At what point does the cost of monitoring overwhelm the benefit?

Page 43: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 43Feb. 23, 2003

Current research

Recovering File space System upgrades Quota systems

Page 44: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 44Feb. 23, 2003

Recovering File Space

How do you clean unused files?Competition between users and adminsTrade off between

having enough space to operate Users recreating temp files that were deleted Users “grabbing” space for later use

Page 45: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 45Feb. 23, 2003

Patch Application

How do you apply changes to a distributed system?Divergence

Convergence

Congruence

Page 46: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 46Feb. 23, 2003

Quota application

What is the correct way to set file system quotas?By categoryDynamically assign users to groupsSet group to lowest maximal value

Page 47: Danss - Theory of SysAdmin

Danss - Theory of SysAdmin 47Feb. 23, 2003

Bibliography

Burgess, M. 2003. On the theory of System Administration, Journal of the ACM.

S. Traugott, L. Brown 2002. Why Order Matters: Turing Equivalence in Automated Systems Administration, Lisa 2002

M. Gilfix, 2002. Holistic Quota Management: The Natural path to a better, more efficient quota system, Lisa 2002