Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008

Scalable Computing on Open Distributed Systems

Jon WeissmanUniversity of Minnesota

National E-Science CenterCLADE 2008

What is the Problem?• Open distributed systems

– Tasks submitted to the “system” for execution– Workers do the computing, execute a task, return an answer

• The Challenge– Computations that are erroneous or late are less useful– Failure, errors, hacked, misconfigured– Unpredictable time to return answers

• Both local- and wide-area systems– Focus on volunteer wide-area systems

Shape of the Solution

• Replication• Works for all sources of unreliability

– computation and data

• How to do this intelligently - scalably?

Replication Challenges• How many replicas?

– too many – waste of resources– too few – application suffers

• Most approaches assume ad-hoc replication– under-replicate: task re-execution (^ latency)– over-replicate: wasted resources (v throughput)

• Using information about the past behavior of a node, we can intelligently size the amount of redundancy

Problems with ad-hoc replication

Unreliable node

Reliable nodeTask x sent to group A

Task y sent to group B

System Model

0.9

0.4

0.8

0.8

0.7

0.8

0.8

0.7

0.4

0.3

• Reputation rating ri– degree of node reliability

• Dynamically size the redundancy based on ri

• Note: variable sized groups

• Assume no correlated errors, relax later

Smart Replication• Rating based on past interaction with clients

– prob. (ri) over window • correct/total or timely/total

– extend to worker group (assuming no collusion) => likelihood of correctness (LOC)

• Smarter Redundancy– variable-sized worker groups– intuition: higher reliability clients => smaller groups

12

1:,

12

1

1

12

1121

)1(k

kmm

k

iii

k

iik

ii rr

Terms

• LOC (Likelihood of Correctness), g

– computes the ‘actual’ probability of getting a correct or timely answer from a group g of clients

• Target LOC (target)– the success-rate that the system tries to ensure while

forming client groups

Scheduling Metrics

• Guiding metrics– throughput : is the set of successfully completed

tasks in an interval

– success rate s: ratio of throughput to number of tasks attempted

Algorithm Space

• How many replicas?– algorithms compute how many replicas to meet a

success threshold

• How to reach consensus?– Majority (better for byzantine threats)– M-1 (better for timeliness)– M-2 (2 matching)

One Scheduling Algorithm

Evaluation

• Baselines– Fixed algorithm: statically sized equal groups uses no

reliability information

– Random algorithm: forms groups by randomly assigning nodes until target is reached

• Simulated a wide-variety of node reliability distributions

Experimental Results: correctness

Simulation: byzantine behavior only … majority voting

Role of target

• Key parameter– hard to specify

• Too large– groups will be too large (low throughput)

• Too small– groups will be too small (low success rate)

• Instead, adaptively learn it– bias toward or s or both

Adaptive Algorithm

What about time?

• Timeliness• Result > time T is less (or not) useful

– (1) soft deadlines• user interacting, visualization output from computation

– (2) hard deadlines• need to get X results done before HPDC/NSDI/… deadline

• Live experimentation on PlanetLab• Real application: BLAST

Some PL data

Computation

- both across and within nodes

Communication

- both across and within nodes

Temporal variability

PL EnvironmentRidge is our live system that implements reputation

120 wide-area nodes, fully correct, M-1 consensus

3 Timeliness environments based on deadlines

D=120s D=180s D=240s

Experimental Results: timeliness

Best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Makespan Comparison

Collusion

• Suppose errors are correlated?• How?

– Widespread bug (hardware or software)– Misconfiguration– Virus– Sybil attack– Malicious group

• With Emmanuel Jeannot (Inria)

Key Ideas• Execute a task => answer groups

– A1, A2, … Ak

– For each Ai there are associated workers Wi1, Wi

2… Win

– Pcollusion(workers in Ai)

• Learn probability of correlated errors– Pcollusion(W1, W2)

• Estimate probability of group correlated errors– Pcollusion(G), G=[W1, W2, W3, …] via f {Pcollusion(Wi, Wj), for all i,j}

• Rank and select answer– Pcollusion(G) and |G|– Update matrix: Pcollusion(W1, W2)

Bootstrap Problem

• Building collusion matrix• Must first “bait” colluders

– Over-replicate such that majority group is still correct to expose colluders

– : probability of worker collusion– : probability colluders fool the system

• Given group size k

4: 1 group 30% colluders, always collude5. Same group – colludes 30% of the time7. 2 groups (40%, 30% colluders)

correctness

throughput

Summary

• Reliable Scalable computing– correctness and timeliness

• Future work– combined models and metrics– workflows: coupling data and computation

reliability

Visit ridge.cs.umn.edu to learn more

Documents

Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008