Text of Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay
Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay Slide 2 2 Outline AssessGrid background Problem Statement Basic Reliability Analysis of behaviour Stationarity Problem Weighted Reliability Simulations and Results What if a provider is unreliable? Alternative: Bayesian Inference Summary and Conclusions Slide 3 3 AssessGrid Background AssessGrid addresses Risk Management in the Grid. This is a necessity in the drive towards commercialisation of Grid technology - The goal is to move beyond best-effort, using SLAs to specify agreed upon level of service. However, - For resource providers, offering an SLA with service guarantees and penalties is a business risk! - For end-users, agreeing to an SLA is a business risk! A large part of AssessGrid is concerned with methods to support providers with tools and methods to: - Monitor and collect useful data. - Assess risk associated with accepting an SLA request, based on this data. Slide 4 4 What is risk? Risk is Hazard, danger, exposure to mischance or peril (Oxford English Dictionary). Risk Management is a discipline that addresses the possibility that future events may cause adverse events. - Economics, Operations Research, Engineering, Gambling, In Risk Management, risk is quantified with two parameters: Risk = Probability of Occurrence x Impact Grid computing: Event is SLA failure! Slide 5 5 Scenario Slide 6 6 Role of the Broker Key role: Finding/Negotiating with providers on behalf of end-users. Broker can also act as an independent party: - Providers may have motivation to lie! - Providers may have unidentified problems in their infrastructure. Here, we assume the broker is independent and honest. Broker can give a second opinion on risk assessments. Broker can agree its own SLAs (virtual provider). Slide 7 7 Problem statement: What do we mean by reliability? A provider makes an SLA offer: - includes an estimate of the Probability of Failure (PoF). Each time an offer is accepted, the details are stored in a database, including: - Final status (Success/Fail) - Offered PoF The problem is: Given a providers past data, can their risk assessments be considered reliable? Slide 8 8 What is reliable? Considering only systematic errors! Assume s SLAs in the database for the same provider. - Offered PoFs, Assume number of fails ~ We define a reliable provider as one that does not systematically underestimate or overestimate the PoF, so that: Slide 9 9 Is it normal? Slide 10 10 Is it normal? (2) Slide 11 11 Basic Reliability: Identifying Systematic Errors Using the providers offered PoFs: The evaluation is based on the following measure: Slide 12 12 Basic Reliability: Identifying Systematic Errors(2) Slide 13 13 Basic Reliability: Identifying Systematic Errors(3) We note that and recall the condition, leading to Slide 14 14 Analysis: How does the measure behave? Simple Example: m SLAs in database. Offered PoF is constant, p. There is a systematic overestimation/underestimation of the PoF, such that: Slide 15 15 Analysis (2) Slide 16 16 Stationarity Problem Conditions are not static! - Example: 60 red balls in a bag. 40 blue balls in the same bag. You try to estimate the number of red balls by taking a ball out and replacing it, repeating this 50 times. Someone is secretly removing a red ball and replacing it with a blue after every sample. E(red) =17.5 Number of reds =10! Slide 17 17 Stationarity Problem(2) A providers behaviour could change as a consequence of a variety of factors, e.g. A providers infrastructure is updated. A providers risk assessment methodology or model parameterisation may change. A providers policy may change, for example due to economic considerations. Slide 18 18 Weighted Reliability Use a weighted average, ensuring more recent SLAs have a larger influence. Total of mk SLAs are split into k categories, with the k th consisting of the most recent SLAs. Here, is the basic measure R over the i th category. Slide 19 19 Simulations A database of SLAs is generated: - Each SLA object has an offered PoF, true Pof and final status. Reliability computed. Process repeated 10000 times for each scenario. Simple case considered here: - Offered PoF is fixed and true PoF is fixed. Slide 20 20 Results Slide 21 21 Results(2) Slide 22 22 Results (3) Slide 23 23 Results(4) Slide 24 24 Results (5) Slide 25 25 What if the provider is unreliable? Discrete approximation: When SLA Offer received with offered POF of p, estimate POF by looking at failure rate for all SLAs with offered POF of ~p. Then, If (|reliability measure| < threshold) Believe provider. Else(PoF estimate = numFails(POF~p)/numSLAs(POF~p) Use all SLAs with offered PoF within x% of the offered PoF in the current SLA. Slide 26 26 Weighted Average risk assessment Split km SLAs into k categories. Compute the estimate PoF, for each category, i=0,,k-1. Slide 27 27 Never Trust Doctors You are tested for a disease, which 2% of the population has. The test never gives a false-negative. If you are clear, there is still a 5% chance of a false positive. You test positive. What is the probability you have the disease? Slide 28 28 Alternative Approach: Bayesian Inference The provider offers a linguistic risk assessment, e.g. the failure probability is: - extremely low: 50% If the broker/end-user requests the PoF exact value this can be provided. Slide 29 29 Alternative Approach: Bayesian Inference (2) The broker does not consider the providers reliability directly. Instead it takes the following approach: - Having received a linguistic risk assessment for a new SLA, the broker first computes a prior distribution for the PoF, given the linguistic category by considering data across all other providers. - The broker computes a posterior distribution, based on the failure rate observed in past SLAs from the same provider with the same linguistic risk assessment. - The broker returns an object which contains: (PoF_broker, confidence) Slide 30 30 Alternative Approach: Bayesian Inference (3) Slide 31 31 Summary/Conclusions A detailed analysis has been carried out for a method to identify providers who are systematically unreliable. The stationarity problem has been addressed. - Weighted Average - Results indicate good performance relative to basic measure and moving average. This can be extended to other measures for non-systematic errors. Bayesian approach has been considered and is also promising.