1 Evaluation. 2 Personal evaluation Software validation Software evaluation

  • View
    222

  • Download
    7

Embed Size (px)

Text of 1 Evaluation. 2 Personal evaluation Software validation Software evaluation

  • Slide 1
  • 1 Evaluation
  • Slide 2
  • 2 Personal evaluation Software validation Software evaluation
  • Slide 3
  • 3 Personal evaluation What have I achieved? Have I achieved what I set out to achieve? Where have I fallen short? Why? What could I have done better? Assumes an a priori statement of what you hope/expect/intend to achieve
  • Slide 4
  • 4 Self evaluation in your dissertation Dissertation plan Introduction Background Success criteria Design Realisation Evaluation/Testing Conclusions & Further Work Ch 3 lays out success criteria by which success of project is to be judged Ch 6 will review work done in Ch 5 with respect to these criteria, including reflection on overall validity of the approach But this is not software evaluation
  • Slide 5
  • 5 Program validation Systematically check all functions in your program/application Systematically check all sequences of inputs etc. Does your program/application do what you think it is supposed to do? This is important, but... This is not software evaluation
  • Slide 6
  • 6 Software evaluation Note: We are using the term software in a very vague sense: it could include a program, a web application, any sort of implementation that does something Evaluate the appropriateness of the software with respect to its intended use Large range of aspects of software that can be evaluated
  • Slide 7
  • 7 Evaluation evaluation In your dissertation you are asked to evaluate what you have achieved Your research could (should?) include an evaluation element So you will need to evaluate your evaluation Your evaluation might have negative results, but still be an informative experiment which you can evaluate positively Your research could even be to compare evaluation schemes!
  • Slide 8
  • 8 A case study Last year a student of mine did a project which was a comparative evaluation of a number of speech synthesis devices His dissertation discussed Factors in setting up a comparative evaluation A description of the actual evaluation A discussion of the results His personal evaluation then considered how well the experiment (i.e. the evaluation) had been conducted
  • Slide 9
  • 9 Software evaluation Functionality does it do what is supposed to do? Reliability does it do the same thing under the same conditions? Usability is it user-friendly? Efficiency cost, speed, etc. Maintainability can you modify it? Is it robust? Portability can it be transferred from one environment/platform to another?
  • Slide 10
  • 10 Software evaluation Evaluating commercial software is different from evaluating something you have constructed Even if you have constructed it from commercially available components Again, note the difference between validation and evaluation Especially concerning functionality Also, evaluation not the same as a software review, as found eg in a magazine
  • Slide 11
  • 11 Stakeholders Developers Researchers Commercial developers End-users Actual end-users (is this a single type?) Their managers (buyers) Vendors Investors
  • Slide 12
  • 12 Evaluation types Feasibility / Suitability For any of the above stakeholders Internal evaluation For development Iterative testing, to evaluate progress Adequacy evaluation Diagnostic evaluation (debugging) Black box vs. glass box evaluation
  • Slide 13
  • 13 Evaluation types Declarative evaluation How well does it perform? Comparison with a gold standard ideal performance Comparison with a baseline wooden block Usability evaluation How long does each step take? Is it natural, intuitive? Is it easy to learn to use? Is it well documented?
  • Slide 14
  • 14 Evaluation types Operational evaluation ROI Compatibility with other software Consistency of interfaces Internal With respect to standards (eg Microsoft) Failsofts Role of humans Preparation, throughput, correction, output Backup Documentation Support Corporate situation of provider
  • Slide 15
  • 15 Framework for evaluation Definition of the relevant quality characteristics what is it you want to evaluate? Be specific Definition of attributes pertinent to this quality Definition of a measure able to provide values for these attributes Definition of a method whereby the measure can be made
  • Slide 16
  • 16 Framework for evaluation Important to be sure that The quality to be evaluated is genuinely a quality that is claimed of the software The attribute to be measured does reflect the quality in question The measure does genuinely measure that attribute (and not some other one) The method is sufficient to deliver a meaningful measure
  • Slide 17
  • 17 Example: spell checker Function: (a) identify wrongly-spelled word (b) suggest an appropriate correction (among other features) Quality: ability to do (a) Attribute: success rate in performance of that task Measure: Precision: percentage of wrongly- spelled words correctly identified in a document Method: give it a text with some wrongly-spelled words and count how many it spots
  • Slide 18
  • 18 Example: spell checker Good evaluation, but not A* Success means Identifying misspelled words (true positives) Ignoring correctly spelled words (true negatives) So is the measure really appropriate? We are only counting true positives and false negatives: we are not giving credit for the true negatives, nor penalising false positives The method is underspecified: How much text? What sort of text? Should we take into account what we know about spell checking (a certain class of error is very hard to detect)? Should we classify misspellings and measure different classes separately?
  • Slide 19
  • 19 Attributes Different types imply different measures/methods Example: dish-washers NameRacksOptions* Water consumption Noise level Cleanliness ABC2a,b10noisy*** EFG3b6quiet* PQR2a5very noisy** * a = pre-wash rinse cycle; b = independent rinse cycle
  • Slide 20
  • 20 Methods and measures Objective measures Measuring, counting, timing Doing a specific task In case of usability issues, need to evaluate with a number of subjects (not just do it yourself) Comparison against a gold standard Precision Recall Other measures also considering false positives and negatives
  • Slide 21
  • 21 Methods and measures Subjective measures Interview after use Feedback questionnaire Rating scales (usually 5 or 7 points, + DK, N/A) Open-ended questions? Questions should relate to some specific point Repeat (some) questions in a disguised way Performance analysis Video the session, analyse afterwards
  • Slide 22
  • 22 Methods and measures Dont try to measure too many different things with the same instrument Though this can be possible to some extent But extraneous factors need to be controlled carefully Problem of statistical significance: Do you have enough subjects to know that the differences (and similarities) are not just random fluctuations?
  • Slide 23
  • 23 Example Simulated doctor-patient interviews with patients with limited English, using computer-based communication device with symbols and digitised speech two devices (laptop+mousepad, tablet+stylus) doctors and nurses literate and illiterate patients
  • Slide 24
  • 24
  • Slide 25
  • 25 Example General question: could they get to the end of the consultation? (How did we measure this?) Objective measures How long did it take? How many questions did they ask? How many answers were (apparently) correctly understood? Subjective measures Feedback questionnaire with satisfaction ratings Open-ended questions about specific issues
  • Slide 26
  • 26 Subjects Many types of evaluation require volunteers How many do you need? Where will you get them from? Are they suitable? Exclusion factors: eg prior familiarity with your topic Need to control for irrelevant differences in their profile How will you guarantee their cooperation? Ethical issues Officially, you need ethics clearance for any experiments involving living beings! In any case, important that volunteers know what they are letting themselves in for Also important that you dont waste peoples time, eg evaluating a useless task (for example as a baseline)
  • Slide 27
  • 27 Summary What are you trying to evaluate? Be specific, not general eg What do you think of this interface? What is the best way to measure what you are interested in? How feasible is it to do what you want? [After Easter]: How to write it all up!
  • Slide 28
  • 28 Next session No class next week First week after Easter (19 Apr) No class on Thursday Instead, practical sessions on Library Resources with Barry White choose one of three sessions each at 2pm-4pm Wed 18, Thur 19 or Fri 20 April in the Joule Library Do we need a sign-up sheet?