Transcript

1

Evaluation

2

Evaluation

• Personal evaluation

• Software validation

• Software evaluation

3

Personal evaluation

• What have I achieved?

• Have I achieved what I set out to achieve?

• Where have I fallen short?

• Why?

• What could I have done better?

• Assumes an a priori statement of what you hope/expect/intend to achieve

4

Self evaluation in your dissertation

Dissertation plan• Introduction• Background• Success criteria• Design• Realisation• Evaluation/Testing• Conclusions & Further

Work

• Ch 3 lays out success criteria by which success of project is to be judged

• Ch 6 will review work done in Ch 5 with respect to these criteria, including reflection on overall validity of the approach

• But this is not “software evaluation”

5

Program validation

• Systematically check all functions in your program/application

• Systematically check all sequences of inputs etc.

• Does your program/application do what you think it is supposed to do?

• This is important, but ...

• This is not “software evaluation”

6

Software evaluation

Note: We are using the term “software” in a very vague sense: it could include a program, a web application, any sort of implementation that does something

• Evaluate the appropriateness of the software with respect to its intended use

• Large range of aspects of software that can be evaluated

7

Evaluation evaluation

• In your dissertation you are asked to evaluate what you have achieved

• Your research could (should?) include an evaluation element

• So you will need to evaluate your evaluation• Your evaluation might have negative results, but

still be an informative experiment which you can evaluate positively

• Your research could even be to compare evaluation schemes!

8

A case study

• Last year a student of mine did a project which was a comparative evaluation of a number of speech synthesis devices

• His dissertation discussed – Factors in setting up a comparative evaluation– A description of the actual evaluation– A discussion of the results

• His personal evaluation then considered how well the experiment (i.e. the evaluation) had been conducted

9

Software evaluation

• Functionality – does it do what is supposed to do?

• Reliability – does it do the same thing under the same conditions?

• Usability – is it user-friendly?• Efficiency – cost, speed, etc.• Maintainability – can you modify it? Is it robust?• Portability – can it be transferred from one

environment/platform to another?

10

Software evaluation

• Evaluating commercial software is different from evaluating something you have constructed– Even if you have constructed it from commercially

available components

• Again, note the difference between validation and evaluation– Especially concerning “functionality”

• Also, evaluation not the same as a software review, as found eg in a magazine

11

Stakeholders

• Developers– Researchers– Commercial developers

• End-users– Actual end-users (is this a single type?)– Their managers (buyers)

• Vendors

• Investors

12

Evaluation types

• Feasibility / Suitability – For any of the above stakeholders

• Internal evaluation– For development– Iterative testing, to evaluate progress– Adequacy evaluation– Diagnostic evaluation (debugging)– Black box vs. glass box evaluation

13

Evaluation types

• Declarative evaluation– How well does it perform?– Comparison with a “gold standard” ideal performance– Comparison with a baseline “wooden block”

• Usability evaluation– How long does each step take?– Is it “natural”, intuitive?– Is it easy to learn to use?– Is it well documented?

14

Evaluation types

• Operational evaluation– ROI– Compatibility with other software– Consistency of interfaces

• Internal• With respect to “standards” (eg Microsoft)

– Failsofts– Role of humans– Preparation, throughput, correction, output– Backup

• Documentation• Support• Corporate situation of provider

15

Framework for evaluation

• Definition of the relevant quality characteristics – what is it you want to evaluate? Be specific

• Definition of attributes pertinent to this quality

• Definition of a measure able to provide values for these attributes

• Definition of a method whereby the measure can be made

16

Framework for evaluation

Important to be sure that • The quality to be evaluated is genuinely a

quality that is claimed of the software• The attribute to be measured does reflect

the quality in question• The measure does genuinely measure

that attribute (and not some other one)• The method is sufficient to deliver a

meaningful measure

17

Example: spell checker

• Function: – (a) identify wrongly-spelled word – (b) suggest an appropriate correction – (among other features)

• Quality: ability to do (a) • Attribute: success rate in performance of that

task• Measure: “Precision”: percentage of wrongly-

spelled words correctly identified in a document• Method: give it a text with some wrongly-spelled

words and count how many it spots

18

Example: spell checker

• Good evaluation, but not A*• Success means

– Identifying misspelled words (true positives)– Ignoring correctly spelled words (true negatives)

• So is the measure really appropriate? We are only counting true positives and false negatives: we are not giving credit for the true negatives, nor penalising false positives

• The method is underspecified: – How much text? – What sort of text? – Should we take into account what we know about spell

checking (a certain class of error is very hard to detect)?– Should we classify misspellings and measure different classes

separately?

19

Attributes

• Different types imply different measures/methods

• Example: dish-washers

Name Racks Options* Water consumption Noise level Cleanliness

ABC 2 a,b 10 noisy ***

EFG 3 b 6 quiet *

PQR 2 a 5 very noisy **

* a = pre-wash rinse cycle; b = independent rinse cycle

20

Methods and measures

• Objective measures– Measuring, counting, timing– Doing a specific task– In case of usability issues, need to evaluate

with a number of subjects (not just do it yourself)

– Comparison against a gold standard• Precision • Recall• Other measures also considering false positives and

negatives

possible

correctR

total

correctP

21

Methods and measures

• Subjective measures– Interview after use– Feedback questionnaire

• Rating scales (usually 5 or 7 points, + DK, N/A)• Open-ended questions?• Questions should relate to some specific point• Repeat (some) questions in a disguised way

– Performance analysis• Video the session, analyse afterwards

22

Methods and measures

• Don’t try to measure too many different things with the same instrument

• Though this can be possible to some extent• But extraneous factors need to be controlled

carefully• Problem of statistical significance:

– Do you have enough subjects to know that the differences (and similarities) are not just random fluctuations?

23

Example

• Simulated doctor-patient interviews with patients with limited English, using computer-based communication device with symbols and digitised speech– two devices (laptop+mousepad, tablet+stylus)– doctors and nurses– literate and illiterate patients

24

25

Example

• General question: could they get to the end of the consultation? (How did we “measure” this?)

• Objective measures– How long did it take?– How many questions did they ask?– How many answers were (apparently) correctly

understood?

• Subjective measures– Feedback questionnaire with satisfaction ratings– Open-ended questions about specific issues

26

Subjects

• Many types of evaluation require volunteers– How many do you need?– Where will you get them from?– Are they suitable?

• Exclusion factors: eg prior familiarity with your topic• Need to control for irrelevant differences in their profile

– How will you guarantee their cooperation?– Ethical issues

• Officially, you need ethics clearance for any experiments involving living beings!

• In any case, important that volunteers know what they are letting themselves in for

• Also important that you don’t waste people’s time, eg evaluating a useless task (for example as a baseline)

27

Summary

• What are you trying to evaluate?– Be specific, not general eg “What do you think

of this interface?”

• What is the best way to measure what you are interested in?

• How feasible is it to do what you want?

• [After Easter]: How to write it all up!

28

Next session

• No class next week• First week after Easter (19 Apr)

– No class on Thursday– Instead, practical sessions on Library

Resources with Barry White– choose one of three sessions

• each at 2pm-4pm • Wed 18, Thur 19 or Fri 20 April • in the Joule Library• Do we need a sign-up sheet?


Recommended