Transcript
Page 1: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

3. Sharing dataand evaluating performance

I. DatasetsII. Evaluation

Thad, Jamie, Oliver, Sourav

Page 2: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. a. Pre-collected Dataset

Pre-collected Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ “NULL Benchmark” dataset. Use as test set for new algorithms

Caviat: for ADL: detailed ground truth required

Page 3: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. b. Actively collected dataset

Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix

Page 4: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. c. Passively collected dataset

Real-world collection In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't

work as well it. NASA TLX, Likert scale, questionaire

Page 5: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. Choice of datasets for making a contribution

New domain Collect data Show results of sample technique

New technique Compare to existing techniques using publicly available

datasets (can be your own released data)

Demonstrate usability & usefulness User study: define “user”, describe domain and

techniques, present metrics and results

Page 6: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Choosing evaluation methods

Domain (what we care about)

X

Analysis proceedure (frames, events, etc.)

X

Evaluation metrics (PR, F1, ROC, AUC, Acc, EDD, etc.)

Page 7: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Choosing the right analysis

What is important: instances (events), temporal alignment (frames) or both?

Is it multi-class? What is the class ratio? How big is your NULL class? What types of errors are harmful?

Page 8: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Guide to Choosing the Right Metrics

Choose metrics that matters to end users Identify end-users early on

Choose metrics that honestly represent output Use multiple metrics – multiple views Statistical relevance reporting – at least variance, etc. Give reasoning for use

Publish time-series results (raw results of recognizer) as well as datase

Page 9: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

To-do

Metric choice flowchart Tools for generating metrics and comparing two

techniques Tools for visualising results

Upload (existing) toolsets to webpage; links to datasets; EGL

Tutorial for ISWC

Page 10: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav
Page 11: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

Dataset types a. Pre-collected

Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ Benchmark NULL dataset. Use as test set only. Caviat: for ADL: very detailed ground truth required

b. Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix c. Passive collection

In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't do it. NASA TLX, Likert scale, questionaire