11
3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

Embed Size (px)

Citation preview

Page 1: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

3. Sharing dataand evaluating performance

I. DatasetsII. Evaluation

Thad, Jamie, Oliver, Sourav

Page 2: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. a. Pre-collected Dataset

Pre-collected Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ “NULL Benchmark” dataset. Use as test set for new algorithms

Caviat: for ADL: detailed ground truth required

Page 3: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. b. Actively collected dataset

Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix

Page 4: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. c. Passively collected dataset

Real-world collection In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't

work as well it. NASA TLX, Likert scale, questionaire

Page 5: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

I. Choice of datasets for making a contribution

New domain Collect data Show results of sample technique

New technique Compare to existing techniques using publicly available

datasets (can be your own released data)

Demonstrate usability & usefulness User study: define “user”, describe domain and

techniques, present metrics and results

Page 6: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Choosing evaluation methods

Domain (what we care about)

X

Analysis proceedure (frames, events, etc.)

X

Evaluation metrics (PR, F1, ROC, AUC, Acc, EDD, etc.)

Page 7: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Choosing the right analysis

What is important: instances (events), temporal alignment (frames) or both?

Is it multi-class? What is the class ratio? How big is your NULL class? What types of errors are harmful?

Page 8: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

II. Guide to Choosing the Right Metrics

Choose metrics that matters to end users Identify end-users early on

Choose metrics that honestly represent output Use multiple metrics – multiple views Statistical relevance reporting – at least variance, etc. Give reasoning for use

Publish time-series results (raw results of recognizer) as well as datase

Page 9: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

To-do

Metric choice flowchart Tools for generating metrics and comparing two

techniques Tools for visualising results

Upload (existing) toolsets to webpage; links to datasets; EGL

Tutorial for ISWC

Page 10: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav
Page 11: 3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

Dataset types a. Pre-collected

Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ Benchmark NULL dataset. Use as test set only. Caviat: for ADL: very detailed ground truth required

b. Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix c. Passive collection

In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't do it. NASA TLX, Likert scale, questionaire