3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav

3. Sharing dataand evaluating performance

I. DatasetsII. Evaluation

Thad, Jamie, Oliver, Sourav

I. a. Pre-collected Dataset

Pre-collected Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ “NULL Benchmark” dataset. Use as test set for new algorithms

Caviat: for ADL: detailed ground truth required

I. b. Actively collected dataset

Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix

I. c. Passively collected dataset

Real-world collection In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't

work as well it. NASA TLX, Likert scale, questionaire

I. Choice of datasets for making a contribution

New domain Collect data Show results of sample technique

New technique Compare to existing techniques using publicly available

datasets (can be your own released data)

Demonstrate usability & usefulness User study: define “user”, describe domain and

techniques, present metrics and results

II. Choosing evaluation methods

Domain (what we care about)

X

Analysis proceedure (frames, events, etc.)

X

Evaluation metrics (PR, F1, ROC, AUC, Acc, EDD, etc.)

II. Choosing the right analysis

What is important: instances (events), temporal alignment (frames) or both?

Is it multi-class? What is the class ratio? How big is your NULL class? What types of errors are harmful?

II. Guide to Choosing the Right Metrics

Choose metrics that matters to end users Identify end-users early on

Choose metrics that honestly represent output Use multiple metrics – multiple views Statistical relevance reporting – at least variance, etc. Give reasoning for use

Publish time-series results (raw results of recognizer) as well as datase

To-do

Metric choice flowchart Tools for generating metrics and comparing two

techniques Tools for visualising results

Upload (existing) toolsets to webpage; links to datasets; EGL

Tutorial for ISWC

Dataset types a. Pre-collected

Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)

→ Benchmark NULL dataset. Use as test set only. Caviat: for ADL: very detailed ground truth required

b. Active collection Prompted activity Co-articulation / transitions Pre-segmented

Precision / Recall, F1, AUC, Accuracy, Confusion matrix c. Passive collection

In-the-wild HCI-style user-study

Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't do it. NASA TLX, Likert scale, questionaire

Documents

3. Sharing data and evaluating performance I. Datasets II. Evaluation Thad, Jamie, Oliver, Sourav