Upload
beatrice-marshall
View
221
Download
2
Embed Size (px)
Citation preview
3. Sharing dataand evaluating performance
I. DatasetsII. Evaluation
Thad, Jamie, Oliver, Sourav
I. a. Pre-collected Dataset
Pre-collected Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)
→ “NULL Benchmark” dataset. Use as test set for new algorithms
Caviat: for ADL: detailed ground truth required
I. b. Actively collected dataset
Active collection Prompted activity Co-articulation / transitions Pre-segmented
Precision / Recall, F1, AUC, Accuracy, Confusion matrix
I. c. Passively collected dataset
Real-world collection In-the-wild HCI-style user-study
Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't
work as well it. NASA TLX, Likert scale, questionaire
I. Choice of datasets for making a contribution
New domain Collect data Show results of sample technique
New technique Compare to existing techniques using publicly available
datasets (can be your own released data)
Demonstrate usability & usefulness User study: define “user”, describe domain and
techniques, present metrics and results
II. Choosing evaluation methods
Domain (what we care about)
X
Analysis proceedure (frames, events, etc.)
X
Evaluation metrics (PR, F1, ROC, AUC, Acc, EDD, etc.)
II. Choosing the right analysis
What is important: instances (events), temporal alignment (frames) or both?
Is it multi-class? What is the class ratio? How big is your NULL class? What types of errors are harmful?
II. Guide to Choosing the Right Metrics
Choose metrics that matters to end users Identify end-users early on
Choose metrics that honestly represent output Use multiple metrics – multiple views Statistical relevance reporting – at least variance, etc. Give reasoning for use
Publish time-series results (raw results of recognizer) as well as datase
To-do
Metric choice flowchart Tools for generating metrics and comparing two
techniques Tools for visualising results
Upload (existing) toolsets to webpage; links to datasets; EGL
Tutorial for ISWC
Dataset types a. Pre-collected
Standard sensor systems (iPhone, indoor motion) From real-life activity (everything is 'garbage' / NULL) (Everyday Gesture Library)
→ Benchmark NULL dataset. Use as test set only. Caviat: for ADL: very detailed ground truth required
b. Active collection Prompted activity Co-articulation / transitions Pre-segmented
Precision / Recall, F1, AUC, Accuracy, Confusion matrix c. Passive collection
In-the-wild HCI-style user-study
Self reporting / retrospective review Subjective measurements. Pattern rec. metrics don't do it. NASA TLX, Likert scale, questionaire