Upload
fridolinwild
View
1.427
Download
1
Embed Size (px)
DESCRIPTION
Creating a gold-standard for evaluating automated reflection detection
Citation preview
Quantifying reflection:Creating a gold-standard for evaluating
automated reflection detection
Thomas Ullmann, Fridolin Wild, Peter Scott,Knowledge Media Institute, The Open University
Outline
• A Model for reflection• Related work on quantification of
reflection• Methodology• Data collected• Results and discussion• Outlook
2
Reflection is creative sense-making of the
past
3
State of the art in quantifying reflection
Reference Scales Unit of analysis Findings
Dyment & O’Connell (2011) Depth of reflection Studies (writings) Meta review: five studies low; four medium; two studies high levels of reflection
Wong et al. (1995) Depth of reflection: habitual to critical.
45 students Content analysis and interviews: 76% reflectors, 11% critical reflectors.
Wald et al. (2012) Reflective to non-reflective
93 writings 2nd year students, self selected best of reflective field notes: 30% critically reflective, 11% transformative reflective.
Plack et al. (2005) Frequencies of elements and depth of reflection
43 journals 43% reflection, 42% critical reflection; frequencies see next slide.
Hatton & Smith (1995) Units of reflection; dialogic versus descriptive
‘units’ (in writings of 60 students)
After instruction: 30% dialogic reflection;
19 reflective units in average per 8-12 pages
Ross (1989) Depth of reflection 134 papers of 25 students 22% highly reflective, 34 % moderately reflective
Williams et al. (2002) Action classification.
56 student journals 23% verify learning, 36% new understanding, 39% future behaviour
4
Plack et al.
%
Williams et al.
Summary: Related work
• More research on level than on elements• Wide range for ‘level of depth’• Measurements on students or writings/journals level• Mostly in the context of instructed reflective writing • Typically: Mapping from evidence to depth/breadth
=> No re-usable instrument to measure reflection
The dimensions of reflection
Ullmann, Wild, Scott (2012): Comparing automatically detected reflective texts with human judgements. http://ceur-ws.org/Vol-931/paper8.pdf
Documentation of insights, plans, and intentions.
Switch point of view.
Argumentation and reasoning.
Identification of a conflict.Awareness building over affective factors.
Explication of self-awareness, e.g., inner monologues, description of feelings.
Example accounts (anonymised)
Dim: Type ExampleSA: Identification of a conflict. “[Victor] and [Morgan], you are right
that I should have applied better my own learning instead of using the Uni ones.”
CA: Reasoning. “I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.”
TP: Switch point of view. “When I am doing FRT work, I often
think about how the parents view me when they know I haven’t got children!”
Dim: Type ExampleOD: Documentation of an insight. “After I saw how this lifted her mood
and eased her anxiety, I will remember that what we can view sometimes to be small can actually make a significant difference.”
OD: Intention. “I would like to be involved in helping with the site, too - although I’m a novice! I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.”
Dim: Type ExampleOD: New understanding. “This has helped me reflect on my own life
and experiences whilst allowing me to empathise with others in their own circumstances; I feel proud of what I have achieved so far as the work/life/study balance is always difficult to navigate, but I’m lucky that I have a supportive family to help.“
None “Bye the way, Audacity is also run under the CC Attribution.”
Methodology: creating a gold standard
10
Corpus selection
SanitizeChunkin
g (for cues)
Sample
BatchingCrowd-
sourcing‘Spam’ filtering
Objecti-fication
mid range length postings
OU LMS forum posts
4 subjects, 2 years de-identification
sentence level 1000 random
500 pers.
500 non-pers..
Expand grid, 10 batches
control questions
5 raters each
Justification valid
‘gold questions’ passed
‘majority vote’
interrater reliability
Crowdsourcing
• Crowdflower: the ‘virtual pedestrian area’• Pre-tests showed:– Really simple questions needed for HITs– But: Quick answer options increase spam– Short texts easier than long texts
(less spam, smaller costs)– Shuffling of answers to avoid artefacts
• Check: larger than usual number of raters (5+) to see how reliable judgements are
Example questionnaire
OU Forum Corpus
Countries (origin of request)
• In total 411 raters• Most of them from the USA
(N=202)• GB (N=94)• India (N=45)• 14 other nations (N=70)
Across batches (3M)
Frequency distribution (3M)
Frequencies by courses (3M)
Interrater Reliability
– Raw data• Baseline: control questions: Krippendorff’s α = 0.43• Control questions + survey data: α = 0.32• Survey data: α = 0.22
– ‘objectified’ data • Majority vote of 3 to all raters agree
– Survey data: α = 0.36, (623 out of 1,000 sentences)
• Majority vote of 4 to all agree– Survey data: α = 0.581, (301 sentences)
• Majority vote of 5 (to all) agree: – α = 0.98 (with outliers), (107 out of 1,000 sentences)
Discussion• Agreement of 5 of course increases IRR – (to 0.98 unfiltered)– when omitting ‘over answering’: to 1.0– But: reduces to single category sentences
• Agreement of 3 deemed good enough – since questions were single choice,
whereas multiple anwers are correct
• Sentences are reduction, but allow to zoom in on markers
• Context: Forum texts• Personal vs. non personal sentences
Questions? Answers?
bit.ly/tel-advances