Quantifying reflection

Quantifying reflection:Creating a gold-standard for evaluating

automated reflection detection

Thomas Ullmann, Fridolin Wild, Peter Scott,Knowledge Media Institute, The Open University

Outline

• A Model for reflection• Related work on quantification of

reflection• Methodology• Data collected• Results and discussion• Outlook

2

Reflection is creative sense-making of the

past

3

State of the art in quantifying reflection

Reference Scales Unit of analysis Findings

Dyment & O’Connell (2011) Depth of reflection Studies (writings) Meta review: five studies low; four medium; two studies high levels of reflection

Wong et al. (1995) Depth of reflection: habitual to critical.

45 students Content analysis and interviews: 76% reflectors, 11% critical reflectors.

Wald et al. (2012) Reflective to non-reflective

93 writings 2nd year students, self selected best of reflective field notes: 30% critically reflective, 11% transformative reflective.

Plack et al. (2005) Frequencies of elements and depth of reflection

43 journals 43% reflection, 42% critical reflection; frequencies see next slide.

Hatton & Smith (1995) Units of reflection; dialogic versus descriptive

‘units’ (in writings of 60 students)

After instruction: 30% dialogic reflection;

19 reflective units in average per 8-12 pages

Ross (1989) Depth of reflection 134 papers of 25 students 22% highly reflective, 34 % moderately reflective

Williams et al. (2002) Action classification.

56 student journals 23% verify learning, 36% new understanding, 39% future behaviour

4

Plack et al.

%

Williams et al.

Summary: Related work

• More research on level than on elements• Wide range for ‘level of depth’• Measurements on students or writings/journals level• Mostly in the context of instructed reflective writing • Typically: Mapping from evidence to depth/breadth

=> No re-usable instrument to measure reflection

The dimensions of reflection

Ullmann, Wild, Scott (2012): Comparing automatically detected reflective texts with human judgements. http://ceur-ws.org/Vol-931/paper8.pdf

Documentation of insights, plans, and intentions.

Switch point of view.

Argumentation and reasoning.

Identification of a conflict.Awareness building over affective factors.

Explication of self-awareness, e.g., inner monologues, description of feelings.

Example accounts (anonymised)

Dim: Type ExampleSA: Identification of a conflict. “[Victor] and [Morgan], you are right

that I should have applied better my own learning instead of using the Uni ones.”

CA: Reasoning. “I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.”

TP: Switch point of view. “When I am doing FRT work, I often

think about how the parents view me when they know I haven’t got children!”

Dim: Type ExampleOD: Documentation of an insight. “After I saw how this lifted her mood

and eased her anxiety, I will remember that what we can view sometimes to be small can actually make a significant difference.”

OD: Intention. “I would like to be involved in helping with the site, too - although I’m a novice! I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.”

Dim: Type ExampleOD: New understanding. “This has helped me reflect on my own life

and experiences whilst allowing me to empathise with others in their own circumstances; I feel proud of what I have achieved so far as the work/life/study balance is always difficult to navigate, but I’m lucky that I have a supportive family to help.“

None “Bye the way, Audacity is also run under the CC Attribution.”

Methodology: creating a gold standard

10

Corpus selection

SanitizeChunkin

g (for cues)

Sample

BatchingCrowd-

sourcing‘Spam’ filtering

Objecti-fication

mid range length postings

OU LMS forum posts

4 subjects, 2 years de-identification

sentence level 1000 random

500 pers.

500 non-pers..

Expand grid, 10 batches

control questions

5 raters each

Justification valid

‘gold questions’ passed

‘majority vote’

interrater reliability

Crowdsourcing

• Crowdflower: the ‘virtual pedestrian area’• Pre-tests showed:– Really simple questions needed for HITs– But: Quick answer options increase spam– Short texts easier than long texts

(less spam, smaller costs)– Shuffling of answers to avoid artefacts

• Check: larger than usual number of raters (5+) to see how reliable judgements are

Example questionnaire

OU Forum Corpus

Countries (origin of request)

• In total 411 raters• Most of them from the USA

(N=202)• GB (N=94)• India (N=45)• 14 other nations (N=70)

Across batches (3M)

Frequency distribution (3M)

Frequencies by courses (3M)

Interrater Reliability

– Raw data• Baseline: control questions: Krippendorff’s α = 0.43• Control questions + survey data: α = 0.32• Survey data: α = 0.22

– ‘objectified’ data • Majority vote of 3 to all raters agree

– Survey data: α = 0.36, (623 out of 1,000 sentences)

• Majority vote of 4 to all agree– Survey data: α = 0.581, (301 sentences)

• Majority vote of 5 (to all) agree: – α = 0.98 (with outliers), (107 out of 1,000 sentences)

Discussion• Agreement of 5 of course increases IRR – (to 0.98 unfiltered)– when omitting ‘over answering’: to 1.0– But: reduces to single category sentences

• Agreement of 3 deemed good enough – since questions were single choice,

whereas multiple anwers are correct

• Sentences are reduction, but allow to zoom in on markers

• Context: Forum texts• Personal vs. non personal sentences

Questions? Answers?

bit.ly/tel-advances

Education

Quantifying reflection