1. Quantifying reflection:
Creating a gold-standard for evaluating
automated reflection detection
Thomas Ullmann, Fridolin Wild, Peter Scott,
Knowledge Media Institute, The Open University
2. Outline
• A Model for reflection
• Related work on quantification of
reflection
• Methodology
• Data collected
• Results and discussion
• Outlook
2
4. State of the art
in quantifying reflection
Reference Scales Unit of analysis Findings
Dyment & O’Connell (2011) Depth of reflection Studies (writings) Meta review: five studies low; four
medium; two studies high levels of
reflection
Wong et al. (1995) Depth of
reflection: habitual
to critical.
45 students Content analysis and interviews: 76%
reflectors, 11% critical reflectors.
Wald et al. (2012) Reflective to non-
reflective
93 writings 2nd year students, self selected best of
reflective field notes: 30% critically
reflective, 11% transformative reflective.
Plack et al. (2005) Frequencies of
elements and
depth of reflection
43 journals 43% reflection, 42% critical reflection;
frequencies see next slide.
Hatton & Smith (1995) Units of reflection;
dialogic versus
descriptive
‘units’ (in writings of 60
students)
After instruction: 30% dialogic reflection;
19 reflective units in average per 8-12
pages
Ross (1989) Depth of reflection 134 papers of 25 students 22% highly reflective, 34 % moderately
reflective
Williams et al. (2002) Action
classification.
56 student journals 23% verify learning, 36% new
understanding, 39% future behaviour
4
7. Summary: Related work
• More research on level than on elements
• Wide range for ‘level of depth’
• Measurements on students or writings/journals level
• Mostly in the context of instructed reflective writing
• Typically: Mapping from evidence to depth/breadth
=> No re-usable instrument
to measure reflection
8. The dimensions of reflection
Ullmann, Wild, Scott (2012): Comparing automatically detected
reflective texts with human judgements. http://ceur-ws.org/Vol-931/paper8.pdf
Documentation of insights,
plans, and intentions.
Switch point of view.
Argumentation and reasoning.
Identification of a conflict.
Awareness building over affective factors.
Explication of self-awareness, e.g.,
inner monologues, description of feelings.
9. Example accounts (anonymised)
Dim: Type Example
SA: Identification of
a conflict.
“[Victor] and [Morgan], you are right that I
should have applied better my own learning
instead of using the Uni ones.”
CA: Reasoning.
“I imagine this is probably in order to have a
focus and provide enough detail rather than
skim over the whole area.”
TP: Switch point of
view.
“When I am doing FRT work, I often think
about how the parents view me when they
know I haven‟t got children!”
Dim: Type Example
OD: Documentation
of an insight.
“After I saw how this lifted her mood and
eased her anxiety, I will remember that what
we can view sometimes to be small can
actually make a significant difference.”
OD: Intention.
“I would like to be involved in helping with the
site, too - although I‟m a novice! I imagine this
is probably in order to have a focus and
provide enough detail rather than skim over
the whole area.”
Dim: Type Example
OD: New
understanding.
“This has helped me reflect on my own life and
experiences whilst allowing me to empathise
with others in their own circumstances; I feel
proud of what I have achieved so far as the
work/life/study balance is always difficult to
navigate, but I‟m lucky that I have a supportive
family to help.“
None
“Bye the way, Audacity is also run under the CC
Attribution.”
10. Methodology:
creating a gold standard
10
Corpus
selection
Sanitize
Chunking
(for cues)
Sample
Batching
Crowd-
sourcing
„Spam‟
filtering
Objecti-
fication
mid range length
postings
OU LMS forum posts
4 subjects, 2 years de-identification
sentence level 1000 random
500
pers.
500 non-
pers..
Expand grid, 10
batches
control questions
5 raters each
Justification valid
„gold questions‟ passed„majority vote‟
interrater reliability
11. Crowdsourcing
• Crowdflower: the ‘virtual pedestrian area’
• Pre-tests showed:
– Really simple questions needed for HITs
– But: Quick answer options increase spam
– Short texts easier than long texts
(less spam, smaller costs)
– Shuffling of answers to avoid artefacts
• Check: larger than usual number of raters (5+) to see
how reliable judgements are
18. Interrater Reliability
– Raw data
• Baseline: control questions: Krippendorff’s α = 0.43
• Control questions + survey data: α = 0.32
• Survey data: α = 0.22
– ‘objectified’ data
• Majority vote of 3 to all raters agree
– Survey data: α = 0.36, (623 out of 1,000 sentences)
• Majority vote of 4 to all agree
– Survey data: α = 0.581, (301 sentences)
• Majority vote of 5 (to all) agree:
– α = 0.98 (with outliers), (107 out of 1,000 sentences)
19. Discussion
• Agreement of 5 of course increases IRR
– (to 0.98 unfiltered)
– when omitting ‘over answering’: to 1.0
– But: reduces to single category sentences
• Agreement of 3 deemed good enough
– since questions were single choice,
whereas multiple anwers are correct
• Sentences are reduction, but allow
to zoom in on markers
• Context: Forum texts
• Personal vs. non personal sentences
Majority vote. Examples are from the result data set.
Number of posts results from filtering by length (1500 – 3500 characters, results in ~1600 texts) from about 16,000 texts – means that most of the texts are actually shorter than 1500 characters (‘me, too’, ‘yes’, …).
After filtering for 623 sentences!!! Not clear whether.
Application scenario question: What if we know the expected frequency in a large body of texts, can we use it to spot differences between course?El1/el2 = elearning (2 = year before); swl2 = social worker; sci = science
Alpha okay, but not great. Therefore analysis if the values change in 3 batches.