Slides that partner with the paper Simon Knight, Andrea Leigh, Yvonne C. Davila, Leigh J. Martin, Daniel W. Krix, Assessment and Evaluation in Higher Education https://doi.org/10.1080/02602938.2019.1570483
In calibration tasks students assess exemplar texts using criteria against which their own work will be assessed. Typically these tasks are used in the context of training for peer assessment. Little research has been conducted on the benefits of calibration tasks, such as benchmarking, as learning opportunities in their own right. This paper examines a dataset from a long-running benchmarking task (~500 students per semester, for four semesters). We investigate the relationship of benchmarking performance to other student outcomes, including ability to self-assess accurately. We show that students who complete the benchmarking perform better, that there is a relationship between benchmarking performance and self-assessment performance, and that students appreciate the support for learning that benchmarking tasks provide. We discuss implications for teaching and learning flagging the potential of calibration tasks as an under-explored tool.
3. How do we give
feedback at scale?
• Assess infrequently
• Make assessments that are easy to give feedback on
(e.g. quizzes)
Teaching & Learning Forum 2018 3
4. How do we give
feedback at scale?
• Assess infrequently
• Make assessments that are easy to give feedback on
(quizzes, etc.)
• Use peer and self-assessment
• Ensure teams of tutors (and students) can give quality
feedback
• Provide opportunities for whole-cohort practice
Teaching & Learning Forum 2018 4
5. How do we give
feedback at scale?
Benchmarking tasks require students to give feedback on
previously marked exemplars, typically of varying quality.
Teaching & Learning Forum 2018 5
Benchmarking
6. How do we give
feedback at scale?
Benchmarking tasks require students to give feedback on
previously marked exemplars, typically of varying quality.
Why?
1. Students engage with criteria & their use
2. Students critically assess exemplars
3. Students & academics see how well students apply the
criteria (how they’re calibrated) – feedback opportunity
Teaching & Learning Forum 2018 6
Benchmarking
8. How do we measure
impact?
Measuring impact of teaching innovation is hard
Typically lots of semester-semester changes (including
the students!)
Impact often measured via ‘happy sheets’ and few
enthusiastic learners
Teaching & Learning Forum 2018 8
9. How do we measure
impact?
~500 students in first year Biocomplexity
Since 2012 they have done a benchmarking task via
SPARKPlus + self-assessment
Analysed 2012-15 data
Teaching & Learning Forum 2018 9
Our context
10. Q1: How accurate are
students in their self-
assessments, and what
is the relationship of this
to their grade?
Teaching & Learning Forum 2018 10
Histogram comparing the distribution of staff
marks to student self-assessments
Students over-estimate
their grades;
11. Q1: How accurate are
students in their self-
assessments, and what
is the relationship of this
to their grade?
Teaching & Learning Forum 2018 11
Histogram comparing the distribution of staff
marks to student self-assessments
Students over-estimate
their grades; those who
overestimate do worse
Strong correlation; i.e. such that
a relationship between over-
estimating, and having a lower
mark (and vice-versa); r(2012)
= .68, p < .0001.
12. Q2: Do students who
complete the
benchmarking perform
better in their
assessment than those
who do not?
Teaching & Learning Forum 2018 12
t(137.88) = 5.41, p < .0001. d = 0.62 (medium effect)
(ignoring students who dropped out). Students who didn’t do the task also
varied more in their criterion-level marks
Students who do not
complete the
benchmarking perform
worse
(SD = 9.28, N = 1979)
Did the task
M = 74.14
(SD = 12.16, N = 129)
Did not do the task
M = 68.24
13. Teaching & Learning Forum 2018 13
Q3: Is accuracy on the
benchmarking predictive
of final mark?
No evidence of link
between benchmarking
accuracy and final mark
14. Teaching & Learning Forum 2018 14
Comparing distances (i.e. the mark they gave themselves, subtracted
from their actual mark) a medium effect (d = .62) t(133.84) = 3.00, p = .0032.
Q4: Are students who
complete the
benchmarking
significantly more
accurate in their self-
assessment
Students who
benchmark, are better
self-assessors
(SD = 16.13, N = 1979)
Did the task
M = 1.15
(SD = 27.47, N = 129)
Did not do the task
M = 8.13
15. Teaching & Learning Forum 2018 15
small significant relationship
between the benchmarking
distance scores and student self-
assessment distances, r(1887) =
.10, p < .0001.
Q5: Is accuracy on the
benchmarking related to
self-assessment
accuracy?
And, students who are
more accurate at
benchmarking are more
accurate self-assessors
16. Teaching & Learning Forum 2018 16
2012 & 2013 430 students (~45% of the cohort) did a feedback survey.
The feedback from these cohorts was generally positive (>75% agree
or strongly agree on all qs)
The SPARK benchmarking process (week 4) helped me to engage early with the
report assessment criteria
The report assessment criteria helped me to understand what was expected in my
report
I followed the assessment criteria closely when writing my report
I understood how each assessment criterion contributed to a particular Graduate
Attribute
Self-assessing my report helped me to critically evaluated my own academic
performance in this task
I have a better understanding of why scientific writing skills are important for a
scientific career
Overall I was satisfied with the report-writing learning process
Q6: What are student perceptions
of feedback structures to support
their assignment completion?
Students think the task
is valuable
17. For example…
Teaching & Learning Forum 2018 17
Benchmarking helped me to understand what
level of writing was expected for each grade.
The feedback and re-submission really helped
me to better my writing and to understand
how I could improve
That it forced me to be familiar with the
marking criteria BEFORE writing the
assignment. Usually I look at the criteria after
writing the assignment and seeing whether it
met the criteria, but with this method I made
sure to incorporate the points whilst writing.
Having previous reports to look at
and gain understanding how to
write and what the markers are
looking for.
19. How do we give
feedback at scale?
• Is quality of student’s written comments in
benchmarking related to their other learning outcomes?
• If students learn from giving feedback, how do we build
that capacity?
• How do we support students to understand the
feedback they receive, and to make sure they get
consistently good feedback from tutors?
Teaching & Learning Forum 2018 19
20. How do we give
feedback at scale?
• Is quality of student’s written comments in
benchmarking related to their other learning outcomes?
(ongoing analysis)
• If students learn from giving feedback, how do we build
that capacity?
• How do we support students to understand the
feedback they receive, and to make sure they get
consistently good feedback from tutors?
• Created a tutor & student feedback guide
Teaching & Learning Forum 2018 20
23. 23Feedback on “References” criteria from Example B 1-5
“It has got 6 references which is good number of credible references.”
In this report, the person did demonstrate a well knowledge about referencing
“Harvard style referencing was well used. Next time include volumes/editions/page
number to specify what section of the book/journal information was obtained. The
quality of paraphrasing is really only of a credit standard - it’s evident you’ve used
some secondary sources, but the in-text referencing style needs a little work; to
improve, have a look at the rubric given, and familiarise yourself with the resources
provided on “in-text” referencing.”
The citation was well presented and done correctly however the reference list was
poorly set out. Out of the four resources on the reference list only one contained
authors. The others were scientific journals and books and therefore needed
authors on this list. The second resource lacks volume and page numbers whilst
the third lacks a sub heading reference that the second has. Lack of consistency is
found throughout this reference list and needs more work. Lastly only four
references is not enough to validate the argument. Very little citation of these
references are found in the discussion and therefore isn't linking the work of the
valid resources to the reasons within the experiment. Great improvement needed.
“Referencing was okay but infrequent and some references were ancient.”
Rank these pieces of feedback in order of “most useful” (1) to “least useful” (5)
25. Thank
you
Simon Knight, Andy Leigh, Yvonne Davila,
Leigh Martin
@sjgknight
Thanks to Dan Krix and Alex Thompson for
their work on the benchmarking project, and
to other academics and students who have
contributed to the benchmarking
development.
Thanks to Shirley Alexander for VCLT
funding in support of this project
Draft paper available on request
https://tinyurl.com/BenchmarkingGuide
26. 2012-15 data analysed Data from 2012-15 of this innovation was
analysed to investigate the relationship between accuracy of
student-assessments and learning outcomes, and to understand
the features of quality feedback in these tasks. Analysis indicates
that:
students who complete the benchmarking task perform better
that students who are more accurate self-assessors perform
better
That students who are more accurate in the benchmarking task
are also more accurate in the self-assessment task
The students are overwhelmingly positive about the task, and are
able to articulate its key intended learning outcomes
Teaching & Learning Forum 2018 26
Editor's Notes
One approach – assess infrequently…the worst of the approaches.
Easy auto-grading quizzes can be quite a good method to introduce practice opportunities, but limited scope
So, peer and self assessment is interesting, and good evidence it leads to positive outcomes
students who overestimate their mark (i.e., have negative distances) are more likely to have lower overall marks, while students who underestimate their mark significantly (i.e. have positive distances), are more likely to have higher overall marks
students who overestimate their mark (i.e., have negative distances) are more likely to have lower overall marks, while students who underestimate their mark significantly (i.e. have positive distances), are more likely to have higher overall marks
There was a significant difference in the overall marks of students such that those who completed the benchmarking task scored higher (M = 74.14, SD = 9.28, N = 1979), compared to those who did not (M = 68.24, SD = 12.16, N = 129); t(137.88) = 5.41, p < .0001. d = 0.62. In addition, students who completed the benchmarking task had significantly lower mark variability among criteria (computed by calculating the standard deviation of their marks across the criteria) (M = 5.54, SD = 3.28, N = 1972), than those who did not (M = 6.73, SD = 5.59, N = 129); t(133.82) = 2.41, p = .01734, d = 0.20. That is, students who did not complete the benchmarking task performed significantly poorer overall, and achieved less consistent marks across the criteria, implying a poorer ability to calibrate against these criteria. d (or Cohen’s d) is an effect size measure representing the difference between the two group means divided by the average of their standard deviations, thus a d of 1 represents that the two groups differ by 1 SD, .5 by half an SD, etc., with .2 considered small, .5 medium, and .8 large
Next phase is to look at the written comments
That is, (in)accuracy in the benchmarking task and (in)accuracy in self-assessment are related, such that those who were more inaccurate in the benchmarking were also more inaccurate in their self-assessment.