This chapter discusses aligning language tests to standards. It defines standard setting as establishing cut scores on exams to assess learner performance against an absolute standard. Various standard setting methodologies are presented, including test-centered methods like Angoff and Ebel, and examinee-centered methods like contrasting groups. Effective standard setting requires expert judgment, training judges, and evaluating the process and results both internally and externally. While standards can promote harmonization, issues like unintended consequences, lack of context specificity, and forcing consensus must be considered. Ultimately uncertainty remains in standard setting.
2. Content of this chapter
It’s as old as the hills
The definition of ‘standards’
The uses of standards
Unintended consequences revisited
Using standards for harmonization and identity
How many standards can we afford?
Performance level descriptors (PLDs) and test scores
Some initial decisions
Standard-setting methodologies
Evaluating standard setting
Training
The special case of the CEFR
You can always count on uncertainty
3. It’s as old as the hills
Standard setting = The process of establishing
one or more cut scores on examinations
Standards-based assessment = Using tests to
assess learner performance and achievement in
relation to an absolute standard
A development of criterion-referenced testing
Using large-scale standardized tests
Pre-dating the criterion-referenced testing move
4. Definition of ‘standard’
Standard = a level of performance
required or experienced (Davies et al.,
1999).
Example: The standard required for entry
to the university is an A in English.
5. The uses of standards
Educational purposes: (achievement tests)
Professional purposes: (certification of aircraft
engineers)
Political purposes: (NCLB & AYP)
Immigration Policy purposes
6. Unintended consequences
In case of NCLB: ELL group is always lower
than the standard & resources are not channeled to
where they are most needed.
Mandatory use of English in tests of content
subjects puts pressure on the indigenous people to
abandon education in their own language
The use of language tests for immigration leads to
fraudulent practices & short-term paper marriages
7. Using standards for harmonization & identity
To enforce conformity to a single model that helps
to create and maintain political unity and identity.
Examples:
Carolingian empire of Charlemagne (CE 800–
814)
CEFR (Now)
8. Carolingian empire of Charlemagne
Within the empire of Charlemagne, in Central and Western
Europe, various groups followed different calendars, and the
main Christian festivals fell on different dates.
In order to bring uniformity, Charlemagne set a new
standard for ‘computists’ who worked out the time of
festivals. They required to pass a test in order to get their
certificate.
There are no ‘correct answers’ for the questions in the
Carolingian test, they are scored as ‘correct’ because they
are defined as such by the standard, and the standard is
arbitrarily chosen with the intention of harmonizing practice.
9. CEFR (Common European Framework of Reference
)
CEFR = A set of standards (six-level scales and their
descriptors ) that provides a European model for language
testing and learning to enhance European identity and
harmonization.
Teachers should align their curriculum and tests to CEFR
standards (Linking) otherwise many European institutions
will not recognize the certificate they awarded.
10. Problems with CEFR
It drains creativity among teachers.
The same set of standards are used for all people
across different contexts, with different purposes.
Validation is based on linking the test to the CEFR.
This is against validity theories.
The use of standards and tests for harmonization
ultimately leads to a desire for more control.
11. How many standards can we afford?
The number of performance levels depends on the goals and
the use of the test.
Choosing the fewest performance levels (pass or fail) is
ideal because the more numerous the classes, the greater will
be the danger of a small difference in marks.
Index of Separation estimates the number of performance
levels into which a test can reliably place test takers.
Sometimes we have to use numerous categories to motivate
young learners.
12. Performance level descriptors (PLDs) & Test
scores
PLDs are often developed based on intuitive and experiential
method & the labels and descriptors are simple reflections of the
values of policy makers.
There are around four levels: ‘advanced – proficient – basic –
below basic’.
The PLDs provide a conceptual hierarchy of performance that is
an indication of the ability or knowledge of the test taker.
Standard-setting is the process of deciding on a cut score for a
test to mark the boundary between two PLDs. If we have two
performance levels (pass and fail), we’ll need a single cut score.
13. Standard based tests, CRT & scoring rubrics
It is said that tests used in standards-based testing are
criterion- referenced, yet for Glaser the criterion was
the domain, and it does not have anything to do with
standard setting and classification.
The standards-based testing movement has interpreted
‘criterion’ to mean ‘standard’.
The focus within PLDs is on the general levels of
competence, proficiency, or performance while scoring
rubrics address only single items.
14. Some initial decisions
All standard setting methods involve expert judgemental
decision making at some level (Jaegar, 1979).
Decision 1: Compensatory or non-compensatory
marking? The strength in other areas ‘compensates’ for
the weakness in one area.
Decision 2: What classification errors can you tolerate?
Decision 3: Are you going to allow test takers who ‘fail’
a test to retake it? If so, what time lapse is required to
retake the test?
15. Second Page :
"Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est
laborum."
Cycle Diagram
Test-centered
Criterion-referenced
Norm-
referenced
Examinee-
centered
Standard-Setting
Methods
Classification of
17. Common process of standard setting
Select an appropriate standard setting method depending upon the
purpose of the standard setting, available data, and personnel.
Select a panel of judges based upon explicit criteria.
Prepare the PLDs and other materials as appropriate.
Train the judges to use the method select.
Rate items or persons, collect and store data.
Provide feedback on rating and initiate discussion for judges to
explain their ratings, listen to others, and revise their views or
decisions, before another round of judging.
Collect final ratings and establish cut scores.
Ask the judges to evaluate the process.
Document the process in order to justify the conclusions reached.
18. Test-centered methods
The judges are presented with individual
items or tasks and required to make a
decision about the expected performance on
them by a test taker who is just below the
border between two standards.
19. Angoff method
Experts are given a set of items and they need to
rate the probability that a hypothetical learner
(who is on the borderline) would answer each test
item correctly.
The average of these probabilities across judges
or raters is the cut score.
If the test contains polytomous items or tasks, the
proportion of the maximum score is used instead
of the probability (modified Angoff).
20. Advantages & disadvantages
Clarity
Simplicity
Cognitive difficulty in
conceptualizing the
borderline learner by all
judges in precisely the
same way.
+ -
21. Ebel method
2 Rounds
Experts classify independently test items by:
I level of difficulty
II level of relevance
easy medium hard
essential important acceptable questionable
22. Ebel method
The judges estimate the percentage of items a borderline
test taker would get correct for each cell.
Then the percentage for each cell is multiplied by the
number of items, so if the ‘easy/essential’ cell has 20
items, 20 * 85 = 1700.
These numbers for each of the 12 cells are added up and
then divided by the total number of items to give the cut
score for a single judge.
Finally, these are averaged across judges to give a final
cut score.
23. All items could be classified: 12 cells in a 3*4 grid defined by the three
difficulty and four relevance category. As in the example:
categories Expert №3 Expert №4 Expert №5
Number
of items
in a
category
(А)
% correctly
performed
items
(В)
А*В
Number
of items
in a
category
(А)
%
correctly
performed
items
(В)
А*В
Number
of items
in a
category
(А)
%
correctly
performed
items
(В)
А*В
Essential
Easy 11 60 660 10 70 700 13 75 975
Medium 1 25 25 3 25 75 1 0 0
Hard 0 10 0 1 0 0 0 0 0
Questionabl
e
Easy 0 0 0 0 0 0 0 0 0
Medium 0 0 0 0 0 0 0 0 0
Hard 0 0 0 0 0 0 0 0 0
Mean 25.1 26.7 35
Mean for all
experts
28
Cut-score 12
…
24. Problems with EBEL
The complex cognitive requirements of classifying items
according to two criteria in relation to an imagined borderline
student may be challenging for the judges.
As it is assumed that some items may have questionable
relevance to the construct of interest, it implicitly throws into
doubt the rigor of the test development process and validity
arguments.
25. Nedelsky method (Multiple-choice)
The experts estimate the multiple-choice items a
borderline test taker would be able to eliminate.
In a four-option item with three distractors, if a candidate
can eliminate 3 of the distractors the chances of getting the
item right are 1 (100 %), but if he can only rule out 1 of
the items, the chance of answering the item correctly is 1
in 3 (33 %).
These probabilities are averaged across all items for each
judge, and then across all judges to arrive at a cut score.
26. Problems with Nedelsky method
It assumes that test takers answer multiple choice items by
eliminating the options that they think are distractors and then
guessing randomly between the remaining options. However, it
is highly unlikely that test takers answer items in this way.
Nedelsky method tends to produce lower cut scores than other
methods and is therefore likely to increase the number of false
positives.
28. Standard Setting
Presentation of the percentage of
students falling into each performance level
and each median cut-score from Round 2.
After discussion individual judgments
Overview of established cut-scores by every
expert, repeating of the same procedure as
in the first step.
Experts are informed about the essential number
of cut-scores to establish. Experts work in
small groups, all the essential material is
introduced to them.
Basic steps of the
procedure
Round III
Round II
Round I
29. Procedures in Bookmark method
Judges are presented with the necessary materials
Then they are asked to keep in mind a borderline student,
and place a ‘bookmark’ in the book between two items,
such that the candidate is more likely to be able to answer
the items below correctly, and the items above incorrectly.
The bookmarks are discussed in group and finally the
median of the bookmarks for each cut point is taken as that
group’s recommendation for that cut-point.
30. Examinee-centered methods
The judges make decisions about whether
individual test takers are likely to be just
below a particular standard; the test is
then administered to the test takers to
discover where the cut score should lie.
31. Borderline group method
The judges define what borderline candidates are
like, and then identify borderline candidates who fit
the definition.
Once the students have been placed into groups the
test can be administered. The median score for a
group defined as borderline is used as the cut score.
The main problem: the cut score is dependent upon
the group being used in the study.
32. Method of contrasting groups
Procedure includes testing of two groups of examinees:
•The classification must be done using independent criteria,
such as teacher judgments.
•The test is then given, and the score distributions are
calculated. There are likely to be overlaps in the
distributions.
• The cut score will be where overlap is observed in the
distributions.
Competent Non-competent
33.
34. Which method is the ‘best’?
It depends on what kind of judgments you can get for your
standard-setting study, and the quality of the judges that you
have available.
However, using the contrasting group approach is
recommended if it’s possible because it is the only method
that allows the calculation of likely decision errors (false
positives and false negatives) for cut scores.
The problem is getting the judgments of a number of people
on a large group of individuals.
35. Evaluating standard setting (Kane, 1994)
Procedural
evidence
• What procedures were used for the standard-setting to ensure that
the process is systematic?
• Were the judges properly trained in the methodology and allowed
to express their views freely?
Internal
evidence
• Deals with the consistency of results arising from the procedure
• It also estimates the extent of agreement between judges (Cohen’s
kappa )
External
evidence
• Correlation of scores of learners in a borderline group study with
some other test of the same construct.
• High correlation = the established cut scores are defensible.
36. Training: a critical part of standard setting
Training activities include familiarization with the PLDs and
the test, looking at the scoring keys, making practice
judgments, and getting feedback.
Different views may lead to disagreements among the judges.
Training should not be designed to eliminate these variations
but to allow free discussion among judges. If the judges do not
converge, the outcome should be accepted by the researchers.
The training process should not force agreement (cloning)
because removing their individuality and inducing agreement is
a threat to validity.
37. The special case of the CEFR
• The CEFR Manual contains performance level descriptors for
standard setting in order to introduce a common language and a single
reporting system into Europe.
• It recommends five processes to ‘relate’ Language Examinations to
the Common European Framework of Reference (CEFR)for
Languages Learning, Teaching, Assessment. These processes are:
Familiarization, specification, standardization training/benchmarking,
standard-setting, and validation.
• Familiarization, standard-setting, and validation are uncontentious
because they reflect common international assessment practice that is
not unique to Europe however the other two sections are problematic.
38. PLDs in CEFR & in other standard-based systems
The use of PLDs in the CEFR is
institutionalized & their meaning is
generalized across nations.
Standardization facilitates ‘the
implementation of a common
understanding of CEFR and training
is cloning rather than familiarization
Benchmarking = the process of
rating individual performance
samples using the CEFR PLDs
Standard-setting = ‘mapping’ the
existing cut scores from tests onto
CEFR levels.
PLDs are evaluated in terms of their
usefulness and meaningfulness; they
can be discarded or changed.
Standardization & training ensure
that everyone understands the
standard-setting method yet
judgments are freely made
Benchmarking = the typical
performances that are identified
after standard-setting.
Standard-setting = establishing cut
scores on tests.
CEFR Other standard-based systems
39. You can always count on uncertainty
Standards-based testing can be positive if people can
reach a consensus, rather than being forced to see the
world through a single lens. Used in this way,
standards are never fixed, monolithic edifices. They
are open to change, and even rejection, in the service
of language education.
Standards-based testing fails if it is used as a policy
tool to achieve control of educational systems with the
intention of imposing a single acceptable teaching and
assessment discourse upon professionals.