2. STEPS IN TEST
DEVELOPMENT
• Test Conceptualization
• Test Construction
• Test Tryout
• Item Analysis
• Test Revision
3. STEP 1: TEST
CONCEPTUALIZATION
• The process can be traced through thoughts
• “There ought to be a test designed to measure _____ in
such and such way”
• An emerging phenomenon or pattern of behavior might
serve as the stimulus for test conceptualization
• Pilot Work: the generalized term for preliminary research
surrounding the creation of the test prototype
• Items must be subject to pilot studies to evaluate whether
or not they should be included in the final form of the test
4. STEP 1: TEST
CONCEPTUALIZATION
• Criterion-Referenced: based on the amount of
knowledge and/or the level of competence ;
employed in licensing
• Norm-Referenced: based on the performance of a
specific group; employed in educational contexts;
mastery of material; existing base of knowledge and
skills
5. STEP 2: TEST
CONSTRUCTION
• Scaling
• setting rules for assigning numbers in measurement
• process by which a measuring device is designed and
calibrated by which numbers are assigned to different
amounts of trait, attribute, or characteristic being
measured
6. STEP 2: TEST
CONSTRUCTION
• Scaling Methods
• Rankings of Experts
• Asking a panel of experts which would then rank the
behavioral indicators and provide a meaningful numerical
score
• Method of Equal-Appearing Intervals
• Developed by L. L. Thurstone (1929)
• A large number of true-false statements reflects positive and
negative attitudes
• Items would be in an interval scale
• Reliability and validity analyses are important to determine
the appropriateness and usefulness
• An item with a larger standard deviation would be dropped
7. STEP 2: TEST
CONSTRUCTION
• Scaling Methods
• Method of Absolute Scaling
• Obtaining a measure of absolute item difficulty based
on results for different age groups of testtakers
• Commonly used in group achievement and aptitude
testing
• Likert Scale
• Consists of ordered responses in a continuum
• Total score is obtained by adding the scores from
individual items
8. STEP 2: TEST
CONSTRUCTION
• Scaling Methods (cont’d)
• Guttman Scales
• Respondents that endorse a stronger statement will also
endorse on the milder ones
• Method of Empirical Keying
• Test items are selected based entirely on how well they
contrast a criterion group from a normative sample
9. STEP 2: TEST
CONSTRUCTION
• Scaling Methods (cont’d)
• Method of Rational Scaling
• All scale items correlate positively with each other and
with the total score for each scale
• Method of Paired Comparisons
• Testtakers are presented with pairs of stimuli which they
will be asked to compare
• Categorical Scaling
• Stimuli are placed into one of two or more alternative
categories that differ quantitatively with respect to
some continuum.
10. STEP 2: TEST
CONSTRUCTION
• Writing Items
• Define clearly what you want to measure
• Generate an item pool
• Avoid exceptionally long items
• Keep the level of difficulty appropriate for those who
will
• Avoid double-barreled items that convey two or more
ideas at the same time
• Consider mixing positively and negatively worded itms
11. STEP 2: TEST
CONSTRUCTION
• Approaches to Test Construction:
• Rational (Theoretical) Approach
• Reliance on reason and logic over data collection for
statistical analysis
• Empirical Approach
• Reliance on data gathering to identify items that relate to the
construct
• Bootstrap
• Combination of rational and empirical approaches based on a
theory, then an empirical approach will be used to identify
items that are highly related to the construct
12. STEP 2: TEST
CONSTRUCTION
• Item Format: form, plan, structure, arrangement, and
layout of individual test items
• Multiple choice
• Matching
• Binary-choice (i. e., True or False)
• Short Answer
13. STEP 2: TEST
CONSTRUCTION
• Scoring Models
• Cumulative
• the number of items endorsed/responded to match the key
which represents the construct being measured
• Class/Category
• the placement of an individual to a particular class for
description or prediction
• Ipsative
• the indication of how an individual performed on one scale
within the given test
14. STEP 3: TEST TRYOUT
• The test should be tried out on people who are
similar in critical respects to the people to whom the
test was designed
A x 5 to 10 = n
A = items on a questionnaire
n = participants
• For validation purposes, there must be at least 20
participants each
• A good test helps in discriminating testtakers
15. STEP 4: ITEM ANALYSIS
• Item-Difficulty Index
• Calculation of the proportion of the total number of
testtakers that answered the test correctly
• The difficulty of the test can be found by averaging the
item-difficulty indices
• Item-Reliability Index
• Indication of the test’s internal consistenct
• Use factor analysis
16. STEP 4: ITEM ANALYSIS
• Item-Validity Index
• Indicates the degree on which a test is measuring what
it intends to measure
• Can be calculated by means of item score standard
deviation and the correlation between the item and
criterion score
• Item-Discrimination Index
• How an item discriminates high-scorers and the low-
scorers
17. STEP 4: ITEM ANALYSIS
• Considerations:
• Guessing
• Item fairness
• Speed Tests
• Qualitative Item Analysis
• Comparison of individual test items with one another and
the test as a whole
18. STEP 4: ITEM ANALYSIS
• “Think Aloud” Test Administration
• Innovative approach to cognitive assessment by
having respondents verbalize thoughts as they occur
• Expert Panels
• Sensitivity Review
• Testtakers could be interviewed
19. STEP 5: TEST REVISION
• Popular culture changes
• Adequacy of test norms
• Changes in reliability or validity
• Theoretical modifications
20. STEP 5: TEST REVISION
• Cross-Validation
• Revalidation of a test on a sample of testtakers other
than those on whom test performance was originally
found to be a valid predictor of some criterion
• Co-validation
• A validation process conducted on two or more tests
using the same sample of testtakers
21. STEP 5: TEST REVISION
• Quality Assurance
• Anchor Protocol
• Produced by a highly authoritative scorer designed to model
scoring and resolve discrepancies that goes along with it
• Scoring Drift
• Discrepancy between scoring in an anchor protocol and
another protocol
• Evaluate properties of existing tests and guide in revisions
• Determine measurement equivalence across populations
• Development of item banks