SlideShare a Scribd company logo
1 of 136
Download to read offline
The Basics of IRT
         by
    AK Dhamija©
  (dhamija.ak@gmail.com)
   dhamija.ak@gmail.com)
www.geocities.com/a_k_dhamija




                                             1
         The Basics of IRT   September 9, 2009
Coverage               •What is IRT?

                       •The Item Characteristic Curve

                       •Item Characteristic Curve
                       Models

                       •Estimating Item Parameters

                       •The Test Characteristic Curve

                       •Estimating an Examinee’s
                       Ability

                       •The Information Function

                       •Test Calibration

                       •Characteristics of a Test

                       •Computer Adaptive Test         2
   The Basics of IRT                   September 9, 2009
What is IRT?
• Classical Theory

   •Dependence of Item Statistics on sample of Respondents

   •Dependence of Respondents’ scores on choice of items

   •Assumes equal errors of measurement at all levels of ability

   •No Modeling of data at item level

   •Items and Respondents at different scale

   •Difficult to compare scores across two different tests
   because not on same scale



                                                                   3
                        The Basics of IRT          September 9, 2009
What is IRT?
• IRT

   •Links observable respondent performance to unobservable
   traits

   •Theory is general

        •One or more abilities or traits

        •Various assumptions / models

        •Binary / polytomous data

   •At specific model level fit can be addressed

   DIF/DTF          LOGISTIC / MULTIDIMENSIONAL MODEL

                                                                   4
                          The Basics of IRT        September 9, 2009
What is IRT?
• Specific IRT model assumptions

   •Dominant First factor (Multidimensional models exist too)

   •No dependency between items

   •Mathematical form of ICC linking performance on items and
   trait measured by the instrument




                                                                  5
                        The Basics of IRT         September 9, 2009
What is IRT?
• IRT benefits

   •Item parameters estimation is independent of respondent
   samples

   •Trait estimation is independent of particular choice of items
   (invaluable property for CAT)

   •Error of measurement for each respondent

   •Item level modeling allows for “optimal assessments”

   •Items and respondents calibrated on same scale




                                                                    6
                        The Basics of IRT           September 9, 2009
What is IRT?
• IRT Limitations

   •Practitioners lack expertise

   •IRT SW are not straight forward

   •Large samples are more helpful for estimation

   •Doesn’t address construct definition / domain convergence




                                                                    7
                        The Basics of IRT           September 9, 2009
What is IRT?
• Many choices of IRT Models

   •1PL , 2PL , 3PL and Normal Ogive Models (0-1 data)

   •Partial Credit , Generalized Partial Credit , Graded Response
   , Nominal Response Models (Polytomous data)

   •Multidimensional Logistic Models (0-1 data)

• Estimation

   •Marginal MLE , Bayesian etc.

• SW

   •Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps

                                                                   8
                        The Basics of IRT          September 9, 2009
The Item Characteristic Curve
• Unobservable, or latent, trait

• Generic term ‘ability’ in IRT

• Interval scale of ability

   •Theoretical values [- ∞,+ ∞]
   •Practical Values      [-3,+3]

• Ideally free response items

   • Difficult to score reliably




                                                                       9
                                   The Basics of IRT   September 9, 2009
The Item Characteristic Curve
•   Data Preparation

    • Raw data recoded

    • Dichotomously scored items
      (For polytomous data :
        Samejima's Graded Response model)
                                                       Dominant
    • Investigating dimensionality                     first factor

        •examine the eigenvalues following
         Principal Axis Factoring (PAF)

        If the data are dichotomous, factor
        analyze tetrachoric correlations

         Assume continuum underlies item
         responses

                                                                10
                             The Basics of IRT   September 9, 2009
The Item Characteristic Curve
• Classical IAT

   • Raw score is sum of scores of item

• IRT

   •Emphasis on individual items

   • Assumption : Ability score (θ )

   • P(θ ) is the probability to give a correct answer




                                                                        11
                         The Basics of IRT               September 9, 2009
The Item Characteristic Curve

                               • P(θ ) is small for low
                               ability examinees & vice
                               versa

                               • A smooth S- shaped
                               curve

                               • item characteristic curve
                               (ICC)
                                   •A basic building block of
                                   IRT

                               • Two technical
                                properties

                                   • Item Difficulty
                                   • Item Discrimination


                                                         12
           The Basics of IRT              September 9, 2009
The Item Characteristic Curve



                               • Item Difficulty
                                    • a location index

                               • Three ICC with the same
                               discrimination but
                               different levels of difficulty




                                                        13
           The Basics of IRT             September 9, 2009
The Item Characteristic Curve
                               • Item Discrimination
                                    • slope at θ = 0

                               • Differentiating between
                               abilities below the item
                               location and those above
                               the item location.

                               •Steepness of the ICC in
                               its middle section

                               • The steeper the curve,
                               the better the item can
                               discriminate




                                                       14
           The Basics of IRT            September 9, 2009
The Item Characteristic Curve
• Caution


    •These two properties
        •say nothing about validity of item
        •simply describe the form ICC

    • Figures only show the range -3 to +3

    • All ICC become asymptotic to a probability of zero at one tail
    and to 1.0 at the other tail




                                                                     15
                        The Basics of IRT             September 9, 2009
The Item Characteristic Curve


                               • Item with Perfect
                               discrimination

                               •item discriminates perfectly
                               those above and below an
                               ability score of 1.5

                               • this item is useless for other
                               anilities in discrimination




                                                           16
           The Basics of IRT                September 9, 2009
The Item Characteristic Curve

                               • Difficulty will have the
                               following levels:
                                                   very easy
                                                   easy
                                                   medium
                                                   hard
                                                   very hard

                               Discrimination will have the
                               following levels:
                                                 none
                                                 low
                                                 moderate
                                                 high
                                                 Perfect



                                                             17
           The Basics of IRT                  September 9, 2009
The Item Characteristic Curve




                                              18
           The Basics of IRT   September 9, 2009
The Item Characteristic Curve

                                   Recap

1. When the item discrimination is less than moderate, the item
   characteristic curve is nearly linear and appears rather flat.

2. When discrimination is greater than moderate, the item characteristic
   curve is S-shaped and rather steep in its middle section.

3. When the item difficulty is less than medium, most of the item
   characteristic curve has a probability of correct response that is greater
   than .5.

4. When the item difficulty is greater than medium, most of the item
   characteristic curve has a probability of correct response less than .5.



                                                                              19
                                 The Basics of IRT             September 9, 2009
The Item Characteristic Curve


                                    Recap

5. Regardless of the level of discrimination, item difficulty locates the item
   along the ability scale. Therefore item difficulty and discrimination are
   independent of each other.

6. When an item has no discrimination, all choices of difficulty yield the
   same horizontal line at a value of P(θ ) = .5. This is because the value of
   the item difficulty for an item with no discrimination is undefined.




                                                                                20
                                  The Basics of IRT              September 9, 2009
Item Characteristic Curve Models
• Enough of being intuitive , lets see the rigor needed by a theory

• Three mathematical models for the ICC

• The Logistic Function : 2 Parameter Model (2PL)


P(θ ) =            1        =               1

               1 + e -L               1 + e -a (θ -b )

where:
          e is the constant 2.718
          b is the difficulty parameter
          a is the discrimination parameter
          L = a(θ - b) is the logistic deviate (logit) and
          θ is an ability level.


                                                                            21
                                 The Basics of IRT           September 9, 2009
Item Characteristic Curve Models
• The difficulty parameter (b) is defined as the point on the ability
scale at which the probability of correct response to the item is .5.

• Range of b : [- ∞,+ ∞]         [-3,+3]

• Discrimination parameter is proportional to the slope of the ICC at
θ = b.

• The actual slope at θ = b is a/4, but taking it as ‘a ‘ makes
interpretation easier

• Range of a : [- ∞,+ ∞]         [-2.80,+2.80]

. Normal Ogive model has different interpretation




                                                                        22
                            The Basics of IRT            September 9, 2009
Item Characteristic Curve Models
        • Computational Example

        • b = 1.0 , a = .5

                  θ          Logit   EXP(-L) 1+EXP(-L) P(θ)

                  -3         -2      7.389   8.389        .12
                  -2         -1.5    4.482   5.482        .18
                  -1         -1      2.718   3.718        .27
                  0          -.5     1.649   2.649        .38
                  1          0       1       2            .5
                  2          .5      .607    1.607        .62
                  3          1       .368    1.368        .73




                                                             23
               The Basics of IRT              September 9, 2009
Item Characteristic Curve Models
•The Rasch, or One-Parameter, Logistic Model (1 PL)

    •a = 1.0 for all items

                  P(θ ) =         1            =            1

                                1 + e -L              1 + e - (θ -b )

Birnbaum (1968) Three Parameter Model (3 PL)

        The probability of correct response includes a small component that is
        due to guessing .

                  P(θ ) = c +     (1-c)

                                1 + e -a (θ -b )




                                                                                       24
                                  The Basics of IRT                     September 9, 2009
Item Characteristic Curve Models
• c is guessing parameter : the probability of getting the item correct by
guessing alone

• value of c does not vary as a function of the ability level

• Range : Theoretically   [0,1]    Practically        [0,.35]

• Definition of the difficulty parameter is changed in 3 PL

• c defines a floor to P(θ )

• P(θ ) = (1+c)/2




                                                                               25
                                  The Basics of IRT             September 9, 2009
Item Characteristic Curve Models
• Discrimination parameter a = slope of the item characteristic curve at θ = b
        which is a (1 - c)/4

        - can still be taken as proportional to a




                                                                             26
                                 The Basics of IRT            September 9, 2009
Item Characteristic Curve Models
          • Negative Discrimination

               • P(θ ) decreases as θ increases

               • Can occur in two ways

                    • The incorrect response to a two-choice
                    item will have a negative discrimination if
                    correct response has a positive one

                    •Something is wrong with the item

               •Two curves have same b and |a|




                                                           27
          The Basics of IRT                 September 9, 2009
Item Characteristic Curve Models
• Interpreting Item Parameter Values

•   a: (divide by 1.7 to interpret in normal ogive model)

        Verbal label     Range of values
        none             0
        very low         .01 - .34
        Low              .35 - .64
        moderate         .65 - 1.34
        High             1.35 - 1.69
        Very high        > 1.70
        Perfect          + infinity

    b: difficult job since easy and hard are relative terms




                                                                           28
                                The Basics of IRT           September 9, 2009
Item Characteristic Curve Models


• In classical test theory, ‘b’ was defined relative to a group of
examinees

• Same item could be easy for one group and hard for another group.

•Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 + c)/2
for a 3-PL model.

c is interpreted directly as a probability.
         c = .12 means that at all θ , P(θ ) by guessing alone is .12.




                                                                             29
                                The Basics of IRT             September 9, 2009
Item Characteristic Curve Models
•The verbal labels reflect only midpoint of scale

•item difficulty tells where the item functions on the ability scale

•The slope of the ICC (‘a’) is at maximum at an ability level corresponding
to the item difficulty

• The item does best in distinguishing between examinees in the
neighborhood of its ability level

• An item whose difficulty is -1 functions among the lower ability
examinees

• A value of +1 denotes an item that functions among higher ability
examinees

• So ability is a location parameter



                                                                            30
                             The Basics of IRT               September 9, 2009
Item Characteristic Curve Models
                                      RECAP

1. Under the 1-PL model, the slope is always the same; only      the
   location of the item changes.

2. Under the 2-PL & 3-PL models, the value of a must become quite large
   (>1.7) before the curve is very steep.

3. Under 1-PL & 2-PL and a large positive value of b : lower tail approach
   zero. Under 3-PL it approaches c.

4. c is not apparent when b < 0 and a < 1.0.

5. Under all models, curves with a negative value of a are the mirror
    image of curves with positive value of a.




                                                                         31
                            The Basics of IRT             September 9, 2009
Item Characteristic Curve Models
                                            RECAP

6.   When b = -3.0, only the upper half of the item characteristic curve
     appears on the graph. When b = +3.0, only the lower half of the curve
     appears on the graph.

7.   The slope of the item characteristic curve is the steepest at the ability
     level corresponding to the item difficulty. Thus, the difficulty
     parameter b locates the point on the ability scale where the item
     functions best.

8.   Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 +
     c)/2 for a 3-PL model . Only when c = 0 are these two definitions
     equivalent.




                                                                           32
                             The Basics of IRT              September 9, 2009
Estimating Item parameters
• IRT Analysis

       - M examinees in J groups(θj) responds to the N items

       - mj examinees within group j, where j = 1, 2, 3. . . . J.

       - In jth group , rj examinees answer the item correctly. p(θj ) = rj/mj




                                                                                   33
                                 The Basics of IRT                  September 9, 2009
Estimating Item parameters
• Find ICC that best fits the observed proportions of correct response.

•first select a model for the curve to be fitted.

• Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness)

• Procedure is Maximum Likelihood Estimation (initial values b=0,a=1)




                                                                              34
                                 The Basics of IRT             September 9, 2009
Estimating Item parameters
    J       [ p(θj ) - P(θj)]2
Χ2 = ∑ mj                        = 28.88 and the criterion value was 45.91
   j =1       P(θj) Q(θj)

The two-parameter model with b = - .39 and a = 1.27 was a good fit to
the observed proportions of correct response




                                                                                35
                                      The Basics of IRT          September 9, 2009
Estimating Item parameters
• The Group Invariance of Item Parameters

• Two groups

• Group 1 [b= -3 to -1]                     ////   Group 2 [ b=1 to 3]




                                                                     36
                               The Basics of IRT      September 9, 2009
Estimating Item parameters

• The Group Invariance of Item Parameters

• b(1) = b(2) = -.39 and a(1) = a(2) = 1.27

• Combined analysis




                                                                   37
                                The Basics of IRT   September 9, 2009
Estimating Item parameters
• Group Invariance is powerful feature of IRT

   •values of the item parameters are a property of the item, not of the group
   that responded to the item.

   •Under classical test theory, just the opposite holds. The item difficulty of
   classical theory is the overall proportion of correct response.

   •b = 0 may give item difficulty index .3 for low ability group and .8 for high
   ability group.

   •Clearly, the value of the classical item difficulty index is not group
   invariant.

   • item difficulty fhas consistent meaning in IRT




                                                                               38
                                 The Basics of IRT              September 9, 2009
Estimating Item parameters
• Caution

       The obtained numerical values will be subject to variation due to
       sample size, how well-structured the data is, and the goodness-of-fit of
       the curve to the data.

       The item must be used to measure the same latent trait for both groups.

       An item’s parameters do not retain group invariance when
       taken out of context, i.e., when used to measure a different latent trait
       or with examinees from a population for which the test is inappropriate.




                                                                             39
                               The Basics of IRT              September 9, 2009
Estimating Item parameters
                                       RECAP

1. Estimated item parameters was usually a good overall fit to the observed
   proportions of correct response.

2. When two groups are employed, the same item characteristic curve will be
   fitted, regardless of the range of ability

3. The number of examinees at each level does not affect the group-invariance
   property.

4. For the item of positive discrimination, the low-ability group involves the
   lower left tail of ICC, and the high-ability group involves the upper right tail.

5. The item parameters were group invariant whether or not the ability
   ranges of the two groups overlapped.




                                                                                 40
                                   The Basics of IRT              September 9, 2009
Estimating Item parameters
                                     RECAP


6. it made no difference as to which group was the high-ability group. Thus,
   group labeling is not a consideration.

7. The group-invariance principle holds for all three item characteristic curve
   models.
8. item parameter estimates are subject to sampling variation.




                                                                              41
                                 The Basics of IRT             September 9, 2009
Test Characteristics Curve
• Examinee’s raw test score is obtained by adding up the item scores

                N
         TS j = ∑ Pi (θj)
               i=1


• eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61

•Procedure is same for all three models




                                                                            42
                                    The Basics of IRT        September 9, 2009
Test Characteristics Curve
                     • For 1-PL & 2-PL left tail
                     approaches zero and right tail N.

                     • TS = 0/Nck => θ = - ∞
                     • TS = N => θ = + ∞




                                                         43
           The Basics of IRT              September 9, 2009
Test Characteristics Curve
TCC transforms ability scores to true scores.

TCC is a monotonically increasing function. (though shapes may differ – S ,
inc-plat-inc)

In all cases, it will be asymptotic to a value of N in the upper tail.

TCC depends upon a number of factors, including the number of items, the
ICC model employed, and the values of the item parameters.

Caution :
        TCC (similar to ICC) does not depend on distribution of scores.




                                                                                  44
                                  The Basics of IRT                September 9, 2009
Test Characteristics Curve
Interpretation

      The ability level corresponding TS = N/2 locates the test
      along the ability scale.

      No explicit formula for TCC, so no parameters for the curve




                                                                     45
                           The Basics of IRT          September 9, 2009
Test Characteristics Curve
                                   RECAP

1.    Relation of the true score and the ability level.

      a.      Given an ability level, the corresponding true score can be
              found via the test characteristic curve.

      b.      Given a true score, the corresponding ability level can be
              found via the test characteristic curve.

      c.      Both the true scores and ability are continuous variables.

2.    Shape of the test characteristic curve.

      a.      When N = 1, the true score ranges from 0 to 1 and TCC = ICC




                                                                          46
                               The Basics of IRT           September 9, 2009
Test Characteristics Curve
                         RECAP

 b.   TCC may not be similar to ICC (due to regions of varying
      steepness and plateaus). IT reflects a mixture of
      item parameter values

 c.   The ability level at which the mid-true score (N/2) occurs
      is an indicator of where the test functions on the ability
      scale.

 d.   When the values of the item difficulties have a
      limited range, the steepness of the TCC depends primarily
      upon the average value of the item discrimination
      parameters.

 e.   When the values of the item difficulties are spread widely
      over the ability scale, the steepness of the TCC will be
      reduced.


                                                                   47
                     The Basics of IRT              September 9, 2009
Test Characteristics Curve
                                 RECAP

      e.      Under a three-parameter model, the lower limit of the true
              scores is NCk.

      f.      The shape of the test characteristic curve depends upon the
              number of items, the ICC model and the mix of values of the
              item parameters.

3.    It would be possible to construct a test characteristic curve that
      decreases as ability increases. To do so would require items with
      negative discrimination for the correct response to the items. Such a
      test would not be considered a good test because the higher an
      examinee’s ability level, the lower the score expected for the
      examinee.




                                                                          48
                             The Basics of IRT             September 9, 2009
Estimating Examinee’s Ability
• To locate that person on the ability scale (θ = ?)

    • Individual’s ability
    • Comparisons among individual’s ability

• The list of 1’s and 0’s for the N items is called the examinee’s item
response vector.

•Use this item response vector and the known item parameters to
estimate the examinee’s θ .

• Maximum likelihood procedures are used to estimate an examinee’s
ability.
         It begins with some a priori value for the ability of the
examinee and the known values of the item parameters. These are
used to compute the probability of correct response to each item for
that examinee.

                                                                         49
                               The Basics of IRT          September 9, 2009
Estimating Examinee’s Ability
• this procedure is based upon an approach that treats each examinee
separately

                       N
                        ∑ -ai ( ui – Pi(θs))
                       i=1
    Θs+1 = θs +
                        N
                        ∑ ai2 Pi(θs) Qi(θs)
                       i=1
    where:        θs is the estimated ability of the examinee within iteration s
    ai is the discrimination parameter of item i, i = 1, 2, . . . .N
    ui is the response made by the examinee to item i: 1 or 0
    Pi(θs ) is the probability of correct response to item i, under the given
    item characteristic curve model, at ability level θs within
    iteration s.
    Qi (θs ) = 1 -Pi (θs )


                                                                               50
                                 The Basics of IRT              September 9, 2009
Estimating Examinee’s Ability
• A three-item test

        b = -1, a = 1.0
        b = 0, a = 1.2
        b = 1, a = .8

    • The examinee’s item responses were:

        item response
        1       1
        2       0
        3       1

    a priori estimate of the examinee’s ability is set to θs= 1.0




                                                                              51
                                The Basics of IRT              September 9, 2009
Estimating Examinee’s Ability
 First iteration:

         item       u     P            Q        a(u-P) a*a(PQ)
         1          1     .88          .12      .119      .105
         2          0     .77          .23      -.922     .255
         3          1     .5.          5        .400      .160
                          sum                   -.403     .520
         ∆θs = -.403/.520 = -.773,    θs+1= 1.0 - .773 = .227

Second iteration:
       ∆θs = .066/.674 = .097, θs+1 = .227 + .097 = .324

Third iteration:

         ∆θs = .0006/.6615 = .0009, θs+1 = .324 + .0009 = .3249

At this point, the process is terminated because the value of the adjustment
(.0009) is very small. Thus, the examinee’s estimated ability is .33.


                                                                                 52
                                     The Basics of IRT            September 9, 2009
Estimating Examinee’s Ability
 Unfortunately, there is no way to know the examinee’s actual ability
parameter. The best one can do is estimate it.

                  1
SE(θ) =
          Sqrt(ai2 Pi(θs) Qi(θs))

SE(θ ) = 1 /sqrt(.6615) = 1.23

The examinee’s ability was not estimated very precisely because the
standard error is very large. This is primarily due to the fact that only
three items were used.




                                                                         53
                                    The Basics of IRT     September 9, 2009
Estimating Examinee’s Ability
 Two cases where MLE procedure fails

       when an examinee answers none of the items correctly, the
       corresponding ability estimate is negative infinity.

       when an examinee answers all the items in the test correctly,
       the corresponding ability estimate is positive infinity.

In both of these cases it is impossible to obtain an ability estimate for
the examinee (the computer literally cannot compute a number as big
as infinity).




                                                                         54
                              The Basics of IRT           September 9, 2009
Estimating Examinee’s Ability
 Item Invariance of an Examinee’s Ability Estimate

       Different sets of items should yield the same ability estimate,
       within sampling variation . In each set, a different part of the
       ICC is involved


This principle rests upon two conditions:

        all the items measure the same underlying latent trait

       the values of all the item parameters are in a common metric.




                                                                        55
                             The Basics of IRT           September 9, 2009
Estimating Examinee’s Ability
Implication of Item Invariance of an Examinee’s Ability Estimate

        A test located anywhere along the ability scale can be used to
        estimate an examinee’s ability.

        An examinee could take a test that is “easy” or a test that is “hard”
        and obtain, on the average, the same estimated ability.

This is in sharp contrast to classical test theory, where such an examinee
would get a high test score on the easy test and vice versa

Under item response theory, the examinee’s ability is fixed and invariant
with respect to the items used to measure it.




                                                                            56
                               The Basics of IRT             September 9, 2009
Estimating Examinee’s Ability
Caution :

An examinee’s ability is fixed relative to a given context.


       However, if the examinee received remedial instruction
       between testing or if there were carryover (memorizing)
       effects, the examinee’s underlying ability level would be
       different for each testing.

The item invariance of an examinee’s ability and the group invariance
of an item’s parameters are two facets of what is referred to,
generically, as the invariance principle of item response theory.




                                                                        57
                             The Basics of IRT           September 9, 2009
Estimating Examinee’s Ability
                                   RECAP

1.    Distribution of estimated ability.

      a.      The standard error of the estimates can be quite large when
              the items are not located near the ability of the examinee.

      c.      When the values of the item discrimination indices are large,
              the standard error of the ability estimates is small. When the
              item discrimination indices are small, the standard error of
              the ability estimates is large.

      e.      The optimum set of items for estimating an examinee’s
              ability would have all its item difficulties equal to the
              examinee’s ability parameter and have items with large
              values for the item discrimination indices.




                                                                           58
                               The Basics of IRT            September 9, 2009
Estimating Examinee’s Ability
                                  RECAP

2.    Item invariance of the examinee’s ability.

      a.      The different sets of items yielded values of estimated
              ability that were near the examinee’s actual ability level.

      b.      The mean value of these estimates generally was a close
              approximation of the examinee’s ability parameter.

3.    Each examinee has an ability score (parameter value) that locates
      that person (estimated ability) on the scale.




                                                                            59
                              The Basics of IRT              September 9, 2009
The Information Function
The statistical meaning of information is credited to Sir R.A. Fisher, who
defined information as the (reciprocal of the σ ) precision with which a
parameter could be estimated.

                           1
                 I=               where σ ~ SE(θ)
                          σ2




                                                                              60
                                The Basics of IRT              September 9, 2009
The Information Function
I(θ) is maximum at θ = -1.0 and is about 3 for the ability range of -2<= θ< = 0.
Within this range, ability is estimated with some precision.

I(θ) does not depend upon the distribution of examinees over the ability
scale.

Ideal I(θ) would be a horizontal line at some large value : hard to achieve

Precision with which an examinee’s ability is estimated depends upon where
the examinee’s ability is located on the ability scale. – Implication for test
construction




                                                                               61
                                 The Basics of IRT              September 9, 2009
The Information Function
Item information function

Ii (θ ), where i indexes the item.

Will be small as single item is involved.




                                                                        62
                                     The Basics of IRT   September 9, 2009
The Information Function
An item measures ability with greatest precision at the ability level
corresponding to the item’s difficulty parameter.

The amount of item information decreases as the ability level departs from
the item difficulty and approaches zero at the extremes of the ability scale.

Test Information Function

                N
        I(θ ) = ∑ Ii (θ )
                i=1




                                                                               63
                                 The Basics of IRT              September 9, 2009
The Information Function
I(θ ) is much higher than Ii (θ )

Longer the test , higher I(θ ) and greater precision in mesauring
examinee’s ability than will shorter tests.

Thus, ability is estimated with some precision near the center of the
ability scale.

I(θ ) tells how well the test is doing in estimating ability over the whole
range of ability scores.

While the ideal test information function often may be a horizontal
line, it may not be the best for a specific purpose.




                                                                          64
                                The Basics of IRT          September 9, 2009
The Information Function
For example, if you were interested in constructing a test to award
scholarships, this ideal might not be optimal. In this situation, you would like
to measure ability with considerable precision at ability levels near the ability
used to separate those who will receive the scholarship from those who do
not.

The best test information function in this case would have a peak at the
cutoff score.

Other specialized uses of tests could require other forms of the
test information function.




                                                                               65
                                 The Basics of IRT              September 9, 2009
The Information Function
Information Functions

2 PL:
        Ii (θ ) = ai 2 P i(θ ) Qi (θ )
        Pi (θ ) = 1/(1+ EXP (-ai (θ - bi)))
        Qi (θ ) =1 - Pi (θ )

1 PL:
        Ii (θ ) = Pi(θ ) Qi (θ )
         Qi (θ ) =1 - Pi (θ )

3 PL:
        Ii (θ ) = ai2 (Qi (θ )) (Pi(θ ) –c2)

                     (Pi(θ ) ) (1-c2)
        Pi (θ ) = c + (1 - c) (1/(1 + EXP (-ai (θ - bi))))
        Qi (θ ) =1 - Pi (θ )

                                                                            66
                                   The Basics of IRT         September 9, 2009
The Information Function
Information Functions Example : 3PL (b = 1.0, a = 1.5, c = .2 )

θ L         Pi(θ)   Qi (θ)   Pi (θ) Qi(θ)         (Pi (θ) - c)   Ii (θ)

-3   -6.0   .20     .80      3.950                .000           .000
-2   -4.5   .21     .79      3.785                .000           .001
-1   -3.0   .24     .76      3.202                .001           .016
0-   -1.5   .35     .65      1.890                .021           .142
1     0.0   .60     .40      .667                 .160           .375
2    1.5    .85     .15      .171                 .428           .257
3    3.0    .96     .04      .040                 .481           .082

General level of the values for the amount of information is lower (because of the
  presence of the terms (1 - c) and (Pi (θ ) - c) )


The maximum occurred at an ability level slightly higher than the value of b




                                                                                   67
                                     The Basics of IRT              September 9, 2009
The Information Function
When they share common values of a and b, the information functions will be the
  same when c= 0

When c > 0, the three-parameter model will always yield less information

Thus, the item information function under a two-parameter model defines the upper
   bound for the amount of information under a three-parameter model

This is reasonable, because getting the item correct by guessing should not
   enhance the precision with which an ability level is estimated.




                                                                                68
                                  The Basics of IRT              September 9, 2009
The Information Function
Computing a Test Information Function : five-item test under 2PL

     Item         b               a

     1            -1.0            2.0
     2            -0.5            1.5
     3            -0.0            1.5
     4            0.5             1.5
     5            1.0             2.0

θ 1         2     3      4        5          Test Information

-3   .071 .051    .024   .012     .001   .159
-2   .420 .194    .102   .051     .010   .777
-1   1.000 .490   .336   .194     .071 2 .091
 0   .420 .490    .563   .490     .420 2 .383
 1   .071 .194    .336   .490     1.000 2.091
 2   .010 .051    .102   .194     .420   .777
 3   .001 .012    .024   .051     .071   .159

                                                                               69
                                The Basics of IRT               September 9, 2009
The Information Function
The five item discriminations had a symmetrical distribution around a value of 1.5

The five item difficulties had a symmetrical distribution about an ability level of zero

Because of this, the test information function also is symmetric about an ability of
   zero




                                                                                     70
                                    The Basics of IRT                 September 9, 2009
The Information Function
         σ ~ SE(θ) = 1/sqrt(I(θ))

A peaked I(θ ) measures ability with unequal precision the ability scale.

   best for estimating the ability of examinees whose abilities fall near
         the peak of the test information function

In some tests, the I(θ ) is rather flat over some region of the ability scale

   test would be a desirable one for those examinees whose ability falls
   in the given range

The maximum amount of test information was 2.383 at an ability level of 0
  This translates into a standard error of .65

   Roughly that 68 percent of the estimates of this ability level fall between -
   .65 and +.65 (ability level is estimated with a modest amount of precision)



                                                                                 71
                                    The Basics of IRT             September 9, 2009
The Information Function
                                      RECAP

1. The general level of the test information function depends upon:

   a.    The number of items in the test.
   b.    The average value of the discrimination parameters of the
         test items.
   c.    Both of the above hold for all three item characteristic curve
         models.

2. The shape of the test information function depends upon:

   a.    The distribution of the item difficulties over the ability scale.
   b.    The distribution and the average value of the discrimination
         parameters of the test items.

3. When the item difficulties are clustered closely around a given value,
       the test information function is peaked at that point on the ability
       scale. The maximum amount of information depends upon the
       values of the discrimination parameters.
                                                                                 72
                                  The Basics of IRT               September 9, 2009
The Information Function
                                     RECAP

4. When the item difficulties are widely distributed over the ability scale, the
   test information function tends to be flatter than when the
         difficulties are tightly clustered.

5. Values of a < 1.0 result in a low general level of the amount of test
        information.

6. Values of a > 1.7 result in a high general level of the amount of test
        information.

7. Under a three-parameter model, values of the guessing parameter c
       greater than zero lower the amount of test information at the low-
       ability levels. In addition, large values of c reduce the general level of
       the amount of test information.

8. It is difficult to approximate a horizontal test information function. To
          do so, the values of b must be spread widely over the ability scale,
          and the values of a must be in the moderate to low range and have a
             U-shaped distribution.                                          73
                                 The Basics of IRT              September 9, 2009
Test Calibration
Test constructors know beforehand
   what trait they want the item to measure and
   Ability level of examinees , where the item is designed to function

But
   it is not possible to determine the values of the item’s parameters a priori
   when a test is administered ,latent trait of examinees is not known

As a result a major task is
   to determine the values of the item parameters and examinee abilities in a
   metric for the underlying latent trait.

In IRT, this task is called Test calibration




                                                                                74
                                   The Basics of IRT             September 9, 2009
Test Calibration
The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 1: The parameters of the N items in the test are estimated

•        The estimated ability of each examinee is treated as if it is expressed in
         the true metric of the latent trait .

•        Then the parameters of each item in the test are estimated via the
         maximum likelihood procedure

    Item’s parameters are estimated one at a time (Independence assumption)




                                                                                   75
                                   The Basics of IRT                September 9, 2009
Test Calibration
The Test Calibration Process (Birnbaum,1968 )

Two stages of maximum likelihood estimation

stage 2 : the ability parameters of the M examinees are estimated

•        taking these item parameter estimates, the ability of each examinee is
         estimated using the maximum likelihood procedure presented

•        the ability estimates are obtained one examinee at a time (independence
         assumption).

The two-stage process is repeated until some suitable convergence criterion is
   met.

The overall effect is that the parameters of the N test items and the ability levels of
   the M examinees have been estimated simultaneously,



                                                                                     76
                                    The Basics of IRT                 September 9, 2009
Test Calibration
•   It does not yield a unique metric for the ability scale.

•   The midpoint and the unit of measurement of the obtained ability scale
    are indeterminate

•   The metric is unique up to a linear transformation

•   It is necessary to “anchor” the metric via arbitrary rules for determining
    the midpoint and unit of measurement of the ability scale.

•   It is not possible to obtain estimates of the examinee’s ability and of the
    item’s parameters in the true metric of the underlying latent trait.

•   The best we can do is obtain a metric that depends upon a particular
    combination of examinees and test items.




                                                                                77
                                  The Basics of IRT              September 9, 2009
Test Calibration
For three different ICC models to choose from , there are several different
   customised ways to implement the Birnbaum paradigm.

You have to devise your own implementation

Illustration : BICAL program as implemented by Benjamin Wright and his co-
Workers for 1PL Rasch model

   ten-item test administered to a group of 16 examinees




                                                                               78
                                 The Basics of IRT              September 9, 2009
Test Calibration
ITEM RESPONSES

         1   2   3   4   5   6   7   8   9    10     RS
01               1                   1               2
02       1       1                                   2
03       1   1   1       1       1                   5
04       1   1   1       1                           4
05                       1                           1
06       1   1       1                               3
07       1                   1 1     1               4
08       1               1   1           1           4
09       1       1           1           1           4
10       1               1                       1   3
11       1   1     1     1   1 1     1    1      1   9
12       1   1   1 1     1   1 1     1    1          9
13       1   1   1       1     1                 1   6
14       1   1   1 1     1   1 1     1    1          9
15       1   1     1     1   1 1     1    1 1        9
16       1   1   1 1     1   1 1     1    1 1        10

                                                                     79
                             The Basics of IRT        September 9, 2009
Test Calibration
Examinee 16 answered all of the items so removed from the data set

If item is answered correctly by all of the examinees or by none of the examinees,
     its item difficulty parameter cannot be estimated.


FREQUENCY COUNTS FOR EDITED DATA

SCORE                               ITEM                                Row
    1 2 3 4 5                       6 7 8 9 10                          Total
1                       1                                               1
2 1           2                               1                         4
3 2 1              1 1                             1                    6
4 4 1 2                        2 3 1 1 2                                16
5 1 1 1                        1         1                              5
6 1 1 1                        1         1              1               6
9 4 4 2 4 4                         4 4 4 4 2                           36
-----------------------------------------------------------------------------
COL 13 8 8 5 10 7 7 6 7 3                                               74

                                                                                               80
                                            The Basics of IRT                   September 9, 2009
Test Calibration
•   Given the two frequency vectors (Rows and Columns), the estimation process
    can be implemented

•   Under the Rasch model, the anchoring procedure takes advantage of a = 1
         - Unit of measurements for estimated ability is set at 1.
         - Item’s difficulty is estimated

•   To get midpoint value of ability scale, the mean item difficulty is subtracted from
    the value of each item’s difficulty estimate
          - Mid point of item difficulty becomes 0 and same is for ability estimates.

•   The ability estimate corresponding to each raw test score is obtained in the
    second stage using the rescaled item difficulties (as if they were the difficulty
    parameters) and the vector of row marginal totals.
         - metric is also changed to that of rescaled ability parameters

•   The output of this stage is an ability estimate for each raw test score in the data
    set


                                                                                     81
                                    The Basics of IRT                 September 9, 2009
Test Calibration
At this point, the convergence of the overall iterative process is checked.

   Wright summed the absolute differences between the values of the item
        difficulty parameter estimates for two successive iterations of the
        paradigm.

   If this sum was less than .01, the estimation process was terminated.

   If it was greater than .01, then another iteration was performed and the
          two stages were done again

Thus, the process of stage one, anchor the metric, stage two, and check for
   convergence is repeated until the criterion is met.

When this happens, the current values of the item and ability parameter estimates
  are accepted and an ability scale metric has been defined.




                                                                                   82
                                   The Basics of IRT                September 9, 2009
Test Calibration
ITEM PARAMETER ESTIMATES

Item     Difficulty
1        -2.37
2        -0.27
3        -0.27
4        +0.98
5        -1.00
6        +0.11
7        +0.11
8        +0.52
9        +0.11
10       +2.06

Because of the anchoring procedures used, these values are actually relative to
   the average item difficulty of the test for these examinees.

You can verify that the sum of the item difficulties is zero (within rounding errors)


                                                                                     83
                                    The Basics of IRT                 September 9, 2009
Test Calibration
ABILITY ESTIMATION
Examinee      Obtained           Raw Score
1             -1.50                     2
2             -1.50                     2
3             +0.02                     5
4             -0.42                     4
5             -2.37                     1
6             -0.91                     3
7             -0.42                     4
8             -0.42                     4
9             -0.42                     4
10            -0.91                     3
11            +2.33                     9
12            +2.33                     9
13            +0.46                     6
14            +2.33                     9
15            +2.33                     9
16            *****                     10

All examinees with the same raw score obtained the same ability estimate
                                                                           84
                               The Basics of IRT            September 9, 2009
Test Calibration
•   This unique feature is a direct consequence of fixing a =1 for all items

         - intuitive appeal of Rasch model


•   When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends upon
    the particular pattern of item responses rather than the raw score.

         - Under these models, examinees with the same item response pattern
           will obtain the same ability estimate.

         - examinees with the same raw score could obtain different ability
           estimates if they answered different items correctly.




                                                                                 85
                                  The Basics of IRT               September 9, 2009
Test Calibration
Illustration:

Three different ten-item tests measuring the same latent trait will be used. A
   common group of 16 examinees will take all three of the tests.

The tests were created so that the average difficulty of the first test was
  matched to the mean ability of the common group of examinees.

The second test was created to be an easy test for this group.

The third test was created to be a hard test for this group. Each of these test-
  group combinations will be subjected to the Birnbaum paradigm and
  calibrated separately.

Each test calibration yields a unique metric for the ability scale.

Due to the anchoring process, all three test calibrations yielded a mean item
  difficulty of zero.


                                                                                86
                                 The Basics of IRT               September 9, 2009
Test Calibration
Within each calibration, examinees with the same raw test score obtained the
   same estimated ability.
However, a given raw score will not yield the same estimated ability across the
   three calibrations.

The value of the mean estimated abilities is expressed relative to the mean item
   difficulty of the test.

Thus, the mean difficulty of the easy test should result in a positive mean ability.
   The mean ability on the hard test should have a negative value. The mean
   ability on the matched test should be near zero.

Test results can be placed on a common ability scale.




                                                                                    87
                                    The Basics of IRT                September 9, 2009
Test Calibration
Putting the Three Tests on a Common Ability Scale (Test Equating)

The principle of the item invariance of an examinee’s ability indicates that an
   examinee should obtain the same ability estimate regardless of the set of items
   used

However, in the three test calibrations done above, this did not hold.
The problem is not in the invariance principle, but in the test calibrations

The invariance principle assumes that the values of the item parameters of the
several sets of items are all expressed in the same ability-scale metric. In the
   present situation, there are three different ability scales, one from each of the
   calibrations.

The average difficulties of these tests were intended to be different, but the
   anchoring process forced each test to have a mean item difficulty of zero.



                                                                                     88
                                    The Basics of IRT                 September 9, 2009
Test Calibration
The mean of the common group was .06 for the matched test, .44 for the easy test,
   and -.11 for the hard test.

This tells us that the mean ability from the matched test is about what it should be.

The mean from the easy test tells us that the average ability is above the mean
   item difficulty of the test

The mean ability from the hard test is below the mean item difficulty.




                                                                                   89
                                   The Basics of IRT                September 9, 2009
Test Calibration
we can use the mean abilities to position the tests on a common scale.

But which particular test calibration to use as the baseline?

Matched Test : This calibration yielded a mean ability of .062 and a mean item
  difficulty of zero.

Because the Rasch model was used, the unit of measurement for all three
   calibrations is unity. Therefore, to bring the easy and hard test results to the
   baseline metric only involved adjusting for the differences in midpoints.




                                                                                    90
                                    The Basics of IRT                September 9, 2009
Test Calibration
Easy Test

The shift factor needed is the difference between the mean estimated ability of the
   common group on the easy test (.444) and on the matched test (.062),
which is .382.

To convert the values of the item difficulties for the easy test to baseline metric, one
   simply subtracts .382 from each item difficulty.

Similarly, each examinee’s ability can be expressed in the baseline metric by
   subtracting .382 from it.




                                                                                     91
                                    The Basics of IRT                 September 9, 2009
Test Calibration
Hard Test

The hard test results can be expressed in the baseline metric by using the
  differences in mean ability. The shift factor is -.111 -.062, or -.173.

Again, subtracting this value from each of the item difficulty estimates puts
  them in the baseline metric.

The ability estimates of the common group yielded by the hard test can be
  transformed to the baseline metric of the matched test by using the same
  shift factor




                                                                              92
                                The Basics of IRT              September 9, 2009
Test Calibration
Item difficulties in the baseline metric

Item    Easy test         Matched test         Hard test

1       -1.492            -2.37                *****
2       -1.492            -.27                 -.037
3       -2.122            -.27                 -.497
4       -.182             .98                  -.497
5       -.562             -1.00                .963
6       +.178             .11                  -.497
7       .528              .11                  .383
8       .582              .521                 .533
9       .880              .11                  .443
10      .880              2.06                 *****
Mean    - .285   mean     0.00      mean        .224




                                                                          93
                                  The Basics of IRT        September 9, 2009
Test Calibration
Ability estimates of the common group in the baseline metric

Item            Easy test        Matched test      Hard test

1               -2.900           -.1.50            *****
2               -.772            -1.50             *****
3               -1.96            2.02              *****
4               -.292            -.42              -.877
5               -.292            -2.37             -.877
6               .168             -.91              *****
7               1.968            -.42              -1.637
8               .168             -.42              -.877
9               .638             -.42              -1.637
10              .638             -.91              -.877
11              .638             2.33              .153
12              1.188            2.33              .153
13              .292             .46               .153
14              1.968            2.33              2.003
15              *****            2.33              1.213
16              *****            *****             2.003
Mean            .062             .062              .062
Std. Dev.       1.344            1.566             1.413
                                                                              94
                               The Basics of IRT               September 9, 2009
Test Calibration
After transformation

•        The matched test has a mean at the midpoint of the baseline ability
         scale.
•        The easy test has a negative value
•        The hard test has a positive value.

•   The average difficulty of both tests is about the same distance from the
    middle of the scale.

In technical terms we have “equated” the tests, i.e., put them on a common
   scale.




                                                                              95
                                 The Basics of IRT             September 9, 2009
Test Calibration
•   The mean estimated ability of the common group was the same for all
    three tests.

•   The standard deviations of the ability estimates were nearly the same for
    the easy and hard tests, and that for the matched test was “in the
    ballpark.”

•   Although the summary statistics were quite similar for all three sets of
    results, the ability estimates for a given examinee varied widely.

•   The invariance principle has not gone awry; what you are seeing is
    sampling variation.

Given the small size of the data sets, it is quite amazing that the results came
   out as nicely as they did.

This demonstrates rather clearly the powerful capabilities of the Rasch
   model and Birnbaum’s MLE paradigm.


                                                                              96
                                 The Basics of IRT             September 9, 2009
Test Calibration
•   After equating, the numerical values of the item parameters can be used
    to compare where different items function on the ability scale.

•   The examinees’ estimated abilities also are expressed in this metric and
    can be compared.

•   It is also possible to compute the test characteristic curve and the test
    information function for the easy and hard tests in the baseline metric.

•   Technically speaking, the tests were equated using the common group
    approach with tests of different difficulty.

•   The ease with which test equating can be accomplished is one of the
    major advantages of item response theory over classical test theory.




                                                                               97
                                 The Basics of IRT              September 9, 2009
Test Calibration
                                RECAP

1. The end product of the test calibration process is the definition of an
    ability scale metric.

2. Under the Rasch model, this scale has a unit of measurement of 1 and
   a midpoint of zero.

3. However, it is not the metric of the underlying latent trait. The
   obtained metric depends upon the item responses yielded by a
   particular combination of examinees and test items being subjected
   to the Birnbaum paradigm.

4. Since the true metric of the underlying latent trait cannot be
   determined, the metric yielded by the Birnbaum paradigm is used as
   if it were the true metric. The obtained item difficulty values and the
   examinee’s ability are interpreted in this metric. Thus, the test has
   been calibrated.


                                                                          98
                            The Basics of IRT              September 9, 2009
Test Calibration

5. The outcome of the test calibration procedure is to locate each
   examinee and item along the obtained ability scale.

6. In the present example, item 5 had a difficulty of -1 and examinee 10
   had an ability estimate of -.91. Therefore, the probability of examinee
   10 answering item 5 correctly is approximately .5.

7. The capability to locate items and examinees along a
    common scale is a powerful feature of item response theory.
           Single Consistent framework




                                                                          99
                            The Basics of IRT              September 9, 2009
Specifying the characteristics
    of a test
During this transitional period in testing practices, many tests have been
  designed and constructed using classical test theory principles but have
  been analyzed via item response theory procedures.

This lack of congruence between the construction and analysis procedures
  has kept the full power of item response theory from being exploited.

In order to obtain the many advantages of item response theory, tests
   should be designed, constructed, analyzed, and interpreted within the
   framework of the theory.




                                                                          100
                               The Basics of IRT            September 9, 2009
Specifying the characteristics
     of a test
•   Under item response theory, a well-defined set of procedures (item
    banking) is used to establish and maintain item pools.

•   item parameters are expressed in a known ability-scale metric.

•   it is possible to select items from the item pool and determine the major
    technical characteristics of a test before it is administered.

•   If the test characteristics do not meet the design goals, selected items
    can be replaced by other items from the item pool until the desired
    characteristics are obtained.

•   In this way, considerable time and money that would ordinarily be devoted
    to piloting the test are saved.




                                                                             101
                                 The Basics of IRT             September 9, 2009
Specifying the characteristics
     of a test
•   first define the latent trait the items are to measure, write items to
    measure this trait, and pilot test the items to weed out poor items.

•   This large set of items is then administered to a large group of examinees.

•   An item characteristic curve model is selected, the examinees’ item
    response data are analyzed via the Birnbaum paradigm, and the test is
    calibrated.

•   The ability scale resulting from this calibration is considered to be the
    baseline metric of the item pool.

•   From a test construction point of view, we now have a set of items whose
    item parameter values are known; in technical terms, a “precalibrated
    item pool” exists.




                                                                                102
                                  The Basics of IRT               September 9, 2009
Specifying the characteristics
     of a test
              Developing a Test From a Precalibrated Item Pool

Since the items in the precalibrated item pool measure a specific latent trait,
tests constructed from it will also measure this trait.

Alternate forms are routinely needed to maintain test security, and special
   versions of the test can be used to award scholarships.

In such cases, items would be selected from the item pool on the basis of
    their content and their technical characteristics to meet the particular
    testing goals.

The advantage of having a precalibrated item pool is that the parameter
  values of the items included in the test can be used to compute the test
  characteristic curve and the test information function before the test is
  administered.


                                                                             103
                                 The Basics of IRT             September 9, 2009
Specifying the characteristics
    of a test
This is possible because neither of these curves depends upon the
   distribution of examinee ability scores over the ability scale.

Given these two curves, the test constructor has a very good idea of how the
   test will perform before it is given to a group of examinees.

In addition, when the test has been administered and calibrated, test
   equating procedures can be used to express the ability estimates of the
   new group of examinees in the metric of the item pool.




                                                                          104
                               The Basics of IRT            September 9, 2009
Specifying the characteristics
     of a test
Screening tests

to distinguish rather sharply between examinees whose abilities are just
   below a given ability level and those who are at or above that level.

   - used to assign scholarships
   - and to assign students to specific instructional programs

Broad-ranged tests

to measure ability over a wide range of underlying ability scale.

   - Tests measuring reading or mathematics are typically broad-range tests.




                                                                             105
                                The Basics of IRT              September 9, 2009
Specifying the characteristics
      of a test
Peaked tests

to measure ability well in a range of ability that is wider than that of a
   screening test, but not as wide as that of a broad-range test.

Some Ground Rules

a. It is assumed that the items would be selected on the basis of
   content as well as parameter values.
b. No two items in the item pool possess exactly the same combination
          of item parameter values.
c. The item parameter values are subject to the following constraints:
   -3 < = b < = +3 , .50 < = a < +2.00 , 0 < = c < = .35
The values of the discrimination parameter have been restricted to reflect the
   range of values usually seen in well-maintained item pools.




                                                                                 106
                                   The Basics of IRT               September 9, 2009
Specifying the characteristics
      of a test
Example Case

You are to construct a ten-item screening test that will separate examinees
  into two groups: those who need remedial instruction and those who
  don’t, on the ability measured by the items in the item pool. Students
  whose ability falls below a value of -1 will receive the instruction.

Solution:

b = -1.8,a = 1.22         b = -1.6,a = 1.43           b = -1.4,a = 1.14
b = -1.2,a = 1.35         b = -1.0,a = 1.56           b = - .8,a = 1.07
b = - .6,a = 1.48         b = - .4,a = 1.29           b = - .2,a = 1.110
b = 0.0,a = 1.3

The logic underlying these choices was one of centering the
difficulties on the cutoff level of -1 and using moderate values of
    discrimination.

                                                                                 107
                                  The Basics of IRT                September 9, 2009
Specifying the characteristics
     of a test




The mid-true score corresponded to an ability level of -1.0.

The test characteristic curve was not particularly steep at the cutoff level,
  indicating that the test lacked discrimination.

The peak of the information function occurred at an ability level of -1.0, but
  the maximum was a bit small.

The results suggest that the test was
properly positioned on the ability scale but that a better set of items could be
   found.
                                                                              108
                                 The Basics of IRT              September 9, 2009
Specifying the characteristics
     of a test
The following changes would improve the test’s characteristics:

First cluster the values of the item difficulties nearer the cutoff level;

second, use larger values of the discrimination parameters.

These two changes should steepen the test characteristic curve and
increase the maximum amount of information at the ability level of
-1.0.




                                                                                109
                                  The Basics of IRT               September 9, 2009
Specifying the characteristics
     of a test
                                    RECAP

1. Screening tests.

1.      The desired test characteristic curve has the mid-true score
        at the specified cutoff ability level. The curve should be as
        steep as possible at that ability level.

2.      The test information function should be peaked, with its
        maximum at the cutoff ability level.

3.      optimal case is where all item difficulties are at the cutoff point and
        the item discriminations are large.

4.      select items that yield the maximum amount of information at the
        cutoff point.


                                                                              110
                                The Basics of IRT               September 9, 2009
Specifying the characteristics
     of a test
2. Broad-range tests.
1.     The desired test characteristic curve has its mid-true score
       at an ability level corresponding to the midpoint of the range
       of ability of interest.

2.      Most often this is an ability level of zero.

3.      The test characteristic curve should be linear for most of its range.

4.      The desired test information function is horizontal over the
        widest possible range.

5.       The maximum amount of information should be as large as
         possible.

6.      The values of the item difficulty parameters should be spread
        uniformly over the ability scale and as widely as practical.

                                                                             111
                                 The Basics of IRT             September 9, 2009
Specifying the characteristics
     of a test
3. Peaked tests.

1. The desired TCC has its mid-true score in the middle of the ability range
   of interest. The curve should have a moderate slope at that ability level.

2. The test information function should be rounded in appearance over the
   ability range of most interest.

3. The item difficulties should be clustered around the midpoint of the ability
   range of interest, but not as tightly as in the case of a screening test.

4. The values of the discrimination parameters should large.

5. Items whose difficulties are within the ability range of interest should
   have larger values of the discrimination than other items




                                                                             112
                                The Basics of IRT              September 9, 2009
Specifying the characteristics
     of a test
4. Role of item characteristic curve models.

1. ‘a’ being fixed at 1.0, the Rasch model has a limit placed upon the
   maximum amount of information that can be obtained.

2. The maximum amount of item information is .25 since Pi (θ ) ) Qi (θ ) )= .25
   when Pi (θ ) = .5. Maximum amount of information for a test under the
   Rasch model is .25 times the number of items.

3. Due to the presence of the guessing parameter, 3-PL model will yield a
   more linear test characteristic curve and a test information function with a
   lower general level than under 2-PL model with the ‘a’ and ‘b’

4. The information function under a 2-PL model is the upper bound for the
   information function under a 3-PL model when the values of ‘b’ and ‘a’ are
   the same.


                                                                             113
                                The Basics of IRT              September 9, 2009
Specifying the characteristics
     of a test
5. Role of the number of items.

Increasing the number of items has little impact upon the general form of the
   TCC if the distribution of the sets of item parameters remains the same.

Increasing the number of items in a test has a significant impact upon the
   general level of the test information function (TIF).

Select items having high values of the ‘a’ and a distribution of item
   difficulties consistent with the testing goals.

Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty is
   not of interest does little for test information function or the slope of the
         TCC.

It’s important to see ICC and the IIF to ascertain the item’s contribution to the
    TCC and to the TIF
                          Partial Credit Model
                                                                               114
                                  The Basics of IRT              September 9, 2009
Computer Adaptive Test
•   The computer can update the estimate of the examinee's ability after
    each item, which then is used to select subsequent item.

•   With the right item bank and a high examinee ability variance, CAT
    can be much more efficient than a traditional paper-and-pencil test.

•   Paper-and-pencil tests

     •   "fixed-item" tests

     •   Everyone takes every item

     •   Easy and hard items are like adding constants to the score.

     •   provide little information about the examinee's ability level.

     •   Large numbers of items and examinees are needed to obtain a
         modest degree of precision.

                                                                            115
                               The Basics of IRT              September 9, 2009
Computer Adaptive Test
Computer adaptive tests

•   the examinee's ability level relative to a norm group can be iteratively
    estimated .

•   items can be selected based on the current ability estimate.

•    Examinees can be given the items that maximize the information
    (within constraints) about their ability levels from the item responses.

•   Examinees will receive few items that are very easy or very hard for
    them. (little information is gained of this)

•   Reduced standard errors (reciprocal of I) and greater precision with
    only a handful of properly selected items.




                                                                           116
                              The Basics of IRT              September 9, 2009
Computer Adaptive Test
The CAT algorithm is usually an iterative process

Step 1: All the items that have not yet been administered are evaluated to
   determine which will be the best one to administer next given the
   currently estimated ability level

Step 2: The "best" next item (providing the most information) is
   administered and the examinee responds

Step 3: A new ability estimate is computed based on the responses to all
   of the administered items.

Steps 1 through 3 are repeated until a stopping criterion is met.

Similar to Newton-Raphson iterative method for solving equations




                                                                          117
                             The Basics of IRT              September 9, 2009
Computer Adaptive Test
                     Reliability and standard error

• In classical measurement, with a test reliability of 0.90, the
  standard error of measurement for the test is about .33 of the
  standard deviation of examinee test scores.

• In item response theory-based measurement, and when ability
  scores are scaled to a mean of zero and a standard deviation of
  one (which is common), this level of reliability corresponds to a
  standard error of about .33 and test information of about 10.

• Thus, it is common in practice, to design CATs so that the standard
  errors are about .33 or smaller (or correspondingly, test information
  exceeds 10.




                                                                       118
                            The Basics of IRT            September 9, 2009
Computer Adaptive Test
                     Potential of computer adaptive tests

• Tests are given "on demand" and scores are available immediately.

• Neither answer sheets nor trained test administrators are needed.

• Test administrator differences are eliminated as a factor in
  measurement error.

• Tests are individually paced.

• Test security may be increased .

• Computerized testing offers a number of options for timing and
  formatting.


                                                                           119
                                  The Basics of IRT          September 9, 2009
Computer Adaptive Test
• CATs can reduce testing time by more than 50% while maintaining the
  same level of reliability. Shorter testing times also reduce fatigue, a
  factor that can significantly affect an examinee's test results.

• CATs can provide accurate scores over a wide range of abilities while
  traditional tests are usually most accurate for average examinees.

   CAT & IRT is Powerful combination




                                                                          120
                               The Basics of IRT            September 9, 2009
Computer Adaptive Test
                             Limitations of CAT

• CATs are not applicable for all subjects and skills.

• Most CATs are based on an IRT model, yet IRT is not applicable to all
  skills and item types.

• Hardware limitations

• Items involving detailed art work and graphs or extensive reading
  passages, for example, may be hard to present.

• CATs require careful item calibration.

• CATs are only manageable if a facility has enough computers for a large
  number of examinees and the examinees are at least partially
  computer-literate. This can be a big limitation.
                                                                         121
                                 The Basics of IRT         September 9, 2009
Computer Adaptive Test
The test administration procedures are different. This may cause problems
  for some examinees.

With each examinee receiving a different set of questions, there can be
   perceived inequities.

Examinees are not usually permitted to go back and change answers.




                                                                          122
                               The Basics of IRT            September 9, 2009
References
Baker,J “The basics of Item Response Theory.”

Birnbaum, A. “Some latent trait models and their use in inferring an
examinee’s ability.” Part 5 in F.M. Lord and M.R. Novick. Statistical Theories of Mental Test Scores.
    Reading, MA: Addison-Wesley, 1968.

Hambleton, R.K., and Swaminathan, H. Item Response Theory: Principles and
Applications. Hingham, MA: Kluwer, Nijhoff, 1984.

Hulin, C. L., Drasgow, F., and Parsons, C.K. Item Response Theory: Application to Psychological
    Measurement. Homewood, IL: Dow-Jones, Irwin: 1983.

Lord, F.M. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:
    Erlbaum, 1980.

Mislevy, R.J., and Bock, R.D. PC-BILOG 3: Item Analysis and Test Scoring with Binary Logistic
    Models. Mooresville, IN: Scientific Software, Inc, 1986.

Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. Research
    Memorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago,
    1976.

Wright, B.D., and Stone, M.A. Best Test Design. Chicago: MESA Press, 1979.

NCME Instructional Modules


                                                                                                123
                                           The Basics of IRT                      September 9, 2009
Thank You

                                    124
  The Basics of IRT   September 9, 2009
What is IRT? – DIF/DTF
 Classical test theory methods confound “bias” with true mean
differences; IRT does not. In IRT terminology, item/test bias is
referred to as DIF/DTF (Differential Item/Test Functioning )

 DIF refers to a difference in the probability of endorsing an item for
members of a reference group (e.g., US workers) and a focal group
(e.g., Chinese workers), having the same standing on theta.

 DTF refers to a difference in the test characteristic curves, obtained
by summing the item response functions for each group.

 DTF is perhaps more important for selection because decisions are
made based on test scores, not individual item responses.

 If DIF is detected, IRT can control for item bias when estimating
scores.
                                                                        125
                              The Basics of IRT           September 9, 2009
What is IRT? – DIF Examples
                                                    Uniform DIF Against Focal Group
                                  1.0
     Prob. of Positive Response   0.9
                                  0.8
                                  0.7
                                  0.6
                                  0.5                                                                                      Reference group
                                                                                                                           favored at all levels
                                  0.4
                                  0.3                                                           Reference
                                  0.2                                                           Focal
                                  0.1
                                  0.0
                                        -3   -2.5    -2    -1.5   -1   -0.5     0 0.5       1   1.5     2   2.5   3
                                                                              Theta

                                                          Nonuniform (Crossing) DIF
                                  1.0
                                  0.9
                                                                                                                      Focal favored at low theta
 Prob. of Positive Response




                                  0.8
                                  0.7
                                  0.6                                                                                 Reference favored at high
                                  0.5                                                                                 theta
                                  0.4
                                  0.3                                                           Reference
                                  0.2
                                                                                                Focal
                                  0.1
                                  0.0
                                        -3   -2.5   -2     -1.5   -1   -0.5     0     0.5   1   1.5     2   2.5   3
                                                                              Theta


                                                                                                                                             126
                                                                                      The Basics of IRT                        September 9, 2009
What is IRT? – DIF Detection
• DIF
  – Parametric
     • Lord’s Chi-Square
              Chi-
     • Likelihood Ratio Test
     • Signed and Unsigned Area Methods
  – Nonparametric
     • SIBTEST
     • Mantel-Haenszel
       Mantel-
• DTF
  – Parametric
     • Raju’s DFIT Method
  – Nonparametric
     • SIBTEST

                                                              127
                            The Basics of IRT   September 9, 2009
What is IRT ? - Some simple models
 Let     represent the latent (factor) score for individual j . Let       be
 the probability that individual j responds correctly to item i.
 Then a simple item response model is




 This is just a ‘binary response’ factor analysis model.




                                                                          128
                                The Basics of IRT           September 9, 2009
What is IRT ? Some simple models ?
 Lawley (1944) really started it off

 Lord (1980) promoted the term ‘item response theory’ as opposed to
 ‘classical item analysis’

 Technical elaborations include:

         ‘parameters’ for ‘guessing’
         Partial credit (degrees of correctness) responses
         Multidimensional models

 BUT the ‘workhorse’ is still the Lord model (with the factor assumed
 to be a random rather than fixed variable), as follows:




                                                                      129
                                The Basics of IRT       September 9, 2009
What is IRT ? - Classical to IRT

• Classical is really an item response model
  (IRM) –

A reasonable (consistent) estimate of          (a
  random variable - so in red) is given by the
  ‘raw score’ i.e. percentage (or total) of correct
  items.
• A somewhat more efficient estimate is given by
  a weighted percentage, using the bi as weights.
• The Lord model is simply:

                                                        130
                     The Basics of IRT    September 9, 2009
What is IRT? Item response Relationship
Sigmoid Function is a type of logistic function . The general form of
logistic function is :

The special case of the logistic function with a = 1, m = 0, n = 1, τ = 1, namely


The logistic function is the inverse of the natural logit function and so
can be used to convert the logarithm of odds into a probability;
the conversion from the log-likelihood ratio of two alternatives also takes
 the form of a sigmoid curve.




                            For a single item in a test                        131
                                     The Basics of IRT           September 9, 2009
What is IRT? Rasch Model


Here the ‘discrimination’ is assumed to be the same for each item

The resulting (maximum likelihood) factor score estimates are then a 1 –
  1 transformation of the raw scores.

Rasch Model is a special case and will often not fit the data very well.

We can add further predictors, for example social background, that may
  mediate the relationships.

Item response practitioners go further:

   If we assume that the item parameters ( ai , bi ) are the same across
       populations, and, for example, tests, then we can form common
       scales for different populations and different tests.
                                                                         132
                               The Basics of IRT           September 9, 2009
What is IRT? Multidimensional Model
We can generalise our logistic model as follows:




Adding a further factor allows an individual to be characterised by
two underlying traits.

A sensible analysis will explore the dimensionality structure of a set
of item responses

Assumptions are needed, for example that factors are independent,
or alternatively that they are correlated but each item has a non-zero
coeffcient (loading) on only 1 factor – or an intermediate assumption.

What are the consequences of a more complex structure?

                                                                       133
                             The Basics of IRT           September 9, 2009
What is IRT? Multidimensional Model
It allows a more faithful representation of multi-faceted achievement.

It allows the (multidimensional) structure of achievement to be compared
    among groups or populations….in the following ways:

  The correlations between factors can vary
  The values of loadings can vary
  The factor scores can be allowed to depend on further variables such
  as gender and the resulting ‘regressions’ may vary. For example:




With extensions to multilevel modelling etc. – a ‘structural equation
  model.



                                                                         134
                                The Basics of IRT          September 9, 2009
Partial Credit Model Variation
  Central line = scale at interval vel of measurement

     Object          |           Item 1      Item 2

    Person A   >>> |        << Item 1.3

                     |      << Item 1.2 <<<<Item 2.2

                     |

    Person B   >>> |        << <<<<<<        <<<<Item 2.1

                     |      << Item 1.1




                                                                     135
                         The Basics of IRT             September 9, 2009
Partial Credit Model Variation
Items :
          Item 1.3 refers to Item 1 Category 3
          Item 1.2 refers to Item 1 Category 2 and so on.
          The difficulty associated with Category 3 of Item 1 is greater than the
          difficulty associated with Category 2 of Item 1, and so on (ordered
          categories)
          The location of Item 1.3 on the scale indicates the ability associated with a
          50% probability of passing Category 3 of Item 1 (or of any of the lower
          categories).

Persons :

          The location of person A on the scale indicates his ability.
          The probability of Person A passing categories at a lower level of difficulty
          is more than 50%.
          The probability of Person A passing categories at a higher level of
          difficulty is less than 50%.
          The probability of Person A passing categories at a level of difficulty that
          is the same as his ability is 50%.

                                                                                   136
                                    The Basics of IRT                September 9, 2009

More Related Content

What's hot

Item and Distracter Analysis
Item and Distracter AnalysisItem and Distracter Analysis
Item and Distracter AnalysisSue Quirante
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theorysaira kazim
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
 
Item Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresItem Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresCarlo Magno
 
The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1Carlo Magno
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good TestDrSindhuAlmas
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment. Tarek Tawfik Amin
 
Topic 8b Item Analysis
Topic 8b Item AnalysisTopic 8b Item Analysis
Topic 8b Item AnalysisYee Bee Choo
 
Stanford binet-5pptx (1)
Stanford binet-5pptx (1)Stanford binet-5pptx (1)
Stanford binet-5pptx (1)AtoZAll1
 
Types of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresTypes of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresCRISALDO CORDURA
 
A visual guide to item response theory
A visual guide to item response theoryA visual guide to item response theory
A visual guide to item response theoryahmad rustam
 
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity pptsurendra poudel
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and ValidityBrian Ebie
 

What's hot (20)

Item and Distracter Analysis
Item and Distracter AnalysisItem and Distracter Analysis
Item and Distracter Analysis
 
Item Response Theory
Item Response TheoryItem Response Theory
Item Response Theory
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theory
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Item Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresItem Response Theory in Constructing Measures
Item Response Theory in Constructing Measures
 
The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1
 
Item Analysis
Item AnalysisItem Analysis
Item Analysis
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
Item analysis
Item analysisItem analysis
Item analysis
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
 
Topic 8b Item Analysis
Topic 8b Item AnalysisTopic 8b Item Analysis
Topic 8b Item Analysis
 
Test validity
Test validityTest validity
Test validity
 
Stanford binet-5pptx (1)
Stanford binet-5pptx (1)Stanford binet-5pptx (1)
Stanford binet-5pptx (1)
 
Types of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresTypes of Scores & Types of Standard Scores
Types of Scores & Types of Standard Scores
 
A visual guide to item response theory
A visual guide to item response theoryA visual guide to item response theory
A visual guide to item response theory
 
Item analysis
Item analysisItem analysis
Item analysis
 
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity ppt
 
Validity
ValidityValidity
Validity
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 

More from Ajay Dhamija

fm3-05-301 (copy).pdf
fm3-05-301 (copy).pdffm3-05-301 (copy).pdf
fm3-05-301 (copy).pdfAjay Dhamija
 
Ethical hacking & Information Security
Ethical hacking & Information SecurityEthical hacking & Information Security
Ethical hacking & Information SecurityAjay Dhamija
 
Karmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemKarmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemAjay Dhamija
 
Verizon - A Case Study
Verizon - A Case StudyVerizon - A Case Study
Verizon - A Case StudyAjay Dhamija
 
Dabur India Ltd - A Case Study
Dabur India Ltd  - A Case StudyDabur India Ltd  - A Case Study
Dabur India Ltd - A Case StudyAjay Dhamija
 
Non Banking Financial Company
Non Banking Financial CompanyNon Banking Financial Company
Non Banking Financial CompanyAjay Dhamija
 
The Financial Sector Reforms in India
The Financial Sector Reforms in IndiaThe Financial Sector Reforms in India
The Financial Sector Reforms in IndiaAjay Dhamija
 
Hosting Inviting Introduction Guest Relations
Hosting Inviting Introduction Guest RelationsHosting Inviting Introduction Guest Relations
Hosting Inviting Introduction Guest RelationsAjay Dhamija
 
Global Fiancial Meltdown of 2007
Global Fiancial Meltdown of 2007Global Fiancial Meltdown of 2007
Global Fiancial Meltdown of 2007Ajay Dhamija
 
Goody Research - Research Methods Flaws
Goody Research - Research Methods FlawsGoody Research - Research Methods Flaws
Goody Research - Research Methods FlawsAjay Dhamija
 
Randomization Tests
Randomization Tests Randomization Tests
Randomization Tests Ajay Dhamija
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationAjay Dhamija
 

More from Ajay Dhamija (15)

fm3-05-301 (copy).pdf
fm3-05-301 (copy).pdffm3-05-301 (copy).pdf
fm3-05-301 (copy).pdf
 
Carbon Finance
Carbon FinanceCarbon Finance
Carbon Finance
 
Ethical hacking & Information Security
Ethical hacking & Information SecurityEthical hacking & Information Security
Ethical hacking & Information Security
 
Karmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming ProblemKarmarkar's Algorithm For Linear Programming Problem
Karmarkar's Algorithm For Linear Programming Problem
 
Verizon - A Case Study
Verizon - A Case StudyVerizon - A Case Study
Verizon - A Case Study
 
Dabur India Ltd - A Case Study
Dabur India Ltd  - A Case StudyDabur India Ltd  - A Case Study
Dabur India Ltd - A Case Study
 
Non Banking Financial Company
Non Banking Financial CompanyNon Banking Financial Company
Non Banking Financial Company
 
The Financial Sector Reforms in India
The Financial Sector Reforms in IndiaThe Financial Sector Reforms in India
The Financial Sector Reforms in India
 
Hosting Inviting Introduction Guest Relations
Hosting Inviting Introduction Guest RelationsHosting Inviting Introduction Guest Relations
Hosting Inviting Introduction Guest Relations
 
TRIZ
TRIZ TRIZ
TRIZ
 
Global Fiancial Meltdown of 2007
Global Fiancial Meltdown of 2007Global Fiancial Meltdown of 2007
Global Fiancial Meltdown of 2007
 
Goody Research - Research Methods Flaws
Goody Research - Research Methods FlawsGoody Research - Research Methods Flaws
Goody Research - Research Methods Flaws
 
Randomization Tests
Randomization Tests Randomization Tests
Randomization Tests
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size Determination
 
Research Design
Research DesignResearch Design
Research Design
 

Recently uploaded

8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdfKhaled Al Awadi
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxThe-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxmbikashkanyari
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckHajeJanKamps
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Peter Ward
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environmentelijahj01012
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?Olivia Kresic
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africaictsugar
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfRbc Rbcua
 

Recently uploaded (20)

8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxThe-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...
 
Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environment
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africa
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdf
 

IRT - Item response Theory

  • 1. The Basics of IRT by AK Dhamija© (dhamija.ak@gmail.com) dhamija.ak@gmail.com) www.geocities.com/a_k_dhamija 1 The Basics of IRT September 9, 2009
  • 2. Coverage •What is IRT? •The Item Characteristic Curve •Item Characteristic Curve Models •Estimating Item Parameters •The Test Characteristic Curve •Estimating an Examinee’s Ability •The Information Function •Test Calibration •Characteristics of a Test •Computer Adaptive Test 2 The Basics of IRT September 9, 2009
  • 3. What is IRT? • Classical Theory •Dependence of Item Statistics on sample of Respondents •Dependence of Respondents’ scores on choice of items •Assumes equal errors of measurement at all levels of ability •No Modeling of data at item level •Items and Respondents at different scale •Difficult to compare scores across two different tests because not on same scale 3 The Basics of IRT September 9, 2009
  • 4. What is IRT? • IRT •Links observable respondent performance to unobservable traits •Theory is general •One or more abilities or traits •Various assumptions / models •Binary / polytomous data •At specific model level fit can be addressed DIF/DTF LOGISTIC / MULTIDIMENSIONAL MODEL 4 The Basics of IRT September 9, 2009
  • 5. What is IRT? • Specific IRT model assumptions •Dominant First factor (Multidimensional models exist too) •No dependency between items •Mathematical form of ICC linking performance on items and trait measured by the instrument 5 The Basics of IRT September 9, 2009
  • 6. What is IRT? • IRT benefits •Item parameters estimation is independent of respondent samples •Trait estimation is independent of particular choice of items (invaluable property for CAT) •Error of measurement for each respondent •Item level modeling allows for “optimal assessments” •Items and respondents calibrated on same scale 6 The Basics of IRT September 9, 2009
  • 7. What is IRT? • IRT Limitations •Practitioners lack expertise •IRT SW are not straight forward •Large samples are more helpful for estimation •Doesn’t address construct definition / domain convergence 7 The Basics of IRT September 9, 2009
  • 8. What is IRT? • Many choices of IRT Models •1PL , 2PL , 3PL and Normal Ogive Models (0-1 data) •Partial Credit , Generalized Partial Credit , Graded Response , Nominal Response Models (Polytomous data) •Multidimensional Logistic Models (0-1 data) • Estimation •Marginal MLE , Bayesian etc. • SW •Bilog-MG ,Parscale , Multilog ,Conquest , Winsteps 8 The Basics of IRT September 9, 2009
  • 9. The Item Characteristic Curve • Unobservable, or latent, trait • Generic term ‘ability’ in IRT • Interval scale of ability •Theoretical values [- ∞,+ ∞] •Practical Values [-3,+3] • Ideally free response items • Difficult to score reliably 9 The Basics of IRT September 9, 2009
  • 10. The Item Characteristic Curve • Data Preparation • Raw data recoded • Dichotomously scored items (For polytomous data : Samejima's Graded Response model) Dominant • Investigating dimensionality first factor •examine the eigenvalues following Principal Axis Factoring (PAF) If the data are dichotomous, factor analyze tetrachoric correlations Assume continuum underlies item responses 10 The Basics of IRT September 9, 2009
  • 11. The Item Characteristic Curve • Classical IAT • Raw score is sum of scores of item • IRT •Emphasis on individual items • Assumption : Ability score (θ ) • P(θ ) is the probability to give a correct answer 11 The Basics of IRT September 9, 2009
  • 12. The Item Characteristic Curve • P(θ ) is small for low ability examinees & vice versa • A smooth S- shaped curve • item characteristic curve (ICC) •A basic building block of IRT • Two technical properties • Item Difficulty • Item Discrimination 12 The Basics of IRT September 9, 2009
  • 13. The Item Characteristic Curve • Item Difficulty • a location index • Three ICC with the same discrimination but different levels of difficulty 13 The Basics of IRT September 9, 2009
  • 14. The Item Characteristic Curve • Item Discrimination • slope at θ = 0 • Differentiating between abilities below the item location and those above the item location. •Steepness of the ICC in its middle section • The steeper the curve, the better the item can discriminate 14 The Basics of IRT September 9, 2009
  • 15. The Item Characteristic Curve • Caution •These two properties •say nothing about validity of item •simply describe the form ICC • Figures only show the range -3 to +3 • All ICC become asymptotic to a probability of zero at one tail and to 1.0 at the other tail 15 The Basics of IRT September 9, 2009
  • 16. The Item Characteristic Curve • Item with Perfect discrimination •item discriminates perfectly those above and below an ability score of 1.5 • this item is useless for other anilities in discrimination 16 The Basics of IRT September 9, 2009
  • 17. The Item Characteristic Curve • Difficulty will have the following levels: very easy easy medium hard very hard Discrimination will have the following levels: none low moderate high Perfect 17 The Basics of IRT September 9, 2009
  • 18. The Item Characteristic Curve 18 The Basics of IRT September 9, 2009
  • 19. The Item Characteristic Curve Recap 1. When the item discrimination is less than moderate, the item characteristic curve is nearly linear and appears rather flat. 2. When discrimination is greater than moderate, the item characteristic curve is S-shaped and rather steep in its middle section. 3. When the item difficulty is less than medium, most of the item characteristic curve has a probability of correct response that is greater than .5. 4. When the item difficulty is greater than medium, most of the item characteristic curve has a probability of correct response less than .5. 19 The Basics of IRT September 9, 2009
  • 20. The Item Characteristic Curve Recap 5. Regardless of the level of discrimination, item difficulty locates the item along the ability scale. Therefore item difficulty and discrimination are independent of each other. 6. When an item has no discrimination, all choices of difficulty yield the same horizontal line at a value of P(θ ) = .5. This is because the value of the item difficulty for an item with no discrimination is undefined. 20 The Basics of IRT September 9, 2009
  • 21. Item Characteristic Curve Models • Enough of being intuitive , lets see the rigor needed by a theory • Three mathematical models for the ICC • The Logistic Function : 2 Parameter Model (2PL) P(θ ) = 1 = 1 1 + e -L 1 + e -a (θ -b ) where: e is the constant 2.718 b is the difficulty parameter a is the discrimination parameter L = a(θ - b) is the logistic deviate (logit) and θ is an ability level. 21 The Basics of IRT September 9, 2009
  • 22. Item Characteristic Curve Models • The difficulty parameter (b) is defined as the point on the ability scale at which the probability of correct response to the item is .5. • Range of b : [- ∞,+ ∞] [-3,+3] • Discrimination parameter is proportional to the slope of the ICC at θ = b. • The actual slope at θ = b is a/4, but taking it as ‘a ‘ makes interpretation easier • Range of a : [- ∞,+ ∞] [-2.80,+2.80] . Normal Ogive model has different interpretation 22 The Basics of IRT September 9, 2009
  • 23. Item Characteristic Curve Models • Computational Example • b = 1.0 , a = .5 θ Logit EXP(-L) 1+EXP(-L) P(θ) -3 -2 7.389 8.389 .12 -2 -1.5 4.482 5.482 .18 -1 -1 2.718 3.718 .27 0 -.5 1.649 2.649 .38 1 0 1 2 .5 2 .5 .607 1.607 .62 3 1 .368 1.368 .73 23 The Basics of IRT September 9, 2009
  • 24. Item Characteristic Curve Models •The Rasch, or One-Parameter, Logistic Model (1 PL) •a = 1.0 for all items P(θ ) = 1 = 1 1 + e -L 1 + e - (θ -b ) Birnbaum (1968) Three Parameter Model (3 PL) The probability of correct response includes a small component that is due to guessing . P(θ ) = c + (1-c) 1 + e -a (θ -b ) 24 The Basics of IRT September 9, 2009
  • 25. Item Characteristic Curve Models • c is guessing parameter : the probability of getting the item correct by guessing alone • value of c does not vary as a function of the ability level • Range : Theoretically [0,1] Practically [0,.35] • Definition of the difficulty parameter is changed in 3 PL • c defines a floor to P(θ ) • P(θ ) = (1+c)/2 25 The Basics of IRT September 9, 2009
  • 26. Item Characteristic Curve Models • Discrimination parameter a = slope of the item characteristic curve at θ = b which is a (1 - c)/4 - can still be taken as proportional to a 26 The Basics of IRT September 9, 2009
  • 27. Item Characteristic Curve Models • Negative Discrimination • P(θ ) decreases as θ increases • Can occur in two ways • The incorrect response to a two-choice item will have a negative discrimination if correct response has a positive one •Something is wrong with the item •Two curves have same b and |a| 27 The Basics of IRT September 9, 2009
  • 28. Item Characteristic Curve Models • Interpreting Item Parameter Values • a: (divide by 1.7 to interpret in normal ogive model) Verbal label Range of values none 0 very low .01 - .34 Low .35 - .64 moderate .65 - 1.34 High 1.35 - 1.69 Very high > 1.70 Perfect + infinity b: difficult job since easy and hard are relative terms 28 The Basics of IRT September 9, 2009
  • 29. Item Characteristic Curve Models • In classical test theory, ‘b’ was defined relative to a group of examinees • Same item could be easy for one group and hard for another group. •Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 + c)/2 for a 3-PL model. c is interpreted directly as a probability. c = .12 means that at all θ , P(θ ) by guessing alone is .12. 29 The Basics of IRT September 9, 2009
  • 30. Item Characteristic Curve Models •The verbal labels reflect only midpoint of scale •item difficulty tells where the item functions on the ability scale •The slope of the ICC (‘a’) is at maximum at an ability level corresponding to the item difficulty • The item does best in distinguishing between examinees in the neighborhood of its ability level • An item whose difficulty is -1 functions among the lower ability examinees • A value of +1 denotes an item that functions among higher ability examinees • So ability is a location parameter 30 The Basics of IRT September 9, 2009
  • 31. Item Characteristic Curve Models RECAP 1. Under the 1-PL model, the slope is always the same; only the location of the item changes. 2. Under the 2-PL & 3-PL models, the value of a must become quite large (>1.7) before the curve is very steep. 3. Under 1-PL & 2-PL and a large positive value of b : lower tail approach zero. Under 3-PL it approaches c. 4. c is not apparent when b < 0 and a < 1.0. 5. Under all models, curves with a negative value of a are the mirror image of curves with positive value of a. 31 The Basics of IRT September 9, 2009
  • 32. Item Characteristic Curve Models RECAP 6. When b = -3.0, only the upper half of the item characteristic curve appears on the graph. When b = +3.0, only the lower half of the curve appears on the graph. 7. The slope of the item characteristic curve is the steepest at the ability level corresponding to the item difficulty. Thus, the difficulty parameter b locates the point on the ability scale where the item functions best. 8. Under IRT, ‘b’ is θ where P(θ ) is .5 for 1-PL & 2-PL models and (1 + c)/2 for a 3-PL model . Only when c = 0 are these two definitions equivalent. 32 The Basics of IRT September 9, 2009
  • 33. Estimating Item parameters • IRT Analysis - M examinees in J groups(θj) responds to the N items - mj examinees within group j, where j = 1, 2, 3. . . . J. - In jth group , rj examinees answer the item correctly. p(θj ) = rj/mj 33 The Basics of IRT September 9, 2009
  • 34. Estimating Item parameters • Find ICC that best fits the observed proportions of correct response. •first select a model for the curve to be fitted. • Let’s fit 2PL (chi-square goodness-of-fit index to assess fitness) • Procedure is Maximum Likelihood Estimation (initial values b=0,a=1) 34 The Basics of IRT September 9, 2009
  • 35. Estimating Item parameters J [ p(θj ) - P(θj)]2 Χ2 = ∑ mj = 28.88 and the criterion value was 45.91 j =1 P(θj) Q(θj) The two-parameter model with b = - .39 and a = 1.27 was a good fit to the observed proportions of correct response 35 The Basics of IRT September 9, 2009
  • 36. Estimating Item parameters • The Group Invariance of Item Parameters • Two groups • Group 1 [b= -3 to -1] //// Group 2 [ b=1 to 3] 36 The Basics of IRT September 9, 2009
  • 37. Estimating Item parameters • The Group Invariance of Item Parameters • b(1) = b(2) = -.39 and a(1) = a(2) = 1.27 • Combined analysis 37 The Basics of IRT September 9, 2009
  • 38. Estimating Item parameters • Group Invariance is powerful feature of IRT •values of the item parameters are a property of the item, not of the group that responded to the item. •Under classical test theory, just the opposite holds. The item difficulty of classical theory is the overall proportion of correct response. •b = 0 may give item difficulty index .3 for low ability group and .8 for high ability group. •Clearly, the value of the classical item difficulty index is not group invariant. • item difficulty fhas consistent meaning in IRT 38 The Basics of IRT September 9, 2009
  • 39. Estimating Item parameters • Caution The obtained numerical values will be subject to variation due to sample size, how well-structured the data is, and the goodness-of-fit of the curve to the data. The item must be used to measure the same latent trait for both groups. An item’s parameters do not retain group invariance when taken out of context, i.e., when used to measure a different latent trait or with examinees from a population for which the test is inappropriate. 39 The Basics of IRT September 9, 2009
  • 40. Estimating Item parameters RECAP 1. Estimated item parameters was usually a good overall fit to the observed proportions of correct response. 2. When two groups are employed, the same item characteristic curve will be fitted, regardless of the range of ability 3. The number of examinees at each level does not affect the group-invariance property. 4. For the item of positive discrimination, the low-ability group involves the lower left tail of ICC, and the high-ability group involves the upper right tail. 5. The item parameters were group invariant whether or not the ability ranges of the two groups overlapped. 40 The Basics of IRT September 9, 2009
  • 41. Estimating Item parameters RECAP 6. it made no difference as to which group was the high-ability group. Thus, group labeling is not a consideration. 7. The group-invariance principle holds for all three item characteristic curve models. 8. item parameter estimates are subject to sampling variation. 41 The Basics of IRT September 9, 2009
  • 42. Test Characteristics Curve • Examinee’s raw test score is obtained by adding up the item scores N TS j = ∑ Pi (θj) i=1 • eg for 4 items test : TS = .73 + .57 + .69 + .62 = 2.61 •Procedure is same for all three models 42 The Basics of IRT September 9, 2009
  • 43. Test Characteristics Curve • For 1-PL & 2-PL left tail approaches zero and right tail N. • TS = 0/Nck => θ = - ∞ • TS = N => θ = + ∞ 43 The Basics of IRT September 9, 2009
  • 44. Test Characteristics Curve TCC transforms ability scores to true scores. TCC is a monotonically increasing function. (though shapes may differ – S , inc-plat-inc) In all cases, it will be asymptotic to a value of N in the upper tail. TCC depends upon a number of factors, including the number of items, the ICC model employed, and the values of the item parameters. Caution : TCC (similar to ICC) does not depend on distribution of scores. 44 The Basics of IRT September 9, 2009
  • 45. Test Characteristics Curve Interpretation The ability level corresponding TS = N/2 locates the test along the ability scale. No explicit formula for TCC, so no parameters for the curve 45 The Basics of IRT September 9, 2009
  • 46. Test Characteristics Curve RECAP 1. Relation of the true score and the ability level. a. Given an ability level, the corresponding true score can be found via the test characteristic curve. b. Given a true score, the corresponding ability level can be found via the test characteristic curve. c. Both the true scores and ability are continuous variables. 2. Shape of the test characteristic curve. a. When N = 1, the true score ranges from 0 to 1 and TCC = ICC 46 The Basics of IRT September 9, 2009
  • 47. Test Characteristics Curve RECAP b. TCC may not be similar to ICC (due to regions of varying steepness and plateaus). IT reflects a mixture of item parameter values c. The ability level at which the mid-true score (N/2) occurs is an indicator of where the test functions on the ability scale. d. When the values of the item difficulties have a limited range, the steepness of the TCC depends primarily upon the average value of the item discrimination parameters. e. When the values of the item difficulties are spread widely over the ability scale, the steepness of the TCC will be reduced. 47 The Basics of IRT September 9, 2009
  • 48. Test Characteristics Curve RECAP e. Under a three-parameter model, the lower limit of the true scores is NCk. f. The shape of the test characteristic curve depends upon the number of items, the ICC model and the mix of values of the item parameters. 3. It would be possible to construct a test characteristic curve that decreases as ability increases. To do so would require items with negative discrimination for the correct response to the items. Such a test would not be considered a good test because the higher an examinee’s ability level, the lower the score expected for the examinee. 48 The Basics of IRT September 9, 2009
  • 49. Estimating Examinee’s Ability • To locate that person on the ability scale (θ = ?) • Individual’s ability • Comparisons among individual’s ability • The list of 1’s and 0’s for the N items is called the examinee’s item response vector. •Use this item response vector and the known item parameters to estimate the examinee’s θ . • Maximum likelihood procedures are used to estimate an examinee’s ability. It begins with some a priori value for the ability of the examinee and the known values of the item parameters. These are used to compute the probability of correct response to each item for that examinee. 49 The Basics of IRT September 9, 2009
  • 50. Estimating Examinee’s Ability • this procedure is based upon an approach that treats each examinee separately N ∑ -ai ( ui – Pi(θs)) i=1 Θs+1 = θs + N ∑ ai2 Pi(θs) Qi(θs) i=1 where: θs is the estimated ability of the examinee within iteration s ai is the discrimination parameter of item i, i = 1, 2, . . . .N ui is the response made by the examinee to item i: 1 or 0 Pi(θs ) is the probability of correct response to item i, under the given item characteristic curve model, at ability level θs within iteration s. Qi (θs ) = 1 -Pi (θs ) 50 The Basics of IRT September 9, 2009
  • 51. Estimating Examinee’s Ability • A three-item test b = -1, a = 1.0 b = 0, a = 1.2 b = 1, a = .8 • The examinee’s item responses were: item response 1 1 2 0 3 1 a priori estimate of the examinee’s ability is set to θs= 1.0 51 The Basics of IRT September 9, 2009
  • 52. Estimating Examinee’s Ability First iteration: item u P Q a(u-P) a*a(PQ) 1 1 .88 .12 .119 .105 2 0 .77 .23 -.922 .255 3 1 .5. 5 .400 .160 sum -.403 .520 ∆θs = -.403/.520 = -.773, θs+1= 1.0 - .773 = .227 Second iteration: ∆θs = .066/.674 = .097, θs+1 = .227 + .097 = .324 Third iteration: ∆θs = .0006/.6615 = .0009, θs+1 = .324 + .0009 = .3249 At this point, the process is terminated because the value of the adjustment (.0009) is very small. Thus, the examinee’s estimated ability is .33. 52 The Basics of IRT September 9, 2009
  • 53. Estimating Examinee’s Ability Unfortunately, there is no way to know the examinee’s actual ability parameter. The best one can do is estimate it. 1 SE(θ) = Sqrt(ai2 Pi(θs) Qi(θs)) SE(θ ) = 1 /sqrt(.6615) = 1.23 The examinee’s ability was not estimated very precisely because the standard error is very large. This is primarily due to the fact that only three items were used. 53 The Basics of IRT September 9, 2009
  • 54. Estimating Examinee’s Ability Two cases where MLE procedure fails when an examinee answers none of the items correctly, the corresponding ability estimate is negative infinity. when an examinee answers all the items in the test correctly, the corresponding ability estimate is positive infinity. In both of these cases it is impossible to obtain an ability estimate for the examinee (the computer literally cannot compute a number as big as infinity). 54 The Basics of IRT September 9, 2009
  • 55. Estimating Examinee’s Ability Item Invariance of an Examinee’s Ability Estimate Different sets of items should yield the same ability estimate, within sampling variation . In each set, a different part of the ICC is involved This principle rests upon two conditions: all the items measure the same underlying latent trait the values of all the item parameters are in a common metric. 55 The Basics of IRT September 9, 2009
  • 56. Estimating Examinee’s Ability Implication of Item Invariance of an Examinee’s Ability Estimate A test located anywhere along the ability scale can be used to estimate an examinee’s ability. An examinee could take a test that is “easy” or a test that is “hard” and obtain, on the average, the same estimated ability. This is in sharp contrast to classical test theory, where such an examinee would get a high test score on the easy test and vice versa Under item response theory, the examinee’s ability is fixed and invariant with respect to the items used to measure it. 56 The Basics of IRT September 9, 2009
  • 57. Estimating Examinee’s Ability Caution : An examinee’s ability is fixed relative to a given context. However, if the examinee received remedial instruction between testing or if there were carryover (memorizing) effects, the examinee’s underlying ability level would be different for each testing. The item invariance of an examinee’s ability and the group invariance of an item’s parameters are two facets of what is referred to, generically, as the invariance principle of item response theory. 57 The Basics of IRT September 9, 2009
  • 58. Estimating Examinee’s Ability RECAP 1. Distribution of estimated ability. a. The standard error of the estimates can be quite large when the items are not located near the ability of the examinee. c. When the values of the item discrimination indices are large, the standard error of the ability estimates is small. When the item discrimination indices are small, the standard error of the ability estimates is large. e. The optimum set of items for estimating an examinee’s ability would have all its item difficulties equal to the examinee’s ability parameter and have items with large values for the item discrimination indices. 58 The Basics of IRT September 9, 2009
  • 59. Estimating Examinee’s Ability RECAP 2. Item invariance of the examinee’s ability. a. The different sets of items yielded values of estimated ability that were near the examinee’s actual ability level. b. The mean value of these estimates generally was a close approximation of the examinee’s ability parameter. 3. Each examinee has an ability score (parameter value) that locates that person (estimated ability) on the scale. 59 The Basics of IRT September 9, 2009
  • 60. The Information Function The statistical meaning of information is credited to Sir R.A. Fisher, who defined information as the (reciprocal of the σ ) precision with which a parameter could be estimated. 1 I= where σ ~ SE(θ) σ2 60 The Basics of IRT September 9, 2009
  • 61. The Information Function I(θ) is maximum at θ = -1.0 and is about 3 for the ability range of -2<= θ< = 0. Within this range, ability is estimated with some precision. I(θ) does not depend upon the distribution of examinees over the ability scale. Ideal I(θ) would be a horizontal line at some large value : hard to achieve Precision with which an examinee’s ability is estimated depends upon where the examinee’s ability is located on the ability scale. – Implication for test construction 61 The Basics of IRT September 9, 2009
  • 62. The Information Function Item information function Ii (θ ), where i indexes the item. Will be small as single item is involved. 62 The Basics of IRT September 9, 2009
  • 63. The Information Function An item measures ability with greatest precision at the ability level corresponding to the item’s difficulty parameter. The amount of item information decreases as the ability level departs from the item difficulty and approaches zero at the extremes of the ability scale. Test Information Function N I(θ ) = ∑ Ii (θ ) i=1 63 The Basics of IRT September 9, 2009
  • 64. The Information Function I(θ ) is much higher than Ii (θ ) Longer the test , higher I(θ ) and greater precision in mesauring examinee’s ability than will shorter tests. Thus, ability is estimated with some precision near the center of the ability scale. I(θ ) tells how well the test is doing in estimating ability over the whole range of ability scores. While the ideal test information function often may be a horizontal line, it may not be the best for a specific purpose. 64 The Basics of IRT September 9, 2009
  • 65. The Information Function For example, if you were interested in constructing a test to award scholarships, this ideal might not be optimal. In this situation, you would like to measure ability with considerable precision at ability levels near the ability used to separate those who will receive the scholarship from those who do not. The best test information function in this case would have a peak at the cutoff score. Other specialized uses of tests could require other forms of the test information function. 65 The Basics of IRT September 9, 2009
  • 66. The Information Function Information Functions 2 PL: Ii (θ ) = ai 2 P i(θ ) Qi (θ ) Pi (θ ) = 1/(1+ EXP (-ai (θ - bi))) Qi (θ ) =1 - Pi (θ ) 1 PL: Ii (θ ) = Pi(θ ) Qi (θ ) Qi (θ ) =1 - Pi (θ ) 3 PL: Ii (θ ) = ai2 (Qi (θ )) (Pi(θ ) –c2) (Pi(θ ) ) (1-c2) Pi (θ ) = c + (1 - c) (1/(1 + EXP (-ai (θ - bi)))) Qi (θ ) =1 - Pi (θ ) 66 The Basics of IRT September 9, 2009
  • 67. The Information Function Information Functions Example : 3PL (b = 1.0, a = 1.5, c = .2 ) θ L Pi(θ) Qi (θ) Pi (θ) Qi(θ) (Pi (θ) - c) Ii (θ) -3 -6.0 .20 .80 3.950 .000 .000 -2 -4.5 .21 .79 3.785 .000 .001 -1 -3.0 .24 .76 3.202 .001 .016 0- -1.5 .35 .65 1.890 .021 .142 1 0.0 .60 .40 .667 .160 .375 2 1.5 .85 .15 .171 .428 .257 3 3.0 .96 .04 .040 .481 .082 General level of the values for the amount of information is lower (because of the presence of the terms (1 - c) and (Pi (θ ) - c) ) The maximum occurred at an ability level slightly higher than the value of b 67 The Basics of IRT September 9, 2009
  • 68. The Information Function When they share common values of a and b, the information functions will be the same when c= 0 When c > 0, the three-parameter model will always yield less information Thus, the item information function under a two-parameter model defines the upper bound for the amount of information under a three-parameter model This is reasonable, because getting the item correct by guessing should not enhance the precision with which an ability level is estimated. 68 The Basics of IRT September 9, 2009
  • 69. The Information Function Computing a Test Information Function : five-item test under 2PL Item b a 1 -1.0 2.0 2 -0.5 1.5 3 -0.0 1.5 4 0.5 1.5 5 1.0 2.0 θ 1 2 3 4 5 Test Information -3 .071 .051 .024 .012 .001 .159 -2 .420 .194 .102 .051 .010 .777 -1 1.000 .490 .336 .194 .071 2 .091 0 .420 .490 .563 .490 .420 2 .383 1 .071 .194 .336 .490 1.000 2.091 2 .010 .051 .102 .194 .420 .777 3 .001 .012 .024 .051 .071 .159 69 The Basics of IRT September 9, 2009
  • 70. The Information Function The five item discriminations had a symmetrical distribution around a value of 1.5 The five item difficulties had a symmetrical distribution about an ability level of zero Because of this, the test information function also is symmetric about an ability of zero 70 The Basics of IRT September 9, 2009
  • 71. The Information Function σ ~ SE(θ) = 1/sqrt(I(θ)) A peaked I(θ ) measures ability with unequal precision the ability scale. best for estimating the ability of examinees whose abilities fall near the peak of the test information function In some tests, the I(θ ) is rather flat over some region of the ability scale test would be a desirable one for those examinees whose ability falls in the given range The maximum amount of test information was 2.383 at an ability level of 0 This translates into a standard error of .65 Roughly that 68 percent of the estimates of this ability level fall between - .65 and +.65 (ability level is estimated with a modest amount of precision) 71 The Basics of IRT September 9, 2009
  • 72. The Information Function RECAP 1. The general level of the test information function depends upon: a. The number of items in the test. b. The average value of the discrimination parameters of the test items. c. Both of the above hold for all three item characteristic curve models. 2. The shape of the test information function depends upon: a. The distribution of the item difficulties over the ability scale. b. The distribution and the average value of the discrimination parameters of the test items. 3. When the item difficulties are clustered closely around a given value, the test information function is peaked at that point on the ability scale. The maximum amount of information depends upon the values of the discrimination parameters. 72 The Basics of IRT September 9, 2009
  • 73. The Information Function RECAP 4. When the item difficulties are widely distributed over the ability scale, the test information function tends to be flatter than when the difficulties are tightly clustered. 5. Values of a < 1.0 result in a low general level of the amount of test information. 6. Values of a > 1.7 result in a high general level of the amount of test information. 7. Under a three-parameter model, values of the guessing parameter c greater than zero lower the amount of test information at the low- ability levels. In addition, large values of c reduce the general level of the amount of test information. 8. It is difficult to approximate a horizontal test information function. To do so, the values of b must be spread widely over the ability scale, and the values of a must be in the moderate to low range and have a U-shaped distribution. 73 The Basics of IRT September 9, 2009
  • 74. Test Calibration Test constructors know beforehand what trait they want the item to measure and Ability level of examinees , where the item is designed to function But it is not possible to determine the values of the item’s parameters a priori when a test is administered ,latent trait of examinees is not known As a result a major task is to determine the values of the item parameters and examinee abilities in a metric for the underlying latent trait. In IRT, this task is called Test calibration 74 The Basics of IRT September 9, 2009
  • 75. Test Calibration The Test Calibration Process (Birnbaum,1968 ) Two stages of maximum likelihood estimation stage 1: The parameters of the N items in the test are estimated • The estimated ability of each examinee is treated as if it is expressed in the true metric of the latent trait . • Then the parameters of each item in the test are estimated via the maximum likelihood procedure Item’s parameters are estimated one at a time (Independence assumption) 75 The Basics of IRT September 9, 2009
  • 76. Test Calibration The Test Calibration Process (Birnbaum,1968 ) Two stages of maximum likelihood estimation stage 2 : the ability parameters of the M examinees are estimated • taking these item parameter estimates, the ability of each examinee is estimated using the maximum likelihood procedure presented • the ability estimates are obtained one examinee at a time (independence assumption). The two-stage process is repeated until some suitable convergence criterion is met. The overall effect is that the parameters of the N test items and the ability levels of the M examinees have been estimated simultaneously, 76 The Basics of IRT September 9, 2009
  • 77. Test Calibration • It does not yield a unique metric for the ability scale. • The midpoint and the unit of measurement of the obtained ability scale are indeterminate • The metric is unique up to a linear transformation • It is necessary to “anchor” the metric via arbitrary rules for determining the midpoint and unit of measurement of the ability scale. • It is not possible to obtain estimates of the examinee’s ability and of the item’s parameters in the true metric of the underlying latent trait. • The best we can do is obtain a metric that depends upon a particular combination of examinees and test items. 77 The Basics of IRT September 9, 2009
  • 78. Test Calibration For three different ICC models to choose from , there are several different customised ways to implement the Birnbaum paradigm. You have to devise your own implementation Illustration : BICAL program as implemented by Benjamin Wright and his co- Workers for 1PL Rasch model ten-item test administered to a group of 16 examinees 78 The Basics of IRT September 9, 2009
  • 79. Test Calibration ITEM RESPONSES 1 2 3 4 5 6 7 8 9 10 RS 01 1 1 2 02 1 1 2 03 1 1 1 1 1 5 04 1 1 1 1 4 05 1 1 06 1 1 1 3 07 1 1 1 1 4 08 1 1 1 1 4 09 1 1 1 1 4 10 1 1 1 3 11 1 1 1 1 1 1 1 1 1 9 12 1 1 1 1 1 1 1 1 1 9 13 1 1 1 1 1 1 6 14 1 1 1 1 1 1 1 1 1 9 15 1 1 1 1 1 1 1 1 1 9 16 1 1 1 1 1 1 1 1 1 1 10 79 The Basics of IRT September 9, 2009
  • 80. Test Calibration Examinee 16 answered all of the items so removed from the data set If item is answered correctly by all of the examinees or by none of the examinees, its item difficulty parameter cannot be estimated. FREQUENCY COUNTS FOR EDITED DATA SCORE ITEM Row 1 2 3 4 5 6 7 8 9 10 Total 1 1 1 2 1 2 1 4 3 2 1 1 1 1 6 4 4 1 2 2 3 1 1 2 16 5 1 1 1 1 1 5 6 1 1 1 1 1 1 6 9 4 4 2 4 4 4 4 4 4 2 36 ----------------------------------------------------------------------------- COL 13 8 8 5 10 7 7 6 7 3 74 80 The Basics of IRT September 9, 2009
  • 81. Test Calibration • Given the two frequency vectors (Rows and Columns), the estimation process can be implemented • Under the Rasch model, the anchoring procedure takes advantage of a = 1 - Unit of measurements for estimated ability is set at 1. - Item’s difficulty is estimated • To get midpoint value of ability scale, the mean item difficulty is subtracted from the value of each item’s difficulty estimate - Mid point of item difficulty becomes 0 and same is for ability estimates. • The ability estimate corresponding to each raw test score is obtained in the second stage using the rescaled item difficulties (as if they were the difficulty parameters) and the vector of row marginal totals. - metric is also changed to that of rescaled ability parameters • The output of this stage is an ability estimate for each raw test score in the data set 81 The Basics of IRT September 9, 2009
  • 82. Test Calibration At this point, the convergence of the overall iterative process is checked. Wright summed the absolute differences between the values of the item difficulty parameter estimates for two successive iterations of the paradigm. If this sum was less than .01, the estimation process was terminated. If it was greater than .01, then another iteration was performed and the two stages were done again Thus, the process of stage one, anchor the metric, stage two, and check for convergence is repeated until the criterion is met. When this happens, the current values of the item and ability parameter estimates are accepted and an ability scale metric has been defined. 82 The Basics of IRT September 9, 2009
  • 83. Test Calibration ITEM PARAMETER ESTIMATES Item Difficulty 1 -2.37 2 -0.27 3 -0.27 4 +0.98 5 -1.00 6 +0.11 7 +0.11 8 +0.52 9 +0.11 10 +2.06 Because of the anchoring procedures used, these values are actually relative to the average item difficulty of the test for these examinees. You can verify that the sum of the item difficulties is zero (within rounding errors) 83 The Basics of IRT September 9, 2009
  • 84. Test Calibration ABILITY ESTIMATION Examinee Obtained Raw Score 1 -1.50 2 2 -1.50 2 3 +0.02 5 4 -0.42 4 5 -2.37 1 6 -0.91 3 7 -0.42 4 8 -0.42 4 9 -0.42 4 10 -0.91 3 11 +2.33 9 12 +2.33 9 13 +0.46 6 14 +2.33 9 15 +2.33 9 16 ***** 10 All examinees with the same raw score obtained the same ability estimate 84 The Basics of IRT September 9, 2009
  • 85. Test Calibration • This unique feature is a direct consequence of fixing a =1 for all items - intuitive appeal of Rasch model • When 2-PL or 3-PL ICC are used, an examinee’s ability estimate depends upon the particular pattern of item responses rather than the raw score. - Under these models, examinees with the same item response pattern will obtain the same ability estimate. - examinees with the same raw score could obtain different ability estimates if they answered different items correctly. 85 The Basics of IRT September 9, 2009
  • 86. Test Calibration Illustration: Three different ten-item tests measuring the same latent trait will be used. A common group of 16 examinees will take all three of the tests. The tests were created so that the average difficulty of the first test was matched to the mean ability of the common group of examinees. The second test was created to be an easy test for this group. The third test was created to be a hard test for this group. Each of these test- group combinations will be subjected to the Birnbaum paradigm and calibrated separately. Each test calibration yields a unique metric for the ability scale. Due to the anchoring process, all three test calibrations yielded a mean item difficulty of zero. 86 The Basics of IRT September 9, 2009
  • 87. Test Calibration Within each calibration, examinees with the same raw test score obtained the same estimated ability. However, a given raw score will not yield the same estimated ability across the three calibrations. The value of the mean estimated abilities is expressed relative to the mean item difficulty of the test. Thus, the mean difficulty of the easy test should result in a positive mean ability. The mean ability on the hard test should have a negative value. The mean ability on the matched test should be near zero. Test results can be placed on a common ability scale. 87 The Basics of IRT September 9, 2009
  • 88. Test Calibration Putting the Three Tests on a Common Ability Scale (Test Equating) The principle of the item invariance of an examinee’s ability indicates that an examinee should obtain the same ability estimate regardless of the set of items used However, in the three test calibrations done above, this did not hold. The problem is not in the invariance principle, but in the test calibrations The invariance principle assumes that the values of the item parameters of the several sets of items are all expressed in the same ability-scale metric. In the present situation, there are three different ability scales, one from each of the calibrations. The average difficulties of these tests were intended to be different, but the anchoring process forced each test to have a mean item difficulty of zero. 88 The Basics of IRT September 9, 2009
  • 89. Test Calibration The mean of the common group was .06 for the matched test, .44 for the easy test, and -.11 for the hard test. This tells us that the mean ability from the matched test is about what it should be. The mean from the easy test tells us that the average ability is above the mean item difficulty of the test The mean ability from the hard test is below the mean item difficulty. 89 The Basics of IRT September 9, 2009
  • 90. Test Calibration we can use the mean abilities to position the tests on a common scale. But which particular test calibration to use as the baseline? Matched Test : This calibration yielded a mean ability of .062 and a mean item difficulty of zero. Because the Rasch model was used, the unit of measurement for all three calibrations is unity. Therefore, to bring the easy and hard test results to the baseline metric only involved adjusting for the differences in midpoints. 90 The Basics of IRT September 9, 2009
  • 91. Test Calibration Easy Test The shift factor needed is the difference between the mean estimated ability of the common group on the easy test (.444) and on the matched test (.062), which is .382. To convert the values of the item difficulties for the easy test to baseline metric, one simply subtracts .382 from each item difficulty. Similarly, each examinee’s ability can be expressed in the baseline metric by subtracting .382 from it. 91 The Basics of IRT September 9, 2009
  • 92. Test Calibration Hard Test The hard test results can be expressed in the baseline metric by using the differences in mean ability. The shift factor is -.111 -.062, or -.173. Again, subtracting this value from each of the item difficulty estimates puts them in the baseline metric. The ability estimates of the common group yielded by the hard test can be transformed to the baseline metric of the matched test by using the same shift factor 92 The Basics of IRT September 9, 2009
  • 93. Test Calibration Item difficulties in the baseline metric Item Easy test Matched test Hard test 1 -1.492 -2.37 ***** 2 -1.492 -.27 -.037 3 -2.122 -.27 -.497 4 -.182 .98 -.497 5 -.562 -1.00 .963 6 +.178 .11 -.497 7 .528 .11 .383 8 .582 .521 .533 9 .880 .11 .443 10 .880 2.06 ***** Mean - .285 mean 0.00 mean .224 93 The Basics of IRT September 9, 2009
  • 94. Test Calibration Ability estimates of the common group in the baseline metric Item Easy test Matched test Hard test 1 -2.900 -.1.50 ***** 2 -.772 -1.50 ***** 3 -1.96 2.02 ***** 4 -.292 -.42 -.877 5 -.292 -2.37 -.877 6 .168 -.91 ***** 7 1.968 -.42 -1.637 8 .168 -.42 -.877 9 .638 -.42 -1.637 10 .638 -.91 -.877 11 .638 2.33 .153 12 1.188 2.33 .153 13 .292 .46 .153 14 1.968 2.33 2.003 15 ***** 2.33 1.213 16 ***** ***** 2.003 Mean .062 .062 .062 Std. Dev. 1.344 1.566 1.413 94 The Basics of IRT September 9, 2009
  • 95. Test Calibration After transformation • The matched test has a mean at the midpoint of the baseline ability scale. • The easy test has a negative value • The hard test has a positive value. • The average difficulty of both tests is about the same distance from the middle of the scale. In technical terms we have “equated” the tests, i.e., put them on a common scale. 95 The Basics of IRT September 9, 2009
  • 96. Test Calibration • The mean estimated ability of the common group was the same for all three tests. • The standard deviations of the ability estimates were nearly the same for the easy and hard tests, and that for the matched test was “in the ballpark.” • Although the summary statistics were quite similar for all three sets of results, the ability estimates for a given examinee varied widely. • The invariance principle has not gone awry; what you are seeing is sampling variation. Given the small size of the data sets, it is quite amazing that the results came out as nicely as they did. This demonstrates rather clearly the powerful capabilities of the Rasch model and Birnbaum’s MLE paradigm. 96 The Basics of IRT September 9, 2009
  • 97. Test Calibration • After equating, the numerical values of the item parameters can be used to compare where different items function on the ability scale. • The examinees’ estimated abilities also are expressed in this metric and can be compared. • It is also possible to compute the test characteristic curve and the test information function for the easy and hard tests in the baseline metric. • Technically speaking, the tests were equated using the common group approach with tests of different difficulty. • The ease with which test equating can be accomplished is one of the major advantages of item response theory over classical test theory. 97 The Basics of IRT September 9, 2009
  • 98. Test Calibration RECAP 1. The end product of the test calibration process is the definition of an ability scale metric. 2. Under the Rasch model, this scale has a unit of measurement of 1 and a midpoint of zero. 3. However, it is not the metric of the underlying latent trait. The obtained metric depends upon the item responses yielded by a particular combination of examinees and test items being subjected to the Birnbaum paradigm. 4. Since the true metric of the underlying latent trait cannot be determined, the metric yielded by the Birnbaum paradigm is used as if it were the true metric. The obtained item difficulty values and the examinee’s ability are interpreted in this metric. Thus, the test has been calibrated. 98 The Basics of IRT September 9, 2009
  • 99. Test Calibration 5. The outcome of the test calibration procedure is to locate each examinee and item along the obtained ability scale. 6. In the present example, item 5 had a difficulty of -1 and examinee 10 had an ability estimate of -.91. Therefore, the probability of examinee 10 answering item 5 correctly is approximately .5. 7. The capability to locate items and examinees along a common scale is a powerful feature of item response theory. Single Consistent framework 99 The Basics of IRT September 9, 2009
  • 100. Specifying the characteristics of a test During this transitional period in testing practices, many tests have been designed and constructed using classical test theory principles but have been analyzed via item response theory procedures. This lack of congruence between the construction and analysis procedures has kept the full power of item response theory from being exploited. In order to obtain the many advantages of item response theory, tests should be designed, constructed, analyzed, and interpreted within the framework of the theory. 100 The Basics of IRT September 9, 2009
  • 101. Specifying the characteristics of a test • Under item response theory, a well-defined set of procedures (item banking) is used to establish and maintain item pools. • item parameters are expressed in a known ability-scale metric. • it is possible to select items from the item pool and determine the major technical characteristics of a test before it is administered. • If the test characteristics do not meet the design goals, selected items can be replaced by other items from the item pool until the desired characteristics are obtained. • In this way, considerable time and money that would ordinarily be devoted to piloting the test are saved. 101 The Basics of IRT September 9, 2009
  • 102. Specifying the characteristics of a test • first define the latent trait the items are to measure, write items to measure this trait, and pilot test the items to weed out poor items. • This large set of items is then administered to a large group of examinees. • An item characteristic curve model is selected, the examinees’ item response data are analyzed via the Birnbaum paradigm, and the test is calibrated. • The ability scale resulting from this calibration is considered to be the baseline metric of the item pool. • From a test construction point of view, we now have a set of items whose item parameter values are known; in technical terms, a “precalibrated item pool” exists. 102 The Basics of IRT September 9, 2009
  • 103. Specifying the characteristics of a test Developing a Test From a Precalibrated Item Pool Since the items in the precalibrated item pool measure a specific latent trait, tests constructed from it will also measure this trait. Alternate forms are routinely needed to maintain test security, and special versions of the test can be used to award scholarships. In such cases, items would be selected from the item pool on the basis of their content and their technical characteristics to meet the particular testing goals. The advantage of having a precalibrated item pool is that the parameter values of the items included in the test can be used to compute the test characteristic curve and the test information function before the test is administered. 103 The Basics of IRT September 9, 2009
  • 104. Specifying the characteristics of a test This is possible because neither of these curves depends upon the distribution of examinee ability scores over the ability scale. Given these two curves, the test constructor has a very good idea of how the test will perform before it is given to a group of examinees. In addition, when the test has been administered and calibrated, test equating procedures can be used to express the ability estimates of the new group of examinees in the metric of the item pool. 104 The Basics of IRT September 9, 2009
  • 105. Specifying the characteristics of a test Screening tests to distinguish rather sharply between examinees whose abilities are just below a given ability level and those who are at or above that level. - used to assign scholarships - and to assign students to specific instructional programs Broad-ranged tests to measure ability over a wide range of underlying ability scale. - Tests measuring reading or mathematics are typically broad-range tests. 105 The Basics of IRT September 9, 2009
  • 106. Specifying the characteristics of a test Peaked tests to measure ability well in a range of ability that is wider than that of a screening test, but not as wide as that of a broad-range test. Some Ground Rules a. It is assumed that the items would be selected on the basis of content as well as parameter values. b. No two items in the item pool possess exactly the same combination of item parameter values. c. The item parameter values are subject to the following constraints: -3 < = b < = +3 , .50 < = a < +2.00 , 0 < = c < = .35 The values of the discrimination parameter have been restricted to reflect the range of values usually seen in well-maintained item pools. 106 The Basics of IRT September 9, 2009
  • 107. Specifying the characteristics of a test Example Case You are to construct a ten-item screening test that will separate examinees into two groups: those who need remedial instruction and those who don’t, on the ability measured by the items in the item pool. Students whose ability falls below a value of -1 will receive the instruction. Solution: b = -1.8,a = 1.22 b = -1.6,a = 1.43 b = -1.4,a = 1.14 b = -1.2,a = 1.35 b = -1.0,a = 1.56 b = - .8,a = 1.07 b = - .6,a = 1.48 b = - .4,a = 1.29 b = - .2,a = 1.110 b = 0.0,a = 1.3 The logic underlying these choices was one of centering the difficulties on the cutoff level of -1 and using moderate values of discrimination. 107 The Basics of IRT September 9, 2009
  • 108. Specifying the characteristics of a test The mid-true score corresponded to an ability level of -1.0. The test characteristic curve was not particularly steep at the cutoff level, indicating that the test lacked discrimination. The peak of the information function occurred at an ability level of -1.0, but the maximum was a bit small. The results suggest that the test was properly positioned on the ability scale but that a better set of items could be found. 108 The Basics of IRT September 9, 2009
  • 109. Specifying the characteristics of a test The following changes would improve the test’s characteristics: First cluster the values of the item difficulties nearer the cutoff level; second, use larger values of the discrimination parameters. These two changes should steepen the test characteristic curve and increase the maximum amount of information at the ability level of -1.0. 109 The Basics of IRT September 9, 2009
  • 110. Specifying the characteristics of a test RECAP 1. Screening tests. 1. The desired test characteristic curve has the mid-true score at the specified cutoff ability level. The curve should be as steep as possible at that ability level. 2. The test information function should be peaked, with its maximum at the cutoff ability level. 3. optimal case is where all item difficulties are at the cutoff point and the item discriminations are large. 4. select items that yield the maximum amount of information at the cutoff point. 110 The Basics of IRT September 9, 2009
  • 111. Specifying the characteristics of a test 2. Broad-range tests. 1. The desired test characteristic curve has its mid-true score at an ability level corresponding to the midpoint of the range of ability of interest. 2. Most often this is an ability level of zero. 3. The test characteristic curve should be linear for most of its range. 4. The desired test information function is horizontal over the widest possible range. 5. The maximum amount of information should be as large as possible. 6. The values of the item difficulty parameters should be spread uniformly over the ability scale and as widely as practical. 111 The Basics of IRT September 9, 2009
  • 112. Specifying the characteristics of a test 3. Peaked tests. 1. The desired TCC has its mid-true score in the middle of the ability range of interest. The curve should have a moderate slope at that ability level. 2. The test information function should be rounded in appearance over the ability range of most interest. 3. The item difficulties should be clustered around the midpoint of the ability range of interest, but not as tightly as in the case of a screening test. 4. The values of the discrimination parameters should large. 5. Items whose difficulties are within the ability range of interest should have larger values of the discrimination than other items 112 The Basics of IRT September 9, 2009
  • 113. Specifying the characteristics of a test 4. Role of item characteristic curve models. 1. ‘a’ being fixed at 1.0, the Rasch model has a limit placed upon the maximum amount of information that can be obtained. 2. The maximum amount of item information is .25 since Pi (θ ) ) Qi (θ ) )= .25 when Pi (θ ) = .5. Maximum amount of information for a test under the Rasch model is .25 times the number of items. 3. Due to the presence of the guessing parameter, 3-PL model will yield a more linear test characteristic curve and a test information function with a lower general level than under 2-PL model with the ‘a’ and ‘b’ 4. The information function under a 2-PL model is the upper bound for the information function under a 3-PL model when the values of ‘b’ and ‘a’ are the same. 113 The Basics of IRT September 9, 2009
  • 114. Specifying the characteristics of a test 5. Role of the number of items. Increasing the number of items has little impact upon the general form of the TCC if the distribution of the sets of item parameters remains the same. Increasing the number of items in a test has a significant impact upon the general level of the test information function (TIF). Select items having high values of the ‘a’ and a distribution of item difficulties consistent with the testing goals. Pairing if item parameters is vital . Eg choosing ‘a’ item whose difficulty is not of interest does little for test information function or the slope of the TCC. It’s important to see ICC and the IIF to ascertain the item’s contribution to the TCC and to the TIF Partial Credit Model 114 The Basics of IRT September 9, 2009
  • 115. Computer Adaptive Test • The computer can update the estimate of the examinee's ability after each item, which then is used to select subsequent item. • With the right item bank and a high examinee ability variance, CAT can be much more efficient than a traditional paper-and-pencil test. • Paper-and-pencil tests • "fixed-item" tests • Everyone takes every item • Easy and hard items are like adding constants to the score. • provide little information about the examinee's ability level. • Large numbers of items and examinees are needed to obtain a modest degree of precision. 115 The Basics of IRT September 9, 2009
  • 116. Computer Adaptive Test Computer adaptive tests • the examinee's ability level relative to a norm group can be iteratively estimated . • items can be selected based on the current ability estimate. • Examinees can be given the items that maximize the information (within constraints) about their ability levels from the item responses. • Examinees will receive few items that are very easy or very hard for them. (little information is gained of this) • Reduced standard errors (reciprocal of I) and greater precision with only a handful of properly selected items. 116 The Basics of IRT September 9, 2009
  • 117. Computer Adaptive Test The CAT algorithm is usually an iterative process Step 1: All the items that have not yet been administered are evaluated to determine which will be the best one to administer next given the currently estimated ability level Step 2: The "best" next item (providing the most information) is administered and the examinee responds Step 3: A new ability estimate is computed based on the responses to all of the administered items. Steps 1 through 3 are repeated until a stopping criterion is met. Similar to Newton-Raphson iterative method for solving equations 117 The Basics of IRT September 9, 2009
  • 118. Computer Adaptive Test Reliability and standard error • In classical measurement, with a test reliability of 0.90, the standard error of measurement for the test is about .33 of the standard deviation of examinee test scores. • In item response theory-based measurement, and when ability scores are scaled to a mean of zero and a standard deviation of one (which is common), this level of reliability corresponds to a standard error of about .33 and test information of about 10. • Thus, it is common in practice, to design CATs so that the standard errors are about .33 or smaller (or correspondingly, test information exceeds 10. 118 The Basics of IRT September 9, 2009
  • 119. Computer Adaptive Test Potential of computer adaptive tests • Tests are given "on demand" and scores are available immediately. • Neither answer sheets nor trained test administrators are needed. • Test administrator differences are eliminated as a factor in measurement error. • Tests are individually paced. • Test security may be increased . • Computerized testing offers a number of options for timing and formatting. 119 The Basics of IRT September 9, 2009
  • 120. Computer Adaptive Test • CATs can reduce testing time by more than 50% while maintaining the same level of reliability. Shorter testing times also reduce fatigue, a factor that can significantly affect an examinee's test results. • CATs can provide accurate scores over a wide range of abilities while traditional tests are usually most accurate for average examinees. CAT & IRT is Powerful combination 120 The Basics of IRT September 9, 2009
  • 121. Computer Adaptive Test Limitations of CAT • CATs are not applicable for all subjects and skills. • Most CATs are based on an IRT model, yet IRT is not applicable to all skills and item types. • Hardware limitations • Items involving detailed art work and graphs or extensive reading passages, for example, may be hard to present. • CATs require careful item calibration. • CATs are only manageable if a facility has enough computers for a large number of examinees and the examinees are at least partially computer-literate. This can be a big limitation. 121 The Basics of IRT September 9, 2009
  • 122. Computer Adaptive Test The test administration procedures are different. This may cause problems for some examinees. With each examinee receiving a different set of questions, there can be perceived inequities. Examinees are not usually permitted to go back and change answers. 122 The Basics of IRT September 9, 2009
  • 123. References Baker,J “The basics of Item Response Theory.” Birnbaum, A. “Some latent trait models and their use in inferring an examinee’s ability.” Part 5 in F.M. Lord and M.R. Novick. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley, 1968. Hambleton, R.K., and Swaminathan, H. Item Response Theory: Principles and Applications. Hingham, MA: Kluwer, Nijhoff, 1984. Hulin, C. L., Drasgow, F., and Parsons, C.K. Item Response Theory: Application to Psychological Measurement. Homewood, IL: Dow-Jones, Irwin: 1983. Lord, F.M. Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Erlbaum, 1980. Mislevy, R.J., and Bock, R.D. PC-BILOG 3: Item Analysis and Test Scoring with Binary Logistic Models. Mooresville, IN: Scientific Software, Inc, 1986. Wright, B.D., and Mead, R.J. BICAL: Calibrating Items with the Rasch Model. Research Memorandum No. 23. Statistical Laboratory, Department of Education, University of Chicago, 1976. Wright, B.D., and Stone, M.A. Best Test Design. Chicago: MESA Press, 1979. NCME Instructional Modules 123 The Basics of IRT September 9, 2009
  • 124. Thank You 124 The Basics of IRT September 9, 2009
  • 125. What is IRT? – DIF/DTF Classical test theory methods confound “bias” with true mean differences; IRT does not. In IRT terminology, item/test bias is referred to as DIF/DTF (Differential Item/Test Functioning ) DIF refers to a difference in the probability of endorsing an item for members of a reference group (e.g., US workers) and a focal group (e.g., Chinese workers), having the same standing on theta. DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group. DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses. If DIF is detected, IRT can control for item bias when estimating scores. 125 The Basics of IRT September 9, 2009
  • 126. What is IRT? – DIF Examples Uniform DIF Against Focal Group 1.0 Prob. of Positive Response 0.9 0.8 0.7 0.6 0.5 Reference group favored at all levels 0.4 0.3 Reference 0.2 Focal 0.1 0.0 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Theta Nonuniform (Crossing) DIF 1.0 0.9 Focal favored at low theta Prob. of Positive Response 0.8 0.7 0.6 Reference favored at high 0.5 theta 0.4 0.3 Reference 0.2 Focal 0.1 0.0 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Theta 126 The Basics of IRT September 9, 2009
  • 127. What is IRT? – DIF Detection • DIF – Parametric • Lord’s Chi-Square Chi- • Likelihood Ratio Test • Signed and Unsigned Area Methods – Nonparametric • SIBTEST • Mantel-Haenszel Mantel- • DTF – Parametric • Raju’s DFIT Method – Nonparametric • SIBTEST 127 The Basics of IRT September 9, 2009
  • 128. What is IRT ? - Some simple models Let represent the latent (factor) score for individual j . Let be the probability that individual j responds correctly to item i. Then a simple item response model is This is just a ‘binary response’ factor analysis model. 128 The Basics of IRT September 9, 2009
  • 129. What is IRT ? Some simple models ? Lawley (1944) really started it off Lord (1980) promoted the term ‘item response theory’ as opposed to ‘classical item analysis’ Technical elaborations include: ‘parameters’ for ‘guessing’ Partial credit (degrees of correctness) responses Multidimensional models BUT the ‘workhorse’ is still the Lord model (with the factor assumed to be a random rather than fixed variable), as follows: 129 The Basics of IRT September 9, 2009
  • 130. What is IRT ? - Classical to IRT • Classical is really an item response model (IRM) – A reasonable (consistent) estimate of (a random variable - so in red) is given by the ‘raw score’ i.e. percentage (or total) of correct items. • A somewhat more efficient estimate is given by a weighted percentage, using the bi as weights. • The Lord model is simply: 130 The Basics of IRT September 9, 2009
  • 131. What is IRT? Item response Relationship Sigmoid Function is a type of logistic function . The general form of logistic function is : The special case of the logistic function with a = 1, m = 0, n = 1, τ = 1, namely The logistic function is the inverse of the natural logit function and so can be used to convert the logarithm of odds into a probability; the conversion from the log-likelihood ratio of two alternatives also takes the form of a sigmoid curve. For a single item in a test 131 The Basics of IRT September 9, 2009
  • 132. What is IRT? Rasch Model Here the ‘discrimination’ is assumed to be the same for each item The resulting (maximum likelihood) factor score estimates are then a 1 – 1 transformation of the raw scores. Rasch Model is a special case and will often not fit the data very well. We can add further predictors, for example social background, that may mediate the relationships. Item response practitioners go further: If we assume that the item parameters ( ai , bi ) are the same across populations, and, for example, tests, then we can form common scales for different populations and different tests. 132 The Basics of IRT September 9, 2009
  • 133. What is IRT? Multidimensional Model We can generalise our logistic model as follows: Adding a further factor allows an individual to be characterised by two underlying traits. A sensible analysis will explore the dimensionality structure of a set of item responses Assumptions are needed, for example that factors are independent, or alternatively that they are correlated but each item has a non-zero coeffcient (loading) on only 1 factor – or an intermediate assumption. What are the consequences of a more complex structure? 133 The Basics of IRT September 9, 2009
  • 134. What is IRT? Multidimensional Model It allows a more faithful representation of multi-faceted achievement. It allows the (multidimensional) structure of achievement to be compared among groups or populations….in the following ways: The correlations between factors can vary The values of loadings can vary The factor scores can be allowed to depend on further variables such as gender and the resulting ‘regressions’ may vary. For example: With extensions to multilevel modelling etc. – a ‘structural equation model. 134 The Basics of IRT September 9, 2009
  • 135. Partial Credit Model Variation Central line = scale at interval vel of measurement Object | Item 1 Item 2 Person A >>> | << Item 1.3 | << Item 1.2 <<<<Item 2.2 | Person B >>> | << <<<<<< <<<<Item 2.1 | << Item 1.1 135 The Basics of IRT September 9, 2009
  • 136. Partial Credit Model Variation Items : Item 1.3 refers to Item 1 Category 3 Item 1.2 refers to Item 1 Category 2 and so on. The difficulty associated with Category 3 of Item 1 is greater than the difficulty associated with Category 2 of Item 1, and so on (ordered categories) The location of Item 1.3 on the scale indicates the ability associated with a 50% probability of passing Category 3 of Item 1 (or of any of the lower categories). Persons : The location of person A on the scale indicates his ability. The probability of Person A passing categories at a lower level of difficulty is more than 50%. The probability of Person A passing categories at a higher level of difficulty is less than 50%. The probability of Person A passing categories at a level of difficulty that is the same as his ability is 50%. 136 The Basics of IRT September 9, 2009