Classical test theory:

  • Classical test theory incorporates terms such as ‘observed scores’ and ‘true scores’
  • There is a substantial emphasis on the description and estimation of the reliability of scores
  • Additionally, observed scores are considered a function of the sum of true scores and error scores
  • In CTT a person’s observed score on a test is a function of that person’s true score, plus error

Modern test theory:

  • An alternative to CTT, more commonly referred to as item response theory (IRT)
  • More complicated than CCT, however its benefits are so great that it outweighs the drawbacks
  • The application of computer adaptive testing (CAT) is the main difference and benefit between CCT and IRT, both actually produce relatively similar results

IRT:

  • rperates under the pretence that the response to any given item is influenced by two factors:

o Qualities of the individual o Qualities of the item

  • There are three well established models of IRT  In the most basic model:

o rnly item characteristics are taken into consideration

  • Most common characteristic is item difficulty
  • That is, the probability a person may answer the question correctly
  • g. on a five item test of mathematical ability o Likelihood that person will respond correctly to any item will be affected by their level of mathematical ability and the item difficulty

IRT and self-report questionnaires:

  • Although IRT is typically used in the context of intelligence or education testing, the basic IRT model can be extended into personality type questions
  • Principles are effectively the same o How much of the trait does the person possess o How likely is that someone would endorse or agree with the item

Trait level:

  • Across all models of IRT, the notion of a person’s trait level is fundamental
  • People are considered to possess certain levels of a trait

Effectively includes all psychological constructs (personality, intelligence, attitudes etc.)

Item difficulty:

  • An item’s level of difficulty is a factor affecting an individual’s probability of responding in a particular way o g. 2+2 or square root of 10 000
    • g. ‘ I enjoy socially with groups of people (high probability) or ‘I enjoy speaking before large audiences’ (low probability)
    • g. my job is ok (high probability of involvement) or I enjoy my job (moderate probability)

Trait level and item difficulty:

  • Trait level and item difficulty are intrinsically connected concepts in IRT
  • A difficult item requires a relatively high trait level in order to be answered correctly

IRT metric:

  • Trait levels and item difficulties are usually scored on a standardised metric o Mean = 0, SD = 1
  • Therefore a person with a trait level of 0 is average
  • Similar, an item associated with a difficulty level of 0 is considered of average difficulty
  • An item’s difficulty is defined at the trait level required to have a .50 probability of answering the item correctly o If an item has a difficulty level of 1.5, then it will take a trait level of 1.5 to have a

50% chance of responding correctly o If an item has a difficulty of -1.5, then a person with  a trait level of 1.0 would have a greater than 50% chance of answering correctly

Item discrimination:

  • In addition to different levels of difficulty, items can also be differentiated with respect to discrimination- known as item discrimination
  • It is similar to the item-total correlation that we came across in the context of classical test theory
  • An items discrimination value indicates the relevance of the item to the train being measured by the test (similar to component loadings)
  • Items that have large and positive discrimination values are good
  • Items will have a bad item discrimination values when they are not particularly relevant to the train of interest

IRT models:

  • Variety of models developed from the IRT perspective
  • The main way the models differ from each other is with respect to the nature and number of parameters they include

 

Three common models: o rne-parameter logistic model o two-parameter logistic model o three-parameter logistic model

One-parameter logistic model:

  • Also known as the Rasch model
  • Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:

o The difficulty of the item (this is the one parameter)

Two-parameter logistic model:

  • Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:

o The difficulty of the item (parameter 1) o Discrimination of the item (parameter 2)

  • Not surprisingly, the two-parameter model is much more useful than the one-parameter model
  • Probably the most common used IRT model

Three-parameter logistic model:

  • Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:
    • The difficulty of the item (parameter 1) o Discrimination of the item (parameter 2)
    • The probability which the question can be answered by guessing
  • Especially useful in multi-choice testing, but not commonly used (rarely much benefit to the third parameter)

IRT – dichotomous and polytomous items:

  • IRT is typically discussed in the context of dichotomously scored items (correct/incorrect)
  • However there are models that can accommodates items responded to on rating scales

Item characteristics curve (ICC):

  • People who work with IRT often evaluate the quality of an item using a graph known as an item characteristics curve
  • Reflects the probability with which individuals across a range of trail levels are likely to answer each item correctly
  • The logistic formula and the parameters included in the mode are used to predict the probabilities o Similar to using a regression equation
  • Lowest line suggest that for an item answered three standard deviations about the mean, has a ‘correct’ chance of 80% (very hard question)
  • Top line suggests for an item three standard deviations below the mean, has a correct chance of 30% (very easy question).

IRT and reliability:

  • You cannot calculate something like coefficient alpha in an IRT model
  • In CTT, reliability represents the level of true score variance associated with a group of composite scores
  • The reliability estimate is assumed to apply across all levels of composite scores (from low to high)
  • In IRT this notion is abandoned, o For a given test different levels of measurement error are acknowledged across different levels of a particular trait.

Applications of IRT:

  • Test development and improvement
  • Differential item functioning (occurs when people from different groups with the same trait have a different probability of giving a certain response)
  • Person fit ( example where you look at characteristics of a person and a job, try to match the right person to the right job)
  • Computer adaptive testing

CTT- testing example:

  • Generate 30 items, order them in terms of difficulty
  • Administer the items to participants in order of item difficulty o Either administer all of the items

o rr stop after the participant makes three errors in a row

  • This is the conventional, static approach

Computer adaptive testing (CAT):

  • A dynamic approach to measuring a construct
  • CAT involves an algorithm which selects items to present to respondents based on their responses to previous items

In practice, no two people 9or very few) who take a CAT will be expected to respond to the same items

The approach to CAT:

  • A large pool of items (say 300) are tested on a large normative sample (say 500)
  • Using the data from 500, each item can be estimated with respect to its item difficulty
  • Items are then placed into categories of difficulty (up to 20 categories)
  • For the first item administered, randomly selects an item with a difficulty level of 0
  • If the person gets the first item correct, the computer will move them up a difficulty level and select a randomly assigned item
  • The computer will keep increasing difficulties until the person makes a mistake
  • As the computer continues to give the respondent items at this particular item difficulty level, the standard error of estimate associated with the latent trait estimate continues to narrow
  • Bas on an arbitrary cut off point, the CAT will terminate once the standard error of estimate confidence interval has narrowed sufficiently

Advantages of CAT:

  • Takes as much as 50% less time for testing
  • More accurate scores, because respondents answer more items in ‘their area of difficulty’ o In CTT testing, many respondents waste time answering really easy items o In CTT testing, typically there are only a handful of items that are in the respondents area of difficulty

Disadvantages of CAT:

  • More time/money to develop
  • Participants don’t ‘trust’ it (how can you measure me in 10 minutes?)