Classical test theory:
- Classical test theory incorporates terms such as ‘observed scores’ and ‘true scores’
- There is a substantial emphasis on the description and estimation of the reliability of scores
- Additionally, observed scores are considered a function of the sum of true scores and error scores
- In CTT a person’s observed score on a test is a function of that person’s true score, plus error
Modern test theory:
- An alternative to CTT, more commonly referred to as item response theory (IRT)
- More complicated than CCT, however its benefits are so great that it outweighs the drawbacks
- The application of computer adaptive testing (CAT) is the main difference and benefit between CCT and IRT, both actually produce relatively similar results
IRT:
- rperates under the pretence that the response to any given item is influenced by two factors:
o Qualities of the individual o Qualities of the item
- There are three well established models of IRT In the most basic model:
o rnly item characteristics are taken into consideration
- Most common characteristic is item difficulty
- That is, the probability a person may answer the question correctly
- g. on a five item test of mathematical ability o Likelihood that person will respond correctly to any item will be affected by their level of mathematical ability and the item difficulty
IRT and self-report questionnaires:
- Although IRT is typically used in the context of intelligence or education testing, the basic IRT model can be extended into personality type questions
- Principles are effectively the same o How much of the trait does the person possess o How likely is that someone would endorse or agree with the item
Trait level:
- Across all models of IRT, the notion of a person’s trait level is fundamental
- People are considered to possess certain levels of a trait
Effectively includes all psychological constructs (personality, intelligence, attitudes etc.)
Item difficulty:
- An item’s level of difficulty is a factor affecting an individual’s probability of responding in a particular way o g. 2+2 or square root of 10 000
- g. ‘ I enjoy socially with groups of people (high probability) or ‘I enjoy speaking before large audiences’ (low probability)
- g. my job is ok (high probability of involvement) or I enjoy my job (moderate probability)
Trait level and item difficulty:
- Trait level and item difficulty are intrinsically connected concepts in IRT
- A difficult item requires a relatively high trait level in order to be answered correctly
IRT metric:
- Trait levels and item difficulties are usually scored on a standardised metric o Mean = 0, SD = 1
- Therefore a person with a trait level of 0 is average
- Similar, an item associated with a difficulty level of 0 is considered of average difficulty
- An item’s difficulty is defined at the trait level required to have a .50 probability of answering the item correctly o If an item has a difficulty level of 1.5, then it will take a trait level of 1.5 to have a
50% chance of responding correctly o If an item has a difficulty of -1.5, then a person with a trait level of 1.0 would have a greater than 50% chance of answering correctly
Item discrimination:
- In addition to different levels of difficulty, items can also be differentiated with respect to discrimination- known as item discrimination
- It is similar to the item-total correlation that we came across in the context of classical test theory
- An items discrimination value indicates the relevance of the item to the train being measured by the test (similar to component loadings)
- Items that have large and positive discrimination values are good
- Items will have a bad item discrimination values when they are not particularly relevant to the train of interest
IRT models:
- Variety of models developed from the IRT perspective
- The main way the models differ from each other is with respect to the nature and number of parameters they include
Three common models: o rne-parameter logistic model o two-parameter logistic model o three-parameter logistic model
One-parameter logistic model:
- Also known as the Rasch model
- Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:
o The difficulty of the item (this is the one parameter)
Two-parameter logistic model:
- Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:
o The difficulty of the item (parameter 1) o Discrimination of the item (parameter 2)
- Not surprisingly, the two-parameter model is much more useful than the one-parameter model
- Probably the most common used IRT model
Three-parameter logistic model:
- Additionally to an individual’s trait level, it states that an individual’s response to an item is determined by:
- The difficulty of the item (parameter 1) o Discrimination of the item (parameter 2)
- The probability which the question can be answered by guessing
- Especially useful in multi-choice testing, but not commonly used (rarely much benefit to the third parameter)
IRT – dichotomous and polytomous items:
- IRT is typically discussed in the context of dichotomously scored items (correct/incorrect)
- However there are models that can accommodates items responded to on rating scales
Item characteristics curve (ICC):
- People who work with IRT often evaluate the quality of an item using a graph known as an item characteristics curve
- Reflects the probability with which individuals across a range of trail levels are likely to answer each item correctly
- The logistic formula and the parameters included in the mode are used to predict the probabilities o Similar to using a regression equation
- Lowest line suggest that for an item answered three standard deviations about the mean, has a ‘correct’ chance of 80% (very hard question)
- Top line suggests for an item three standard deviations below the mean, has a correct chance of 30% (very easy question).
IRT and reliability:
- You cannot calculate something like coefficient alpha in an IRT model
- In CTT, reliability represents the level of true score variance associated with a group of composite scores
- The reliability estimate is assumed to apply across all levels of composite scores (from low to high)
- In IRT this notion is abandoned, o For a given test different levels of measurement error are acknowledged across different levels of a particular trait.
Applications of IRT:
- Test development and improvement
- Differential item functioning (occurs when people from different groups with the same trait have a different probability of giving a certain response)
- Person fit ( example where you look at characteristics of a person and a job, try to match the right person to the right job)
- Computer adaptive testing
CTT- testing example:
- Generate 30 items, order them in terms of difficulty
- Administer the items to participants in order of item difficulty o Either administer all of the items
o rr stop after the participant makes three errors in a row
- This is the conventional, static approach
Computer adaptive testing (CAT):
- A dynamic approach to measuring a construct
- CAT involves an algorithm which selects items to present to respondents based on their responses to previous items
In practice, no two people 9or very few) who take a CAT will be expected to respond to the same items
The approach to CAT:
- A large pool of items (say 300) are tested on a large normative sample (say 500)
- Using the data from 500, each item can be estimated with respect to its item difficulty
- Items are then placed into categories of difficulty (up to 20 categories)
- For the first item administered, randomly selects an item with a difficulty level of 0
- If the person gets the first item correct, the computer will move them up a difficulty level and select a randomly assigned item
- The computer will keep increasing difficulties until the person makes a mistake
- As the computer continues to give the respondent items at this particular item difficulty level, the standard error of estimate associated with the latent trait estimate continues to narrow
- Bas on an arbitrary cut off point, the CAT will terminate once the standard error of estimate confidence interval has narrowed sufficiently
Advantages of CAT:
- Takes as much as 50% less time for testing
- More accurate scores, because respondents answer more items in ‘their area of difficulty’ o In CTT testing, many respondents waste time answering really easy items o In CTT testing, typically there are only a handful of items that are in the respondents area of difficulty
Disadvantages of CAT:
- More time/money to develop
- Participants don’t ‘trust’ it (how can you measure me in 10 minutes?)