- Biases that systematically obscure or cause differences among groups of respondent test scores unjustifiably
- It is important that test scores do not discriminate unjustifiability against any particular group
- In this context, almost invariably ‘group’ refers to gender and/or race
Gender differences in mathematics:
- Females on average tend to score lower on mathematics in comparison to males
- However is it possible that conventional mathematics tests discriminate against females? If so, then the differences should not be interpreted as a construct level difference
Construct bias:
- Concerns the relationship of observed scores to true scores on a psychological test o Different factorial validity between groups
- If the relationship is systemically different for different groups, then we might conclude that the test is biased o In some cases, two people with different observed scores may actually have the same level of the construct
o Even when the two means are equal, construct bias still must be tested (bias may result in the equal mean)
Predictive bias:
- rccurs when a test’s use has different implications for two or more groups
- g. using ATAR to predictive first year university performance o If it were discovered that the ATAR was differentially predictive of academic performance for males in comparison to females, than that would be an example of predictive bias
- Construct bias does not necessarily imply predictive bias
Identifying test score bias:
- There are at least two categories that can be used to identify test score bias o Internal methods to identify construct bias o External methods to identify predictive bias
- No single methods can be used to establish bias
- This is analogous to establishing validity for test scores
Test bias is about interpreting scores (validity) and are those results justifiable across all groups
Test score bias vs. true group differences:
- Simply because two groups differ in their mean scores on a test does not imply that the test suffers from test score bias
- If women score higher on a scale that putatively measures openness to feelings, then it is fully possible that they are as a group, more open to feelings than men
- Similarly, a weighing scale would be expected to demonstrate that on average, males are heavier than females
- This does not mean that the scale is biased
Detecting construct bias- internal evaluation:
- Construct bias only occurs through internal evaluation
- Evaluating construct bias typically occurs by conducting analyse at the item level An item on a test is biased if:
o Persons belonging to different groups responded in different ways to the item and o It could be shown that the differing responses were not related to the group differences associated with the psychological attribute putatively measured by the test
Variance not related to the true construct
- In plain language, a test is considered biased if you take two people (a male and female) who have the same level of an attributed (e.g. openness to emotions) but tend to respond differently to that particular item o Construct bias is present when different groups perceive the item differently for some reason
Approach to evaluating item bias:
- The procedure focuses on the internal structure of the test o No need to collect data from other measures/occasions
- Internal structure refers to the o Pattern of correlations among items and/or
o The correlations between each item and the total test score
- If the two groups exhibit different internal structures to their test responses, then we can conclude that the test is likely to suffer from construct bias
Factor analysis and bias:
- Also known as PCA, is a data technique used to o Determine the number of dimensions associated with a collection of items (or subscales)
o Define the nature of the dimension based on the factor loadings Construct bias is present if:
A different number of dimensions is associated with test scores when a factor analysis is performance on the groups separately
e.g. when measured separately, one dimension found for men and two dimensions found for women
- more subtle construct bias can be present even when the groups are associated with the same number of dimensions (happens more often, same number of dimensions but the factor loadings present the bias; specifically the pattern of factor loadings is different between the groups)
- we estimate this with a factor congruence coefficient
- coefficient should be positive and large to indicate the absence of construct bias
- measure this by correlating the component loadings between two items for both male and females
Factor congruence coefficient:
- There are no published guidelines to help interpret factor congruence coefficients The correlation however should at least be positive
Differential item functioning (DIF) analysis:
- DIF is a sophisticated approach to the assessment of construct bias developed within the context of item response theory (IRT)
- Theoretically DIF assumes that we can estimate a person’s trait level on a particular dimension via the test data
- This is true for all groups
- We wish to determine whether the trait levels and the item responses match up in the same way for both groups
- If they don’t we have evidence for construct bias
Rank order consistency:
- A simple way to test for evidence for construct bias o Calculate the means associated with each item separately for each group o Then calculate Spearman’s R correlation between the item means (i.e. item difficulties)
- If the correlation between item means is low, then we would have some evidence to suggest construct bias
Suggests that a low correlation between item means would be less than .90, results in construct bias
Detecting predictive bias- external evaluation:
- Predictive bias only occurs in external evaluation
- Construct bias tends to be more in pure research contexts, predictive bias by contrast is a more applied consideration
- Specifically, do the scores from a particular test predict a criterion (or outcome) with equal accuracy for two or more groups
- g. government employees who creates the tests behind the ATAR should make sure that the scores have equal predictive validity with respect to university grade performance across all groups
Predictive bias analyses:
- Involves two steps:
- Need to determine whether the test scores predict the dependent variable to start with
- Need to determine whether the test scores predict the dependent variable equally across groups
- Both steps involve conducting a regression analysis
Bivariate regression review:
- Is an extension of Pearson correlation
- Consists of using one continuously measured variable to predict another continuously measured variable
- The main difference is that bivariate regression includes additional terms which can be used to create a regression equation Intercept:
- The expected value of Y when X=0
- The point at which the regression line crosses the Y-axis
- Slope:
o The expected increase or decrease in the dependent variable (Y) as a function of an increase or decrease in the independent variable (X)
- Predicting people’s scores o Intercept is the minimum achievable score o Slope is the increase in dependant score after increase in independent score
One size fits all:
- The assumption that different groups share a common regression equation is based on the idea that ‘one size fits all’
- When this is observed, there is no predictive bias associated with the test scores
- The regression equation derived from the total sample is referred to as the ‘common regression equation’
- Also use the terms ‘slope intercept’ and ‘common slope’
Next we estimate separate regression equations for each group
We then compare the group=level regression equations with the common regression equation
- If the group-level estimates do not match the common regression equation, then you might suspect that your test scores are biased
Testing predictive bias statistically- intercept bias:
- If the slop is equal for both groups, can confirm there is no bias What if the intercepts are different?
o This suggests intercept bias e.g. if the male intercept is higher, suggests that males are rated higher than females
- If the lines run parallel, no indication of slope bias
Slope bias:
- Suppose two groups had similar intercepts, but their slops differed in magnitude
- This implies that we would expect different predicted scores for males and females with the same test score
Intercept and slope bias:
- There are occasions where both intercept and slop bias co-occur
- Rare to have only one slope/intercept bias but not the other
Outcome score bias:
- Thus far, the discussion of bias has focused on the predictor variable (e.g. test/questionnaire)
- It is fully possible that the outcome variable scores are the ones that are biased (e.g. level of relationships, testers mood, supervisor ratings)
- For a full analysis of bias, you would want to examine both possibilities- predictor bias and outcome bias
The effect of reliability:
- Are occasions where one group will have less reliable scores than the other group (unlikely to ever be exactly equal)
- The differences in reliabilities can cause differences in slopes and intercepts
- Differences in reliability are a form of construct bias than can have an impact on predictive bias
Keep in mind:
- The absence of the difference between two means might be caused by construct bias
- We should probably always test the possibility of construct bias when we test the difference between two means based on composite scores