An index of the items discrimination<\/li>\n<\/ul>\n<\/li>\n<\/ul>\nThe item-difficulty Index<\/strong><\/p>\n\n- An index of an item\u2019s difficulty is obtained by calculating the proportion of the total number of testtakers who answered the item correctly.<\/li>\n
- p<\/em> = item difficulty<\/li>\n
- A subscript refers to the item number e.g. p1<\/sub><\/li>\n
- The value of an item-difficulty index can theoretically range from 0 (if no one got the item right) to 1 (if everyone got the item right) e.g. 50\/100 people got item 2 right \u2014> p2<\/sub>= .5<\/li>\n
- Larger the item-difficulty index = easier the item<\/li>\n
- Percent of people passing the item<\/li>\n
- Item-difficulty index = item-endorsement index in other contexts such as personality testing – The statistic provides not a measure of percent of people passing the item but a measure of the percent of people who said yes to, agreed with, or endorsed the item.<\/li>\n
- An index of the difficulty of the average test item for a particular test can be calculated by averaging the itemdifficulty indices for all the test\u2019s items.<\/li>\n<\/ul>\n
\n- Sum the item-difficulty indices for all test items, divide by the total number of items on the test<\/li>\n<\/ul>\n
\n- For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approx. .5, with individual items on the test ranging in difficulty from about .3 to .8.<\/li>\n
- Possible effect of guessing must be considered for selected-response tests<\/li>\n
- With this type of item, the optimal average item difficulty is usually the midpoint between 1.00 and the chance success proportion<\/strong> (probability of answering correctly by random guessing)\n
\n- g. in true-false, chance success proportion is 1\/2, so optimal item difficulty is half way between .50 and 1.00.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n
Differential Item Functioning (DIF): <\/strong>an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same or similar level of the underlying trait<\/p>\n\n- Instruments containing such items may have reduced validity for between-group comparisons because their scores may indicate a variety of attributes other than those the scale is intended to measure.<\/li>\n
- DIF analysis:<\/strong> test developers scrutinise group by group item response curves, looking for DIF items.<\/li>\n
- DIF items<\/strong>: those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership<\/li>\n
- DIF analysis has been used to evaluate measurement equivalence in item content across groups that vary by culture, gender and age<\/li>\n<\/ul>\n
Developing item banks <\/strong><\/p>\n\n- Each of the items assembled as part of an item bank, whether taken from an existing test or written especially for the item bank, have undergone rigorous qualitative and quantitative evaluation<\/li>\n
- All items available for use as well as new items created especially for the item bank constitute the item pool<\/li>\n
- Item pool evaluated by content experts, potential respondents and survey experts using a variety of qualitative and quantitative methods<\/li>\n
- Individual items in an item pool may be evaluated by cognitive testing procedures \u2014> interviewer conducts one on one interviews with respondents in an effort to identify any ambiguities associated with the items.<\/li>\n
- Item pools may also be evaluated by groups of respondents, which allows for discussion of the clarity and relevance of each item, among other item characteristics \u2014> items that make the cut after such scrutiny constitute the preliminary item bank<\/li>\n
- Next step in creating the item bank is the administration of all the questionnaire items to a large and representative sample of the target population \u2014> for ease in data analysis, group administration by computer is preferable.<\/li>\n
- However, depending upon the content and method of administration required by the items, the questionnaire (or portions of it) may be administered individually using paper and pencil methods.<\/li>\n
- After administration of the preliminary item bank to the entire sample of respondents, responded to the items are evaluated with regard to several variables such as validity, reliability, domain coverage and differential item functioning<\/li>\n
- The final item bank will consist of a large set of items all measuring a single domain (singe trait or ability)<\/li>\n
- A test developer will use the banked items to create one or more tests with a fixed number of items e.g. teacher may create two different versions of a math test in order to minimise efforts by testtakers to cheat<\/li>\n
- When used within a CAT environment, a testator\u2019s response to an item may automatically trigger which item is presented to the testtaker next<\/li>\n
- The software has been programmed to present the item next that will be most informative with regard to the testtakers standing on the construct being measured<\/li>\n<\/ul>\n
<\/p>\n
\n- If most of the high scorers fail a particular item, these testtakers may be making an alternative interpretation of a response intended to serve as a distractor.<\/li>\n
- In such a case, the test developer should interview the examinees to understand better the basis for the choice and then appropriately revise (or eliminate) the item.<\/li>\n
- item discrimination index = d<\/em><\/li>\n
- Estimate of item discrimination compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores<\/li>\n
- The optimal boundary lines for what we refer to as to \u201cupper\u201d and \u201clower\u201d areas of a distribution of scores with demarcate the upper and lower 27% of the distribution of scores – provided the distribution is normal.<\/li>\n
- As the distribution of test scores becomes more platykurtic (flatter), the optimal boundary line for defining upper and lower increases to near 33%.<\/li>\n
- For most applications, any percentage between 25 and 33 will yield similar estimations.<\/li>\n
- The item-discrimination index is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly <\/strong><\/li>\n
- Higher value of d<\/em> = greater number of high scorers answering the item correctly<\/li>\n
- Negative d value = red flag \u2014> indicates that low-scoring examinees are more likely to answer the item correctly than high-scoring examinees.<\/li>\n
- Highest possible value of d is +1.00 \u2014> all members of Upper group (top 27%) answered correctly, all members of Lower group (bottom 27%) group answered incorrectly<\/li>\n
- If the same proportion of members of the U and L groups pass the item, the item is not discriminating between testtakers at all and d<\/em> will be 0.<\/li>\n
- Lowest value is d<\/em> = -1 \u2014> all members of U group failed the item, all members of the L group passed it.<\/li>\n<\/ul>\n
Analysis of item alternatives <\/strong><\/p>\n\n- Quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers.<\/li>\n
- No formulas or stats<\/li>\n
- By charting the number of testtakers in the U and L groups who chose each alternative, the test developer can get an idea of the effectiveness of a distractor by means of a simple eyeball test \u2014> look at distribution of answer for each item.<\/li>\n<\/ul>\n
Item characteristic curves (ICCs)<\/strong><\/p>\n\n- Item characteristic curve: graphic representation of item difficulty and discrimination<\/li>\n
- Steeper slope = steeper item discrimination<\/li>\n
- Item may vary in terms of difficulty level \u2014> easy item will shift ICC to left along the ability axis, a difficult item will shift the ICC to the right along the horizontal axis<\/li>\n
- It takes high ability levels for a person to have a high probability of their response being scored as correct<\/li>\n<\/ul>\n
Other considerations in Item Analysis<\/strong><\/p>\nGuessing<\/p>\n
<\/p>\n
Criteria that any correction for guessing must meet:<\/p>\n
\n- recognise that when a respondent guesses at an answer on an achievement test, the guys sis not typically made on a totally random basis. It is more reasonable to assume that the testtakers guess is based on some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives. However, the individual testtaker\u2019s amount of knowledge of the subject matter will vary from one item to the next.<\/li>\n
- Must deal with the problem of omitted items. Sometimes, instead of guessing, the testtaker will simply omit a response to an item. Should omitted item be scored wrong? Should it be excluded? etc.<\/li>\n
- Some testtakers may be luckier than others in guessing the choices that are keyed correct.<\/li>\n<\/ol>\n
\n- No solution to the problem of guessing deemed entirely satisfactory<\/li>\n
- Responsible test developer address problem of guessing by including in the test manual:\n
\n- Explicit instructions regarding this point for the examiner to convey to the examinees<\/li>\n
- Specific instructions for scoring and interpreting omitted items<\/li>\n<\/ol>\n<\/li>\n<\/ul>\n
Item fairness<\/strong><\/p>\n\n- The degree a test item is biased<\/li>\n
- Biased test item:<\/u><\/em> favours one particular group of examinees in relation to another when differences in group ability are controlled<\/li>\n
- Item characteristic curves can be used to identify biased items<\/li>\n
- Specific items are identified as biased in a statistical sense if they exhibit differential item functioning<\/li>\n
- Differential item functioning:<\/u><\/em> exemplified by different shapes of item-characteristic curves for different groups (e.g. men and women) when the two groups do not differ in total test score.<\/li>\n
- If an item is to be considered fair to different groups of testtakers, the item-characteristic curves for the different groups should not be significantly different.<\/li>\n
- Establishing presence of differential item functioning requires a statistical test of the null hypothesis and no different between the item-characteristic curves of the two groups<\/li>\n
- Items exhibiting significant difference in item-characteristic curves must be revised or eliminated Speed tests <\/strong><\/li>\n
- Item analyses of tests taken under speed conditions yield misleading or uninterpretable results.<\/li>\n
- The closer an item is to the end of the test, the more difficult it may appear to be \u2014> This is because testtakers may not get to items near the end of the test before time runs out<\/li>\n
- Measures of item discrimination may be artificially high for late-appearing items \u2014> because test takers who know the material better may work faster and are thus more likely to answer the later items<\/li>\n
- Items appearing late in a speed test are consequently more likely to show positive item-total correlations because of the select group of examinees reaching those items.<\/li>\n
- So, how can items on a speed test be analysed? \u2014> most obvious solution to restrict the item analysis of items on a speed test only to the items completed by the testtaker \u2014> not recommended for 3 reasons<\/li>\n
- 1) item analyses of the later items would be based on a progressively smaller number of testtakers, yielding progressively less reliable results<\/li>\n
- 2) if the more knowledgable examinees reach the later items, then part of the analysis is based on all testtakers and part is based on a selected sample<\/li>\n
- 3) because the more knowledgeable testtakers are more likely to score correctly, their performance will make items occurring toward the end of the test appear to be easier than they are<\/li>\n<\/ul>\n
Qualitative Item Analysis<\/strong><\/p>\n\n- Qualitative methods:<\/strong> techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures<\/li>\n
- Encouraging testtakers to discuss aspects of their test-taking experience is eliciting\/generating data<\/li>\n
- These data may be used to improve the test<\/li>\n
- Qualitative item analysis:<\/strong> general term for various non statistical procedures designed to explore how individual test items work.<\/li>\n
- The analysis compares individual test items to each other and to the test as a whole<\/li>\n
- Involve exploration of the issues through verbal means such as interviews and group discussions conducted with testtakers and other relevant parties<\/li>\n
- Caution \u2014> there may be abuse of the process – respondents may be disgruntled for any number of reasons, from failure to prepare adequately for the test to disappointment in their test performance.<\/li>\n<\/ul>\n
\u201cThink aloud\u201d test administration<\/strong><\/p>\n\n- Having respondents verbalise thoughts as they occur<\/li>\n
- Shed light on testtaker\u2019s thought processes during the administration of a test<\/li>\n
- One to one basis<\/li>\n<\/ul>\n
Expert panels<\/strong><\/p>\n\n- Sensitivity review:<\/u><\/em> study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations<\/li>\n
- Possible forms of content bias that may be in an achievement test\n
\n- Status<\/li>\n
- Stereotype<\/li>\n
- Familiarity<\/li>\n
- Offensive choice of words<\/li>\n
- Other<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n
TEST REVISION<\/strong><\/p>\nTest revision as a Stage in New Test Development <\/strong><\/p>\n