Computer-based exams have long favored multiple-choice question formats due to technological restrictions, but considerable advancements in testing software now allow for more innovative item types to be used, such as multiple select, drag and drop, reorder, matching, true/false, and even long answer essay questions. These item types offer different ways to assess candidates, especially for licensure and certification examinations. An important question to ask is how these innovative item types actually perform compared to typical multiple-choice examinations.
Case Study
Using innovative item types, 5 different scoring models were used with the permission of an organization that evaluates candidates in a high-stakes healthcare profession in Canada. The types were: Dichotomous, Partial Credit, two different Modified Dichotomous models, and a Trap Scoring Model. The effect of these different scoring models on the overall criterion-referenced psychometrics for this examination is determined and described via Classical Test Theory (CTT) and Item Response Theory (IRT).
Scoring Models
Item scoring models can be classified into two broad categories: dichotomous (i.e., right/wrong) and polytomous (i.e., partial credit) (Parshall & Harmes, 2007). Depending on the characteristics of the item type, the number and type of scoring models within each category can vary. A research study on a high stakes exam was used to test how innovative item types and a few different scoring models would affect the passing rate and amount of information obtained about the candidates themselves.
For Multiple Choice (MC) questions, the dichotomous scoring model is consistent with the previous examination policies that this organization has been using (no negative marks for incorrect responses). Like the MC items, Multiple Select (MS), Reorder and Matching items are scored dichotomously. However, for MS, Reorder and Matching items, there is the potential for obtaining richer information when using partial credit scoring models. The ICE Research and Development Committee (2017) has noted that, “for items that require more than one response or action, it is often a reasonable scoring strategy to award some level of partial credit for each response or action that the examinee performs successfully.”
For certification and licensure examinations, we want to find which candidates are minimally competent so that they perform the job-related tasks that are required of them to receive their certificate or license to practice.
Dichotomous Scoring Model
All items were scored dichotomously, and each item was worth a maximum of 1 mark.
Multiple Select Response
Candidates receive 1 point if they select all the correct (or best) options. They receive 0 points if they select an incorrect response or fail to answer. No negative marks are given for incorrect matched options.
Matching
Candidates receive 1 point if they match all the options correctly. They receive 0 points if they match one or more options incorrectly. No negative marks are given for incorrect matched options.
Reorder
Candidates receive 1 point if they order all the options correctly. They receive 0 points if they order one more option incorrectly. No negative marks are given for incorrect matched options.
Modified Dichotomous Model A
As an alternative to the dichotomous model, candidates will receive a point if they achieved a score of 50% or higher on an item scored by partial credit. They receive 0 points if they achieved a score of less than 50%. Hence, candidates either attain a score of 1 or 0 on an item.
Modified Dichotomous Model B
Similar to the Modified Dichotomous Model A, the difference being that candidates will receive a point if they achieved a score of 75% or higher on an item scored by partial credit. They receive 0 points if they achieved a score of less than 75%. Hence, candidates either attain a score of 1 or 0 on an item.
Polytomous Scoring Model
Across the various polytomous scoring models, traditional multiple-choice, multiple select, matching, and reorder questions are scored polytomously.
Multiple Select Response
For MS, candidates receive partial marks for each correct (or best) option selected.
Example where, A, B, F are correct options, C, D, E are incorrect options.
Examinee | Selections | Correctly Selected | Not Selected | Item Score |
August | A,B,C,F | 3 | 3 | 3/3 => 1 |
John | A,B,F | 3 | 3 | 3/3 => 1 |
Jane | B,F | 2 | 4 | 2/3 => 0.6666 |
Matching
Candidates receive a partial credit for each option matched correctly. No negative marks are given for incorrect matched options.
Reorder
Candidates receive partial credit for each option ordered correctly. No negative marks are given for incorrect matched options.
Trap Door Model
Multiple Select Response
Candidates receive partial marks for each correct or best option selected; no credit is awarded if any incorrect options are selected. Partial credit is awarded for selecting some correct options, and full credit is awarded for selecting all correct options. This is a great option for applying the fatal flaw idea.
Example where A, B, F are correct options, C, D, E are incorrect options.
Examinee | Selections | Correctly Selected? | Incorrect Selections? | Item Score |
August | A,B,C,F | All correct | Yes | 0 |
John | A,B,F | All correct | No | 3/3 => 1 |
Jane | B,F | 2 | 0 | 2/3 => 0.6666 |
April | A,B,D | 2 | Yes | 0 |
Matching
Candidates receive partial credit for each option matched correctly. No credit is awarded if any options are matched incorrectly. This is similar to a dichotomous model if a candidate tries to match all options.
Reorder
Same as dichotomous scoring model.
Results
With CTT, candidate scores and test-level metrics were calculated based on the application of the 5 scoring models to the Exam.
Table 1: Test-Level Metrics | |||||
Dichotomous | Modified Dichotomous A | Modified Dichotomous B | Polytomous | Trap Door | |
Number of Candidates | 177 | 177 | 177 | 177 | 177 |
Number of Items | 180 | 180 | 180 | 180 | 180 |
Average Score | 134.37 | 140.60 | 137.67 | 139.48 | 137.46 |
Standard Deviation | 12.33 | 11.61 | 12.01 | 11.73 | 12.29 |
Minimum Score | 95.00 | 104.00 | 99.00 | 102.08 | 101.00 |
Maximum Score | 161.00 | 166.00 | 164.00 | 165.08 | 165.00 |
Average Item Difficulty | 0.7407 | 0.7722 | 0.7581 | 0.7685 | 0.7569 |
Average Discrimination | 0.1798 | 0.1776 | 0.1787 | 0.1808 | 0.1802 |
Reliability | 0.8237 | 0.8196 | 0.8196 | 0.8215 | 0.8256 |
Passing Rate | 0.79 | 0.91 | 0.84 | 0.87 | 0.82 |
From Table 1 we can see that candidates achieved higher scores using the Modified Dichotomous A model. Scores were lowest for the Dichotomous Scoring model. Based on the intended use of the examination, this led to lower scores that were not unexpected and, overall, would be a true reflection of a candidate’s score.
Reliability
Test Reliability for the exam was calculated with Cronbach’s alpha coefficient. In high-stakes examinations, a good range for test reliability using the alpha coefficient is greater than 0.80 (Nunnally, 1978; Finch, Holmes, and French, 2019). The higher the reliability estimated for the coefficient alpha, the more confidence one can have that the discrimination between candidates at different score levels on the test represent stable differences. The Trap Door model produced the best reliability (0.8256). The other 4 models yielded lower reliability, with the worst models being the Modified Dichotomous models (0.8196).
In addition to item difficulty, one important metric to asses item performance is discrimination. Discrimination refers to an item’s ability to differentiate between those who know the content and those who do not. Items with a higher discrimination do a good job of identifying high-ability candidates versus low-ability candidates. For this study, the point biserial correlation (RPB) was used. This measure produces an index that ranges from -1 (perfect negative discrimination) to 1 (perfect positive discrimination) with an index of 0 indicating no discrimination. Items that are very hard (item difficulty < 0.30) or very easy (item difficulty > 0.90) usually have lower item discrimination values than items with medium difficulty. For high stakes examinations such as the one used in this study, it is recommended to flag items with an RPB < 0.20 but the real world is seldom in line with theoretical practices so an RPB > 0.10 was deemed acceptable and anything lower than 0.10 was flagged for low discrimination. Based on the average discrimination of the items for each model, the partial credit model was the most effective in differentiating between high-ability candidates and low ability candidates at 0.1808. The Modified Dichotomous A model yielded the least discriminating items at 0.1776.
The passing rate ranged from 79% for the Dichotomous model to 91% for the Modified A Dichotomous model. This was initially a bit worrisome as it suggested that a different passing score should be used for each model. The anchor items, which form the anchor test inside of the exam for test equating, are summed up across candidates out of a total of 60 anchor items. An advanced statistical technique, a one-way ANOVA test, was performed on these anchor item totals across candidates at an alpha of 0.05, to see if there were any significant differences between the anchor scores across models and if a different cut score would be required for each scoring model. The result was that the pass mark would be constant across each scoring model.
Test Information Function
Table 2: IRT Rasch Model – Test Information Function | |||||||
Model | Theta at Maximum TIF | Maximum TIF | cSEM | Theta 95% Confidence Interval | Confidence Interval Range | Total Items for Analysis | |
Dichotomous | -1.2457 | 33.0591 | 0.1739 | -1.5866 | -0.9048 | 0.6818 | 176 |
Mod 50 | -1.4189 | 32.1782 | 0.1763 | -1.7644 | -1.0734 | 0.6910 | 173 |
Mod 75 | -1.3466 | 33.2418 | 0.1734 | -1.6865 | -1.0066 | 0.6799 | 175 |
Polytomous | -1.2793 | 33.3475 | 0.1732 | -1.6187 | -0.9399 | 0.6788 | 175 |
Trap | -1.3478 | 33.5521 | 0.1726 | -1.6862 | -1.0094 | 0.6767 | 175 |
The Rasch model is an IRT model that constrains the discrimination parameter (i.e.,‘a’)for all items to a constant of 1.00, whereas the difficulty parameter (i.e., ‘b’)is unconstrained, and thus represents the probability of a specified response (i.e., right/wrong answer). Reliability for candidate scores on the exam using IRT was calculated using the Test Information Function, an IRT method that is comparable to the alpha coefficient in CTT. In addition, the TIF from the Rasch model suggests that the more information we have in the curve, the more we can know about candidate’s latent trait at various ability levels (q), while minimizing as much of the error of measurement as possible.
The trap door scoring model had the highest TIF out of the models at 33.55; and had the lowest conditional standard error of measurement (cSEM) at 0.1726. Because the cSEM varies with accuracy and precision of the estimate process (i.e., error is not the same across items), it will vary depending on the level of the latent trait. In Table 2, the cSEM is calculated at the peak TIF for each scoring model. Calculating the 95 percent confidence intervals of the TIF values, we can see in Table 2 that the trap score model has a narrower range.
Running the score models through a Rasch model for comparison with CTT further supports the idea that a partial credit model increases the degree to which we are classifying candidates as minimally competent on a criterion-referenced examination. The trap door model once again outperformed the other models by having the highest TIF maximum value, along with the lowest cSEM at this point. When it comes to criterion-referenced examinations, we also seek to find the most information at each theta score; and the trap door model does this well when we look at the 95 percent confidence interval range. It has the narrowest TIF curve, which allows us to get the most information about candidates at the maximum value which could potentially be our cut score; and with criterion-referenced exams, we want to know the most information about candidates around the passing mark. Hence, using a trap door scoring model with an IRT Rasch model (rather than CTT), would provide us with higher precision around the cut score and more information about minimally-competent candidates.
Conclusion
Testing innovative item types for scoring models has yielded a great amount of information in the context of criterion-referenced examinations. With CBT becoming the de facto way of testing candidates, many organizations and stakeholders for licensure and certification exams are looking at innovative item types to test candidates and further refine the minimally-competent status so that a candidate may practice in their field of choice, knowing they have the required knowledge to perform their job.
With recent advances in assessment and statistical analysis technology, it is now possible to use various innovative item types to test candidates. The CTT analyses showed that partial credit models (i.e., polytomous and trap door models) yielded scores that best match the true representation of a candidate’s ability, displayed evidence of higher test reliability, and produced the best average indices of item discrimination. In addition, the partial credit model does not penalize candidates for guessing. The trap door model, while more difficult than the polytomous model, had the best overall test-level metrics from all five models.
In general, much of the evidence in this research study supports the use of the partial credit model; specifically, the trap door model, especially in high-stakes examinations. Technology continues to evolve, and certification and licensure organizations look to build exams with more than just standard multiple-choice items to test various forms of job-specific tasks to enhance their exams and achieve greater information about minimal competence. Educational measurement must progress with technology to help organizations seek new and robust ways of testing candidates, and it is an exciting time to advance this knowledge so that society can have greater confidence that certified and licensed candidates in their fields are performing their tasks at a safe and competent level.
References
Finch, W. H., & French, B. F. (2019). Educational and psychological measurement.
Nunnally, Jum C. Psychometric Theory. 2d ed. New York: McGraw-Hill, 1978.
Parshall, Cynthia & Harmes, J. (2007). Designing templates based on a taxonomy of innovative items.