Scoring models for polytomous items

Computer-based exams have long favored multiple-choice question formats due to technological restrictions, but considerable advancements in testing software now allow for more innovative item types to be used, such as multiple select, drag and drop, reorder, matching, true/false, and even long answer essay questions. These item types offer different ways to assess candidates, especially for licensure and certification examinations. An important question to ask is how these innovative item types actually perform compared to typical multiple-choice examinations.

Case Study

Using innovative item types, 5 different scoring models were used with the permission of an organization that evaluates candidates in a high-stakes healthcare profession in Canada. The types were: Dichotomous, Partial Credit, two different Modified Dichotomous models, and a Trap Scoring Model. The effect of these different scoring models on the overall criterion-referenced psychometrics for this examination is determined and described via Classical Test Theory (CTT) and Item Response Theory (IRT).

Scoring Models

Item scoring models can be classified into two broad categories: dichotomous (i.e., right/wrong) and polytomous (i.e., partial credit) (Parshall & Harmes, 2007). Depending on the characteristics of the item type, the number and type of scoring models within each category can vary. A research study on a high stakes exam was used to test how innovative item types and a few different scoring models would affect the passing rate and amount of information obtained about the candidates themselves.

For Multiple Choice (MC) questions, the dichotomous scoring model is consistent with the previous examination policies that this organization has been using (no negative marks for incorrect responses). Like the MC items, Multiple Select (MS), Reorder and Matching items are scored dichotomously. However, for MS, Reorder and Matching items, there is the potential for obtaining richer information when using partial credit scoring models. The ICE Research and Development Committee (2017) has noted that, “for items that require more than one response or action, it is often a reasonable scoring strategy to award some level of partial credit for each response or action that the examinee performs successfully.”

For certification and licensure examinations, we want to find which candidates are minimally competent so that they perform the job-related tasks that are required of them to receive their certificate or license to practice.

Dichotomous Scoring Model

All items were scored dichotomously, and each item was worth a maximum of 1 mark.

Multiple Select Response

Candidates receive 1 point if they select all the correct (or best) options. They receive 0 points if they select an incorrect response or fail to answer. No negative marks are given for incorrect matched options.

Matching

Candidates receive 1 point if they match all the options correctly. They receive 0 points if they match one or more options incorrectly. No negative marks are given for incorrect matched options.

Reorder

Candidates receive 1 point if they order all the options correctly. They receive 0 points if they order one more option incorrectly. No negative marks are given for incorrect matched options.

Modified Dichotomous Model A

As an alternative to the dichotomous model, candidates will receive a point if they achieved a score of 50% or higher on an item scored by partial credit. They receive 0 points if they achieved a score of less than 50%. Hence, candidates either attain a score of 1 or 0 on an item.

Modified Dichotomous Model B

Similar to the Modified Dichotomous Model A, the difference being that candidates will receive a point if they achieved a score of 75% or higher on an item scored by partial credit. They receive 0 points if they achieved a score of less than 75%. Hence, candidates either attain a score of 1 or 0 on an item.

Polytomous Scoring Model

Across the various polytomous scoring models, traditional multiple-choice, multiple select, matching, and reorder questions are scored polytomously.

Multiple Select Response

For MS, candidates receive partial marks for each correct (or best) option selected.

Example where, A, B, F are correct options, C, D, E are incorrect options.

Examinee	Selections	Correctly Selected	Not Selected	Item Score
August	A,B,C,F	3	3	3/3 => 1
John	A,B,F	3	3	3/3 => 1
Jane	B,F	2	4	2/3 => 0.6666

Matching

Candidates receive a partial credit for each option matched correctly. No negative marks are given for incorrect matched options.

Reorder

Candidates receive partial credit for each option ordered correctly. No negative marks are given for incorrect matched options.

Trap Door Model

Multiple Select Response

Candidates receive partial marks for each correct or best option selected; no credit is awarded if any incorrect options are selected. Partial credit is awarded for selecting some correct options, and full credit is awarded for selecting all correct options. This is a great option for applying the fatal flaw idea.

Example where A, B, F are correct options, C, D, E are incorrect options.

Examinee	Selections	Correctly Selected?	Incorrect Selections?	Item Score
August	A,B,C,F	All correct	Yes	0
John	A,B,F	All correct	No	3/3 => 1
Jane	B,F	2	0	2/3 => 0.6666
April	A,B,D	2	Yes	0

Matching

Candidates receive partial credit for each option matched correctly. No credit is awarded if any options are matched incorrectly. This is similar to a dichotomous model if a candidate tries to match all options.

Reorder

Same as dichotomous scoring model.

Results

With CTT, candidate scores and test-level metrics were calculated based on the application of the 5 scoring models to the Exam.

Table 1: Test-Level Metrics
	Dichotomous	Modified Dichotomous A	Modified Dichotomous B	Polytomous	Trap Door
Number of Candidates	177	177	177	177	177
Number of Items	180	180	180	180	180
Average Score	134.37	140.60	137.67	139.48	137.46
Standard Deviation	12.33	11.61	12.01	11.73	12.29
Minimum Score	95.00	104.00	99.00	102.08	101.00
Maximum Score	161.00	166.00	164.00	165.08	165.00
Average Item Difficulty	0.7407	0.7722	0.7581	0.7685	0.7569
Average Discrimination	0.1798	0.1776	0.1787	0.1808	0.1802
Reliability	0.8237	0.8196	0.8196	0.8215	0.8256
Passing Rate	0.79	0.91	0.84	0.87	0.82

From Table 1 we can see that candidates achieved higher scores using the Modified Dichotomous A model. Scores were lowest for the Dichotomous Scoring model. Based on the intended use of the examination, this led to lower scores that were not unexpected and, overall, would be a true reflection of a candidate’s score.

Reliability

Test Reliability for the exam was calculated with Cronbach’s alpha coefficient. In high-stakes examinations, a good range for test reliability using the alpha coefficient is greater than 0.80 (Nunnally, 1978; Finch, Holmes, and French, 2019). The higher the reliability estimated for the coefficient alpha, the more confidence one can have that the discrimination between candidates at different score levels on the test represent stable differences. The Trap Door model produced the best reliability (0.8256). The other 4 models yielded lower reliability, with the worst models being the Modified Dichotomous models (0.8196).

In addition to item difficulty, one important metric to asses item performance is discrimination. Discrimination refers to an item’s ability to differentiate between those who know the content and those who do not. Items with a higher discrimination do a good job of identifying high-ability candidates versus low-ability candidates. For this study, the point biserial correlation (RPB) was used. This measure produces an index that ranges from -1 (perfect negative discrimination) to 1 (perfect positive discrimination) with an index of 0 indicating no discrimination. Items that are very hard (item difficulty < 0.30) or very easy (item difficulty > 0.90) usually have lower item discrimination values than items with medium difficulty. For high stakes examinations such as the one used in this study, it is recommended to flag items with an RPB < 0.20 but the real world is seldom in line with theoretical practices so an RPB > 0.10 was deemed acceptable and anything lower than 0.10 was flagged for low discrimination. Based on the average discrimination of the items for each model, the partial credit model was the most effective in differentiating between high-ability candidates and low ability candidates at 0.1808. The Modified Dichotomous A model yielded the least discriminating items at 0.1776.

The passing rate ranged from 79% for the Dichotomous model to 91% for the Modified A Dichotomous model. This was initially a bit worrisome as it suggested that a different passing score should be used for each model. The anchor items, which form the anchor test inside of the exam for test equating, are summed up across candidates out of a total of 60 anchor items. An advanced statistical technique, a one-way ANOVA test, was performed on these anchor item totals across candidates at an alpha of 0.05, to see if there were any significant differences between the anchor scores across models and if a different cut score would be required for each scoring model. The result was that the pass mark would be constant across each scoring model.

Test Information Function

Table 2: IRT Rasch Model – Test Information Function
Model	Theta at Maximum TIF	Maximum TIF	cSEM	Theta 95% Confidence Interval	Confidence Interval Range	Total Items for Analysis
Dichotomous	-1.2457	33.0591	0.1739	-1.5866	-0.9048	0.6818	176
Mod 50	-1.4189	32.1782	0.1763	-1.7644	-1.0734	0.6910	173
Mod 75	-1.3466	33.2418	0.1734	-1.6865	-1.0066	0.6799	175
Polytomous	-1.2793	33.3475	0.1732	-1.6187	-0.9399	0.6788	175
Trap	-1.3478	33.5521	0.1726	-1.6862	-1.0094	0.6767	175

The Rasch model is an IRT model that constrains the discrimination parameter (i.e.,‘a’)for all items to a constant of 1.00, whereas the difficulty parameter (i.e., ‘b’)is unconstrained, and thus represents the probability of a specified response (i.e., right/wrong answer). Reliability for candidate scores on the exam using IRT was calculated using the Test Information Function, an IRT method that is comparable to the alpha coefficient in CTT. In addition, the TIF from the Rasch model suggests that the more information we have in the curve, the more we can know about candidate’s latent trait at various ability levels (q), while minimizing as much of the error of measurement as possible.

The trap door scoring model had the highest TIF out of the models at 33.55; and had the lowest conditional standard error of measurement (cSEM) at 0.1726. Because the cSEM varies with accuracy and precision of the estimate process (i.e., error is not the same across items), it will vary depending on the level of the latent trait. In Table 2, the cSEM is calculated at the peak TIF for each scoring model. Calculating the 95 percent confidence intervals of the TIF values, we can see in Table 2 that the trap score model has a narrower range.

Running the score models through a Rasch model for comparison with CTT further supports the idea that a partial credit model increases the degree to which we are classifying candidates as minimally competent on a criterion-referenced examination. The trap door model once again outperformed the other models by having the highest TIF maximum value, along with the lowest cSEM at this point. When it comes to criterion-referenced examinations, we also seek to find the most information at each theta score; and the trap door model does this well when we look at the 95 percent confidence interval range. It has the narrowest TIF curve, which allows us to get the most information about candidates at the maximum value which could potentially be our cut score; and with criterion-referenced exams, we want to know the most information about candidates around the passing mark. Hence, using a trap door scoring model with an IRT Rasch model (rather than CTT), would provide us with higher precision around the cut score and more information about minimally-competent candidates.

Conclusion

Testing innovative item types for scoring models has yielded a great amount of information in the context of criterion-referenced examinations. With CBT becoming the de facto way of testing candidates, many organizations and stakeholders for licensure and certification exams are looking at innovative item types to test candidates and further refine the minimally-competent status so that a candidate may practice in their field of choice, knowing they have the required knowledge to perform their job.

With recent advances in assessment and statistical analysis technology, it is now possible to use various innovative item types to test candidates. The CTT analyses showed that partial credit models (i.e., polytomous and trap door models) yielded scores that best match the true representation of a candidate’s ability, displayed evidence of higher test reliability, and produced the best average indices of item discrimination. In addition, the partial credit model does not penalize candidates for guessing. The trap door model, while more difficult than the polytomous model, had the best overall test-level metrics from all five models.

In general, much of the evidence in this research study supports the use of the partial credit model; specifically, the trap door model, especially in high-stakes examinations. Technology continues to evolve, and certification and licensure organizations look to build exams with more than just standard multiple-choice items to test various forms of job-specific tasks to enhance their exams and achieve greater information about minimal competence. Educational measurement must progress with technology to help organizations seek new and robust ways of testing candidates, and it is an exciting time to advance this knowledge so that society can have greater confidence that certified and licensed candidates in their fields are performing their tasks at a safe and competent level.

References

Finch, W. H., & French, B. F. (2019). Educational and psychological measurement.

Nunnally, Jum C. Psychometric Theory. 2d ed. New York: McGraw-Hill, 1978.

Parshall, Cynthia & Harmes, J. (2007). Designing templates based on a taxonomy of innovative items.