Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji joeb.81.4.215-220

Journal of Education for Business

ISSN: 0883-2323 (Print) 1940-3356 (Online) Journal homepage: http://www.tandfonline.com/loi/vjeb20

Acceptance and Accuracy of Multiple Choice,
Confidence-Level, and Essay Question Formats for
Graduate Students
Stephen M. Swartz
To cite this article: Stephen M. Swartz (2006) Acceptance and Accuracy of Multiple Choice,
Confidence-Level, and Essay Question Formats for Graduate Students, Journal of Education for
Business, 81:4, 215-220, DOI: 10.3200/JOEB.81.4.215-220
To link to this article: http://dx.doi.org/10.3200/JOEB.81.4.215-220

Published online: 07 Aug 2010.

Submit your article to this journal

Article views: 38

View related articles


Citing articles: 3 View citing articles

Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=vjeb20
Download by: [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RA JA ALI HA JI
TANJUNGPINANG, KEPULAUAN RIAU]

Date: 12 January 2016, At: 18:00

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

Acceptance and Accuracy of Multiple
Choice, Confidence-Level, and Essay
Question Formats for Graduate Students
STEPHEN M. SWARTZ
UNIVERSITY OF NORTH TEXAS
DENTON, TEXAS

ABSTRACT. The confidence level
(information-referenced testing; IRT)

design is an attempt to improve upon the
multiple choice format by allowing students
to express a level of confidence in the
answers they choose. In this study, the
author evaluated student perceptions of the
ease of use and accuracy of and general
preference for traditional multiple choice,
confidence-level, and essay format questions. The author estimated the relative
accuracy of traditional multiple choice vs.
confidence level compared with the essay
results and the student self-reported mastery of knowledge domains. Student acceptance of the new format was equal to, and
accuracy was better than, the traditional
format.
Copyright © 2006 Heldref Publications

T

he assessment of student learning
is an important issue for educators. The history of the development of
assessment tools and techniques indicates a high level of emphasis on the

accuracy and efficiency of testing methods (Madaus & O’Dwyer, 1999). Since
the early 1900s, traditional multiple
choice (MC) item formats have
achieved a position of dominance in
learning assessment, mainly due to the
prima facie objectivity and the efficiency
of administration this format represents.
However, the popularity of the MC format has come under scrutiny for some
applications where accuracy of assessment, particularly for complex knowledge domains, has greater importance
than efficiency (Becker & Johnston,
1999; Bennett, Rock, & Wang, 1991).
Traditional MC testing formats offer
efficiency, objectivity, simplicity, and
ease of use for the assessment of student
knowledge, but are subject to many
sources of interpretation error. Essay
format questions, while inefficient and
difficult to grade objectively, offer a
potentially higher level of information
quality.

Purpose
The purpose of this study was to evaluate graduate student perceptions of the
ease of use and accuracy of and general
preference for traditional MC, confi-

dence level (CL), and constructed
response (CR; essay or short answer)
format questions. I also made a comparison estimating the relative accuracy of
traditional MC versus CL against both
the CR results and the student selfreported posttest mastery of knowledge
domains.
Literature Review and Related
Research
Knowledge Assessment: CR Versus MC
Educators have sought a better compromise between the richness and depth
of the CR format and the simplicity and
efficiency of the MC format. Classroom
testing procedures are used to assess a
range of student attributes from simple
right or wrong recall of factual material

to the demonstration of synthesized
knowledge applied correctly to new or
unique problems. Testing procedures
can be classified roughly into two sets:
CR and MC formats (Haladyna, 1999).
The CR format includes the assessment
of student attributes through critiques,
demonstrations, essays, experiments,
interviews, oral reports, portfolios, projects, and research papers. CR tools provide students with prompts, and students are required to construct
responses. MC format tools generally
present students with a prompt, then
offer alternatives from which the students choose the correct response. True
March/April 2006

215

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

or false, matching, and the traditional
MC (i.e., selecting from among alternative responses) are all forms of the MC

category. It has become generally
accepted that trade-offs exist between
CR and MC measurement tools. While
CR formats are more difficult to administer and evaluate objectively and precisely, they provide the opportunity to
assess more complex student attributes
and higher levels of attribute achievement (Conderman, 2001; Powell, 1989).
By contrast, traditional MC formats are
easy to administer and use and provide
inherent objectivity in grading, but they
measure only superficial binary outcomes and promote rote learning
(Miller, Williams, & Haladyna, 1978;
Rogers & Ndalichako, 1997). Also, the
traditional MC formats are unable to
distinguish between right answers
resulting from students knowing the
answer and those resulting from students guessing the answer (Rogers &
Ndalichako, 2000). While a variety of
approaches have been tried in an effort
to improve MC formats, including the
addition of essay questions linked to

MC questions (Wood, 1998), many of
these compromises could be considered
as adjunct approaches in addition to the
MC format and do not represent direct
improvements on the MC format itself.
Issues in MC Formats
All MC questions consist of a stem or
prompt (the question) and several alternative responses. The alternative
responses generally include a single
correct response and multiple plausible,
but incorrect, choices (Hansen, 1997). If
the purpose of the examination instrument is to measure the level or amount
of knowledge in a domain, this format
reflects only the binary outcomes of students knowing or guessing the correct
answer versus students not knowing the
answer or guessing incorrectly.
One way to add additional precision
to MC format items is to allow for more
than one correct answer and offer a
range of credit for more complete versus less complete responses. Pomplun

and Omar (1997) reported on the use of
such multiple-mark MC format items.
With this format, multiple correct
responses are offered and students are
216

Journal of Education for Business

directed to select every correct
response. Full credit is offered for a perfect selection and varying levels of partial credit are given for less than perfect
selections.
This format appears to offer two main
advantages over traditional MC formats
(Pomplun & Omar, 1997). First, administrators believe that the multipleresponse format is more realistic
because, for many knowledge domains,
more than one right answer naturally
exists. Only infrequently does a single
right answer for many knowledge areas
exist. Second, the method is believed to
reduce the bias introduced by guessing.

By giving students a more exhaustive
list of choices, the likelihood that at
least some of the choices fall into the
students’ knowledge base is higher.
Finally, the multiple-response option is
easily accommodated into the existing
bubble sheet optical reader technology
already in use.
Another critical issue in designing,
using, and interpreting traditional MC
format instruments is the question of the
optimal number of choices offered.
While the multiple-response format
would require the use of many alternatives (several right and several wrong
choices are required), the use of three
through five alternatives for binaryoutcome measurement has become a de
facto standard. However, research suggests that the use of three choices may
be superior to any larger number. As
early as 1964, Tversky showed that,
given a fixed number of choices, the use

of three alternatives actually maximized
the discriminability and statistical
power of the instrument. In 1994,
Sidick, Barrett, and Doverspike investigated the use of three versus five choice
items used in public sector employment
tests. They concluded that the psychometric properties of the three alternative
MC items were comparable to the five
choice items, making the potential
development and administration simplicity gains preferable. In 1995, Bruno
and Dirkzwager applied the information
theoretic perspective to this problem of
optimal number of choices in MC format. The starting assumption was that
the amount of information extracted
from a test item will increase with the
number of offered choices but that this

is not a perfectly linear relationship
because the marginal increase in information extracted tapers as the number
of choices increases. Indeed, too many
choices begin to introduce a certain

amount of distraction, because equally
informed students may select different
marginal choices from a large number
of alternatives (Bruno & Dirkzwager).
The researchers derived a formula representing the amount of information per
alternative reflected in the number of
options and found the optimal whole
number of three choices to yield the
maximum amount of information provided per choice, which was considered
to be ideal. Rogers and Harley (1999)
yielded similar findings. Educators
reported that, in many instances, the
development of a fourth alternative
often resulted in writing a throwaway
choice that added no value to the item.
Also, the information gained from three
choice items was at least equivalent to
four choice items, and the bias induced
by guessing (test-wiseness) was reduced
(Rogers & Harley).
The literature presented seems to suggest that essay questions are preferred
for the amount of information about student knowledge they provide, particularly in terms of the ability to assess
dimensionality of knowledge beyond
simple right versus wrong determinations. MC questions are efficient to
administer and evaluate and reduce
potential evaluator bias, but are inferior
in their ability to measure multiple
dimensions of knowledge. By adding
additional choices, the ability to discriminate between levels of knowledge
is improved, but, for many applications,
a smaller number of choices is preferred.
Multidimensional Testing With CL
The inclusion of the dimension of relative certainty to the existing dimension
of rightness provides useful information
for the educator (Hassman & Hunt,
1994). The measurement of students’
confidence in their answers, combined
with whether the answer is correct, both
reduces the guessing effect and provides
some diagnostic feedback to the learning process. The development of information-referenced testing (IRT), or CL

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

testing (Bruno, 1986; Bruno, Holland,
& Ward, 1988), allows this kind of measurement.
The IRT format proposes to capture
the dimension of student certainty in the
answer selected. The advantage to the
educator is that, by taking student confidence in the answer into account, intermediate assessment between fully
informed students (i.e., students who
know and are confident in the correct
answer) and misinformed students (i.e.,
students who are confident in their
choice, but answer incorrectly) can be
achieved (Bruno, 1986) by offering
choices in three levels with each level
representing a different degree of confidence. An example of this question format follows:
1) 1 + 2 = ?

A. 2.717
B. 3
C. 3.141

D. A or B
E. B or C
F. A or C

G. I don’t know

At the first level, three alternatives
are presented, with one correct and two
incorrect choices (e.g., choices A, B, C).
By choosing an alternative at this level,
students exhibit a high level of confidence in their knowledge. By selecting
the right answer, students demonstrate
that they are fully informed and confident in their knowledge. The correct
response (B in the example given)
would be graded at full credit. By
choosing a wrong answer at this level
(A or C), the student is demonstrating
that he or she is confident in the wrong
knowledge and is, therefore, misin-

formed. At this point, an incorrect
response would be given zero credit.
The second level of alternatives (e.g.,
D, E, F) presents Boolean “or” choices
among alternative combinations of the
first-level options. By selecting a choice
at this level, students demonstrate that
they are either partially informed by
selecting a correct choice (in the example, both D and E include the right
answer) or misinformed by selecting the
wrong choice (F in the example). Correct answers at this level would be
awarded half credit. At this level, students trade half of the available credit to
avoid the risk of being forced to choose
between two of the three alternatives,
which would earn them a 50% score for
random guessing. Wrong choices at this
level are again scored zero credit.
Finally, at the third level, students are
afforded the opportunity to admit to
being uninformed (e.g., choosing G in
the above example) and possessing a
lack of knowledge. Here, the student is
rewarded with one third of the credit,
which represents the fair value of
attempting random guessing from
among any of the three first-level choices. The feedback quality for the educator as a result of analyzing student
responses along this spectrum allows
for a wider range of (and more appropriate) pedagogical responses. The IRT
response model is summarized in Table
1 (from Larson, 2003).
Although the additional information
provided by the implementation of IRT

could be of value to the educator, concern
may exist regarding the implementation
cost. First, there are direct costs associated with securing seven-item bubble
sheets, making changes to associated
software for analyzing results, and making changes to the testing process. However, the greater concern could be
whether or not students accustomed to
the traditional MC format would accept
the more complicated (and perhaps difficult to understand) format and be able to
perform comfortably on instruments with
items of this type. The “costs” associated
with the CL or IRT format may outweigh
the additional information provided to
the educator.
METHOD
In this study, I attempted to answer
two research questions:
1. How do graduate students perceive
the relative ease of use and measurement accuracy of and general preference for traditional MC, IRT, and essay
or short-answer formats for assessing
student knowledge?
2. Which MC format (traditional vs.
CL) provided better accuracy in terms
of association with self-reported mastery and the answers to CR (essay or
short answer) format questions?
I was interested in both the acceptance level and accuracy of the proposed
CL format to more thoroughly assess its
suitability for use in the classroom.

TABLE 1. Summary of Information-Referenced Testing Model

Student action

Root cause

Diagnosis

Credit
earned

Pedagogical response

Chooses correct option
from first level

Student confidently
comprehended the
objective.

Student is “fully informed.”

1.0

None.

Chooses correct option
from second level

Student is not confident or
comprehends only part of
the objective.

Student is “partially informed.”

0.5

Adjust the scope of
instruction and study
to “fill in the gaps.”

Chooses “I don't know”

Student cannot answer the
test item.

Student is “uninformed.”

0.3

Cover the material
again fully and
increase confidence.

Chooses incorrect option
from first or second level

Student is confident, but
wrong.

Student is “misinformed.”

0.0

Reevaluate learning;
use alternative
methods of instruction
to correct the problem.

March/April 2006

217

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

To test student perception of the three
question formats, I surveyed two groups
of students on the ease of use of, accuracy of measurement of, and general
preference for traditional MC, CL, and
CR questions. The two groups included
students from a Master of Business
Administration (MBA) program at a
private Midwest college and students
from a Master of Science (MS) program
at a government-run graduate school.
The two groups took very similar (e.g.,
same textbook, same professor, same
exams) sections of a graduate Introduction to Supply Chain Management
course. The sections were the same size
(18 students) and were very similar in
demographic composition. The MBA
students were slightly older, had a wider
range of industry experience, and
attended the class at night once a week.
The MS students were 3–5 years
younger, on the average, of a very similar range of experiences, and taking the
class during the day, twice a week, as
part of a full-time cohort.
Three exams were administered during each course (in addition to case
work and student projects) and each
exam contained a mix of MC, CL, and
CR questions from the same knowledge
domains. Students received familiarization training on the CL format prior to
the first exam, consisting of a complete
description 2 weeks before the exam,
and another presentation including a
practice quiz the week prior to the
exam. For Exam I, the questions were
tightly coupled in that I maintained the
question wording as identical as possible among the three formats. Exams II
and III had reduced coupling, so that by
Exam III the questions were recognizably different, but they were from the
same knowledge domain as much as
possible. Each exam consisted of 10
knowledge domains, measured by 10
MC, 10 CL, and 5 CR questions. Therefore, 5 of the knowledge domains were
tested across all three formats. Immediately prior to and following the exam,
students were asked to self-evaluate
their knowledge in each of the tested
domains. In addition, immediately following each exam, the students were
asked to evaluate the overall ease of use
of, measurement accuracy of, and general preference for each format. Ease,
218

Journal of Education for Business

Accuracy, and Preference were evaluated on a 7-point Likert scale ranging
from 1 (stongly disagree) to 5 (strongly
agree) requiring students to respond to
statements such as, “Multiple choice
questions were easy to understand and
use.” I assessed student knowledge represented through the three formats (i.e.,
MC, CL, and CR) and captured the data
in the same SPSS (version 12.0.2 for
Windows, Chicago, IL) dataset.
RESULTS
To address the research questions, I
performed several statistical analyses.
First, I used difference of means tests to
determine to what degree student preferences between the three formats were
different or separable. I ran these tests
for all groups combined, then ran them
again across the demographic categories
for MS versus MBA, and finally for
Exams I, II, and III. I also ran a regression model to assess whether the various demographic factors had an effect
on student preference. The second
analysis involved the use of correlation
models to measure degree of association
between the knowledge as represented
by the two MC formats (traditional and
CL) and assumed true values as represented by the essay (CR) answers and
self-reported mastery.
Preference
The students demonstrated a strong,
consistent preference for the CR format on all three dimensions of ease of
use, measurement of accuracy, and likability (see Table 2). Paired-difference
t tests were run in SPSS (version 12.0

for Windows). Preferences were organized from most preferable to least
preferable and grouped by whether
there were statistically significant differences (α < .10) between the means.
Table 2 (results for all groups and
exams) shows that for ease of use, CR
was statistically significantly preferred
over both MC and CL, which could not
be separated statistically from each
other. General likability had similar
results. For measurement accuracy, CR
was still superior to both MC and CL,
but CL was statistically separable from
MC as well.
Next, I ran the paired difference t
tests within the subgroups of the MS
students and MBA students separately.
The results were very similar between
the two groups. For ease of use, students
in both groups preferred the CR format
and were indifferent in their preference
for MC and CL (not statistically significant). For measurement accuracy, the
MS group favored CR over CL and CL
over MC, while the MBA group preferred CR over both CL and MC.
Results for the overall preference (likability) were identical, which suggests
that the type of program may have some
effect on students’ perceptions of testing
formats.
The third set of difference of means
tests contrasted the results between the
three exams. Note that two separate
effects are being picked up by this contrast: The coupling of the question formats (identical vs. similar questions)
and the learning effect over time as students develop familiarity with each successive test. No attempt was made to
directly account for these potentially
interacting effects; however, ease of use

TABLE 2. Paired t -Test Results for Master of Business Administration and
Master of Science Students’ Preferences for Question Format: Overall
Ease, Accuracy, and Likability

Measure
Ease of use
Measurement accuracy
Likability
+p < .10.

Constructed
response

Question format
Multiple
choice

Confidence
level

5.61+
5.53+
5.02+

5.03
4.52
4.03

4.98
4.74
4.16

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

alone could be considered a surrogate
measure for increasing familiarity
between the three exams. If ease of use
scores for CL increase over repeated
tests, this should indicate changes in the
learning effect.
The results appeared consistent
across the three exams. For ease of use,
students consistently preferred CR over
MC and CL, while no statistically significant differences existed between
MC and CL. An interesting finding is
that the means decreased over time for
the CR and CL formats. I found similar
results for measurement accuracy. For
overall likeability, the first and second
exams reflected the same CR over CL
and MC pattern as ease of use. However, for the third exam, the CL preference
increased to statistically tie with CR, so
two groups (CR and CL; CL and MC)
were formed.
Overall, the results of the difference
of means tests indicated a fairly consistent student preference for CR format
questions on all three criteria. This pattern did not change according to type of
program (i.e., MS or MBA) or across
the three exams. The CL format seemed
fairly even with (statistically inseparable from) the MC format, except for the
MS group and for the third exam, where
it seemed to have some desirability over
MC in terms of measurement accuracy
and overall likeability. A regression
model predicting preference with either
MBA versus MS or degree of coupling
provided neither statistical nor practical
significance, supporting the results of
the difference of means tests.
Accuracy
I performed biserial correlations
(both parametric and nonparametric) to
assess the relationships between the
question formats (MC and CL) and
some presumed true value of knowledge. The answers to CR items and students’ posttest self-reported mastery
(PoSM) represented the presumed true
value of knowledge in the individual
domains tested. All relationships were
found to be statistically significant at
the α < .05 level (see Table 3).
Using PoSM as a surrogate for true
knowledge, we see that the traditional
MC format has a slight advantage over

student–instructor competence in learning or teaching in the individual knowledge domains. Table 4 presents the
results of biserial correlations between
student-reported pretest subject mastery
(PrSM), posttest self-reported mastery
(PoSM), pretest instructor performance
(PrIP), and posttest instructor performance (PoIP).
The results indicated strong agreement between pre- and posttest subject
mastery as reported by students
(PrSM–PoSM at .618). Similar strong
agreement existed between pre- and
posttest instructor performance (PrIP–
PoIP at .648). However, slightly less
agreement existed between instructor
performance and subject mastery. Both
before and after taking the exam, IP and
SM associated at relatively high levels
(pretest at .513; posttest at .593). Finally, it is interesting to note that the association between instructor performance
and true knowledge (represented by
CR) was statistically significant for
either pre- or posttest measures (not
shown in Table 4).

CL using both parametric (.198 vs.
.109) and nonparametric (.215 vs. .186)
correlations. However, when the CR
answers are used as a surrogate for measuring true knowledge, the opposite is
true. CL shows an advantage over MC
by a slightly wider margin on both parametric (.297 vs. .165) and nonparametric (.378 vs. .146) tests of association.
This difference in result could be
explained by several theories. The
PoSM measure of true knowledge is
subject to much bias and potentially
overlaps with student preference, while
the CR measure is not subject to student
(self-report) issues. Because CR
responses have prima facie validity and
general acceptance in the literature as
being the best measure of true knowledge, I preferred to use this measure.
The findings, therefore, indicate
improved accuracy of the CL format
over the MC.
Efficacy
Another interesting question involves
the assessment of student perceptions of

TABLE 3. Pearson Parametric (r ) and Spearman Nonparametric (r s)
Correlations for Accuracy of Multiple Choice and Confidence
Level Testing
r
rs
1. Multiple choice
2. Confidence level
3. Self-reported mastery
4. Constructed response

1

2

3

4


.356a
.215a
.146c

.249a

.186a
.378a

.198a
.109b

.159c

.165a
.297a
.190a


a

Σ = .000; bΣ = .039; cΣ = .002.
*p < .05.

TABLE 4. Pearson Parametric (r ) and Spearman Nonparametric (r s)
Correlations for Pretest Versus Posttest Knowledge Assessment
r
rs
1. Pretest subject mastery
2. Posttest subject mastery
3. Pretest instructor performance
4. Posttest instructor performance

1

2

3

4


.589a
.509a
.391c

.618a

.366a
.621a

.513a
.376b

.613c

.400a
.593a
.648a


a

Σ = .000; bΣ = .039; cΣ = .002.

March/April 2006

219

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 18:00 12 January 2016

DISCUSSION
The overall research questions guiding
this investigation were “How do graduate
students perceive the relative ease of use
of, accuracy of the measure of, and general preference for traditional MC, IRT,
and CR formats for assessing student
knowledge?” and “Which MC format
(traditional vs. CL) provided better accuracy in terms of association with selfreported mastery and the answers to CR
(essay) format questions?” Both the
acceptance level and relative accuracy of
the proposed CL format to more thoroughly assess its suitability for use in the
classroom interested me.
Student feedback indicated a strong
preference for CR format questions
across all three criteria and a lower preference for traditional MC and CL formats. These preferences do not seem to
be dependent on type of program (i.e.,
MS or MBA) or the degree of coupling
of the questions (high, medium, or low).
Students demonstrated consistent
ambivalence between traditional MC
and the proposed IRT formats.
However, student preference is only
one factor for educators to consider
when employing a testing format. The
accuracy of the assessment instrument
and the value of the information provided are also important considerations. Evidence indicates that the CL
format offers advantages over traditional MC in terms of measurement
accuracy, as reflected in both student
opinion and comparison with CR
items. In addition, the information provided by the CL items offers a pedagogical advantage over traditional MC
formats in the quality and richness of

220

Journal of Education for Business

feedback provided. The results of this
initial study suggest that, among these
students, acceptance of the CL format
was at least as high as the level of
acceptance for the traditional MC format and is not an adoption consideration. The decision to employ the CL
format then depends on the trade-off
between the value of the improved
information and accuracy and the
administrative burden of rewriting
questions and modifying procedures.
NOTE
Correspondence concerning this article should
be addressed to Stephen M. Swartz, Assistant Professor of Logistics Management, Department of
Marketing and Logistics, University of North
Texas, Denton, TX. E-mail: swartzs@unt.edu
REFERENCES
Becker, W. E., & Johnston, C. (1999). The relationship between multiple choice and essay
response questions in assessing economics
understanding. The Economic Record, 75,
348–357.
Bennett, R. E., Rock, D. A., & Wang, M. (1991).
Equivalence of free-response and multiple
choice items. Journal of Educational Measurement, 28(1), 77–92.
Bruno, J. E. (1986). Assessing the knowledge base
of students: An information theoretic approach
to testing. Measurement and Evaluation in
Counseling and Development, 18, 116–130.
Bruno, J. E., Holland, J. R., & Ward, J. W. (1988).
Enhancing academic support services for special action students: An application of information referenced testing. Measurement and Evaluation in Counseling and Development, 21(1),
5–13.
Bruno, J. E., & Dirkzwager, A. (1995). Determining the optimal number of alternatives to a multiple-choice test item: An information theoretic
perspective. Educational and Psychological
Measurement, 55, 959–966.
Conderman, G. (2001). Program evaluation:
Using multiple assessment methods to promote
authentic student learning and circular change.
Teacher Education and Special Education, 24,
391–394.
Haladyna, T. M. (1999). Developing and validat-

ing multiple-choice test items. Mahwah, NJ:
Lawrence Erlbaum Associates.
Hansen, J. D. (1997). Quality multiple-choice test
questions: Item-writing guidelines and an
analysis of auditing testbanks. Journal of Education for Business, 73, 94–97.
Hassman, P., & Hunt, D. P. (1994). Human selfassessment in multiple-choice testing. Journal
of Educational Measurement, 31, 149–160.
Larson, E. D. (2003). An analysis of information
referenced testing as an air force assessment
tool. Unpublished master’s thesis, Air Force
Institute of Technology, Dayton, OH.
Madaus, G. F., & O’Dwyer, L. M. (1999). A short
history of performance assessment: Lessons
learned. Phi Delta Kappan, 80, 688–695.
Miller, H. G., Williams, R. G., & Haladyna, T. M.
(1978). Beyond facts: Objective ways to measure thinking. Englewood Cliffs, NJ: Educational Technology Publications.
Pomplun, M., & Omar, M. D. H. (1997). Multiple-mark items: An alternative objective item
format? Educational and Psychological Measurement, 57, 949–962.
Powell, J. L. (1989). How well do tests measure
real reading? Bloomington, IN: ERIC Clearinghouse on Reading and Communication Skills.
(ERIC Document Reproduction Service No.
ED 306552)
Rogers, W. T., & Harley, D. (1999). An empirical
comparison of three and four choice items and
tests: Susceptibility to testwiseness and internal
consistency reliability. Educational and Psychological Measurement, 59, 234–247.
Rogers, W. T., & Ndalichako, J. (1997). Comparison of finite state score theory, classical test
theory, and item response theory in scoring
multiple-choice items. Educational and Psychological Measurement, 57, 580–589.
Rogers, W. T., & Ndalichako, J. (2000). Numberright, item-response, and finite-states scoring:
Robustness with respect to lack of equally classifiable option and item option independence.
Educational and Psychological Measurement,
60, 5–9.
Sidick, J. T., Barrett, G. V., & Doverspike, D.
(1994). Three-alternative multiple choice tests:
An attractive option. Personnel Psychology, 47,
829–835.
Tversky, A. (1964). On the optimal number of
alternatives of a choice point. Journal of Mathematical Psychology, 1, 386–391.
Wood, W. C. (1998). Linked multiple-choice
questions: The tradeoff between measurement
accuracy and grading time. Journal of Education for Business, 74, 83–86.