Directory UMM :Data Elmu:jurnal:J-a:Journal Of Business Research:Vol48.Issue3.2000:

An Empirical Assessment of Measurement
Error in Health-Care Survey Research
Debi Prasad Mishra
STATE UNIVERSITY OF NEW YORK AT BINGHAMTON

This article estimates the average amount of error variance in a typical
health-care measure by reanalyzing published multitrait–multimethod
(MTMM) matrices via the confirmatory factor analysis (CFA) procedure.
For the set of studies under consideration, a typical health-care measure
contains a relatively high proportion (64%) of error variance (method
and random). The magnitude of this error variance estimate seems to be
higher than corresponding variance estimates reported in the other social
sciences. Using principles from psychometric theory, we note how measurement error can confound validity, and we provide directions to researchers
for improving the quality of survey-based health-care measures. J BUSN RES
2000. 48.193–205.  2000 Elsevier Science Inc. All rights reserved.

I

n recent years, the use of paper and pencil items for measuring theoretical concepts has become widespread in the
health-care discipline. For example, researchers have used
survey-based questionnaires for gauging theoretical constructs

involving patient satisfaction (Singh, 1991), depression (Beck,
1967), sleep patterns (Fontaine, 1989), juvenile health (Greenbaum, Dedrick, Prange, and Friedman, 1994; Marsh and
Gouvernet, 1989), family health (Sidani and Jones, 1995),
caregiving (Phillips, Rempusheski, and Morrison, 1989), nursing (Gustafson, Sainfort, Van Konigsveld, and Zimmerman,
1990), and total quality management in hospitals (Lammers,
Cretin, Gilman, and Calingo, 1996), among others.
Although the use of subjective paper and pencil measures
has advanced our understanding of theoretical concepts in a
number of health-care areas, it has also been recognized that
the presence of error in such measures can seriously undermine the reliability and validity of concepts being measured
(Bagozzi, Yi, and Phillips, 1991; Bagozzi and Yi, 1990; Cole,
Howard, and Maxwell, 1981; Cote and Buckley, 1987;
Churchill, 1979; Figuerdo, Ferketich, and Knapp, 1991; Ferketich, Figuerdo, and Knapp, 1991; Lowe and Ryan-Wenger,
1992; Waltz, Strickland, and Lenz, 1984; Williams, Cote, and

Address correspondence to Debi Prasad Mishra, School of Management,
State University of New York, Binghamton, NY 13902, USA.
Journal of Business Research 48, 193–205 (2000)
 2000 Elsevier Science Inc. All rights reserved.
655 Avenue of the Americas, New York, NY 10010


Buckley, 1989). Commenting on the deleterious impact of
measurement error on validity, Lowe and Ryan-Wenger (1992)
note that “the discipline of nursing needs to be more sophisticated in its study of construct validity,” and that “the questions
of convergence and discrimination are central to the validity
of the instruments being developed” (p. 74). Likewise, Cote
and Buckley (1987) observe that “measures used in social
science research have a large error component” (p. 318), and
go on to say that “in the future, researchers must be more
resolute in their desire to develop construct measures that are
valid and free of measurement error” (p. 318).
Given the potential of measurement error to confound the
reliability and validity of constructs seriously, we would expect
that summary psychometric analyses (c.f., Cote and Buckley,
1987) of paper and pencil items used in health-care research
would be readily available. However, a careful scrutiny of the
literature belies this expectation and underscores the paucity
of research on measurement and validity issues in health-care
research. For example, in the context of nursing, Gustafson,
Sainfort, Van Konigsveld, and Zimmerman (1990) note that

“there have been few detailed evaluations of measures of quality of care in nursing homes” and that “much research on
factors affecting nursing home quality has used measures of
questionable reliability and validity . . . some measures currently in use have been developed using methodologies not
based on solid conceptual grounds, offering little reason to
expect them to have much internal or external validity” (p. 97).
Our objective in this paper is twofold: (1) to provide empirical estimates of the amount of measurement error in paper
and pencil health-care measures; and in a related vein, (2) to
illustrate the use of the confirmatory factor analysis (CFA)
method for modeling measurement error in health-care research. By conducting such an empirical analysis, we seek to
address three important issues in the health-care field. First,
by ascertaining the pervasiveness of measurement error and
understanding its deleterious impact upon construct validity,
health-care researchers will be in a position to interpret properly the results of statistical analyses conducted to uncover
relationships among theoretical constructs. For example, Cote
ISSN 0148-2963/00/$–see front matter
PII S0148-2963(98)00088-5

194

J Busn Res

2000:48:193–205

D. P. Mishra

Figure 1. Process for establishing construct validity of latent constructs.

and Buckley (1987) found that on average, measures in the
social sciences contained 41.7% trait variance, 26.3% method
variance, and 32% error variance. Consequently, using these
estimates, Cote and Buckley reported that “the true value for
the slope in a two-variable (regression) case would be 2.4
times the estimated value for measures containing 50% error
variance” (p. 318). In other words, interpreting statistical results without assessing measurement error can seriously undermine construct validity. Second, much of the concern over
an appropriate regulatory role in health care has focused on
efforts to define, measure, and ensure the quality of care of
nursing homes and hospitals (Lemke and Moos, 1986). Given
that quality of care measures are often of the paper and pencil
type, greater attention to measurement issues is warranted.
Finally, there is a growing need for health-care institutions to
adopt “total quality management” (TQM) practices for managing the efficient delivery of health care (Casalou, 1991; Lammers, Cretin, Gilman, and Calingo, 1996). The central theme

of TQM is patient satisfaction, which is a latent construct that
cannot be measured directly. In other words, researchers must
rely upon valid measures of patient satisfaction for implement-

ing TQM practices. Again, the presence of measurement error
may seriously impinge upon efforts to implement, analyze,
and modify TQM efforts in health care.
In keeping with out objectives, this paper is organized as
follows. The first section offers a perspective on construct
validity and ties in the notion of measurement error with
related psychometric concepts (i.e., scaling, validity, reliability). The next section details the plan of data analysis. Specifically, the nested confirmatory factor analysis approach of
Widaman (1985) based on MTMM (multitrait–multimethod)
matrices (Campbell and Fiske, 1959) is outlined. The third
section describes those health-care studies from which MTMM
data were reanalyzed. This is followed by a discussion of the
results of our analysis. Finally, we describe the limitations of
our study together with some implications for future research.

A Psychometric Perspective on
Construct Validity

According to Churchill (1979), “Construct validity, which lies
at the very heart of the scientific process, is most directly

Measurement Error in Health-Care Survey Research

related to the question of what the instrument is in fact measuring—what construct, trait, or concept underlies a person’s
performance or score on a measure” (p. 70). In other words,
constructs are latent entities constructed by researchers for
describing and understanding an underlying phenomenon of
interest. Because these constructs are latent, they can only be
indirectly determined by using a set of observable measures.
The fundamental construct validity question can be phrased
as follows. Given that a set of items purportedly measures a latent
construct, how confident is the researcher about a correspondence
between these measures and the construct in question? For example, when a researcher measures depression by using physiological indicators (e.g., heartbeat), can he or she assert that heartbeat is in fact a valid measure of anxiety?
Although the fundamental premise of construct validity is
rather simple, investigators must pay careful attention to the
underlying psychometric processes that are related to validity.
More specifically, the relationship between scaling, reliability,
measurement error, and validity must be fully explored by

researchers. The following paragraphs discuss the interrelationship between validity and other psychometric concepts
(scaling, reliability, measurement error). Let us consider the
conceptual process involved in measuring a latent construct
as depicted in Figure 1.
As a first step, a researcher specifies the domain of the
construct from which items are sampled to form a scale. The
process of selecting a relevant subset of items from a universal
set of items is termed “domain sampling” (Nunnally, 1978).
Note that this universal set is an infinite set of all possible
items that tap the latent construct. Because of constraints of
time and costs, a researcher typically selects a few representative items from the universal set to form a “scale” where respondents assign numbers to questions. This process of transforming responses into scores is called scaling. Scores obtained
from scales are called observed scores. Before proceeding further, researchers determine the relationship of the observed
score to the score that would have been obtained had all
items from the infinite set (i.e., the construct domain) been
administered to an infinite set of respondents across an infinite
set of situations. The score on such an infinite set (of items,
situations, and respondents) is called the true score.
The relationship between the observed score (from a finite
set) and the true score (from an infinite set) is called reliability.
Mathematically, reliability is calculated as the ratio of the

observed variance to the true variance (Churchill, 1979). Note
that high reliability for a set of items merely implies that
the correlations among items constituting a scale reflect an
underlying latent factor. In other words, if one administered
a scale with high reliability across multiple situations and
respondents, the amount of variance in the observed score
would be low; that is, the scale would be consistent and repeatable. Whether a scale exhibiting high reliability actually measures the underlying construct of interest is a question of
validity. For example, as Nunnally (1978) illustrates, “how far

J Busn Res
2000:48:193–205

195

stones were tossed on one occasion might correlate highly
with how far they were tossed on another occasion, and thus,
being repeatable, the measure would be highly reliable; but
obviously the tossing of stones would not constitute a valid
measure of intelligence” (pp. 191–192). In other words, “reliability is a necessary but not sufficient condition for validity”
(Nunnally, 1978, p. 192).


Measurement Error and Validity
According to classical test theory (Gulliksen, 1950), the variance can be partitioned into trait variance and error variance.
Furthermore, error variance can be subdivided into method
variance and random error variance. Mathematically, this relationship is expressed as Equation (1):
so 5 s t 1 s m 1 s r

(1)

where so is the observed variance, st the trait (or valid) variance, sm the method variance, and sr the random error variance. Construct validity entails explaining the trait variance
only. Note that method variance and random error variance
may serve to attenuate or inflate theoretical relationships
among measures (Cote and Buckley, 1988). To this extent, a
rigorous estimation of method variance and random error
variance should necessarily precede any attempt at investigating construct validity.
The effect of measurement error (method variance and
random error variance) can be highlighted with the following
example. Let us assume that a researcher administers the
Beck Depression Inventory (BDI; Beck, 1967) to respondents.
Furthermore, let us assume that approximately half the items

on the BDI are negatively scored. It is entirely possible that
some items on the scale may correlate highly (and consequently load on a latent factor) just because of the particular
method (i.e., negatively worded items). In such a situation,
it is erroneous to conclude that the set of negatively worded
items measure a trait factor. On the other hand, these items
merely tap a method factor. In a similar vein, we can visualize
how random error (respondent fatigue, length of questions,
format) may also produce a spurious factor. In sum, measurement error seriously compromises the construct validity of
measures.
Given the potential of measurement error to undermine
construct validity severely, researchers first must estimate the
amount of error variance (both method variance and random
error variance) in their measures before addressing the question of validity. Such as exercise will help investigators ascertain the degree to which measurement error bias may be
present. Furthermore, health-care researchers will also be
compelled to make use of such statistical estimation techniques as latent variables structural equations modeling (Bentler, 1992), which explicitly permits the estimation of measurement error while studying theoretical relationships among
constructs.

196

J Busn Res

2000:48:193–205

D. P. Mishra

Figure 2. Nested confirmatory factor analysis models for analyzing MTMM data. j1, j2, j3, j4 5 trait factors; j5, j6 5 method factors;
l 5 factor loading; φ 5 factor correlation; x 5 measured variable; d 5 random error.

Method
The Use of MTMM Matrices for
Estimating Variance Components
Per conventional research practice (Bagozzi and Yi, 1991;
Bentler, 1992; Byrne, 1994; Cote and Buckley, 1987, 1988;
Williams, Cote, and Buckley, 1989), variance components
of measures can be estimated by analyzing MTMM matrices
(Campbell and Fiske, 1959) by means of CFA. Briefly, a
MTMM matrix represents intercorrelations of more than one
trait measured by more than one method. In its original formulation, Campbell and Fiske (1959) assumed that methods
were maximally dissimilar and that traits and methods were
not correlated.
Despite the widespread use of MTMM matrices for assessing
construct validity of measures (Fiske and Campbell, 1992),
researchers (Lowe and Ryan-Wenger, 1992; Marsh 1989;
Marsh and Hocevar, 1988; Schmitt and Stutts, 1986) have

pointed out several problems with Campbell and Fiske’s
(1959) method (see Schmitt and Stutts, 1986 for a good
discussion). Most of the criticism against the MTMM technique
has been directed at its underlying assumption of the use
of “maximally dissimilar methods” (Marsh, 1989), because
methods are rarely maximally dissimilar in practice. In this
vein, Williams, Cote, and Buckley (1989) note that “if maximally dissimilar methods are not used in MTMM analysis, it
becomes difficult to identify the existence of methods effects,”
(p. 462). Furthermore, the inability of the MTMM technique
to provide meaningful statistical estimates of different variance
components of measures (i.e., trait, method, and random error) provides, at best, an incomplete assessment of validity
(Schmitt and Stutts, 1986).
In contrast to the well-documented shortcomings of the
MTMM technique, CFA of MTMM data represents a viable
alternative for assessing the construct validity of measures and
also for partitioning the over-all variance of a measure into

Measurement Error in Health-Care Survey Research

its various components. The present study empirically assesses
the amount of measurement error in measures by using CFA
to reanalyze published MTMM matrices in health care. The
nested CFA approach that formed the basis for analyzing
MTMM data is described next.

Nested CFA Approach
The nested CFA approach subsumes two interrelated techniques for partitioning an item’s total variance: (1) estimating
a sequence of successively restricted nested models; and (2)
comparing these nested models to determine incremental
goodness-of-fit.
For each of the studies in the health-care area that used
the MTMM approach, the amount of measurement error in
indicators of latent constructs was assessed by estimating a set
of four hierarchically nested models per guidelines originally
offered by Widaman (1985) and subsequent modifications
suggested in recent research that use the MTMM technique
(Bagozzi and Yi, 1990, 1991; Bagozzi, Yi, and Phillips, 1991;
Williams, Cote, and Buckley, 1989). To implement the nested
approach, four confirmatory factor analysis models were estimated and compared to yield meaningful tests of hypotheses.
The nested models that show the possible sources of variation
in measures and their representation in a confirmatory factor
analytic model are graphically depicted in Figure 2 and described in more detail below.
Model 1 (M1): is a “null” model that assumes the intercorrelations among measures can be explained by random error
only and no trait or method factors are present. This null
model is depicted in panel A of Figure 2.
Model 2 (M2): is a “trait-only” model that hypothesizes
variation in measures can be explained completely by traits
assumed to be correlated and by random error. Panel B in
Figure 2 depicts the trait-only model.
Model 3 (M3): depicted in Panel C of Figure 2 is a “methodonly” model that hypothesizes variation in measures can
be completely explained by correlated method factors plus
random error.
Model 4 (M4): is a “trait and method” model where variance
among measures is explained completely by traits, methods, and random error, with freely estimated correlations
among trait factors and correlated method factors. Panel
D in Figure 2 depicts the trait and method model.
To implement the nested approach and determine the best
fitting model, a sequence of x2 difference tests were conducted.
First, Model 2 (M2) was compared with the null model (M1)
to determine whether trait factors were present or not. Model
comparison entails computing the x2 difference between two
models and calculating the corresponding difference in the
degrees of freedom. A rejection of the null hypothesis (H0:
The compared models offers no incremental fit of model to
data) indicates that the compared model affords a better fit

J Busn Res
2000:48:193–205

197

to the data than the baseline model. In the second step, M4
was compared with M3. A significant x2 difference test further
indicates that trait factors are present. Likewise, the presence
of method factors was determined by comparing M3 with M1,
and M4 and M2, respectively. Once trait and method factors
were detected, a confirmatory factor analysis model was estimated and different variance components calculated per standard procedures (Cote and Buckley, 1987, 1988; Williams,
Cote, and Buckley, 1989).
While estimating the variance components of health-care
measures we assume that measure variation is a linear combination of traits, methods, and error. In other words, we assume
that trait and method factors do not vary. In contrast, recent
research (Bagozzi and Yi, 1991; Browne, 1989; Kim and Lee,
1997) using the direct product (DP) model permits the estimation of models where trait and method factors interact. The
multiplicative effect of trait and method factors on measure
variation may be justified when differential augmentation or
differential attenuation is expected to be present among trait
and method factors (Bagozzi and Yi, 1991). However, in the
absence of compelling reasons for considering these interactions, using multiplicative models may not be justified. Based
upon our interpretation of the studies from where data have
been reanalyzed, we did not find any strong rationale for
including trait and method interations. Consequently, we have
refrained from specifying and estimating DP models.
Recall that from the set of hierarchically nested models
described earlier, M1 represents a situation where intercorrelations among measures are caused by random error alone. It
is highly unlikely that M1 alone can explain the data very
well, because various measures used in a study are likely to
stem from a conceptual framework and are expected to contain
some amount of trait and method variances as well. Likewise,
M3 holds that method factors and error variances alone can
explain all the variation in measures. This situation of method
and error variances accounting for all the variance in measures
does not seem reasonable, because a researcher who employs
the MTMM technique has likely used some theoretical guidelines to develop measures. Although M1 and M3 are not
related to theory at all, M2 suggests that measure variance is
a function of trait factors and error variance. The situation
depicted by M2 is also unlikely, because paper and pencil
measures used to measure latent constructs are expected to
contain some degree of method variance. Hence, M4, which
is a trait and method model, is expected to outperform M1,
M2, and M3. In other words, M4 is more representative of
actual data than the other nested models. Given these observations we hypothesize that:
H1: For the studies under consideration, the trait and
method model (M4), which holds that the variance
in measures of latent constructs can be explained by
traits, methods, and random error, will outperform
other model specifications represented by M1 (error

198

J Busn Res
2000:48:193–205

D. P. Mishra

Figure 3. CFA model of MTMM data in Saylor Finch, Baskin, Furey, and Kelly’s (1984) j1, j2, 5 trait factors; j3, j4, j5, j6 5 method
factors; l 5 factor loading; φ 5 factor correlation; x 5 measured variable; d 5 random error.

only), M2 (trait plus error), and M3 (method plus
error).

Data Collection
To identify studies in the health-care area that have used
MTMM matrices for assessing the validity of measures, we
turned to several published sources and implemented a systematic search process along the lines followed by Cote and
Buckley (1987). First, we examined Turner’s (1981) article,
which lists 70 studies in which the MTMM method was used.
Second, we examined the PSYCHLIT database for the 1979–
1994 time period and searched for studies in the health-care
area by using “MTMM,” “multitrait,” and “multimethod” as
keywords. In addition, we also examined the MEDLINE and
ABI-INFORM databases for the 1980–1995 time period to
identify published studies that used the MTMM technique.
Finally, we visually examined all articles that appeared in
major health-care journals between 1990 and 1994 for evidence of use of the MTMM procedure. The major journals
that we examined were Research in Nursing and Health, Medical
Care, Heart and Lung, Journal of Health Care Marketing, Advances in Nursing Science, Journal of Consulting and Clinical
Psychology, Nursing Research, Journal of the American Medical
Association, Western Journal of Nursing and Research, Interna-

tional Journal of Nursing Studies, and American Journal of Public
Health.
Based upon our search, we uncovered 19 studies that
claimed to have used the MTMM method for examining the
convergent and discriminant validity of measures. Of these
19 studies, 15 articles were excluded, because either they did
not publish the MTMM matrices at all (e.g., Greenbaum,
Dedrick, Prange, and Friedman, 1994) or published a partial
MTMM matrix (e.g., Sidani and Jones, 1995) that could not
be used for reanalyzing the data. Hence, our empirical analysis
is based upon correlation data provided in four different studies that have used the MTMM design.
Although the few studies that have used the MTMM method
in health-care research might seem atypical, a closer scrutiny
of available evidence from other streams in the social sciences
suggests that this is not so. For example, Bagozzi and Yi (1991)
reported that only four studies that used the MTMM design
were published over a 17-year period in the Journal of Consumer Research. Explaining the relatively infrequent use of the
MTMM method in social science research, Bagozzi and Yi
(1991) note that “one explanation [for this paucity of published studies] is the difficulty in obtaining multiple measures
of each construct in one’s theory and using different methods
to do so” (p. 427). In a similar vein, Lowe and Ryan-Wenger

Measurement Error in Health-Care Survey Research

(1992) in a review article note that over a 10-year period,
only six studies that reported the use of the MTMM procedure
were published in the major nursing journals. Given the relatively infrequent use of MTMM matrices for construct validation across a number of disciplines, the present dataset seems
adequate enough to estimate empirically the extent to which
measurement error is present in health-care research measures.

Description of Studies
A few studies in health-care research (e.g., Marsh and Gouvernet, 1989; Saylor, Finch, Baskin, Furey, and Kelly, 1984;
Sidani and Jones, 1995; Wolfe, Gentile, Michienzi, Sas, and
Wolfe, 1991) have used MTMM matrices for examining the
construct validity of measures. Specifically, the Saylor and
colleagues study investigates the construct validity for measures of “childhood depression” using an MTMM approach,
and the Wolfe and colleagues study develops and validates a
scale for assessing the impact of sexual abuse from the child’s
perspective. In both these articles, a reasonable degree of
“convergent” and “discriminant” validity was reported by the
authors. However, these studies did not use a CFA framework,
rendering the observed results prone to measurement error
biases. In fact, Wolfe and colleagues explicitly acknowledge
that “of concern is the magnitude of error variance . . . which
means that scores on the tests are more subject to unknown
sources of variance than they are to the recognized sources,”
and go on to say that “these results call for a subsequent
confirmatory factor analysis of the items” (p. 382). We hope
that the results of our analysis will answer some of their
concerns. We describe these aforementioned studies more
fully in the following paragraphs.
SAYLOR, FINCH, BASKIN, FUREY, AND KELLY (1984). This study
investigated the construct validity of childhood depression
measures. Specifically, two trait factors, anger and depression,
were measured by eight items (four each for anger and depression) and four methods. The CFA model pertaining to this
study is depicted in Figure 3. The first method used was a
“Children’s Depression and Anger” inventory (Kovacs, 1981;
Nelson and Finch, 1978). The second format was the “Peer
Nomination Inventory of Depression and Anger” (Finch and
Eastman, 1983; Lefkowitz and Tesiny, 1980). The last two
methods were the “Teacher Nominated Inventory of Depression and Anger” (Saylor, Finch, Baskin, Furey, and Kelly,
1984), and the “AML Rating Form” (Cowen, Door, Clarfield,
and Kreling, 1973), respectively. Construct validity of measures was investigated by means of an 8 3 8 multitrait (2 traits)
multimethod (4 methods) matrix based on 133 responses.
Interestingly, although the authors found some support for
convergent and discriminant validity, they also expressed concern over measurement error, which was an important source
of variance in the analysis of variance (ANOVA) model. Note,
however, that the authors did not use a CFA approach. Consequently, no estimate of the amount of measurement error
could be obtained.

J Busn Res
2000:48:193–205

199

The authors developed the “Children’s Impact
of Traumatic Events Scale (CITES)” for assessing the impact
of sexual abuse from the child’s perspective. Seventy-eight
scale items were used to measure four traits; PTSD (a composite factor of intrusive thoughts, avoidance, hyperarousal, and
sexual anxiety), Social Reactions, Abuse Attributions, and Eroticism. Two methods were used for construct validation: (1)
the CITES-R scale; and (2) a combined scale comprising items
from a number of scales (i.e., SAFE, CBCL-FTSD etc.; Wolfe,
Gentile, Michienzi, Sas, and Wolfe, 1991). In a sample size
of 60 respondents, the researchers found some evidence of
convergent and discriminant validity but acknowledged that
their results were prone to measurement error biases. The
CFA model for the Wolfe study is shown in Figure 4.
WOLFE (1991).

MARSH AND GOUVERNET (1989). The construct validity of
children’s responses to two multidimensional self-concept
scales (methods) for measuring four traits (physical skill, social
skill, general skill, and cognitive skill) was assessed by means
of an MTMM analysis based on 508 responses. The authors
of this study found “strong support for both the convergent
and discriminant validity of responses to these two multidimensional self-concept instruments” (p. 63). Figure 5 illustrates the CFA model pertaining to the Marsh and Gouvernet
(1989) study.

In a study involving 368 subjects, the author
used a pseudo-MTMM approach, where three different traits
(i.e., physician satisfaction, hospital satisfaction, and insurance
satisfaction) were measured by two different methods; namely,
the over-all item and multi-item ratings. Convergent and discriminant validity of measures was evaluated by means of
CFA. Although the MTMM matrix was not reported in the
study, estimates of trait and method loadings were provided
by the author, which, in turn, permits us to estimate the
amount of trait, method, and error variances in the measures.
Note that because Singh’s (1991) study does not publish the
pseudo-MTMM matrix, we do not reanalyze the MTMM matrix by CFA. On the other hand, we use the factor loadings
(for traits and methods) reported in the study to compute
variance estimates for the measures. Figure 6 depicts the CFA
model for the Singh (1991) study.
MTMM matrices reported in the studies described above
(with the exception of Singh, 1991) were reanalyzed by using
the CFA procedure (Bentler, 1992). Specifically, the nested
approach of Widaman (1985) described earlier was used to
estimate a set of hierarchical (nested) factor analytic models.
The results of our analysis are discussed in the next section.
SINGH (1991).

Results
Model Estimation Results
We estimated four nested models: the null model (M1); the
trait model (M2); the method model (M3); and the trait and
method model (M4) using the EQS computer program (Bentler,
1992). The results of our analysis are depicted in Table 1.

200

J Busn Res
2000:48:193–205

D. P. Mishra

Figure 4. CFA model of MTMM data in Wolfe, Gentile, Michienzi, Sas, and Wolfe’s (1991) study. j1, j2, j3, j4 5 trait factors; j5, j6 5
method factors; l 5 factor loading; φ 5 factor correlation; x 5 measured variable; d 5 random error.

EQS was preferred over Lisrel VII (Joresko¨g and Sorbo¨m
1989), because the EQS program automatically imposes
bounds on negative error variances (Heywood cases), which
are constrained at the lower limit of “0,” yielding stable parameter estimates. As Williams, Cote, and Buckley (1989) note
“Heywood cases are quite common in MTMM analyses using
confirmatory factor analyses” (p. 463). By using the EQS software we attempted to minimize estimation problems associated with negative error variances.
For each dataset we analyzed, the trait and method model
(M4) provides the best fit of data to theory, thereby supporting
H1. Specifically, different model comparisons yield significant
values of the x2 difference test for the various nested model
comparisons. For example, in the Saylor, Finch, Baskin, Furey,
and Kelly (1984) study, each model comparison is significant
(M2–M1: Dx2 5 156, df 5 9, p , .0001; M4–M3: Dx2 5
133, df 5 14, p , .0001; M3–M1: Dx2 5 95, df 5 9, p ,
.01; M4–M2: Dx2 5 72, df 5 14, p , .01). Furthermore, for
the Saylor, study, the high (0.978) comparative fit index (CFI)
implies a good model fit (values of CFI closer to 1 are desirable;
Bentler, 1990). Similar results suggesting that intercorrelations
among measures are best explained by correlated traits, correlated methods, and random error were obtained for the Wolfe,
Gentile, Michienzi, Sas, and Wolfe (1991) and the Marsh and
Gouvernet (1989) studies. In other words, for the set of studies

considered in the present study, intercorrelations among items
can be best explained by a combination of random error,
method factors, and trait factors.
For estimating the amount of trait, method, and random
error variance in each measure, we squared the loading of
each item in the final model (M4) on its corresponding trait
and method factor. Error variances were freely estimated by
the EQS program. Note that for some items, error variances
had to be fixed at the lower bound of zero for model identification (Dillon, Kumar, and Mulani, 1987) and for avoiding
Heywood cases. The results of variance partitioning are detailed in Table 2.

Variance Partitioning Results
For the Saylor, Finch, Baskin, Furey, and Kelly (1984) study,
estimates of trait variance vary from a low of 7% to a high of
55%. Method variances are particularly high, ranging from
0.5 to 76%. Finally, error variances vary between 44 and
66% (not including those estimates fixed at zero for model
identification). In the Wolfe, Gentile, Michienzi, Sas, and
Wolfe (1991) study, a similar pattern of variance partitioning
is observed. Specifically, trait variances vary between 0.1 and
85%, and the range of method variance (7–79%) and random
error variance (1–81%) also seems to be high. In the Marsh
and Gouvernet (1989) study, trait variances vary from a low

Measurement Error in Health-Care Survey Research

J Busn Res
2000:48:193–205

201

Figure 5. CFA model of MTMM data in Marsh and Gouvernet (1991). j1, j2, j3, j4 5 trait factors; j5, j6 5 method factors; l 5 factor
loading; φ 5 factor correlation; x 5 measured variable; d 5 random error.

of 21% to a high of 77%, and both the method (17–50%)
and error variance (1–34%) estimates seem moderately high.
For the Singh (1991) study, estimates of trait variances vary
from a low of 2% to a high of 74%, and the method (25–64%)
and error variance (2–53%) values seem to be high. Across
all studies (including two published studies that were not
reanalyzed), the mean values of trait (35.6%), method (30%),
and random error (34.4%) variances suggest the widespread
prevalence of measurement error in health-care measures. We
discuss these results in greater detail in the following sections.

Conclusions
This paper has two main objectives: (1) to estimate the amount
of measurement error in health care research; and (2) to
illustrate the use of the CFA technique as a viable method
for modeling measurement error in health-care research. We
hoped to estimate empirically the degree of measurement error
in health-care measures so that researchers would be sensitized
to the interaction between construct validity and error variance.
The results of our reanalysis underline the presence of
high levels of measurement error in health-care research. In
particular, our findings suggest that paper and pencil measures

in health-care research seem to perform more poorly than
their social science counterparts in measuring trait variance.
Specifically, although the typical health-care measure in this
study contains an average of only 35.6% trait variance, Cote
and Buckley (1987) reported that, on average, measures in
the social sciences contained 41.7% trait variance.
We can only speculate on some of the possible reasons
for the presence of high measurement error in health-care
research. First, because some subdisciplines within healthcare research (e.g., patient satisfaction, TQM) are relatively
new, a number of new instruments are being developed by
researchers to understand phenomena without paying adequate attention to the construct validity of these instruments.
Consequently, measures have not been successively refined
and validated. For example, many studies in the Journal of
Clinical Psychology represent attempts at developing new
scales. Such an emphasis on new scales is understandable
given the relative “youth” of the discipline where new concepts
and scales have to be developed for tackling the impact of
changing social mores on everyday life (e.g., industrialization
on depression; alternative life styles on AIDS). Although the
direction of research effort toward the development of new
scales is certainly justified, researchers should also pay greater

202

J Busn Res
2000:48:193–205

D. P. Mishra

Figure 6. CFA model for MTMM data in Singh (1991). j1, j2, j3 5 trait factors; j4, j5 5 method factors; l 5 factor loading; φ 5 factor
correlation; x 5 measured variable; d 5 random error.

attention to refining existing scales. To this extent, scale purification efforts may be greatly aided by placing a relatively
greater emphasis on replication studies (Hubbard and Armstrong, 1994).
Another reason for the presence of high amounts of measurement error seems to be the dominance of a paradigm that
has historically emphasized the assessment of technical aspects
of health-care quality. For example, a historical quality measure for nursing homes may involve computing composite
scores on such tangible and objective entities as room odor,
rodent control, and soiled linen handling (Gustafson, Sainfort,
Van Konigsveld, and Zimmerman, 1990). In contrast to the
use of such direct measures, today’s health-care quality measures are often more perceptual in nature (e.g., patient satisfaction). Consequently, while making a transition from the use of
direct measures to more perceptual measures, construct validation efforts have lagged behind.
How should health-care researchers improve upon the relatively poor quality of measures? We suggest a multipronged
approach to this problem. First, irrespective of the chronological stage of the discipline, health-care researchers should pay
more attention to construct validity issues. Specifically, rigorous and systematic attempts to validate measures must be

undertaken in line with guidelines established by psychometric theory (Nunnally, 1978). Several excellent illustrations of
the use of proper procedures for developing better measures
of latent constructs exist in the literature (Churchill, 1979;
Gerbing and Anderson, 1988; Kim and Lee, 1997; Singh
and Rhoads, 1991; Suen 1990). Second, researchers should
undertake more replication studies in the future (Hubbard and
Armstrong, 1994). Specifically, researchers may use existing
scales as a basis for replication and consequent purification
of measures. Finally, in undertaking statistical analyses, researchers may make use of such techniques as covariance
structure modeling, which allows the decomposition and estimation of total variance in measures into its various error
components. Covariance structure modeling is easily implemented by using such a technique as EQS (Bentler 1992).
Although covariance structure models do not eliminate measurement error in measures, they provide less biased estimates
of true relationships among various theoretical constructs.

Limitations
The results of our study must be viewed against certain limitations. First, we obtained variance estimates from four MTMM

Measurement Error in Health-Care Survey Research

J Busn Res
2000:48:193–205

203

Table 1. Statistics for Nested Models and Model Differences
x2 (df)

Study
Saylor, Finch, Baskin, Furey, and Kelly (1984)
Null model
Trait only
Method only
Trait and method
Null-trait only
Trait and Method-method only
Method only-null
Trait and method-trait only
Wolfe, Gentile, Michienzi, Sas, and Wolfe (1991)
Null model
Trait only
Method only
Trait and method
Null-trait only
Trait and method-method only
Method only-null
Trait and method-trait only
Marsh and Gouvernet (1989)
Null model
Trait only
Method only
Trait and method
Null-trait only
Trait and method-method only
Method only-null
Trait and method-trait only

p

x2
x2
x2
x2
x2
x2
x2
x2

(28)
(19)
(19)
(5)
(9)
(14)
(9)
(14)

5
5
5
5
5
5
5
5

233
77
138
5
156
133
95
72

,.001
,.001
,.001
5 0.6
,.001
,.001
,.01
,.01

x2
x2
x2
x2
x2
x2
x2
x2

(28)
(23)
(19)
(10)
(5)
(9)
(9)
(13)

5
5
5
5
5
5
5
5

111
81
67
8
30
59
94
74

,.001
,.001
,.01
5 .68
,.01
,.01
,.01
,.01

x2
x2
x2
x2
x2
x2
x2
x2

(28)
(16)
(21)
(7)
(12)
(14)
(7)
(9)

5
5
5
5
5
5
5
5

2240
415
1570
32
1825
1538
67
383

,.001
,.001
,.001
,.01
,.001
,.001
,.01
,.01

CFIa

0.668
0.408
0.978

0.276
0.450
0.933

0.820
0.300
0.989

All models were estimated by the EQS software package.
a
CFI per Bentler (1990).

matrices. As a result, our findings may not be sufficiently
generalizable. However, as Bagozzi and Yi (1991) note, there
is a paucity of published MTMM matrices in many disciplines.
To the extent that many studies have not measured traits
using different methods, we suspect that the overall picture
in terms of measurement error may be even worse. However,
the results of our analyses are in line with extant literature on
measurement error and validity in the social sciences (Blalock,
1982; Venkatraman and Grant, 1986), and may serve as a
starting point for calling more attention to construct validity
issues in health-care research.

References
Bagozzi, Richard P., and Yi, Youjae: Assessing Method Variance in
Multitrait–Multimethod Matrices: The Case of Self-Reported Affect and Perceptions at Work. Journal of Applied Psychology 75
(1990): 547–560.
Bagozzi, Richard P., and Yi, Youjae: Multitrait–Multimethod Matrices
in Consumer Research. Journal of Consumer Research 17 (March
1991): 426–439.
Bagozzi, Richard P., Yi, Youjae, and Phillips, Lynn W.: Assessing
Construct Validity in Organizational Research. Administrative Science Quarterly 36 (September 1991): 421–458.
Beck, A. T.: Depression: Clinical, Experimental, and Theoretical Aspects.
Holber, New York. 1967.

Bentler, P. M.: Comparative Fit Indexes in Structural Models. Psychological Bulletin 107 (1990): 238–246.
Bentler, P. M.: EQS: Structural Equations Program Manual. BMDP
Statistical Software, Los Angeles, CA. 1992.
Blalock, Hubert M.: Conceptualization and Measurement in the Social
Sciences. Sage, Thousand Oaks, CA. 1982.
Browne, Michael W.: Relationship Between an Addictive Model and
a Multiplicative Model for Multitrait Multimethod Matrices, in
Multiway Data Analysis. R. Coppi and S. Bolaco, eds., NorthHolland, Amsterdam. 1989, pp. 507–520.
Byrne, Barbara: Structural Equation Modeling with EQS and EQS/Windows. Sage, Thousand Oaks, CA. 1994.
Campbell, Donald T., and Fiske, Donald W.: Convergent and Discriminant Validation by the Multitrait–Multimethod Matrix. Psychological Bulletin 56 (March 1959): 81–105.
Casalou, Robert F.: Total Quality Management in Health Care. Hospital and Health Services Administration 36(1) (Spring 1991): 134–
146.
Churchill, Gilbert A., Jr.: A Paradigm for Developing Better Measures
of Marketing Construct. Journal of Marketing Research 16 (February 1979): 64–72.
Cole, D. A., Howard, G. S., and Maxwell, S. E.: Effects of Mono
versus Multiple Operationalization in Construct Validation Efforts. Journal of Consulting and Clinical Psychology 49 (1981): 395–
405.
Cote, Joseph A., and Buckley, M. Ronald: Estimating Trait, Method,

204

J Busn Res
2000:48:193–205

D. P. Mishra

Table 2. Variance Partitioning Results
Study
(Year)

Items

Variance Partitioning (%)
Trait
Method
Error

Saylor, Finch, Baskin, Furey, and Kelly
(1984)
CIA
7
CDI
15
PNIA
54
PNID
24
TNIA
47
TNID
27
AML-A
55
AML-M
24
Marsh and
Gouvernet
(1989)
PHYSICAL 1
64
SOCIAL 1
49
GENERAL 1
21
COGNITIVE 1
65
PHYSICAL 2
49
SOCIAL 2
64
GENERAL 2
77
COGNITIVE 2
21
Singh
(1991)
SAT 1
49
SAT 2
2
SAT 3
38
SAT 4
74
SAT 5
22
SAT 6
36
SAT 7
41
SAT 8
49
SAT 9
64
Wolfe et al.
(1991)
PTSD 1
46
SOCRE 1
0.1
ATTRIB 1
31
EROTIC 1
13
PTSD 2
8
SOCRE 2
85
ATTRIB 2
2
EROTIC 2
7
Cole et al.
(1981)
SELFR 1
44
SIGOT 1
38
ROLEP 1
24
INVIV 1
10
SELFR 2
28
SIGOT 2
45
ROLEP 2
18
INVIV 2
24
Fontaine,b
(1989)
Overall
22
Over-allc
estimates
35.6
a

52
56
44
——a
52.5
——a
——a
66

20
20
50
34
17
26
22
50

16
31
29
1
34
10
1
29

49
49
25
49
25
25
64
49
25

2
49
37
——a
53
39
——a
2
11

52
79
68
12
11
15
28
7

2
20.9
1
75
81
——a
70
76

44
41
22
4
55
41
22
4

12
21
54
86
17
14
60
72

4

74

30

Value of this estimate was fixed at zero for identification purposes.
Over-all value estimated by ANOVA by Rowe and Ryan-Wenger (1992).
Fontaine (1989) excluded.

b
c

41
29
2
76
0.5
73
45
10

34.4

and Error Variance: Generalizing Across 70 Construct Validation
Studies. Journal of Marketing Research 24 (August 1987): 315–318.
Cote, Joseph A., and Buckley, M. Ronald: Measurement Error and
Theory Testing in Consumer Research: An Illustration of the
Importance of Construct Validation. Journal of Consumer Research
14 (March 1988): 579–582.
Cowen, E., Door, D., Clarfield, S., and Kreling, B.: The AMLIA QuickScreening Device for Early Identification of School Maladaptation.
American Journal of Community Psychology 1 (1973): 12–35.
Dillon, William R., Kumar, A., and Mulani, N.: Offending Estimates
in Covariance Structure Analysis: Comments on the Causes of
and Solutions to Heywood Cases. Psychological Bulletin 101(1)
(1987): 126–135.
Ferketich, Sandra L., Figuerdo, Aurelio J., and Knapp, Thomas R.:
The Multitrait–Multimethod Approach to Construct Validity. Research in Nursing and Health 14 (August 1991): 315–320.
Figuerdo, Aurelio J., Ferketich, Sandra L., and Knapp, Thomas R.:
More on MTMM: The Role of Confirmatory Factory Analysis.
Research in Nursing and Health 14 (October 1991): 387–391.
Finch, A. J., and Eastman, E. S.: A Multimethod Approach to Measuring Anger in Children. Journal of Psychology 115 (1983): 55–60.
Fiske, Donald W., and Campbell, Donald T.: Citations Do Not Solve
Problems. Psychological Bulletin 112(3) (1992): 393–395.
Fontaine, D. K.: Measurement of Nocturnal Sleep Patterns in Trauma
Patients. Heart and Lung 18 (1989): 402–410.
Gerbing, David W., and Anderson, James C.: An Updated Paradigm
for Scale Development Incorporating Unidemensionality and its
Assessment. Journal of Marketing Research 25 (May 1988): 186–
192.
Greenbaum, Paul E., Dedrick, Robert F., Prange, Mark E., and Friedman, Robert M.: Parent, Teacher, and Child Ratings of Problem
Behaviors of Youngsters With Serious Emotional Disturbances.
Psychological Assessment 6(2) (1994): 141–148.
Gulliksen, H.: Theory of Mental Tests, Wiley, New York. 1950.
Gustafson, David H., Sainfort, F. C., Van Konigsveld, Richard, and
Zimmerman, David R.: The Quality Assessment Index (QAI) for
Measuring Nursing Home Quality. Health Services Research 25(1)
(1990): 97–127.
Hubbard R., and Armstrong, J. S.: Replications and Extensions in
Marketing: Rarely Published but Quite Contrary. International
Journal of Research in Marketing 11 (1994): 233–248.
Joresko¨g, Karl G., and Sorbo¨m, Dag: Lisrel 7: User’s Reference Guide.
Indiana Scientific Software. 1989.
Kim, Chankon, and Lee, Hanjon: Development of Family Triadic
Measures for Children’s Purchase Influence. Journal of Marketing
Research 35 (August 1997): 307–321.
Kovacs, M.: Rating Scales To Assess Depression in School-Age Children. Acta Paedopsychiatry 23 (1981): 437–457.
Lammers, John C., Cretin, Shan, Gilman, Stuart, and Calingo, E.:
Total Quality Management in Hospitals: The Contributions of
Commitment, Quality Councils, Teams, Budgets, and Training
to Perceived Improvement at Veterans Health Administration Hospitals. Medical Care 34(5) (1996): 463–478.
Lefkowitz, M. M., and Tesiny, E. P.: Assessment of Childhood Depression. Journal of Consulting and Clinical Psychology 48 (1980):
43-50.
Lemke, S., and Moos, R.: Quality of Residential Settings for Elderly
Adults. Journal of Gerontology 4(2) (1986): 268–276.
Lowe, Nancy K., and Ryan-Wenger, Nancy K.: Beyond Campbell

Measurement Error in Health-Care Survey Research

and Fiske: Assessment of Convergent and Discriminant Validity.
Research in Nursing and Health 15 (1992): 67–75.
Marsh, Herbert W.: Confirmatory Factor Analyses of Multitrait–
Multimethod Data: Many Problems and a Few Solutions. Applied
Psychometric Measurement 13(4) (December 1989): 335–361.
Marsh, Herbert W., and Hocevar, Dennis: A New More Powerful
Approach to Multitrait–Multimethod Analyses: Application of Second-Order Confirmatory Factor Analysis. Journal of Applied Psychology 73(1) (1988): 107–117.
Marsh, Herbert W., and Gouvernet, Paul J.: Multidimensional SelfConcepts and Perceptions of Control: Construct Validation of Responses by Children. Journal of Educational Psychology 81(1) (1989):
57–69.
Nelson, W. M., and Finch, A. J.: The Children’s Inventory of Anger
(CIA). Unpublished manuscript, Medical University of South Carolina. 1978.
Nunnally, Jum C.: Psychometric Theory. McGraw-Hill, New York. 1978.
Phillips, L. R., Rempusheski, V. F., and Morrison, E.: Developing and
Testing the Beliefs about Caregiving Scale. Research in Nursing and
Health 12 (1989): 207–220.
Saylor, Conway Fleming, Finch, A. J. Jr., Baskin, Cathy Haas, Furey,
William, and Kelly, Mary Margaret: Construct Validity for Measures
of Childhood Depression: Application of Multitrait–Multimethod
Methodology. Journal of Consulting and Clinical Psychology 52(6)
(1984): 977–985.
Schmitt, Neal, and Stutts, Daniel M.: Methodology Review: Analysis of
Multitrait–Multimethod Matrices. Applied Psychological Measurement
10(1) (1986): 1–22.

J Busn Res
2000:48:193–205

205

Sidani, Souraya, and Jones, Elaine: Use of Multrait Multimethod
(MTMM) to Analyze Family Relational Data. Western Journal of
Nursing Research 17(5) (1995): 556–570.
Singh, Jagdip: U