Directory UMM :Data Elmu:jurnal:J-a:Journal Of Business Research:Vol48.Issue3.2000:
Methods of Measuring Health-Care Service Quality
Hanjoon Lee
Linda M. Delene
Mary Anne Bunda
WESTERN MICHIGAN UNIVERSITY
Chankon Kim
SAINT MARY’S UNIVERSITY
Service quality is an elusive and abstract construct to measure, and extra
effort is required to establish a valid measure. This study investigates the
psychometric properties of three different measurements of health-care
service quality as assessed by physicians. The multitrait-multimethod
approach revealed that convergent validity was established for measures
based on the single-item global rating method and multi-item rating
method. On the other hand, almost no evidence of convergent validity
was found for the measures based on the constant-sum rating method.
Furthermore, discriminant validity for the seven health-care service quality dimensions measured by the three methods was not well established.
The high levels of interdimensional correlations found suggested that the
service quality dimensions may not be separable in a practical sense. The
study suggested an ongoing effort is needed to develop a new service
quality scale suitable to this unique service industry. J BUSN RES 2000.
48.233–246. 2000 Elsevier Science Inc. All rights reserved.
T
he health-care delivery system has been undergoing
formidable challenges in the 1990s. Rapid movement
toward systems of managed care and integrated delivery networks has led health-care providers to recognize real
competition. To be successful or even survive in this hostile
environment, it is crucial to provide health-care recipients
with service that meets or exceeds their expectations. At the
same time, it is important to known which dimensions of
health-care services physicians believe are necessary to constitute excellent service. It is crucial to have a better understanding of service quality perceptions possessed by both recipients
and providers when shaping the health-care delivery system.
The traditional medical model has focused on the technical
nature of health-care events; the focus has been on the training
and updated skills of the physicians and the nature of the
Address correspondence to Hanjoon Lee, Marketing Department, Haworth
College of Business, Western Michigan University, Kalamazoo, Michigan
49008, USA.
Journal of Business Research 48, 233–246 (2000)
2000 Elsevier Science Inc. All rights reserved.
655 Avenue of the Americas, New York, NY 10010
actual medical outcome (O’Connor, Shewchuk, and Carney,
1994). A series of services marketing research, however, has
looked at the relationship between the services expected and
the service actually perceived as received by recipients (Carman, 1990; Finn and Lamb, 1991; Parasuraman, Zeithaml,
and Berry, 1985, 1988; Zeithaml, Parasuraman, and Berry,
1988). The services marketing approach places an emphasis
on quality evaluation from the recipients’ perspectives, but
ignores the necessity for including an evaluation of the technical skill of the provider and the nature of the medical outcome.
Especially in the area of health-care service, the services marketing approach seems to neglect the important role of physicians in shaping patients’ service expectations. A balanced
approach, therefore, utilizing aspects of service quality from
both the services marketing and health-care approaches may
be required. At the same time, the physicians’ view toward
the quality of their own services needs more research attention.
For the success of health-care organizations, accurate measurement of health-care service quality is as important as
understanding the nature of the service delivery system. Without a valid measure, it would be difficult to establish and
implement appropriate tactics or strategies for service quality
management. The most widely known and discussed scale
for measuring service quality is SERVQUAL (Parasuraman,
Zeithmal, Berry, 1988). Since the scale was developed, various
researchers have applied it across such different fields as securities brokerage, banks, utility companies, retail stores, and
repair and maintenance shops. The scale has also been applied
to the health-care field in numerous studies (Babakus and
Mangold, 1992; Brown and Swartz, 1989; Carman, 1990;
Headley and Miller, 1993; O’Connor, Shewchuk, and Carney,
1994; Walbridge and Delene, 1993). However, with a few
exceptions, they did not systematically examine the psychometric properties of their scale, because these studies dealt
with pragmatic and managerial issues for health-care services.
Validity of the SERVQUAL scale seems not to be fully established. A more stringent psychometric test has been recomISSN 0148-2963/00/$–See front matter
PII S0148-2963(98)00089-7
234
J Busn Res
2000:48:233–246
mended for the improvement of the service quality measurement (for a recent review, please see Asubonteng, McCleary,
and Swan, 1996).
In this study, we sought to examine rigorously the psychometric properties pertaining to alternative methods of measuring health-care service quality as perceived by physicians.
Specifically, physicians were asked to assess health-care service quality along the seven dimensions of a modified SERVQUAL scale. Dimensional responses were collected using three
measurement methods: single-item global rating method, constant-sum rating method and multi-item rating method, thus
resulting in multitrait-multimethod (MTMM) data. Based on
the results of construct validation conducted on the MTMM
data, we reported findings regarding the convergent validity
of the three methods and the discriminant validity of the seven
service quality dimensions as measured by the three methods.
Previous Research
Two Approaches in Health-Care Service Quality
Service quality is an exclusive and abstract concept because
of its “intangibility” as well as its “inseparability of production
and consumption” (Parasuraman, Zeithaml, and Berry, 1985).
Various approaches have been suggested regarding how to
define and measure service quality. The services marketing
literature has defined service quality in terms of what service
recipients receive in their interaction with the service providers
(i.e., technical, physical, or outcome quality), and how this
technical quality is provided to the recipients (i.e., functional,
interactive, or process quality) (Grönoos, 1988; Lehtinen and
Lehtinen, 1982; Berry, Zeithaml, and Parasuraman, 1985).
Parasuraman, Zeithaml, and Berry (1985) asserted that consumers perceive service quality in terms of the gap between
received service and expected service. They identified 10 dimensions of service quality: access, communication, competence, courtesy, security, tangibles, reliability, responsiveness,
credibility, and understanding or caring. They then classified
10 dimensions into three categories: search properties (credibility and tangibles; dimensions that consumers can evaluate
before purchase), experience properties (reliability, responsiveness, accessibility, courtesy, communication, and understanding/knowing the consumer; dimensions that can be
judged during consumption or after purchase), and credence
properties (competence and security; dimensions that a consumer finds hard to evaluate even after purchase or consumption).
In the area of traditional health-care research, the quality
of health care has been viewed from a different perspective.
Quality has been defined as “the ability to achieve desirable objectives using legitimate means” (Donabedian, 1988,
p. 173), where the desirable objective implied “an achievable
state of health.” Thus, quality is ultimately attained when a
physician properly helps his or her patients to reach an achievable level of health, and they enjoy a healthier life. One of
H. Lee et al.
the most widely used quality assessment approaches has been
proposed in the structure-process-outcome model of Donabedian (1980). In this model, the structure indicates the settings
where the health care is provided, the process indicates how
care is technically delivered; whereas, the outcome indicates
the effect of the care on the health or welfare of the patient.
In the structure-process-outcome model, quality was viewed
as technical in nature and assessed from the physicians’ point
of view. It is well known that physicians pay significantly
more attention to the technical and functional dimensions of
health-care service (Donabedian, 1988; O’Connor, Shewchuk,
and Carney, 1994). This tendency might be attributal to physician education and training. Considering the potentially fatal
and irrevocable consequences of poor medical quality (malpractice) in health care, in contrast to other service industries,
it would be logical and desirable for physicians to hold such
an attitude.
A difference has been observed between the service marketing approach emphasizing recipients’ perspectives and the
traditional health-care approach honoring physicians’ concerns. Both patient groups and physician groups are important
constituents of the health-care system. However, it has been
found that health-care recipients have difficulty in evaluating
medical competence and the security dimensions (i.e., credence properties) considered to be the primary determinant
of service quality (Bopp, 1990; Hensel and Baumgarten,
1988). This inability or impossibility of assessing the technical
quality received in health-care service leads patients to rely
more heavily on other dimensions, such as credibility or tangibility (i.e., search properties) when inferring the quality of
health-care service (Bowers, Swan, and Taylor, 1994). This
lack of patient ability to make a proper evaluation raises a
question regarding the gap analysis paradigm suggested by
Parasuraman, Zeithaml, and Berry (1985). If customers in
the health-care delivery system cannot evaluate the important
service dimensions, can they have a reasonable expectation
about services they will receive? If they cannot, the contribution of the health-care recipients’ views in influencing the
design of an efficient system may not be as significant as we
formerly thought.
If the health-care service industry were similar to other
industries that provide services for their customers, a patient
could choose among many physicians who offer different
prices, and provide service that differs in terms of medical
technical quality (i.e., competence and security) or other service-related dimensions. The reality in the health-care industry
is different. Patients do not have enough information about
their physicians. Even if more information were available and
accessible, patients probably could not weigh the information
properly. Physician choice is often made not by the patients
themselves, but through referral from the patient’s primary
doctor, from his or her health organization (HMO), and/or
from friends. Although service recipients’ perceptions toward
service is valuable for improving health-care service quality,
Measuring Health-Care Service Quality
it is as crucial to understand physicians’ perceptions of service
quality when designing and improving the health-care delivery
system. Therefore, this study placed its focus on how physicians perceive health-care service quality.
Measurement Issues in
Health-Care Service Quality
A system cannot be designed and operated effectively unless
the quality of the product or service can be understood or
correctly measured. One major stride toward developing
quantitative measures of service quality was made by Parasuraman, Zeithaml, and Berry (1985), and the SERVQUAL scale
was the consequence of this effort (Parasuraman, Zeithaml,
and Berry, 1988). The 10 dimensions discussed in the 1985
study were reduced into five dimensions in SERVQUAL after
an empirical test. Their original objective was to discover dimensions that were generic to all services. If this assumption
is correct, dimensional patterns for service quality should be
similar across different service industries. Several researchers
have since examined the stability of SERVQUAL dimensions
(Asubonteng, McCleary, and Swan, 1996; Babakus and Boller,
1992; Carman, 1990; Dabholkar, Thorpe, and Rentz, 1996).
Carman (1990) found that the numbers of service quality
dimensions were not stable across different services in his
factor analysis results. He also found that, among the five
dimensions, items measuring “tangibles” and “reliability” consistently loaded on the expected factors across different services. However, items tapping “assurance” and “empathy”
broke into different factors. A similar finding was reported
by Babakus and Boller (1992). There seems to be a consensus
that SERVQUAL is not a generic measure for all service industries and that service-specific dimensions other than those
suggested in SERVQUAL may be needed to understand service
quality perceptions fully.
Although these studies have generated insight into the
measurement properties of SERVQUAL, their measurement
analyses, which were aimed primarily at checking dimensionality, were inadequate for testing the construct validity of the
scale. Construct validity is defined as the degree of correspondence between constructs and their measures (Peter, 1981).
A systematic and rigorous construct validation requires multitrait-multimethod data, which is the correlation matrix for
two or more traits where each trait is measured by two or
more methods. Demonstration of construct validity requires
evidence of convergent validity and discriminant validity
(Campbell and Fiske, 1959).
The two main sources of variance in measures of a construct
are the construct or trait being measured and measurement
error. Measurement error can be divided further into random
error and systematic error (e.g., method variance). Single measures do not allow us to make an assessment of measurement
error. With a single method, we cannot separate trait variance
from unwanted method variance (Bagozzi, Yi, and Phillips,
1991). Thus, construct validation is a process of separating
J Busn Res
2000:48:233–246
235
the confounding effects of random and systematic errors from
trait variance. Without disentangling the variation in measures
attributable to the trait, we cannot assess the extent of the
true relationship between the measures and traits (i.e., the
convergent validity) or the true relationships between traits
(the discriminant validity).
Despite its seminal role in the understanding and assessment of construct validity, the original Campbell and Fiske
(1959) approach to MTMM analyses has limitations. Most
notably, it prescribes no precise standards for determining
evidence of construct validity. Furthermore, the procedure
does not yield specific estimates of trait, method, and random
variance. Several alternative procedures have been proposed
for analyzing MTMM data (for a review, see Bagozzi, 1993).
The construct validation process in this study utilized two of
these alternative MTMM approaches; namely, application of
the confirmatory factor analysis (CFA) model (Joreskog and
Sorbom, 1993) and the correlated uniqueness (CU) model
(Marsh, 1989).
Research Design
Design of the MTMM Study
Previous studies have indicated that SERVQUAL must be modified for each unique service sector (Carman, 1990; Babakus
and Boller, 1992). Haywood-Farmer and Stuart (1988) empirically tested SERVQUAL and found it did not encompass all
the dimensions of professional service quality. They suggested
that service dimensions for core service, service customization,
and knowledge and information be added to the five dimensions of SERVQUAL. Of these additional dimensions, core
service was found to be the most important factor not represented in the SERVQUAL instrument. Related research of
professional service quality perception was done by Brown
and Swartz (1989). This study found that “professionalism”
and “professional competence” were significant factors for
both providers and patients in the evaluation of service quality.
The modified SERVQUAL approach utilized in this research, therefore, included the five dimensions of SERVQUAL
(Parasuraman, Zeithaml, and Berry, 1985), as well as the “core
medical service” (Haywood-Farmer and Stuart, 1989) and the
“professionalism/skill” (Brown and Swartz, 1989) dimensions.
The latter two dimensions were included to measure the technical aspects of health-care service. These same service quality
dimensions were also used in the earlier research of Walbridge
and Delene (1993), which involved physician attitudes toward
service quality. The seven dimensions, their origins, and their
definitions can be found in Table 1.
It is well known that measurement methods can affect the
nature of a respondent’s evaluation (Kumar and Dillon, 1992;
Phillips, 1981). Of the various methods used in measurement,
three were selected for this research: single-item global rating
method, constant-sum rating method, and multi-item rating
method. The single-item global rating method provided the
236
J Busn Res
2000:48:233–246
H. Lee et al.
Table 1. Service Quality Attributes
Attribute
Definition
Assurance
Courtesy displayed by physicians, nurses, or office staff
and their ability to inspire patient trust and
confidence
Parasuraman, Zeithaml, and Berry, 1988
Empathy
Caring, individualized attention provided to patients by
physicians and their staffs
Parasuraman, Zeithaml, and Berry, 1988
Reliability
Ability to perform the expected service dependably and
accurately
Parasuraman, Zeithaml, and Berry, 1988
Responsiveness
Willingness to provide prompt service
Parasuraman, Zeithaml, and Berry, 1988
Tangibles
Physical facilities, equipment, and appearance of contact
personnel
Parasuraman, Zeithaml, and Berry, 1988
Core medical service
The central medical aspects of the service: appropriateness, effectiveness, and benefits to the patient
Professionalism/skill
Knowledge, technical expertise, amount of training,
and experience
respondent with dimensions and definitions of each service
dimension. With this method, the respondent reported his or
her evaluation rating on each dimension—without evaluating
the multiple indicators (components) of each dimension and
without comparing it to other service dimensions. The constant-sum rating method, in contrast, is comparative in nature,
requiring the respondent to allocate a given number of “importance points” among various dimensions. In this method,
respondents were forced to think about the relative importance of each service dimension. In the multi-item rating
method, multiple indicators were developed that were intended to capture each of seven service quality dimensions.
It is generally accepted that the multi-item rating method
can provide a better sampling of the domain of content than
the single-item global rating method (Bagozzi, 1980). Thus,
content validity can be enhanced with multiple-item measures.
They also have the advantage of allowing the computation
of reliability coefficients (e.g., Cronbach’s alpha, [Cronbach,
1951]). Reliability assessment with the single-item global rating
method is a problem in typical survey research studies, because
measurement error cannot be estimated with a single item.
However, a drawback in using the multi-item rating
method in place of the single-item global rating method is
the tendency toward questionnaire length along with possible
detrimental effects on response rate and respondent fatigue.
In other words, the single-item global rating method has the
potential advantage of parsimony for the respondent. Therefore, in areas where there is little or no difference between
the explanatory power of single- and multi-item methods, the
single-item global rating method may be preferable in studies
where parsimony is important.
In studying service quality dimensions, it may be helpful
Authors
Swartz and Brown, 1989
to gather information regarding the relative importance of
each dimension. One way to do this is through the use of
constant-sum rating method. The constant-sum rating method
forces respondents to identify the comparative importance of
each service dimension. In health-care study, this constantsum method was used to examine determinant dimensions
in hospital preference (Woodside and Shinn, 1988). Constantsum method also tends to eliminate individual response styles
of “nay-saying” and the “halo effects,” which cause respondents
to carry over their judgments from one dimension to another
(Churchill, 1991). In an earlier, related study (Walbridge and
Delene, 1993), it was believed that physicians may be reluctant
to rate any service quality dimension as unimportant. Thus,
the constant-sum rating method was employed in this research
to determine its applicability as an efficient measurement
method of health-care service quality where physicians’ perceptions were surveyed.
There are a few drawbacks to using the constant-sum
method. The first is its inherent increase in task complexity for
respondents. It requires more mental effort from the individual
than either the single-item or multi-item methods. Each rating
decision affects other ratings because of the constraints imposed by the nature of the measurement process. As the number of attributes increase, respondents become more taxed
(Aaker, Kumar, and Day, 1994; Malhotra, 1995). This increase
in complexity may lead the subject to use a subset of the
dimensions instead of including all of them in his or her
evaluation (Churchill, 1991). This effect may be heightened
if the subject does not view the dimensions as being completely
independent. This lack of independence was found to produce
spurious correlations sometimes (Kerlinger, 1973).
In this study, we asked physicians how they perceived the
Measuring Health-Care Service Quality
seven dimensions of health-care service quality as measured
by three different methods: single-item global rating method,
multi-item rating method, and constant-sum rating method.
Questionnaire Development
A panel of physicians was consulted on questionnaire design
and semantics, and input was also received from a state university hospital. The questionnaire was divided into four sections,
with one section for each of the three measurement methods
and the last section containing demographic questions. Section One utilized the single-item global rating method. The
subjects were given the name and definition of each dimension
indicated in Table 1 and asked to rate the importance of each
dimension on a seven-point scale. Pretesting with physicians
showed that a conventional scale using the two bipolar adjectives “unimportant” and “important” was inappropriate. Physicians were reluctant to rate any of the dimensions as “unimportant” or “less important.” Further pretesting results suggested
the use of “Important” for the low end (one) and “Critical”
for the high end (seven) in a seven-point scale.
Section Two was a constant-sum rating method that asked
the subjects to distribute a fixed number of “importance
points” among the seven dimensions. This led respondents
to rate the comparative importance of each service dimension
relative to the others. The same names and definitions of the
dimensions were used as in Section One.
Section Three consisted of forty-three (43) “practice characteristics.” Placed in random order, each practice characteristic
corresponded with one of the seven service quality dimensions, with between five and seven characteristics pertaining
to each service quality dimension based on a previous study
(Walbridge and Delene, 1993) (please see Appendix A). In
this section, physicians evaluated the practice characteristics,
without referring to the names or definitions of the pertinent
service quality dimensions. Practice characteristics were evaluated using the same “important–critical” dichotomy used in
Section One. Respondents then answered questions related
to demographic variables in the last section.
Sampling
Physicians (1,428) were randomly selected by a commercial
mail-order vendor from a national databased leased from the
American Medical Association. Some professional categories
were eliminated to remove nonphysicians from the list, as
well as specialties considered divergent from the mainstream
of health-care service (for a listing of the specialties used,
please see Appendix B). The four-page, self-administered
questionnaire was mailed to physicians. To attain a higher
response rate, physicians received a “warm-up” postcard announcing the arrival of the questionnaire within the next week.
The initial mailing of the questionnaire included a cover letter
explaining the purpose of the research and the confidentiality
of responses. Approximately 6 weeks later, a follow-up mailing
of 1,200 questionnaires was sent to physicians who had not
yet responded.
J Busn Res
2000:48:233–246
237
Of the original 1,428 addresses, 72 were invalid. Six questionnaires returned were unusable. A total of 348 responses
were received from the two mailings with an effective response
rate of 24.4%. Demographic characteristics of our sample were
compared with those of the physician population in the United
States in Table 2. The similarities become apparent through
simple visual inspection. The population of physicians in the
United States is 16.4% female; whereas, the sample was 18.7%
female. The age distribution of the sample was also somewhat
similar, especially for physicians under the age of 65, which
accounted for about 90%. The sample was similar to the
population on the basis of practice specialty. The goodnessof-fit tests were performed for sex, age, and specialty group
categories. The results were x2 5 0.4 (DF 5 1, p 5 0.729)
for sex, x2 5 14.4 (DF 5 4, p , 0.000) for age, and x2 5
4.8 (DF 5 3, p 5 0.084) for specialty group. These results
suggest that the sample reflected the population’s sex and
specialty group compositions, but consisted of physicians who
were somewhat older than the population.
Analysis
Instrument Reliability for the Multi-Item
Rating Method
It is necessary to derive a composite score for each of the
seven service quality dimensions measured by the multi-item
rating method. For this purpose, the level of internal consistency was checked as a way of assessing the homogeneity
of items comprising each dimension. The Cronbach’s alpha
indices for the seven dimensions ranged from 0.80 to 0.90,
with a mean of 0.85. This high degree of internal consistency
(Nunnally, 1978) allowed us to sum the ratings to get composite scores for each of the seven dimensions. Each composite
score indicated a measure of each service quality dimension
obtained by the multi-item rating method. These composite
scores were used for the MTMM analysis along with the other
Table 2. Demographics: Population vs. Sample
Populationa
Sample
24.4
39.8
21.6
12.5
1.7
20.3
32.1
18.5
17.1
12.1
83.6
16.4
81.0
18.7
34.5
12.9
10.8
22.5
39.4
10.3
12.9
33.3
Age
Under 35
35–44
45–54
55–64
65 and over
Gender
Male
Female
Specialty group
Primary care
Surgical
Hospital based
Other specialties
a
Source: American Medical Association, 1990.
238
J Busn Res
2000:48:233–246
scores assessed by the single-item global rating method and
the constant-sum rating method.
Construct Validation of the Modified
SERVQUAL Scale
Our investigation of the construct validity of the modified
SERVQUAL involved CFA of the multitrait–multimethod
(MTMM) data. CFA allows methods to affect measures of traits
in different degrees; whereas, methods are assumed to covary
freely among themselves. CFA then provides assessments of
over-all goodness of fit for the variable specification of the
given MTTM data, while enabling the partition of variance in
measures into trait, method, and error components. Trait
variance reflects the shared variation for measures of a common trait and can be used to assess convergent validity. Discriminant validity among traits is indicated by intertrait correlations significantly lower than unity (Bagozzi and Yi, 1991).
As suggested by Bagozzi (1993) and Widaman (1985), we
first tested a CFA model based on the hypothesis that the
variation in measures can be explained by traits and random
error (i.e., the trait-only model). In this model, there are seven
traits (i.e., seven service quality dimensions), and each trait
is indicated by three measures. Each of three measures is related
to its own rating method (i.e., single-item global rating
method, constant-sum rating method, etc.,). This model resulted in poor fit, as indicated by x2 (168) 5 1935.04, p 5
.00. A probable cause for the trait-only model’s poor fit was the
presence of method factors as important sources of variation in
the measures (Bagozzi, 1993; Widaman, 1985). Subsequently,
another CFA model that incorporated trait and method factors
was tested. Estimation of this trait-method model was not
possible, however, because iterations failed to converge. Such
occurrence in the confirmatory factor analysis of a traitmethod model is not uncommon (Marsh and Bailey, 1991;
Bagozzi, 1993; Van Driel, 1978). Also, frequently found in
the CFA solution are improper parameter estimates, such as
negative variances. In all these instances, the confirmatory
factor analysis model is construed as an inappropriate specification of the variable structure and must be rejected (Bagozzi
and Yi, 1991).
In view of these problems that frequently accompany the
application of CFA models to MTMM data, Marsh (1989)
proposed the CU model as an alternative. The CU model
differs from the CFA model primarily in the interpretation of
method effects. In the CFA model, method effects are inferred
by squared method factor loadings. In contrast, the CU model
specification does not include method factors. Instead, method
effects are depicted as and inferred from correlations among
error terms corresponding to the measures based on common
method. This depiction of method effect is the main reason
why the CU model seldom produces an ill-defined solution
(Marsh, 1989, p. 341). Another difference between the two
approaches rests on the assumption regarding method correlations. Whereas the CU model assumes that methods are uncor-
H. Lee et al.
related, no such assumption is necessary for the CFA model.
Finally, both CFA and CU models are premised on the additive
effects of traits and methods on measures.
Given its robust nature, the CU model is an attractive
alternative when the CFA model results in an ill-defined solution or nonconvergence (Bagozzi, 1993; Marsh, 1989). We
subsequently tested the CU model’s fit to our MTMM data
(see Figure 1 for the diagram of the CU model). Another
application of the CU model in a similar situation can be
found in Kim and Lee’s (1997) construct validation study
involving measures of children’s influences on family decisions. The CU model’s fit as indicated by the x2 test result
(x2 (105) 5 311.62, p 5 .00) was unsatisfactory. However,
because of the x2 test’s sensitivity to sample size, some researchers (Bentler, 1990; Bagozzi and Yi, 1991) have suggested
fit assessments based on other goodness-of-fit indices when
the sample size is suspected to be the cause for rejecting
the hypothesized model. One frequently used measure is the
comparative fit index (CFI), which evaluates the practical
significance of the variance explained by the model (for a
detailed discussion, seen Bentler, 1990 and Bagozzi, Yi, and
Phillips, 1991). For our CU model, computation of the CFI
yielded .96. This is much greater than the .90 rule of thumb
suggested as the minimum acceptable level by Bentler (1990).
Therefore, the CU model captures a significant proportion of
variance of our MTMM data from a practical point of view;
hence, little variance remains to be accounted for.
Table 3 presents the estimated factor loadings for the CU
model. Significant trait factor loadings (t . 2.0) establish the
convergent validity of the measures (Widaman, 1985; Bagozzi
and Yi, 1991). Although the trait factor loadings for all the
measures based on global single-item method and multi-item
method are significant, only three of the seven constant-sum
measures were significant. The three dimensions of health-care
service quality for which the constant-sum measure exhibited
convergent validity are assurance, responsiveness, and tangibles. An assessment of the extent of convergence shown by
each measure requires a decomposition of the total variance
into proportions attributable to the corresponding trait and
random error. As in the CFA model, the amount of trait
variance in a measure is inferred by the squared trait factor
loading for that measure. For all seven dimensions of healthcare service quality, trait variances for constant-sum measures
were extremely low, with a range between 0.00 and 0.03 (or
0 and 3%). The best results were found for the global singleitem measures. Their trait variances ranged between 0.45 and
0.83, with a mean level of .57. The seven multi-item measures
showed levels of trait variance generally lower than the global
single-item measures. Trait variances for these measures
ranged from 0.30 to 0.49, with a mean of 0.39. According to
Bagozzi and Yi (1991), strong (weak) evidence for convergent
validity is achieved when at least (less than) half of the total
variation in a measure is caused by trait. According to this
rule of thumb, there is strong evidence for convergent validity
J Busn Res
2000:48:233–246
Figure 1. Correlated uniqueness model for the MTMM data.
Measuring Health-Care Service Quality
239
240
J Busn Res
2000:48:233–246
H. Lee et al.
Table 3. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Single-item global measure
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
Constant-sum Measure
Assurance
Core medical service
Empathy
Professionalism/skill
Reliability
Responsiveness
Tangibles
Multi-item Measure
Assurance
Core medical service
Empathy
Professionalism/skill
Reliability
Responsiveness
Tangibles
Core Medical
Service
Empathy
Factor Loading
Traits
Professionalism/
Skills
Reliability
Responsiveness
Tangibles
0.69 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.67 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.75 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.71 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.71 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.83 (0.11)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.91 (0.11)
20.18 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
20.01 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.05 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.06 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.10 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.16 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.16 (0.06)
0.63 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.55 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.70 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.62 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.66 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.60 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.62 (0.09)
Standard error of estimates are shown in parantheses.
All zero values indicate that their corresponding parameters were fixed.
for most of our global single-item measures (5 out of 7). Trait
variances for all seven multi-item measures fall below the level
of 0.5. Therefore, evidence for convergent validity is weak for
these measures using multi-item rating method; whereas, the
constant-sum measures exhibit little or no convergent validity.
As noted before, the effects of methods under the CU model
are represented as correlations among error (uniqueness)
terms. Although the CFA model enables the separation of the
variance portion that is caused by method bias, we can only
infer the significance and size of the method bias in the CU
model analysis based on examination of the estimated uniqueness correlations. Table 4(a), 4(b), and 4(c) display the estimated error variances and covariances for single-item global
measures, constant-sum measures, and multi-item measures,
respectively. For the single-item measures, a significant covariance between error terms were found in 14 of 21 possible
cases (see Table 4a). When these covariances were converted
into correlations, the values ranged from 0.28 to 0.82, with
an average of 0.59. These levels of uniqueness correlations
demonstrate a considerable degree of method effect contained
in the measurement. Therefore, a substantial portion of the
variations in the global single-item measures can be attributed
to the measurement procedure.
For the constant-sum measures, 16 of the 21 uniqueness
covariances were significant (see Table 4b). Although this
indicates the existence of a significant method effect, the magnitudes of the uniqueness correlations (range: 0.03–0.36;
mean 0.19) suggest that the size of method effect is small.
The very large error variances shown in Table 4b demonstrate
that almost all the variations in the constant-sum measures
are attributable to random error. With regard to the multiitem measures, as can be seen in Table 4c, all uniqueness
covariances are significant. Uniqueness correlations were also
generally high (range: 0.37–0.71; mean 0.59).
Our next investigation focused on discriminant validity
among the seven dimensions of health-care service quality. It
consisted in verifying whether the correlations among the
seven dimensions (i.e., traits) as measured by three different
methods were significantly different from unity (11 or 21)
(Widaman, 1985; Bagozzi, Yi, and Phillips, 1991). As shown
in Table 5, all of the correlations among the dimensions are
significant and very high (range: 0.69–0.99; mean: 0.84).
Seven of the 21 correlations were above the 0.90 level. Such
high correlations among service quality dimensions (range:
0.67–0.92; mean: 0.82) were also observed in the study conducted by Dabholkar, Thorpe, and Rentz (1996). It should
be noted, however, that these correlations are disattenuated
correlations (i.e., corrected for measurement error) and are
larger than those correlations among measures. Particularly
notable is the correlation between the dimensions of assurance
Measuring Health-Care Service Quality
J Busn Res
2000:48:233–246
241
Table 4. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Core Medical
Service
Traits
Professionalism/
Skills
Empathy
Reliability
Responsiveness
Tangibles
0.31 (0.16)
0.09 (0.14)
0.17 (0.19)
0.97 (0.08)
0.26 (0.06)
0.97 (0.08)
0.63 (0.10)
0.37 (0.08)
0.61 (0.10)
(a) Error Variance and Covariance for Single-Item Global Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.51
0.37
0.35
0.35
0.32
0.27
0.14
(0.12)
(0.11)
(0.13)
(0.10)
(0.11)
(0.11)
(0.11)
0.53
0.33
0.34
0.38
0.23
0.09
(0.12)
(0.12)
(0.11)
(0.11)
(0.11)
(0.11)
0.41
0.28
0.30
0.22
0.11
(0.14)
(0.10)
(0.11)
(0.11)
(0.11)
0.51
0.41
0.26
0.08
(0.13)
(0.12)
(0.12)
(0.13)
0.49 (0.13)
0.26 (0.13)
0.10 (0.13)
(b) Error Variance and Covariance for Constant-Sum Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.99
20.03
0.24
20.35
20.32
20.16
20.10
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
1.00
20.18
20.14
20.21
20.20
20.23
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
0.99
20.13
20.21
20.15
20.09
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
0.99
0.20
20.16
20.18
(0.08)
(0.06)
(0.06)
(0.06)
0.99 (0.08)
0.27 (0.06)
0.07 (0.06)
(c) Error Variance and Covariance for Multi-Item Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.58
0.35
0.34
0.39
0.32
0.31
0.38
(0.11)
(0.09)
(0.11)
(0.08)
(0.09)
(0.08)
(0.08)
0.68
0.36
0.41
0.32
0.34
0.38
(0.09)
(0.09)
(0.09)
(0.09)
(0.08)
(0.08)
0.49
0.32
0.28
0.28
0.32
(0.12)
(0.09)
(0.09)
(0.08)
(0.08)
0.60
0.31
0.29
0.43
(0.11)
(0.10)
(0.08)
(0.09)
0.55 (0.11)
0.37 (0.09)
0.41 (0.09)
All error variance and covariance estimates differing significantly from zero are underscored.
Standard error of estimates are within parentheses.
and empathy (0.99), which is near unity. This high correlation
between the assurance dimension and the empathy dimension
seemed to be consistent with the findings of the past studies
that discovered the dimensional instability of the SERVQUAL
scale (Babakus and Boller, 1992; Carman, 1990). A formal
test of discriminant validity was conducted by computing a
95% confidence interval (the estimated correlation 6 twice its
standard error estimate) for each of the estimated correlations
among the seven dimensions. Despite the high levels of correlation observed between the dimensions, only one (that between assurance and empathy) fell within the interval. Hence,
from a strict statistical point of view, discriminant validity
was established, except for between assurance and empathy.
However, whether these dimensions are distinct from a practical standpoint is highly questionable.
In summary, the above results of the CU model analysis
of the MTMM data first led us to conclude that convergent
validity was established for two of the three measures, the
single-item global measure and multi-item measure. Based on
Bagozzi and Yi’s (1991) rule of thumb, only the single-item
global measure, which captured an average trait variance
greater than 0.50, demonstrated strong evidence of convergence; whereas, weak evidence of convergence was found for
the multi-item measure. For the constant-sum measure, on the
other hand, there was virtually no sign of convergence. Almost
all of the variance in the seven constant-sum measures (for the
seven service dimensions) was attributed to random error.
With respect to discriminant validity, from a strict statistical
viewpoint, discrimination was demonstrated among the seven
health-care service quality dimensions, except for one instance
(between “assurance” and “empathy”). That is, all intertrait
(or interdimensional) correlations except one were significantly less than unity. However, the magnitudes of the intertrait correlations were generally very high, with a mean
value of 0.84. Hence, the seven dimensions did not seem
separable in a practical sense. We should note, however, that
the interpretation of discriminant validity is meaningful only
when convergent validity is established (Bagozzi, 1993). Given
our finding that convergent validity was established for two
of the three types of measures tested, the evidence relating to
discriminant validity should be viewed with caution.
Implications and Conclusion
One of the more pressing challenges health-care providers and
researchers face is to develop a better understanding of the key
242
J Busn Res
2000:48:233–246
H. Lee et al.
Table 5. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
1.00
0.94
0.99
0.80
0.89
0.81
0.72
(0.03)
(0.02)
(0.04)
(0.03)
(0.04)
(0.05)
Core Medical
Service
1.00
0.91
0.92
0.90
0.80
0.78
(0.03)
(0.03)
(0.03)
(0.05)
(0.05)
Trait Intercorrelation
Traits
Professionalism/
Empathy
Skills
Reliability
Responsiveness
Tangibles
1.00
0.77
0.83
0.76
0.69
1.00
0.91 (0.03)
0.83 (0.03)
1.00
0.83 (0.03)
1.00
(0.04)
(0.03)
(0.04)
(0.05)
1.00
0.95 (0.02)
0.85 (0.04)
0.83 (0.04)
All error variance and covariance estimates differing significantly from zero are underscored.
Standard error of estimates are within parentheses.
dimensions constituting health-care quality and valid approaches to their measurement. This research focused on conceptual and measurement issues relating to the study of healthcare quality. In contrast to most of the past research in this
area, we took the physician’s (service provider’s) rather than
the patient’s (service recipient’s) perspective. This approach
is justified in view of the prevalent understanding that healthcare recipients are often unable to evaluate key dimensions
of health-care service (Bopp, 1990; Hensel and Baumgarten,
1988), and, thus, may not have as much to contribute to the
design of an effective health-care system as providers. Another
contrast is found in methodological approach. Whereas past
studies that investigated the validity of the SERVQUAL scale
tended to lack methodological rigor and scope, our construct
validation procedure based on the MTMM data analysis allowed for a more systematic scrutiny of key measurement
properties of the scale (i.e., convergent validity, discriminant
validity, and method bias).
First, we compared the performance of the constant-sum
rating method, the single-item global rating method, and the
multi-item rating method in measuring the health-care service
quality. All seven measures based on the constant-sum method
showed almost complete lack of convergence with the measures based on other methods. One plausible explanation for
this is the relatively high degree of complexity inherent in
the measures using the constant-sum method. This measure
requires more effort on the part of the respondents, and, thus,
is likely to create cognitive strains. Consequently, resulting
responses may not be as reliable as those obtained by other
methods. In fact, many physicians seemed to have difficulty
allocating the importance points among the seven categories.
In contrast to common expectation, the single-item global
measures performed better than the multi-item measures in
capturing the intended dimensions. An attempt to generalize
this finding beyond health-care providers may be inappropriate, because the result could have been caused by the high
level of familiarity that our physician respondents had with
the health-care service quality dimensions. A clear under-
standing of the issues involved in the questions reduces measurement error in responses. Thus, such an outcome may not
be obtained from health-care recipients, who may not possess
such a clear understanding. Nonetheless, this finding suggests
that single-item global measures may elicit responses that are
as reliable as the multi-item measures when knowledgeable
service providers are involved, and do so with greater parsimony. The single-item global rating method may be useful if
the goal of a study is to gain an understanding for the general
nature of health-care service issues. We should add, however,
that assessment of reliability level for single-item measures is
not possible in most cases. This remains a major problem for
the single-item global rating method.
When the research is to be diagnostic in nature, focusing
on specific characteristics of the service offering in an effort to
identify areas for improvement, the multi-item rating method
has greater utility. The multi-item rating method has the distinct advantage of being able to generate detailed information
on specific aspects of service quality that can be used as a
basis for action plans. As a caveat, it should be noted that
our recommendation regarding the use of the single-item
global rating method and the multi-item rating method is
limited to future research involving health-care service providers’ perceptions. For research involving the perceptions of
patients who do not understand the key dimensions of healthcare service quality, the multi-item rating method seems to
be a better choice, because this method is less susceptible to
measurement error than the single-item global rating method.
In terms of the discriminant validity of the seven health-care
service quality dimensions, our results were not supportive of
the validity. The computed magnitudes of interdimensional
correlations were very high. Although all correlations except
one satisfied the statistical criterion applied (i.e., significantly
less than unity), their magnitudes (ranging between 0.69–
0.99) cast much doubt on the separability of these dimensions
from a practical viewpoint. Considering that a similar finding
has been reported before (Dabholkar, Thorpe, and Rentz,
1996), a caution is warranted in future applications of the
Measuring Health-Care Service Quality
SERVQUAL scale or its modified versions in health-care service quality research. Because the validation of a measure is
an ongoing process, we suggest that more research be directed
toward producing a suitable adaptation of the SERVQUAL
scale. It is important for this research to take into consideration
the unique aspects of this particular service sector.
This study limited its research scope to physicians’ perceptions toward health-care service quality. Under CQI or TQM,
patients’ perceptions or evaluations of health-care services also
play a critical role. If health-care providers do not understand
how service recipients evaluate health-care services, it is difficult for providers to design or improve strategic planning
and marketing activities effectively. Therefore, research based
upon the patients’ perspective is necessary. Based upon the
perceptions of both parties in the health-care delivery system,
we can identify areas where mutual understanding exists,
means to inform and educate the public, and ways to improve
the current delivery system.
References
Aaker, David A., Kumar, V., and Day, George S.: Marketing Research.
John Wiley & Sons, Inc., New York, NY. 1995.
Asubonteng, Patrick, McCleary, Karl J., and Swan, John E.: SERVQUAL Revisited: A Critical Review of Service Quality. The Journal
of Services Marketing 10(6) (1996): 62–71.
Babakus, Emin, and Mangold, W. Glynn: Adapting the SERVQUAL
Scale to Hospital Services: An Empirical Investigation. Health
Services Research 26 (February 1992): 767–786.
Babakus, Emin, and Boller, Gregory W.: An Empirical Assessment
of the SERVQUAL Scale. Journal of Business Research 24(3) (1992):
253–268.
Bagozzi, Richard P.: Causal M
Hanjoon Lee
Linda M. Delene
Mary Anne Bunda
WESTERN MICHIGAN UNIVERSITY
Chankon Kim
SAINT MARY’S UNIVERSITY
Service quality is an elusive and abstract construct to measure, and extra
effort is required to establish a valid measure. This study investigates the
psychometric properties of three different measurements of health-care
service quality as assessed by physicians. The multitrait-multimethod
approach revealed that convergent validity was established for measures
based on the single-item global rating method and multi-item rating
method. On the other hand, almost no evidence of convergent validity
was found for the measures based on the constant-sum rating method.
Furthermore, discriminant validity for the seven health-care service quality dimensions measured by the three methods was not well established.
The high levels of interdimensional correlations found suggested that the
service quality dimensions may not be separable in a practical sense. The
study suggested an ongoing effort is needed to develop a new service
quality scale suitable to this unique service industry. J BUSN RES 2000.
48.233–246. 2000 Elsevier Science Inc. All rights reserved.
T
he health-care delivery system has been undergoing
formidable challenges in the 1990s. Rapid movement
toward systems of managed care and integrated delivery networks has led health-care providers to recognize real
competition. To be successful or even survive in this hostile
environment, it is crucial to provide health-care recipients
with service that meets or exceeds their expectations. At the
same time, it is important to known which dimensions of
health-care services physicians believe are necessary to constitute excellent service. It is crucial to have a better understanding of service quality perceptions possessed by both recipients
and providers when shaping the health-care delivery system.
The traditional medical model has focused on the technical
nature of health-care events; the focus has been on the training
and updated skills of the physicians and the nature of the
Address correspondence to Hanjoon Lee, Marketing Department, Haworth
College of Business, Western Michigan University, Kalamazoo, Michigan
49008, USA.
Journal of Business Research 48, 233–246 (2000)
2000 Elsevier Science Inc. All rights reserved.
655 Avenue of the Americas, New York, NY 10010
actual medical outcome (O’Connor, Shewchuk, and Carney,
1994). A series of services marketing research, however, has
looked at the relationship between the services expected and
the service actually perceived as received by recipients (Carman, 1990; Finn and Lamb, 1991; Parasuraman, Zeithaml,
and Berry, 1985, 1988; Zeithaml, Parasuraman, and Berry,
1988). The services marketing approach places an emphasis
on quality evaluation from the recipients’ perspectives, but
ignores the necessity for including an evaluation of the technical skill of the provider and the nature of the medical outcome.
Especially in the area of health-care service, the services marketing approach seems to neglect the important role of physicians in shaping patients’ service expectations. A balanced
approach, therefore, utilizing aspects of service quality from
both the services marketing and health-care approaches may
be required. At the same time, the physicians’ view toward
the quality of their own services needs more research attention.
For the success of health-care organizations, accurate measurement of health-care service quality is as important as
understanding the nature of the service delivery system. Without a valid measure, it would be difficult to establish and
implement appropriate tactics or strategies for service quality
management. The most widely known and discussed scale
for measuring service quality is SERVQUAL (Parasuraman,
Zeithmal, Berry, 1988). Since the scale was developed, various
researchers have applied it across such different fields as securities brokerage, banks, utility companies, retail stores, and
repair and maintenance shops. The scale has also been applied
to the health-care field in numerous studies (Babakus and
Mangold, 1992; Brown and Swartz, 1989; Carman, 1990;
Headley and Miller, 1993; O’Connor, Shewchuk, and Carney,
1994; Walbridge and Delene, 1993). However, with a few
exceptions, they did not systematically examine the psychometric properties of their scale, because these studies dealt
with pragmatic and managerial issues for health-care services.
Validity of the SERVQUAL scale seems not to be fully established. A more stringent psychometric test has been recomISSN 0148-2963/00/$–See front matter
PII S0148-2963(98)00089-7
234
J Busn Res
2000:48:233–246
mended for the improvement of the service quality measurement (for a recent review, please see Asubonteng, McCleary,
and Swan, 1996).
In this study, we sought to examine rigorously the psychometric properties pertaining to alternative methods of measuring health-care service quality as perceived by physicians.
Specifically, physicians were asked to assess health-care service quality along the seven dimensions of a modified SERVQUAL scale. Dimensional responses were collected using three
measurement methods: single-item global rating method, constant-sum rating method and multi-item rating method, thus
resulting in multitrait-multimethod (MTMM) data. Based on
the results of construct validation conducted on the MTMM
data, we reported findings regarding the convergent validity
of the three methods and the discriminant validity of the seven
service quality dimensions as measured by the three methods.
Previous Research
Two Approaches in Health-Care Service Quality
Service quality is an exclusive and abstract concept because
of its “intangibility” as well as its “inseparability of production
and consumption” (Parasuraman, Zeithaml, and Berry, 1985).
Various approaches have been suggested regarding how to
define and measure service quality. The services marketing
literature has defined service quality in terms of what service
recipients receive in their interaction with the service providers
(i.e., technical, physical, or outcome quality), and how this
technical quality is provided to the recipients (i.e., functional,
interactive, or process quality) (Grönoos, 1988; Lehtinen and
Lehtinen, 1982; Berry, Zeithaml, and Parasuraman, 1985).
Parasuraman, Zeithaml, and Berry (1985) asserted that consumers perceive service quality in terms of the gap between
received service and expected service. They identified 10 dimensions of service quality: access, communication, competence, courtesy, security, tangibles, reliability, responsiveness,
credibility, and understanding or caring. They then classified
10 dimensions into three categories: search properties (credibility and tangibles; dimensions that consumers can evaluate
before purchase), experience properties (reliability, responsiveness, accessibility, courtesy, communication, and understanding/knowing the consumer; dimensions that can be
judged during consumption or after purchase), and credence
properties (competence and security; dimensions that a consumer finds hard to evaluate even after purchase or consumption).
In the area of traditional health-care research, the quality
of health care has been viewed from a different perspective.
Quality has been defined as “the ability to achieve desirable objectives using legitimate means” (Donabedian, 1988,
p. 173), where the desirable objective implied “an achievable
state of health.” Thus, quality is ultimately attained when a
physician properly helps his or her patients to reach an achievable level of health, and they enjoy a healthier life. One of
H. Lee et al.
the most widely used quality assessment approaches has been
proposed in the structure-process-outcome model of Donabedian (1980). In this model, the structure indicates the settings
where the health care is provided, the process indicates how
care is technically delivered; whereas, the outcome indicates
the effect of the care on the health or welfare of the patient.
In the structure-process-outcome model, quality was viewed
as technical in nature and assessed from the physicians’ point
of view. It is well known that physicians pay significantly
more attention to the technical and functional dimensions of
health-care service (Donabedian, 1988; O’Connor, Shewchuk,
and Carney, 1994). This tendency might be attributal to physician education and training. Considering the potentially fatal
and irrevocable consequences of poor medical quality (malpractice) in health care, in contrast to other service industries,
it would be logical and desirable for physicians to hold such
an attitude.
A difference has been observed between the service marketing approach emphasizing recipients’ perspectives and the
traditional health-care approach honoring physicians’ concerns. Both patient groups and physician groups are important
constituents of the health-care system. However, it has been
found that health-care recipients have difficulty in evaluating
medical competence and the security dimensions (i.e., credence properties) considered to be the primary determinant
of service quality (Bopp, 1990; Hensel and Baumgarten,
1988). This inability or impossibility of assessing the technical
quality received in health-care service leads patients to rely
more heavily on other dimensions, such as credibility or tangibility (i.e., search properties) when inferring the quality of
health-care service (Bowers, Swan, and Taylor, 1994). This
lack of patient ability to make a proper evaluation raises a
question regarding the gap analysis paradigm suggested by
Parasuraman, Zeithaml, and Berry (1985). If customers in
the health-care delivery system cannot evaluate the important
service dimensions, can they have a reasonable expectation
about services they will receive? If they cannot, the contribution of the health-care recipients’ views in influencing the
design of an efficient system may not be as significant as we
formerly thought.
If the health-care service industry were similar to other
industries that provide services for their customers, a patient
could choose among many physicians who offer different
prices, and provide service that differs in terms of medical
technical quality (i.e., competence and security) or other service-related dimensions. The reality in the health-care industry
is different. Patients do not have enough information about
their physicians. Even if more information were available and
accessible, patients probably could not weigh the information
properly. Physician choice is often made not by the patients
themselves, but through referral from the patient’s primary
doctor, from his or her health organization (HMO), and/or
from friends. Although service recipients’ perceptions toward
service is valuable for improving health-care service quality,
Measuring Health-Care Service Quality
it is as crucial to understand physicians’ perceptions of service
quality when designing and improving the health-care delivery
system. Therefore, this study placed its focus on how physicians perceive health-care service quality.
Measurement Issues in
Health-Care Service Quality
A system cannot be designed and operated effectively unless
the quality of the product or service can be understood or
correctly measured. One major stride toward developing
quantitative measures of service quality was made by Parasuraman, Zeithaml, and Berry (1985), and the SERVQUAL scale
was the consequence of this effort (Parasuraman, Zeithaml,
and Berry, 1988). The 10 dimensions discussed in the 1985
study were reduced into five dimensions in SERVQUAL after
an empirical test. Their original objective was to discover dimensions that were generic to all services. If this assumption
is correct, dimensional patterns for service quality should be
similar across different service industries. Several researchers
have since examined the stability of SERVQUAL dimensions
(Asubonteng, McCleary, and Swan, 1996; Babakus and Boller,
1992; Carman, 1990; Dabholkar, Thorpe, and Rentz, 1996).
Carman (1990) found that the numbers of service quality
dimensions were not stable across different services in his
factor analysis results. He also found that, among the five
dimensions, items measuring “tangibles” and “reliability” consistently loaded on the expected factors across different services. However, items tapping “assurance” and “empathy”
broke into different factors. A similar finding was reported
by Babakus and Boller (1992). There seems to be a consensus
that SERVQUAL is not a generic measure for all service industries and that service-specific dimensions other than those
suggested in SERVQUAL may be needed to understand service
quality perceptions fully.
Although these studies have generated insight into the
measurement properties of SERVQUAL, their measurement
analyses, which were aimed primarily at checking dimensionality, were inadequate for testing the construct validity of the
scale. Construct validity is defined as the degree of correspondence between constructs and their measures (Peter, 1981).
A systematic and rigorous construct validation requires multitrait-multimethod data, which is the correlation matrix for
two or more traits where each trait is measured by two or
more methods. Demonstration of construct validity requires
evidence of convergent validity and discriminant validity
(Campbell and Fiske, 1959).
The two main sources of variance in measures of a construct
are the construct or trait being measured and measurement
error. Measurement error can be divided further into random
error and systematic error (e.g., method variance). Single measures do not allow us to make an assessment of measurement
error. With a single method, we cannot separate trait variance
from unwanted method variance (Bagozzi, Yi, and Phillips,
1991). Thus, construct validation is a process of separating
J Busn Res
2000:48:233–246
235
the confounding effects of random and systematic errors from
trait variance. Without disentangling the variation in measures
attributable to the trait, we cannot assess the extent of the
true relationship between the measures and traits (i.e., the
convergent validity) or the true relationships between traits
(the discriminant validity).
Despite its seminal role in the understanding and assessment of construct validity, the original Campbell and Fiske
(1959) approach to MTMM analyses has limitations. Most
notably, it prescribes no precise standards for determining
evidence of construct validity. Furthermore, the procedure
does not yield specific estimates of trait, method, and random
variance. Several alternative procedures have been proposed
for analyzing MTMM data (for a review, see Bagozzi, 1993).
The construct validation process in this study utilized two of
these alternative MTMM approaches; namely, application of
the confirmatory factor analysis (CFA) model (Joreskog and
Sorbom, 1993) and the correlated uniqueness (CU) model
(Marsh, 1989).
Research Design
Design of the MTMM Study
Previous studies have indicated that SERVQUAL must be modified for each unique service sector (Carman, 1990; Babakus
and Boller, 1992). Haywood-Farmer and Stuart (1988) empirically tested SERVQUAL and found it did not encompass all
the dimensions of professional service quality. They suggested
that service dimensions for core service, service customization,
and knowledge and information be added to the five dimensions of SERVQUAL. Of these additional dimensions, core
service was found to be the most important factor not represented in the SERVQUAL instrument. Related research of
professional service quality perception was done by Brown
and Swartz (1989). This study found that “professionalism”
and “professional competence” were significant factors for
both providers and patients in the evaluation of service quality.
The modified SERVQUAL approach utilized in this research, therefore, included the five dimensions of SERVQUAL
(Parasuraman, Zeithaml, and Berry, 1985), as well as the “core
medical service” (Haywood-Farmer and Stuart, 1989) and the
“professionalism/skill” (Brown and Swartz, 1989) dimensions.
The latter two dimensions were included to measure the technical aspects of health-care service. These same service quality
dimensions were also used in the earlier research of Walbridge
and Delene (1993), which involved physician attitudes toward
service quality. The seven dimensions, their origins, and their
definitions can be found in Table 1.
It is well known that measurement methods can affect the
nature of a respondent’s evaluation (Kumar and Dillon, 1992;
Phillips, 1981). Of the various methods used in measurement,
three were selected for this research: single-item global rating
method, constant-sum rating method, and multi-item rating
method. The single-item global rating method provided the
236
J Busn Res
2000:48:233–246
H. Lee et al.
Table 1. Service Quality Attributes
Attribute
Definition
Assurance
Courtesy displayed by physicians, nurses, or office staff
and their ability to inspire patient trust and
confidence
Parasuraman, Zeithaml, and Berry, 1988
Empathy
Caring, individualized attention provided to patients by
physicians and their staffs
Parasuraman, Zeithaml, and Berry, 1988
Reliability
Ability to perform the expected service dependably and
accurately
Parasuraman, Zeithaml, and Berry, 1988
Responsiveness
Willingness to provide prompt service
Parasuraman, Zeithaml, and Berry, 1988
Tangibles
Physical facilities, equipment, and appearance of contact
personnel
Parasuraman, Zeithaml, and Berry, 1988
Core medical service
The central medical aspects of the service: appropriateness, effectiveness, and benefits to the patient
Professionalism/skill
Knowledge, technical expertise, amount of training,
and experience
respondent with dimensions and definitions of each service
dimension. With this method, the respondent reported his or
her evaluation rating on each dimension—without evaluating
the multiple indicators (components) of each dimension and
without comparing it to other service dimensions. The constant-sum rating method, in contrast, is comparative in nature,
requiring the respondent to allocate a given number of “importance points” among various dimensions. In this method,
respondents were forced to think about the relative importance of each service dimension. In the multi-item rating
method, multiple indicators were developed that were intended to capture each of seven service quality dimensions.
It is generally accepted that the multi-item rating method
can provide a better sampling of the domain of content than
the single-item global rating method (Bagozzi, 1980). Thus,
content validity can be enhanced with multiple-item measures.
They also have the advantage of allowing the computation
of reliability coefficients (e.g., Cronbach’s alpha, [Cronbach,
1951]). Reliability assessment with the single-item global rating
method is a problem in typical survey research studies, because
measurement error cannot be estimated with a single item.
However, a drawback in using the multi-item rating
method in place of the single-item global rating method is
the tendency toward questionnaire length along with possible
detrimental effects on response rate and respondent fatigue.
In other words, the single-item global rating method has the
potential advantage of parsimony for the respondent. Therefore, in areas where there is little or no difference between
the explanatory power of single- and multi-item methods, the
single-item global rating method may be preferable in studies
where parsimony is important.
In studying service quality dimensions, it may be helpful
Authors
Swartz and Brown, 1989
to gather information regarding the relative importance of
each dimension. One way to do this is through the use of
constant-sum rating method. The constant-sum rating method
forces respondents to identify the comparative importance of
each service dimension. In health-care study, this constantsum method was used to examine determinant dimensions
in hospital preference (Woodside and Shinn, 1988). Constantsum method also tends to eliminate individual response styles
of “nay-saying” and the “halo effects,” which cause respondents
to carry over their judgments from one dimension to another
(Churchill, 1991). In an earlier, related study (Walbridge and
Delene, 1993), it was believed that physicians may be reluctant
to rate any service quality dimension as unimportant. Thus,
the constant-sum rating method was employed in this research
to determine its applicability as an efficient measurement
method of health-care service quality where physicians’ perceptions were surveyed.
There are a few drawbacks to using the constant-sum
method. The first is its inherent increase in task complexity for
respondents. It requires more mental effort from the individual
than either the single-item or multi-item methods. Each rating
decision affects other ratings because of the constraints imposed by the nature of the measurement process. As the number of attributes increase, respondents become more taxed
(Aaker, Kumar, and Day, 1994; Malhotra, 1995). This increase
in complexity may lead the subject to use a subset of the
dimensions instead of including all of them in his or her
evaluation (Churchill, 1991). This effect may be heightened
if the subject does not view the dimensions as being completely
independent. This lack of independence was found to produce
spurious correlations sometimes (Kerlinger, 1973).
In this study, we asked physicians how they perceived the
Measuring Health-Care Service Quality
seven dimensions of health-care service quality as measured
by three different methods: single-item global rating method,
multi-item rating method, and constant-sum rating method.
Questionnaire Development
A panel of physicians was consulted on questionnaire design
and semantics, and input was also received from a state university hospital. The questionnaire was divided into four sections,
with one section for each of the three measurement methods
and the last section containing demographic questions. Section One utilized the single-item global rating method. The
subjects were given the name and definition of each dimension
indicated in Table 1 and asked to rate the importance of each
dimension on a seven-point scale. Pretesting with physicians
showed that a conventional scale using the two bipolar adjectives “unimportant” and “important” was inappropriate. Physicians were reluctant to rate any of the dimensions as “unimportant” or “less important.” Further pretesting results suggested
the use of “Important” for the low end (one) and “Critical”
for the high end (seven) in a seven-point scale.
Section Two was a constant-sum rating method that asked
the subjects to distribute a fixed number of “importance
points” among the seven dimensions. This led respondents
to rate the comparative importance of each service dimension
relative to the others. The same names and definitions of the
dimensions were used as in Section One.
Section Three consisted of forty-three (43) “practice characteristics.” Placed in random order, each practice characteristic
corresponded with one of the seven service quality dimensions, with between five and seven characteristics pertaining
to each service quality dimension based on a previous study
(Walbridge and Delene, 1993) (please see Appendix A). In
this section, physicians evaluated the practice characteristics,
without referring to the names or definitions of the pertinent
service quality dimensions. Practice characteristics were evaluated using the same “important–critical” dichotomy used in
Section One. Respondents then answered questions related
to demographic variables in the last section.
Sampling
Physicians (1,428) were randomly selected by a commercial
mail-order vendor from a national databased leased from the
American Medical Association. Some professional categories
were eliminated to remove nonphysicians from the list, as
well as specialties considered divergent from the mainstream
of health-care service (for a listing of the specialties used,
please see Appendix B). The four-page, self-administered
questionnaire was mailed to physicians. To attain a higher
response rate, physicians received a “warm-up” postcard announcing the arrival of the questionnaire within the next week.
The initial mailing of the questionnaire included a cover letter
explaining the purpose of the research and the confidentiality
of responses. Approximately 6 weeks later, a follow-up mailing
of 1,200 questionnaires was sent to physicians who had not
yet responded.
J Busn Res
2000:48:233–246
237
Of the original 1,428 addresses, 72 were invalid. Six questionnaires returned were unusable. A total of 348 responses
were received from the two mailings with an effective response
rate of 24.4%. Demographic characteristics of our sample were
compared with those of the physician population in the United
States in Table 2. The similarities become apparent through
simple visual inspection. The population of physicians in the
United States is 16.4% female; whereas, the sample was 18.7%
female. The age distribution of the sample was also somewhat
similar, especially for physicians under the age of 65, which
accounted for about 90%. The sample was similar to the
population on the basis of practice specialty. The goodnessof-fit tests were performed for sex, age, and specialty group
categories. The results were x2 5 0.4 (DF 5 1, p 5 0.729)
for sex, x2 5 14.4 (DF 5 4, p , 0.000) for age, and x2 5
4.8 (DF 5 3, p 5 0.084) for specialty group. These results
suggest that the sample reflected the population’s sex and
specialty group compositions, but consisted of physicians who
were somewhat older than the population.
Analysis
Instrument Reliability for the Multi-Item
Rating Method
It is necessary to derive a composite score for each of the
seven service quality dimensions measured by the multi-item
rating method. For this purpose, the level of internal consistency was checked as a way of assessing the homogeneity
of items comprising each dimension. The Cronbach’s alpha
indices for the seven dimensions ranged from 0.80 to 0.90,
with a mean of 0.85. This high degree of internal consistency
(Nunnally, 1978) allowed us to sum the ratings to get composite scores for each of the seven dimensions. Each composite
score indicated a measure of each service quality dimension
obtained by the multi-item rating method. These composite
scores were used for the MTMM analysis along with the other
Table 2. Demographics: Population vs. Sample
Populationa
Sample
24.4
39.8
21.6
12.5
1.7
20.3
32.1
18.5
17.1
12.1
83.6
16.4
81.0
18.7
34.5
12.9
10.8
22.5
39.4
10.3
12.9
33.3
Age
Under 35
35–44
45–54
55–64
65 and over
Gender
Male
Female
Specialty group
Primary care
Surgical
Hospital based
Other specialties
a
Source: American Medical Association, 1990.
238
J Busn Res
2000:48:233–246
scores assessed by the single-item global rating method and
the constant-sum rating method.
Construct Validation of the Modified
SERVQUAL Scale
Our investigation of the construct validity of the modified
SERVQUAL involved CFA of the multitrait–multimethod
(MTMM) data. CFA allows methods to affect measures of traits
in different degrees; whereas, methods are assumed to covary
freely among themselves. CFA then provides assessments of
over-all goodness of fit for the variable specification of the
given MTTM data, while enabling the partition of variance in
measures into trait, method, and error components. Trait
variance reflects the shared variation for measures of a common trait and can be used to assess convergent validity. Discriminant validity among traits is indicated by intertrait correlations significantly lower than unity (Bagozzi and Yi, 1991).
As suggested by Bagozzi (1993) and Widaman (1985), we
first tested a CFA model based on the hypothesis that the
variation in measures can be explained by traits and random
error (i.e., the trait-only model). In this model, there are seven
traits (i.e., seven service quality dimensions), and each trait
is indicated by three measures. Each of three measures is related
to its own rating method (i.e., single-item global rating
method, constant-sum rating method, etc.,). This model resulted in poor fit, as indicated by x2 (168) 5 1935.04, p 5
.00. A probable cause for the trait-only model’s poor fit was the
presence of method factors as important sources of variation in
the measures (Bagozzi, 1993; Widaman, 1985). Subsequently,
another CFA model that incorporated trait and method factors
was tested. Estimation of this trait-method model was not
possible, however, because iterations failed to converge. Such
occurrence in the confirmatory factor analysis of a traitmethod model is not uncommon (Marsh and Bailey, 1991;
Bagozzi, 1993; Van Driel, 1978). Also, frequently found in
the CFA solution are improper parameter estimates, such as
negative variances. In all these instances, the confirmatory
factor analysis model is construed as an inappropriate specification of the variable structure and must be rejected (Bagozzi
and Yi, 1991).
In view of these problems that frequently accompany the
application of CFA models to MTMM data, Marsh (1989)
proposed the CU model as an alternative. The CU model
differs from the CFA model primarily in the interpretation of
method effects. In the CFA model, method effects are inferred
by squared method factor loadings. In contrast, the CU model
specification does not include method factors. Instead, method
effects are depicted as and inferred from correlations among
error terms corresponding to the measures based on common
method. This depiction of method effect is the main reason
why the CU model seldom produces an ill-defined solution
(Marsh, 1989, p. 341). Another difference between the two
approaches rests on the assumption regarding method correlations. Whereas the CU model assumes that methods are uncor-
H. Lee et al.
related, no such assumption is necessary for the CFA model.
Finally, both CFA and CU models are premised on the additive
effects of traits and methods on measures.
Given its robust nature, the CU model is an attractive
alternative when the CFA model results in an ill-defined solution or nonconvergence (Bagozzi, 1993; Marsh, 1989). We
subsequently tested the CU model’s fit to our MTMM data
(see Figure 1 for the diagram of the CU model). Another
application of the CU model in a similar situation can be
found in Kim and Lee’s (1997) construct validation study
involving measures of children’s influences on family decisions. The CU model’s fit as indicated by the x2 test result
(x2 (105) 5 311.62, p 5 .00) was unsatisfactory. However,
because of the x2 test’s sensitivity to sample size, some researchers (Bentler, 1990; Bagozzi and Yi, 1991) have suggested
fit assessments based on other goodness-of-fit indices when
the sample size is suspected to be the cause for rejecting
the hypothesized model. One frequently used measure is the
comparative fit index (CFI), which evaluates the practical
significance of the variance explained by the model (for a
detailed discussion, seen Bentler, 1990 and Bagozzi, Yi, and
Phillips, 1991). For our CU model, computation of the CFI
yielded .96. This is much greater than the .90 rule of thumb
suggested as the minimum acceptable level by Bentler (1990).
Therefore, the CU model captures a significant proportion of
variance of our MTMM data from a practical point of view;
hence, little variance remains to be accounted for.
Table 3 presents the estimated factor loadings for the CU
model. Significant trait factor loadings (t . 2.0) establish the
convergent validity of the measures (Widaman, 1985; Bagozzi
and Yi, 1991). Although the trait factor loadings for all the
measures based on global single-item method and multi-item
method are significant, only three of the seven constant-sum
measures were significant. The three dimensions of health-care
service quality for which the constant-sum measure exhibited
convergent validity are assurance, responsiveness, and tangibles. An assessment of the extent of convergence shown by
each measure requires a decomposition of the total variance
into proportions attributable to the corresponding trait and
random error. As in the CFA model, the amount of trait
variance in a measure is inferred by the squared trait factor
loading for that measure. For all seven dimensions of healthcare service quality, trait variances for constant-sum measures
were extremely low, with a range between 0.00 and 0.03 (or
0 and 3%). The best results were found for the global singleitem measures. Their trait variances ranged between 0.45 and
0.83, with a mean level of .57. The seven multi-item measures
showed levels of trait variance generally lower than the global
single-item measures. Trait variances for these measures
ranged from 0.30 to 0.49, with a mean of 0.39. According to
Bagozzi and Yi (1991), strong (weak) evidence for convergent
validity is achieved when at least (less than) half of the total
variation in a measure is caused by trait. According to this
rule of thumb, there is strong evidence for convergent validity
J Busn Res
2000:48:233–246
Figure 1. Correlated uniqueness model for the MTMM data.
Measuring Health-Care Service Quality
239
240
J Busn Res
2000:48:233–246
H. Lee et al.
Table 3. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Single-item global measure
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
Constant-sum Measure
Assurance
Core medical service
Empathy
Professionalism/skill
Reliability
Responsiveness
Tangibles
Multi-item Measure
Assurance
Core medical service
Empathy
Professionalism/skill
Reliability
Responsiveness
Tangibles
Core Medical
Service
Empathy
Factor Loading
Traits
Professionalism/
Skills
Reliability
Responsiveness
Tangibles
0.69 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.67 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.75 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.71 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.71 (0.10)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.83 (0.11)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.91 (0.11)
20.18 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
20.01 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.05 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.06 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.10 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.16 (0.06)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.16 (0.06)
0.63 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.55 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.70 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.62 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.66 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.60 (0.09)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.62 (0.09)
Standard error of estimates are shown in parantheses.
All zero values indicate that their corresponding parameters were fixed.
for most of our global single-item measures (5 out of 7). Trait
variances for all seven multi-item measures fall below the level
of 0.5. Therefore, evidence for convergent validity is weak for
these measures using multi-item rating method; whereas, the
constant-sum measures exhibit little or no convergent validity.
As noted before, the effects of methods under the CU model
are represented as correlations among error (uniqueness)
terms. Although the CFA model enables the separation of the
variance portion that is caused by method bias, we can only
infer the significance and size of the method bias in the CU
model analysis based on examination of the estimated uniqueness correlations. Table 4(a), 4(b), and 4(c) display the estimated error variances and covariances for single-item global
measures, constant-sum measures, and multi-item measures,
respectively. For the single-item measures, a significant covariance between error terms were found in 14 of 21 possible
cases (see Table 4a). When these covariances were converted
into correlations, the values ranged from 0.28 to 0.82, with
an average of 0.59. These levels of uniqueness correlations
demonstrate a considerable degree of method effect contained
in the measurement. Therefore, a substantial portion of the
variations in the global single-item measures can be attributed
to the measurement procedure.
For the constant-sum measures, 16 of the 21 uniqueness
covariances were significant (see Table 4b). Although this
indicates the existence of a significant method effect, the magnitudes of the uniqueness correlations (range: 0.03–0.36;
mean 0.19) suggest that the size of method effect is small.
The very large error variances shown in Table 4b demonstrate
that almost all the variations in the constant-sum measures
are attributable to random error. With regard to the multiitem measures, as can be seen in Table 4c, all uniqueness
covariances are significant. Uniqueness correlations were also
generally high (range: 0.37–0.71; mean 0.59).
Our next investigation focused on discriminant validity
among the seven dimensions of health-care service quality. It
consisted in verifying whether the correlations among the
seven dimensions (i.e., traits) as measured by three different
methods were significantly different from unity (11 or 21)
(Widaman, 1985; Bagozzi, Yi, and Phillips, 1991). As shown
in Table 5, all of the correlations among the dimensions are
significant and very high (range: 0.69–0.99; mean: 0.84).
Seven of the 21 correlations were above the 0.90 level. Such
high correlations among service quality dimensions (range:
0.67–0.92; mean: 0.82) were also observed in the study conducted by Dabholkar, Thorpe, and Rentz (1996). It should
be noted, however, that these correlations are disattenuated
correlations (i.e., corrected for measurement error) and are
larger than those correlations among measures. Particularly
notable is the correlation between the dimensions of assurance
Measuring Health-Care Service Quality
J Busn Res
2000:48:233–246
241
Table 4. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Core Medical
Service
Traits
Professionalism/
Skills
Empathy
Reliability
Responsiveness
Tangibles
0.31 (0.16)
0.09 (0.14)
0.17 (0.19)
0.97 (0.08)
0.26 (0.06)
0.97 (0.08)
0.63 (0.10)
0.37 (0.08)
0.61 (0.10)
(a) Error Variance and Covariance for Single-Item Global Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.51
0.37
0.35
0.35
0.32
0.27
0.14
(0.12)
(0.11)
(0.13)
(0.10)
(0.11)
(0.11)
(0.11)
0.53
0.33
0.34
0.38
0.23
0.09
(0.12)
(0.12)
(0.11)
(0.11)
(0.11)
(0.11)
0.41
0.28
0.30
0.22
0.11
(0.14)
(0.10)
(0.11)
(0.11)
(0.11)
0.51
0.41
0.26
0.08
(0.13)
(0.12)
(0.12)
(0.13)
0.49 (0.13)
0.26 (0.13)
0.10 (0.13)
(b) Error Variance and Covariance for Constant-Sum Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.99
20.03
0.24
20.35
20.32
20.16
20.10
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
1.00
20.18
20.14
20.21
20.20
20.23
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
(0.06)
0.99
20.13
20.21
20.15
20.09
(0.08)
(0.06)
(0.06)
(0.06)
(0.06)
0.99
0.20
20.16
20.18
(0.08)
(0.06)
(0.06)
(0.06)
0.99 (0.08)
0.27 (0.06)
0.07 (0.06)
(c) Error Variance and Covariance for Multi-Item Measures
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
0.58
0.35
0.34
0.39
0.32
0.31
0.38
(0.11)
(0.09)
(0.11)
(0.08)
(0.09)
(0.08)
(0.08)
0.68
0.36
0.41
0.32
0.34
0.38
(0.09)
(0.09)
(0.09)
(0.09)
(0.08)
(0.08)
0.49
0.32
0.28
0.28
0.32
(0.12)
(0.09)
(0.09)
(0.08)
(0.08)
0.60
0.31
0.29
0.43
(0.11)
(0.10)
(0.08)
(0.09)
0.55 (0.11)
0.37 (0.09)
0.41 (0.09)
All error variance and covariance estimates differing significantly from zero are underscored.
Standard error of estimates are within parentheses.
and empathy (0.99), which is near unity. This high correlation
between the assurance dimension and the empathy dimension
seemed to be consistent with the findings of the past studies
that discovered the dimensional instability of the SERVQUAL
scale (Babakus and Boller, 1992; Carman, 1990). A formal
test of discriminant validity was conducted by computing a
95% confidence interval (the estimated correlation 6 twice its
standard error estimate) for each of the estimated correlations
among the seven dimensions. Despite the high levels of correlation observed between the dimensions, only one (that between assurance and empathy) fell within the interval. Hence,
from a strict statistical point of view, discriminant validity
was established, except for between assurance and empathy.
However, whether these dimensions are distinct from a practical standpoint is highly questionable.
In summary, the above results of the CU model analysis
of the MTMM data first led us to conclude that convergent
validity was established for two of the three measures, the
single-item global measure and multi-item measure. Based on
Bagozzi and Yi’s (1991) rule of thumb, only the single-item
global measure, which captured an average trait variance
greater than 0.50, demonstrated strong evidence of convergence; whereas, weak evidence of convergence was found for
the multi-item measure. For the constant-sum measure, on the
other hand, there was virtually no sign of convergence. Almost
all of the variance in the seven constant-sum measures (for the
seven service dimensions) was attributed to random error.
With respect to discriminant validity, from a strict statistical
viewpoint, discrimination was demonstrated among the seven
health-care service quality dimensions, except for one instance
(between “assurance” and “empathy”). That is, all intertrait
(or interdimensional) correlations except one were significantly less than unity. However, the magnitudes of the intertrait correlations were generally very high, with a mean
value of 0.84. Hence, the seven dimensions did not seem
separable in a practical sense. We should note, however, that
the interpretation of discriminant validity is meaningful only
when convergent validity is established (Bagozzi, 1993). Given
our finding that convergent validity was established for two
of the three types of measures tested, the evidence relating to
discriminant validity should be viewed with caution.
Implications and Conclusion
One of the more pressing challenges health-care providers and
researchers face is to develop a better understanding of the key
242
J Busn Res
2000:48:233–246
H. Lee et al.
Table 5. Summary of Parameter Estimates for the Correlated Uniqueness Model
Assurance
Assurance
Core medical service
Empathy
Professionalism/skills
Reliability
Responsiveness
Tangibles
1.00
0.94
0.99
0.80
0.89
0.81
0.72
(0.03)
(0.02)
(0.04)
(0.03)
(0.04)
(0.05)
Core Medical
Service
1.00
0.91
0.92
0.90
0.80
0.78
(0.03)
(0.03)
(0.03)
(0.05)
(0.05)
Trait Intercorrelation
Traits
Professionalism/
Empathy
Skills
Reliability
Responsiveness
Tangibles
1.00
0.77
0.83
0.76
0.69
1.00
0.91 (0.03)
0.83 (0.03)
1.00
0.83 (0.03)
1.00
(0.04)
(0.03)
(0.04)
(0.05)
1.00
0.95 (0.02)
0.85 (0.04)
0.83 (0.04)
All error variance and covariance estimates differing significantly from zero are underscored.
Standard error of estimates are within parentheses.
dimensions constituting health-care quality and valid approaches to their measurement. This research focused on conceptual and measurement issues relating to the study of healthcare quality. In contrast to most of the past research in this
area, we took the physician’s (service provider’s) rather than
the patient’s (service recipient’s) perspective. This approach
is justified in view of the prevalent understanding that healthcare recipients are often unable to evaluate key dimensions
of health-care service (Bopp, 1990; Hensel and Baumgarten,
1988), and, thus, may not have as much to contribute to the
design of an effective health-care system as providers. Another
contrast is found in methodological approach. Whereas past
studies that investigated the validity of the SERVQUAL scale
tended to lack methodological rigor and scope, our construct
validation procedure based on the MTMM data analysis allowed for a more systematic scrutiny of key measurement
properties of the scale (i.e., convergent validity, discriminant
validity, and method bias).
First, we compared the performance of the constant-sum
rating method, the single-item global rating method, and the
multi-item rating method in measuring the health-care service
quality. All seven measures based on the constant-sum method
showed almost complete lack of convergence with the measures based on other methods. One plausible explanation for
this is the relatively high degree of complexity inherent in
the measures using the constant-sum method. This measure
requires more effort on the part of the respondents, and, thus,
is likely to create cognitive strains. Consequently, resulting
responses may not be as reliable as those obtained by other
methods. In fact, many physicians seemed to have difficulty
allocating the importance points among the seven categories.
In contrast to common expectation, the single-item global
measures performed better than the multi-item measures in
capturing the intended dimensions. An attempt to generalize
this finding beyond health-care providers may be inappropriate, because the result could have been caused by the high
level of familiarity that our physician respondents had with
the health-care service quality dimensions. A clear under-
standing of the issues involved in the questions reduces measurement error in responses. Thus, such an outcome may not
be obtained from health-care recipients, who may not possess
such a clear understanding. Nonetheless, this finding suggests
that single-item global measures may elicit responses that are
as reliable as the multi-item measures when knowledgeable
service providers are involved, and do so with greater parsimony. The single-item global rating method may be useful if
the goal of a study is to gain an understanding for the general
nature of health-care service issues. We should add, however,
that assessment of reliability level for single-item measures is
not possible in most cases. This remains a major problem for
the single-item global rating method.
When the research is to be diagnostic in nature, focusing
on specific characteristics of the service offering in an effort to
identify areas for improvement, the multi-item rating method
has greater utility. The multi-item rating method has the distinct advantage of being able to generate detailed information
on specific aspects of service quality that can be used as a
basis for action plans. As a caveat, it should be noted that
our recommendation regarding the use of the single-item
global rating method and the multi-item rating method is
limited to future research involving health-care service providers’ perceptions. For research involving the perceptions of
patients who do not understand the key dimensions of healthcare service quality, the multi-item rating method seems to
be a better choice, because this method is less susceptible to
measurement error than the single-item global rating method.
In terms of the discriminant validity of the seven health-care
service quality dimensions, our results were not supportive of
the validity. The computed magnitudes of interdimensional
correlations were very high. Although all correlations except
one satisfied the statistical criterion applied (i.e., significantly
less than unity), their magnitudes (ranging between 0.69–
0.99) cast much doubt on the separability of these dimensions
from a practical viewpoint. Considering that a similar finding
has been reported before (Dabholkar, Thorpe, and Rentz,
1996), a caution is warranted in future applications of the
Measuring Health-Care Service Quality
SERVQUAL scale or its modified versions in health-care service quality research. Because the validation of a measure is
an ongoing process, we suggest that more research be directed
toward producing a suitable adaptation of the SERVQUAL
scale. It is important for this research to take into consideration
the unique aspects of this particular service sector.
This study limited its research scope to physicians’ perceptions toward health-care service quality. Under CQI or TQM,
patients’ perceptions or evaluations of health-care services also
play a critical role. If health-care providers do not understand
how service recipients evaluate health-care services, it is difficult for providers to design or improve strategic planning
and marketing activities effectively. Therefore, research based
upon the patients’ perspective is necessary. Based upon the
perceptions of both parties in the health-care delivery system,
we can identify areas where mutual understanding exists,
means to inform and educate the public, and ways to improve
the current delivery system.
References
Aaker, David A., Kumar, V., and Day, George S.: Marketing Research.
John Wiley & Sons, Inc., New York, NY. 1995.
Asubonteng, Patrick, McCleary, Karl J., and Swan, John E.: SERVQUAL Revisited: A Critical Review of Service Quality. The Journal
of Services Marketing 10(6) (1996): 62–71.
Babakus, Emin, and Mangold, W. Glynn: Adapting the SERVQUAL
Scale to Hospital Services: An Empirical Investigation. Health
Services Research 26 (February 1992): 767–786.
Babakus, Emin, and Boller, Gregory W.: An Empirical Assessment
of the SERVQUAL Scale. Journal of Business Research 24(3) (1992):
253–268.
Bagozzi, Richard P.: Causal M