08832323.2011.633580

Journal of Education for Business

ISSN: 0883-2323 (Print) 1940-3356 (Online) Journal homepage: http://www.tandfonline.com/loi/vjeb20

Initial Impressions and the Student Evaluation of
Teaching
Dennis E. Clayson
To cite this article: Dennis E. Clayson (2013) Initial Impressions and the Student Evaluation of
Teaching, Journal of Education for Business, 88:1, 26-35, DOI: 10.1080/08832323.2011.633580
To link to this article: http://dx.doi.org/10.1080/08832323.2011.633580

Published online: 19 Nov 2012.

Submit your article to this journal

Article views: 212

View related articles

Citing articles: 1 View citing articles


Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=vjeb20
Download by: [Universitas Maritim Raja Ali Haji]

Date: 11 January 2016, At: 20:54

JOURNAL OF EDUCATION FOR BUSINESS, 88: 26–35, 2013
C Taylor & Francis Group, LLC
Copyright 
ISSN: 0883-2323 print / 1940-3356 online
DOI: 10.1080/08832323.2011.633580

Initial Impressions and the Student Evaluation
of Teaching
Dennis E. Clayson

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

University of Northern Iowa, Cedar Falls, Iowa, USA


Do first impressions influence the final evaluations given in a class? The author looked at the
initial student perceptions and conditions of a class and compared these with conditions and
evaluations 16 weeks later at the end of the term. It was found that the first perceptions of the
instructor and the instructor’s personality were significantly related to the evaluations made at
the end of the semester. Implications for the validity of and utilization of the student evaluation
of instruction are discussed.
Keywords: confirmatory bias, faculty evaluation, initial perceptions, personality, student
evaluation of teaching

What influence do the initial impressions that students form
about an instructor have on the final evaluation of a class?
The answer to this question would add valuable information
to a long-lasting debate about the validity of the student
evaluation of teaching (SET).
The utilization of a SET process has become almost universal in modern universities and colleges (Clayson, 2009).
These instruments are not only utilized to improve instruction, but they are also extensively used to establish tenure,
promotion, merit pay, and public reputations. Consequently,
the SET process has been extensively debated and researched.
Even though the first published article on the evaluations appeared almost 85 years ago (Remmers & Brandenburg, 1927,
as cited in Kulik, 2001), little agreement has been reached

about the validity of the instruments. This is primarily the result of two broad issues that have plagued SET research from
the start. First, there has been no generally accepted definition of good or effective teaching. Institutions have utilized
instruments to measure constructs that they have not definitively identified. Second, even without construct definitions,
there are aspects of pedagogical practice that would generally
be accepted as indicators of good instruction. For example,
many would assume that good or effective instruction would
lead to increased student learning, or that the personality of
the instructors would be tangentially related to the process,

Correspondence should be addressed to Dennis E. Clayson, University
of Northern Iowa, Department of Marketing, 344 CBB, Cedar Falls, IA
50614–0126, USA. E-mail: [email protected]

but not overwhelmingly so. However, the research findings
on these issues have proven to be ambiguous.
These problems raise an interesting methodological issue.
How could it be demonstrated that variables unrelated to
teaching influence the evaluations if good and/or effective
teaching has not been clearly defined? One solution would
be to search for influences on SET that could not be logically

connected to what happens in a setting, either physical or
temporal, where instruction actually occurs. What influence,
for example, would a brief initial exposure to an instructor
have on the later evaluation of that instructor, especially if the
first exposure was made before any instruction took place?
Specifically, I looked at perceptions of students before instruction actually began and compared them with the SET
outcomes at the end of a 16-week term. No published literature could be found looking at this relationship so early in
a term. If SET validly measures the quality of instruction,
except for second order variables such as personality, initial
impressions before instruction begins should be unrelated to
the final evaluations.

REVIEW OF THE LITERATURE
SET research has been handicapped by a number of problems including strongly held opinions, statistical questions,
and methodological issues arising from gathering data from
anonymous sources. As suggested previously, however, the
fundamental problem has been the failure to define the construct underlining the measurement process. There has been

INITIAL IMPRESSIONS OF TEACHING


no widely accepted definition of what good or effective teaching is (Adams, 1997; Clayson, 2009; Kulik, 2001). Attempts
to circumvent this failure have centered on several issues.

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

Learning
Many assume that any criteria to define a construct of good
and/or effective teaching would include some measure of
learning. Cohen (1981) affirmed, “Even though there is a
lack of unanimity on a definition of good teaching, most
researchers in this area agree that student learning is the
most important criterion of teaching effectiveness” (p. 283).
Although some early studies identified a negative relationship between learning and SET (Attiyeh & Lumsden,
1972; Rodin & Rodin, 1972), most studies found either no
relationship or a positive association (Baird, 1987; Cohen,
1981; Dowell & Neal, 1982; Lundsten, 1986; Marlin & Niss,
1980). In the last 20 years, however, there has been a shift in
the findings. A recent meta-analysis (Clayson, 2009) found
no published findings after 1990 that contained a significant
positive association between learning and the evaluations.

Further, the relationship between SET and learning became
increasingly neutral or negative as more statistical sophistication was utilized in studies, and measures of learning
became more objective. The research concluded that while
there was a relationship between SET and perceived learning, there was none between objective measures of learning
and the evaluations.
If learning is seen as an improvement in subsequent performance, then findings suggest that learning may actually be
negatively related to SET. In a study of accounting students, it
was found that a significant negative relationship existed between student evaluations of their instructors in introductory
classes and how well they performed in a subsequent class
(Yunker & Yunker, 2003). V. E. Johnson (2003), utilizing
a university-wide database, reported that “stringent grading
is associated with higher levels of achievement in follow-up
courses” (p. 161), but that stringent grading was strongly
associated with lower evaluations. At the U.S. Air Force
Academy, students in calculus classes in which learning can
be objectively measured gave higher evaluations to instructors of classes in which they were getting higher grades,
but lower evaluations to instructors who produced students
who did well in subsequent calculus classes. The authors
concluded, “the correlation between introductory calculus
professor value added in the introductory and follow-on

courses is negative. Students appear to reward contemporaneous course value added . . . but punish deep learning”
(Carrell & West, 2010, p. 429). Consistent with this, they
found that inexperienced instructors got better evaluations in
introductory classes than did more seasoned instructors, who
produced students who did better in subsequent classes.
Personality
A similar pattern of mixed findings has been found with
the influence of the instructor’s personality on SET. In

27

general, researchers from the educational colleges report few
personality traits that correlated with student ratings (Boice,
1992; Braskamp & Ory, 1994; Centra, 1993). Yet, studies
that manipulated actual classroom conditions found positive
relationships (Naftulin, Ware, & Donnelly, 1973; Widmeyer
& Loy, 1988). Other studies have found associations
between personality variables and the evaluation outcomes
that accounted for over 50–75% of the total variance of
the evaluations (Erdle, Murray, & Rushton, 1985; Feldman,

1986; Marks, 2000; Murray, Rushton, & Paunonen, 1990;
Sherman & Blackburn, 1975).
In a study of business students, Clayson and Sheffet (2006)
compared change in the students’ perception of personality
with change in the evaluations in the last six weeks of the
term. Even after the midterm, changes in evaluations, negative and positive for individual instructors, were highly related to changes in the students’ perception of personality,
and in the same direction. The study ruled out the possibility
that the personality–evaluation association was a statistical
artifact resulting from insufficient control of secondary variables. Another earlier study of business students found that
each standard deviation change in personality resulted in a
0.83 standard deviation change in the evaluations. Personality was found to be significantly related to every other factor
in the study, including the students’ perception of the instructor’s knowledge and fairness. It was negatively related
to rigor, and positively related to the students’ perception of
how much they had learned (Clayson & Haley, 1990).
Validity Inconsistency
These and other problems have led some researchers to
question the validity of SET. After reviewing the results
of a study of over 2,000 business students, Marks (2000)
concluded that “student evaluations lack discriminant
validity. No matter how reliable the measures, student

evaluations are no more than perceptions and impressions”
(p. 117). Greenwald and Gillmore (1997) previously pointed
out that while evaluations of instructors have convergent
validity, they lack discriminant validity. In other words, SET
are correlated with attributes that a concept of good teaching
would be expected to be related with, but they are neutral
or are correlated with numerous attributes with which they
should not be related. It is also claimed that the instruments
lack divergent and outcome validity (Onwuegbuzie, Daniel,
& Collins, 2009; Sproule, 2002). This partially results from,
and is complicated by, a considerable halo effect (Orsini,
1988). Convergent validity and discriminant and divergent
invalidity would be expected if the evaluations were measuring a global construct that the students have a tendency to
apply to whatever question is addressed (Langbein, 1994).
What would that global construct be? Some researchers
have concluded that the evaluations most likely create something that could be called a likeability scale (Clayson,
2009; Clayson & Haley, 1990; Marks, 2000; Tang & Tang,
1987). This interpretation answers numerous questions about

28


D. E. CLAYSON

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

apparently contradictory findings, including the high impact
of instructor personality.
While most individuals have had experiences of learning
a great deal from thoroughly disliked instructors, most would
agree that instruction is facilitated when a teacher is liked.
Yet, as found in the lack of relationship of SET with learning,
being liked may not be related to what many educators would
consider to be good teaching. As Foote, Harmon, and Mayo
(2003) concluded after reviewing the literature and the results
of their own study, “those [instructors] who score highly on
evaluations may do so not because they teach well, but simply
because they get along well with students” (p. 17).

Initial Impressions
Due to serial learning effects, it would be expected that the

initial exposure would have a strong impact on students’ perception of personality. More than half a century ago, Solomon
Asch (1946) found that the order of terms used to describe
a person made a difference in how that individual was perceived. When a person was described as envious, stubborn,
critical, impulsive, industrious, and intelligent, rather than
intelligent, industrious, impulsive, critical, stubborn, and envious, the second order produced higher personal ratings than
the first. In some cases, a brief initial experience seems to
create a perception that is only slightly modified by further
interactions.
It has been shown that when subjects are first introduced
to another person, they make judgments of attractiveness,
likeability, trustworthiness, competence, and aggressiveness
within one tenth of a second. Surprisingly, it has also been
shown that more extended exposure (beyond one half of a
second) simply boosted the confidence of judgments (Willis
& Todorov, 2006). These findings fall under the rubric
of the primacy effect and refer to the process by which
early information may alter the perception of subsequent
information. This is especially true if the initial information
has high relevance, but is less true if subsequent information
is stronger, the situation is more structured, or if subjects
have higher cognitive sophistication (Haugtvedt & Wegener,
1994; Krosnick & Alwin, 1987).
Moreover, observers have a tendency to look for, find, and
remember information that fits their preconceived expectations, while information that contradicts these expectations
may be dismissed, ignored, or distorted. This confirmatory
bias was found in early studies by Wason (1960), who showed
subjects a sequence of three numbers and then asked them
to find a rule and use that rule to create a new sequence of
numbers that would conform to the original set. After every
attempt, the subjects were told whether they were correct or
wrong. Wason found that subjects had a tendency to create
rules that were much more complex than warranted. Furthermore, they seemed to offer only positive tests for their
hypotheses, and did not attempt to falsify their rules.

In other words, the subjects chose to select evidence that
would confirm a prior hypothesis rather than disconfirm it.
Later research found that the retrieval of confirming evidence
actively inhibits the retrieval of disconfirming evidence, further strengthening bias (Davies, 2003). Rabin and Schrag
(1999) found that initially being wrong often only strengthened the original hypothesis, and that people could believe
with near certainty in a false hypothesis despite receiving
an infinite amount of information. Prior training, education,
and experience seem to have little effect on this tendency
(Mahoney & DeMonbreun, 2005).
For initial impressions to alter final evaluations of a class,
the students can only use their past experience and their brief
exposure to the instructor to form their impressions. Evidence
suggests that students do form initial impressions about personality that are long lasting and do affect their perception
of the instructor. For example, Widmeyer and Loy (1988)
conducted an experiment in which all students were exposed
to the same guest instructor, but before the class began half
received descriptions of the instructor indicating that he was
warm, and the other half that he was cold. After the instructional period, not only did the students in the warm group
rate the instructor higher on positive aspects of personality,
but they also rated the instructor previously defined as warm
as having more teaching ability.
Other evidence indicated that many students appear to
form an opinion of a class and the instructor very early in a
course, and subsequent class and learning experiences may
do little to change that opinion (Feldman, 1977; Ortinau &
Bush, 1987; Sauber & Ludlow, 1988). Harvard psychologists
(Ambady & Rosenthal, 1993) investigated students’ reactions to randomly selected 30-s clips of soundless videotapes
of actual classroom instruction and found them highly correlated with end-of-course evaluations. Evaluations based on
6-s exposures were as significant as judgments based on 30-s
clips. Not only were classroom and instructor evaluations
similar, but also personality traits identified by independent
raters were also highly correlated with the evaluations. Their
findings have been replicated under actual instructional classroom conditions (Babad, Avni-Babad, & Rosenthal, 2004).

HYPOTHESES
Unlike earlier investigations (Kohlan, 1973), which took their
first measures after instruction began, here I compared initial impressions gathered after students were exposed to the
instructor, but before the syllabus was distributed and before
any actual instruction had taken place. The literature predicts
that initial impressions of personality may be long lasting.
Indeed, one study (Clayson & Sheffet, 2006) did report a
simple least-squares correlation between measures of personality taken before a class began and the final evaluation.
To the extent that student evaluations at the end of a period
of instruction reflect actual teaching practice, it would not be

INITIAL IMPRESSIONS OF TEACHING

expected that the initial perceptions and impressions would
be related to the final evaluations (Wallace et al., 2001).
Hypothesis 1: Student initial impressions of the instructor’s
personality would be related to the final student impressions of personality.
Hypothesis 2: An initial SET before instruction begins would
not be related to the final SET at the end of the instructional period.

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

METHOD
The study was made possible by mining an existing database.
During the spring semester of 2003, over 700 students in organizational management and principles of marketing classes
were followed for an entire semester. Longitudinal data was
gathered about the students and their perceptions of the class
and instructor periodically over a period of 16 weeks. Within
this data were measures of student perceptions before instruction actually began and corresponding perceptions in
the last week of the semester. These measures taken before
instruction actually began could be compared with data taken
at the end of the 16 weeks in order to investigate the question
raised in this study. The portions of the original study that
are directly related to the present research issue, or that may
bias the findings, are outlined subsequently.
Utilized Variables
Eight instructors, who taught 13 sections of introductory undergraduate business classes (six sections of organizational
management, and seven sections of principles of marketing), gave permission for the study to be conducted in their
classes over the period of a semester. On the first meeting of
the class, the instructors introduced themselves, turned the
class over to a researcher, and left the room. At this point,
students had not seen the syllabus, and had an average of
about 5 min of exposure to the instructor. Due to the nature of class schedules and the physical facilities, a student
could be exposed to the instructor for no less than 1 min
and not more than 10, depending on how early the student
arrived. Students who signed a consent form were then asked
to complete a questionnaire containing the variables that are
outlined subsequently, plus a set of demographic questions.
Authorized consent procedures were utilized throughout the
study. Pertinent to this investigation, the class sections were
evaluated again at week 16 during a 16-week term. Because
each student was identified by a code, the last questionnaires
were identical to the one given before the class began except
that no demographic data was gathered.
The initial database came from a total of 737 students.
Not all questions were answered by each student, and not
all students completed their enrolled course. Consequently,
the sample size for this study consisted of 567 students who

29

responded both to the initial questionnaire and to the questionnaire 16 weeks later at the end of the semester.
Variables
Several demographics were gathered at the first class
meeting. The student’s gender (male = 51%, labeled as 0;
female = 49%, labeled as 1, utilized as a dummy variable)
was self-reported. In addition, the actual cumulative GPA
of each student at the beginning of the class was obtained
by student permission from the university registrar (M grade
point average [GPA] = 3.03, SD = 0.47).
A number of questions were asked to establish initial class
and student conditions. Students reported whether they had
heard anything about the instructor’s grading policy before
the class began (0 = not heard, 69%; 1 = heard, 31%), and
to estimate how difficult they thought the class would be (0
= easy or average, 81%; 1 = hard, 19%). A preliminary
analysis indicated that easy and average estimates were not
significantly different on the major variables of the study.
Student grade-related expectations were also surveyed.
Respondents were asked, “What grade do you think you will
receive in this class?” and “What grade do you think you
will deserve to receive in this class?” A new variable was
created utilizing these two measures. When the expected
grade (Exp Grade: M = 3.27, SD = 0.53) was the same as
the deserved grade (M = 3.34, SD = 0.54) for the class, then
it could be assumed that the students expected to be treated
fairly in grading, but when the two measures did not match,
the students apparently believed that they would not receive
the grade they deserved. An initial analysis showed that
if the deserved grade was higher or lower than the expected
grade, no significant differences were found in subsequent
variables, consequently fair was dichotomized as a dummy
variable (0 = fair [deserved = expected grade], 84%; 1 =
unfair [deserved not equal to expected], 16%).
Student evaluation of the teaching was measured by using
the five questions on the student evaluation (SET) instrument actually used by the university. These five measures
were summed and averaged (the instructor: “Created an atmosphere conducive of learning,” “Instructor explains material appropriately,” “Instructor shows interest in student
learning,” “Instructor sets high but reasonable standards,”
and “Rate your satisfaction with your learning in this class”).
A second unambiguous SET measure, “What grade would
you give your instructor?” was also asked in all testing periods. The measures were similar with correlations above
0.80. Consequently, the two measures of evaluation were
summed to create a total evaluation measure called Evaluation (Cronbach’s α was .71 initially, and .93 for Week 16).
This measure is similar to the dependent variables utilized
in most SET studies (Feldman, 1986). The evaluation scale
ranged from 0 to 4 as in the classical GPA metric.
Students also evaluated the personality of the instructor at
each testing period. The Big Five personality inventory was

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

30

D. E. CLAYSON

utilized. Many personality theorists have concluded that an
adequate taxonomy for personality attributes could be created by five factors (Digman, 1990). This has been referred
to as the Big Five, or as the Five Factor Model of personality.
The factors have been found to be stable over long periods
of a person’s life (Soldz & Vaillant, 1999), and are largely
genetic (Jang et al., 1998). They seem to be unrelated to culture and have been found in societies as diverse as those in
Germany and China (McCrae & Costa, 1997). Because the
personality evaluations can be too long and detailed for a
brief administration, the factors were measured by utilizing
a simple semantic 7-point scaling device. The question read,
“From what you know now, rate this instructor on the following dimensions.” The five dimensions were disagreeableagreeable, not conscientious-conscientious, emotionally
unstable-emotionally stable, introverted-extroverted, and
unimaginative-imaginative.
When the larger data set was compiled from which this
study was drawn, a validity check compared this shortened
personality inventory with the standardized inventory. The
shorter instrument was found to have both concurrent and
predictive validity. The five factors were summed and averaged to produce a compensatory, global measure of the overall negative-positive perception of personality. Cronbach’s
alpha initially was .91 and .83 at the end of the term. This
variable was called “personality.” The measure is not traditional personality in that the construct is typically defined as
a cluster of independent traits. Nevertheless, a student could,
for example, believe an instructor was positive on one or
several factors, but not on all, and still perceive the instructor
as having a good or a bad personality globally and independent of the perception of any specific factor. This measure is
consistent with many prior studies that did not utilize a personality inventory when measuring personality, but instead
relied on some global measure (Erdle et al., 1985; Murray
et al., 1990).

RESULTS
There were no significant differences between students from
different majors on any dependent variable. Hence, the data
were combined for analysis. There has been an ongoing debate on whether data from individual students or from class
means should be utilized when studying the effects of student
evaluation of instruction (Marsh & Roche, 1997; Stumpf &
Freedman, 1979). Because in this study I looked at student
perceptions and not at teacher characteristics, within-class
student data were utilized rather than between-class means
(Clayson, 2007; Stumpf & Freedman).
Effects of Initial Conditions
Table 1 shows the differences of the initial and the final measure of personality by initial independent variables. Table 2

TABLE 1
Initial and Final Personality Measure by Initial
Variables
Initial personality
Raw
Dichotomous variables (means)
Gender
Male (n = 285)
4.75 (0.05)
Female (n = 282)
4.88 (0.06)
Statistical probability
.121
Heard
Not heard (n = 392)
4.74 (0.05)
Heard (n = 175)
4.98 (0.07)
Statistical probability
.006∗
Rigor
Easy-average (n = 455) 4.81 (0.04)
Hard (n = 112)
4.81 (0.09)
Statistical probability
.952
Fair grading
Fair (n = 474)
4.84 (0.04)
Unfair (n = 93)
4.68 (0.09)
Statistical probability
.129
Continuous variables (correlations)
GPA
.098
Statistical probability
.020∗
Initial expected grade
0.150
Statistical probability
.000∗
Initial evaluations
.462
Statistical probability
.000∗
Initial personality measure
Statistical probability

Final personality

Adj.a

Raw

Adj.

.005∗

5.31 (0.06)
5.44 (0.06)
.125

.006∗

.425

5.31 (0.05)
5.52 (0.05)
.026∗

.491

.543

5.41 (0.05)
5.23 (0.05)
.087

.897

.750

5.45 (0.05)
5.02 (0.10)
.000∗

.009∗

0.063
0.135
0.129
.002∗
.432
.000∗

0.057
0.173
0.015
.720
.228
.000∗
.246
.000∗

0.018
0.671
0.022
.594
.143
.001∗
.198
.000∗

Note. Values in parentheses represent standard errors.
was adjusted for class effects.
∗ p < .05.
aProbability

shows the similar differences of the initial evaluation and the
final evaluation of the instructor. The statistical probability
represents the probability of the null hypothesis assuming no
differences. The column labeled “Adj.” gives the probability
of the same variables controlled for class effects. There are
several techniques that would allow an estimate of the association of within-class effects controlled for by class effects.
The method utilized in this study is the main effect, analysis
of covariance utilizing Type III sum of squares. This allows
for a test of each variable in the model with all other variables simultaneously included in the analysis. Because I was
not interested in class differences, the problem of using this
technique without the assumption of homogeneity of group
regressional betas was minimal (Tatsuoka, 1971). Except for
an estimate of total class effect variance, the result in this
model is identical to a linear regression utilizing the same
variables.
As shown in Table 1 and consistent with the literature review, the initial expected grade, and initial SET were strongly
related to the initial measure of personality. In addition, both
the initial measure of personality and the initial SET were

INITIAL IMPRESSIONS OF TEACHING
TABLE 2
Initial and Final Evaluations by Initial Variables
Initial evaluation

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

Raw
Dichotomous variables (means)
Gender
Male (n = 285)
3.15 (0.04)
Female (n = 282)
3.22 (0.03)
Statistical probability
.130
Heard
Not heard (n = 392)
3.18 (0.03)
Heard (n = 175)
3.20 (0.05)
Statistical probability
.756
Rigor
Easy-average (n = 455) 3.22 (0.03)
Hard (n = 112)
3.05 (0.06)
Statistical probability
.011∗
Fair grading
Fair (n = 474)
3.22 (0.03)
Unfair (n = 93)
3.01 (0.07)
Statistical probability
.003∗
Continuous variables (correlations)
GPA
.130
Statistical probability
.002∗
Initial expected grade
.282
Statistical probability
.000∗
Initial personality measure
.462
Statistical probability
.000∗
Initial evaluations
Statistical probability

TABLE 4
Evaluation Regression: All Variables Included

Final evaluation

Adj.a

Raw

Adj.

.016∗

2.89 (0.05)
2.95 (0.05)
.449

.019∗

.045∗

2.87 (0.04)
3.04 (0.04)
.031

.395

.074

2.97 (0.04)
2.73 (0.09)
.008∗

.470

.038∗

2.98 (0.03)
2.61 (0.09)
.000∗

.106
.012∗
.268
.000∗
.415
.000∗

.047
.260
.017
.685
.182
.000∗
.229
.000∗

.011∗
.001
.982
.023
.578
.089
.340
.142
.001∗

Note. Values in parentheses represent standard errors.
was adjusted for class effects.
∗ p < .05.

significantly associated with the final measure of personality,
along with the addition of the initial perception that the grading with be fair. Note that the ordinal effects of the variables
on the initial evaluation are identical to the ordinal relationTABLE 3
Personality Regression: All Variables Included

Student characteristics
Sex
GPA
Initial impressions
Heard
Rigor
Fair grading
Expected grade
Personality
Evaluation
Class effects
Intercept
Corrected model
Note. Adjusted R2 = .28.
< .05.

∗p

B

t

Variable
Student characteristics
Sex
GPA
Initial impressions
Heard
Rigor
Fair grading
Exp grade
Personality
Evaluation
Class effects
Intercept
Corrected model

B

t

p

.141
–.044

2.16∗
–0.59

.031
.554

–.062
–.020
–.213
–.390
.290
.148

–0.84
–0.24
–2.41∗
–0.60
0.76
2.43∗
3.88
8.84
3.61

.402
.806
.016
.551
.447
.015
.000
.000
.000

Note. Adjusted R2 = .30.
< .05.

∗p

aProbability

Variable

31

ship of the same variables on the final evaluations. The same
pattern is shown in Table 2 looking at the effects on the SET
measures, with one important exception. The initial SET is
significantly related to the final SET evaluation, but the initial
impression of personality was not significantly related to the
final SET when controlled for class effects.
As can be seen in Tables 3 and 4, with all variables included, the initial perception of personality was significantly
associated with the final perception of personality, and the
initial SET evaluation was significantly related to the final
SET evaluation, as were sex and perceptions of fairness. Note
that the initial impression of personality was not related to
the final SET, and that the initial SET was not related to the
final measure of personality. The collinearity measures are
all well within the acceptable limits for the model with the
smallest tolerance measure resulting from the measures of
the initial evaluation (rii = .68).

CONCLUSIONS
p

.173
–.019

2.24∗
–0.21

.026
.834

–.073
.046
–.282
–.064
.115
.115

–0.84
0.48
–2.70∗
–0.82
3.45∗
1.40
3.52
13.10
3.43

.400
.630
.007
.412
.001
.110
.000
.000
.000

The first hypothesis that the initial impressions of the instructor’s personality would be related to the final student
impressions of personality could not be rejected. The second
hypothesis that an initial SET before instruction begins would
not be related to the final SET at the end of the instructional
period was rejected.
The initial SET evaluation, before any instruction took
place, was significantly related to the final SET evaluation
given 16 weeks later. Note that the initial belief that the student held about the instructor’s fairness in assigning grades
also influenced the final SET. Controlling for the students’
past performance as measured by GPA seemed to have no
effect on these relationships. It would appear that the very
best and the very worst students (as measured by previous

32

D. E. CLAYSON

grades) are reacting to the instructor in the same fashion. The
same pattern is found with student perceptions of personality.
In the regressional analysis, the initial impressions of personality were not related to the final SET evaluation, and the
initial SET was not related to the final personality measure.
This is most likely the result of the two variables standing as
proxies for each other, and reinforces previous findings that
SET procedures essentially construct a personality measure
similar to the proposed likeability scale.

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

Limitations
It is possible that some of the findings in this study are a
result of unique conditions found only in the institution from
which the data was taken. Data from other sources may find
more or less of the effects found here. It is also possible that
some of the effects were related to the nature of the classes.
These courses are introductory and contain little technical
and quantitative material.
Nevertheless, other research would indicate that the findings of the persistence of the initial perceptions of the instructor on the final results are most likely not a function of
unique sampling, but could be generalized to other populations (for reviews of grade effects, see Clayson, 2004; V. E.
Johnson, 2003; Marsh & Roche, 2000; Stumpf & Freedman,
1979 for reviews of personality effects, see Erdle et al., 1985;
Marks, 2000; Murray, et al., 1990).
SET Implications
As mentioned in the introduction, the utilization of some sort
of student evaluation of teaching has become almost universal. The instruments are used to make important decisions
that can result in major changes in an instructor’s career.
They are also utilized to make improvements in teaching
and hopefully to make the students’ experience more productive. Given the importance placed on the process, it is
essential that the instruments are valid measures of instruction. An attempt to establish or discredit this validity has
been the aim of most all of the hundreds of reports and publications pertaining to SET. It appears in this long process
that SET does have convergent validity, but is lacking divergent and discriminant validity (for reviews of this issue,
see Clayson, 2009; V. E. Johnson, 2003). In other words,
the evaluations appear to be related to what the construct
of good instruction should be associated with, but they are
also related to many factors that would not be logically associated with the construct. This discrepancy impacts on the
ability of SET to discriminate between a good teacher and
a poor one; the very use to which the instruments are most
applied.
Finding that the instruments are influenced by factors that
are unrelated to actual instruction weakens arguments that the
evaluations can continue to be utilized as they are presently.
This study adds to this chorus by showing a strong relationship between student attitudes and perceptions developed

before any instruction has taken place, which are still impacting evaluations, supposedly measuring only instruction,
after four months of instructional interaction.
For individual instructors, this study adds two warnings.
First, initial impressions are important. They create perceptions that are long lasting and continue to influence the students’ evaluation of the instructor long after what would
logically be expected. Second, instructors must be careful in
utilizing student evaluations to improve teaching. A strong
primacy effect, matched with a newly reported propensity
of students to purposefully falsify evaluations (Clayson &
Haley, 2011), requires that an instructor by judicial in accepting suggestions found in SET reports for instructional
improvement.
In this study, none of the students had seen a course syllabus, nor had students been exposed to any class instruction
when they made their initial evaluations. Finding an association between the evaluation of the class at the end of the
term with evaluations made within the first ten minutes of
exposure, as well as corresponding persistence in perceived
grade fairness, indicates that SET instruments are biased toward student perceptions unrelated to the instructor’s actual
teaching style and abilities.
The results also help clarify several hypotheses made in
the literature that attempted to make a validity argument for
SET without including variables related to actual instruction.
For example, Erdle et al. (1985) maintained that instructors’
personalities are reflected in certain classroom teaching behaviors, which in turn are validly rated by students. The
findings of this study do not contradict this argument, but
make it unlikely in that students would have to be extremely
keen observers of individual differences that would predict
future classroom behavior. This acuity is unlikely given that
the initial perception of personality was not related to the final
teaching evaluation when controlled by the initial evaluation
of instruction.
Almost 40 years ago, Kohlan (1973) found a significant
relationship between an initial evaluation made early in the
class (after instruction had begun) and a final evaluation made
by students. He suggested three possible explanations for his
findings: (a) the SET process which uses these assessments
cannot be valid, (b) very little new information about instructor behavior is presented after the first few classes, or
(c) there may be a primacy effect due to stereotyping. This
last explanation was not confirmed by this study. With class
effects controlled, which also controls for individual instructor effects, the initial evaluation was still related to the final
evaluation. Because the evaluation was made in less than
ten minutes exposure to the instructor, Kohlan’s second explanation that little information about instructor behavior is
presented after the first few classes cannot be used as an
explanation of the findings of this study because no such
information was available. Kohlan’s first hypothesis that the
SET evaluation process is invalid was not contradicted by
this study.

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

INITIAL IMPRESSIONS OF TEACHING

The findings reported here reinforce Marks’s (2000) and
Greenwald and Gillmore’s (1997) contention that student
evaluations of teaching lack discriminant validity. Even after
16 weeks of personal face-to-face instruction, the students’
limited initial impressions of personality and of teaching can
still be found in the final evaluation. Irrespective of any definition of good teaching, which includes actual instruction,
this study indicates that the evaluations are biased.
Especially in the last decade, there have been numerous
research findings that have raised troubling questions about
the evaluation process. Rachel Johnson (2000) argued that
while the student evaluation of teaching has bureaucratic advantages, the system of usage is detrimental to actual teaching, both for practice and theory. Theall and Franklin (2001)
stated, “Student ratings are only one source of information
about teaching, and teaching is only one aspect of faculty
performance. Never make the mistake of judging teaching or
overall performance on the basis of ratings alone” (p. 51).
The findings of this study reinforce their warning.
Research Implications
Some research has found that unmet student performance
expectations on exams may result in student dissatisfaction
(Grimes, 2002). It was found here that the students’ expectation of a fair grade, even before the class began, was significantly related to the final course evaluation. This could
suggest that the mere expectation of lower grades may influence the evaluations, especially if students are not accurately
estimating their future performance. Although the expected
grade’s influence on the evaluations at the end of the term
have been extensively studied, it is still unknown how those
expectations are made and how that process influences SET.
The unrealistic grade expectations of students at the beginning of the term raise another issue. Previous research has
indicated that students have difficulty estimating their own
academic performance (Clayson, 2005; Kennedy, Lawton, &
Plumlee, 2002; Williams & Ceci, 1997). Consistent with this
literature, the present data find that the initial expected grade,
while being significantly related to the initial SET, was not
associated with the final evaluation, and more surprisingly,
unrelated to the final class grade (r = .036, p = .393).
The students began their classes expecting a grade
significantly higher than the one actually received (3.27 vs.
2.77; t(566) = 21.59, p < .001), and even higher than their
own cumulative GPA. This was true even though these
classes reported grades well below the university average,
a fact regularly and publically announced by the business
college. Furthermore, I found that prior grading information
about the class did not modify the exaggerated expectations.
These students were not inexperienced. Almost all were
juniors and seniors.
There are two possible explanations for these paradoxical findings: (a) students may think their expectations will
be met because they believe that their performance will lift
their grades, or that the instructor will give a grade more

33

lenient than earned; or (b) the students may be dishonest
or cavalier in their answers. Fortunately, there is a way to
test between these explanations. As indicated previously, research has found a persistent association between expected
grades and the evaluation (for detailed reviews, see Clayson,
Frost, & Sheffet, 2006; Greenwald & Gillmore, 1997; V. E.
Johnson, 2003). By inspecting the data from the last week of
the term, it would be expected that if (a) is true, then there
should be an association between the final expected grade and
the final evaluation, but not between the final grade (not yet
received) and the evaluation. If (b) is true, there should be an
association between the final course grade, which would be
highly related to their given grades by week 16, and the evaluation, but not between the final expected grade (capriciously
reported) and the evaluation. All three measures were correlated primarily because of GPA, so a regression was run with
the final evaluation as the dependent variable and expected
final grade, actual final grade, and GPA as independent variables. The result was highly significant, F(3, 566) = 23.50,
p < .0001, but the only significant variable loading was for
the expected grade (β = .384), t = 7.82, p < .0001. The final
grade was nonsignificant (β = –.073), t = –1.29, p = .197.
The students appeared to be giving an honest response from
their perspective, but how did they predict their grade if not
from performance?
Because expected grades are related to the evaluations,
and faculty believe that this association lowers academic
standards (Simpson & Siguaw, 2000), this is a question that
needs to be addressed.

REFERENCES
Adams, J. V. (1997). Student evaluations: The rating game. Inquiry, 1(2),
10–16.
Ambady, N., & Rosenthal, R. (1993). Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness.
Journal of Personality and Social Psychology, 64, 431–441.
Asch, S. E. (1946). Forming impressions of personality. Journal of Abnormal
and Social Psychology, 41, 258–290.
Attiyeh, R., & Lumsden, K. G. (1972). Some modern myths in teaching
economic: The UK experience. American Economic Review, 62, 429–433.
Babad, E., Avni-Babad, D., & Rosenthal, R. (2004). Prediction of students’
evaluations from brief instances of professors nonverbal behavior in defined instructional situations. Social Psychology of Education, 7, 3–33.
Baird, J. S. (1987). Perceived learning in relation to student evaluation of
university instruction. Journal of Educational Psychology, 79, 90–91.
Boice, R. (1992). Countering common misbeliefs about the student evaluation of teaching. ADE Bulletin, 101(Spring), 1–4.
Braskamp, L. A., & Ory, J. C. (1994). Assessing faculty work: Enhancing
individual and institutional performances. San Francisco, CA: JosseyBass.
Carrell, S. E., & West, J. E. (2010). Does professor quality matter? Evidence
from random assignment of students to professors. Journal of Political
Economy, 118, 409–432.
Centra, J. A. (1993). Reflective faculty evaluations: Enhancing teaching and
determining faculty effectiveness. San Francisco, CA: Jossey-Bass.
Clayson, D. E. (2004). A test of reciprocity effects in the student evaluation
of instructors in marketing classes. Marketing Education Review, 14(2),
11–21.

Downloaded by [Universitas Maritim Raja Ali Haji] at 20:54 11 January 2016

34

D. E. CLAYSON

Clayson, D. E. (2005). Performance overconfidence: Metacognitive effects
or misplaced student expectations? Journal of Marketing Education, 27,
122–129.
Clayson, D. E. (2007). Conceptual and statistical problems of using betweenclass data in educational research. Journal of Marketing Education, 29(1),
1–5.
Clayson, D. E. (2009). Student evaluations of teaching: Are they related to
what students learn? A meta-analysis and review of the literature. Journal
of Marketing Education, 31(1), 16–30.
Clayson, D. E., Frost, T. F., & Sheffet, M. J. (2006). Grades and the student
evaluation of instruction: A test of the reciprocity effect. Academy of
Management Learning & Education, 5(1), 52–65.
Clayson, D. E., & Haley, D. A. (1990). Student evaluations in marketing:
What is actually being measured? Journal of Marketing Education, 12(3),
9–17.
Clayson, D. E., & Haley, D. A. (2011). Are students telling us the truth? A
critical look at the student evaluation of teaching. Marketing Education
Review, 21, 101–112.
Clayson, D. E., & Sheffet, M. J. (2006). Personality and the student evaluation of teaching. Journal of Marketing Education, 28, 149–160.
Cohen, P. A. (1981). Student ratings of instruction and student achievement:
A meta-analysis of multi-section validity studies. Review of Educational
Research, 51, 281–309.
Davies, M. F. (2003). Confirmatory bias in the evaluation of personality
descriptions: Possible test strategies and output interference. Journal of
Personality and Social Psychology, 85, 736–744.
Digman, J. M. (1990). Personality structure: An emergence of the five-factor
model. The Annual Review of Psychology, 41, 417–440.
Dowell, D. A., & Neil, J. A. (1982). A selective review of the validity of
student ratings of teaching. Journal of Higher Education, 53, 51–62.
Erdle, S., Murray, H. G., & Rushton, J. P. (1985). Personality, classroom
behavior and student ratings of college teaching effectiveness: A path
analysis. Journal of Educational Psychology, 77, 394–407.
Feldman, K. A. (1977). Consistency and variability among college students
in rating their teachers and courses: A review and analysis. Research in
Higher Education, 6, 223–274.
Feldman, K. A. (1986). The perceived instructional effectiveness of college teachers as related to their personality and attitudinal characteristics: A review and synthesis. Research in Higher Education, 24,
139–213.
Foote, D. A., Harmon, S. K., & Mayo, D. T. (2003). The impacts of instructional style and gender role attitude on students’ evaluation of faculty.
Marketing Education Review, 13(2), 9–19.
Greenwald, A. G., & Gillmore, G. M. (1997). Grading leniency is a removable contaminant of student ratings. American Psychologist, 52,
1209–1217.
Grimes, P. W. (2002). The overconfident principle of economics students:
An examination of a metacognitive skill. Journal of Economic Education,
33, 15–30.
Haugtvedt, C. P., & Wegener, D. T. (1994). Message order effect in persuasion: An attitude strength perspective. Journal of Consumer Research,
21, 205–218.
Jang, K. L., McCrae, R. R., Angleitner, A., Riemann, R., & Livesley, W. J.
(1998). Heritability of facet-level traits in a cross-cultural twin sample:
Support for a hierarchical model of personality. Journal of Personality
and Social Psychology, 74, 1556–1565.
Johnson, R. (2000). The authority of the student evaluation questionnaire.
Teaching in Higher Education, 5, 419–434.
Johnson, V. E. (2003). Grade inflation: A crisis in college education. New
York, NY: Springer.
Kennedy, E. J., Lawton, L., & Plumlee, E. L. (2002). Blissful ignorance:
The problem of unrecognized incompetence and academic performance.
Journal of Marketing Education, 24, 243–252.
Kohlan, R. G. (1973). A comparison of faculty evaluations early and late in
the course. Journal of Higher Education, 44, 587–595.

Krosnick, J. A., & Alwin, D. F. (1987). An evaluation of a cognitive theory
of response -order effects in survey measurement. The Public Opinion
Quarterly, 51, 201–219.
Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New
Directions for Institutional Research,

Dokumen yang terkait