Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji joeb.79.6.317-322

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=vjeb20

Journal of Education for Business

ISSN: 0883-2323 (Print) 1940-3356 (Online) Journal homepage: http://www.tandfonline.com/loi/vjeb20

Exploring Effects of Criteria and Multiple Graders

on Case Grading

C. Gopinath

To cite this article: C. Gopinath (2004) Exploring Effects of Criteria and Multiple Graders on Case Grading, Journal of Education for Business, 79:6, 317-322, DOI: 10.3200/ JOEB.79.6.317-322

To link to this article: http://dx.doi.org/10.3200/JOEB.79.6.317-322

Published online: 07 Aug 2010.

Submit your article to this journal

Article views: 21

View related articles


(2)

ase discussion is a popular peda-gogical technique that is used in business courses. Cases provide rich descriptions of settings in which busi-ness decisions are required, and they provide students with the opportunity to apply analytical skills within a real con-text and to arrive at decisions and other recommendations.

Instructors use cases in several ways. Apart from engaging the whole class in a case discussion, the instructor also may ask students to prepare and present their analyses of the case or to submit a written report on it. Written analyses of cases are meant to help students develop written communication skills and to hone their ability to develop logical arguments. The written reports are grad-ed and often go toward the overall eval-uation of the student in the course.

The literature on essay grading in the field of higher education shows that it is an activity subject to several biases and errors. Although grading cases—an activity similar to essay grading—is undertaken extensively in business pro-grams, the issues of bias, reliability, and consistency have not been examined in the business education literature.

In this article, I describe a study in which I examined the extent of agree-ment on case grades between dual graders when grading criteria are speci-fied. My results show that even when

criteria are specified, areas of misinter-pretation that account for significant differences continue to arise among graders. However, students appear to benefit from a process that involves multiple cases and multiple graders.

Literature Review

Written case analysis is used exten-sively in business programs. In a survey of 177 faculty members who taught business policy/strategic management courses, Alexander, O’Neill, Snyder, and Townsend (1986) found that 84.2% of the respondents required individual written case analysis in this course. After class participation, these written analyses were the second most impor-tant factor in the determination of stu-dent grades. I believe that similar high

usage exists in other courses. However, my search revealed almost no study on the grading of case analyses in business schools. Thus, I shall draw from studies conducted in other fields such as law, which makes use of cases, and the humanities, in which studies on essay marking (a close analogy to case grad-ing) have been conducted.

Essays, as with case analyses, have no absolutely right or wrong answers but must show student comprehension of an advanced level of analysis. “An essay writer has to identify the problems beneath the question posed, he or she has to create a structure, display insight and provide a coherent argument” (Brown, 1997, p. 59). When a grader reads an essay, he or she has to deter-mine whether it satisfies the require-ment of a good discussion and reveals aspects of student learning. These include concerns of both the content and process of the discussion.

Grading a case analysis also requires consideration of content and process. Content issues include an evaluation of the grasp of the issues in the case, knowledge of the facts and their impli-cations, and whether the student has understood the main question being raised. Process issues include concerns of presentation, whether arguments are logical and analytical, and quality of language expression.

Exploring Effects of Criteria and

Multiple Graders on Case Grading

C. GOPINATH

Suffolk University Boston, Massachusetts

C

ABSTRACT. Written analyses of cases help the student develop the skills of logical analysis and written communication. However, students often question the reliability of the grades that they receive. In this study, the author used six criteria to evaluate case analysis by two graders who were team teaching a course. Results show that even with predetermined criteria, the graders had areas of disagreement as a result of varied interpretation. Yet, student grades suggest that students benefit from a process that involves multiple cases and multiple graders. The author discusses the implications of these findings.


(3)

The biases that work on the grader have attracted wide concern among scholars. These biases could arise out of the gender of the essay writer (Wright, 1996), personal knowledge of the stu-dent (Dennis, Newstead, & Wright, 1996), and the sequence of grading, in which a few consecutively good essays could bias the instructor to grade a weak essay particularly harshly (Spear, 1997). In addition to these biases, differences in marks may reflect the graders’ differ-ent philosophies of learning (Blanke, 1999). Bilimoria (1995) used the lens of modernism and postmodernism to illus-trate approaches to grading. A mod-ernist views evaluation as results orient-ed and meeting certain standards. Grades differentiate between accom-plishments and indicate the extent to which students meet criteria. On the other hand, a postmodernist views eval-uation as continuous, focused on mak-ing improvements, servmak-ing as feedback, and reflecting how learning opportuni-ties have been used. Thus, a difference in ideology may influence the tendency of one grader to mark high and another to mark low.

In this study, I focus on the subset of the literature that deals with the use of criteria for grading a case or essay and the effects of disagreement, if any, in the interpretation of these criteria. Often, double marking, or the use of multiple graders to assess an essay, is an attempt to reduce the subjectivity in essay mark-ing (Erskine, Leenders, & Maufette-Leenders, 1981; Partington, 1994). Thus, establishing criteria should allow for less subjectivity because both the student and the instructor can be guided by the same set of principles.

The belief that grading based on cri-teria can be standardized has led to its automation, which has added attraction for educators who deal with a large vol-ume of tests. For them, automation and standardization bring speed, consisten-cy, and a perceived measure of objectiv-ity to the process. In several studies of grading essays and writing samples conducted as part of an extended proj-ect, Page (1994) found it possible to achieve a high level of correlation between grades assigned by a computer and those assigned by multiple human judges. The criteria were broken down

into measurable variables for content traits and essay content. The Graduate Management Admissions Council, which administers the GMAT test required by many business schools of their graduate program applicants, introduced computerized grading of essay answers in 1999 (Honan, 1999). The test involves two essay questions, which used to be graded by humans. Now, with about 400,000 test takers every year, both a human and a comput-erized essay scoring system grade GMAT essays. If the electronic grade differs from the human grade by more than one point, a third (human) expert assigns a final grade. The electronic sys-tem looks for the organization of ideas and syntactical structure. These include finding a subordinate clause, looking at where a discussion starts and ends, and examining vocabulary.

Some scholars have found high levels of student and instructor agreement on assessment of an essay exam and inferred that scoring standards can be communicated readily (Nealey, 1969). However, this issue has not been studied in regard to case analysis. Case analysis, apart from calling for good writing and logical development of an argument, also requires the specific application of theories or concepts to the case situation.

This discussion poses some questions of interest to us in our study. If the use of criteria does in fact reduce subjectiv-ity in grading, there should not be sig-nificant differences in grades assigned by multiple graders working with speci-fied criteria. Any bias in grading will find expression in differences in the marks assigned by the two graders. I explore this aspect in my first research question:

RQ1: What is the extent of agreement between two graders of a written case analysis when the criteria for grading have been specified?

When multiple instructors are grad-ing cases, students receive a grade that represents the expertise of the many graders. This is true whether the multi-ple graders agree on the grade or not. When the graders disagree, they go through a process of reconciliation to give the student a single grade. In their study of English language essays, Wood and Quinn (1976) used correlation to

show that having multiple graders improves reliability by reducing varia-tion. Thus, they concluded that through a system of multiple grading, the effects of erratic marking are reduced and the students’ grades are less affected by who marked their papers. However, in a criteria-based evaluation scheme with subsequent reconciliation, the final grade could vary from the initial grade. The final grade may be higher or lower than the one that the students would have received if there had been only one grader. Thus, I formulated my second research question:

RQ2: Does double marking result in a dif-ferent (higher or lower) grade for the stu-dent compared with a single marking when the criteria for grading have been specified?

The purpose of a grade and written comments is to evaluate and provide feedback to students. The learning process requires that students work to improve those areas in which they did not meet expectations. They also under-stand the criteria better through repeat-ed attempts. When students submit mul-tiple written reports, they have the opportunity to improve by working on their weak areas and demonstrating their understanding. Thus, I formulated a third research question:

RQ3: When grading criteria are held con-stant, are student grades higher on a sec-ond case compared with the first?

Method

The absence of previous research addressing related research questions led us to adopt an exploratory approach. I and another instructor, both team teachers of an introductory general business course that all MBA students at our university are required to take in the first semester of their program, con-ducted the study. Both of us were pres-ent in the classroom throughout the semester and participated in class activ-ities. The course was designed to intro-duce the students to (a) a set of skills that they would need in the program and in a management career (such as written analysis, presentation, discus-sion skills, etc.) and (b) a set of per-spectives such as viewing the company as a whole, appreciating a globalized


(4)

environment, and the impact of technol-ogy on business.

The students were required to sub-mit written case analyses (WCA) indi-vidually on any two of the six cases that were discussed in the course dur-ing the semester. They were encour-aged, but not required, to submit one case early in the semester and to do the second one after considering the feed-back on the first. Each WCA carried a weight of 15% of the total grade for the course. The WCA had a 350-word limit. The format required them to (a) specify an issue or a problem, (b) ana-lyze the situation by using a concept, theory, or model that had been dis-cussed in any previous class session in the course, and (c) bring the discussion to a conclusion.

In the first semester that we taught this course, we arrived at four criteria for grading the cases and provided them to the students. As the semester pro-gressed, we found several instances in which we disagreed on the interpreta-tion or applicainterpreta-tion of the criteria to the WCA under consideration. We dis-cussed the criteria again before the start of the second semester and agreed to expand the list to the following six items to reduce the misinterpretation:

1. Question or issue was specified in the beginning.

2. Question or issue is relevant. 3. Question was answered/issue brought to conclusion.

4. There was good depth of analysis. 5. Appropriate theory/concepts was/ were used.

6. Writing adhered to format (writing style, error-free writing, word limit, etc.).

Scores ranging from 1 (poor) to 5 (excellent) were given for each item. Our criteria, which we compiled based on our experience, parallel what is expected in other business courses. For example, conducting an analysis, apply-ing business policy theory and concepts, and writing ability are among the top seven criteria used to grade cases (Alexander et al., 1986) in the business policy course.

We collected data over the course of a semester and used the following proce-dure: When a student submitted a WCA, one instructor read it first and evaluated it by using the grading sheet. To elimi-nate bias, the instructor made no marks on the script (Murphy, 1979), which was then passed on to the second instructor, who also read the case and evaluated it separately. We then met and reconciled our evaluations of each WCA. The process of reconciliation came into effect when there was a dif-ference between the individual grades that we each gave on a particular criteri-on. Each of us would then provide the reasons for our grades, and we would reread the WCA. Each item on the grad-ing sheet on which there was a dis-agreement would be discussed and rec-onciled. There were three possible outcomes of this process. The final grade would be either (a) the grade given by one of the instructors, which would indicate that one of us had been able to convince the other; (b) the

aver-age of the two grades, which would reflect a compromise; or (c) a common grade different (higher or lower) from the original grade (if, in the process of reading and discussing, we decided that a different grade was justified).

Then we set aside the individual scor-ing sheets with the independent grades and comments and entered the recon-ciled grade along with the grader com-ments on a third grading sheet, which was given to the student. The students were aware that both of us were involved in grading each WCA but were not told of the detailed grading process or of the study in progress. We collected data from 53 students in two sections of the course. Because two case reports were missing, we had a total sample of 104.

Results

Quantitative Analysis

We examined RQ1 by looking at both the extent of initial agreement and sub-sequent reconciliation between the two graders. There was a strong positive correlation (r = .46, p < .01) between our scores across all the criteria (see Table 1). There was full agreement between us in 71% of the cases (see Table 2). These results compare favor-ably with those of Page (1994), who noted that one U.S. state educational system required that interjudge agree-ment be at least 70% in a 4-point rating.

Disagreement between us was a more complicated issue. Reconciliation came about in all cases of initial disagreement.

TABLE 1. Comparison of Instructors’ Grades

Average grade Average grade given by given by

Number and criterion Instructor A Instructor B Correlation t

1. Question or issue was specified in the beginning 4.83 4.76 .45** 1.19 2. Question or issue is relevant 4.76 4.88 .16* –1.83* 3. Question was answered/issue brought to conclusion 4.61 4.71 .28** –1.42 4. There was good depth of analysis 4.28 4.14 .48** 1.80* 5. Apropriate theory/concepts was/were used 4.21 3.97 .56** 2.87** 6. Writing adhered to format 4.76 4.75 .83** 0.28

Average 4.58 4.54 .46**

*p < .10. **p < .01. N = 104.


(5)

In the process of reconciliation, a major-ity of the cases (27% out of 29% that required reconciliation) were resolved with one grader convincing the other (columns 4 and 6, Table 2). In only 3% of the cases (column 5) did we resort to taking the mean. This percentage con-firms the extensive discussions and review that accompanied reconciliation without resorting to a quick compromise through settling for the mean. Moreover, very few students received grades that fell outside our initial range (columns 3 and 7, Table 2). This suggests that the initial two grades represented the possi-ble range that the student could have received.

Looking at differences across specif-ic criteria, we see that criteria 2, 4, and 5 (see Table 1) accounted for significant differences between the two graders. Although correlation on criterion 2 was low, there was a high level of agreement between us. Criteria 4 and 5 represent a different picture. These two criteria had about equal numbers of students whose final grades were equal to that of one or the other grader. Against this, on all the other criteria, the final grade was more heavily weighted toward one or the other grader. Only about 12% (4 out of 43 and 6 out of 39) of the students received a mean grade. This suggests that we were adhering to our initial grades more strongly on these two crite-ria than on the others. An examination

of the criteria itself suggests that these two were subject to greater interpreta-tion than the others.

To examine RQ2, we compared final grade received with the higher or lower of an individual instructor’s grade and found no significant variation. The data show that in 13.8% of the cases, stu-dents received a grade higher than sin-gle marking (columns 6 and 7, Table 2). On the negative side, in 12.8% of the cases students received a grade lower than single marking (columns 3 & 4).

To address RQ3, we compared the grades (Table 3) received by the students on their second case with those received on the first. We found a significant dif-ference in the case of criterion 5. On the others, there were either no differences or a marginal improvement. Because criterion 5 was also one of the two crite-ria on which there was the highest initial disagreement between us, the improve-ment could suggest either that the feed-back helped the students understand the criterion better, or that it helped improve their application of theory.

To check whether there was a grader “learning” bias—that is, whether the graders were converging in their views over the semester—we compared the disagreements on a case-by-case basis (see Table 4). Of the six, there was a drop in the number of disagreements in the second case. In all the other cases, the disagreements remained at around

an average of 39, which suggests that there was little convergence effect. Qualitative Analysis

The notes that we kept during the grading reconciliation process helped us identify the areas that resulted in dis-agreement:

1. Interpretation of the criteria. The cri-terion labeled “Question or issue is rel-evant” was interpreted by one grader broadly to mean that the writing focused on one or more of the issues in the case. The other grader was looking to see if the student picked the more important among the issues. Another cause for the disagreement in initial grading was that there was confusion in classification. For instance, if the analysis was not dealing with the question or issue that had been specified, were points to be taken off under the criteria regarding either depth of analysis or appropriate conclusion?

2. Grading philosophy. Although we agreed on the nature of the deficiency, we sometimes disagreed on the severity and therefore the penalty. Sometimes one of us took off more than the other did. This dealt directly with the concern leading to RQ2 and the grading philos-ophy of the instructor. One of us would argue that graduate students should TABLE 2. Grade Agreement and Reconciliation of Disagreement

Full

agreement Final grade Final grade Final grade Final grade Final grade between lower than lower of mean of higher of higher than Number and criterion graders both grades two grades two grades two grades both grades

(1) (2) (3) (4) (5) (6) (7)

1. Question or issue was specified in the

beginning 84 — 6 — 14 —

2. Question or issue is relevant 80 1 7 2 14 — 3. Question was answered/issue brought to

conclusion 72 1 10 4 17 —

4. There was good depth of analysis 55 1 24 4 19 1 5. Apropriate theory/concepts was/were used 59 — 20 6 19 — 6. Writing adhered to format 91 — 10 1 2 —

Average 73.5 .5 12.8 2.8 14.2 .2

71%

Notes. Figures represent number of written case analyses (WCAs) that satisfied the specified condition. Columns 2 through 7 represent the full range of categories into which a WCA could fall; thus the numbers add up to N = 104.


(6)

know better, whereas the other would argue that the students need more encouragement at this stage, their first semester in the program.

3. Relative grading. To assist the process of reconciliation, we often would compare our grading process for the case under discussion with how other students had been graded. We would go back and check whether we had penalized or credited another stu-dent on a similar issue and note the extent of that penalty. Thus, although not initially stipulated, consistency across a particular case became an objective.

4. Errors of omission. In some cases, we reached consensus easily because one of us had overlooked a deficiency initially and was convinced quickly when the other drew attention to it.

Discussion

I undertook this study to explore the effect of using multiple graders and their interpretation of criteria in evaluat-ing written business case analysis. The literature on essay grading suggests that having clear criteria for grading helps to narrow the differences and results in a high level of agreement among multiple graders. My results show that the

over-all level of agreement found in this study is consistent with that found in the literature. However, a closer look at the criteria on which disagreement is great-est sugggreat-ests cause for concern.

As my results show, the wording of the criteria may allow for multiple interpre-tations. Although I and my co-instructor were clear about our criteria at the time that we designed them, they were still open to diverse interpretation on imple-mentation. Thus, I recommend that instructors and researchers be as precise as possible in laying out their expecta-tions. For instance, the criterion labeled “The question or issue is relevant” could specify further whether the term “rele-vant” means “relevant to the decision makers in the case” or “relevant to the topics of the session.”

However, the following disturbing question arises: How can we expect stu-dents to understand the criteria when even instructors interpret them differ-ently? Fortunately, we were not dealing with an examination situation in which the possibility of repeat submission or appeal does not exist. Thus, my results suggest that although having criteria is better than not having them, there is plenty of room for misinterpretation. Instructors need to take care to spell out, in as much detail as possible, what they mean by their criteria and perhaps to spend time in class discussing them with the students before finalizing them. In addition, students may be encouraged to discuss the evaluations that they receive with the instructor if they are not clear about the message.

Written case analysis is used widely in business programs, because instruc-tors believe that it improves both written communication and analytical skills. Thus, the process of writing cases, grad-ing them, and providgrad-ing feedback is an important activity for the student and the instructor. It serves both to evaluate the student’s abilities and to assist the learning process by providing feedback. We found support for this process. Our grading form, apart from giving a numerical score representing our deci-sion, also provided an explanation through written comments. When we felt that a student’s analysis was weak or involved poor application of theory, we gave examples of how he or she TABLE 3. Comparison of Student Grades Across the Two Cases

Average grade Average grade Number and criterion on Case 1 on Case 2 t

1. Question or issue was

specified in the beginning 4.8 4.8 .07 2. Question or issue is relevant 4.8 4.9 –.99 3. Question was answered/issue

brought to conclusion 4.7 4.7 .10 4. There was good depth of

analysis 4.2 4.2 –.06

5. Apropriate theory/concepts

was/were used 3.9 4.3 –2.1*

6. Writing adhered to format 4.7 4.8 –1.1

*p < .05.

TABLE 4. Case-Wise Comparison of Agreement and Disagreement

Criterion

Case 1 2 3 4 5 6

Lincoln Agreement 67 100 67 67 100 67 Disagreement 33 0 33 33 0 33 GE Agreement 90 60 40 40 60 85 Disagreement 10 40 60 60 40 15 Sony Agreement 80 70 70 50 60 80 Disagreement 20 30 30 50 40 20 SW Agreement 83 78 78 61 52 91 Disagreement 17 22 22 49 48 9 V V Agreement 88 80 72 36 44 92 Disagreement 12 20 28 64 56 8 KMP Agreement 65 87 83 74 65 87 Disagreement 35 13 17 26 35 13

Note. Agreement means that the final grade was the same as the individual grades given by the two instructors. Disagreement means that the final grade was different from at least one of the instruc-tors’ grades. The numbers are percentages of the case write-ups that satisfied the criteria.

Agreement/ disagreement


(7)

could have dealt with the case. Thus, the student could develop a clearer under-standing of the criteria or of the numer-ical score that he or she had misinter-preted.

We assumed that students read and make use of written feedback. In addi-tion, we provided general comments on the written reports in a subsequent class. My analysis does not allow me to pin-point which particular feedback was useful. The role of feedback, in general, should be further explored. Researchers need to examine how much feedback is optimal and whether feedback on con-tent is more effective than that on process.

We found that having multiple graders benefited the students in multi-ple ways. Although the students did not get better grades than they would have with a single grader, we were able to check each other’s omissions and the reconciliation process allowed for a measure of relative consistency. In addi-tion, our resolution of our differences through discussion before we provided comments to the students resulted in a more considered feedback. Of course, although one can argue that reconcilia-tion helps to reduce bias in terms of grades, the different perspectives from multiple graders can aid students’ learn-ing process and is an area in which fur-ther study is needed.

We realize that few schools have the resources to support multiple graders on a regular basis. Moreover, many instruc-tors are not comfortable with having another instructor in the classroom or another person with whom they must share grading responsibility. One way around this is to involve students in the grading process, although one must take into account the difference in quality of feedback that can be expected from a faculty member as opposed to a student. Peer evaluation and assessment can be important sources of feedback to the

students (Gopinath, 1999). Moreover, asking students to evaluate cases helps those providing the feedback develop analysis skills (Schroeder & Fitzgerald, 1984). Thus, instead of having a second instructor as a grader, the instructor can draw on students (individuals or groups) to serve as a second grader. The instruc-tor trying to reconcile his or her views with the student peers’ comments before finalizing the grade would still incorpo-rate some of the benefits that we observed in our study.

When dealing with large classes, instructors often use teaching assistants to grade cases according to a process involving criteria and then random examination of graded papers for con-sistency. However, this method may not allow for multiple opinions of graders to come into play because the teaching assistant has to try to replicate the stan-dard and expectation set by the instruc-tor and may not have the expertise or the experience to provide an alternative viewpoint. The question of whether the marginal benefits from multiple graders (in terms of learning value) exceed the marginal cost is another area that war-rants further study.

Overall, it is important for instructors to consider carefully the emphasis placed on individual written analysis reports so they can lessen the impact of varied interpretations of the analysis. In addition, instructors should allow stu-dents to challenge or question the com-ments received and the grade given. Instructors often have a mindset about not changing a grade. We would suggest that the process of learning requires an instructor to be able to justify the deci-sion that he or she has made about the quality of a student’s work. Thus, the instructor must be willing to defend his or her evaluation and be willing to change the grade, if necessary. In cases in which the final exam (with limited opportunity to discuss and revise the

grade) is in the form of a written case analysis, the instructor will be well advised to give the student the benefit of the doubt.

ACKNOWLEDGMENT

I wish to thank Patricia Carlson for her assis-tance in data collection.

REFERENCES

Alexander, L. D., O’Neill, H. M., Snyder, N. H., & Townsend, J. B. (1986). How academy mem-bers teach the business policy/strategic man-agement case course. Journal of Manman-agement Case Studies, 2, 334–344.

Bilimoria, D. (1995). Modernism, postmod-ernism, and contemporary grading practices. Journal of Management Education, 19, 440–458.

Blanke, H. G. (1999). Grading by theory. College Teaching, 47, 136–139.

Brown, G. (1997). Assessing student learning in higher education. London: Routledge. Dennis, I., Newstead, S. E., & Wright, D. E.

(1996). A new approach to exploring biases in educational assessment. British Journal of Psy-chology, 87, 515–535.

Erskine, J. A., Leenders, R. R., & Maufette-Leen-ders, L. A. (1981). Teaching with cases. Lon-don, Canada: University of Western Ontario. Gopinath, C. (1999). Alternatives to instructor

assessment of class participation. Journal of Education for Business, 75, 10–14.

Honan, W. H. (1999). High tech comes to the classroom: Machines that grade essays. New York Times, 148, p. B8.

Murphy, R. J. L. (1979). Removing marks from examination scripts before re-marking them: Does it make any difference? British Journal of Educational Psychology, 49, 73–78.

Nealey, S. M. (1969). Student-instructor agree-ment in scoring an essay examination. Journal of Educational Research, 63, 111–115. Page, E. B. (1994). Computer grading of student

prose, using modern concepts and software. Journal of Experimental Education, 62, 127–142.

Partington, J. (1994). Double-marking students’ works. Assessment and Evaluation in Higher Education, 19(1), 57–61.

Schroeder, H., & Fitzgerald, P. (1984). Peer eval-uation in case analysis. Journal of Business Education, 60, 73–77.

Spear, M. (1997). The influence of contrast effects upon teachers’ marks. Educational Research, 39, 229–233.

Wood, R., & Quinn, B. (1976). Double impression marking of English language essay and summa-ry questions. Educational Review, 28, 229–246. Wright, D. E. (1996). A new approach to explor-ing biases in educational assessment. British Journal of Psychology, 87, 515–535.


(1)

ase discussion is a popular peda-gogical technique that is used in business courses. Cases provide rich descriptions of settings in which busi-ness decisions are required, and they provide students with the opportunity to apply analytical skills within a real con-text and to arrive at decisions and other recommendations.

Instructors use cases in several ways. Apart from engaging the whole class in a case discussion, the instructor also may ask students to prepare and present their analyses of the case or to submit a written report on it. Written analyses of cases are meant to help students develop written communication skills and to hone their ability to develop logical arguments. The written reports are grad-ed and often go toward the overall eval-uation of the student in the course.

The literature on essay grading in the field of higher education shows that it is an activity subject to several biases and errors. Although grading cases—an activity similar to essay grading—is undertaken extensively in business pro-grams, the issues of bias, reliability, and consistency have not been examined in the business education literature.

In this article, I describe a study in which I examined the extent of agree-ment on case grades between dual graders when grading criteria are speci-fied. My results show that even when

criteria are specified, areas of misinter-pretation that account for significant differences continue to arise among graders. However, students appear to benefit from a process that involves multiple cases and multiple graders. Literature Review

Written case analysis is used exten-sively in business programs. In a survey of 177 faculty members who taught business policy/strategic management courses, Alexander, O’Neill, Snyder, and Townsend (1986) found that 84.2% of the respondents required individual written case analysis in this course. After class participation, these written analyses were the second most impor-tant factor in the determination of stu-dent grades. I believe that similar high

usage exists in other courses. However, my search revealed almost no study on the grading of case analyses in business schools. Thus, I shall draw from studies conducted in other fields such as law, which makes use of cases, and the humanities, in which studies on essay marking (a close analogy to case grad-ing) have been conducted.

Essays, as with case analyses, have no absolutely right or wrong answers but must show student comprehension of an advanced level of analysis. “An essay writer has to identify the problems beneath the question posed, he or she has to create a structure, display insight and provide a coherent argument” (Brown, 1997, p. 59). When a grader reads an essay, he or she has to deter-mine whether it satisfies the require-ment of a good discussion and reveals aspects of student learning. These include concerns of both the content and process of the discussion.

Grading a case analysis also requires consideration of content and process. Content issues include an evaluation of the grasp of the issues in the case, knowledge of the facts and their impli-cations, and whether the student has understood the main question being raised. Process issues include concerns of presentation, whether arguments are logical and analytical, and quality of language expression.

Exploring Effects of Criteria and

Multiple Graders on Case Grading

C. GOPINATH

Suffolk University Boston, Massachusetts

C

ABSTRACT. Written analyses of cases help the student develop the skills of logical analysis and written communication. However, students often question the reliability of the grades that they receive. In this study, the author used six criteria to evaluate case analysis by two graders who were team teaching a course. Results show that even with predetermined criteria, the graders had areas of disagreement as a result of varied interpretation. Yet, student grades suggest that students benefit from a process that involves multiple cases and multiple graders. The author discusses the implications of these findings.


(2)

The biases that work on the grader have attracted wide concern among scholars. These biases could arise out of the gender of the essay writer (Wright, 1996), personal knowledge of the stu-dent (Dennis, Newstead, & Wright, 1996), and the sequence of grading, in which a few consecutively good essays could bias the instructor to grade a weak essay particularly harshly (Spear, 1997). In addition to these biases, differences in marks may reflect the graders’ differ-ent philosophies of learning (Blanke, 1999). Bilimoria (1995) used the lens of modernism and postmodernism to illus-trate approaches to grading. A mod-ernist views evaluation as results orient-ed and meeting certain standards. Grades differentiate between accom-plishments and indicate the extent to which students meet criteria. On the other hand, a postmodernist views eval-uation as continuous, focused on mak-ing improvements, servmak-ing as feedback, and reflecting how learning opportuni-ties have been used. Thus, a difference in ideology may influence the tendency of one grader to mark high and another to mark low.

In this study, I focus on the subset of the literature that deals with the use of criteria for grading a case or essay and the effects of disagreement, if any, in the interpretation of these criteria. Often, double marking, or the use of multiple graders to assess an essay, is an attempt to reduce the subjectivity in essay mark-ing (Erskine, Leenders, & Maufette-Leenders, 1981; Partington, 1994). Thus, establishing criteria should allow for less subjectivity because both the student and the instructor can be guided by the same set of principles.

The belief that grading based on cri-teria can be standardized has led to its automation, which has added attraction for educators who deal with a large vol-ume of tests. For them, automation and standardization bring speed, consisten-cy, and a perceived measure of objectiv-ity to the process. In several studies of grading essays and writing samples conducted as part of an extended proj-ect, Page (1994) found it possible to achieve a high level of correlation between grades assigned by a computer and those assigned by multiple human judges. The criteria were broken down

into measurable variables for content traits and essay content. The Graduate Management Admissions Council, which administers the GMAT test required by many business schools of their graduate program applicants, introduced computerized grading of essay answers in 1999 (Honan, 1999). The test involves two essay questions, which used to be graded by humans. Now, with about 400,000 test takers every year, both a human and a comput-erized essay scoring system grade GMAT essays. If the electronic grade differs from the human grade by more than one point, a third (human) expert assigns a final grade. The electronic sys-tem looks for the organization of ideas and syntactical structure. These include finding a subordinate clause, looking at where a discussion starts and ends, and examining vocabulary.

Some scholars have found high levels of student and instructor agreement on assessment of an essay exam and inferred that scoring standards can be communicated readily (Nealey, 1969). However, this issue has not been studied in regard to case analysis. Case analysis, apart from calling for good writing and logical development of an argument, also requires the specific application of theories or concepts to the case situation.

This discussion poses some questions of interest to us in our study. If the use of criteria does in fact reduce subjectiv-ity in grading, there should not be sig-nificant differences in grades assigned by multiple graders working with speci-fied criteria. Any bias in grading will find expression in differences in the marks assigned by the two graders. I explore this aspect in my first research question:

RQ1: What is the extent of agreement between two graders of a written case analysis when the criteria for grading have been specified?

When multiple instructors are grad-ing cases, students receive a grade that represents the expertise of the many graders. This is true whether the multi-ple graders agree on the grade or not. When the graders disagree, they go through a process of reconciliation to give the student a single grade. In their study of English language essays, Wood and Quinn (1976) used correlation to

show that having multiple graders improves reliability by reducing varia-tion. Thus, they concluded that through a system of multiple grading, the effects of erratic marking are reduced and the students’ grades are less affected by who marked their papers. However, in a criteria-based evaluation scheme with subsequent reconciliation, the final grade could vary from the initial grade. The final grade may be higher or lower than the one that the students would have received if there had been only one grader. Thus, I formulated my second research question:

RQ2: Does double marking result in a dif-ferent (higher or lower) grade for the stu-dent compared with a single marking when the criteria for grading have been specified?

The purpose of a grade and written comments is to evaluate and provide feedback to students. The learning process requires that students work to improve those areas in which they did not meet expectations. They also under-stand the criteria better through repeat-ed attempts. When students submit mul-tiple written reports, they have the opportunity to improve by working on their weak areas and demonstrating their understanding. Thus, I formulated a third research question:

RQ3: When grading criteria are held con-stant, are student grades higher on a sec-ond case compared with the first?

Method

The absence of previous research addressing related research questions led us to adopt an exploratory approach. I and another instructor, both team teachers of an introductory general business course that all MBA students at our university are required to take in the first semester of their program, con-ducted the study. Both of us were pres-ent in the classroom throughout the semester and participated in class activ-ities. The course was designed to intro-duce the students to (a) a set of skills that they would need in the program and in a management career (such as written analysis, presentation, discus-sion skills, etc.) and (b) a set of per-spectives such as viewing the company as a whole, appreciating a globalized


(3)

environment, and the impact of technol-ogy on business.

The students were required to sub-mit written case analyses (WCA) indi-vidually on any two of the six cases that were discussed in the course dur-ing the semester. They were encour-aged, but not required, to submit one case early in the semester and to do the second one after considering the feed-back on the first. Each WCA carried a weight of 15% of the total grade for the course. The WCA had a 350-word limit. The format required them to (a) specify an issue or a problem, (b) ana-lyze the situation by using a concept, theory, or model that had been dis-cussed in any previous class session in the course, and (c) bring the discussion to a conclusion.

In the first semester that we taught this course, we arrived at four criteria for grading the cases and provided them to the students. As the semester pro-gressed, we found several instances in which we disagreed on the interpreta-tion or applicainterpreta-tion of the criteria to the WCA under consideration. We dis-cussed the criteria again before the start of the second semester and agreed to expand the list to the following six items to reduce the misinterpretation:

1. Question or issue was specified in the beginning.

2. Question or issue is relevant. 3. Question was answered/issue brought to conclusion.

4. There was good depth of analysis. 5. Appropriate theory/concepts was/ were used.

6. Writing adhered to format (writing style, error-free writing, word limit, etc.).

Scores ranging from 1 (poor) to 5 (excellent) were given for each item. Our criteria, which we compiled based on our experience, parallel what is expected in other business courses. For example, conducting an analysis, apply-ing business policy theory and concepts, and writing ability are among the top seven criteria used to grade cases (Alexander et al., 1986) in the business policy course.

We collected data over the course of a semester and used the following proce-dure: When a student submitted a WCA, one instructor read it first and evaluated it by using the grading sheet. To elimi-nate bias, the instructor made no marks on the script (Murphy, 1979), which was then passed on to the second instructor, who also read the case and evaluated it separately. We then met and reconciled our evaluations of each WCA. The process of reconciliation came into effect when there was a dif-ference between the individual grades that we each gave on a particular criteri-on. Each of us would then provide the reasons for our grades, and we would reread the WCA. Each item on the grad-ing sheet on which there was a dis-agreement would be discussed and rec-onciled. There were three possible outcomes of this process. The final grade would be either (a) the grade given by one of the instructors, which would indicate that one of us had been able to convince the other; (b) the

aver-age of the two grades, which would reflect a compromise; or (c) a common grade different (higher or lower) from the original grade (if, in the process of reading and discussing, we decided that a different grade was justified).

Then we set aside the individual scor-ing sheets with the independent grades and comments and entered the recon-ciled grade along with the grader com-ments on a third grading sheet, which was given to the student. The students were aware that both of us were involved in grading each WCA but were not told of the detailed grading process or of the study in progress. We collected data from 53 students in two sections of the course. Because two case reports were missing, we had a total sample of 104.

Results

Quantitative Analysis

We examined RQ1 by looking at both the extent of initial agreement and sub-sequent reconciliation between the two graders. There was a strong positive correlation (r = .46, p < .01) between our scores across all the criteria (see Table 1). There was full agreement between us in 71% of the cases (see Table 2). These results compare favor-ably with those of Page (1994), who noted that one U.S. state educational system required that interjudge agree-ment be at least 70% in a 4-point rating.

Disagreement between us was a more complicated issue. Reconciliation came about in all cases of initial disagreement.

TABLE 1. Comparison of Instructors’ Grades

Average grade Average grade given by given by

Number and criterion Instructor A Instructor B Correlation t

1. Question or issue was specified in the beginning 4.83 4.76 .45** 1.19 2. Question or issue is relevant 4.76 4.88 .16* –1.83* 3. Question was answered/issue brought to conclusion 4.61 4.71 .28** –1.42 4. There was good depth of analysis 4.28 4.14 .48** 1.80* 5. Apropriate theory/concepts was/were used 4.21 3.97 .56** 2.87** 6. Writing adhered to format 4.76 4.75 .83** 0.28 Average 4.58 4.54 .46**

*p < .10. **p < .01. N = 104.


(4)

In the process of reconciliation, a major-ity of the cases (27% out of 29% that required reconciliation) were resolved with one grader convincing the other (columns 4 and 6, Table 2). In only 3% of the cases (column 5) did we resort to taking the mean. This percentage con-firms the extensive discussions and review that accompanied reconciliation without resorting to a quick compromise through settling for the mean. Moreover, very few students received grades that fell outside our initial range (columns 3 and 7, Table 2). This suggests that the initial two grades represented the possi-ble range that the student could have received.

Looking at differences across specif-ic criteria, we see that criteria 2, 4, and 5 (see Table 1) accounted for significant differences between the two graders. Although correlation on criterion 2 was low, there was a high level of agreement between us. Criteria 4 and 5 represent a different picture. These two criteria had about equal numbers of students whose final grades were equal to that of one or the other grader. Against this, on all the other criteria, the final grade was more heavily weighted toward one or the other grader. Only about 12% (4 out of 43 and 6 out of 39) of the students received a mean grade. This suggests that we were adhering to our initial grades more strongly on these two crite-ria than on the others. An examination

of the criteria itself suggests that these two were subject to greater interpreta-tion than the others.

To examine RQ2, we compared final grade received with the higher or lower of an individual instructor’s grade and found no significant variation. The data show that in 13.8% of the cases, stu-dents received a grade higher than sin-gle marking (columns 6 and 7, Table 2). On the negative side, in 12.8% of the cases students received a grade lower than single marking (columns 3 & 4).

To address RQ3, we compared the grades (Table 3) received by the students on their second case with those received on the first. We found a significant dif-ference in the case of criterion 5. On the others, there were either no differences or a marginal improvement. Because criterion 5 was also one of the two crite-ria on which there was the highest initial disagreement between us, the improve-ment could suggest either that the feed-back helped the students understand the criterion better, or that it helped improve their application of theory.

To check whether there was a grader “learning” bias—that is, whether the graders were converging in their views over the semester—we compared the disagreements on a case-by-case basis (see Table 4). Of the six, there was a drop in the number of disagreements in the second case. In all the other cases, the disagreements remained at around

an average of 39, which suggests that there was little convergence effect. Qualitative Analysis

The notes that we kept during the grading reconciliation process helped us identify the areas that resulted in dis-agreement:

1. Interpretation of the criteria. The cri-terion labeled “Question or issue is rel-evant” was interpreted by one grader broadly to mean that the writing focused on one or more of the issues in the case. The other grader was looking to see if the student picked the more important among the issues. Another cause for the disagreement in initial grading was that there was confusion in classification. For instance, if the analysis was not dealing with the question or issue that had been specified, were points to be taken off under the criteria regarding either depth of analysis or appropriate conclusion?

2. Grading philosophy. Although we agreed on the nature of the deficiency, we sometimes disagreed on the severity and therefore the penalty. Sometimes one of us took off more than the other did. This dealt directly with the concern leading to RQ2 and the grading philos-ophy of the instructor. One of us would argue that graduate students should TABLE 2. Grade Agreement and Reconciliation of Disagreement

Full

agreement Final grade Final grade Final grade Final grade Final grade between lower than lower of mean of higher of higher than Number and criterion graders both grades two grades two grades two grades both grades (1) (2) (3) (4) (5) (6) (7) 1. Question or issue was specified in the

beginning 84 — 6 — 14 — 2. Question or issue is relevant 80 1 7 2 14 — 3. Question was answered/issue brought to

conclusion 72 1 10 4 17 — 4. There was good depth of analysis 55 1 24 4 19 1 5. Apropriate theory/concepts was/were used 59 — 20 6 19 — 6. Writing adhered to format 91 — 10 1 2 — Average 73.5 .5 12.8 2.8 14.2 .2

71%

Notes. Figures represent number of written case analyses (WCAs) that satisfied the specified condition. Columns 2 through 7 represent the full range of

categories into which a WCA could fall; thus the numbers add up to N = 104.


(5)

know better, whereas the other would argue that the students need more encouragement at this stage, their first semester in the program.

3. Relative grading. To assist the process of reconciliation, we often would compare our grading process for the case under discussion with how other students had been graded. We would go back and check whether we had penalized or credited another stu-dent on a similar issue and note the extent of that penalty. Thus, although not initially stipulated, consistency across a particular case became an objective.

4. Errors of omission. In some cases, we reached consensus easily because one of us had overlooked a deficiency initially and was convinced quickly when the other drew attention to it. Discussion

I undertook this study to explore the effect of using multiple graders and their interpretation of criteria in evaluat-ing written business case analysis. The literature on essay grading suggests that having clear criteria for grading helps to narrow the differences and results in a high level of agreement among multiple graders. My results show that the

over-all level of agreement found in this study is consistent with that found in the literature. However, a closer look at the criteria on which disagreement is great-est sugggreat-ests cause for concern.

As my results show, the wording of the criteria may allow for multiple interpre-tations. Although I and my co-instructor were clear about our criteria at the time that we designed them, they were still open to diverse interpretation on imple-mentation. Thus, I recommend that instructors and researchers be as precise as possible in laying out their expecta-tions. For instance, the criterion labeled “The question or issue is relevant” could specify further whether the term “rele-vant” means “relevant to the decision makers in the case” or “relevant to the topics of the session.”

However, the following disturbing question arises: How can we expect stu-dents to understand the criteria when even instructors interpret them differ-ently? Fortunately, we were not dealing with an examination situation in which the possibility of repeat submission or appeal does not exist. Thus, my results suggest that although having criteria is better than not having them, there is plenty of room for misinterpretation. Instructors need to take care to spell out, in as much detail as possible, what they mean by their criteria and perhaps to spend time in class discussing them with the students before finalizing them. In addition, students may be encouraged to discuss the evaluations that they receive with the instructor if they are not clear about the message.

Written case analysis is used widely in business programs, because instruc-tors believe that it improves both written communication and analytical skills. Thus, the process of writing cases, grad-ing them, and providgrad-ing feedback is an important activity for the student and the instructor. It serves both to evaluate the student’s abilities and to assist the learning process by providing feedback. We found support for this process. Our grading form, apart from giving a numerical score representing our deci-sion, also provided an explanation through written comments. When we felt that a student’s analysis was weak or involved poor application of theory, we gave examples of how he or she TABLE 3. Comparison of Student Grades Across the Two Cases

Average grade Average grade Number and criterion on Case 1 on Case 2 t

1. Question or issue was

specified in the beginning 4.8 4.8 .07 2. Question or issue is relevant 4.8 4.9 –.99 3. Question was answered/issue

brought to conclusion 4.7 4.7 .10 4. There was good depth of

analysis 4.2 4.2 –.06 5. Apropriate theory/concepts

was/were used 3.9 4.3 –2.1* 6. Writing adhered to format 4.7 4.8 –1.1

*p < .05.

TABLE 4. Case-Wise Comparison of Agreement and Disagreement

Criterion

Case 1 2 3 4 5 6 Lincoln Agreement 67 100 67 67 100 67

Disagreement 33 0 33 33 0 33 GE Agreement 90 60 40 40 60 85 Disagreement 10 40 60 60 40 15 Sony Agreement 80 70 70 50 60 80 Disagreement 20 30 30 50 40 20 SW Agreement 83 78 78 61 52 91 Disagreement 17 22 22 49 48 9 V V Agreement 88 80 72 36 44 92 Disagreement 12 20 28 64 56 8 KMP Agreement 65 87 83 74 65 87 Disagreement 35 13 17 26 35 13

Note. Agreement means that the final grade was the same as the individual grades given by the two

instructors. Disagreement means that the final grade was different from at least one of the instruc-tors’ grades. The numbers are percentages of the case write-ups that satisfied the criteria.

Agreement/ disagreement


(6)

could have dealt with the case. Thus, the student could develop a clearer under-standing of the criteria or of the numer-ical score that he or she had misinter-preted.

We assumed that students read and make use of written feedback. In addi-tion, we provided general comments on the written reports in a subsequent class. My analysis does not allow me to pin-point which particular feedback was useful. The role of feedback, in general, should be further explored. Researchers need to examine how much feedback is optimal and whether feedback on con-tent is more effective than that on process.

We found that having multiple graders benefited the students in multi-ple ways. Although the students did not get better grades than they would have with a single grader, we were able to check each other’s omissions and the reconciliation process allowed for a measure of relative consistency. In addi-tion, our resolution of our differences through discussion before we provided comments to the students resulted in a more considered feedback. Of course, although one can argue that reconcilia-tion helps to reduce bias in terms of grades, the different perspectives from multiple graders can aid students’ learn-ing process and is an area in which fur-ther study is needed.

We realize that few schools have the resources to support multiple graders on a regular basis. Moreover, many instruc-tors are not comfortable with having another instructor in the classroom or another person with whom they must share grading responsibility. One way around this is to involve students in the grading process, although one must take into account the difference in quality of feedback that can be expected from a faculty member as opposed to a student. Peer evaluation and assessment can be important sources of feedback to the

students (Gopinath, 1999). Moreover, asking students to evaluate cases helps those providing the feedback develop analysis skills (Schroeder & Fitzgerald, 1984). Thus, instead of having a second instructor as a grader, the instructor can draw on students (individuals or groups) to serve as a second grader. The instruc-tor trying to reconcile his or her views with the student peers’ comments before finalizing the grade would still incorpo-rate some of the benefits that we observed in our study.

When dealing with large classes, instructors often use teaching assistants to grade cases according to a process involving criteria and then random examination of graded papers for con-sistency. However, this method may not allow for multiple opinions of graders to come into play because the teaching assistant has to try to replicate the stan-dard and expectation set by the instruc-tor and may not have the expertise or the experience to provide an alternative viewpoint. The question of whether the marginal benefits from multiple graders (in terms of learning value) exceed the marginal cost is another area that war-rants further study.

Overall, it is important for instructors to consider carefully the emphasis placed on individual written analysis reports so they can lessen the impact of varied interpretations of the analysis. In addition, instructors should allow stu-dents to challenge or question the com-ments received and the grade given. Instructors often have a mindset about not changing a grade. We would suggest that the process of learning requires an instructor to be able to justify the deci-sion that he or she has made about the quality of a student’s work. Thus, the instructor must be willing to defend his or her evaluation and be willing to change the grade, if necessary. In cases in which the final exam (with limited opportunity to discuss and revise the

grade) is in the form of a written case analysis, the instructor will be well advised to give the student the benefit of the doubt.

ACKNOWLEDGMENT

I wish to thank Patricia Carlson for her assis-tance in data collection.

REFERENCES

Alexander, L. D., O’Neill, H. M., Snyder, N. H., & Townsend, J. B. (1986). How academy mem-bers teach the business policy/strategic man-agement case course. Journal of Manman-agement

Case Studies, 2, 334–344.

Bilimoria, D. (1995). Modernism, postmod-ernism, and contemporary grading practices.

Journal of Management Education, 19,

440–458.

Blanke, H. G. (1999). Grading by theory. College

Teaching, 47, 136–139.

Brown, G. (1997). Assessing student learning in

higher education. London: Routledge.

Dennis, I., Newstead, S. E., & Wright, D. E. (1996). A new approach to exploring biases in educational assessment. British Journal of

Psy-chology, 87, 515–535.

Erskine, J. A., Leenders, R. R., & Maufette-Leen-ders, L. A. (1981). Teaching with cases. Lon-don, Canada: University of Western Ontario. Gopinath, C. (1999). Alternatives to instructor

assessment of class participation. Journal of

Education for Business, 75, 10–14.

Honan, W. H. (1999). High tech comes to the classroom: Machines that grade essays. New

York Times, 148, p. B8.

Murphy, R. J. L. (1979). Removing marks from examination scripts before re-marking them: Does it make any difference? British Journal of

Educational Psychology, 49, 73–78.

Nealey, S. M. (1969). Student-instructor agree-ment in scoring an essay examination. Journal

of Educational Research, 63, 111–115.

Page, E. B. (1994). Computer grading of student prose, using modern concepts and software.

Journal of Experimental Education, 62,

127–142.

Partington, J. (1994). Double-marking students’ works. Assessment and Evaluation in Higher

Education, 19(1), 57–61.

Schroeder, H., & Fitzgerald, P. (1984). Peer eval-uation in case analysis. Journal of Business

Education, 60, 73–77.

Spear, M. (1997). The influence of contrast effects upon teachers’ marks. Educational Research,

39, 229–233.

Wood, R., & Quinn, B. (1976). Double impression marking of English language essay and summa-ry questions. Educational Review, 28, 229–246. Wright, D. E. (1996). A new approach to explor-ing biases in educational assessment. British

Journal of Psychology, 87, 515–535.