Directory UMM :Data Elmu:jurnal:A:Advances in Physiology Education:Vol277.Issue6.Dec1999:

(1)

TEACHING EX PERIMENTAL DESIGN TO BIOLOGISTS

James F. Zolman

Depa rtm ent of Physiology, University of Kentuck y College of Medicine, Lexington, Kentuck y 40536-0298

T

The teaching of research design and data analysis to our graduate students has been a persistent problem. A course is described in which students, early in their graduate training, obtain extensive practice in designing experiments and interpreting data. Lecture-discussions on the essentials of biostatistics are given, and then these essentials are repeatedly reviewed by illustrating their applications and misapplica-tions in numerous research design problems. Students critique these designs and prepare similar problems for peer evaluation. In most problems the treatments are confounded by extraneous variables, proper controls may be absent, or data analysis may be incorrect. For each problem, students must decide whether the researchers’ conclusions are valid and, if not, must identify a fatal experimental flaw. Students learn that an experiment is a well-conceived plan for data collection, analysis, and interpretation. They enjoy the interactive evaluations of research designs and appreciate the repetitive review of common flaws in different experiments. They also benefit from their practice in scientific writing and in critically evaluating their peers’ designs.

AM. J. PHYSIOL. 277 (ADV. PHYSIOL. EDUC. 22): S111–S118, 1999.

Key words: biostatistics; designing experiments; analyzing and interpreting data; experimental flaws; writing abstracts; peer reviewing

Many assessment reviews have found that much of biological research is flawed because of incorrect conclusions drawn from confounded experimental designs and the misuse of simple statistical methods (1, 2, 8, 14, 22). These flaws are relatively simple mistakes such as not randomly allocating experimen-tal units to treatments, neglecting to include a control group or condition, or misapplying statistical models to data (e.g., having correlated observations when the applied model requires independent observations). Along with a decline in the rigor and quality of research (3, 11, 19), these errors result in significant research costs. Money is spent and time is taken to collect data that may not be replicated because of extraneous confounds or may not be correctly inter-preted because inappropriate statistical models have been unknowing applied.

Reasons for this persistent research problem include the nature of individuals learning biostatistics, the discipline of statistics, our erroneous separation of experimental design from statistics, and the limited experience our students receive in designing experi-ments. As examples of the first two reasons, it is very difficult for us to deal with uncertainty (probability), whether in the real world or in the domain of statistical inference. Also, statistics uses many ordi-nary words but gives very special meanings to such words as variables, errors, hypotheses, interactions, and significance. Furthermore, many abstract con-cepts such as random and systematic factors, probabil-ity distributions, random samples, and study and target populations are used to explain data variability and its generalization. An additional enigma is the expected mastery of the quirky logic of hypothesis

(2)

testing in which an explanation based on chance factors (null hypothesis) must be rejected to assert a possible effect of a treatment (alternative hypothesis). These problems of uncertainty, terminology, abstrac-tion, and quirky logic can only be minimally addressed by nonstatisticians (5, 21).

The other two major reasons for general research flaws that we, as scientists, can dramatically change are 1) our erroneous separation of data collection (what we wrongfully believe is a real experiment) from statistical analysis and interpretation (what we wrongfully assume are only computations of test statistics) and 2) the limited practice our students receive in designing experiments and critically evaluat-ing the experiments of others. Too often, statistical methods are not included in the design of a study. In these cases data collection is usually completed before suitable data analysis procedures are even considered. In addition, we typically expect students in their research proposals to provide detailed descriptions of measurement techniques, some information of pos-sible treatment confounds and their controls during data collection, and little, if any, information on suitable statistical procedures. As teachers and re-search mentors, we must, by example, convince students that an experiment is a well-conceived plan (design) for data collection, data analysis, and data interpretation. Also, early in their academic careers, we must provide them with numerous opportunities for designing experiments and analyzing and interpret-ing experimental data. These are the objectives of an annual research design course developed for second-year graduate students in the Department of Physiol-ogy at the University of Kentucky.

GENERAL PHILOSOPHY AND COURSE FORMAT Biological scientists are concerned primarily with designing and interpreting their own experiments as well as critically evaluating articles and papers in their research field. Consequently, they must know whether an experiment is properly conceived, correctly con-trolled, adequately analyzed, and correctly inter-preted. To accomplish these aims I focus on experi-mental design and statistical literacy and do not cover derivations of probability distributions or computa-tional procedures. Lecture-discussions are given on the essentials of experimental design and statistical inference. These essentials then are reviewed by

illustrating their applications and misapplications in numerous research design problems. Course partici-pants critique these research design problems and prepare similar design problems for peer evaluation. In most of these problems the treatment conditions are confounded by subject-, environment-, or time-related extraneous variables, proper controls may be absent, or data analysis may be incorrect (7, 9, 15, 16, 23).

As researchers we strive to translate important ques-tions into appropriate treatment condiques-tions, obtain accurate and valid measurements of functional pro-cesses, do experiments that produce unequivocal data, and integrate our data with current knowledge. Therefore, throughout the course I repeatedly empha-size that an appropriate experimental design with proper statistical methods will not ensure correct scientific conclusions if a research hypothesis is not clearly defined, the applied treatment conditions are not appropriate, or the data collected are inaccurate or invalid. These latter requirements are based on specialized discipline knowledge. Also, I emphasize that the logic of experimental design can be learned and a critical perception can be cultivated so that important research hypotheses are more likely to be evaluated in well-designed experiments. Fortunately, in a well-designed experiment the included statistical methods are relatively easy to select (17, 20, 23). In short, I reassure students that their discipline-based ideas and techniques are extremely important but that any idea may be evaluated in either a well-designed or a poorly designed experiment. Consequently, they must not expect even an ingenious idea to be ac-cepted when evaluated in a poorly designed experi-ment. They also should not expect a trivial idea to become important when evaluated in a well-designed experiment.

LECTURE-DISCUSSION SESSIONS

The class meets twice a week (1.5-hour sessions), and because active participation is required of all mem-bers, class enrollment is restricted (6–10 graduate students). Assessment procedures include student performance on one midterm examination covering fundamental material, four prepared research design problems and their critiques, five in-class critiques of design problems that I prepared, and one final prob-lem set that substitutes for a final examination.

(3)

Topics covered in the initial 13 lecture-discussion sessions include an introduction to the design and analysis of experiments, confounding in experiments, factorial experiments with two or three factors, de-signs with repeated measures, interpretation of inter-actions, multiple comparison tests, and problems in generalizing experimental data (see Table 1). Students are expected to have some prior knowledge of com-mon probability distributions and elementary test statistics. An introductory course in statistics is a prerequisite, and most students have completed a graduate level course for scientists. The design knowl-edge of students is determined on a pretest entitled ‘‘PhD Qualifying Examination: Experimental Design and Statistical Inference’’ given at the first organiza-tional meeting. This pretest consists of choice and short-answer questions. The multiple-choice questions are on very elementary concepts, and short answers are requested on questions such as the following: How many independent groups are necessary in a given between-group design with two factors?; What are the degrees of freedom (df) in a simple two-group design?; and What is an interaction? Test results are reviewed at the next class, but tests are retained by me. Hence, for the last three years the same pretest has been used. Much more demanding multiple-choice and short-answer questions are asked on a midterm examination on material covered in the introductory lectures and assigned readings. As ex-amples, students are expected to be able to

recon-struct complete analysis of variance (ANOVA) tables (sources of variance, df, andF-ratios) for three-factor experimental designs with or without correlated (re-peated) measures on some or all of the factors and to identify confounds common to different research designs, know how to control for common con-founds, and know how to analyze and interpret two-and three-way interactions. The multiple-choice ques-tions of the midterm examination also are reviewed only in class. The short-answer questions on experi-mental design change each year, and this section is returned to students.

The performances of students (n 5 19) in the last three classes on these two examinations are shown in Fig. 1. A significant improvement in test scores from the pretest to the midterm examination occurred (P, 0.001). Overall, the percentage of correct responses increased from 39.8 (612.6 SD) to 79.8 (68.4 SD). No significant differences were found among the three classes in the two test scores or in the increases across the two tests. Physiology students are encouraged to enroll in this course during the fall semester of their second year. Hence, students with limited back-grounds in statistics may do remedial work during the

TABLE 1

Lectur e-discussions on ex perimental design and statistical infer ence

Introduction to

1) Experimental Design and Statistical Inference

2) The Anatomy of an Experiment and of a Scientific Paper

3) The Logic of Experimental Design and Statistical Inference Advantages, Disadvantages and Confounds of

4) Between-Subject (experimental unit) Designs

5) Within-Subject Designs

6) Mixed Factorial Designs Statistical Interpretation

7) Analysis of Variance (ANOVA) Models

8) Main Effects and Interactions in Factorial Designs

9) Interactions and Multiple Comparisons Selected Statistical Topics

10) Regression and Correlation

11) Parametric and Nonparametric Tests

12) Correlated Measures

13) Data Analysis

FIG. 1.

Box and whisker plots of means 6 SE (box ) and standar d deviations (whiskers) of pr etest (k) and midter m (n) scor es for the thr ee classes. A 3 3 2 factorial ANOVA with r epeated measur es on the last factor was used to deter mine statistical significance levels.

(4)

summer, by either taking a statistics course or self-study. However, even with this opportunity the pre-test performance of students is disappointing because most of them do not understand elementary concepts of experimental design (see Fig. 1).

SEMINARS ON RESEARCH DESIGN PROBLEMS AND THEIR CRITIQUES

In the second half of the course the essentials of experimental design and statistical inference are re-viewed by illustrating their applications and misappli-cations in experimental briefs. One hundred research design problems and their critiques have been pub-lished, and these are used as introductory and refer-ence material (23). Each year students prepare and critique about 40 new problems.

Each student must prepare four new research design problems and their critiques (about one problem a week). The design problems are distributed to the class during the meeting before their scheduled evalu-ations. Critiques are given only to me. Each research design problem is expected to describe concisely the significance of the research, the research hypothesis, the preparation used, the experimental methods, the findings (including descriptive statistics for each sub-group or condition, statistical tests, and significance levels), the interpretation of the findings, and the author’s conclusions. Problems must avoid jargon, have specialized terms defined, and concentrate on general design and statistical errors (see Table 2). Essentially, a longer and more detailed abstract is expected. One of the four submitted problems must have no fatal experimental flaws.

The critiques of design problems must include a restatement of the research hypothesis, the ANOVA table, and whether the inferences of the researchers are appropriate. The submitted critique of a research design problem must be concise and cannot provide additional details of the experiment. The errors identi-fied must be those that would make the author’s conclusions not acceptable given the information provided in the research problem. If a general method-ological error is not clearly described (all control samples measured first), evaluators are instructed to assume that the measurement methods and proce-dures were appropriate for all conditions.

Three or four of these new problems are reviewed during each class. I select the sequence of problems for review and inform the class that the problems have been graded and that their criticisms will not change the grades. If students find additional confounds or statistical errors (and to my delight they have), I acknowledge my oversight and congratulate them on their insightful evaluations. Also, I inform the group that the author of the problem will not answer questions—the problem must stand alone. Graded problems (and critiques) are returned at the end of each class. I facilitate discussion by asking students whether they accept the author’s conclusions and, if not, why. In these discussions the emphasis is on design and inference errors but a critical, not a cynical, perceptive is encouraged. To be critical is not to deride or to demolish but to make exact and discriminating judgments. I emphasize that flaws may occur in biological studies but that every flaw is not fatal. Each student decides when to submit the re-quested problem without a fatal experimental flaw, and this uncertainty helps maintain the objectivity of the evaluators. In theAPPENDIXare two representative design problems, their critiques, and issues that may be reviewed for each research problem.

Five critiques are also completed in class on problems that I prepared (,30 minutes for each problem). After

the students’ written critiques are submitted, poten-TABLE 2

Potential confounds in between-subject, within-subject and mix ed factorial designs

I. Between-Subject (experimental unit) Design

• Selective bias in assignment of subjects (units)

• Selective loss of subjects (units) because of treatment condi-tions

• Treatment condition associated with another potential inde-pendent variable

II. Within-Subject Design

• General carry-over effects across treatments

• Differential carry-over effects from one treatment to another

• Time-related effects

• Treatment condition associated with another potential inde-pendent variable

III. Mixed Factorial Design

• Between-subject component: Subject selection and treatment-subject loss

• Within-subject component: Carry-over and time-related effects associated with repeated (correlated) measurements

• Treatment condition associated with another potential inde-pendent variable

(5)

tial confounds and statistical errors are discussed. At least one problem without a fatal experimental error is included in the assessment set. Three critique sessions are spaced throughout the second half of the course. The flawed problems have confounds and errors that have been repeatedly discussed but that now are embedded in different experimental contexts. The students’ critiques are graded and returned during the next class, when discussion may continue. Also, students may be requested to critique posters mounted outside of laboratories and at regional or national meetings.

The final problem set, which substitutes for a final examination, is due at the last class meeting. The first problem in the two-problem set must be a two- or three-factor design with repeated measures on at least one factor and must have a serious design flaw or inappropriate statistical inference. The critique format of this problem is the same as the format for all flawed problems. The second problem must correct the serious error in the first problem and must extend the research question(s) by using a mixed factorial design. For this latter problem the experimental design, statistical inferences, and researcher’s conclusions must all be appropriate.

STUDENT FEEDBACK

Evaluation forms are distributed a week before the last class meeting, returned at the last class, and summa-rized by department staff after final grades are as-signed. Also, to guarantee future anonymity, students are requested to type their answers. The evaluation form asks for agreement/disagreement rankings on general statements and short answers to three open-ended questions. For the last three classes (n 5 19) the percentages of ranks of strongly agree (SA), agree (A), disagree (DA), and strongly disagree (SDA) on three statements were as follows. 1) The general organization of the course enhances conceptual learn-ing: 50% SA, 50% A.2) Class discussions enhanced and enriched the learning experience: 69% SA, 31% A.3) Faculty-student interactions in and out of class pro-vided a positive learning experience: 55% SA, 45% A. Other general statements were included on the text, assigned readings, level of difficulty of the course, and subsequent course recommendations.

The three open-ended questions were as follows: What did you like about the course?; What did you like the least about the course?; and What would you change in the course? Most students enjoyed the interactive discussions on designs and potential founds and appreciated the repetitive review of con-founds and errors in different experimental problems. Many students requested more introductory lectures on applied statistics, some wanted more examples of data analysis, and a few requested calculation proce-dures and guided practice with computer statistics packages. To schedule these additions, most students recommended that the critiques completed in class on my design problems be reduced or eliminated. Some students were not comfortable with these assess-ments because they could not prepare for specific problems, had to decide relatively quickly to accept or reject the authors’ conclusions, and had to provide valid reasons for decisions to reject. Surprisingly, none of the students objected to the open evaluations of their research designs or written presentations (see Ref. 12).

RECOMMENDATIONS

The major advantage of the course format is that students are exposed to all aspects of research with an emphasis on whether an experiment is properly conceived, correctly controlled, adequately analyzed, correctly interpreted, and concisely presented. Stu-dents receive extensive practice in identifying serious design flaws, inappropriate statistical analyses, and invalid conclusions in many different experiments. Students also benefit from their practice in writing detailed abstracts and learn quickly to provide con-structive criticism of the research designs and presen-tations of others. The long-term effects of these problem-solving exercises are difficult to quantify. However, informal assessments by mentors of the subsequent research efforts of their students indicate that significant improvement did occur. Also, evi-dence from cognitive science suggests that the skills of critical thinking must be practiced with a wide variety of problems for these skills to be learned and retained.

A general advantage of discussing specific research designs is that the experimental details can match the interests of the group. Research design problems can be tailored to reflect the research expertise of a

(6)

seminar group, a department, or a college, and partici-pants normally maintain their interest in identifying flaws in new problems. A continuing benefit is that in subsequent consulting sessions with students, both partners use the same design and statistics terminol-ogy, which produces more productive discussions. The major shortcomings of focusing on design prob-lems are that students receive limited exposure to real data sets and no practice in analyzing data using computer statistics programs.

Significant course modifications, of course, would be necessary if class size became larger than 10–12 students. With a large group, class evaluations of all submitted problems would not be feasible unless the number of requested problems was greatly reduced, thereby reducing the students’ experience in design-ing experiments. Some of the teachdesign-ing exercises, however, can be adapted for use with large classes. For example, in 1996 at Monash University (Australia), after introductory lectures were given by other fac-ulty, design problems in behavioral science were distributed to over 40 undergraduate students. These psychology honor students were very active in trying to identify flaws in the study problems. On the basis of informal assessment by the Head of the Department, students enjoyed and profited from these interactive discussions, and a critique of a published article is now part of the promotion assessment for the honors degree in psychology.

Although many statistics books, like the design course at Kentucky, focus on statistical reasoning (see e.g., Refs. 4, 6, 10, 13, 18), many biological scientists continue to use flawed designs and give elaborate interpretations of data that most likely were produced by chance processes. To improve our scientific out-put, we must become more proficient in the designing of our own experiments and in conveying efficiently this knowledge to our students. To achieve these objectives, the three intertwined experimental phases of data collection, analysis, and interpretation must be integrated in our research planning, our students’ research proposals, and our research seminars. When discussing research articles we should critically evalu-ate all components—the research hypothesis, measure-ment methods and protocols, experimeasure-mental design and statistical inferences, and authors’ conclusions. In

our department, first-year students are expected to read research articles in many areas of physiology. Obviously, given their performance on the design pretest (see Fig. 1), students cannot critically evaluate the experimental designs and statistical inferences in these articles. Hence, only research hypotheses and data collection methods can be adequately evaluated in seminar courses in the first year. This separation of discipline knowledge and protocols from experimen-tal design perpetuates the myth of data collection as real science (an experiment) and data analysis and interpretation as only number crunching (statistics). If nothing else, we should immediately convey to our students that all planned experiments must be well-conceived blueprints for data collection, analysis, and interpretation. As a teaching tool the evaluation of research design problems is an excellent way to reinforce this important message. The more students are exposed, early in their academic careers, to both good and bad experimental designs, the better their scientific output should become.

APPENDIX

Ex amples of Resear ch Design Pr oblems, Their Critiques, and Issues Reviewed

The following two research design problems (RDP) illustrate the evaluative experiences obtained by students. Design problems are prepared and critiqued after lectures on the essentials of biostatis-tics (see Table 1). Consequently, for some readers these two relatively brief problems may contain some unfamiliar material such as ANOVA tables with their sources of variances, df, and error terms for the properF-ratios. The first problem is similar to introductory problems that I use, and the second, more detailed design is similar to those expected from students. In the design course, submitted problems can be no more than four double-spaced pages and their critiques no more than three double-spaced pages. Detailed instruc-tions for the preparation of different types of problems and their critiques are available on request. At the end of each of these selected problems are issues, discussed in the first part of the course, that may be reviewed. One hundred sample design prob-lems and their critiques have been published (see Ref. 23).

RDP 1.Elevated levels of plasma high-density lipoprotein (HDL) cholesterol may be associated with a lowered risk of coronary heart disease. Several previous studies suggested that vigorous exercise may result in increased levels of HDL. To determine whether running is associated with an increase in HDL concentration, plasma HDL concentrations were measured in middle-aged (35–45 years old) marathon runners, joggers, and inactive men.

(7)

Mean Plasma HDL Concentrations and Standard Deviations (SD) Inactive men (n530) 48.3 mg/dL 10.2 SD Joggers (n530) 50.3 mg/dL 19.6 SD Marathon runners (n530) 64.8 mg/dL 11.9SD A between-subject ANOVA was done, and there was a significant group effectF(2,87) 55.63, P,0.01, and a Scheffe´’s multiple comparison test indicated that the runners were significantly different from the inactive men (P,0.05) but did not differ from the joggers and that the joggers and inactive men did not differ significantly. From these results, the researchers concluded that only very strenuous exercise will significantly increase HDL concen-tration; that is, jogging has no significant effect on HDL concentra-tion.

Critique of RDP 1. The research hypothesis is whether HDL concentrations differ in middle-aged men who either run or jog, or are inactive. The ANOVA table for this study is

Source df Error term

Group 2 S/group

S/group 87

Total 89

This study, however, was not an experiment and does not permit causal inferences to be made. Classification variables are the independent variables in this study, and classification variables are always subject confounded. Inactive men, joggers, and marathon runners probably differ in their diets, their responses to stress, their self-esteem, their body weights, and so forth. Because these subject-related confounding variables cannot be controlled by the researcher, cause-and-effect conclusions are impossible when only classification variables are used. The data presented also make the author’s conclusions suspect. Inactive men and marathon runners are classes of individuals that are relatively easy to define. Inactive men do not walk, jog, or run. Marathon runners run in marathons (26 miles). However, joggers may jog very slowly once a week or rather briskly four to seven times a week. Notice that the standard deviation for the joggers is about twice that of the other two groups. Consequently, this group includes individuals who vary widely in their HDL levels and probably also in the amount of their weekly exercise. Individuals who jog briskly for 40 minutes at least 3 times a week may have the same HDL concentrations as many marathon runners. High HDL concentrations may, therefore, be associated with both moderate and strenuous exercise. Hence, the author’s conclusion that only very strenuous exercise will significantly increase HDL concentration is not justified because classification variables were used. Furthermore, a conclusion of only an associa-tion between strenuous exercise and high HDL concentraassocia-tion would be suspect because of the probable measurement error. Issues that may be reviewed in addition to correlation/classification versus experimental study:

•ANOVA, single factor

•Familywise error rate

•Multiple comparison tests

•Scheffe´’s test

•Power of test

•Standard deviation

•Heterogeneity of variances

•Measurement error

•ANOVA table, levels of a factor, definitions of a factor, df, andP

value

•Assumptions of statistical test

RDP 2.The mammalian hippocampus has been proposed to be an important brain region for memory processing; consequently, interfering with hippocampal functioning should produce a greater effect on complex tasks (requiring extensive memory processing) than on simple tasks. To test this hypothesis, a biologist prepared a new type of knockout mouse in which the N-methyl-D-aspartate (NMDA) receptor was deleted specifically from the pyramidal cells of the hippocampus in the adult mouse in some littermates but not in others. This novel procedure apparently eliminates both develop-ment differences and other brain changes between the knockout mice and their normal littermates. These same outcomes are assumed to have occurred in this experiment. The behavior of 20 of these adult knockout mice were compared with that of 20 normal adult littermates (only one littermate pair of male mice from each litter) on a simple two-choice T-maze and on a complex radial arm (RA) maze with multiple choice points. The order of maze testing was counterbalanced so that one-half of the mice in each group were trained first on the T-maze and subsequently on the RA maze. The other 10 mice in each group were trained in the reverse order. Food reward was used in both tasks, and all mice were maintained at 85% of their pretraining weight. For all mice the number of trials to reach 90% correct responding on each maze was recorded. A mixed factorial ANOVA indicated a significant group effect,

F(1, 38)58.35,P,0.01, and a significant test effect,F(1, 38)5 8.95,P,0.01. The group3test interaction was not significant,

F(1, 38)51.20. Both the knockout and normal mice required more trials to learn the RA maze (41.263.5, mean6SD) than the T-maze (19.863.7). Also, the knockout mice required more averaged trials to learn the two mazes (39.863.6) compared with their controls (21.262.9). From these findings, the biologist concluded that the loss of NMDA receptors from the neurons of the hippocampus in the male mouse has a greater effect on the learning of complex than simple tasks, a conclusion that supports the hypothesis that the hippocampus is involved in memory processing.

Critique of RDP 2.The research hypothesis is whether the loss of NMDA receptors on pyramidal neurons would interfere more with a mouse’s performance on an RA maze than on a T-maze. The general ANOVA for this mixed factorial design is

(8)

Source df Error term

Group 1 S/group

Test 1 Test3S/group

S/group 38

Group3test 1 Test3S/group

Test3S/group 38

Total 79

The statistical test selected was appropriate, but the biologist’s conclusions were wrong. If the researcher’s conclusions were correct, a significant group3 test interaction would have been obtained. Subsequent analyses of this interaction should also have indicated that for the knockout mice the difference in errors made on the complex and simple tasks was greater than the trial error difference between these two behavioral tests for the normal mice. However, the group3test interaction,F(1, 38)51.20, was not significant; consequently, the only conclusions the biologist could make would be interpretations based on the two reported main effects, namely, that the knockout mice required significantly more trials to learn the two mazes compared with their control litter-mates and that mice in both groups required significantly more trials to learn the complex than the simple maze. Therefore, no evidence exists that a deficiency in hippocampal function affected the mouse’s performance more on complex than on simple tasks. Issues that may be reviewed in addition to interaction interpreta-tion:

•Mixed factorial design

•Main effects and interactions

•Between-subject and within-subject components

•Confounds, carryover effects

•Counterbalancing

•ANOVA table, levels of a factor, definitions of a factor, df, andP

value

•Expected value ofF-ratio

•Assumptions of statistical test

I am grateful to Brian Jackson, Daniel Richardson, and Eric Smart for helpful comments on this paper. I thank Kim Ng of Monash University for supporting the use of research design problems and their evaluations as a teaching method with a large class of advanced undergraduate students.

Address for reprint requests and other correspondence: J. F. Zolman, Dept. of Physiology, MS 508C Medical Center, College of Medicine, Univ. of Kentucky, Lexington, KY 40536-0298. (E-mail: [email protected]).

Received 18 November 1998; accepted in final form 25 August 1999.

Refer ences

1. Altman, D. G.Statistics in medical journals: developments in the 1980s.Sta t. Med. 10: 1897–1913, 1991.

2. Altman, D. G.Statistical reviewing for medical journals.Sta t. Med. 17: 2661–2674, 1998.

3. Bailar, J. C., III. The real threats to the integrity of science.

Chronicle of Higher Educa tion, April 21: B1–B2, 1995. 4. Bland, M. An Introduction to Medica l Sta tistics (2nd ed.).

Oxford, UK: Oxford Univ. Press, 1995.

5. Bradstr eet, T. E. Teaching introductory statistics courses so that nonstatisticians experience statistical reasoning.Am erica n Sta tisticia n50: 69–78, 1996.

6. Dawson-Saunders, B., and R. G. Trapp.Ba sic a nd Clinica l Biosta tistics(2nd ed.). East Norwalk, CT: Appleton and Lange, 1994.

7. Friedman, H. H., and L. W. Friedman. A new approach to teaching statistics: Learning from misuses.New York Sta tisti-cia n31: 1–3, 1980.

8. Glantz, S. A. Prim er of Biosta tistics (4th ed.). New York: McGraw-Hill, 1995.

9. Jaffe, A. J., and H. F. Spir er.Misused Sta tistics: Stra ight Ta lk for Twisted Num bers. New York: Marcel Dekker, 1987. 10. Keppel, G.Design a nd Ana lysis: A Resea rcher’s Ha ndbook

(3rd ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.

11. Lederber g, J. Sloppy research extracts a greater toll than misconduct.The Scientist, February 20: 13, 1995.

12. Lightfoot, J. T.Adifferent method of teaching peer review systems.

Am . J. Physiol. 274 (Adv. Physiol. Educ. 19): S57–S61, 1998. 13. Munr o, B. H. (Editor). Sta tistica l Methods for Hea lth Ca re

Resea rch(3rd ed.). Philadelphia, PA: Lippincott-Raven, 1997. 14. Nowak, R.Problems in clinical trials go far beyond

miscon-duct.Science264: 1541–1541, 1994.

15. Riegelman, R. K., and R. P. Hirsch.Studying a Study a nd Testing a Test: How to Rea d the Hea lth Science Litera ture(3rd ed.). Boston, MA: Little, Brown, 1996.

16. Scott, E. M., and J. M. Waterhouse. Physiology a nd the Scientific Method: The Design of Experim ents in Physiology. Manchester, UK: Manchester Univ. Press, 1986.

17. Selwyn, M. R.Principles of Experim enta l Design for the Life Sciences. Boca Raton, FL: CRC, 1996.

18. Sokal, R. R., and F. J. Rohlf.Biom etry: The Principles a nd Pra ctice of Sta tistics in Biologica l Resea rch (3rd ed.). New York: Freeman, 1995.

19. Swazey, J. P., M. S. Anderson, and K. S. Lewis. Ethical problems in academic research.Am . Sci. 81: P542–P553, 1993. 20. Tweedie, R. L., K. L. Mengersen, and J. A. Eccleston.Garage in, garbage out: can statisticians quantify the effects of poor data?Cha nce7: 20–27, 1993.

21. Watts, D. G.Why is introductory statistics so difficult to learn? And what can we do to make it easier?Am erica n Sta tisticia n

45: 290–292, 1991.

22. Williamson, J. W., P. G. Goldschmidt, and T. Colton.The quality of medical literature: An analysis of validation assessments. In:Medica l Uses of Sta tistics, edited by J. C. Bailar III and F. Mosteller. Cambridge, MA: N. Engl. J. Med., 1986, p. 370–391. 23. Zolman, J. F.Biosta tistics: Experim enta l Design a nd Sta

(1)

Topics covered in the initial 13 lecture-discussion

sessions include an introduction to the design and

analysis of experiments, confounding in experiments,

factorial experiments with two or three factors,

de-signs with repeated measures, interpretation of

inter-actions, multiple comparison tests, and problems in

generalizing experimental data (see Table 1). Students

are expected to have some prior knowledge of

com-mon probability distributions and elementary test

statistics. An introductory course in statistics is a

prerequisite, and most students have completed a

graduate level course for scientists. The design

knowl-edge of students is determined on a pretest entitled

‘‘PhD Qualifying Examination: Experimental Design

and Statistical Inference’’ given at the first

organiza-tional meeting. This pretest consists of

choice and short-answer questions. The

multiple-choice questions are on very elementary concepts,

and short answers are requested on questions such as

the following: How many independent groups are

necessary in a given between-group design with two

factors?; What are the degrees of freedom (df) in a

simple two-group design?; and What is an interaction?

Test results are reviewed at the next class, but tests are

retained by me. Hence, for the last three years the

same pretest has been used. Much more demanding

multiple-choice and short-answer questions are asked

on a midterm examination on material covered in the

introductory lectures and assigned readings. As

ex-amples, students are expected to be able to

recon-struct complete analysis of variance (ANOVA) tables

(sources of variance, df, and

F

-ratios) for three-factor

experimental designs with or without correlated

(re-peated) measures on some or all of the factors and to

identify confounds common to different research

designs, know how to control for common

con-founds, and know how to analyze and interpret

two-and three-way interactions. The multiple-choice

ques-tions of the midterm examination also are reviewed

only in class. The short-answer questions on

experi-mental design change each year, and this section is

returned to students.

The performances of students (

n

5 19) in the last

three classes on these two examinations are shown in

Fig. 1. A significant improvement in test scores from

the pretest to the midterm examination occurred (

P

,

0.001). Overall, the percentage of correct responses

increased from 39.8 (

6 12.6 SD) to 79.8 (

6 8.4 SD). No

significant differences were found among the three

classes in the two test scores or in the increases across

the two tests. Physiology students are encouraged to

enroll in this course during the fall semester of their

second year. Hence, students with limited

back-grounds in statistics may do remedial work during the

TABLE 1

Lectur e-discussions on ex perimental design and statistical infer ence

Introduction to

1) Experimental Design and Statistical Inference

2) The Anatomy of an Experiment and of a Scientific Paper

3) The Logic of Experimental Design and Statistical Inference Advantages, Disadvantages and Confounds of

4) Between-Subject (experimental unit) Designs

5) Within-Subject Designs

6) Mixed Factorial Designs Statistical Interpretation

7) Analysis of Variance (ANOVA) Models

8) Main Effects and Interactions in Factorial Designs

9) Interactions and Multiple Comparisons Selected Statistical Topics

10) Regression and Correlation

11) Parametric and Nonparametric Tests

12) Correlated Measures

13) Data Analysis

FIG. 1.

(2)

summer, by either taking a statistics course or

self-study. However, even with this opportunity the

pre-test performance of students is disappointing because

most of them do not understand elementary concepts

of experimental design (see Fig. 1).

SEMINARS ON RESEARCH DESIGN PROBLEMS

AND THEIR CRITIQUES

In the second half of the course the essentials of

experimental design and statistical inference are

re-viewed by illustrating their applications and

misappli-cations in experimental briefs. One hundred research

design problems and their critiques have been

pub-lished, and these are used as introductory and

refer-ence material (23). Each year students prepare and

critique about 40 new problems.

Each student must prepare four new research design

problems and their critiques (about one problem a

week). The design problems are distributed to the

class during the meeting before their scheduled

evalu-ations. Critiques are given only to me. Each research

design problem is expected to describe concisely the

significance of the research, the research hypothesis,

the preparation used, the experimental methods, the

findings (including descriptive statistics for each

sub-group or condition, statistical tests, and significance

levels), the interpretation of the findings, and the

author’s conclusions. Problems must avoid jargon,

have specialized terms defined, and concentrate on

general design and statistical errors (see Table 2).

Essentially, a longer and more detailed abstract is

expected. One of the four submitted problems must

have no fatal experimental flaws.

The critiques of design problems must include a

restatement of the research hypothesis, the ANOVA

table, and whether the inferences of the researchers

are appropriate. The submitted critique of a research

design problem must be concise and cannot provide

additional details of the experiment. The errors

identi-fied must be those that would make the author’s

conclusions not acceptable given the information

provided in the research problem. If a general

method-ological error is not clearly described (all control

samples measured first), evaluators are instructed to

assume that the measurement methods and

proce-dures were appropriate for all conditions.

Three or four of these new problems are reviewed

during each class. I select the sequence of problems

for review and inform the class that the problems have

been graded and that their criticisms will not change

the grades. If students find additional confounds or

statistical errors (and to my delight they have), I

acknowledge my oversight and congratulate them on

their insightful evaluations. Also, I inform the group

that the author of the problem will not answer

questions—the problem must stand alone. Graded

problems (and critiques) are returned at the end of

each class. I facilitate discussion by asking students

whether they accept the author’s conclusions and, if

not, why. In these discussions the emphasis is on

design and inference errors but a critical, not a

cynical, perceptive is encouraged. To be critical is not

to deride or to demolish but to make exact and

discriminating judgments. I emphasize that flaws may

occur in biological studies but that every flaw is not

fatal. Each student decides when to submit the

re-quested problem without a fatal experimental flaw,

and this uncertainty helps maintain the objectivity of

the evaluators. In the

APPENDIX

are two representative

design problems, their critiques, and issues that may

be reviewed for each research problem.

Five critiques are also completed in class on problems

that I prepared (

,

30 minutes for each problem). After

the students’ written critiques are submitted,

poten-TABLE 2

Potential confounds in between-subject, within-subject and mix ed factorial designs

I. Between-Subject (experimental unit) Design

• Selective bias in assignment of subjects (units)

• Selective loss of subjects (units) because of treatment condi-tions

• Treatment condition associated with another potential inde-pendent variable

II. Within-Subject Design

• General carry-over effects across treatments

• Differential carry-over effects from one treatment to another

• Time-related effects

• Treatment condition associated with another potential inde-pendent variable

III. Mixed Factorial Design

• Between-subject component: Subject selection and treatment-subject loss

• Within-subject component: Carry-over and time-related effects associated with repeated (correlated) measurements

• Treatment condition associated with another potential inde-pendent variable

(3)

tial confounds and statistical errors are discussed. At

least one problem without a fatal experimental error is

included in the assessment set. Three critique sessions

are spaced throughout the second half of the course.

The flawed problems have confounds and errors that

have been repeatedly discussed but that now are

embedded in different experimental contexts. The

students’ critiques are graded and returned during the

next class, when discussion may continue. Also,

students may be requested to critique posters mounted

outside of laboratories and at regional or national

meetings.

The final problem set, which substitutes for a final

examination, is due at the last class meeting. The first

problem in the two-problem set must be a two- or

three-factor design with repeated measures on at least

one factor and must have a serious design flaw or

inappropriate statistical inference. The critique format

of this problem is the same as the format for all flawed

problems. The second problem must correct the

serious error in the first problem and must extend the

research question(s) by using a mixed factorial design.

For this latter problem the experimental design,

statistical inferences, and researcher’s conclusions

must all be appropriate.

STUDENT FEEDBACK

Evaluation forms are distributed a week before the last

class meeting, returned at the last class, and

summa-rized by department staff after final grades are

as-signed. Also, to guarantee future anonymity, students

are requested to type their answers. The evaluation

form asks for agreement/disagreement rankings on

general statements and short answers to three

open-ended questions. For the last three classes (

n

5 19)

the percentages of ranks of strongly agree (SA), agree

(A), disagree (DA), and strongly disagree (SDA) on

three statements were as follows.

1 ) The general

organization of the course enhances conceptual

learn-ing: 50% SA, 50% A.

2 ) Class discussions enhanced and

enriched the learning experience: 69% SA, 31% A.

3 )

Faculty-student interactions in and out of class

pro-vided a positive learning experience: 55% SA, 45% A.

Other general statements were included on the text,

assigned readings, level of difficulty of the course, and

subsequent course recommendations.

The three open-ended questions were as follows:

What did you like about the course?; What did you like

the least about the course?; and What would you

change in the course? Most students enjoyed the

interactive discussions on designs and potential

founds and appreciated the repetitive review of

con-founds and errors in different experimental problems.

Many students requested more introductory lectures

on applied statistics, some wanted more examples of

data analysis, and a few requested calculation

proce-dures and guided practice with computer statistics

packages. To schedule these additions, most students

recommended that the critiques completed in class on

my design problems be reduced or eliminated. Some

students were not comfortable with these

assess-ments because they could not prepare for specific

problems, had to decide relatively quickly to accept or

reject the authors’ conclusions, and had to provide

valid reasons for decisions to reject. Surprisingly, none

of the students objected to the open evaluations of

their research designs or written presentations (see

Ref. 12).

RECOMMENDATIONS

The major advantage of the course format is that

students are exposed to all aspects of research with an

emphasis on whether an experiment is properly

conceived, correctly controlled, adequately analyzed,

correctly interpreted, and concisely presented.

Stu-dents receive extensive practice in identifying serious

design flaws, inappropriate statistical analyses, and

invalid conclusions in many different experiments.

Students also benefit from their practice in writing

detailed abstracts and learn quickly to provide

con-structive criticism of the research designs and

presen-tations of others. The long-term effects of these

problem-solving exercises are difficult to quantify.

However, informal assessments by mentors of the

subsequent research efforts of their students indicate

that significant improvement did occur. Also,

evi-dence from cognitive science suggests that the skills

of critical thinking must be practiced with a wide

variety of problems for these skills to be learned and

retained.

A general advantage of discussing specific research

designs is that the experimental details can match the

interests of the group. Research design problems can

be tailored to reflect the research expertise of a

(4)

seminar group, a department, or a college, and

partici-pants normally maintain their interest in identifying

flaws in new problems. A continuing benefit is that in

subsequent consulting sessions with students, both

partners use the same design and statistics

terminol-ogy, which produces more productive discussions.

The major shortcomings of focusing on design

prob-lems are that students receive limited exposure to real

data sets and no practice in analyzing data using

computer statistics programs.

Significant course modifications, of course, would be

necessary if class size became larger than 10–12

students. With a large group, class evaluations of all

submitted problems would not be feasible unless the

number of requested problems was greatly reduced,

thereby reducing the students’ experience in

design-ing experiments. Some of the teachdesign-ing exercises,

however, can be adapted for use with large classes.

For example, in 1996 at Monash University (Australia),

after introductory lectures were given by other

fac-ulty, design problems in behavioral science were

distributed to over 40 undergraduate students. These

psychology honor students were very active in trying

to identify flaws in the study problems. On the basis of

informal assessment by the Head of the Department,

students enjoyed and profited from these interactive

discussions, and a critique of a published article is

now part of the promotion assessment for the honors

degree in psychology.

Although many statistics books, like the design course

at Kentucky, focus on statistical reasoning (see e.g.,

Refs. 4, 6, 10, 13, 18), many biological scientists

continue to use flawed designs and give elaborate

interpretations of data that most likely were produced

by chance processes. To improve our scientific

out-put, we must become more proficient in the designing

of our own experiments and in conveying efficiently

this knowledge to our students. To achieve these

objectives, the three intertwined experimental phases

of data collection, analysis, and interpretation must be

integrated in our research planning, our students’

research proposals, and our research seminars. When

discussing research articles we should critically

evalu-ate all components—the research hypothesis,

measure-ment methods and protocols, experimeasure-mental design

and statistical inferences, and authors’ conclusions. In

our department, first-year students are expected to

read research articles in many areas of physiology.

Obviously, given their performance on the design

pretest (see Fig. 1), students cannot critically evaluate

the experimental designs and statistical inferences in

these articles. Hence, only research hypotheses and

data collection methods can be adequately evaluated

in seminar courses in the first year. This separation of

discipline knowledge and protocols from

experimen-tal design perpetuates the myth of data collection as

real science (an experiment) and data analysis and

interpretation as only number crunching (statistics).

If nothing else, we should immediately convey to our

students that all planned experiments must be

well-conceived blueprints for data collection, analysis, and

interpretation. As a teaching tool the evaluation of

research design problems is an excellent way to

reinforce this important message. The more students

are exposed, early in their academic careers, to both

good and bad experimental designs, the better their

scientific output should become.

APPENDIX

Ex amples of Resear ch Design Pr oblems, Their

Critiques, and Issues Reviewed

(5)

Critique of RDP 1. The research hypothesis is whether HDL concentrations differ in middle-aged men who either run or jog, or are inactive. The ANOVA table for this study is

Source df Error term

Group 2 S/group

S/group 87

Total 89

•ANOVA, single factor

•Familywise error rate

•Multiple comparison tests

•Scheffe´’s test

•Power of test

•Standard deviation

•Heterogeneity of variances

•Measurement error

•ANOVA table, levels of a factor, definitions of a factor, df, andP

value

•Assumptions of statistical test

F(1, 38)58.35,P,0.01, and a significant test effect,F(1, 38)5

8.95,P,0.01. The group3test interaction was not significant,

(6)

Source df Error term

Group 1 S/group

Test 1 Test3S/group

S/group 38

Group3test 1 Test3S/group

Test3S/group 38

Total 79

•Mixed factorial design

•Main effects and interactions

•Between-subject and within-subject components

•Confounds, carryover effects

•Counterbalancing

•ANOVA table, levels of a factor, definitions of a factor, df, andP

value

•Expected value ofF-ratio

•Assumptions of statistical test

Received 18 November 1998; accepted in final form 25 August 1999.

Refer ences

1. Altman, D. G.Statistics in medical journals: developments in the 1980s.Sta t. Med. 10: 1897–1913, 1991.

2. Altman, D. G.Statistical reviewing for medical journals.Sta t. Med. 17: 2661–2674, 1998.

3. Bailar, J. C., III. The real threats to the integrity of science.

Chronicle of Higher Educa tion, April 21: B1–B2, 1995. 4. Bland, M. An Introduction to Medica l Sta tistics (2nd ed.).

Oxford, UK: Oxford Univ. Press, 1995.

5. Bradstr eet, T. E. Teaching introductory statistics courses so that nonstatisticians experience statistical reasoning.Am erica n Sta tisticia n50: 69–78, 1996.

6. Dawson-Saunders, B., and R. G. Trapp.Ba sic a nd Clinica l Biosta tistics(2nd ed.). East Norwalk, CT: Appleton and Lange, 1994.

7. Friedman, H. H., and L. W. Friedman. A new approach to teaching statistics: Learning from misuses.New York Sta tisti-cia n31: 1–3, 1980.

8. Glantz, S. A. Prim er of Biosta tistics (4th ed.). New York: McGraw-Hill, 1995.

(3rd ed.). Englewood Cliffs, NJ: Prentice Hall, 1991.

11. Lederber g, J. Sloppy research extracts a greater toll than misconduct.The Scientist, February 20: 13, 1995.

12. Lightfoot, J. T.Adifferent method of teaching peer review systems.

Am . J. Physiol. 274 (Adv. Physiol. Educ. 19): S57–S61, 1998. 13. Munr o, B. H. (Editor). Sta tistica l Methods for Hea lth Ca re

Resea rch(3rd ed.). Philadelphia, PA: Lippincott-Raven, 1997. 14. Nowak, R.Problems in clinical trials go far beyond

miscon-duct.Science264: 1541–1541, 1994.

15. Riegelman, R. K., and R. P. Hirsch.Studying a Study a nd Testing a Test: How to Rea d the Hea lth Science Litera ture(3rd ed.). Boston, MA: Little, Brown, 1996.

16. Scott, E. M., and J. M. Waterhouse. Physiology a nd the Scientific Method: The Design of Experim ents in Physiology. Manchester, UK: Manchester Univ. Press, 1986.

17. Selwyn, M. R.Principles of Experim enta l Design for the Life Sciences. Boca Raton, FL: CRC, 1996.

18. Sokal, R. R., and F. J. Rohlf.Biom etry: The Principles a nd Pra ctice of Sta tistics in Biologica l Resea rch (3rd ed.). New York: Freeman, 1995.

21. Watts, D. G.Why is introductory statistics so difficult to learn? And what can we do to make it easier?Am erica n Sta tisticia n

45: 290–292, 1991.