Student Evaluations of Teaching

[The Montana Professor 17.1, Fall 2006 <http://mtprof.msun.edu>]

Implications of Recent Research on Student Evaluations of Teaching

Judith D. Fischer
Brandeis School of Law
University of Louisville
judith.fischer@louisville.edu

--Judith D. Fischer

Suppose university students rate twenty-nine instructors' teaching on a numeric scale. Suppose further that the students' learning under those instructors is plotted on another scale, and that there is a positive correlation between the two. Thus it appears that the students' ratings say something valid about whether their instructors are good teachers. But now suppose that four instructors whom the students rated in the top half produced learning in the bottom half, while four who were rated in the bottom half produced learning in the top half. That is, 27% of the instructors' student ratings placed them in the wrong half as determined by student learning. Would it be a good idea to use the student rating scale to make personnel decisions about the instructors, with a cutoff at the median? Would that be fair to the instructors? Would it help the students?

The above data are not hypothetical; they appeared in the report of a study by Stapleton and Murkison./1/ And some universities do use student ratings (sometimes called "student evaluations"/2/) to draw precise lines among instructors./3/ Stapleton and Murkison dramatically illustrated a point that is sometimes overlooked when student ratings are used to make personnel decisions: a statistical generalization is not a universal statement./4/ For example, a particular study might disclose a positive correlation of .40 between student ratings and student learning. A casual reader might think this means the two variables are always found together. But a correlation of .40 has been described as "modest,"/5/ which means the factors occur together only some of the time. In highlighting this point, the Stapleton and Murkison study is just one example of the recent scholarship that identifies problems with the widespread student ratings system.

Student ratings first began to gain prominence in university personnel decisions in the 1970s./6/ Evidence of biases in the ratings existed even at that time,/7/ but it was often dismissed or ignored./8/ As the ratings continued to gain acceptance in the 1980s and 1990s, a common refrain echoed through much of the literature on the subject: student ratings are valid and reliable./9/ Most universities now require anonymous student ratings of instructors and use them for personnel purposes, making them of pivotal importance in many instructors' careers. But during this whole process, little thought was given to whether student ratings might have unintended negative effects on university education./10/

By now there is a wide body of scholarship about student ratings, with a notable portion of it defending them./11/ But in the past ten years, an increasing number of scholars have identified flaws in the student ratings system. This recent scholarship, including my own study,/12/ reveals two general categories of problems: (1) biases in the ratings, and (2) their unintended negative effects.

Biases

It is a truism in the literature that students' ratings correlate positively with students' expected grades./13/ Of course, this correlation might not indicate a bias; if highly rated teachers produce more student learning and therefore give appropriately higher grades, this correlation would tend to show the ratings' validity. In the late 1990s, two professors decided to look into the meaning of the correlation.

Greenwald and Gilmore asked students in hundreds of courses at the University of Washington to state on their rating forms what grades they expected. There was a positive correlation of .45 between the students' expected grades and their ratings of their instructors./14/ Based on the number of participants in the study, this correlation was statistically significant at the .001 level. Greenwald and Gilmore also found that within individual classes, the positive grades-ratings correlation was maintained. They therefore concluded that the overall grades-ratings correlation did not reflect the quality of the teaching, because all the students in a given class had the same instructor./15/ Greenwald and Gilmore also found a negative correlation between workload and grades, and partly on that basis concluded that the grades-ratings correlation is not the result of high student engagement, but rather occurs because students "give high ratings in appreciation for lenient grading."/16/

Johnson reached the same conclusion after collecting student rating forms from nearly 1900 Duke University students./17/ He found that the students' ratings of their professors correlated positively with their grades,/18/ but he concluded that this data did not support the teaching-effectiveness explanation because the within-class correlations occurred among students who all had the same instructor./19/ He also found that some students' grades for prerequisite courses correlated negatively with their grades for related advanced courses, suggesting that "prerequisite courses in which professors grade more stringently are more effective in preparing students for advanced courses."/20/ Johnson concluded that "the sensitivity of global measures of teaching effectiveness to biases like...grading leniency, and other factors is...unacceptably high."/21/

Instructor expressiveness has also been shown to correlate positively with student ratings. This correlation, too, might support the ratings' validity if instructor expressiveness leads students to learn more. Williams and Ceci explored this issue when Ceci taught the fall and spring sections of a course in the same way except for being more expressive the second time./22/ In the spring, Ceci incorporated recommendations from a teaching skills course taught by a media consultant who was not an academic. Specifically, Ceci projected enthusiasm by varying his voice pitch and using hand gestures. He received significantly higher ratings from the spring students./23/ Yet Ceci's increased expressiveness did not affect learning as measured by exam scores: the fall and spring students' scores on the same examination were nearly identical./24/ At least in Ceci's case, then, expressiveness represented a bias in the ratings, not a catalyst of student learning.

Recent scholarship has also documented other biases. Timing can affect student ratings; ratings taken after students have received grades or during a final examination period tend to be lower./25/ Students also give higher ratings to courses in which they had a prior interest or which they take as electives, to upper level courses, and to courses in certain academic fields./26/ Other factors that positively affect ratings include an instructor's attractiveness,/27/ dress,/28/ and sense of humor./29/ The latter trait was found to correlate positively with students' opinions on two unrelated points: whether the instructor was organized and whether he or she encouraged supplemental reading./30/ And although findings on the point are mixed,/31/ instructors' female gender has been found to affect student ratings negatively./32/

Moreover, one scholar showed that students can be flatly wrong. They rated highly a speaker they had never heard and a film they had never seen./33/ As Stanley Fish observed, decisions about young scholars' careers have been based on "the ill-informed opinions of transient students with little or no stake in the enterprise...in full knowledge that nothing they said would ever be questioned or challenged."/34/ But unfairness to instructors is not the only harm student ratings cause. They negatively affect the entire educational enterprise and even their supposed beneficiaries, the students.

Negative effects

Grade inflation. The very correlation between grades and ratings suggests that instructors have an incentive to inflate grades in order to get the ratings they need to keep their jobs or advance professionally./35/ As former professor Peter Sacks said, "Placing significant weight on student evaluations produces the unwholesome incentive in all-too-human teachers to give out lots of good grades to make students happy."/36/ While Sacks believed that "most educators would never openly concede that this happens at their institutions,"/37/ he himself told of giving "outrageously good grades" in order to affect his student ratings, which did rise./38/ Some of his superiors even encouraged him to improve his ratings this way./39/ Another writing professor admitted giving higher grades to improve his ratings,/40/ and a law school dean attributed grade inflation partly to student ratings./41/

These anecdotal reports received support in a study in the late 1990s, in which 70% of faculty respondents reported that their university's reliance on student ratings was an incentive to give higher grades./42/

Dilution of rigor. Accumulating evidence shows that student ratings also lead to diluted rigor in university courses. As Trout wrote, "The most effective device for lowering standards is the numerical [student] evaluation form."/43/ Some instructors' efforts to obtain higher ratings will improve their teaching, but some of those efforts will be tangential or even antithetical to good teaching. Instructors may cater to students who would prefer not to be challenged./44/ Indeed, one professor's sardonic advice on how to get good ratings included suggestions to "Show lots of films" and "Teach what they want how they want it."/45/ As one commentator put it, "[i]t is very hard to educate people you have reason to fear; it is often possible to please them, and where the rewards for flattering them are great, it will be hard not to."/46/

Several university teachers have reported lowering course rigor to obtain better ratings. Peter Sacks described his "sandbox experiment," in which he lowered his standards in order to improve his ratings./47/ Another professor wrote that he decided to tell the students what they wanted to hear about their writing, "praising them however much they foundered." The result: his ratings improved./48/ Yet another professor reported changing his teaching "for the worse" in service of the ratings./49/

These anecdotal reports were supported by a recent survey of faculty members in one university, where 72% of the respondents said student ratings encouraged them to "water down" course content./50/

My own survey added some data on this point. In the spring of 2002, I sent a survey to the 300 members of the Association of Legal Writing Directors (ALWD)./51/ Members of that organization teach legal writing in law schools, and most supervise other teachers of legal writing. My questions about the effects of student ratings on course rigor generated interesting responses. Of the fifty-two ALWD members who responded to the survey, 33% reported that they or teachers they supervise had refrained from doing something they believed pedagogically sound because it might negatively affect their student ratings. This percentage may be understated because of instructors' natural reluctance to report any lowering of standards. But whether or not the percentage is understated, it documents a matter of concern. Instructors are diluting course rigor out of concern about student evaluations.

This means that student ratings damage their purported beneficiaries, the students. Students who fail to learn are encouraged to blame their professors instead of themselves./52/ Weak students may thus acquire a false sense of competence, while students who do want high standards find their educations devalued by lowered rigor./53/ Indeed, the whole ratings process "profoundly reverses fundamental authority relations,"/54/ prompting Stanley Fish to ask in an article title, "Who's in charge here?"/55/ He noted that the ratings present students with "invitations to grind axes without fear of challenge or discovery."/56/ That does not have a salutary effect on students' character./57/ Moreover, the curriculum itself may be watered down if administrators make curricular decisions based on student ratings.

All of this makes the educational enterprise seem uncomfortably like "retail merchandising," in which it is "appropriate to cater to the consumer's interest."/58/ As one professor summarized this approach, "[I]t does not matter what the student learns, or even whether he learns anything; what matters is whether he is satisfied with the process. If the student likes the service he is getting, his teacher is good; otherwise, not."/59/ But this notion misperceives the role of the student, who is less the consumer than the product of education, while it shortchanges the students' future employers and clients. It also misperceives the role of the professor, who must sometimes challenge or correct students and at times is even "duty bound to displease."/60/

Why would universities continue to use student ratings and sometimes rely heavily on them for personnel purposes, in light of their demonstrated biases and negative effects? The answer may lie in their very ease of administration. All the university needs to do to collect student ratings is take about ten minutes out of each course to distribute and collect the forms and then compile them electronically. This produces numbers and percentages that have a seductive appearance of certainty. Although student ratings scholars have repeatedly stressed that the forms' data should not be used to support numeric comparisons among instructors,/61/ some administrators indulge in the "micrometer fallacy," the belief that simply because a number can be calculated it conveys meaningful information./62/ The numbers' seeming certainty lulls evaluators into believing that they need not invest significant time or consider the multidimensionality of teaching in order to make personnel decisions. And the numbers can be presented to a board or legislature as evidence that the university takes teaching seriously.

But mere ease of administration is no excuse for universities' continued reliance on a system that presents flawed data and causes grade inflation and lower standards. Student evaluations can be helpful when collected and maintained by individual professors for their own information. But they should not be used for personnel purposes. At the very least, if a university or department decides to continue using them, student ratings should be treated as only one piece of evidence among others, including peer evaluations and teaching portfolios./63/ And those who evaluate faculty members should view student ratings with the profound skepticism they deserve.

Notes

Richard John Stapleton & Gene Murkison, "Optimizing the Fairness of Student Evaluations: A Study of Correlations between Instructor Excellence, Study Production, Learning Production, and Expected Grades," Journal of Management Education 25.3 269 (June 2001), 279-84.[Back]
Many student ratings scholars avoid using the phrase "student evaluations" in the belief that students can generate data ("student ratings"), but actual evaluation of teaching should be in the hands of professionals.[Back]
Mary Gray & Barbara R. Bergmann, "Student Teaching Evaluations: Inaccurate, Demeaning, Misused," Academe 89.5 (Sept.-Oct. 2003): 44-45; Richard L. Abel, "Evaluating Evaluations: How Should Law Schools Judge Teaching?" Journal of Legal Education 40.3 407 (1990): 452-54.[Back]
Stapleton & Murkison, 271.[Back]
Anthony G. Greenwald, "Constructs in Student Ratings of Instructors," in The Role of Constructs in Psychological and Educational Measurement, Henry I. Braun et al., eds., 277 (Mahwah, N.J.: Lawrence Erlbaum Assocs., 2002), 281.[Back]
John Ory, "Student Ratings of Instruction: Ethics and Practice," 43 New Directions for Teaching & Learning 63 (Fall 1990): 64.[Back]
E.g., David S. Holmes, "Effects of Grades and Disconfirmed Grade Expectancies on Students' Evaluations of Their Instructor," 63 Journal of Educational Psychology 130 (1972), reporting data showing that students who received lower grades than they expected gave their instructor lower ratings; Ross Vasta & Robert F. Sarmiento, "Liberal Grading Improves Evaluations but Not Performance," 71 Journal of Educational Psychology 207 (1979): 210, finding through manipulated grades that liberal grading resulted in higher student ratings but not more studying or better student performance.[Back]
Such reactions sometimes came from professors who built their careers on student ratings and had a vested interest in bolstering the ratings' credibility. Valen E. Johnson, Grade Inflation: A Crisis in College Education (New York: Springer, 2003), 238.[Back]
E.g., Herbert W. Marsh, "Student Evaluations of University Teaching: Research Findings, Methodological Issues, and Directions for Future Research," International Journal of Educational Research 11.3 235 (1987): 305; Peter A. Cohen, "Bringing Research Into Practice," 43 New Directions for Teaching & Learning 123 (1990): 125.[Back]
Max O. Hocutt, "De-Grading Student Evaluations: What's Wrong with Student Polls of Teaching," Academic Questions 1.4 55 (Winter 1987-88): 60.[Back]
E.g., John A. Centra, "Will Teachers Receive Higher Student Evaluations by Giving Higher Grades and Less Course Work?", Research in Higher Education 44.5 (Oct. 2003): 495.[Back]
For a full report of the study, see Judith D. Fischer, "The Use and Effects of Student Ratings in Legal Writing Courses: A Plea for Holistic Evaluation of Teaching," Legal Writing 10 (2004): 111. The respondents' suggestions for improving student ratings appeared as "How to Improve Student Ratings in Legal Writing Courses: Views from the Trenches," University of Baltimore Law Review 34 (2004): 199.[Back]
Johnson, 52-57, 62-68, summarizing studies.[Back]
Greenwald, "Constructs in Student Ratings of Instructors," 284-85; see also Anthony Greenwald & Gerald M. Gilmore, "Grading Leniency Is a Removable Contaminant of Student Ratings," American Psychologist 52 (1997): 1209 (an earlier report of the study).[Back]
Greenwald & Gilmore, "Grading Leniency," 1212.[Back]
Ibid., 1211, 1213.[Back]
Johnson, 27.[Back]
Ibid., 117.[Back]
Ibid., 60.[Back]
Ibid., 160. Also see Arthur M. Sullivan & Graham R. Skanes, "Validity of Student Evaluation of Teaching and the Characteristics of Successful Instructors," 66 Journal of Educational Psychology 584 (1974): 589, reporting that professors who focused on achievement rather than projecting enthusiasm received lower student ratings but produced students who learned more and did better in advanced courses.[Back]
Ibid., 165.[Back]
Wendy M. Williams & Stephen Ceci, "'How' m I Doing?': Problems with Student Ratings of Instructors and Courses," Change 29 (Sept.-Oct. 1997): 18-19.[Back]
Ibid., 20.[Back]
Ibid., 20-21.[Back]
Howard K. Wachtell, "Student Evaluation of College Teaching Effectiveness: A Brief Review," 23 Assessment & Evaluation in Higher Education 191 (1998): 194.[Back]
William E. Cashin, IDEA Paper No 32: Student Ratings of Teaching: The Research Revisited (Southeast Missouri State University, Cape Girardeau, MO: Center for Scholarship in Teaching and Learning, Sept. 1995), 5-6.[Back]
Daniel Hamermesh & Amy Parker, "Beauty in the Classroom: Professors' Pulchritude and Putative Pedagogical Productivity," Economics of Education Review 24.4 (2005): 369, also available at http://www.eco.utexas.edu/faculty/Hamermesh/Teachingbeauty.pdf.[Back]
Tracy L. Morris et al., "Fashion in the Classroom: Effects of Attire on Student Perceptions of Instructors in College Classes," 45 Communication Quarterly 135 (1996): 141.[Back]
Gary Adamson, Damian O'Kane, & Mark Shelvin, "Students' Ratings of Teaching Effectiveness: A Laughing Matter?" Psychological Reports 96 (Feb. 2005): 226.[Back]
Ibid.[Back]
Ibid., 4.[Back]
See Faith E. Fich, "Are Student Evaluations of Teaching Fair?" Computing Research News 15.3 (May 2003), 2, available at http://www.cra.org/CRN/articles/may03/fich.html.[Back]
David V. Reynolds, "Students Who Haven't Seen a Film on Sexuality and Communication Prefer It to a Lecture on the History of Psychology They Haven't Heard: Some Implications for the University," Teaching of Psychology 4 (Apr. 1977): 82-83.[Back]
Stanley Fish, "Who's in Charge Here?" Chronicle of Higher Education 51, 4 Feb. 2005, C2.[Back]
Anthony C. Krautmann & William Sander, "Grades & Student Evaluations of Teachers," 18 Economics of Education Review 59 (Feb. 1999): 61. The study produced data suggesting that "faculty have the ability to 'buy' higher evaluations by lowering their standards."[Back]
Peter Sacks, Generation X Goes to College: An Eye-Opening Account of Teaching in Postmodern America (Chicago: Open Court, 1996), 182.[Back]
Peter Sacks, "In Response...," Change (Sept./Oct. 1997): 29.[Back]
Sacks, Generation X Goes to College, 85, 99.[Back]
Ibid., 86. See also Greenwald & Gilmore, "Grading Leniency," 1209, reporting a professor's advice that student ratings can be improved by giving a "softball midterm" so students will expect good grades.[Back]
James B. Twitchell, "Stop Me before I Give Your Kid Another 'A'," Washington Post, 4 June 1997, A23; Hocutt, 61.[Back]
Bradley Toben, "What Should Our Students Justifiably Expect of Us As Teachers," 33 University of Toledo Law Review 221 (2001): 228.[Back]
Michael H. Birnbaum, "A Survey of Faculty Opinions Concerning Student Evaluations of Teaching," The Senate Forum (1999), http://psych.fullerton.edu/mbirnbaum/faculty3.htm, reporting a survey of faculty members at California State University, Fullerton.[Back]
Paul A. Trout, "What the Numbers Mean: Providing a Context for Numerical Evaluations of Courses," Change 24 (Sept.-Oct. 1997): 30.[Back]
Wilbert McKeachie, "Student Ratings: The Validity of Use," 52 American Psychologist 1218 (1997): 1219.[Back]
Ian Neath, "How to Improve Your Teaching Evaluations Without Improving Your Teaching," 78 Psychological Reports 1363 (1996): 1367, 1368.[Back]
Michael Platt, "Souls without Longing," 18 Interpretation 415 (Spring 1991): 452.[Back]
Sacks, Generation X Goes to College, 101.[Back]
Ben Marcus, "Graded by My Students: Through Some Dubious Teaching Techniques, I've Learned to Win Good Evaluations From My Classes," Time 157, 8 Jan. 2001, 51.[Back]
Jeffrey E. Stake, "Response to Haskell: Academic Freedom, Tenure, and Student Evaluation of Faculty," Education Policy Analysis Archives 5.8, 18 Mar. 1997, 2, available online at http://epaa.asu.edu/epaa/v5n8.html.[Back]
Birnbaum, 6.[Back]
See Fischer, 10 Legal Writing 111, for a full report of the survey's results.[Back]
Michael Platt, "What Student Evaluations Teach," 22 Perspectives on Political Science 29 (1993): 31.[Back]
Sacks, Generation X Goes to College, 62-64.[Back]
Abel, 411. See also Johnson, 151, stating that ratings have "alter[ed] the dynamics of student-instructor interactions in ways that are not yet fully understood...."[Back]
Fish, C2.[Back]
Ibid.[Back]
Platt, "What Student Evaluations Teach," 33.[Back]
Fish, C2.[Back]
Hocutt, 56.[Back]
Platt, "What Student Evaluations Teach," 33.[Back]
See McKeachie, 1223, stating that the use of numerical means or medians leads to invalid comparisons; Michael Theall & Jennifer Franklin, "Looking for Bias in All the Wrong Places: A Search for Truth or a Witch Hunt in Student Ratings of Instruction?" 109 New Directions for Instructional Research 45 (Spring 2001): 46, pointing out the unfairness of comparing a seminar instructor with the instructor of a large required course.[Back]
Abel, 428-29.[Back]
See Paul Trout, "Flunking the Test: The Dismal Record of Student Evaluations," Academe 86.4 (July/Aug. 2000): 60, recommending "less noxious" tools like "[n]arrative evaluations, self-evaluations, peer visitation and review, and intensive focus-group interviews."[Back]

[The Montana Professor 17.1, Fall 2006 <http://mtprof.msun.edu>]

Contents | Home