[The Montana Professor 17.1, Fall 2006 <http://mtprof.msun.edu>]
Judith D. Fischer
Brandeis School of Law
University of Louisville
judith.fischer@louisville.edu
Suppose university students rate twenty-nine instructors' teaching on a numeric scale. Suppose further that the students' learning under those instructors is plotted on another scale, and that there is a positive correlation between the two. Thus it appears that the students' ratings say something valid about whether their instructors are good teachers. But now suppose that four instructors whom the students rated in the top half produced learning in the bottom half, while four who were rated in the bottom half produced learning in the top half. That is, 27% of the instructors' student ratings placed them in the wrong half as determined by student learning. Would it be a good idea to use the student rating scale to make personnel decisions about the instructors, with a cutoff at the median? Would that be fair to the instructors? Would it help the students?
The above data are not hypothetical; they appeared in the report of a study by Stapleton and Murkison./1/ And some universities do use student ratings (sometimes called "student evaluations"/2/) to draw precise lines among instructors./3/ Stapleton and Murkison dramatically illustrated a point that is sometimes overlooked when student ratings are used to make personnel decisions: a statistical generalization is not a universal statement./4/ For example, a particular study might disclose a positive correlation of .40 between student ratings and student learning. A casual reader might think this means the two variables are always found together. But a correlation of .40 has been described as "modest,"/5/ which means the factors occur together only some of the time. In highlighting this point, the Stapleton and Murkison study is just one example of the recent scholarship that identifies problems with the widespread student ratings system.
Student ratings first began to gain prominence in university personnel decisions in the 1970s./6/ Evidence of biases in the ratings existed even at that time,/7/ but it was often dismissed or ignored./8/ As the ratings continued to gain acceptance in the 1980s and 1990s, a common refrain echoed through much of the literature on the subject: student ratings are valid and reliable./9/ Most universities now require anonymous student ratings of instructors and use them for personnel purposes, making them of pivotal importance in many instructors' careers. But during this whole process, little thought was given to whether student ratings might have unintended negative effects on university education./10/
By now there is a wide body of scholarship about student ratings, with a notable portion of it defending them./11/ But in the past ten years, an increasing number of scholars have identified flaws in the student ratings system. This recent scholarship, including my own study,/12/ reveals two general categories of problems: (1) biases in the ratings, and (2) their unintended negative effects.
It is a truism in the literature that students' ratings correlate positively with students' expected grades./13/ Of course, this correlation might not indicate a bias; if highly rated teachers produce more student learning and therefore give appropriately higher grades, this correlation would tend to show the ratings' validity. In the late 1990s, two professors decided to look into the meaning of the correlation.
Greenwald and Gilmore asked students in hundreds of courses at the University of Washington to state on their rating forms what grades they expected. There was a positive correlation of .45 between the students' expected grades and their ratings of their instructors./14/ Based on the number of participants in the study, this correlation was statistically significant at the .001 level. Greenwald and Gilmore also found that within individual classes, the positive grades-ratings correlation was maintained. They therefore concluded that the overall grades-ratings correlation did not reflect the quality of the teaching, because all the students in a given class had the same instructor./15/ Greenwald and Gilmore also found a negative correlation between workload and grades, and partly on that basis concluded that the grades-ratings correlation is not the result of high student engagement, but rather occurs because students "give high ratings in appreciation for lenient grading."/16/
Johnson reached the same conclusion after collecting student rating forms from nearly 1900 Duke University students./17/ He found that the students' ratings of their professors correlated positively with their grades,/18/ but he concluded that this data did not support the teaching-effectiveness explanation because the within-class correlations occurred among students who all had the same instructor./19/ He also found that some students' grades for prerequisite courses correlated negatively with their grades for related advanced courses, suggesting that "prerequisite courses in which professors grade more stringently are more effective in preparing students for advanced courses."/20/ Johnson concluded that "the sensitivity of global measures of teaching effectiveness to biases like...grading leniency, and other factors is...unacceptably high."/21/
Instructor expressiveness has also been shown to correlate positively with student ratings. This correlation, too, might support the ratings' validity if instructor expressiveness leads students to learn more. Williams and Ceci explored this issue when Ceci taught the fall and spring sections of a course in the same way except for being more expressive the second time./22/ In the spring, Ceci incorporated recommendations from a teaching skills course taught by a media consultant who was not an academic. Specifically, Ceci projected enthusiasm by varying his voice pitch and using hand gestures. He received significantly higher ratings from the spring students./23/ Yet Ceci's increased expressiveness did not affect learning as measured by exam scores: the fall and spring students' scores on the same examination were nearly identical./24/ At least in Ceci's case, then, expressiveness represented a bias in the ratings, not a catalyst of student learning.
Recent scholarship has also documented other biases. Timing can affect student ratings; ratings taken after students have received grades or during a final examination period tend to be lower./25/ Students also give higher ratings to courses in which they had a prior interest or which they take as electives, to upper level courses, and to courses in certain academic fields./26/ Other factors that positively affect ratings include an instructor's attractiveness,/27/ dress,/28/ and sense of humor./29/ The latter trait was found to correlate positively with students' opinions on two unrelated points: whether the instructor was organized and whether he or she encouraged supplemental reading./30/ And although findings on the point are mixed,/31/ instructors' female gender has been found to affect student ratings negatively./32/
Moreover, one scholar showed that students can be flatly wrong. They rated highly a speaker they had never heard and a film they had never seen./33/ As Stanley Fish observed, decisions about young scholars' careers have been based on "the ill-informed opinions of transient students with little or no stake in the enterprise...in full knowledge that nothing they said would ever be questioned or challenged."/34/ But unfairness to instructors is not the only harm student ratings cause. They negatively affect the entire educational enterprise and even their supposed beneficiaries, the students.
Grade inflation. The very correlation between grades and ratings suggests that instructors have an incentive to inflate grades in order to get the ratings they need to keep their jobs or advance professionally./35/ As former professor Peter Sacks said, "Placing significant weight on student evaluations produces the unwholesome incentive in all-too-human teachers to give out lots of good grades to make students happy."/36/ While Sacks believed that "most educators would never openly concede that this happens at their institutions,"/37/ he himself told of giving "outrageously good grades" in order to affect his student ratings, which did rise./38/ Some of his superiors even encouraged him to improve his ratings this way./39/ Another writing professor admitted giving higher grades to improve his ratings,/40/ and a law school dean attributed grade inflation partly to student ratings./41/
These anecdotal reports received support in a study in the late 1990s, in which 70% of faculty respondents reported that their university's reliance on student ratings was an incentive to give higher grades./42/
Dilution of rigor. Accumulating evidence shows that student ratings also lead to diluted rigor in university courses. As Trout wrote, "The most effective device for lowering standards is the numerical [student] evaluation form."/43/ Some instructors' efforts to obtain higher ratings will improve their teaching, but some of those efforts will be tangential or even antithetical to good teaching. Instructors may cater to students who would prefer not to be challenged./44/ Indeed, one professor's sardonic advice on how to get good ratings included suggestions to "Show lots of films" and "Teach what they want how they want it."/45/ As one commentator put it, "[i]t is very hard to educate people you have reason to fear; it is often possible to please them, and where the rewards for flattering them are great, it will be hard not to."/46/
Several university teachers have reported lowering course rigor to obtain better ratings. Peter Sacks described his "sandbox experiment," in which he lowered his standards in order to improve his ratings./47/ Another professor wrote that he decided to tell the students what they wanted to hear about their writing, "praising them however much they foundered." The result: his ratings improved./48/ Yet another professor reported changing his teaching "for the worse" in service of the ratings./49/
These anecdotal reports were supported by a recent survey of faculty members in one university, where 72% of the respondents said student ratings encouraged them to "water down" course content./50/
My own survey added some data on this point. In the spring of 2002, I sent a survey to the 300 members of the Association of Legal Writing Directors (ALWD)./51/ Members of that organization teach legal writing in law schools, and most supervise other teachers of legal writing. My questions about the effects of student ratings on course rigor generated interesting responses. Of the fifty-two ALWD members who responded to the survey, 33% reported that they or teachers they supervise had refrained from doing something they believed pedagogically sound because it might negatively affect their student ratings. This percentage may be understated because of instructors' natural reluctance to report any lowering of standards. But whether or not the percentage is understated, it documents a matter of concern. Instructors are diluting course rigor out of concern about student evaluations.
This means that student ratings damage their purported beneficiaries, the students. Students who fail to learn are encouraged to blame their professors instead of themselves./52/ Weak students may thus acquire a false sense of competence, while students who do want high standards find their educations devalued by lowered rigor./53/ Indeed, the whole ratings process "profoundly reverses fundamental authority relations,"/54/ prompting Stanley Fish to ask in an article title, "Who's in charge here?"/55/ He noted that the ratings present students with "invitations to grind axes without fear of challenge or discovery."/56/ That does not have a salutary effect on students' character./57/ Moreover, the curriculum itself may be watered down if administrators make curricular decisions based on student ratings.
All of this makes the educational enterprise seem uncomfortably like "retail merchandising," in which it is "appropriate to cater to the consumer's interest."/58/ As one professor summarized this approach, "[I]t does not matter what the student learns, or even whether he learns anything; what matters is whether he is satisfied with the process. If the student likes the service he is getting, his teacher is good; otherwise, not."/59/ But this notion misperceives the role of the student, who is less the consumer than the product of education, while it shortchanges the students' future employers and clients. It also misperceives the role of the professor, who must sometimes challenge or correct students and at times is even "duty bound to displease."/60/
Why would universities continue to use student ratings and sometimes rely heavily on them for personnel purposes, in light of their demonstrated biases and negative effects? The answer may lie in their very ease of administration. All the university needs to do to collect student ratings is take about ten minutes out of each course to distribute and collect the forms and then compile them electronically. This produces numbers and percentages that have a seductive appearance of certainty. Although student ratings scholars have repeatedly stressed that the forms' data should not be used to support numeric comparisons among instructors,/61/ some administrators indulge in the "micrometer fallacy," the belief that simply because a number can be calculated it conveys meaningful information./62/ The numbers' seeming certainty lulls evaluators into believing that they need not invest significant time or consider the multidimensionality of teaching in order to make personnel decisions. And the numbers can be presented to a board or legislature as evidence that the university takes teaching seriously.
But mere ease of administration is no excuse for universities' continued reliance on a system that presents flawed data and causes grade inflation and lower standards. Student evaluations can be helpful when collected and maintained by individual professors for their own information. But they should not be used for personnel purposes. At the very least, if a university or department decides to continue using them, student ratings should be treated as only one piece of evidence among others, including peer evaluations and teaching portfolios./63/ And those who evaluate faculty members should view student ratings with the profound skepticism they deserve.
Notes
[The Montana Professor 17.1, Fall 2006 <http://mtprof.msun.edu>]