In his Fall 1997 piece in The Montana Professor, "How to Improve Your Teaching Evaluation Scores Without Improving Your Teaching," Paul Trout advances the following critical arguments with respect to the use of student-generated numerical evaluation forms (SNEFs) in assessing faculty teaching:
Trout provides some interesting evidence for each of these claims, and almost anyone familiar with student evaluations will recognize that they contain a grain of truth. I contend here, however, that Trout has misinterpreted some of the evidence and ignored many of its important implications. The role of SNEFs in the evaluation of teaching certainly deserves critical examination, but not the trashing Trout gives it.
Many studies of this type have been done, with very diverse results. In some instances, SNEFs turn out to be modestly successful predictors of student achievement; in others, they do not. Researchers have attempted to draw reliable conclusions from these diverse and conflicting findings, both through formal meta-analyses (see Abrami, et.al., 1990, for a discussion of these) or through more seat-of-the-pants methods. But even these attempts have failed to produce a consensus. Trout cites authors who conclude that SNEFs are generally invalid, but others have found that the body of research suggests significant validity (see, for example, McKeachie, 1990, and Cohen, 1981). In general, it seems to me, what the research tells us about the validity of SNEFs is an open question (1).
A fundamental objection to all such studies of validity concerns the assumption that good teaching can be defined in terms of "student learning." Obviously, the object of the exercise is that students learn something. But what? Most teachers would probably argue that they want students to learn some specific course content (how Beckett uses language, what economies of scale are, when the Pre-Cambrian ended, etc.), but also some higher order skills (how to calculate, think logically, write a clear paragraph, and so forth) and attitudes (curiosity, open-mindedness, intellectual enthusiasm, and the like). The latter outcomes are obviously less well defined and measured than the former, and are rarely explicitly assessed by the standardized exams used as measures of student learning in SNEF validity studies. Obviously, if validity studies employ invalid measures of learning, they are not telling us much.
Of course it might be plausibly objected that standardized exams do tell us a good deal about student learning. We would not give finals or rely on ACTs or GREs unless we had some faith in that regard. But if we can measure student learning and if we define good teaching in terms of student learning, we obviously have no need for SNEFs at all. Why not simply evaluate professors, and departments and whole institutions for that matter, by what students learn? That, after all, is what "outcomes based assessment" is all about.
The problem here, of course, is that while good teachers have the responsibility to create the conditions under which learning can occur, whether it does occur depends on the abilities, motivations and attitudes of students. If a university becomes more selective in admissions, the learning of students will improve. Is this the simple route to better teaching? If my students in introductory economics, disgruntled by the fact that they must take my course and seeing no earthly reason to do so, stop attending and learn nothing, am I a bad teacher?
Trout, in particular, should be sensitive to this issue. He is in despair at contemporary students, who, he says, are "increasingly under-prepared for and disengaged from rigorous academic study" and want courses that have "few demands and high grades." He suggests that they are impressionable (they like it when professors use powerful vocabularies), shallow (they care about the clothes their professors wear), childish (they respond to nurturing), easily bored, lazy, politically biased, readily hoodwinked with flattery and sob stories, and handily subornable with good grades and cookies. If we mistrust their judgments on SNEFs, we surely do not want to measure our teaching ability by what this rag tag band is prepared to learn!
What students learn is obviously important, but it should not be the only element in a definition of good teaching. What professors do in the classroom and office--the quality of their preparation and presentation, the validity of their examinations and grading, their understanding of how students learn, the helpfulness of assignments--are all arguably aspects of good teaching regardless of how much students learn. Good teaching should be judged by both outcomes and process.
Lest I be accused of setting up a straw man here, I should make it clear that Trout does not explicitly endorse outcomes based assessment of teaching, although that seems to be an inescapable implication of assuming that good teaching is defined by what students learn. On the other hand, if good teaching is to be judged by both outcomes and process, then validity studies based on outcomes alone, i.e., student learning, and not even the full range of outcomes at that, are not very useful.
The public relations strategies of institutions are obviously highly diverse, and so Trout may be correct in saying that some colleges and universities point to their student evaluation practices as evidence of their commitment to good teaching. But that's not likely to get them very far. For something like two decades, now, education at all levels has been the subject of studies, reports, and analyses which generally lament the declining level of student achievement; higher education, in particular, has been attacked for what is seen as a lackluster commitment to teaching. This work has spawned a reform movement which emphasizes, above all else, outcomes, standards and accountability. As a result, most institutions try to demonstrate the quality of their teaching by what their students achieve, in GRE scores, scholarships, employment and so forth. In this environment, any institution that tried to defend its teaching by pointing to the fact that it permits students to evaluate faculty would be considered, frankly, laughable.
Do SNEF scores really determine the distribution of faculty rewards and thereby create incentives for faculty to cave in to the basest of student demands? Like public relations strategies, the reward structures of colleges and universities are no doubt highly diverse. But the most common complaint voiced about those structures, by students, faculty, and the public, is that they fail to reward good teaching properly (2). At research institutions, at any rate, woe betide the young assistant professor who goes up before the Tenure Committee with a handful of excellent undergraduate SNEF scores but less than the requisite number of books, articles and grants. At colleges and universities that identify themselves as teaching institutions, student evaluations presumably count for much more.
Whatever the faculty reward structure was in the past, educational reform, with its demand for outcomes and accountability, is pushing institutions away from reliance on SNEFs. Consider two examples of particular relevance in Montana.
The University of Montana is scheduled for a Northwest Accreditation Review two years from now. The University has already been told that standards for accreditation will shift from the measurement of educational inputs (library books, lab equipment, student teacher ratios, etc.) to outcomes, measured, for example, by performance on GRE's. As a first step in this process, faculty are being asked to include a statement on all course syllabi for the 1998 spring semester clearly indicating course goals and how student achievement of those goals will be tested. This may be unobjectionable as a general practice, but it seems clear that when student achievement becomes the gold standard of accreditation, it, and not SNEFs, will also be the gold standard of faculty retention, promotion, tenure and salary decisions.
A second example is the Western Governors' Conference Virtual University, a scheme by which students will take courses provided by various regional universities over the Internet, by TV or through other distance learning technologies. When Jeff Baker, the former Commissioner of Higher Education, first presented this proposal to the Montana higher education community, he represented it as a specific response to complaints from the "business community" that college and university graduates were poorly trained. The Virtual University would address this problem by providing students with courses of specific content, unmediated by the traditional student-teacher relationship, and evaluated strictly in terms of outcomes (3). If this experiment is successful, and the Virtual University begins to compete seriously for resources (including students), existing institutions will be forced to modify their practices in the direction of greater specificity of outcomes and accountability. I do not want to suggest here that Trout has taken a position on either of these initiatives, whatever their merits or demerits might be. The point is rather that they are evidence that the current public and political environment in which higher education operates is such that Trout's account of the role of SNEFs in institutional decision making is not very compelling. It's not the Sixties anymore.
First of all, if process, and not just outcome, is to be part of the definition of good teaching, then several of the things that Trout says teachers can do to improve their SNEF scores can clearly improve their teaching as well. Using a powerful vocabulary, speaking clearly, providing interesting and lively lectures, and establishing a sense of openness with and concern for students are all things that students like and rate highly, Trout tells us. If Trout's students learn as much as mine do (4), and Trout displays these characteristics while I don't (I'm bumbling, boring, cold and distant and use the vocabulary of a thirteen year old), might not one legitimately conclude (5)that Trout is a better teacher than I am?
Second, it is difficult to know what to make of the correlation that Trout reports between specific faculty behaviors and evaluations when those behaviors are interrelated in some unknown way. Consider, for example, Trout's claim that students' first, non-verbal impressions are extremely good predictors of how they will rate their instructors. The correlation, we are told, is "as near perfect as one finds in psychology." This is an unusual way of reporting a level of correlation, but let's assume that it means that, say, 80 percent of the variation in SNEF scores is explained by first, non-verbal impressions. In other words, before you even open your mouth on the first day of class, students form a relatively unshakable opinion about you that will surface on their SNEFs four months later. If this is true, how then are their evaluations to be influenced by your dress, affect, vocabulary, entertainment value, demands for work, political attitudes and grading practices, which Trout argues also have powerful effects? It's possible, and maybe even likely, that professors who are easy going and easy grading are also politically and personally congenial with their students, buy them pizzas, deliver knockout lectures and create a good first impression (6). Then all these behaviors would be correlated with each other and with good SNEF scores, but at that point it becomes difficult, statistically, to figure out just what it is the students are valuing. It is unclear whether the researchers Trout cites grappled with this problem.
Third, the evidence that Trout cites for some of his most disturbing contentions is weak. Consider, for example, the notion that flattering the political biases of students will improve SNEF scores. As evidence, Trout cites Stanley Coren, who "discovered that a quarter of the students were apt to interpret the [objective] presentation of [scientific] evidence about the genetic and racial differences in intelligence as motivated by racism, rendering the professor a racist for twenty-five percent of students." We are not told who "the students" are, and we are apparently expected to assume that the twenty-five percent will give the instructor a bad rating. Even if that is true, will flattering the biases of that twenty-five percent by, say, presenting the evidence in a negative and non-objective manner, raise the professor's SNEF score? Shouldn't he or she be at all worried about the reaction of the seventy-five percent who apparently recognize objectivity when they see it and who might become resentful (7) if their professor failed to practice it? Trout also cites Alan Dershowitz's conflict with his Harvard Law School students, who accused Dershowitz of sexism for his classroom treatment of rape. What general conclusions we should draw from this incident are not clear. We have Trout's assurance that Dershowitz presented the material "dispassionately" (one assumes that this characterization comes from Dershowitz himself). Must we then assume that the entire "sizable group" of Harvard law students who found the presentation offensive were simply blinded by their biases or too dull witted to recognize Dershowitz's use of the hypothetical? It could be, but it seems a little disingenuous of Trout to take Dershowitz's word for it.
Trout acknowledges that "researchers still debate whether or not instructors can out-and-out buy high ratings with high grades," but then argues that "ample evidence" suggests that they can. The problem is that researchers still debate this issue because there is ample evidence to the contrary, as well.
Trout, quoting Max Holcutt, describes the outcome of a study of several thousand courses at Holcutt's university: "students definitely give a higher rating to teachers who grade higher. The coefficient of correlation is a low but significant .38. This correlation might have been higher had the study considered not actual but expected grade." The difficulty in interpreting this correlation, and others Trout mentions, is that there are several possible causal structures that may account for it. One is, of course, the bribery explanation favored by Trout: reward the students more and they will reward you. But Howard (whom Trout cites) proposes two other possibilities: either teach the students more and/or admit more highly motivated, enthusiastic students. In either case, they will get better grades and give you higher ratings and grades and ratings will be correlated but not causally linked. These three explanations are not mutually exclusive, of course, and the problem is to determine what part, if any, of the reported correlation is attributable to bribery. Howard reports two studies that use path analysis to sort out the bribery effect and find it quite small. Similarly, Seiver, using a different methodological approach (two stage least squares regression) finds the bribery effect very small and statistically insignificant (8).
Finally, Trout argues that you can improve your evaluations by "sucking up" to your students, i.e. letting them out early, canceling Friday classes, bringing in cookies or pizza, having a party, and the like. He admits there is no formal evidence that this will work, but it must "because some instructors invest a whole lot of money into bringing pizzas to class." In a more gracious age (perhaps when the readers of The Montana Professor were students themselves), it was customary for faculty to form a warm relationship with students, to know something about their personal lives and families, to display some concern for their well-being, to provide them with refreshments during a seminar break, to go out for a beer after an evening class, or even to bring pizza to class (9). It is distressing that Trout feels it safe to assume that any effort that faculty make in that direction these days is evidence of "sucking up."
Obviously, SNEF scores are not immune from contamination by judgments that students make about teaching on spurious, irrelevant or inappropriate grounds. Trout evidently feels that contamination is complete, and thus student evaluations are entirely invalid. But he misinterprets the evidence. Student evaluations do provide us with useful information regarding the quality of teaching, and used judiciously, in conjunction with other types of information, should play a role in the evaluation of faculty teaching performance.
Abrami, Philip C., Sylvia d'Apollonia, and Peter Cohen, "Validity of Student Ratings of Instruction: What We Know and What We Do Not," Journal of Educational Psychology, 82(2), 1990, 219-231.
Cohen, P.A., "Student Ratings of Instruction and Student Achievement: a meta-analysis of multi-section validity studies," Review of Educational Research, 51, 1981, 281-309.
Howard, George S. and Scott E. Maxwell, "Correlation Between Student Satisfaction and Grades: A Case of Mistaken Causation?", Journal of Educational Psychology, 72(6), 1972, 810-820.
McKeachie, Wilbert J., "Research on College Teaching: The Historical Background," Journal of Educational Psychology, 82(2), 1990, 189-200.
Seiver, David, "Evaluation and Grades: A Simultaneous Framework," Journal of Economic Education, 14(3), 1983, 32-39.
Sullivan, Arthur M. and Graham R. Sranes, "Validity of Student Evaluation of Teaching and the Characteristics of Successful Instructors," Journal of Educational Psychology, 66(4), 1974, 584-590.
Trout, Paul, "How To Improve Your Teaching Evaluation Score Without Improving Your Teaching!" The Montana Professor, 7(3), Fall 1997, 17-22.