[The Montana Professor 14.2, Spring 2004 <http://mtprof.msun.edu>]
Valen E. Johnson
New York: Springer-Verlag, 2003
262 pages, $34.95 hc
Warren Esty
Mathematical Sciences
MSU-Bozeman
What are the relationships among grades, student learning, teacher effectiveness, and student teaching evaluations? Are grades awarded by the same standards across departments? Does the perception that some courses or departments give higher grades influence which courses students decide to take, and possibly even influence their selection of a major?
Valen Johnson is a professor of biostatistics at the University of Michigan who wades into the analysis of dozens of statistical studies of these questions with impeccable credentials. He is a former professor of statistics and decision sciences at Duke and holds the high honor of having been elected a Fellow of the American Statistical Association.
His intent is to analyze the current impacts on students and faculty of current grading schemes, not to demonstrate that "inflation" has occurred. However, when he mentions that 44 to 46 percent of all grades awarded are As or A-minuses at some top schools like Dartmouth, Harvard, and Duke, he certainly expects the reader to react that those grades are high. He expects that the reader will have heard of the old theory that "C" means "average," but does not bother to document that it ever reflected reality, or even that grades are drifting upwards, because the past plays no role in his analysis. Awarding 45 percent As and A-minuses is incompatible with an average of C, so there is some connection with "inflation," and his title is not too misleading. In any case, it will attract the readership he wants--readers who are concerned about what we can simply call "high grades."
Johnson's concern is not high grades per se, rather the impact on students, and even faculty, of current grading schemes that are not equable. This might better be called "grade disparity." Grades influence students' choices of courses and majors, and the jobs they can get, and the graduate schools they can get into. Grades influence teacher evaluations and course enrollments, causing faculty who give lower grades to be less likely to receive tenure, promotion, or raises. After analyzing the validity of studies that document grade disparities, Johnson quantifies the effects, finds they are surprisingly large, and concludes "Disparities in student grading have resulted in a general degradation of America's post-secondary educational system. Inconsistent grading standards result in unjust evaluation of students and faculty, and discourage students from taking those courses that would be of greatest benefit to them" (239)--the "Crisis in College Education" of his subtitle.
Johnson immediately acknowledges the "deep philosophical divisions that exist between members of the academic community regarding the meaning and purpose of grades" (3). He quotes attacks on high grades that others have articulated. "In many courses, faculty members are giving out relatively high grades for average or sub-par work" (4). "Grade inflation compresses all grades at the top, making it difficult to discriminate the best from the very good" (4). Some faculty give higher grades than students deserve in order to make their teacher evaluations better (49). Mean grades awarded in disciplines with less-able students are actually higher than mean grades awarded in disciplines with better students (199). The list of attacks is long.
On the other hand, every attack has prompted a corresponding defense. "The fact that the average student now receives a B instead of a C has not intrinsically devalued our educational system. A grade of C represents nothing more than an ordered categorical response" (196). "The global questioning of tenets once held to be singularly true allows a larger number or students to display with greater diversity a legitimate and appropriate grasp of a widened content" (8). Giving high grades does not influence teacher evaluations, on the contrary, good teaching leads to both high evaluations and high grades (10). "Discipline-specific abilities underlie observed differences in grading stringency between departments" (205). "The fields with 'easier' grading standards may have students who are more diligent, creative, or possess more of unmeasured traits which are positively related to GPA" (201). Some universities simply have better students (6).
Johnson considers, one after another, the numerous philosophical divisions and various defenses to attacks on high grades. They are often quoted at length so the arguments for awarding high grades are made clear. He observes that each defense relies on one or more assumptions, or assertions of supposed fact, without which the argument for the defense fails. Then five key "facts" for the defense are identified that are subject to statistical investigation. Instead of being facts, Johnson intends to show they are "myths."
The "myths" are:
"student grades do not bias student evaluations of teaching,
student evaluations of teaching provide reliable measures of instructional effectiveness,
high course grades imply high levels of student achievement,
student course selection decisions are unaffected by expected grading practices,
grades assigned in an unregulated academic environments have a consistent and objective meaning across classes, departments, and institutions" (9).
Statistically, he concludes the myths are false because the data from numerous investigations, including his own, actually prove them false, and, furthermore, all the effects are not merely statistically discernable, but actually large.
I think that no one statistical study could possibly convince critics. There are too many potential objections. Maybe high grades are going to better students. Maybe students are working harder for those grades. Maybe modern grading schemes elicit better work. However, Johnson does not have just one study to consider. If you can think of some reason why high grades might, or might not, be justified, so has some previous researcher who has designed an experiment to test that hypothesis. Johnson assembles the studies that address the questions above and analyzes their experimental designs, with potential objections in mind, to see if they really are suitable to their goals, and reports the results.
Furthermore, while at Duke he designed a large three-year study to gather sufficient data to address all the controversial questions above, with special attention to the relationship between grades, teaching evaluations, and courses selected by students. By allowing students, during registration, computer access to files about teacher evaluations and grades awarded by instructors, he was able to determine whether such data had any influence on courses selected. Much of the data collection could be regarded as intrusive. Student teacher evaluations were done on-line so they could be linked to other data about the student. The computer created a massive database by keeping track of the files each student visited and their subsequent course selections, and linked all that to the registrar's information about the student's gender, ethic group, and course records.
Of course, there are faculty who do not want information about teacher evaluations and previous class GPAs available to students (that information is not available here at Montana State University) and their protests terminated his project after only one year. In particular, the math faculty protested the loudest, partially because of some low teacher evaluations, but mostly fearing that general knowledge of the low grades they tend to assign would drive students away. Johnson remarks that "they were likely naive in thinking they were not already suffering from exactly this phenomenon." Nevertheless, the truncated experiment compiled enough data to distinguish true causation from simple statistical correlation in many cases.
For example, it is possible to compare grades of individuals across departments to determine which departments grade stringently and which leniently, and to quantify the amounts. Consider all students at the university who take classes in both departments X and Y, possibly excluding departmental majors. If the average grade from department X is significantly higher than the average grade from department Y, that is some evidence that department X grades more leniently. Are there other factors that might explain the difference? Possibly. That is what "experimental design" is for--to figure out some way to draw conclusions with "all other factors being equal."
Of interest to every faculty member will be the table which summarizes three studies at different universities by listing departments and the amounts their average grades for the same individual would differ from the norm (203). The effect is strong enough to move a student's GPA from the middle of the class to the top quartile or bottom quartile based on the department awarding the grades, when corrected for all other factors. A fascinating observation is that departments that attract the most capable students graded stringently, while departments with less-capable students graded more leniently.
Johnson's data often confirm common sense. For example, if, for a given course, students are aware that section 1 tends to award higher grades than section 2, they are much more likely to sign up for section 1. That suggests, but does not quite prove, that if grading policies between departments were similar, students might sign up for more natural science courses (in which grading is typically more stringent). I have a daughter who attended MSU in math and I remember her friends discussing which courses they would take based on how "hard" they were. One said she would change majors to avoid the strict instructor of a course required a particular semester. But I admit it didn't occur to me at the time just how huge the impact of faculty or departments awarding low or high grades can be when extrapolated across a whole campus. Johnson estimates that the average undergraduate at Duke would take over 40 percent more natural science courses and the impact would be equivalent to adding 20 or more full-time science faculty among a total faculty of about 500. Furthermore, this refers only to selections of elective courses and does not factor in the additional majors a more-lenient grading policy would attract.
The introduction is fascinating. He first notes just how high the grades have become at numerous top schools. (I was playing Trivial Pursuit recently, and one question was, "How many of every ten members of Harvard's class of 2002 graduated 'with honors'?" The answer was "Nine.") Then he quotes extensively from faculty who support awarding high grades, with the intention of designing of an experiment to address the claims and extract the truth. For example, it will not surprise any reader that grades and student teaching evaluations are positively correlated. That fact is not in dispute, but its explanation is highly controversial. It may be that good teaching leads to both good grades and good evaluations (the "teacher-effectiveness theory"). On the other hand, it is well-known that humans feel compelled to return favors (Influence: The Psychology of Persuasion, Robert B. Cialdini, 1993) and high teacher evaluations may be considered a proper return favor to the teacher for awarding high grades (the "grade-leniency theory").
These and other theories can be tested against the facts. Psychologists and statisticians have spent decades considering similar problems and designing experiments that are able to distinguish causation from correlation. For example, teaching evaluations are usually conducted before the student receives his or her grade. In that case evaluations could depend upon the expected grade, but not on the grade itself. This leaves open the possibility of manipulating expectations (without actually manipulating grades) for experimental purposes to see if teacher evaluations are affected. This has been done. Also, in some studies teacher evaluations have been given twice, both before and after receiving grades. Statistical design is an important area of statistics that Johnson has obviously mastered and he wants the reader to be aware of the sometimes astounding arguments used to justify grading schemes so his experiment can address them properly. He wants his critics to have no objection left unmet when he is finished.
Although not generally interested in changes in grades over the years, he does bemoan the fact that it has become impossible to reward outstanding students with outstanding grades, and that now even one low grade can ruin the GPA of a fine student at a school where most grades are high. A Montana high school teacher once remarked to me that the finest student he had taught in 10 years did not graduate in the top 10 percent of his class because he once got a B-plus in Band and more than 10 percent of his class graduated with a perfect 4.0.
The book bogs down severely in Chapter Two on the design of his complicated experiment at Duke. It is too detailed and heavy going for the general reader, who, unless statistically trained and highly motivated, may skip it and take my word that the statistical work has been done well and the standard statistical considerations about experimental design and objections about non-response have been handled properly.
However, the entertainment resumes in Chapter Three on the relationship between grades and student-evaluation of teaching. Johnson discusses the amusing history of such investigations, and the sociological trends evidenced. His purpose is again to consider the design of experiments and what they can and cannot detect. He reviews the factors that have been said to "bias" investigations and considers the explanations propounded over the decades about why correlations might be positive or negative. Hundreds of studies spanning 70 years have addressed the issue and their explanations are legion and range far beyond the "teacher-effectiveness" and "grade leniency" theories. Johnson discusses which types of designs can answer which questions, and selects the three dozen "observational" studies that he finds have close to the proper design to factor out the effect of the teaching and isolate the effect of the grades. They average a significant positive correlation of 0.21, supporting the theory that grades influence teacher evaluations.
However, Johnson notes that "all observational studies are, in some sense, flawed" (71), and then considers "experimental" studies, which can actually be designed to "untangle the causal relationship between student grades and student evaluations of teaching" (75). Several published grade-manipulation experiments are analyzed and the following chapter shows how his Duke experiment was designed to handle remaining unresolved issues. The mean class grade turned out to be much less important as a predictor of student-evaluation item responses than were either individual student grades or prior student interest, facts which "contradict predictions generated under the teacher-effectiveness theory," but also "little support was gathered for the grade-leniency theory" (117). He found the "grade-attribution" theory the most useful: "Students attribute success in academic work to themselves, but attribute failure to external sources" (96). Regardless of the reason, the analysis provides "conclusive evidence of a biasing effect of student grades on student evaluations of teaching" (118).
One academic response to concerns about high and inequitable grades has been to deny the objective validity of grades and therefore to regard correlations between grades and other factors as irrelevant ("The dangerous Myth of Grade Inflation," Alfie Kohn, Chronicle of Higher Education, 8 Nov. 2002). The pursuit of good grades is a hindrance to real learning, so grades are the problem, not the solution. Johnson wastes no time in berating such postmodern arguments as "bizarre."
Regardless of whether grades have objective meaning, they do have an impact, and Johnson's quest to reveal the impact is valid. The work reports common-sense relationships and gives solid evidence for them. Whoever is invested in the current uneven system of grading will have a hard time deconstructing it. (They will simply denigrate or ignore it.) Statistics is not a subject in its infancy and statisticians actually can design experiments to distinguish causation from correlation. There are so many interrelated issues that Johnson's work is interesting from the beginning to the very end, becoming unreadable only when he puts on his statistical-researcher hat and proves to his peers he knows his stuff, as if writing a scholarly article.
More often it is highly entertaining. For example, the "Fox effect" is documented in a beautiful experiment in which Dr. Fox, an authority on the application of mathematics to human behavior, gave a lecture on "Mathematical Game Theory as Applied to Physical Education" to various groups of professionals and graduate students. They rated the lecture very highly, in spite of the fact Dr. Fox was actually an actor delivering a lecture designed to be without content. However, his delivery was enthusiastic and witty--which is apparently sufficient to make listeners appreciate, in the words of the original experimenters, "double talk, neologisms, non sequiturs, and contradictory statements."
The literature is full of articles demonstrating that such "seduction" can bias student evaluation of teaching. Overall, Johnson concludes that "tools for accurately evaluating effective teaching and student achievement remain elusive" and "current evaluation instruments provide little (legitimate) guidance to administrators in promotion, tenure, and salary decisions" (165).
We all know that good intentions may have unintended consequences. In my opinion a major contribution of this book is to clarify the negative unintended consequences of some modern (or post-modern) grading schemes. I enjoyed the thorough discussions of previously proposed potential explanations for statistical relationships of every kind between grades, evaluations, achievement, and student choices. Many of them I would never have thought of myself (I'd like to think because they are loony). In the end, readers should be convinced that he has demonstrated that grades are not awarded equably and that the negative effects of this are large.
But, what can be done about it? Johnson considers potential grade-adjustment schemes and the philosophical justifications for, and objections to, each. The BCS computer-ranking of college football teams is highly controversial. Imagine how impossible it would be to obtain consensus that any particular mathematical grade-adjustment formula would be universally fair.
So, what incentive could teachers be given to move their grade curve down? A proposal I found amusing was setting a goal for the percentage of As in a class and, if exceeded by x percent, removing the top x percent of the teacher evaluations before computing their summary statistics. Johnson supports constraints on mean grade distributions, something he says is not uncommon in graduate and professional schools. He considers the objections to various types of constraints and notes how they have worked (or not worked) in practice. For example, if the median is constrained to be at or below some grade, some faculty will assign 51 percent of their grades at or just below that level, and 49 percent at high levels. Philosophically, some may justify this by asserting that students "need" high grades to get into a good graduate school because everybody else is giving high grades.
It seems to me that grading practices have gotten out of hand because they value the good of the individual (student or faculty member) at the expense of the common good. Only my allegiance to some abstract theory of what grades "should" mean prevents me from awarding higher grades for the same work. I can see incentives for giving higher grades, but where are the incentives for not giving them? Even universities are forced to think of themselves as individuals in a pool of universities, each of which can reward its students, relative to students of other schools, by giving higher grades. It seems that only national embarrassment can rein in the trend. Apparently Harvard has recently decided it will limit the number of students allowed to graduate "with honors" to 60 percent of a class. Some honor!
Johnson presents conclusive evidence that grades bias student evaluations of teaching; student evaluations do not provide reliable evidence of instructional effectiveness; high course grades do not imply high levels of student achievement; and grades do not have a consistent meaning across classes, departments, or institutions. Furthermore, student course and major selection are significantly affected by expected grading practices. Equally significantly, he shows that the impacts are damaging to the university community as a whole. Unfortunately, he is unable to show how to resurrect faculty commitment to the common good. Surely this must become a major academic concern.
Nevertheless, I expect few administrators will acknowledge the "crisis" Johnson perceives. Acknowledging a crisis would mean admitting something should be done, and that seems to me to be politically unlikely anytime soon. Does it matter if most grades are high? Does it matter if some faculty member or department gives higher grades than the rest? Does it matter if the teaching component of promotion and tenure decisions is based on (highly unreliable) student teacher evaluations? Johnson demonstrates that the answer to each is a resounding "Yes!" Is anyone willing to do something about it? Probably not.
[The Montana Professor 14.2, Spring 2004 <http://mtprof.msun.edu>]