Pages

Friday, March 19, 2021

Measuring Teaching Quality in Higher Education

For every college professor, teaching is an important part of their job. For most college professors, who are not located at relatively few research-oriented universities, teaching the main part of their job. So how can we evaluate whether teaching is being done well or poorly? This question applies both at the individual level, but also for bigger institutional questions: for example, are faculty with lifetime tenure, who were granted tenure in substantial part for their performance as researchers, better teachers than faculty with short-term contracts?  David Figlio and Morton Schapiro tackle such questions in "Staffing the Higher Education Classroom" (Journal of Economic Perspectives, Winter 2021, 35:1, 143-62). 

The question of how to evaluate college teaching isn't easy. For example, there are not annual exams as often occur at the K-12 level, nor are certain classes followed by a common exam like the AP exams in high school. My experience is that the faculty colleges and universities are not especially good at self-policing of teaching.  In some cases, newly hired faculty get some feedback and guidance, and there are hallway discussions about especially awful teachers, but that's about it. Many colleges and universities have questionnaires on which students can evaluate faculty. This is probably a better method than throwing darts in the dark, but it is also demonstrably full of biases: students may prefer easier graders, classes that require less work, or classes with an especially charismatic professor. There is a developed body of evidence that white American faculty members tend to score higher. Figlio and Schapiro write: 

Concerns about bias have led the American Sociological Association (2019) to caution against over-reliance on student evaluations of teaching, pointing out that “a growing body of evidence suggests that their use in personnel decisions is problematic” given that they “are weakly related to other measures of teaching effectiveness and student learning” and that they “have been found to be biased against women and people of color.” The ASA suggests that “student feedback should not be used alone as a measure of teaching quality. If it is used in faculty evaluation processes, it should be considered as part of a holistic assessment of teaching effectiveness.” Seventeen other scholarly associations, including the American Anthropological Association, the American Historical Association, and the American Political Science Association, have endorsed the ASA report ...
Figlio and Schapiro suggest two measures of effective teaching for intro-level classes: 1) how many students from a certain intro-level teacher go on to become majors in the subject, and 2) "deep learning," which is combination of how many in an intro-level class go on to take any additional classes in a subject, and do whether students from a certain teacher tend to perform better in those follow-up classes. They authors are based at Northwestern University, and so they were able to obtain "registrar data on all Northwestern University freshmen who entered between fall 2001 and fall 2008, a total of 15,662 students, and on the faculty who taught them during their first quarter at Northwestern." 

Of course, Figlio and Schapiro emphasize that their approach is focused on Northwestern students, who are not a random cross-section of college students. The methods they use may need to be adapted in other higher-education contexts. In addition, this focus on first-quarter teaching of first-year students is an obvious limitation in some ways, but given that the first quarter may also play an outsized role in the adaptation of students to college, it has some strengths, too. In addition, they focus on comparing faculty within departments, so that econ professors are compared to other econ professors, philosophy professors to other philosophy professors, and so on. But with these limitations duly noted, they offer what might be viewed as preliminary findings that are nonetheless worth considering. 

For example, it seems as if their two measures of teaching quality are not correlated: "That is, teachers who leave scores of majors in their wake appear to be no better or worse at teaching the material needed for future courses than their less inspiring counterparts; teachers who are exceptional at conveying course material are no more likely than others to inspire students to take more courses in the subject area. We would love to see if this result would be replicated at other institutions." This result may capture the idea that some teachers are "charismatic" in the sense of attracting students to a subject, but that those same teachers don't teach in a way that helps student performance in future classes.

They measure the quality of research done by tenured faculty using measures of publications and professional awards, but find: "Our bottom line is, regardless of our measure of teaching and research quality, there is no apparent relationship between teaching quality and research quality." Of course, this doesn't mean that top researchers in the tenure-track are worse teachers; just that they aren't any better. They cite other research backing up this conclusion as well. 

This finding raises some awkward questions, as Figlio and Schapiro note: 
But what if state legislators take seriously our finding that while top teachers don’t sacrifice research output, it is also the case that top researchers don’t teach exceptionally well? Why have those high-priced scholars in the undergraduate classroom in the first place? Surely it would be more cost-efficient to replace them in the classroom either with untenured, lower-paid professors, or with faculty not on the tenure-line in the first place. That, of course, is what has been happening throughout American higher education for the past several decades, as we discuss in detail in the section that follows. And, of course, there’s the other potentially uncomfortable question that our analysis implies: Should we be concerned about the possibility that the weakest scholars amongst the tenured faculty are no more distinguished in the classroom than are the strongest scholars? Should expectations for teaching excellence be higher for faculty members who are on the margin of tenurability on the basis of their research excellence?
Figlio and Schapiro then extend their analysis to looking at the teaching quality of non-tenure track faculty. Their results here do need to be interpreted with care, given that non-tenure contract faculty at Northwestern often operate with three-year renewable contracts, and most of these faculty in this category are in their second or later contract. They write: 
Thus, our results should be viewed in the context of where non-tenure faculty at a major research university function as designated teachers (both full-time and part-time) with long-term relationships to the university. We find that, on average, tenure-line faculty members do not teach introductory undergraduate courses as well as do their (largely full-time, long-term) contingent faculty counterparts. In other words, our results suggest that on average, first-term freshmen learn more from contingent faculty members than they do from tenure track/tenured faculty. 
When they look more closely at the distribution of these results, they find that the overall average advantage of Northwestern's contingent faculty mainly arises because of a certain number of tenured faculty at the bottom tail of the distribution of teachers seem to be terrible at teaching first-year students. As Figlio and Schapiro point out, any contract faculty who were terrible and at the bottom tail of the teaching distribution are likely to be let go--and so they don't appear in the data. Thus, the lesson  here would be that institutions should be have greater awareness about the possibility that a small share of tenure-track faculty may be doing a terrible job in intro-level classes--and get those faculty reassigned somewhere else.

This study obviously leaves a lot of questions unanswered. For example, perhaps the skills to be a top teacher in an intro-level class are different than the skills to teach an advanced class. Maybe top researchers do better in teaching advanced classes? Or perhaps top researchers offer other benefits to the university (grant money, public recognition, connectedness to the frontier concepts in a field) that have additional value? But the big step forward here is to jumpstart more serious thinking about how it's possible to develop some alternative quantitative measures of teacher quality that don't rely on subjective evaluations by other faculty members or on student questionnaires.

One other study I recently ran across along these lines uses data from the unique academic environment of the US Naval Academy, where students are required to take certain courses from randomly assigned faculty. Michael Insler, Alexander F. McQuoid, Ahmed Rahman, and Katherine Smith discuss their findings in "Fear and Loathing in the Classroom: Why Does Teacher Quality Matter?" (January 2021, IZA DP No. 14036).  They write: 

Specifically, we use student panel data from the United States Naval Academy (USNA), where freshmen and sophomores must take a set of mandatory sequential courses, which includes courses in the humanities, social sciences, and STEM disciplines. Students cannot directly choose which courses to take nor when to take them. They cannot choose their instructors. They cannot switch instructors at any point. They must take the core sequence regardless of interest or ability." In addition: 
Due to unique institutional features, we observe students’ administratively recorded grades at different points during the semester, including a cumulative course grade immediately prior to the final exam, a final exam grade, and an overall course grade, allowing us to separately estimate multiple aspects of faculty value-added. Given that instructors determine the final grades of their students, there are both objective and subjective components of any academic performance measure. For a subset of courses in
our sample, however, final exams are created, administered, and graded by faculty who do not directly influence the final course grade. This enables us to disentangle faculty impacts on objective measures of student learning within a course (grade on final exam) from faculty-specific subjective grading practices (final course grade). Using the objectively determined final exam grade, we measure the direct impact of the instructor on the knowledge learned by the student.
To unpack this just a bit, the researchers can look both at test scores specifically, which can be viewed as "hard" measure of what is learned. But when instructors give a grade for a class, the instructor has some ability to add a subjective component in determining the final grade. For example, one can imagine that perhaps a certain student made great progress in improved study skills, or a student had some reason why they underperformed on the final (perhaps relative to earlier scores on classwork), and the professor did not want to overly penalize them. 

One potential concern here is that some faculty might "teach to the test," in a way that makes the test scores of their student look good, but doesn't do as much to prepare the students for the follow-up classes. Another potential concern is that when faculty depart from the test scores in giving their final grades, they may be giving students a misleading sense of their skills and preparation in the field--and thus setting those students up for disappointing performance in the follow-up class. Here the finding from Insler, McQuoid, Rahman, and Smith: 
We find that instructors who help boost the common final exam scores of their students also boost their performance in the follow-on course. Instructors who tend to give out easier subjective grades however dramatically hurt subsequent student performance. Exploring a variety of mechanisms, we suggest that instructors harm students not by “teaching to the test,” but rather by producing misleading signals regarding the difficulty of the subject and the “soft skills” needed for college success. This effect is stronger in non-STEM fields, among female students, and among extroverted students. Faculty that are well-liked by students—and thus likely prized by university administrators—and considered to be easy have particularly pernicious effects on subsequent student performance.

Again, this result is based on data from a nonrepresentative academic institution. But it does suggest some dangers of relying on contemporaneous popularity among students as a measure of teaching performance.