Comments on Babies Learning Language: N-best evaluation for hiring and promotion

Hey Rebecca, thanks very much for engaging (and as...

2017-07-06T09:09:55.151-07:00

Hey Rebecca, thanks very much for engaging (and as always great to hear from you)! There are a couple of points here:

1. Talks as evidence for future plans. I agree that we're trying to judge future success, but what makes us think that there is any correlation between talk quality and future productivity? I guess the thought is that we judge the ideas in the talk and decide if they are good. My worry here is that we can do that just as well from a research statement, but without the bias. And the research statement is much closer to the eventual evaluation metric for an academic - which is almost all written. If we want people who write good papers/grants etc. then we should hire people who are good at precisely those things. Adding in the signal from the talks just adds bias - and other skills, see below. (I do agree that letters can contextualize the candidate's contributions to prior work/independence in doing that work, I think that's important).

2. Communication skills. I agree that the best way to judge someone's communication skills is to see them communicate! But in principle we could actually separate that from research productivity, e.g. having people do a guest teaching slot on material that they didn't produce. That would perhaps be more probative about their general science communication abilities (and many teaching schools do something similar). The worry is that by conflating communication and research, we can often get worse research that's communicated better (perhaps by someone more charismatic and fitting with our own biases).

Overall, this argument comes from my own feeling that I'm a poor "judge of character" when it comes to research, and that others may not be much better. So I'm looking for something a little more structured than the "I'll know it when I see it" that feels like what we do now in our holistic evaluations.

Hi Mike. Fascinating as usual, and I agree with yo...

2017-07-06T06:41:40.677-07:00

Hi Mike. Fascinating as usual, and I agree with your description of some of the ills of our current system. But I'm not yet convinced by your proposed solution. Because for postdocs, and especially for tenure track hires, we are not trying to evaluate the quality of the work the candidate has done, but the quality of the work that the candidate will do. I think that's the reason for the premium on job talks, chalk talks, and letters of recommendation. Because the quality of past work is only noisily correlated with the quality of future work, and subject to major confounds, including the unknown contribution of the PI. So we look to letters to describe the independence of the candidate; and to the talks for evidence of the ability to craft and develop a future-oriented coherent scientific path. Skills as a communicator are not irrelevant, either; a PI needs to be able to communicate to their own lab, their broader audience, and the public, in order to succeed at their job. I am sure we do this evaluation noisily at best; but I'm not sure reading papers would help us do it any better.
--Rebecca Saxe

Thanks, that's fascinating context! I apprecia...

2017-06-20T16:58:52.481-07:00

Thanks, that's fascinating context! I appreciate you sharing. I would love to hear others' thoughts on how REF committees operate - perhaps there are even some thoughts about how evaluation results compare to traditional metrics (e.g., correlation or lack with H and citations), though I'm sure data confidentiality is a huge issue.

Talking to REF panel members, what happens is that...

2017-06-20T09:23:56.535-07:00

Talking to REF panel members, what happens is that each member is assigned a stack of submitted outputs to read and assign star rankings to. Of course this isn't full peer-review, and it's not blind, and we can't be sure that panel members won't be influenced by the prestige of publication outlets, citation counts, etc. However, in principle their remit is to judge the quality of the science underlying the submitted output, independently of where it been published.

You are right, the REF is controversial for various reasons (it puts a huge burden on universities, it effectively creates a "transfer market" for top academics, departments try to game the system in various ways, in addition to excellence, "impact" is now also a criterion as well, etc.). The way panels operate, however, is not generally a controversial aspect of the REF.

Anyway, given that hiring committees in the UK are guided by REF results, this is in some ways a real-life experiment with n-best evaluation along the line you suggest.

Frank, this is very interesting, thank you! I wa...

2017-06-19T09:53:57.986-07:00

Frank, this is very interesting, thank you!

I wasn't familiar with all the details of the REF, and I didn't know that the committees actually read the materials. (I knew that the assessments were controversial, though).

I guess it's a separate question how often such assessments should be done and how funding should be allocated - but it would be interesting to hear people's experiences with the actual judgements that are made by the committees.

Hi Michael, this is an immensely sensible proposal...

2017-06-19T02:47:11.985-07:00

Hi Michael, this is an immensely sensible proposal, which would address many of the perverse incentives in the current system for evaluating academics for hiring and promotion.

It's interesting to note that the Research Excellence Framework (REF) in the UK uses almost exactly the system you propose. The REF is a 5-yearly exercise in which all research active staff (faculty and senior researchers) at all UK universities are evaluated. Everyone submits their top 4 outputs in a specified 5-year window; these outputs are the evaluated (effectively peer-reviewed) by a panel of experts in the discipline, and a rating between 0 and 4 stars is assigned to each researcher based on this evaluation. The overall evaluation of a department is then computed as an aggregate as of these individual star-rating. Crucially, government funding is then allocated based on this aggregate.

The next REF will happen in 2020, and most of the process will be the same as the one I just described, but the number of output evaluated per research will vary (probably between 2 and 6, with an average of 4). This is also something you suggest.

The important point is that the REF evaluation panels are explicitly discouraged from using proxy metrics such as impact factors and citation counts; they are instructed to evaluate each output on its quality. Also, outputs do not have to be publications; they can be preprints, but also datasets, software, patents, etc.

Needless to say, evaluating every researcher at every university is an immense effort, and it's a very costly exercise. However, it's one that has driven up overall research quality, in my view. Not only because it incentivises researchers to aim for quality rather than quantity, but also because hiring and promotion committees apply REF criteria in their decisions.

Hi Fiery, thanks very much for the comment and for...

2017-06-18T15:16:01.974-07:00

Hi Fiery, thanks very much for the comment and for engaging! I agree with you that context is important in assessing scientific quality. That's why I'm suggesting that letters and job talk be focused on providing that kind of context. Citation numbers can also be useful for this purpose when used appropriately.

But I disagree that current hiring/promotion looks like this. First, we tend to assess people, not products. This leads to bias at the level of gender, race, looks, etc. Second, letters and talks are not used to assess or contextualize individual pieces of work as consistently as they should be. Third, citation numbers are used inappropriately, with little awareness of the need for controls for subfield, publication date, etc.

I am responding to some generalizations from the replication crisis: people can look like stars who have published highly impactful research, but if you examine individual papers in a journal club, the research will come apart at the seams. Putting a group of smart people in a room for a guided discussion of a couple of papers and - my contention is - you will often come out with a very different assessment of a body of work than if you read a CV, read some general praise in letters, and watch a nice, well-practiced talk.

Thanks for a very interesting post. I agree with ...

2017-06-16T09:27:49.603-07:00

Thanks for a very interesting post. I agree with the benefits you've described. I thought it would be worthwhile to point out a potential set of costs. Obviously a key question is who gets to decide what constitutes "good science". At one extreme, it might be the president of the university. A risk of adopting this procedure is that it will bias hiring towards fields (or individual research programs) that are easy to understand and appreciate, and away from those that that are more inaccessible to a broad audience (but no less valuable). At the opposite extreme, evaluations of "good science" might be left to those individuals who are closest to the relevant field / program of research. But this approach introduces the risk that personal relationships, nepotism, rivalry etc. will infect the hiring process. Hiring decisions will frequently be made by somebody who was an advisor to, collaborator with, competitor of, etc., the candidates in question.

This tension need not doom the N-best approach, and some potential solutions come to mind. There is a goldilocks approach where "good science" is evaluated by those who are close-but-not-too-close. There is a democratic approach where it is evaluated by a mixture of the close and the far. There is a recusal approach where people with prior cooperative or competitive relationships are not allowed to influence the process. Combine all of these, and what you get looks an awful lot like current tenure and promotion practices.

A second set of concerns with the "best science" approach is that the answer to the question "is X better science than Y" is relatively more likely to be infected by predictable biases (gender, race, nationality, halo, etc.) than the answer to the question "is X citations more citations than Y"? (Of course, there are plenty of documented ways in which the number of citations is, itself, subject to the same pernicious biases).

In any event, in light of the inherent tensions in deciding who gets to make the determination about what is "good science", and the potential role of bias, it becomes more clear why relatively objective, numeric measures based on citation would have been chosen not merely to save time and mental energy on the part of evaluators, but in fact because it could be viewed as a more objective and reliable process. Or at least a check on the system that asks, "Does a relatively objective metric concur with our "best science" evaluation?"

-Fiery Cushman