Babies Learning Language: Politics

Showing posts with label Politics. Show all posts

Monday, February 8, 2021

Transparency and openness is an ethical duty, for individuals and institutions

(tl;dr: I wrote an opinion piece a couple of years ago - now rejected - on the connection between ethics and open science. Rather than letting it just get even staler than it was, here it is as a blog post.)

In the past few years, journals, societies, and funders have increasingly oriented themselves towards open science reforms, which are intended to improve reproducibility and replicability. Typically, transparency policies focus on open access to publications and the sharing of data, analytic code, and other research products.

Many working scientists have a general sense that transparency is a positive value, but also have concerns about specific initiatives. For example, sharing data often carries confidentiality risks that can only be mitigated via substantial additional effort. Further, many scientists worry about personal or career consequences from being “scooped” or having errors discovered. And transparency policies sometimes require resources that are not be available to researchers outside of rich institutions.

I argue below that despite these worries, scientists have an ethical duty to be open. Further, where this duty is in conflict with scientists' other responsibilities, we need to lobby our institutions – universities, journals, and funders – to mitigate the costs and risks of openness.

N-best evaluation for hiring and promotion

How can we create incentive-compatible evaluation of scholarship? Here's a simple proposal, discussed around a year ago by Sanjay Srivastava and floated in a number of forms before that (e.g., here):

The N-Best Rule: Hiring and promotion committees should solicit a small number (N) of research products and read them carefully as their primary metric of evaluation for research outputs.

I'm far from the first person to propose this rule, but I want to consider some implementational details and benefits that I haven't heard discussed previously. (And just to be clear, this is me describing an idea I think has promise – I'm not talking on behalf of anyone or any institution).

Why do we need a new policy for hiring and promotion? How do two conference papers on neural networks for language understanding compare with five experimental papers exploring bias in school settings or three infant studies on object categorization? Hiring and promotion in academic settings is an incredibly tricky business. (I'm focusing here on evaluation of research, rather than teaching, service, or other aspects of candidates' profiles.) How do we identify successful or potentially successful academics, given the vast differences in research focus and research production between individuals and areas? Two different records of scholarship simply aren't comparable in any sort of direct, objective manner. The value of any individual piece of work is inherently subjective, and the problem of subjective evaluation is only compounded when an entire record is being compared.

To address this issue, hiring and promotion committees typically turn to heuristics like publication or citation numbers, or journal prestige. These heuristics are widely recognized to promote perverse incentives. The most common, counting publications, leads to an incentive to do low-risk research and "salami slice" data (publish as many small papers on a dataset as you can, rather than combining work to make a more definitive contribution). Counting citations or H indices is not much better – these numbers are incomparable across fields, and they lead to incentives for self-citation and predatory citation practices (e.g., requesting citation in reviews). Assessing impact via journal ranks is at best a noisy heuristic and rewards repeated submissions to "glam" outlets. Because they do not encourage quality science, these perverse incentives have been implicated as a major factor in the ongoing replicability/reproducibility issues that are facing psychology and other fields.

Confessions of an Associate Editor

For the last year and a half I've been an Associate Editor at the journal Cognition. I joined up because Cognition is the journal closest to my core interests; I've published nine papers there, more than in any other outlet by a long shot. Cognition has been important historically, and continues to publish recent influential papers as well. I was also excited about a new initiative by Steve Sloman (the EIC) to require authors to post raw data. Finally, I joined knowing that Cognition is currently an Elsevier journal. I – perhaps naively – hoped that like Glossa, Cognition could leave Elsevier (which has a very bad reputation, to say the least) and go open access. I'm stepping down as an AE in the fall because of family constraints and other commitments, and so I wanted to take the opportunity to reflect on the experience and some lessons I've learned.

Be kind to your local editor. Editing is hard work done by ordinary academics, and it's work they do over and above all the typical commitments of non-editor academics. I was (and am) slow as an editor, and I feel very guilty about it. The guilt over not moving faster has been the hardest aspect of the job; often when I am doing some other work, I will be distracted by my slipping editorial responsibilities.¹ Yet if I keep on top of them I feel that I'm neglecting my lab or my teaching. As a result, I have major empathy now for other editors – and massive respect for the faster ones. Also, whenever someone complains about slow editors on twitter, my first thought is "cut them some slack!"

Make data open (and share code too, while you're at it)! I was excited by Sloman's initiative for data openness when I first read about it. I'm still excited about it: It's the right thing to do. Data sharing is prerequisite for ensuring the reproducibility of results in papers, and enables reuse of data for folks doing meta-analysis, computational modeling, and other forms of synthetic theoretical work. It's also very useful for replication – students in my graduate class do replications of published papers and often learn a tremendous amount about the paradigm and analyses of the original experiment by looking at posted data when they are available. But sharing data is not enough. Tom Hardwicke, a postdoc in my lab and in the METRICS center at Stanford, is currently doing a study of computational reproducibility of results published in Cognition – data are still coming in, but our first impression is that it's often difficult to reproduce the findings in a good number of papers based on the raw data and their written description of analyses. Cognition and other journals can do much more to facilitate posting of analytic code.

Open access is harder than it looks. I care deeply about open access – as both an ethical priority and a personal convenience. And the journal publishing model is broken. At the same time, my experiences have convinced me that it is no small thing to switch a major journal to a truly OA model. I could spend an entire blogpost on this issue alone (and maybe I will later), but the key issue here is money: where it comes from and where it goes. Running Cognition is a costly affair in its current form. There is an EIC, two senior AEs, and nine other AEs. All receive small but not insignificant stipends. There is also a part-time editorial assistant, and an editorial software platform. I don't know most of these costs, but my guess is that replicating this system as is – without any of the legal, marketing, and other infrastructure – would be minimally $150,000 USD/year (probably closer to 200k or more, depending on software).

How do you argue for diversity?

During the last couple of months I have been serving as a member of my department's diversity committee, charged with examining policies relating to diversity in graduate and faculty recruitment. I have always put a value on the personal diversity of the people I worked with. But until this experience, I hadn't thought about how unexamined my thinking on this topic was, and I hadn't explicitly tried to make the case for diversity in our student population. So I was unprepared for the complexity of this issue.* As it turns out, different people have tremendously different intuitions on how to – and whether you should – argue for diversity in an educational setting.

In this post, I want to enumerate some of the arguments for diversity I've collected. I also want to lay out some of the conflicting intuitions about these arguments that I have encountered. But since diversity is an incredibly polarizing issue, I also want to be sure to give a number of caveats. First, this blogpost is about the topic of other people’s responses to arguments for diversity; I’m not myself making any of these arguments here. I do personally care about diversity and personally find some of these arguments more and less compelling, but that’s not what I’m writing about. Second, all of this discussion is grounded in the particular case of understanding diversity in the student body of educational institutions (especially in graduate education). I don’t know enough about workplace issues to comment. Third, and somewhat obviously, I don’t speak for anyone but myself. This post doesn’t represent the views of Stanford, the Stanford psych department, or even the Stanford Psych diversity committee.

Misperception of incentives for publication

There's been a lot of conversation lately about negative incentives in academic science. A good example of this is Xenia Schmalz's nice recent post. The basic argument is, professional success comes from publishing a lot and publishing quickly, but scientific values are best served by doing slower, more careful work. There's perhaps some truth to this argument, but it overstates the misalignment in incentives between scientific and professional success. I suspect that people think that quantity matters more than quality, even if the facts are the opposite.

Let's start with the (hopefully uncontroversial) observation that number of publications will be correlated at some magnitude with scientific progress. That's because for the most part, if you haven't done any research you're not likely to be able to publish, and if you have made a true advance it should be relatively easier to publish.* So there will be some correlation between publication record and theoretical advances.

Now consider professional success. When we talk about success, we're mostly talking about hiring decisions. Though there's something to be said about promotion, grants, and awards as well, I'll focus here on hiring.** Getting a postdoc requires the decision of a single PI, while faculty hiring generally depend on committee decisions. It seems to me that many people believe these hiring decisions comes down to the weight of the CV. That doesn't square with either my personal experience or the incentive structure of the situation. My experiences suggest that the quality and importance of the research is paramount, not the quantity of publications. And more substantively, the incentives surrounding hiring also often favor good work.***

At the level of hiring a postdoc, what I personally consider is the person's ideas, research potential, and skills. I will have to work with someone closely for the next several years, and the last person I want to hire is someone sloppy and concerned only with career success. Nearly all postdoc advisors that I know feel the same way, and that's because our incentive is to bring someone in who is a strong scientist. When a PI interviews for a postdoc, they talk to the person about ideas, listen to them present their own research, and read their papers. They may be impressed by the quantity of work the candidate has accomplished, but only in cases where that work is well-done and on an exciting topic. If you believe that PIs are motivated at all by scientific goals – and perhaps that's a question for some people at this cynical juncture, but it's certainly not one for me – then I think you have to believe that they will hire with those goals in mind.

Engineering the National Children's Study

The National Children's Study was a 100,000-child longitudinal study that would have tracked a cohort of children from birth to age 21, measuring environmental, family, genetic, and cognitive aspects of development at an unprecedented scale. Unfortunately, last year the NIH Director decided to shut the study down, following a highly critical report from the National Academy of Sciences that criticized a number of aspects of the study including its leadership and its sampling plan.

I got involved in the NCS about a year ago, when I was asked to be a part of the Cognitive Health team. Participating in the team has been an extremely positive experience, as I've had a chance to work with a great group of developmental researchers. We've met weekly for the past year, first to create plans for the cognitive portions of NCS, and later – after the study was cancelled – to discuss possible byproducts of the group's work. (Full disclosure: I am still a contractor for NCS and will be until the final windup is completed).

According to recent reports, though, NCS may be restarted by an act of Congress. As originally conceived, the study served a very valuable purpose: creating a sample large enough and diverse enough to allow analyses of rare outcomes, even for parts of the population that are often underrepresented in other cohorts. Other countries clearly think this is a good idea. According to one proposal, though, recruitment in the new study might piggyback on other ongoing studies. I'm not sure how this could work, given that different studies would likely have radically different measures, ages, and recruitment strategies. Even if some of these choices were coordinated, differences in implementation of the studies would make inferences from the data much more problematic.

I would love to see the original NCS vision carried to fruition. But even based on my limited perspective, I also understand why the project was extremely slow to start and ran into substantial cost obstacles. Creating such a massive design inevitably runs into problems of interlocking constraints, where decisions about recruitment depend on decisions about design and vice versa. Converging on the right measures is such a difficult process that by the time decisions are made, they are already out of date (a critique leveled also by the NAS report).

If the NCS is restarted, it will need a faster and cheaper planning process to have a chance of going forward to data collection. Here's my proposal: the NCS needs to work as if it's building a piece of software, not planning a conference. If you're planning a conference, you need to have stakeholders gradually reach consensus on details like the location, the program, and the events, before a single event occurs on a fixed timeline. But if you're building a software application, you need to respond to the constraints of your platform, adapt to your shifting user base, pilot test quickly and iteratively, and make sure that everything works before you release to market. This kind of agile optimization was missing from the previous iteration of the study. Here are three specific suggestions.

1. Iterative piloting.

Nothing reveals the weaknesses of a study design like putting it into practice. In a longitudinal study, the adoption of a bad measure, bad data storage platform, or bad sampling decision early on in the study will dramatically reduce the value of the subsequent data. It's a terrible feeling to collect data on a measure, knowing that the earlier baselines were flawed and the longitudinal analysis will be compromised.

The original NCS included a vanguard cohort of about 5,000 participants, mostly to test the recruitment strategy. (In fact, the costs of the vanguard study may have contributed to the cancelation of the main strategy). But one pilot program is not enough. All aspects of the program need to be piloted, so that the design can be adapted to the realities of the situation. From the length of the individual sessions, to the reliability of the measures and the retention rate across different populations, critical parts of the study all need to be tested multiple times before they are adopted.

The revised NCS should create a staged series of pilot samples of gradually increasing size, whose timeline is designed to allow iteration and incorporation of insights from previous samples. For example, if NCS v2 launches in 2022, then create cohorts of 100, 200, 1000, and 2000 to launch in 2018 – 21, respectively. Make the first samples longitudinal to test dropout (so the sampling design can be adjusted in the main study), and make the last sample cross-sectional so as to pilot test the precise measures that are planned for every age visit. Make it a rule: If any measure or decision is adopted in the final sample, there must be data on its reliability in the current study context.

2. Early adoption of precise infrastructure standards.

Here's a basic example of an interlocking constraint satisfaction problem. You need to present measures to parents and collect and store the data resulting from these measures in a coherent data-management framework. But the way you collect the data and the way you store them interact with what the measures are. You can't know exactly how data from a measure (even one as simple as a survey) will look until you know how it will be collected. But you want to design the infrastructure for data collection around the measures that you need.

One way to solve this kind of problem is to iterate gradually into a solution. One committee discusses measures, a second discusses infrastructure. They discuss their needs, then meet, then discuss their needs again. Finally they converge and adopt a shared standard. This model can work well if the target you are optimizing to is static, e.g. if the answer stays the same during your deliberations. The problem is that technical infrastructure doesn't stay the same while you work – the best infrastructure is constantly changing. Good ideas for data management when the NCS began are no longer relevant. But if the infrastructure group is constantly changing the platform, then the folks creating the measures can't ever rely on particular functionality.

Software engineers solve this problem by creating design specifications that are implementation independent. In other words, everyone knows exactly what they need to deliver and what they can rely on others to deliver (and the under-the-hood details don't matter). Consider an API (application programming interface) for an eye-tracker. The experimenter doesn't know how the eye-tracker measures point of gaze, but she knows that if she calls a particular method, say getPointOfGaze, she will get back X and Y coordinates, accurate to some known tolerance. On the other end of the abstraction, the eye-tracker manufacturers don't need to know the details of the experiment in order to build the eye-tracker. They just need to getPointOfGaze quickly and accurately.

In a revised NCS, study architects should publish a technical design specification for all (behavioral) measures that is independent of method of administration. Such standards obviate hiring many layers of contractors to implement each set of measures separately. Instead, a single format conversion step can be engineered. For example, a standard survey XML format would be translated into the appropriate presentation format (whether the survey is presented on the phone, on the computer, or on a tablet or phone). As in many modern content management systems, the users of a measure could rapidly view and iterate on the precise implementation of the measure, rather than having to work through intermediaries.

A further engineering trick that could be applied to this setup is the use of automated testing and test suites. Given a known survey format and a uniform standard, it would be far easier to create automated tools to estimate completion time, to test data storage and integrity, and to search for bugs. Imagine if the NCS looked like an open-source software project, in which each "build" of the study protocol would be forced to pass a set of automated tests prior to piloting...

3. Independence of measure development and measure adoption.

Other people's children are great, but we all love our own the best. That's why we don't review our own papers or hire our own PhD students to be our colleagues. The adoption of measures into a longitudinal study is no different. If we allow the NCS to engage in measure development – creating new ways of measuring a particular environmental, physiological, or psychological construct – rather than simply adopting pre-existing standards, we need to take care that these measures are only adopted if they are the best option for fulfilling the study's goals.

Fix this problem by barring NCS designers from being involved in the creation of measures that are then used in the NCS. If the design committee wants a new measure, they must solicit competitive outside bids to create it and then adopt the version that has the most data supporting it in a direct evaluation. To do otherwise risks the inclusion of measures with insufficient evidence of reliability and validity.

This recommendation is based directly on my own experiences in the Cognitive Health team. Over the course of the last year, I've been very pleased to be able to help this team in the development of a new set of measures for profiling infant cognition. Based on automated eye-tracking methods, these measures have the potential to be a ground-breaking advance in understanding individual differences in cognition during infancy. I'm now quite invested in their success and I hope to continue working on them regardless of the outcome of the NCS study.

That's precisely the problem. I am no longer an objective observer of these measures! Had NCS gone forward I would have pushed for their adoption into the main study, even if the data on their efficacy were much more limited than should be necessary for adoption at a national scale. I'm not suggesting that NCS would adopt a really terrible measure. But given what we know about motivated cognition and the sunk cost fallacy, it's very likely that the bar would be lower for adopting an internally-developed measure than an external one.

If the NCS acts as a developer of new measures, there is a temptation to continue working to get the perfect suite of measurements, rather than to stop development and run the study. This is the great being the enemy of the good. If the NCS is a consumer of others' measures – on some rare occasions, measures that it has commissioned and evaluated – then it can more dispassionately adopt the best available option that fits the constraints of the study.

Conclusions

My own experiences with the NCS – limited as they are – have been nothing but positive. I've gotten to work with some great people, seen the initial development of an exciting new tool, and glimpsed the workings of a much larger project. But as I read about the fate of the study as a whole, I worry that the independence that's made my little part of the project so fun to work on – developing standards, envisioning new measures – is precisely why the project as a whole did not move forward.

What I've suggested here is that a new version of the NCS could benefit from an engineering mindset. Having internal deadlines for pilot launches would constrain planning with interim goals. Adding precise technical specifications and the abstractions necessary to work with them would add certainty to the planning process and eliminate many redundant contractors; for example, our new measures would probably be off the table simply because they wouldn't fit into the existing infrastructure. And an adversarial review of measures would better allow designers to weigh independent evidence for adoption.

In sum: bring back the NCS! But run it like you're building an app: one that has to fulfill a set of functions, yes, but also one that has to scale quickly and cheaply to unprecedented size.

---
Thanks to Steve Reznick, my colleague on the Cognitive Health team, for valuable comments on a previous draft. Views and errors are my own.

Babies Learning Language

Monday, February 8, 2021

Transparency and openness is an ethical duty, for individuals and institutions

Thursday, June 15, 2017

N-best evaluation for hiring and promotion

Thursday, June 1, 2017