Showing posts with label Reproducibility. Show all posts
Showing posts with label Reproducibility. Show all posts

Wednesday, June 11, 2025

Two summer book recommendations

After a long stint primarily reading fiction, I've been on a non-fiction kick recently and just read two books that I would definitely recommend!

Persuasion in Parallel (2022) by Alexander Coppock, a political scientist at Yale, is a scholarly monograph on how political persuasion works. It's a delightful combination of large-scale replications, strong emphasis on effect estimation and causal inference, and really thoughtful discussion of mechanisms. It starts from a replication and re-analysis of Lord, Ross, and Lepper (1979), the seminal work on political persuasion, and goes on to replicate a whole host of more recent studies. Across all of them, the key take-home is that arguments about controversial topics (e.g., gun control, abortion, etc.) operate very similarly across people with very different views: they cause small changes in attitude in the direction of the arguments, regardless of the recipient's initial views. 

In statistical terms, there's very limited heterogeneity across groups in the effect of persuasive arguments. I really appreciated evidence on this heterogeneity question because often students' intuition in psychology is that everything differs based on sociodemographic characteristics; yet this intuition is rarely quantified or challenged. Coppock's analyses take a really important step in this direction. 

The book is short and quite readable (especially given how data-rich it is). It's also very up front about the limitations of the work. There's also a thought-provoking final chapter on Bayesian inference and rational models of belief change that makes a number of connections to computational cognitive science that I enjoyed. Despite being an academic, I am not the sort of person who will sit down on a weekend with a monograph from another discipline for fun; this book was an exception for me because of how interesting, important, and thorough the work is.

On a heavier note, Doctored (2025), by Charles Piller (an investigative reporter with Science), is a screed about scientific misconduct in Alzheimer's research. I'm intimately familiar with replication issues in psychology, but I was still totally horrified to read about the impacts of scientific fraud in the Alzheimer's field. Piller makes a very well-researched and thorough case, working with experts on fraud and scientific reviewers. While a critique of the book by an Alzheimer's authority questions how central the fraudulent work was to the field (Lancet review), I was convinced by the later chapters of Doctored that show how pervasive image falsification has been within the Alzheimer's research enterprise. It's just awful to think that many people have been in dangerous clinical trials due to research misconduct. 

The book was clearly written very fast as there is some redundancy between chapters and a bit of unnecessary stage-setting around various researchers' grandparents (perhaps reflecting a pivot from an earlier vision of the book where certain people were more important to the narrative). But the substance of the scientific critique is so compelling – and honestly terrifying – that I was more than happy to overlook a few minor weaknesses in the prose. Definitely recommend.

Monday, March 27, 2023

Domain-specific data repositories for better data sharing in psychology!

Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with new federal mandates taking effect in the US this year. How should psychologists and other behavioral scientists share their data? 

Repositories should clearly be FAIR - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will never be reused. It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. The measures we use are all over the place. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF." 

What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data in general. I want data about a specific construct, from a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these Domain Specific Data Repositories (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.

Sunday, February 21, 2021

Methodological reforms, or, If we all want the same things, why can't we be friends?

 (tl;dr: "Ugh, can't we just get along?!" OR "aspirational reform meet actual policy?" OR "whither metascience?")


This post started out as a thread about the tribes of methodological reform in psychology, all of whom I respect and admire. Then it got too long, so it became a blogpost. 

As folks might know, I think methodological reform in psychology is critical (some of my views have been formed by my work with the ManyBabies consortium). For the last ~2 years, I've been watching two loose groups of methodological reformers get mad at each other. It has made me very sad to see these conflicts because I like all of the folks involved. I've actually felt like I've had to take a twitter holiday several times because I can't stand to see some of my favorite folks on the platform yelling at each other. 

This post is my - perhaps misguided - attempt to express appreciation for everyone involved and try to spell out some common ground.

Monday, April 8, 2019

A (mostly) positive framing of open science reforms

I don't often get the chance to talk directly and openly to people who are skeptical of the methodological reforms that are being suggested in psychology. But recently I've been trying to persuade someone I really respect that these reforms are warranted. It's a challenge, but one of the things I've been trying to do is give a positive, personal framing to the issues. Here's a stab at that. 

My hope is that a new graduate student in the fields I work on – language learning, social development, psycholinguistics, cognitive science more broadly – can pick up a journal and choose a seemingly strong study, implement it in my lab, and move forward with it as the basis for a new study. But unfortunately my experience is that this has not been the case much of the time, even in cases where it should be. I would like to change that, starting with my own work.

Here's one example of this kind of failure: As a first-year assistant professor, a grad student and I tried to replicate one of my grad school advisors' well-known studies. We failed repeatedly – despite the fact that we ended up thinking the finding was real (eventually published as Lewis & Frank, 2016, JEP:G). The issue was likely that the original finding was an overestimate of the effect, because the original sample was very small. But converging on the truth was very difficult and required multiple iterations.

Sunday, December 9, 2018

How to run a study that doesn't replicate, experimental design edition

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

Tuesday, January 16, 2018

MetaLab, an open resource for theoretical synthesis using meta-analysis, now updated

(This post is jointly written by the MetaLab team, with contributions from Christina Bergmann, Sho Tsuji, Alex Cristia, and me.)


A typical “ages and stages” ordering. Meta-analysis helps us do better.

Developmental psychologists often make statements of the form “babies do X at age Y.” But these “ages and stages” tidbits sometimes misrepresent a complex and messy research literature. In some cases, dozens of studies test children of different ages using different tasks and then declare success or failure based on a binary p < .05 criterion. Often only a handful of these studies – typically those published earliest or in the most prestigious journals – are used in reviews, textbooks, or summaries for the broader public. In medicine and other fields, it’s long been recognized that we can do better.

Meta-analysis (MA) is a toolkit of techniques for combining information across disparate studies into a single framework so that evidence can be synthesized objectively. The results of each study are transformed into a standardized effect size (like Cohen’s d) and are treated as a single data point for a meta-analysis. Each data point can be weighted to reflect a given study’s precision (which typically depends on sample size). These weighted data points are then combined into a meta-analytic regression to assess the evidential value of a given literature. Follow-up analyses can also look at moderators – factors influencing the overall effect – as well as issues like publication bias or p-hacking.* Developmentalists will often enter participant age as a moderator, since meta-analysis enables us to statistically assess how much effects for a specific ability increase as infants and children develop. 


An example age-moderation relationship for studies of mutual exclusivity in early word learning.

Meta-analyses can be immensely informative – yet they are rarely used by researchers. One reason may be because it takes a bit of training to carry them out or even understand them. Additionally, MAs go out of date as new studies are published. 

To facilitate developmental researchers’ access to up-to-date meta-analyses, we created MetaLab. MetaLab is a website that compiles MAs of phenomena in developmental psychology. The site has grown over the last two years from just a small handful of MAs to 15 at present, with data from more than 16,000 infants. The data from each MA are stored in a standardized format, allowing them to be downloaded, browsed, and explored using interactive visualizations. Because all analyses are dynamic, curators or interested users can add new data as the literature expands.

Thursday, December 7, 2017

Open science is not inherently interesting. Do it anyway.

tl;dr: Open science practices themselves don't make a study interesting. They are essential prerequisites whose absence can undermine a study's value.

There's a tension in discussions of open science, one that is also mirrored in my own research. What I really care about are the big questions of cognitive science: what makes people smart? how does language emerge? how do children develop? But in practice I spend quite a bit of my time doing meta-research on reproducibility and replicability. I often hear critics of open science – focusing on replication, but also other practices – objecting that open science advocates are making science more boring and decreasing the focus on theoretical progress (e.g., Locke, Strobe & Strack).  The thing is, I don't completely disagree. Open science is not inherently interesting.

Sometimes someone will tell me about a study and start the description by saying that it's pre-registered, with open materials and data. My initial response is "ho hum." I don't really care if a study is preregistered – unless I care about the study itself and suspect p-hacking. Then the only thing that can rescue the study is preregistration. Otherwise, I don't care about the study any more; I'm just frustrated by the wasted opportunity.

So here's the thing: Although being open can't make your study interesting, the failure to pursue open science practices can undermine the value of a study. This post is an attempt to justify this idea by giving an informal Bayesian analysis of what makes a study interesting and why transparency and openness is then the key to maximizing study value.

Friday, November 10, 2017

Talk on reproducibility and meta-science

I just gave a talk at UCSD on reproducibility and meta-science issues. The slides are posted here.  I focused somewhat on developmental psychology, but a number of the studies and recommendations are more general. It was lots of fun to chat with students and faculty, and many of my conversations focused on practical steps that people can take to move their research practice towards a more open, reproducible, and replicable workflow. Here are a few pointers:

Preregistration. Here's a blogpost from last year on my lab's decision to preregister everything. I also really like Nosek et al's Preregistration Revolution paper. AsPredicted.org is a great gateway to simple preregistration (guide).

Reproducible research. Here's a blogpost on why I advocate for using RMarkdown to write papers. The best package for doing this is papaja (pronounced "papaya"). If you don't use RMarkdown but do know R, here's a tutorial.

Data sharing. Just post it. The Open Science Framework is an obvious choice for file sharing. Some nice video tutorials make an easy way to get started.

Friday, October 6, 2017

Introducing childes-db: a flexible and reproducible interface to CHILDES

Note: childes-db is a project that is a collaboration between Alessandro Sanchez, Stephan Meylan, Mika Braginsky, Kyle MacDonald, Dan Yurovsky, and me; this blogpost was written jointly by the group.

For those of us who study child development – and especially language development – the Child Language Data Exchange System (CHILDES) is probably the single most important resource in the field. CHILDES is a corpus of transcripts of children, often talking with a parent or an experimenter, and it includes data from dozens of languages and hundreds of children. It’s a goldmine. CHILDES has also been around since way before the age of “big data”: it started with Brian MacWhinney and Catherine Snow photocopying transcripts (and then later running OCR to digitize them!). The field of language acquisition has been a leader in open data sharing largely thanks to Brian’s continued work on CHILDES.

Despite these strengths, using CHILDES can sometimes be challenging, especially for the most casual or most in-depth interactions. Simple analyses like estimating word frequencies can be done using CLAN – the major interface to the corpora – but these require more comfort with command-line interfaces and programming than can be expected in many classroom settings. On the other end of the spectrum, many of us who use CHILDES for in-depth computational studies like to read in the entire database, parse out many of the rich annotations, and get a set of flat text files. But doing this parsing correctly is complicated, and often small decisions in the data-processing pipeline can lead to different downstream results. Further, it can be very difficult to reconstruct a particular data prep in order to do a replication study. We've been frustrated several times when trying to reproduce others' modeling results on CHILDES, not knowing whether our implementation of their model was wrong or whether we were simply parsing the data differently.

To address these issues and generally promote the use of CHILDES in a broader set of research and education contexts, we’re introducing a project called childes-db. childes-db aims to provide both a visualization interface for common analyses and an application programming interface (API) for more in-depth investigation. For casual users, you can explore the data with Shiny apps, browser-based interactive graphs that supplement CHILDES’s online transcript browser. For more intensive users, you can get direct access to pre-parsed text data using our API: an R package called childesr, which allows users to subset the corpora and get processed text. The backend of all of this is a MySQL database that’s populated using a publicly-available – and hopefully definitive – CHILDES parser, to avoid some of the issues caused by different processing pipelines.

Thursday, June 1, 2017

Confessions of an Associate Editor

For the last year and a half I've been an Associate Editor at the journal Cognition. I joined up because Cognition is the journal closest to my core interests; I've published nine papers there, more than in any other outlet by a long shot. Cognition has been important historically, and continues to publish recent influential papers as well. I was also excited about a new initiative by Steve Sloman (the EIC) to require authors to post raw data. Finally, I joined knowing that Cognition is currently an Elsevier journal. I – perhaps naively – hoped that like Glossa, Cognition could leave Elsevier (which has a very bad reputation, to say the least) and go open access. I'm stepping down as an AE in the fall because of family constraints and other commitments, and so I wanted to take the opportunity to reflect on the experience and some lessons I've learned.

Be kind to your local editor. Editing is hard work done by ordinary academics, and it's work they do over and above all the typical commitments of non-editor academics. I was (and am) slow as an editor, and I feel very guilty about it. The guilt over not moving faster has been the hardest aspect of the job; often when I am doing some other work, I will be distracted by my slipping editorial responsibilities.1 Yet if I keep on top of them I feel that I'm neglecting my lab or my teaching. As a result, I have major empathy now for other editors – and massive respect for the faster ones. Also, whenever someone complains about slow editors on twitter, my first thought is "cut them some slack!"

Make data open (and share code too, while you're at it)! I was excited by Sloman's initiative for data openness when I first read about it. I'm still excited about it: It's the right thing to do. Data sharing is prerequisite for ensuring the reproducibility of results in papers, and enables reuse of data for folks doing meta-analysis, computational modeling, and other forms of synthetic theoretical work. It's also very useful for replication – students in my graduate class do replications of published papers and often learn a tremendous amount about the paradigm and analyses of the original experiment by looking at posted data when they are available. But sharing data is not enough. Tom Hardwicke, a postdoc in my lab and in the METRICS center at Stanford, is currently doing a study of computational reproducibility of results published in Cognition – data are still coming in, but our first impression is that it's often difficult to reproduce the findings in a good number of papers based on the raw data and their written description of analyses. Cognition and other journals can do much more to facilitate posting of analytic code.

Open access is harder than it looks. I care deeply about open access – as both an ethical priority and a personal convenience. And the journal publishing model is broken. At the same time, my experiences have convinced me that it is no small thing to switch a major journal to a truly OA model. I could spend an entire blogpost on this issue alone (and maybe I will later), but the key issue here is money: where it comes from and where it goes. Running Cognition is a costly affair in its current form. There is an EIC, two senior AEs, and nine other AEs. All receive small but not insignificant stipends. There is also a part-time editorial assistant, and an editorial software platform. I don't know most of these costs, but my guess is that replicating this system as is – without any of the legal, marketing, and other infrastructure – would be minimally $150,000 USD/year (probably closer to 200k or more, depending on software).

Wednesday, February 15, 2017

Damned if you do, damned if you don't

Here's a common puzzle that comes up all the time in discussions of replication in psychology. I call it the stimulus adaptation puzzle. Someone is doing an experiment with a population and they use a stimulus that they created to induce a psychological state of interest in that particular population. You would like to do a direct replication of their study, but you don't have access to that population. You have two options: 1) use the original stimulus with your population, or 2) create a new stimulus designed to induce the same psychological state in your population.

One example of this pattern comes from RPP, the study of 100 independent replications of psychology studies from 2008. Nosek and E. Gilbert blogged about one particular replication, in which the original study was run with Israelis and used as part of its cover story a description of a leave from a job, with one reason for the leave being military service. The replicators were faced with the choice of using the military service cover story in the US where their participants (UVA undergrads) mostly wouldn't have the same experience, or modifying to create a more population-suitable cover story. Their replication failed. D. Gilbert et al. then responded that the UVA modification, a leave due to a honeymoon, was probably responsible for the difference in findings. Leaving aside the other questions raised by the critique (which we responded to), let's think about the general stimulus adaptation issue.

If you use the original stimulus with a new population, it may be inappropriate or incongruous. So a failure to elicit the same effect is explicable that way. On the other hand, if you use a new stimulus, perhaps it is unmatched in some way and fails to elicit the intended state as well. In other words, in terms of cultural adaptation of stimuli for replication, you're damned if you do and damned if you don't. How do we address this issue?

Thursday, January 26, 2017

Paper submission checklist


It's getting to be CogSci submission time, and this year I am thinking more about trying to set uniform standards for submission. Following my previous post on onboarding, here's a pre-submission checklist that I'm encouraging folks in my lab to follow. Note that, as described in that post, all our papers are written in RStudio using R Markdown, so the paper should be a single document that compiles all analyses and figures into a single PDF. This process helps deal with much of the error-checking of results that used to be the bulk of my presubmission checking.

Paper writing*

  • Is the first paragraph engaging and clear to an outsider who doesn't know this subfield?
  • Are multiple alternative hypotheses stated clearly in the introduction and linked to supporting prior literature?
  • Does the paragraph before the first model/experiment clearly lay out the plan of the paper?
  • Does the abstract describe the main contribution of the paper in terms that are accessible to a broad audience?
  • Does the first paragraph of the general discussion clearly describe the contributions of the paper to someone who hasn't read the results in detail? 
  • Is there a statement of limitations on the work (even a few lines) in the general discussion?

Tuesday, January 3, 2017

Onboarding

Reading twitter this morning I saw a nice tweet by Page Piccinini, on the topic of organizing project folders:
This is exactly what I do and ask my students to do, and I said so. I got the following thoughtful reply from my old friend Adam Abeles:
He's exactly right. I need some kind of onboarding guide. Since I'm going to have some new folks joining my lab soon, no time like the present. Here's a brief checklist for what to expect from a new project.

Friday, July 22, 2016

Preregister everything

Which methodological reforms will be most useful for increasing reproducibility and replicability? I've gone back and forth on this blog about a number of possible reforms to our methodological practices, and I've been particularly ambivalent in the past about preregistration, the process of registering methodological and analytic decisions prior to data collection. In a post from about three years ago, I worried that preregistration was too time-consuming for small-scale studies, even if it was appropriate for large-scale studies. And last year, I worried whether preregistration validates the practice of running (and publishing) one-offs, rather than running cumulative study sets. I think these worries were overblown, and resulted from my lack of understanding of the process.

Instead, I want to argue here that we should be preregistering every experiment do. The cost is extremely low and the benefits – both to the research process and to the credibility of our results – are substantial. Starting in the past few months, my lab has begun to preregister every study we run. You should too.

The key insights for me were:
  1. Different preregistrations can have different levels of detail. For some studies, you write down "we're going to run 24 participants in each condition, and exclude them if they don't finish." For others you specify the full analytic model and the plots you want to make. But there is no study for which you know nothing ahead of time. 
  2. You can save a ton of time by having default analytic practices that don't need to be registered every time. For us these live on our lab wiki (which is private but I've put a copy here).  
  3. It helps me get confirmation on what's ready to run. If it's registered, then I know that we're ready to collect data. I especially like the interface on AsPredicted, that asks coauthors to sign off prior to the registration going through. (This also incidentally makes some authorship assumptions explicit). 

Tuesday, June 21, 2016

Reproducibility and experimental methods posts

In celebration of the third anniversary of this blog, I'm collecting some of my posts on reproducibility. I didn't initially anticipate that methods and the "reproducibility crisis" in psychology would be my primary blogging topic, but it's become a huge part of what I write about on a day-to-day basis.

Here are my top four posts in this sequence:


Then I've also written substantially about a number of other topics, including publication incentives and the file-drawer problem:


The blog has been very helpful for me in organizing and communicating my thoughts, as well as for collecting materials for teaching reproducible research. Hoping to continue thinking about these topics in the future, even as I move back to discussing more developmental and cognitive science topics. 

Sunday, June 5, 2016

An adversarial test for replication success

(tl;dr: I argue that the only way to tell if a replication study was successful is by considering the theory that motivated the original.)

Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?

There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.

Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*

Monday, April 25, 2016

Misperception of incentives for publication

There's been a lot of conversation lately about negative incentives in academic science. A good example of this is Xenia Schmalz's nice recent post. The basic argument is, professional success comes from publishing a lot and publishing quickly, but scientific values are best served by doing slower, more careful work. There's perhaps some truth to this argument, but it overstates the misalignment in incentives between scientific and professional success. I suspect that people think that quantity matters more than quality, even if the facts are the opposite.

Let's start with the (hopefully uncontroversial) observation that number of publications will be correlated at some magnitude with scientific progress. That's because for the most part, if you haven't done any research you're not likely to be able to publish, and if you have made a true advance it should be relatively easier to publish.* So there will be some correlation between publication record and theoretical advances.

Now consider professional success. When we talk about success, we're mostly talking about hiring decisions. Though there's something to be said about promotion, grants, and awards as well, I'll focus here on hiring.** Getting a postdoc requires the decision of a single PI, while faculty hiring generally depend on committee decisions. It seems to me that many people believe these hiring decisions comes down to the weight of the CV. That doesn't square with either my personal experience or the incentive structure of the situation. My experiences suggest that the quality and importance of the research is paramount, not the quantity of publications. And more substantively, the incentives surrounding hiring also often favor good work.***

At the level of hiring a postdoc, what I personally consider is the person's ideas, research potential, and skills. I will have to work with someone closely for the next several years, and the last person I want to hire is someone sloppy and concerned only with career success. Nearly all postdoc advisors that I know feel the same way, and that's because our incentive is to bring someone in who is a strong scientist. When a PI interviews for a postdoc, they talk to the person about ideas, listen to them present their own research, and read their papers. They may be impressed by the quantity of work the candidate has accomplished, but only in cases where that work is well-done and on an exciting topic. If you believe that PIs are motivated at all by scientific goals – and perhaps that's a question for some people at this cynical juncture, but it's certainly not one for me – then I think you have to believe that they will hire with those goals in mind.

Monday, March 28, 2016

Should we always bring out our nulls?

tl;dr: Thinking about projects that aren't (and may never be) finished. Should they necessarily be published?

So, the other day there was a very nice conversation on twitter, started by Micah Allen and focusing on people clearing out their file-drawers and describing null findings. The original inspiration was a very interesting paper about one lab's file drawer, in which we got insight into the messy state of the evidence the lab had collected prior to its being packaged into conventional publications.

The broader idea, of course, is that – since they don't fit as easily into conventional narratives of discovery – null findings are much less often published than positive findings. This publication bias then leads to an inflation of effect sizes, with many negative consequences downstream. And the response to problem of publication bias then appears to be simple: publish findings regardless of statistical significance, removing the bias in the literature. Hence, #bringoutyernulls.

This narrative is a good one and an important one. But whenever the publication bias discussion come up, I have a contrarian instinct that I have a hard time suppressing. I've written about this issue before, and in that previous piece I tried to articulate the cost-benefit calculation: while suppressing publication has a cost in terms of bias, publication itself also has a very significant cost to both authors (in writing, revising, and even funding publication) and readers (in sorting through and interpreting the literature). There really is junk, the publication of which would be a net negative –whether because of errors or irrelevance. But today I want to talk about something else that bothers me about the analysis of publication bias I described above.

Thursday, March 10, 2016

Limited support for an app-based intervention

tl;dr: I reanalyzed a recent field-trial of a math-learning app. The results differ by analytic strategy, suggesting the importance of preregistration.

Last year, Berkowitz et al. published a randomized controlled trial of a learning app. Children were randomly assigned to math and reading app groups; their learning outcomes on standardized math and reading tests were assessed after a period of app usage. A math anxiety measure was also collected for children’s parents. The authors wrote that:

The intervention, short numerical story problems delivered through an iPad app, significantly increased children’s math achievement across the school year compared to a reading (control) group, especially for children whose parents are habitually anxious about math.
I got excited about this finding because I have recently been trying to understand the potential of mobile and tablet apps for intervention at home, but when I dug into the data I found that not all views of the dataset supported the success of the intervention. That's important because this was a well-designed, well-conducted trial. But the basic randomization to condition did not produce differences in outcome, as you can see in the main figure of my reanalysis.



My extensive audit of the dataset is posted here, with code and their data here. (I really appreciate that the authors shared their raw data so that I could do this analysis – this is a huge step forward for the field!). Quoting from my report: 
In my view, the Berkowitz et al. study does not show that the intervention as a whole was successful, because there was no main effect of the intervention on performance. Instead, it shows that – in some analyses – more use of the math app was related to greater growth in math performance, a dose-response relationship that is subject to significant endogeneity issues (because parents who use math apps more are potentially different from those who don’t). In addition, there is very limited evidence for a relationship of this growth to math anxiety. In sum, this is a well-designed study that nevertheless shows only tentative support for an app-based intervention.
Here's a link to my published comment (which came out today), and here's Berkowitz et al.'s very classy response. Their final line is:
We welcome debate about data analysis and hope that this discussion benefits the scientific community.

Monday, December 14, 2015

The ManyBabies Project

tl;dr: Introducing and organizing a ManyLabs study for infancy research. Please comment or email me (mcfrank (at) stanford.edu) if you would like to join the discussion list or contribute to the project. 

Introduction

The last few years have seen increasing acknowledgement that there are flaws in the published scientific literature – in psychology and elsewhere (e.g., Ioannidis, 2005). Even more worrisome is that self-corrective processes are not as fast or as reliable as we might hope. For example, in the reproducibility project, which was published this summer (RPP, project page here), 100 papers were sampled from top journals, and one replication of each was conducted. This project revealed a disturbingly low rate of success for seemingly well-powered replications. And even more disturbing, although many of the target papers had a large impact, most still had not been replicated independently seven years later (outside of RPP). 

I am worried that the same problems affect developmental research. The average infancy study – including many I've worked on myself – has the issues we've identified in the rest of the psychology literature: low power, small samples, and undisclosed analytic flexibility. Add to this the fact that many infancy findings are never replicated, and even those that are replicated may show variable results across labs. All of these factors lead to a situation where many of our empirical findings are too weak to build theories on.

In addition, there is a second, more infancy-specific problem that I am also worried about. Small decisions in infancy research – anything from the lighting in the lab to whether the research assistant has a beard – may potentially affect data quality, because of the sensitivity of infants to minor variations in the environment. In fact, many researchers believe that there is huge intrinsic variability between developmental labs, because of unavoidable differences in methods and populations (hidden moderators). These beliefs lead to the conclusion that replication research is more difficult and less reliable with infants, but we don't have data that bear one way or the other on this question.