Babies Learning Language: 2016

Friday, November 4, 2016

Don't bar barplots, but use them cautiously

Should we outlaw the the commonest visualization in psychology? The hashtag #barbarplots has been introduced as part of a systematic campaign to promote a ban on bar graphs. The argument is simple: barplots mask the distributional form of the data, and all sorts of other visualization forms exist that are more flexible and precise, including boxplots, violin plots, and scatter plots. All of these show the distributional characteristics of a dataset more effectively than a bar plot.

Every time the issue gets discussed on twitter, I get a little bit rant-y; this post is my attempt to explain why. It's not because I fundamentally disagree with the argument. Barplots do mask important distributional facts about datasets. But there's more we have to take into account.

Preregister everything

Which methodological reforms will be most useful for increasing reproducibility and replicability? I've gone back and forth on this blog about a number of possible reforms to our methodological practices, and I've been particularly ambivalent in the past about preregistration, the process of registering methodological and analytic decisions prior to data collection. In a post from about three years ago, I worried that preregistration was too time-consuming for small-scale studies, even if it was appropriate for large-scale studies. And last year, I worried whether preregistration validates the practice of running (and publishing) one-offs, rather than running cumulative study sets. I think these worries were overblown, and resulted from my lack of understanding of the process.

Instead, I want to argue here that we should be preregistering every experiment do. The cost is extremely low and the benefits – both to the research process and to the credibility of our results – are substantial. Starting in the past few months, my lab has begun to preregister every study we run. You should too.

The key insights for me were:

Different preregistrations can have different levels of detail. For some studies, you write down "we're going to run 24 participants in each condition, and exclude them if they don't finish." For others you specify the full analytic model and the plots you want to make. But there is no study for which you know nothing ahead of time.
You can save a ton of time by having default analytic practices that don't need to be registered every time. For us these live on our lab wiki (which is private but I've put a copy here).
It helps me get confirmation on what's ready to run. If it's registered, then I know that we're ready to collect data. I especially like the interface on AsPredicted, that asks coauthors to sign off prior to the registration going through. (This also incidentally makes some authorship assumptions explicit).

Minimal nativism

(After blogging a little less in the last few months, I'm trying out a new idea: I'm going to write a series of short posts about theoretical ideas I've been thinking about.)

Is human knowledge built using a set of of perceptual primitives combined by the statistical structure of the environment, or does it instead rest on a foundation of pre-existing, universal concepts? The question of innateness is likely the oldest and most controversial in developmental psychology (think Plato vs. Aristotle, Locke vs. Descartes). In modern developmental work, this question so bifurcates the research literature that it can often feel like scientists are playing for different "teams," with incommensurable assumptions, goals, and even methods. But these divisions have a profoundly negative effect on our science. Throughout my research career, I've bounced back and forth between research groups and even institutions that are often seen as playing on different teams from one another (even if the principals involved personally hold much more nuanced positions). Yet it seems obvious that neither has sole claim to the truth. What does a middle position look like?

One possibility is a minimal nativist position. This term is developed in Noah Goodman and Tomer Ullman's work, showing up first in a very nice paper called Learning a Theory of Causality.* In that paper, they write:

... this [work] suggests a novel take on nativism—a minimal nativism—in which strong but domain-general inference and representational resources are aided by weaker, domain-specific perceptual input analyzers.

This statement comes in the context of the authors proposal that infants' theory of causal reasoning – often considered a primary innate building block of cognition – could in principle be constructed by a probabilistic learner. But that learner would still need some starting point; in particular, here the authors' learner had access to 1) a logical language of thought and 2) some basic information about causal interventions, perhaps from the infant's innate knowledge about contact causality or the actions of social agents (these are the "input analyzers" in the quote above).

Reproducibility and experimental methods posts

In celebration of the third anniversary of this blog, I'm collecting some of my posts on reproducibility. I didn't initially anticipate that methods and the "reproducibility crisis" in psychology would be my primary blogging topic, but it's become a huge part of what I write about on a day-to-day basis.

Here are my top four posts in this sequence:

A moderate's view of the reproducibility crisis – part 1 of a sequence, in part responding to the release of the Open Science Collaboration reproducibility project paper.
The slower, harder ways to increase reproducibility – part 2 of the sequence.
Estimating p(replication) in a practical setting – a report on the results from my graduate methods course, in which students replicate previously published papers.
Shifting our cultural understanding of replication – a plea for changes in practices and incentives.

Then I've also written substantially about a number of other topics, including publication incentives and the file-drawer problem:

And methods for reproducible research:

The blog has been very helpful for me in organizing and communicating my thoughts, as well as for collecting materials for teaching reproducible research. Hoping to continue thinking about these topics in the future, even as I move back to discussing more developmental and cognitive science topics.

Sunday, June 5, 2016

An adversarial test for replication success

(tl;dr: I argue that the only way to tell if a replication study was successful is by considering the theory that motivated the original.)

Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?

There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.

Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*

Misperception of incentives for publication

There's been a lot of conversation lately about negative incentives in academic science. A good example of this is Xenia Schmalz's nice recent post. The basic argument is, professional success comes from publishing a lot and publishing quickly, but scientific values are best served by doing slower, more careful work. There's perhaps some truth to this argument, but it overstates the misalignment in incentives between scientific and professional success. I suspect that people think that quantity matters more than quality, even if the facts are the opposite.

Let's start with the (hopefully uncontroversial) observation that number of publications will be correlated at some magnitude with scientific progress. That's because for the most part, if you haven't done any research you're not likely to be able to publish, and if you have made a true advance it should be relatively easier to publish.* So there will be some correlation between publication record and theoretical advances.

Now consider professional success. When we talk about success, we're mostly talking about hiring decisions. Though there's something to be said about promotion, grants, and awards as well, I'll focus here on hiring.** Getting a postdoc requires the decision of a single PI, while faculty hiring generally depend on committee decisions. It seems to me that many people believe these hiring decisions comes down to the weight of the CV. That doesn't square with either my personal experience or the incentive structure of the situation. My experiences suggest that the quality and importance of the research is paramount, not the quantity of publications. And more substantively, the incentives surrounding hiring also often favor good work.***

At the level of hiring a postdoc, what I personally consider is the person's ideas, research potential, and skills. I will have to work with someone closely for the next several years, and the last person I want to hire is someone sloppy and concerned only with career success. Nearly all postdoc advisors that I know feel the same way, and that's because our incentive is to bring someone in who is a strong scientist. When a PI interviews for a postdoc, they talk to the person about ideas, listen to them present their own research, and read their papers. They may be impressed by the quantity of work the candidate has accomplished, but only in cases where that work is well-done and on an exciting topic. If you believe that PIs are motivated at all by scientific goals – and perhaps that's a question for some people at this cynical juncture, but it's certainly not one for me – then I think you have to believe that they will hire with those goals in mind.

Was Piaget a Bayesian?

tl;dr: Analogies between Piaget's theory of development and formal elements in the Bayesian framework.

Intro

I'm co-teaching a course with Alison Gopnik at Berkeley this quarter. It's called "What Changes?" and the goal is to revisit some basic ideas about what drives developmental changes. Here's the syllabus, if you're interested. As part of the course, we read the first couple of chapters of Flavell's brilliant book, "The Developmental Psychology of Jean Piaget." I had come into contact with Piagetian theory before of course, but I've never spent that much time engaging with the core ideas. In fact, I don't actually teach Piaget in my intro to developmental psychology course. Although he's clearly part of the historical foundations of the discipline, to a first approximation, a lot of what he said turned out to be wrong.

In my own training and work, I've been inspired by probabilistic models of cognition and cognitive development. These models use the probability calculus to represent degrees of belief in different hypotheses, and have been influential in a wide range of domains from perception and decision-making to communication and social cognition.¹ But as I have gotten more interested in the measurement of developmental change (e.g., in Wordbank or MetaLab, two new projects I've been involved in recently), I've become a bit more frustrated with these probabilistic tools, since there hasn't been as much progress in using them to understand children's developmental change (in contrast to progress characterizing the nature of particular representations). Hence my desire to teach this course and understand what other theoretical frameworks had to contribute.

Despite the seeming distance between the modern Bayesian framework and Piaget, reading Flavell's synthesis I was surprised to see that many of the key Piagetian concepts actually had nice parallels in Bayesian theory. So this blogpost is my attempt to translate some of these key concepts in theory into a Bayesian vocabulary.² It owes a lot to our class discussion, which was really exciting. For me, the translation highlights significant areas of overlap between Piagetian and Bayesian thinking, as well as some nice places where the Bayesian theory could grow.

Should we always bring out our nulls?

tl;dr: Thinking about projects that aren't (and may never be) finished. Should they necessarily be published?

One null result tells you about as much as one positive; not much. But a pattern of nulls demands attention. #BringOutYerNulls
— Micah Allen (@neuroconscience) March 18, 2016

Suggestion: let's kick start this by tweeting some null results we had under the hashtag #BringOutYerNulls. Bonus points if published!
— Micah Allen (@neuroconscience) March 18, 2016

So, the other day there was a very nice conversation on twitter, started by Micah Allen and focusing on people clearing out their file-drawers and describing null findings. The original inspiration was a very interesting paper about one lab's file drawer, in which we got insight into the messy state of the evidence the lab had collected prior to its being packaged into conventional publications.

The broader idea, of course, is that – since they don't fit as easily into conventional narratives of discovery – null findings are much less often published than positive findings. This publication bias then leads to an inflation of effect sizes, with many negative consequences downstream. And the response to problem of publication bias then appears to be simple: publish findings regardless of statistical significance, removing the bias in the literature. Hence, #bringoutyernulls.

This narrative is a good one and an important one. But whenever the publication bias discussion come up, I have a contrarian instinct that I have a hard time suppressing. I've written about this issue before, and in that previous piece I tried to articulate the cost-benefit calculation: while suppressing publication has a cost in terms of bias, publication itself also has a very significant cost to both authors (in writing, revising, and even funding publication) and readers (in sorting through and interpreting the literature). There really is junk, the publication of which would be a net negative –whether because of errors or irrelevance. But today I want to talk about something else that bothers me about the analysis of publication bias I described above.

Limited support for an app-based intervention

tl;dr: I reanalyzed a recent field-trial of a math-learning app. The results differ by analytic strategy, suggesting the importance of preregistration.

Last year, Berkowitz et al. published a randomized controlled trial of a learning app. Children were randomly assigned to math and reading app groups; their learning outcomes on standardized math and reading tests were assessed after a period of app usage. A math anxiety measure was also collected for children’s parents. The authors wrote that:

The intervention, short numerical story problems delivered through an iPad app, significantly increased children’s math achievement across the school year compared to a reading (control) group, especially for children whose parents are habitually anxious about math.

I got excited about this finding because I have recently been trying to understand the potential of mobile and tablet apps for intervention at home, but when I dug into the data I found that not all views of the dataset supported the success of the intervention. That's important because this was a well-designed, well-conducted trial. But the basic randomization to condition did not produce differences in outcome, as you can see in the main figure of my reanalysis.

My extensive audit of the dataset is posted here, with code and their data here. (I really appreciate that the authors shared their raw data so that I could do this analysis – this is a huge step forward for the field!). Quoting from my report:

In my view, the Berkowitz et al. study does not show that the intervention as a whole was successful, because there was no main effect of the intervention on performance. Instead, it shows that – in some analyses – more use of the math app was related to greater growth in math performance, a dose-response relationship that is subject to significant endogeneity issues (because parents who use math apps more are potentially different from those who don’t). In addition, there is very limited evidence for a relationship of this growth to math anxiety. In sum, this is a well-designed study that nevertheless shows only tentative support for an app-based intervention.

Here's a link to my published comment (which came out today), and here's Berkowitz et al.'s very classy response. Their final line is:

We welcome debate about data analysis and hope that this discussion benefits the scientific community.

Town hall on methodological issues

Our department just had its first ever town hall event. The goal was to have an open discussion of issues surrounding reproducibility and other methodological challenges. Here's the announcement:

Please join us for a special Psychology Colloquium event: Town Hall on Contemporary Methodological Issues in Psychological Science.

Professors Lee Ross, Mike Frank, and Russ Poldrack will each give a ten-minute talk, sharing their perspectives on contemporary methodological issues within their respective fields. There will be opportunities for both small and large group discussion.

I gave a talk on my evolving views on reproducibility, many summarized here, specifically focusing on the issue that individual studies tend not to be definitive. I advocated for a series of changes to our default practice, including:

Larger Ns
Multiple internal replications
Measurement and estimation, rather than statistical significance
Experimental “debugging” tools (e.g., manipulation checks, negative/positive controls)
Preregistration where appropriate
Everything open – materials, data, code – by default

I then illustrated this with a couple of recent examples of work I've been involved in. If you're interested in seeing the presentation, my slides are available here. Overall, the town hall was a real success, with a lot of lively discussion and plenty of student voices discussing their concerns.

Thursday, February 18, 2016

Explorations in hierarchical drift diffusion modeling

tl;dr: Adventures in using different platforms/methods to fit drift diffusion models to data.

The drift diffusion model (DDM) is increasingly a mainstay of research on decision-making, both in neuroscience and cognitive science. The classic DDM defines a pseudo random-walk decision process that describes a distribution on both accuracies and reaction times. This kind of joint distribution is really useful for capturing tasks where there could be speed-accuracy tradeoffs, and hence where classic univariate analyses are uninformative. Here's the classic DDM picture, this version from Vandekerckhove, Tuerlinckx, & Lee (2010), who have a nice tutorial on hierarchical DDMs:

We recently started using DDM to try and understand decision-making behavior in the kinds of complex inference tasks that my lab and I have been studying for the past couple of years. For example, in one recently-submitted paper, we use DDM to look at decision processes for inhibition, negation, and implicature, trying to understand the similarities and differences in these three tasks:

We had initially hypothesized that performance in the negation and implicature tasks (our target tasks) would correlate with inhibition performance. It didn't, and what's more the data seemed to show very different patterns across the three tasks. So we turned to DDM to understand a bit more of the decision process for each of these tasks.* Also, in a second submitted paper, we looked at decision-making during "scalar implicatures," the inference that "I ate some of the cookies" implies that I didn't eat all of them. In both of these cases, we wanted to know what was going on in these complex, failure-prone inferences.

An additional complexity was that we are interested in the development of these inferences in children. DDM has not been used much with children, usually because of the large number of trials that DDM seems to require. But we were inspired by a recent paper by Ratcliff (one of the important figures in DDMs), which used DDMs for data from elementary-aged children. And since we have been using iPad experiments to get RTs and accuracies for preschoolers, we thought we'd try and do these analyses with data from both kids and adults.

But... it turns out that it's not trivial to fit DDMs (especially the more interesting variants) to data, so I wanted to use this blogpost to document my process in exploring different ecosystems for DDM and hierarchical DDM.

Babies Learning Language

Friday, November 4, 2016

Don't bar barplots, but use them cautiously

Friday, July 22, 2016

Preregister everything

Tuesday, July 12, 2016

Minimal nativism

Tuesday, June 21, 2016

Reproducibility and experimental methods posts

Sunday, June 5, 2016

An adversarial test for replication success

Monday, April 25, 2016

Misperception of incentives for publication

Thursday, April 14, 2016