Monday, March 28, 2016

Should we always bring out our nulls?

tl;dr: Thinking about projects that aren't (and may never be) finished. Should they necessarily be published?

So, the other day there was a very nice conversation on twitter, started by Micah Allen and focusing on people clearing out their file-drawers and describing null findings. The original inspiration was a very interesting paper about one lab's file drawer, in which we got insight into the messy state of the evidence the lab had collected prior to its being packaged into conventional publications.

The broader idea, of course, is that – since they don't fit as easily into conventional narratives of discovery – null findings are much less often published than positive findings. This publication bias then leads to an inflation of effect sizes, with many negative consequences downstream. And the response to problem of publication bias then appears to be simple: publish findings regardless of statistical significance, removing the bias in the literature. Hence, #bringoutyernulls.

This narrative is a good one and an important one. But whenever the publication bias discussion come up, I have a contrarian instinct that I have a hard time suppressing. I've written about this issue before, and in that previous piece I tried to articulate the cost-benefit calculation: while suppressing publication has a cost in terms of bias, publication itself also has a very significant cost to both authors (in writing, revising, and even funding publication) and readers (in sorting through and interpreting the literature). There really is junk, the publication of which would be a net negative –whether because of errors or irrelevance. But today I want to talk about something else that bothers me about the analysis of publication bias I described above.

Thursday, March 10, 2016

Limited support for an app-based intervention

tl;dr: I reanalyzed a recent field-trial of a math-learning app. The results differ by analytic strategy, suggesting the importance of preregistration.

Last year, Berkowitz et al. published a randomized controlled trial of a learning app. Children were randomly assigned to math and reading app groups; their learning outcomes on standardized math and reading tests were assessed after a period of app usage. A math anxiety measure was also collected for children’s parents. The authors wrote that:

The intervention, short numerical story problems delivered through an iPad app, significantly increased children’s math achievement across the school year compared to a reading (control) group, especially for children whose parents are habitually anxious about math.
I got excited about this finding because I have recently been trying to understand the potential of mobile and tablet apps for intervention at home, but when I dug into the data I found that not all views of the dataset supported the success of the intervention. That's important because this was a well-designed, well-conducted trial. But the basic randomization to condition did not produce differences in outcome, as you can see in the main figure of my reanalysis.



My extensive audit of the dataset is posted here, with code and their data here. (I really appreciate that the authors shared their raw data so that I could do this analysis – this is a huge step forward for the field!). Quoting from my report: 
In my view, the Berkowitz et al. study does not show that the intervention as a whole was successful, because there was no main effect of the intervention on performance. Instead, it shows that – in some analyses – more use of the math app was related to greater growth in math performance, a dose-response relationship that is subject to significant endogeneity issues (because parents who use math apps more are potentially different from those who don’t). In addition, there is very limited evidence for a relationship of this growth to math anxiety. In sum, this is a well-designed study that nevertheless shows only tentative support for an app-based intervention.
Here's a link to my published comment (which came out today), and here's Berkowitz et al.'s very classy response. Their final line is:
We welcome debate about data analysis and hope that this discussion benefits the scientific community.

Thursday, February 25, 2016

Town hall on methodological issues

Our department just had its first ever town hall event. The goal was to have an open discussion of issues surrounding reproducibility and other methodological challenges. Here's the announcement: 
Please join us for a special Psychology Colloquium event: Town Hall on Contemporary Methodological Issues in Psychological Science.

Professors Lee Ross, Mike Frank, and Russ Poldrack will each give a ten-minute talk, sharing their perspectives on contemporary methodological issues within their respective fields. There will be opportunities for both small and large group discussion.
I gave a talk on my evolving views on reproducibility, many summarized here, specifically focusing on the issue that individual studies tend not to be definitive. I advocated for a series of changes to our default practice, including: 
  1. Larger Ns
  2. Multiple internal replications
  3. Measurement and estimation, rather than statistical significance
  4. Experimental “debugging” tools (e.g., manipulation checks, negative/positive controls)
  5. Preregistration where appropriate 
  6. Everything open – materials, data, code – by default
I then illustrated this with a couple of recent examples of work I've been involved in. If you're interested in seeing the presentation, my slides are available here. Overall, the town hall was a real success, with a lot of lively discussion and plenty of student voices discussing their concerns. 

Thursday, February 18, 2016

Explorations in hierarchical drift diffusion modeling

tl;dr: Adventures in using different platforms/methods to fit drift diffusion models to data. 

The drift diffusion model (DDM) is increasingly a mainstay of research on decision-making, both in neuroscience and cognitive science. The classic DDM defines a pseudo random-walk decision process that describes a distribution on both accuracies and reaction times. This kind of joint distribution is really useful for capturing tasks where there could be speed-accuracy tradeoffs, and hence where classic univariate analyses are uninformative. Here's the classic DDM picture, this version from Vandekerckhove, Tuerlinckx, & Lee (2010), who have a nice tutorial on hierarchical DDMs:


We recently started using DDM to try and understand decision-making behavior in the kinds of complex inference tasks that my lab and I have been studying for the past couple of years. For example, in one recently-submitted paper, we use DDM to look at decision processes for inhibition, negation, and implicature, trying to understand the similarities and differences in these three tasks:


We had initially hypothesized that performance in the negation and implicature tasks (our target tasks) would correlate with inhibition performance. It didn't, and what's more the data seemed to show very different patterns across the three tasks. So we turned to DDM to understand a bit more of the decision process for each of these tasks.* Also, in a second submitted paper, we looked at decision-making during "scalar implicatures," the inference that "I ate some of the cookies" implies that I didn't eat all of them. In both of these cases, we wanted to know what was going on in these complex, failure-prone inferences.

An additional complexity was that we are interested in the development of these inferences in children. DDM has not been used much with children, usually because of the large number of trials that DDM seems to require. But we were inspired by a recent paper by Ratcliff (one of the important figures in DDMs), which used DDMs for data from elementary-aged children. And since we have been using iPad experiments to get RTs and accuracies for preschoolers, we thought we'd try and do these analyses with data from both kids and adults.

But... it turns out that it's not trivial to fit DDMs (especially the more interesting variants) to data, so I wanted to use this blogpost to document my process in exploring different ecosystems for DDM and hierarchical DDM.

Monday, December 14, 2015

The ManyBabies Project

tl;dr: Introducing and organizing a ManyLabs study for infancy research. Please comment or email me (mcfrank (at) stanford.edu) if you would like to join the discussion list or contribute to the project. 

Introduction

The last few years have seen increasing acknowledgement that there are flaws in the published scientific literature – in psychology and elsewhere (e.g., Ioannidis, 2005). Even more worrisome is that self-corrective processes are not as fast or as reliable as we might hope. For example, in the reproducibility project, which was published this summer (RPP, project page here), 100 papers were sampled from top journals, and one replication of each was conducted. This project revealed a disturbingly low rate of success for seemingly well-powered replications. And even more disturbing, although many of the target papers had a large impact, most still had not been replicated independently seven years later (outside of RPP). 

I am worried that the same problems affect developmental research. The average infancy study – including many I've worked on myself – has the issues we've identified in the rest of the psychology literature: low power, small samples, and undisclosed analytic flexibility. Add to this the fact that many infancy findings are never replicated, and even those that are replicated may show variable results across labs. All of these factors lead to a situation where many of our empirical findings are too weak to build theories on.

In addition, there is a second, more infancy-specific problem that I am also worried about. Small decisions in infancy research – anything from the lighting in the lab to whether the research assistant has a beard – may potentially affect data quality, because of the sensitivity of infants to minor variations in the environment. In fact, many researchers believe that there is huge intrinsic variability between developmental labs, because of unavoidable differences in methods and populations (hidden moderators). These beliefs lead to the conclusion that replication research is more difficult and less reliable with infants, but we don't have data that bear one way or the other on this question.

Wednesday, November 25, 2015

Preventing statistical reporting errors by integrating writing and coding

tl;dr: Using RMarkdown with knitr is a nice way to decrease statistical reporting errors.

How often are there statistical reporting errors in published research? Using a new automated method for scraping APA-formatted stats out of PDFs, Nuijten et al. (2015) found that over 10% of p-values were inconsistent with the reported details of the statistical test, and 1.6% were what they called "grossly" inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not (another summary here). Here are two key figures, first for proportion inconsistent by article and then for proportion of articles with an inconsistency:


These graphs are upsetting news. Around half of articles had at least one error by this analysis, which is not what you want from your scientific literature.* Daniel Lakens has a nice post suggesting that three errors account for many of the problems: incorrect use of < instead of =, use of one-sided tests without clear reporting as such, and errors in rounding and reporting.

Speaking for myself, I'm sure that some of my articles have errors of this type, almost certainly from copying and pasting results from an analysis window into a manuscript (say Matlab in the old days or R now).**  The copy-paste thing is incredibly annoying. I hate this kind of slow, error-prone, non-automatable process.

So what are we supposed to do? Of course, we can and should just check our numbers, and maybe run statcheck (the R package Nuijten et al. created) on our own work as well. But there is a much better technical solution out there: write statistics into the manuscript in one executable package that automatically generates the figures, tables, and statistical results. In my opinion, doing this used to be almost as much of a pain as doing the cutting and pasting (and this is spoken as someone who writes academic papers in LaTeX!). But now the tools for writing text and code together have gotten so good that I think there's no excuse not to. 


Thursday, November 5, 2015

A conversation about scale construction

(Note: this post is joint with Brent Roberts and Michael Kraus, and is cross-posted on their blogs - MK and BR).

MK: Twitter recently rolled out a polling feature that allows its users to ask and answer questions of each other. The poll feature allows polling with two possible response options (e.g., Is it Fall? Yes/No). Armed with snark and some basic training in psychometrics and scale construction, I thought it would be fun to pose the following as my first poll:



Said training suggests that, all things being equal, some people are more “Yes” or more “No” than others, so having response options that include more variety will capture more of the real variance in participant responses. To put that into an example, if I ask you if you agree with the statement: “I have high self-esteem.” A yes/no two-item response won’t capture all the true variance in people’s responses that might be otherwise captured by six items ranging from strongly disagree to strongly agree. MF/BR, is that how you would characterize your own understanding of psychometrics?

MF: Well, when I’m thinking about dependent variable selection, I tend to start from the idea that the more response options for the participant, the more bits of information are transferred. In a standard two-alternative forced-choice (2AFC) experiment with balanced probabilities, each response provides 1 bit of information. In contrast, a 4AFC provides 2 bits, an 8AFC provides 3, etc. So on this kind of reasoning, the more choices the better, as illustrated by this table from Rosenthal & Rosnow’s classic text:



For example, in one literature I am involved in, people are interested in the ability of adults and kids to associate words and objects in the presence of systematic ambiguity. In these experiments, you see several objects and hear several words, and over time the ideas is that you build up some kind of links between objects and words that are consistently associated. In these experiments, initially people used 2 and 4AFC paradigms. But as the hypotheses about mechanism got more sophisticated, people shifted to using more stringent measures, like a 15AFC, which was argued to provide more information about the underlying representations.

On the other hand, getting more information out of such a measure presumes that there is some underlying signal. In the example above, the presence of this information was relatively likely because participants had been trained on specific associations. In contrast, in the kinds of polls or judgment studies that you’re talking about, it’s more unknown whether participants have the kind of detailed representations that allow for fine-grained judgements. So if you’re asking for a judgment in general (like in #TwitterPolls or classic likert scales), how many alternatives should you use?

MK: Right, most or all of my work (and I imagine a large portion of survey research) involves subjective judgments where it isn’t known exactly how people are making their judgments and what they’d likely be basing those judgments on. So, to reiterate your own question: How many response alternatives should you use?

MF: Turns out there is some research on this question. There’s a very well-cited paper by Preston & Coleman (2000), who ask about a service rating scale for restaurants. Not the most psychological example, but it’ll do. They present different participants with different numbers of response categories, ranging from 2 - 101. Here is their primary finding:



In a nutshell, the reliability is pretty good for two categories, but it gets somewhat better up to about 7-9 options, then goes down somewhat. In addition, scales with more than 7 options are rated as slower and harder to use. Now this doesn’t mean that all psychological constructs have enough resolution to support 7 or 9 different gradations, but at least simple ratings or preference judgements seem like they might.

MK: This is great stuff! But if I’m being completely honest here, I’d say the reliabilities for just two response categories, even though they aren’t as good as they are at 7-9 options, are good enough to use. BR, I’m guessing you agree with this because of your response to my Twitter Poll:



BR: Admittedly, I used to believe that when it came to response formats, more was always better. I mean, we know that dichotomizing continuous variables is bad, so how could it be that a dichotomous rating scale (e.g., yes/no) would be as good if not superior to a 5-point rating scale? Right?

Two things changed my perspective. The first was precipitated by being forced to teach psychometrics, which is minimally on the 5th level of Dante’s Hell teaching-wise. For some odd reason at some point I did a deep dive into the psychometrics of scale response formats and found, much to my surprise, a long and robust history going all they way back to the 1920s. I’ll give two examples. Like the Preston & Colemen (2000) study that Michael cites, some old old literature had done the same thing (god forbid, replication!!!). Here’s a figure showing the test-retest reliability from Matell & Jacoby (1971), where they varied the response options from 2 to 19 on measures of values:



The picture is a little different from the internal consistencies shown in Preston & Colemen (2000), but the message is similar. There is not a lot of difference between 2 and 19. What I really liked about the old school researchers is they cared as much about validity as they did reliability--here’s their figure showing simple concurrent validity of the scales:



The numbers bounce a bit because of the small samples in each group, but the obvious take away is that there is no linear relation between scale points and validity.

The second example is from Komorita & Graham (1965). These authors studied two scales, the evaluative dimension from the Semantic Differential and the Sociability scale from the California Psychological Inventory. The former is really homogeneous, the latter quite heterogeneous in terms of content. The authors administered 2 and 6 point response formats for both measures. Here is what they found vis a vis internal consistency reliability:



This set of findings is much more interesting. When the measure is homogeneous, the rating format does not matter. When it is heterogeneous, having 6 options leads to better internal consistency. The authors’ discussion is insightful and worth reading, but I’ll just quote them for brevity: “A more plausible explanation, therefore, is that some type of response set such as an “extreme response set” (Cronbach, 1946; 1950) may be operating to increase the reliability of heterogeneous scales. If the reliability of the response set component is greater than the reliability of the content component of the scale, the reliability of the scale will be increased by increasing the number of scale points.”

Thus, the old-school psychometricians argued that increasing the number of scale point options does not affect test-retest reliability, or validity. It does marginally increase internal consistency, but most likely because of “systematic error” such as, response sets (e.g., consistently using extreme options or not) that add some additional internal consistency to complex constructs.

One interpretation of our modern love of multi-option rating scales is that it leads to better internal consistencies which we all believe to be a good thing. Maybe it isn’t.

MK: I’ve have three reactions to this: First, I’m sorry that you had to teach psychometrics. Second, it’s amazing to me that all this work on scale construction and optimal item amount isn’t more widely known. Third, how come, knowing all this as you do, this is the first time I have heard you favor two-item response options?

BR: You might think that I would have become quite the zealot for yes/no formats after coming across this literature, but you would be wrong. I continued pursuing my research efforts using 4 and 5 point rating scales ad nauseum. Old dogs and new tricks and all of that.

The second experience that has turned me toward using yes/no more often, if not by default, came as a result of working with non-WEIRD [WEIRD = White, Educated, Industrial, Rich, and Democratic] samples and being exposed to some of the newer, more sophisticated approaches to modeling response information in Item Response Theory. For a variety of reasons our research of late has been in samples not typically employed in most of psychology, like children, adolescents, and less literate populations than elite college students. In many of these samples, the standard 5-point likert rating of personality traits tend to blow up (psychometrically speaking). We’ve considered a number of options for simplifying the assessment to make it less problematic for these populations to rate themselves, one of which is to simplify the rating scale to yes/no.

It just so happens that we have been doing some IRT work on an assessment experiment we ran on-line where we randomly assigned people to fill out the NPI in one of three conditions--the traditional paired-comparison, a 5-point likert ratings of all of the stems, and a yes/no rating of all of the NPI item stems (here’s one paper from that effort). I assumed that if we were going to turn to a yes/no format that we would need more items to net the same amount of information as a likert-style rating. So, I asked my colleague and collaborator, Eunike Wetzel, how many items you would need using a yes/no option to get the same amount of test information from a set of likert ratings of the NPI. IRT techniques allow you to estimate how much of the underlying construct a set of items captures via a test information function. What she reported back was surprising and fascinating. You get the same amount of information out of 10 yes/no ratings as you do out of 10 5-point likert scale ratings of the NPI.

So, Professor Kraus, this is the source of the pithy comeback to your tweet. It seems to me that there is no dramatic loss of information, reliability, or validity when using 2-point rating scales. If you consider the benefits gained--responses will be a little quicker, fewer response set problems, and the potential to be usable in a wider population, there may be many situations in which a yes/no is just fine. Conversely, we may want to be cautious about the gain in internal consistency reliability we find in highly verbal populations, like college students, because it may arise through response sets and have no relation to validity.

MK: I appreciate this really helpful response (and that you address me so formally). Using a yes/no format has some clear advantages, as it forces people to fall on one side of a scale or the other, is quicker to answer than questions that rely on 4-7 Likert items, and sounds (from your work BF) that it allows scales to hold up better for non-WEIRD populations. MF, what is your reaction to this work?

MF: This is totally fascinating. I definitely see the value of using yes/no in cases where you’re working with non-WEIRD populations. We are just in the middle of constructing an instrument dealing with values and attitudes about parenting and child development and the goal is to be able to survey broader populations than the university-town parents we often talk to. So I am certainly convinced that yes/no is a valuable option for that purpose and will do a pilot comparison shortly.

On the other hand, I do want to push back on the idea that there are never cases where you would want a more graded scale. My collaborators and I have done a bunch of work now using continuous dependent variables to get graded probabilistic judgments. Two examples of this work are Kao et al., (2014) – I’m not an author on that one but I really like it – and Frank & Goodman (2012). To take an example, in the second of those papers we showed people displays with a bunch of shapes (say a blue square, blue circle, and green square) and asked them, if someone used the word “blue,” which shape do you think they would be talking about?

In those cases, using sliders or “betting” measures (asking participants to assign dollar values between 0 and 100) really did seem to provide more information per judgement than other measures. I’ve also experimented with using binary dependent variables in these tasks, and my impression is that they both converge to the same mean, but that the confidence intervals on the binary DV are much larger. In other words, if we hypothesize in these cases that participants really are encoding some sort of continuous probability, then querying it in a continuous way should yield more information.

So Brent, I guess I’m asking you whether you think there is some wiggle room in the results we discussed above – for constructs and participants where scale calibration is a problem and psychological uncertainty is large, we’d want yes/no. But for constructs that are more cognitive in nature, tasks that are more well-specified, and populations that are more used to the experimental format, isn’t it still possible that there’s an information gain for using more fine-grained scales?

BR: Of course there is wiggle room. There are probably vast expanses of space where alternatives are more appropriate. My intention is not to create a new “rule of thumb” where we only use yes/no responses throughout. My intention was simply to point out that our confidence in certain rules of thumb is misplaced. In this case, the assumption that likert scales are always preferably is clearly not the case. On the other hand, there are great examples where a single, graded dimension is preferable--we just had a speaker discussing political orientation which was rated from conservative to moderate to liberal on a 9-point scale. This seems entirely appropriate. And, mind you, I have a nerdly fantasy of someday creating single-item personality Behaviorally Anchored Rating Scales (BARS). These are entirely cool rating scales where the items themselves become anchors on a single dimension. So instead of asking 20 questions about how clean your room is, I would anchor the rating points from “my room is messier than a suitcase packed by a spider monkey on crack” to “my room is so clean they make silicon memory chips there when I’m not in”. Then you could assess the Big Five or the facets of the Big Five with one item each. We can dream can’t we?

MF: Seems like a great dream to me. So - it sounds like if there’s one take-home from this discussion, it’s “don’t always default to the seven-point likert scale.” Sometimes such scales are appropriate and useful, but sometimes you want fewer – and maybe sometimes you’d even want more.