tag:blogger.com,1999:blog-42972429174190892612024-03-16T11:50:11.999-07:00Babies Learning LanguageThoughts on language learning, child development, and fatherhood; experimental methods, reproducibility, and open science; theoretical musings on cognitive science more broadly. Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.comBlogger115125tag:blogger.com,1999:blog-4297242917419089261.post-34637504612583957792023-03-27T13:14:00.013-07:002023-04-03T10:14:47.877-07:00Domain-specific data repositories for better data sharing in psychology! <div>Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with <a href="https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html">new federal mandates</a> taking effect in the US this year. How should psychologists and other behavioral scientists share their data? </div><div><br /></div><div>Repositories should clearly be <a href="https://www.go-fair.org/fair-principles/">FAIR</a> - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will <i>never be reused. </i>It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. <a href="https://journals.sagepub.com/doi/full/10.1177/2515245920952393">The measures we use are all over the place</a>. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF." </div><div><br /></div><div>What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data <i>in general</i>. I want data <i>about</i> a specific construct, <i>from </i>a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these <u><i>Domain Specific Data Repositories</i></u> (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.</div><span><a name='more'></a></span><div><br /></div><h4 style="text-align: left;">Put data in DSDRs</h4><div>Suppose I'm doing a project on executive function in early childhood. Wouldn't it be nice if I could download raw or aggregated data from the various tasks that people had used to measure executive function? Or suppose I'm now interested in complex sentence structure and psycholinguistics. Wouldn't it be nice to be able to download data from the hundreds of experiments on word-by-word reading time for sentences of different types? Data on both these questions exist, but they are spread out across repositories for individual papers, formatted differently in every case. Putting together more than one dataset is typically a nightmare of data harmonization and meta-data guesswork. </div><div><br /></div><div>Neuroimaging folks get this. You don't post fMRI images to Zenodo or OSF or another repository of this type. You post them to <a href="https://openneuro.org">OpenNeuro</a> - a domain-specific repository for neuroimaging. fMRI data have specific standards for metadata and particular affordances in terms of preprocessing, aggregation, and analysis. OpenNeuro is designed around these ideas. </div><div><br /></div><div>Similarly, the <a href="https://childes.talkbank.org">Child Language Data Exchange System</a> (CHILDES) has known this basic fact for years. They established a common schema for transcripts of parent-child conversations (the CHAT standard). Now everyone in the field of child language posts their data to CHILDES in this format, and so when you want to learn about kids' use of the word "and", you can search <i>every</i> major transcribed corpus of child language in a single archive. My group has done the same kind of thing with data around children's vocabulary, with <a href="http://wordbank.stanford.edu">Wordbank</a> archiving parent reports about child language from dozens of languages and tens of thousands of kids. </div><div><br /></div><div>To make high-value, reusable datasets, it is critical to aggregate the data around a common data standard that is specific to a particular instrument or construct, and that connects with the agenda of a particular research community. These tools can even help catalyze research communities to work together around a shared agenda. They can also increase data quality by putting into place domain-specific quality controls.</div><div> </div><h4 style="text-align: left;">We need more DSDRs</h4><div>The trouble is, making these domain-specific repositories is expensive and complicated. We've now made four: <a href="http://wordbank.stanford.edu">Wordbank</a>, <a href="http://childes-db.stanford.edu">childes-db</a>, <a href="http://peekbank.stanford.edu">Peekbank</a>, and <a href="http://metalab.stanford.edu">Metalab</a>. Each of them has their own web hosting framework (similar but different) as well as their own underlying database schema, visualization apps, and application programming interface (API) for downloading the data. Even though they are structurally similar, they are not the same, and each was made as a one-off. </div><div><br /></div><div>As a result, we now struggle under the burden of maintaining and updating these repositories, and it's not likely we can do too many more without abandoning some of them. Every time one breaks, I get lots of email. Every year I have to beg RStudio (now Posit) for free licenses to keep our visualization interface going. And it goes without saying that there is no funding for long-term maintenance of such repositories.</div><div><br /></div><div>But maybe we could automate and centralize the construction of such repositories and host them jointly in the cloud, rather than creating wholly separate resources each time. </div><div><br /></div><div>At the core of each repository is a database schema, like the schema for Peekbank:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigFrcZKirjxUoT6wuy-I2WKI5v1aO0O8mUYe2Q5ZPoNr-BzjFu3IZEpNwJ0QQnoJ_GB4PxSiNWNjUsb1Piqf6QSd75DzmAcbr-IHvCKwYnitLyBGgU_pOVF_AE2TStja8F16t2z2VX1udJ_cAhzvJ0vjMOfqq7hpkUyHc_l7CXRfPIotEBYmI_gYOh1g/s1117/schema_3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1012" data-original-width="1117" height="290" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigFrcZKirjxUoT6wuy-I2WKI5v1aO0O8mUYe2Q5ZPoNr-BzjFu3IZEpNwJ0QQnoJ_GB4PxSiNWNjUsb1Piqf6QSd75DzmAcbr-IHvCKwYnitLyBGgU_pOVF_AE2TStja8F16t2z2VX1udJ_cAhzvJ0vjMOfqq7hpkUyHc_l7CXRfPIotEBYmI_gYOh1g/s320/schema_3.png" width="320" /></a></div><br /><div>Designing this kind of schema requires a clear understanding of the ontology for the kind of data you want to archive – it's surprisingly tricky (the Peekbank one took us many meetings over several years!). But once you have such a schema, it is straightforward to create an API to get data out of a database with this schema. And with a good API, it is surprisingly easy to define visualizations of data in the schema. People are often surprised that the interactive visualizations in something like Wordbank are the easy part!</div><div><br /></div><div>The only pain point is importing new datasets into the schema – typically this work requires writing custom data-munging code for each dataset to define the relation between the incoming data format and the specific tables required in the schema. For Wordbank we even defined an intermediate abstract layer for defining the mapping between incoming data and our schema. </div><div><br /></div><div>In principle, all of this work could be wrapped in a sufficiently general framework to make it unnecessary to create a custom hosting solution. Each database could be an instance of a broader database type, or even inside a giant wrapper database. And each API could be generated automatically from the database schema. You could even imagine a world where these DSDRs were created automatically out of an app like AirTable. There's some serious design work to do to describe the scope of such a system, but it is certainly not out of the realm of possibility.</div><div><br /></div><h4 style="text-align: left;">Challenges</h4><div style="text-align: left;">We have some work to do to make DSDRs like Wordbank the norm. At a minimum, we need:</div><div style="text-align: left;"><ul style="text-align: left;"><li>Credit assignment: robust norms for giving contributors credit when their data are used. At the moment, Wordbank and CHILDES simply ask folks to cite the contributors' paper (e.g., <a href="http://wordbank.stanford.edu/contributors">http://wordbank.stanford.edu/contributors</a>) but in the long term, datasets should have DOIs that are downloaded with the data and associated to the paper DOI automagically.</li><li>Dataset use tracking: repositories also need DOIs and methods for tracking their use and impact beyond citations of papers about the repository, which are often out of date and which split impact across multiple products.</li><li>Effective data versioning solutions: we need easy tools for using historical snapshots of repositories so that analyses of DSDR data are reproducible. We have hand engineered this for some of our DSDRs, but we need to be able to roll out this functionality with limited extra effort. Right now some key repositories like CHILDES have no accessible version control, meaning analyses can break down the line and users will not know why. </li><li>Mechanisms for ensuring the longevity of DSDRs: we need to ensure that DSDRs don't just rely on single investigators for maintenance and updates, perhaps through partnerships with libraries and cloud providers.</li></ul></div><p style="text-align: left;">There's a lot to do.</p><h4 style="text-align: left;">Conclusion</h4><div>Lost in many discussions of data sharing is that data shared in individual packages fosters reproducibility but often not interoperability and reuse. <b>Reuse comes when data are organized around specific disciplinary constructs, frameworks, and measurements</b>. And reuse value grows further as the size and diversity of the datasets in a domain-specific repository increase. We need more of domain-specific data repositories to catalyze research communities, especially in smaller fields where no such data resource exists. To create these, we will need new technical tools for rapidly and sustainably spinning up new repositories. These tools should be a development priority.</div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-37767989093515030802023-03-27T09:37:00.004-07:002023-03-27T09:37:34.623-07:00Why do LLMs learn so much slower than humans? <div>[<a href="https://twitter.com/mcxfrank/status/1640379247373197313">repost from twitter</a>]</div><div><br /></div>How do we compare the scale of language learning input for large language models vs. humans? I've been trying to come to grips with recent progress in AI. Let me explain two illustrations I made to help.<br /><br />Recent progress in AI is truly astonishing, though somewhat hard to interpret. I don't want to reiterate recent discussion, but <a href="https://twitter.com/spiantado">@spiantado</a> has a good take in the first part of <a href="https://lingbuzz.net/lingbuzz/007180">lingbuzz.net/lingbuzz/007180</a>; l like this thoughtful piece by <a href="https://twitter.com/MelMitchell1">@MelMitchell1</a> as well: <a href="https://www.pnas.org/doi/10.1073/pnas.2300963120">https://www.pnas.org/doi/10.1073/pnas.2300963120</a>.<br /><br />Many caveats still apply. LLMs are far from perfect, and I am still struggling with their immediate and eventual impacts on science (see <a href="https://twitter.com/mcxfrank/status/1638589956238225408">prior thread</a>). My goal in the current thread is to think about them as cognitive artifacts instead. <br /><br />For cognitive scientists interested in the emergence of intelligent behavior, LLMs suggest that some wide range of interesting adaptive behaviors can emerge given enough scale. Obviously, there's huge debate over what counts as intelligent, and I'm not going to solve that here. <div><br />But: for my money, we start seeing *really* interesting behaviors at the scale of GPT3. Prompting for few shot tasks felt radically unexpected and new, and suggested task abstractions underlying conditional language generation. At what scale do you see this? </div><div><br />GPT-3 was trained on 500 billion tokens (= .75 words). So that gives us ~4e11 words. PaLM and Chinchilla are both trained on around 1e12 words. We don't know the corpus size for GP4-4 (!?!). How do these numbers compare with humans? </div><div><br />Let’s start with an upper bound. A convenient approximation is 1e6 words per month for an upper bound on spoken language to a kid (<a href="https://arxiv.org/pdf/1607.08723.pdf">arxiv.org/pdf/1607.08723…</a>, appendix A or <a href="https://www.pnas.org/doi/abs/10.1073/pnas.1419773112">pnas.org/doi/abs/10.107…</a>). That's 2e8 words for a 20 year old. How much could they read?<br /><br />Assume they start reading when they’re 10, and read a 1e5-word book/week. That’s an extra 5e6 million words per year. Double that to be safe and it still only gets us to 3e8 words over 10 years. <br />Now let's do a rough lower bound. Maybe 1e5 words per month for kids growing up in a low-SES environment with limited speech to children (<a href="https://onlinelibrary.wiley.com/doi/epdf/10.1111/desc.12724">onlinelibrary.wiley.com/doi/epdf/10.11…</a>). We don't get much of a literacy boost. So that gives us 5e6 by age 5 and 2e7 by age 20. </div><div><br /></div><div>That "lower bound" five year old can still reason about novel tasks based on verbal instructions - especially once they start kindergarten! </div><div><br /></div><div>The take-home here is that we are off by 4-5 orders of input magnitude in the emergence of adaptive behaviors.<br /></div><div><div style="text-align: center;"><a href="https://pbs.twimg.com/media/FsPMQppaAAANvQX.jpg"><img height="299" src="https://pbs.twimg.com/media/FsPMQppaAAANvQX.jpg" width="400" /></a></div><div><br /></div><div><div><br /></div><div>The big cognitive science question is - which factors account for that gap? I'll think about four broad ones. </div></div><div><br /></div>Factor 1: innate knowledge. Humans have SOME innate perceptual and/or conceptual foundation. The strongest version posits "core knowledge" of objects, agents, events, sets, etc. which serve to bootstrap further learning. People disagree about whether this is true. <br /><br /></div><div>Factor 2: multi-modal grounding. Human language input is (often) grounded in one or more perceptual modalities, especially for young children. This grounding connects language to rich information for world models that can be used for broader reasoning. <br /><br />Factor 3: active, social learning. Humans learn language in interactive social situations, typically curricularized to some degree by the adults around them. After a few years, they use conversation to elicit information relevant to them. <br /><br /></div><div>Factor 4: evaluation differences. We're expecting chatGPT to reason about/with all the internet's knowledge, and a five year old just understand a single novel theory of mind or causal reasoning task. Is comparison even possible? <br /><br /></div><div>S<span style="text-align: center;">o of course I don't know the answer! But here are a few scenarios for thinking this through. Scenario 1 is classic nativist dev psych: innate endowment plus input make the difference. You use core knowledge to bootstrap concepts from your experience. </span></div><div><div style="text-align: center;"><br /></div><div style="text-align: center;"><a href="https://pbs.twimg.com/media/FsPMRwVakAAcOgk.jpg"><img height="340" src="https://pbs.twimg.com/media/FsPMRwVakAAcOgk.jpg" width="400" /></a></div><div><br /></div>Scenario 2 is more like modern rational constructivism. Grounded experience plus a bunch of active and social learning allow kids to learn about the structure of the world even with limited innate knowledge. <br /><br /></div><div>I hear more about Scenario 3 in the AI community - once we ground these models in perceptual input, it's going to be easier for them to do common-sense reasoning with less data. And finally, of course, we could just be all wrong about the evaluation (Scenario 4). <br /><br /></div><div>As I said, I don't know the answer. But this set of questions is precisely why challenges like BabyLM are so important (<a href="https://babylm.github.io/">babylm.github.io</a>).</div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-43917739262834189122023-03-27T09:32:00.002-07:002023-03-27T09:32:21.002-07:00AI for psychology workflows hackathon - a report[<a href="https://twitter.com/mcxfrank/status/1638589956238225408">reposted from twitter</a>]<br /><br /> My lab held a hackathon yesterday to play with places where large language models could help us with our research in cognitive science. The mandate was, "how can these models help us do what we do, but better and faster."<br /><br />Some impressions:🧵 <br /><br /><div>Whatever their flaws, chat-based LLMs are astonishing. My kids and I used ChatGPT to write birthday poems for their grandma. I would have bet money against this being possible even ten years ago. <br /><br />But can they be used to improve research in cognitive science and psychology? <br /><br /></div><div>1. Using chat-based agents to retrieve factual knowledge is not effective. They are not trained for this and they do it poorly (the "hallucination problem"). Ask ChatGPT for a scientist bio, and the result will be similar but with random swaps of institutions, dates, facts, etc. <br /><br /></div><div>2. A new generation of retrieval-based agents are on their way but not here yet. These will have a true memory where they can look up individual articles, events, or entities rather than predicting general gestalts. Bing and Bard might be like this some day, but they aren't now. <br /><br /></div><div>3. Chat-based agents can accomplish pretty remarkable text formatting and analysis, which has applications in literature reading and data munging. E.g., they can pull out design characteristics from scientific papers, reformat numbers from tables, etc. Cool opportunities. These functions are critically dependent on long prompt windows. Despite GPT-4's notionally long prompt length, in practice we couldn't get more than 1.5k tokens consistently. That meant that pre-parsing inputs was critical, and this took too much manual work to be very useful. </div><div><br />4. A massive weakness for scientific use is that cutting-edge agents cannot easily be placed in a reproducible scientific pipeline. Pasting pasting text into a window is not a viable route for science. You can get API access but without random seeds, this is not enough. (We got a huge object lesson in this reproducibility issue yesterday when OpenAI declared that they are retiring Codex, a model that is the foundation of a large number of pieces of work on code generation in the past year. This shouldn't happen to our scientific workflows.) Of course we could download Alpaca or some other open model, set it up, and run it as part of a pipeline. But we are cognitive scientists, not LLM engineers. We don't want to do that just to make our data munging slightly easier! <br /><br /></div><div>5. Chat agents are not that helpful in breaking new ground. The problem is that, if you don't know the solution for a problem, then you can't tell whether the AI did it right, or even is going in the right direction! Instead, the primary use case seems to be helping people accomplish tasks they *already know how to do*, but to do them more effectively and faster. If you can check the answer, then the AI can produce a candidate answer to check. <br /><br /></div><div>6. It was very easy for us to come up with one-off use-cases that could be very helpful (e.g., help me debug this function, help me write this report or letter), and surprisingly hard to come up with cases that could benefit with creating automated workflows. At small scale, using chat AI to automate research tasks is trading one task (e.g., annotating data) for more menial and annoying ones (prompt engineering and data reformatting so that the AI can process it). This is ok for large problems, but not small and medium ones. <br /><br /></div><div>7. Confidence rating is a critical functionality that we couldn't automate reliably. We need AI to tell us when a particular output is low confidence so that it can be rechecked. <br /><br /></div><div>In sum: Chat AI is going to help us be faster at many tasks we already know how to do, and there are a few interesting scientific automation applications that we found. But for LLMs to change our research, we need better engineering around reliability and reproducibility. </div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-21364761737112966922023-02-16T16:37:00.007-08:002023-02-16T16:37:46.819-08:00Why do hybrid meetings suck? <p>I tried rendering this post in Quarto, which is not blogger-compatible, but I'm including the link here: <a href="http://rpubs.com/mcfrank/hybrid">rpubs.com/mcfrank/hybrid</a>.</p>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-91145383143398804002021-02-21T20:31:00.007-08:002021-02-22T11:11:08.531-08:00Methodological reforms, or, If we all want the same things, why can't we be friends?<p> <i>(tl;dr: "Ugh, can't we just get along?!" OR "aspirational reform meet actual policy?" OR "whither metascience?")</i></p><br />This post started out as a thread about the tribes of methodological reform in psychology, all of whom I respect and admire. Then it got too long, so it became a blogpost. <br /><br />As folks might know, I think methodological reform in psychology is critical (some of <a href="https://psyarxiv.com/27b43/">my views</a> have been formed by my work with the ManyBabies consortium). For the last ~2 years, I've been watching two loose groups of methodological reformers get mad at each other. It has made me very sad to see these conflicts because I like all of the folks involved. I've actually felt like I've had to take a twitter holiday several times because I can't stand to see some of my favorite folks on the platform yelling at each other. <div><br /></div><div>This post is my - perhaps misguided - attempt to express appreciation for everyone involved and try to spell out some common ground.</div><span><a name='more'></a></span><div><br /><h3>What do the centrists and the radicals think?</h3><br />One thread that catalyzed my thinking about this discussion was the "far left" and "center left" comparison that Charlie Ebersole proposed. Following that thread, I'll call these groups the centrists and the radicals. </div><div><br /><blockquote class="twitter-tweet"><p dir="ltr" lang="en">I'm definitely not the first to notice this, but it bears repeating: The gender imbalance between prominent "mainstream" open science folks and those critiquing it from the methodological "left" is striking and concerning. 1/3</p>— Charlie Ebersole (@CharlieEbersole) <a href="https://twitter.com/CharlieEbersole/status/1355231317001199617?ref_src=twsrc%5Etfw">January 29, 2021</a></blockquote></div><div><br />Centrist reforms are things like preregistration, transparency guidelines, and tweaks to hypothesis testing (e.g., p-value thresholds, equivalence testing, or Bayesian hypothesis testing). There's no consensus "platform" for reforms, but a <a href="https://psyarxiv.com/ksfvq/">recent review</a> summarizes the state of things quite well. Just to be clear, a number of authors of this article are collaborators and friends, and I think it's on the whole a really good article.</div><div><br /><blockquote class="twitter-tweet"><p dir="ltr" lang="en">10 years of replication and reform in psychology. What has been done and learned?<br /><br />Our latest paper prepared for the Annual Review summarizes the advances in conducting and understanding replication and the reform movement that has spawned around it.<a href="https://t.co/i5GQRPGzIa">https://t.co/i5GQRPGzIa</a> <br /><br />1/ <a href="https://t.co/yIYzUCaGE0">pic.twitter.com/yIYzUCaGE0</a></p>— Brian Nosek (@BrianNosek) <a href="https://twitter.com/BrianNosek/status/1359118772972507143?ref_src=twsrc%5Etfw">February 9, 2021</a></blockquote><div><br /></div>In contrast to the centrists, radicals start with the critical importance of theory building, often via computational models. On this view, no matter how well planned a test is, if it's not posed as part of a comparison of theories, you are playing 20 questions with nature (<a href="http://chil.rice.edu/tambo/teaching/psyc101GL/Newell%20%281973%29.pdf">as Newell said</a>), and you probably won't win. Here's a nice guide to some of the work in this tradition:</div><div> <blockquote class="twitter-tweet"><p dir="ltr" lang="en">I want to highlight some non-mainstream work on reproducibility, open science, replication crisis, meta-science by women. Reading and drawing from a diverse set of authors and ideas will help push this stream of work forward and help make science more open and inclusive.</p>— Berna Devezer (@zerdeve) <a href="https://twitter.com/zerdeve/status/1234906668036608000?ref_src=twsrc%5Etfw">March 3, 2020</a></blockquote><script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><div><br /></div><div>In this debate, the rubber really hits the road in the discussion around preregistration. Preregistration is a critical part of centrist reforms (e.g., through registered reports) but is "redundant at best" in much of the more radical views (e.g., <a href="https://psyarxiv.com/wxn58">this really nice post by Danielle Navarro</a>).<br /><br /><h3>I'm a centrist and a radical</h3><br />Here's the thing. These views are not inconsistent! It's just that the implicit contexts of application are different. Centrists are trying to make <i>broad policy recommendations</i> for funders/journals/training programs; radicals are thinking about <i>ideal scientific structures</i>. Both viewpoints resonate with my personal experience. </div><div><br /></div><div>In my lab, I try to do science that conforms to the radical vision of ideal scientific structures! In much of my work, we do the kind of computational theory building that lets us make quantitative predictions in advance and test them using precise measurements. This kind paradigm obviates simple NHST p-values, though sometimes we include them anyway because reviewers. We do typically preregister this work though, to keep from fooling ourselves about our predictions. Here's an example:<br /><br /><blockquote class="twitter-tweet"><p dir="ltr" lang="en">Preregistration and iterative statistical modeling go hand in hand. [THREAD]<br /><br />I'll illustrate via a new preprint from my lab that I'm very excited about, "Polite speech emerges from competing social goals" (w/ <a href="https://twitter.com/EricaYoon4?ref_src=twsrc%5Etfw">@EricaYoon4</a>, <a href="https://twitter.com/mhtessler?ref_src=twsrc%5Etfw">@mhtessler</a>, and Noah Goodman): <a href="https://t.co/LvUf3Pecns">https://t.co/LvUf3Pecns</a> /1</p>— Michael C. Frank (@mcxfrank) <a href="https://twitter.com/mcxfrank/status/1064665177201631232?ref_src=twsrc%5Etfw">November 19, 2018</a></blockquote><script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><br />On the other hand, I also teach experimental methods to psychology graduate students. In my teaching I'm much more of a centrist. In this context, I see lots of "garden variety" psych research on the topics that students are interested in. Much of it is not easily amenable to computational theory. (<a href="http://babieslearninglanguage.blogspot.com/2018/12/how-to-run-study-that-doesnt-replicate.html">Here's a sample of the perspective I've developed in that course</a>). <br /><br />From the radicals, there's lots of interest in computational theory building and some very nice guides/explainers (e.g. <a href="https://journals.sagepub.com/doi/full/10.1177/1745691620970585">this one by Olivia Guest and Andrea Martin</a>, EDIT: these authors are just trying to help people understand modeling and want to be clear that they feel there is a place for qualitative theory and don't subscribe to a "radical" position). The radical tradition is what I was trained in and what I do. I love this kind of work. But psych is a <i>VERY big place</i> (TM). It feels to me like hubris to say to a student who does educational mindsets work, or emotion regulation, or longitudinal development of racial identity – "don't even bother unless you have my kind of computational theory." Maybe that's not what they want as an outcome from their research, and maybe they are right and I am wrong!<br /><br />(As an aside: models and data go hand and hand, and it's not actually that clear to me that moving to computational theory is right in areas where there are no precise empirical measurements to explain. In 2013 I taught a fun class trying to make models of social behavior with Jamil Zaki and Noah Goodman. We made lots of models but had no reliable quantitative measurements to use to fit the models. So we had some pretty great computational theory – in my humble opinion – but we were still nowhere.)<br /><br />So based on these musings, in my experimental methods class, I make more minimal recommendations to the students. To evaluate the effect of an intervention, plan your sample size and preregister the statistical test. Don't p-hack. Go ahead and explore your data but don't pretend p-values from that exploration are a sound basis for strong conclusions. Try to make good plots of your raw data. Again, these sound pretty centrist, even though like I said, in my own lab I'm much more of a radical!<br /><br />The methodological practices that I recommend in class don't necessarily result in a robust body of theory. But at the same time, I have a strong conviction that they are a first step towards keeping people from tricking themselves while they stare at noise. Random promotion of noise to signal is rampant in the literature - we see it all the time when we try to replicate findings in class that are clearly the basis of post-hoc selection of significant p-values. So simply blocking this kind of noise promotion is an important first step. <br /><br /><h3>Contexts for everything</h3></div><div><br /></div><div>I'm arguing that one difference between centrists and radicals is what the context of the claim is. The centrist in me says: "it's really easy to tell NSF/NIH to add preregistration, sample size planning, and data sharing, to the merit review criteria (think <a href="http://clinicaltrials.gov">clinicaltrials.gov</a>)." In contrast, I don't think anyone would even know what you meant if you said: "all grants need to have sound computational theory."<br /><br />Danielle Navarro make the general case wonderfully in the piece I linked above: "advocating preregistration as a solution to p-hacking (or its Bayesian equivalent) is deeply misguided because we should never have been relying on these tools as a proxy for scientific inference." I basically agree with this point completely. <i>For my own research.</i><br /><br />But I'm <i>also</i> worried that applying this standard as a blanket policy intervention across all of psychology (plus the other behavioral sciences, to say nothing of the clinical sciences) would be a disaster for everyone involved. What would people do when they didn't have computational theory or adequate statistical models but got asked by funders and journals to provide such theory? My guess is that they'd make it up in a way that satisfied the policy hoop they'd been asked to jump through and then would continue p-hacking. <br /><br />Here are a few ideas about consensus metascience directions for both groups. Centrists should consider how they want to tweak policies to encourage cumulative science in the form of quantitative theory. How could we study the effects of quantitative theory on the robustness of empirical findings? I've got one idea: seems like <a href="http://babieslearninglanguage.blogspot.com/2015/09/descriptive-vs-optimal-bayesian-modeling.html">literatures that test quantitative theories presuppose precise and replicable measurements</a>; this is a testable correlational claim at least. I've also wondered about encouraging dose-response designs as a potential intervention on the standard 2x2 design that gets (over-)used in much of the psychology literature. </div><div><br /></div><div>On the other side, though, methodological radicals should take a look at the metascience policy intervention literature - where <a href="https://osf.io/preprints/metaarxiv/39cfb/">something actually gets changed in an official policy and then you measure the outcome</a>. Through my collaborations with Tom Hardwicke, I've become convinced that this kind of work can make us clearer about our desired endpoints as science policy-makers – what counts as success when we propose methodological reforms? </div><div><br /></div><div>One final comment. Another dynamic in this whole conversation is the failure – perceived and actual – of some centrist voices to engage constructively with the more radical critiques. As has been pointed out several times (as in the Ebersole tweet above), this lack of engagement may have to do with the gender distribution - more male voices in the center, more women on the radical side. These dynamics aren't good and this behavior is not OK. Leaders in the centrist parts of the field need to address the more radical critiques, especially those that come from folks who are deeply knowledgeable about the philosophical and statistical issues. The radical critiques of preregistration sometimes may get mistakenly written off as being part of a different genre of knee-jerk response to methodological reforms from less thoughtful corners of the field. This is sloppy. The radical work needs to be cited and discussed – and if, as I've suggested here, there's a response to the critiques based on pragmatics and policy issues, then that response needs to be articulated. </div><div><br /></div><h3>Conclusions</h3><div><br /></div><div>OK, in sum: Maybe this is part of being an official old person (TM) but, why can't we all just get along? Let's have radical ambitions for the future while taking well-scoped, pragmatic policy positions in the short term. </div><div><br /></div></div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-23148175875115873372021-02-08T17:38:00.005-08:002021-02-09T08:42:13.127-08:00Transparency and openness is an ethical duty, for individuals and institutions<i>(tl;dr: I wrote an opinion piece a couple of years ago - now rejected - on the connection between ethics and open science. Rather than letting it just get even staler than it was, here it is as a blog post.)</i><br /><br /><div>In the past few years, journals, societies, and funders have increasingly oriented themselves towards open science reforms, which are intended to improve reproducibility and replicability. Typically, transparency policies focus on open access to publications and the sharing of data, analytic code, and other research products. </div><div><br /></div><div>Many working scientists have a general sense that <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5688730/">transparency is a positive value</a>, but also have <a href="https://www.americanscientist.org/article/open-science-isnt-always-open-to-all-scientists">concerns about specific initiatives</a>. For example, sharing data often carries confidentiality risks that can only be mitigated via substantial additional effort. Further, many scientists worry about personal or career consequences from being “scooped” or having errors discovered. And transparency policies sometimes require resources that are not be available to researchers outside of rich institutions. </div><div><br /></div><div>I argue below that despite these worries, scientists have an ethical duty to be open. Further, where this duty is in conflict with scientists' other responsibilities, we need to lobby our institutions – universities, journals, and funders – to mitigate the costs and risks of openness.</div><span><a name='more'></a></span><div><br /></div><div><h4 style="text-align: left;">Scientists have an ethical duty to be open</h4>Openness is definitional to the scientific enterprise. The sociologist Robert Merton (1942) described a set of norms that science is assumed to follow: communism – that scientific knowledge belongs to the community; universalism – that the validity of scientific results is independent of the identity of the scientists; disinterestedness – that scientists and scientific institutions act for the benefit of the overall enterprise; and organized skepticism – that scientific findings must be critically evaluated prior to acceptance. The choice to be a scientist constitutes acceptance of these norms.<br /><br />For individual scientists to adhere to these norms, the products of research must be open. To contribute to the communal good, papers must be available so they can be read, evaluated, and extended. And to be subject to skeptical inquiry, experimental materials, research data, analytic code, and software must be all available so that analytic calculations can be verified and experiments can be reproduced. Otherwise, evaluators must accept arguments on the authority of the reporter rather than by virtue of the materials and data, an alternative that is inimical to the norm of universalism. For many scientists, the situation is neatly summarized by the motto of the Royal Society: “Nullius in verba,” <a href="https://science-sciencemag-org.stanford.idm.oclc.org/content/251/4990/142.2">often loosely translated as</a> “on no one’s word”.<br /><br />Beyond its centrality to science, openness also carries benefits, both to science and to scientists. Open access to the scientific literature <a href="https://www.bmj.com/content/323/7321/1103.short">increases the impact of publications</a>, which in turn increases the pace of discovery. Openly accessible data <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4973366/#bib105">increases the potential for citation and reuse</a>, and maximizes the chances that errors are found and corrected. These benefits accrue not just to the scientific ecosystem at large but also to individual scientists, who gain via citations, media impact, collaborations, and funding opportunities. <br /><br />Some responsibilities follow from these benefits. Because openness maximizes the impact of research and its products, researchers have a responsibility to their funders to pursue open practices so as to seek the maximal return on funders’ investments. And by the same logic, if research participants contribute their time to scientific projects, the researchers also owe it to these participants to maximize the impact of their contributions, <a href="https://www.ncbi.nlm.nih.gov/pubmed/23466937">as my colleague Russ Poldrack has argued</a>.<br /><br />For all of these reasons, individual scientists have a duty to be open – scientific institutions have a duty to promote transparency in the science they support and publish.</div><div><br /><h4 style="text-align: left;">The negatives of openness</h4>Scientists have many other ethical duties beyond openness, however. They have obligations to their collaborators and trainees. They have committed to funders to complete specific studies. And in biomedical and social science fields, they have duties to preserve the welfare of their research participants as well. Conflicts with these duties are often the source of researchers’ hesitance to embrace openness. <br /><br />Transparency policies also carry costs in terms of time and effort. For example, some routes to open access publication require authors to pay substantial publication costs (i.e., author processing charges). Organizing materials and data for sharing as well as providing support to dataset users can also be time-consuming, especially for larger datasets. <br /><br />Maintaining participant confidentiality is a major source of both cost and risk for biomedical and other human subjects research. Loss of confidentiality by research participants can have big negative consequences for health, employment, and well-being. While ensuring that tabular data does not contain identifying information is <a href="https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html">often relatively straightforward</a>, other types of data can be tricky and expensive to anonymize. For example, removing identifying information from video data requires considerable time and expertise. And certain types of dense or narrative data simply may not be de-identifiable due to aspects of the data or the participants’ identities. <br /><br />Transparency can even be a source of risk – actual or perceived – to researchers themselves. Effort spent pursuing open practices may not be seen as compatible with other career incentives. For example, learning technical tools to facilitate code and data sharing could take away from time to pursue new research. Disclosure of high value datasets prior to publication could in principle lead to opportunities for “scooping” – though it turns out that there are very few documented cases of pre-emption as a result of data sharing. Finally, open sharing of research products prior to and during peer review might carry greater risk for junior researchers and for researchers from disadvantaged groups, because of their greater vulnerability to critiques or negative attention.</div><div><br /><h4 style="text-align: left;">Individuals should consider openness as a default</h4>In the face of competing duties as well as potential negatives to openness, what should individual researchers do? First, because of the ethical duty to openness for every scientist, open practices should be a default in cases where risks and costs are limited. For example, the vast majority of journals allow authors to post accepted manuscripts in their untypset form to an open repository. This route to “green” open access is easy, cost free, and – because it comes only after articles are accepted for publication – confers essentially no risks of scooping. As a second example, the vast majority of analytic code can be posted as an explicit record of exactly how analyses were conducted, even if posting data is sometimes more fraught. These kinds of “incentive compatible” actions towards openness can bring researchers much of the way to a fully transparent workflow, and there is no excuse not to take them.<br /><br />For some researchers, however, there will be real negatives associated with one or more open practices. If they are not aware of the positive benefits of transparency and sharing for their work and the work of their trainees, they may consider open practices only as a necessary evil, rather than as opportunities to increase citations or build a reputation. But if they recognize the potential benefits of openness, researchers can ask whether there are steps that can be taken to realize some of those benefits while mitigating risks – for example, releasing only summary, tabular data rather than raw media data, or making use of a data sharing repository with robust access control.<br /><br />In some cases, researchers might decide not to share. One example of this kind of situation came up in my own work, when I was studying <a href="http://langcog.stanford.edu/papers_new/roy-2015-pnas.pdf">dense audio-video recordings of the private life of a single identified family</a>; these data are both sensitive and impossible to de-identify. The family decided not to share these data, and I support this decision, having seen how much the data would have compromised their family's privacy – though we did make tabular data available so that statistical results could be reproduced. A second more general case is archival data without consent for sharing where recontacting participants may be impossible or impractical. These cases are relatively rare, however; it is more common that sharing simply presents some potentially mitigable costs. It is precisely in these cases that institutions should step in.</div><div><br /><h4 style="text-align: left;">Institutions can mitigate the risks and costs of openness</h4>Given the ethical imperative towards openness, institutions like funders, journals, and societies need to use their role to promote open practices and to mitigate potential negatives. Scholarly societies have an important role to play in educating scientists about the benefits of openness and providing resources to steer their members towards best practices for sharing their publication and other research products. Similarly, journals can set good defaults, for example by requiring data and code sharing except in cases where a strong justification is given (equivalent to adopting the second highest level in the <a href="https://www.cos.io/initiatives/top-guidelines">Transparency and Openness Promotion</a> guidelines). I don't think the TOP guidelines are perfect, but I'm not sure why in this case we'd let the perfect be the enemy of the good.</div><div><br /></div><div>Departments and research institutes can also signal their interest in open practices in job advertisements and tenure/promotion guidelines. We did this the last time we had a search at Stanford Psych and it signaled our department's general interest in these practices, leading to some good conversations with candidates (and letting us notice explicitly if candidates weren't as interested as we were). In addition, by structuring graduate programs to provide training in tools and methods for data and code sharing, departments can educate grad students about producing reproducible and replicable research – this has been my hobby horse for quite a while (see <a href="http://langcog.stanford.edu/papers/FS-POPS2012.pdf">here</a> and <a href="https://psyarxiv.com/p73he/">here</a>). <br /><br />Institutional funders of research play the most important role, however. Most funders already signal an interest in openness through a required data management plan or similar document, and some (like the US NIH) mandate data sharing to the extent permissible given other regulatory constraints (e.g., institutional review, health or data privacy laws). These requirements, though laudable, don't really change the scientific incentives at play. Data sharing should not just be required: It should also be treated as part of the scientific merit of an application. Creating a sufficiently high value dataset should be itself meritorious enough to warrant funding. And on the opposite side of the calculus, funders should signal their willingness to support the effort required to mitigate data sharing costs. For example, this could take the form of extra budget supplements explicitly tied to sharing activities. <br /><br />More generally, funders and other institutional stakeholders need to act to change the incentive structure for individuals. For example, funding agencies could make it a priority to invest in creating technical tools and practice guidelines for human subject data anonymization. A small RFP for these could create huge value, making it much more straightforward to participate in data sharing. </div><div><br /></div><div><h4 style="text-align: left;">Conclusion</h4>Both advocates and critics of open practices often appear to be arguing about the merits of radical transparency, but this goal is often not achievable. Instead, individual researchers and institutions should proceed from both an understanding of the benefits of openness and an appreciation of the ethical duty to be open. These starting points lead naturally to a set of practices that are open by default, with exceptions in case of specific risks. </div><div><br /></div><div>When individual researchers can't mitigate the costs associated with openness, responsibility falls to institutional actors in the scientific ecosystem to help. We can all do our part in this by lobbying our journals scientific societies, institutions, and funders to support researchers in making the right decisions around transparency.<br /><span id="docs-internal-guid-a9fce7b4-7fff-f530-9546-002a82e12e8b"><br /></span></div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-1803920589231465772020-10-23T16:49:00.007-07:002020-10-23T16:50:32.779-07:00Against reference limitsMany academic conferences and journals have limits on the number of references you can cite. I want to argue here that <i>these limits make no sense and should be universally abolished</i>. <div><br /></div><div>To be honest, I kind of feel like I should be able to end this post here, since the idea seems so eminently sensible to me. But here's the positive case: If you are doing academic research of any type, you are not starting from scratch. It's critical to acknowledge antecedents and background so that readers can check assumptions. Some research has less antecedent work in its area, other research has more, and so a single limit for all articles doesn't make sense. More references allow readers to understand better where an article falls in the broader literature.</div><span><a name='more'></a></span><div><br /></div><div>Some objections and responses. </div><div><br /></div><div><b>Aren't there space limitations? </b>No, there aren't. Some journals still operate based on a set "page budget" that the publisher puts in place. This is silly as absolutely no one reads paper journals any more. If this weren't already clear before the pandemic, it's clear now. No one has sent an issue of <i>Cognition</i> or <i>Psych Science</i> to my house but life goes on. </div><div><br /></div><div><b>In my high profile, glossy journal, you should only cite important references and not try to be complete. </b>When you remove references from an article, typically you cut the three papers you might have cited to just one. That one is probably the original positive claim; it's more likely to be from a <a href="https://twitter.com/gershbrain/status/1319379132615168008">famous old guy</a> and it's less likely to be a newer finding, a meta-analysis, or a reference that provides additional context. This lack of context feeds the "rich get richer" cycle of citations and it hurts readers who should see multiple sources of evidence on an issue.</div><div><br /></div><div><b>My review journal is aimed at students and we don't want to overwhelm them with references. </b>I guess the argument is this: If you have 70 references and you cut them to 30 as a function of the journal limits, then students know what citation to look at. To me this seems crazy. First of all, no student is going to track down all 30 references; they are inevitably looking at a subset, probably the references for one claim. And for that one claim, they deserve the same context as a researcher does – don't just send them to the original paper without also giving them the critique, the meta-analysis, or the newer non-replication. If you want to curate, then have the bibliography be annotated (as, for example, <i>Nature Reviews Neuroscience</i> does). Let the author call out the important references, rather than removing dissent and diversity from the bibliography.</div><div><br /></div><div><b>It's only a conference paper/abstract, you don't need references and they count against the space limitations anyway.</b> Most computer science conferences now do not count references against page limits, and increasingly abstracts for developmental psych conferences do not either. Fundamentally, you are probably looking at conference papers or abstracts on the web – so the documents you look at should be able to have hyperlinks in them (and that's all references are, anyway, is hyperlinks to other papers). Let authors add a reference section! And while we're at it, we should have a technical solution (e.g., a regular expression) to count words outside of citations. Why dock people words for appropriate scholarly procedure?</div><div><br /></div><div><b>Unlimited references encourage (self-)citation packing. </b>It's true that if citations were unlimited, in principle you could pack the reference section with tons of irrelevant citations or, maybe more realistically, with self-citations. But first of all, most journals already have unlimited citations and no one does this (well, <a href="https://www.insidehighered.com/news/2018/04/30/prominent-psychologist-resigns-journal-editor-over-allegations-over-self-citation">almost no one</a>). Second, citation packing is something that can be dealt with by editors and reviewers. Finally, if someone is hell-bent on self-citation and you have a reference limit, they will use all of their references to cite themselves anyway. But if you give them unlimited references they might actually cite the relevant work in addition to their own. Self-citation is a real issue, but limiting references is the wrong policy tool to deal with this problem.</div><div><br /></div><div>OK, I hope I've convinced you. Let everyone cite to their heart's content. Don't limit references, and don't count them towards page and word limits in submissions. </div>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-12499236340881135172020-03-02T16:12:00.003-08:002020-03-03T09:21:19.640-08:00Advice on reviewing<i>(Several people I work with have recently asked me for reviewing advice, so I thought I'd share my thoughts more broadly.)</i><br />
<i><br /></i>
Peer review – organized scrutiny of scientific work prior to publication – is a critical part of our current scientific ecosystem. But I have heard many of the peer review horror studies out there and experienced some myself. The peer review ecosystem could be improved – better tracking and sharing of peer review, better credit assignment, more fair allocations of review requests, better online systems for editors and reviewers, to name a few.*<br />
<br />
Should we have peer review at all? In my view, peer review is primarily a filter that limits the amount of truly terrible work that appears in reputable journals (e.g., society publications, high-ranked international outlets). Don't get me wrong: plenty of incorrect, irreproducible, and un-replicable science still appears in print! But there are certain minimal standards that peer review enforces – published work typically ends up conforming to the standards of its field, even if those standards themselves could be improved. Without peer review, more of this terrible work would appear and there would be even more limited cues for distinguishing the good from the bad.** To paraphrase, it's the worst solution to the problem of quality control in science – except for all the others!<br />
<br />
So all in all I'm an advocate for peer review.<br />
<br />
<a name='more'></a><br />
But for an early career researcher (say a grad student or postdoc especially), getting involved poses some tradeoffs. On the one hand, there are several positives. Being a reviewer helps you:<br />
<ul>
<li>learn about other new work in the field by engaging with it deeply, </li>
<li>calibrate your judgment to that of the editor and other reviewers, and</li>
<li>get credit from editors (and occasionally authors and other readers, in the case of open review) for contributing.</li>
</ul>
<div>
But it also can be time-consuming, especially at first. How do you decide when to review and when not to review? Here's my advice. </div>
<div>
<blockquote class="tr_bq">
<i>On average, try to review </i><i>about 2.5x as many papers as you submit as first author</i><i>. Try to do those reviews at the places you publish and want to publish. Be efficient with your reviewing.</i></blockquote>
</div>
I'll explain each part here in a bit more depth.<br />
<br />
<b>1. On average, not right now.</b> As my wife is fond of saying, we have <a href="https://en.wikipedia.org/wiki/Seasons_of_Giving">seasons of giving</a>. You don't have to do everything at once! This means, first, you should try not to have more than a few reviews out at a time. Otherwise it gets very overwhelming. So try to space things out: don't feel like you have to review continuously. Take a break from time to time, especially if family or career circumstances mean you have a lot on your plate. I did a ton of one-off reviewing for several years, then did a bunch of editorial service, got burned out – <a href="http://babieslearninglanguage.blogspot.com/2017/06/confessions-of-associate-editor.html">related confessional blogpost here</a> – then took a breather, and now am back doing a mix of editing and reviewing.<br />
<br />
<b>2. Review at the population replacement rate.</b> Most papers have 2–3 reviewers. So if everyone reviews 2–3 papers for each first authored paper they submit, then we should have as many reviews coming into the system as going out. But again, this doesn't have to be all at once! If you haven't submitted anything yet from your PhD, doing a lot of reviewing is not usually a great idea. I tend to suggest focusing on your own work until then. This is also not a hard and fast rule and it's great to be generous with reviewing if you have the curiosity and capacity. If you're submitting one paper this year, I think it's fine – maybe even good – to review more than two or three papers. But I wouldn't necessarily review ten unless you really want to.<br />
<br />
<b>3. Review at places you (want to) publish.</b> Peer review is an important part of socialization into a scientific community. It's one way our communities develop norms as to statistical or methodological standards. A lot has been said about the ways these norms are occasionally negative (e.g., requiring HARKing – "hypothesizing around known results"). Plenty of this socialization is good, though. For example, my recent reviewers have required more breadth in the cited literature, required more reproducible code, asked for additional studies, and many other steps that have made my and my collaborators' work better.*** By participating in specific communities' review, you learn what they want from their contributors. You also have a chance to show editors your thoughtfulness and judgment. (This isn't a big motivator but it's not nothing.) So choosing outlets carefully helps you give back to the scholarly community you want to be part of and it also helps you learn about how that community works.<br />
<br />
<b>4. Spend time on reviews, but not too much time. </b>My first review ever (as a grad student) was eight pages long. I included information on every typo in the paper. I'm sure there was useful feedback in there, but as an editor, these kinds of over-the-top reviews don't actually help that much. And as an author, they are a pain – they are either "writing for the author," or nitpicking specific wording decisions. Authors should get some autonomy in what they write, provided the underlying research is sound. The advice I received from my advisor (after he had a nice chuckle about the length of my review was): summarize the paper in no more than a paragraph, provide a small handful of major points that are critical to your evaluation of the paper, and if you feel it's appropriate, make a recommendation.**** Then you can list a few minor points that are helpful to the authors but don't themselves make or break the paper.<br />
<br />
Writing a review like this takes time, but not too much time. I recommend reading the paper through soon after you get it, making a few notes, thinking it over, and then coming back and writing the review as you reread. That way you can form an opinion and then check it. It's hard to say how long this process <i>should</i> take – everyone is different, and the process gets way faster with experience. But if a normal length paper is taking more than 3-5 hours to review, I think that's probably too much, unless you are really taking time to check a specific calculation or analysis.<br />
<br />
Finally, what do you do if a particular reviewing opportunity just isn't right? Don't be afraid to say no. Editors are people too, and they will totally understand if you tell them how many reviews you already have outstanding or share that you are on leave or otherwise occupied (finishing your thesis, for example). Editors generally are totally fine with a quick and helpful decline response, especially when you name other people who you think are qualified.***** You can always say "happy to help next time!"<br />
<br />
----<br />
* I won't talk about blinding vs. not blinding here, though I did share some thoughts <a href="http://babieslearninglanguage.blogspot.com/2017/06/confessions-of-associate-editor.html">elsewhere</a>.<br />
** In some fields, there aren't huge incentives for publishing random nonsense. Theoretical physics comes to mind – you can upload random junk to arXiv but it's not a huge deal, in the sense that it's just more spam that needs to be filtered out. In contrast, in biomedicine or even in psychology, publication can in a strong journal can lead to positive commercial consequences. So we need significant filtering to prevent unscrupulous researchers from taking advantage of this route.<br />
*** They also of course misunderstood simple points; got the stats wrong; asked me to cite their own work; and said trenchant stuff about my writing that made me feel bad for days. Criticism is always a mixed bag.<br />
**** Some people say that reviewers should assess but not recommend. But most journals make you choose your recommendation from a dropdown menu so I don't know what that really means. I think that if you have a clear recommendation, you should state it in the review and argue for it. E.g., "for this paper to be acceptable, the authors would need to do X, Y, and Z."<br />
***** Especially helpful to decline by giving names of early career experts as most editors think of the same prominent researchers for reviews in a particular domain and then have trouble generating a broader review pool for areas they don't know as well.<br />
<div>
<br /></div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-46418151731552403132019-11-05T21:51:00.002-08:002019-11-05T21:52:07.357-08:00Letter of recommendation: Attack of the Psychometricians<i>(tl;dr: It's letter of recommendation season, and so I decided to write one to a paper that's really been influential in my recent thinking. Psychometrics, y'all.)</i><br />
<br />
To whom it may concern:<br />
<br />
I am writing to provide my strongest recommendation for the paper, "<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2779444/">Attack of the Psychometricians</a>" by Denny Borsboom (2006). Reading this paper oriented me to a rich tradition of psychometric modeling – but more than that, it changed my perspective on the relationship between psychological measurement and theory. (It also taught me to use the term "sumscore"* as an insult). I urge you to consider it for a position in your reading list, syllabus, or lab meeting.<br />
<br />
I first met AotP (or Attack!, as I like to call it) via a link on twitter. Not the most auspicious beginning, but from a quick skim on my phone, I could tell that this was a paper that needed further study.<br />
<br />
The paper presents and discusses what it calls the central insight of psychometrics: that "measurement does not consist of finding the right observed score to substitute for a theoretical attribute, but of devising a model structure to relate an observable to a theoretical attribute." In other words, the goal is to make models that link data to theoretical quantities of interest. What this means is that measurement is essentially continuous with theory construction. By creating and testing a good measurement model, you're creating and testing a key component of a good theory.<br />
<br />
<a name='more'></a><br />
Attack! has made me think about the origins of this situation. Here's my attempt at an origin story. In the olden times, all the psychologists went to the same conferences and worried about the same things. But then a split formed between different groups. Educational psychologists and psychometricians knew that different problems on tests had different measurement properties, and began exploring how to select good and bad items, and how to figure out people's ability abstracted away from specific items. Cognitive psychologists, on the other hand, spurned this item-level variation and embraced the dogma of exchangeable experimental items. People did Lots Of Trials, all generated from the same basic template. The sumscore reigned supreme, and yielded important insight into Memory, Attention, and Reasoning (irrespective of what was being remembered, attended to, or reasoned about).<br />
<br />
Psychophysicists diverged from the cognitivist hierarchy. They always knew that they needed to infer a latent relationship (the psychometric curve). As they got better at doing this, they fit models that included parameters of the decision process – for example, a "lapse" parameter to capture inattention) – as well as the quantities of interest. And because they typically fit these curves within individual subjects, these parameters were participant-level estimates. But the models that fit these curves were often specific to particular metric relationships and not appropriate for increasingly complicated domains.<br />
<br />
Now in modern cognitive science, we get work on sophisticated constructs – for example, in moral psychology or psycholinguistics – where experimenters break with the cognitivist dogma and use non-exchangeable items. Sometimes items are sentences or even whole vignettes. Yet for the most part these researchers have forgotten to model item variation (except occasionally using a random intercept for items in their linear mixed effects models). <a href="https://web.stanford.edu/~clark/1970s/Clark,%20H.H.%20_Language%20as%20fixed%20effect%20fallacy_%201973.pdf">Clark (1972)</a> scolded them about the problematic statistical inferences that could result from forgetting to model items and this guidance has reappeared in recent exhortations to <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881361/">Keep It Maximal</a>! But as far as I can tell, no one really talks about modeling items in more detail *in order to learn more about what is in people's heads*.<br />
<br />
Attack! has infested my brain. Now when I see someone use differentiated items in their task yet use the sumscore as their measure of the latent trait of interest, I think, "you're just leaving information on the table." <a href="https://psyarxiv.com/42rza/">I suddenly want to fit psychometric models to everything</a>. Because, in the end, what do you want as a psychologist? A better understanding of the latent space that we're trying to theorize about. I used to think that this was called Theory and it was distinct from Data Analysis. Thanks to Attack! I now know that measurement and theory are (or at least should be) contiguous with one another.**<br />
<br />
On a personal note, Attack! is a great read and will play well with your interest in sociological biases that shape the structure of scientific inquiry. You shouldn't pass this paper up. Do not hesitate to contact me with questions or concerns.<br />
<br />
Sincerely,<br />
<br />
Michael C. Frank<br />
Internet Commentator<br />
<br />
---<br />
* For those of you not in the know, the sumscore is just what we normal psychologists call "percent correct" – treating the sum of your correct answers on the test as your score, as opposed to inferring the latent trait (ability) from the performance on the observed variables.<br />
** This contiguity idea is interestingly related to the Bayesian Data Analysis turn in the Bayesian cognitive modeling world, where we now think about linking functions that relate models to data directly. In fact, I think these are really the same idea when you get down to it. Here's a great paper that describes this viewpoint: <a href="https://psycnet.apa.org/record/2017-14287-001">Tauber, Navarro, Perfors, & Steyvers (2017)</a>.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-58982439542830422512019-10-08T21:34:00.001-07:002019-10-09T10:40:16.291-07:00Confounds and covariates<i>(tl;dr: explanation of confounding and covariate adjustment)</i><br />
<br />
Every year, one of the trickiest concepts for me to teach in my experimental methods course is the difference between experimental confounds and covariates. Although this distinction seems simple, it's pretty deeply related to the definition of what an experiment is and why experiments lead to good causal inferences. It's also caught up in a number of methodological problems that come up again and again in my class. This post is my attempt to explain the distinction and how it relates to different problems and cultural practices in psychology.<br />
<br />
Throughout this post, I'll use a silly example. My first year of graduate school, I got distracted from my actual research by the hypothesis that listening to music with lyrics decreased my ability to write papers for my classes. I'll call this the "Bob Dylan" hypothesis, since I was listening to a lot of Dylan at the time. Let's represent this by the following causal diagram.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1-X9DCZ1lTemFTzINBD4GkDEPislXY9vrabAHZZNr8A9CNZ0wlc9oGEyXS5DEV0dzkwOWh-3rtbtEIi6HIjoHCJCFFuq3jR6cg08J86f_x4a6tfdI077spgmTJFrVo6y92NvQW_i8m0Fs/s1600/Screen+Shot+2019-10-07+at+8.50.57+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="704" data-original-width="448" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1-X9DCZ1lTemFTzINBD4GkDEPislXY9vrabAHZZNr8A9CNZ0wlc9oGEyXS5DEV0dzkwOWh-3rtbtEIi6HIjoHCJCFFuq3jR6cg08J86f_x4a6tfdI077spgmTJFrVo6y92NvQW_i8m0Fs/s200/Screen+Shot+2019-10-07+at+8.50.57+PM.png" width="126" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Our outcome is writing skill (Y) and our predictor is Dylan listening (X). The edge between them represents a hypothesized causal relationship. Dylan is hypothesized to affect writing skill, and not vice versa. (These kinds of diagrams are called <a href="https://en.wikipedia.org/wiki/Causal_graph">causal graphical models</a>*).<br />
<br />
<b>Observational Studies and Experiments</b><br />
<br />
Suppose we did an observational study where we measured each of these variables in a large population. Assume we came up with some way to sample people's writing, get a measure of whether they either were or weren't listening to lyric-heavy music at the time, and assess the writing sample's quality. We might find that Y was correlated with X, but in a surprising direction: listening to Dylan would be related to <i>better</i> writing.<br />
<br />
Can we make a causal inference in this case? If so, we could get rich promoting a Dylan-based writing intervention. Unfortunately, we can't – correlation doesn't equal causation here, because there is (at least one) confounding third variable: age (Z). Age is positively related to both Dylan listening and writing skill in our population of interest. Older people tend to be good writers and also tend to be more into folk rockers; I'm not even going to put a question mark on this edge because I'm pretty sure this is true.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9qASfcND_jkm9-hP1I64ybQ1OYKnWw0iFfnqi6Do1TRQMqeM1p4HtqJnU9z3BsSjDkjlAPvPWZwIKQBOTjyhr4eqMEXYRwahH9wgmeSkoAgwMXA2akWdp-p5FS0V4ULBXJTbDgDpa7zJK/s1600/Screen+Shot+2019-10-07+at+8.51.03+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="970" data-original-width="704" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9qASfcND_jkm9-hP1I64ybQ1OYKnWw0iFfnqi6Do1TRQMqeM1p4HtqJnU9z3BsSjDkjlAPvPWZwIKQBOTjyhr4eqMEXYRwahH9wgmeSkoAgwMXA2akWdp-p5FS0V4ULBXJTbDgDpa7zJK/s200/Screen+Shot+2019-10-07+at+8.51.03+PM.png" width="145" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
But: the causal relationship of age to our other two variables means that variation in Z can induce a correlation in X and Y, even in the absence of a true causal link. We can say that age is a <b>confound</b> in estimating the Dylan-writing skill relationship: it's a variable that is correlated with both our predictor and our outcome variables.<br />
<br />
To get gold-standard evidence about causality, we need to do an experiment. (We won't discuss statistical techniques for inferring causality, which can be useful but don't give you gold standard evidence anyway; <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6558187/">review here</a>).<br />
<br />
Experiments are when we intervene on the world and measure the consequences. Here, this means forcing some people to listen to Dylan. In the language of graphical models, if <i>we </i>control the Dylan listening, that means that variable X is causally exogenous. (Exogenous means that it's not caused by anything else in the system). We "snipped" the causal link between age and Dylan listening.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyW_VW663pN55fyuAGk2CljhW3IwXKdbxVVFKlCfODwTuT29PSkSaCTBA9pqQJt62PyKZT3ocGJz-NJ2BvNHnlp_IbYeG9L6J4alhMdfBeGVVIIVM_KKU8B6sikphHtQ_ygL01bsEFLRfe/s1600/Screen+Shot+2019-10-07+at+8.51.14+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="964" data-original-width="736" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyW_VW663pN55fyuAGk2CljhW3IwXKdbxVVFKlCfODwTuT29PSkSaCTBA9pqQJt62PyKZT3ocGJz-NJ2BvNHnlp_IbYeG9L6J4alhMdfBeGVVIIVM_KKU8B6sikphHtQ_ygL01bsEFLRfe/s200/Screen+Shot+2019-10-07+at+8.51.14+PM.png" width="152" /></a></div>
<br />
So now we can "wiggle" the Dylan listening variable – change it experimentally – and see if we detect any changes in writing skill. We do this by randomly assigning individuals to listen to Dylan or not and then measuring writing during the assigned listening (or non-listening) period. This is a "between-subjects" design. We can use our randomized experiment to get a measure of the <a href="https://en.wikipedia.org/wiki/Average_treatment_effect">average treatment effect</a> of Dylan, the size of the causal effect of the intervention on the outcome. In this simple experiment, the ATE is estimated by the regression Y ~ X (for ease of exposition, I'm not going to discuss so-called mixed models, which model variation across subjects and/or experimental items). That's the elegant logic of randomized experiments: the difference between condition gives you the average effect.<br />
<br />
<b>Confounds </b><br />
<br />
Let's consider an alternate experiment now. Suppose we did the same basic procedure, but now with a "within-subjects" design where participants do both the Dylan treatment and the control, in that order. This experiment is flawed, of course. If you observe a Dylan effect, you can't rule out the idea that participants got tired and wrote worse in the control condition because it always came second.<br />
<br />
Order (Dylan first vs. control first; notated X') is an <b>experimental confound</b>: a variable that is created in the course of the experiment that is both causally related to the predictor and potentially also related to the outcome. Here's how the causal model now looks:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6jG5pEYWe8VegGLlElpTPtkyc8Q3gArLBYHv8El7u7LCODmCc7Qhu4WSSVyCFJ1lRTX5v-gx7KkFyJ15tZ0bacKHqrfYBe0MR1IykcpOhtp93enWZ1u_2-EZaJEfsuihOxOe-S5KmWbKe/s1600/Screen+Shot+2019-10-07+at+9.43.35+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="796" data-original-width="808" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6jG5pEYWe8VegGLlElpTPtkyc8Q3gArLBYHv8El7u7LCODmCc7Qhu4WSSVyCFJ1lRTX5v-gx7KkFyJ15tZ0bacKHqrfYBe0MR1IykcpOhtp93enWZ1u_2-EZaJEfsuihOxOe-S5KmWbKe/s200/Screen+Shot+2019-10-07+at+9.43.35+PM.png" width="200" /></a></div>
<br />
<br />
We've reconstructed the same kind of confounding relationship we had with age, where we had a variable (X') that was correlated both with our predictor (X) and our outcome (Y)! So...<br />
<br />
<b>What should we do with our experimental confounds? </b><br />
<br />
<b>Option 1.</b> Randomize. Increasingly, this is my go-to method for dealing with any confound. Is the correct answer on my survey confounded with response side? Randomize what side the response shows up on! Is order confounded with condition? Randomize the order you present in! Randomization is much easier now that we program many of our experiments using software like Qualtrics or code them from scratch in JavaScript.<br />
<br />
The only time you really get in trouble with randomization is when you have a large number of options, a small number of participants, or some combination of the two. In this case, you can end up with unbalanced levels of the randomized factors (for example, ten answers on the right side and two on the left). Averaging across many experiments, this lack of balance will come out in the wash. But in a single experiment, it can really mess up your data – especially if your participants notice and start choosing one side more than the other <i>because</i> it's right more often. For that reason, when balance is critical, you want option 2.<br />
<br />
<b>Option 2.</b> Counterbalance. If you think a particular confound might have a significant effect on your measure, balancing it across participants and across trials is a very safe choice. That way, you are guaranteed to have no effect of the confound on your average effect. In a simple counterbalance of order for our Dylan experiment, we manipulate condition order between subjects. Some participants hear Dylan first and others hear Dylan second. Although technically we might call order a second "factor" in the experiment, in practice it's really just a nuisance variable, so we don't talk about it as a factor and we often don't analyze it (but see Option 3 below).<br />
<br />
In the causal language we have been using, counterbalancing allows us to snip out the causal dependency between order and Dylan. Now they are unconfounded (uncorrelated) with one another. We've "solved" a confound in our experimental design. Here's the picture:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidisBklxHw5tj4Hi6NYR4BECch8Cx-SYJe-LFhNYqFf1oboYEmHPPrVBFwcYYKwzOAHTsjSZarFVm6L43Khk3nDne-s7MWKyq6Fm3UaWGfOxfuUSYhISkxqa5gII6qTdGI5w0ZK5jkoL8W/s1600/Screen+Shot+2019-10-07+at+10.27.55+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br class="Apple-interchange-newline" /><img border="0" data-original-height="812" data-original-width="790" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidisBklxHw5tj4Hi6NYR4BECch8Cx-SYJe-LFhNYqFf1oboYEmHPPrVBFwcYYKwzOAHTsjSZarFVm6L43Khk3nDne-s7MWKyq6Fm3UaWGfOxfuUSYhISkxqa5gII6qTdGI5w0ZK5jkoL8W/s200/Screen+Shot+2019-10-07+at+10.27.55+PM.png" width="194" /></a></div>
<br />
Counterbalancing doesn't always work, though. It gets trickier when you have too many levels on a variable (too many Dylan songs!) or multiple confounding variables. For example, if you have lots of different nuisance variables – say, condition order, what writing prompt you use for each order, which Dylan song you play – it may not be possible to do a fully-crossed counterbalance so that all combinations of these factors are seen by equal numbers of participants. In these kinds of cases, you may have to rely on partial counterbalancing schemes or <a href="http://compneurosci.com/wiki/images/9/98/Latin_square_Method.pdf">latin squares designs</a>, or you may have to fall back on randomization.<br />
<br />
<b>Option 3.</b> Do Options 1 and 2 and then model the variation. This option was never part of my training, but it's an interesting third option that I'm increasingly considering.** That is, we are often faced with the choice between A) a noisy between-participants design and B) a lower-noise within-participants design that nevertheless adds noise back in via some obvious order effect that you have to randomize or counterbalance. In a recent talk by Andrew Gelman, he suggested that we try to model these as covariates, to reduce noise. This seems like a pretty interesting suggestion, especially if the correlation between them and the outcome is substantial.***<br />
<br />
<b>Covariates</b><br />
<br />
Going back to our example, now we have two variables – age and order – that are no longer <i>confounded</i> with our primary relationship of interest (i.e., Dylan and writing). But they may still be related to our outcome measure. Here's what the picture looks like, repeated from above.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidisBklxHw5tj4Hi6NYR4BECch8Cx-SYJe-LFhNYqFf1oboYEmHPPrVBFwcYYKwzOAHTsjSZarFVm6L43Khk3nDne-s7MWKyq6Fm3UaWGfOxfuUSYhISkxqa5gII6qTdGI5w0ZK5jkoL8W/s1600/Screen+Shot+2019-10-07+at+10.27.55+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="812" data-original-width="790" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidisBklxHw5tj4Hi6NYR4BECch8Cx-SYJe-LFhNYqFf1oboYEmHPPrVBFwcYYKwzOAHTsjSZarFVm6L43Khk3nDne-s7MWKyq6Fm3UaWGfOxfuUSYhISkxqa5gII6qTdGI5w0ZK5jkoL8W/s200/Screen+Shot+2019-10-07+at+10.27.55+PM.png" width="194" /></a></div>
<br />
Even if they are not confounding our experimental manipulation, age and experimental condition order may still be correlated with our outcome measure, writing skill. How does this work? Well, the average treatment effect of Dylan on writing is still given by the regression Y ~ X. But we also know that there is some variance in Y that is due to X' and Z.<br />
<br />
That's because age and order are <b>covariates</b>: they may – by virtue of their potential causal links with the outcome variable – have some correlation with outcomes, even in a case where the predictor is experimentally manipulated. This should be intuitive for the external (age) covariate, but it's true for both: they may account for variance in Y over and above that controlled by the experimental manipulation of X.<br />
<br />
<b>What should we do about our covariates? </b><br />
<br />
<b>Option 1.</b> Nothing! We are totally safe in ignoring all of our covariates, regressing Y on X and treating the estimate as an unbiased estimate of the the effect (the ATE). This is why randomization is awesome. We are guaranteed that, in the limit of many different experiments, even though people with different ages will be in the different Dylan conditions, this source of variation will be averaged out.<br />
<br />
The <b>first fallacy of covariates</b> is that, because you have a known covariate, you have to adjust for it. Not true. You can just ignore it and your estimate of the ATE is unbiased. This is the norm in cognitive psychology, for example: variation between individuals is treated as noise and averaged out. Of course, there are weaknesses in this strategy – you will not learn about the relationship of your treatment to those covariates! – but it is sound.<br />
<br />
<b>Option 2. </b>If you have a small handful of covariates that you believe are meaningfully related to the outcome, you can plan in advance to adjust for them in your regression. In our Dylan example, this would be a pre-registered plan to add Z as a predictor: Y ~ X + Z. If age (Z) is highly correlated with writing ability (Y), then this will give us a more precise estimate of the ATE, while remaining unbiased.<br />
<br />
When should we do this? Well, it turns out that you need a pretty strong correlation to make a big difference. There's some nice code to simulate the effects of covariate adjustment on precision in <a href="http://egap.org/methods-guides/10-things-know-about-covariate-adjustment">this useful blogpost on covariate adjustment</a>; <a href="https://gist.github.com/mcfrank/a165356fcff909fdc41fdf82c31fc277">I lightly adapted it</a>. Here's the result:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNvXC-s2FdHpe-CV11ka6-iAXRKCItZ4lpSBCSKc-I3WsxahDBkZ248_GkH7F83W3GGaUWq_Pr5IfYL_fwJGtD0my_e1h03gNcMyQ0sZufJgXphBQz3shOmwmKi9WSPtJllbRK4o1mKRuZ/s1600/Rplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="389" data-original-width="584" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNvXC-s2FdHpe-CV11ka6-iAXRKCItZ4lpSBCSKc-I3WsxahDBkZ248_GkH7F83W3GGaUWq_Pr5IfYL_fwJGtD0my_e1h03gNcMyQ0sZufJgXphBQz3shOmwmKi9WSPtJllbRK4o1mKRuZ/s400/Rplot.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Root mean squared error (RMSE; lower RMSE means greater precision, in other words) is plotted as a function of the sample size (N). Different colors show the increase in precision when you control for covariates with different levels of correlation with the outcome variable. For low levels of correlation with the covariate, you don't get much increase in precision (pink and red lines). Only as the correlation is .6 or above do we see noticeable increases in precision; and it only really makes a big difference with correlations in the range of .8.<br />
<br />
Considering these numbers in light of our Dylan study, I would bet that age and writing skill are not correlated with writing skill > .8 (unless we're looking at ages from kindergarten to college!). I would guess that in an adult population this correlation would be much, much lower. So maybe it's not worth controlling for age in our analyses.<br />
<br />
And the same is probably true for order, our other covariate. Although perhaps we do think that our order has a strong correlation with our skill measure. For example, maybe our experiment is long and there are big fatigue effects. In that case, we would want to condition.<br />
<br />
So these are are options: if the covariate is known to be very strong, we can condition. Otherwise we should probably not worry about it.<br />
<br />
<b>What <i>shouldn't</i> we do with our covariates?</b><br />
<b><br /></b>
<b>Don't c</b><b>ondition on lots and lots of covariates because you think they are theoretically important. </b>There are lots of things that people do with covariates that they shouldn't be doing. My personal hunch is that this is because a lot of researchers think that covariates (especially demographic ones like age, gender, socioeconomic status, race, ethnicity, etc.) are important. That's true: these are important variables. But that doesn't mean you need to control for them in every regression. This leads us to the second fallacy.<br />
<br />
The <b>second fallacy of covariates</b> is that, because you think covariates are in general meaningful, it is not harmful to control for them in your regression model. In fact, if you control for meaningless covariates in a standard regression model, you will on average reduce your ability to see differences in your treatment effect. Just by chance your noise covariates will "soak up" variation in the response, leaving less to be accounted for by the true treatment effect! Even if you strongly suspect something is a covariate, you should be careful before throwing it into your regression model.<br />
<br />
<b>Don't condition on covariates because your groups are unbalanced. </b>People often talk about "unhappy randomization": you randomize adults to the different Dylan groups, for example, but then it turns out the mean age is a bit different between groups. Then you do a <i>t</i>-test or some other statistical test and find out that you actually have a significant age difference. But this makes no sense: because you randomized, you <i>know</i> that the difference in ages occurred by chance, so why are you using a <i>t</i>-test to test <i>if</i> the variation is due to chance? In addition, if your covariate isn't highly correlated with the outcome, this difference won't matter (see above). Finally, if you adjust for this covariate because of such a statistical test, you can actually end up biasing estimates of the ATE across the literature. Here's a really useful blogpost from the <a href="https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments">Worldbank</a> that has more details on why you shouldn't follow this practice.<br />
<br />
<b>Don't condition on covariates post-hoc. </b>The previous example is a special case of a general practice that you shouldn't follow. Don't look at your data and then decide to control for covariates! Conditioning on covariates based on your data is an extremely common route for <i>p</i>-hacking; in fact, it's so common that it shows up in Simmons, Nelson, & Simonsohn's (2011) instant classic <a href="https://journals.sagepub.com/doi/10.1177/0956797611417632">False Positive Psychology</a> paper as one of the key ingredients of analytic flexibility. Data-dependent selection of covariates is a quick route to false positive findings that will be less likely to be replicable in independent samples.<br />
<b><br /></b>
<b><span class="il" style="background-color: white; font-family: inherit;">Don't condition</span><span style="background-color: white; font-family: inherit;"> on a post-</span><span class="il" style="background-color: white; font-family: inherit;">treatment</span></b><span style="background-color: white; font-family: inherit;"><b> variable. </b>As we discussed above, there are some reasons to condition on highly-correlated covariates in general. But there's an exception to this rule. There are some variables that are never OK to condition on – in particular, any variable that is collected after treatment. For example, we might think that another good covariate would be someone's enjoyment of Bob Dylan. So, after the writing measurements are done, we do a Dylan Appreciation Questionnaire (DAQ). The problem is, imagine that having a bad experience writing while listening to Dylan might actually change your DAQ score. So then people in the Dylan condition would have <i>lower</i> DAQ on average. If we control for DAQ in our regression (Y ~ X + DAQ), we then distort our estimate of the effects of Dylan. Because DAQ and X (Dylan condition) are correlated, DAQ will end up soaking up some variance that is actually due to condition. This is bad news. </span><span style="background-color: white; font-family: inherit;">Here's a nice </span><a data-saferedirecturl="https://www.google.com/url?q=http://www.dartmouth.edu/~nyhan/post-treatment-bias.pdf&source=gmail&ust=1570578022546000&usg=AFQjCNG45hySa0tLnhJAjhULDW_AEs9d8w" href="http://www.dartmouth.edu/~nyhan/post-treatment-bias.pdf" style="background-color: white; color: #1155cc; font-family: inherit;" target="_blank">paper</a> that explains this issue in more detail.<br />
<br />
<b>Don't condition on a collider. </b>This issue is a little bit off-topic for the current post, since it's primarily an issue in observational designs, but here's a really good <a data-saferedirecturl="https://www.google.com/url?q=http://www.the100.ci/2017/03/14/that-one-weird-third-variable-problem-nobody-ever-mentions-conditioning-on-a-collider/&source=gmail&ust=1570578022546000&usg=AFQjCNEcBFxqRFGZ7jSV2vc_3O6EN-kVQg" href="http://www.the100.ci/2017/03/14/that-one-weird-third-variable-problem-nobody-ever-mentions-conditioning-on-a-collider/" style="background-color: white; color: #1155cc;" target="_blank">blogpost</a> about it.<br />
<br />
<b>Conclusions</b><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Covariates and confounds are some of the most basic concepts underlying experimental design and analysis in psychology, yet they are surprisingly complicated to explain. Often the issues seem clear until it comes time to do the data analysis, at which point different assumptions lead to different default analytic strategies. I'm especially concerned that these strategies vary by culture, for example with some psychologists always conditioning on confounders, and others never doing so. (</span><a href="http://www.the100.ci/2019/10/03/indirect-effect-ex-machina/" style="font-family: inherit;">We haven't even talked about mediation and moderation</a><span style="font-family: inherit;">!). Hopefully this post has been useful in using the vocabulary of causal models to explain some of these issues. </span><br />
<span style="font-family: inherit;"><br /></span>
---<br />
* <span style="font-family: inherit;">The definitive resource on causal graphical models is <a href="http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf">Pearl (2009)</a>. It's not easy going, but it's very important stuff.</span><span style="background-color: white; font-family: inherit;"> Even just starting to read it will strengthen your methods/stats muscles.</span><br />
<span style="background-color: white; font-family: inherit;">** Importantly, it's a lot like adding random effects to your model – you model sources of structure in your data so that you can better estimate the particular effects of interest. </span><br />
<span style="background-color: white; font-family: inherit;">*** The advice not to model covariates that aren't very correlated with your outcome is very frequentist, with the idea being that you lose power when you condition on too many things. In contrast, Gelman & Hill (2006) give more Bayesian advice: if you think a variable matters to your outcome, keep it in the model. This advice is consistent with the idea of modeling experimental covariates, even if they don't have a big correlation with the outcome. In the Bayesian framework, including this extra information should (maybe only marginally) improve your precision but you aren't "spending degrees of freedom" in the same way. </span>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com1tag:blogger.com,1999:blog-4297242917419089261.post-37837315784065996782019-07-23T20:58:00.001-07:002019-07-23T21:22:52.803-07:00An ethical duty for open science?Let's do a thought experiment. Imagine that you are the editor of a top-flight scientific journal. You are approached by a famous researcher who has developed a novel molecule that is a cure for a common disease, at least in a particular model organism. She would like to publish in your journal. Here's the catch: her proposed paper describes the molecule and asserts its curative properties. You are a specialist in this field, and she will personally show you any evidence that you need to convince you that she is correct – including allowing you to administer this molecule to an animal under your control and allowing you to verify that the molecule is indeed the one that she claims it is. But she will not put any of these details in the paper, which will contain only the factual assertion.<br />
<br />
Here's the question: should you publish the paper?<br />
<br />
<a name='more'></a><br />
If you publish it quickly, you will ensure that the molecule is known quickly and hence that translational research to humans will commence as soon as possible. This step will likely save many lives. In addition, the article is likely to be well-cited (since, as we have stipulated, it is correct). So publication should be assured, right?<br />
<br />
On the other hand, perhaps you share some reservations about publication. This paper doesn't look like a traditional scientific paper: it provides no methods or data, it only asserts a conclusion. There is no way for a reader to reproduce the experiments that led to the assertion, since no experiments are even mentioned. That doesn't feel like science. Maybe it’s worthy of being published in the newspaper. But not in a scientific journal.<br />
<br />
Further, you might be worried about the precedent set by this individual decision. Isn’t this person saying that the editor is the sole arbiter of the work they’ve done? How will this work out in the hands of other editors? Also, who gets to write such an article? You paid attention to this person because she was already famous – but you probably wouldn’t have taken time to verify the work of an unknown scientist so it could be published in this way.<br />
<br />
I posted a version of this thought experiment as a twitter poll, and – with 1,025 respondents – saw only 6% recommending publication:<br />
<blockquote class="twitter-tweet">
<div dir="ltr" lang="en">
Thought experiment. You're editor at a journal. A famous scientist submits an important medical discovery. She proves to you that she's right (stipulation: she's right), but her paper will contain *only* the assertion of the discovery, no methods/data. Should you publish?</div>
— Michael C. Frank (@mcxfrank) <a href="https://twitter.com/mcxfrank/status/1153415839363690496?ref_src=twsrc%5Etfw">July 22, 2019</a></blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
Even though this was a self-selected group of respondents, that’s as strong an intuition as you’d pretty much ever find.<br />
<br />
Remember, if we were purely utilitarian in our treatment of scientific reporting standards, this would be an obvious case: we should publish the paper. Perhaps we could make an argument about the long term utility of the precedent, but that’s an analysis of <i>future</i> rule-making, not the logic of this particular case. Thus, the intuition that the paper shouldn’t be published stems from something <i>other</i> than the immediate utility of the situation.<br />
<br />
Perhaps, like me, you think that the essence of science is verifiability – so if others can’t check your work, you are not contributing to science.<br />
<br />
This thought experiment demonstrates that we feel that scientists have a duty to report our methods and data to the community as part of reporting our findings. What is the nature of this duty? It is not based on the utility of any individual instance of publication. Is it a conventional norm that we can violate? In other words, is it like wearing pajamas to work – we don't happen to do that around here, but if we did it would be OK?<br />
<br />
Let's try a further thought experiment (based on <a href="https://files.eric.ed.gov/fulltext/ED142299.pdf">Nucci & Turiel, 1978</a>). Imagine now that there were a journal where people did just publish assertions, and didn't have to report methods or results. (In this further thought experiment, we don't stipulate that the assertions are correct). Would this still be a <i>real</i> scientific journal? I think the judgment is pretty clear that it wouldn't be. It would be an opinion magazine <i>on the topic of science</i> but it wouldn't be science. So the intuition that we we shouldn't publish the assertion paper is not an intuition about social conventions or norms.<br />
<br />
Instead, the duty has the force of a moral or ethical norm, something that is in force regardless of what some particular community's norms are. In other words, more like stealing and hurting people – wrong pretty mostly always – than like wearing pajamas to work. Consider the idea of "pseudoscience": this is a word that refers precisely to communities or people who say they are doing science but are actually violating the principles of science!<br />
<br />
This ethical norm emerges from concerns about benefits to a broader community (the scientific enterprise as a whole) rather than from concerns for the individual researcher. And it feels tied up in concerns about fairness or universality as well. A scientist getting to publish because she's famous and maybe more likely to be believed (or perhaps even more likely to be right) doesn't feel like a fair way that science to work. You might even say that the norm of reporting information for verification and assessment of scientific findings is a deontological norm: one that is designed so that it can appropriately or fairly be held by the entire community.<br />
<br />
No one has to write scientific papers, of course, but if they choose to, they have to report sufficient information for another researcher to verify their conclusions. An assertion just won’t do.<br />
<br />
I’ll just end with a question. Why does the ethical duty to provide verification information stop at the conventional reporting standards of a scientific paper, which – as many people have observed – are insufficient for fully reproducing the data analysis or independently replicating the data collection?<br />
<br />
<i>(Thanks to Tom Hardwicke for discussion.)</i><br />
<div>
<br /></div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-48733526806343856272019-05-06T08:53:00.000-07:002019-05-06T08:53:32.773-07:00It's the random effects, stupid!<i>(tl;dr: wonky post on statistical modeling)</i><br />
<br />
I fit linear mixed effects models (LMMs) for most of the experimental data I collect. My data are typically repeated observations nested within subjects, and often have crossed effects of items as well; this means I need to account for this nesting and crossing structure when estimating the effects of various experimental manipulations. For the last ten years or so, I've been fitting these models in <span style="font-family: "courier new" , "courier" , monospace;">lme4</span> in R, a popular package that allows quick specification of complex models.<br />
<br />
One question that comes up frequently regarding these models is what random effect structure to include? I typically follow the advice of <a href="https://www.sciencedirect.com/science/article/pii/S0749596X12001180">Barr et al. (2013)</a>, who recommend "maximal" models – models that nest all the fixed effects within a random factor that have repeated observations for that random grouping factor. So for example, if you have observations for both conditions for each subject, fit random condition effects by subject. This approach contrasts, however, with the <a href="https://arxiv.org/abs/1506.04967">"parsimonious" approach of Bates et al.</a>,* who argue that such models can be over-parameterized relative to variability in the data. The issue of choosing an approach is further complicated by the fact that, in practice, <span style="font-family: "courier new" , "courier" , monospace;">lme4</span> can almost never fit a completely maximal model and instead returns convergence warnings. So then you have to make a bunch of (perhaps ad-hoc) decisions about what to prune or how to tweak the optimizer.<br />
<br />
Last year, responding to this discussion, I posted a blogpost that became surprisingly popular, <a href="http://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html">arguing for the adoption of Bayesian mixed effects models</a>. My rationale was not mainly that Bayesian models are interpretively superior – which they are, IMO – but just that they allow us to fit the random effect structure that we want without doing all that pruning business. Since then, we've published a few papers (e.g. <a href="https://psyarxiv.com/8p67h/">this one</a>) using Bayesian LMMs (mostly without anyone even noticing or commenting).**<br />
<br />
In the mean time, I was working on the ManyBabies project. We finally completed data collection on our first study, a 60+ lab consortium study of babies' preference for infant-directed speech! This is exciting and big news, and I will post more about it shortly. But in the course of data analysis, we had to grapple with this same set of LMM issues. In our pre-registration (which, for what it's worth, was written before I really had tried the Bayesian methods), we said we would try to fit a maximal LMM with the following structure. It doesn't really matter what all the predictors are, but <span style="font-family: "courier new" , "courier" , monospace;">trial_type</span> is the key experimental manipulation:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><b>M1)</b> log_lt ~ trial_type * method +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> trial_type * trial_num +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> age_mo * trial_num +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> trial_type * age_mo * nae +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (trial_type * trial_num | subid) +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (trial_type * age_mo | lab) + </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (method * age_mo * nae | item)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
Of course, we knew this model would probably not converge. So we preregistered a pruning procedure, which we followed during data analysis, leaving us with:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><b>M2)</b> log_lt ~ trial_type * method +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> trial_type * trial_num +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> age_mo * trial_num +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> trial_type * age_mo * nae +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (trial_type | subid) +</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (trial_type | lab) + </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> (1 | item)</span><br />
<br />
We fit that model and report it in the (under review) paper, and we interpret the <i>p</i>-values as real <i>p</i>-values (well, as real as <i>p</i>-values can be anyway), because we are doing exactly the confirmatory thing we said we'd do. But in the back of my mind, I was wondering if we shouldn't have fit the whole thing with Bayesian inference and gotten the random effect structure that we hoped for.***<br />
<br />
So I did that. Using the amazing <span style="font-family: "courier new" , "courier" , monospace;">brms</span> package, all you need to do is replace "lmer" with "brm" (to get a default prior model with default inference).**** Fitting the full LMM on my MacBook Pro takes about 4hrs/chain with completely default parameters, so 16 hrs total – though if you do it in parallel you can fit all four at once. I fit M1 (the maximal model, called "bayes"), M2 (the pruned model, "bayes_pruned"), and for comparison the frequentist (also pruned, called "freq") model. Then I plotted coefficients and CIs against one another for comparison. There are three plots, corresponding to the three pairwise comparisons (brms M1 vs. lme4 M2, brms M1 vs. brms M2, and brms M2 vs. lme4 M2). (So as not to muddy the interpretive waters for ManyBabies, I'm just showing the coefficients without labels here). Here are the results.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHhbk7Ozmd4RMmj_6tBd3o7UjrbM7IfvHmK-29zkaYuxI_64pwJmN9OP45ksdrViLLVWKZ0z81grFV47fi-tug_1cHgZTAQWvcmSwVxIffuxjxZ5-RkCGANjgZ55uy8jb-DJUdyXJwUPgx/s1600/bayes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="507" data-original-width="702" height="459" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHhbk7Ozmd4RMmj_6tBd3o7UjrbM7IfvHmK-29zkaYuxI_64pwJmN9OP45ksdrViLLVWKZ0z81grFV47fi-tug_1cHgZTAQWvcmSwVxIffuxjxZ5-RkCGANjgZ55uy8jb-DJUdyXJwUPgx/s640/bayes.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
As you can see, to a first approximation, there are not huge differences in coefficient magnitudes, which is good. But, inspecting the top row of plots, you can see that the full Bayesian M1 does have two coefficients that are different from both the Bayesian M2 <i>and</i> the frequentist M2. In other words, the fitting method didn't matter with this big dataset – but the random effects structure did! Further, if you dig into the confidence intervals, they are again similar between fitting methods but different between random effects structures. Here's a pairs plot of the correlation between upper CI limits (note that .00 here means a correlation of 1.00!):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQn251ULhkK96qP4NQDfrFgWSFAqCX0RJROsh6mRrxgmYa4GjaHvlKARFGU106iW0-uLHxKxASju3w2L9aKNMAKj-UXB-q3aOmc0-dUHGUbrcyw8TfpZI-ncohbHbxfOG1JHXJAoSQuhNW/s1600/bayes_upper_ci.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="507" data-original-width="702" height="460" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQn251ULhkK96qP4NQDfrFgWSFAqCX0RJROsh6mRrxgmYa4GjaHvlKARFGU106iW0-uLHxKxASju3w2L9aKNMAKj-UXB-q3aOmc0-dUHGUbrcyw8TfpZI-ncohbHbxfOG1JHXJAoSQuhNW/s640/bayes_upper_ci.png" width="640" /></a></div>
<br />
Not huge differences, but they track with random effect structure again, not with the fitting method.<br />
<br />
In sum, in one important practical case, we see that fitting the maximal model structure (rather than the maximal <i>convergent</i> model structure) seems to make a difference to model fit and interpretation. This evidence to me supports the Bayesian approach that I recommended in my prior post. I don't know that M1 is the <i>best</i> model – I'm trusting the "keep it maximal" recommendation on that point. But to the extent that I should be able to fit all the models I want to try, then using brms (even if it's slower) seems important. So I'm going to keep using this fitting procedure in the immediate future.<br />
<br />
----<br />
* This approach seems very promising, but also a bit tricky to implement. I have to admit, I am a bit lazy and it is really helpful when software provides a solution for fitting that I can share with people in my lab as standard practice. A collaborator and I tried someone else's implementation of parsimonious models and it completely failed, and then we gave up. If someone wants to try it on this dataset I'd be happy to share!<br />
<br />
* An aside: after I posted, Doug Bates kindly engaged and encouraged me to adopt Julia, rather than R, for model fitting, if it was fitting that I wanted and not Bayesian inference. We did experiment a bit with this, and Mika Braginsky wrote the j<a href="https://github.com/mikabr/jglmm">glmm package</a> to use Julia for fitting. This experiment resulted in her <a href="https://psyarxiv.com/cg6ah/">in-press paper</a> using Julia for model fits, but also with us recognizing that 1) Julia is TONS faster than R for big mixed models, which is a win, but 2) Julia can't fit some of the baroque random effects structures that we occasionally use, and 3) installing Julia and getting everything working is very non-trivial, meaning that it's hard to recommend for folks just getting started.<br />
<br />
** Jake Westfall, back in 2016 when we were planning the study, said we should do this, and I basically told him that I thought that developmental psychologists wouldn't agree to it. But I think he was probably right.<br />
<br />
*** Code for this post is <a href="https://gist.github.com/mcfrank/f20b11aa84fdb35de9d6fcc2d589468a">on github</a>.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com3tag:blogger.com,1999:blog-4297242917419089261.post-65002607066518101892019-04-08T13:30:00.001-07:002019-04-08T13:30:09.954-07:00A (mostly) positive framing of open science reformsI don't often get the chance to talk directly and openly to people who are skeptical of the methodological reforms that are being suggested in psychology. But recently I've been trying to persuade someone I really respect that these reforms are warranted. It's a challenge, but one of the things I've been trying to do is give a positive, personal framing to the issues. Here's a stab at that.
<br />
<br />
My hope is that a new graduate student in the fields I work on – language learning, social development, psycholinguistics, cognitive science more broadly – can pick up a journal and choose a seemingly strong study, implement it in my lab, and move forward with it as the basis for a new study. But unfortunately my experience is that this has not been the case much of the time, even in cases where it should be. I would like to change that, starting with my own work.<br />
<br />
Here's one example of this kind of failure: As a first-year assistant professor, a grad student and I tried to replicate one of my grad school advisors' well-known studies. We failed repeatedly – despite the fact that we ended up thinking the finding was real (eventually published as <a data-saferedirecturl="https://www.google.com/url?q=http://langcog.stanford.edu/papers_new/lewis-2016-jepg.pdf&source=gmail&ust=1554506197301000&usg=AFQjCNFo_X0lP8FZ6JddjE3WRKPGkLQifQ" href="http://langcog.stanford.edu/papers_new/lewis-2016-jepg.pdf" style="color: #1155cc;" target="_blank">Lewis & Frank, 2016</a>, JEP:G). The issue was likely that the original finding was an overestimate of the effect, because the original sample was very small. But converging on the truth was very difficult and required multiple iterations.<br />
<br />
<a name='more'></a><br />
This kind of thing happens to me quite a lot. I run a class in which first year PhDs in my department try to replicate the published literature, often articles from Psych Science and other top journals. I've blogged about this course (e.g., <a data-saferedirecturl="https://www.google.com/url?q=http://babieslearninglanguage.blogspot.com/2018/12/how-to-run-study-that-doesnt-replicate.html&source=gmail&ust=1554506197301000&usg=AFQjCNGry6TyEGJnOaeJJ8HDod4y3BrGbw" href="http://babieslearninglanguage.blogspot.com/2018/12/how-to-run-study-that-doesnt-replicate.html" style="color: #1155cc;" target="_blank">here</a>) and published on outcomes from it as well (<a data-saferedirecturl="https://www.google.com/url?q=https://psyarxiv.com/p73he/&source=gmail&ust=1554506197301000&usg=AFQjCNGmQbovSvf1thIHHZkEq6oVo0-TiA" href="https://psyarxiv.com/p73he/" style="color: #1155cc;" target="_blank">Hawkins, Smith et al., 2018</a>, AMPPS). More than half of the time, these replication studies fail, roughly consistent with estimates from larger meta-science projects like RPP and the more recent (and higher-quality) ManyLabs 2 and Social Science Replication projects.<br />
<br />
The reasons for this failure are not always clear, and we don't always do the extensive followup work necessary to "debug" the experiment. But over time I have tried to identify a number of reasons for failures and use them as guides to the way I run my lab and provide methodological training for students. I also have advocated for journals and funders to adopt these reforms. Most are about transparency, and some are about good design practices. These reforms have been a win-win for my lab. They improve the clarity, impact, and validity of our work – mostly while speeding things up! Here they are.<br />
<div>
<br /></div>
<b>Share code and data</b>. Several studies, including ours (<a data-saferedirecturl="https://www.google.com/url?q=https://osf.io/preprints/metaarxiv/39cfb/&source=gmail&ust=1554506197301000&usg=AFQjCNE1exjjevyD3431NOwKjSExbrVH6Q" href="https://osf.io/preprints/metaarxiv/39cfb/" style="color: #1155cc;" target="_blank">Hardwicke et al., 2018</a>, Royal Soc Open Science) show that MOST published journal articles contain some statistical errors, ranging from the trivial to the extreme. In reproducing the analytic calculations from a number of prominent papers (which would only be possible through data sharing), we have found major errors requiring correction in quite a few. Creating clear sharing pipelines leads to cleaner, easier-to-check papers.<br />
<br />
<b>Use a reproducible workflow</b>. Technical tools like git, RMarkdown, Jupyter, etc. facilitate students and other researchers reporting results whose provenance and relationship to the underlying data are known. These tools also speed up research dramatically, letting you share and reuse code effectively much more often and auto-generate tables, graphs, and other elements of reports. They also decrease copy/paste errors in reporting! And for me as a PI, I love being able to "audit" the work that folks in my lab do, and to quickly and easily pull in figures, data, or other excerpts from github when I need to add them to a talk.<br />
<br />
<b>Preregister</b>. Everything in my lab is preregistered. All this means is that people in my lab need to write down what they are going to do (sample size, main analysis) before they do it. <a data-saferedirecturl="https://www.google.com/url?q=https://docs.google.com/document/d/1hQxvrrcVQUKkyjaotTPVaGeGD-6j9oLM90EDQCQ56AM/edit?usp%3Dsharing&source=gmail&ust=1554506197301000&usg=AFQjCNGe2BAK_NDD98Anz6xe-oeVbhq8_A" href="https://docs.google.com/document/d/1hQxvrrcVQUKkyjaotTPVaGeGD-6j9oLM90EDQCQ56AM/edit?usp=sharing" style="color: #1155cc;" target="_blank">Here's a sample</a>. If we have talked things through enough, writing the registration often takes 30 minutes; of course for more complex projects, more thought is needed (and it's a good thing to do that thinking ahead of time!). This process is not binding – we routinely violate our registration, and report our violation – and takes very little time. It just makes us transparently report what we knew <i>before</i> doing the study. As an added bonus, if you care about p < .05 results (I mostly don't), these are really only valid in the case of a preregistered hypothesis. There's what I think is a pretty good explanation of this perspective in our transparency guide from last year (<a data-saferedirecturl="https://www.google.com/url?q=https://www.collabra.org/article/10.1525/collabra.158/&source=gmail&ust=1554506197301000&usg=AFQjCNH-DUd1bHsQouKvLU6dPHyqTzCT6w" href="https://www.collabra.org/article/10.1525/collabra.158/" style="color: #1155cc;" target="_blank">Klein et al., 2018</a>, Collabra).<br />
<br />
<b>Follow best practices in experimental design</b>. That means thinking about reliability and validity, and using a psychometric perspective (e.g., including sampling multiple experimental items). It also means planning a sample size that is sufficient to get precise enough measures to make quantitative predictions. There is a huge body of knowledge about how to do good experiments from Rosenthal and Rosnow onward – but often we rely on lab lore and implicit learning.<br />
<br />
In sum, my worries about the literature have led me to a set of practices that – I think – have enhanced the research we do and made it more reproducible and replicable, while not slowing us down or making our workflow more onerous.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com4tag:blogger.com,1999:blog-4297242917419089261.post-23283657525444371112019-02-21T16:31:00.001-08:002019-02-21T16:31:30.781-08:00Nothing in childhood makes sense except in the light of continuous developmental changeI'm awestruck by the processes of development that operate over children's first five years. My daughter M is five and my newborn son J is just a bit more than a month old. J can't yet consistently hold his head up, and he makes mistakes even in bottle feeding – sometimes he continues to suck but forgets to swallow so that milk pours out of his mouth until his clothes are soaked. I remember this kind of thing happening with M as a baby ... and yet voila, five years later, you have someone who is <a href="https://twitter.com/mcxfrank/status/1055857077707362305">writing text messages to grandma</a> and <a href="https://twitter.com/mcxfrank/status/1091120221803360261">illustrating new stories about Spiderman</a>. How you could possibly get from A to B (or in my case, from J to M)? The immensity of this transition is perhaps the single most important challenge for theories of child development.<br />
<br />
As a field, we have bounced back and forth between continuity and discontinuity theories to explain these changes. Continuity theories posit that infants' starting state is related to our end state, and that changes are gradual, not saltatory; discontinuity theories posit stage-like transitions. Behaviorist learning theory was fundamentally a continuity hypothesis – the same learning mechanisms (plus experience) underly all of behavior, and change is gradual. In contrast, Piagetian stage theory was fundamentally about explaining behavioral discontinuities. As the pendulum swung, we get core knowledge theory, a continuity theory: innate foundations are "revised but not overthrown" (paraphrasing <a href="https://psycnet.apa.org/fulltext/1993-05134-001.html">Spelke et al. 1992</a>). Gopnik and Wellman's "<a href="http://alisongopnik.com/Papers_Alison/ChomskyFinal.pdf">Theory theory</a>" is a discontinuity theory: intuitive theories of domains like biology or causality are discovered like scientific theories. And so on.<br />
<br />
For what it's worth, my take on the "modern synthesis" in developmental psychology is that development is <a href="https://mitpress.mit.edu/books/beyond-modularity">domain-specific</a>. Domain of development – perception, language, social cognition, etc. – progress on their own timelines determined by experience, maturation, and other constraining factors. And my best guess is that some domains develop continuously (especially motor and perceptual domains) while others, typically more "conceptual" ones, show more saltatory progress associated with stage changes. But – even though it would be really cool to be able to show this – I don't think we have the data to do so.<br />
<br />
<b>The problem is that we are not thinking about – or measuring – development appropriately. </b>As a result, what we end up with is a theoretical mush.<b> </b>We talk as though everything is discrete, but that's mostly a function of our measurement methods. Instead, everything is at rock bottom continuous, and the question is how steep the changes are.<br />
<br />
<b>We talk as though everything is discontinuous all the time. </b>The way we know how to describe development verbally is through what I call "milestone language." We discuss developmental transitions by (often helpful) age anchors, like "children say their first word around their first birthday," or "preschoolers pass the Sally-Ann task at around 3.5 years." When summarizing a study, we* assert that "by 7 months, babies can segment words from fluent speech," even if we know that this statement describes the fact that the mean performance of a group is significantly different than zero in a particular paradigm instantiating this ability, and even if we know that babies might show this behavior a month earlier if you tested enough of them! But it's a lot harder to say "early word production emerges gradually from 10 - 14 months (in most children)."<br />
<br />
Beyond practicalities, one reason we use milestone language is because our measurement methods are only set up to measure discontinuities. First, our methods have poor reliability: we typically don't learn very much about any one child, so we can't say conclusively whether they truly show some behavior or not. In addition, <a href="http://doi.org/10.1111/cdev.13079">most developmental studies are severely underpowered</a>, just like <a href="https://rdcu.be/bkajT">most studies in neuroscience and psychology in general</a>. So the precision of our estimates of a behavior for groups of children are noisy. To get around this problem, we use null hypothesis significance tests – and when the result is p < .05, we declare that development has happened. But of course we will see discrete changes in development if we use a discrete statistical cutoff!<br />
<br />
And finally, we tend to stratify our samples into discrete age bins (which is a good way to get coverage), e.g. recruiting 3-month-olds, 5-month-olds, and 7-month-olds for a study. But then, we use these discrete samples as three separate analytic groups, ignoring the continuous developmental variation between them! This practice reduces statistical power substantially, <a href="https://www.jstor.org/stable/30038865">much like taking median splits on continuous variables</a> (taking a median split on average is like <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1458573/">throwing away a third of your sample</a>!). In sum, even in domains where development is continuous, our methods guarantee that we get binary outcomes. We don't try to estimate continuous functions, even when our data afford them.<br />
<div>
<br /></div>
<b>The truth is, when you scratch the surface in development, everything changes continuously.</b> Even the stuff that's not supposed to change still changes. I saw this in one of my very first studies, when I was a lab manager for <a href="https://www.psych.ucla.edu/faculty/page/scott.johnson">Scott Johnson</a> and we accidentally found ourselves <a href="http://langcog.stanford.edu/papers/FVJ-cognition.pdf">measuring 3-9 month-olds' face preferences</a>. Though I had learned from the literature that <a href="https://www.ncbi.nlm.nih.gov/pubmed/1786670">infants had an innate face bias</a>, I was surprised to find that magnitude of face looking was changing dramatically across the range I was measuring. (Later we found that this change was <a href="http://langcog.stanford.edu/papers/FAJ-JECP2014.pdf">related to the development of other visual orientating skills</a>). Of course "it's not surprising" that some complex behavior goes up with development, says reviewer 3. But it is <i>important</i>, and the ways we talk about and analyze our data don't reflect the importance of quantifying continuous developmental change.<br />
<br />
One reason that it's not surprising to see developmental change is that everything that children do is at its heart a skill. Sucking and swallowing is a skill. Walking is a skill. Recognizing objects is a skill. Recognizing words is a skill too - <a href="https://doi.org/10.1016/j.cobeha.2018.04.001">so too is the rest of language</a>, at least according to some folks. Thinking about other people's thoughts is a skill. So that means that everything gets better with practice. It will – to a first approximation – follow a classic logistic curve like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkAX6hHKVBIjsoKGS1EymuynC92fIa-nwTFr_DE4hQ022F_UnEQeyXhutS6czwEXisy3CjahbRzVajFW_jSaHi08PVsdIC3DNRmTF3YI3hXBnyNvH1BgzG711XPUhsRKuxFDq4O48ZEBqS/s1600/download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="961" data-original-width="1344" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkAX6hHKVBIjsoKGS1EymuynC92fIa-nwTFr_DE4hQ022F_UnEQeyXhutS6czwEXisy3CjahbRzVajFW_jSaHi08PVsdIC3DNRmTF3YI3hXBnyNvH1BgzG711XPUhsRKuxFDq4O48ZEBqS/s400/download.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Most skills get better with practice, and the ones described above are no exception. But developmental progress also happens in the absence of practice of specific skills due to physiological maturation – older children's brains are faster and more accurate at processing information, even for skills that haven't been practiced. So samples from this behavior should look like these red lines:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLgBbkHjWnDUJ-Xe7JCUHDfuRY3-xs1RmmFo1Ys38DHOstk55GhkGdk63wVVyvz3VYqjBayBHwzIexvxXYQ8zuriqCwQqyAS01_wkvlvjVt8NP0LyJm21GIBhalRiiyV8bMKSDYr36GjGS/s1600/download-1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="961" data-original-width="1344" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLgBbkHjWnDUJ-Xe7JCUHDfuRY3-xs1RmmFo1Ys38DHOstk55GhkGdk63wVVyvz3VYqjBayBHwzIexvxXYQ8zuriqCwQqyAS01_wkvlvjVt8NP0LyJm21GIBhalRiiyV8bMKSDYr36GjGS/s400/download-1.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
But here's the problem. If you have a complex behavior, it's built of simple behaviors, which are themselves skills. To get the probability of success on one of those complex skills, you can – as a first approximation – multiply the independent probabilities of success in each of the components. That process yields logistic curves that look like these (color indicating the number of components):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqNdobfBBNXxRhjND6eNXlhQTXTHUsbEoMlMGZgdYyl6uSr6pFkSovbXh4b891KYsg_QTIuQOQXacegxOuEsufTuQ0cR4HZ3UOrkTaBZVyKWw7VFEyzILOv12cd7c_EiBIifudP0vJL97y/s1600/download-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="961" data-original-width="1344" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqNdobfBBNXxRhjND6eNXlhQTXTHUsbEoMlMGZgdYyl6uSr6pFkSovbXh4b891KYsg_QTIuQOQXacegxOuEsufTuQ0cR4HZ3UOrkTaBZVyKWw7VFEyzILOv12cd7c_EiBIifudP0vJL97y/s400/download-2.png" width="400" /></a></div>
<br />
And samples from a process with many components look even more discrete, because the logistic is steeper!<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXIg6di7UVEtzKuKYmF_ZMI-PDemm8hQeAEj1P83CFQJjZeTiPYgnjKTSkVBzMA5u5DjXod8i3U_1YEx9jy9m_oF3H09MsbkG6ewMhSWLuX6TKACC5xVvpA8qnF-YcjlA66G35rpASrwIH/s1600/download-3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="961" data-original-width="1344" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXIg6di7UVEtzKuKYmF_ZMI-PDemm8hQeAEj1P83CFQJjZeTiPYgnjKTSkVBzMA5u5DjXod8i3U_1YEx9jy9m_oF3H09MsbkG6ewMhSWLuX6TKACC5xVvpA8qnF-YcjlA66G35rpASrwIH/s400/download-3.png" width="400" /></a></div>
<br />
<br />
Given this kind of perspective, we should expect complex behaviors to emerge relatively suddenly, even if they are simply the product of a handful of continuously changing processes.<br />
<br />
This means, from a theoretical standpoint, we need stronger baselines. Our typical baseline at the moment is the null hypothesis of no difference; but that's a terrible baseline! Instead, we need to be comparing to a null hypothesis of "developmental business as usual." To show discontinuity, we need to take into account the continuous changes that a particular behavior will inevitably be undergoing. And then, we need to argue that the <i>rate</i> of developmental change that a particular process is undergoing is faster than we should expect based on simple learning of that skill. Of course to make these kinds of inferences requires far more data about individuals than we usually gather.<br />
<div>
<br /></div>
<div>
In a conference paper that I'm still quite proud of, <a href="http://langcog.stanford.edu/papers_new/frank-2016-cogsci.pdf">we tried to create this sort of baseline for early word learning</a>. Arguably, early word learning is a domain where there likely aren't huge, discontinuous changes – instead kids gradually get faster and more accurate in learning new words until they are learning several new words per day. We used meta-analysis to estimate developmental increases in two component processes of novel word mapping: auditory word recognition and social cue following. Both of these got faster and more accurate over the first couple of years. When we put these increases together, we found they together created really substantial changes in how much input would be needed for a new word mapping. (Of course what we haven't done in the three years since we wrote that paper is actually measure the parameters on the process of word mapping developmentally – maybe that's for a subsequent ManyBabies study...). Overall, this baseline suggests that even in the absence of discontinuity, continuous changes in many small processes can produce dramatic developmental differences.<br />
<br />
In sum: sometimes developmental psychologists don't take the process of developmental change seriously enough. To do better, we need to start analyzing change continuously; measuring with sufficient precision to estimate rates of change; and creating better continuous baselines before we make claims about discrete change or emergence. </div>
<br />
---<br />
* I definitely do this too!Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-33471599162332401572018-12-09T22:38:00.001-08:002018-12-10T09:24:29.470-08:00How to run a study that doesn't replicate, experimental design edition<i>(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)</i><br />
<br />
Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via <a href="https://www.qualtrics.com/">Qualtrics</a>.<br />
<br />
The argument of this post is that <u>this experiment has a low probability of replicating</u>, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.<br />
<div>
<br /></div>
Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have <a href="http://babieslearninglanguage.blogspot.com/2015/03/estimating-preplication-in-practical.html">blogged before about outcomes from it</a>. I've also written several journal articles about student replication in this model (<a href="http://langcog.stanford.edu/papers/FS-POPS2012.pdf">Frank & Saxe, 2012</a>; <a href="https://osf.io/8t2x5/">Hawkins*, Smith*, et al., 2018</a>). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).<br />
<br />
Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).<br />
<br />
What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. <u>In other words, they were poorly designed.</u><br />
<br />
<a name='more'></a><br />
There's now a <a href="https://www.theatlantic.com/science/archive/2018/08/scientists-can-collectively-sense-which-psychology-studies-are-weak/568630/">strong meta-scientific literature</a> suggesting that prediction markets can accurately guess which studies will not replicate. Some of this effect is likely due to general plausibility of study results – the general correlation of prior and posterior probabilities of effects. There are also general statistical predictors of failures to replicate – small samples, small effect sizes, and p-values relatively close to the .05 boundary. Over the past 5-6 years, the community has received a real education about these issues. In my class, we try to spot effects with these sorts of issues and sometimes now ask students not to select projects with statistical red flags. Further, within the constraints of our class budget (which is limited), we try to recruit decent sample sizes.*<br />
<br />
This year in my class, I think experimental design was the culprit for many of our failed replications, however. Further, I suspect that many of the prediction markets are picking up on problematic design features as well as the statistical issues mentioned above. Here are the experimental design features that appear – both in my experience and, in some cases, in the broader literature – related to replication success. These "negative features" shape my defaults about how to design a study.<br />
<br />
<b>Single-question DVs. </b>Psychological measurements are noisy. If you have high noise, you will have low signal to detect the effect of even a strong manipulation. One way to reduce noise is to measure many times and combine those measurements. Papers that fail to take advantage of this strategy dramatically reduce their ability to find effects of their manipulation. Yet it is striking how many of the findings we look at nevertheless have a single "key question" that is supposed to detect their manipulation. From an <a href="https://en.wikipedia.org/wiki/Item_response_theory">item response theory</a> perspective, even if you found the perfect item (optimal discrimination) for a particular population, that item is still likely to be suboptimal and yield under-informative estimates about other populations. This means that your design is unlikely to be replicable in a different context, just because your item isn't designed to measure people in that context.<br />
<div>
<br /></div>
<div>
<b>Single-item<i> </i>manipulations. </b>The counterpart to single question DVs is single-item manipulations, e.g. instantiations of a particular theoretical contrast in a particular experimental vignette or stimulus. Even if an effect induced via a particular item is replicable, it is likely not easily generalizable to a larger population of experimental items (as has been noted since <a href="https://web.stanford.edu/~clark/1970s/Clark,%20H.H.%20_Language%20as%20fixed%20effect%20fallacy_%201973.pdf">Clark, 1972</a>).<b> </b>But in addition, if you have only a single stimulus of interest, the chance of variation in response to this stimulus – due to sample differences including demographic variation or overall cohort change – is very high; this is exactly the same point as is made above about the DV, now made about the IV. Further, there is a substantial threat to internal validity if this stimulus is used by any other psychologists (as frequently happens with popular tasks - <a href="https://doi-org.stanford.idm.oclc.org/10.1038/ncomms1442">e.g., the prisoner's dilemma</a>).</div>
<div>
<b><br /></b></div>
<b>Between-subjects designs. </b>Variation between people is a huge source of the total variation in psychological measurements. By subtracting out this variance, within-subjects designs dramatically decrease the variance in the measurement of some manipulation. As a result, between-subjects effects tend to replicate less (unless their original samples were really huge). This effect shows up in the original <a href="http://science.sciencemag.org/content/349/6251/aac4716">OSC 2015</a> replication sample, and it also shows up in our previous class sample. In our 16 project sample so far this year, the replication rate for the between subjects experiments was .21 (2.5 successes out of 12) vs. .625 (2.5 successes out of 4).<br />
<br />
<b>Short state induction manipulations.</b> It's hard to change people's state in a significant way during a very short experiment, at least, given the tools available to ethical psychologists. If you want to make some one feel powerful, or greedy, or afraid, or anxious, there's only so much you can do by showing them images on a computer screen, making them read words, or making them reflect on their experience by writing a short paragraph. And if you make even a moderate change to someone's state, they are extremely likely to reflect on this experience in the context of the experiment in some very substantial ways (see Task demands, below). It's hard, but probably not impossible to do these kinds of manipulations right; likely manipulations of this type that can and do work.** But think about the counterfactual world where experimenters really could push people's feelings around quite flexibly and easily – we'd be constantly bent to our environment, pushed one way or the other by the precise stimuli we came into contact with, with the incumbent policy implications (Hal Pashler and Andrew Gelman have both made this point previously in several different ways).<br />
<br />
<b>Task demands. </b>When I was an undergraduate, my girlfriend – now my wife – and I used to walk over to the business school and do experimental studies for fun (they paid better than psychology). After we were done, we'd walk out and compare notes on what the point of each study was, as well as what condition we thought we were in. MTurk workers are just the same – probably better because many of them have done more studies. Participants will be thinking about what your study is about, and reacting based on some complex combination of that guess (correct or not) and their desired self-presentation and feelings about that goal. It is remarkable how many studies do not consider this issue. Two-stage studies like the one I described at the beginning are extremely vulnerable to this kind of reasoning: if your survey consists only of a state induction and a vignette, it is a guarantee that people will read the two together and then think about the connection. Hmm, I wonder what my feeling of powerlessness has to do with my reading about moral judgements? I wonder what reading a news article about the environment has to do with my judgements about future planning? This kind of design (especially without a good cover story) is a recipe to include participants' interpretive thinking in your pattern of results. Yet most of these paradigms do not even include strategies like a funnel debrief to detect such issues.***<br />
<br />
<b>No manipulation checks. </b>Manipulation checks are tricky in state-induction experiments. Because they often directly refer to the construct of interest ("how powerful do you feel?") they can increase task demands and explicit reasoning. They also often are only single items themselves and aren't necessarily psychometrically valid measures of the precise construct of interest. That said: without a manipulation check, if your experiment fails in the type of design we're considering, there is typically no signal for understanding what went wrong. In classic perception, memory, and learning experiments there are usually correct answers, allowing the experimenter to think about whether participants understood the task and were at floor or at ceiling in their performance. In contrast, in judgement studies of the type I'm writing about here, there is not typically any calibration of the measurement. In many experiments without manipulation checks, there is no signal (beyond a difference on the key DV) that allows experimenters (or readers) to verify that the participants understood the materials and were affected by the manipulation.<br />
<br />
<div>
A subtitle to this post could well be "revenge of the psychometricians." (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2779444/">They already attacked us once</a>). Many of the problematic practices I see come down to poor measurement: single items for the DV measure, single items for the IV manipulation, lack of within-subjects design. All of these are places where experimenters can reduce measurement variation in easy ways. It is not that experiments like the one I've described here are impossible to do right, or that they <i>never</i> replicate. (<a href="https://osf.io/wx7ck/">ManyLabs 1</a> and <a href="https://osf.io/8cd4r/">ManyLabs 2</a> have both had replicable and non-replicable examples of such experiments in each). It's that there are so many lost opportunities to do better. </div>
<div>
<br /></div>
<div>
---</div>
<div>
* We probably don't have the power to detect small effects in the cases where the authors initially reported large ones, however.</div>
<div>
** Some good ones likely take advantage of apparent task demands to cause deeper reasoning about the state induction. </div>
<div>
*** Surprisingly I couldn't find a good description of this strategy online. In brief, ask successively more specific questions to try to elicit an understanding of how much they knew about the manipulation, e.g. "what did you think this experiment was about? what did you think about the other person in the experiment? did you notice anything odd about him? did you know he was a confederate?"<br />
<br />
[Correction: w/in subjects designs <i>decrease</i> variance, thanks Yoel Sanchez-Araujo]</div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-34500305777293766572018-09-07T14:12:00.004-07:002018-09-07T14:12:55.029-07:00Scale construction, continuedFor psychometrics fans: I helped out with a post by Brent Roberts, "<a href="https://pigee.wordpress.com/2018/09/07/yes-or-no-2-0-are-likert-scales-always-preferable-to-dichotomous-rating-scales/">Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?</a>" This post is a continuation of <a href="http://babieslearninglanguage.blogspot.com/2015/11/a-conversation-on-scale-construction.html">our earlier conversation on scale construction</a> and continues to examine the question of if – and if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some circumstances it's totally safe, while in others it is a disaster!Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-43813874081435774662018-08-30T11:26:00.001-07:002018-08-30T11:30:58.425-07:00Three (different) questions about development<i>(tl;dr: Some questions I'm thinking about, inspired by the idea of studying the broad structure of child development through larger-scale datasets.)</i><br />
<br />
My daughter, M, started kindergarten this month. I began this blog when I was on paternity leave after she was born; the past five years have been an adventure and revolution for my understanding of development to watch her grow.* Perhaps the most astonishing feature of the experience is how continuous, incremental changes lead to what seem like qualitative revolutions. There is of course no moment in which she became the sort of person she is now: the kind of person who can tell a story about an adventure in which two imaginary characters encounter one another for the first time,** but some set of processes led us to this point. How do you uncover the psychological factors that contribute to this kind of growth and change?<br />
<br />
My lab does two kinds of research. In both my hope is to contribute to this kind of understanding by studying the development of cognition and language in early childhood. The first kind of work we do is to conduct series of experiments with adults and children, usually aimed at getting answers to questions about representation and mechanism in early language learning in social contexts. The second kind of work is a larger-scale type of resource-building, where we create datasets and accompanying tools like <a href="http://wordbank.stanford.edu/">Wordbank</a>, <a href="http://metalab.stanford.edu/">MetaLab</a>, and <a href="http://childes-db.stanford.edu/">childes-db</a>. The goal of this work is to make larger datasets accessible for analysis – as testbeds for reproducibility and theory-building.<br />
<br />
Each of these activities connects to the project of understanding development at the scale of an entire person's growth and change. In the case of small-scale language learning experiments, the inference strategy is pretty standard. We hypothesize the operation of some mechanism or the utility of some information source in a particular learning problem (say, <a href="http://langcog.stanford.edu/papers/FG-cogpsych2014.pdf">the utility of pragmatic inference in word learning</a>). Then we carry out a series of experiments that shows a proof of concept that children can use the hypothesized mechanism to learn something in a lab situation, along with control studies that rule out other possibilities. When done well, these studies can give you pretty good traction on individual learning mechanisms. But they can't tell you that these mechanisms are used by children consistently (or even at all) in their actual language learning.<br />
<br />
In contrast, when we work with large-scale datasets, we get a whole-child picture that isn't available in the small studies. In our Wordbank work, for example, we get a global picture of the child's vocabulary and linguistic abilities, for many children across many languages. The trouble is, it's very hard or even impossible to find answers to smaller-scale questions (say, about <a href="http://langcog.stanford.edu/papers_new/macdonald-2018-cogsci.pdf">information seeking from social partners</a>) in datasets that represent global snapshots of children's experience or outcomes. Both methods – the large-scale and the small-scale – are great. The trouble is, the questions don't necessarily line up. Instead, larger datasets tend to direct you towards different questions. Here are three.<br />
<a name='more'></a><br />
<b>1. How do you connect small mechanisms to big changes?</b><br />
<br />
An individual child's vocabulary is made up of hundreds or thousands of individual words, each of which has its own natural history – how and when it was learned, what information was used, what inferences were made. For example, M figured out that "parchment" is a kind paper because Harry Potter was always writing on it. But this is true for any other piece of knowledge (or for that matter, any other skill) as well – it has its own learning history that is contributed to in different ways and to different extents by particular processes and experiences. These individual contributions are typically the object of study for small-scale experimental studies, but in larger-scale observations we only see the result of these – the accreted strata of experience as fossilized by learning.<br />
<br />
The problem is that paleontology in this situation isn't straightforward. We don't have a good sense what it would look like if words – or for that matter, any other kind of skill or knowledge – were learned exclusively via a particular route. The best work of this type that I know about is a slightly-esoteric but cool line of computational investigations of word learning (<a href="https://www.ncbi.nlm.nih.gov/pubmed/21564227">example 1</a>, <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1551-6709.2009.01071.x">example 2</a>) that ask about what vocabularies look like – in terms of their growth, composition, and learning times – under different assumptions about the mechanisms in operation.<br />
<br />
Relatively little work has tried to connect this kind of theorizing to empirical datasets, however. In <a href="https://psyarxiv.com/cg6ah/">one very recent preprint</a> we've tried to take a first step in this direction by asking about the effects of different predictors on the composition of children's early vocabulary (e.g., does word frequency in the input predict which words are learned earlier, or does the conceptual concreteness predict better?). But lots of work is still needed to connect actual mechanistic proposals about in-the-moment learning mechanisms to larger-scale datasets that characterize what children's knowledge looks like.<br />
<br />
Even if you have proposals about learning mechanisms, how do you verify that they add up to the kind of child you see in the aggregate measures?<br />
<br />
<b>2. Does development mostly hang together or is it many different things?</b><br />
<br />
Piaget's developmental theorizing offered at least two things. The first is an account of how knowledge grows and changes – the relationship between assimilation and accommodation. This account feels very modern to me, as I wrote about a while back ("<a href="http://babieslearninglanguage.blogspot.com/2016/04/was-piaget-bayesian.html">Was Piaget a Bayesian?</a>"). The other part of the story was an elaborate theory about global, stage-based transitions in children's development. This second part, the stage theory – while on the whole still taught and tested more in textbooks of developmental psychology – has fallen into disrepute in terms of its empirical validity. My favorite critique is <a href="http://internal.psychology.illinois.edu/infantlab/articles/gelman_baillargeon_1983.pdf">Gelman & Baillargeon (2003)</a>. But the particular stages posited by Piaget don't need to be right for us to consider the factor structure of development more broadly.<br />
<br />
Another way of looking at this. My grandmother (who worked as a research assistant at the Yale Child Studies Center in the 50s and 60s) apparently used to say that kids "either walk or talk," meaning that they would either achieve one milestone or the other first. This is a multi-factorial view of development, in which language vs. locomotor development are two different capacities that are in fact anti-correlated.*** Actually it seems like walking and vocabulary growth are <a href="https://www.ncbi.nlm.nih.gov/pubmed/23750505">positively correlated</a>. This is a small case study, but it raises the question of how the different features of global developmental progress relate to one another.<br />
<br />
Intelligence is defined psychometrically via little <i>g</i>, the first factor in a factor analysis of many tests of cognitive ability. The empirical regularity is that <i>g</i> usually accounts for a substantial amount of variance across cognitive tasks – though that <a href="http://bactra.org/weblog/523.html">doesn't necessarily mean it's a unitary construct</a>. One analogous question you could ask is about development in early childhood. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3412562/">Early language hangs together astonishingly well</a>, but does early language relate to motor development, for example? There are some reviews that <a href="https://www.cambridge.org/core/journals/journal-of-child-language/article/developing-language-in-a-developing-body-the-relationship-between-motor-development-and-language-development/88F68BD4D8F3524F5FAD9387A29C0FE8">argue that it does</a>, but I'm not aware of a comprehensive analysis of children's trajectories through both that dissociates shared variation due to age.<br />
<br />
More generally, is there a little <i>d</i>, that – beyond age – explains global developmental advancement or delay? Statistically, there must be, but how much of the variance does it explain, and what capacities are most tightly related to one another?<br />
<br />
<b>3. What's variable and what's consistent?</b><br />
<br />
Finally, how universal are developmental trajectories, across children and across cultures? Imagine having some arbitrary estimate of locomotor development that assigned a number on some (hypothetically) reliable and valid scale. We could ask about the variance of this measure for a particular age group, but that would be largely meaningless without any units or comparison. But by comparing that variation to developmental variation, we can reason about how consistent individuals' development is. This variation is argued to be small for <a href="http://www.pnas.org/content/77/9/5572.short">stereoacuity of depth perception</a>, for example, while it is <a href="https://www.ncbi.nlm.nih.gov/pubmed/7845413">much larger for vocabulary</a>.<br />
<br />
Neither of these cases make apples-to-apples comparisons, however. To be precise, units of variation would have to be defined in terms of the ratio of individual variance to developmental variance (as a function of either absolute age or percentage age). Using this approach, you could begin to ask, is variation across individuals larger for particular aspects of development than others? Or is variability itself standard across developmental phenomena?<br />
<br />
One further addition is the application of these ideas to cultural variance in skills. Once we have comparable units for a particular skill, we can ask about the relative variability across individuals vs. variability across cultures. What proportion of total variance is due to cultural variability vs. idiosyncrasies of individuals' development? This variance-partitioning approach is in some sense a statistical answer to old questions about universals and variation in language (and other domains) of development.<br />
<br />
<b>Conclusions</b><br />
<br />
Bigger datasets shouldn't lead us to abandon our questions. Nor should they lead us to forget basic statistical facts – e.g., the problematic nature of correlational studies or inferences from convenience samples. But in pursuing the kinds of answers they can give, they sometimes lead back in interesting ways to prior theoretical developments; some of these feel almost forgotten in our current emphasis on small-scale, tightly controlled experiments.<br />
<br />
---<br />
* That's just on the professional side. Being a parent has changed me profoundly as a person – I hope for the better.<br />
** Harry Potter, of course, and Hiccup from How To Train Your Dragon.<br />
*** I don't know if she'd endorse this view more generally – she passed away before I was born, and this anecdote is related by my dad.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-16446376281638973632018-08-10T11:57:00.003-07:002018-08-10T13:02:03.158-07:00Where does logical language come from? The social bootstrapping hypothesis<i>(Musings on the origins of logical language, inspired by work done in my lab by <a href="https://www.snhu.edu/student-experience/campus-experience/campus-academics/faculty/psychology">Ann Nordmeyer</a>, <a href="https://web.stanford.edu/~masoudj/">Masoud Jasbi</a>, and others).</i><br />
<br />
For the last couple of years I've been part of a group of researchers who are interested in where logic comes from. While formal, <a href="https://en.wikipedia.org/wiki/Boolean_algebra">boolean logic</a> is a human discovery*, all human languages appear to have methods for making logical statements. We can negate a statement ("No, I didn't eat your dessert while you were away"), quantify ("I ate all of the cookies"), and express conditionals ("if you finish early, you can join me outside.").** While boolean logic doesn't offer a good description of these connectives, natural language still has some logical properties. How does this come about? Because I study word learning, I like to think about logic and logical language as a word learning problem. What is the initial meaning that "no" gets mapped to? What about "and", "or", or "if"?<br />
<br />
Perhaps logical connectives are learned just like other words. When we're talking about object words like "ball" or "dog," a common hypothesis is that children have object categories as the possible meanings of nouns. These object categories are given to the child by perception*** in some form or other. Then, kids hear their parents refer to individual objects ("look! a dog! [POINTS TO DOG]"). The point allows the determination of reference; the referent is identified as an instance of a category, and – modulo some <a href="http://psycnet.apa.org/record/2007-05396-002">generalization</a> and <a href="https://www.sciencedirect.com/science/article/pii/S0010027715300391">statistical inference</a> – the word is learned, more or less.****<br />
<br />
So how does this process work for logical language? There are plenty of linguistic complexities for the learner to deal with: Most logical words simply don't make sense on their own. You can't just turn to your friend and say "or" (at least not without a lot of extra context). So any inference that a child makes about the meaning of the word will have to involve disentangling that from the meaning of the sentence as a whole. But beyond that, what are the potential targets for the meaning of these words? There's nothing you can point to out in the world that is an "if," an "and," or even a "no."<br />
<br />
<a name='more'></a><br />
For many folks this boils down to a classic <a href="https://en.wikipedia.org/wiki/Poverty_of_the_stimulus">argument from the poverty of the stimulus</a>: there must be some innate logical concepts that underly the ability to acquire logical language. Let's call this idea "logical nativism." These innate logical concepts need not look like boolean primitives, but they should at least form some kind of basis for inducing a more complex semantics and making lexical mappings. To the extent that you can find <a href="http://science.sciencemag.org/content/359/6381/1263">evidence for logical reasoning in infants before they can talk</a>*****, this would constitute evidence for the logical nativist perspective.<br />
<br />
Others would deny this kind of innate structure. There are lots of reasons to be skeptical of strong nativist claims, whether because you think logic isn't the kind of thing that brains represent innately or because you believe such structures could be learned from input (relatedly, here's my take on "<a href="http://babieslearninglanguage.blogspot.com/2016/07/minimal-nativism.html">minimal nativism</a>."). But if you make this sort of claim, then you are responsible for characterizing how children come to learn these words and use them correctly. Even if you skirt around <a href="https://plato.stanford.edu/entries/language-thought/#NatLOT">Fodor's problem</a> by assuming that children have access to a space of concepts expressive enough to discover these logical operators, you still might want to ask how they do so.<br />
<br />
One possible learning theory is that children build the logical operators directly (<a href="https://colala.bcs.rochester.edu/papers/piantadosi2016representation.pdf">perhaps through some kind of probabilistic induction</a>). But I want to sketch the beginnings of a different acquisition theory here. On this theory – let's call it the <i>social bootstrapping</i> hypothesis – children begin by mapping logical words to speech acts with specifically social functions like rejection, offer, or threat. They then gradually generalize the broader logical functions of these words by noticing similarities between social uses of the words and other more abstract uses.<br />
<br />
This post is a way of writing down my own speculations, and is not fully worked out. Probably someone has said something like this before – perhaps Liz Bates or Lois Bloom - I'm not sure, and that's why this is a blog post rather than a paper. That said, here are a couple of examples.<br />
<br />
<b>Negation</b><br />
<b><br /></b>
"No" is often <a href="http://langcog.stanford.edu/papers/SYF-cogsci2015.pdf">one of children's very first words</a>. (In some unpublished data, we even saw that this was especially true for second children – presumably they were saying to their sibling "don't DO that!") Consistent with this idea, early negation has been glossed as having the meaning "rejection" – something like "I don't want that" (lit review and up to date coding in <a href="http://mindmodeling.org/cogsci2018/papers/0416/0416.pdf">this recent paper</a> by Ann Nordmeyer and me). Some other early negations are used for nonexistence ("no cookies") which is a bit different, both syntactically – functioning as a determiner – and semantically. But it's been claimed that you see less early use of negation as what have been called "denials," where a proposition is being negated and the intended meaning is "it is not true that X."<br />
<br />
Ann's study suggests that it's true you don't see these early propositional denials as often, but she did find more frequent denials for some – often during book reading, where parents would ask polar questions like "is that a dog [pointing to a bird]?" and children would say "no!" It seemed like while these utterances were technically logical denials, they were more straightforwardly denying a name rather than a proposition. Further, they seemed like they made sense in those contexts and were being uttered by pretty young children.<br />
<br />
More broadly: I wonder if the relevant target for initial mapping of "no" is essentially the <i>social act</i> of rejection – the head shake when a new food is offered, meaning "don't put that in my mouth." Then once this initial mapping is made, from a very salient and present social impulse (parents rejecting kid's behavior <i>and</i> kid rejecting parents' behavior), this meaning can be generalized to other cases. In particular, the trajectory from "no! don't do that" to "no! don't (you) say that" to "no! don't (you) think that" to "no! not true!" doesn't feel too implausible to me. This would especially be an easy conflation to make under a pre-theory of mind, naïve-realist viewpoint in which what I think is what you think is what is true of the world. It would also explain why the early denials that Ann saw were possible – they're very transparently instances of "rejection of a name" even though they look like "denial of a proposition" on the earlier analysis.<br />
<br />
One much-discussed example of early negation is the utterance "no mummy do it" (<a href="https://www.researchgate.net/profile/Ken_Drozd/publication/14412544_Child_English_pre-sentential_negation_as_metalinguistic_exclamatory_sentence_negation/links/574ca30a08ae061b3301d1d8/Child-English-pre-sentential-negation-as-metalinguistic-exclamatory-sentence-negation.pdf">see Drozd, 1995</a>), which means something like "I don't want mummy to do it." Drozd then presents the utterance "no Nathaniel a king," (Nathaniel is the kid here, who's speaking) which alternatively means something like "I don't want you to say that Nathaniel's [I am] a king" or "Nathaniel's not a king." You see how there is a pretty small step from <i>rejecting</i> <i>an action</i> to <i>rejecting a proposition</i>.<br />
<br />
Related to this bootstrapping account is the <a href="https://web.stanford.edu/~cgpotts/papers/potts-salt20-negation.pdf">persistent negativity of negation</a> – in corpora, negative terms carry negative valence. To be fair, the account given in that paper notes that these effects may be pragmatic in nature. But the paper did lead me to a related hypothesis to my social bootstrapping idea, \namely that negation is “Learned early
on with the association of ‘unpleasant feelings’” (<a href="https://www.scribd.com/document/177136824/1-A-Natural-History-of-Negation-Laurence-R-Horn-pdf">from Bertrand Russell</a> originally). I think that's probably right, although I'm arguing that the negativity of negation is not a direct <i>affective</i> mapping, it's instead a mapping to the <i>social negativity </i>of rejection.<br />
<br />
<b>Disjunction</b><br />
<div>
<br /></div>
<div>
In contrast to "no," "or" is a bit of a mess in acquisition. <a href="http://langcog.stanford.edu/papers_new/jasbi-2018-cogsci.pdf">Children <i>say</i> "or" pretty early</a>, but who knows what they mean? One big issue is that they hear disjunctions that seem to mean logical OR ("[waiter:] you can order dinner or drinks" - true if one is true, the other is true, or both), but they also hear some that appear to be XOR ("[waiter:] you can have dessert or the check" - true if one is true, or the other is true, but NOT both). What could be the target for mapping for this word?<br />
<br />
Well, one part of the puzzle comes from <a href="http://langcog.stanford.edu/papers_new/jasbi-2018-cogsci.pdf">Masoud Jasbi's paper</a>, which is that these different uses have different prosody: the second one has a more distinctive rise/fall/rise pattern than the first. (Also, typically the disjuncts are logically inconsistent in XOR cases.) But there's a more general issue: how do you even think of OR and XOR as possible meanings?<br />
<br />
Again, my suggestion is that the initial target is a social meaning: <i>offer</i>. Under this story, "X or Y" as a construction initially means "offer." Probably this comes up in the context of food offers, especially. The exclusivity of this offer (can you take both or only one) is then a secondary concern that can be worked out from context. But again, you can see the progression from "would you like carrots or string cheese" -> offer(X,Y) to "is john home or at school" -> offer(john at home, john at school). The key step is again, <i>offering an action </i>to<i> offering a proposition.</i><br />
<i><br /></i>
Furthermore, as Masoud's dissertation uncovers, there are a host of other meanings for "or" that don't fit well at all with the basic boolean OR vs. XOR idea. For example, "I'm a wine-lover, or oenophile" (definitional disjunction) doesn't fit. And we constantly correct ourselves using disjunction, e.g. "I think it's in the closet. [observes that's not the case] Or under the piano." These broader meanings feel like they might be different classes of social meanings that map onto the lexical item in specific pragmatic and prosodic frames.<br />
<i><br /></i>
<b>Implication (and Conclusion)</b></div>
<br />
Before I wrap up I just want to mention "if," where I think there is possible story. <i>Threat </i>seems like a clear candidate as a target for mapping. "If you dump that out, you won't get any more" feels to me like a prototypical example of a child-directed utterance where the causal interpretation could eventually get generalized into <a href="https://plato.stanford.edu/entries/logic-conditionals/">whatever your semantics is for conditionals</a>. Note here that in this case again there's a reversal. The Gricean pragmatics that is assumed on conventional accounts to be <i>built out of </i>a logical semantics actually becomes on this account the place where acquisition starts! So rather than causality being an implicature from the conditional, it's actually the starting point for mapping and generalization. I don't have data on this, but I'd be interested in investigating...<br />
<br />
Hopefully, in this post, I've planted the idea that social meanings could be the roots of logical word learning. There are of course many obstacles to realizing this kind of account – first of all, specifying the relationship between the different semantic entities that can be acted on (from objects to actions to propositions). Further, it's not as clear how this would work for "and" or quantifiers like "some." But as I observe children's interactions and think about the way their pragmatic competence supports word learning, this is the sort of constructivist account that feels like the most plausible response to logical nativism.<br />
<br />
---<br />
* Or invention. I won't get all philosophy of math on this right now.<br />
** Of course, the logic of natural language is contaminated constantly with pragmatic inference – <a href="http://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf">that's what I spend most of my time studying</a>.<br />
*** We'll ignore here both <a href="https://www.sciencedirect.com/science/article/pii/S001002858571016X">reciprocal effects of language on category formation</a> <i>and</i> <a href="https://www.nature.com/articles/nn1199_1019">the difficulty of object recognition</a>.<br />
**** By "more or less" here I mean this is actually a major topic of study for a whole subfield. So there is a lot to learn. But at a high level <a href="https://web.stanford.edu/~masoudj/">this kind of social learning view is not terrible</a>.<br />
***** I have some criticisms of the inferences from this paper, but the experimental designs are extremely clever.<br />
<br />
<i>(Thanks very much to Chris Potts for helpful comments). </i>Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com3tag:blogger.com,1999:blog-4297242917419089261.post-77964637038020544372018-06-18T10:56:00.001-07:002018-06-18T10:57:08.126-07:00What does it mean to get a degree in psychology these days? <i>(I was asked to give a speech yesterday at Stanford's Psychology commencement ceremony. Here is the text). </i><br />
<i><br /></i>
1. Chair, Colleagues, graduates of the class of 2018 – undergraduates and graduate students – family members, and friends. It’s a pleasure to be here today with all of you. Along with honoring our graduates, we especially honor all the wonderful speakers today for their accomplishments – MH for his excellence in research and teaching, Angela for her deep engagement with the department community. You could be forgiven for thinking that there was some special achievement that brought me here as well. In fact, by tradition, faculty take turns addressing the graduating class and is my turn this year. It’s a real pleasure to have one last chance to address you.<br />
<br />
Two weeks ago, my daughter Madeline graduated from preschool. There was cake; photos were taken. They broke a piñata. It was a big deal! Several of her friends will be going to different schools, some moving away to other states or even other countries. This is one of the biggest changes she’s ever experienced. I’m already worried about what happens next. Parents, I can only imagine what you are going through today – but at least you know that your kids made it through the first day of kindergarten.<br />
<br />
Graduates - Your graduation from Stanford today is a really big deal. You also get to have cake and photos. If you’re very lucky, some special person has even bought you a piñata. But more importantly, just like for Madeline this is a time of transitions. You may be moving somewhere new. Even if you are staying here, friends will be further away than the next dorm or the next office. So do not hesitate to take a little extra time today to celebrate with the people you love and who love you.<br />
<br />
Congratulations.<br />
<br />
2. I want to take a little time now to think about what it means to get a degree in psychology from Stanford.<br />
<br />
When you sit next to someone on an airplane and tell them you are studying psychology, perhaps they ask you if you are reading their mind. Perhaps they wonder if you are studying Freudian analysis and have thoughts about their unconscious, or their relationship with their mother. Or maybe they are more up to date and wonder if you study psychological disorders as they manifest themselves in the clinic. But the truth is, knowing what you’ve done in your degrees here at Stanford, you probably haven’t done too much Freud. Or too much mind-reading. And although you may be interested in clinical work (and this is laudable), that’s not the core of what we teach here.<br />
<br />
Gaining a degree in psychology also means that you have gone to many classes in psychology and learned about many studies – from social influence to stereotype threat, from mental rotation to marshmallow tests. Although this body of knowledge is a lovely thing to have come into contact with (and I hope that you continue to deepen your knowledge), knowing this content is also not the core of what it means to receive your degree.<br />
<br />
What you have learned instead are tools; a specific kind of tools, namely tools for thought. These tools can be used to approach problems and construct solutions. This is what it means for psychology to be an academic discipline: a discipline denotes a particular mental toolbox. The university is the intellectual equivalent of a construction firm – different departments have the tools to solve different sorts of problems.<br />
<br />
3. Like nearly all ideas, “cognitive tools” seem obvious – after you are used to them. Let’s take one example, a foundational cognitive tool that we use every single day: numbers. Because we are so numerate, a lot of people have the idea that numbers are easy and straightforward. But they aren’t.<br />
<br />
Take the preschoolers in Madeline’s old classroom. Nearly all of them can count, at least to ten and maybe higher. But if you probe a bit more deeply, it all falls apart. If at snack time, you ask someone to give you exactly four cheerios, she’s liable to hand you seven, or a whole handful. Even when a child knows that “one” means exactly 1, it takes quite a few months for them to figure out that “two” means exactly 2, and more months for 3. When they finally figure out how the whole system works it enables so many new things! Madeline owes all of her dessert-negotiation prowess to her abilities with numbers. Seven gummi bears? No. How about six? This idea of exact comparison is a skill – even though it makes for tiresome after-dinner conversation.<br />
<br />
Numbers are an invented, culturally-transmitted tool. In graduate school I worked with an Amazonian indigenous group, the Pirahã, who have no words for numbers. They are bright, sophisticated people who love a good practical joke. Many Pirahã can shoot a fish with an arrow while standing in a canoe. Yet because their language does not have these particular words in it – words like “seven” - and because they do not go through that laborious period of practice that Madeline and other kids learning languages like English do – they can’t remember that it’s exactly seven gummi bears. To them, six or eight seems like the same amount. They simply don’t have the tool.<br />
<br />
4. So what are the tools of the psychologist?<br />
<br />
There’s one tool that qualifies as the hammer of psychology – the single tool you can use to frame an entire house. That’s the experiment. The fundamental insight of all of modern psychology is that the puzzles of the human mind can be understood as objects of scientific study if we can design appropriately controlled experiments. As complicated and unpredictable as people are (especially when they are integrated into complex cultural systems), we can still learn about their inner workings via experiments.<br />
<br />
This insight has spread far outside of psychology and far outside of the academy. Nowadays, Facebook runs a hundred experiments a day on you. Governments and political campaigns, startups and not-for-profits are all constantly experimenting to try to understand how to achieve their goals. There is a good chance that in the next few years of your professional life you will face a complicated human problem with an unknown solution. The psychologist’s approach will serve you well: formulate a hypothesis about how you should manipulate the world; then assess whether the manipulation has changed your measurement of interest. This strategy is shockingly effective.<br />
<br />
But the serious carpenter has other, more specialized tools in the toolkit – the plane, awl, rasp, drawknife, jigsaw, bevel. Let me mention two more.<br />
<br />
The first is the idea that our knowledge is not just a set of facts, but is organized into theories that help us understand the world. We call these theories intuitive theories – they are the explanatory frameworks that people carry with them to understand why things happen. What follows from this idea is that when you want to change people’s behavior, you can’t just tell them to change or tell them different facts. You need to change their theory. When I want Madeline to eat her vegetables, it turns out just telling her to “eat broccoli” doesn’t work very well – even if she does eat the broccoli, she won’t know what else to eat or why to eat it. And of course the well-known idea about fostering a growth mindset is precisely this kind of implicit theory: it’s a theory of whether ability is fixed or whether it can be improved with hard work.<br />
<br />
The second idea I want to share is that our judgment is systematically biased. It’s biased by our own beliefs. Our minds are wonderful, efficient systems that deal with uncertainty – we piece together a sentence even in a noisy restaurant using our expectations about what that person might be trying to say to us. In most cases, this is an amazing feature of our own cognition, letting us operate flexibly using limited data. But this reliance on our own beliefs also has negative consequences: it leads us to stereotype, and to engage in confirmation bias, looking for evidence that further supports our own beliefs. Understanding of these sources of bias can help us avoid falling into this trap. A good grounding in psychology, in other words, helps us be more aware of our own limitations.<br />
<br />
I’d love to tell you about more ideas. Every woodworker loves to show off their workbench. And the wonderful thing about tools is that when you use them together you can create new tools, in the same way the carpenter can first make a jig to make it easier to make a difficult cut. I could go on, but hopefully I’ve piqued your curiosity – and you have lots more to do today.<br />
<br />
5. So. Make sure that you celebrate! Eat some cake, smash a piñata, and most of all, say your "thank you"s to the people who have supported you during your time here at Stanford. I speak for all of them when I say that we are very proud of you and cannot wait to see what you accomplish.<br />
<br />
As this weekend passes and you head off for other things, it is all but certain that you will find yourself in new situations facing challenges that you have not considered before. (Life would not be fun without them!). But I am confident that your tools will be sufficient to the job. Keep them sharp and they will serve you well.<br />
<br />
<br />
<div>
<br /></div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-46436308880553538472018-05-05T15:52:00.000-07:002018-05-05T15:52:49.533-07:00nosub: a command line tool for pushing web experiments to Amazon Mechanical Turk<div>
<i>(This post is co-written with <a href="http://zx.gd/academic/">Long Ouyang</a>, a former graduate student in our department, who is the developer of nosub, and <a href="http://stanford.edu/~bohn/">Manuel Bohn</a>, a postdoc in my lab who has created a minimal working example). </i></div>
<div>
<br /></div>
Although my lab focuses primarily on child development, our typical workflow is to refine experimental paradigms via working with adults. Because we treat adults as a convenience population, Amazon Mechanical Turk (AMT) is a critical part of this workflow. AMT allows us to pay an hourly wage to participants all over the US who complete short experimental tasks. (<a href="http://babieslearninglanguage.blogspot.com/2013/10/randomization-on-mechanical-turk_10.html">Some background</a> from an old post).<br />
<div>
<br /></div>
<div>
Our typical workflow for AMT tasks is to create custom websites that guide participants through a series of linguistic stimuli of one sort or another. For simple questionnaires we often use Qualtrics, a commercial survey product, but most tasks that require more customization are easy to set up as free-standing javascript/HTML sites. These sites then need to be pushed to AMT as "external HITs" (Human Intelligence Tasks) so that workers can find them, participate, and be compensated. </div>
<div>
<br /></div>
<div>
<a href="https://github.com/longouyang/nosub">nosub</a> is a simple tool for accomplishing this process, building on earlier tools used by my lab.* The idea is simple: you customize your HIT settings in a configuration file and type</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">nosub upload</span></div>
<div>
<br /></div>
<div>
to upload your experiment to AMT. Then you can type</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">nosub download</span></div>
<div>
<br /></div>
<div>
to fetch results. Two nice features of nosub from a psychologist's perspective are: 1. worker IDs are anonymized by default so you don't need to worry about privacy issues (but they are deterministically hashed so you can still flag repeat workers), and 2. nosub can post HITs in batches so that you don't get charged Amazon's surcharge for tasks with more than 9 hits. </div>
<div>
<br /></div>
<div>
All you need to get started is to install <a href="http://node.js/">Node.js</a>; installation instructions for nosub are available in the <a href="https://github.com/longouyang/nosub">project repository</a>.<br />
<br />
Once you've run nosub, you can download your data in JSON format, which can easily be parsed into R. We've put together a <a href="https://github.com/manuelbohn/nosub_example">minimal working example</a> of an experiment that can be run using nosub and a <a href="https://github.com/manuelbohn/nosub_example/blob/master/nosub_example.Rmd">data analysis script in R</a> that reads in the data. </div>
<div>
<br /></div>
<div>
---</div>
<div>
* <a href="https://psiturk.org/">psiTurk</a> is another framework that provides a way of serving and tracking HITs. psiTurk is great and we have used it for heavier-weight applications where we need to track participants, but can be tricky to debug and is not always compatible with some of our light-weight web experiments.</div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-67070199225069663202018-02-26T23:18:00.002-08:002018-03-01T09:03:25.301-08:00Mixed effects models: Is it time to go Bayesian by default?<div>
<i>(tl;dr: Bayesian mixed effects modeling using <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/paul-buerkner/brms">brms</a></span> is really nifty.)</i></div>
<div>
<b><br /></b></div>
<div>
<b>Introduction: Teaching Statistical Inference?</b></div>
<div>
<br /></div>
How do you reason about the relationship between your data and your hypotheses? <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian inference</a> provides a way to make normative inferences under uncertainty. As scientists – or even as rational agents more generally – we are interested in knowing the probability of some hypothesis given the data we observe. As a cognitive scientist I've long been interested in using Bayesian models to describe cognition, and that's what I did much of my graduate training in. These are custom models, sometimes fairly difficult to write down, and they are an area of active research. That's not what I'm talking about in this blogpost. Instead, I want to write about the basic practice of statistics in experimental data analysis.<br />
<div>
<br /></div>
<div>
Mostly when psychologists do and teach "stats," they're talking about frequentist statistical tests. Frequentist statistics are the standard kind people in psych have been using for the last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value. P-values represent the probability of the data (or any more extreme) under the null hypothesis (typically "no difference between groups" or something like that). The problem is that <a href="http://psycnet.apa.org/fulltext/1995-12080-001.html">this is not what we really want to know as scientists</a>. We want the opposite: the probability of the hypothesis given the data, which is what Bayesian statistics allow you to compute. You can also compute the relative evidence for one hypothesis over another (the Bayes Factor). </div>
<div>
<br /></div>
<div>
<div>
Now, the best way to set psychology twitter on fire is to start a holy war about who's actually right about statistical practice, Bayesians or frequentists. There are lots of arguments here, and I see some merit on both sides. That said, there is lots of evidence that <a href="http://repository.cmu.edu/psychology/968/">much of our implicit statistical reasoning is Bayesian</a>. So I tend towards the Bayesian side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian stats in my classes. I've felt like, even with their philosophical attractiveness, actually computing Bayesian stats had too many very severe challenges for students. For example, in previous years you might run into major difficulties inferring the parameters of a model that would be trivial under a frequentist approach. I just couldn't bring myself to teach a student a philosophical perspective that – while coherent – wouldn't provide them with an easy toolkit to make sense of their data. </div>
</div>
<div>
<br /></div>
<div>
The situation has changed in recent years, however. In particular, the <a href="http://bayesfactorpcl.r-forge.r-project.org/">BayesFactor R package</a> by Morey and colleagues makes it extremely simple to do basic inferential tasks using Bayesian statistics. This is a huge contribution! Together with <a href="https://jasp-stats.org/">JASP</a>, these tools make the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really impressed by how well these tools work. </div>
<div>
<br /></div>
<div>
All that said, my general approach to statistical inference tends to rely less on inference about a particular hypothesis and more on parameter estimation – following the spirit of folks like <a href="http://www.stat.columbia.edu/~gelman/arm/">Gelman & Hill (2007)</a> and <a href="http://journals.sagepub.com/doi/abs/10.1177/0956797613504966">Cumming (2014)</a>. The basic idea is to fit a model whose parameters describe substantive hypotheses about the generating sources of the dataset, and then to interpret these parameters based on their magnitude and the precision of the estimate. (If this sounds vague, don't worry – the last section of the post is an example). The key tool for this kind of estimation is not tests like the t-test or the chi-squared. Instead, it's typically some variant of regression, usually mixed effects models. </div>
<div>
<br /></div>
<div>
<div>
<b>Mixed-Effects Models</b></div>
</div>
<div>
<b><br /></b></div>
<div>
Especially in psycholinguistics where our experiments typically show many people many different stimuli, mixed effects models have rapidly become the <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613284/">de facto standard for data analysis</a>. These models (also known as hierarchical linear models) let you estimate sources of random variation ("random effects") in the data across various grouping factors. For example, in a reaction time experiment some participants will be faster or slower (and so all data from those particular individuals will tend to be faster or slower in a correlated way). Similarly, some stimulus items will be faster or slower and so all the data from these groupings will vary. The <span style="font-family: "courier new" , "courier" , monospace;">lme4</span> package in R was a game-changer for using these models (in a frequentist paradigm) in that it allowed researchers to estimate such models for a full dataset with just a single command. For the past 8-10 years, nearly every paper I've published has had a linear or generalized linear mixed effects model in it. </div>
<div>
<br /></div>
<div>
Despite their simplicity, the biggest problem with mixed effects models (from an educational point of view, especially) has been figuring out how to write consistent model specifications for random effects. Often there are many factors that vary randomly (subjects, items, etc.) and many other factors that are nested within those (e.g., each subject might respond differently to each condition). Thus, it is not trivial to figure out what model to fit, even if fitting the model is just a matter of writing a command. Even in a reaction-time experiment with just items and subjects as random variables, and one condition manipulation, you can write</div>
<div>
<br /></div>
<div style="text-align: center;">
<span style="font-family: inherit;">(1)</span><span style="font-family: "courier new" , "courier" , monospace;"> rt ~ condition + (1 | subject) + (1 | item)</span></div>
<div style="text-align: center;">
<br /></div>
<div>
for just random intercepts by subject and by item, or you can nest condition (fitting a random slope) for one or both:</div>
<div>
<br /></div>
<div style="text-align: center;">
<span style="font-family: inherit;">(2)</span><span style="font-family: "courier new" , "courier" , monospace;"> rt ~ condition + (condition | subject) + (condition | item)</span></div>
<div style="text-align: center;">
<br /></div>
<div>
and you can additionally fiddle with covariance between random effects for even more degrees of freedom!</div>
<div>
<br /></div>
<div>
Luckily, a number of years ago, a powerful and clear simulation paper by <a href="https://www.sciencedirect.com/science/article/pii/S0749596X12001180">Barr et al. (2013)</a> came out. They argued that there was a simple solution to the specification issue: use the "maximal" random effects structure supported by the design of the experiment. This meant adding any random slopes that were actually supported by your design (e.g., if condition was a within-subject variable, you could fit condition by subject slopes). While this suggestion was <a href="https://www.sciencedirect.com/science/article/pii/S0749596X16302467">quite controversial</a>,* Barr et al.'s simulations were persuasive evidence that this suggestion led to conservative inferences. In addition, having a simple guideline to follow eliminated a lot of the worry about analytic flexibility in random effects structure. If you were "keeping it maximal" that meant that you weren't intentionally – or even inadvertently – messing with your model specification to get a particular result. </div>
<div>
<br /></div>
<div>
Unfortunately, a new problem reared its head in <span style="font-family: "courier new" , "courier" , monospace;">lme4</span>: convergence. With very high frequency, when you specify the maximal model, the approximate inference algorithms that search for the maximum likelihood solution for the model will simply not find a satisfactory solution. This outcome can happen even in cases where you have quite a lot of data – in part because the number of parameters being fit is extremely high. In the case above, not counting covariance parameters, we are fitting a slope and an intercept across participants, plus a slope and intercept for <i>every participant</i> and for <i>every item</i>. </div>
<div>
<br /></div>
<div>
To deal with this, people have developed various strategies. The first is to do some black magic to try and change the optimization parameters (e.g., following <a href="https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html">these helpful tips</a>). Then you start to prune random effects away until your model is "less maximal" and you get convergence. But these practices mean you're back in flexible-model-adjustment land, and vulnerable to all kinds of charges of post-hoc model tinkering to get the result you want. We've had to specify lab best-practices about the <a href="https://osf.io/zqzsu/wiki/Standard%20Analytic%20Procedures/">order for pruning random effects</a> – kind of a guide to "tinkering until it works," which seems suboptimal. In sum, the models are great, but the methods for fitting them don't seem to work that well. </div>
<div>
<br /></div>
<div>
Enter Bayesian methods. For several years, it's been possible to fit Bayesian regression models using <a href="http://mc-stan.org/">Stan</a>, a powerful probabilistic programming language that interfaces with R. Stan, building on BUGS before it, has put Bayesian regression within reach for someone who knows how to write these models (and interpret the outputs). But in practice, when you could fit an <span style="font-family: "courier new" , "courier" , monospace;">lmer</span> in one line of code and five seconds, it seemed like a bit of a trial to hew the model by hand out of solid Stan code (which looks a little like <span style="font-family: "courier new" , "courier" , monospace;">C</span>: you have to declare your variable types, etc.). We have done it <a href="https://github.com/jasbi/cogsci2017">sometimes</a>, but typically only for models that you couldn't fit with <span style="font-family: "courier new" , "courier" , monospace;">lme4</span><span style="font-family: inherit;"> (e.g., an ordered logit model). So I still don't teach this set of methods, or advise that students use them by default. </span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;"><b>brms?!? A worked example</b></span></div>
<div>
<span style="font-family: inherit;"><b><br /></b></span></div>
<div>
In the last couple of years, the package <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/paul-buerkner/brms">brms</a></span> has been in development. <span style="font-family: "courier new" , "courier" , monospace;">brms</span> is essentially a front-end to Stan, so that you can write R formulas just like with <span style="font-family: "courier new" , "courier" , monospace;">lme4</span> but fit them with Bayesian inference.* This is a game-changer: all of a sudden we can use the same syntax but fit the model we want to fit! Sure, it takes 2-3 minutes instead of 5 seconds, but the output is clear and interpretable, and we don't have all the specification issues described above. Let me demonstrate. </div>
<div>
<br /></div>
<div>
The dataset I'm working on is an unpublished set of data on kids' pragmatic inference abilities. It's similar to many that I work with. We show children of varying ages a set of images and ask them to choose the one that matches some description, then record if they do so correctly. Typically some trials are control trials where all the child has to do is recognize that the image matches the word, while others are inference trials where they have to reason a little bit about the speaker's intentions to get the right answer. Here are the data from this particular experiment:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv7030Uw3qdzf94sFoiJAMa-2HMJCt6Epd4Q_NBPW0MugHKkessjY9KYJxag4YDn9XxHxdNwvEWqsMDsnCl0aH4TMgJVw-zC34i5pWelAp62X5smSIx14ghqO09aWsJNppvF2xJW5I9PQC/s1600/Rplot02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="362" data-original-width="555" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv7030Uw3qdzf94sFoiJAMa-2HMJCt6Epd4Q_NBPW0MugHKkessjY9KYJxag4YDn9XxHxdNwvEWqsMDsnCl0aH4TMgJVw-zC34i5pWelAp62X5smSIx14ghqO09aWsJNppvF2xJW5I9PQC/s400/Rplot02.png" width="400" /></a></div>
<div>
<br /></div>
<div>
I'm interested in quantifying the relationship between participant age and the probability of success in pragmatic inference trials (vs. control trials, for example). My model specification is:</div>
<div>
<br /></div>
<div>
<div style="text-align: center;">
<span style="font-family: inherit;">(3)</span><span style="font-family: "courier new" , "courier" , monospace;"> correct ~ condition * age + (condition | subject) + (condition | stimulus)</span></div>
</div>
<div>
<br /></div>
<div>
So I first fit this with <span style="font-family: "courier new" , "courier" , monospace;">lme4</span>. Predictably, the full desired model doesn't converge, but here are the fixed effect coefficients: </div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> beta stderr z p</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">intercept </span><span style="white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">0.50 0.19 2.65 0.01</span></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">condition </span><span style="white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">2.13 0.80 2.68 0.01</span></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">age </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">0.41 0.18 2.35 0.02</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">condition:age </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">-0.22 0.36 -0.61 0.54</span></div>
</div>
<div>
</div>
<div>
Now let's prune the random effects until the convergence warning goes away. In the simplified version of the dataset that I'm using here I can keep stimulus and subject intercepts and still get convergence when there are no random slopes. But in the larger dataset, the model won't converge unless i do <i>just</i> the random intercept by subject:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> beta stderr z p</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">intercept </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">0.50 0.21 2.37 0.02</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">condition </span><span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">1.76 0.33 5.35 0.00</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">age </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">0.41 0.18 2.34 0.02</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">condition:age </span><span style="font-family: "courier new" , "courier" , monospace; white-space: pre-wrap;">-0.25 0.33 -0.77 0.44</span></div>
</div>
<div>
<br /></div>
<div>
Coefficient values are decently different (but the p-values are not changed dramatically in this example, to be fair). More importantly, a number of fairly trivial things matter to whether the model converges. For example, I can get one random slope in if I set the other level of the condition variable to be the intercept, but it doesn't converge with either in this parameterization. And in the full dataset, the model wouldn't converge at all if I didn't center age. And then of course I haven't tweaked the optimizer or messed with the convergence settings for any of these variants. All of this means that there are a <i>lot</i> of decisions about these models that I don't have a principled way to make – and critically, they need to be made conditioned on the data, because I won't be able to tell whether a model will converge <i>a priori</i>!</div>
<div>
<br /></div>
<div>
So now I switched to the Bayesian version using <span style="font-family: "courier new" , "courier" , monospace;">brms</span>, just writing <span style="font-family: "courier new" , "courier" , monospace;">brm()</span> with the model specification I wanted (3). I had to do a few tweaks: upping the number of iterations (suggested by the warning messages from the output, changing to a Bernoulli model rather than binomial (for efficiency, again suggested by the error message), but this was very straightforward otherwise. For simplicity I've adopted all the default prior choices, but I could have gone more informative.</div>
<div>
<br /></div>
<div>
Here's the summary output for the fixed effects:</div>
<div>
<br /></div>
<div>
<pre data-ordinal="1" style="line-height: 1.45; text-size-adjust: auto; white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;"> estimate error l-95% CI u-95% CI
intercept 0.54 0.48 -0.50 1.69
condition 2.78 1.43 0.21 6.19
age 0.45 0.20 0.08 0.85
condition:age -0.14 0.45 -0.98 0.84
</span></pre>
</div>
<div>
<br /></div>
<div>
From this call, we get back coefficient estimates that are somewhat similar to the other models, along with 95% credible interval bounds. Notably, the condition effect is larger (probably corresponding to being able to estimate a more extremal value for the logit based on sparse data), and then the interaction term is smaller but has higher error. Overall, coefficients look more like the first non-convergent maximal model than the second converging one. </div>
<div>
<br /></div>
<div>
The big deal about this model is not that what comes out the other end of the procedure is radically different. It's that it's <i>not</i> different. I got to fit the model I wanted, with a maximal random effects structure, and the process was almost trivially easy. In addition, and as a bonus, the CIs that get spit out are actually credible intervals that we can reason about in a sensible way (as opposed to frequentist confidence intervals, <a href="https://link.springer.com/article/10.3758/s13423-015-0947-8">which are quite confusing</a> if you think about them deeply enough). </div>
<div>
<br /></div>
<div>
<b>Conclusion</b></div>
<div>
<b><br /></b></div>
<div>
Bayesian inference is a powerful and natural way of fitting statistical models to data. The trouble is that, up until recently, you could easily find yourself in a situation where there was a dead-obvious frequentist solution but off-the-shelf Bayesian tools wouldn't work or would generate substantial complexity. That's no longer the case. The existence of tools like BayesFactor and brms means that I'm going to suggest that people in my lab go Bayesian by default in their data analytic practice. </div>
<div>
<br /></div>
<div>
----<br />
<i>Thanks to Roger Levy for pointing out that model (3) above could include an </i>age | stimulus<i> slope to be truly maximal. I will follow this advice in the paper. </i><br />
<i><br /></i></div>
<div>
* Who would have thought that a paper about statistical models would be called "<a href="https://www.sciencedirect.com/science/article/pii/S0749596X16302467">the cave of shadows</a>"?</div>
<div>
** <a href="http://mc-stan.org/users/interfaces/rstanarm">Rstanarm</a> did this also, but it covered fewer model specifications and so wasn't as helpful. </div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com23tag:blogger.com,1999:blog-4297242917419089261.post-20628455182544306412018-01-16T10:43:00.000-08:002018-01-16T10:43:15.125-08:00MetaLab, an open resource for theoretical synthesis using meta-analysis, now updated<div class="separator" style="clear: both; text-align: center;">
</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: start;">
<i>(This post is jointly written by the MetaLab team, with contributions from Christina Bergmann, Sho Tsuji, Alex Cristia, and me.)</i></div>
<div dir="ltr" style="line-height: 1.38; margin-left: 1em; margin-right: 1em; margin-top: 0pt; text-align: center;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-left: 1em; margin-right: 1em; margin-top: 0pt;">
</div>
<div style="text-align: center;">
<img height="168" src="https://lh5.googleusercontent.com/4BjFFXavXONUE3PYgN_Vs75bIWceASz7Nf65otmmxalwD_AlrWHwXS-BR8en0kWJJ3wQt8TwJQzYlEsV7VlhvBl1m36K4KLsBC5NzWsqQmPYYNUJrKoKHKeK7UKn5p20glDAKgdn" width="200" /></div>
<i></i><br />
<div style="text-align: center;">
<i><i>A typical “ages and stages” ordering. Meta-analysis helps us do better.</i></i></div>
<i>
</i>
<div dir="ltr" style="line-height: 1.38; margin-left: 1em; margin-right: 1em; margin-top: 0pt;">
</div>
<div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: left;">
Developmental psychologists often make statements of the form “babies do X at age Y.” But these “ages and stages” tidbits sometimes misrepresent a complex and messy research literature. In some cases, dozens of studies test children of different ages using different tasks and then declare success or failure based on a binary p < .05 criterion. Often only a handful of these studies – typically those published earliest or in the most prestigious journals – are used in reviews, textbooks, or summaries for the broader public. In medicine and other fields, it’s long been recognized that we can do better.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Meta-analysis (MA) is a toolkit of techniques for combining information across disparate studies into a single framework so that evidence can be synthesized objectively. The results of each study are transformed into a standardized effect size (like Cohen’s d) and are treated as a single data point for a meta-analysis. Each data point can be weighted to reflect a given study’s precision (which typically depends on sample size). These weighted data points are then combined into a meta-analytic regression to assess the evidential value of a given literature. Follow-up analyses can also look at moderators – factors influencing the overall effect – as well as issues like publication bias or p-hacking.* Developmentalists will often enter participant age as a moderator, since meta-analysis enables us to statistically assess how much effects for a specific ability increase as infants and children develop. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<img height="257" src="https://lh6.googleusercontent.com/OReQ4GgN-NwAQRkCNfvGJ6JFXeK8Yi-qJf3Iey21OC-CwnrHZHls7NTMzX6ZzJ2eB_kH5AplRHgzhJBFATvp0KhMoYRwku8bSamg6FqotJ6uJPMjhYu-lLGdIlDMPAC77kwie82u" width="400" /></div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: center;">
<i>An example age-moderation relationship for studies of mutual exclusivity in early word learning.</i></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Meta-analyses can be immensely informative – yet they are rarely used by researchers. One reason may be because it takes a bit of training to carry them out or even understand them. Additionally, MAs go out of date as new studies are published. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
To facilitate developmental researchers’ access to up-to-date meta-analyses, we created <a href="http://metalab.stanford.edu/">MetaLab</a>. MetaLab is a website that compiles MAs of phenomena in developmental psychology. The site has grown over the last two years from just a small handful of MAs to 15 at present, with data from more than 16,000 infants. The data from each MA are stored in a standardized format, allowing them to be downloaded, browsed, and explored using interactive visualizations. Because all analyses are dynamic, curators or interested users can add new data as the literature expands.</div>
<div style="text-align: left;">
<a name='more'></a></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<img height="295" src="https://lh6.googleusercontent.com/lL8YwYLN4xwK-AowgE-MGeKrmoUDc-cg1x_t8LGBsC6_48RlN8NDwuPwFEPS32AX0zf4cLph4_BztQbNNLBsLkJ--6m7k-vzo_o38S1b5l39ROKMNBFGIygnA_6p-gfKQYoHhcPw" width="400" /></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<i>The main visualization app on MetaLab, showing a meta-analysis of infant-directed speech preference. The dataset of interest can be selected in the left upper corner to obtain standard meta-analytic visualizations. </i></div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: left;">
We thought it was time for a refresh for the site this fall, so we are launching a new version today: MetaLab 2.0.** If you have visited MetaLab before, you will notice a lot of changes to the new site. First and foremost, we’ve generalized our approach so that it is not specific to language development in particular but can be used to explore MAs on other topics, which we hope to incorporate as they become available. There are new tutorials, more documentation and explanatory materials (including a <a href="https://www.youtube.com/watch?v=Omnq13QZ-3c&list=PLu8FqtGdUsEJUqHmhEo2Kq-e7qJ07ocUi">youtube video series</a>), and a host of other changes to make the site more intuitive to use.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
What can you do with MetaLab? During hypothesis generation, MAs can be a good way to get a comprehensive overview of a literature; we have always noted full references and many MAs even contain unpublished reports that would be difficult to locate otherwise. For instance, about half of the studies in MetaLab’s Sound Symbolism MA are unpublished. Integrating both published and unpublished records, the <a href="https://osf.io/wshdy/">forthcoming MA</a> by Sho and colleagues suggests that there is overall evidence for early sensitivity to sound symbolism, though it’s weaker than what’s represented in the published literature.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
If you are designing a study, MetaLab can help you choose a sample size or stimulus type. For example, Christina's <a href="https://osf.io/wpgjm/">new paper on vowel discrimination</a> uses stimuli that were thought to be appropriately difficult to avoid ceiling or floor effects – which worked! This selection was based on <a href="http://pubman.mpdl.mpg.de/pubman/item/escidoc:1836135/component/escidoc:1945376/Tsuj_cristia_2014.pdf">Sho's vowel discrimination MA</a>, which contains both acoustic information and effect sizes. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Even if there is not an MA for your particular phenomenon of interest, you can still learn about the average effect size for related phenomena and methods. And once you’ve finished your study, you can tell us about it so we add it to the appropriate meta-analysis in MetaLab, putting your study in the map of similar studies, and helping other researchers find it.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
If you want to learn more, check out our papers (<a href="https://osf.io/uhv3d/?view_only=None">Bergmann et al., in press, </a><a href="https://psyarxiv.com/htsjm/">Lewis et al., preprint</a>). Bergmann et al. (in press) focuses on study power and method choice, providing instructions how to make a priori sample size decisions to conduct appropriately powered studies. Lewis et al. (preprint) assesses publication bias (spoiler: it's not as bad as we feared, at least in studies of early language) and shows how all MAs on language development together can help us build data-driven theories.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
It’s more important than ever to create a cumulative research literature in which our theories rest on the sum of the available evidence and our new work is designed and powered appropriately to make a contribution. MetaLab is designed to help accomplish both of these goals. If you would like to contribute a MA, add an analysis or a datapoint, or simply comment on the site functionality, please reach out to us or <a href="https://github.com/langcog/metalab2">add a github issue</a>. And when someone asks you, “when do babies do X?,” you can look for answers based not on a handful of infants but on hundreds or thousands of them. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
---</div>
<div style="text-align: left;">
* We’re aware of the issues of meta-analysis with respect to understanding literatures that are deeply scarred by publication bias and p-hacking (e.g., <a href="http://datacolada.org/59">datacolada</a>, <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2659409">Inzlicht et al.</a>). We go into this a bit in our papers on the topic, but basically we think that – although publication bias and QRPs are a problem in our fields as they are everywhere – the literature is not fundamentally corrupted in the same way it is <a href="http://psycnet.apa.org/record/2016-17976-001">in some subfields</a>. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
** With generous support from <a href="http://www.bitss.org/projects/metalab-paving-the-way-for-easy-to-use-dynamic-crowdsourced-meta-analyses/">a SSMART grant</a> from the <a href="http://www.bitss.org/">Berkeley Initiative for Transparency in the Social Sciences</a> (BITSS) and some coding help from <a href="https://deanattali.com/shiny/">AttaliTech</a>,</div>
</div>
<br />
<br />
<div>
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><br /></span></div>
Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-36768911340292173282017-12-07T08:40:00.000-08:002017-12-07T08:40:15.702-08:00Open science is not inherently interesting. Do it anyway. <i>tl;dr: Open science practices themselves don't make a study interesting. They are essential prerequisites whose absence can undermine a study's value.</i><br />
<br />
There's a tension in discussions of open science, one that is also mirrored in my own research. What I really care about are the big questions of cognitive science: what makes people smart? how does language emerge? how do children develop? But in practice I spend quite a bit of my time doing meta-research on reproducibility and replicability. I often hear critics of open science – focusing on replication, but also other practices – objecting that open science advocates are making science more boring and decreasing the focus on theoretical progress (e.g., <a href="https://www.researchgate.net/profile/Edwin_Locke/publication/277087389_Theory_Building_Replication_and_Behavioral_Priming_Where_Do_We_Need_to_Go_From_Here/links/55d4b89708aef1574e975920.pdf">Locke</a>, <a href="http://journals.sagepub.com/doi/abs/10.1177/1745691613514450">Strobe & Strack</a>). The thing is, I don't completely disagree. Open science is not inherently interesting.<br />
<br />
Sometimes someone will tell me about a study and start the description by saying that it's pre-registered, with open materials and data. My initial response is "ho hum." I don't really care if a study is preregistered – <i>unless </i>I care about the study itself and suspect p-hacking. Then the only thing that can rescue the study is preregistration. Otherwise, I don't care about the study any more; <a href="http://babieslearninglanguage.blogspot.com/2016/03/limited-support-for-app-based.html">I'm just frustrated by the wasted opportunity</a>.<br />
<br />
So here's the thing: Although being open can't make your study interesting, <i>the failure to pursue open science practices can undermine the value of a study.</i> This post is an attempt to justify this idea by giving an informal Bayesian analysis of what makes a study interesting and why transparency and openness is then the key to maximizing study value.<br />
<br />
<a name='more'></a><br />
<h4>
<b>What makes a scientific study interesting?</b> </h4>
I take a fundamentally Bayesian approach to scientific knowledge. If you haven't encountered Bayesian philosophy of science, <a href="http://www.strevens.org/research/simplexuality/Bayes.pdf">here's a nice introduction by Strevins</a>; I find this framework nicely fits my intuitions about scientific reasoning. The core assumption is that knowledge in a particular domain can be represented by a probability distribution over theoretical hypotheses, given the available evidence.* This distribution can be decomposed into the product of 1) the prior probability of each hypothesis and 2) the likelihood of the hypothesis given the available evidence. New evidence changes this posterior distribution, and the amount of change is quantified by <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">information gain</a>. Thus, an "interesting" study is simply one that leads to high information gain.<br />
<br />
Some good intuitions fall out of this decision. First, consider a study that decisively selects between two competing hypotheses that are equally likely based on prior literature; this study leads to high information gain and is clearly quite "theoretically interesting." Next, consider a study that provides strong support for a particular hypothesis, but the hypothesis is already deeply established in the literature; it's much less informative and hence much less interesting. Would you spend time conducting a large, high-powered test of <a href="https://en.wikipedia.org/wiki/Weber%E2%80%93Fechner_law">Weber's law</a>? Probably not – it would probably show the same regularity as the hundreds or thousands of studies before it. Finally, consider a study that collects a large amount of detailed data, but the design doesn't distinguish between hypotheses. Despite the amount of data, the theoretical progress is minimal and hence the study is not interesting.**<br />
<br />
Under this definition, an interesting study can't just have the potential to compare between hypotheses, it must provide evidence that changes our beliefs about which one is more probable.*** Larger samples and more precise measurements typically result in greater amounts of evidence, and hence lead to more important ("more interesting") studies. In the special case where the literature is consistent with two distinct hypotheses, evidence can be quantified by the <a href="https://en.wikipedia.org/wiki/Bayes_factor">Bayes Factor</a>. The bigger the Bayes Factor, the more evidence a study provides in favor of one hypothesis compared with the other, and the greater the information gain.<br />
<br />
<h4>
<b>How does open science affect whether a study is interesting?</b></h4>
Transparency and openness in science includes the sharing of code, data, and experimental materials as well as the sharing of protocols and analytic intentions (e.g., through preregistration). Under the model described above, none of these practices add to the informational value of a study. Having the raw data available or knowing that the inferential statistics are appropriate due to preregistration can't make a study better – the data are still the data, and the evidence is still the evidence.<br />
<i><br /></i>
<i>If there is uncertainty about the correctness of a result, the informational value of the study is decreased. </i>Consider a study that in principle decides between two hypotheses, but imagine the skeptical reader has no access to the data and harbors some belief that there has been a major analytic error. The reader can quantify her uncertainty about the evidential value of the study by assigning probabilities to the two outcomes: either the study is right, or else it's not. Integrating across these two outcomes, the value of the study is of course lower than if she knows the study has now error. Or similarly, imagine that the reader believes that another, different statistical test was equally appropriate but that the authors selected the one they report post hoc (leading to an inflation of their risk of a false positive).**** Again, uncertainty about the presence of p-hacking decreases the evidential value of the study, and the decrease is proportional to the strength of the belief. <br />
<br />
<i>Open science practices decrease the belief in p-hacking or error, and thus preserve the evidential value of the study.</i> If the skeptical reader has the ability to repeat the data analysis ("computational reproducibility"), the possibility of error is decreased. If she has access to the preregistration, the possibility of p-hacking is similarly minimized. Both of these steps mean that the informational value of the study is maintained rather than decreased.<br />
<br />
One corollary to this formulation is that replication can be a way to "rescue" particular interesting research designs. A finding can – by virtue of its design – have the potential to be theoretically important, but it may have limited evidential value, whether because of the small sample, imprecision of the measurements, or worries about error or p-hacking. In this case, a replication can substantially alter the theoretical landscape by adding evidence to the picture (this point is made by <a href="http://psycnet.apa.org.stanford.idm.oclc.org/record/2014-38072-004">Klein et al.</a> in their commentary on the ManyLabs studies). So then replication in general can be interesting <i>or</i> uninteresting – depending on the strength of the evidence for the original finding and its theoretical relevance. The most interesting replications will be those that target a finding with a design that allows for high information gain but for which the evidence is weak.<br />
<br />
<h4>
<b>Conclusions</b></h4>
Open science practices won't make your study interesting or important by themselves. The only way to have an interesting study is the traditional way: create a strong experimental design grounded in theory, and gather enough evidence to force scientists to update their beliefs. But what a shame if you have gone this route and then the value of your study is undermined! Transparency is the only way to ensure that readers assign the maximal possible evidential value to your your work.<br />
<br />
---<br />
* As a first approximation, let's be subjectively Bayesian, so that the distribution is in the heads of individual scientists and represents their beliefs. Of course, no scientist is perfect, but we're thinking about an idealized rational scientist who weighs the evidence and has reasonably fair subjective priors.<br />
** Advocates for hypothesis-neutral data collection argue that later investigators can bring their own hypotheses to a dataset. In the framework I'm describing here, you could think about the dataset having some latent value that isn't realized until the investigator comes along and considers whether the data are consistent with their particular hypotheses. Big multivariate datasets can be very informative in this way, even if they are not collected with any particular analysis in mind. But investigators always have to be on their guard to ensure that particular analyses aren't undermined by the post-hoc nature of the investigation. <br />
*** Even though evidence in this sense is computed after data collection, that doesn't rule out the prospective analysis of whether a study will be interesting. For example, you can compute the <i>expected information gain</i> using optimal experimental design. <a href="https://psyarxiv.com/h457v">Here's a really nice recent preprint</a> by Coenen et al. on this idea.<br />
**** I know that this use of the p-hacking framework mixes my Bayesian apples in with some frequentist pairs. But you can just as easily do post-hoc overfitting of Bayesian models (see e.g., the <a href="http://datacolada.org/13">datacolada post</a> on this topic).Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-88327191450991933472017-11-10T11:29:00.002-08:002017-11-10T11:29:59.841-08:00Talk on reproducibility and meta-scienceI just gave a talk at UCSD on reproducibility and meta-science issues. <a href="https://figshare.com/articles/UCSD_Psych_Colloquium_11_9_17/5592460">The slides are posted here</a>. I focused somewhat on developmental psychology, but a number of the studies and recommendations are more general. It was lots of fun to chat with students and faculty, and many of my conversations focused on practical steps that people can take to move their research practice towards a more open, reproducible, and replicable workflow. Here are a few pointers:<br />
<br />
<b>Preregistration</b>. Here's a blogpost from last year on <a href="http://babieslearninglanguage.blogspot.com/2016/07/preregister-everything.html">my lab's decision to preregister everything</a>. I also really like Nosek et al's <a href="https://osf.io/2dxu5/">Preregistration Revolution</a> paper. <a href="http://aspredicted.org/">AsPredicted.org</a> is a great gateway to simple preregistration (<a href="http://datacolada.org/44">guide</a>).<br />
<br />
<b>Reproducible research</b>. Here's a blogpost on <a href="http://babieslearninglanguage.blogspot.com/2015/11/preventing-statistical-reporting-errors.html">why I advocate for using RMarkdown to write papers</a>.<b> </b>The best package for doing this is <a href="https://github.com/crsh/papaja">papaja</a> (pronounced "papaya"). If you don't use RMarkdown but do know R, <a href="https://github.com/mcfrank/rmarkdown-workshop">here's a tutorial</a>.<br />
<br />
<b>Data sharing</b>. <a href="http://pages.ucsd.edu/~cmckenzie/Simonsohn2013PsychScience.pdf">Just post it</a>. The <a href="http://osf.io/">Open Science Framework</a> is an obvious choice for file sharing. Some <a href="https://cos.io/our-services/training-services/cos-training-tutorials/">nice video tutorials</a> make an easy way to get started.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com0tag:blogger.com,1999:blog-4297242917419089261.post-73987868185679970202017-11-05T15:36:00.004-08:002017-11-05T15:36:37.397-08:00Co-work, not homeworkCoordination is one of the biggest challenges of academic collaborations. You have two or more busy collaborators, working asynchronously on a project. Either the collaboration ping-pongs back and forth with quick responses but limited opportunity for deeper engagement or else one person digs in and really makes conceptual progress, but then has to wait an excruciating amount of time for collaborators to get engaged, understand the contribution, and respond themselves. What's more, there are major inefficiencies caused by having to load up the project back into memory each time you begin again. ("What was it we were trying to do here?")<br />
<br />
The "homework" model in collaborative projects is sometimes necessary, but often inefficient. This default means that we meet to discuss and make decisions, then assign "homework" based on that discussion and make a meeting to review the work and make a further plan. The time increments of these meetings are usually 60 minutes, with the additional email overhead for scheduling. Given the amount of time I and the collaborators will actually spend on the homework the ratio of actual work time to meetings is sometimes not much better than 2:1 if there are many decisions to be made on a project – as in design, analytic, and writeup stages.* Of course if an individual has to do data collection or other time-consuming tasks between meetings, this model doesn't hold!<br />
<div>
<br /></div>
Increasingly, my solution is co-work. The idea is that collaborators schedule time to sit together and do the work – typically writing code or prose, occasionally making stimuli or other materials – either in person or online. This model means that when conceptual or presentational issues come up we can chat about them as they arise, rather than waiting to resolve them by email or in a subsequent meeting.** As a supervisor, I love this model because I get to see how the folks I work with are approaching a problem and what their typical workflow is. This observation can help me give process-level feedback as I learn how people organize their projects. I also often learn new coding tricks this way.***<br />
<br />
<a name='more'></a><br />
The products of co-work are often stronger than drafts that come out of independent work. When we program or write by ourselves we sometimes let bad sentences (or copy&pasted code slide by) – I certainly do this. In contrast, when I'm working together with someone I'm more conscious of trying to work carefully and write clearly. And as a supervisor, I like that this model allows us to discuss the strengths and weaknesses of something that we've jointly produced – rather than the model of having me provide a critique of the student's independent work.<br />
<br />
Co-working isn't always appropriate. If the amount to be done is too great, or the workload is not distributed evenly between collaborators (whether because of seniority, time, or skill) then it's not the right choice. But for doing the conceptually challenging bits of projects – say, coding the key data analysis or writing the intro or general discussion – co-working can be both an efficient way to get something done and a great way to learn and think together.****<br />
<br />
---<br />
* I also find that for me, given other academic constraints, "homework" often means "comes out of family time" (evenings and weekends).<br />
** Sometimes we work on different parts of the project, but in the same place, so that if questions come up we can interrupt and discuss.<br />
*** Of course, I recognize that this model presumes that supervisors have the time to co-work with trainees; sometimes making this time can be a hard ask. But "can you show me how you'd approach that task" is a often a reasonable question to pose to a supervisor! And of course this model also works as well – maybe even better – with collaborations between two people at the same career stage.<br />
**** In some sense it's amazing I'm writing a blogpost about academics sitting in one place and working together, but that's really the culture we've got - almost every work situation I've been in has involved meetings for decision-making then independent "homework" for the collaborators or the trainee.Michael Frankhttp://www.blogger.com/profile/00681533046507717821noreply@blogger.com1