Babies Learning Language: Onboarding

Reading twitter this morning I saw a nice tweet by Page Piccinini, on the topic of organizing project folders:

@chbergma @JeffRouder @fusaroli TL;DR For any analysis in R I have the following folders: 1) data, 2) scripts, 3) figures, 4) write_up.
— Page Piccinini (@pageinini) January 3, 2017

This is exactly what I do and ask my students to do, and I said so. I got the following thoughtful reply from my old friend Adam Abeles:

@mcxfrank what's your onboarding process? You should have creating that structure then on the checklist to save the pain later.
— Adam Abeles (@aabeles) January 3, 2017

He's exactly right. I need some kind of onboarding guide. Since I'm going to have some new folks joining my lab soon, no time like the present. Here's a brief checklist for what to expect from a new project.

Experimental Design

We preregister everything. What this means is that your sample size and analytic strategy need to be registered in some form prior to collecting any non-pilot dataset.

When appropriate, we try to do power analysis to determine sample size, but sometimes that's hard. So we sometimes just plan a decent-sized sample and assume we'll replicate if things look interesting.

Also, as I've learned, piloting can't really tell you about effect size, so we only pilot for figuring out if participants hate our procedure (which they often do, since they're preschoolers).

Collecting data

Always check with the lab manager to ensure that your experiment is covered under an existing IRB protocol and that you have up-to-date training before you collect any data.

Not all of our data are collected on MTurk but we do use it frequently. To write experiments, we use mmturkey with basic javascript and html and cosub to submit HITs.

MTurk worker IDs should be anonymized before pushing to a repo. Here's one way to do that.

Analysis

For each project, you should have a github repository, with the analysis scripts in the main directory, and experimental materials, data, and helper functions in subdirectories.

All data should be tidy. Data should be saved by default in a transparent, open format like CSV (but there are exceptions). On receiving data (assuming they are anonymized) commit them to your repo and do not alter them. If you need to alter data, for example to sanitize open-ended responses, create a new column or in the most extreme case, a new copy of the data called "XYZ_sanitized."

For analysis, we use R and R Markdown (my explanation for why). If you are sharing results with me, I typically would like to look at a rendered markdown file published to rpubs.com so that I can look on my computer or phone and see your code, text, and figures together. Then if I want to contribute, I can clone the repo, re-render, and push.

We make everything open by default. We use the OSF as our backbone framework, but primarily work with git and github (connected to an OSF repo for registration and sharing purpose) – here's our lab github page. An easy way to register your analysis is to write it in your github repo, link the repo to OSF, and then register the repo.

We have an evolving list of standard analytic choices to help guide your analysis.

Writing

We use R Markdown for papers now (see my guide), with bibliographies in bibtex. This way of writing creates clean, reproducible documents that integrate code and text.

When writing, please start with an outline and make sure that the relevant collaborators have agreed on some aspects of it before you flesh it out into a full document – this can save a lot of time later and lead to more organized manuscript.

If you are writing an abstract, grant proposal, or other non-manuscript document, please use google docs. The comment functionality and mobile-friendliness are very nice and make it easy for collaborators to work in parallel.

If you're revising a paper, the first thing you should do is paste the reviews into a google doc and break them up into specific comments. These comments serve as a "to do list" for revisions.

On submission, please post a preprint to psyarxiv.

Authorship and submission

From our authorship policy: "An author on a paper is someone who has made a sustained, intellectual contribution." That means they've been around for a good chunk of the project and done something that is more than just follow instructions.

You should assume that you are the lead author on any collaboration between us, unless it's explicitly stated otherwise (e.g., I ask for your help as a collaborator on a project I've initiated). But if there are more than two authors, we should figure out the order before we start actually doing work. These conversations can be awkward and we will often get excited about the actual work and forget to have them. But that's a mistake and can lead to sadness later.

All coauthors must approve all manuscripts, grants, abstracts, etc. prior to submission.

Conclusion

I'm sure I've missed a lot of stuff here, but it's been very helpful for me to write this down. Would love to hear your feedback in comments.

7 comments:

GerryJanuary 4, 2017 at 2:22 PM
Hi Michael. Greetings from Adelaide, Australia. I am a Psychology teacher in a high school here in Adelaide. Whilst I don't do anything nearly as complicated as you are outlining in this post, nevertheless, I am interested in the process you have outlined. Just one thing: do you have a glossary to explain some of the abbreviations such as "MTurk", "R and R Markdown" and "OSF"? Perhaps I should know this, however I don't. Also do you have an example showing this process in action? If you don't have time to explain, I will understand. Thanks. Gerald
Michael FrankJanuary 10, 2017 at 2:14 PM
Hi Gerald, glad you like the post! It might help to click the links to OSF and RMakdown? Mturk is available at mturk.com. I don't have an example yet, but you are welcome to see our github page (github.com/langcog) for some in-progress projects.
melissaJanuary 26, 2017 at 11:14 AM
Don't know if you've covered this elsewhere, but can you say a bit more about anonymizing mturk ids? I typically see only meaningless (to me) ID alphanumeric strings. Is the concern the potential to link these IDs across studies? Or do those codes contain stable information that I'm not aware of?
Lewis ClarkFebruary 4, 2017 at 10:52 AM
Thanks for sharing this information with us.
Shravan VasishthMarch 11, 2017 at 9:46 PM
Two things I would suggest are:

1. Data management plan published with the preregistration (maybe you do that already).
2. Fake data simulation done before experiment is run, expressing expectations from the experiment, and a complete analysis of the fake data.

Tuesday, January 3, 2017

Onboarding