Babies Learning Language: Education

Showing posts with label Education. Show all posts

Wednesday, April 22, 2026

Datapages for reusable (and pretty!) data sharing

[This post is joint with Mika Braginsky, a long-time collaborator on data-sharing and data-viz.]

Data sharing is both a critical scientific need and, increasingly, a mandate by many research funders. The FAIR principles – that data should be findable, accessible, interoperable, and reusable – are a critical guide to how data are shared. Yet even FAIR-compliant datasets in approved repositories are often shared in ad-hoc formats that are hard to reuse or to integrate with other data. In contrast, the most impactful datasets tend to be disseminated thoughtfully through dataset-specific or community-specific platforms. These “domain-specific data repositories” (this was our term from a previous blogpost!) create opportunities for creating data standards and ontologies that fit the needs of a particular community, research problem, or instrument type. They also allow opportunities for engagement through interactive visualizations. But custom repositories and pretty websites with nice visualizations are costly and complicated to create.

We are introducing a set of open-source tools and templates for easily creating datapages, interactive websites to disseminate data for broad reuse. Datapages are easy to deploy for a single project, but extensible enough to host large collections of related datasets. You can learn more and get started at https://datapages.github.io/.

An LLM-backed "socratic tutor" to replace reading responses

My hot take on college-level teaching is that reading responses are mostly a terrible assignment, and they're even worse in the age of AI. I'm piloting something a bit different with my co-instructor right now: a "socratic tutor" bot that asks students to answer open-ended socratic questions about a specific text and "passes" them when they show sufficient comprehension. Initial feedback from students in a first trial has been extremely positive, so I am thinking more about how this could be useful in the future, as well as some of the potential problems. LLMs are far from a panacea for education – they cause way more problems than they solve, at the moment! – but this might be an interesting use case.

As an instructor, one major challenge is that you want people to read the assigned reading and engage with it so that what you do in class can build on this content in a meaningful way; some students would prefer not to (or just don't have time, or whatever). How do you solve this problem? Weekly quizzes are possible but they're time-consuming to make and give and annoying to grade; plus they reinforce a memorization mindset, rather than inviting students to engage.

The humble reading response is a frequent alternative: you ask students to respond to, critique, or build on their readings, usually in a short response ranging from a paragraph to a page. At their best in a well-prepared seminar, the instructor reads these beforehand, synthesizes them, and calls on individual students to share their reactions. But in a larger course, often this synthesis is impossible – and so the reading response becomes an assignment that no one wants to write and that can be tedious to read at the level they deserve. Even worse, if you're not getting called out on your reaction, it's possible to "respond" to a reading without having read it. And that's even before you can ask an AI to write a response to a text that it has ingested at some point (or that you've pasted into its chat window). What do we do?

Advice on reviewing

(Several people I work with have recently asked me for reviewing advice, so I thought I'd share my thoughts more broadly.)

Peer review – organized scrutiny of scientific work prior to publication – is a critical part of our current scientific ecosystem. But I have heard many of the peer review horror studies out there and experienced some myself. The peer review ecosystem could be improved – better tracking and sharing of peer review, better credit assignment, more fair allocations of review requests, better online systems for editors and reviewers, to name a few.*

Should we have peer review at all? In my view, peer review is primarily a filter that limits the amount of truly terrible work that appears in reputable journals (e.g., society publications, high-ranked international outlets). Don't get me wrong: plenty of incorrect, irreproducible, and un-replicable science still appears in print! But there are certain minimal standards that peer review enforces – published work typically ends up conforming to the standards of its field, even if those standards themselves could be improved. Without peer review, more of this terrible work would appear and there would be even more limited cues for distinguishing the good from the bad.** To paraphrase, it's the worst solution to the problem of quality control in science – except for all the others!

So all in all I'm an advocate for peer review.

What does it mean to get a degree in psychology these days?

(I was asked to give a speech yesterday at Stanford's Psychology commencement ceremony. Here is the text).

1. Chair, Colleagues, graduates of the class of 2018 – undergraduates and graduate students – family members, and friends. It’s a pleasure to be here today with all of you. Along with honoring our graduates, we especially honor all the wonderful speakers today for their accomplishments – MH for his excellence in research and teaching, Angela for her deep engagement with the department community. You could be forgiven for thinking that there was some special achievement that brought me here as well. In fact, by tradition, faculty take turns addressing the graduating class and is my turn this year. It’s a real pleasure to have one last chance to address you.

Two weeks ago, my daughter Madeline graduated from preschool. There was cake; photos were taken. They broke a piñata. It was a big deal! Several of her friends will be going to different schools, some moving away to other states or even other countries. This is one of the biggest changes she’s ever experienced. I’m already worried about what happens next. Parents, I can only imagine what you are going through today – but at least you know that your kids made it through the first day of kindergarten.

Graduates - Your graduation from Stanford today is a really big deal. You also get to have cake and photos. If you’re very lucky, some special person has even bought you a piñata. But more importantly, just like for Madeline this is a time of transitions. You may be moving somewhere new. Even if you are staying here, friends will be further away than the next dorm or the next office. So do not hesitate to take a little extra time today to celebrate with the people you love and who love you.

Congratulations.

2. I want to take a little time now to think about what it means to get a degree in psychology from Stanford.

When you sit next to someone on an airplane and tell them you are studying psychology, perhaps they ask you if you are reading their mind. Perhaps they wonder if you are studying Freudian analysis and have thoughts about their unconscious, or their relationship with their mother. Or maybe they are more up to date and wonder if you study psychological disorders as they manifest themselves in the clinic. But the truth is, knowing what you’ve done in your degrees here at Stanford, you probably haven’t done too much Freud. Or too much mind-reading. And although you may be interested in clinical work (and this is laudable), that’s not the core of what we teach here.

Gaining a degree in psychology also means that you have gone to many classes in psychology and learned about many studies – from social influence to stereotype threat, from mental rotation to marshmallow tests. Although this body of knowledge is a lovely thing to have come into contact with (and I hope that you continue to deepen your knowledge), knowing this content is also not the core of what it means to receive your degree.

What you have learned instead are tools; a specific kind of tools, namely tools for thought. These tools can be used to approach problems and construct solutions. This is what it means for psychology to be an academic discipline: a discipline denotes a particular mental toolbox. The university is the intellectual equivalent of a construction firm – different departments have the tools to solve different sorts of problems.

3. Like nearly all ideas, “cognitive tools” seem obvious – after you are used to them. Let’s take one example, a foundational cognitive tool that we use every single day: numbers. Because we are so numerate, a lot of people have the idea that numbers are easy and straightforward. But they aren’t.

Take the preschoolers in Madeline’s old classroom. Nearly all of them can count, at least to ten and maybe higher. But if you probe a bit more deeply, it all falls apart. If at snack time, you ask someone to give you exactly four cheerios, she’s liable to hand you seven, or a whole handful. Even when a child knows that “one” means exactly 1, it takes quite a few months for them to figure out that “two” means exactly 2, and more months for 3. When they finally figure out how the whole system works it enables so many new things! Madeline owes all of her dessert-negotiation prowess to her abilities with numbers. Seven gummi bears? No. How about six? This idea of exact comparison is a skill – even though it makes for tiresome after-dinner conversation.

Numbers are an invented, culturally-transmitted tool. In graduate school I worked with an Amazonian indigenous group, the Pirahã, who have no words for numbers. They are bright, sophisticated people who love a good practical joke. Many Pirahã can shoot a fish with an arrow while standing in a canoe. Yet because their language does not have these particular words in it – words like “seven” - and because they do not go through that laborious period of practice that Madeline and other kids learning languages like English do – they can’t remember that it’s exactly seven gummi bears. To them, six or eight seems like the same amount. They simply don’t have the tool.

4. So what are the tools of the psychologist?

There’s one tool that qualifies as the hammer of psychology – the single tool you can use to frame an entire house. That’s the experiment. The fundamental insight of all of modern psychology is that the puzzles of the human mind can be understood as objects of scientific study if we can design appropriately controlled experiments. As complicated and unpredictable as people are (especially when they are integrated into complex cultural systems), we can still learn about their inner workings via experiments.

This insight has spread far outside of psychology and far outside of the academy. Nowadays, Facebook runs a hundred experiments a day on you. Governments and political campaigns, startups and not-for-profits are all constantly experimenting to try to understand how to achieve their goals. There is a good chance that in the next few years of your professional life you will face a complicated human problem with an unknown solution. The psychologist’s approach will serve you well: formulate a hypothesis about how you should manipulate the world; then assess whether the manipulation has changed your measurement of interest. This strategy is shockingly effective.

But the serious carpenter has other, more specialized tools in the toolkit – the plane, awl, rasp, drawknife, jigsaw, bevel. Let me mention two more.

The first is the idea that our knowledge is not just a set of facts, but is organized into theories that help us understand the world. We call these theories intuitive theories – they are the explanatory frameworks that people carry with them to understand why things happen. What follows from this idea is that when you want to change people’s behavior, you can’t just tell them to change or tell them different facts. You need to change their theory. When I want Madeline to eat her vegetables, it turns out just telling her to “eat broccoli” doesn’t work very well – even if she does eat the broccoli, she won’t know what else to eat or why to eat it. And of course the well-known idea about fostering a growth mindset is precisely this kind of implicit theory: it’s a theory of whether ability is fixed or whether it can be improved with hard work.

The second idea I want to share is that our judgment is systematically biased. It’s biased by our own beliefs. Our minds are wonderful, efficient systems that deal with uncertainty – we piece together a sentence even in a noisy restaurant using our expectations about what that person might be trying to say to us. In most cases, this is an amazing feature of our own cognition, letting us operate flexibly using limited data. But this reliance on our own beliefs also has negative consequences: it leads us to stereotype, and to engage in confirmation bias, looking for evidence that further supports our own beliefs. Understanding of these sources of bias can help us avoid falling into this trap. A good grounding in psychology, in other words, helps us be more aware of our own limitations.

I’d love to tell you about more ideas. Every woodworker loves to show off their workbench. And the wonderful thing about tools is that when you use them together you can create new tools, in the same way the carpenter can first make a jig to make it easier to make a difficult cut. I could go on, but hopefully I’ve piqued your curiosity – and you have lots more to do today.

5. So. Make sure that you celebrate! Eat some cake, smash a piñata, and most of all, say your "thank you"s to the people who have supported you during your time here at Stanford. I speak for all of them when I say that we are very proud of you and cannot wait to see what you accomplish.

As this weekend passes and you head off for other things, it is all but certain that you will find yourself in new situations facing challenges that you have not considered before. (Life would not be fun without them!). But I am confident that your tools will be sufficient to the job. Keep them sharp and they will serve you well.

Monday, February 26, 2018

Mixed effects models: Is it time to go Bayesian by default?

(tl;dr: Bayesian mixed effects modeling using brms is really nifty.)

Introduction: Teaching Statistical Inference?

How do you reason about the relationship between your data and your hypotheses? Bayesian inference provides a way to make normative inferences under uncertainty. As scientists – or even as rational agents more generally – we are interested in knowing the probability of some hypothesis given the data we observe. As a cognitive scientist I've long been interested in using Bayesian models to describe cognition, and that's what I did much of my graduate training in. These are custom models, sometimes fairly difficult to write down, and they are an area of active research. That's not what I'm talking about in this blogpost. Instead, I want to write about the basic practice of statistics in experimental data analysis.

Mostly when psychologists do and teach "stats," they're talking about frequentist statistical tests. Frequentist statistics are the standard kind people in psych have been using for the last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value. P-values represent the probability of the data (or any more extreme) under the null hypothesis (typically "no difference between groups" or something like that). The problem is that this is not what we really want to know as scientists. We want the opposite: the probability of the hypothesis given the data, which is what Bayesian statistics allow you to compute. You can also compute the relative evidence for one hypothesis over another (the Bayes Factor).

Now, the best way to set psychology twitter on fire is to start a holy war about who's actually right about statistical practice, Bayesians or frequentists. There are lots of arguments here, and I see some merit on both sides. That said, there is lots of evidence that much of our implicit statistical reasoning is Bayesian. So I tend towards the Bayesian side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian stats in my classes. I've felt like, even with their philosophical attractiveness, actually computing Bayesian stats had too many very severe challenges for students. For example, in previous years you might run into major difficulties inferring the parameters of a model that would be trivial under a frequentist approach. I just couldn't bring myself to teach a student a philosophical perspective that – while coherent – wouldn't provide them with an easy toolkit to make sense of their data.

The situation has changed in recent years, however. In particular, the BayesFactor R package by Morey and colleagues makes it extremely simple to do basic inferential tasks using Bayesian statistics. This is a huge contribution! Together with JASP, these tools make the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really impressed by how well these tools work.

All that said, my general approach to statistical inference tends to rely less on inference about a particular hypothesis and more on parameter estimation – following the spirit of folks like Gelman & Hill (2007) and Cumming (2014). The basic idea is to fit a model whose parameters describe substantive hypotheses about the generating sources of the dataset, and then to interpret these parameters based on their magnitude and the precision of the estimate. (If this sounds vague, don't worry – the last section of the post is an example). The key tool for this kind of estimation is not tests like the t-test or the chi-squared. Instead, it's typically some variant of regression, usually mixed effects models.

Mixed-Effects Models

Especially in psycholinguistics where our experiments typically show many people many different stimuli, mixed effects models have rapidly become the de facto standard for data analysis. These models (also known as hierarchical linear models) let you estimate sources of random variation ("random effects") in the data across various grouping factors. For example, in a reaction time experiment some participants will be faster or slower (and so all data from those particular individuals will tend to be faster or slower in a correlated way). Similarly, some stimulus items will be faster or slower and so all the data from these groupings will vary. The lme4 package in R was a game-changer for using these models (in a frequentist paradigm) in that it allowed researchers to estimate such models for a full dataset with just a single command. For the past 8-10 years, nearly every paper I've published has had a linear or generalized linear mixed effects model in it.

Despite their simplicity, the biggest problem with mixed effects models (from an educational point of view, especially) has been figuring out how to write consistent model specifications for random effects. Often there are many factors that vary randomly (subjects, items, etc.) and many other factors that are nested within those (e.g., each subject might respond differently to each condition). Thus, it is not trivial to figure out what model to fit, even if fitting the model is just a matter of writing a command. Even in a reaction-time experiment with just items and subjects as random variables, and one condition manipulation, you can write

(1) rt ~ condition + (1 | subject) + (1 | item)

for just random intercepts by subject and by item, or you can nest condition (fitting a random slope) for one or both:

(2) rt ~ condition + (condition | subject) + (condition | item)

and you can additionally fiddle with covariance between random effects for even more degrees of freedom!

Luckily, a number of years ago, a powerful and clear simulation paper by Barr et al. (2013) came out. They argued that there was a simple solution to the specification issue: use the "maximal" random effects structure supported by the design of the experiment. This meant adding any random slopes that were actually supported by your design (e.g., if condition was a within-subject variable, you could fit condition by subject slopes). While this suggestion was quite controversial,* Barr et al.'s simulations were persuasive evidence that this suggestion led to conservative inferences. In addition, having a simple guideline to follow eliminated a lot of the worry about analytic flexibility in random effects structure. If you were "keeping it maximal" that meant that you weren't intentionally – or even inadvertently – messing with your model specification to get a particular result.

Unfortunately, a new problem reared its head in lme4: convergence. With very high frequency, when you specify the maximal model, the approximate inference algorithms that search for the maximum likelihood solution for the model will simply not find a satisfactory solution. This outcome can happen even in cases where you have quite a lot of data – in part because the number of parameters being fit is extremely high. In the case above, not counting covariance parameters, we are fitting a slope and an intercept across participants, plus a slope and intercept for every participant and for every item.

To deal with this, people have developed various strategies. The first is to do some black magic to try and change the optimization parameters (e.g., following these helpful tips). Then you start to prune random effects away until your model is "less maximal" and you get convergence. But these practices mean you're back in flexible-model-adjustment land, and vulnerable to all kinds of charges of post-hoc model tinkering to get the result you want. We've had to specify lab best-practices about the order for pruning random effects – kind of a guide to "tinkering until it works," which seems suboptimal. In sum, the models are great, but the methods for fitting them don't seem to work that well.

Enter Bayesian methods. For several years, it's been possible to fit Bayesian regression models using Stan, a powerful probabilistic programming language that interfaces with R. Stan, building on BUGS before it, has put Bayesian regression within reach for someone who knows how to write these models (and interpret the outputs). But in practice, when you could fit an lmer in one line of code and five seconds, it seemed like a bit of a trial to hew the model by hand out of solid Stan code (which looks a little like C: you have to declare your variable types, etc.). We have done it sometimes, but typically only for models that you couldn't fit with lme4 (e.g., an ordered logit model). So I still don't teach this set of methods, or advise that students use them by default.

brms?!? A worked example

In the last couple of years, the package brms has been in development. brms is essentially a front-end to Stan, so that you can write R formulas just like with lme4 but fit them with Bayesian inference.* This is a game-changer: all of a sudden we can use the same syntax but fit the model we want to fit! Sure, it takes 2-3 minutes instead of 5 seconds, but the output is clear and interpretable, and we don't have all the specification issues described above. Let me demonstrate.

The dataset I'm working on is an unpublished set of data on kids' pragmatic inference abilities. It's similar to many that I work with. We show children of varying ages a set of images and ask them to choose the one that matches some description, then record if they do so correctly. Typically some trials are control trials where all the child has to do is recognize that the image matches the word, while others are inference trials where they have to reason a little bit about the speaker's intentions to get the right answer. Here are the data from this particular experiment:

I'm interested in quantifying the relationship between participant age and the probability of success in pragmatic inference trials (vs. control trials, for example). My model specification is:

(3) correct ~ condition * age + (condition | subject) + (condition | stimulus)

So I first fit this with lme4. Predictably, the full desired model doesn't converge, but here are the fixed effect coefficients:

beta stderr z p

intercept 0.50 0.19 2.65 0.01

condition 2.13 0.80 2.68 0.01

age 0.41 0.18 2.35 0.02

condition:age -0.22 0.36 -0.61 0.54

Now let's prune the random effects until the convergence warning goes away. In the simplified version of the dataset that I'm using here I can keep stimulus and subject intercepts and still get convergence when there are no random slopes. But in the larger dataset, the model won't converge unless i do just the random intercept by subject:

beta stderr z p

intercept 0.50 0.21 2.37 0.02

condition 1.76 0.33 5.35 0.00

age 0.41 0.18 2.34 0.02

condition:age -0.25 0.33 -0.77 0.44

Coefficient values are decently different (but the p-values are not changed dramatically in this example, to be fair). More importantly, a number of fairly trivial things matter to whether the model converges. For example, I can get one random slope in if I set the other level of the condition variable to be the intercept, but it doesn't converge with either in this parameterization. And in the full dataset, the model wouldn't converge at all if I didn't center age. And then of course I haven't tweaked the optimizer or messed with the convergence settings for any of these variants. All of this means that there are a lot of decisions about these models that I don't have a principled way to make – and critically, they need to be made conditioned on the data, because I won't be able to tell whether a model will converge a priori!

So now I switched to the Bayesian version using brms, just writing brm() with the model specification I wanted (3). I had to do a few tweaks: upping the number of iterations (suggested by the warning messages from the output, changing to a Bernoulli model rather than binomial (for efficiency, again suggested by the error message), but this was very straightforward otherwise. For simplicity I've adopted all the default prior choices, but I could have gone more informative.

Here's the summary output for the fixed effects:

                      estimate  error    l-95% CI u-95% CI
intercept             0.54      0.48    -0.50     1.69
condition             2.78      1.43     0.21     6.19
age                   0.45      0.20     0.08     0.85
condition:age        -0.14      0.45    -0.98     0.84

From this call, we get back coefficient estimates that are somewhat similar to the other models, along with 95% credible interval bounds. Notably, the condition effect is larger (probably corresponding to being able to estimate a more extremal value for the logit based on sparse data), and then the interaction term is smaller but has higher error. Overall, coefficients look more like the first non-convergent maximal model than the second converging one.

The big deal about this model is not that what comes out the other end of the procedure is radically different. It's that it's not different. I got to fit the model I wanted, with a maximal random effects structure, and the process was almost trivially easy. In addition, and as a bonus, the CIs that get spit out are actually credible intervals that we can reason about in a sensible way (as opposed to frequentist confidence intervals, which are quite confusing if you think about them deeply enough).

Conclusion

Bayesian inference is a powerful and natural way of fitting statistical models to data. The trouble is that, up until recently, you could easily find yourself in a situation where there was a dead-obvious frequentist solution but off-the-shelf Bayesian tools wouldn't work or would generate substantial complexity. That's no longer the case. The existence of tools like BayesFactor and brms means that I'm going to suggest that people in my lab go Bayesian by default in their data analytic practice.

----
Thanks to Roger Levy for pointing out that model (3) above could include an age | stimulus slope to be truly maximal. I will follow this advice in the paper.

* Who would have thought that a paper about statistical models would be called "the cave of shadows"?

** Rstanarm did this also, but it covered fewer model specifications and so wasn't as helpful.

Wednesday, February 15, 2017

Damned if you do, damned if you don't

Here's a common puzzle that comes up all the time in discussions of replication in psychology. I call it the stimulus adaptation puzzle. Someone is doing an experiment with a population and they use a stimulus that they created to induce a psychological state of interest in that particular population. You would like to do a direct replication of their study, but you don't have access to that population. You have two options: 1) use the original stimulus with your population, or 2) create a new stimulus designed to induce the same psychological state in your population.

One example of this pattern comes from RPP, the study of 100 independent replications of psychology studies from 2008. Nosek and E. Gilbert blogged about one particular replication, in which the original study was run with Israelis and used as part of its cover story a description of a leave from a job, with one reason for the leave being military service. The replicators were faced with the choice of using the military service cover story in the US where their participants (UVA undergrads) mostly wouldn't have the same experience, or modifying to create a more population-suitable cover story. Their replication failed. D. Gilbert et al. then responded that the UVA modification, a leave due to a honeymoon, was probably responsible for the difference in findings. Leaving aside the other questions raised by the critique (which we responded to), let's think about the general stimulus adaptation issue.

If you use the original stimulus with a new population, it may be inappropriate or incongruous. So a failure to elicit the same effect is explicable that way. On the other hand, if you use a new stimulus, perhaps it is unmatched in some way and fails to elicit the intended state as well. In other words, in terms of cultural adaptation of stimuli for replication, you're damned if you do and damned if you don't. How do we address this issue?

How do you argue for diversity?

During the last couple of months I have been serving as a member of my department's diversity committee, charged with examining policies relating to diversity in graduate and faculty recruitment. I have always put a value on the personal diversity of the people I worked with. But until this experience, I hadn't thought about how unexamined my thinking on this topic was, and I hadn't explicitly tried to make the case for diversity in our student population. So I was unprepared for the complexity of this issue.* As it turns out, different people have tremendously different intuitions on how to – and whether you should – argue for diversity in an educational setting.

In this post, I want to enumerate some of the arguments for diversity I've collected. I also want to lay out some of the conflicting intuitions about these arguments that I have encountered. But since diversity is an incredibly polarizing issue, I also want to be sure to give a number of caveats. First, this blogpost is about the topic of other people’s responses to arguments for diversity; I’m not myself making any of these arguments here. I do personally care about diversity and personally find some of these arguments more and less compelling, but that’s not what I’m writing about. Second, all of this discussion is grounded in the particular case of understanding diversity in the student body of educational institutions (especially in graduate education). I don’t know enough about workplace issues to comment. Third, and somewhat obviously, I don’t speak for anyone but myself. This post doesn’t represent the views of Stanford, the Stanford psych department, or even the Stanford Psych diversity committee.

Onboarding

Reading twitter this morning I saw a nice tweet by Page Piccinini, on the topic of organizing project folders:

@chbergma @JeffRouder @fusaroli TL;DR For any analysis in R I have the following folders: 1) data, 2) scripts, 3) figures, 4) write_up.
— Page Piccinini (@pageinini) January 3, 2017

This is exactly what I do and ask my students to do, and I said so. I got the following thoughtful reply from my old friend Adam Abeles:

@mcxfrank what's your onboarding process? You should have creating that structure then on the checklist to save the pain later.
— Adam Abeles (@aabeles) January 3, 2017

He's exactly right. I need some kind of onboarding guide. Since I'm going to have some new folks joining my lab soon, no time like the present. Here's a brief checklist for what to expect from a new project.

Can we improve math education with a 5000-year-old technology?

(This post is written jointly by my collaborator David Barner and me; we're posting it to both his new blog, MeaningSeeds, and to mine).

The first calculating machines invented by humans – stone tablets with grooves that contained counting stones or "calculi" – are no match for contemporary computers in terms of computational power. But they and their descendants, in the form of the modern Soroban abacus, may have an edge on modern techniques when it comes to mathematics education. In a study about to appear in Child Development, co-authored with George Alvarez, Jessica Sullivan, and Mahesh Srinivasan, we investigated a recent trend in math education that emanates from these first counting boards: The use of "mental abacus."

The abacus, which originates from Babylonian counting boards dating back to at least 2700 BC, has been used in a dozen different cultures in different forms for tallying, accounting, and basic arithmetic procedures like addition, subtraction, multiplication and division. And recently, it has made a comeback in classrooms in around the world, as a supplement to K-12 elementary mathematics. The most popular form of abacus – the Japanese Soroban (pictured below) – features a collection of beads arranged into vertical columns, each of which represents a place value – ones, tens, hundreds, thousands, etc. At the bottom of each column are four "earthly" beads, each of which represents a multiple of 1. On top is one "heavenly" bead, which represents a multiple of 5. When beads are moved toward the dividing beam, they are "in play", such that each column can represent a value up to 9.

When children learn mental abacus, they first are taught to represent numbers on the physical device, and then to add and subtract quantities by moving beads in and out of play. After some months of practice, they are then asked to do sums by simply imagining an abacus, rather than using the actual physical device. This mental version of the abacus has clear – and sometimes profound – computational benefits for some expert users. Highly trained users – called "masters" by those in the abacus world – can instantly encode and recall long strings of numbers, can add two digit numbers as fast as they can be called out in sequence, and can compute square roots – and even cube roots – almost instantaneously, even for large numbers. Most startling of all, these techniques can be practices while simultaneously talking, and can be mastered by children as young as 10 years of age with record breaking results (see also here, here, and here). If you haven’t ever seen this phenomenon, take a look at the YouTube video below. It is truly remarkable stuff.

In our study we asked whether this technique can be mastered to good effect by ordinary school children, in big, busy, modern classrooms. We conducted the research in Vadodara, India, a medium sized industrial town on the west coast of India, where abacus has recently become a popular supplement to standard math training in both after-school and standard K-12 settings. At the charitable school we visited, abacus training was already underway and was being taught to hundreds of children starting in Grade 2, in classrooms of 70 children per group. To see whether it was having a positive effect, we enrolled a new, previously untrained, cohort of roughly 200 Grade 2 kids and randomly assigned them to receive either abacus training from expert teachers or extra hours of standard math training, in addition to their regular math curriculum.

Even in these relatively large classrooms of children from low-income families, mental abacus technique edged out standard math. Though effects were modest in this group, they were reliable across multiple measures of math ability. Also, children attained the best mastery of mental abacus best if they began the study with strong spatial working memory abilities (to get a sense of how we measured spatial working memory take a look at this video).

Why did abacus have this positive effect? One possibility is that learning a different way of representing numbers helped kids make generalizations about how numbers work. For example, the abacus – like other math manipulatives – provides a concrete representation of place value – i.e., the idea that the same digit can represent a different quantity depending on its position (e.g., the first and second 3 in “33” represent 30 and 3 respectively). This better representation might have helped kids understand the conceptual basis of arithmetic. Another possibility is that the edge was chiefly due to the highly procedural nature of mental abacus training. Operations are initially learned as sequences of hand movements, rather than as linguistic rules, and according to users can be performed almost automatically, without reflection. Finally, it's possible that it's this unique mix of conceptual concreteness and procedural efficacy that gives the abacus its edge. Children may not have to learn procedures and then separately learn how these operations relate to objects and sets in the world: Abacus may allow both to be learned at the same time, a welcome tonic to the ongoing math wars.

Right now it's uncertain why mental abacus helps kids, and whether the effects we've found will last beyond early elementary school. Also, the technique has yet to be rigorously tested on US shores, where it's currently being adopted by public schools in at least two states. This is the focus of a new study, currently underway, which will test whether this ancient calculation technique should be left in museums, or instead be widely adopted to boost math achievement in the 21st century.

Babies Learning Language

Wednesday, April 22, 2026

Datapages for reusable (and pretty!) data sharing

Monday, February 16, 2026

An LLM-backed "socratic tutor" to replace reading responses

Monday, March 2, 2020

Advice on reviewing

Monday, June 18, 2018