Babies Learning Language: Was Piaget a Bayesian?

tl;dr: Analogies between Piaget's theory of development and formal elements in the Bayesian framework.

Intro

I'm co-teaching a course with Alison Gopnik at Berkeley this quarter. It's called "What Changes?" and the goal is to revisit some basic ideas about what drives developmental changes. Here's the syllabus, if you're interested. As part of the course, we read the first couple of chapters of Flavell's brilliant book, "The Developmental Psychology of Jean Piaget." I had come into contact with Piagetian theory before of course, but I've never spent that much time engaging with the core ideas. In fact, I don't actually teach Piaget in my intro to developmental psychology course. Although he's clearly part of the historical foundations of the discipline, to a first approximation, a lot of what he said turned out to be wrong.

In my own training and work, I've been inspired by probabilistic models of cognition and cognitive development. These models use the probability calculus to represent degrees of belief in different hypotheses, and have been influential in a wide range of domains from perception and decision-making to communication and social cognition.¹ But as I have gotten more interested in the measurement of developmental change (e.g., in Wordbank or MetaLab, two new projects I've been involved in recently), I've become a bit more frustrated with these probabilistic tools, since there hasn't been as much progress in using them to understand children's developmental change (in contrast to progress characterizing the nature of particular representations). Hence my desire to teach this course and understand what other theoretical frameworks had to contribute.

Despite the seeming distance between the modern Bayesian framework and Piaget, reading Flavell's synthesis I was surprised to see that many of the key Piagetian concepts actually had nice parallels in Bayesian theory. So this blogpost is my attempt to translate some of these key concepts in theory into a Bayesian vocabulary.² It owes a lot to our class discussion, which was really exciting. For me, the translation highlights significant areas of overlap between Piagetian and Bayesian thinking, as well as some nice places where the Bayesian theory could grow.

Organization: The growth of structured knowledge

The first key part of Piaget's theory is that knowledge is structured and organized, and that the particular structure of knowledge changes with experience. In domains like language, physical reasoning, or social cognition, learning happens in response to environmental inputs, and the knowledge being learned is fundamentally structured, rather than an undifferentiated mass of sense-data or associations. (For Piaget, these environmental inputs are often generated by actions on the part of the learner, but we'll leave this out for now and come back to it below.)

The important role of organization in Piaget is a good fit with Bayesian theories. The simplest Bayesian treatment of learning starts with the idea of accumulation of data changing the probability of a structured set of hypotheses. This set is typically called the "hypothesis space" H. Indeed, the first step in almost any Bayesian cognitive modeling project is to work out the hypothesis space, defining the set of possible representations that the model will consider.

In the simplest Bayesian model, learning then is assigning weights (posterior probabilities) to hypotheses in this state. For a particular hypothesis, the model starts with the learner's prior probability distribution: p(h). Then the target is to compute the updated posterior probability distribution, given the data: p(h | d). This distribution can be computed by Bayes' rule:

p(h | d) ∝ p(d | h) p(h)

Bayes' rule states that the posterior probability of a particular hypothesis given the data is the product of the prior probability of that hypothesis times the likelihood of the data given the hypothesis. Written as a generative graphical model – a description of the causal structure of the world, this model looks pretty simple:

h -> d

In other words, some hypothesis is the case, and that hypothesis leads to the observed data. Bayesian inference is used to recover the highest probability hypothesis.

To see how this works applied to a developmental example, let's look at theory of mind. A classic view of theory of mind development – ironically stemming from Piagetian thinking and now likely to be false – was that kids started out with the view that everyone has the same beliefs (egocentric). They then gradually shift to the view that others have different desires (first) and then beliefs (second) from them.

Goodman et al. (2006) reported a simple and elegant model of this shift (albeit in a domain where the actual empirical data have gotten much more complicated in the intervening years). The space of "theories of mind" (different ways of thinking about others beliefs) was the hypothesis space; in this case, they simply compared egocentric and non-egocentric belief models. The learner then was assumed to observe successively greater amounts of data about the behavior of agents, evaluating their posterior probability distribution over the two different theories along the way. With a small amount of data, the more parsimonious (higher prior probability) egocentric theory won out. But with more data, the model shifted to the more complex, but more adult-like non-egocentric theory.

This (early) model exhibits learning from data over structured hypotheses, but that's where the resemblance to Piaget ends. In particular, the transition between theories is not abrupt and stage-like but instead a gradual reweighing of hypotheses with respect to evidence.

Abrupt, stage-like transitions

Hierarchical Bayesian models (HMBs) present a framework for describing one of these challenges, namely abrupt, stage-like transitions between different knowledge structures. Hierarchical models posit a second level of generative structure in the environment, where hypotheses are generated by theories:

t -> h -> d

These theories describe different classes of possible hypotheses about the particulars of a system. Datapoints are generated by hypotheses about the particulars, but those particulars are generated by a framework theory. The mathematics of this type of model are similar, but with the addition of theories (t) on which hypotheses are conditioned.

p(h, t | d) ∝ p(d | h, t) p(h | t) p(t)

The critical feature of HBMs that produces stage-like changes is this: The weight assigned to individual hypotheses (the middle level) does not change linearly with the amount of evidence available. That's because the hypothesis is linked to a particular theory. If that theory is favored, then all of the hypotheses associated with it will be relatively high probability. But evidence inconsistent with any of the hypotheses related to a theory can make a theory relatively less likely. And if all of the hypotheses related to the theory become less likely, that leads to a shift in the overall probability of the theory (and hence even lower likelihood of each individual hypothesis). In other words, because hypotheses "belong to" theories, this couples them together and they may rise or fall in probability in a relatively more abrupt, stage-like way.

A nice example of an HBM with discrete transitions between theories is Perfors, Tenenbaum, & Regier (2011). They considered a set of distinct structural theories about language, including context-free grammars as well as simpler models like finite state grammars and even more basic memorization grammars. Each of these theories didn't itself predict data, but it defined a class of grammars (the hypotheses) that could be learned from data. In their simulations, they increased the amount of data available to their model and showed how it transitioned between different classes of grammars. With a small amount of data, the model learned a simple flat memorization grammar, but as the amount of data grew, it transitioned to qualitatively different representations, eventually converging on the context-free grammar.

Assimilation and accommodation: Dealing with data

Although the general apparatus of learning in Piaget's theory proceeds over the backdrop of organized knowledge, a major focus of the theory is on the tension between assimilation and accommodation. Assimilation describes the way that new data are processed with respect to the current knowledge state, while accommodation describes the way the knowledge state is adjusted with respect to new data. Their reciprocal relationship produces patterns of stability (assimilation) and change (accommodation).

This set of mechanisms doesn't fit that nicely into the view of Bayesian learning that I've sketched out in the previous sections. Accommodation certainly happens – that's the process of adjusting the probabilities of different hypotheses.³ But assimilation is nowhere to be found. The data are the data.

But this situation is completely standard in the tradition of probabilistic inferences with respect to perception. In perceptual environments, the key problem is that percepts are not veridical reflections of the external environment. They are instead corrupted by sensory noise and must be recovered. This recovery can be assisted by the perceiver's expectations – if the observations are consistent with those expectations – or harmed by them – in the case of something unexpected.

Formally, we have the same set of mechanics as in our simple model of p(h | d), except that we don't get to see the actual data. Instead, we observe d*, some noisy reflection of the data, and if we are lucky, we know something about the process by which d became d* (e.g., p(d* | d)). Then we can work backwards and figure out that:

p(h | d*) ∝ p(d* | d) p(d | h) p(h)

There are many examples of this treatment of perceptual uncertainty in vision research (e.g., seeing faint, slightly off-vertical lines as vertical because you have a prior that lines go up and down). But here's an example from higher-level cognition that I really like. Levy (2008) discusses a class of sentences, like this one:

(1) The coach smiled at the player tossed the frisbee.

In studies of these sorts of sentences, readers got completely sidetracked, even though the sentences themselves were syntactically unambiguous (Q: "who did the coach smile at?" A: "the player who was tossed the frisbee"). The key observation that Levy made was that people might assume that they had misread or that there was a typo, and go back to check whether they had seen:

(2) The coach smiled as the player tossed the frisbee.

Sentence (2) means something very different, but is an overall more likely sentence than the grammatical-but-awkward (1). Under Levy's "noisy channel" model, we use as evidence for the error the fact that, had it happened, the original sentence would have been much easier to understand. Under our current analogy, readers were not just accommodating the odd structure of (1), they were actually considering assimilating (1) into (2), which was more likely under their framework.⁴

Other constructs that don't translate as well

Egocentrism and equilibration. Early egocentrism was also a critical part of Piaget's theory. Assimilation and accommodation were initially the same for infants and then gradually grew to be in opposition to one another. I think the idea here is that for very young infants, the first time they have a particular experience, they are both assimilating (making sense of the experience) and accommodating (creating a schema around that single experience). There's a sense in which this developmental story sounds a bit like noisy-channel learning, starting from very vague priors. Early on, data points are extremely influential - you basically form your hypotheses and theory around the experiences you've had. Later on however, it's harder to change a more established theory, and so datapoints get assimilated more and accommodated less. So that's at least a change in balance between assimilation and accommodation. On the other hand, I have no idea how to accommodate the Piagetian idea of not having boundaries between the agent and the world. That just seems like a different notion that doesn't have an easy place in the Bayesian worldview.

Vertical and horizontal décalage. Décalage (or displacement) refers to situations in the stage theory where either behaviors seen in previous stages reoccur (vertical) or where there is some heterogeneity in stage-consistent behaviors across domains (horizontal). I'm not sure how well these concepts translate into the Bayesian framework. Since nothing about the Bayesian approach, even with hierarchical models, implies any similarities in approach or structure across content domains, these particular concepts feel as though they might be more internal to the stage-theory aspects of Piaget, which overall are the ones that have held up least well over the years.

Conclusions

In this blogpost I've tried to draw out the analogies (and occasional dis-analogies) between Piagetian and Bayesian theory. Despite the massive distance in vocabulary between modern Bayesian work and Piaget, I was still surprised to see so much congruence. In particular, the relationship between hierarchical models / noisy channel models on the one hand and stage-transitions / assimilation-accommodation on the other seem quite striking.

On the other hand, I would not want to imply that the parallel between the two theories is complete. There are many areas of dissimilarity. Some of these relate to places where the Piagetian theory was found wanting empirically. For example, the influential critique by Gelman & Baillargeon (1983) describes many empirical issues with the idea of cross-cutting stages that share structural features across domains. In contrast, congruent with modern work in developmental psychology, Bayesian theories have primarily focused on what happens within content domains.

But some areas of dissimilarity highlight weaknesses of the Bayesian framework in its progress understanding development. For example, the Piagetian approach is very focused on the child as active learner, creating evidence. Although there's been some recent excitement about active learning in cognitive development, there is a long way to go both empirically and theoretically in integrating active learning into the Bayesian developmental worldview.

In addition, on Piaget's view, both play and imitation have important theoretical roles. Play is pure assimilation, where the child gets to sample from their own model of the world and explore the scope and limits of that model. Although there's been a lot of exciting work on play in development more generally, Bayesians are just beginning to consider the role of this kind of internal simulation. And imitation is pure accommodation, where the child stops trying to assimilate input into their own schemas. This kind of "suspension of the model" – where the agent considers the perceptual data unfiltered by their priors – hasn't really been seriously considered in Bayesian cognitive development as far as I know.

Bayesian models have been an influential tool for thinking about cognitive development in the last 10-15 years, but – as I suggested above – they have often been more focused on the form of children's knowledge than on the processes of how that knowledge changes. With that critique in mind, work in the Bayesian tradition could really benefit from connections to prior developmental theory.

For a great introduction to the general ideas behind these models, Tenenbaum et al. (2011) lays out the structure and approach. And for a tutorial introduction to how Bayesian models are used in cognitive development, Perfors et al. (2011) walk through the fundamentals.↩
Of course, I'm not trying to say that other traditions – connectionism, for example – don't also capture some of the same set of concepts. This is just an attempt to spell out one set of analogical mappings.↩
I don't love the turn of phrase "adjusting the probabilities" because it feels like it assumes either a modeler taking an action, or else a homunculus that's responsible for the adjustment process. The actual fact of the matter is just that the model provides a mathematical statement of the probabilities and the numbers come out differently in situations with different amounts of data. There's no "agent" inside the model, it's just a set of equations.↩
I especially like this example because my collaborator Dan Yurovsky and I have some evidence that even kids do this kind of correction: When you say "I had bees and carrots" for dinner, they use their expectations to decide whether you actually meant "peas."↩

1 comment:

Abdellah FourtassiApril 18, 2016 at 2:43 PM
Very nice blog post, I enjoyed the reading!
I guess there is a way to integrate assimilation/accommodation even in the simpler case where only d and h are involved. While, as you mentioned, accommodation can be understood as adjusting probabilities of various hypotheses in our *space*, assimilation can be understood as the process that takes place in the *time* interval when data favor a particular hypothesis. Of course accommodation is active continuously through time, but so long as it does not radically change weights, we can say we are in a stage of assimilation.

Thursday, April 14, 2016

Was Piaget a Bayesian?