Monday, August 12, 2013

Simpson's paradox and age-of-acquisition for words



For a child, what makes a word easy to learn? One of the key things is that the word be frequent. A word that is heard frequently gives the child many more chances to remember its form and infer its meaning. This relationship between frequency and age of acquisition (AoA) is probably the most robust finding in predicting when words will be learned (although this is a dubious distinction because there aren't too many other robust predictors that have been studied across multiple contexts). But does this relationship hold up for all types of words? Or is it only (or primarily) true for nouns? In this post I'll discuss this relationship in the context of a statistical effect called Simpson's Paradox (illustrated above).

The first comprehensive study of the frequency/AoA relationship that I know of comes from a study by Huttenlocher et al. (1991). (Previous work by Brown and others had looked at these data for individual word classes). Huttenlocher and colleagues examined the correlation between the frequency of word production in mothers' speech and the age at which children produced the same word, finding a striking -.65 correlation between log frequency and AoA (both estimated from a detailed set of transcripts of parent-child talk at sessions from 16 - 26 months). Note that a negative correlation means that if you hear the word more, you learn it earlier. Even more striking, within individual caregivers, correlations were usually up in the range of .8! But in order to get this correlation, Huttenlocher et al. only looked at "content words" - excluding articles and other closed class words as well as words that were heard infrequently in their sample. More recent studies haven't found correlations nearly this high - why is that?

For example, a study by Goodman, Dale, and Li (2008) looked at correlations between corpus frequency in CHILDES and population production and comprehension norms from the MacArthur-Bates Communicative Development Inventory (CDI, a parent report measure). This approach averages across many different children, with the hope that the better measurement afforded by having more data cancels out the idiosyncratic differences you would normally find between caregiver-child dyads (e.g. some moms talk about cars, others talk about tea sets). Perhaps because of this move, Goodman et al. find fairly mixed results, with a positive relationship between the average frequency of words in different syntactic categories and their AoA (meaning more frequent words are learned later):


What's going on here? Here's one thought: Simpson's paradox is a statistical phenomenon that arises when you are interested in quantifying the relationship between two variables, and there is some grouping variable that mediates that effect. So the trend you're looking for may actually be present in each subgroup. In the illustration at the top of this post, the relationship between x and y is overall negative, but positive within each group.

In the context of word learning, the hypothesis is this: AoA is negatively predicted by frequency within syntactic categories but the relationship is weaker or even positive across categories. That's because the most frequent words, closed class words like "the" and "of," are hard to learn and tend to be dropped from children's early telegraphic speech. To examine this feature of the frequency/AoA relationship, Brandon Roy, Deb Roy, and I have been using the Human Speechome Project (HSP), an ultra-dense set of videos and transcripts of the life of one child (Deb Roy's informative TED talk here). (NB: Brandon and Deb are not related.)

In our 2009 paper, we found a similar pattern to the previous reports: the frequency with which the child in HSP heard words was a predictor of AoA, both within and (more weakly) across syntactic categories. Here's a replot of the data from that paper, so that the axes are the same as the one above from the Goodman et al. paper:




You can already get a flavor for the fact that there's a Simpson's paradox effect happening here. For example, the closed class words show the same frequency/AoA relationship (as do all groups except maybe the verbs) but the closed class words are overall much more frequent than the nouns (they are shifted right). Now look at the means alone, where we see the same positive relationship as Goodman et al. do, in contrast to the negative relationship in each of the syntactic categories alone:


The spread of AoAs is smaller than for Goodman et al., but the pattern is nearly identical. Why is the spread smaller? One simple reason is that our sample includes all kinds of idiosyncratic nouns like "cymbal" that aren't in the CDI and are lower frequency than the CDI words. So this overall moves the AoA for nouns closer to that of closed class words. (If we included words like "fluorocarbon" in the list, nouns would probably be learned on average after closed class words). The HSP child also learns words a bit faster than the standard CDI children, so all the AoAs are shifted a bit younger than in the Goodman et al. study.

To summarize: This analysis suggests that the Goodman et al. study as well as our 2009 paper both observed a pretty clear Simpson's paradox effect. The strength of the effect is modulated by which words are chosen for the analysis as well as how they are distributed across categories. So, in the initial Huttenlocher et al. study, the magnitude of the correlation was much larger than in later studies because they dropped closed class words (the magnitude was probably further inflated by the dropping of low frequency words and the use of data from a single conversation session).

The upshot of this analysis is that there is likely a very strong relationship between frequency and AoA in child language acquisition, and that effect appears to hold across word classes. The more words are heard, the earlier they are learned. Nevertheless, the precise magnitude of this correlation across the vocabulary depends strongly on the particular sample of words that are analyzed (and their syntactic category). Stay tuned for more updates from the Human Speechome Project (e.g. like this) on the microstructure of word learning.

(HT: Florian Jaeger, who used Simpson's Paradox in a recent article on linguistic typology.)

No comments:

Post a Comment