Note: childes-db is a project that is a collaboration between Alessandro Sanchez, Stephan Meylan, Mika Braginsky, Kyle MacDonald, Dan Yurovsky, and me; this blogpost was written jointly by the group.
For those of us who study child development – and especially language development – the Child Language Data Exchange System (CHILDES) is probably the single most important resource in the field. CHILDES is a corpus of transcripts of children, often talking with a parent or an experimenter, and it includes data from dozens of languages and hundreds of children. It’s a goldmine. CHILDES has also been around since way before the age of “big data”: it started with Brian MacWhinney and Catherine Snow photocopying transcripts (and then later running OCR to digitize them!). The field of language acquisition has been a leader in open data sharing largely thanks to Brian’s continued work on CHILDES.
Despite these strengths, using CHILDES can sometimes be challenging, especially for the most casual or most in-depth interactions. Simple analyses like estimating word frequencies can be done using CLAN – the major interface to the corpora – but these require more comfort with command-line interfaces and programming than can be expected in many classroom settings. On the other end of the spectrum, many of us who use CHILDES for in-depth computational studies like to read in the entire database, parse out many of the rich annotations, and get a set of flat text files. But doing this parsing correctly is complicated, and often small decisions in the data-processing pipeline can lead to different downstream results. Further, it can be very difficult to reconstruct a particular data prep in order to do a replication study. We've been frustrated several times when trying to reproduce others' modeling results on CHILDES, not knowing whether our implementation of their model was wrong or whether we were simply parsing the data differently.
To address these issues and generally promote the use of CHILDES in a broader set of research and education contexts, we’re introducing a project called childes-db. childes-db aims to provide both a visualization interface for common analyses and an application programming interface (API) for more in-depth investigation. For casual users, you can explore the data with Shiny apps, browser-based interactive graphs that supplement CHILDES’s online transcript browser. For more intensive users, you can get direct access to pre-parsed text data using our API: an R package called childesr, which allows users to subset the corpora and get processed text. The backend of all of this is a MySQL database that’s populated using a publicly-available – and hopefully definitive – CHILDES parser, to avoid some of the issues caused by different processing pipelines.
For online browsing of the database, we currently provide three visualizations: a word frequency browser (inspired by childfreq), a mean-length-of-utterance (MLU) browser, and a visualization that allows you to see the sizes of different corpora (a function I have often wanted when choosing corpora for studies).
Above is an example of the kind of visualization you can make using the frequency browser. The figure shows the increasing use of the determiner “the” by children in the Providence corpus. You can see that children start out using “the” infrequently, but their rate asymptotically approaches their mother’s rate by around 2 or 2.5 years old. This kind of plot, which formerly would have taken quite a bit of work, is now available after just a few clicks. We are hoping that users will tell us what other common visualizations they would find useful!
On the API side of things, our R package has calls for get_transcripts (to find out transcript names, get_participants (to get who is in particular transcripts), get_tokens and get_types (to get words matching a particular filter), and get_utterances (to get utterances from transcripts or speakers). All of these take a set of filters like collection (e.g., “English-North American”), corpus (“Brown”), child (“Adam”), age, and so forth. For example, if you want to know how many transcripts there are in the Brown corpus, you can just write:
# returns all transcripts in the brown corpus
d_brown_transcripts <- get_transcripts(collection = NULL,
corpus = "Brown",
child = NULL)
# print the number of rows nrow(d_brown_transcripts)
Overall the guiding philosophy of our API is to provide simple and consistent representations and operations, even if it means that we don't support all the same functions as CHILDES / CLAN. Our database is versioned, so that as CHILDES changes, you can reference a particular persistent version of the dataset. By default, you’ll always get the newest version of the data, but versioning means both that you can recreate others’ work more easily and that you have to worry less about database changes disrupting your own analyses. In future versions, we want to let you actually reference a particular version in your API calls, for perfect reproducibility of particular analyses.
childes-db is still very much a work in progress – it came together thanks to some concerted work by Alessandro, Stephan, and Mika, as well as a hackathon this summer. Everything is open source, and the repositories are linked on the website. We’re hoping to continue to build over the next year, adding more corpora (right now we only provide English but we'll be expanding soon); multiple annotation layers (e.g., parses from natural language processing tools); and more versioning tools. But we’re especially interested in hearing from users as to what functions they would like to see, so please get in touch with your feedback.
Check it out at http://childes-db.stanford.edu.