Language Researchers' Toolkit

Language research often involves the analysis of auditory and written corpora from taped interactions between various people, such as a mother and her child. This analysis frequently requires the transcription and coding of various linguistic units to study how language changes over development. Carrying out this work manually is an extremely time consuming task. This project will develop open access software for fast automatic recognition, coding and transcription of human speech and conversations.

undefinedComputational linguists have developed various algorithms, which could facilitate the study of language development, but it is often difficult for child language researchers to apply these algorithms. We will be developing a Language Researchers' Toolkit, which is a set of open source tools to facilitate the application of computational algorithms to study language development.

The Toolkit will use a common standard wrapper for corpora (e.g. HDF5) to allow the easy importation of audio and written corpora into various programmes for analysis. Programs will be developed that will allow linguistic labels that have been created by child language researchers to be time-stamped and aligned with the corpus, and these labels can then be used by computational researchers to evaluate and train their algorithms.

Computational researchers could also make their algorithm available in a common format that would allow them to be applied easily to any corpus in the system without any need for programming experience. The goal is to facilitate both child language researchers and computational linguistics by providing a common computational framework that allows their findings to mutually inform each other’s work.

Project Team: Franklin Chang (Lead), Elena Lieven, Julian Pine, Caroline Rowland and Anna Theakston

Start Date: September 2015

Duration: 2.5 years