Establishing Corpora From Existing Data Sources

doi:10.4324/9781315535258-42

ABSTRACT

Corpora are searchable collections of spoken and written language that can be used for linguistic analysis. The largest publicly available spoken corpus was the 10 million words of spoken English in the 100-million-word British National Corpus (BNC). Other important corpora of spoken English are the Cambridge and Nottingham Corpus of Discourse in English, the Cambridge North American Spoken Corpus, the Santa Barbara Corpus of Spoken American English, the Switchboard corpus, and the CallHome corpus. Because of the issues with pricing and (lack of) availability, some researchers might consider creating their own corpora. Unfortunately, it is almost prohibitively difficult for individual researchers to create large spoken corpora “from the ground up.” As a result, the most realistic alternative for most researchers is to create corpora from existing resources. This was, for example, the process that was followed in the creation of the Corpus of Contemporary American English.