Spoken Learner Corpus

About the Trinity Lancaster Corpus project

The Spoken Learner Corpus (SLC) Project is a collaboration between Trinity and the Centre for Corpus Approaches to Social Science (CASS) at Lancaster University.

The aim of the project has been to create a large corpus of learner (and examiner) speech which can be used in a wide range of research contexts including Second Language Acquisition, language testing, L2 pedagogy and materials development, etc.

The corpus currently stands at over 4 million words. It has been created from recordings of Trinity’s Graded Examinations in Spoken English (GESE) across a range of grades from B1–C2 on the CEFR scale. It represents language used in a variety of speaking tasks which reflect speech events in the world outside the test and covers multiple different language backgrounds.

How can we use the corpus?

As a unique research resource the Trinity Lancaster corpus enables the investigation of learner speech at different proficiency levels (advanced, intermediate and lower intermediate/threshold) and analysis of spoken learner production across different tasks (both monologic and interactive). The corpus samples language of learners with a variety of L1 backgrounds, representing English speakers from Italy, Spain, Mexico, Argentina, Brazil, China, India, Sri Lanka and Russia, which will allow us to report back to those learners on their specific proficiencies and needs for development. It also facilitates the development of locally focused teaching materials and test support activities. See our current range of Corpus-informed teaching resources.

Corpora analysis is likely to become more sophisticated in the future, especially with multiple layers of corpus annotation that allows searching according to different linguistic and background criteria. The Trinity Lancaster Corpus has an aspiration to become a leading research tool in this respect. 


What is a language corpus?

A language corpus is a collection of texts, either written or spoken, which is compiled digitally for the purpose of language analysis. Advances in computer technology mean that it is now possible to create very large corpora (millions of words), store them in digital form, and analyse them automatically or semi-automatically.

The recorded speech is entered and coded with a variety of tags so that users can examine all the texts in the corpus, or a sample of them, in order to determine how language is used in particular contexts (eg in formal or informal situations), by specific groups of people (eg different ages, different mother tongues), for specific purposes (eg for academic purposes, for social purposes), etc. The findings of such analyses can be used for many real-world purposes such as devising teaching materials, constructing tests and other assessment procedures, compiling accurate dictionaries or improving communication amongst different social or cultural groups.

The nature of the GESE test – one which focuses on communicative skills and allows test takers choice in their contributions – means that the Trinity Lancaster Corpus can offer unique insights into how learners choose to manage interaction and build meaning based on their own identify rather than being overly constrained by the test task.

Further information


Keep in touch

Make sure you don’t miss the latest news from Trinity College London. Sign up for email updates about your subject area.

Back to top