The development of the first ever large-scale collection of Welsh words that represents the full range of language used by people in everyday life is underway.
CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh) project has been launched at a special event in Cardiff on 28th February 2017 attended by Alun Davies AM, Minister for Lifelong Learning and Welsh Language.
The project, a collaboration between Cardiff, Swansea, Lancaster and Bangor universities, is breaking new ground in creating a large-scale, open access corpus of contemporary Welsh language.
Backed by high-profile ambassadors - poet Damian Walford-Davies, musician and presenter Cerys Matthews, broadcaster Nia Parry and international rugby referee Nigel Owens - CorCenCC is community-driven and uses mobile and digital technologies to enable public collaboration.
The research team aim for the corpus to contain 10 million words of Welsh language, providing concrete evidence about modern Welsh language use for academic researchers, teachers, language learners, dictionary makers, translators, and anyone interested in the way Welsh is used across different speakers and genres.
Researchers at Lancaster University are developing a computer system to allow investigation of the 10 million word corpus. The system builds on grammatical analysis in the project and automatically allocates words and phrases into 232 semantic groups. It will allow corpus users to examine key concepts in the texts and will support the teaching applications of CorCenCC by grouping words into themes for vocabulary profiling.
Dr Paul Rayson, Reader in Natural Language Processing at Lancaster University’s School of Computing and Communications, Director of the UCREL research centre, and co-investigator on the project, said: “Collecting a large-scale corpus of Welsh language used in everyday life, and involving Welsh writers and speakers in so many aspects of the project is very exciting. The corpus will also contribute to improving Welsh language technologies such as automatic translation, speech recognition and artificial intelligence.”
Dr Dawn Knight, project lead from Cardiff University’s School of English, Communication and Philosophy said: “What we aim to achieve is the development of the first large-scale living and evolving corpus, representing the Welsh language across communication types and informed by real, current, users of the language. We will be engaging with the public in a number of ways, and using new technologies to do so, including the CorCenCC crowdsourcing app. The use of crowdsourced corpus data is relatively unheard of, and represents a new direction to complement more traditional language collection methods.”
Steve Morris, Swansea University added: “This is a project about the past, present and future use of the Welsh language and will inform us about variation and change in real language use, such as regional differences or use of mutations over time. By putting speakers themselves in charge of their contributions to the corpus, they can be sure that the recordings they share will be the most natural and accurate representation possible of their everyday Welsh.”
CorCenCC is funded by the Economic and Social Research Council and the Arts and Humanities Research Council. The project and also involves Welsh Government; National Assembly for Wales; The National Library of Wales; WJEC-CBAC; Welsh for Adults; S4C; BBC; y Lolfa; SaySomethingin.com and the Dictionary of the Welsh Language. Additional funding for the launch was received from the British Council; the School of English, Communication and Philosophy (ENCAP), Cardiff University and Research Institute for Arts and Humanities (RIAH), Swansea University.