|
Corpws Electroneg o'r Gymraeg
Home |
Focus |
Linguistic Corpora for all |
Corpws Electroneg o'r Gymraeg
 |
Welsh is a P Celtic language closely related to Cornish and Breton, and is spoken by about 600,000 people in Wales. After more than a century of decline, numbers of Welsh speakers are increasing again, especially children between the ages of 5 and 16. |
It is the most widely spoken Celtic language today, although unlike Irish, a member of the Q Celtic language group, it has no official recognition in the European Union. In 1993 a small grant was obtained from the Higher Education Funding Council for Wales to develop a one million word electronic corpus of Welsh at the University of Wales, Bangor. This is called the CEG corpus (Corpws Electroneg o'r Gymraeg: Electronic Welsh Corpus). At the time this was an innovative venture for a small language. It had been difficult to obtain information on frequencies of different word classes, inflections, mutations, and other grammatical features of Welsh before this project, and every effort was made to design the corpus in such a way that it would be accessible and useful to researchers for years to come. 500 text samples, each containing approximately 2000 words from various categories of contemporary prose fact and fiction writing were collected and tagged [1]. With no daily newspaper published in Welsh, and use of electronic media still underdeveloped, it was difficult to get large enough samples of some categories. Some book extracts had to be scanned in order to create electronic versions, but a balance sample was eventually achieved. A part of speech tagger, initially developed for the CySill Welsh spelling and grammar checking programme, was used to help mark words according to their part of speech, and to strip the word forms back to the core forms (the forms usually used as headwords in dictionary entries). This was particularly important for a language such as Welsh where changes to the first letter of words, verb conjugations and so on would make the task of entering the core form manually a long and laborious one. For example, for a noun like 'cath' (the Welsh word for 'cat') it is possible to have the mutated forms 'gath', 'chath' and 'nghath' also. With the plural form 'cathod' we have the mutated forms 'gathod', chathod' and 'nghathod', giving us a total of 8 different forms of the noun. With verbs it is even more complicated, with over a hundred conjugated and mutated forms possible. A simple verb like 'cael' ('to have') has more than 147 different forms, and so it was essential to be able to mark all the related forms automatically. Initial analyses provided the researchers with statistics that gave new insights into Welsh, including word frequency counts, and the relative number of feminine and masculine nouns in the language (it was found that approximately 2/3 of the nouns were masculine, with 1/3 of the nouns feminine). The corpus has been used to help create new Welsh dictionaries, both to ensure that all the vocabulary items were included in new dictionaries and also to rank the words that appear most frequently so that they can be introduced early to language learners. The corpus was successfully used in this way to create the Lexicelt on-line Welsh/Irish dictionary and phrasebook [2]. The corpus was also made freely available on the web to other academic researchers. In 2005 the WISPR (Welsh and Irish Speech Processing Resources) project at the University of Wales, Bangor used the corpus to develop a phrase-break model for Welsh text-to-speech synthesis [3]. Corrections were made, including the editing of erroneous part of speech tags caused by inconsistencies in the original work. The original authors estimate that two years would be required to raise tag quality from 96% to near-100% throughout the whole corpus [1], but the modified corpus is also freely available to download for non-commercial research [4]. Although a million word corpus seemed very big by the standards of the time, it now seems very small and there are plans for a much more ambitious hundred million word corpus for Welsh. This will record the extensive new vocabulary which has entered Welsh since the early 1990s, including all the language of the worldwide web and multimedia applications. There are also plans to publish the first Welsh language daily newspaper from March 2008, and this will give access to new material for the corpus on a daily basis. The new bilingual policies of many public agencies in Wales also means that the creation of a bilingual Welsh/English corpus is within reach, bringing with it possibilities for further avenues of research. These include comparative analyses of Welsh and English grammar and syntax, and even use as a resource to create machine translation systems.
21.11.07
Delyth Prys
Delyth Prys is e-Gymraeg Team Leader. She works at the Canolfan Bedwyr, University of Wales, Bangor . Canolfan Bedwyr's Language Technology Unit is a research unit that develops language resources for the Welsh language, the Celtic languages, and for multilingual situations in general.
References: [1] Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. http://www.bangor.ac.uk/ar/cb/ceg.php.en [2] Prys, D., Evans, D. et al. (2005). Geiriadur a Llyfr Ymadroddion ar-lein / Foclóir agus Leabhrán Frásaí ar line http://www.lexicelt.org/ [3] Jones, R.J. (2006). Changes to POS tagging in the CEG corpus. http://www.bangor.ac.uk/~cbs204/ceg/ceg_tagging_changes.html [4] Language Technology Unit, Bangor University (2006). Modified CEG corpus: http://www.bangor.ac.uk/~cbs204/ceg/tag_corrected.tar.gz
|
|