The linguistic context
The Irish language is a minority language even within the island of Ireland. While 1.66 million (of an overall population of about 5 million – including Northern Ireland) claim fluency in the language, it is estimated that the number who use Irish as their daily language of communication is in the region of 100,000. One of the confounding features of the linguistic situation of Irish is demography, in that many native or fluent speakers reside in English-speaking areas (in Ireland, North and South) and, thus, do not have an everyday context in which to speak the language. The areas in which Irish is a daily community language are geographically located in pockets mainly in remote or coastal areas, from the extreme north-west of the island to the extreme south-west and south, with one exception in the midlands. Others who use Irish on a daily basis include urban dwellers, some of whom work for Irish-language institutions or are involved in Irish-medium education, or are in some other way proactive in preserving the language. Still others reside in relative isolation throughout the island.
While the conditions for the spoken language have become increasingly challenging over the years, resources relating to the written language have been in development with considerable dedication and enthusiasm. Much activity has focussed on digitization of sources over the past twenty years, with the result that we now have a digital continuum of textual sources going back to the period of Old Irish (AD 500 - 900).
The six corpora mentioned below are general language (LGP) corpora developed for lexicographical and research purposes. It will be seen that the corpora, taken collectively, cover the span from AD 700 to the present day. We do not, as yet, have corpora specifically designed for terminology work. However, a project is planned to develop a subcorpus of technical material for this purpose within the
NCI (NEW Corpus for Ireland).
NCI: New Corpus for Ireland (Nua-Chorpas na hÉireann)
The NCI has been developed over the past few years, in connection with a lexicographical project at Foras na Gaeilge (the compilation of a new English-Irish dictionary – see www.focloir.ie), and consists of two separate corpora, one containing 30 million words in Modern Irish (AD 1650+) and the other 25 million words in Hiberno-English, (i.e., English as spoken in Ireland), about half of which come from books both literary and informative. The remaining half is made up of newspapers, periodicals, official documents, Web sites and broadcasting material. The material included was produced during the period from 1883 to the present day.
The NCI is not currently freely available online but enquiries regarding access for research purposes may be sent to cconvery@forasnagaeilge.ie.
Corpas na Gaeilge
This is a monolingual corpus of 705 published texts from the period 1600-1882, covering the end of the Early Modern Irish period (AD 1200 – 1650) and first half of the Modern Irish period (AD 1650 to the present day), including prose, poetry, religious texts, historical documents, translations, based on 1.2 million lines of running text. It was developed at the Royal Irish Academy as part of a lexicographical project designed to compile a historical Irish-Irish dictionary of Modern Irish, and to form a companion to the RIA's earlier work, Dictionary of the Irish Language (see below). This corpus is not available online but is available on CD and may be ordered directly from the RIA as well as from other outlets.
(CD-Rom, Royal Irish Academy, 2004).
eDIL
Launched in June 2007, this is an online electronic dictionary of Old and Middle Irish (covering the period AD 700-1700) based on Dictionary of the Irish Language (DIL), and also containing material from the Early Modern period and some later material. DIL is a scholarly dictionary, providing lexical and morphological information in English on each headword, as well as citations from medieval texts to illustrate the accompanying lexical and morphological information. It was originally published in 22 individual fasciculi by the RIA between 1913 and 1976.
(Royal Irish Academy, 1913-1976).
CELT (Corpus of Electronic Texts)
This searchable online corpus of multilingual texts of Irish literature and history contains 10.6 million words from 934 texts. 47 source documents were written in Irish, 380 in English, 17 in Latin, 2 in French and one in Spanish. Also included are 92 translations - 83 into English, 6 into German and 3 into French.
The Irish texts included are mainly from the Middle-Irish and Early Modern Irish periods (AD 900 – 1650). (Some Irish text files contain an English translation).
Several other corpora, initiated by private individuals, are available online. The most extensive of these are:
• Tobar na Gaedhilge: This is a monolingual corpus of literary texts in Modern Irish containing over 3 million words. The original texts date from the first half of the twentieth century.
• Corpas Comhthreomhar Gaeilge-Béarla: This is a parallel Irish/English corpus containing about 600,000 sentences from miscellaneous sources, including annual reports, legislation, online terminology databases, software localization, government Web sites, and published articles.
more