|
Academia 18
Home |
Press |
Academia |
18 |
art_13
Construction of a Parallel Text Corpus Encoding Primary Data Academia Nr: 18 (März - Juni / marzo - giugno 1999)by Johann Gamper This article is a status report about an ongoing research project in the field of corpus-based terminology acquisition at the European Academy of Bolzano/Bozen. We will discuss the first module, i.e. the construction of a bilingual corpus of legal and administrative texts. The main focus will be on the annotation of the texts with bibliographic and structural information.
Introduction For a few years now, the European Academy of Bolzano/Bozen has been working on a legal and administrative terminology for South Tyrol [1]. A great part of this work consists of scanning large collections of bilingual text material for terms and their translations — a very labor-intensive and error-prone task. In [4] we presented the project CATEx (Computer Assisted Terminology Extraction) which aims to provide a computational framework to support terminology acquisition. CATEx adopts a new form of corpus-based terminology acquisition, which extends the traditional approach (characterized by manual scanning of printed documents) at least in two directions: (1) a corpus is a collection of language material in machine-readable form, and (2) sophisticated computer programs explore the corpus for terminologically relevant information and generate lists of term candidates which have to be post-edited by humans [2]. The basis for automatic term acquisition is a corpus of machine-readable texts. The accuracy of the results of automatic term extraction can be improved if implicit knowledge in the texts is made explicit — a process known as text markup or annotation. Structural information, part of speech tags, lemmas and alignments are the most important pieces of information required for advanced term extraction. This paper is a brief status report about the construction of a parallel text corpus in the CATEx project. The main focus will be on the annotation of bibliographic and structural information.
Building the Text Corpus The overall process of building the text corpus is shown in figure 1 and comprises the following tasks: corpus design, preprocessing, encoding primary data and encoding linguistic information. Corpus design selects a collection of texts which should be included in the corpus. An important design criterion is that a corpus is representative, i.e. the texts represent a realistic model of the language to be studied [2]. Italian laws and their German translation form the core part of our corpus. Currently, the corpus contains ca. 5 million words and it is one of the largest special language corpora. In the preprocessing phase we correct errors in the raw text material (mainly OCR errors) and produce a unified electronic version in a way that the programs used for consequent annotation can be relatively simple. A great part of work is devoted to various annotation tasks meant to enrich the raw text material with interpretative, linguistic and non-linguistic information. Making such information explicit facilitates automatic text analyses.
Primary Data Encoding For text annotation the Corpus Encoding Standard (CES) [5] is applied, which provides a set of annotation guidelines especially tailored for language engineering applications. CES is an application of SGML which is a very general, declarative markup language and has been designed to be a standard for interchanging textual information independent of any formatting information. The raw text material without any annotation is referred to as primary data. CES provides rules for encoding relevant information in the primary data as well as for encoding additional information which results from linguistic analyses of the primary data. Primary data encoding comprises the markup for documentation and structural information. A CES-document consists of a header which contains the documentation information and a body which contains the primary data marked up with structural information. In our corpus, each law is treated as a single CES-document. Documentation information concerns global information valid for the whole document. This includes, among other pieces of information, a bibliographic description of the text (title, information about the publication of the printed version author(s), etc.) and encoding conventions (number of SGML elements actually tagged, character set used, language, etc.). Regarding the annotation of structural information, CES distinguishes between gross structural markup and markup of sub-paragraph structures. At the level of gross structure a text consists of possibly hierarchically ordered units of text such as chapters, sections, lists, notes, etc. down to the paragraph level. Relevant elements at the sub-paragraph level are abbreviations, names, dates, numbers, foreign words, etc. Figure 2 shows a part of the "Italienisches Zivilgesetzbuch" in printed form. Figure 3 shows the same part annotated with structural information, where SGML-tags (the text in ) represent explicitly the interpretation of the enclosed text. The first line starts a division of type buch, where id is a unique identifier and n is the number of the division. The second line marks the text "Buch I" as header information. The fourth line starts a new division of type titel which is enclosed in the previous one. In all we have the following divisions in the indicated hierarchical order:
1 Legge/Gesetz 2 Parte/Abteilung 3 Libro/Buch 4 Titolo/Titel 5 Capo/Abschnitt 6 Sezione/Teil 7 §/§ 8 Articolo/Artikel
The smallest division in our texts are articles, which in general contain several paragraphs (
) and footnotes (). Footnote references are coded by elements, where the target attribute contains the unique identifier of the footnote. At the level of sub-paragraph elements, figure 3 shows the annotation of dates () and abbreviations ().
Concluding Remarks Text corpora play a crucial role in all areas dealing with natural language processing in one form or another. Before a corpus can be explored by computer programs, the texts have to be enriched with various pieces of information. The type and amount of information to be annotated depends on the use of the corpus. While text annotation is a laborintensive task, it makes a text corpus a valuable resource for various applications in natural language processing. Encoding primary data enriches the text corpus explicitly with bibliographic and structural information. In the CATEx project, these pieces of information are required to automatically extract the source of terms. The source of the term "Rechtsfähigket" in figure 3 can be automatically computed to "Buch I, Titel I, Artikel 1 des Italienischen Zivilgesetzbuches, Athesia, 1992". Moreover, structural information is a necessary ingredient for a browser which allows a user-friendly navigation through the documents. Future work includes the annotation of linguistic information. We consider the recognition of tokens and sentences, a lexical analysis which assigns lemmas and part of speech tags, the disambiguation of the part of speech tags, and the alignment of the parallel texts. These pieces of information are useful for the recognition of terms and their translation equivalents [3].
Dr. Johann Gamper, researcher in the section "Language and law" at the European Academy of Bolzano/Bozen Johann.Gamper@eurac.edu References [1] Reiner Arntz and Felix Mayer. Vergleichende Rechtsterminologie und Sprachdatenverarbeitung — das Beispiel Südtirol. In Übersetzungswissenschaft im Umbruch, pages 117–129. Günther Narr Verlag Tübingen, 1996. [2] Lynne Bowker. Towards a corpus-based approach to terminography. Terminology, 3(1):27–52, 1996. [3] Ido Dagan and Kenneth W. Church. Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12:89–107, 1997. [4] Johann Gamper. CATEx — a project proposal. Academia, 14:10–12, 1998. European Academy Bolzano/Bozen. [5] Nancy Ide, Greg Priest-Dorman, and Jean Véronis. Corpus encoding standard, 1996. See http://www.cs.vassar.edu/CES/
|
|