contact | site map | imprint           6.7.2008
Logo EURAC  
  NEWS ARCHIVE    
      Events    
      Education courses    
      On research    
      New print releases    
      Job openings    
SITE SEARCH  
 

Academia 14
 

Home  |  Press  |  Academia  |  14  |  Artikel2  

CATEx – A Project Proposal

by Johann Gamper

The following article briefly describes CATEx, a new project for Computer Assisted Terminology Extraction, which extends previous work and aims to make it at last possible for South Tyrol to have an Italian/German coherent and comprehensive legal and administrative terminology. CATEx adopts a corpus-based approach, where terminology is extracted from a collection of machine-readable bilingual texts. The most recent advances in corpus-encoding, automatic information extraction and management of terminological data are applied.

Introduction and Motivation

Due to the equal status of the Italian and the German language in South Tyrol, a great part of legal and administrative documents has to be written in both languages: usually they are first laid down in Italian and then translated into German. A prerequisite for high quality translations is a consistent and comprehensive terminology, i.e. a domain-specific dictionary, which however does not exist yet1 . While organized terminological activities have been neglected for decades, various institutions coined different German terms without prior consultation and created their own in-house dictionaries. The result of this has been a "terminological chaos", lots of inconsistencies, duplication of efforts, and poor quality translations. Since 1994 the Commission for Terminology (TERKOM) has been working in cooperation with the scientific area I of the European Academy of Bolzano (EAB) on a standardization of the legal and administrative terminology in South Tyrol (LA-terminology) . The European Academy of Bolzano is focusing on descriptive terminology – Italian/German term pairs are collected from bilingual texts, e.g. the Italian "codici" and their translations, and preferences for German terms are proposed. In particular cases further examinations are performed such as a comparison with the Austrian/German/Swiss law systems. On the basis of this work, the TERKOM establishes one German term for each Italian term. A few years of experience at the European Academy of Bolzano have shown that exhaustive analysis of the huge number of relevant texts and terminology extraction (experts estimate 40.000-70.000 terms) require the use of advanced computational methods: the CATEx project was born.

Objectives

Referring to figure 1, CATEx aims to provide a framework to support the tasks described in the boldface box. The main objectives are as follows:

  1. Development of a computational frame work for corpus-based terminological work.
  2. Production of linguistic resources: a corpus of bilingual texts and an LA-terminology.
  3. Creation of a sophisticated interface for user access.
We follow the corpus-based approach in terminology [Bow96]. The terminological data are automatically extracted from a corpus of machine-readable texts and stored in a database. Both the corpus and the terminology will be made available to users. This leads to a computational framework with four modules (text database/corpus, data/knowledge extraction, terminological database, user interface) which are shown in figure 2 and will be discussed in more detail below.

Text Database/Corpus

The basis for collecting and describing LA-terminology is a domain-specific parallel corpus of representative Italian/German texts in machine-readable form which cover the whole area of law and administration and show the use of the terms in various contexts (table 1). Some of these documents are widely accepted as standard and used as reference books by translators. In order to facilitate various automatic text processing tasks, implicit information in the corpus has to be made explicit – a procedure generally known as text mark-up or text annotation. Four levels of mark-up can be distinguished: document-wide markup (bibliographic information, etc.), gross structural markup (chapter, section, etc.), sub-paragraph markup (sentences, words, etc.) and linguistic annotation (part-of-speech tags, alignment of parallel texts, etc.). Marking up the texts according to the EAGLES recommendations on corpus encoding, an application of SGML, facilitates the exchange of the corpus and the use of already existing text processing software. The corpus will be implemented as a sophisticated text database.

Data/Knowledge Extraction

Once the corpus of annotated texts is ready, automatic methods to recognize and extract various pieces of data and knowledge are needed. First of all, we want to extract terms and their translation equivalents. For that purpose word sequences in the source language have to be identified as terms and their translations in the target language have to be determined. Further, we want to extract conceptual characteristics, for example the relations of hyponymy (e.g. X is a kind of Y) and meronymy (e.g. X is part of Y). Such relations allow a hierarchical organization of terminology facilitating the search process as well as the use of terminology by other programs. Finally, we need additional information like a definition of a term, a small context example, and the relative sources. To automate the data/knowledge extraction process, a lot of information is required including a definition of what a term is, which word patterns represent what relation, as well as linguistic annotations explicitly encoded in the corpus (part-of-speech tags, alignments, etc.). Even if these tasks are not fully automatic, the algorithms compute a list of candidates to be post-edited by a human. The effectiveness of this approach is incomparable to that of manual extraction, where a person has to scan the entire document.

Terminological Database

The starting point for the development of a new data model is the concept-oriented terminological database BLUTERM (http://www2.eurac.edu) which has been built up in the European Academy of Bolzano over the last years. Each entry represents a concept named by one or more terms and contains additional information such as a definition and a small context. While these pieces of information are more or less standard, we plan some useful enhancements. Specialized dictionaries are not really sufficient – a combination of dictionaries and parallel texts showing the use of terms in various contexts is what translators actually need. We will provide links from the terms in the terminological database to the text database where these terms appear and vice versa. Moreover, we plan to build concept hierarchies along various relations. Such structural knowledge allows the user to navigate among hyponyms/hypernyms as well as to coordinate terms and is a necessary ingredient for sophisticated information retrieval. These two extensions are shown in figure 3. As for [KT95], we plan to enhance each entry with statistical and temporal information such as the frequency of the terms in the corpus and the time when a certain translation was used, both containing valuable information for proposing preferences.

User Interface

The main distribution media for the linguistic resources will be the WWW. An "intelligent" interface allowing a user-friendly and transparent access to both the corpus and the terminological database has to be developed. Some advanced features will be the following:
  • Multilingual navigation aids including queries in the Italian and German languages
  • Display of parallel documents
  • Complex queries: words or compound words, word combinations in a context, etc.
  • Restriction of queries to parts of the corpus/terminological database
Other distribution media will be considered as well, e.g. CD-ROM or printed dictionaries. For that purpose we have to provide interfaces to export dictionaries directly from the terminological database.

Concluding Remarks

Machine-readable corpora are becoming increasingly important in all fields related to natural language processing. This is also true for terminology, where adopting a corpus-based approach opens new doors [Bow96]: increased speed and scope of research, possibility of providing more contextual information, unrestricted access to the text, support for retrieving conceptual information etc. Dagan and Church [DC97] report "that it took about 10 hours to construct a list of 1700 terms extracted from a 300,000-word document". Recent advances in text processing research (standards for corpus encoding, powerful algorithms for bilingual text alignment [Chu93], the extraction of terminological information [DC97], etc.) and the increased availability of machine-readable corpora are crucial for the success of the automatic terminology acquisition. At the European Academy of Bolzano several projects deal with various aspects of terminology including descriptive and comparative terminology, extraction of terms from texts, and modelling of terminological information in databases [AM96]. The knowledge acquired and the developed techniques provide valuable ingredients for the realization of the new project. Starting from our own experiences and exploiting new computational methods in corpus-based research, the CATEx project aims to develop a framework for corpus-based research and to construct a special-language corpus of bilingual texts and a comprehensive collection of terms used in South Tyrol in the past – a prerequisite for and an important step towards a standardization of the LA-terminology.

Dr. Johann Gamper, researcher in the section "Language and law" at the European Academy of Bolzano

References

[AM96] Reiner Arntz and Felix Mayer. Vergleichende Rechtsterminlogie und Sprachdatenverarbeitung - das Beispiel Südtirol. In Übersetzungswissenschaft im Umbruch, pages 117-129. Günther Narr Verlag Tübingen, 1996.

[Bow96] Lynne Bowker. Towards a corpusbased approach to terminography. Terminology, 3(1):27-52, 1996.

[Chu93] Kenneth Ward Church. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31th Annual Meeting of the Association for Computational Linguistics, pages 1-8, Columbus, Ohio, 1993.

[DC97] Ido Dagan and Ken Church. Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12:89-107, 1997.

[KT95] Judith Klavans and Evelyne Tzoukermann. Combining corpus and machine-readable dictionary data for building bilingual lexicons. Machine Translation, 10:185-218, 1995.

Notes 1 Note that there are indeed a few glossaries. There are also Italian/German dictionaries, but they translate between different law systems, e.g. Italian/German, and therefore they are necessarily different from a dictionary for South Tyrol.


  The latest issue
 

 
 
Copyright © EURAC 2008 Send page Print page Top of page