Gymn@zilla
Home |
Research departments |
Applied Linguistics |
Institute for Specialised Communication and Multilingualism |
Projects |
Gymn@zilla
Gymn@zilla supports browsing a local document repository and the Internet by dynamically creating and annotating HTML and PDF documents with open dictionary resources. Gymn@zilla is written in Perl. It is an online application running on a Linux web server. Its architecture thus guarantees the usage of free and powerful modules. The main submodules of Gymn@zilla handle (1) the mirroring of web pages, (2) the linguistic processing, (3) the processing and selection of images and (4) the generation of exercises. Mirroring of web pages is done by using Perl's LWP modules. Hyperlinks in a web page are rewritten to Gymn@zilla's URL in order to allow continuous browsing with Gymn@zilla. The original URL is encoded as a CGI-parameter. Links to multimedia documents such as audio, video and graphic files are preserved. Once converted, the documents language is guessed and the best matching support language (L1) is selected. The text is segmented into its tokens, which is not trivial for East-Asian languages. For the annotation of inflected word forms a stemming of these forms is performed by the use of pattern matching techniques. According to the user's preferences, the text is then annotated with translations and terminological information. The annotation is done by insertion of <a> -tags with advanced link titles in JavaScript containing the information which will show up when the user moves the mouse onto it. The dictionaries which Gymn@zilla includes are mostly taken from the Internet (eg. the Chinese cedict dictionary), or provided by our research partners (eg. the Russian dictionary from the Laboratory of Computational Linguistics at IPPI at the Russian Academy of Sciences). All dictionaries are transformed in an XML-structure which feature the lemma and optionally grammatical indicators, the translation, pronunciation features and notes. In order to improve the quality of the annotation, attempts will be made in the future to classify the documents by comparing the character n-grams of the document to those of specific dictionaries. Part-of-speech tagging and word sense disambiguation will be explored in order to avoid notoriously incorrect annotations. Each user in Gymn@zilla is associated with a session. This information is then used to make private editable wordlists in the form of simple XML documents. XSLT-transformations are then used to generate quizzes for training.
For more in-depth information on technological issues of Gymn@zilla we refer to our scientific publications about the project.
last update
16.10.2008
|