Programming tasks

Overview and intro

The present document specifies relevant tasks for a programmer in this project.

Above all, the project needs a programmer that understands the project, that reads the documentation in the gt/doc catalog, that reads (large parts of) the Beesley/Karttunen book that presents the framework of the project.

In exchange, the project offers insight into language technology, into practical applications of finite state transducers, and into the use of perl, cvs, make and other practical unix tools.

Corpus interface

The project has licensed corpus software tools from IMS Stuttgart (the CQP of the IMS Stuttgart Corpus bench). This software must be installed, and the available corpus material must be installed as well. Parallel corpus texts will be available (starting with the Bible), they must be aligned.

There is documentation available for the IMS corpus (cf. link above).

Cgi-bin scripts for the parsers

The parsers are available in a rather crude web interface today (cf. the external web page of the project). It can be improved in different ways:

Reading code, evaluating parsers

The code of the parsers is written in three different formalisms (they are documented on the tools page:

A programmer should read the code, and have a look at optimality issues: Is the code written in an optimal fashion?

A dictionary interface

One natural extension of the work on parsers is to make intelligent dictionaries. With a morphological system claiming that "boad1án" is present singular first person, and with a dictionary saying that its infinitive 'boahtit' equals 'komme', we want to say that "boad1án" should be translated with "jeg kommer" and vice versa. We thus need a system for an intelligent dictionary, and a web interface for it.

A preprocessor for text

A preprocessor is a program that divides running text into sentences and words. The preprocessor is documented here. It should be improved.

In the development process, a perl preprocessor is probably better than an xfst tokenize preprocessor. The compilation time for tok.txt is all to long for the developer phase.

Preprocessing the input to vislcg

Today, this is done by the lookup2cg script. This script has severe problems with compounds.

The issue needs serious thinking.

Examples:

""
        "strategiija" N Sg Nom # plána N Sg Nom # bargu N Sg Nom
        "strategiija" N Sg Nom # plána N Sg Gen # bargu N Sg Nom
        "strategiija" N Sg Gen # plána N Sg Nom # bargu N Sg Nom
        "strategiija" N Sg Gen # plána N Sg Gen # bargu N Sg Nom


Trond Trosterud
Last modified: Thu Jan 15 13:07:45 GMT 2004