Programming tasks

Overview and intro

Corpus interface
Cgi-bin scripts for the parsers
Reading code, evaluating parsers
A dictionary interface
A preprocessor for text

The present document specifies relevant tasks for a programmer in this project.

Above all, the project needs a programmer that understands the project, that reads the documentation in the gt/doc catalog, that reads (large parts of) the Beesley/Karttunen book that presents the framework of the project.

In exchange, the project offers insight into language technology, into practical applications of finite state transducers, and into the use of perl, cvs, make and other practical unix tools.

Corpus interface

The project has licensed corpus software tools from IMS Stuttgart (the CQP of the IMS Stuttgart Corpus bench). This software must be installed, and the available corpus material must be installed as well. Parallel corpus texts will be available (starting with the Bible), they must be aligned.

There is documentation available for the IMS corpus (cf. link above).

Cgi-bin scripts for the parsers

The parsers are available in a rather crude web interface today (cf. the external web page of the project). It can be improved in different ways:

The Sami letters may be made visible on the output (today only c1, d1, n1, s1, t1, z1). The cgi-bin code should be adjusted to generate Unicode html output.
There are today 3 parsers available, differing only in what language the output is given. There should be a function to chose language (i.e., to chose between the 3 parsers sme.fst, n-sme.fst and s-sme.fst).
A scheme should be made to make generation of word forms graphically-based instead of text-based, as today (cf. separate document for drawing and explanation).

Reading code, evaluating parsers

The code of the parsers is written in three different formalisms (they are documented on the tools page:

twolc, a formalism for morphophonological rules
lexc, a formalism representing the lexicon and the inflectional morphology as a b-tree
xfst, a tool for converting regular expressions into finite-state automata. The xfst formalism is used to made a preprocessor, a device to divide text into sentences and sentences into words.

A programmer should read the code, and have a look at optimality issues: Is the code written in an optimal fashion?

A dictionary interface

One natural extension of the work on parsers is to make intelligent dictionaries. With a morphological system claiming that "boad1án" is present singular first person, and with a dictionary saying that its infinitive 'boahtit' equals 'komme', we want to say that "boad1án" should be translated with "jeg kommer" and vice versa. We thus need a system for an intelligent dictionary, and a web interface for it.

A preprocessor for text

A preprocessor is a program that divides running text into sentences and words. The preprocessor is documented here. It should be improved.

In the development process, a perl preprocessor is probably better than an xfst tokenize preprocessor. The compilation time for tok.txt is all to long for the developer phase.

Preprocessing the input to vislcg

Today, this is done by the lookup2cg script. This script has severe problems with compounds.

Cohorts with identical analysis should be unified, but they are not.
The quotation marks are wrapped around the first part of the compound.
It is hard to find back to the dictionary form of inflected compounds.

The issue needs serious thinking.

Examples:

""
        "strategiija" N Sg Nom # plána N Sg Nom # bargu N Sg Nom
        "strategiija" N Sg Nom # plána N Sg Gen # bargu N Sg Nom
        "strategiija" N Sg Gen # plána N Sg Nom # bargu N Sg Nom
        "strategiija" N Sg Gen # plána N Sg Gen # bargu N Sg Nom

Trond Trosterud

Last modified: Thu Jan 15 13:07:45 GMT 2004