Programming tasks
Overview and intro
The present document specifies relevant tasks for a programmer in this
project.
Above all, the project needs a programmer that understands the
project, that reads the documentation in the gt/doc catalog, that
reads (large parts of) the Beesley/Karttunen
book that presents the framework of the project.
In exchange, the project offers insight into language technology, into
practical applications of finite state transducers, and into the use
of perl, cvs, make and other practical unix tools.
The project has licensed corpus software tools from IMS Stuttgart (the
CQP of the IMS Stuttgart Corpus bench). This software must be
installed, and the available corpus material must be installed as
well. Parallel corpus texts will be available (starting with the
Bible), they must be aligned.
There is documentation available for the IMS corpus (cf. link
above).
The parsers are available in a rather crude web interface today
(cf. the external web
page of the project). It can be improved in different ways:
- The Sami letters may be made visible on the output (today only
c1, d1, n1, s1, t1, z1). The cgi-bin code should be adjusted
to generate Unicode html output.
- There are today 3 parsers available, differing only in what
language the output is given. There should be a function to chose
language (i.e., to chose between the 3 parsers sme.fst, n-sme.fst and
s-sme.fst).
- A scheme should be made to make generation of word forms
graphically-based instead of text-based, as today (cf. separate
document for drawing and explanation).
The code of the parsers is written in three different formalisms (they are documented on the tools page:
- twolc, a formalism for morphophonological rules
- lexc, a formalism representing the lexicon and the inflectional
morphology as a b-tree
- xfst, a tool for converting regular expressions into
finite-state automata. The xfst formalism is used to made a
preprocessor, a device to divide text into sentences and sentences
into words.
A programmer should read the code, and have a look at optimality
issues: Is the code written in an optimal fashion?
One natural extension of the work on parsers is to make intelligent
dictionaries. With a morphological system claiming that "boad1án" is
present singular first person, and with a dictionary saying that its
infinitive 'boahtit' equals 'komme', we want to say that "boad1án"
should be translated with "jeg kommer" and vice versa. We thus need a
system for an intelligent dictionary, and a web interface for it.
A preprocessor is a program that divides running text into sentences
and words. The preprocessor is documented here. It should be improved.
In the development process, a perl preprocessor is probably better
than an xfst tokenize preprocessor. The compilation time for tok.txt
is all to long for the developer phase.
Preprocessing the input to vislcg
Today, this is done by the lookup2cg script. This script has severe
problems with compounds.
- Cohorts with identical analysis should be unified, but they are not.
- The quotation marks are wrapped around the first part of the compound.
- It is hard to find back to the dictionary form of inflected compounds.
The issue needs serious thinking.
Examples:
""
"strategiija" N Sg Nom # plána N Sg Nom # bargu N Sg Nom
"strategiija" N Sg Nom # plána N Sg Gen # bargu N Sg Nom
"strategiija" N Sg Gen # plána N Sg Nom # bargu N Sg Nom
"strategiija" N Sg Gen # plána N Sg Gen # bargu N Sg Nom
Trond Trosterud
Last modified: Thu Jan 15 13:07:45 GMT 2004