Sámi language technology project
Introduction
The project has an external home page, Sámi giellateknologija.
This file presents the files and resources of the project. At present,
the goal is to build a morphological parser for Northern Sámi,
and to build a parser for Southern Sámi nouns. During the
project period I will aim at creating a morphological disambiguator
for Northern Sámi.
Documentation files
Lexicon files and rule files
Northern Sámi
At present, the project is in a preparation phase, and it still has
not found its home. The linguistic ground work for the Northern
Sámi project was done by Pekka Sammallahti in 1993, his input
can be found in the catalogue 93-originals/. At present, the Northern Sámi
files included are:
Pekka's original files were lexicon-saame.txt (a preliminary
lexicon file), LEXITWOL.doc (another slightly different version
of the same file, with more lexicon explanations, the two were unified
into the present files), ADJ-TWOL.doc and NOMENAT.doc,
the nouns and adjectives. In december Pekka handed over the other
dictionary files, for nouns and adjectives (these files are appr. 1/4
larger than the ones in sme.save today), for verbs, adverbs,
and closed parts of speech. The december files are not in Xerox
format, they must thus be converted, assigned to appropriate
sublexica, etc.
Southern Sámi
The original Moshagen and Trosterud files are found at Lingsoft, in
the sms directory. We wrote an article for NJL on
Later, we have received some
comments from Lauri Karttunen that we are about to incorporate and
evaluate. These files are still not made available, but they will be
included, eventually. As part of the evaluation of Karttunen's
comments, the southern Sámi lexicon should be transformed from
Lingsoft format to Xerox format.
Documentation files
TODO-list
The project still has no home, and hence no file structure, no cvs
system, etc. Trond Trosterud is currently working on the project, but
the intention is to maintain it in a way that makes it possible to
include others as well. The project needs technical and linguistic
clarifications on the following points:
- The project needs a home. Negotiations with the university's
computer department are going on.
- The project needs a directory structure and a file
structure. Northern and Southern Sámi files should be stored
separately, as should the program files.
- The project needs cvs, and a structure that makes it possible to
add new workers.
- The localisation issue must be solved (see below).
- The 1993 files are not documented. The documentation process has started (see above), but there is still work to do.
Cooperation with support people is needed on all of the above points.
Tools
Xerox has delivered the following tools, at normal non-commercial conditions:
- lexc
- lookup
- tokenize
- tokeniz.fst
- twolc
- xfst
They can at present be found on a local disc only.
The tools are documented in the forthcoming 600-page (!) Karttunen /
Beesley book Finite-State Morphology: Xerox Tools and
Techniques (available to project workers, contact Trond).
In case we will be able to extend the project to making practical
applications, spell-checkers and the like, it could eventualle be
possible to use appropriate Lingsoft tools.
Localisation
At present (november 01), the project is run in a 7-bit fashion, with
digraphs (a1, c1, d1, n1, s1, t1, z1) for the 7 Sámi
letters. This is an ad hoc solution. We hope to migrate either to
UTF-8 format, or to a 7-bit-format, either ISO-IR 197 or Latin 4. Both
Linux localisers and the Xerox tools manuals boast UTF-8
compatibility, thus, in theory, this should be possible. Still, Xerox advices us to use an 8-bit-solution internally. This must be sorted out.
Should we go for UTF-8, the following must be in place:
- The Linux/Unix platform of the project must be UTF-8 enabeled
- We must find out how the Xerox tools handle UTF-8 #in practice#.
- We must make a Northern Sámi keyboard for UTF-8
- Existing files must be converted to UTF-8
Latin 4 or ISO-IR 197, the two 8-bit code tables are both supported by
iconv, and both contain the required symbols. Of the two, Latin 4
mitht be better supported, but ISO-IR 197 is a true superset of the
alphabetic repertoire of Latin 1, and should thus give no
compatibility problems with Latin 1 input. In the long run, Unicode
and UTF-8 is still the desired output, and migrating directly from 7
bit to UTF-8 seems a better solution. Crucial is Emacs support,
shells, etc.
Trond.Trosterud@hum.uit.no
Last modified: Thu Dec 20 18:28:18 CET 2001