Sámi language technology project

Introduction

The project has an external home page, Sámi giellateknologija.

This file presents the files and resources of the project. At present, the goal is to build a morphological parser for Northern Sámi, and to build a parser for Southern Sámi nouns. During the project period I will aim at creating a morphological disambiguator for Northern Sámi.

Documentation files

The project application (For a general background of the project, note that the present resources enables us to do only appr 1/4 of what is sketched in that document.)
A documentation of the Northern Sámi twol rule file twol-sme.txt.
A documentation of the Northern Sámi lexicon files *sme-lex.txt, the files are listed below.
A test diary, documenting test results
An article documenting the Southern Sámi parser, written by Sjur Moshagen and Trond Trosterud
The research project uses Xerox Finite-State technology

Lexicon files and rule files

Northern Sámi

At present, the project is in a preparation phase, and it still has not found its home. The linguistic ground work for the Northern Sámi project was done by Pekka Sammallahti in 1993, his input can be found in the catalogue 93-originals/. At present, the Northern Sámi files included are:

sme-lex.txt, the grammatical lexica.
noun-sme-lex.txt, the nouns.
adj-sme-lex.txt, the adjectives.
verb-sme-lex.txt, the verbs.
twol-sme.txt, the twol rules file.

Pekka's original files were lexicon-saame.txt (a preliminary lexicon file), LEXITWOL.doc (another slightly different version of the same file, with more lexicon explanations, the two were unified into the present files), ADJ-TWOL.doc and NOMENAT.doc, the nouns and adjectives. In december Pekka handed over the other dictionary files, for nouns and adjectives (these files are appr. 1/4 larger than the ones in sme.save today), for verbs, adverbs, and closed parts of speech. The december files are not in Xerox format, they must thus be converted, assigned to appropriate sublexica, etc.

Southern Sámi

The original Moshagen and Trosterud files are found at Lingsoft, in the sms directory. We wrote an article for NJL on Later, we have received some comments from Lauri Karttunen that we are about to incorporate and evaluate. These files are still not made available, but they will be included, eventually. As part of the evaluation of Karttunen's comments, the southern Sámi lexicon should be transformed from Lingsoft format to Xerox format.

Documentation files

A Two-level Parser for Southern Sámi, the Moshagen and Trosterud article submitted to NJL, and the most complete documentation of the Southern Sámi analysis.
Documentation of the rules file
Documentation of the lexicon files

TODO-list

The project still has no home, and hence no file structure, no cvs system, etc. Trond Trosterud is currently working on the project, but the intention is to maintain it in a way that makes it possible to include others as well. The project needs technical and linguistic clarifications on the following points:

The project needs a home. Negotiations with the university's computer department are going on.
The project needs a directory structure and a file structure. Northern and Southern Sámi files should be stored separately, as should the program files.
The project needs cvs, and a structure that makes it possible to add new workers.
The localisation issue must be solved (see below).
The 1993 files are not documented. The documentation process has started (see above), but there is still work to do.

Cooperation with support people is needed on all of the above points.

Tools

Xerox has delivered the following tools, at normal non-commercial conditions:

lexc
lookup
tokenize
tokeniz.fst
twolc
xfst

They can at present be found on a local disc only.

The tools are documented in the forthcoming 600-page (!) Karttunen / Beesley book Finite-State Morphology: Xerox Tools and Techniques (available to project workers, contact Trond).

In case we will be able to extend the project to making practical applications, spell-checkers and the like, it could eventualle be possible to use appropriate Lingsoft tools.

Localisation

At present (november 01), the project is run in a 7-bit fashion, with digraphs (a1, c1, d1, n1, s1, t1, z1) for the 7 Sámi letters. This is an ad hoc solution. We hope to migrate either to UTF-8 format, or to a 7-bit-format, either ISO-IR 197 or Latin 4. Both Linux localisers and the Xerox tools manuals boast UTF-8 compatibility, thus, in theory, this should be possible. Still, Xerox advices us to use an 8-bit-solution internally. This must be sorted out.

Should we go for UTF-8, the following must be in place:

The Linux/Unix platform of the project must be UTF-8 enabeled
We must find out how the Xerox tools handle UTF-8 #in practice#.
We must make a Northern Sámi keyboard for UTF-8
Existing files must be converted to UTF-8

Latin 4 or ISO-IR 197, the two 8-bit code tables are both supported by iconv, and both contain the required symbols. Of the two, Latin 4 mitht be better supported, but ISO-IR 197 is a true superset of the alphabetic repertoire of Latin 1, and should thus give no compatibility problems with Latin 1 input. In the long run, Unicode and UTF-8 is still the desired output, and migrating directly from 7 bit to UTF-8 seems a better solution. Crucial is Emacs support, shells, etc.

Trond.Trosterud@hum.uit.no

Last modified: Thu Dec 20 18:28:18 CET 2001