Sámi language technology project

Introduction

The project has an external home page, Sámi giellateknologija.

This file presents the files and resources of the project. At present, the goal is to build a morphological parser for Northern Sámi, and to build a parser for Southern Sámi nouns. During the project period I will aim at creating a morphological disambiguator for Northern Sámi.

Documentation files

Lexicon files and rule files

Northern Sámi

At present, the project is in a preparation phase, and it still has not found its home. The linguistic ground work for the Northern Sámi project was done by Pekka Sammallahti in 1993, his input can be found in the catalogue 93-originals/. At present, the Northern Sámi files included are:

Pekka's original files were lexicon-saame.txt (a preliminary lexicon file), LEXITWOL.doc (another slightly different version of the same file, with more lexicon explanations, the two were unified into the present files), ADJ-TWOL.doc and NOMENAT.doc, the nouns and adjectives. In december Pekka handed over the other dictionary files, for nouns and adjectives (these files are appr. 1/4 larger than the ones in sme.save today), for verbs, adverbs, and closed parts of speech. The december files are not in Xerox format, they must thus be converted, assigned to appropriate sublexica, etc.

Southern Sámi

The original Moshagen and Trosterud files are found at Lingsoft, in the sms directory. We wrote an article for NJL on Later, we have received some comments from Lauri Karttunen that we are about to incorporate and evaluate. These files are still not made available, but they will be included, eventually. As part of the evaluation of Karttunen's comments, the southern Sámi lexicon should be transformed from Lingsoft format to Xerox format.

Documentation files

TODO-list

The project still has no home, and hence no file structure, no cvs system, etc. Trond Trosterud is currently working on the project, but the intention is to maintain it in a way that makes it possible to include others as well. The project needs technical and linguistic clarifications on the following points:

  1. The project needs a home. Negotiations with the university's computer department are going on.
  2. The project needs a directory structure and a file structure. Northern and Southern Sámi files should be stored separately, as should the program files.
  3. The project needs cvs, and a structure that makes it possible to add new workers.
  4. The localisation issue must be solved (see below).
  5. The 1993 files are not documented. The documentation process has started (see above), but there is still work to do.
Cooperation with support people is needed on all of the above points.

Tools

Xerox has delivered the following tools, at normal non-commercial conditions:

They can at present be found on a local disc only.

The tools are documented in the forthcoming 600-page (!) Karttunen / Beesley book Finite-State Morphology: Xerox Tools and Techniques (available to project workers, contact Trond).

In case we will be able to extend the project to making practical applications, spell-checkers and the like, it could eventualle be possible to use appropriate Lingsoft tools.

Localisation

At present (november 01), the project is run in a 7-bit fashion, with digraphs (a1, c1, d1, n1, s1, t1, z1) for the 7 Sámi letters. This is an ad hoc solution. We hope to migrate either to UTF-8 format, or to a 7-bit-format, either ISO-IR 197 or Latin 4. Both Linux localisers and the Xerox tools manuals boast UTF-8 compatibility, thus, in theory, this should be possible. Still, Xerox advices us to use an 8-bit-solution internally. This must be sorted out.

Should we go for UTF-8, the following must be in place:

Latin 4 or ISO-IR 197, the two 8-bit code tables are both supported by iconv, and both contain the required symbols. Of the two, Latin 4 mitht be better supported, but ISO-IR 197 is a true superset of the alphabetic repertoire of Latin 1, and should thus give no compatibility problems with Latin 1 input. In the long run, Unicode and UTF-8 is still the desired output, and migrating directly from 7 bit to UTF-8 seems a better solution. Crucial is Emacs support, shells, etc.


Trond.Trosterud@hum.uit.no
Last modified: Thu Dec 20 18:28:18 CET 2001