General information for new users

The document index.html (the one that contained the link to this page) is index to the docuumentation of this Sámi language technology project, conducted at the University of Tromsø. Read the documentation, it is useful. At present, the goal is to build a morphological parser for Northern Sámi, and to build basic parsers for Southern and Lule Sámi. The other latin-based Sámi written languages will also be looked into. We have also started creating a morphological disambiguator for Northern Sámi, using Constraint Grammar technology. The project application contains some general background information on the project (note that present resources enables us to do appr. 1/4 of what is sketched in that document).

Directory structure

The project is located in the directory gt/ (an acronym for giellateknologiija, language technology). These are the subdirectories (the abbreviations for the different languages are in accordance with the ISO standard for language codes):

doc/ = documentation files,
script/ = (script files)
smi/ = files relevant to all the languages, e.g. proper names,
sme/ = Northern Sami
smj/ = Lule Sami
sma/ = Southern Sami
smn/ = Inari Sami
sms/ = Skolt Sami
www/ = directory for web-related issues
tmp/ = directory for temporary storing of script files under compilation.

Each language directory has the following subdirectories:

bin/ (the program files, these are autuomatically generated by the make command),
src/, the source files, our crown jewels,
dev/ developer's file (store your own notes here if needed),
corp, corpus files (cf. the README file in the corp/ directory), and
testing/, containing files for morphology testing (how to conduct such testing is explained on the Testing tools page.

The gt/ directory is copied to the home directory of each user by the cvs program.

Project history

The linguistic ground work for the Northern Sámi project was done by Pekka Sammallahti in 1993. His original 1993 files were twolrules-saame.txt (the twol rules), lexicon-saame.txt (a preliminary lexicon file), LEXITWOL.doc (a slightly different version of the same file, with more lexicon explanations, the two were unified into the present files), ADJ-TWOL.doc and NOMENAT.doc, the nouns and adjectives. Pekka's input can be found in the catalogue 93-originals/ (they are not included in the cvs catalogue, ask Trond for reference).

In december 2001 Pekka handed over raw dictionary files for nouns and adjectives, and for verbs, adverbs, and closed parts of speech. These files were tranlated over in the lexc format.

Trond Trosterud

Last modified: Fri Feb 21 10:20:32 GMT 2003