General information for new users
The document index.html (the one that
contained the link to this page) is index to the docuumentation of
this Sámi language technology project, conducted at the
University of Tromsø. Read the documentation, it is
useful. At present, the goal is to build a morphological parser for
Northern Sámi, and to build basic parsers for Southern and Lule
Sámi. The other latin-based Sámi written languages will
also be looked into. We have also started creating a morphological
disambiguator for Northern Sámi, using Constraint Grammar
technology. The project
application contains some general background information on the
project (note that present resources enables us to do appr. 1/4 of
what is sketched in that document).
Directory structure
The project is located in the directory gt/ (an acronym for
giellateknologiija, language technology). These are the
subdirectories (the abbreviations for the different languages are in
accordance with the ISO standard for language codes):
- doc/ = documentation files,
- script/ = (script files)
- smi/ = files relevant to all the languages,
e.g. proper names,
- sme/ = Northern Sami
- smj/ = Lule Sami
- sma/ = Southern Sami
- smn/ = Inari Sami
- sms/ = Skolt Sami
- www/ = directory for web-related issues
- tmp/ = directory for temporary storing of script files
under compilation.
Each language directory has the following
subdirectories:
-
bin/ (the program files, these are autuomatically
generated by the make command),
- src/, the source files, our crown jewels,
- dev/ developer's file (store your own notes here if
needed),
- corp, corpus files (cf. the README file in the corp/
directory), and
- testing/, containing files for morphology testing (how to
conduct such testing is explained on the Testing tools page.
The gt/ directory is copied to the home directory of each user by the
cvs program.
Project history
The linguistic ground work for the Northern Sámi project was
done by Pekka Sammallahti in 1993. His original 1993 files were
twolrules-saame.txt (the twol rules), lexicon-saame.txt
(a preliminary lexicon file), LEXITWOL.doc (a slightly
different version of the same file, with more lexicon explanations,
the two were unified into the present files), ADJ-TWOL.doc and
NOMENAT.doc, the nouns and adjectives. Pekka's input can be
found in the catalogue 93-originals/ (they are not included in the cvs
catalogue, ask Trond for reference).
In december 2001 Pekka handed over raw dictionary files for nouns and
adjectives, and for verbs, adverbs, and closed parts of speech. These
files were tranlated over in the lexc format.
Trond Trosterud
Last modified: Fri Feb 21 10:20:32 GMT 2003