Documentation on Northern Sámi

Morphophonology and morphology

** Documentation ** of the twol rule file
- twol-sme.txt
** Documentation ** of the lexicon files for
- the main continuation lexica for N, A, V,
- the lexicon files for the various POS, nouns, adjectives, verbs, adverbs, pronouns, conjunctions, subjunctions, interjections and particles, adpositions, prepositions and postpositions, proper nouns, abbreviations and punctuation marks
- The use of flag diacritics for compounds

For North Sami, we use a perl script, preprocess, cf. the documentation. Documentation of the old xfst-based preprocessor tok.txt is found here (the documentation contains a general discussion of preprocessing as well). We may return to using tokenize when the code is stable):
Documentation of case.regex, the file for initial capitalisation, and allcaps.regex, the file for words written in all-caps, and cap-sme, the lookup script for invoking allcaps.regex

After having been piped from the preprocessor through lookup, the output may be postprocessed in different ways.

Lookup gives Xerox-style output, we need vislcg-type input, the transition is done with the script lookup2cg
We guess the POS of unknown words in running text with the guesser (still under construction)
Incoming text contains foreign words. We have a very long list of wordforms in Swedish, Norwegian, Finnish, Danish and English, foreign.txt, cf. the documentation

The programs are compiled (i.e. made), by writing make when standing in the src/ directory. The make command invokes the Makefile.

Last modified: Mon Nov 1 22:23:42 2004