Documentation on Northern Sámi
Tags
Morphophonology and morphology
- ** Documentation ** of the twol rule file
- ** Documentation ** of the lexicon files for
- the main continuation lexica for N, A, V,
- the lexicon files for the various POS, nouns,
adjectives,
verbs,
adverbs,
pronouns, conjunctions, subjunctions, interjections and particles,
adpositions, prepositions and postpositions, proper nouns,
abbreviations
and punctuation marks
- The use of flag diacritics for compounds
Preprocessing
- For North Sami, we use a perl script, preprocess, cf. the documentation. Documentation of the old
xfst-based preprocessor tok.txt is found
here (the documentation contains a general discussion of preprocessing as well). We may return to using tokenize
when the code is stable):
- Documentation of case.regex, the file for
initial capitalisation, and allcaps.regex, the file for words
written in all-caps, and cap-sme,
the lookup script for invoking allcaps.regex
Postprocessing
After having been piped from the preprocessor through lookup, the
output may be postprocessed in different ways.
- Lookup gives Xerox-style output, we need vislcg-type input, the transition is done with the script lookup2cg
- We guess the POS of unknown words in running text with the guesser (still under construction)
- Incoming text contains foreign words. We have a very long list
of wordforms in Swedish, Norwegian, Finnish, Danish and English, foreign.txt, cf. the documentation
Disambiguation
Conmpiling
The programs are compiled (i.e. made), by
writing make
when standing in the src/ directory. The
make command invokes the Makefile.
Testing and bug reports
Last modified: Mon Nov 1 22:23:42 2004