How to use the Sámi morphological parsers

Setting up the environment

Log in with your own user name and password.
If you have been away from cochise for a long time, or if this is your first time, write "cvs co gt" and press the return key (from now on indicated by "RETURN"). By doing that, you check out whatever new catalogues or files that have been added since last time. In order to update already existing files, "cvs up" is enough. For more info on cvs and the messages it may give you, see Introduction to cvs.
Change to the directory of the language you are interested in ("cd gt/sme/src RETURN" for Northern Sámi, and correspondingly for gt/sma/src (Sourthern) and gt/smj/src (Lule Sámi).
When in the src directory, write "make RETURN" (in order to compile the last version of the parser). The machine will then for the next 30 minutes (depending upon how many parts of the parser it must rebuild) write cryptic messages on the screen, and finish with an optimistic "bye.". The other parts of the parser are compiled in a couple of minutes, but compiling the preprocessor is a really slow process. While waiting, open a new window and do something else (you may e.g. read this documentation)

Analysing and generating words

Letters: The Northern Sámi letters are rendered as á, c1, d1, n1, s1, t1, z1. Thus write mánná, but Kárás1johka (with "s1" for s-caron) for the place name. Lule and Southern Sámi are written with the letters found in the Lule and Southern Sámi alphabets (the Lule Sámi [ng] sound is written as ñ).

Analysing one word at a time:

Note that the source files are in src/, the binary files are in bin/. The exact commands depend upon where you are. In order to write make, you must be in src/, we assume that you have a separate window for analysis, and that you are in the sme/ (etc.) catalogue when you analyse.

For Northern, write "lookup -flags mbTT bin/sme.fst RETURN"
For Lule Sámi, write "lookup -flags TT bin/smj.fst RETURN".
For Southern Sámi, write "lookup -flags TT bin/sma.fst RETURN".
then write the words that shall be analysed, one word at a time, followed by RETURN.
To leave lookup mode, press "ctrl C".
The "-flags mb" part is required for Northern Sámi, because of the c1, d1, etc. digraphs. For the other languages, "-flags TT" is not required, but it gives a nicer output. See the documentation on the lookup program for details.

Generating words

Write exactly the same commands as you do when you analyse words, except that you change 'sme.fst' to 'isme.fst', 'sma.fst' to 'isma.fst', etc.
Then write Sami words in their dictionary forms, followed by grammatical information. The format is given in the table in the file The grammatical tags.Note that the Southern Sámi sma.fst handles capital letters and ī-i variation, but that it only accepts correct "īquot; when you write in the base forms in the generator.
Again, to leave lookup mode, press "ctrl C".

A good way of working is to have two windows open, one for analysing and one for generating (and probably also addidtional windows, for documentation, for the source files, etc.).

Analysing more than one word at a time

Write the following command (the string 'sentence here' should be replaced with the actual sentence, and the part following the command "lookup" varies according to language, of course). I again assume you stand in the sme/ (sma/ etc.) catalogue).

echo "sentence here" | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst

Analysing files:

For each of the languages, write the following line:

cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst | less
cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/smj.fst | less
cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/sma.fst | less

Note that new Northern Sámi testfiles must be converted to the á, c1, d1 etc. format (there is a perl script to do that, and a better preprocessor is on the TODO list). The sme, sma and smj directories all contain a subdirectory called corp (so far, only sme/corp has testfiles).

There are now preprocessors that handle various sámi encodings. (They exist in the gt/script directory). They convert the input to the databases internal format. The files utf8-, ws2-, linmac- and latin6-sme are lookup scripts that turn the input and output to and from the internal format, and could be used like this:

cat utf8-filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f utf8-sme | less
cat ws2-filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f ws2-sme | less
cat linmac-filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f linmac-sme | less
cat latin6-filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f latin6-sme | less

Instead of just showing the result on the screen as running text (as above), much can be done to manipulate it. Here are some examples, all the textstrings should replace the word "less" in the command above.

"grep '\?' | sort | uniq -c | sort -nr | less RETURN" (to get a frequency list of the words that the parser does not recognize)
"grep '+N+Pl' > plnouns" (to get all plural nouns and save them to the file "plnouns"
"grep -v '\?' | cut -f2 | sort | uniq -c | sort -nr | less RETURN" (to get a frequency list of the lexemes that the parser recognizes, note that this requires that the flag TT is turned off, i.e. not mentioned.)
"grep '\+\?' | sort | uniq -c | sort -nr | less RETURN" (to get a frequency list of the word forms that the parser does not recognize)

To analyse more files at the same time, write their names one after another after the "cat" command, e.g. "cat file1 file3 file3 | preprocess ..."

Trond Trosterud

Last modified: Wed Oct 8 00:03:22 2003