Test plan for sme

Testing corpus text

At regular intervals, the corpus should be inspected, e.g. with the following command (~/gt/sme/ as the working directory):

cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/sme.fst | grep '\?' | grep -v CLB | sort | uniq -c | sort -nr | less

The result will contain all non-Sami words in the lexicon. In order to remove these foreign words from the list, the following command may be used:

cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/s\ me.fst | grep '\?' | grep -v CLB | cut -f1 | lookup -flags mbTT bin/foreign.fst | grep '\?' | sort | uniq -c | sort -nr | less

The resulting list is an overview over words not recognised by the parser. All-capital words should be ignored, or they could be tested separately, with the command

... | lookup -flags mbTT -f bin/cap-sme | ...

By using this script words written in CAPITALS are analysed as well, but run in this mode, the parser is to slow to analyse the full one-million word corpus.

The remaining words should be inspected. Failure of recognising words has one of three reasons:

They are misspellings
They are missing from the lexicon
They are listed in the lexicon, but an error in the morphological or morphophonological system prevents the parser from recognising them.

In simple cases, errors should just be corrected. Otherwise they should be reported to the Bugzilla database. Misspellings may be ignored, or, if they are frequent, they should be added to the lexicon, with a tag (at present the tag is "! XXX substandard"). When developing a spell checker, misspellings become interesting in their own right, but for the development of the disambiguator, we are more interested in actually analysing the words, than in pointing out that they are misspelled.

Clear formatting errors may be corrected in the corpus files, with the following command:

perl -i -pe 's/formatting_error/corrected_formatting/g' corp/filename

This should be done with care, and only when it is totally clear that the input string cannot be interpreted as anything else than a formatting error. The preferred way of dealing with formatting errors is to improve our conversion tools.

Words missing in the lexicon should be added, with their proper lexicon.

Words listed in the lexicon, but with one or more word forms not analysed, are the most challenging ones. This implies that there is an error in the morphophonological file twol-sme.txt or more probably in the morphological section (for nouns, verbs and adjectives this means sme-lex.txt). In case of morphological errors, the path through the morphological derivation should be traced and inspected. In case of morphophonological errors there are procedures within twolc for detecting them (see the twolc manual).

Testing recall of texts

At regular intervals, new, previously unseen texts should be tested for type and token recall. The test prcedure, as well as test results, are explained in the sme test diary.

Testing the parser

The parser should be tested for its output, via the testing tools.

Status quo and directions for actively testing the parser:

Testing the morphology

The best way of testing the morphology is perhaps the command make n-paradigm WORD=johka, as described in the testing tools. This method is fine for the inflection of nouns, verbs and adjectives. As of september 2004, the basic noun paradigms in Nickel have all ben tested, as have the CG patterns. Priority should now be given to adjectives, and to the verbs. The sublexica should all be run through the generator.

Testing the individual lexemes

Adjectives

As for the adjectives, there are several subtypes that are not covered by the existing lexica. One possible way of monitoring the situation would be to write a perl script (or shell script) that takes as input a list of adjectives, and gives their nom.sg., attributive form, gen.sg, comparative nominative, comparative genitive, superlative and superlative genitive forms, and then run representative lists of adjectives through the script.

Verbs

As for the verbs, the verb file should be read through and checked for transitivity (the question is whether the verbs are assigned correct sublexicon).

P-positions and adverbs

TODO for a person with Sami as mother tongue: Read through the pp-sme-lex.txt and adv-sme-lex.txtfiles and evaluate the division into prepositions, postpositions, adpositions and adverbs.

Pronouns

Perhaps a script could be made to run all pronouns through a test.

Numerals

The chapter on numerals is still not properly written. Wait with testing this until the code is more stable.

Testing the correctness of the given analyses

When we test whether words are let through or not, we do not test whether the parser actually gives correct analyses. A word may thus be misanalysed, in two ways:

It is misspelled, but still given an (errouneous) analysis
It is correctly spelled, but given a grammatical analysis that it should not have had

The first issue is of major concern to the spell checker project, and will not be dealt with here.

The second issue has great importance to the disambiguator, and to the form generator isme.fst. Errors of this type pop up in two contexts: When the parser is used as input to the disambiguator (and the correct reading is missing from the input), and as a result of regularly reading through the analysis of a shorter, non-disambiguated text.

Reading through the code

Although the parser might give correct output, the internal lexicon structure may not be optimal. At some point, the code should be read through with this in mind.

Trond Trosterud

Last modified: Tue Nov 9 10:54:56 2004