At regular intervals, the corpus should be inspected, e.g. with the following command (~/gt/sme/ as the working directory):
cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/sme.fst | grep '\?' | grep -v CLB | sort | uniq -c | sort -nr | less
The result will contain all non-Sami words in the lexicon. In order to remove these foreign words from the list, the following command may be used:
cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/s\
me.fst | grep '\?' | grep -v CLB | cut -f1 | lookup -flags mbTT bin/foreign.fst | grep '\?' | sort | uniq -c | sort -nr | less
The resulting list is an overview over words not recognised by the parser. All-capital words should be ignored, or they could be tested separately, with the command
... | lookup -flags mbTT -f bin/cap-sme | ...
By using this script words written in CAPITALS are analysed as well, but run in this mode, the parser is to slow to analyse the full one-million word corpus.
The remaining words should be inspected. Failure of recognising words has one of three reasons:
In simple cases, errors should just be corrected. Otherwise they should be reported to the Bugzilla database. Misspellings may be ignored, or, if they are frequent, they should be added to the lexicon, with a tag (at present the tag is "! XXX substandard"). When developing a spell checker, misspellings become interesting in their own right, but for the development of the disambiguator, we are more interested in actually analysing the words, than in pointing out that they are misspelled.
Clear formatting errors may be corrected in the corpus files, with the following command:
perl -i -pe 's/formatting_error/corrected_formatting/g' corp/filename
This should be done with care, and only when it is totally clear that the input string cannot be interpreted as anything else than a formatting error. The preferred way of dealing with formatting errors is to improve our conversion tools.
Words missing in the lexicon should be added, with their proper lexicon.
Words listed in the lexicon, but with one or more word forms not analysed, are the most challenging ones. This implies that there is an error in the morphophonological file twol-sme.txt or more probably in the morphological section (for nouns, verbs and adjectives this means sme-lex.txt). In case of morphological errors, the path through the morphological derivation should be traced and inspected. In case of morphophonological errors there are procedures within twolc for detecting them (see the twolc manual).
At regular intervals, new, previously unseen texts should be tested for type and token recall. The test prcedure, as well as test results, are explained in the sme test diary.
Status quo and directions for actively testing the parser:
The best way of testing the morphology is perhaps the command
make n-paradigm WORD=johka
, as described in the testing
tools. This method is fine for the inflection of nouns, verbs and
adjectives. As of september 2004, the basic noun paradigms in Nickel
have all ben tested, as have the CG patterns. Priority should now be
given to adjectives, and to the verbs. The sublexica should all be run
through the generator.
As for the adjectives, there are several subtypes that are not covered by the existing lexica. One possible way of monitoring the situation would be to write a perl script (or shell script) that takes as input a list of adjectives, and gives their nom.sg., attributive form, gen.sg, comparative nominative, comparative genitive, superlative and superlative genitive forms, and then run representative lists of adjectives through the script.
As for the verbs, the verb file should be read through and checked for transitivity (the question is whether the verbs are assigned correct sublexicon).
TODO for a person with Sami as mother tongue: Read through the pp-sme-lex.txt and adv-sme-lex.txtfiles and evaluate the division into prepositions, postpositions, adpositions and adverbs.
Perhaps a script could be made to run all pronouns through a test.
The chapter on numerals is still not properly written. Wait with testing this until the code is more stable.
When we test whether words are let through or not, we do not test whether the parser actually gives correct analyses. A word may thus be misanalysed, in two ways:
The first issue is of major concern to the spell checker project, and will not be dealt with here.
The second issue has great importance to the disambiguator, and to the form generator isme.fst. Errors of this type pop up in two contexts: When the parser is used as input to the disambiguator (and the correct reading is missing from the input), and as a result of regularly reading through the analysis of a shorter, non-disambiguated text.
Although the parser might give correct output, the internal lexicon structure may not be optimal. At some point, the code should be read through with this in mind.
Trond TrosterudLast modified: Tue Nov 9 10:54:56 2004