cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/sme.fst | grep '\?' | grep -v CLB | sort | uniq -c | sort -nr | less
The result will contain all non-Sami words in the lexicon. In order to remove these foreign words from the list, the following command may be used:
cat corp/* | preprocess abbr=bin/abbr.txt | lookup -flags mbTT mbTT bin/s\
me.fst | grep '\?' | grep -v CLB | cut -f1 | lookup -flags mbTT bin/foreign.fst | grep '\?' | sort | uniq -c | sort -nr | less
The resulting list is an overview over words not recognised by the parser. All-capital words should be ignored, or they could be tested separately, with the command
... | lookup -flags mbTT -f bin/cap-sme | ...
By using this script words written in CAPITALS are analysed as well, but run in this mode, the parser is to slow to analyse the full one-million word corpus.
The remaining words should be inspected. Failure of recognising words has one of three reasons:
In simple cases, errors should just be corrected. Otherwise they should be reported to the Bugzilla database. Misspellings may be ignored, or, if they are frequent, they should be added to the lexicon, with a tag (at present the tag is "! XXX substandard"). When developing a spell checker, misspellings become interesting in their own right, but for the development of the disambiguator, we are more interested in actually analysing the words, than in pointing out that they are misspelled.
Clear formatting errors may be corrected in the corpus files, with the following command:
perl -i -pe 's/formatting_error/corrected_formatting/g' corp/filename
This should be done with care, and only when it is totally clear that the input string cannot be interpreted as anything else than a formatting error. The preferred way of dealing with formatting errors is to improve our conversion tools.
Words missing in the lexicon should be added, with their proper lexicon.
Words listed in the lexicon, but with one or more word forms not analysed, are the most challenging ones. This implies that there is an error in the morphophonological file twol-sme.txt or more probably in the morphological section (for nouns, verbs and adjectives this means sme-lex.txt). In case of morphological errors, the path through the morphological derivation should be traced and inspected. In case of morphophonological errors there are procedures within twolc for detecting them (see the twolc manual).
Status quo and directions for actively testing the parser:
make n-paradigm WORD=johka, as described in the testing
tools. This method is fine for the inflection of nouns, verbs and
adjectives. As of september 2004, the basic noun paradigms in Nickel
have all ben tested, as have the CG patterns. Priority should now be
given to adjectives, and to the verbs. The sublexica should all be run
through the generator.
Testing the individual lexemes
Adjectives
As for the adjectives, there are several subtypes that are not covered
by the existing lexica. One possible way of monitoring the situation
would be to write a perl script (or shell script) that takes as input
a list of adjectives, and gives their nom.sg., attributive form,
gen.sg, comparative nominative, comparative genitive, superlative and
superlative genitive forms, and then run representative lists of
adjectives through the script.
Verbs
As for the verbs, the verb file should be read through and checked for
transitivity (the question is whether the verbs are assigned correct
sublexicon).
P-positions and adverbs
TODO for a person with Sami as mother tongue: Read through the
pp-sme-lex.txt and adv-sme-lex.txtfiles and evaluate the division into
prepositions, postpositions, adpositions and adverbs.
Pronouns
Perhaps a script could be made to run all pronouns through a test.
Numerals
The chapter on numerals is still not properly written. Wait with
testing this until the code is more stable.
Testing the correctness of the given analyses
When we test whether words are let through or not, we do not test
whether the parser actually gives correct analyses. A word may thus be
misanalysed, in two ways:
- It is misspelled, but still given an (errouneous) analysis
- It is correctly spelled, but given a grammatical analysis that
it should not have had
The first issue is of major concern to the spell checker project, and will not be dealt with here.
The second issue has great importance to the disambiguator, and to the
form generator isme.fst. Errors of this type pop up in two contexts:
When the parser is used as input to the disambiguator (and the correct
reading is missing from the input), and as a result of regularly
reading through the analysis of a shorter, non-disambiguated text.
Reading through the code
Although the parser might give correct output, the internal lexicon
structure may not be optimal. At some point, the code should be read
through with this in mind.
Trond Trosterud
Last modified: Tue Nov 9 10:54:56 2004