There are a number of different spellers being supported (or on the way to be supported in our infrastructure: * fst-based spellers: ** zhfst files ** extensions for LibreOffice (oxt-files) based on LibreOffice-voikko ** foma spellers * list-based spellers (support under development) ** PLX spellers (Sámi spellers for MS Word using closed-source technology) ** Hunspell files !!!Speller configuration The basic configuration for building spellers is: {{{ ./configure --with-hfst --enable-spellers }}} There is one optimisation flag that is turned on by default: {{--enable-minimised-spellers}}. For some languages this optimisation is counterproductive, causing the speller to become very slow and unresponsive. If this is the case, ''disable'' this optimisation as follows: {{{ ./configure --with-hfst --enable-spellers --disable-minimised-spellers }}} You should also play a bit with the next configuration option, and see which combination of optimisations yeld the best performance. !!!Fst optimisations Some languages, notably Greenlandic ({{kal}}), compiles into a very large net. Hfst supports something called ''hyper-minimisation'' in which paths are replaced with automatically generated flag diacritics, such that otherwise similar paths can be collapsed without changing the semantics of the language model. This type of minimisation has a profound effect on some languages, and a minimal effect on others. In some cases it has even increased the size of the resulting fst. For Greenlandic the effect is stunning: from being a more or less unusable behemoth at 160 Mb and more, the acceptor for the Greenlandic speller (when combined with minimised spellers as described above) is reduced to a mere 6,3 Mb. To turn on this type of fst size optimisation, configure as follows: {{{ ./configure --with-hfst --enable-spellers --enable-hyperminimisation }}} Whether this option helps or not must be tested for each language, and preferably documented. You can see how this and the previous option affects the speller file sizes for three languages ({{fin, kal, sme}}) [here|ExampleOfFileSizesWithOptimisations.html]. !!!Error model optimisations The default error model has two important properties: * alphabet size * transition weights Further details about the error model and its parts and build configuration can be found on a [separate page|../../proof/TheSpellerErrorModel.html]. !!Alphabet size The alphabet size has a huge impact on the size of the final error model fst, and with that, also the speed of creating suggestions. The smaller the alphabet the smaller and speedier the fst. To ensure you have as small an alphabet as possible, add as many characters as possible to the exclusion list in the following file: {{{ tools/spellcheckers/fstbased/hfst/editdist.default.txt }}} All other characters will be used to create a simple edit distance 1 error model (this model is concatenated with itself to enable corrections of edit distance 2). Tip: use the terminal output of {{make}} in {{tools/spellcheckers/fstbased/hfst/}} (following the text ''... and base alphabet size NN'') as a starting point. Remove all regular alphabetic symbols, and what is left should be excluded by adding them to the file mentioned above. !!Transition weights The default error model created above is quite rough, as all transitions are equally possible. To improve this, you can specify weights for specific transition pairs (in the same file as above): {{{ ø ö 0.5 }}} The default weight is 1.0, and the above line says that replacing ''ø'' with ''ö'' should only have a weighxt of 0.5, and thus be more likely than the default. The columns are TAB separated. Using this system, it is possible to tune the default error model to improve the order of the suggestions by using general single-letter rules. To enable the error model to correct longer sequences of letter combinations, one should edit the file {{tools/spellcheckers/fstbased/hfst/strings.default.txt}}. It follows a similar but not identical structure as the previous file: {{{ øø:öö 0.2 ää:ææ 0.2 }}} It is also possible to add whole word replacements to the error model by editing the file {{tools/spellcheckers/fstbased/hfst/words.default.txt}}. Whole-word replacements are typically given the weight "0.0", to ensure they are on the top of the suggestion list: {{{ jih:jïh 0.0 }}} In the future it will be possible to use a file of collected typos and their corrections as the basis for whole-word corrections. !!!Fine tuning the suggestion order In the previous section we looked at how we could fine-tune the suggestions based on the error - what type of changes we had to do to arrive at a correct word. This is good in itself, but it does not differentiate between to suggestions with the same weighting where one is a frequent word and the other is not, or where one word is a compound and the other is not. Neither does it move rare word forms down on the suggestion list. To add such behavior, we need to add weights to the fst that will end up as the acceptor. !!Morphology-based weighting Morphology-based weighting is done by adding weights to the morphological or morphosyntactic tags in the analyser. You do this by modifying the file {{tools/spellcheckers/fstbased/desktop/weighting/tags.reweight}}. The file contains TAB separated values, two columns: # the tag itself # the weight that should be given to the tag Comments can be added as lines starting with __#__. Below is an example of how this can be done, taken from {{sme}}: {{{ +Cmp +2 +Der +1 +Der1 +1 +Der2 +1 +Der3 +1 +Der4 +1 +Der5 +1 +PxSg1 +3 +PxSg2 +3 +PxSg3 +3 +PxPl1 +3 +PxPl2 +3 +PxPl3 +3 +Use/SpellNoSugg +10000 +Cmp/Hyph +10000 +Cmp/SplitR +10000 }}} The weights are added to the other weights given to a word form, and should be chosen to align with the rest of the weights being used. Corpus weights are typically between {{6}} and {{12}} (but will vary depending on the size of the corpus), and the default weight for editing distance operations is {{10}}. Very high weights will cause a word form not to be suggested at all, or very rarely. !!Corpus-based weighting You turn on frequency-based weighting by doing two things: # Create a speller corpus # Enable the use of the speller corpus !Creating a speller corpus This is very simple: just store a large amount of text in the file {{tools/spellcheckers/fstbased/desktop/weighting/spellercorpus.raw.txt}}. The content does not have to be sorted, split or clean in anyway - basic cleaning and sorting is done automatically, and all incorrect words will be filtered out automatically. If you are using texts that are copyrighted, you can use the following Perl one-liner to scramble the words or lines in the text, so that the original text is not reconstructable: {{{ perl -MList::Util=shuffle -e 'print shuffle(<>);' < myfile.txt \ > tools/spellcheckers/fstbased/desktop/weighting/spellercorpus.raw.txt }}} After this, the text is fine for inclusion in the corpus. Use a lot of text, so that also the not-so-frequent word forms are covered - that will help a lot in improving the suggestion quality. !Enabling the use of the speller corpus Having a text corpus (which provides us with frequency data) is not enough, you also need to enable the use of it. This is done by editing {{tools/spellcheckers/fstbased/desktop/Makefile.am}}, so that it contains the following line (the line should already be there, but with the value ''no''): {{{ ENABLE_CORPUS_WEIGHTS=yes }}} You can temporarily disable the use of frequency data, e.g. for evaluation and development purposes, by changing ''yes'' to ''no''. !!Both It can also be quite helpful to combine the use of frequency (corpus) weights and tag-based (morphology) weights. You need to experiment and test a bit to arrive at the best configuration for a given language. !!!Time-stamping the spellers The spellers do all get an easter egg with build date and version info. But this information does not get automatically updated. To ensure you have a correct timestamp in your easter egg, do: {{{ cd tools/spellcheckers/ make clean make }}} The reason you should {{cd}} into {{tools/spellcheckers/}} first, is so that you don't have to rebuild everything, just the spellers and the easter egg. !!!Easter egg trigger The trigger string is ''nuvviDspeller''. Copy and paste this word into any speller we have made or echo it into a speller on the command line, and the suggestions should contain the version information. !!!Testing spellers The speller may be tested on data from {{test/data/typos.txt}}. In order to do this, you need {{Text/Brew.pm}} (a Perl module, it should be installed if you follow the default setup procedure). To test, stand in the $LANG (langs/sme, etc) directory and write: {{{ sh devtools/test_voikkospell_suggestions.sh open -a Safari devtools/speller_result_typos.vk.xml }}}