This document describes the different parts of the error model used to create
suggestions for the spellers, how they interact, and how one can turn the
different parts on and off.

!!!Makefile configurations

The file {{tools/spellcheckers/Makefile.mod-desktop-hfst.am}} looks like
this, with default values as given by the {{und/}} template (there is a corresponding file for mobile phone spellers, so that they can be made different from the desktop spellers):

{{{
# This is the default weight for all editing operations in the error model:
DEFAULT_WEIGHT=10

# Edit distanse for the Levenshtein error model:
EDIT_DISTANCE=2

# Define whether we allow changes to the initial letter(s) in the error model,
# possible values are:
# * no    - no longer string edits = only the default, letter-based error model
# * txt   - use only the txt file as source
# * regex - use only the regex file as source
# * both  - use both the txt and regex files as sources
# NB!!! Setting this to anything but 'no' will greatly increase the size and
# search space of the error model, and thus make it much, much slower. Make sure
# you TEST the resulting error model properly and thoroughly, both for speed
# and suggestion quality.
INITIAL_EDITS=no

# Variable to define whether to enable edits of longer strings (as opposed to
# single letters). Possible values are:
# * no    - no longer string edits = only the default, letter-based error model
# * txt   - use only the txt file as source
# * regex - use only the regex file as source
# * both  - use both the txt and regex files as sources
STRING_EDITS=txt
# Variable to specify the edit distance for the regex
# version of the strings file. The total edit distance for those operations is
# this value multiplied with the value of the DEFAULT_EDIT_DIST variable.
STRING_REGEX_EDIT_DISTANCE=2

# Variable to define whether to enable edits of word-final strings (as opposed
# to single letters). Possible values are:
# * no    - no longer string edits = only the default, letter-based error model
# * txt   - use only the txt file as source
# * regex - use only the regex file as source
# * both  - use both the txt and regex files as sources
FINAL_STRING_EDITS=no

# Variable to define whether to enable whole-word replacements. Possible values:
# - yes
# - no
WORD_REPLACEMENTS=no
}}}

The different options are described above in the comments. In the following
discussion only the relevant options are listed. We'll start with a minimal
error model:

!!!A minimal error model

{{{
DEFAULT_WEIGHT=10
EDIT_DISTANCE=2
INITIAL_EDITS=no
STRING_EDITS=no
FINAL_STRING_EDITS=no
WORD_REPLACEMENTS=no
}}}

That is, the error model contains only a Levenshtein edit distance {{2}} error
model with no additional components. It can be illustrated like this (the
multiplication factor {{2}} is taken from the {{Makefile.am}} variable
{{EDIT_DISTANCE}}, and {{.#.}} marks the beginning and end of the word):

[../images/SimpleErrorModel.png]

(Strictly speaking, the error model could have been even simpler, by specifying
an edit distance of one. But that will in most cases produce a very bad speller,
so we stick to the default editing distance 2 default value.)

The file used to specify the letters of the error model is:

{{{
tools/spellcheckers/editdist.default.txt
}}}

In that file you specify the whole alphabet used for the error model (that is,
all and only the letters you want to be used when generating correction
suggestions). The default weight for each modification of the input misspelling
is specified in the {{Makefile.am}} variable:

{{{
DEFAULT_WEIGHT=10
}}}

That is, every letter change is given a default weight of {{10}} (in addition
to whatever weight is already present, e.g. from the corpus).

One can change this default for individual letters in the alphabet in the
{{editdist.default.txt}} file (which will then change the weight for all pairs
involving that letter), or for specific transitions:

{{{
## Inclusions: this is the real alphabet definition:
a
á	5
b
c
č	6

## Transition pairs + weight - section separator:
@@

## Transition pair specifications:
a	á	4
á	a	4
}}}

In the above fragment, the letters {{a}}, {{b}} and {{c}} will have a default
weight of {{10}} for all changes involving these letters, whereas changes
involving {{á}} and {{č}} will have a non-default weight as specified. In
addition, the change from {{a}} to {{á}} (and the other way around) is given a
weight of {{4}}.

!!!Slightly more complex - adding STRING_EDITS

The {{STRING_EDITS}} variable governs whether longer stretches than single
characters (ie strings) can be changed in one editing operation. It has four
possible values:

;''no''    : no {{STRING_EDITS}} operations
;''txt''   : {{STRING_EDITS}} taken from a txt file
;''regex'' : {{STRING_EDITS}} taken from a regex file
;''both''  : {{STRING_EDITS}} taken from both a txt and a regex file

!!STRING_EDITS=txt

Using a txt file as the input file for {{STRING_EDITS}} operations, you edit
a very simple data structure:

{{{
gi:giija	-2
riikka:rihká	-2
rg:rgg	-2
rgg:rg	-2
}}}

The format is:

* input string
* colon
* output string to replace the input string
* TAB
* weight specification (numeric type ''real'')

The intended use is to replace sequences of characters that typically get
spelled wrongly with their correct counterpart, such that the expected
suggestions appear on top or among the top 5. This can be useful also in cases
where the actual editing distance between input and output is only one, e.g.
when the error is part of a regular but context-restricted pattern.

The filename for this file is: {{strings.default.txt}}. The {{default}} part
can be replaced with names for alternative writing systems or orthographies, to
be used in spellers for those writing systems or orthographies.

The string pairs in this file is compiled in as a parallel fst to the
Levenshtein edit distance model, and the editing distance variable is applied to
both. That is, with the following setup:

{{{
EDIT_DISTANCE=2
STRING_EDITS=txt
}}}

we get an error model that can be illustrated as follows:

[../images/ErrorModelWithStrings.png]

{{EDIT_DISTANCE=2}} means that one can correct up to two errors in the input
word, each of which can be either a regular Levenshtein operation or a string
replacement operation.

!!STRING_EDITS=regex

The file for the regex string editing model is: {{strings.default.regex}}. The
content of that file is a standard Xerox-style regular expression, with an
additional Hfst weight specification:

{{{
{øø} -> {öö}::0 ,
 ø   ->  {ö}::0 ;
}}}

With the Makefile.am variables set as follows:

{{{
EDIT_DISTANCE=2
STRING_EDITS=regex
STRING_REGEX_EDIT_DISTANCE=2
}}}

we get an error model that looks like:

[../images/ErrorModelWithRegex.png]

The variable {{STRING_REGEX_EDIT_DISTANCE}} regulates how many times the regex
file is applied - __on top of__ the EDIT_DISTANCE variable. With the values
specified above, you can have ''four'' changes applied to the input word, as
long as all changes are covered by the {{strings.default.regex}} error model.

!!STRING_EDITS=both

In this case both the {{txt}} and {{regex}} files are included. With the
following settings:

{{{
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
}}}

we get the following error model:

[../images/ErrorModelWithBoth.png]

Beware that when using both the txt and the regex strings extensions to the
Levenshtein model, there is a risk that the total error model becomes too large
and powerful. This will be noticable through sluggish suggestion speed. To avoid
this issue, make sure you only include strings and string patterns that are
frequent and have a good effect on suggestion quality. Also have a look at the
error model file size.

!!!Increasing the complexity - adding FINAL_STRING_EDITS

This part of the error model is meant to cover errors in suffixes. It comes
''in addition to'' the previous Levenshtein + strings error model, which means that with {{EDIT_DISTANCE=2}}, you get two edit operations (Levenshtein or string) ''pluss'' one suffix operation. This will normally not be a problem since the changes are restricted to the final parts of the word, and thus the search space for the error model does not increase very much.

The possible values for this variable are the same as for {{STRING_EDITS}}:

;''no''    : no {{FINAL_STRING_EDITS}} operations
;''txt''   : {{FINAL_STRING_EDITS}} taken from a txt file
;''regex'' : {{FINAL_STRING_EDITS}} taken from a regex file
;''both''  : {{FINAL_STRING_EDITS}} taken from both a txt and a regex file

Each of these values has the same meaning and consequence as for
{{STRING_EDITS}}. The files are named {{final_strings.default.*}}.

!!FINAL_STRING_EDITS=txt

{{{
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=txt
}}}

[../images/ErrorModelWithFinalStrings.png]

!!FINAL_STRING_EDITS=regex

{{{
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=regex
}}}

[../images/ErrorModelWithFinalRegex.png]

!!FINAL_STRING_EDITS=both

{{{
EDIT_DISTANCE=2
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
}}}

[../images/ErrorModelWithFinalBoth.png]

The same warning applies in this case as with the {{STRING_EDITS}} — if you use
both the {{txt}} and the {{regex}} files, make sure to test for speed and size
issues.

!!!Maximum complexity - adding INITIAL_EDITS

__NB!__ This is an experimental feature, and is not guaranteed to work as
intended.

The purpose of this variable is to allow for special treatment of the initial
letter(s) of the misspellings. This has a huge price, though, in terms of search
space and thus speed of the speller. If enabled, consider redusing the editing
distance to one, and compensate with more targeted additions in the {{strings}}
and {{final_strings}} files.

Also, as seen below, these edit operations come ''in addition to'' the regular
Levenshtein model (and final_strings operations), which means that the effective
editing distance of an error model with {{INITIAL_EDITS}} on,
{{EDIT_DISTANCE=2}} and {{FINAL_STRING_EDITS}} enabled is __four__. That is a
very powerful model, and one that is likely to be way too slow. Reducing
{{EDIT_DISTANCE}} to {{1}} will substantially limit the error model, and thus
improve suggestion speed.

The possible values for the {{INITIAL_EDITS}} variable are:

;''no''    : no {{INITIAL_EDITS}} operations
;''txt''   : {{INITIAL_EDITS}} taken from a txt file
;''regex'' : {{INITIAL_EDITS}} taken from a regex file
;''both''  : {{INITIAL_EDITS}} taken from both a txt and a regex file

Each of these values has the same meaning and consequence as for
{{STRING_EDITS}}. The files to edit are {{initial_letters.default.*}}.

!!INITIAL_EDITS=txt

{{{
EDIT_DISTANCE=2
INITIAL_EDITS=txt
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
}}}

[../images/ErrorModelWithInitLtrs.png]

!!INITIAL_EDITS=regex

{{{
EDIT_DISTANCE=2
INITIAL_EDITS=regex
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
}}}

[../images/ErrorModelWithInitRegex.png]

!!INITIAL_EDITS=both

{{{
EDIT_DISTANCE=2
INITIAL_EDITS=both
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
}}}

[../images/ErrorModelWithInitBoth.png]

!!!Complete madness - adding WORD_REPLACEMENTS

Actually, that might not be a bad idea. Enabling {{WORD_REPLACEMENTS}} does not
really add to the complexity of the error model, but it allows targeted
promotion of individual words on the suggestion list, words with known and
frequent misspellings. To that end you can add misspelled words and their
corrections to the file {{words.default.txt}}, in the following format:

{{{
oahppiin:ohppiin	-10
váiloje:váilo	-10
maŋge:mange	-10
}}}

The format is:
* misspelled word
* colon
* correct word
* TAB
* weight

The possible values for the {{WORD_REPLACEMENTS}} variable are:

;''no''    : no {{WORD_REPLACEMENTS}} operations
;''yes''   : enable {{WORD_REPLACEMENTS}}

Expanding on the settings fragment used throughout, we get the following:

{{{
EDIT_DISTANCE=2
INITIAL_EDITS=both
STRING_EDITS=both
STRING_REGEX_EDIT_DISTANCE=2
FINAL_STRING_EDITS=both
WORD_REPLACEMENTS=yes
}}}

When enabled, the file is compiled into an fst that is applied outside the rest
of the error model:

[../images/ErrorModelWithWords.png]

As discussed next, the settings above are not a good idea. The maximum editing
distance is actually six ({{6! - 1 + (2*2) + 1}}), which is way too much. But
it serves to illustrate the use of the settings in {{Makefile.am}}.

!!!Final words

DO NOT ENABLE EVERYTHING! That will very, very likely make the error model size
explode, and make the speller so slow that it can't be used. Exactly which files
and what features are needed will vary from language to language, and has to be
tested on a case by case basis.

The goal of a good speller is to always suggest the correct thing, or something
sensible and close to the correct thing, but do not try to overdo this - it is
better to not suggest something, than to need several seconds to be able to
suggest.