(''or: how to fix decomposed Sami letters'')

In Unicode, many glyphs (letter symbols) may either be represented
by one character, or by a sequence of many. The letter á may thus be
either one character á or two characters a and combining ´ . Normalisation
forms are used to standardise the representation.

# NFKD = Normalization Form Compatibility Decomposition
# NFKC = Normalization Form Compatibility Composition

The first, NFKD, __decomposes__ the characters (á as two characters), 
whereas the second, NFKC, __composes it__ (á as one character).

Our North Sami analysers use the __composed__ representation.

If you get text with decomposed letters (__UnicodeChecker__ will tell you that č is two characters), you must __compose__ them with the following command

{{{
cat infile.txt \
| uconv -f utf8 -t utf8 -x Any-NFKC > outfile.txt
}}}

See also {{man uconv}}

The uconv program should be installed on your machine as part of
the ICU installation.


* [Unicode on normalization|http://unicode.org/reports/tr15/]
* [Exmple script where the command is used|https://github.com/redpony/cdec/blob/master/corpus/utf8-normalize.sh]