(''or: how to fix decomposed Sami letters'') In Unicode, many glyphs (letter symbols) may either be represented by one character, or by a sequence of many. The letter á may thus be either one character á or two characters a and combining ´ . Normalisation forms are used to standardise the representation. # NFKD = Normalization Form Compatibility Decomposition # NFKC = Normalization Form Compatibility Composition The first, NFKD, __decomposes__ the characters (á as two characters), whereas the second, NFKC, __composes it__ (á as one character). Our North Sami analysers use the __composed__ representation. If you get text with decomposed letters (__UnicodeChecker__ will tell you that č is two characters), you must __compose__ them with the following command {{{ cat infile.txt \ | uconv -f utf8 -t utf8 -x Any-NFKC > outfile.txt }}} See also {{man uconv}} The uconv program should be installed on your machine as part of the ICU installation. * [Unicode on normalization|http://unicode.org/reports/tr15/] * [Exmple script where the command is used|https://github.com/redpony/cdec/blob/master/corpus/utf8-normalize.sh]