Preprocessing the input
- Introduction
- Tokenizing
- The tokeinizer file
- handling abbreviations
- Spell relaxation of æ/ä, ø/ö
- Initial capitalization
- Capitalization of whole words
Within the Xerox framework, this is done with the
tokenize tool. The code itself is written as a set of regular
expressions, and the source file (tok.txt) is compiled by xfst.
The tok.txt file is copied from an earlier version of the sme tok.txt
file, and has not yet (030311) been updated according to the last sme
developments.
The starting point for the preprocessor was the sme tok.txt file.
We need Lule Sami abbreviations.
This is a feature common to Lule and Southern Sami, not to be found in
Northern Sami. The letter æ/ä and ø/ö are used interchangeably in
Norway and Sweden. The parser should thus accept any version of them.
The xfst file to handle this is short, it consists of one line:
æ (->) ä, ø (->) ö ;
The line says that æ may optionally be replaced by ä and that ø may
optionally be replaced with ö.
This works like the Northern Sami file, cf. the documentation for Northern
Sami initial capitalization. The content of the file is slightly
different, since the letter repertoire is different as well.
This has not yet been implemented.
Trond Trosterud