Preprocessing the input

Introduction
Tokenizing
1. The tokeinizer file
2. handling abbreviations
Spell relaxation of æ/ä, ø/ö
Initial capitalization
Capitalization of whole words

Introduction

Within the Xerox framework, this is done with the tokenize tool. The code itself is written as a set of regular expressions, and the source file (tok.txt) is compiled by xfst.

The tok.txt file is copied from an earlier version of the sme tok.txt file, and has not yet (030311) been updated according to the last sme developments.

Tokenizing

The tokenizer file

The starting point for the preprocessor was the sme tok.txt file.

This is a feature common to Lule and Southern Sami, not to be found in Northern Sami. The letter æ/ä and ø/ö are used interchangeably in Norway and Sweden. The parser should thus accept any version of them.

The xfst file to handle this is short, it consists of one line:

æ (->) ä, ø (->) ö ;

The line says that æ may optionally be replaced by ä and that ø may optionally be replaced with ö.

Initial capitalization

This works like the Northern Sami file, cf. the documentation for Northern Sami initial capitalization. The content of the file is slightly different, since the letter repertoire is different as well.

Capitalization of whole words

This has not yet been implemented.

Trond Trosterud