Preprocesor for Sámi language tools

Contents

Overview and intro

This document describes how the preprocessing of the text into separate words is implemented in the project. The document contains overall description of the preprocessing task and some implementation details.

Preprocessor is a Perl script preprocess. It takes text as input and produces a list of words (tokens) separated by newline. The preprocessor has two main functions: it cuts the the text into sentences and sentences into tokens (words and other units such as numerical expressions and punctuation). Sentence delimiters and most of the punctuation are treated as separate tokens.

The output of the preprocessor is a list of tokens separated by newline. The preprocessor output is input for the morphological parser. (See the flowchart of the parsing process.) The parser gives each token a morphological analysis, a tag or set of tags. For the analysis to be succesful, the preprocessor must be fully compatible with the parser; the preprocessor must produce tokens that are recognized by the parser. This means that for example multiword expressions like "earret eará" have to be recognized as a single token (not as two separate words) both by the preprocessor and the parser. To achieve this, the lexicon entries of "special" tokens, such as punctuation, abbreviations and multiword expressions are planned parallel with the preprocessor (see the section Abbreviations and the lexical files).

Tokens

The string that is given to the preprocessor is divided to tokens, which are the basic units used in the lexical analysis. The tokens are complete units such as words, punctuation etc. The default assumption is that a string surrounded by space is one token.

The tokens are of three types: words, numerical expressions and punctuation. Numerical expressions consist of numbers, dates, time, prices, etc. Words include abbreviations and multiword expressions. The punctuation that gets a morphological analysis form its own class of tokens, including for example {}[]()?!;,'\"

The string consisting only of alphabetical characters (and number 1 for digraphs d1, s1, etc.) surrounded by space is always an instance of one token, a word. So if two words are accidentally typed without separating space, it is an error that is not accounted for. Words may contain a hyphen such as in Davvi-Norgga

The punctuation that occurs in the text is divided to two classes: 1. those that are inseparable part of a word or numerical expression and 2. those that are independent. Independent punctuation may split the input string into two or more parts. For example the string (gielddat/guovllut). will be divided into six parts: both parentheses are their own tokens as well as the dot. Both the words and slash are also separate tokens. The preprocessor output is thus:

(
gielddat
/
guovllut
)
.
The splitting depends on the surrounding characters. For example, a dot rarely splits a numerical expression into two but generally splits a string that consist only of alphabetical characters (1.4.2004, Minä lähdin.En tule takaisin.). The treatment of punctuation is detailly explained in documentation below).

The abbreviations contain a final period that may not end the sentence. The abbreviations are listed in a lexicon file where they recieve a +ABBR tag. However, the same abbreviations must be recognized already in the preprocessing phase, in order not to treat the period as a sentence delimiter. The list of abbreviations is extracted from the lexicon file and used in preprocessing (see Handling abbreviations).

The preprocess script

The preprocessor is a perl-script called preprocess; it reads STDIN for input. The preprocessor is given one command line parameter, the file from where to read the formerly generated list of abbreviations and multiword expressions. The file is created by make and the name is usually lang/bin/abbr.txt. Usage:

preprocess --abbr=sme/bin/abbr.txt

The output of the script is a list of words (tokens) and punctuation in one line each.

The text is handled one paragraph at the time (paragraph ends when two subsequent newlines are encountered). If there are no paragraphs in the text, the whole input is slurped in in one go. The newlines are replaced with space and the paragraph is treated as a single line. The line is divided using space (tab, newline) as a separator and the elements stored into an array. Example:

First read the text. (Then split by space.)

First
read
the
text.
(Then
split
by
space.)

The elements in the array are processed one at the time. The element may consist of two or more tokens, the tokens are stored in an array. First, the punctuation preceding and not belonging to the token is removed and stored to the tokens array. This class of punctuation includes {}[]()?!;,'\". If the remaining part starts with an alphabetical character optionally preceded by one non-word character, it is considered to be a word token. Otherwise it is numerical expression.

The word tokens are tested against the list of multiword expressions when they are expanded if needed. The punctuation at the end of the word that does not belong to the token is stored to the tokens array. The word may be divided to several tokens if it contains delimiters such as /.

If the word contains a dot, it is tested against the list of abbreviations to see if the dot is a sentence delimiter or belongs to the token (see handling abbreviations). The dots that are sentence delimiters are printed out as a separate tokens. Otherwise the dot remains connected to the word. If the sentence ends with an abbreviation, an extra dot is added to mark the sentence boundary.

The treatment of the numerical expressions differ in that most of the punctuation, such as /, is considered to be part of the token (see Punctuation for details). The dot that follows a number is considered to be a sentence delimiter if the following word starts with a capital letter. Otherwise the dot belongs to the token which may be an ordinal number, date etc.

There are some constant variables that affect to the preprocessing, here are the current values:

my $MULTIWORD_SIZE = 3;  
my $SEPARATE_PUNCT = quotemeta("|{}[]()«»?!;,'\""); 
my $CONTAIN_PUNCT = 'ja\/dahje|http|:\/\/'; 
First defines the size of a multiword expression, i.e. how many words should be included in testing if the expression is a multiword. The constant $SEPARATE_PUNCT contains all the punctuation that is considered to be individual token without further processing if occurs in the beginning or at the end of the token. The word tokens that contain punctuation which belongs to the token are defined in constant $CONTAIN_PUNCT. This constant is currently only language dependent part of the script.

Punctuation

As most of the punctuation in a numerical expression is inseparable part of it, the numerical expressions are treated as a separate class. The punctuation in other strings generally causes the string to be splitted up but not always, depeding on the punctuation mark. In the following table, the punctuation marks are listed:
Punctuation Numerical expressions (digits and non-word characters) Words (word characters, no digits)
(){}[]" ' Always their own tokens Always their own tokens, exception: (?)
: Belongs to the expression if not followed by space (14:30, 10:s) Belongs to the expression if not followed by space (Namdal:s)
/ Belongs to the expression (5/2004) Always its own token (gielddat/guovllut). Exceptions: ja/dahje which is treated as one token; html-addresses and file names.
-, -- Belongs to the expression, also when surrounded by spaces (1-3, 1 - 3) Belongs to the expression, but not when separated by space on either side (Davvi-Norgga, dárogielas ja -kultuvrras). Hyphenated words are a problem.
!?; Always its own token Always its own token
. Belongs to the expression unless sentence delimiter Its own token. Exception: belongs to an abbreviation and does not end the sentence.
+*= Belongs to the expression Always its own token
% Belongs to the expression, also when separated by space (50%, 50 %) Always its own token

Abbreviations

The abbreviations are divided in three classes according to whether they are able to end the sentence or not. Firs class contains transitive abbreviations (TRAB) that never end the sentence. The intransitive abbreviations (ITRAB) are considered to end the sentence whenever they are followed by a capital letter or number. The third class (TRNUMAB) contains abbreviations that end the sentence when followed by a capital letter but not when followed by number. In sum:

Abbreviations and the lexicon files

All the abbreviations are extracted from the lexicon files. They are extracted from the file as a part of the compilation process make. The relevant commands are stored to the Makefile in the gt/lang/src directory. The script that handles the extraction is called abbr-extract and located in gt/script directory; it is intended to be a language-independent script.

The main source is abbr-lang-lex.txt, where the real abbreviations are listed. There exists also a number of multi-word expressions which are dispersed to different lexicon files, for example the file adv-sme-lex.txt contains multi-word adverbials like earret eará. Not to mention the propernoun lexicon, which contains (at the moment) over 500 multi-word propernouns.

In order to be able to treat the multi-word expressions as single tokens already at the preprocessing phase, the multi-word expressions are extracted from the lexicon files. The relevant lexicon file names are given to the script abbr-extract in lang/src/Makefile. There is no restrictions to how to list the multiword expressions, they can be used in lexicon files as other tokens. Only the structure of the file abbr-lang-tex.txt is restricted.

Usage of the script abbr-extract:

abbr-extract --output= --abbr_lex=
--lex=

The structure of abbr-lang-lex.txt

The structure of the abbr-lang-lex.txt is free in all the other respects but the extracted part must have the following syntax:

LEXICON ITRAB
abbr  TAG ;
ab.br  TAG ;
ab  TAG ;
abbr% abbr TAG ;

And the same to the other abbreviation classes, LEXICON TRAB and LEXICON TRNUMAB. The ordering of the abbreviation classes is free.

The abbreviations may contain one or more dots but due to xfst, the dot that ends the abbreviation must be removed (operator). The pattern matching in the preprocessor is done always without the dot. It is possible to add the dot to the abbreviation again in the lex-file.

The space in the multi-word abbreviations is marked in the lexicons by %-sign (literal operator) which is recognized by the abbr-extract-script and removed.

The abbreviations in abbr-lang-lex.txt are lower-cased unless there exists only upper cased version of the abbreviation. The abbr-extract-script generates upper case versions of all the abbreviations (they may begin the sentence).

Possible flaws

Some of the transitive abbreviations may end the sentence. Some of the abbreviations are homonyms:
min. min
Abbreviations may be followed by capital letter even when not ending the sentence
Siellä oli Pekka ym. Mäkisiä.

Siellä
oli
Pekka
ym
.
Mäkisiä
.

Siellä oli Pekka ym. Mäkinen ei tullut.

ym
.
Mäkinen

Last modified: Fri Aug 13 11:14:43 2004