preprocess
Preprocessor is a Perl script preprocess
. It takes text
as input and produces a list of words (tokens) separated by newline.
The preprocessor has two main
functions: it cuts the the text into sentences and sentences into
tokens (words and other units such as numerical expressions and punctuation). Sentence delimiters and most of the punctuation are treated as separate tokens.
The output of the preprocessor is a list of tokens separated by newline. The preprocessor output is input for the morphological parser. (See the flowchart of the parsing process.) The parser gives each token a morphological analysis, a tag or set of tags. For the analysis to be succesful, the preprocessor must be fully compatible with the parser; the preprocessor must produce tokens that are recognized by the parser. This means that for example multiword expressions like "earret eará" have to be recognized as a single token (not as two separate words) both by the preprocessor and the parser. To achieve this, the lexicon entries of "special" tokens, such as punctuation, abbreviations and multiword expressions are planned parallel with the preprocessor (see the section Abbreviations and the lexical files).
The string that is given to the preprocessor is divided to tokens, which are the basic units used in the lexical analysis. The tokens are complete units such as words, punctuation etc. The default assumption is that a string surrounded by space is one token.
The tokens are of three types: words, numerical expressions and punctuation. Numerical expressions consist of numbers, dates, time, prices, etc. Words include abbreviations and multiword expressions. The punctuation that gets a morphological analysis form its own class of tokens, including for example {}[]()?!;,'\"
The string consisting only of alphabetical characters (and number 1 for digraphs d1, s1, etc.) surrounded by space is always an instance of one token, a word. So if two words are accidentally typed without separating space, it is an error that is not accounted for. Words may contain a hyphen such as in Davvi-Norgga
The punctuation that occurs in the text is divided to two classes: 1. those that are inseparable part of a word or numerical expression and 2. those that are independent. Independent punctuation may split the input string into two or more parts. For example the string (gielddat/guovllut). will be divided into six parts: both parentheses are their own tokens as well as the dot. Both the words and slash are also separate tokens. The preprocessor output is thus:
( gielddat / guovllut ) .The splitting depends on the surrounding characters. For example, a dot rarely splits a numerical expression into two but generally splits a string that consist only of alphabetical characters (1.4.2004, Minä lähdin.En tule takaisin.). The treatment of punctuation is detailly explained in documentation below).
The abbreviations contain a final period that may not end the sentence. The abbreviations are listed in a lexicon file where they recieve a +ABBR tag. However, the same abbreviations must be recognized already in the preprocessing phase, in order not to treat the period as a sentence delimiter. The list of abbreviations is extracted from the lexicon file and used in preprocessing (see Handling abbreviations).
preprocess
script
The preprocessor is a perl-script called preprocess
; it reads
STDIN
for input. The preprocessor is given one command
line parameter, the file from where to read the formerly generated
list of abbreviations and multiword expressions. The file is created
by make
and the name is usually
lang/bin/abbr.txt
. Usage:
preprocess --abbr=sme/bin/abbr.txt
The output of the script is a list of words (tokens) and punctuation in one line each.
The text is handled one paragraph at the time (paragraph ends when two subsequent newlines are encountered). If there are no paragraphs in the text, the whole input is slurped in in one go. The newlines are replaced with space and the paragraph is treated as a single line. The line is divided using space (tab, newline) as a separator and the elements stored into an array. Example:
First read the text. (Then split by space.) First read the text. (Then split by space.)
The elements in the array are processed one at the time. The element may consist of two or more tokens, the tokens are stored in an array. First, the punctuation preceding and not belonging to the token is removed and stored to the tokens array. This class of punctuation includes {}[]()?!;,'\". If the remaining part starts with an alphabetical character optionally preceded by one non-word character, it is considered to be a word token. Otherwise it is numerical expression.
The word tokens are tested against the list of multiword expressions when they are expanded if needed. The punctuation at the end of the word that does not belong to the token is stored to the tokens array. The word may be divided to several tokens if it contains delimiters such as /.
If the word contains a dot, it is tested against the list of abbreviations to see if the dot is a sentence delimiter or belongs to the token (see handling abbreviations). The dots that are sentence delimiters are printed out as a separate tokens. Otherwise the dot remains connected to the word. If the sentence ends with an abbreviation, an extra dot is added to mark the sentence boundary.
The treatment of the numerical expressions differ in that most of the punctuation, such as /, is considered to be part of the token (see Punctuation for details). The dot that follows a number is considered to be a sentence delimiter if the following word starts with a capital letter. Otherwise the dot belongs to the token which may be an ordinal number, date etc.
There are some constant variables that affect to the preprocessing, here are the current values:
my $MULTIWORD_SIZE = 3; my $SEPARATE_PUNCT = quotemeta("|{}[]()«»?!;,'\""); my $CONTAIN_PUNCT = 'ja\/dahje|http|:\/\/';First defines the size of a multiword expression, i.e. how many words should be included in testing if the expression is a multiword. The constant $SEPARATE_PUNCT contains all the punctuation that is considered to be individual token without further processing if occurs in the beginning or at the end of the token. The word tokens that contain punctuation which belongs to the token are defined in constant $CONTAIN_PUNCT. This constant is currently only language dependent part of the script.
As most of the punctuation in a numerical expression is inseparable part of it, the numerical expressions are treated as a separate class. The punctuation in other strings generally causes the string to be splitted up but not always, depeding on the punctuation mark. In the following table, the punctuation marks are listed:
Punctuation | Numerical expressions (digits and non-word characters) | Words (word characters, no digits) |
(){}[]" ' | Always their own tokens | Always their own tokens, exception: (?) |
: | Belongs to the expression if not followed by space (14:30, 10:s) | Belongs to the expression if not followed by space (Namdal:s) |
/ | Belongs to the expression (5/2004) | Always its own token (gielddat/guovllut). Exceptions: ja/dahje which is treated as one token; html-addresses and file names. |
-, -- | Belongs to the expression, also when surrounded by spaces (1-3, 1 - 3) | Belongs to the expression, but not when separated by space on either side (Davvi-Norgga, dárogielas ja -kultuvrras). Hyphenated words are a problem. |
!?; | Always its own token | Always its own token |
. | Belongs to the expression unless sentence delimiter | Its own token. Exception: belongs to an abbreviation and does not end the sentence. |
+*= | Belongs to the expression | Always its own token |
% | Belongs to the expression, also when separated by space (50%, 50 %) | Always its own token |
The abbreviations are divided in three classes according to whether they are able to end the sentence or not. Firs class contains transitive abbreviations (TRAB) that never end the sentence. The intransitive abbreviations (ITRAB) are considered to end the sentence whenever they are followed by a capital letter or number. The third class (TRNUMAB) contains abbreviations that end the sentence when followed by a capital letter but not when followed by number. In sum:
All the abbreviations are extracted from the lexicon files. They are
extracted from the file as a part of the compilation process
make
. The relevant commands are stored to the Makefile in the gt/lang/src
directory. The script that handles the extraction is called
abbr-extract
and located in gt/script
directory; it is intended to be a language-independent script.
The main
source is abbr-lang-lex.txt
, where the real abbreviations
are listed. There exists also a number of multi-word expressions which
are dispersed to different lexicon files, for example the file
adv-sme-lex.txt
contains multi-word adverbials like
earret eará
. Not to mention the propernoun lexicon, which
contains (at the moment) over 500 multi-word propernouns.
In order to be able to treat the multi-word expressions as single
tokens already at the preprocessing phase, the multi-word expressions
are extracted from the lexicon files. The relevant lexicon file names
are given to the script abbr-extract
in
lang/src/Makefile
. There is no restrictions to how to
list the multiword expressions, they can be used in lexicon files as
other tokens. Only the structure of the file
abbr-lang-tex.txt
is restricted.
Usage of the script abbr-extract
:
The structure of the
abbr-extract --output=
The structure of
abbr-lang-lex.txt
abbr-lang-lex.txt
is free in
all the other respects but the extracted part must have
the following syntax:
LEXICON ITRAB abbr TAG ; ab.br TAG ; ab TAG ; abbr% abbr TAG ;
And the same to the other abbreviation classes, LEXICON
TRAB
and LEXICON TRNUMAB
. The ordering of the
abbreviation classes is free.
The abbreviations may contain one or more dots but due to xfst, the dot that ends the abbreviation must be removed (operator). The pattern matching in the preprocessor is done always without the dot. It is possible to add the dot to the abbreviation again in the lex-file.
The space in the multi-word abbreviations is marked in the lexicons by
%-sign (literal operator) which is recognized by the
abbr-extract
-script and removed.
The abbreviations in abbr-lang-lex.txt
are lower-cased
unless there exists only upper cased version of the abbreviation. The
abbr-extract
-script generates upper case versions of all
the abbreviations (they may begin the sentence).
min. minAbbreviations may be followed by capital letter even when not ending the sentence
Siellä oli Pekka ym. Mäkisiä. Siellä oli Pekka ym . Mäkisiä . Siellä oli Pekka ym. Mäkinen ei tullut. ym . Mäkinen