Preprocessing the input
- Introduction
- Tokenizing
- The tokeinizer file
- handling abbreviations
Within the Xerox framework, this is done with the
tokenizetool. The code itself is written as a set of regular
expressions, and the source file (tok.txt) is compiled by xfst.
In the present project, we have temporarily (spring 2004) abandoned
this preprocessor, and replaced it with a perl-based preprocessor (see
the documentation for the file preprocess. The main reason why we
abandoned the preprocessor that we document here is that its
compilation time ecxeeded half an hour on cochise, and several hours
on local machines. The reason why compilaton time exploded was our use
of the Replace operator @-> (see below), it looks for the longest
match, which takes time.
When the abbreviation list was part of the tok.txt file, it meant a
pause of more than half an hour for every added abbreviation or
multiword expression. Thus, we moved to a non-compiled version, perl,
during the developmental phase. In a stable, finished parser, tokenize
is probably faster than perl, and we should consider migrating
back. This documentation will be important if and when we migrate
back, but also in the meantime it contains program-independent
documentation on abbreviation handling which deserves to be browsed
through.
The starting point for the preprocessor was the tok1.fst preprocessor
file, written by Anne Schiller, and printed in the Karttunen/Beesley
book (cf. the first cvs versions of the tok.txt file). This file has
been revised several times. The leading idea behind the file is the
following:
The tokenizer has two purposes: It cuts text into sentences, and it
cuts sentences into words. Thus, symbols that are not letters or
numbers are separated from words and numbers. Sentence delimiter
symbols (.?!) are treated as separate tokens. In the morphological
parser itself, these symbols are given the tag '+CLB', for clause
boundary.
The file is an xfst source file. It defines sets, joins them together
as either words, symbols, abbreviations, initials, or numerals (all
being referred to by the variable 'Token'. Then, a newline (NL) is
introduced after each token (Token + NL is called TOK1), all spaces
are replaced by newlines (TOK2). The abbreviations get a separate
treatment, as described in the next sentence. At the end of the
tok.txt file, the different token types (TOK1, TOK1, and the different
classes of abbreviations are composed together into one regular
expression.
The challenge is to handle abbreviations, like e.g. this one. Even
though e.g. contains a final period, it shall not end a sentence. Then
there are other abbreviations, like "Ltd.", that may end
sentences. The preprocessor thus divides the abbreviations in 4
different groups, according to whether they take objects or not
(i.e. according to whether there is an obligatory word or numeral
following them or not):
- INTRANSABBR (abbr that do not take an object: 'Lloyds Ltd.')
- INTRANSNUMABBR (abbr that take CAP objects: '')
- INTRANSCAPABBR (abbr that take NUM objects: 'Downing Str. 10')
- TRANSABBR (abbr. that do take an object: 'Mr. Peters')
Here is the rule set that lies behind the treatment of abbreviations:
- WORD + dot + space -> sentence boundary
When there is a listed abbreviation + space, it is detected what
follows after the space. There are three possibilities: there can be
a small letter, an capital letter or a number:
- Any abbreviation + space + small letter ->
no sentence boundary
Thus, all groups: TRANSABBR + INTRANSNUMABBR + INTRANSCAPABBR +
INTRANSABBR. When a word starting with a small letter follows the
abbreviation, all listed abbreviations are treated as
transitive, i.e. no sentence boundary is inserted.
- Certain abbreviations + space + capital
letter -> no sentence boundary (TRANSABBR + INTRANSNUMABBR)
All other abbreviations (complement of "certain") + space + capital
letter -> sentence boundary (INTRANSCAPABBR + INTRANSABBR)
- Certain abbreviations + space + number -> no
sentence boundary (TRANSABBR + INTRANSCAPABBR)
All other abbreviations (complement of "certain") + space + number ->
sentence boundary
(from INTRANSNUMABBR + INTRANSABBR)
We thus have four groups:
- A list of abbreviations that never end a sentence (A set
called TRANSABBR)
- A list of abbreviations that do not end a sentence if followed
by any letter. These abbreviations do end a sentence if followed by a
number (A set called INTRANSNUMABBR)
- A list of abbreviations that do not end a sentence if followed by a
number or a small letter. These abbreviations do end a sentence
if followed by a capital letter. (A set called
INTRANSCAPABBR)
- A list of abbreviations end a sentence if followed by a capital
letter or a number. These do no end a sentence if followed by a small
case letter. (A set called INTRANSABBR)
In other words:
TRANSABBR / INTRANSNUMABBR / INTRANSCAPABBR / INTRANSABBR +
small -> no sentence boundary
TRANSABBR + capital -> no sentence boundary
TRANSABBR + number -> no sentence boundary
INTRANSNUMABBR + capital -> no sentence boundary
INTRANSNUMABBR + number -> a sentence boundary
INTRANSCAPABBR + capital -> a sentence boundary
INTRANSCAPABBR + number -> no sentence boundary
INTRANSABBR + capital -> a sentence boundary
INTRANSABBR + number -> a sentence boundary
It is better to have too few sentence boundaries than too many. Of the
4 sets listed above, the first invokes no sentence boundaries, and the
following ones invoke an increasing amount of them. Thus, when in
doubt, put the abbreviation in question in the sets as follows:
TRANSABBR is better than INTRANSNUMABBR, which is
better than INTRANSCAPABBR, which is better than
INTRANSABBR.
Examples
Jeg kjøpte epler. de var dyre. A sentence boundary, rule 1. A
sentence beginning with a small letter will be found in the
grammar checker.
Siv.ing. Pia Aho stakk innom. No sentence boundary, rule 3.
Siv.ing. og kunstner Pia Aho stakk innom. No sentence boundary,
rule 2.
Dette er Pia Aho, siv.ing. Hun kjøper ost. No sentence boundary,
rule 3. This is where there is a sentence boundary (in real life),
but
we have to trust probabilities: it is likely that the first type of
example is more common than this type of example. If this
hypothesis
turns out to be wrong, abbreviations can be moved into a more
proper
set. The point is that we need to have this kind of an option.
F.eks. i Europa kjøpes det mye ost. No sentence boundary, rule 3.
F.eks. kunstnerne kjøper mye ost. No sentence boundary, rule 2.
Kjøp f.eks. 12 epler. No sentence boundary, rule 4.
Jeg kjøper ost o.a. godt. No sentence boundary, rule 2.
Jeg kjøper ost o.a. Ta med litt melk også. A sentence boundary,
rule 3
Jeg kjøpte epler, pærer, osv. Jeg tok også med meg
melk. A sentence boundary, rule 3.
Kjøp epler, pærer, osv. ta med melk også. No
sentence boundary, rule 2.
Kjøp epler, pærer, osv. og ta også med melk. No sentence
boundary, rule 2.
Også § 9, 3. avsn. nevner denne saka. No sentence boundary, rule 2.
Les også § 9, 3. avsn. Der tar man opp denne saka. A sentence
boundary,
rule 3.
I Trosterudveien 6, leilighet nr. 7 bor det en mann. No sentence
boundary, rule 4.
Jeg trekker lodd i Lotto. I fjor trakk jeg 110 000 nr. Året før
trakk
jeg bare 18 000 nr. og det første året mitt 500 nr. Two cases: A
sentence
boundary, rule 3, and no sentence boundary, rule 2.
Trond Trosterud
Last modified: Mon Nov 1 21:34:08 2004