The list of foreign words
Incoming text contains many foreign words. Used in isolation, as
spontanous loans, they should be delt with by a POS guesser. Text
chunks in foreign languages represent noise, though, and a good corpus
should mark such chunks with xml tags
(<foreign></foreign>, etc.). While waiting for that, and
while developing our parser, we have a stoplist of foreign words. They
list was made in the following way:
- Large lists of Norwegian, Swedish, Danish, Finnish and English
words were sorted into one list, called old-foreign.txt. The list was duplicated by an
identical list with capital initial letter (using case.regex gave too
long compilation time)
- The list was run through sme.fst, and the overlapping words
(abonnere, adagio, Adam, addere, etc.) were removed.
- In addition, a file new-foreign.txt was added to the
cvs, containing Non-Sami words from our corpus files.
- Each of these files were turned into fst files. Then the union of the two files was made into one binary file, foreign.fst
The compilation is included in the Makefile. The source filea are in
the gt/script catalogue, whereas the binary filse are in the gt/sme/bin
catalogue. Only foreign.fst should be used, the two other ones are intermediate files.
foreign.fst should be used as follows: When investigating Sámi
words that the parser cannot cope with, foreign words are just
noise. They can be removed with this command line:
cat text | preprocess ... | lookup -flags mbTT sme.fst | grep ´\?' |
cut -f1 | foreign.fst | grep '\?' | cut -f1 | ...
Now, only the words which are not recognised by the parser, and not part of the stop list, are included.
The list of foreign words was cut in two because compilation time for
the whole list is very long. The intention with the split is that
old-foreign.txt should be left alone. All additional words
should be added to the shorter new-foreign.txt file. If this
file becomes too long, it may be transferred over to old-foreign.txt.
Trond Trosterud trond.trosterud@hum.uit.no
Last modified: Tue Nov 9 22:27:35 2004