Parallel text corpora

There is a plan to create a parallele text corpora for some texts. There will be a web-interface for searching strings in either language and getting the pairs of aligned sentences as a result.

Sentence alignment

Some tools for the sentence alignment were tested. The most promising was Uplug tool, which is statistical and based on the sentence length and comparison of individual words in the text. The tool is not yet properly installed (only in /tmp), so the tool has to be called using explicite path names.

The results of the tests were ok, but there were some problems with preprocessing. The next step is to first use the project's own preprocessor for both texts and then align the texts that are cut into sentences.

The test corpus was created from a translated texts:

Coahkkingirji_1_02.doc
plenum_1_02.doc

The files were first converted to text and to the project-internal 7bit-encoding.

$ antiword -m UTF-8.txt Coahkkingirji_1_02.doc | utf8-7bit.pl > Coahkkingirji_1_02.txt 
$ antiword -m UTF-8.txt plenum_1_02.doc | utf8-7bit.pl > plenum_1_02.txt
There were some strange characters  from antiword, they were cleaned away.
.. some shell commands

Then the corpus processing tool uplug was used to convert the text to xml.

$ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'iso-8859-1' -co 'iso-8859-1' -in Coahkkingirji_1_02.txt > Coahkkingirji_1_02.xml
$ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'iso-8859-1' -co 'iso-8859-1' -in plenum_1_02.txt > plenum_1_02.xml
Finally the texts were aligned using the uplug sentalign tool.
$ /tmp/uplug-0.1.2/uplug systems/align/sent -src Coahkkingirji_1_02.xml -trg plenum_1_02.xml > aligned_1_02.xml
And the result can be read using command
/tmp/uplug-0.1.2/tools/readalign aligned_1_02.xml | less
The utf-8 version was created by commands:
$ antiword -m UTF-8.txt Coahkkingirji_1_02.doc > Coahkkingirji_1_02.txt 
$ antiword -m UTF-8.txt plenum_1_02.doc > plenum_1_02.txt
$ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'utf-8' -in Coahkkingirji_1_02.txt > Coahkkingirji_1_02.xml
$ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'utf-8' -in plenum_1_02.txt > plenum_1_02.xml
$ /tmp/uplug-0.1.2/uplug systems/align/sent -src Coahkkingirji_1_02.xml -trg plenum_1_02.xml > aligned_1_02.xml
Saara Huhmarniemi saara.huhmarniemi@helsinki.fi
Last modified: Mon Sep 27 14:37:32 2004