/tmp
), so the tool has to be
called using explicite path names.
The results of the tests were ok, but there were some problems with preprocessing. The next step is to first use the project's own preprocessor for both texts and then align the texts that are cut into sentences.
The test corpus was created from a translated texts:
Coahkkingirji_1_02.doc plenum_1_02.doc
The files were first converted to text and to the project-internal 7bit-encoding.
$ antiword -m UTF-8.txt Coahkkingirji_1_02.doc | utf8-7bit.pl > Coahkkingirji_1_02.txt $ antiword -m UTF-8.txt plenum_1_02.doc | utf8-7bit.pl > plenum_1_02.txtThere were some strange characters from antiword, they were cleaned away.
.. some shell commands
Then the corpus processing tool uplug was used to convert the text to xml.
$ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'iso-8859-1' -co 'iso-8859-1' -in Coahkkingirji_1_02.txt > Coahkkingirji_1_02.xml $ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'iso-8859-1' -co 'iso-8859-1' -in plenum_1_02.txt > plenum_1_02.xmlFinally the texts were aligned using the uplug
sentalign
tool.
$ /tmp/uplug-0.1.2/uplug systems/align/sent -src Coahkkingirji_1_02.xml -trg plenum_1_02.xml > aligned_1_02.xmlAnd the result can be read using command
/tmp/uplug-0.1.2/tools/readalign aligned_1_02.xml | lessThe utf-8 version was created by commands:
$ antiword -m UTF-8.txt Coahkkingirji_1_02.doc > Coahkkingirji_1_02.txt $ antiword -m UTF-8.txt plenum_1_02.doc > plenum_1_02.txt $ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'utf-8' -in Coahkkingirji_1_02.txt > Coahkkingirji_1_02.xml $ /tmp/uplug-0.1.2/uplug systems/pre/basic -ci 'utf-8' -in plenum_1_02.txt > plenum_1_02.xml $ /tmp/uplug-0.1.2/uplug systems/align/sent -src Coahkkingirji_1_02.xml -trg plenum_1_02.xml > aligned_1_02.xmlSaara Huhmarniemi saara.huhmarniemi@helsinki.fi Last modified: Mon Sep 27 14:37:32 2004