The conversion scripts are located in gt/script. They are of two different types: perl scripts (*.pl) and xfst scripts. The xfst scripts are compiled, they have filename.regex as source file names and filename.fst as binary file names.
The scripts have different functions. Some scripts convert input text to the internal format used by the program, whereas other scripts convert the output of the program into a format suitable for output.
Note that the unix utility iconv contains ready-made conversion routines for many code tables. The syntax is as follows:
$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file
A list of code tables is listed with iconv --list. This of course does not help in converting text to our internal format, but in the future it may be used for conversion to utf-8.
The scripts are named "sourceform-targetform.scripttype". The perl script converting Latin 6 input to the internal 7-bit digraph system á, c1, d1, n1, s1, t1, z1, is called latin6-7bit.pl.
There are at the moment script for converting from ws2, Latin6 and mac (here called "linmac", since mac files are translated to something else when the files are moved to Linux. "Something else" is here called "linmac" (mac as observed on Linux), and taken as a starting point for the conversion script.
The perl scripts contain conversion lines of the format s/\273/t1/g. This line converts a t-stroke to t1. The code position (in the code table Latin 6, used a.o. by Statens Kartverk) is hexadecimal BB. Perl uses octal notation, and the octal value of BB is 273.
Note that there are two different scripts, utf8-7bit.pl and utf8.pl. The former converts from utf8 to 7bit, the other one is some sort of all-in-one-script that converts from different formats (mac saved as utf8, text written on Win9x saved as utf8, etc. to 7-bit. Testing is needed to see whether this is a relevant partition, in any case, the utf8-7bit.pl works in cases where the input signal has not been corrupted, i.e. it takes real utf8 as input.
The <encoding>-7bit.regex files are files that convert from the given encoding to the internal format.
The 7bit-<encoding>.regex files are files that convert to the given encoding from the internal format.
To make use of the .regex files you may have to compile them to .fst files. Go to the script directory and have a look at the .regex and .fst files. If the .regex file is older than the .fst file with the same name, you may use the .fst file right on, and you do not need to compile. If the .fst file is older or do not exist, you must compile it. Do that by while in the script directory type the command:
make all
In order to convert from encoding X to internal format, be in the script directory, and type the following command:
cat <<encoding>-filename> | lookup -flags mbTT -f <encoding>-7bit.fst
It will will convert a file from a given encoding to the internal format.
In order to analyse a file in a given encoding, go to the gt/sme directory. To analyze a file in the ws2 (aka. levi, WinSam2) encoding type the command
cat| preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f ws2-sme | less
Upon executing this command the input file will first be tokenized, then converted to the internal format, analyzed and the output will be in the same encoding as the input file.
XXX But is this a good idea? we must evaluate this. It is hard to see how anyone would like his input back to ws2 on a Linux terminal. Tests on input is needed here.TT
The lookup file ws2-file has this content (The file format is documented in Beesley/Karttunen, p. 442):
sme sme.fst fws2 ../script/ws2-7bit.fst tws2 ../script/7bit-ws2.fst fws2 sme tws2
This file converts the input from the ws2 encoding to the internal format. The input will then be analyzed with the sme.fst file and the result is converted back to the ws2 format.
The other <encoding>-sme files follow the same pattern.
The most improtant caseconvertion scripts are case.regex (caseconv.fst). They are different form language to language, and located in the language-specific directories. They form an integrated part of the Makefiles, and the resulting parsers contain the ability of recognising initial capital letters.
There are also scripts to allow for words written in all caps, called allcaps.regex. By the help of such scripts, ("Duodji" is accepted, as is "DUODJI", but "DuoDji" is not. These are also located in the src directories (so far only for sme), and are integrated in the Makefile. But the resulting allcaps.fst is not compiled together with sme.fst into a single transducer, as this would have resulted in a too large network. Instead, it is kept separate in the sme/bin directory, and when needed, it may be invoked by the following command (assuming you stand in gt/sme):
... | lookup -flags mbTT -f src/cap-sme | ...
Note that the lookup script file is located in sme/src, but the binary allcaps.fst that the cap-sme file refers to, is located in sme/bin.
Southern and Lule Sámi have scripts to allow for different practices for writing � (as �or i) and for the Norwegian/Swedish ��and � mix. These are xfst scripts, integrated in the makefiles of sma and smj.
Børre?
or should this be documented on the webinterace page?
The script pdfto7bit.pl is a script that converts pdf files to 7bit. It is used like this:
pdfto7bit.pl [option] <filename>
The options allowed are:
To use it you will have to have the gt/script catalog in your path. Type this at the command prompt.
PATH="~/[path to the gt directory]/gt/script:$PATH"
After this you can type "pdfto7bit.pl" at the command prompt to use it. Typical uses are shown below
for pdffile in [directory of pdf files]/*.pdf do pdfto7bit.pl $pdffile > [directory of text files]/`basename $pdffile .pdf`.txt doneThis command takes a batch of pdf files, converts them to text files and saves them in a given directory. The command `basename $pdffile .pdf`.txt assures that a pdf file named: foo.pdf is saved as foo.txt.
pdfto7bit.pl -e <name of offending file>
Last modified: Fri Aug 16 03:45:37 CEST 2002