The scripts have different functions. Some scripts convert input text to the internal format used by the program, whereas other scripts convert the output of the program into a format suitable for output.
Note that the unix utility iconv contains ready-made conversion routines for many code tables. The syntax is as follows:
$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file
A list of code tables is listed with iconv --list. This of course does not help in converting text to our internal format, but in the future it may be used for conversion to utf-8.
There are at the moment script for converting from ws2, Latin6 and mac (here called "linmac", since mac files are translated to something else when the files are moved to Linux. "Something else" is here called "linmac" (mac as observed on Linux), and taken as a starting point for the conversion script.
Note that there are two different scripts, utf8-7bit.pl and utf8.pl. The former converts from utf8 to 7bit, the other one is some sort of all-in-one-script that converts from different formats (mac saved as utf8, text written on Win9x saved as utf8, etc. to 7-bit. Testing is needed to see whether this is a relevant partition, in any case, the utf8-7bit.pl works in cases where the input signal has not been corrupted, i.e. it takes real utf8 as input.
The <encoding>-7bit.regex files are files that convert from the given encoding to the internal format.
The 7bit-<encoding>.regex files are files that convert to the given encoding from the internal format.
To make use of the .regex files you may have to compile them to .fst files. Go to the script directory and have a look at the .regex and .fst files. If the .regex file is older than the .fst file with the same name, you may use the .fst file right on, and you do not need to compile. If the .fst file is older or do not exist, you must compile it. Do that by while in the script directory type the command:
make all
cat <<encoding>-filename> | lookup -flags mbTT -f <encoding>-7bit.fst
It will will convert a file from a given encoding to the internal format.
In order to analyse a file in a given encoding, go to the gt/sme directory. To analyze a file in the ws2 (aka. levi, WinSam2) encoding type the command
cat| preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -f ws2-sme | less
Upon executing this command the input file will first be tokenized, then converted to the internal format, analyzed and the output will be in the same encoding as the input file.
XXX But is this a good idea? we must evaluate this. It is hard to see how anyone would like his input back to ws2 on a Linux terminal. Tests on input is needed here.TT
The lookup file ws2-file has this content (The file format is documented in Beesley/Karttunen, p. 442):
sme sme.fst fws2 ../script/ws2-7bit.fst tws2 ../script/7bit-ws2.fst fws2 sme tws2
This file converts the input from the ws2 encoding to the internal format. The input will then be analyzed with the sme.fst file and the result is converted back to the ws2 format.
The other <encoding>-sme files follow the same pattern.
... | lookup -flags mbTT -f src/cap-sme | ...
Note that the lookup script file is located in sme/src, but the binary allcaps.fst that the cap-sme file refers to, is located in sme/bin.
or should this be documented on the webinterace page?
The script pdfto7bit.pl is a script that converts pdf files to 7bit. It is used like this:
pdfto7bit.pl [option] <filename>
The options allowed are:
To use it you will have to have the gt/script catalog in your path. Type this at the command prompt.
PATH="~/[path to the gt directory]/gt/script:$PATH"
After this you can type "pdfto7bit.pl" at the command prompt to use it. Typical uses are shown below
for pdffile in [directory of pdf files]/*.pdf do pdfto7bit.pl $pdffile > [directory of text files]/`basename $pdffile .pdf`.txt doneThis command takes a batch of pdf files, converts them to text files and saves them in a given directory. The command `basename $pdffile .pdf`.txt assures that a pdf file named: foo.pdf is saved as foo.txt.
pdfto7bit.pl -e <name of offending file>