Building Instructions
*********************

This document describes how to build the linguistic transducers, the prerequi-
site software, and then gives a more elaborate walk-through of the build
process.

Basic building
==============

Briefly, the shell command:

make GTLANG=xxx

(where GTLANG signifies the language to build, and xxx is the three-letter ISO
639-2 code for the target language)

will produce the default set of transducers for the language xxx. The list of
supported languages equals the list of language sub-directories found in this
directory.

This build command can be further targeted with specific build targets as listed
below (for a full list, see the file Makefile).

General targets:
----------------

hfst - make the default set of transducders using the hfst tool set, producing
       transducers compatible with the hfst runtimes.

fst  - make only the regular analysing transducer, not the other ones

Prerequisite software:
======================

To build the default transducers, one needs the Xerox FST tools, available at
the site http://fsmbook.com/. The tools are closed-source, but available free of
charge at that site by accepting the proposed license.

To build using the alternate, open-source HFST tool set, you need to download
and install the HFST tools. More information can be found at http://hfst.sf.net/

By default, even binary forms of VISLCG3 grammars are compiled, although these
are not needed if all you want to do is morphological analysis. To be able to
disambiguiate, or just to avoid error messages during the build, please install
VISLCG3 as well. Instructions can be found at http://beta.visl.sdu.dk/cg3.html.

Build walk-through
==================

The following is a step-by-step walk-through of the build process, based on how
it is done using the new build strategy for the Xerox tools (see the NEW-*
targets in the Makefile). Following that is a short walk-through of the HFST
build process. Presently the HFST building is a lot simpler than the Xerox one, 
partly because it isn't fully developed yet. Ideally, the two build processes
should mirror each other.

In the walk-through, all files are using sme as the language. This should be
replaced with the $(GTLANG) variable in real usage.

Xerox
-----

# Build a sme-name lexc file (this is done because some of the names are shared
# between sme, smj and sma):
cat sme/src/propernoun-sme-morph.txt sme/src/propernoun-sme-lex.txt > sme/src/propernoun-sme-lex-tmp.txt

# cat all lexc source files, and compile them using xfst:
cat \
    sme/src/sme-lex.txt \
    sme/src/verb-sme-lex.txt \
    sme/src/pp-sme-lex.txt \
    sme/src/pronoun-sme-lex.txt \
    sme/src/interjection-sme-lex.txt \
    sme/src/conjunction-sme-lex.txt \
    sme/src/subjunction-sme-lex.txt \
    sme/src/particle-sme-lex.txt \
    sme/src/noun-sme-lex.txt \
    sme/src/numeral-sme-lex.txt \
    sme/src/adj-sme-lex.txt \
    sme/src/adv-sme-lex.txt \
    sme/src/punct-sme-lex.txt \
    sme/src/acro-sme-lex.txt \
    sme/src/abbr-sme-lex.txt \
    sme/src/propernoun-sme-lex-tmp.txt \
    > sme/int/all-sme-lex.txt
xfst -e "read lexc sme/int/all-sme-lex.txt" \
			-e "compose net" \
			-e "save stack sme/bin/NEW-sme-lexc.save " \
			-stop

# Build the normative twolc file using M4 and twolc, and remove the script file:
m4  sme/src/twol-sme.txt > tmp/twol-sme-tmp.txt
printf "read-grammar tmp/twol-sme-tmp.txt \n\
	compile \n\
	save-binary sme/bin/twol-sme.bin \n\
	quit \n" > tmp/twol-script-sme
twolc < tmp/twol-script-sme
rm -f tmp/twol-script-sme

# Using LEXC, we compose the compiled twolc rules with the lexc transducer:
# (script file needed - deleted afterwards)
printf "read-source sme/bin/NEW-sme-lexc.save \n\
	read-rules sme/bin/twol-sme.bin \n\
	compose-result \n\
	save-result sme/bin/NEW-sme-twolc.save \n\
	quit \n" > tmp/NEW-save-script-sme
lexc < tmp/NEW-save-script-sme
rm -f tmp/NEW-save-script-sme

echo
echo "*** Building NEW-sme.save ***" ;
echo
xfst -e "read regex @\"sme/bin/NEW-sme-twolc.save\" ; " \
			-e "save stack sme/bin/NEW-sme.save" \
			-stop

echo
echo "*** Generating sme-num.fst ***"
echo
xfst -e "read lexc sme/src/sme-num.txt" \
			-e "save stack sme/bin/sme-num.fst" \
			-stop
echo
echo "*** Building nosofthyphen.fst ***" ;
echo
xfst -e "read regex < common/src/nosofthyphen.regex " \
			-e "save stack common/bin/nosofthyphen.fst " \
			-stop
echo
echo "*** Building usage-tags-remove.fst ***" ;
echo
xfst -e "read regex < common/src/usage-tags-remove.regex " \
			-e "save stack common/bin/usage-tags-remove.fst " \
			-stop
echo
echo "*** Building inituppercase.fst ***" ;
echo
xfst -e "read regex < common/src/inituppercase.regex " \
			-e "save stack common/bin/inituppercase.fst " \
			-stop
echo
echo "*** Building spellrelax.fst ***" ;
echo
xfst -e "read regex < common/src/spellrelax.regex " \
			-e "save stack common/bin/spellrelax.fst " \
			-stop
echo
echo "*** Building downcase-derived-proper.fst ***" ;
echo
xfst -e "source common/src/downcase-derived-proper.xfst" \
			-e "save stack common/bin/downcase-derived-proper.fst" \
			-stop
echo
echo "*** Building webadr.fst ***" ;
echo
xfst -e "source common/src/webadr.xfst" \
			-e "save stack common/bin/webadr.fst" \
			-stop
echo
echo "*** Building NEW-sme.fst ***" ;
echo
xfst -e "read regex ( @\"common/bin/nosofthyphen.fst\" \
			.o. @\"sme/bin/NEW-sme.save\" \
			.o. @\"common/bin/inituppercase.fst\" \
			.o. @\"common/bin/downcase-derived-proper.fst\" \
			.o. @\"common/bin/spellrelax.fst\" ) \
			|   @\"common/bin/webadr.fst\" ; " \
			-e "save stack sme/bin/NEW-sme.fst" \
			-stop


HFST
----

TBW.