Building Instructions ********************* This document describes how to build the linguistic transducers, the prerequi- site software, and then gives a more elaborate walk-through of the build process. Basic building ============== Briefly, the shell command: make GTLANG=xxx (where GTLANG signifies the language to build, and xxx is the three-letter ISO 639-2 code for the target language) will produce the default set of transducers for the language xxx. The list of supported languages equals the list of language sub-directories found in this directory. This build command can be further targeted with specific build targets as listed below (for a full list, see the file Makefile). General targets: ---------------- hfst - make the default set of transducders using the hfst tool set, producing transducers compatible with the hfst runtimes. fst - make only the regular analysing transducer, not the other ones Prerequisite software: ====================== To build the default transducers, one needs the Xerox FST tools, available at the site http://fsmbook.com/. The tools are closed-source, but available free of charge at that site by accepting the proposed license. To build using the alternate, open-source HFST tool set, you need to download and install the HFST tools. More information can be found at http://hfst.sf.net/ By default, even binary forms of VISLCG3 grammars are compiled, although these are not needed if all you want to do is morphological analysis. To be able to disambiguiate, or just to avoid error messages during the build, please install VISLCG3 as well. Instructions can be found at http://beta.visl.sdu.dk/cg3.html. Build walk-through ================== The following is a step-by-step walk-through of the build process, based on how it is done using the new build strategy for the Xerox tools (see the NEW-* targets in the Makefile). Following that is a short walk-through of the HFST build process. Presently the HFST building is a lot simpler than the Xerox one, partly because it isn't fully developed yet. Ideally, the two build processes should mirror each other. In the walk-through, all files are using sme as the language. This should be replaced with the $(GTLANG) variable in real usage. Xerox ----- # Build a sme-name lexc file (this is done because some of the names are shared # between sme, smj and sma): cat sme/src/propernoun-sme-morph.txt sme/src/propernoun-sme-lex.txt > sme/src/propernoun-sme-lex-tmp.txt # cat all lexc source files, and compile them using xfst: cat \ sme/src/sme-lex.txt \ sme/src/verb-sme-lex.txt \ sme/src/pp-sme-lex.txt \ sme/src/pronoun-sme-lex.txt \ sme/src/interjection-sme-lex.txt \ sme/src/conjunction-sme-lex.txt \ sme/src/subjunction-sme-lex.txt \ sme/src/particle-sme-lex.txt \ sme/src/noun-sme-lex.txt \ sme/src/numeral-sme-lex.txt \ sme/src/adj-sme-lex.txt \ sme/src/adv-sme-lex.txt \ sme/src/punct-sme-lex.txt \ sme/src/acro-sme-lex.txt \ sme/src/abbr-sme-lex.txt \ sme/src/propernoun-sme-lex-tmp.txt \ > sme/int/all-sme-lex.txt xfst -e "read lexc sme/int/all-sme-lex.txt" \ -e "compose net" \ -e "save stack sme/bin/NEW-sme-lexc.save " \ -stop # Build the normative twolc file using M4 and twolc, and remove the script file: m4 sme/src/twol-sme.txt > tmp/twol-sme-tmp.txt printf "read-grammar tmp/twol-sme-tmp.txt \n\ compile \n\ save-binary sme/bin/twol-sme.bin \n\ quit \n" > tmp/twol-script-sme twolc < tmp/twol-script-sme rm -f tmp/twol-script-sme # Using LEXC, we compose the compiled twolc rules with the lexc transducer: # (script file needed - deleted afterwards) printf "read-source sme/bin/NEW-sme-lexc.save \n\ read-rules sme/bin/twol-sme.bin \n\ compose-result \n\ save-result sme/bin/NEW-sme-twolc.save \n\ quit \n" > tmp/NEW-save-script-sme lexc < tmp/NEW-save-script-sme rm -f tmp/NEW-save-script-sme echo echo "*** Building NEW-sme.save ***" ; echo xfst -e "read regex @\"sme/bin/NEW-sme-twolc.save\" ; " \ -e "save stack sme/bin/NEW-sme.save" \ -stop echo echo "*** Generating sme-num.fst ***" echo xfst -e "read lexc sme/src/sme-num.txt" \ -e "save stack sme/bin/sme-num.fst" \ -stop echo echo "*** Building nosofthyphen.fst ***" ; echo xfst -e "read regex < common/src/nosofthyphen.regex " \ -e "save stack common/bin/nosofthyphen.fst " \ -stop echo echo "*** Building usage-tags-remove.fst ***" ; echo xfst -e "read regex < common/src/usage-tags-remove.regex " \ -e "save stack common/bin/usage-tags-remove.fst " \ -stop echo echo "*** Building inituppercase.fst ***" ; echo xfst -e "read regex < common/src/inituppercase.regex " \ -e "save stack common/bin/inituppercase.fst " \ -stop echo echo "*** Building spellrelax.fst ***" ; echo xfst -e "read regex < common/src/spellrelax.regex " \ -e "save stack common/bin/spellrelax.fst " \ -stop echo echo "*** Building downcase-derived-proper.fst ***" ; echo xfst -e "source common/src/downcase-derived-proper.xfst" \ -e "save stack common/bin/downcase-derived-proper.fst" \ -stop echo echo "*** Building webadr.fst ***" ; echo xfst -e "source common/src/webadr.xfst" \ -e "save stack common/bin/webadr.fst" \ -stop echo echo "*** Building NEW-sme.fst ***" ; echo xfst -e "read regex ( @\"common/bin/nosofthyphen.fst\" \ .o. @\"sme/bin/NEW-sme.save\" \ .o. @\"common/bin/inituppercase.fst\" \ .o. @\"common/bin/downcase-derived-proper.fst\" \ .o. @\"common/bin/spellrelax.fst\" ) \ | @\"common/bin/webadr.fst\" ; " \ -e "save stack sme/bin/NEW-sme.fst" \ -stop HFST ---- TBW.