This file documents compilation and usage of the North Sami analyser. Prerequisites: ============== You must have a Unix system (Linux or Mac), and a terminal supporting UTF-8. Mac users must make sure they have standard developer tools installed, from the developer tools on the Mac OS system CD. In order to compile the analyser you need compilers from __one__ of these two sites (both are not needed, they compile the same analysers): * xerox tools: http://fsmbook.com * hfst tools: http://sourceforge.projects.net/hfst/ The xerox tools are available as binary compilers, for non-commercial use. You will need both lexc, xfst, lookup and twolc. All the programs must be installed in a folder in your path. The hfst tools are open source. Follow the instructions in the downloaded folders At present (2010), the xerox tools are better tested and supported, and easier to install (no compilation needed). Hfst is open source and without restrictions, and contrary to xerox, has support for introduction of weighted transducers. For syntactic analysis you will need the Constraint Grammar compiler vislcg3. It can be obtained from http://visl.sdu.dk/vislcg3/. Note that Mac users must install ICU (available from Macports). If you do not need syntax you may ignore this compiler, and ignore the corresponding error message which will come during compilation. Compilation =========== Open a terminal window. Stand in gt/ (this folder). With xerox compilers, write the command: make GTLANG=sme With htst compilers, write the command: make GTLANG=sme hfst After the compilation process, the analysers can be found in ./sme/bin/ The optional abbreviation file ./sme/bin/abbr.txt is compiled with the command: make GTLANG=sme abbr Compiled files: =============== Here are the compiled files: Files for use: -------------- From xerox compilation * sme.fst = analyser * isme.fst = generator From hfst compilation * sme-gen.hfst = generator * sme.hfst = analyser * sme.hfstol = optimised-lookup analyser Syntax files * sme-dep.bin * sme-dis.bin For a list of auxiliary files, see below *). Usage notes: ============ (standing in sme/ (one level up): morphological analysis: ----------------------- xerox: cat textfile | ../script/preprocess --abbr=bin/abbr.txt |\ lookup bin/sme.fst hfst: cat textfile | ../script/preprocess --abbr=bin/abbr.txt |\ hfst-optimized-lookup bin/sme.hfstol syntactic analysis: pipe the output from morphology, and add this to the end of the pipeline: | lookup2cg | vislcg3 -g bin/sme-dis.bin | lookup2cg | vislcg3 -g bin/sme-dep.bin sme-dis.bin gives syntax and sme-dep.bin gives dependency. A better dependency analysis is given by using the common dep file:: | lookup2cg | vislcg3 -g bin/sme-dis.bin \ | lookup2cg | vislcg3 -g ../../gt/smi/src/smi-dep.rle \ *) List of auxiliary files in the bin/ catalogue: ================================================= Auxiliary files from xerox compilation * sme.save = Faroese analyser, without initial capital letters * twol-sme.bin = Faroese analyser Auxiliary files from xerox compilation * lexc-sme.hfst * twol-sme.hfst General files * abbr.txt = list of abbreviations for use in gt/script/preprocess * allcaps.fst * inituppercase.fst * tagfix.fst * tok.fst