A flowchart of the parsing process

        Action taken..              ..by the means of the command:
        **************            ******************************

    |--------------------|
    | take incoming text |        sme$ cat corp/filename.txt | 
    |--------------------|
             \/
 |--------------------------|
 | preprocessing it:        |
 | moving one word per line,|     preprocess --abbr=bin/abbr.txt |
 | finding sentence bound.  |
 |--------------------------|
             \/
|-----------------------------|
| morphological analysis:     |
| give each word all possible |   lookup -flags mbTT bin/sme.fst |
| analyses                    |
|-----------------------------|
             \/
|-----------------------------|
| processing the output into  |
| a format that fits the dis- |   ../script/lookup2cg |
| ambiguator, w/a perlscript  |
|-----------------------------|
             \/
|-----------------------------|
| disambiguating the analysis:|
| picking only the relevant   |   mdis --grammar src/sme-dis.rle
| morphological analyses.     |
|-----------------------------|

Further versions will hopefully assign syntactic tags, and disambiguate them, in the same way as described for other languages in the literature.

In order for the command to work, one must stand in the sme (etc.) directory. The files are in different directories, for the following reasons:

The text file is in the corp directory (any text can be used)
The binary files (the files that are compiled) are in the bin directory
The lookup2cg script is a perlscript, common to all languages, and hence in the ../script directory
The .rle file is a source file, it is not compiled, and it is hence in the src directory

Hmm, one could perhaps claim that this is somewhat confusing...

Trond Trosterud

Last modified: Thu Apr 29 09:09:00 2004