lookup2cg - script

Presentation

The script lookup2cg reformats the lookup output so that it can be interpreted as vislcg input. lookup2cg is a perl script, and as all other scripts, it is located in the gt/script directory.

The implementation

The input to the script is the output of lookup and basically the following format:
Dán     dát+Pron+Dem+Sg+Acc
Dán     dát+Pron+Dem+Sg+Gen
The output will be the input to vislcg:
"< Dán>"
        "dát" Pron Dem Sg Acc
        "dát" Pron Dem Sg Gen
The script reads one cohort at the time, separates the original word form and the base+analysis and stores the base+analysis into an array. The base forms and the analyses are formatted to match the requirements of vislcg.

The compound analyses are rated: Only the analyses that contain the least number of word boundaries are taken into account. For example, if there is an analysis which contains two word boundaries and analysis which contains only one, then the analysis with two word boundaries is removed. (Of course the rating is redundant if the base form is generated from the analyses of compound parts, see compounds)

Compound base forms are generated from the base forms of its parts. The analyses of compounds recieved by lookup2cg contain separate analysis of each compound part. For the disambiguation, only the base form of the whole compound and its analysis is needed. lookup2cg generates the base form for the compound and removes the unnecessary analyses of its parts.

The base form of a compound is formed by replacing the last word of the original compound by its base form. The compound boundary is searched by comparing the 3 first letters of the base form to the original compound. If these 3 letters contain a digraph, then first 4 letters are considered. Only the analysis of the last word is preserved.

If the 3 first letters do not match, usually due to consonant gradation, weakened vowel etc. 2 letters are used as a "last resort". See compounds for more detail.

The formation of the compound's base form may produce identical analyses because the analyses of the other compound words are removed. Identical lines are removed before the reformatted output is printed.

The derivational tags are marked with an asterisk *. The lexicon contains fully derived forms of some words that are derived during the analysis as well. The derivational tags pose problems to disambiguation. See Derivation for details. At the moment, the implementation of searching derivational tags is fairly simple. First, a part of speech tag is searched from the right, this tag is preserved. If there are other part of speech tags on the left of the tag they are marked with asterix as being part of the derivation. This is a temporary solution, some more sophisticated methods are on their way.

Compounds

""
        "dállo/doallu" N Sg Nom # ekonomiija N Sg Nom
        "dállo/doallu" N Sg Nom # ekonomiija N Sg Gen
        "dállo/doallu" N Sg Nom # ekonomiija N Sg Acc

The target form is

""
        "dállodoalloekonomiija" N Sg Nom
        "dállodoalloekonomiija" N Sg Gen
        "dállodoalloekonomiija" N Sg Acc

This means:

  1. Identify the compound boundary in the input form
  2. Replace the string between the initial " and the # symbol with the string to the left of the compound boundary in the input form
  3. Then conflate the result.

Thus, for the following input:

""
        "Lasse" N Prop Sg Nom # died1áhus N Sg Nom
        "Lasse" N Prop Sg Gen # died1áhus N Sg Nom
        "lassi" N Sg Nom # died1áhus N Sg Nom

the target form is

""
        "Lassedied1áhus" N Sg Nom

The problematic part here is identifying the compound boundary. Just taking the first part from the analysis will not do, as there may be changes of 3 kinds: The final vowel (á, i, u) may have been weakened to (a, e, o), as for dállodoall_o_ekonomiija above; there may be consonant gradation in the form (as when 'alimus/riekti # duopmu' becomes 'alimusrievttiduomuin') with a kt:vtt change; and the compound form may be shortened (and eventually changed), as when 'geahc1c1at + vuohki' becomes 'geahc1c1anvuogi'.

Then, as shown in the last example, the second part may be changed as well. There are two safe indications: the number of syllables, and the initial consonant after the # symbol. So, search for a match for the 3 first graphemes after the #, starting from after the second syllable of the input form should be safe.

Fixing this should also make it possible to get rid of "ambiguities" like the following:

""
        "rámma" N Sg Nom # eaktu N Pl Ill
        "rámma" N Sg Gen # eaktu N Pl Ill

where the different is in the homonymy of nominative and genitive of the first part of the compound. If the input is "", then the ambiguity will never arise.

Derivations

Since the input to the parser is a human-readable dictionary, many derivations are present already in the dictionary. Due to the dynamic derivation component, they come out with a double or even multiple analysis, as the analysis with the derivational affix added in the parsing process is given as well. thus, we have "ambiguities" like the following:

""
        "mearkkas1it" V Pass upmi N Sg Nom
        "mearkkas1upmi" N Sg Nom

""
        "seailut" V h eapmi N Sg Ill
        "seailluhit" V eapmi N Sg Ill

""
        "eallit" V h us N Sg Nom # heivet V h eapmi N Sg Gen
        "eallit" V h us N Sg Nom # heivehit V eapmi N Sg Gen
        "ealihit" V us N Sg Nom # heivet V h eapmi N Sg Gen
        "ealihit" V us N Sg Nom # heivehit V eapmi N Sg Gen
        "ealáhus" N Sg Nom # heivet V h eapmi N Sg Gen
        "ealáhus" N Sg Nom # heivehit V eapmi N Sg Gen

The derivational tags that should be searched for is the following set (their initial + signs are removed during the initial stage of lookup2cg):

 +adda +ahtti +alla +asti +d +eaddji +eamos1 +eapmi +g +geahtes +h
 +heapmi +hudda +huhtti +huvva +j +l +las1 
 +meahttun +mus1 +n +s1 +st +stuvva +us +vuohta +goahti +lágan
 +Dimin  +Pass

Thus, the following algorithm should do:

  1. Scan through the reading from the right until a derivation tag is encountered
  2. If the string to the left of the tag is found for another reading, then remove the reading with a derivation tag.

Moments for building a preprocessor geared towards disambiguation

The goal is to feed only syntactically relevant information to the disambiguator. So, in the analysis of "bargiin", the correct analysis is that it is Sg Com of "bargi". Since this word is lexicalised, it is found as a noun in the lexicon.

"" S:1995
        "bargat" V N Actor Sg Com
        "bargi" N Sg Com
What we want is thus to treat all Actor nouns as if they were found in the lexicon in the first place. The problem is then to reverse the morphological process, and find the stem.

Actio

"" S:631, 631, 631
        "lohkat" V n N Actio Pl Nom
        "lohkan" N Pl Nom

Derivations

These ones do not induce consonant gradation in the stem:

alla
Remove the -it part from the basic form and the and insert "alla"
ahtti
Remove the -it part from the basic form and the and insert "ahtti"
eaddji
Remove the -it part from the basic form and the and insert "eaddji"
eapmi
Remove the -it part from the basic form and the and insert "eapmi"
l
Remove the -t part from the basic form and insert "l"
vuohta
Just add vuohta to the basic form, removing the intervening A tag. Problem: there is often a tag 'las1' to the left of 'vuohta', this tag causes CG. In these cases, vuohta cannot be added easily.

These ones do:

heapmi
d
h

For the non-gradating verb-to-noun suffixes, remove the V label preceeding the N.

"" S:1708
        "c1uovvut" V l eapmi N Sg Acc
        "c1uovvulit" V eapmi N Sg Acc

""
        "iskat" V d eapmi N Sg Acc
        "iskat" V d eapmi N Sg Gen

""
        "jorgalit" V ahtti n N Actio Sg Ill
        "jorgalahttit" V n N Actio Sg Ill

"" S:662
        "mearridit" V eaddji N Pl Ill

For the gradating suffixes, we should think more before doing anything.

"" S:636, 1479
        "lassi" N heapmi A Comp Sg Nom

Trond Trosterud
Last modified: Thu Apr 29 09:51:09 2004