lookup
output so that it can be interpreted as vislcg
input. lookup2cg
is a perl script, and as all other
scripts, it is located in the gt/script directory.
lookup
and
basically the following format:
Dán dát+Pron+Dem+Sg+Acc Dán dát+Pron+Dem+Sg+GenThe output will be the input to
vislcg
:
"< Dán>" "dát" Pron Dem Sg Acc "dát" Pron Dem Sg GenThe script reads one cohort at the time, separates the original word form and the base+analysis and stores the base+analysis into an array. The base forms and the analyses are formatted to match the requirements of
vislcg
.
The compound analyses are rated: Only the analyses that contain the least number of word boundaries are taken into account. For example, if there is an analysis which contains two word boundaries and analysis which contains only one, then the analysis with two word boundaries is removed. (Of course the rating is redundant if the base form is generated from the analyses of compound parts, see compounds)
Compound base forms are generated from the base forms of its
parts. The analyses of compounds recieved by lookup2cg
contain separate analysis of each compound part. For the
disambiguation, only the base form of the whole compound and its
analysis is needed. lookup2cg
generates the base form for
the compound and removes the unnecessary analyses of its parts.
The base form of a compound is formed by replacing the last word of the original compound by its base form. The compound boundary is searched by comparing the 3 first letters of the base form to the original compound. If these 3 letters contain a digraph, then first 4 letters are considered. Only the analysis of the last word is preserved.
If the 3 first letters do not match, usually due to consonant gradation, weakened vowel etc. 2 letters are used as a "last resort". See compounds for more detail.
The formation of the compound's base form may produce identical analyses because the analyses of the other compound words are removed. Identical lines are removed before the reformatted output is printed.
The derivational tags are marked with an asterisk *. The lexicon contains fully derived forms of some words that are derived during the analysis as well. The derivational tags pose problems to disambiguation. See Derivation for details. At the moment, the implementation of searching derivational tags is fairly simple. First, a part of speech tag is searched from the right, this tag is preserved. If there are other part of speech tags on the left of the tag they are marked with asterix as being part of the derivation. This is a temporary solution, some more sophisticated methods are on their way.
"" "dállo/doallu" N Sg Nom # ekonomiija N Sg Nom "dállo/doallu" N Sg Nom # ekonomiija N Sg Gen "dállo/doallu" N Sg Nom # ekonomiija N Sg Acc
The target form is
"" "dállodoalloekonomiija" N Sg Nom "dállodoalloekonomiija" N Sg Gen "dállodoalloekonomiija" N Sg Acc
This means:
Thus, for the following input:
"" "Lasse" N Prop Sg Nom # died1áhus N Sg Nom "Lasse" N Prop Sg Gen # died1áhus N Sg Nom "lassi" N Sg Nom # died1áhus N Sg Nom
the target form is
"" "Lassedied1áhus" N Sg Nom
The problematic part here is identifying the compound boundary. Just taking the first part from the analysis will not do, as there may be changes of 3 kinds: The final vowel (á, i, u) may have been weakened to (a, e, o), as for dállodoall_o_ekonomiija above; there may be consonant gradation in the form (as when 'alimus/riekti # duopmu' becomes 'alimusrievttiduomuin') with a kt:vtt change; and the compound form may be shortened (and eventually changed), as when 'geahc1c1at + vuohki' becomes 'geahc1c1anvuogi'.
Then, as shown in the last example, the second part may be changed as well. There are two safe indications: the number of syllables, and the initial consonant after the # symbol. So, search for a match for the 3 first graphemes after the #, starting from after the second syllable of the input form should be safe.
Fixing this should also make it possible to get rid of "ambiguities" like the following:
"" "rámma" N Sg Nom # eaktu N Pl Ill "rámma" N Sg Gen # eaktu N Pl Ill
where the different is in the homonymy of nominative and genitive of
the first part of the compound. If the input is "
The derivational tags that should be searched for is the following set
(their initial + signs are removed during the initial stage of
lookup2cg):
Thus, the following algorithm should do:
These ones do:
For the non-gradating verb-to-noun suffixes, remove the V label
preceeding the N.
For the gradating suffixes, we should think more before doing anything.
Derivations
Since the input to the parser is a human-readable dictionary, many
derivations are present already in the dictionary. Due to the dynamic
derivation component, they come out with a double or even multiple
analysis, as the analysis with the derivational affix added in the
parsing process is given as well. thus, we have "ambiguities" like the
following:
"
+adda +ahtti +alla +asti +d +eaddji +eamos1 +eapmi +g +geahtes +h
+heapmi +hudda +huhtti +huvva +j +l +las1
+meahttun +mus1 +n +s1 +st +stuvva +us +vuohta +goahti +lágan
+Dimin +Pass
Moments for building a preprocessor geared towards disambiguation
The goal is to feed only syntactically relevant information to the
disambiguator. So, in the analysis of "bargiin", the correct analysis
is that it is Sg Com of "bargi". Since this word is lexicalised, it is
found as a noun in the lexicon.
"
What we want is thus to treat all Actor nouns as if they were found in
the lexicon in the first place. The problem is then to reverse the
morphological process, and find the stem.Actio
"
Derivations
These ones do not induce consonant gradation in the stem:
"
"
Trond Trosterud
Last modified: Thu Apr 29 09:51:09 2004