The disambiguation file sme-dis.rle
File macrostructure
The disambiguation file sme-dis.rle consists of the following parts:
- Delimiter declarations
- Tag and tag list declarations
- Constraints
The file format is documented in Tapanainen 1996. Cf. also the general discussion on CG-2 usage.
The delimiters
Three sentence delimiters are declared, ".", "?" and "!".
Tag and tag list declarations
This section consists of two parts. In the first part, every single
lexc tag has been given a list declaration of its own. There probably
is a better way of doing this (the reader is invited to find a better
way :-).
In the second part comes the real set and list declarations.
The constraints
The constraints are organised in cycles. CG-2 makes it possible to
arrange the rules in blocks. Each set of constraints is introduced
with the key word CONSTRAINTS, and the last section is closed off with
the word END.
First comes disambiguation rules that each refer to one single cohort
(marked with "cycle 0"). Then comes local disambiguation, referring mainly
to words one cohort to the left or right (eventually with intervening
adverbials) ("cycle 1"). Then follows the main part of the disambiguation file
("cycle 2"),which contains rules with local and long-distance scope. Finally,
there is a cycle consisting of rules with global scope, along with some rules
that need to come as late as possible ("cycle 3").
The rules within some of the main cycles are organised in subcycles, set
apart by the key word CONSTRAINTS. This gives better control of the order
in which the rules apply.
Also note that rules are often organised such that a block of rules is
followed by a set of examples. In these cases, the first rule will go with
the first example, the second rule with the second example, and so on. The
idea is that this gives a better overview than if each rule is immediately
followed by an example.
The majority of the examples show where a rule hits. But in some cases, the
example shows where the rule does NOT hit. Then the example is there to
illustrate the need for some specific condition in the rule. Which condition
is normally pointed out explicitly.
More on the cycles
Cycle 0
Much cohort-internal disambiguation is taken care of in the
preprocessor phase (in the script lookup2cg,
see also the documentation in the script).
The rules in this section are there to reject analyses where there is
a better analysis in the same cohort. Typically, this is the case
where a lexicalised derivation competes with a dynamically derived
form.
For example, adverbs derived from adjectives are
rejected if there is an alternative analysis with the lexicalised
adverb.
Names are preferred to ordinary nouns if the word-form has initial
capital and occurs in the middle of a sentence.
Cycle 1
This cycle consists of two subcycles. In cycle 1a we deal with some personal
pronouns that have homonyms in other parts of speech, we pick out some Px
readings, and select/remove some verb readings. There are also two rules
related to "ahte", which have to come early so that only the CS reading of
"ahte" survives to cycle 2. In cycle 1b most of the remaining Px readings
are removed.
Cycle 2
This is the largest cycle in the disambiguation file. It consists of several
subcycles. The main organising principle is that rules that are relevant to one
and the same part of speech should come as one rule block. However, some
exceptions to this principle have turned out to be necessary. For example,
although the main verb rule block follows the main noun rule block, certain
verb rules precede the noun rules, thereby improving the effect of the latter.
In the following, the main subcycles of Cycle 2 are dealt with one by one.
Noun or not?
In this subcycle some relatively certain noun readings are picked out. In
most cases adjective readings are thereby removed, but also some verb readings.
The selected noun readings will serve as context for later rules.
Adjectives and adverbs
In this subcycle some relatively certain adverb and adjective readings are
picked out. The competing readings are verb or noun readings.
Certain singleton words
Some ambiguous words that have one particular reading that can be selected
with some certainty are dealt with here.
Disambiguating clitics
So far one rule here. It removes the Qst reading from adverbs like "nugo".
Disambiguating adpositions
Two subcycles. The first one deals with adpositions of the GASKAL class, in
the cases where they combine with a coordination. The rules look OK, but
they cannot at this point be tested against the corpus (no potential hits).
The second subcycle consists of rules that select and remove Po and Pr
readings in a general fashion. The ordering of rules is crucial here.
The present ordering is not necessarily the best one, although it has been
worked on quite a lot.
Disambiguating subjunctions
Two subcycles. The first one contains two general CS rules and a number of
rules related to individual subjunctions with homonyms in other POS. The
second one contains rules that disambiguate those ambiguous instances that
survive the first subcycle.
Disambiguating adverbs
Consists of some general adverb rules (SELECT and REMOVE), some rules that
distinguish between adverbs and other specified POS, and a series of rules
related to individual adverbs. The latter set of rules deals with ambiguities
that are not resolved by the general adverb rules.
Disambiguating pronouns
Since personal pronouns are dealt with in cycle 1, this cycle consists of rule
blocks for interrogative pronouns, reflexive pronouns, reciprocal pronouns,
indefinite pronouns, and demonstrative pronouns, in that order, so that the
rules for demonstratives, for examples, can build on the output of all other
pronoun rules.
Note that there are particularly many rules for "dat", since "dat" can be a
demonstrative pronoun or a personal pronoun. We have chosen here to treat
"dat" as a personal pronoun whenever it stands alone.
Disambiguating adjectives
Disambiguating verbs - part 1
Disambiguating nouns
Disambiguating verbs - part 2
Residual cases
Cycle 3
Trond Trosterud/Marit Julien
Last modified: Sun Dec 19 16:37:14 2004