The disambiguation file sme-dis.rle

File macrostructure

The disambiguation file sme-dis.rle consists of the following parts:

The file format is documented in Tapanainen 1996. Cf. also the general discussion on CG-2 usage.

The delimiters

Three sentence delimiters are declared, ".", "?" and "!".

Tag and tag list declarations

This section consists of two parts. In the first part, every single lexc tag has been given a list declaration of its own. There probably is a better way of doing this (the reader is invited to find a better way :-).

In the second part comes the real set and list declarations.

The constraints

The constraints are organised in cycles. CG-2 makes it possible to arrange the rules in blocks. Each set of constraints is introduced with the key word CONSTRAINTS, and the last section is closed off with the word END.

First comes disambiguation rules that each refer to one single cohort (marked with "cycle 0"). Then comes local disambiguation, referring mainly to words one cohort to the left or right (eventually with intervening adverbials) ("cycle 1"). Then follows the main part of the disambiguation file ("cycle 2"),which contains rules with local and long-distance scope. Finally, there is a cycle consisting of rules with global scope, along with some rules that need to come as late as possible ("cycle 3").

The rules within some of the main cycles are organised in subcycles, set apart by the key word CONSTRAINTS. This gives better control of the order in which the rules apply.

Also note that rules are often organised such that a block of rules is followed by a set of examples. In these cases, the first rule will go with the first example, the second rule with the second example, and so on. The idea is that this gives a better overview than if each rule is immediately followed by an example.

The majority of the examples show where a rule hits. But in some cases, the example shows where the rule does NOT hit. Then the example is there to illustrate the need for some specific condition in the rule. Which condition is normally pointed out explicitly.

More on the cycles

Cycle 0

Much cohort-internal disambiguation is taken care of in the preprocessor phase (in the script lookup2cg, see also the documentation in the script).

The rules in this section are there to reject analyses where there is a better analysis in the same cohort. Typically, this is the case where a lexicalised derivation competes with a dynamically derived form. For example, adverbs derived from adjectives are rejected if there is an alternative analysis with the lexicalised adverb.

Names are preferred to ordinary nouns if the word-form has initial capital and occurs in the middle of a sentence.

Cycle 1

This cycle consists of two subcycles. In cycle 1a we deal with some personal pronouns that have homonyms in other parts of speech, we pick out some Px readings, and select/remove some verb readings. There are also two rules related to "ahte", which have to come early so that only the CS reading of "ahte" survives to cycle 2. In cycle 1b most of the remaining Px readings are removed.

Cycle 2

This is the largest cycle in the disambiguation file. It consists of several subcycles. The main organising principle is that rules that are relevant to one and the same part of speech should come as one rule block. However, some exceptions to this principle have turned out to be necessary. For example, although the main verb rule block follows the main noun rule block, certain verb rules precede the noun rules, thereby improving the effect of the latter.

In the following, the main subcycles of Cycle 2 are dealt with one by one.

Noun or not?

In this subcycle some relatively certain noun readings are picked out. In most cases adjective readings are thereby removed, but also some verb readings. The selected noun readings will serve as context for later rules.

Adjectives and adverbs

In this subcycle some relatively certain adverb and adjective readings are picked out. The competing readings are verb or noun readings.

Certain singleton words

Some ambiguous words that have one particular reading that can be selected with some certainty are dealt with here.

Disambiguating clitics

So far one rule here. It removes the Qst reading from adverbs like "nugo".

Disambiguating adpositions

Two subcycles. The first one deals with adpositions of the GASKAL class, in the cases where they combine with a coordination. The rules look OK, but they cannot at this point be tested against the corpus (no potential hits). The second subcycle consists of rules that select and remove Po and Pr readings in a general fashion. The ordering of rules is crucial here. The present ordering is not necessarily the best one, although it has been worked on quite a lot.

Disambiguating subjunctions

Two subcycles. The first one contains two general CS rules and a number of rules related to individual subjunctions with homonyms in other POS. The second one contains rules that disambiguate those ambiguous instances that survive the first subcycle.

Disambiguating adverbs

Consists of some general adverb rules (SELECT and REMOVE), some rules that distinguish between adverbs and other specified POS, and a series of rules related to individual adverbs. The latter set of rules deals with ambiguities that are not resolved by the general adverb rules.

Disambiguating pronouns

Since personal pronouns are dealt with in cycle 1, this cycle consists of rule blocks for interrogative pronouns, reflexive pronouns, reciprocal pronouns, indefinite pronouns, and demonstrative pronouns, in that order, so that the rules for demonstratives, for examples, can build on the output of all other pronoun rules.

Note that there are particularly many rules for "dat", since "dat" can be a demonstrative pronoun or a personal pronoun. We have chosen here to treat "dat" as a personal pronoun whenever it stands alone.

Disambiguating adjectives

Disambiguating verbs - part 1

Disambiguating nouns

Disambiguating verbs - part 2

Residual cases

Cycle 3


Trond Trosterud/Marit Julien
Last modified: Sun Dec 19 16:37:14 2004