Introduction

File structure

The file format is documented in the Xerox manuals, especially in Karttunen 1993 Finite-STate Lexicon Compiler, but see also the forthcoming Beesley and Karttunen book. The file consists of a section defining Multichar_symbols, and of a large number of lexica, 183 lexica according to the present count (19.10.01).

The Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).

The Root lexicon points to the lexica of the parts of speech.

The lexica representing open parts of speech

NounRoot

NOTE. The additional noun lexicon subst-s-7b differs not too much from the lexicon already added to the save file. It should not be added yet.

NounRoot has the lexica BOAZU, FALIS, GADDI, GAHPIR, GISTTA, GOAHTI, JOHTOLAT, MALIS, SEAMU, STAHTA, VIVVA. The lexica represent the following inflectional types:

The sublexica, alphabetically ordered


BEANA = trisyllabic animated gradating 0-nouns
BOAZU = animate contracted 0-nouns
FALIS = contracted animate C-nouns.
GADDI = bisyllabic V-nouns with comparative forms
GAHPIR = trisyllabic, non-gradating C-nouns.
GISTTA =
GOAHTI = inanimate bisyllabic V-nouns.
IIJA = bisyllabic, non-gradating a-nouns, wit an a-illative.
JOHTOLAT =
MALIS = trisyllabic inanimate gradating C-nouns.
MATTAR = trisyllabic animate gradating C-nouns.
OLLUVUOHTA = exceptional vuohta-nouns.
SEAMU =
SUOLU = inanimate contracted 0-nouns.
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative.
VIVVA = animate bilsyllabic V-nouns.

The sublexica, ordered by inflectional type


VIVVA = bilsyllabic animate V-nouns
GOAHTI = bisyllabic inanimate V-nouns
GADDI = bisyllabic V-nouns with comparative forms
IIJA = bisyllabic, non-gradating a-nouns, with an a-illative
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative
BOAZU = contracted animate 0-nouns
SUOLU = contracted inanimate 0-nouns
FALIS = contracted animate C-nouns
BEANA = trisyllabic animated gradating 0-nouns
SEAMU = trisyllabic inanimate gradating 0-nouns
MATTAR = trisyllabic animate gradating C-nouns
MALIS = trisyllabic inanimate gradating C-nouns
GAHPIR = trisyllabic, non-gradating C-nouns
OLLUVUOHTA = exceptional vuohta-nouns
GISTTA = The Noun gistta, gist -
JOHTOLAT =

In the noun lexicon, the declension types are distributed as follows (21.10.01):

Bisyllabic
V-nouns
   1568 VIVVA		animate
   9048 GOAHTI		inanimate
      2 GADDI		w/comparative forms
a-nouns
      0 IIJA		non-gradating w/ a-ill
    113 STAHTA		non-gradating w/ a/i-ill

Contracted
0-nouns			
     35 BOAZU		animate
     38 SUOLU		inanimate
C-nouns
    188 FALIS		animate

Trisyllabic
0-nouns
     49 BEANA		animate gradating
    423 SEAMU		inanimate gradating
C-noun
     99 MATTAR		animate gradating
   1065 MALIS		inanimate gradating
   2208 GAHPIR		non-gradating

Miscellania
   1749 JOHTOLAT	
    239 DIMINC		diminuitives

     94 LASIS		
     75 MUSH		
     40 MAGASH		(all marked as "(pl.r.)")
     27 EGEZHAGAT	4-syllabic hk:g
     11 GARGIA		loanwords, video, etc.
      3 SATTU		(inconsistently marked)

      3 OANADUS		(abbreviations, look into this group)
      1 EANU		eatnu

ProperNoun

Propernoun has three sublexica:

NIILLAS = trisyllabic, non-gradating C-proper names.

PIERA = bisyllabic a-proper names without gradation; a/i-illative.

HEANDARAT

The adjective lexicon

The file adj-sme-lex.txt contains 1820 entries, whereas thadj-7b contains 4943 entries. The file ajd-7b differs formally too much from the lexicon already added to the save file. It should not be added yet.

AdjRoot

The adjective sublexica

In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (23.10.01):

    359 BOAKKAS
    353 JEAGOHEAPMI
    269 BEAKKAN
    146 GARAS
    124 GAPPUS
    114 LAIKI
    110 AKTIIVA
    106 NUORRA
     31 JUHKKIS
     26 EATTAS
     22 GUOHCA
     18 LODJI
     13 GEARGGUS
     13 DILDDAS
     12 SEARRA
      6 BIEKKUS
      5 NUOLUS
      5 NJUORAS
      4 HEAHKAS
      3 LIEKKUS
      3 ASEHIS
      1 GUOROS

VerbRoot

VerbRoot contains 12 sublexica, each of the three stem types are represented by 4 verb lexica:

The last type is not expanded in the lexicon.

Bisyllabic verbs:
ARVI arvit sataa !Bisyllabic Impersonal Verbs
DIEHTI diehtit tietaa !Bisyllabic Verbs with Personal Passive
BOAHTI boahtit tulla !Bisyllabic Verbs without Personal Passive
CUOHCIT c1uohcit osua

Contracted verbs:
BORGE borget tehda pyry !Contracted Impersonal Verbs
DOHPPE dohppet tarttua !Contracted Verbs with Personal Passive
GILLE gillet viitsia !Contracted Verbs without Personal Passive
GEARRAA gearra1t

Trisyllabic verbs:
CUORPMAST c1uorpmastit sataa rakeita !Trisyllabic Contracted verbs
MUITAL muitalit !Trisyllabic Verbs with Personal Passive
ALIST alistit !Trisyllabic Verbs without Personal Passive
BORGGIST borggistit

The following table gives an overview:

               even      odd        contracted
-------------------------------------------------
impers         ARVI      CUORPMAST  BORGE
pers +ppass    DIEHTI    MUITAL     DOHPPE
pers -ppass    BOAHTI    ALIST      GILLE

               even      odd        contracted
-------------------------------------------------
impers         RAIN      MAKE-STORM HAIL
pers +ppass    KNOW      TELL       GRIP
pers -ppass    COME                 BE-BOTHERED

The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):

-at            2964
even-syll  -it  924
-ut             826
total          4714

3-syllabic -it 5426

-a1t            301
-et            1091
-ot             209
total          1601

Comments to the lexica

Within each of the main groups, there are three types, impersonal verbs and verbs with and without personal passives. The difference between i/a/u and e/á/o verbs is handeled in the rules file, and not in the lexicon file.

The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.

At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs. These verbs are all given the sublexica DIEHTI, MUITAL and DOHPPE, i.e., they are given the transitive sublexicon, the maximal paradigm.

TODO: Assign corrct transitivity/sublexicon marking to the bulk of the verbs. Also, the undefined sublexica should be investigated.

Pekka gives them the following comment to the dec 01 files:

Tiedosto sisältää verbiartikkelien hakusanat. Joissakin tapauksissa on myös vaihtoehtoja (x ~ y) ja variantin kohdalla on viite päähakusanaan (x gc1. y). Pituusviivalliset vokaalimerkit on korvattu x2-yhdistelmillä, pisteelliset x3-yhdistelmillä. Kolmannen kestoasteen merkkinä on ', jonka koodi DOS:ssa on 173. Joidenkin hakusanojen yhteydessä on suluissa tietoa rektiosta, esim. liikot (+ lok.); tämä tieto ei ole toistaiseksi systemaattista.

Error report 011111

At the moment, the verb lexicon (VerbRoot; in sme-lex.txt) consists of a handful verbs only. Some of these are not recognised by the analyser:

borggistit
c1uohcit
c1uorpmastit
borgistit
muitalit
gearra1t

The problem is that the respective sublexica are not defined. They should thus be defined.

011611

Go through the file "feilmelsingar..." in notatar. Several of the sublexica are simply just errouneously written. (SeAMU, etc.)

Verb types in the dictionary Sámi-suoma-sámi sátnegirji

  1. c1uohppat
  2. vuolgit
  3. doalvut
  4. muitalit
  5. vuoddja1t
  6. c1orget
  7. gul'lot

An i:j error in the verbal paradigm

Could this be a result of the abandonment of the <=> arrow in the i:j rule?

apply down> diehtit+V+Pass+Act
dihttoiuvvon
dihttojuvvon
apply down> diehtit+V+Pass+Ger
dihttoiuvvodettiin
dihttojuvvodettiin
Here is another, of the same kind:

apply down> boahtit+V+Pass+Inf
bohttoit
apply down> c1uorvut+V+Pass+Inf
c1urvoit

It is not identical, though, since here the i should be deleted, not turned into j.

A further one of the same kind:

apply down> bidjat+V+Ind+Prs+Sg1
bidian
bijan

The grammatical continuation lexica

AGADJECT contains deadjectival adjectives with derivational suffix -ag/-og.

Bug reports, errors

Errors in the rule file

Gradation error for certain nouns

Weak grade is not recognised for ren1ko. Unclear what kind of error this is.

Weak grade not rec. for ma1hli, duihmi, c1a1ihmi, -hl-, -hm-, -hn- also in weak grade.

The noun olmmos1 errouneously has strong grade in the nominative singular.

Unexpectedly, a ' symbol turns up in the underlying form:

apply up> gistta
gistta+N+Sg+Nom
gis'ta+N+Sg+Nom

Missing vowel alternation for 1st and 2nd Px

In the Px paradigm of even-syllabic words, the same epenthetic vowel is given as for 3rd person. The correct alternation should be i > á, u > o.

This error is now fixed!

Errors in the continuation lexica

MUSH
has defect Acc, Gen, and 'apply down' does not work

JOHTOLAT
has two Gen forms, one errouneous,and has -ai instead of -ii in the Illative. The other case endings work fine.

LASIS
is not found in the lexicon list at all. TODO: Write a lexicon for LASIS

Checking diary

All CG cases of series II E are checked. The ihx ones do not work (cf. above), but the other ones do.

The multiple genitive forms

At one stage , Acc/Gen forms where accompanied by several strange additional forms (Gen#vuoign1an/vuoigna1m). These are now commented out of the noun lexicon, by a ! mark.

TODO: Check with the oroginal lexicon, to ensure that nothing crucial has been lost in the conversion process.

The i:j alternation

The problem reported in this paragraph is now solved, by adding a new lexical j:i rule, by making the two j:i into <= rules and by require the d:t rule be dependent upon underlying vowels.

The MUITAL problem is known. Here is another one: The program analyses aviissa, but not aviisa. Instead, we get a correct analysis of avijsa. Note also the Gen/Nom distinction here, that probably contains the solution to the riddle.

apply up> a1jggi
apply up> a1iggi
a1igi+N+Sg+Acc
a1igi+N+Sg+Gen
apply up> a1jgi
a1igi+N+Sg+Nom
Thus, Gen/Acc are ok, but not Nom.

another one, this one is curious, as it treats the word as acompound, with "de" as the second component:

apply up> guvsside
guksi+N+Sg+Gen#de+N+Sg+Acc
guksi+N+Sg+Gen#de+N+Sg+Gen
guksi+N+Sg+Gen#de+N+Sg+Nom
apply up>
Words missing?

kursa
eiseva1lddit

Strange word:

apply up> olbmos1
olbmos1+Sg+Acc
olbmos1+Sg+Gen
olbmos1+Sg+Nom
olmmos1+N+Sg+Nom
olmmos1+N+s1+Sg+Acc
olmmos1+N+s1+Sg+Gen
olmmos1+N+s1+Sg+Nom
There is a word "de" in the lexicon (a noun): There thus is an errer somewhere.

Missing from olbmos1:

olbmos1 - olmmoz1in. All from olmma1i work.

Many adjecxtives are missing, appr. half the ones listed by Nickel.

duhtavas1, lihkolas1 asehas1, oanehas1, vuollegas1, boaris, ra1hkis, ja1llu uhcci, unna, bastil, ruoksat, alit, allat, gassat, govdat, lossat.

Comitative plural and Px

Correct:
apply down> giella+N+Pl+Com+PxSg3
gielaidisguin
apply down> giella+N+Pl+Com+PxPl3
gielaideasetguin

Errouneous:
apply down> beana+N+Pl+Com+PxPl3
beatnagiiddiset
apply down> beana+N+Pl+Com+PxSg3
beatnagiiddis
Also the contracted words luomi and gahpir behaved the same way as beana. It thus seems this is an error for all contracted nouns.

TODO: Go through the Px paradigm, and see if beana shows errors in other parts of the paradigm, and if there are other words that have problems in the Comitative Plural paradigm.

Compounds

The rule file requires the first part of compounds to be in the genitive. Min aigi still writes compounds with the initial part in the nominative. E.g. "vuoktac1almmi", and not "vuovttac1almmi", but only the latter is recognised. Also: "bargguvigiid", but MA writes "bargovugiid". Thus, the compound form of the nominative is not recognised. (neither is genitive CG with weak vowel, "barggovugiid" is rejected as well.)

PrfPtc + N is not recognised as a compound. "mujtalanvejolas2vuod1at" (check for va1ikkuhanvejolas1vuod1at".


Trond Trosterud
Last modified: Wed Dec 19 22:52:46 CET 2001