Introduction

File structure

The file format is documented in the Xerox manuals, especially in Karttunen 1993 Finite-State Lexicon Compiler, but see also the forthcoming Beesley and Karttunen book. The file consists of a section defining Multichar_symbols, and of a large number of lexica, 183 lexica according to the present count (19.10.01). The file sme-lex.txt contains a.o. the continuation lexica for nouns, verbs and adjectives, whereas the bulk of the lexicon is divided into different files, as indicated below.

In the sme-lex.txt file, the Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).

The Root lexicon points to the lexica of the different parts of speech: (for each sublexicon there is a pointer to the relevant file containing the sublexicon)

LEXICON Root
 NounRoot ;       ! -> noun-sme-lex.txt
 ProperNoun ;     ! -> the file sme-lex.txt itself
 AdjectiveRoot ;  ! -> adj-sme-lex.txt
 VerbRoot ;       ! -> verb-sme-lex.txt
 Pronoun ;        ! -> closed-sme.lex.txt
 Adverb ;         ! -> adv-sme-lex.txt
 Particles ;      ! -> closed-sme.lex.txt
 Subjunction ;    ! -> closed-sme.lex.txt
 Conjunction ;    ! -> closed-sme.lex.txt
 Adposition ;     ! -> pp-sme.lex.txt
 Postposition ;   ! -> pp-sme.lex.txt
 Preposition ;    ! -> pp-sme.lex.txt
 Interjection ;   ! -> closed-sme.lex.txt

The different part of speech lexica are documented here, in the order just given. Finally comes a section on bugs, etc. [This section will be moved elsewhere!].

Nouns

The NounRoot lexicon

NOTE. The additional noun lexicon subst-s-7b differs not too much from the lexicon already added to the save file. It should not be added yet.

NounRoot has the lexica BOAZU, FALIS, GADDI, GAHPIR, GISTTA, GOAHTI, JOHTOLAT, MALIS, SEAMU, STAHTA, VIVVA. The lexica represent the following inflectional types:

The sublexica, alphabetically ordered

BEANA = trisyllabic animated gradating 0-nouns
BOAZU = animate contracted 0-nouns
FALIS = contracted animate C-nouns.
GADDI = bisyllabic V-nouns with comparative forms
GAHPIR = trisyllabic, non-gradating C-nouns.
GISTTA =
GOAHTI = inanimate bisyllabic V-nouns.
IIJA = bisyllabic, non-gradating a-nouns, wit an a-illative.
JOHTOLAT =
MALIS = trisyllabic inanimate gradating C-nouns.
MATTAR = trisyllabic animate gradating C-nouns.
OLLUVUOHTA = exceptional vuohta-nouns.
SEAMU =
SUOLU = inanimate contracted 0-nouns.
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative.
VIVVA = animate bilsyllabic V-nouns.

The sublexica, ordered by inflectional type

VIVVA = bilsyllabic animate V-nouns
GOAHTI = bisyllabic inanimate V-nouns
GADDI = bisyllabic V-nouns with comparative forms
IIJA = bisyllabic, non-gradating a-nouns, with an a-illative
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative
BOAZU = contracted animate 0-nouns
SUOLU = contracted inanimate 0-nouns
FALIS = contracted animate C-nouns
BEANA = trisyllabic animated gradating 0-nouns
SEAMU = trisyllabic inanimate gradating 0-nouns
MATTAR = trisyllabic animate gradating C-nouns
MALIS = trisyllabic inanimate gradating C-nouns
GAHPIR = trisyllabic, non-gradating C-nouns
OLLUVUOHTA = exceptional vuohta-nouns
GISTTA = The Noun gistta, gist -
JOHTOLAT =

In the noun lexicon, the declension types are distributed as follows (08.04.02):

Bisyllabic
V-nouns
   1577 VIVVA		animate
   9070 GOAHTI		inanimate
      2 GADDI		w/comparative forms
a-nouns
     54 IIJA		non-gradating w/ a-ill
     62 STAHTA		non-gradating w/ a/i-ill

Contracted
0-nouns			
     38 BOAZU		animate
     38 SUOLU		inanimate
C-nouns
    188 FALIS		animate

Trisyllabic
0-nouns
     49 BEANA		animate gradating
    423 SEAMU		inanimate gradating
C-noun
     99 MATTAR		animate gradating
   1065 MALIS		inanimate gradating
   2208 GAHPIR		non-gradating

Miscellanea
   1749 JOHTOLAT	
    239 DIMINC		diminutives

     94 LASIS		
     75 MUSH		
     40 MAGASH		(all marked as "(pl.r.)")
     27 EGEZHAGAT	4-syllabic hk:g
     11 GARGIA		loanwords, video, etc.
      3 SATTU		(inconsistently marked)

      3 OANADUS		(abbreviations, look into this group)
      1 EANU		eatnu

The ProperNoun lexicon

The proper nouns are stored in propernoun-sme-lex.txt.

The file structure

Propernoun is divided into two sublexica, SamiProperNoun and GeneralProperNoun (this division is for clarity reasons, and can be dispensed with as soon as the structure of all the names is clear.). here are the lexica:

ACCRA = foreign names ending in a stressless vowel.
NYSTØ = foreign names ending in a stressed vowel.
CNAME = foreign names ending in a consonant
C-FI-NEN = Finnish names of the "Itkonen" type, with Gen = Itkosa / Itkonena
LONDON ! Final foot structure (X.) and (X..) => Loc:is
BERN ! Final foot structure (X) => Loc:as

The class CNAME will eventually be divided in two, one will go to the BERN lexicon and the other to the LONDON lexicon, depending upon foot structure. At the moment, the CNAME entries accept both -is and -as suffixes, and the different suffix types are marked as such in the analysis.

These are the lexica for Sáami names:

NIILLAS = trisyllabic, non-gradating C-proper names.

PIERA = bisyllabic a-proper names without gradation; a/i-illative.

HEANDARAT

MARJA = bisyllabic vowel-final names with gradation

Sámi geographical names

Contact person in Statens Kartverk: Johnny Andersen, 32118171.

Nominal sublexica

tbw.

Adjectives

The file adj-sme-lex.txt contains 1820 entries, whereas thadj-7b contains 4943 entries. The file adj-7b differs formally too much from the lexicon already added to the save file. It should not be added yet.

AdjRoot

The adjective sublexica

In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (23.10.01) (ordered after frequency, thereafter after declension type):

359 BOAKKAS 353 JEAGOHEAPMI 269 BEAKKAN 146 GARAS 124 GAPPUS 114 LAIKI 110 AKTIIVA 106 NUORRA 31 JUHKKIS 26 EATTAS 22 GUOHCA 18 LODJI 13 GEARGGUS 13 DILDDAS 12 SEARRA 6 BIEKKUS 5 NUOLUS 5 NJUORAS 4 HEAHKAS 3 LIEKKUS 3 ASEHIS 1 GUOROS Making linguistic sense of the system (Sammallahti's codes aaa etc.):

106 NUORRA aaa 269 BEAKKAN aab 2 ISSORAS aad BUORRE ab 1 JOHTIL babaa 6 BIEKKUS babba 3 ASEHIS baf 146 GARAS bbb 1 FIINNA bbe 353 JEAGOHEAPMI bae 359 BOAKKAS caa cab cb a GUOHCA !Trisyll. Gradating Adj., no sep. Attr. NUORRA !Bisyll. V-Adj. without Separate Attr; no Adv. JUHKKIS !Bisyll. V-Adj. without Separate Attr; no Adv. BEAKKAN !Trisyll. Non-gradating C-Adj. without Separate Attr. ba SEARRA !Bisyll. V-Adj's with s-Attr. LAIKI !Bisyll. V-Adj's with s-Attr. & Adv. JOHTIL !Trisyll. Non-gradating C-Adj. with is-Attr. bb GARAS !Trisyll. Gradating C-Adj. with Bisyll. a-Attr. CIENAL !Trisyll. Gradating C-Adj. with Strong Grade a-Attr. AGADJECT !Denominal Adj's with Deriv. -ag/og 124 GAPPUS 114 LAIKI 110 AKTIIVA 31 JUHKKIS 26 EATTAS 22 GUOHCA 18 LODJI 13 GEARGGUS 13 DILDDAS 12 SEARRA 5 NUOLUS 5 NJUORAS 4 HEAHKAS 3 LIEKKUS 1 GUOROS

Adjectival sublexica

AGADJECT contains deadjectival adjectives with derivational suffix -ag/-og.

Verbs

The VerbRoot lexicon

The lexicon is stored in the verb-sme.txt file.

VerbRoot contains 12 sublexica, each of the three stem types are represented by 4 verb lexica:

a lexicon for impersonal verbs
a lexicon for verbs with personal passives
a lexicon for verbs without personal passives
a lexicon with ???

The last type is not expanded in the lexicon.

Bisyllabic verbs:
ARVI arvit sataa !Bisyllabic Impersonal Verbs
DIEHTI diehtit tietaa !Bisyllabic Verbs with Personal Passive
BOAHTI boahtit tulla !Bisyllabic Verbs without Personal Passive
CUOHCIT c1uohcit osua

Contracted verbs:
BORGE borget tehda pyry !Contracted Impersonal Verbs
DOHPPE dohppet tarttua !Contracted Verbs with Personal Passive
GILLE gillet viitsia !Contracted Verbs without Personal Passive
GEARRAA gearra1t

Trisyllabic verbs:
CUORPMAST c1uorpmastit sataa rakeita !Trisyllabic Contracted verbs
MUITAL muitalit !Trisyllabic Verbs with Personal Passive
ALIST alistit !Trisyllabic Verbs without Personal Passive
BORGGIST borggistit

The following table gives an overview:

               even      odd        contracted
-------------------------------------------------
impers         ARVI      CUORPMAST  BORGE
pers +ppass    DIEHTI    MUITAL     DOHPPE
pers -ppass    BOAHTI    ALIST      GILLE

               even      odd        contracted
-------------------------------------------------
impers         RAIN      MAKE-STORM HAIL
pers +ppass    KNOW      TELL       GRIP
pers -ppass    COME                 BE-BOTHERED

The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):

-at            2964
even-syll  -it  924
-ut             826
total          4714

3-syllabic -it 5426

-a1t            301
-et            1091
-ot             209
total          1601

Comments to the verb sublexica

Within each of the main groups, there are three types, impersonal verbs and verbs with and without personal passives. The difference between i/a/u and e/á/o verbs is handeled in the rules file, and not in the lexicon file.

The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.

At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs. These verbs are all given the sublexica DIEHTI, MUITAL and DOHPPE, i.e., they are given the transitive sublexicon, the maximal paradigm.

TODO: Assign corrct transitivity/sublexicon marking to the bulk of the verbs. Also, the undefined sublexica should be investigated.

Pekka gives them the following comment to the dec 01 files:

Tiedosto sisältää verbiartikkelien hakusanat. Joissakin tapauksissa on myös vaihtoehtoja (x ~ y) ja variantin kohdalla on viite päähakusanaan (x gc1. y). Pituusviivalliset vokaalimerkit on korvattu x2-yhdistelmillä, pisteelliset x3-yhdistelmillä. Kolmannen kestoasteen merkkinä on ', jonka koodi DOS:ssa on 173. Joidenkin hakusanojen yhteydessä on suluissa tietoa rektiosta, esim. liikot (+ lok.); tämä tieto ei ole toistaiseksi systemaattista.

! The file verb-sme-lex.txt today contains
! the  complete set of
! two-syllabic even-syllable verbs
! All have the sublexicon DIEHTI
! This should of course be given
! appropriate sublexica according to 
! transitivity.
! 4-(and more)-syllabic verbs still
! to be added.

! This is the complete set of
! 4-and-more syllabic
! evensyllable verbs
! All have the sublexicon DIEHTI
! They should of course be given
! appropriate sublexica according to 
! transitivity

! This is the complete set of
! odd-syllable verbs
! All have the sublexicon MUITAL
! This should of course be given
! appropriate sublexica according to 
! transitivity

! This is the complete set of
! contracted verbs
! All have the sublexicon DOHPPE
! This should of course be given
! appropriate sublexica according to 
! transitivity

Verbal sublexica

tbw.

Verbal derivation

Here documenting the even-syll ones, the other ones are similar. DIEHTI is transitive, BOAHTI is intransitive.

DIEHTI -> +V: DIEHTIStem ; +V: DeverbalVerbsDIEHTI ;
BOAHTI -> +V: BOAHTIStem ; +V: DeverbalVerbsBOAHTI ;
DIEHTIStem -> :Y7j PASSIVE ; BOAHTIINCH ;
BOAHTIStem -> SG3PASSV ; BOAHTIINCH ;
BOAHTIINCH -> DeverbalNounsV ; +goah0ti:goah'ti BOAHTICnj ; BOAHTICnj ;
BOAHTICnj -> +Ind+Prs: PrsV ; +Ind+Prt: PrtV ; +Pot+Prs:Q7z1 PrsC ;
      +Cond: CondV1 ; +Imprt: ImprtVA ; NominalFormsV ;
NominalFormsV -> :X1 NominalFormsV1 ; :X4 NominalFormsV2 ;
      :Q6 NominalFormsV3 ; :X2 NominalFormsV4 ; :Q3 NominalFormsV5 ;
      :Y1 NominalFormsV6 ;
PASSIVE ->  +Pass:uvvo DOHPPEINCH ; +Pass+meahttun+A:uvvomeahttum
      MEAHTTUN ; +Pass+PrfPrc:un K ; +Pass+eaddji+N+Actor:uvvojeaddji¤
      DEVNVCASE ; +Pass+upmi+N:upmi DEVNVCASE ;
DeverbalVerbsDIEHTI ->
 +st:X8st MUITALStem ;
 +st+alla:X6stalla DIEHTIStem ;
 +st+adda:X6stadda DIEHTIStem ;
 +l:l MUITALStem ;
 +l+adda:X2ladda DIEHTIStem ;
 +l+ahtti:lahtti DIEHTIStem ;
 +l+asti:las'ti DIEHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d MUITALStem ;
DeverbalVerbsBOAHTI ->
 +st:X8st ALISTStem ;
 +st+alla:X6stalla BOAHTIStem ;
 +st+adda:X6stadda BOAHTIStem ;
 +l:l ALISTStem ;
 +l+adda:X2ladda BOAHTIStem ;
 +l+ahtti:lahtti BOAHTIStem ;
 +l+asti:las'ti BOAHTIStem ;
 +h:X4h MUITALStem ;
 +h+alla:X6halla DIEHTIStem ;
 +h+adda:X6hadda DIEHTIStem ;
 +h+asti:X4has'ti DIEHTIStem ;
 +stuvva:X8stuvva SG3PASSV ;
 +d:Q8d ALISTStem ;

Pronouns

The tag system follows the outline in Nickel.

All Pronouns have the initial lexicon path Root -> Pronoun -> ...

Personal pronouns

Lexicon path:

Personal
 firstperspron
  firstperspronsg -> wordforms -> K
  firstpersprondu -> wordforms -> K
  perspronpl -> wordforms -> K
 nonfirstperspron
  nonfirsperspronsg -> wordforms -> K
  nonfirstpersrondu -> wordforms -> K
  perspronpl -> wordforms -> K

Note that 3rd person is identical for all three persons. Not all forms were different for the sg and du forms, but the lexica were split for consistency.

Interrogative pronouns

So far, only gii and mii added. The sublexicon Interrogative contains one entry for Sg Nom of gii and mii, and points the rest to the case paradigm for the demonstrative pronouns.

Interrogative
 +Sg+Nom -> K (one entry for gii and one for mii)
 oblintercas (one entry for gii and one for mii)
  demcas

Demonstrative pronouns

The lexicon path:

Demonstrative
 demcas (one entry for each stem)
  demcassg
   nomdemcassg -> wordforms -> K
   obldemcassg -> wordforms -> K
  demcaspl
   nomdemcaspl -> wordforms -> K
   obldemcaspl -> wordforms -> K

Reflexive pronouns

The Nominative forms are just listed. The oblique ones are directed to the sublexicon reflobl, and there directed via different case stems to appropriate Px sublexica. These sublexica are the same as the ones for nouns, they are found in the sme-lex.txt file. The only exception are some sublexica that are used only for plural forms, these are duplicated here from the sme-lex file, in order not to revise the main lexicon.

Reciprocal pronouns

tbw. Reciprocal pronouns are not added yet.

Relative pronouns

tbw.

These are identical to the Interrogative ones. How should this be done?

Indefinite pronouns

tbw.

Numerals

Overview of the lexicon structure

The numeral lexica are formed as a generator, generating all possible numerals. The basic lexicon is Numeral, and it looks like this:

LEXICON Numeral
MILJON ; ! a noun of its own
UNDERDUHAT ; ! for generator under 1000
JUSTDUHAT ; ! going via 1000
OVERDUHAT ; ! for generator over 1000
OLD ; ! for "thirteen hundred, etc.
!num-basic ; ! replaced by the 5 lexica above
num-ordinal ; ! The basic ordinal numbers
!num-derived ; ! still unimplemented
num-imprecise ;! still almost unimplemented
ARABIC ; ! for the arabic numerals
ROMAN ; ! for the roman numerals

MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.

So, what is the reasin for the three different lexica around 1000?

The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional "(one) thousand" before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.

Case inflection of numerals

At the moment, the numerals only case inflected for the last part.

Indeclinable words

All the lexica for indeclinable words are made the same way:

LEXICON Root
Adverb ;
 LEXICON Adverb
 a1d1amusat adv ;
  LEXICON adv
  +Adv:0 K ;

The Root lexicon points to the POS lexica (Adverb etc.). Each of the POS lexica lists the entries, with a pointer to an arbitrarily named sublexicon (here "adv"). This sublexicon contains the grammatical tag for the POS in question (the tag has no surface form, hence ":0"), and eventually a pointer towards the cliticon lexicon K. Adverbs can have clitics added, hence K, whereas subjunctions do not, hence no K. [XXX At the moment particles are not directed to K, perhaps they should be.]

Adverbs

They are explained in the intro above.

Particles

These are in the closed-sme-lex.txt file. Their tag is +Pcle and th|qeir lexicon path is:

Root -> Particle -> pcle -> #

Subjunctions

Subjunctions are ahte, juos, etc. These are in the closed-sme-lex.txt file. Their lexicon path is:

Root -> Subjunction -> -> #

Conjunctions

Conjunctions are ja, dahjege, etc. These are in the closed-sme-lex.txt file. Their tag is +CC and their lexicon path is:

Root -> Conjunction -> Cc -> #

P-positions

There are three different classes here: Postpositions, occuring after their complement, prepositions, occuring before, and adpositions, occuring both before and after. This could have been done the Lingsoft way as well: Having +Adp as a common tag for both, and eventually +Prep and +Postp as subtags, no subtag would indicate both ways (or both subtags could be used). At the moment, they are left as 3 distinct groups. The classification is based upon Nickel, p-positions found only in Sammallahti's dictionary and not in Nickel were put in the Adposition group. Empirical studies will probably lead to rearrangement of the present division, this should be looked into in connection with the morphological disambiguator (cg grammar).

Adpositions

Adpositions are are bajil, birra, gaskal, etc. These are in the pp-sme-lex.txt file. Their tag is +Adp and their lexicon path is:

Root -> Adposition -> Pp -> #

Postpositions

Postpositions are are bokte, lusa, etc. These are in the pp-sme-lex.txt file. Their tag is +Po and their lexicon path is:

Root -> Postposition -> Postp -> #

Prepositions

Prepositions are are aisttan, earet, etc. These are in the pp-sme-lex.txt file. Their tag is +Pr and their lexicon path is:

Root -> Preposition -> Prep -> #

Interjections

Interjections are are hoi, huh, kys1-kys1, etc. These are in the closed-sme-lex.txt file. Their tag is +Interj and their lexicon path is:

Root -> Interjection -> Ij -> #

Abbreviations

There is a file called abbr-sme-lex.txt. Work on abbreviations has not yet begun, the file contains just some dummy entries.

Trond Trosterud

Last modified: Sat Jun 22 11:21:25 CEST 2002