In the sme-lex.txt file, the Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).
The Root lexicon points to the lexica of the different parts of speech: (for each sublexicon there is a pointer to the relevant file containing the sublexicon)
LEXICON Root NounRoot ; ! -> noun-sme-lex.txt ProperNoun ; ! -> the file sme-lex.txt itself AdjectiveRoot ; ! -> adj-sme-lex.txt VerbRoot ; ! -> verb-sme-lex.txt Pronoun ; ! -> closed-sme.lex.txt Adverb ; ! -> adv-sme-lex.txt Particles ; ! -> closed-sme.lex.txt Subjunction ; ! -> closed-sme.lex.txt Conjunction ; ! -> closed-sme.lex.txt Adposition ; ! -> pp-sme.lex.txt Postposition ; ! -> pp-sme.lex.txt Preposition ; ! -> pp-sme.lex.txt Interjection ; ! -> closed-sme.lex.txtThe different part of speech lexica are documented here, in the order just given. Finally comes a section on bugs, etc. [This section will be moved elsewhere!].
NounRoot has the lexica BOAZU, FALIS, GADDI, GAHPIR, GISTTA, GOAHTI, JOHTOLAT, MALIS, SEAMU, STAHTA, VIVVA. The lexica represent the following inflectional types:
The sublexica, alphabetically ordered
BEANA = trisyllabic animated gradating 0-nouns
BOAZU = animate contracted 0-nouns
FALIS = contracted animate C-nouns.
GADDI = bisyllabic V-nouns with comparative forms
GAHPIR = trisyllabic, non-gradating C-nouns.
GISTTA =
GOAHTI = inanimate bisyllabic V-nouns.
IIJA = bisyllabic, non-gradating a-nouns, wit an a-illative.
JOHTOLAT =
MALIS = trisyllabic inanimate gradating C-nouns.
MATTAR = trisyllabic animate gradating C-nouns.
OLLUVUOHTA = exceptional vuohta-nouns.
SEAMU =
SUOLU = inanimate contracted 0-nouns.
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative.
VIVVA = animate bilsyllabic V-nouns.
The sublexica, ordered by inflectional type
VIVVA = bilsyllabic animate V-nouns
GOAHTI = bisyllabic inanimate V-nouns
GADDI = bisyllabic V-nouns with comparative forms
IIJA = bisyllabic, non-gradating a-nouns, with an a-illative
STAHTA = bisyllabic, non-gradating a-nouns, with an a/i-illative
BOAZU = contracted animate 0-nouns
SUOLU = contracted inanimate 0-nouns
FALIS = contracted animate C-nouns
BEANA = trisyllabic animated gradating 0-nouns
SEAMU = trisyllabic inanimate gradating 0-nouns
MATTAR = trisyllabic animate gradating C-nouns
MALIS = trisyllabic inanimate gradating C-nouns
GAHPIR = trisyllabic, non-gradating C-nouns
OLLUVUOHTA = exceptional vuohta-nouns
GISTTA = The Noun gistta, gist -
JOHTOLAT =
In the noun lexicon, the declension types are distributed as follows (08.04.02):
Bisyllabic V-nouns 1577 VIVVA animate 9070 GOAHTI inanimate 2 GADDI w/comparative forms a-nouns 54 IIJA non-gradating w/ a-ill 62 STAHTA non-gradating w/ a/i-ill Contracted 0-nouns 38 BOAZU animate 38 SUOLU inanimate C-nouns 188 FALIS animate Trisyllabic 0-nouns 49 BEANA animate gradating 423 SEAMU inanimate gradating C-noun 99 MATTAR animate gradating 1065 MALIS inanimate gradating 2208 GAHPIR non-gradating Miscellanea 1749 JOHTOLAT 239 DIMINC diminutives 94 LASIS 75 MUSH 40 MAGASH (all marked as "(pl.r.)") 27 EGEZHAGAT 4-syllabic hk:g 11 GARGIA loanwords, video, etc. 3 SATTU (inconsistently marked) 3 OANADUS (abbreviations, look into this group) 1 EANU eatnu
ACCRA = foreign names ending in a stressless vowel. NYSTØ = foreign names ending in a stressed vowel. CNAME = foreign names ending in a consonant C-FI-NEN = Finnish names of the "Itkonen" type, with Gen = Itkosa / Itkonena LONDON ! Final foot structure (X.) and (X..) => Loc:is BERN ! Final foot structure (X) => Loc:asThe class CNAME will eventually be divided in two, one will go to the BERN lexicon and the other to the LONDON lexicon, depending upon foot structure. At the moment, the CNAME entries accept both -is and -as suffixes, and the different suffix types are marked as such in the analysis.
These are the lexica for Sáami names:
NIILLAS = trisyllabic, non-gradating C-proper names.
PIERA = bisyllabic a-proper names without gradation; a/i-illative.
HEANDARAT
MARJA = bisyllabic vowel-final names with gradation
In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (23.10.01) (ordered after frequency, thereafter after declension type):
359 BOAKKAS 353 JEAGOHEAPMI 269 BEAKKAN 146 GARAS 124 GAPPUS 114 LAIKI 110 AKTIIVA 106 NUORRA 31 JUHKKIS 26 EATTAS 22 GUOHCA 18 LODJI 13 GEARGGUS 13 DILDDAS 12 SEARRA 6 BIEKKUS 5 NUOLUS 5 NJUORAS 4 HEAHKAS 3 LIEKKUS 3 ASEHIS 1 GUOROS Making linguistic sense of the system (Sammallahti's codes aaa etc.):106 NUORRA aaa 269 BEAKKAN aab 2 ISSORAS aad BUORRE ab 1 JOHTIL babaa 6 BIEKKUS babba 3 ASEHIS baf 146 GARAS bbb 1 FIINNA bbe 353 JEAGOHEAPMI bae 359 BOAKKAS caa cab cb a GUOHCA !Trisyll. Gradating Adj., no sep. Attr. NUORRA !Bisyll. V-Adj. without Separate Attr; no Adv. JUHKKIS !Bisyll. V-Adj. without Separate Attr; no Adv. BEAKKAN !Trisyll. Non-gradating C-Adj. without Separate Attr. ba SEARRA !Bisyll. V-Adj's with s-Attr. LAIKI !Bisyll. V-Adj's with s-Attr. & Adv. JOHTIL !Trisyll. Non-gradating C-Adj. with is-Attr. bb GARAS !Trisyll. Gradating C-Adj. with Bisyll. a-Attr. CIENAL !Trisyll. Gradating C-Adj. with Strong Grade a-Attr. AGADJECT !Denominal Adj's with Deriv. -ag/og 124 GAPPUS 114 LAIKI 110 AKTIIVA 31 JUHKKIS 26 EATTAS 22 GUOHCA 18 LODJI 13 GEARGGUS 13 DILDDAS 12 SEARRA 5 NUOLUS 5 NJUORAS 4 HEAHKAS 3 LIEKKUS 1 GUOROS
AGADJECT contains deadjectival adjectives with derivational suffix -ag/-og.
VerbRoot contains 12 sublexica, each of the three stem types are represented by 4 verb lexica:
The last type is not expanded in the lexicon.
Bisyllabic verbs:
ARVI arvit sataa !Bisyllabic Impersonal Verbs
DIEHTI diehtit tietaa !Bisyllabic Verbs with Personal Passive
BOAHTI boahtit tulla !Bisyllabic Verbs without Personal Passive
CUOHCIT c1uohcit osua
Contracted verbs:
BORGE borget tehda pyry !Contracted Impersonal Verbs
DOHPPE dohppet tarttua !Contracted Verbs with Personal Passive
GILLE gillet viitsia !Contracted Verbs without Personal Passive
GEARRAA gearra1t
Trisyllabic verbs:
CUORPMAST c1uorpmastit sataa rakeita !Trisyllabic Contracted verbs
MUITAL muitalit !Trisyllabic Verbs with Personal Passive
ALIST alistit !Trisyllabic Verbs without Personal Passive
BORGGIST borggistit
The following table gives an overview:
even odd contracted ------------------------------------------------- impers ARVI CUORPMAST BORGE pers +ppass DIEHTI MUITAL DOHPPE pers -ppass BOAHTI ALIST GILLE even odd contracted ------------------------------------------------- impers RAIN MAKE-STORM HAIL pers +ppass KNOW TELL GRIP pers -ppass COME BE-BOTHERED
The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):
-at 2964 even-syll -it 924 -ut 826 total 4714 3-syllabic -it 5426 -a1t 301 -et 1091 -ot 209 total 1601
The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.
At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs. These verbs are all given the sublexica DIEHTI, MUITAL and DOHPPE, i.e., they are given the transitive sublexicon, the maximal paradigm.
TODO: Assign corrct transitivity/sublexicon marking to the bulk of the verbs. Also, the undefined sublexica should be investigated.
Pekka gives them the following comment to the dec 01 files:
Tiedosto sisältää verbiartikkelien hakusanat. Joissakin tapauksissa on myös vaihtoehtoja (x ~ y) ja variantin kohdalla on viite päähakusanaan (x gc1. y). Pituusviivalliset vokaalimerkit on korvattu x2-yhdistelmillä, pisteelliset x3-yhdistelmillä. Kolmannen kestoasteen merkkinä on ', jonka koodi DOS:ssa on 173. Joidenkin hakusanojen yhteydessä on suluissa tietoa rektiosta, esim. liikot (+ lok.); tämä tieto ei ole toistaiseksi systemaattista.
! The file verb-sme-lex.txt today contains ! the complete set of ! two-syllabic even-syllable verbs ! All have the sublexicon DIEHTI ! This should of course be given ! appropriate sublexica according to ! transitivity. ! 4-(and more)-syllabic verbs still ! to be added. ! This is the complete set of ! 4-and-more syllabic ! evensyllable verbs ! All have the sublexicon DIEHTI ! They should of course be given ! appropriate sublexica according to ! transitivity ! This is the complete set of ! odd-syllable verbs ! All have the sublexicon MUITAL ! This should of course be given ! appropriate sublexica according to ! transitivity ! This is the complete set of ! contracted verbs ! All have the sublexicon DOHPPE ! This should of course be given ! appropriate sublexica according to ! transitivity
DIEHTI -> +V: DIEHTIStem ; +V: DeverbalVerbsDIEHTI ; BOAHTI -> +V: BOAHTIStem ; +V: DeverbalVerbsBOAHTI ; DIEHTIStem -> :Y7j PASSIVE ; BOAHTIINCH ; BOAHTIStem -> SG3PASSV ; BOAHTIINCH ; BOAHTIINCH -> DeverbalNounsV ; +goah0ti:goah'ti BOAHTICnj ; BOAHTICnj ; BOAHTICnj -> +Ind+Prs: PrsV ; +Ind+Prt: PrtV ; +Pot+Prs:Q7z1 PrsC ; +Cond: CondV1 ; +Imprt: ImprtVA ; NominalFormsV ; NominalFormsV -> :X1 NominalFormsV1 ; :X4 NominalFormsV2 ; :Q6 NominalFormsV3 ; :X2 NominalFormsV4 ; :Q3 NominalFormsV5 ; :Y1 NominalFormsV6 ; PASSIVE -> +Pass:uvvo DOHPPEINCH ; +Pass+meahttun+A:uvvomeahttum MEAHTTUN ; +Pass+PrfPrc:un K ; +Pass+eaddji+N+Actor:uvvojeaddji¤ DEVNVCASE ; +Pass+upmi+N:upmi DEVNVCASE ; DeverbalVerbsDIEHTI -> +st:X8st MUITALStem ; +st+alla:X6stalla DIEHTIStem ; +st+adda:X6stadda DIEHTIStem ; +l:l MUITALStem ; +l+adda:X2ladda DIEHTIStem ; +l+ahtti:lahtti DIEHTIStem ; +l+asti:las'ti DIEHTIStem ; +h:X4h MUITALStem ; +h+alla:X6halla DIEHTIStem ; +h+adda:X6hadda DIEHTIStem ; +h+asti:X4has'ti DIEHTIStem ; +stuvva:X8stuvva SG3PASSV ; +d:Q8d MUITALStem ; DeverbalVerbsBOAHTI -> +st:X8st ALISTStem ; +st+alla:X6stalla BOAHTIStem ; +st+adda:X6stadda BOAHTIStem ; +l:l ALISTStem ; +l+adda:X2ladda BOAHTIStem ; +l+ahtti:lahtti BOAHTIStem ; +l+asti:las'ti BOAHTIStem ; +h:X4h MUITALStem ; +h+alla:X6halla DIEHTIStem ; +h+adda:X6hadda DIEHTIStem ; +h+asti:X4has'ti DIEHTIStem ; +stuvva:X8stuvva SG3PASSV ; +d:Q8d ALISTStem ;
All Pronouns have the initial lexicon path Root -> Pronoun -> ...
Personal firstperspron firstperspronsg -> wordforms -> K firstpersprondu -> wordforms -> K perspronpl -> wordforms -> K nonfirstperspron nonfirsperspronsg -> wordforms -> K nonfirstpersrondu -> wordforms -> K perspronpl -> wordforms -> K
Note that 3rd person is identical for all three persons. Not all forms were different for the sg and du forms, but the lexica were split for consistency.
Interrogative +Sg+Nom -> K (one entry for gii and one for mii) oblintercas (one entry for gii and one for mii) demcas
Demonstrative demcas (one entry for each stem) demcassg nomdemcassg -> wordforms -> K obldemcassg -> wordforms -> K demcaspl nomdemcaspl -> wordforms -> K obldemcaspl -> wordforms -> K
These are identical to the Interrogative ones. How should this be done?
LEXICON Numeral MILJON ; ! a noun of its own UNDERDUHAT ; ! for generator under 1000 JUSTDUHAT ; ! going via 1000 OVERDUHAT ; ! for generator over 1000 OLD ; ! for "thirteen hundred, etc. !num-basic ; ! replaced by the 5 lexica above num-ordinal ; ! The basic ordinal numbers !num-derived ; ! still unimplemented num-imprecise ;! still almost unimplemented ARABIC ; ! for the arabic numerals ROMAN ; ! for the roman numerals
MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.
So, what is the reasin for the three different lexica around 1000?
The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional "(one) thousand" before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.
LEXICON Root Adverb ; LEXICON Adverb a1d1amusat adv ; LEXICON adv +Adv:0 K ;
The Root lexicon points to the POS lexica (Adverb etc.). Each of the POS lexica lists the entries, with a pointer to an arbitrarily named sublexicon (here "adv"). This sublexicon contains the grammatical tag for the POS in question (the tag has no surface form, hence ":0"), and eventually a pointer towards the cliticon lexicon K. Adverbs can have clitics added, hence K, whereas subjunctions do not, hence no K. [XXX At the moment particles are not directed to K, perhaps they should be.]
Root -> Particle -> pcle -> #
Root -> Subjunction -> -> #
Root -> Conjunction -> Cc -> #
Root -> Adposition -> Pp -> #
Root -> Postposition -> Postp -> #
Root -> Preposition -> Prep -> #
Root -> Interjection -> Ij -> #