List of things to do
Intro
This is not a complete list of the work still to do, or
a strategic plan. Rather, it is a list of things to do next. The
reason why it is made is that more people are getting involved in this
project, and we must be even more explicit. The point with this
document is thus not the issues themselves (they are described
elsewhere), but the lists. The list contains linguistic issues only. A
todo-list for technical issues, including our web environment, is
found here.
Disambiguation
- The preprocessor must fix the derivation issue, so that the
embedded POS tags do not interfer with the disambiguation rules
- Make correct corpora
- Go systematically through the rules
Corpora
- Set up the IMS Corpus workbench interface on cochise
- Make an XML specification, based on TEI etc. for the metadata of the corpus texts
- Include a corpus text, as a test case
- Collect texts
The morphological parsers
sme
- Linguistic problems
- The whole morphological parser should be tested at
regular intervals (cf. the testing directory)
- Vowel shortening of compounds with short 1st syllable (bivdo-
pro bivdu-). The linguistic facts must be clearified.
- The continuation lexica for adjectives
- The transitivity distribution for verbs (the sublexicon distribution)
- 2nd part shortening of 3-part compounds (sámegieloahpahus)
- Generation vs. parsing: We should prepare for a smaller, more
restrictive generator isme.fst (e.g. excluding poetic short
genitives), and also for different eastern and western versions of the
generator
- The twol-sme.txt file should be rewritten, according to the
twol-smj.txt file (this has low priority, "if it isn't broken, don't
fix it").
- Place names
- Place names must be added to the propernoun file (Tomi)
- The names in Pekka's dictionary
- Procedure for adding Pekka's names:
- obtain a copy of the list of person and place names in Pekka's 1989 dictionary (ask pekka for a copy)
- Extract the names to a list, one name on each line
- Transform the names to the á, c1, d1, n1, s1,
t1, z1 format (there are several perl scripts for this purpose in
the gt/script directory, if none of them are suited to Pekka's
encoding format, a new perl script should be made on the basis of
e.g. ws2-7bit.pl)
- Run the command "cat namelist | lookup -flags
mbTT ~/gt/sme/bin/sme.fst | grep '\?' > unrecognised
- Include the unrecognised ones in the
gt/sme/src/propernoun-sme-lex.txt file, modeled upon the names already
found there.
- The philosopyh behind the lexica in the
propernoun file is the following: They should mirror corresponding
lexica in the noun-sme-lex.txt file, but not have the same name, as it
should be possible to treat proper and common nouns differently in
future applications.
- Sami names in Norway: Statens kartverk (these are already fetched, but the process must be repeated (Trond))
- Sami names in Finland and Sweden: This issue is open.
-
- The missing adjectives must be added (cf. Trond.)
- Loan words should be created (the reparere > repareret process)
- Testing on running texts
- New test texts should be added to the test diary (from Davvi
Girji, NSI, Samediggi)
- The rule file should be read through
- The verbal sublexica should be reassigned (Biret?)
smj
General work:
- The smj project must be coordinated with work at Árran.
- We need access to dictionaries, in order to complete the parser
- Place names must be added to the propernoun file
- Then we need test corpora, starting with the NT and novels
- The parser should be made part of pedagogical softpare projects
The parser:
- Vocabulary testing on texts
- Grammar testing, the paradigms
The other Sami languages
We will not work on other languages than North and Lule Sami during
this project period. When we start up with the other Sami languages
again, the following may be seen as a starting point.
sma
- Lauri Karttunen's comments must be addressed
- We need more corpora
- We need to cooperate with the dictionary projects
- Place names must be added to the propernoun file
- Then the next step should then be to introduce more parts of
speech in the system, perhaps in the following order:
- the closed classes (almost complete)
- verbs (there is a draft version in place)
- adjectives (their predicative declension should be pointed
to the nouns; the attributive forms need linguistic ground work)
- derivation and compounding
- South Sami place names
- loan words
smn
- Write documentation
- Get a grammatical description
- Write the basic twol file + the grammar file
- Get dictionary & complete the lexica
- Get texts to test against
- Find cooperation partners
sms
- Write documentation
- Get a grammatical description
- Write the basic twol file + the grammar file
- Get dictionary & complete the lexica
- Get texts to test against
- Find cooperation partners
Last modified: Thu Apr 22 15:44:17 2004