List of things to do

Intro

This is not a complete list of the work still to do, or a strategic plan. Rather, it is a list of things to do next. The reason why it is made is that more people are getting involved in this project, and we must be even more explicit. The point with this document is thus not the issues themselves (they are described elsewhere), but the lists. The list contains linguistic issues only. A todo-list for technical issues, including our web environment, is found here.

Disambiguation

The preprocessor must fix the derivation issue, so that the embedded POS tags do not interfer with the disambiguation rules
Make correct corpora
Go systematically through the rules

Corpora

Set up the IMS Corpus workbench interface on cochise
Make an XML specification, based on TEI etc. for the metadata of the corpus texts
Include a corpus text, as a test case
Collect texts

The morphological parsers

sme

Linguistic problems
- The whole morphological parser should be tested at regular intervals (cf. the testing directory)
- Vowel shortening of compounds with short 1st syllable (bivdo- pro bivdu-). The linguistic facts must be clearified.
- The continuation lexica for adjectives
- The transitivity distribution for verbs (the sublexicon distribution)
- 2nd part shortening of 3-part compounds (sámegieloahpahus)
- Generation vs. parsing: We should prepare for a smaller, more restrictive generator isme.fst (e.g. excluding poetic short genitives), and also for different eastern and western versions of the generator
- The twol-sme.txt file should be rewritten, according to the twol-smj.txt file (this has low priority, "if it isn't broken, don't fix it").
Place names
- Place names must be added to the propernoun file (Tomi)
  - The names in Pekka's dictionary
    - Procedure for adding Pekka's names:
    1. obtain a copy of the list of person and place names in Pekka's 1989 dictionary (ask pekka for a copy)
    2. Extract the names to a list, one name on each line
    3. Transform the names to the á, c1, d1, n1, s1, t1, z1 format (there are several perl scripts for this purpose in the gt/script directory, if none of them are suited to Pekka's encoding format, a new perl script should be made on the basis of e.g. ws2-7bit.pl)
    4. Run the command "cat namelist | lookup -flags mbTT ~/gt/sme/bin/sme.fst | grep '\?' > unrecognised
    5. Include the unrecognised ones in the gt/sme/src/propernoun-sme-lex.txt file, modeled upon the names already found there.
    6. The philosopyh behind the lexica in the propernoun file is the following: They should mirror corresponding lexica in the noun-sme-lex.txt file, but not have the same name, as it should be possible to treat proper and common nouns differently in future applications.
  - Sami names in Norway: Statens kartverk (these are already fetched, but the process must be repeated (Trond))
  - Sami names in Finland and Sweden: This issue is open.
- The missing adjectives must be added (cf. Trond.)
- Loan words should be created (the reparere > repareret process)
Testing on running texts
New test texts should be added to the test diary (from Davvi Girji, NSI, Samediggi)
The rule file should be read through
The verbal sublexica should be reassigned (Biret?)

smj

General work:

The smj project must be coordinated with work at Árran.
We need access to dictionaries, in order to complete the parser
Place names must be added to the propernoun file
Then we need test corpora, starting with the NT and novels
The parser should be made part of pedagogical softpare projects

The parser:

Vocabulary testing on texts
Grammar testing, the paradigms

The other Sami languages

We will not work on other languages than North and Lule Sami during this project period. When we start up with the other Sami languages again, the following may be seen as a starting point.

sma

Lauri Karttunen's comments must be addressed
We need more corpora
We need to cooperate with the dictionary projects
Place names must be added to the propernoun file
Then the next step should then be to introduce more parts of speech in the system, perhaps in the following order:
1. the closed classes (almost complete)
2. verbs (there is a draft version in place)
3. adjectives (their predicative declension should be pointed to the nouns; the attributive forms need linguistic ground work)
4. derivation and compounding
5. South Sami place names
6. loan words

smn

Write documentation
Get a grammatical description
Write the basic twol file + the grammar file
Get dictionary & complete the lexica
Get texts to test against
Find cooperation partners

sms

Write documentation
Get a grammatical description
Write the basic twol file + the grammar file
Get dictionary & complete the lexica
Get texts to test against
Find cooperation partners

Last modified: Thu Apr 22 15:44:17 2004