Workplan

People, spring 04

The staff is roughly Trond, 100 %, Saara 50 %, Tomi 20 %, Lena 10 % (?), and Pekka

Trond: Running the project, working mainly on disambiguation, also on the morphological parser and on planning the corpus
Saara: Maintaining the project's infrastrucure, designing the pre- and postprocessors, the cgi-bin scripts, the corpus setup, and issues such as localisation and a bug database,
Tomi: Evaluating the project, writing an evaluation report. Maintaining the morphological parser
Lena: Testing, working on the lexicon
Pekka: Strategic planning, consultant on tricky linguistic questions

Milestones

The official milestones

In our application we stated the following milestone list for the project:

                                        Start       Finish
 1 Språkuavhengig preprosessering       2004  1     2004  1
 2 Infrastruktur for disambiguering     2004  1     2004  2
 3 Korpusgrensesnitt - prototyp         2004  1     2004  4
 4 Grunnarbeid for nordsamisk           2004  1     2004  4
 5 Nordsamisk disambiguering - prototyp 2004  1     2005  2
 6 Revidere morfologiske analyseprogram 2004  1     2006  4
 7 Grunnarbeid for lulesamisk           2004  3     2005  4
 8 Lulesamisk disambiguering - prototyp 2004  4     2005  4
 9 Parallelltekstkorpora - prototyp     2005  1     2005  2
10 Korpusgrensesnitt - beta             2005  1     2005  4
11 Nordsamisk disambiguering - beta     2005  3     2005  4
12 Parallelltekstkorpora - beta         2005  3     2006  1
13 Lulesamisk disambiguering - ferdig   2005  4     2006  2
14 Nordsamisk disambiguering - ferdig   2006  1     2006  4
15 Korpusgransesnitt - ferdig           2006  1     2006  4
16 Parallelltekstkorpora - ferdig       2006  2     2006  4

Comments to the issues that start in the spring of 2004

The language-independent preprocessor (Saara): This goal is fulfilled, as we have a revised language independent preprocessor (preprocess) and a morphology-to-disambiguation processor (lookup2cg). There istill is work to do on language specific preprocessing (not mentioned in the list). This work will in practice run in parallel with other work
Infrastructur for disambiguation (Trond): This was in place already in 2003
Corpus interface (Saara, Trond): We are beginning this work now (scheduled finished at 20004 4)
Disambiguation prototype for sme (Trond): The work is under way.

Other issues

Derivation in the lookup2cg preprocessor (Saara, Trond): Problem: The sme.fst output for derivation is not optimal for disambiguation (words get assigned POS twice). Goal: Make an optimal version, either by reversing the morphophonological processes and build a new baseform, or by introducing a special set of embedded POS tags. If we decide on the latter, it should be done before may 04. The former may take more time.
A systematic testing the morphology of the parser (Tomi, Lena, Trond): This should be done before autumn 2004, when we speed up the work on the disambiguator
Gather corpus texts (Trond): We should have achieved a large amount of texts by 2005 1, when the corpus interface is finished. The work on gathering texts will continue

Trond Trosterud

Last modified: Wed Mar 31 11:01:05 2004