!!!Meeting setup

* Date: 4.12.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from last week
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:48.

Present: __Børre, Sjur, Thomas, Tomi, Trond__

Absent: __Maaren, Saara__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* contact authors who have already received the corpus licensing contract
* consider a script for automatic testing of the spell checker in Word
** Began work. Some AppleScript done
* meeting Wednesday at 9.30 with __Sjur__ to plan testing
** Done
* {{sma}} discussions with SD (with __Sjur__, __Trond__)
** Not done
* add a simple password protection to risten.no in the G5
** Not done
* consider infra for testing feedback
** Some plans made during the meeting with __Sjur__
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
** Not done
* update all forrest installations, including local patches
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Done some work on bug 348


!! Maaren
* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''
* investigate unrecognised word forms in the hyphenator


!! Saara
* finalize server of the Xerox tools.
* help Trond with some shell commands
* re-analyze parallel files
* consider implementing some new features to the corpus files
* add closed POSes to the paradigm gen, if needed.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur
* name lexicon:
** refactor SD-terms editor code
*** finally got multiuser editing safe against data corruption - after more than
    a year!
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* investigate unrecognised word forms in the hyphenator
** nothing done, but this has become much easier with the debug output from
   Tomi's PLX conversion.
* decide how to specify compounding behaviour info in the lexicon
** meeting Tuesday 12.30
*** had a meeting, but didn't finish, and the second part disappeared in other
    issues
* {{sma}} discussions with SD (with __Børre__, __Trond__)
* check why some SUB-marked entries got included in the normative transducer
* consider a script for automatic testing of the spell checker in Word
** discussed with __Børre__, pseudocode in place
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
** make sure we get one/more licenses in Alta
* publish corpus contracts and project infra on NoDaLi-sta
** not yet
* ask SD/Sig-Britt Persson about some of the South Sámi bible texts
* check whether {{lookup}} can be used to generate paradigms for closed POSes
** done, doesn't look like it
* meeting Wednesday at 9.30 with __Børre__ to plan testing
** done, memo available soon (not checked in) in the {{gt/doc/proof/}} dir
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** done some for [risten.no|http://www.risten.no], gleaned through others


!! Thomas
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
** not done
* redo the derived words work for {{smj}}
** done
* decide how to specify compounding behaviour info in the lexicon
** begun
** meeting Tuesday 12.30
*** done
* check why some SUB-marked entries got included in the normative transducer
** not done
* investigate unrecognised word forms in hyphenator
** not done
* test editing of the xml files
** not done
* clean terms-sme.xml such that all names have the correct tag for their use
  (e.g. @type=secondary)
** not done
* write paradigm grammar for the closed POSes
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** just added new bugs


!! Tomi
* make PLX lexicon for all open POSes
** not done - how much is done, what is left?
** there is something fishy in the server, Saara has got it working but I
   haven't, we have same compilation
* add derivations to the PLX generation
** not done
* add compound stems to the PLX generation
** added noun
* add closed POS and clitics to PLX generation
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
* get more {{sma}} texts
** Still waiting
* investigate unrecognised word forms in the hyphenator
** Not done
* decide how to specify compounding behaviour info in the lexicon
** Not done
* meeting Tuesday 12.30
** Meeting kept, but more to discuss.
* {{sma}} discussions with SD (with __Børre__, __Sjur__)
** Not done.
* redo the derived words work for {{smj}}
** Not done.
 [fix bugs!|http://giellatekno.uit.no/bugzilla].


!!!3. Documentation

TODO:
* update all forrest installations, including local patches (__Børre__)
** not yet, will do it in Alta


!!!4. Corpus gathering

Nothing new during last week.

__TODO:__
* get {{sma}} Bible / NT texts (__Trond__)
* Discussions with the Sámi Parliament about {{sma}} (__Børre, Sjur, Trond__)
* ask SD/Sig-Britt Persson about some of the South Sámi bible texts (__Sjur__)


!!!5. Corpus infrastructure

!!More texts to the graphical corpus interface

__Trond__ has talked with __Lars__, who is writing documentation for the users.

TODO:
* Consider whether to implement the <p> only policy or not (__Saara__)
* add text to the server (__Lars__)
* Discuss with Lars (__Trond__)


!!Aligner

No more texts yet, Saara has included the aligner in the relevant perl script.

__TODO:__
* gather more parallel texts (__Trond, Børre__)
* re-analyze parallel files using the command-line version (__Saara__)


!!!6. Infrastructure

!!Xerox tools wrapped as servers

The server hasn't been working for __Tomi,__ but is now working again. The
paradigm generator now only generates 16-17 word forms - far too few. It seems
all possessives have disappeared:


{{{
sajin   NIR # !SUB
sa^jis  NIR
sa^jiin NIR
sa^ji   NIR
sad^jái NIR
sad^ji  NIR
sad^je  NIRL - this is done by client
sa^je   NIRL
sa^ji   NIRL
sa^je   NIRL
sajiis  NIR # !SUB
sa^jiin NIR
sa^jii^guin     NIR
sa^jiid NIR
sa^jii^de       NIR
sa^jit  NIR
sa^jiid NIRL

sajin   NIR #N+Sg+Loc
sa^jis  NIR #N+Sg+Loc
sa^jiin NIR #N+Sg+Com
sa^ji   NIR #N+Sg+Acc
sad^jái NIR #N+Sg+Ill
sad^ji  NIR
sad^je  NIRL #N+Sg+Nom
sa^je   NIRL #N+Sg+Gen
sa^ji   NIRL #N+Sg+Gen
sa^je   NIRL #N+Sg+Gen
sajiis  NIR #N+Pl+Loc
sa^jiin NIR #N+Pl+Loc
sa^jii^guin     NIR #N+Pl+Com
sa^jiid NIR #N+Pl+Acc
sa^jii^de       NIR #N+Pl+Ill
sa^jit  NIR #N+Pl+Nom
sa^jiid NIRL #N+Pl+Gen

Using fsts:
/opt/smi/sme/bin/isme.fst
/opt/smi/sme/bin/hyph-sme.fst
}}}

It should have been using: {{ifst-norm: inverse-norm.fst}}.
The file is available to the server, cf {{/opt/smi/sme/bin/}}:
{{{
-rwxrwxr-x  1 root  cvs    2257 jun 21 10:13 allcaps.fst
-rwxrwxr-x  1 root  cvs      92 jun 21 10:16 cap-sme
-rwxrwxr-x  1 root  cvs 6995574 des  4 00:38 hyph-sme.fst
-rwxrwxr-x  1 root  cvs 1206092 des  4 00:38 isme.fst
-rwxrwxr-x  1 root  cvs 3106957 des  4 00:38 isme-norm.fst
-rwxrwxr-x  1 root  cvs  674609 des  4 00:38 sme-dis.rle
-rwxrwxr-x  1 root  cvs 1251450 des  4 00:38 sme.fst
}}}

__TODO:__
* remove clitics from the paradigm generator (__Saara__)
** done
* investigate why possessives have disappeared from the paradigm generator
  (Number, also a facultative (?) category, has not disappeared)
  (__Saara, Tomi__)
* make sure the normative generator is used when generating paradigms (__Tomi__)


!!Hyphenator


TODO:
* investigate unrecognised word forms (__Maaren, Thomas, Trond, Sjur__)
** possibly the main cause was found in the meeting (non-normative forms
   included in the generated output - these are not recognised by the
   hyphenator, which is strictly normative)


!!!7. Linguistics

!!Names and multilinguality

TODO:
# finish first version of the editing (__Sjur__)
# test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__)
# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
  well) (the morphological section should be kept intact, in e.g.
  propernoun-sme-morph.txt) (__Sjur, Saara__)
# convert propernoun-($lang)-lex.txt to a derived file from common xml files
  (__Sjur, Tomi, Saara__)
# start to use the xml file as source file
# clean terms-sme.xml such that all names have the correct tag for their use
  (e.g. @type=secondary) (__Thomas, Maaren, linguists__)
# merge placenames which are errouneously in different entries: e.g. Helsinki,
  Helsingfors, Helsset (__linguists__)
# publish the name lexicon on risten.no (__Sjur__)
# add missing parallel names for placenames (__linguists__)
# add informative links between first names like Niillas and Nils
  (__linguists__)


!!Derivation and spellers like Aspell

TODO:
* redo the work for {{smj}}, including discussion regarding ''Actio''
  (__Thomas, Sjur, Trond__)
** done by __Thomas__


!!North Sámi

The following words are included in the normative list despite being marked
with !SUB:

{{{
accompagnerejun -JUVVON
accompagnerejun accompagneret+V+TV+Der1+Der/j+Der/Pass+PrfPrc
ábuhuvvože -ETNE
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne -ETNE
áccohallagođežedne      áccohallat+V+TV+Der3+Der/goahti+Pot+Prs+Du1

In sme-lex.txt:
 +Der2+Der/Pass:uvvo DOHPPEINCH ;
 +Der/Pass+PrfPrc:un K ;  !SUB
 +Du1:e K ;  !SUB
 +Du1:edne K ;   !SUB
 +Du1:etne K ;
}}}

These are generated by {{make wordlist TARGET=sme}}, which uses
{{nonrec-sme.fst}} ({{print lower}}).

The last version of the wordlist does not include the errouneous words anymore.
They seem to have disappeared as part of other changes.

TODO:
* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio'' (__Maaren__, __Thomas__)
* check why some SUB-marked entries got included in the normative transducer
  (__Thomas, Sjur__)
** done in the meeting


!!Lule Sámi

TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)
* hire new linguist (__Sjur__)


!!!8. Name lexicon infrastructure

Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries


Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:
* develop the needed XQueries and UI (__Sjur, Tomi__)

Postponed:
* data synchronisation between [risten.no|http://www.risten.no] and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
  construct entries like we have now from the different parts of a complex name
  entry


!!!9. Spellers

!!Polderland data generation

All other open POSes now included in the paradigm generator, below is a verb
example:

{{{
galggaže        VIR #V+Pot+Prs+Du1
galggažedne     VIR #V+Pot+Prs+Du1
galg^ga^žet^ne  VIR #V+Pot+Prs+Du1
galg^ga^žeh^pet VIR #V+Pot+Prs+Pl2
galg^gaš        VIR #V+Pot+Prs+ConNeg
galg^ga^žat     VIR #V+Pot+Prs+Pl1
galg^ga^žit     VIR #V+Pot+Prs+Pl1
galg^ga^žan     VIR #V+Pot+Prs+Sg1
galg^ga^žea^ba  VIR #V+Pot+Prs+Du3
galg^ga^žat     VIR #V+Pot+Prs+Sg2
galg^ga^žeahp^pi        VIR #V+Pot+Prs+Du2
galg^ga^žit     VIR #V+Pot+Prs+Pl3
galg^ga^ža      VIR #V+Pot+Prs+Sg3
galg^ga^leim^me VIR #V+Cond+Prs+Du1
galg^ga^šeim^me VIR #V+Cond+Prs+Du1
galg^ga^leid^det        VIR #V+Cond+Prs+Pl2
galg^ga^šeid^det        VIR #V+Cond+Prs+Pl2
galg^ga^le      VIR #V+Cond+Prs+ConNeg
galg^ga^še      VIR #V+Cond+Prs+ConNeg
galg^ga^leim^met        VIR #V+Cond+Prs+Pl1
galg^ga^šeim^met        VIR #V+Cond+Prs+Pl1
...
}}}

__TODO:__
* write paradigm grammar for the closed POSes (__Thomas__)
** done
* check whether {{lookup}} can be used to generate paradigms for closed POSes
  (__Sjur__)
** couldn't find anything
* decide how to specify compounding behaviour info for the lexicon
  (__Thomas, Trond, Sjur__)
** meeting Tuesday at 12.30
*** done, needs to be followed up
* make inflection PLX lexicon for all open POSes (__Tomi__)
** done
* add closed POS and clitics to PLX generation (__Tomi__)
* add derivations to the PLX generation (__Tomi__)
* add compound stems to the PLX generation (__Tomi__)


!!Aspell

TODO when the major part of the PLX conversion is done:
* add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the
  PLX data generation is finished)
* study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__)


!!Testing

When the PLX-based speller is ready: use the generated word list as test input:
all should be accepted (coverage self-testing). Pick random 1% and randomly
change them with edit distance 1, run through speller = testing false positives

We need a meeting to plan testing. We'll do it shortly this week, and perhaps
a longer meeting in Alta.

__TODO:__
* meeting Wednesday at 9.30 (__Sjur, Børre__)
** done
* consider a script (shell+AppleScript?) for automatic testing of MS Word
  (__Sjur, Børre__)
** pseudocode written, some AppleScript
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
  (__Børre, Sjur__)


!!!10. Other

!!Corpus contracts

TODO:
* publish corpus contracts and project infra on NoDaLi-sta (__Sjur__)
** not yet


!!Bug fixing

__56__ open Divvun/Disamb bugs, and __23__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)


!!Task lists as iCal entries

TODO:
* update Maaren's Forrest installation (__Børre__)


!!!11. Next meeting, closing

The next meeting is 11.12.2006, 09:30 Norwegian time.

The meeting was closed at 11:22.

!!!Appendix - task lists for the next week

!! Boerre

* contact authors who have already received the corpus licensing contract
* continue work on script for automatic testing of the spell checker in Word
* {{sma}} discussions with SD (with __Sjur__, __Trond__)
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* update all forrest installations, including local patches
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Maaren

* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''


!! Saara

* finalize server of the Xerox tools.
* help Trond with some shell commands
* re-analyze parallel files
* consider implementing some new features to the corpus files
* add closed POSes to the paradigm gen, if needed.
* investigate why possessives have disappeared from the paradigm generator
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur

* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* decide how to specify compounding behaviour info in the lexicon
* {{sma}} discussions with SD (with __Børre__, __Trond__)
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* publish corpus contracts and project infra on NoDaLi-sta
* ask SD/Sig-Britt Persson about some of the South Sámi bible texts
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas

* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
* decide how to specify compounding behaviour info in the lexicon
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Tomi

* add closed POS and clitics to PLX generation
* add derivations to the PLX generation
* add compound stems to the PLX generation
* make sure the normative generator is used when generating paradigms
* investigate why possessives have disappeared from the paradigm generator
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond

* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
* get more {{sma}} texts
* decide how to specify compounding behaviour info in the lexicon
* {{sma}} discussions with SD (with __Børre__, __Sjur__)
 [fix bugs!|http://giellatekno.uit.no/bugzilla].