!!!Meeting setup

* Date: 19.02.2007
* Time: 09.00 Norw. time
* Place: Internet
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from last week
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:49.

Present: __Børre, Sjur, Thomas, Tomi, Trond__

Absent: __Maaren, Saara, Steinar__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* write form to request corpus user account
** not done
* document how to apply for access to closed corpus, and details on the corpus
  and its use in general
** some done
* update and fix our documentation and infrastructure as __Steinar__ finds
  problem areas
** not done
* continue work on script for automatic testing of the spell checker in Word
** not done
* fix {{sme}} texts in corpus this month
** not done
* find missing {{nob}} parallel texts in corpus
** not done
* work on the Polderland data generation (PLX format conversion)
** not done
* go through other directories, fix parallellity information for other documents
** not done
* add {{sma}} texts to the corpus repository
** not done
* move the G5 to the basement
** done
* add info to front page (incl. download links)
** not done
* write separate page with detailed info (incl. download links)
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done


!! Maaren
* lexicalise actio compounds

!! Saara
* fix {{sme}} texts in corpus this month
* continue aligning the rest of the parallel files
* fix problems with xml2lexc if needed
* have some holiday first
* start improving the corpus interface for Sámi in Oslo.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur
* name lexicon:
** refactor the rest of the SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
** add cvs synchronisation
*** worked on this one - settled on a standardised pretty-print format, made
    XML::Twig produce it (patch submitted); looked for ways to guarantee fixed
    sort order (Unicode character code collation) using XSL, didn't succeed yet
* hire linguist and programmer
* publish corpus contracts and project infra as open-source on NoDaLi-sta
* fix stuorra-oslolaš lower case {{o}}
* write form to request corpus user account
* document how to apply for access to closed corpus, and details on the corpus
  and its use in general
* get an Intel Mac for __Tomi__
* write press release for the beta
** first version done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
* other tasks done:
** helped with PLX generation of number
** received beta from Polderland, installed and tested
** other work with the beta release
** installed resources and tools for compiling PLX files into binary spellers
** wrote a start on a page for the beta release


!! Steinar
* test our infrastructure and documentation - follow the documentation exactly,
  and find problem areas - report problems to __Børre__. Start: At the front
  page.
* Complete the semantic sets in sme-dis.rle
* missing lists
* report conversion errors to __Saara__
* Look at the actio compound issue when adding from missing lists
* lexicalise actio compounds. Example: ''vuolggasadji''  vs. ''vuolginsadji''
* Go through the Num bugs
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
* work with compounding
* Lack of lowering before hyphen: Twol rewrite.
* fix stuorra-oslolaš lower case {{o}}
* implement discontinous case inflection for {{sme}} numbers
* produce correct number base forms in the {{sme}} analyzer
* translate beta release docs to {{sme}} and {{smj}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Tomi
* improve numerals in the speller
* add prefixes to the PLX
* add derivations to the PLX generation
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond
* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
** Worked with Thomas on this, not finished yet.  
* fix {{sme}} texts in corpus this month
** Worked on the aligment of the texts.
* find missing {{nob}} parallel texts in corpus, go through Saara's list
** Found some, but it turned out we had them already! We must align what we 
   have, and also go through our aligner again.
* Go through the Num bugs
** Looked at some num paradigms, but no bug closed.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!3. Documentation

TODO:
* write form to request corpus user account (__Børre, Sjur, Trond__)
* document how to apply for access to closed corpus, and details on the corpus
  and its use in general (__Børre, Sjur, Trond__)
* correct and imrove it based on feedback from __Steinar__ (__Børre__)


!!!4. Corpus gathering

TODO:
* {{sme}} texts: no new additions, fix corpus errors during this month
  (__Børre, Trond, Saara__)
* missing {{nob}} parallel texts should be added if such holes are found
  (__Børre, Trond__)
* Go through the list of missing or errouneous {{nob}} texts, based upon
  __Saara's__ perfect list (__Børre, Trond__)
* add {{sma}} texts to the corpus repository (__Børre__)


!!!5. Corpus infrastructure

!!Alignment

Main news: We have a working parallel corpus online.

Notes about the interface (or lack of documentation): the first search field in
the form needs to be filled; to get the parallell texts in the search result,
make sure to click ''add phrase'' and specify the language to be the other one.

__TODO:__
* go through other directories (nob dicrectories, sd directories), fix 
  parallellity information for other documents (2 hours)
  (__Børre__)


!!Conversion issues

__TODO:__
* report conversion errors to __Saara__ (__Trond, Steinar__)


!!!6. Infrastructure

__Børre__ and __Steinar__ have both started on the task of testing and
correcting the documentation.

__TODO:__
* test our infrastructure and documentation - follow the documentation exactly,
  and find problem areas - report problems to __Børre__. Start: At the front
  page. (__Steinar__)
* update and fix our documentation and infrastructure as __Steinar__ finds
  problem areas (__Børre__)


!!!7. Linguistics

!!Numbers:

__Thomas__ is almost finished with correcting the number part of the {{sme}}
analyzer.

TODO:
* discontinous case inflection in  {{sme}} (but only for maximally three-part
  compound numerals) ({{viđain/goalmmát/logiin}} and {{guvttiin/logiin/viđain}})
  (__Thomas__)
* produce correct number base forms in the {{sme}} analyzer (__Thomas__)
* Go through the {{sme}} Num bugs (__Thomas__)


!!North Sámi

TODO:
* lexicalise actio compounds. Example: ''vuolggasadji''  vs. ''vuolginsadji''
  (__Maaren__)
* fix stuorra-oslolaš lower case {{o}} (__Sjur, Thomas, Trond__)


!!Lule Sámi

TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)


!!!8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [the meeting
memo.|/doc/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

TODO:
# finish first version of the editing (__Sjur__)
# test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__)
# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
  well) (the morphological section should be kept intact, in e.g.
  propernoun-sme-morph.txt) (__Sjur, Saara__)
# convert propernoun-($lang)-lex.txt to a derived file from common xml files
  (__Sjur, Tomi, Saara__)
# implement data synchronisation between [risten.no|http://www.risten.no] and
   the cvs repo, and possibly other servers (ie the G5 as an alternative server
   to the public risten.no - it might be faster and better suited than the
   official one; also local installations could be treated the same way)
# start to use the xml file as source file
# clean terms-sme.xml such that all names have the correct tag for their use
  (e.g. @type=secondary) (__Thomas, Maaren, linguists__)
# merge placenames which are errouneously in different entries: e.g. Helsinki,
  Helsingfors, Helsset (__linguists__)
# publish the name lexicon on risten.no (__Sjur__)
# add missing parallel names for placenames (__linguists__)
# add informative links between first names like Niillas and Nils
  (__linguists__)


!!!9. Spellers

!!Polderland data generation

__TODO:__
# improve number conversion (__Børre, Tomi__)
# add prefixes to the PLX (__Børre, Tomi__)
# add derivations to the PLX generation (__Børre, Tomi__)
## next after numbers are fixed


!!OOo speller(s)

TODO after the MS Office Beta is delivered:
* add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the
  PLX data generation is finished)
* study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__)


!!Testing

__TODO:__
* get an Intel Mac for Tomi (__Sjur__)
** not yet


!!Localisation

We need to translate the info added to our front page (and a separate page)
regarding the beta release. Also the press release needs to be translated.

TODO:
* translate beta release docs to {{sme}} (__Thomas__)
* translate beta release docs to {{smj}} (__Thomas__)


!!Beta release

Tentative beta release: Thursday 15.2. - but it might be delayed till later in
February, since we still have no beta from Polderland.

In the beta, {{sme}} is now Catalan, whereas {{smj}} is Basque.

All beta packages (mklex tools, Win and Mac tools) can be copied from __Sjur's__
home dir on the G5:
{{{
/Users/sjur/mklex.zip
/Users/sjur/SamiProofingtools_beta-Mac.dmg
/Users/sjur/SamiProofingtools_beta-Win.zip
}}}

mklex -M256 -p sami_north_phon* revInputSamiNort.plx mssp3samiNorthern.lex

SamiNortAsCatalan2007-02sp

The PLX compilers (one each for sme, smj), which compiles the specified source
files into ready speller files (or almost ready in the Mac case), are now
installed on the G5. Follow the instructions given in the Word document found
inside the {{mklex.zip}} package above PLUS one addition when compiling the Mac
version:

As a last step after the lexicon is compiled, use the tool
{{/Developer/Tools/Rez}} to add a resource fork with the content in
{{gt/sm(e|j)/polderland/*Lex.rsrc.hex}} to the lexicon file and tool
{{/Developer/Tools/SetFile}} to add creator and type and set custom icon. The 
command sequence should be something like the following:

{{{
cd gt/
export RIncludes=/System/Library/Frameworks/Carbon.framework/Headers/
/Developer/Tools/Rez sme/polderland/CatalanLex.rsrc.hex -a -o $SpellerLexiconFil
/Developer/Tools/SetFile -a CI -c MSOF -t HMSD $SpellerLexiconFile
}}}

This step is necessary to make MS Office recognise the speller lexicon file as a
real such file. It will add an icon and a language ID to the file.

PLX files:
* adjectives
** AdjectiveRoot
* verbs
** VerbRoot
** Copula
** Negativeverb
* nouns
** NounRoot
* propernouns
** ProperNoun
* Adverb
* Conjunction
* Interjection
* Particles
* Adposition (pp)
* Pronoun
* Subjunction
* Numerals
* pp

All ok, except numerals.

DONE:
* delivered PLX data of {{sme}} and {{smj}} including compounding
* translated Windows installer to {{sme}} and {{smj}}
* installed PLX compiler in G5 at {{/usr/local/bin/mklex*}} (one version for
  {{sme}} and one for {{smj}})
* added resources needed for compiling PLX lexicons to our cvs repo
* tested the beta drop from Polderland - good we did, it is absolutely
  unacceptable (our responsibiliby - only linguistic errors (poor coverage)
  found so far)

TODO:
* write press release (__Sjur__)
** done first draft, see {{xtdoc/sd/.../xdocs/pr/}}
* add info to front page (incl. download links) (__Børre__)
* write separate page with detailed info (incl. download links) (__Børre__)
** __Sjur__ wrote a start
* test the beta release from Polderland thoroughly before it is released
  (__all__):
** download and installation
** documentation
** technical performance
** linguistic performance:
*** true positives (correctly recognised misspellings)
*** false positives (correct words errouneously marked as misspellings)
*** false negatives (misspellings not recognised by the speller)
*** true negatives (correctly spelled words recognised as such by the speller)
*** suggestions
** all tests on both Mac and Win - Windows only (__Børre, Sjur, Thomas__)
* compile new speller lexicons using the mklex* tools on the G5, following the
  instructions given in the documentation from Polderland (in mklex beta drop)
  (__Tomi, Børre__)
* add compilation of MS Office spellers part of the Makefile (__Tomi__)
* install Windows and MS Office; test tools on Windows (__Børre, Thomas__)
* collect a list of PR recipients, forward to
* questions for Polderland (__Børre__):
** version info in the speller?
** remaking/updating the installer packages with linguistic updates - who?


!!Compilation

Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs
Nouns compile at 3 sec/noun,            i.e. (23600*3) / 3600 = 19 hrs


!!Testing

Different ways of testing:

# Impressionistic, functionality: try the program, try all the functions
# Impressionistic, coverage: try the program on different texts, look for false positives
# Systematic (in order of importance):
## Make a corpus of texts, from different genres (can be done before 0.2 release)
### For each text, detect precision
### For each text, detect recall
### For each text, detect accuracy

Before beta release: precision is important, but have a look at recall as well.


Recall and precision

* precision = tp / ( tp + fp )  = true redlines / all redlines
** can we trust that the redlines are actually errors?
** Task: check all hits 
** (test p, are they tp or fp?)
* recall = tp / ( tp + fn)      = true redlines / all errors in doc
** can we trust that all errors are actually found? 
** Task: check every single word 
** (test p, are they tp or fp, test n, are they tn or fn?)
* accuracy = tp + tn / tp + fp + fn + tn = overall performance

Definitions:

* true positives (correctly recognised misspellings)
* false positives (correct words errouneously marked as misspellings)
* false negatives (misspellings not recognised by the speller)


!!Timetable
# The next beta version (beta 0.2) is ready tuesday at xx h?
# Testing 0.2: Thomas, Steinar, Trond, Ilona, ...
# 0.3 compilation starts at thursday
** what to compile? only the improved *-sm{ej}-lex.txt files
# The next beta version (beta 0.3) is ready sunday
# Monday: Testing beta 0.3 for unpleasent surprises
# We release beta 0.3 on Tuesday, unless there are surprises
# If there are surprises, we must compile again, this time 0.4
# Deadline for documentation as already(?) stated

compile a: *sm(e|j)-lex.txt to *-plx.txt = 83 hrs?
sort: *-plx.txt
compile b: -plx.txt to .sp = ? hrs


two-phase sort:

now: 
* cat *plx.txt > sme-plx.txt
* cat sme-plx.txt | sort -r | uniq > all_except_nouns-sme-plx.txt

tomorrow:
* cat noun-sme-plx.txt all_except_nouns-sme-plx.txt | sort -r | uniq > sme-plx.txt
* sort complexity = N

one-phase sort:
* tomorrow:
* cat *plx.txt > sme-plx.txt
* cat sme-plx.txt | sort -r | uniq > all_except_nouns-sme-plx.txt


!!!10. Other

!!Corpus contracts

TODO:
* publish corpus contracts and project infra as open-source on NoDaLi-sta
  (__Sjur__)


!!Bug fixing

__57__ open Divvun/Disamb bugs, and __23__ risten.no bugs


!!Moving G5

TODO:
* move the G5 to the basement (__Børre__)
** moved, new IP 129.242.220.113

!!!11. Next meeting, closing

The next meeting is 26.2.2007, 09:30 Norwegian time.

The meeting was closed at 11:12.

!!!Appendix - task lists for the next week

!! Boerre
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Boerre.ics]
* write form to request corpus user account
* document how to apply for access to closed corpus, and details on the corpus
  and its use in general
* update and fix our documentation and infrastructure as __Steinar__ finds
  problem areas
* continue work on script for automatic testing of the spell checker in Word
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus
* work on the Polderland data generation (PLX format conversion)
* go through other directories, fix parallellity information for other documents
* add {{sma}} texts to the corpus repository
* move the G5 to the basement
* add info to front page (incl. download links)
* write separate page with detailed info (incl. download links)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Maaren
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Maaren.ics]
* lexicalise actio compounds

!! Saara
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Saara.ics]
* fix {{sme}} texts in corpus this month
* continue aligning the rest of the parallel files
* fix problems with xml2lexc if needed
* have some holiday first
* start improving the corpus interface for Sámi in Oslo.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Sjur.ics]
* name lexicon:
** refactor the rest of the SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* publish corpus contracts and project infra as open-source on NoDaLi-sta
* fix stuorra-oslolaš lower case {{o}}
* write form to request corpus user account
* document how to apply for access to closed corpus, and details on the corpus
  and its use in general
* get an Intel Mac for __Tomi__
* write press release for the beta
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Steinar
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Steinar.ics]
* test our infrastructure and documentation - follow the documentation exactly,
  and find problem areas - report problems to __Børre__. Start: At the front
  page.
* Complete the semantic sets in sme-dis.rle
* missing lists
* report conversion errors to __Saara__
* Look at the actio compound issue when adding from missing lists
* lexicalise actio compounds. Example: ''vuolggasadji''  vs. ''vuolginsadji''
* Go through the Num bugs
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Thomas.ics]
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
* work with compounding
* Lack of lowering before hyphen: Twol rewrite.
* fix stuorra-oslolaš lower case {{o}}
* implement discontinous case inflection for {{sme}} numbers
* produce correct number base forms in the {{sme}} analyzer
* translate beta release docs to {{sme}} and {{smj}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Tomi
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Tomi.ics]
* improve numerals in the speller
* add prefixes to the PLX
* add derivations to the PLX generation
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond
[iCal|/doc/admin/weekly/2007/Tasks_2007-02-19_Trond.ics]
* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus, go through Saara's list
* Go through the Num bugs
* [fix bugs!|http://giellatekno.uit.no/bugzilla].