!!!Meeting setup

* Date: 03.01.2007
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from last week
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:42.

Present: __Børre, Saara, Sjur, Steinar, Thomas, Trond__

Absent: __Maaren, Tomi__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* contact authors who have already received the corpus licensing contract
** not done
* continue work on script for automatic testing of the spell checker in Word
** not done
* update setup and installation instructions for new users/computers
** not done
* create new forrest tarball
** not done
* cvs up of the public server should be done for {{xtdoc/sd/documentation/}}
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done
* other:
** fixed link errors in gtuit/doc


!! Maaren
* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''


!! Saara
* help Trond with some shell commands
* re-analyze parallel files
** done, aligner now works.
* Move to Bugzilla: vislcg server-friendly as feature request to the vislcg devs
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** done some


!! Sjur
* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
*** started work on a refactor of some basic code, to allow greater flexibility,
    easier coding, and hopefully a better interface
* hire linguist and programmer
* decide how to specify compounding behaviour info in the lexicon
** combinatorial behaviour decided; specification of form still not done
* get an Intel Mac for testing Windows spellers
* publish corpus contracts and project infra on NoDaLi-sta
* fix forrest installations for Maaren, Disamb
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
** we've begun
* decide how to specify compounding behaviour info in the lexicon
** on our way
* go through the class of GAHPIR words, and try to generalise the compounding
  behaviour
** done
* change whatever is needed based on the above generalisation
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** worked with bugs


!! Tomi
* add closed POS and clitics to PLX generation
* add derivations to the PLX generation
* add compound stems to the PLX generation
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond
* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
** We have done some work on this.
* Go through the GAHPIR lexicon (with Thomas)
** Done
* get more {{sma}} texts
** No news.
* decide how to specify compounding behaviour info in the lexicon
** At least we worked on this, what was the final outcome again?
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
** Done some work on bugs, closed a couple.


!!!3. Documentation

TODO:
* either fix forrest installations (__Sjur__), or create a new tarball
  (__Børre__)
* cvs up of the public server should be done for {{xtdoc/sd/documentation/}}
  - notice the directory! (__Børre__)
* update setup and installation instructions (__Børre__)


!!!4. Corpus gathering

We should probably concentrate upon mending what we have, up to Febrary 8th.

__Børre__ has been fixing the meta information for many files, adding info on parallel language versions.

TODO:
* {{sme}} texts: Fix what we have for this month (__Børre, Trond, Saara__)
* missing {{nob}} parallel texts could be an issue (__Børre, Trond__)


!!!5. Corpus infrastructure

!!Aligner

The command line aligner is in place, and most(?)/all(?) potential texts are 
aligned.

__TODO:__
* make an overview of status quo (__Trond, Børre, Saara__). __Børre__ has added
  parallellity status to all {{admin/sd/}} texts
* gather more parallel texts (__Trond, Børre__)
** not any more? (depends on what we are missing and how hard they are to get,
   but we will not put much time into it)
* re-analyze parallel files using the command-line version (__Saara__)
** progressing
* when aligned, send aligned, xml {{nob}} texts to __Lars__ (__Saara__)


!!Conversion issues

Analysis statistics:

{{{
	words	missing	% recognised
adm	1493746	28967	98,06
bib	241233	2651	98,90
fac	382353	11371	97,03
fic	70601	7847	88,89
law	284700	6269	97,80
new	4032848	17134	99,58
}}}

There are still a lot of conversion errors, and fixing them should be a priority
this month. Error types:

* pdf files (there will always be some errors here)
* string-string replacement during the xsl conversion
** Conversion errors (but we want to fix the conversion)
** typos (should be marked up using our error markup tags) 
** grammatical errors (as above)

So far, we only have had character replacement (even context-less character
replacement). We can use XSL string replacement to correct spelling errors to
constructs like, cf the {{corpus.dtd}} and [a previous meeting memo
|https://giellalt.uit.no/admin/weekly/2006/Meeting_2006-03-20.html#Correction+tags%3F]:

{{{
<error correct="corrected text">erroneous text</error>
}}}

Either we use the function __Trond__ found, or we install and use Saxon 8 and
XSL 2, which have good string manipulation tools.

Flow:
* orig
* xsl
** corr-of-textstring (not implemented yet) (we is > we are) (we keep the
   orig text, and the correction is added as additional info in an attribute,
   see above)
* xml
* preprocess
* corr-of-token=typos.txt eror>error
* analysis

Do we have good tools for localising conversion errors? Problem: I spot an error 
in the analysis. Which file is it in? 
{{ccat}} removes file source info, and we are thus in the blind. Conclusion: we
don't have a good tool for this, only grep.


__TODO:__
* add correction markup to the xml files (string-to-correction markup)
  (__Saara__)
* report conversion errors to __Saara__ (__Trond, Steinar__)


!!!6. Infrastructure

!!Xerox tools wrapped as servers

__TODO:__
* find a way of integrating {{vislcg}} as a server, or send a feature request
  to the {{vislcg}} developers (__Saara__)
** move this to Bugzilla
*** done


!!!7. Linguistics

!!Names and multilinguality

TODO:
# finish first version of the editing (__Sjur__)
# test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__)
# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
  well) (the morphological section should be kept intact, in e.g.
  propernoun-sme-morph.txt) (__Sjur, Saara__)
# convert propernoun-($lang)-lex.txt to a derived file from common xml files
  (__Sjur, Tomi, Saara__)
# start to use the xml file as source file
# clean terms-sme.xml such that all names have the correct tag for their use
  (e.g. @type=secondary) (__Thomas, Maaren, linguists__)
# merge placenames which are errouneously in different entries: e.g. Helsinki,
  Helsingfors, Helsset (__linguists__)
# publish the name lexicon on risten.no (__Sjur__)
# add missing parallel names for placenames (__linguists__)
# add informative links between first names like Niillas and Nils
  (__linguists__)


!!North Sámi

TODO:
* go through the class of GAHPIR words, and try to generalise the compounding
  behaviour (__Thomas__)
** done
* change whatever is needed based on the above generalisation
  (__Thomas, Trond__)
** done


!!Lule Sámi

TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)
** progressing - good!
* update proper noun lexicon with copy of {{sme}} lexicon  (__Trond__)
** done


!!!8. Name lexicon infrastructure

Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries


Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:
* develop the needed XQueries and UI (__Sjur, Tomi__)
* try to make a first version of xml2lexc in Perl for testing and preparation
  for the big jump (__Saara__)

Postponed:
* data synchronisation between [risten.no|http://www.risten.no] and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
  construct entries like we have now from the different parts of a complex name
  entry


!!!9. Spellers

!!Polderland data generation

__TODO:__
* decide how to specify compounding behaviour info for the lexicon
  (__Thomas, Trond, Sjur__)
** partially done - compound stem specification still open
* add closed POS and clitics to PLX generation (__Tomi__)
* add derivations to the PLX generation (__Tomi__)
* add compound stems to the PLX generation (__Tomi__)


!!Aspell

TODO when the major part of the PLX conversion is done:
* add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the
  PLX data generation is finished)
* study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__)


!!Testing

__TODO:__
* get an Intel Mac for testing Windows spellers (__Børre, Sjur__)

!!Localisation

The Windows installer should be localised. We have received the string file from
Polderland, now they should be translated.

The Mac installer they use is the standard MacOS X installer, and as {{sme}}
isn't generally turned on as a UI language in MacOS X, there is no big point in
translating it. It could be done, though:-)

TODO:
* send translation text to __Børre__ and __Thomas__ (__Sjur__)
* translate Windows installer to {{sme}} and {{smj}} (__Børre, Thomas__)


!!!10. Other

!!Corpus contracts

TODO:
* publish corpus contracts and project infra on NoDaLi-sta (__Sjur__)
** still not done


!!Bug fixing

__55__ open Divvun/Disamb bugs, and __23__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)


!!!11. Next meeting, closing

The next meeting is 8.1.2007, 09:00 Norwegian time.

The meeting was closed at 12:12.

!!!Appendix - task lists for the next week

!! Boerre

* contact authors who have already received the corpus licensing contract
* continue work on script for automatic testing of the spell checker in Word
* recreate our forrest tarball
* update setup and installation instructions for new users/computers
* create new forrest tarball
* cvs up of the public server should be done for {{xtdoc/sd/documentation/}}
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus
* aligner status quo
* translate Windows installer to {{sme}} and {{smj}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Maaren

* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''


!! Saara

* help Trond with some shell commands
* fix {{sme}} texts in corpus this month
* aligner status quo
* send aligned, xml {{nob}} texts to __Lars__
* add correction markup to the xml files (string-to-correction markup)
* first new version of xml2lexc in Perl
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur

* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* decide how to specify compounding behaviour info in the lexicon
* get an Intel Mac for testing Windows spellers
* publish corpus contracts and project infra on NoDaLi-sta
* fix forrest installations for Maaren, Disamb
* send Windows installer translation text to __Børre__ and __Thomas__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Steinar

* conversion error screening 
* missing lists
* report conversion errors to __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas

* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
* decide how to specify compounding behaviour info in the lexicon
* translate Windows installer to {{sme}} and {{smj}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Tomi

* add closed POS and clitics to PLX generation
* add derivations to the PLX generation
* add compound stems to the PLX generation
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond

* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
* get more {{sma}} texts
* aligner status quo
* decide how to specify compounding behaviour info in the lexicon
* Set up work on missing and converision screening with Steinar.
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus
* report conversion errors to __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla].