!!!Meeting setup

* Date: 08.01.2007
* Time: 09.00 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from last week
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 9:24.

Present: __Børre, Saara, Sjur, Steinar, Thomas, Trond__

Absent: __Maaren, Tomi__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* contact authors who have already received the corpus licensing contract
** still not done
* continue work on script for automatic testing of the spell checker in Word
** nothing done
* recreate our forrest tarball
** done
* update setup and installation instructions for new users/computers
** done
* create new forrest tarball
** done
* cvs up of the public server should be done for {{xtdoc/sd/documentation/}}
** done
* fix {{sme}} texts in corpus this month
** nothing done
* find missing {{nob}} parallel texts in corpus
** nothing done
* aligner status quo
** works
* translate Windows installer to {{sme}} and {{smj}}
** translated a few strings
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done
* other
** had to setup (www.)divvun.no to serve static content directly via apache. 
   jetty runs out of memory if it has to serve huge files. Only for big/huge
   files, the rest of the site is served by Jetty as before.


!! Maaren
* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''


!! Saara
* help Trond with some shell commands
** this is related to bug 355.
* fix {{sme}} texts in corpus this month
** done some
* aligner status quo
** aligner is working
* send aligned, xml {{nob}} texts to __Lars__
** not done
* add correction markup to the xml files (string-to-correction markup)
** faced some problems
* first new version of xml2lexc in Perl
** started, test version will be available today.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur
* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
*** continued refactoring and redesign of the risten.no code: risten.no now has
    full i18n support, is using the latest Forrest to overcome several
    shortcomings of the previous version, and is soon ready for a more flexible
    editor
* hire linguist and programmer
* decide how to specify compounding behaviour info in the lexicon
* get an Intel Mac for testing Windows spellers
* publish corpus contracts and project infra on NoDaLi-sta
* fix forrest installations for Maaren, Disamb
* send Windows installer translation text to __Børre__ and __Thomas__
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Steinar
* conversion error screening 
** not done
* missing lists
** done some work  
* report conversion errors to __Saara__
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done


!! Thomas
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
** not done
* decide how to specify compounding behaviour info in the lexicon
** nothing this week
* translate Windows installer to {{sme}} and {{smj}}
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** worked hard with bugs


!! Tomi
* On sick leave.


!! Trond
* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
** No smj this week.  
* get more {{sma}} texts
** Also no news.
* aligner status quo
** done - working
* decide how to specify compounding behaviour info in the lexicon
* Set up work on missing and conversion screening with Steinar.
** Done (missing only).
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus
** Not yet.
* report conversion errors to __Saara__
** Done. Progress on conversion last week.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
** Fixed some, actually, and found new ones


!!!3. Documentation

__Børre__ has done all(!) open issues - thanks!

TODO:
* either fix forrest installations (__Sjur__), or create a new tarball
  (__Børre__)
** done (new tarball)
* cvs up of the public server should be done for {{xtdoc/sd/documentation/}}
  - notice the directory! (__Børre__)
** done
* update setup and installation instructions (__Børre__)
** done


!!!4. Corpus gathering

TODO:
* {{sme}} texts: no new additions, fix corpus errors during this month
  (__Børre, Trond, Saara__)
* missing {{nob}} parallel texts should be added if such wholes are found
  (__Børre, Trond__)
* Go through the list of missing or errouneous nob texts, based upon Saaras
  perfect list (__Børre, Trond__)


!!!5. Corpus infrastructure

!!Aligner

__TODO:__
* make an overview of status quo (__Trond, Børre, Saara__).
** done
* go through other directories, fix parallelity information for other documents
  (__Børre__)
** the {{sme/admin/}} files are now double checked.
* gather more parallel texts (__Trond, Børre__)
** not any more? (depends on what we are missing and how hard they are to get,
   but we will not put much time into it)
* re-analyze parallel files using the command-line version (__Saara__)
** progressing
* when aligned, send aligned, xml {{nob}} texts to __Lars__ (__Saara__)


!!Conversion issues

Initial nbsp? Cf these:

ccat -l sme -r zcorp/bound/sme/admin/depts/ | preprocess --abbr=bin/abbr.txt --corr=src/typos.txt | lookup -flags mbTT -utf8 -f bin/missing | grep '\?' | cut -f1 | sort | uniq -c | sort -nr | l

{{{
 19  –Danne (EM SPACE, U+2003)
 84  –Lea
 62 Bueng
}}}

Use UnicodeChecker to check spurious characters that look inocent (nbsp, em
space, etc).

{{admin/depts}} still have many errors stemming from the problematic PDF
conversion:

{{{
     34 kapih
     33 ttalat
}}}

__TODO:__
* add correction markup to the xml files (string-to-correction markup)
  (__Saara__)
* report conversion errors to __Saara__ (__Trond, Steinar__)


!!!6. Infrastructure

Nothing

!!!7. Linguistics

!!North Sámi

Compounds with Actio as the first part. Earlier, we accepted such compounds, but 
they gave rise to so much disamb errors that we excluded them, and opted for 
lexicalisation. Now they turn up, but the solution is probably to lexicalise
them. 

{{{
bajásšaddaneavttuid
olgunastinlága
ávkkástallanvugiide
}}}

Another problem:

* we get {{heajus-Oslo}} instead of {{heajos-Oslo}}
* we get {{lojis-Oslo}} instead of {{lojes-Oslo}}
* we get {{álkis-Oslo}} instead of {{álkes-Oslo}}

The fix here is the same as for {{smj}} below: Hyphen must be included in the
right context triggering lowering.

Third issue: stuorra-oslolaš does not work (that is, initial lower case oslolaš
after compound with hyphen - there is always hyphen in front of name-based
words).

TODO:
* Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
  folks to see whether it is too hard to lexicalise
  (__Thomas, Maaren, Steinar__)
* Lack of lowering before hyphen: Twol rewrite. (__Thomas, Trond__)
* fix stuorra-oslolaš lower case {{o}} (__Sjur, Thomas, Trond__)


!Numbers:

Needs to be included in the non-rec transducers, and their inflection needs to
be tested. Test specification:

* inflection (case, number, ordinality)
* generation of higher numerals (not inflected)


TODO:
* include numbers in the non-recursive transducers (i.e. split the recursive and 
  the non-recursive part of the numerals) (__Trond, Thomas__)
* Set up test bed for numerals, test and revise (__who?__)
* Make a test bed {{make num-paradigm}} (__Trond__)
* Go through the Num bugs (__Trond, Thomas, Steinar__)


!Hyphenation problem

Vowel combinations that are not defined as diphthongs behave like diphthongs
anyway, ie. they don't get hyphenated. In North sámi we have this problem when
there is {{a}} or {{u}} in second position:

{{{
laam
laam    laam (expected: la-am)
liam
liam    liam
luam
luam    luam
lyam
lyam    lyam
lium
lium    lium
luum
luum    luum
lyum
lyum    lyum
}}}

For Lule sámi ALL Vowel combos behave like diphtongs.

South Sámi needs supervising, too, there are reports indicating it has onset
maximising rather than coda maximising.

TODO:

* write diphthong hyphenation pseudocode (__Thomas__)
* rewrite hyphenation code (__Trond__)
* ask Ove Lorentz to report on our sma hyphenator (__Trond__)

!!Lule Sámi

We had a twol problem -- now fixed by __Thomas__. We still have a problem with
actor nouns of type {{juhttsat:juhttse}}: when hyphened we get
{{juhttsa-muorra}} instead of {{juhttse-muorra}}. There is a similar problem in
{{sme}} with adjectives (see above).

The fix is to include the hyphen in the right context triggering vowel fronting.

TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)
* Lack of lowering/fronting before hyphen: Twol rewrite. (__Thomas, Trond__)
* include numbers in the non-recursive transducers
* Set up a test bed for numerals, test and revise (__who?__)


!!!8. Name lexicon infrastructure

__Sjur__ continued refactoring and redesign of the risten.no code: risten.no now 
has full i18n support, is using the latest Forrest to overcome several
shortcomings of the previous version, and is soon ready for a more flexible
editor.

Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries


Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:
* data synchronisation between [risten.no|http://www.risten.no] and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
  construct entries like we have now from the different parts of a complex name
  entry

TODO:
# try to make a first version of xml2lexc in Perl for testing and preparation
  for the big jump (__Saara__)
## soon ready:-)
# finish first version of the editing (__Sjur__)
# test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__)
# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
  well) (the morphological section should be kept intact, in e.g.
  propernoun-sme-morph.txt) (__Sjur, Saara__)
# convert propernoun-($lang)-lex.txt to a derived file from common xml files
  (__Sjur, Tomi, Saara__)
# start to use the xml file as source file
# clean terms-sme.xml such that all names have the correct tag for their use
  (e.g. @type=secondary) (__Thomas, Maaren, linguists__)
# merge placenames which are errouneously in different entries: e.g. Helsinki,
  Helsingfors, Helsset (__linguists__)
# publish the name lexicon on risten.no (__Sjur__)
# add missing parallel names for placenames (__linguists__)
# add informative links between first names like Niillas and Nils
  (__linguists__)


!!!9. Spellers

!!Polderland data generation

__TODO:__
# decide how to specify compounding behaviour info for the lexicon
  (__Thomas, Trond, Sjur__)
# add closed POS and clitics to PLX generation (__Børre, Tomi__)
# add compound stems to the PLX generation (__Børre, Tomi__)
# add derivations to the PLX generation (__Børre, Tomi__)
# Include numerals in the speller (__Børre, Tomi__)


!!Aspell

TODO when the major part of the PLX conversion is done:
* add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the
  PLX data generation is finished)
* study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__)


!!Testing

__TODO:__
* get an Intel Mac for testing Windows spellers (__Børre, Sjur__)

!!Localisation

TODO:
* send translation text to __Børre__ and __Thomas__ (__Sjur__)
** done
* translate Windows installer text to {{sme}} and {{smj}} (__Børre, Thomas__)
** __Børre__ has started, __Thomas__ (and perhaps __Steinar__ and __Maaren__)
   will continue


!!!10. Other

!!Corpus contracts

TODO:
* publish corpus contracts and project infra on NoDaLi-sta (__Sjur__)


!!Bug fixing

__57__ open Divvun/Disamb bugs, and __23__ risten.no bugs


!!!11. Next meeting, closing

The next meeting is 15.1.2007, 09:00 Norwegian time.

The meeting was closed at 10:50.

!!!Appendix - task lists for the next week

!! Boerre

* contact authors who have already received the corpus licensing contract
* continue work on script for automatic testing of the spell checker in Word
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus
* translate Windows installer to {{sme}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
* work on the Polderland data generation (PLX format conversion)
* go through other directories, fix parallelity information for other documents


!! Maaren

* investigate the generated word form list sent to Polderland - use the command
  {{make wordlist TARGET=sme}} in ''victorio''


!! Saara

* fix {{sme}} texts in corpus this month
* send aligned, xml {{nob}} texts to __Lars__
* add correction markup to the xml files (string-to-correction markup)
* first new version of xml2lexc in Perl
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur

* name lexicon:
** rewrite the integration with forrest, to get a more flexible integration
   with proper i18n, solving some problems with the previous solution, and
   make a foundation for better search and editing interfaces.
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* decide how to specify compounding behaviour info in the lexicon
* get an Intel Mac for testing Windows spellers
* publish corpus contracts and project infra on NoDaLi-sta
* fix stuorra-oslolaš lower case {{o}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Steinar

* conversion error screening 
* missing lists
* report conversion errors to __Saara__
* Go through the Num bugs
* Look at the actio compound issue when adding from missing lists
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas

* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt 
* decide how to specify compounding behaviour info in the lexicon
* translate Windows installer to {{sme}} and {{smj}}
* Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
  folks to see whether it is too hard to lexicalise
* Lack of lowering before hyphen: Twol rewrite.
* include numbers in the non-recursive transducers
* Go through the Num bugs
* Write diphthong hyphenation pseudocode
* fix stuorra-oslolaš lower case {{o}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Tomi

* add closed POS and clitics to PLX generation
* add derivations to the PLX generation
* add compound stems to the PLX generation
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond

* update the {{smj}} proper noun lexicon, and refine the morphological analysis,
  cf. the propernoun-smj-lex.txt
* decide how to specify compounding behaviour info in the lexicon
* Set up work on missing and conversion screening with Steinar and Ilona.
* fix {{sme}} texts in corpus this month
* find missing {{nob}} parallel texts in corpus, go through Saara's list
* report conversion errors to __Saara__
* Write twol rules for {{sme, smj}} on hyphen-triggered lowering with Thomas
* Go through the Num bugs
* Make numeral testbed
* Rewrite hyphenation-code (pseudocode from __Thomas__) {{sme, smj}}
* Get input on {{sma}} hyphenations
* fix stuorra-oslolaš lower case {{o}}
* include numbers in the non-recursive transducers for {{sme, smj}}
* [fix bugs!|http://giellatekno.uit.no/bugzilla].