!!!Meeting setup

* Date: 24.04.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 11:11.

Present: __Saara, Sjur, Thomas, Trond__

Absent: __Børre__, __Maaren__ (sick leave), __Tomi__

Main secretary: __Trond__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* convert nob and nno bible texts to be used as part of a parallel corpus
* review the paratext2xml converter
* convert smj NT to paratext
* Send out letters to the Iđut authors
* write  docu for how to apply for a corpus user account (forms, recipients,
  etc)
* remove old corpus files from gt/sme/corp/ after __Trond__ has cleaned it
* integrate generated corpus repository summaries in the Forrest site
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* on sick leave throughout April

!! Saara
* Create a parallel corpora of the new testaments.
* make free texts available
** set up a weekly cron script (but only if there are new files)
*** not done
* Implement links to parallel files in corpus header.
** not done
* Implement parallel corpus upload in web upload script
** in progress 
* Implement turning off the language recognition in the xsl-file (and corpus.dtd).
** almost ready
* Refine language detection
** almost ready
* add a possibility to upload whole documents for hyphenation (and also
  analysis, generation, etc)
** not done
* add a log of every word/text uploaded/hyphenated/analyzed etc.
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* on paternal leave

!! Thomas
* add all word boundaries
** working with lule-sámi word-boundaries for the moment
* work on compounding and derivation
** nothing this week
* smj G3 issue
** nothing this week
* sme G3 issue
** nothing this week

!! Tomi
* move aspell UTF-8 suffix bug to Bugzilla
* Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
  (it's almost done, but there are a couple of loose ends)
* new proper name lexicon
** implement data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* install and test Gobby, install new version of SEE
* Set up the mechanism for the hash-mark transducer package
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Make nob into nor in the anchor list (A is done)
* better smj NT text, get fin NT texts
* Discuss Gobby 0.4 with Sjur
* get a key for __Maaren__ in May
* install aligner, test it and give feedback
* correct hyphenation of word boundaries and exceptions with Thomas
* get/upgrade keys for __Børre's__ room for __Tomi__ and __Thomas__
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!3. Documentation

TODO:
* documentation on how to apply for a user account for the corpus repo
  (__Børre__)
**  The item will be moved to the TODO list, again.

!!!4. Corpus gathering

__Trond__ added __sme__ beuraucratic texts, roughly 0,4 mill words, total size
now approaching 1,5 mill words.

!! Trip to Sámi municipalities

__Børre__ is on his trip.

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO: Send out the rest of the letters (__Børre__)

!!Odin

__Sæth__ replied by e-mail, hasn't had time to follow-up, but will try to
include us in their plans.

* __Børre__ will weekly mirror the URL mentioned in Sæths e-mail, 
  and add missing files to the corpus.
  
!!Olavi Korhonen's Lule Sámi dictionary.

* No news this week

TODO: __Børre__ to contact Olavi Korhonen and Kuhmunen

!! KIO Grafisk and the Iđut books

__TODO__:
* send letters to the authors (__Børre__)
* wait for the discussions with Davvi Girji
**  A talk with Brita Kåven, revealed that they would have a look at
    the contracts __after__ easter.

!!Bible texts

We will get text from Finland, but still haven't received any. We have got the
Swedish text from Sweden. As for the last html versions from Norway, __Trond__
has not contacted them last week. 

Swedish html has arrived, no paratext. Norsk bibelselskap has not sent 
corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:
* convert smj NT to paratext (__Børre__)
**  Not done
* get fin, swe, nob and nno NT and OT in __paratext__ format. (__Trond__)
**  Still no news

!!Min Áigi

__Børre__ has received texts, and forwarded them to __Trond__. Problems with
Unicode in the filenames, as the non-ASCII characters are unparsed strings with
the octal code of the character(s) in question:

The files (appr 2000 files) are added, here:
/usr/local/share/corp/orig/sme/news/MinAigi/

We have problems with Unicode characters in filenames. All characters with
diacritics are stored decomposed on MacOS X, and when transferring the files to
Linux (cochise) via a tar file, the characters are not recomposed, making the
files accessible only by typing the combining diacritic - not nice. We also now
have the same problem on Mac, making it in practice impossible to access a set
of files like:

{{{
a84-231-8-254:~ sjur$ l a+TAB
áda  áde  ádo  åde  
a84-231-8-254:~ sjur$ l a
}}}

This was solved once before, and we need to look at this again. The old Bugzilla
issue should be reopened.

__TODO__:
* __Børre__ to contact __Per Christian Biti__ on technical issues (how to
  transfer texts).
* reopen Bugzilla issue, and study the previous discussion and solution
  regarding Unicode characters in filenames (__Saara__)
* add filename extensions to files not having one
* investigate whether the bullet has a meaning or could be removed
* Space in file names should be changed to underscore (and not to hyphen!).

Unicode table, in case we need to recompose manually:

{{{
á    0061 0301         
č    0063 030C          
đ    0064 0335          
ŋ    006E          
š    0073 030C          
ŧ    0074 0335          
ž    007A 030C          
æ    00        
ø    006F 0337          
å    0061 030A          
ö    006F 0308          
ä    0061 0308          
Á    0041 0301          
Č    0043 030C          
Đ    0044 0335          
Ŋ    004E           
Š    0053 030C          
Ŧ    0054 0335          
Ž    005A 030C         
Æ    00          
Ø    004F 0337          
Å    0041 030A          
Ö    004F 0308          
Ä    0041 0308  
}}}

Min Áigi seems to have been changing from text files to MS Word around issue
015-05.

!!Kåfjord

Promised to send us texts, but nothing has arrived yet.

__TODO__: __Børre__ to contact them.

!!Sámi Instituhtta

Audhild Schanche has signed the contract. We will have to contact them about
transferring the texts. 

__TODO__: __Børre__ to contact them.

!!!5. Corpus infrastructure

https://giellalt.uit.no/lang/corp/corpus-summary.html

TODO:

!!Changes and updates because of the Divvun public tender

User account admin and infra: see [previous
  memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under ''Documentation.''

Automatic build of the content of our corpus repo: also see [previous
  memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:
* make free texts available
** done weekly by a cron script (but only if there are new files) (__Saara__)
* Make a link, easily available, to these texts.
** done as a downloadable tar package.

Name change again?
{{{
gt -> gtbound/
gtbound -> some nifty new letter... ?
gtfree -> some nifty new letter... ?
}}}
__ Trond__ to come up with some new suggestion.

!!Free and non-free texts

More info in a [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:
* Check the status of the texts, again. (__Børre, Trond__)
* Rerun the conversion afterwards (__Saara__ is the one with the magic spell)
* check against bugs [273|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=273] and 
  [272|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=272]

!!Linking parallel files

DECISION:
We'll keep the original filename, and store linking info in the header (has to
be added manually).

TODO:
* develop the web interface for uploading to make it easier to add several
  documents in one go. (__Saara__)

{{{
lg-a, lg-b
<p id>
<s id>
</s>
</p>

key file
eq-link# lga-1 = lgb-1,2
eq-link# lga-2 = lgb-3
eq-link# lga-2 = lgb-4
}}}

<s>...</s><.><s>...</s><?> <!>

sme-dis.rle: DELIMITERS 
The preprocess and the aligner should agree on what is a sentence. (.!?)

<s>
... this is what sme-dis.rle will analyse
<.>
</s>
<!>
<?>

(sme-tdis.rle: DELIMITERS
<¶>)

!!More texts to the graphical corpus interface:

TODO:
* We would like to have more than the NT in the graphical interface
 http://omilia.uio.no/CE/sami/ (__Saara__)
** We add the largest texts first.
* We would like to have grammatical searchability, not only POS. (__Saara__,
  __Trond__)
* This presupposes a discussion with Oslo. (__Trond__ and __Saara__ to continue
  this discussion)
* For Lule Sámi: We would like to have a parallel corpus interface with NT
  (text only). This presupposes better quality texts (__Trond, Børre__)
** Better Lule NT text still not made.
* The list of good candicates: The longest (admin) texts.
** We need a ccat version of the script for analysing text, still keeping xml
   tags (__Tomi__).

Top-two priorities:
# Trond and Saara to discuss with Lars.
# Lars to add text to the server.
# Tomi to prepare for the parallel corpus.

!!Language recognition

TODO:
* turn on language recognition, skipping Finnish (__Saara__)
** Done, it seems.
* create a short word list to help the trigram heuristics
** Trond has made such lists for all lgs except sme, smj and nob.
* send Saara sme, smj and nob files (sort, but not uniq -c) (__Trond__)
* Add some flag to write into the xsl file (__Saara__):
** method:do not run lg recognition
** method:Choose between these 2:nob, sme

!!!6. Infrastructure

!!Aligner

Today, we have two anchor files in addition to the original one. 

TODO:
* Read documentation and try out, give feedback to Bergen. (__Trond__,
  __Saara__, __Tomi__)
* conflate nno with nob into 'nor' (__Trond__, letters a, b, c done)
**  Partly done, must go through and see.
* __Saara__ to install the aligner, everyone to read the documentation
** done, waiting for the test files from Bergen.
* __Trond__ and __Saara__ will continue this issue.

!!Hyphenator

Trond and Thomas have been updating the propernoun file with ^ tags. We need the
tag in front of compound parts beginning in a vowel or in two or more
consonants. Compound parts beginning with one consonant are handled correctly.

TODO:
* Remove the ^ signs from the UPPER level when generating (just like the TV and
  IV tags are removed today) (__Trond__)
* correct the treatment of hyphenation of word boundaries and exceptions (fst
  gymnastics) (__Sjur, Trond__)
* add a possibility to upload whole documents for hyphenation (and also
  analysis, generation, etc) (__Saara__)
* we should log all and every word/text uploaded/hyphenated/analyzed etc
** we'll do it, but it does not have first priority (__Saara__)
* add exceptions marks to the {{smj}} lexicon (boks^távva)

!!!7. Linguistics

!!General - hyphenation

See discussion, open questions and decission in the [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-04-03.html]

TODO:
* add all word boundaries (__Thomas__)
**  Done: All sme; smj nouns and adjectives
* Set up the mechanism for the hash transducer package - fst gymnastics, see
  above.

!!North Sámi

Nothing specific this week.

!!Lule Sámi

TODO:
* add the rest of the inc- words (__Thomas__)
**  everything added that is possible now, about 50 unknown words left+2 abbr.
    +moaddi etc (numerals)
    
!!!8. Name lexicon infrastructure

TODO:
# refactor and prepare risten.no for multiple collections:
##  refactor the code into more and more specific components according to our
    folder hierarchy (__Tomi, Sjur__)
### things are moving forward
# develop the needed XQueries and interface (__Sjur, Tomi__)
##  developing
# data synchronisation between risten.no and the cvs repo (__Tomi__)
##  nothing this week
# test and review when ready

Discussion postponed until Sjur is back.

!!!9. Spellers

Nothing until the new proper noun lexicon is in place. We don't have enough
people to do both.

!!!10. Public tender

2 offers received, from Polderland and Lingsoft.

TODO:
* read the offers (__Børre, Sjur, Trond, Tomi__)
* meet on Friday to sum up the findings (__Sjur, Tomi, Trond__)
* telephone meeting next Tuesday with Finnut (__Børre, Sjur__)

!!!11. Other

!!Bug fixing

__50__ open Divvun/Disamb bugs, and __25__ risten.no bugs

Please help Saara with bug 279. Not much help...

Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug
barnraising. ... and then a new one after the name lexicon is fixed.

!!!12. Summary, task list

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* convert nob and nno bible texts to be used as part of a parallel corpus
* review the paratext2xml converter
* convert smj NT to paratext
* Send out letters to the Iđut authors
* write  docu for how to apply for a corpus user account (forms, recipients,
  etc)
* remove old corpus files from gt/sme/corp/ after __Trond__ has cleaned it
* integrate generated corpus repository summaries in the Forrest site
* mirror Odin URL (create cron task to do it automatically?)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* on sick leave throughout April

!! Saara
* Create a parallel corpora of the new testaments.
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* make free texts available
** set up a weekly cron script (but only if there are new files)
* Implement links to parallel files in corpus header.
* Implement parallel corpus upload in web upload script
* Implement turning off the language recognition in the xsl-file (and corpus.dtd).
* Refine language recognition
* add a possibility to upload whole documents for hyphenation (and also
  analysis, generation, etc)
* add a log of every word/text uploaded/hyphenated/analyzed etc.
* Investigate the decomposed Unicode characters in file names -problem.
* Convert Min Áigi file names.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* read through and evaluate the public tender offers

!! Thomas
* add all word boundaries
* work on compounding and derivation
* smj G3 issue
* sme G3 issue

!! Tomi
* move aspell UTF-8 suffix bug to Bugzilla
* Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
  (it's almost done, but there are a couple of loose ends)
* new proper name lexicon
** implement data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* install and test Gobby, install new version of SEE
* Set up the mechanism for the hash-mark transducer package
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Make nob into nor in the anchor list (A, B, C is done)
* better smj NT text, get fin NT texts
* Discuss Gobby 0.4 with Sjur
* get a key for __Maaren__ in May
* install aligner, test it and give feedback
* correct hyphenation of word boundaries and exceptions with Thomas
* get/upgrade keys for __Børre's__ room for __Tomi__ and __Thomas__
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!13. Next meeting, closing

08.05.2006 09:30

Closed at 13:28