!!!Meeting setup

* Date: 06.02.2006
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# name lexicon infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:38.

Present: __Børre, Saara, Sjur, Tomi, Trond__

Absent: __Maaren, Thomas__

Main secretary: __Trond__

Agenda accepted as is, we'll try to finish by 10.55, to allow  for joining
the celebration of the Sámi national day.

!!!2. Reviewing the task list from the last meeting

!! Børre
* send out contracts with accompanying letter
**  Sent to Iđut and Kåfjord municipality
* Gather public texts, preferrably also parallel ones
**  Not done
* Continue converting text from input format to our xml
**  Not done
* review code and documentation for corpus xsl files under version control
**  Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
**  Not done
* Other
**  The server didn't get an IP-address using DHCP. It turned out that if the
    «Gateway Assistant» was used, then the network card connected to «the
    outside world» didn't get an IP-address using DHCP. Even if all the services
    that was started using the «Gateway Assistant» were turned off, this
    behaviour went on. The «solution» was to reinstall Mac OS X Server, and not
    use the «GA».

!! Maaren
* work with risten.no
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* continue discussion on the new lexicon format
* Refine language detection for Finnish
* Finnish the review of the hyphenation detection.
* Review the handling of xsl-files in corpus infrastructure, including version
  control
** almost done, I'll need some help with the xsl-processing of the
   main text.
* Fix the preprocess script and optimize it by building an analyzator
  for the multi-part expressions.
** it seems that building a preprocessor-specific analyzator is not possible.
* finalize an improved working version of the CGI and command line scripts for
  corpus additions
** almost done.
* update conversion from lexc to xml (proper names) with the latest
  refinements
* Try to add numeral treatment as part of the analyzator.
** not done
* Change character coding detection to paragraph-based.
** done, use convert2xml.pl with option --multi-coding. This
   introduces some errors (due to the small size of some paragraphs), so
   the default is still one coding per file.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
** not done
* Lule Sámi twol problems, with __Thomas__ and __Trond__
** not done
* follow up on voice group-chat not working to Sámediggi
** Test Marratech when the new Marratech server is in place
*** not done
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** not done
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
*** not done
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** not done
* write a background document on the corpus contracts
** not done
* continue proper name lexicon work and discussion
** did a lot to upgrade the risten.no infrastructure to be multi-collection
   aware
** discussions in the newsgroup
** added the test lexicons __Saara__ created to my own instance of risten.no
* public tender:
** waited for and received a draft public tender document from __Finnut__
* smj G3 issue with __Thomas__ and __Trond__
** not done
* sme G3 issue with __Thomas__ and __Trond__
** not done
* call EDD/__Christian Emil Ore__ about national place name lexicon
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** closed [bug #217|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=217]

!! Thomas
On sick leave.

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
*** Not done
* corpus infrastructure:
** dtd location (both public and internal)
*** Not done
** cgi-admin script for adding xsl-files
*** Not done
* Document aspell and corpus infrastructure
* ccat: add a -v option - it should return the version of the tool
** Done
* new proper name lexicon
** remove last part of complex names not used as simplex names
*** Not done
** start looking at conversion of the name lexicon from present format to xml
** discuss the new lexicon format in the newsgroup
** Look into synchronisation of proper names with risten.no
*** Some progress
** new version of xml2lexc (based on catxml, now ccat)
*** Not done
*** xml2lexc update to handle complex names: construct entries like we have now
    from the different parts of a complex name entry
* comment review template made by __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Work on corpus texts with Børre.
** Done some progres wrt. processing of texts.
* 3-part compounds with  __Sjur__ and __Thomas__.
** Had a look at the rule set myself, but awaiting Thomas.
* smj G3 issue with __Sjur__ and __Thomas__.
** Not done.
* sme G3 issue with __Sjur__ and __Thomas__.
** Not done.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Not done.
* Worked mostly on disambiguation.

!!!3. Documentation

!!Reviews

Anything? Nothing.

!!Other docu

Anything? Documentation has been added for disambiguation.

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

Sent letter to Iđut and Kåfjord.

TODO: Still a ''lot'' for __Børre!__

!!Odin

Waiting for __Sæth__ to discuss with colleagues about how to implement the
cooperation, and return to us.

Nothing heard.

!!Bible texts

TODO:
* write a paratext2xml converter
** __Tomi__ has already done it! Excellent!
** files requiring this converter should have the filename extension __.ptx__
*** Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX
   19PSANBST.u8.PTX
** __Børre__ will review the converter as part of adding the Norwegian texts
   to our corpus
* convert smj NT to paratext. (__Børre__)
* ask to get fin and swe NT and OT in paratext format. (__Trond__)

!!!5. Corpus infrastructure

Task list:
# Include the xsl files under version control
## RCS version control is almost finished, but an issue with access control is
   still open. Discussed a bit in the meeting, but nothing conclusive. We'll
   continue the discussion in the newsgroup.
# Incorporate language detection as part of the corpus processing (__Saara__)
## Almost finished. Needs improved Finnish language model - presently it isn't
   able to distinguish Finnish from Sámi (proving the family bonds:-)
# we need to review whether only automatic hyphen detection is good enough, or
  whether manual post-processing in some form is needed. Delayed until we have
  some results to base the review on.
## Acceptable results: 90% of all real hyphens correctly tagged.
# CGI-admin script to add xsl-file to a corpus file that doesn't have one 
  (__Saara__)

Things are moving forward, but still more work to do. The list is left as is.

E-mail address in case of upload errors:

corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.

/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest)
http://localhost:8888/upload/upload_corpus_file.html (Forrest)

One option is to ask the cochise team, that would be royd or steinar and the
address cc.uit.no.

*Problems with greek letter in Word documents. With font Sam Times Uni(versal)
**  (__Børre__) Can't we just manually change
    the letters and fonts in the few documents affected?

We forget about these texts for the time being, they'll be put in a dir. for
broken texts. Such texts can be looked upon later , if wanted/needed.

!!Suggestion for Script for text analysis. 

We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue, 
with one file for each of the five directories. A way of getting this is to ach night
(afternoon!): Make a crontab job, run the following command, for each
directory admin, bible, facta, ficti, laws, news: 

{{{
ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg 
 | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt 

For example:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle
  > /usr/local/share/corp/ga/sme/admin.txt 
}}}

* Today:    /usr/local/share/corp/gt/sme/DIR(/*)/*xml
* Addition: /usr/local/share/corp/ga/sme/dir.txt

TODO:
* Look at the suggestion from __Trond__ (__Saara__, discuss with __Trond__
  if unclear)
* ask for e-mail adress as specified above (__Trond__)

!!!6. Linguistics

Anything? Nothing.

!!!7. Name lexicon infrastructure

!!Complex names

TODO:
* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Tomi__)
** the resulting file format should be identical to our present prop-name
   file (=lexc), that can then be converted to our new xml format using the same
   script as for the regular names (__Tomi__ or __Saara__, but only when the
   technical details are settled)
* __Saara__ has added the analyzer as part
  of the preprocess, but it is slow, and needs to be optimized.

!!XML format

TODO:
# update conversion from lexc to xml to reflect new xml format (__Saara__)
## mostly done, some open questions left
# testing of conversion
# eXist as editor:
## develop the needed XQueries and interface
## data synchronisation between risten.no and 
## test whether eXist as editor is actually working well

More TODO:
* read and comment in the news group (__all__)
* decide upon and set up infra for new projects and project ideas

Definitions/terminology:
* __synchronisation__ in our context is ''data'' synchronisation, that is, to
  bring the two repositories (CVS and risten.no) in synch regarding the shared
  lexicons (propnouns at least).
* __code refactoring__ is the process of reorganising the code by moving general
  functions and snippets out of specific functions and into general libraries
  for shared access, easier maintenance, and a better organisation.

!!!8. Other

!!SGL Seminar

* SGL/normativity seminar
**  all members = potentially/likely all languages
*** not all languages, only North Sámi
**  date? As early as possible, end of February/beginning of March
**  place? __Maaren__ will investigate

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla
  #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__,
   __Tomi__)
*** 10.4 introduced support for locales in the shell (10.3 and earlier didn't
    know about locales)
** Test: the result of the last line should indicate whether this is a problem
   in cat or a Perl/OS mismatch.
** Is this a problem with ccat?
*** It doesn't seem so (3 min and still counting)
*** In the end, the bug turned up with ccat as well. I gave the command:
*** zcorp/gt/sme/*/*xml | preprocess --abbr=bin/abbr.txt | lookup -flags
    mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle
    --minimal | sort | less
*** and it (in the end) responded: 
{{{
1729 constraint rules
utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
}}}

To ccat's defence I must say that cat, in a similar situation, would have given far
more error messages (hold on, testing still under way).

{{{
   preprocess file_name.txt - OK
   cat file_name.txt | preprocess - bug!!
   catxml file_name.xml | preprocess - ??
   ccat filename | preprocess - bug !!
}}}

This bug isn't a high priority any more, because ccat behaves differently than
cat, and because there is the possibility of avoiding cat when working locally.

BUG: close as Won't fix. (__Børre__)

!!Bug fixing

__32__ open bugs (and 24 risten.no bugs)

* Add bug report for the Xerox backspace error (__Trond__)

!!!9. Summary, task list

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* review code and documentation for corpus xsl files under version control
* convert nob and nno bible texts to be used as part of a parallel corpus, and
  review the paratext2xml converter as part of the conversion
* convert smj NT to paratext
* close bug 211 as WONTFIX
**  DONE :-)

* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with risten.no
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* continue discussion on the new lexicon format
* Refine language detection for Finnish
* Finnish the review of the hyphenation detection.
* Review the handling of xsl-files in corpus infrastructure, including version
  control
* Fix the preprocess script and optimize it by building an analyzator
  for the multi-part expressions. 
* finalize an improved working version of the CGI and command line scripts for
  corpus additions
* update conversion from lexc to xml (proper names) with the latest refinements
* Try to add numeral treatment as part of the analyzator.
* Look at crontab ga/ directory issue with __Trond.__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* project planning with __Trond__, continued
* Follow up on place names from Norge Digitalt
* Evaluate SFST as speller (and analyzer) lexicon
* write a background document on the corpus contracts
* public tender:
** review draft tender document from Finnut
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/proper noun lexicon development: fix bugs, continue development
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* work on North Sámi compounding and derivation
* review corpus usage documentation
* smj G3 issue with __Sjur__ and __Trond__
* sme G3 issue with __Sjur__ and __Trond__

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* corpus infrastructure:
** dtd location (both public and internal)
* Document aspell and corpus infrastructure
* new proper name lexicon
** remove last part of complex names not used as simplex names
** discuss the new lexicon format and other issues in the newsgroup
** Look into data synchronisation of proper nouns between risten.no and CVS
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* comment review template made by __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Work on corpus texts with Børre.
* Contact the Finnish and Swedish Bible societies to get Bible texts.
* Look at ga/ directory issue with __Saara.__
* News group discussion followup.
* Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
* Ask for e-mail adress for corpus upload script
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!10. Next meeting, closing

13.02.2006 09:30

Closed at 10:37