!!!Meeting setup

* Date: 30.01.2006
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# name lexicon infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:48.

Present: __Børre, Sjur, Tomi, Trond__

Absent: __Maaren, Saara, Thomas__

Main secretary: __Børre__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* send out contracts with accompanying letter
**  Not done
* Gather public texts, preferrably also parallel ones
**  Not done
* Contact Odin editor (__Ove Sæth__) to ask for source (and parallel) documents
**  Done
* Continue converting text from input format to our xml
**  Not done
* review code and documentation for corpus xsl files under version control
**  Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
**  Not done

!! Maaren
* work with risten.no
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* Convert the name lexicon from present format to xml for testing; final
  conversion pending the editing fascilities
** done
* Refine language detection for Finnish
** not done
* Finish the review of the hyphenation detection.
** not done
* Review the handling of xsl-files in corpus infrastructure, including version
  control
** in progress
* Do some testing for bug
  [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]
** not done
* optimize the preprocess script
** not done
* Write/update user documentation for the corpus usage in preparation for the
  review in the project meetings next week.
** done
* finalize an improved working version of the CGI and command line scripts for
  corpus additions
** in progress
* xml2lexc update to handle complex names: construct entries like we have now
  from the different parts of a complex name entry
** tomi will do the xml2lexc-script(?)
* update conversion from lexc to xml (proper names) with the latest
refinements
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
** not done
* Project seminar
** plan and make schedule with __Trond__
*** done
** check which hotels SD has an agreement with
*** done
** plan XQuery/XSLT training session
*** done
*** the whole seminar is done
* Lule Sámi twol problems, with __Thomas__ and __Trond__
** not done
* follow up on voice group-chat not working to Sámediggi
** Test Marratech when the new Marratech server is in place
*** not done
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** not done
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
*** not done
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** not done
* write a background document on the corpus contracts
** not done
* continue proper name lexicon work and discussion
** done a lot at the seminar
* public tender:
** review offer from [Finnut Consult AS|http://www.finnut.no/]
*** done, as well as asked for two other offers, then picked one (Finnut), and
    had a meeting with them, planning the whole PT process
* smj G3 issue with __Thomas__ and __Trond__
** not done
* sme G3 issue with __Thomas__ and __Trond__
** not done
* call EDD/__Christian Emil Ore__ about national place name lexicon
** not done
* risten.no/name lexicon development: fix bugs, continue development
** done some, backed up and restarted the server; the backup isn't completely
   valid, due to some strange bug/conflict between the version on the server
   and the version of the client. The XML files are ok though.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
On sick leave

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
*** Not done
* corpus infrastructure:
** dtd location (both public and internal)
*** Not done
** cgi-admin script for adding xsl-files
*** Not done
* Document aspell and corpus infrastructure
** Not done
* Specification for new catxml in C++
** install and announce new ccat tool
*** Done
* new proper name lexicon
** remove last part of complex names not used as simplex names
*** Not done
** start looking at conversion of the name lexicon from present format to xml
*** Not done
** discuss the new lexicon format in the newsgroup
*** Not done
** Look into synchronisation of proper names with risten.no
*** Not done
** meeting to arrive at final xml format
*** Participated the meeting
** new version of xml2lexc (based on catxml, now ccat)
*** Not done
* hyphenation in corpus docs
** Not done, but has __Saara__ done it? Sjur: yes.
* comment review template made by __Saara__
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Not done
* pick up backpacks after Xmas
** Done

!! Trond
* Contact Odin editor (__Ove Sæth__) immediately to reopen contacts
** Done.
* Project seminar
** plan and make schedule with __Sjur__
** Done.
** check with Linda and Ilona whether we can start on Monday after lunch
** Done.
* sign contract with Bibelselskapet for Norwegian parallel texts
** Done.
* document corpus infrastructure, your part
* review corpus usage documentation (ccat)
** Done.
* discuss the new lexicon format in the newsgroup
** Done, but not extensively.
* smj G3 issue with __Sjur__ and __Thomas__
** Not done
* sme G3 issue with __Sjur__ and __Thomas__
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** I think we made some progress on some of the number bugs, but we haven't 
   looked at the bugzilla base yet.

!!!3. Documentation

!!Reviews

!ccat review

__Saara, Linda, Ilona, Trond__

Conducted the review as part of the seminar, although Thomas and Maaren weren't
there. Outcome:

It works mostly as documented, a few glitches were found and corrected. The
documentation is ''terse'' but concise (simple is beautiful, that is, in full
accordance to KISS). ccat -h gives:

{{{
        -a              Print all text elements.
        -p              Print plain paragraphs. (default)
        -T              Print paragraphs with title type.
        -L              Print paragraphs with list type.
        -t              Print paragraphs with table type.
        -r <dir>        Recursively process directory dir and subdirs enountered.
        -h              Print this help message.
}}}

* (Trond:) I thought that -T printed titles, but it gave the same output as -a.
* (Tomi:) No, it gave you all regular paragraphs + titles, but no tables or 
lists. The latest version will give you ''only'' titles, lists, etc according 
to the specified option.
* (Sjur:) I would like a -v option, so we are able to identify which version to 
send a bug report for :-) The version number should be autoincremented on each
CVS check-in/build, or manually set?

The [xml-based documentation|/doc/ling/catxml.html] was not completely
up-to-date with the latest changes, fixed in the meeting.

!!Other documentation

* __Børre__: Informed about the forrest documentation: the documentation tree
structure, how to add documents and get them into the menu.

!!!4. Corpus gathering

Discussed briefly how the formalities should be implemented: signature, Websak,
posting etc.

!!Collecting

See the [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO: Still a ''lot'' for __Børre!__

!!Odin

DONE: __Trond__, and then __Børre__ to call __Ove Sæth__ to re-establish 
contact. 

__Sæth__ to discuss with colleagues about how to implement the cooperation.

!!Bible texts

{{{
ccat -t zcorp/gt/sme/bible/ot/1Mos_09-01.doc.xml | less
}}}

This gives everything.
What we want is to make a file-specific version of testament.xml, with these
properties:
* The first column should be suppressed
* The second column should be marked number or something
* The third column should be marked header if the typographic code in the fist
  column indicates it is a header (1 or 3) and it should be marked text if the
  typographic code indicates it is text (is 5).
{{{
sme$ls zcorp/orig/sme/bible/ot/
1Mos_09-01.doc  Salmmat-_garvasat_0203.doc
}}}
There are two new books in the paratext format waiting in the nob orig and nno
orig directories.

TODO:
* write a paratext2xml converter
* convert smj NT to paratext
* ask to get fin and swe NT and OT in paratext format

We already have an embryonic converter: gt/script/testament.xsl
Usage: xsltproc /path/to/testament.xsl bible-text.xml > converted-text.xml 
format.

!!!5. Corpus infrastructure

Task list:
# Include the xsl files under version control
## RCS version control is almost finished, but an issue with access control is
   still open. Discussed a bit in the meeting, but nothing conclusive. We'll
   continue the discussion in the newsgroup.
# Incorporate language detection as part of the corpus processing (__Saara__)
## Almost finished. Needs improved Finnish language model - presently it isn't
   able to distinguish Finnish from Sámi (proving the family bonds:-)
# we need to review whether only automatic hyphen detection is good enough, or
  whether manual post-processing in some form is needed. Delayed until we have
  some results to base the review on.
## Acceptable results: 90% of all real hyphens correctly tagged.
# CGI-admin script to add xsl-file to a corpus file that doesn't have one 
  (__Saara__)

Things are moving forward, but still more work to do. The list is left as is.

!!!6. Linguistics

Nothing today, our linguists are on sick leave or not participating. For the
tasks and their status, see the [previous meeting memo|Meeting_2006-01-16].

!!!7. Name lexicon infrastructure

!!Complex names

Task list for this issue:

* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Tomi__)
** the resulting file format should be identical to our present prop-name
   file (=lexc), that can then be converted to our new xml format using the same
   script as for the regular names (__Tomi__ or __Saara__, but only when the
   technical details are settled)
* __Saara__ has added the analyzer as part
  of the preprocess, but it is slow, and needs to be optimized.

!!XML format

Tasks:
# make a test lexicon for evaluating the format, set up the editing, and test it
  (__Saara__)
## Done
# update conversion from lexc to xml to reflect new xml format (__Saara__)
## mostly done, some open questions left
# testing of conversion
# eXist as editor:
## develop the needed XQueries and interface
## synchronisation between risten.no and 
## test whether eXist as editor is actually working well

!!!8. Other

!!SGL Seminar

* SGL/normativity seminar
**  all members = potentially/likely all languages
*** not all languages, only North Sámi
**  date? As early as possible, end of February/beginning of March
**  place? __Maaren__ will investigate

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla
  #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__,
   __Tomi__)
*** 10.4 introduced support for locales in the shell (10.3 and earlier didn't
    know about locales)
** Test: the result of the last line should indicate whether this is a problem
   in cat or a Perl/OS mismatch.
** Is this a problem with ccat?
*** It doesn't seem so (3 min and still counting)
*** In the end, the bug turned up with ccat as well. I gave the command:
*** zcorp/gt/sme/*/*xml | preprocess --abbr=bin/abbr.txt | lookup -flags
    mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle
    --minimal | sort | less
*** and it (in the end) responded: 
{{{
1729 constraint rules
utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
}}}

To ccat's defence I must say that cat, in a similar situation, would have given far
more error messages (hold on, testing still under way).

{{{
   preprocess file_name.txt - OK
   cat file_name.txt | preprocess - bug!!
   catxml file_name.xml | preprocess - ??
   ccat filename | preprocess - bug !!
}}}

This bug isn't a high priority any more, because ccat behaves differently than
cat, and because there is the possibility of avoiding cat when working locally.

BUG: close as Won't fix.

!!Bug fixing

__30__ open bugs (and 25 risten.no bugs)

!!Norwegian ispell press release

The i18n section of Skolelinux plans a press release including a paragraph
about our project. We will ask them to reformulate a couple of things, and
remove the links they've included. Text submitted to them:

* Dette er den eneste kilden til retteprogram for norsk uavhengig av Microsoft,
  bl.a. det som brukes av Apple.

* Et separat prosjekt ved Sametinget er i gang for å utvikle samiske
  retteprogram, bl a for Linux. Se http://www.divvun.no for mer info.

!!!9. Summary, task list

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* review code and documentation for corpus xsl files under version control
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with risten.no
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* continue discussion on the new lexicon format
* Refine language detection for Finnish
* Finnish the review of the hyphenation detection.
* Review the handling of xsl-files in corpus infrastructure, including version
  control
* Do some testing for bug
  [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]
* Fix the preprocess script and optimize it by building an analyzator
  for the multi-part expressions. 
* finalize an improved working version of the CGI and command line scripts for
  corpus additions
* update conversion from lexc to xml (proper names) with the latest refinements
* Try to add numeral treatment as part of the analyzator.
* Change character coding detection to paragraph-based.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* follow up on voice group-chat not working to Sámediggi
** Test Marratech when the new Marratech server is in place
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* continue proper name lexicon work and discussion
* public tender:
** review offer from [Finnut Consult AS|http://www.finnut.no/]
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/name lexicon development: fix bugs, continue development
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* work on North Sámi compounding and derivation
* review corpus usage documentation
* smj G3 issue with __Sjur__ and __Trond__
* sme G3 issue with __Sjur__ and __Trond__

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* corpus infrastructure:
** dtd location (both public and internal)
** cgi-admin script for adding xsl-files
* Document aspell and corpus infrastructure
* ccat: add a -v option - it should return the version of the tool
* new proper name lexicon
** remove last part of complex names not used as simplex names
** start looking at conversion of the name lexicon from present format to xml
** discuss the new lexicon format in the newsgroup
** Look into synchronisation of proper names with risten.no
** new version of xml2lexc (based on catxml, now ccat)
*** xml2lexc update to handle complex names: construct entries like we have now
    from the different parts of a complex name entry
* comment review template made by __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Work on corpus texts with Børre.
* 3-part compounds with  __Sjur__ and __Thomas__.
* smj G3 issue with __Sjur__ and __Thomas__.
* sme G3 issue with __Sjur__ and __Thomas__.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!!!10. Next meeting, closing

06.02.2006 09:30

Closed at 12:03