!!!Meeting setup

* Date: 09.01.2006
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# name lexicon infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:45.

Present: __Børre, Sjur, Trond__

Absent: __Maaren, Saara, Thomas, Tomi__ (sick leave until Jan. 17th)

Main secretary: __Børre__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Gather public texts, preferrably also parallel ones
**  Not done
* Contact Odin editor to ask for source (and parallel) documents
**  Not done
* Continue converting text from input format to our xml
**  Not done
* Document the corpus infrastructure
**  Done
* review code and documentation for corpus xsl files under version control
**  Not done
* review the convert2xml.pl script, what works well and what doesn´t.
**  Done
* comment review template made by __Saara__
**  Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
**  Not done
* Misc.
**  Have moved some documentation from the gt/doc tree to xtdoc/giellatekno
    and some of the content to xtdoc/sd as it seemed to fit better there. Have 
    worked on translating these
    documents.
**  Had a look at different formats for new testament. Me and Trond found that
    the paratext format is the best for making parallell corpora.

!! Maaren
* work with risten.no
* review corpus usage documentation, and the usage of the corpus
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* Look at the corpus infrastructure issue
* Convert the name lexicon from present format to xml and test it
** not done
* discuss the new lexicon format in the newsgroup
* Language detection for the corpus files
** Implemented with heuristics for Sámi. Not yet in use, since there
is still problems with Finnish: it is not properly distinguished from Sámi
languages. I have to look for larger text material for the language
model for Finnish.
* Review the hyphenation detection.
** In progress.. There is not enough hyphenated texts in our corpus, I
will probably test with Finnish.
* Catxml review: look at the ccat tool
** Installed and documented, can be reviewed as well.
* Review the handling of xsl-files in corpus infrastructure
** Started major update for all the xsl-files. xsl-processing is not
yet included in the web upload script, and the xsl-files do not
correspond to the current corpus.dtd. The basic functionality is there
however, so finishing and testing the scripts takes two weeks
(part time work).
* Do some testing for bug
  [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]
** not done
* add version control of the corpus xsl files to the (upload)
processing
** related to xsl-files, see above.
* xml2lexc update to handle complex names: construct entries like we have now
  from the different parts of a complex name entry
* update preprocess to handle inflected forms of complex names
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
** not done
* Lule Sámi twol problems, with __Thomas__ and __Trond__
** nothing
* follow up on voice group-chat not working to Sámediggi
** Test Marratech when the new Marratech server is in place
*** still waiting
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** nothing this week
* Follow up on place names from Norge Digitalt
** waiting for __Bjørn Olav Megard__
** should call EDD at UiO: they're planning a national place name lexicon
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
** nothing
* continue proper name lexicon work and discussion
** nothing
* public tender
** updated the selection criteria with the latest remarks
** finally found and contacted a firm that helps with public tenders and the
   legal and formal requirements associated with such - now waiting for a price
   quote from them
* smj G3 issue with __Thomas__ and __Trond__
** nothing
* sme G3 issue with __Thomas__ and __Trond__
** nothing
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** nothing
* other:
** risten.no had crashed again just before Xmas, but it wasn't noticed until now
   (it looked like it was still running until New Year). The last backup was
   dated 22. dec, which means that some work had been lost. I have a set of logs
   that can be analysed. risten.no is back up now, but searches are only done on
   sma for unclear reasons. I'm still working on it.

!! Thomas
Sick leave.

!! Tomi
Sick leave.

!! Trond
* Follow up the lawyer treatment of the contracts
** Done, got feedback from university lawyer, he gave green light (with no
   comments)
* project planning with __Sjur__, continued
** Nothing with Sjur, done some on disamb.
* smj G3 issue with __Sjur__ and __Thomas__
** Not done.
* sme G3 issue with __Sjur__ and __Thomas__
** Not done.
* document corpus infrastructure, your part
** Not done.
* review corpus usage documentation
** Not done.
* discuss the new lexicon format in the newsgroup
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Some postings, yes.


!!!3. Documentation

!!Reviews

__Sjur:__ What is the reasoning for moving the documents around?
__Børre:__ Some documents sitting in the gt/doc tree seemed to fit best in the
xtdoc/giellatekno tree.

Add documentation on our corpus infrastructure and our corpus work in general
(__Børre__, __Tomi__, __Trond__). For the basic corpora, we need 2
additional types of documentation, or doc for 2 target groups:

# For the __users/linguists:__ What corpus are found, how do I use them (this
  info is now scattered) 
  (Part of the HOWTO USE is documented in the catxml docu, 
  what documents are found where etc + an overall documentation is not
  written, since the corpus is so sparsely populated)
## catxml done, which is what is needed mostly. Do we need more?
## Review: __Thomas, Maaren, Linda, Ilona, Trond__

Saara has installed ccat. It is ready for a first round of reviews.

Findings so far:
Basic text presentation works fine, but the text-type options does not.

!!!4. Corpus gathering

!!Contracts

Next step:

# wait for comments from the lawyers - remind them of the task
## __Trond__ is done (no comments, ok by UiT), __Sjur__ still to do it
# possibly update contracts with remarks from lawyers
# start using them!

!!Collecting

Nothing new, now hampered by the lawyers checking the final version of the
contracts.

We want both parallell and errouneous (unproofed) text files. What we need is a
direct contact with Odin, in order to have as good coverage as possible.

TODO: __Trond__, and then __Børre__ to call __Ove Sæth__ to re-establish 
contact. 

!Bible texts

We have received Norwegian texts and a contract draft from Bibelselskapet, in 
essence requiring a separate contract for internet use of extracts of the text. 
The texts are in the paratext format, an international standard for the bible in
electronic form. Most translations are available in this format.

TODO: __Trond__ will accept the contract as is, and then negotiate a separate
contract with them for use with the online, searchable parallel corpus when it
is ready.

!Example of the paratext format

The linebreak in v1 is ours, in the original it is all on one line

{{{
\s Gud skaper verda
\p
\v 1 I opphavet skapte Gud himmelen og jorda. \f + \fr 1,1 \fk skapte \ft Det hebr. verbet som vert nytta her,
     står alltid med Gud som subjekt.\f* \x - \xo 1,1 \xt 2,4; Job 38; Sal 8; 33,6; 104; Joh 1,1.3; Kol 1,16\x* 
\v 2 Jorda var aud og tom, og mørker låg over havdjupet. Men Guds Ande sveiv over vatnet. 
\v 3 Då sa Gud: «Det verte ljos!» Så vart det ljos. \x - \xo 1,3 \xt Sal 33,9; 2 Kor 4,6\x* 
\v 4 Og Gud såg at ljoset var godt, og han skilde ljoset frå mørkret. 
\v 5 Gud kalla ljoset dag, og mørkret kalla han natt. Og det vart kveld, og det vart morgon, fyrste dagen.
\p
}}}


!!!5. Corpus infrastructure

Task list:
# Include the xsl files under version control
## RCS version control is almost finished, but an issue with access control is
   still open. Discussed a bit in the meeting, but nothing conclusive. We'll
   continue the discussion in the newsgroup.
# Incorporate language detection as part of the corpus processing (__Saara__)
## Almost finished. Some heuristics regarding other Sámi languages in the same
   document to be added.
# we need a way to deal with hyphenated documents (documents with (manually)
  inserted hyphenation marks) in catxml/preprocess.
## done, needs review (__Saara__)
# we need to review whether only automatic hyphen detection is good enough, or
  whether manual post-processing in some form is needed. Delayed until we have
  some results to base the review on.
## Acceptable results: 90% of all real hyphens correctly tagged.
# CGI-admin script to add xsl-file to a corpus file that doesn't have one 
  (__Tomi__)
## __Saara__ will review the existing code, consult __Tomi__, and try to make
   a script to help __Børre__ utilize the template and the infra in place.


!!!6. Linguistics

!!North Sámi

* three-part compounds issue still open
** Thomas is finished with three-part compounds for North Sámi. 
*** compilation time has increased to 10m for the sme parser
    as a whole. This is probably linked to the three-part issue, but we are not 
    sure about that.
*** it overgenerates wildly. This is bad for several of our applications.
** compound shortening issue is waiting for the SGL to make a normative decision
*** we have to wait until SGL´s meeting protocol (???) is ready
** linguistic analysis/discussion to continue in the newsgroup
** we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
   this discussion, but how? We need a common meeting / normativity seminar /
   seminar on proofing issues and on language technology in a broader sense with
   them. Then we also need to open the linguistic issues newsgroup we talked
   about earlier, and set up their computers and teach them to use the
   newsgroup. This seminar needs to include the Lule Sámi section of SGL (others
   are welcome too).
*** Laila has informed the members of the SGL about this. This is very okei -
    open the linguistic issues newsgroup for the 
    members of SGL, but we have to know how to do it. Seminar also very okei, 
    but we have to wait until the new SGL is elected.
** Timetable:
*** The Sámediggiráđđi is going to appoint new members in January
    (__Johan Mikkel Saara__ is asking people)
*** Contact the SGL when it is elected, and ask them to arrange a
    normativity seminar, with invited keypersons. Have to wait until it is
     elected. Initially we planned for February, but it is too early, because
     SGL is not elected yet. Divvun wants the seminar as early as possible.
* number project still open (see below)
* diphthong simplification/G3 issue should be carried over from Lule Sámi.
** TODO: 
*** __Trond, Thomas and Sjur__ to discuss this this month.
*** Writing the rules (copy from smj, adjust) (__Sjur, Thomas__ and __Trond__)
*** Bug 77 - clearify whether ''háliid'' is acceptable - it is found, thus we
    need to be able to analyse it, but only as !SUB:

!!Lule Sámi

Open tasks:
* Derived G3 that looks like G2 are still open. 
** __Thomas, Trond and Sjur__ should meet again to finish the rest.

!!Numerals

The following North Sámi linguistic issues should be settled before going into
the numeral project:

# Three-part compounds
# Diphthong simplification
# Derivation

These issues are recently done in Lule Sámi, and it is more efficient to
complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but
the issue itself is common to the two langauges, and should therefore be treated
in parallel.

Numerals in North Sámi:
Inventory is listed elsewhere.

Numerals in Lule Sámi:
There are 70 lines of code setting up the structure for case inflection of basic
numerals.

!!!7. Name lexicon infrastructure

!!Complex names

Task list for this issue:

* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Saara__)
** __Tomi__ is working on a C++ version of the xml2lexc tool. He will put it
   into CVS soon (it is actually catxml, can easily be adopted to the xml2lexc
   requirements)
* the resulting file format should be identical to our present prop-name
  file (=lexc), that can then be converted to our new xml format using the same
  script as for the regular names (__Tomi__ or __Saara__, but only when the
  technical details are settled)
* The core issue: The preprocess script can handle uninflected (A + B)
  compounds, but not (A + inflected B). __Saara__ will add the analyzer as part
  of the preprocess, to be able to handle inflected cases as well.

!!XML format

We had our meeting, and the result was pretty close to the structure of the
existing risten.no term base. Now the conversion script to xml needs to be
updated. The new format is documented in the newsgroup.

Tasks:
# make a test lexicon for evaluating the format, set up the editing, and test it
  (__Børre__)
# update conversion from lexc to xml to reflect new xml format
# testing of conversion
# continue the discussion of the name lexicon format
  (__Saara, Tomi, Sjur, Trond__)
# implement a prototype in eXist
# eXist as editor:
## develop the needed XQueries and interface
## synchronisation between risten.no and 
## test whether eXist as editor is actually working well

!!!8. Other

!!Seminars

* SGL/normativity seminar
** all members = potentially/likely all languages
** date? As early as possible, but not likely before the end of February
** place? __Maaren__ will investigate
* project meeting
** date: 23.-27. januar.
** place: Tromsø/Helsinki?
** still too many open questions re place+date, to be determined in a separate
   meeting later this week, when all are available.

Suggested content for project meeting:

* Common
** Project updates
** Project milestones
** Project evaluation (?)

* Divvun
** Project milestones
** Evaluation/feedback

* Disamb
** Project milestones
** Tagsets (?)

* Linguists
** G3 smj/sme
** Compilation time and twol ruleset
** Numeral project

* Programmers
** Proper names:
*** xml
*** eXist/editing

Teaching sessions (list of free thoughts and personal frustrations)
* How to use the xml corpora
* emacs? (?)
* Use of Xerox tools (?)
* Twig
* XQuery
* xml webapp development, security and session integrity

This was as far as we got today, please add to it before or in the next meeting.

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla
  #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__,
   __Tomi__)
*** 10.4 introduced support for locales in the shell (10.3 and earlier didn't
    know about locales)
** Test: the result of the last line should indicate whether this is a problem
   in cat or a Perl/OS mismatch.

{{{
   preprocess file_name.txt OK
   cat file_name.txt | preprocess !!
   catxml file_name.xml | preprocess ??
}}}

After some heavy investigation, we got no further. There is no difference
between different bash versions (2.0x, 3.0x), neither between different locales
as long as they are UTF-8, and whether it is stored in LANG or LC_ALL.

One new insight though: both cat and print gives the same errors, thus
indicating that the error is NOT strictly reltated to cat.


!!Bug fixing

__29__ open bugs (and 25 risten.no bugs)

!!!9. Summary, task list

!! Børre
* Gather public texts, preferrably also parallel ones
* Contact Odin editor (__Ove Sæth__) to ask for source (and parallel) documents
* Continue converting text from input format to our xml
* review code and documentation for corpus xsl files under version control
* make an XML test lexicon for our new name lexicon; format is based on the
  latest meeting memo
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with risten.no
* review corpus usage documentation, and the usage of the corpus
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* Convert the name lexicon from present format to xml for testing; final
  conversion pending the editing fascilities
* continue discussion on the new lexicon format
* Refine language detection for Finnish
* Finnish the review of the hyphenation detection.
* Review the handling of xsl-files in corpus infrastructure, including version
  control
* Do some testing for bug
  [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]
* xml2lexc update to handle complex names: construct entries like we have now
  from the different parts of a complex name entry
* update preprocess to handle inflected forms of complex names
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Plan the forthcoming seminar
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* follow up on voice group-chat not working to Sámediggi
** Test Marratech when the new Marratech server is in place
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* continue proper name lexicon work and discussion
* public tender:
** follow up the contacts with [Finnut Consult AS|http://www.finnut.no/]
* review corpus upload user interface
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/name lexicon development: fix bugs, continue development
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* work on North Sámi compounding and derivation
* review corpus usage documentation
* smj G3 issue with __Sjur__ and __Trond__
* sme G3 issue with __Sjur__ and __Trond__

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* corpus infrastructure:
** dtd location (both public and internal)
** cgi-admin script for adding xsl-files
* Document aspell and corpus infrastructure
* Specification for new catxml in C++
** install and announce new ccat tool
* new proper name lexicon
** remove last part of complex names not used as simplex names
** start looking at conversion of the name lexicon from present format to xml
** discuss the new lexicon format in the newsgroup
** Look into synchronisation of proper names with risten.no
** meeting to arrive at final xml format
** new version of xml2lexc (based on catxml, now ccat)
* hyphenation in corpus docs
* comment review template made by __Saara__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
* pick up backpacks after Xmas

!! Trond
* Contact Odin editor (__Ove Sæth__) to reopen contacts
* Plan the forthcoming seminar
* sign contract with Bibelselskapet for Norwegian parallel texts
* project planning with __Sjur__, continued
* smj G3 issue with __Sjur__ and __Thomas__
* sme G3 issue with __Sjur__ and __Thomas__
* document corpus infrastructure, your part
* review corpus usage documentation (ccat)
* discuss the new lexicon format in the newsgroup
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!!!10. Next meeting, closing

16.01.2006 09:30

Closed at 12:31