!!!Meeting setup

* Date: 21.11.2005
* Time: 10.00 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit, phone

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# The Árran journey
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# Speller infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:12.

Present: __Børre, Saara, Sjur, Thomas, Trond__

Absent: __Maaren, Tomi__

Main secretary: __Sjur__

Agenda accepted with Árran as an additional point.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Contact oahpahusossodat about texts
* Gather public texts
* Continue converting text from input format to our xml
**  Done some, will have to automate it.
* Document the corpus directory structure
**  Done some.
* Ask __Thor-Øivind__ to move bugzilla to our new webserver
** ... and update Bugzilla at the same time
* install new XXE and the new XXE Forrest config for all (or check that it is
  installed and working)
**  Not done
* mark-up names
**  Not done
* divvun.no and giellatekno.uit.no
**  Binary files download area
**  Make the conversion to static site, using our own script.
*** Not done
* hyphenation in corpus docs
**  Not done
* meet with __Anders Kintel__ in Árran
**  Done
* corpus xsl files under version control
**  Not done
* Other
**  Doctored Maarens computer on Tuesday. See, emacs and XXE should work as
    expected now.

!! Maaren
* shall work with Sámi place names only
* update the last issue in the North Sámi normativity issues document

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
** xml template for namelex format in gt/common/src/proper-nouns.xml
* Convert the name lexicon from present format to xml
** waiting for the format
* document corpus infrastructure, your own parts
** done
* Look at the hyphenation issue
** not done
* Update the corpus.dtd
** done
* corpus xsl files under version control
** not done
* make preprocess and lookup2cg faster
** work in progress

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__, Wednesday
** No time last week
* risten.no bugs and fixes
* follow up on voice group-chat not working to Sámediggi
** Test Marratech
*** still no __working__ link
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** nothing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
*** nothing
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** nothing
* write a background document on the corpus contracts
** not yet
* discuss kvensk project support with __Trond__
* proper name integration with risten.no
* discuss risten.no work with __Tomi__
** some further discussions done
* write public tender documents
** updates to the deliverables doc
* buy:
** new computer (project server)?
* hyphenation in corpus docs
** Børre and I discussed this topic in the car, regarding some of the texts we
   have. We want to replace all hyphens with a tag like <hyph/> as part of the
   conversion of the corpus files. This way we can easily replace the tag with
   either zero or something else for testing purposes. The hyph->tag process
   needs some intelligence to avoid replacing in cases like "intra- og
   ekstranett"
* meet with __Anders Kintel__ in Árran
** Done, as well as with Bård Eriksen from a publisher housed in Árran.
* Other:
** ordered AppleCare extended warranty to all Divvun computers.
** presentation of the Divvun project at the Árran conference

!! Thomas
* work on North Sámi compounding and derivation
** worked on compounding and still working
* Look at Linguistic bugs with __Trond__
** not anything done this week
* Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
** not met this week, written suggestion to the last problem in Unison newsgroup
* update the lule sámi normativity issues document
** done

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* Document aspell and corpus infrastructure
* Specification for new catxml in C++
** this includes also placing the source and binary
* discuss about xml-processing with __Saara__
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
* start looking at conversion of the name lexicon from present format to xml
* discuss risten.no work with __Sjur__
* Look into synchronisation of proper names with risten.no
* hyphenation in corpus docs
* corpus xsl files under version control
* add automatic language detection to the corpus processing
* corpus processing problem (convert2xml.pl at line 91)

!! Trond
* Send the contract to the university lawyer
** Done, and discussed with her. Next step is to integrate the (minor) changes
   and then make our Tromsø version.
* Look into the document hyphenation issue
** Not done.
* Look at the three-part compound issue
** Had a look at Thomas' rule set, that's all.
* Work on the CG-related bugs on the bug list (7 open) (numeral related ones
  postponed.
** Had a long look at them. Except the notoriously wexy #77, all of them are
   linked to the forthcoming numeral project.  
* project planning with __Sjur__, continued
** Bought the program, at least, but at that point __Sjur__ was off to Drag.
** also look at the development processes - specification and  testing
* The name project
** Work on the name project, mark up names (exactly 100 names left)
** Most work on this issue done by others this week
** Extract complex names from version 1.126 and save them as a separate file in
   common.
** Checked in this moment...
* discuss kvensk project support with __Sjur__
** Sporadically mentioned.
* Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme.
** Discussed a bit with Thomas, otherways awaiting Thomas' Lule Sámi cleanup.
* Worked mostly on disamb issues, including corpus issues.

!!!3. Árran trip

The fifth [Sámi Conference|http://www.arran.no/]

!!Killer Whale Safari

A wonderful (put in your favourite travel noun here)!

!!Meeting with Anders Kintel

He is using Filemaker Pro, with two fields in his database. Sámi word in one
field, and the rest of the lemma article in the other.

We have asked for the first field only, and we will put it in the corpus
repository as is, and use it as a regular corpus text (not for disambiguation,
though).

!!Meeting with Bård Eriksen, publisher from Báhko

* [http://www.arran.no/component/option,com_contact/task,view/contact_id,9/Itemid,3/lang,no/]

Very positive, __Børre__ will return to him around the middle of December.
Báhko is publishing everything from Árran (the center).

!!Presentation

__Sjur__ held a 15-20 min presentation of the Divvun project, and a short
speller demo. The demo didn't go that well, due to the test document.
__Conclusion:__ we need to have a pre-made test document, to be able to properly
test and demonstrate the speller.

!!!4. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general
(__Børre__, __Tomi__, __Trond__, __Saara__):
* The directory structure is now settled (as of last meeting), and should be 
  documented.

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
groups:

# For the __users/linguists:__ What corpus are found, how do I use them (this
  info is now scattered) 
  (Part of the HOTWO USE is documented in the catxml docu
  The what documents are found where etc + an overall documentation is not
  written, since the corpus is so sparsely populated)
# For the __collectors:__ How do I add texts, where do I add them, how do I
  convert them (this is (partly?) done in the Corpus Conversion document)

test:

* add/update Aspell documentation (__Tomi__)
** Some documentation has been written, but there still is work to be done.
* as always: document what you're doing:-) (__all__)

!!Divvun.no down again

Tomcat is running out of memory in between. __Børre__ will look into changing
to Forrest generating static html pages (forrest site), and serve those off of
the standard Apache server. He will also look at utilizing Forrestbot as the
tool to update the site, instead of our homegrown script.

Update: Only one small change needed in our own script. Binary download section
should be included.

!!!5. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Børre has gathered files from the Sámediggi
Will go on gathering files from [Odin|http://odin.dep.no/].

* [http://troms.kulturnett.no/bibliotek/samisk/samisk_materiale.htm]
* [Sámi legal text on the net|http://www.lagat.no]
** We need all these texts
** We need to survey the site in the future
** We need the Norwegian versions as well
* Lule Sámi: see the [Árran report|#3.+Árran+trip]

!!Contracts

Update: All SD versions now synchronised with the templates. Trond met with the
lawyer, and she commented the updated versions. Trond will soon update the
contracts.

!!!6. Corpus infrastructure

Updated task list:
# Include the xsl files under version control (__Børre, Tomi, Saara__)
# Incorporate language detection as part of the corpus processing (__Tomi__)
# we need a way to deal with hyphenated documents (documents with (manually) inserted
  hyphenation marks) in catxml/preprocess.
  (__Tomi, Børre,__ (discussion in the newsgroup:) __Sjur, Trond, Saara__)
## Discuss details in the newsgroup
## in normal cases hyphenation points should be removed
## when testing the robustness of our parsers, as well as when testing the
   hyphenator, the hyphenation points should be retained:
### This is true for examples like "eala-<CR>hus", they
    should be converted to something like: "eala<hyph/>hus", in order to both
    keep the hyphenation information when needed, and get it out of the way when
    not.
### In cases of truncated compounds like "ealahus-<CR> ja ...", we want the
    hyphen to stay untouched, and be part of the linguistic processing.
### There are sporadically text books with explicit hyphenation points, like:
    ea-la-hus. In these documents, all hyphens, without exception, should be
    converted to <hyph/>.

!!!7. Linguistics

!!Name lexicon

Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no]

The plan for this project was as follows: Two lines of work run in parallel:

# name markup
# testing of conversion
# eXist as editor:
## develop the needed XQueries and interface
## synchronisation between risten.no and 
## test whether eXist as editor is actually working well


I updated the file gt/common/src/proper-nouns.xml with different formats for printing 
the namelex in the xml. (The line wrap makes it difficult to present them here.) These 
were among the default ones in xmltwig package. Also other formats are possible, e.g. 
having the whole entry in one line, but I found it difficult to read. My favorite is 
the record_c -format, where each <form> is in its own line.


When these two tasks are done (at some point in the future), the conversion will
be done.

Status quo on the two lines of work:

The mark up of the remaining 400 entries until conversion starts (People
allocated look at the rest: __Maaren, Ilona, Trond, Børre__). This week's status
quo is as follows (exactly 100 names not assigned):

{{{
  31 BERN
  19 LONDON
  16 NIILLAS
  15 MARJA
  11 ACCRA
   4 HEANDARAT
   3 ANAR
   1 ALEUHTAT
}}}

The technical issues are specified in earlier memos. Conducted by:
__Tomi, Saara, Sjur__. __Sjur__ and __Tomi__ will tomorrow Tuesday report back
on a plan for using risten.no as editor for our name lexicon

A very short example is found at common/src/proper-nouns.xml.
Saara has made a conversion script which is ready to use. More discussions on
the layout of the resulting xml file is needed.

!!Complex names

Task list for this issue:

* find eventual unique second-parts (B-parts of names that do not exist in
  isolation)
* remove these B-parts from the ordinary name file (__Tomi__)
* the resulting file format should be identical to our present prop-name
  file (=lexc), that can then be converted to our new xml format using the same
  script as for the regular names
* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now)

The file proper-complex.xml has been added to gt/common/src.
NOTE! It is not an xml file, but simply a lexc file taken from the 1.126 version
of propernouns-sme-lex.txt and converted to utf-8.

The details of the new XML format needs to be further discussed in the newsgroup
and integrated with the rest of the XML work and discussion.

!!North Sámi

* three-part compounds issue still open
** look at Lule Sámi, but apply it to second-parts only
** Thomas is working on it
** the exact rules for when shortening happens should be documented (__Maaren__
   will send the normativity decisions to __Thomas__)
** descriptive facts from our corpus (__Trond, Thomas__)
** linguistic analysis/discussion to continue in the newsgroup
* number project still open
* diphthong simplification/G3 issue should be carried over from Lule Sámi

!!Lule Sámi

 __Sjur, Thomas__ and __Trond__ will cont. Lule Sámi issues.

Tasks:
* update the normativity issues document:
** Px issue
* G3 open issues (S2; Sx = Spiik, consonant series)
** Great progress has been made on the G3 issue, just some minor points remain.
** __Thomas, Trond and Sjur__ meet later this week to solve the rest

!!Numerals

The issue awaits closure of the propernames project, and is postponed to next week.

!!!8. Speller infrastructure

Nothing this week either.

!!!9. Other

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Thor Øivind__,
   __Tomi__)
*** Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there 
    is progress (Saara has commented), but the issue is complicated.   

!!XXE updates

Who has the latest XXE (3.0) and the latest forrest config?

* Børre - ok
* Trond - ok
* Maaren - ok
* Tomi - no
* Thomas - no
* Saara - ok
* Sjur - ok
* Ilona - ok?
* Linda - no

__Børre__ is updating the ones not yet up to speed.

!!Video conferencing across firewalls

The problem we've had with the SD firewall persists, and there doesn't seem to
be any resources available to help us. __Geir Kaaby__ instead suggested we look
at the [Marratech|http://www.marratech.com/] package, and try it out. So please
download the MacOS X client (or get it from me), and I'll send you the URL to
the meeting room as soon as I get it.

!!Bug fixing

__24__ open bugs (and 24 risten.no bugs)

!Bugzilla update
When Bugzilla is being moved, it should also be updated to the newest version,
and the UTF-8 bug should be resolved.

!!risten.no

* Organisation: could __Tomi__ be used, in exchange for more linguistic work by
  (old) GIO members? Yes, it is ok, but how much still needs to be evaluated
* it is ok to integrate "kvensk" placenames with risten.no
** this should be integrated with the general proper name work - we want all
   proper names integrated with risten.no, df above
** needs further development of risten.no to allow for multiple XML bases to
   be presented and maintained in parallel. This is to be further worked on by
   __Tomi__ and __Sjur__
* infrastructure for proper names in place by end of November, if everything
  goes well (or according to a tentative plan)

!! AppleCare extended warranty

All Divvun computers (PowerBook G4s) have received an extended warranty to the
end of the project period. The warranty product (AppleCare) needs to be
registered before it is effective. Please do that as soon as possible when you
receive the package, and __NO__ later than 24.11. You should also include your
wireless keyboard as part of the registration (you register all Apple hardware
covered by AppleCare, which is all equipment bought at the same time: computer
and keyboard).

!!!9. Summary, task list

!! Børre
* Contact oahpahusossodat about texts
* Gather public texts
* Continue converting text from input format to our xml
* Document the corpus directory structure
* Ask __Thor-Øivind__ to move bugzilla to our new webserver
** ... and update Bugzilla at the same time
* install new XXE and the new XXE Forrest config for all (or check that it is
  installed and working)
* mark-up names
* divvun.no and giellatekno.uit.no
**  Binary files download area
**  Make the conversion to static site, using our own script.
* hyphenation in corpus docs
* corpus xsl files under version control
* register AppleCare

!! Maaren
* shall work with Sámi place names only
* update the last issue in the North Sámi normativity issues document
* register AppleCare

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
* Convert the name lexicon from present format to xml
* document corpus infrastructure, your own parts
* Look at the hyphenation issue
* Update the corpus.dtd
* corpus xsl files under version control

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
* risten.no bugs and fixes
* follow up on voice group-chat not working to Sámediggi
** Test Marratech
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* discuss kvensk project support with __Trond__
* proper name integration with risten.no
* discuss risten.no work with __Tomi__
* write public tender documents
* hyphenation in corpus docs
* buy:
** new computer (project server)?
* register AppleCare

!! Thomas
* work on North sámi compounding and derivation
* Look at Linguistic bugs with __Trond__
* Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
* update the lule sámi normativity issues document about incorporation of loan words
* register AppleCare

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* Document aspell and corpus infrastructure
* Specification for new catxml in C++
** this includes also placing the source and binary
* discuss about xml-processing with __Saara__
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
* start looking at conversion of the name lexicon from present format to xml
* discuss risten.no work with __Sjur__
* Look into synchronisation of proper names with risten.no
* hyphenation in corpus docs
* corpus xsl files under version control
* add automatic language detection to the corpus processing
* register AppleCare

!! Trond
* update the contracts with changes from the university lawyer
* Look into the document hyphenation issue
* Look at the three-part compound issue
* Work on the CG-related bugs on the bug list (7 open) (numeral related ones
  postponed.
* project planning with __Sjur__, continued
** also look at the development processes - specification and  testing
* The name project
** Work on the name project, mark up names (100 names left)
* discuss kvensk project support with __Sjur__
* Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme.

!!!10. Next meeting, closing

21.11.2005 09:30

Closed at 12:31