!!!Meeting setup

* Date: 16.10.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:11.

Present: __Børre, Maaren, Sjur, Thomas, Tomi, Trond__

Absent: __Saara__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* corpus collection:
** contact __Ája__ (Kåfjord)
*** Not done
** discuss access to older Min Áigi and Áššu files with __Richard Valkeapää__
*** I will provide him with a renaming script, and then we will get the Áššu and Min Áigi files.
* Move norwegian documents in Min Áigi from sme to nob
** Not done
* set up Bugzilla automatic reminders for open issues
** Not done
* finish Forrest i18n work (pdf)
** Not done
* set up Tomcat for use with eXist and the propnouns db on the G5
** Not done
* investigate use of proxies for __Maaren__, for her to be able to use iChat AV
** Done. SquidMan is a possible solution, will test as soon as possible
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Not done
* Other
** Worked a lot with the aligner, tca2.


!! Maaren
* download and install latest Marratech
** with the help of SquidMan and http proxy, this is now obsolote - good bye,
   Marratech!


!! Saara
* add more texts to the graphical corpus interface
* finalize server of the Xerox tools.
* generate parallel corpus files manually (with __Trond__)
* Improve text_cat
* export corpus tools to location available to all (with cron), cf news disc.
* Improve hyph-filter.pl
* Implement M4-rules for sme-dis.rle
* improve {{smj}} LM based on the new, larger corpus
* help Trond with some shell commands
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur
* name lexicon:
** refactor SD-terms editor code
*** started
** implement missing propnouns editing functions
*** nothing
** implement improvements decided upon in Tromsø
*** nothing
* move corpus user doc issue to Bugzilla
** nope
* hire linguist and programmer
** nothing last week
* finish i18n work of Forrest
** nothing
* install eXist and our local copy of risten.no and propnouns on the G5
** nope
* add normative M4 macros to the twol code; utilize it in the make file
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** fixed one with Thomas and Trond
* other things done:
** spent a lot of time improving hyphenation and speller generation for
   Polderland, partly together with Thomas and Trond. Much has improved, but
   there are still issues, and hyphenation for {{smj}} does not yet compile


!! Thomas
* work with Polderland phonetic rules for {{smj}}
** done
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt 
** nothing this week
* find and study all derived verbs in our corpus (depends on __Saara__)
** have not started yet
* suggest which derivations could be generated (depends on __Saara__)
** not yet started


!! Tomi
* continue implementation of the speller lexicon conversion
** continued
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** fixed some bug


!! Trond
* Discuss shell script wrapper teaksta.sh with Saara
** Not done
* write user account form, probably ask for copy of existing ones from the IT
  centre (with Sjur)
** Not done
* write documentation for our {{bound}} users, with pointers to the ordinary 
  documentation  
** Not done
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
** Not done
* Get more {{sma}} texts to improve language recognition
** Not done
* study paragraphs with mixed content
** Not done
* add corpus artice framework to Bugzilla
** Not done
* add corpus user accounts and access issues to Bugzilla
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
* As can be seen, this week has been devoted to totally different things, first
  and foremost to getting our first analysed texts ready for corpus inclusion.
  (and to hyphenation...) I'll just have to move the tasks to next week.

!!!3. Documentation

TODO:
* finish i18n work (__Børre__ and __Sjur__)
** set up Tomcat at the faculty server, and install Forrest as war. Needs
   testing before being deployed on the server! Thus, install it on the G5 first
   and see that it works, then on the faculty server.
** i18n does not work in PDF ("Table of Content" won't translate)
*** still doesn't work
* Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
** User documentation probably in several languages. This covers how to apply
   for an account, on what grounds one can apply, and pointers to documentation
   telling how to use the corpus.
** Admin documentation, telling how we set the permissions to the corpus files,
   and whatever other processes and tasks needed to set up a corpus account.
*** move these issues into Bugzilla (__Sjur__)


!!!4. Corpus gathering

__Børre__ talked to __Richard Valkeapää__, they have several computers, and have
the data available. But they have problems with their backup system and
interaction between file names on Windows and MacOS X. __Børre__ will help them,
and we will receive our data.

__TODO:__
* discuss corpus transfer with NSI (__Børre__)
* continue to contact authors (__Børre__)


!!!5. Corpus infrastructure

!!User accounts and access

The whole issue should be moved into Bugzilla as several tasks.

TODO:
* add the issue with subissues to Bugzilla (__Trond__)


!!More texts to the graphical corpus interface:

__Trond__, __Børre__ and __Saara__ aligned and sent the first parallell corpus
files to Oslo. One milestone further :-)

TODO:
* align texts, analyse, and send to Lars (__Trond, Saara__)
* add text to the server (__Lars__)


!!Aligner

__Børre__ has made some important improvements to the Bergen Aligner. The
Aligner compiles better and is now able to save intermediate result files (4 out
of 5 files, that is, so there is still some work to do), in order to control
memory usage. In this way, we are hopefully able to contribute to the Bergen
development.

__TODO:__
* gather parallel texts (__Trond__)
** __Trond__ is working on it
* try out NT alignment strategies (__Saara__)


!!Language recognition

TODO:
* get more {{sma}} texts, first the Bible / NT (__Trond__)
* improve the {{smj}} LM (__Saara__)
** done
* what about paragraphs with mixed content? Build a corpus of such paragraphs
  (__Trond__)


!!!6. Infrastructure

!!Xerox tools wrapped as servers

The Divvun project needs generation pretty soon, and __Tomi__ can add generation
to the server if he needs it. Possible conflicts in the code will have to be
resolved when __Saara__ is back.

__TODO:__
* improve and finish the present prototype (__Saara__)
* add generator (__Tomi__)


!!Hyphenator

__Sjur__ has hyphenated the pl-wordlist.txt file (the full-form word list
generated for Polderland), and now has about __10Gb__ of hyphenated data on his
computer! Forgot to use {{time}}, but it took somewhere between 36 and 48 hours!
There's a lot of noise and problematic issues in the data, but a cleaned version
will be sent off to PL for their perusal. The cleaned version is __651Mb,__
just slightly larger than the original input file, which was __627Mb.__

Hyphenation for {{smj}} still does not work, as the hyphenation rules file does
not compile. We have received valuable feedback from __Lauri Karttunen__,
though, as we reported a UTF-8 bug in the latest versions of the Xerox tools.

Overgeneration problem:
{{{
Abakalikilaččabuččaideamet      A^ba^ka^li^kilač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lab^bo^žiid^dá^met
}}}

Unrecognised input:
{{{
abbalaččaboiid          abbalaččaboiid          +?
Abbalaččaboiidda        Abbalaččaboiidda        +?
abbalaččaboiidda        abbalaččaboiidda        +?
Abbalaččaboiiddáde      Abbalaččaboiiddáde      +?
abbalaččaboiiddáde      abbalaččaboiiddáde      +?
Abbalaččaboiiddádeguin  Abbalaččaboiiddádeguin  +?
}}}

What is the percentage of unrecognized input? (i.e. of word forms generated by
the generator but not accepted by the analyser)? Answer:

* Wordlist generated = 27 354 884
* unrecognised wordforms = 11 284 570
* percentage unrecognised = 41,3 %


This is way too high, and must be investigated. The input is generated from our
normative, non-circular version of the lexicon, and should thus only contain
recognised word forms already in the lexicon.

TODO:
* Update the {{sma}} hyphenator rule set with the insights gained from {{smj}}
  updates (__Trond__ during weekends)
* linguistic issues in the hyphenator: (__Sjur__ using __M4__)
** shortening in compounds (grossly overgenerates in the hyphenator output)
*** done - the shortening rules are now filtered out by __M4__ when compiling
    the twol for hyphenation and other normative uses
* consider case conversion problems (__Saara__)
* check #- when comparing input string with hyphenated string - the hyphen
  ''could'' be part of the input string, keep it if it is, discard the
  hyphenated string if not (__Saara__)
* investigate unrecognised input word forms (__Maaren, Thomas, Trond, Sjur__)


!!Automatic Bugzilla reminder for untouched bugs

Some perl-libraries needed by Bugzilla weren't in the path, causing it to not
work. Adding them should fix the issue.

TODO:
* fix the remaining issues (__Børre__)


!!M4

__TODO:__
* make speller make targets that utilise M4 to produce normative
  and hyphenation transducers - postponed a while (__Tomi__ or __Sjur__)
** done
* disamb variants for derivations (see next) (__Saara__)
** include make as part of a shell script, which will run the M4 processor on
   the disamb rules (__Saara__)
*** done - but the disambiguation REMOVE rules need to be added in again for
    the benefit of the disamb project. What the syntacticians really want is to
    have this done in the cg-3 interface.


!!!7. Linguistics

!!Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology
sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that "foreign" names in texts follows a domination pattern:
majority language forms can be found in minority language texts as real names
("Kautokeino produkter"), whereas minority language names ''almost always''
occur in majority language texts as citations. And citations should not be
considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

{{{
Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)
}}}

Suggestion:

We need to reconsider the ''all names in all languages'' policy. That policy is
valid only for {{Fem, Mal,}} and {{Sur}} (and Ani and Tit?). For
{{Obj, Org, Plc}} the rule should be that if they have multilingual names, each
name should only be used in it's own language. Then we need a modification
saying that majority language names can be included in minority language
lexicons __if attested__ in our corpus. Also, the majority language varies
according to country (obviously), which means that in a speller context, we
might consider tailoring spellers for each country, leaving out noise relating
to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur
and Plc are __different__ readings. An alternative would be to have them as
secondary tags, not in conflict with each other:

{{{
"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
}}}


!!Derivation and spellers like Aspell

* enclose the CG rule that preferres lexicalised forms over derivations in M4
  (__Saara__)
** done
* find and study all derived verbs in our corpus (__Thomas__)
* suggest which derivations could be generated (__Thomas__)
* lexicalise the rest (__Thomas__)


!!North Sámi

We have a lot of overgeneration, mainly possible derivations that are spelled
out. We need to investigate what is going on, and evaluate whether it is a
problem.

TODO:
* investigate the generated word form list sent to Polderland - use the command
  {{make pl-wordlist TARGET=sme}} in ''victorio'' (__Maaren__, __Thomas__)


!!Lule Sámi

TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)
* hire new linguist (__Sjur__)


!!!8. Name lexicon infrastructure

Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries


Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:
* develop the needed XQueries and UI (__Sjur, Tomi__)
* turn Tomcat on on the G5; send admin username and password to __Sjur__
  (__Børre__)
* add eXist and the proper noun interface to the G5 (__Sjur__)

Postponed:
* data synchronisation between risten.no and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
  construct entries like we have now from the different parts of a complex name
  entry


!!!9. Spellers

!!News

The Greenlandic speller was released today. It is available at the website of
the [greenlandic language council.|http://www.oqaasileriffik.gl/dk/] It is
available for free, and is based on the same technology as our own.

!!Speller data generation

It reads the lexc files, reads each word, and requests (or will request) the
generator server to expand each base form to its full paradigm.

Still missing: the generator isn't yet available, __Tomi__ should implement a
first version while __Saara__ is away. It should allow requests for:
* single word forms
* reduced paradigm
* full paradigm (all possessives, clitics etc, including compound forms)

The Java code needs to be checked in into cvs, suggested location:

{{{
gt/src/lexc2xspell
}}}

__TODO:__
* add code to cvs (__Tomi__)
* implement generator server based on Saara's code (__Tomi__)
* specify compounding behaviour info for the lexicon (__Thomas, Trond, Sjur__)
* add hyphenation points to the generated output (__Tomi__)


!!!10. Other

!!Bug fixing

__66__ open Divvun/Disamb bugs, and __24__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)


!!Meetings and the SD Firewall

__TODO:__
* investigate use of proxies for __Maaren__ (__Børre__)
** done and solved nicely!
* document SquidMan use at SD (__Børre__)


!!Task lists as iCal entries

TODO:
* update Maaren's installation to r430284 (__Børre__)


!!Words section

You all need to check out CVS/words, and link to the relevant place, cf how
gt/doc/ is linked.

!!!11. Next meeting, closing

Next meeting 23.10.2006 at 9:30.

__Sjur__ will be away on Thursday and Friday this week.
__Trond__ will be away next week.

Closed at 11:16.

!!!Appendix -task lists for the next week

!! Boerre

* corpus collection:
** contact __Ája__ (Kåfjord)
** discuss access to older Min Áigi and Áššu files with __Richard Valkeapää__
* Move norwegian documents in Min Áigi from sme to nob
* set up Bugzilla automatic reminders for open issues
* finish Forrest i18n work (pdf)
* set up Tomcat on the G5 for use with eXist and the propnouns db, as well as
  Forrest in its i18n form
* document SquidMan use at SD
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Maaren

* investigate the generated word form list sent to Polderland - use the command
  {{make pl-wordlist TARGET=sme}} in ''victorio''
* investigate unrecognised input word forms in the hyphenator


!! Saara

* add more texts to the graphical corpus interface
* finalize server of the Xerox tools.
* generate parallel corpus files manually (with __Trond__)
* Improve text_cat
* export corpus tools to location available to all (with cron), cf news disc.
* Improve hyph-filter.pl
* help Trond with some shell commands
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Sjur

* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* move corpus user doc issue to Bugzilla
* hire linguist and programmer
* finish i18n work of Forrest
* install eXist and our local copy of risten.no and propnouns on the G5
* investigate unrecognised input word forms in the hyphenator
* decide how to specify compounding behaviour info in the lexicon
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Thomas

* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt 
* find and study all derived verbs in our corpus
* suggest which derivations could be generated
* investigate unrecognised input word forms in hyphenator
* investigate the generated word form list sent to Polderland - use the command
  {{make pl-wordlist TARGET=sme}} in ''victorio''
* decide how to specify compounding behaviour info in the lexicon


!! Tomi

* continue implementation of the speller lexicon conversion
* make generator as server, based on __Saara's__ code
* add lexc2xspell code to cvs
* add hyphenation points to the generated output
* [fix bugs!|http://giellatekno.uit.no/bugzilla]


!! Trond

* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
* Get more {{sma}} texts to improve language recognition
* study paragraphs with mixed content
* add corpus user accounts and access issues to Bugzilla
* investigate unrecognised input word forms in the hyphenator
* decide how to specify compounding behaviour info in the lexicon
* [fix bugs!|http://giellatekno.uit.no/bugzilla].