!!!Meeting setup

* Date: 06.06.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:37.

Present: __Børre, Saara, Sjur, Thomas, Trond, Tomi__

Absent: __Maaren__, __Saara__ leaving 10.30.

Agenda accepted as is, some additions under ''Other''.

!!!2. Updated task status since last meeting

!! Børre
* corpus collection:
** send out contracts with accompanying letter
*** Harald Gaski and Lill-Tove Fredriksen
** Gather public texts, preferrably also parallel ones
*** Done
** Continue converting text from input format to our xml
*** Updating meta data in MinAigi, using Saaras script
** convert nob and nno bible texts to be used as part of a parallel corpus
*** Not done
** review the paratext2xml converter
*** Not done
** convert smj NT to paratext
*** Not done
** Send out letters to the rest of the Iđut authors
*** Not done
** call Brita Kåven again towards the end of the week
*** Not done
** contact Ája (Kåfjord)
*** Not done
** create weekly cron job to mirror Odin URL and detect new/updated pages
*** Done
** wait for R. Valkeapää, call him this week
*** He needed some renaming script, said I would send him some of ours.
* public tender:
* corpus access:
** meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
*** Not done
** set upp the unix group structure for corpus users (also __Saara, Trond__)
*** Not done
* install Gobby for __Thomas and Maaren__, send {{/opt/local/}} tarball to
  __Saara__
** Done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments.
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement links to parallel files in corpus header.
** sent a note to the newsgroup.
* Implement parallel corpus upload in web upload script
** not done
* Install Gobby
** not done
* clean the {{free/}} dir and test copying of only free texts to it
** done
* Add some language recognition flags to write into the xsl file
** done
* set up the unix group structure for corpus users (also __Børre, Trond__)
* count and examine pairs of proofread and nonproofed MinAigi files
** done
* evaluate addition of a text-only corpus in the Oslo interface (with __Trond__)
** done
* improve add-hyph-tags.pl to handle XML-structures better.
** done
* Updated decode.pm to handle msword titles.
** almost ready
* convert nob and nno bible texts to be used as part of a parallel corpus
** We don't have parallel text markup in corpus.dtd.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** waiting for answers from the board, then act
*** received answers from all, as well as a clarifying e-mail from Finnut; sent
    a reply and comment based on the Finnut letter, suggesting that we continue
    to ask for clarification before we pick one of the offers as received; also
    checked the monetary situation for the whole project to guide the tender
    process.
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections (regular search interface)
*** progress - all look-up functions are now multi-collection aware
* change corpus-summary processing to generate smaller pages
** not done
* send bug report to Apple re filename matching and accented characters in
  Terminal
** done
* meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
* give copies of the SEE JSPWiki mode to others
** checked in to CVS, it can be installed from there with a double-click
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* correct hyphenation of exceptions (sme)
**done
* work on compounding and derivation
**nothing done this week
* sme G3 issue
**nothing done this week

!! Tomi
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* install and test Gobby
* Set up the mechanism for the hash-mark transducer package
* add ccat option to analyse text while keeping the xml tags and structure
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
** Not done
* get fin, swe, nob and nno NT and OT in __paratext__ format
** Not done
* install aligner, test it and give feedback
** Not done
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** Not done
* get/upgrade keys for __Børre's__ room for __Tomi__ and __Thomas__
** No response from key office, the key person has been away
* Rethink the doubletagging procedure for names, consider grammatically
  motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org)
** The rethinking is done, implementation is needed.
* Work on the graphical corpus tag list
** Done some, but awaiting Tomi's script to get this issue going.
* send __Saara__ smj files for language recognition
** No new smj files.
* create a short smj word list to help the trigram heuristics
** This was done earlier.
* Call Árran again, and then transfer smj corpus discussions to __Børre__
** Worked on this issue
* Put __Saara__ and __Tero__ in contact with each other
** Awaiting the dust to be settled after his new employment.
* ask __Lars Nygård__ and __Tero Avellan__ to install Gobby 0.3
** Issue open.
* continue linguistic discussions while Maaren is in Tromsø
** Had one session.
* meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
** Not done
* set upp the unix group structure for corpus users (also __Børre, Saara__)
** Not done
* evaluate addition of a text-only corpus in the Oslo interface (with __Saara__)
** We still go for the "everything first" strategy
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
** Worked on bugs.

!!!3. Documentation

TODO:
* documentation on how to apply for a user account for the corpus repo
  (__Børre__)
** we will administer the corpus user accounts ourselves
** We first have to discuss and decide what we want before __Børre__ can write
   documentation (see further down)

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO:
* Send out the rest of the letters (__Børre__)

New contracts:
* none last 2 weeks

!!Olavi Korhonen's Lule Sámi dictionary.

TODO:
* set up user account/corpus access for Olavi (__Børre__)
** we need the infrastructure for corpus access in place first

!!KIO Grafisk and the Iđut books

__TODO__:
* send letters to the authors (__Børre__)
** talked to Kurt Tore Andersen, will get a letter from us (__Børre__)

!!Bible texts

We will get text from Finland, but still haven't received any. Swedish html has
arrived, no paratext. Norsk bibelselskap has not sent 
corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:
* convert smj NT to paratext (__Børre__)
* get fin, swe, nob and nno NT and OT in __paratext__ format. (__Trond__)

!!Davvi Girji

Talked to Harald Gaski, he will send the contracts to the writers' organisation,
to ensure the contract is ok.

__TODO__:
* call Brita Kåven again towards the end of the week (__Børre__)
* call the authors (__Børre__)
** Harald Gaski

!!Min Áigi

__TODO__:
* send bug report to Apple (typing filenames in Terminal does not match, moving
  filenames across plattforms via tarballs creates problems - the names are not
  the same!) (__Sjur__)
** done
* complete metainformation (__Børre__)

!!Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

__TODO__:
* contact Ája (__Børre__)
** Not done

!!Sámi Instituhtta

TODO:
* wait for R. Valkeapää, call him this week (__Børre__)

!!Čálliid Lágádus

__Børre__ talked to them, they are positive, and will give us print-ready pdf's.
[http://www.calliidlagadus.org/]

!!Árran

The negotiations are underway, they discuss it in a meeting today, and we have
an appointment for later today or tomorrow (both their internal texts, and
discussions on Báhko). All further negotiations will go through Bård Eriksen
/ Báhko.

TODO:
* call Bård Eriksen (__Børre__)

!!!5. Corpus infrastructure

!!User accounts and access

TODO:
* discuss and decide upon the exact access policy we want to give corpus users;
  meeting on Tuesday May 23, at 9.30 (__Børre, Sjur, Trond__)
** still not done, new date: Friday 9.6.2006, at 9.30
* set upp the unix group structure to open for a new category of users:
  Linguists with access to the texts on the servers. Some of
  __Børre, Saara, Trond__.
* make a text-only corpus in the Oslo interface (dump the texts on omilia), 
  thereby making them accessible to simple lexical lookup. The cost of doing
  this should be evaluated against what it takes waiting for the analysed
  corpus to be in place. (__Saara, Trond__)
** Discussion on this issue after the previous meeting
*** Conclusion: skip the text-only version, and go for the analyzed one right
    away

!!More texts to the graphical corpus interface:

We need to get the infrastructure complete to be able to do this, then it
should be a piece of cake.

Main obstacle for further progress: a tool to analyse text while keeping the xml
structure. Here's some ''very'' simple pseudo-code:

# parse xml
# send each <p> node to lookup etc
# analyse it
# add </s><s> after each <.>, <!>, <?> mark
# wrap up the analysis output in xml
# put it in a <p>node again

The first implementation in Perl (__Saara__), later possibly a C implementation
(__Tomi__) for speed reasons.

TODO:
# make an analyser that retains the xml structure (__Saara__)
# Finish the  tag unification (korpustags.txt) (__Trond__)
## progressing, but open questions remain (awaiting text to see how things
   crash)
# add text to the server (__Lars__)

!!Aligner

__Trond__ and __Saara__ will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which
documents belong together. Discussion to continue in the newsgroup (__Saara__
has started it - please respond!).

!!Language recognition

TODO:
* create a short smj word list to help the trigram heuristics (__Trond__)
** done (but the smj.lm file suffer from the narrow corpus base. The smj.wm
   file will also improve with more texts, but not that much)
* Add some flag to write into the xsl file (__Saara__)
** done

!!Free and non-free texts

TODO:
* test the script(s) copying texts to the free/ dir which are explicitly
  marked as free (__Saara__)
** done

!!Corpus summary

TODO:
* trim generated corpus summary pages (__Sjur__)
** suggestion: lump together files with content less than X paragraphs (X < 5?)
** nothing done

!!Proofed vs unproofed corpus files

TODO:
* count the number of parallel files of type unproofed/proofread (__Saara__)
** done, 35 such file pairs
** delayed for the time being, move to Bugzilla (__Sjur__)

!!!6. Infrastructure

!!Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:
* Put __Saara__ and __Tero__ in contact with each other (__Trond__)
** still open, awaiting both Tero and the more crucial corpus tasks
* The paradigm generator should also have an xml-out option (for use in e.g.
  electronic dictionaries) - on hold till we know more about the generator
** move to Bugzilla (__Sjur__)

!!Hyphenator

__Thomas__ is finished with adding ^ tags to the __sme__ noun file.

__Trond__ and __Thomas__ have been working on the smj rule component, and have
improved both the treatment of weak grade consonant clusters (preconsonantal
geminates) and on some loan word patterns.

TODO:
* correct the treatment of hyphenation of word boundaries and exceptions (fst
  gymnastics) (__Sjur, Trond__)
** Still not done - postponed till next week (w 24)
* Update the sme and sma rule sets with the insights gained from smj updates.
* add exception marks to the sme lexicons (__Thomas__)
** done

!!Automatic Bugzilla reminder for untouched bugs

We need to get a summary by e-mail for all bugs not touched in more than 5(?)
weeks. Do we want it? Yes. It is possible? __Børre__ will look around, if
nothing found, he'll ask Thor-Øivind.

TODO:
* set up Bugzilla to send automatic reminders for bugs not touched in a given
  timeframe; ask Thor-Øivind for help if needed (__Børre__)


!!!7. Linguistics

!!North Sámi

Topic: Actio+compound - how productive? How much does it destroy speller
performeance by being close to real spelling errors with a high editing distance
to the correct form?

{{{
Spelling error from typos.txt:
4       vuolggahansádji             vuolggasadji

vuolggahan
vuolggahan      vuolgga+N+Sg+Nom+Foc
vuolggahan      vuolggahit+V+TV+PrfPrc
vuolggahan      vuolggahit+V+TV+Ind+Prs+Sg1
vuolggahan      vuolggahit+V+TV+Actio+Acc
vuolggahan      vuolggahit+V+TV+Actio+Gen
vuolggahan      vuolggahit+V+TV+Actio+Nom

vuolggahansadji
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
}}}

TODO:
* Maaren is in Tromsø this week, we should continue our linguistic discussions
** not done, and Maaren is now back in Guovdageaidnu (?)

!!Lule Sámi

There are some open issues in the marginal area of the smj transducer:
* numerals, e.g. Our poor treatment of number words becomes more visible 
  as we get real texts (guhttatuvsánanielljatjuodegålmmålåkvihtta is not
  recognised, etc.).
* names => waiting for the new name lexicon
* compounds? Shortening here as well, but not in written language (some
  lexicalised exceptions, as in sme) =>  nothing to discuss/implement
* loanwords? We should consider importing the ^LOAN words from sme and 
  Lule-ify them. Otherwise, loan words should come from corpora.

TODO:
* 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
* add proper numeral analysis/treatment
* add loanwords (e.g. latin -ere verbs)

!!!8. Name lexicon infrastructure

TODO:
# finish refactoring for multiple collections in the search interfarce
  (__Sjur__)
## progressing, still major work to be done
# develop the needed XQueries and interface (__Sjur, Tomi__)
## progressing, done some, commited
   (adding new term, create-termc-entry.xq)
# data synchronisation between risten.no and the cvs repo (__Tomi__)
## discussion started on eXist-list, nothing useful came up. We need to
   reformulate the question from our perspective, and bring it up again
   (__Sjur__)

Timeline:

* One-time conversion lexc2xml (sme to common)
** done (must be redone at D-day, but the *script* is done)
* editing functions in risten.no - what about editing in emacs/see/other editors
** in the works
* Automatic converson: xml2lexc (modulo language) (based on ccat)
  (ready to be executed) (__Tomi__)
** not done

!!!9. Public tender

TODO:
* act, based on the answers (__Sjur__)
** done: we are asking for more clarification, and then pick the best
* write clarification letters to PL and LS (__Sjur__ with help from
  __Børre, Trond__)
** LS - done
** PL - has started, not finished

!!!10. Other

!!Summer vacation

|| Who  || When
| Børre  | August
| Linda  | ?
| Maaren | ?
| Saara  | July
| Sjur   | at least some in July, but still open
| Thomas | 3.7 - 7.8
| Trond  | 3.7 - 14.8 (last two weeks off at summer school)
| Tomi   | 8.7 - 16.7, 2 more weeks in July and/or August

!!Bug fixing

__43__ open Divvun/Disamb bugs (two down!), and __25__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help __Saara__ with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
 (Perl locale). Not much help...
__Saara__ will contact __Roy__ on this issue.

After the corpus issues have been somewhat settled, we should do a bug
barnraising. ... and then a new one after the name lexicon is fixed.

!Forrest: Unicode & JSPWiki

Sjur has identified a way of making {{forrest run}} read correctly the jspwiki
files (UTF-8 encoded):

{{{
forrest run -Dforrest.jvmargs="-Dfile.encoding=utf-8"
}}}

It is added to our Forrest documentation.

!!RAM on our G5

We got some defect RAM. Our business contact has received a whole party of
defect RAM, they are investigating this. After that is resolved we will receive
"good" RAM.

!!Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all
computers c.f.
[http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide]
(our preinstalled Xcode veriosn is 2.0, must be 2.1):

* Børre - ok
* Maaren - ok
* Saara - todo
* Sjur - ok
* Thomas - ok
* Tomi - ok
* Trond - ok

__Trond__ has asked __Lars Nygård__ and __Tero Avellan__ to
install Gobby as well. __Per Langgård__ is already using it.

!!SEE 2.5 extensions

TODO:
* give the jspwiki syntax colouring mode to all (__Sjur__)
** JSPWiki.mode checked in to cvs in {{gt/src/}} - just double-click the mode!
* write a perl script to extract all TODO items and sort them according to
  responsible person; wrap the script in a AppleScript available from the menu
* other modes on the whish list:
** twol mode
** xfst mode
** lexc mode
** vislcg mode

!!!11. Next meeting, closing

12.06.2006 09:30

Closed at 11:08

!!!Appendix - task lists for the next week

!! Børre
* corpus collection:
** send out contracts with accompanying letter
** Gather public texts, preferrably also parallel ones
** Continue converting text from input format to our xml
** convert nob and nno bible texts to be used as part of a parallel corpus
** review the paratext2xml converter
** convert smj NT to paratext
** Send out letters to the rest of the Iđut authors
** send contract to __Kurt Tore Andersen__
** call __Brita Kåven__ again
** call __Harald Gaski__
** contact __Ája__ (Kåfjord)
** Send renaming scripts to __R. Valkeapää__
** complete __Min Áigi__ metadata
** call __Bård Eriksen__
* corpus access:
** meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
** set upp the unix group structure for corpus users (also __Saara, Trond__)
* set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement links to parallel files in corpus header
* Implement parallel corpus upload in web upload script
* Install Gobby
* set upp the unix group structure for corpus users (also __Børre, Trond__)
* Test the aligners once again
* make an analyser that retains the xml structure for Oslo
* discuss parallel text markup in newsgroup
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** write e-mail to PL and ask for more clarification
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections (regular search interface)
* change corpus-summary processing to generate smaller pages
* meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
* move to Bugzilla:
** proofed/unproofed parallel text issue
** xml output of paradigm generator
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* hyphenation-rule-set 
* work on compounding and derivation
* lule sámi incoming words
* sme G3 issue

!! Tomi
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* Set up the mechanism for the hash-mark transducer package
* add ccat option to analyse text while keeping the xml tags and structure
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
* get fin, swe, nob and nno NT and OT in __paratext__ format
* install aligner, test it and give feedback
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* Put __Saara__ and __Tero__ in contact with each other
* meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
* set upp the unix group structure for corpus users (also __Børre, Saara__)
* [fix bugs!|http://giellatekno.uit.no/bugzilla].