!!!Meeting setup

* Date: 22.05.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:56.

Present: __Børre, Sjur, Thomas, Trond, Tomi__

Absent: __Maaren, Saara__

Main secretary: __Thomas__ (with help from others)

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* corpus work:
** send out contracts with accompanying letter
*** Sent to Elle Márjá Vars
** Gather public texts, preferrably also parallel ones
*** Gathered, some added to the corpus
** Continue converting text from input format to our xml
*** Done
** convert nob and nno bible texts to be used as part of a parallel corpus
*** Not done
** review the paratext2xml converter
*** Not done
** convert smj NT to paratext
*** Not done
** Send out letters to the rest of the Iđut authors
** call Brita Kåven
*** Done
** contact Kåfjord
*** Will have to contact Ája
** create weekly cron job to mirror Odin URL and detect new/updated pages
*** Not done
** Check the status & license of the corpus texts
*** Done some
** contact Korhonen & Kuhmunen
*** Done
* public tender:
** continue public tender offer evaluation
*** meeting with Sjur, Tomi and Trond on friday afternoon
** meeting on Thursday 18.5. at 10.00 AM with __Sjur, Tomi, Trond__
*** see above
** telephone meeting with Finnut
*** Not done
* install latest SEE
** Not done
* install Gobby using Darwin Ports (also for __Thomas and Maaren__)
** Done for me
* move to Bugzilla:
** write  docu for how to apply for a corpus user account
*** Not done, but talked to Roy Dragseth about this (more below).
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* been on sick leave
* Create a parallel corpora of the new testaments.
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement links to parallel files in corpus header.
** not done
* Implement parallel corpus upload in web upload script
** not done
* Check the status & license of the corpus texts and 
  change the default of the availability to "license" instead of "free"
** not done
* Rerun corpus conversion
** done on daily basis
* Install Gobby
** got problems during installation
* make xsl conversion routine for the typographic tags
** done
* update the corpus script(s) to only copy texts to the free/ dir which are
  explicitly marked as free
** not done
* Add some language recognition flags to write into the xsl file
** not done
* rename corpus dirs, and create symlinks
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** read & evaluate received offers
*** done
** meeting on Thursday 18.5. at 10.00 AM with __Børre, Tomi, Trond__
*** we had a meeting later (I had technical problems with my Mac at the time)
** telephone meeting Friday with Finnut
*** short call, will have the meeting on Monday 22 instead
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** nothing
* name lexicon:
** implement editing functions
*** updated my local cvs sandbox to use the new server (victorio), and checked
    in my recent work
** finalise refactoring for multiple collections (regular search interface)
*** done some, but not finished
* update corpus-summary processing to adhere to the new structure
** done, but the generated page/file size for sme is way too large due to the
   large number of Min Áigi files. The corpus summary needs to be broken down in
   smaller parts, probably on source and year (will give 2-3000 documents/page)
* send bug report to Apple re filename matching and accented characters in
  Terminal
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* correct hyphenation of exceptions
** finished lule sámi, now working with north sámi
* correct hyphenation of smj -st-
** done
* work on compounding and derivation
** worked a little
* smj G3 issue
** finished
* sme G3 issue
** not done
* set up a linguistic workshop while Maaren is in Tromsø
** not done

!! Tomi
* __move to Bugzilla:__
** aspell UTF-8 suffix bug
*** done
** Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
   (it's almost done, but there are a couple of loose ends)
*** done
* new proper name lexicon
** implement data synchronisation of proper nouns between risten.no and CVS
*** not done
** XQuery refactoring and code development for our proper noun editor
*** some done
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
*** not done
* read aligner docu, install, provide feedback
** not done
* install and test Gobby
** done, but not successfully
* Set up the mechanism for the hash-mark transducer package
** not done
* meeting on Thursday 18.5. at 10.00 AM with __Børre, Sjur, Trond__
** done
* add ccat option to analyse text while keeping the xml tags and structure
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
** Not done
* get fin, swe, nob and nno NT and OT in __paratext__ format
** Not done
* install aligner, test it and give feedback
** Not done
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** Not done
* get/upgrade keys for __Børre's__ room for __Tomi__ and __Thomas__
** Ordered upgrade.
* Rethink the doubletagging procedure for names, consider grammatically
  motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org)
** Done some thinking
* Check the status & license of the corpus texts
** Not done
* Work on the graphical corpus tag list
** Done some.
* send __Saara__ smj files for language recognition
** Asked for text, discussing with Árran
* Put __Saara__ and __Tero__ in contact with each other
** Not done
* meeting on Thursday 18.5. at 10.00 AM with __Børre, Sjur, Tomi__
** Done
* ask __Lars Nygård, Per Langgård__ and __Tero Avellan__ to install Gobby 0.3
** Asked Per, not the others.
* set up a linguistic workshop while Maaren is in Tromsø
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!3. Documentation

TODO:
* documentation on how to apply for a user account for the corpus repo
  (__Børre__)
** we will administer the corpus user accounts ourselves
** We first have to discuss and decide what we want before __Børre__ can write
   documentation (see further down)

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO:
* Send out the rest of the letters (__Børre__)

New contracts:
* none last week

!!Olavi Korhonen's Lule Sámi dictionary.

Phoned Korhonen. He was willing to sign the contracts, and wanted some
kind of access to our corpus. He wants to collect words for his Northern
Sámi dictionary.

TODO:
* set up user account/corpus access for Olavi (__Børre__)

!!KIO Grafisk and the Iđut books

__TODO__:
* send letters to the authors (__Børre__)

!!Bible texts

We will get text from Finland, but still haven't received any. Swedish html has
arrived, no paratext. Norsk bibelselskap has not sent 
corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:
* convert smj NT to paratext (__Børre__)
* get fin, swe, nob and nno NT and OT in __paratext__ format. (__Trond__)

!!Davvi Girji

Called her last week. She said Davvi Girji os would give us permission
to use texts by their authors. She wasn't sure if we could get the texts
directly from Davvi Girji, because of copyrights of pictures and other
artwork in the books. Said they (Davvi Girji workers) were going to have a 
meeting right after our conversation by phone, and take some kind of 
desicion on this case. Haven't heard anything from Kåven, though.

__TODO__:
* call Brita Kåven again towards the end of the week (__Børre__)

!!Min Áigi

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for
the .txt, but business as usal for the .doc files. __Saara__  has done
the xsl conversion routine for the typographic tags. It still needs some fine
tuning, as there are some @ tags that were not included in our initial list.

__TODO__:
* send bug report to Apple (typing filenames in Terminal does not match, moving
  filenames across plattforms via tarballs creates problems - the names are not
  the same!) (__Sjur__)
** not done yet
* make xsl conversion routine for the typographic tags (__Saara__)
** Done, some adjustment needed.

!!Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

__TODO__:
* contact Ája (__Børre__)

!!Sámi Instituhtta

__Børre__ contacted __Richard Valkeapää,__ the IT-consult at NSI. He put it on
his todo list, as he would have to contact the person who has worked with the 
newspaper texts anyway. He said this would be done in the near future (within
a month).

TODO:
* wait for R. Valkeapää, call him next week (__Børre__)

!!!5. Corpus infrastructure

!!User accounts and access

Talked to Roy Dragseth last week about this. Turns out that
we should do the administration ourselves. We will have to discuss on what
level we would like to give access to users of our corpus.

This is from our contracts:

{{{
Contract 1 

3.7 Mottakar kan gje personleg bruksrett til tekstsamlinga til personar som har
skrive under på bruksrettskontrakten i Vedlegg 2. Mottakar skal ikkje gje
bruksrett til tekstsamlinga til personar som ein har grunn til å tru vil bryte
vilkåra i kontrakten. Mottakar forpliktar seg til å informere avgjevar med ein
gong han/ho får kjennskap til mogleg brot på desse vilkåra.

(Vedlegg 2 = Contract 3) 

Contract 3

4.1 Brukaren har berre rett til å bruke tekstane til forsking eller slike
kommersielle språkteknologiske eller andre liknande formål, som ikkje bryt med
Lov om opphavsrett til åndsverk. Brukaren kan bruke tekstane for å gjere seg
nytte av dei språktrekka (t.d. statistisk informasjon, grammatiske reglar og
semantiske skildringar) han/ho har funne gjennom forsking, og plukke ut kortare
sitat frå tekstane.

5.3 Brukaren får ikkje ta større delar av løpande tekst i tekstsamlinga enn
korte sitat bort frå den tenaren som tekstsamlinga er installert på. Det er lov
til å lagre temporære kopiar på sjølve tenaren på det vilkåret at brukaren tek
omsyn til datatryggleiken. Denne avgrensinga gjeld ikkje offentlege dokument
(t.d. NOU, stortingsmeldingar o.l.). Det går klårt fram av kvart dokument kva
for lisens som er knytta til det.
}}}

TODO: 
* before anything else is done: Zip an xml-stripped version of our free texts
  and send them to Olavi. In that way, he gets something to work on, and we 
  will get feedback (__Børre__)
* discuss and decide upon the exact access policy we want to give corpus users;
  meeting on Tuesday May 23, at 9.30 (__Børre, Sjur, Trond__)
* set upp the unix group structure to open for a new category of users:
  Linguists with access to the texts on the servers. Some of
  __Børre, Saara, Trond__.
* make a text-only corpus in the Oslo interface (dump the texts on omilia), 
  thereby making them accessible to simple lexical lookup. The cost of doing
  this should be evaluated against what it takes waiting for the analysed
  corpus to be in place. (__who?__)

!!Name change again?

TODO:
* rename corpus dirs, and create symlinks (__Saara__)

!!Free and non-free texts

More info in a [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:
* Check the status of the texts, again. (__Børre, Trond, Saara__)
** some is done, not complete yet
* Rerun the conversion afterwards (__Saara__ is the one with the magic spell)
** in progress
* update the script(s) to only copy texts to the free/ dir which are explicitly
  marked as free (__Saara__)
** not done
* update the processing of the corpus summary files (__Sjur__)
** done

!!More texts to the graphical corpus interface:

TODO:
* We would like to have more than the NT in 
  [the graphical interface|http://omilia.uio.no/CE/sami/] (__Saara__)
** We add the largest texts first.
* We would like to have grammatical searchability, not only POS. (__Saara__,
  __Trond__  This presupposes a discussion with Oslo. (__Trond__ and __Saara__ 
  to continue this discussion)
* For Lule Sámi: We would like to have a parallel corpus interface with NT
  (text only). This presupposes better quality texts (__Trond, Børre__)
** Better Lule NT text still not made.
* The list of good candicates: The longest (admin) texts.
** We need a new option in ccat for analysing text while still keeping the
   xml tags and structure (__Tomi__)
** xml texts number <p>, preprocess finds <.> <?> <!> and ccat numbers them as
   <s>.
** Then the aligner aligns...

Top-three priorities:
# discuss more with Lars on tag unification, and unify them (__Trond__)
# change ccat to be able to create the right input for the corpus analysis
  (xml- tagged output) (__Tomi__). Work estimate: a few (2-3?) days - nothing
  this week, we will re-evaluate (and schedule) next Monday)
# add text to the server (__Lars__)

!!Language recognition

TODO:
* refine language recognition (__Saara__)
** in progress, continue discussion in
    [249|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=249]
* create a short smj word list to help the trigram heuristics (__Trond__)
* send __Saara__ smj files (__Trond__)
** Trond has tried to get files, see task report above
* Add some flag to write into the xsl file (__Saara__):
** method:do not run lg recognition
** method:Choose between these 2: nob, sme, etc.

!!!6. Infrastructure

!!Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:
* Put __Saara__ and __Tero__ in contact with each other (__Trond__)
** still open
* The paradigm generator should also have an xml-out option (for use in e.g.
  electronic dictionaries) - on hold till we know more about the generator

!!Aligner

__Trond__ and __Saara__ will continue this issue.

!!Hyphenator

__Thomas__ is finished with adding ^ tags to the __smj__ noun file, and has
continued working on the __sme__ noun file.

__Trond__ and __Tomi__ have been working on the smj rule component, and have
improved both the treatment of weak grade consonant clusters (preconsonantal
geminates) and on some loan word patterns.

TODO:
* correct the treatment of hyphenation of word boundaries and exceptions (fst
  gymnastics) (__Sjur, Trond__)
** Still not done.  
* Update the sme and sma rule sets with the insights gained from smj updates.

!!!7. Linguistics

!!General - hyphenation

See discussion, open questions and decission in a [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-04-03.html]

TODO:
* Set up the mechanism for the hash transducer package - fst gymnastics, see
  above.
* add exceptions marks to the {{smj}} lexicon (boks^távva)
** done

!!North Sámi

TODO:
* set up a linguistic workshop while Maaren is in Tromsø (and remember
  that she is on sick leave all the time) (__Trond, Thomas__)

!!Lule Sámi

There are some open issues in the marginal area of the smj transducer:
* numerals, e.g. Our poor treatment of number words becomes more visible 
  as we get real texts (guhttatuvsánanielljatjuodegålmmålåkvihtta is not
  recognised, etc.).
* names => waiting for the new name lexicon
* compounds? Shortening here as well, but not in written language (some
  lexicalised exceptions, as in sme) =>  nothing to discuss/implement
* loanwords? We should consider importing the ^LOAN words from sme and 
  Lule-ify them. Otherwise, loan words should come from corpora.

TODO:
* 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
* add proper numeral analysis/treatment
* add loanwords (e.g. latin -ere verbs)

!!!8. Name lexicon infrastructure

TODO:
# finish refactoring for multiple collections in the search interfarce
  (__Sjur__)
## improving, not finished
# develop the needed XQueries and interface (__Sjur, Tomi__)
## progressing, done some, haven't commited
   (adding new term, create-termc-entry.xq)
# data synchronisation between risten.no and the cvs repo (__Tomi__)
## discussion started on eXist-list, we'll wait a couple of days to see what's
   coming out of it, and if nothing useful to us, we'll add our use case with
   questions
# test and review when ready
# Rethink the doubletagging procedure for names, consider grammatically
  motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org, 
  or the Lyndi England issue) (__Trond__)

!!!9. Spellers

Nothing until the new proper noun lexicon is in place. We don't have enough
people to do both. Here's our most important targets for spellers in the near
future:

* aspell
* hunspell

!!!10. Public tender

Finnut called, and here's their evaluation: if we think that the offers are
incomplete or otherwise not fully acceptable, we can enter negotiations with the
companies, effectively cancelling the current public tender. This can __only__
be done as long as we don't change the public tender document (that is, the
foundation for the public tender) - if it is changed, we have to announce the
whole competition again, with the usual 53 days minimum deadline for
applications.

__Sjur__ will send an e-mail to the project board, outlining the different
aspects of the two offers, and ask about their opinion on the following
questions:
* accept the offers as is, or enter negotiations?
* if accept, which one?

TODO:
* meeting on Thursday 18.5. at 10.00 AM (__Børre, Sjur, Tomi Trond__)
** done (on Friday)
* telephone meeting with Finnut (__Børre, Sjur__)
** short call on Friday, he called today (during the meeting, see above)
* write e-mail to the project board, ask for their opinion regarding the offers
  (__Sjur__)

!!!11. Other

!!Summer vacation

|| Who  || When
| Børre  | ?
| Linda  | ?
| Maaren | ?
| Saara  | July
| Sjur   | ?
| Thomas | 3.7 - 7.8
| Trond  | July
| Tomi   | 8.7 - 16.7, more?

!!Bug fixing

__45__ open Divvun/Disamb bugs, and __25__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help __Saara__ with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
 (Perl locale). Not much help...
__Saara__ will contact __Roy__ on this issue.

After the corpus issues have been somewhat settled, we should do a bug
barnraising. ... and then a new one after the name lexicon is fixed.

!!Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all
computers c.f.
[http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide]
(our preinstalled Xcode veriosn is 2.0, must be 2.1):

* Børre - ok by copying {{/opt/local/}} from __Trond__
* Maaren - Børre to do it
* Saara - todo
* Sjur - ok
* Thomas - Børre to do it
* Tomi - todo
* Trond - ok

Easy way out when the standard Darwin Ports installation isn't working:
just get a copy of {{/opt/local/}} from __Børre__.

__Trond__ should ask __Lars Nygård__ and __Tero Avellan__ to
install Gobby as well; has asked __Per Langgård__.

!!SEE autosave AppleScript

Copy the following into a ScriptEditor window:
{{{
tell application "SubEthaEdit"
	repeat until false is true
		save documents
		delay 60
	end repeat
end tell
}}}

and click "run". All your SubEthaEdit documents will be automatically saved
every minute (the interval can be changed by specifying another value (in
seconds) for {{delay}})

!!!12. Summary, task list

!! Børre
* corpus collection:
** send out contracts with accompanying letter
** Gather public texts, preferrably also parallel ones
** Continue converting text from input format to our xml
** convert nob and nno bible texts to be used as part of a parallel corpus
** review the paratext2xml converter
** convert smj NT to paratext
** Send out letters to the rest of the Iđut authors
** call Brita Kåven again towards the end of the week
** contact Ája (Kåfjord)
** create weekly cron job to mirror Odin URL and detect new/updated pages
** Check the status & license of the corpus texts
** wait for R. Valkeapää, call him next week
* public tender:
** assist with letter to the project board
* corpus access:
** zip an xml-stripped version of our free texts for to Olavi
** meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
** set up user account/corpus access for Olavi
** set upp the unix group structure for corpus users (also __Saara, Trond__)
* install latest SEE
* install Gobby for __Thomas and Maaren__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments.
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement links to parallel files in corpus header.
* Implement parallel corpus upload in web upload script
* Check the status & license of the corpus texts and 
  change the default of the availability to "license" instead of "free"
* rerun the corpus conversion
* Install Gobby
* update the corpus script(s) to only copy texts to the free/ dir which are
  explicitly marked as free
* Add some language recognition flags to write into the xsl file
* rename corpus dirs, and create symlinks
* set upp the unix group structure for corpus users (also __Børre, Trond__)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** write letter to the project board
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections (regular search interface)
* change corpus-summary processing to generate smaller pages
* send bug report to Apple re filename matching and accented characters in
  Terminal
* meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* correct hyphenation of exceptions (sme)
* work on compounding and derivation
* sme G3 issue
* set up a linguistic workshop while Maaren is in Tromsø

!! Tomi
* new proper name lexicon
** implement data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* install and test Gobby
* Set up the mechanism for the hash-mark transducer package
* add ccat option to analyse text while keeping the xml tags and structure
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
* get fin, swe, nob and nno NT and OT in __paratext__ format
* install aligner, test it and give feedback
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* get/upgrade keys for __Børre's__ room for __Tomi__ and __Thomas__
* Rethink the doubletagging procedure for names, consider grammatically
  motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org)
* Check the status & license of the corpus texts
* Work on the graphical corpus tag list
* send __Saara__ smj files for language recognition
* create a short smj word list to help the trigram heuristics
* Put __Saara__ and __Tero__ in contact with each other
* ask __Lars Nygård__ and __Tero Avellan__ to install Gobby 0.3
* set up a linguistic workshop while Maaren is in Tromsø
* meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
* set upp the unix group structure for corpus users (also __Børre, Saara__)
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!13. Next meeting, closing

29.05.2006 09:30

Closed at 11:35