!!!Meeting setup

* Date: 06.06.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:59.

Present: __Sjur, Thomas, Trond, Tomi__

Absent: __Børre, Maaren, Saara__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* corpus collection:
** send out contracts with accompanying letter
*** Maaren has signed a contract, which is sent to Davvi Girji
** Gather public texts, preferrably also parallel ones
*** Not done
** Continue converting text from input format to our xml
** convert nob and nno bible texts to be used as part of a parallel corpus
*** Not done
** review the paratext2xml converter
*** Not done
** convert smj NT to paratext
*** Not possible before we have another paratext NT.
** Send out letters to the rest of the Iđut authors
*** Not done
** send contract to __Kurt Tore Andersen__
*** Not done
** call __Brita Kåven__ again
*** Haven't got in touch with her
** call __Harald Gaski__
*** Met with him. He has to send contracts to his writers union, to check
    if the contracts are ok with them.
** contact __Ája__ (Kåfjord)
*** Not done
** Send renaming scripts to __R. Valkeapää__
*** Working on them
** complete __Min Áigi__ metadata
*** Halfway done
** call __Bård Eriksen__
*** Not done
* corpus access:
** meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
*** Delayed once more
** set upp the unix group structure for corpus users (also __Saara, Trond__)
*** Waiting for outcome of above meeting
* set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement links to parallel files in corpus header
** done
* Implement parallel corpus upload in web upload script
* Install Gobby
* set upp the unix group structure for corpus users (also __Børre, Trond__)
* Test the aligners once again
* make an analyser that retains the xml structure for Oslo
** in progress (with Tomi). The exact structure of the output still open
* discuss parallel text markup in newsgroup
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** write e-mail to PL and ask for more clarification
*** done
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** not done
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections (regular search interface)
*** some work on researching new ways of doing all-collection searches, and at
    the same time allow for different xml structures
* change corpus-summary processing to generate smaller pages
* meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
** postponed
* move to Bugzilla:
** proofed/unproofed parallel text issue
*** done
** xml output of paradigm generator
*** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* hyphenation-rule-set 
** a few bugs, otherwise finished
* work on compounding and derivation
** finished with derivation
* lule sámi incoming words
** added a few
* sme G3 issue
** nothing this week

!! Tomi
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
*** not done
** XQuery refactoring and code development for our proper noun editor
*** not done
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
*** not done
* read aligner docu, install, provide feedback
** not done
* Set up the mechanism for the hash-mark transducer package
** not done
* add ccat option to analyse text while keeping the xml tags and structure
** done some
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
** Not done
* get fin, swe, nob and nno NT and OT in __paratext__ format
** Discussed with fin and swe societies, the Swedes are not quite satisfied with
   their paratext version, and prefer us to do the Word version. Børre and I
   should take a new round on this.
* install aligner, test it and give feedback
** Not done
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** Not done
* Put __Saara__ and __Tero__ in contact with each other
** Per will send the PHP code. Details about Tero will be given at the meeting.
* meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
** Postponed.
* set upp the unix group structure for corpus users (also __Børre, Saara__)
** Postponed.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
** Worked on some bugs.

!!!3. Documentation

TODO:
* documentation on how to apply for a user account for the corpus repo
  (__Børre__)
** we will administer the corpus user accounts ourselves
** We first have to discuss and decide what we want before __Børre__ can write
   documentation (see further down)

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO:
* Send out the rest of the letters (__Børre__)

New contracts:
* none last 2 weeks

!!Olavi Korhonen's Lule Sámi dictionary.

TODO:
* set up user account/corpus access for Olavi (__Børre__)
** we need the infrastructure for corpus access in place first

!!KIO Grafisk and the Iđut books

__TODO__:
* send letter to Kurt Tore Andersen (__Børre__)
* send letters to the other authors (__Børre__)

!!Bible texts

The Swedish Bible Society is reluctant to give us the paratext version, as it is
linguistically inferiour. Instead they suggested a MS Word version of the text.

__Børre__ and __Trond__ suggest that we give the MS Word version a second try,
at least to be able to get hard facts about the problems with that format when
discussing with the swedes.

We have asked for ''any'' language version in paratext format from the Norwegian
Bible Society, to use as a reference when making the __smj__ paratext version.

We are still waiting for the Norwegian versions, but Finnish and Swedish are
within reach. Finnish is available in a newsgroup (text-only) format, and we 
have received MS Word of the  Swedish version. Børre has had some problems
converting the Word file, though.

TODO:
* get nob and nno NT and OT in __paratext__ format. (__Trond__)
* convert smj NT to paratext (__Børre__)
* convert fin, swe to paratext or directly to our XML (__Børre__)

!!Davvi Girji

Talked to Harald Gaski, he will send the contracts to the writers' organisation,
to ensure the contract is ok.

__TODO__:
* call Brita Kåven again towards the end of the week (__Børre__)
* call the authors (__Børre__)
** Harald Gaski
*** done

!!Min Áigi

__TODO__:
* complete metainformation (__Børre__)

!!Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

__TODO__:
* contact Ája (__Børre__)
** Not done

!!Sámi Instituhtta

TODO:
* send renaming scripts to R. Valkeapää (__Børre__)

!!Čálliid Lágádus

__Børre__ talked to them, they are positive, and will give us print-ready pdf's.
[http://www.calliidlagadus.org/]

!!Árran

The negotiations are underway, they discuss it in a meeting today, and we have
an appointment for later today or tomorrow (both their internal texts, and
discussions on Báhko). All further negotiations will go through Bård Eriksen
/ Báhko.

TODO:
* call Bård Eriksen (__Børre__)

!!!5. Corpus infrastructure

!!General

Errors in the Antiword conversions found when parsing the xml corpus. Main
problem is to skip headers and footers - now they are included as part of the
text, and are intermingled within the regular text, e.g. in the middle of a
sentence.

TODO (all of these in priority order, the third option is really a last resort):
# fine-tune the initial conversion in antiword (__Børre__ or __Saara__)
# make file-specific fixes in the file-speciflc xsl file (by having our local
  xsl expert __read__ our corpus, and spot obvious omissions in the conversion
  (__Saara__)
# Manually fix the resulting files before sending off to analysis (??, this
  option should be postponed to just before the final, official release, and
  evaluated at that point)

!!User accounts and access

TODO:
* discuss and decide upon the exact access policy we want to give corpus users;
  meeting on Tuesday May 23, at 9.30 (__Børre, Sjur, Trond__)
** still not done, new date: Wednesday 14.6.2006, at 9.30
* set upp the unix group structure to open for a new category of users:
  Linguists with access to the texts on the servers. Some of
  __Børre, Saara, Trond__.

!!More texts to the graphical corpus interface:

We need to get the infrastructure complete to be able to do this, then it
should be a piece of cake.

Saaara has made corpus-analyze.pl, a script to analyse text while keeping the
xml structure. It is working, but the output format needs some tweaking.

TODO:
# refine xml-tagged output (__Saara__ and __Tomi__)
# add text to the server (__Lars__)

!!Aligner

__Trond__ and __Saara__ will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which
documents belong together. Discussion to continue in the newsgroup (__Saara__
has started it - please respond!).

!!Language recognition

Still waiting for more __smj__ text to improve it.

!!Free and non-free texts

Anything? Final check with __Børre__ and __Saara__ - waiting for them to return.

!!Corpus summary

TODO:
* trim generated corpus summary pages (__Sjur__)
** suggestion: lump together files with content less than X paragraphs (X < 5?)

!!Proofed vs unproofed corpus files

TODO:
* count the number of parallel files of type unproofed/proofread (__Saara__)
** done, 35 such file pairs
** delayed for the time being, move to Bugzilla (__Sjur__)
*** done

!!!6. Infrastructure

!!Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

We have now received the original PHP code from Tero/Per. It seems quite easy to
adapt, although it will of course require work on our part.

TODO:
* convert or adapt the received PHP to our needs (__Tomi__ or __Saara__)
* move xml-out option to Bugzilla (__Sjur__)
** done

!!Hyphenator

__Thomas__ is finished with adding ^ tags to the __sme__ noun file.

__Trond__ and __Thomas__ have been working on the smj rule component, and have
improved both the treatment of weak grade consonant clusters (preconsonantal
geminates) and on some loan word patterns.

TODO:
* correct the treatment of hyphenation of word boundaries and exceptions (fst
  gymnastics) (__Sjur, Trond__)
** Still not done - postponed till next week (w 24)
* Update the sme and sma rule sets with the insights gained from smj updates.

!!Automatic Bugzilla reminder for untouched bugs

We need to get a summary by e-mail for all bugs not touched in more than 5(?)
weeks. Do we want it? Yes. It is possible? __Børre__ will look around, if
nothing found, he'll ask Thor-Øivind.

TODO:
* set up Bugzilla to send automatic reminders for bugs not touched in a given
  timeframe; ask Thor-Øivind for help if needed (__Børre__)

!!!7. Linguistics

!!Name double-tagging

Conclusion, in a principled fashion: 

# hardcoded sem-tags win
# There is a sem-tag conversion procedure: according to a hierarchy of sem-tags:
  Any Plc can be interpreted as Sur, etc. (to be spelled out)

TODO:
* when adding new names, only use one sem-tag unless there are known objects
  which belong to different sem classes(?) (__all linguists__)
* write disamb rules to implement the system above (__Trond, Linda__)

!!North Sámi

Topic: Actio+compound - how productive? How much does it destroy speller
performeance by being close to real spelling errors with a high editing distance
to the correct form?

{{{
Spelling error from typos.txt:
4       vuolggahansádji             vuolggasadji

vuolggahan
vuolggahan      vuolgga+N+Sg+Nom+Foc
vuolggahan      vuolggahit+V+TV+PrfPrc
vuolggahan      vuolggahit+V+TV+Ind+Prs+Sg1
vuolggahan      vuolggahit+V+TV+Actio+Acc
vuolggahan      vuolggahit+V+TV+Actio+Gen
vuolggahan      vuolggahit+V+TV+Actio+Nom

vuolggahansadji
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
}}}

TODO:
* investigate Actio compounding (__Thomas__)
* discuss findings with __Maaren__, and later all of us (__Thomas__)

!!Lule Sámi

There are some open issues in the marginal area of the smj transducer:
* numerals, e.g. Our poor treatment of number words becomes more visible 
  as we get real texts (guhttatuvsánanielljatjuodegålmmålåkvihtta is not
  recognised, etc.).
* names => waiting for the new name lexicon
* compounds? Shortening here as well, but not in written language (some
  lexicalised exceptions, as in sme) =>  nothing to discuss/implement
* loanwords? We should consider importing the ^LOAN words from sme and 
  Lule-ify them. Otherwise, loan words should come from corpora.

TODO:
* 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
  (__Thomas__)
** progressing fine
* add proper numeral analysis/treatment (__Thomas__)
* add loanwords (e.g. latin -ere verbs) (__Thomas__)

!!!8. Name lexicon infrastructure

TODO:
# finish refactoring for multiple collections in the search interfarce
  (__Sjur__)
## progressing, investigating options
# develop the needed XQueries and interface (__Sjur, Tomi__)
## progressing, done some, commited
   (adding new term, create-termc-entry.xq)
# data synchronisation between risten.no and the cvs repo (__Tomi__)
## discussion started on eXist-list, nothing useful came up. We need to
   reformulate the question from our perspective, and bring it up again
   (__Sjur__)

!!!9. Public tender

We have received answers from LS, and are now wating for the PL answers. Their
deadline is the coming Thursday.

TODO:
* write clarification letters to PL (__Sjur__ with help from
  __Børre, Trond__)
** done
* evaluate finally the offers based on the answers to the clarification letters

!!!10. Other

!!Summer vacation

|| Who  || When
| Børre  | August
| Linda  | ?
| Maaren | ?
| Saara  | July
| Sjur   | at least 2 weeks in July, but still open
| Thomas | 3.7 - 7.8
| Trond  | 3.7 - 14.8 (last two weeks off at summer school)
| Tomi   | 8.7 - 16.7, 2 more weeks in July and/or August

!!Bug fixing

__43__ open Divvun/Disamb bugs (two down!), and __25__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help __Saara__ with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
 (Perl locale). Not much help...
__Saara__ will contact __Roy__ on this issue.

After the corpus issues have been somewhat settled, we should do a bug
barnraising. ... and then a new one after the name lexicon is fixed.

!!Gobby

Installed to most computers (only __Saara__ missing), now we need to test it in
a multiuser scenario.

__Trond__ has asked __Lars Nygård__ and __Tero Avellan__ to
install Gobby as well. __Per Langgård__ is already using it.

TODO:
* install Gobby (__Saara__)
* test it (__Tomi, Trond, Sjur, Børre, Lars?__)
* if successful, document its use within our project (__Børre__)

!!SEE 2.5 extensions

TODO:
* give the jspwiki syntax colouring mode to all (__Sjur__)
** JSPWiki.mode checked in to cvs in {{gt/src/}} - just double-click the mode!
* write a perl script to extract all TODO items and sort them according to
  responsible person; wrap the script in a AppleScript available from the menu
* other modes on the whish list:
** twol mode
** xfst mode
** lexc mode
** vislcg mode

!!!11. Next meeting, closing

19.06.2006 09:30

Sjur is away on Friday June 16.

Closed at 11:21.

!!!Appendix - task lists for the next week

!! Børre
* corpus collection:
** send out contracts with accompanying letter
** Gather public texts, preferrably also parallel ones
** Send out letters to the rest of the Iđut authors
** send contract to __Kurt Tore Andersen__
** call __Brita Kåven__ again
** contact __Ája__ (Kåfjord)
** Send renaming scripts to __R. Valkeapää__
** call __Bård Eriksen__
* corpus conversion:
** Continue converting text from input format to our xml
** convert nob and nno bible texts to be used as part of a parallel corpus
** convert fin, swe to paratext or directly to our XML
** review the paratext2xml converter
** convert smj NT to paratext
** complete __Min Áigi__ metadata
* corpus access:
** meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
** set upp the unix group structure for corpus users (also __Saara, Trond__)
* set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
* test Gobby (with others)
* document use of Gobby within our project if above test is ok
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement parallel corpus upload in web upload script
* Install Gobby
* set upp the unix group structure for corpus users (also __Børre, Trond__)
* Test the aligners once again
* refine the xml output of the xml-tagged analyses
* convert or adapt the received PHP for paradigm generation to our needs
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** collect the last clarifying answers, and make a final proposal to the board
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections (regular search interface)
* change corpus-summary processing to generate smaller pages
* meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
* test Gobby (with others)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* investigate Actio compounding, first sme, later smj
* discuss findings with __Maaren__, and later all of us
* add proper numeral analysis/treatment to smj
* add loanwords (e.g. latin -ere verbs) to smj
* work on compounding
* lule sámi incoming words
* bug-fixing
* sme G3 issue

!! Tomi
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* Set up the mechanism for the hash-mark transducer package
* refine the xml output of the xml-tagged analyses
* convert or adapt the received PHP for paradigm generation to our needs
* test Gobby (with others)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
* get fin, swe, nob and nno NT and OT in __paratext__ format
* install aligner, test it and give feedback
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
* set upp the unix group structure for corpus users (also __Børre, Saara__)
* test Gobby (with others)
* [fix bugs!|http://giellatekno.uit.no/bugzilla].