!!!Meeting setup

* Date: 28.08.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat

!!!Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:47.

Present: __Saara, Sjur, Thomas, Børre, Tomi, Trond__

Absent: __Maaren__

Agenda accepted as is.

!!!2. Updated task status since last meeting

!! Børre
* corpus collection:
** send out contracts with accompanying letter
** Gather public texts, preferrably also parallel ones
** Send out letters to the rest of the Iđut authors
** contact __Ája__ (Kåfjord)
** send contracts to __Čálliid Lágádus__
*** All the above not done
** Contact __Richard Valkepää__ at NSI about older Min Áigi and Áššu files.
*** Not heard anything from him
* corpus conversion:
** convert nob and nno bible texts to be used as part of a parallel corpus
** convert fin, swe to paratext or directly to our XML
** review the paratext2xml converter
*** All above not done
** Move norwegian documents in Min Áigi from sme to nob
*** Not done
* corpus access:
** possibly deploy the user account form as an HTML form
*** not done
** make a test user
** Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
*** User documentation probably in several languages. This covers how to apply
    for an account, on what grounds one can apply, and pointers to documentation
    telling how to use the corpus.
*** Admin documentation, telling how we set the permissions to the corpus files,
   and whatever other processes and tasks needed to set up a corpus account.
*** Not done
* set up Bugzilla automatic reminders for open issues
** My attempt doesn't work
* create document & document entry for name double-tagging
** ?
* Update forrests to latest svn version
** Done before vacations
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
* Create a parallel corpora of the new testaments
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
** tested
* Implement parallel corpus upload in web upload script
** not done, waiting for decision of the cgi-bin scripts
* Install Gobby
** done
* Test the aligners once again
** done
* refine the xml output of the xml-tagged analyses
** done
* convert or adapt the received PHP for paradigm generation to our needs
** discussed possible solutions in the news
* remove headers and footers from antiword documents, other improvements
** antiword done, pdf-documents are to be fixed next
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* public tender:
** review letters to tenderers, contract for subcontractor
*** contract finished, signed and countersigned. First common meeting held.
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
*** more done, plans for the rest is made.
** finalise refactoring for multiple collections (regular search interface)
*** mostly done, but a bug in eXist or Cocoon is blocking the final step
* review user and admin documentation for corpus access
** not done
* write user account form, probably ask for copy of existing ones from the IT
  centre (with Trond)
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* investigate productivity of even-syllable Actio compounding
** done
* investigate and identify under which conditions even-syllable Actio
  compounding is possible
** done
* discuss findings with the rest of us
** done
* add proper numeral analysis/treatment to smj
** done to the same extent as in sme
* add loanwords (e.g. latin -ere verbs) to smj
** done
* sme G3 issue
** started with a load of help from my friends
* review user documentation for corpus access
** not done
* create smj abbr file 
** copied the sme abbr file and lulefied it
* review the [document|http://divvun.no/doc/tools/gobby.html]
** done
* Redirected following three syllable verbs and prevent them from being
  first-part Actios in compounds:
** Reflexives on -dit
** Reciprocals on -dit, -(a)lit
** Momentatives on -dit, -(a)lit, -ádit, -ihit
** Frequentatives on -(a)lit, -(u)hit, -dit
** Continuatives on -dit, -(u)hit, -nit
** Inchoatives in -nit 
** Translatives on -dit
** Essives on -dit and -stit
** Causatives on -dit, -stit
*** done
* find and study all derived verbs in our corpus (depends on __Trond__)
** not done
* suggest which derivations could be generated (depends on __Trond__)
** not done

!! Tomi
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* read aligner docu, install, provide feedback
* Set up the mechanism for the hash-mark transducer package
* test the new xml output of the xml-tagged analyses
* export corpus tools to {{/opt}} (with cron)
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* better smj NT text
* get fin, swe, nob and nno NT and OT in __paratext__ format
** No bible work during summer, not received anything, will have to look into this again.
* install aligner, test it and give feedback
** Aligner installed, it is good, but slow an operates manually. Will call Bergen today.
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
** Not done.  
* make shell script wrappers for the most common commands for user friendlyness
** Some done, not others.
* write user account form, probably ask for copy of existing ones from the IT
  centre (with Sjur)
** not done  
* write documentation for our {{bound}} users, with pointers to the ordinary 
  documentation  
** not done  
* write documentation on double-tagging names
** not done  
* discuss web-only user access management with Oslo
** not done  
* change tagging of derived stems in the disamb output, to facilitate much
  easier extraction of non-lexicalised derivations
** prepared for the conversion, not done the actual change.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!3. Documentation

TODO:
* Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
** User documentation probably in several languages. This covers how to apply
   for an account, on what grounds one can apply, and pointers to documentation
   telling how to use the corpus.
** Admin documentation, telling how we set the permissions to the corpus files,
   and whatever other processes and tasks needed to set up a corpus account.

!!!4. Corpus gathering

Nothing has happened during the summer.

!!Olavi Korhonen's Lule Sámi dictionary.

Waiting for an answer.

!!Bible texts

Will have a second round with the Word versions.

__TODO__:
* get nob and nno NT and OT in __paratext__ format. (__Trond__)
* convert smj NT to paratext (__Børre__)
* convert fin, swe to paratext or directly to our XML (__Børre__)


!!Kåfjord

__TODO__:
* contact Ája (__Børre__)
* talk to Lene about Kåfjord (__Børre__)

!!Sámi Instituhtta

When will we get the corpus? We still don't know, __Børre__ will contact him
again.

TODO:
* contact NSI again (__Børre__)


!!Čálliid Lágádus

[http://www.calliidlagadus.org/]

TODO:
* send contracts (__Børre__)

!!Árran

TODO:
* contact Bård Eriksen again (__Børre__)


!!!5. Corpus infrastructure

!!General

What we would like: A make-type system that kept track of the file.xml and
file.xsl -pairs, always converting the former whenever the latter had a newer
date. Cf. Trond's letter "Makefile for xsl corpus conversion?" in the news.

At first sight, this sounds like a good idea.

__TODO:__
* remove headers and footers from antiword documents, other improvements
  as needed, including PDF conversion (__Saara__)
* give the make issue a second sight (__Tomi__)
* comment Trond's suggestion in the news (__all__)


!!User accounts and access

For details, see a [previous meeting memo|Meeting_2006-06-19], as well as the
memo from a [dedicated
meeting|http://divvun.no/doc/infra/corpus_policy.html].

!Shell access

TODO:
* export to {{/opt}} (with cron) tools that the project team members find in 
  their cvs tree (the bound users do not have a cvs tree, and therefore need 
  these tools in {{/opt}} in order to conduct linguistic analyses) (__Tomi__)
** ccat
* make shell script wrappers for the most common commands for user friendliness
  (we must think of what commands they are) (__Trond__)
** (first version of first script, teaksta.sh, was checked in, but it is still
   not working (the problem is a simple handling of input-output, some shell
   script literates should have a look at it)
* write user account form, probably ask for copy of existing ones from the IT
  centre (__Trond__ and __Sjur__)
* possibly deploy the user account form as an HTML form (__Børre__)
* write documentation for our {{bound}} users, with pointers to the ordinary 
  documentation (__Børre, Trond__)
* write documentation for how to apply for a user account (where's the form, to
  whom do I send the form, who needs it, etc.) (__Børre__)
* make our own guidelines for the user application processing (__Børre__)
* make a test user (__Børre__)
* test corpus access as test user (__Trond__)

!Web browser access

TODO:
* discuss with Oslo (__Trond__)
* delay other tasks until we are ready to go public?
* user management for access to bound texts
* short user guide needed before going public (either write one or take whatever has been made in Oslo (__Trond__)


!!More texts to the graphical corpus interface:

TODO:
* add text to the server (__Lars__)


!!Aligner

The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input. 

TODO:
* contact Bergen to discuss these issues, ask for a non-manual interface, etc.
  (__Trond__)


!!Language recognition

Still waiting for more __smj__ and __sma__ text to improve it. We need South
Sámi, since Reindriftsnytt/Boazodoallo-ođđasat is trilingual, nob, sme, sma. We
are presently unable to correctly identify __sma__.

!!Corpus summary

The time-based statistics is still missing.

TODO:
* add time-based display as a feature request to Bugzilla (__Sjur__)

!!!6. Infrastructure

!!Xerox tools wrapped as servers

To improve throughput and response time on heavy loads, it would really be nice
to have the Xerox tools wrapped up as servers.

__TODO:__
* decide the programming language to use (__Saara__)
* find some (almost-)ready-to-use code to build on (__Saara__)
* implement it (__Saara__)


!!Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

__Saara__ has given a report on the PHP code in News. Please read.

__Conclusion:__
Only to be used as a source of inspiration. We'll wait with further work until
we have the server wrapper thing (see above) in place.


!!Hyphenator

TODO:
* correct the treatment of hyphenation of word boundaries and exceptions (fst
  gymnastics) (__Sjur, Trond__)
* Update the sma hyphenator rule set with the insights gained from smj updates
  (__Trond__ during summer vacation)


!!Automatic Bugzilla reminder for untouched bugs

TODO:
* give mail reminders a second try; ask Thor-Øivind for help if needed
  (__Børre__)


!!M4

Tomi and Saara did a lot in Tromsø. How far is it now? Probably finished today!

__TODO:__
* finish the work, and check it in (__Saara__)


!!!7. Linguistics

!!Derivation and spellers like Aspell

To make it easier to extract all derived stems, we should enhance the tags used
for derivations in {{sme}} to make them easier to grep. The most straightforward
solution is to make the tags follow the same pattern as for {{smj}},
{{+Der/NNN}}. Presently {{sme}} is only using the {{NNN}} part as a tag, where
{{NNN}} represents the derivational suffix. That is, there is no single pattern
to match against for {{sme}}.

__Problematic issue:__ the disamb output will presently give information only
about non-lexicalised derivations. This can potentially give false data on the
frequency of derivational affixes. To get (more) precise data, there should be
an option in {{lookup2cg}} that favours derived analyses over lexicalised ones,
everything else being equal. A similar option for compounds can also be useful
in certain contexts. That is, the present behaviour should be partly turned
upside-down ("select the analysis/-es with the fewest compounds and derivations
available"). The best alternative would probably be to select the next to least
complex analysis, that is, to allow only one compound border or one derivation.

TODO:
* change tagging of derived stems in the disamb output, to facilitate much
  easier extraction of these verbs (__Trond, Tomi__)
** Issues: Rewrite the {{sme-lex.txt, sme-dis.rle, sme-tdis.rle, twol-sme.txt}}
   files (and perhaps other source files), (40 tags and 5 files are
   involved (+ possibly 5-10 more, utf8-scripts, also the tag conversion
   scripts in {{sme/src}} and in {{gt/cwb/korpustags.txt}})). Pseudocode:
{{{
For file i, take taglist 33-37 and replace with 39-43, respectively
}}}
* find and study all derived verbs in our corpus (__Thomas__)
* suggest which derivations could be generated (__Thomas__)
** see source code above, but also consider overgeneration problems, as well as
   input from the statistical results in the previous point
* lexicalise the rest (__Thomas__)


!!Semantic double-tagging of names

The policy needs documentation. Thus:

TODO:
* Make a section under {{gt/doc/lang/smi/}}, add a chapter
  __Common linguistic resources__ under the __Languages__ section in
  {{site-frag.xml}}, and add the document below to that section (__Børre__)
* write guidelines for annotators wrt. to name tagging and put them under 
  {{gt/doc/lang/smi}}. A short version of the
  guidelines goes as follows: \\Systematic
  ambiguity (the __Trosterud__ (plc/sur) and __Aftenposten__ (org/obj) types)
  should not be doubletagged, but given primary tags (plc, org) only.
  Doubletagging should be confined to exceptional  cases (__Eira__
  (Fem/Sur/Plc), __Janne__ (Fem/Mal), such things. These guidelines should be
  added to our documentation file. (__Trond__)
* make sure all linguists is aware of the guidelines (__Trond, Sjur__)
* write disamb rules to implement the system above (__Trond, Linda__)


!!North Sámi

The following already derived verbs (verbs ending in -šit, -skit, etc.) are not
happy with further derivation. It seems that most of them do not appear as Actio 
forms in the first part of compounding either. The following holds for both 
''sme'' and ''smj'':

{{{
LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 
5-syllables, formerly directed to 
MUITAL
 +V+TV: MUITALStem ;
!These derived verbs have now been redirected to MUITTASJ and similar lexica.
!Reflexives on -dit
!Reciprocals on -dit, -(a)lit
!Momentatives on -dit, -(a)lit, -ádit, -ihit
!Frequentatives on -(a)lit, -(u)hit, -dit
!Continuatives on -dit, -(u)hit, -nit
!Inchoatives in -nit 
!Translatives on -dit
!Essives on -dit and -stit
!Causatives on -dit, -stit
}}}

Examples:
* muittašit > *muittašallat


TODO:
* investigate and identify under which conditions Actio compounding is possible
  (__Thomas__)
** done
* discuss findings with all of us (__Thomas__)
** done


!!Lule Sámi

TODO:
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  (__Thomas, Trond__)
* convert roughly 100 smj names from that file (lines 740-843) to XML
  (__Saara__)
* add inc abbr to a new abbr lexicon file (__Thomas__)
** copied the sme abbr file and lulefied it
* add proper numeral analysis/treatment (__Thomas__)
** done to the same extent as in sme
* add loanwords (e.g. latin -ere verbs) (__Thomas__)
** done

!!!8. Name lexicon infrastructure

Decided in Tromsø:
* add smj proper noun lexicon file to the output
* remove {{{^ # 0}} from the center ID
* replace spaces with underscores in all IDs
* remove occurence indicator from language IDs: Agalin_1 (the center/concept ID)
  => Agalin (the language ID), and thus the two Agalin's should become one
  language entry (but two different concept entries)
* store a redundant copy of the center-file semantic information in the
  language-specific files, for processing speed
* add logging facilities
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* all names in all languages by default
* tag for excluding/including a name from certain applications
* hide / display {{^}} during browsing
* future epxansion: choose what info to display in the single language browser
* search by (single) language
** done
* display existing language entries when adding a new language to a record
* make searches behave predictable (the hits should be the expected ones)
** done
* add editor to change single, existing entries

Details can be found in [the meeting
memo.|/doc/admin/physical_meetings/tromso-2006-08-propnoun.html]

Language entry example illustrating both the sem-tag on the sense elements, and
the removal of occurence indicator:
{{{
<entry id="Agalin">
 <infl lexc="BERN" />
 <senses>
  <sense sem="plc" ref="Agalin"/>
  <sense sem="sur" ref="Agalin_1"/>
 </senses>
</entry>
}}}

TODO:
* improve lexc2xml conversion (__Saara__)
** add default {{smj}} entries
** exclude {{^ # 0}} from the center ID
** add an empty <log/> element to all entries (center and lanuage files)
** add a {{last-update}} attribute to the root element of all files
* finish refactoring for multiple collections in the search interfarce
  (__Sjur__)
** waiting for a bug fix (__Tomi__ is investigating it)
* develop the needed XQueries and UI (__Sjur, Tomi__)
* data synchronisation between risten.no and the cvs repo (__Tomi__)
** discussion started on eXist-list, nothing useful came up. We need to
   reformulate the question from our perspective, and bring it up again
   (__Sjur__)


!!!9. Public tender

TODO:
* write a contract (mostly done by __Finnut__, review by __Sjur__)
** done
* get it signed (__Finnut, Lennart Mikkelsen__)
** done


!!!10. Tromsø meeting round-up

TODO:
* check in meeting memos (__Sjur__)
* Polderland questions. __Thomas__ did already send requested info.
* speller development - see the meeting memo. Separate follow-up next week.
* Lule Sámi linguist - Sjur has tried to call a possible candidate, but no
  answer so far. He will use e-mail instead. (__Sjur__)
* order AirPort Express to Tromsø (__Sjur__)


!!!11. Other


!!Bug fixing

__43__ open Divvun/Disamb bugs (two down!), and __25__ risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help __Saara__ with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
 (Perl locale). Not much help...
__Saara__ will contact __Roy__ on this issue.

!!Gobby

TODO:
* install Gobby (__Trond, Sjur__)
* review the [document|http://divvun.no/doc/tools/gobby.html] (__Thomas__)
** done - document accepted:-)


!!Task lists as iCal entries

TODO:
* update all forrest installations to r430284 (__Børre__)


{{{
cd $FORREST_HOME
svn up -r430284
}}}

!!!11. Next meeting, closing

Next meeting 4.9.2006 at 9:30.

Closed at 11:25.

!!!Appendix - task lists for the next week

!! Børre
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Boerre.ics]
* corpus collection:
** send out contracts with accompanying letter
** Gather public texts, preferrably also parallel ones
** Send out letters to the rest of the Iđut authors
** contact __Ája__ (Kåfjord), talk to __Lene__
** send contracts to __Čálliid Lágádus__
** contact __Richard Valkepää__ at NSI about older Min Áigi and Áššu files
** contact __Bård Eriksen__ again
* corpus conversion:
** convert nob and nno bible texts to be used as part of a parallel corpus
** convert fin, swe to paratext or directly to our XML
** review the paratext2xml converter
** Move norwegian documents in Min Áigi from sme to nob
* corpus access:
** possibly deploy the user account form as an HTML form
** make a test user
** Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
*** User documentation probably in several languages. This covers how to apply
    for an account, on what grounds one can apply, and pointers to documentation
    telling how to use the corpus.
*** Admin documentation, telling how we set the permissions to the corpus files,
   and whatever other processes and tasks needed to set up a corpus account.
* set up Bugzilla automatic reminders for open issues
* create document & document entry for semantic double-tagging of names (for
  __Trond__)
* Update forrests to svn version r430284
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* On sick leave

!! Saara
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Saara.ics]
* Create a parallel corpora of the new testaments
* add more texts to the graphical corpus interface
* grammatical searchability in the graphical corpus interface
* Implement parallel corpus upload in web upload script
* remove headers and footers from pdf documents
* Implement server of the analysis tools.
* Add more languages to the lexc2xml propernoun conversion.
* Refine the namelex output
* finish M4 work
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Sjur.ics]
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* name lexicon:
** implement editing functions
** finalise refactoring for multiple collections
** implement improvements decided upon in Tromsø
* review user and admin documentation for corpus access
* write user account form, probably ask for copy of existing ones from the IT
  centre (with Trond)
* add time-based corpus summary as a feature request to Bugzilla
* check in meeting memos from Tromsø
* start hiring process of linguist and programmer
* order AirPort Express to the Tromsø gang
* install Gobby
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Thomas.ics]
* sme G3 issue
* bug-fixing!
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt 
* review user documentation for corpus access
* find and study all derived verbs in our corpus (depends on __Trond__)
* suggest which derivations could be generated (depends on __Trond__)

!! Tomi
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Tomi.ics]
* new proper name lexicon
** data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
** implement improvements decided upon in Tromsø
* read aligner docu, install, provide feedback
* Set up the mechanism for the hash-mark transducer package
* test the new xml output of the xml-tagged analyses
* export corpus tools to {{/opt}} (with cron)
* Do the sme Der/ change (with __Trond__)
* consider __Trond's__ suggestion of a makefile for corpus conversion
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
[iCal|/doc/admin/weekly/2006/Tasks_2006-08-28_Trond.ics]
* better smj NT text
* get fin, swe, nob and nno NT and OT in __paratext__ format
* contact __Bergen__ about aligner issues
* fst gymnastics to add hyphenation and word boundary marks to hyphenation
  transducer
* make shell script wrappers for the most common commands for user friendlyness
* write user account form, probably ask for copy of existing ones from the IT
  centre (with Sjur)
* write documentation for our {{bound}} users, with pointers to the ordinary 
  documentation  
* write documentation on semantic double-tagging of names
* discuss web-only user access management with Oslo
* change tagging of derived stems in the disamb output, to facilitate much
  easier extraction of non-lexicalised derivations
* do the sme Der/ change (with __Tomi__)
* write short user guide for the corpus web interface
* install Gobby
* [fix bugs!|http://giellatekno.uit.no/bugzilla].