!!!Meeting setup

* Date: 06.03.2006
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:04.

Present: __Maaren, Sjur, Thomas, Tomi__

Absent: __Børre, Saara, Trond__

Main secretary: __all__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre

On Winter holiday.

!! Maaren
* work with risten.no
** not done. I have worked with the top-ten list lately. Almost finished.
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.
** Laila wants us to send the list to the SGL-members (as we have done before).
 Seminar is not actual. Yet. 
* get the normativity decissions from the December SGL meeting
** I have send decissions to __Thomas__ and asked __Thomas__ to phone to Laila.

!! Saara
* continue discussion on the new lexicon format
* Fix the preprocess script and optimize it.
* Try to add numeral treatment as part of the analyzator.
* Move gt2ga.sh to G5 and implement copying of the gt-dir.
* Create a parallel corpora of the new testaments.
* Routine for adding new languages to the propernoun xml-structure.
* Move to Bugzilla: the analyzer needs to be optimized.
* Implement validation of xml corpus against the dtd.
* Create a group for corpus users.
* Finish corpus dtd documentation, dtd location and permlink reference
* Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
** not done
* Lule Sámi twol problems, with __Thomas__ and __Trond__
** not done
* project planning with __Trond__, continued
** not done
* Follow up on place names from Norge Digitalt
** not done
* Evaluate SFST as speller (and analyzer) lexicon
** not done
* write a background document on the corpus contracts
** not done
* public tender:
** review draft tender document from Finnut
*** done and published
** the public tender would benefit a lot from anonymous cvs access (read-only)
   and also regular corpus access - started discussion last week with __Børre__
   and __Trond__ (see notes further down)
* smj G3 issue with __Thomas__ and __Trond__
** not done
* sme G3 issue with __Thomas__ and __Trond__
** not done
* call EDD/__Christian Emil Ore__ about national place name lexicon
** not done
* risten.no/proper noun lexicon development:
** refactor code
*** not done
** implement inheritance/collection overriding for xsl/css/xquery using sitemaps
*** tried to work out how to match parts of a request parameter, and continue
    based on the match - no success so far (but still studying)
** data synchronisation between risten.no and the cvs repository
*** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* work with Lule sámi 90% goal
**worked and reached goal 92%
* work on North Sámi compounding and derivation
**worked
* smj G3 issue with __Sjur__ and __Trond__
**not had time to
* sme G3 issue with __Sjur__ and __Trond__
**not had time to

!! Tomi
* move aspell UTF-8 suffix bug to Bugzilla
** not done
* corpus infrastructure:
** dtd location (both public and internal)
*** not done
* Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
  (it's almost done, but there are a couple of loose ends)
** not done
* new proper name lexicon
** discuss the new lexicon format and other issues in the newsgroup
*** not done
** Look into data synchronisation of proper nouns between risten.no and CVS
*** not done
** XQuery refactoring and code development for our proper noun editor
*** not done
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
*** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond

On Winter holiday.

!!!3. Documentation

!!Changes and updates because of the Divvun public tender

* document anonymous, read-only access to our cvs repo
* probably a new main section (sub-tab?) on external access to all our resources
* documentation on how to apply for a user account for the corpus repo
* we need to finish the corpus dtd documentation

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO: Send out the rest of the letters (__Børre__)

!!Odin

Waiting for __Sæth__ to discuss with colleagues about how to implement the
cooperation, and return to us.

TODO:
* call Sæth (__Børre__)

!!Olavi Korhonen's Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.

!! KIO Grafisk and the Iđut books

We need a test file in order to find out whether file conversion from Quark 
to InDesign works.

__TODO__:
* __Trond__ to ask for a Quark test file from his sister
* __Børre__ to ask KIO to send a more elaborate test file, representative for
  what we can expect, but without any copyrighted content (and please stress,
  that fonts will __never__ leave the computer, they are part of the OS, and are
  not included in the Quark file). The file should include text with Sámi and
  Norwegian letters, and preferably some dummy pictures as well. Actually,
  __Børre__ could create that file in Word and send it to KIO for them to move
  to Quark and send back.
* __Børre__ will send letters to the authors.

!!Bible texts

TODO:
* review paratext2xml converter (__Børre__)
* convert smj NT to paratext (__Børre__)
* ask to get fin and swe NT and OT in paratext format. (__Trond__)
**  Still not done. __Trond__ will contact Bibelselskapet for a new sme
    version, and for the NT nob and nno, and the fin and swe ones (after
    Wednesday)

!!!5. Corpus infrastructure

We need more "version control" in the corpus work - we don't know which version
of the XSL script was used (but roughly whether it was used or not). We need to
document within the XSL file which version of ''all'' tools were used, including
the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:
* for the biggest top ten (or so) the orig. should be located and copied to the
  new corpus repo
* then these files should be removed from gt/sme/corp/
* all small files could just be forgotten/ignored

TODO:
* Access control to corpus repo resolved through Unix groups: one group for
  corpus maintainers with write access, and another for corpus users, with
  read-only access. Still not done.
** __Saara__ will ask Roy to do what should be done.

Further discussion about corpus analysis and computer use:

The new G5 is tremendeously faster than cochise, thus we want to use it. But
cochise will continue to be our main corpus repo.

* the corpus/gt/ dir will be synchronised with the G5
** TODO: set up copying script (__Saara__)
* we need to develop strong enough security routines for the G5 to fulfill our
  obligations towards the text licensers
**  TODO: __Børre__ to move this to bugzilla
* we are still using only one processor when analysing - making some simple
  multiprocessor analysis script will increase the speed of the analysis a lot
**  script implemented, but not tested due to copying not in place yet (see
    above).

New tasks:

* corpus dtd documentation:
** structure, content/model and location of the dtd (location =
   permlink) http://giellatekno.uit.no/dtd/corpus.dtd
   TODO: __Saara__ to write and finish the docu, also check the dtd link
*** the link isn't working
* finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to
 xml) (__Saara__)
* add xml validation against our dtd to the corpus conversion process
  (__Saara__)

!!Changes and updates because of the Divvun public tender

* routines for setting up new users of our corpus
** create the final text corpus license: the regular end user computer license
   towards UiT/IT, to get an account and access to the corpus repository, be it
   through a web interface or the command line. We can probably as Roy Dragseth
   et co about a useful template.
** __Børre__ will be the recipient of the SD end user license (approving corpus
   users on behalf of SD). Who should be the recipient on behalf of UiT/IT?
** who will do the actual account setup? What type of account is needed?
* an automatic build of the content of our corpus repo:
** for each text:
*** license attached to each text
*** length (words, characters, sentences, paragraphs)
*** (source) language and other properties of the text
*** for freely available texts, a download link?
** an overview document with statistics for the whole corpus
** the automatic build should generate one (or several) Forrest XML document(s)
   containing the above info

!!!6. Infrastructure

We need to set up anonymous, read-only access to our cvs repo as outlined by our
friend in Skolelinux. This is necessary both for the Divvun public tender, as
well as for other cooperation with external bodies (e.g. Skolelinux). This setup
needs to be accompanied by documentation for access, as well as for building,
prerequisites, etc.

We might even consider seting up patching routines, such that people with
read-only access can send in patches for bugs and enhancements they would like
to contribute.

Our open-source policy now demands that we really take the step, and not only
''talk'' open-source:-)

Should we set up view-cvs (web interface to cvs)?

Problems:
* IP-problemacy (Do we have the complete rights to the content, or does the work
  have a license that allows us to publish the them)
**  Sámi place names - we need to ask our sources:
*** Statens kartverk (__Trond__) - materialet er fritt, jf Stortinget!
*** Finnish same (__Thomas__) - we don't have it, so this is no issue.
    ''Reminder:'' we need to get it!
**  Are there any parts in smj, sms, etc that shouldn't be publicly available?
*** What about the inc-*.txt files in smj/src/ ? Are they the Anders K source
    files? Not exactly, but they are derived from it. We will have to remove
    them from the CVS repository when we're finished with them.
*** do we have any corpus files in smX/corp/ with copyright/unchecked lisence?
    Yes, __Trond__ needs to check this this week.
**  No other problems? Pekka is fine with this.
*** = No other problems.
* Do we dare to show our work?
** yes

Howto/who:
* what do we need?
** web interface? maybe
** command line check-out? yes (__Roy Dragseth__/__Børre__)
** need to be able to restrict anon. cvs to only specific modules
   (__Roy Dragseth__/__Børre__)

!!!7. Linguistics

!!North Sámi

We want to get the decisions from the SGL meeting in December.
* Laila has sent those to __Thomas__ by email ...a long time a go...
* __Thomas__ will include the responses in our normativity document.

SGL do
not want us to send "unnecessary" things like rievan - rieban, politihkka -
bolitihkka, pievlat - bievlat. And I agree with Laila. We are wasting their
time. We have to write rieban-politihkka-bievlat. These are decided already year
1982-1988 1992-94. We have to read what Ole Henrik Magga has written long time a
go about the decissions they (Finland, Norway and Sweden) made 1982-1988, and
1992 (?).

We have to make a list and send the list to SGL. They
do have a meeting probably in May.

TODO:
* Find the book by O. H. Magga on normativity desicions made in the 80'ies, and
   documentation on all later decisions:
** Dieđut nr 2 1985: GIELLA, dutkan, dikšun ja oahpaheapmi.Sámi instituhtta. s.
   32 - 74 (davvisámi čállinvuohki - Ole Henrik Magga), s. 68-74 (Pekka
   Sammallahti)
** documentation on later decisions from __Laila__, __Maaren__ will get copies
   and send them to __Thomas__
* make sure we only send them issues that are really open
* new issues need probably to be better explained from the point of view of
  spelling correction and end user reactions - that is, why ''we'' think it is
  an issue in need for SGL clarification

TODO:
* get the decisions (__Maaren__)
** done

!!Lule Sámi

Goal for March 1st: Reach the 90 percent lexical recall limit.

__TODO__:
* __Trond and Thomas__: work on the 90 % limit
** we reached 92%!
* __Trond__: Write a report to NFR.
** done

!!!8. Name lexicon infrastructure

!!Complex names

* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Tomi__)
** the resulting file format should be identical to our present prop-name
   file (=lexc), that can then be converted to our new xml format using the same
   script as for the regular names (__Tomi__ or __Saara__, but only when the
   technical details are settled)

TODO:
* Move this issue to bugzilla (__Børre__)

!!XML format

TODO on eXist as editor:
# refactor and prepare risten.no for multiple collections:
## develop the Cocoon sitemap to delegate requests to the proper folder level,
   such that the most specific code is always used
## refactor the code into more and more specific components according to our
   folder hierarchy
# develop the needed XQueries and interface (__Sjur, Tomi__)
# data synchronisation between risten.no and the cvs repo (__Tomi__)
# test whether eXist as editor is actually working well (__linguists__)

Data synchronisation task list/specification:

* the xml file needs to be stored/updated in cvs
* there should be no diffs on whitespace and sorting order (to ensure we get
  only substantial diffs, and no noise)
* the prop name update cycle should something like:
** dump the xml from eXist (in proper sorting order)
** check whether there are diffs against cvs; continue only if there are
** update from cvs
** error check: are there conflicts? if yes, send report to <somebody>
** are there still diffs? if yes, continue:
** check in/commit w. generated comment
** error check: is the document valid and conformant xml? if no, stop and send a
   report to <somebody>
** reimport the xml file into eXist
* question: do we need to lock the file in eXist through this update cycle?
* the update cycle should be a nightly cron job

!!!9. Spellers

Nothing until the new proper noun lexicon is in place.

!!!10. Other

!!SGL Seminar

SGL don't want a seminar - at least not yet.

!!Bug fixing

__35__ open Divvun/Disamb bugs, and __24__ risten.no bugs


!!!11. Summary, task list

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* convert nob and nno bible texts to be used as part of a parallel corpus
* review the paratext2xml converter
* convert smj NT to paratext
* Call Ove Sæth
* Move complex name lexicon issue to bugzilla
* Ask KIO Grafisk to make a test Quark document based on a Word document from us
* Send out letters to the Iđut authors
* Add corpus security re G5 syncing as an issue to Bugzilla
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with the top-ten list
* send copies of normativity decisions from 1985 to 1992 (2003?) to __Thomas__

!! Saara
* continue discussion on the new lexicon format
* Fix the preprocess script and optimize it.
* Try to add numeral treatment as part of the analyzator.
* Move gt2ga.sh to G5 and implement copying of the gt-dir.
* Create a parallel corpora of the new testaments.
* Routine for adding new languages to the propernoun xml-structure.
* Move to Bugzilla: the analyzer needs to be optimized.
* Implement validation of xml corpus against the dtd.
* Create a group for corpus users.
* Finish corpus dtd documentation, dtd location and permlink reference
* Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* project planning with __Trond__, continued
* Follow up on place names from Norge Digitalt
* Evaluate SFST as speller (and analyzer) lexicon
* write a background document on the corpus contracts
* public tender:
** answer requests/questions
** set up anon. read-only cvs with __Børre__
** corpus repo access
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/proper noun lexicon development:
** refactor code
** implement inheritance/collection overriding for xsl/css/xquery using sitemaps
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* add incoming Lule sámi words
* include the SGL:decisions in our normativity document
* include normativity desicions made by Magga and Sammalahti in our normativity
  document
* follow-up on the Sámi place names in Finland
* work on North Sámi compounding and derivation
* smj G3 issue with __Sjur__ and __Trond__
* sme G3 issue with __Sjur__ and __Trond__

!! Tomi
* move aspell UTF-8 suffix bug to Bugzilla
* corpus infrastructure:
** dtd location (both public and internal)
* Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
  (it's almost done, but there are a couple of loose ends)
* new proper name lexicon
** discuss the new lexicon format and other issues in the newsgroup
** Look into data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Bring Lule Sámi up to 90 %
* Write Lule Report to NFR
* Apply for strategy funds
* Work on corpus texts with Børre.
* Contact the Finnish and Swedish Bible societies to get Bible texts.
* Look at ga/ directory issue with __Saara__
* Ask for a Quark test file from his sister
* News group discussion followup.
* clean all texts from gt/smX/corp - they need to be in our corpus repo before
  we open our cvs for anonymous access
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!12. Next meeting, closing

13.03.2006 09:30

Closed at 12:07