!!!Meeting setup

* Date: 27.02.2006
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:36.

Present: __Børre, Saara, Sjur, Thomas, Trond__

Absent: __Maaren, Tomi__

Main secretary: __Trond__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
**  Some new texts added to the corpus, from the Sámediggi
* convert nob and nno bible texts to be used as part of a parallel corpus
**  Not done
* review the paratext2xml converter
*   Not done
* convert smj NT to paratext
**  Not done
* Call Ove Sæth and Olavi Korhonen
**  Called Korhonen, he was very positive about sharing his Lule Sámi 
    dictionary. He asked if he could get access to our corpus as a return
    favour, because he is also making a Northern Sámi dictionary.
* Correct Forrest integration for new projects and project ideas
**  Done with Sjur
* Move complex name lexicon issue to bugzilla
**  Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with risten.no
** Added words. Also added words from a frequency-based missing list.
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.

!! Saara
* continue discussion on the new lexicon format
* Fix the preprocess script and optimize it.
** not done.
* update conversion from lexc to xml (proper names) with the latest refinements
* Try to add numeral treatment as part of the analyzator.
** not done
* Look at crontab ga/ directory issue with __Trond.__
** postponed.
* Create a parallel corpora of the new testaments.
** In progress.
* Routine for adding new languages to the propernoun xml-structure (Lule Sámi
  name lexicon and links to the termcenter.xml etc.).
** In progress.
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* project planning with __Trond__, continued
* Follow up on place names from Norge Digitalt
* Evaluate SFST as speller (and analyzer) lexicon
* write a background document on the corpus contracts
* public tender:
** review draft tender document from Finnut
*** almost finished - will finish today.
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/proper noun lexicon development: fix bugs, continue development
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
* other:
** was on Winter Holiday

!! Thomas
* work on North Sámi compounding and derivation
**nothing, been working with Lule sámi namelexicas and incoming words
* review corpus usage documentation
**nothing, been working with Lule sámi namelexicas and incoming words
* smj G3 issue with __Sjur__ and __Trond__
**nothing, been working with Lule sámi namelexicas and incoming words
* sme G3 issue with __Sjur__ and __Trond__
**nothing, been working with Lule sámi namelexicas and incoming words


!! Tomi

On sick leave.

!! Trond
* Mostly worked on Lule Sámi this week.
* Work on corpus texts with Børre.
** Discussed issues.
* Contact the Finnish and Swedish Bible societies to get Bible texts.
** Not done.
* Look at ga/ directory issue with __Saara.__
** Postponed. We discussed it, though...
* News group discussion followup.
* Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
** Done.
* Ask IT guys for an e-mail adress for corpus upload script:
  corpus@giellatekno.uit.no (forwarded to Børre)
** Other people have looked into this.  
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!3. Documentation

TODO:
* Integrating the future project plans and ideas in our Forrest documentation
  (__Børre__)
** Done by __Sjur__ and __Børre__.

!!!4. Corpus gathering

!!Collecting

See a [previous meeting memo|Meeting_2006-01-16] for what's to be done.

TODO: Send out the rest of the letters (__Børre__)

!!Odin

Waiting for __Sæth__ to discuss with colleagues about how to implement the
cooperation, and return to us.

TODO:
* call Sæth (__Børre__)

!!Olavi Korhonen's Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.

!! KIO Grafisk and the IÄ‘ut books

We need a test file in order to find out whether file conversion from Quark 
to InDesign works.

__TODO__:
* __Trond__ to ask for a Quark test file from his sister
* __Børre__ to ask KIO to send a more elaborate test file, representative for
  what we can expect, but without any copyrighted content (and please stress,
  that fonts will __never__ leave the computer, they are part of the OS, and are
  not included in the Quark file). The file should include text with Sámi and
  Norwegian letters, and preferably some dummy pictures as well. Actually,
  __Børre__ could create that file in Word and send it to KIO for them to move
  to Quark and send back.
* __Børre__ will send letters to the authors.

!!Bible texts

TODO:
* review paratext2xml converter (__Børre__)
* convert smj NT to paratext (__Børre__)
* ask to get fin and swe NT and OT in paratext format. (__Trond__)
**  Still not done. Trond will contact Bibelselskapet for a new sme version, and 
    for the NT nob and nno, and the fin and swe ones (after Wednesday)

!!!5. Corpus infrastructure

We need more "version control" in the corpus work - we don't know which version
of the XSL script was used (but roughly whether it was used or not). We need to
document within the XSL file which version of ''all'' tools were used, including
the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:
* for the biggest top ten (or so) the orig. should be located and copied to the
  new corpus repo
* then these files should be removed from gt/sme/corp/
* all small files could just be forgotten/ignored

TODO:
* Access control to corpus repo resolved through Unix groups: one group for corpus
  maintainers with write access, and another for corpus users, with read-only
  access. Still not done.
** __Saara__ will ask Roy to do what should be done.

Further discussion about corpus analysis and computer use:

The new G5 is tremendeously faster than cochise, thus we want to use it. But
cochise will continue to be our main corpus repo.

* the corpus/gt/ dir will be synchronised with the G5
** TODO: set up copying script (__Saara__)
* we need to develop strong enough security routines for the G5 to fulfill our
  obligations towards the text licensers
**  TODO: __Børre__ to move this to bugzilla
* we are still using only one processor when analysing - making some simple
  multiprocessor analysis script will increase the speed of the analysis a lot
**  script implemented, but not tested due to copying not in place yet (see
    above).

Secure copying:

{{{
To get cvs ssh working without password prompting:
ssh-keygen -t rsa
<just type enter to all questions>
chmod 0644 .ssh/id_rsa.pub
scp .ssh/id_rsa.pub <user>@cochise.uit.no:.ssh/authorized_keys2
}}}

New tasks:

* corpus dtd documentation:
** structure, content/model and location of the dtd (location =
   permlink) http://giellatekno.uit.no/dtd/corpus.dtd
   TODO: __Saara__ to write and finish the docu, also check the dtd link
*** the link isn't working
* finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to
 xml) (__Saara__)
* add xml validation against our dtd to the corpus conversion process (__Saara__)

!!!6. Linguistics

!!North Sámi

We want to get the decisions from the SGL meeting in December.

TODO:
* get the decisions (__Maaren__)

!!Lule Sámi

NFR wrote:
{{{
[Programstyret] bed (...) om at det umiddelbart vert laga ein alternativ plan 
for resten av prosjektperioden. Denne planen må gå ut frå den endra føresetnaden,
at ein ikkje har tilgang til den lulesamiske ordboka. Programstyret vil at ein
går over til den alternative planen frå 1. mars, dersom tilgangen ikkje er reelt
i orden på det tidspunktet. 
}}}

The criterion for continuing with Lule Sámi that we set up in our plan was the 
following, somewhat stronger:

{{{
kriteriet for om det blir satsing på lulesamisk er om vi har ein betaversjon
av den lulesamiske transdusaren, med integrert leksikon, i gang 1. mars.
}}}

Moments to our March 1st report:
* We have recieved the dictionary, and integrated most of it (27.2:
  78 % of the Kintel words are added)
* Our Lule Sámi disambiguator is up and running as an alpha version
* The analyser runs on 80,0 recall on token and 67,4 on type, on the New Testament
  (proper names excluded)
* The analyser has 17773 lexicon lines
* We have 3367 lines of non-allocated Kintel words
* The integration of proper nouns will be done in parallel with Northern Sámi, and
  when it suits the Northern Sámi conversion timetable.

Goal for March 1st: Reach the 90 percent lexical recall limit.


Conclusion: we have already met the basic requirements for continuing with Lule
Sámi.

__TODO__:
* __Trond and Thomas__: work on the 90 % limit
* __Trond__: Write a report to NFR.

!!!7. Name lexicon infrastructure

!!Complex names

* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Tomi__)
** the resulting file format should be identical to our present prop-name
   file (=lexc), that can then be converted to our new xml format using the same
   script as for the regular names (__Tomi__ or __Saara__, but only when the
   technical details are settled)

TODO:
* Move this issue to bugzilla (__Børre__)

!!XML format

TODO:
# testing of conversion
# eXist as editor:
## develop the needed XQueries and interface (__Sjur, Tomi__)
## data synchronisation between risten.no and (__Sjur, Tomi__)
## test whether eXist as editor is actually working well (__Thomas, others__)

!!!8. Spellers

Nothing while Tomi is on sick leave, and until the new proper name lexicon is in
place.

!!!9. Other

!!Upcoming Strategy money application deadline at Samisk senter

Possible grant proposal themes include:

* Southern Sámi: 
** Grammar research
** corpus collection
** normativity seminar (future speller)
* Lule Sámi:
** Disambiguator
* Northern Sámi:
** Text-to-speech
** Semantic annotation
* Programming infrastructure:
** Setting up a structure for inflecting dictionaries


!!SGL Seminar

*  all members = potentially/likely all languages
** not all languages, only North Sámi
*  date:
** As early as possible, end of March?
*  place? __Maaren__ will investigate

!!Bug fixing

__33__ open bugs (and __24__ risten.no bugs)


!!!10. Summary, task list

!! Børre
* send out contracts with accompanying letter
* Gather public texts, preferrably also parallel ones
* Continue converting text from input format to our xml
* convert nob and nno bible texts to be used as part of a parallel corpus
* review the paratext2xml converter
* convert smj NT to paratext
* Call Ove Sæth
* Move complex name lexicon issue to bugzilla
* Ask KIO Grafisk to make a test Quark document based on a Word document from us
* Send out letters to the IÄ‘ut authors
* Add corpus security re G5 syncing as an issue to Bugzilla
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Maaren
* work with risten.no
* discuss with relevant people regarding seminar on proofing tools, normativity
  and SGL in February/March, including place.
* get the normativity decissions from the December SGL meeting

!! Saara
* continue discussion on the new lexicon format
* Fix the preprocess script and optimize it.
* Try to add numeral treatment as part of the analyzator.
* Move gt2ga.sh to G5 and implement copying of the gt-dir.
* Create a parallel corpora of the new testaments.
* Routine for adding new languages to the propernoun xml-structure.
* Move to Bugzilla: the analyzer needs to be optimized.
* Implement validation of xml corpus against the dtd.
* Create a group for corpus users.
* Finish corpus dtd documentation, dtd location and permlink reference
* Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Sjur
* Follow up the lawyer treatment of the contracts
* Lule Sámi twol problems, with __Thomas__ and __Trond__
* project planning with __Trond__, continued
* Follow up on place names from Norge Digitalt
* Evaluate SFST as speller (and analyzer) lexicon
* write a background document on the corpus contracts
* public tender:
** review draft tender document from Finnut
* smj G3 issue with __Thomas__ and __Trond__
* sme G3 issue with __Thomas__ and __Trond__
* call EDD/__Christian Emil Ore__ about national place name lexicon
* risten.no/proper noun lexicon development:
** refactor code
** implement inheritance/collection overriding for xsl/css/xquery using sitemaps
** data synchronisation between risten.no and the cvs repository
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Thomas
* work with Lule sámi 90% goal
* work on North Sámi compounding and derivation
* review corpus usage documentation
* smj G3 issue with __Sjur__ and __Trond__
* sme G3 issue with __Sjur__ and __Trond__

!! Tomi
* move aspell UTF-8 suffix bug to Bugzilla
* corpus infrastructure:
** dtd location (both public and internal)
* Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
  (it's almost done, but there are a couple of loose ends)
* new proper name lexicon
** discuss the new lexicon format and other issues in the newsgroup
** Look into data synchronisation of proper nouns between risten.no and CVS
** XQuery refactoring and code development for our proper noun editor
** new version of xml2lexc (based on ccat), should handle complex names correct:
   construct entries like we have now from the different parts of a complex name
   entry
* [fix bugs!|http://giellatekno.uit.no/bugzilla]

!! Trond
* Bring Lule Sámi up to 90 %
* Write Lule Report to NFR
* Apply for strategy funds
* Work on corpus texts with Børre.
* Contact the Finnish and Swedish Bible societies to get Bible texts.
* Look at ga/ directory issue with __Saara__
* Ask for a Quark test file from his sister
* News group discussion followup.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].

!!!11. Next meeting, closing

06.03.2006 09:30

Børre, Saara, Trond will be away.

Closed at 12:00