!!!Meeting setup

* Date: 28.11.2005
* Time: 09.30 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit, phone

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# name lexicon infrastructure
# Other issues
## speech synthesis
### Things happening within SD
### the UiT NFR application
## Proofing article: deadline Dec. 5.
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 09:56.

Present: __Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond__

Absent: __none__

Main secretary: __Sjur__

Agenda accepted with some additions, and replaced speller infra with proper name
lexicon infra.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Contact oahpahusossodat about texts
**  Not done
* Gather public texts
**  Gathered laws and some websites
* Continue converting text from input format to our xml
**  Done, found some bugs in convert2xml.pl
* Document the corpus directory structure
**  Some done
* Ask __Thor-Øivind__ to move bugzilla to our new webserver
**  He is on vacation untill the 28th.
* install new XXE and the new XXE Forrest config for all (or check that it is
  installed and working)
**  Done, except Ilona.
* mark-up names
**  Not done
* divvun.no and giellatekno.uit.no
**  Binary files download area
*** Not done
**  Make the conversion to static site, using our own script.
*** Done some work on that, working on the script
* hyphenation in corpus docs
**  Not done
* corpus xsl files under version control
**  Not done
* register AppleCare
**  Done for Børre, Duommá and Tomi

!! Maaren
* shall work with Sámi place names only
** done
* update the last issue in the North Sámi normativity issues document
** done
* register AppleCare
** not done

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* look into efficient editing of the xml proper name lexicon (tools,
modes, etc)
** not done (the kvensk place name issue)
* Convert the name lexicon from present format to xml
** not done
* document corpus infrastructure, your own parts
** done
* Look at the hyphenation issue
** updating preprocess script according to the spec in previous meeting
* Update the corpus.dtd
** done
* corpus xsl files under version control
** started discussion in the newsgroups
* make preprocess and lookup2cg faster
** lookup2cg ready

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
** nothing
* risten.no bugs and fixes
** risten.no has crashed. Took most of Thursday and Friday, still not back.
* follow up on voice group-chat not working to Sámediggi
** Test Marratech
*** internal URL is now working (got an updated URL), I'm waiting for the
    external URL. The first test with myself indicate that the video quality
    is far inferiour compared to iChat. But most imprortant is sound quality
    - we'll see.
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** nothing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
*** wrote to him, nothing more due to the risten.no crash
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** nothing
* write a background document on the corpus contracts
** nothing
* discuss kvensk project support with __Trond__
** not with __Trond__, but spoke with __Hallgeir Varsi__, and wrote to the
   newsgroup. The kvensk integration requires some rethinking of parts of the
   name lexicon, which was briefly started in the newsgroup.
* proper name integration with risten.no
** part of general work with risten.no
* discuss risten.no work with __Tomi__
** done
* write public tender documents
** nothing
* hyphenation in corpus docs
** nothing
* buy:
** new computer (project server)?
*** not done yet. If we do it, the invoice should be at SD before Dec. 15.
* register AppleCare
** done.
* other:
** bureaucrazy: travel invoices, other invoices.

!! Thomas
* work on North sámi compounding and derivation
** finished with second-syllable deletion in compounding
* Look at Linguistic bugs with __Trond__
* Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
** no meeting or bug-fixing this week
* update the lule sámi normativity issues document about incorporation of loan words
** done
* register AppleCare
** asked Børre to do it for me

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
*** Not done
* three-part compounding
** Not done. 
* corpus infrastructure: dtd location (both public and internal)
** Not done
* Document aspell and corpus infrastructure
** Not done this week
* Specification for new catxml in C++
** this includes also placing the source and binary
*** Not done
* start looking at conversion of the name lexicon from present format to xml
** Not done
* Look into synchronisation of proper names with risten.no
** Modified risten.no to accept multiple collections
*** so far, this is only done for searching, not editing, true? Editing is not
    checked in yet.
* hyphenation in corpus docs
** Not done
* corpus xsl files under version control
** Not done
* add automatic language detection to the corpus processing
** Not done
* register AppleCare
** Børre did? Jipps:-). Done.
* Other:
** Done some work with risten.no

!! Trond
* update the contracts with changes from the university lawyer
** Done
* Look into the document hyphenation issue
** Not done.
* Look at the three-part compound issue
** Worked on this issue with Thomas
* Work on the CG-related bugs on the bug list (7 open) (numeral related ones
  postponed).
** Still 7 open  
* project planning with __Sjur__, continued
** also look at the development processes - specification and  testing
* The name project
** Work on the name project, mark up names (100 names left)
*** All names now marked
* discuss kvensk project support with __Sjur__
** Not done.
* Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme.
** Progress made at smj, still not sme.
* Other:
** Disamb with Linda
** Corpus conversion & format with Børre.

!!!3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general
(__Børre__, __Tomi__, __Trond__, __Saara__):
* The directory structure is now settled (as of last meeting), and should be 
  documented.

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
groups:

# For the __users/linguists:__ What corpus are found, how do I use them (this
  info is now scattered) 
  (Part of the HOTWO USE is documented in the catxml docu
  The what documents are found where etc + an overall documentation is not
  written, since the corpus is so sparsely populated)
# For the __collectors:__ How do I add texts, where do I add them, how do I
  convert them (this is (partly?) done in the Corpus Conversion document)

test:

* add/update Aspell documentation (__Tomi__)
** Some documentation has been written, but there still is work to be done.
* as always: document what you're doing:-) (__all__)

Tomcat->static HTML progress

Now, all pages are generated directly from XML by Forrest within Tomcat. We'll
change to let Forrest pre-generate the HTML (and pdf), and serve these
ready-made files directly.

!!!4. Corpus gathering

All the available law texts are downloaded from Odin, with (most of?) their
corresponding Norwegian originals.

All available texts from NSI's web site have been downloaded.

The texts are now in our corpus repository, but not converted to the corpus
format. Some bugs relating to this process have been posted to Bugzilla.

* [http://troms.kulturnett.no/bibliotek/samisk/samisk_materiale.htm]
* [Sámi legal text on the net|http://www.lagat.no]
** We need all these texts
*** done
** We need to survey the site in the future
*** new translations under way
** We need the Norwegian versions as well
*** done

!!Contracts

Update: All SD versions now synchronised with the templates. Trond met with the
lawyer, and she commented the updated versions. Trond has updated the
contracts.

Next step:

# Make __our__ versions of the updated Helsinki contracts, and make sure they
  are according to our intention. (__Sjur and Trond__)
# send them to the SD lawyer and to the University lawyer through formal
  channels. (__Sjur and Trond__)

Contract 1 should have the main priority.

!!!5. Corpus infrastructure

Updated task list:
# Include the xsl files under version control (__Børre, Tomi, Saara__)

## Saara has started a dicsussion in the newsgroup - please follow up!
# Incorporate language detection as part of the corpus processing (__Tomi__)
## __Tomi__will start, but possibly hand it over to __Saara__ if it takes too
   much time (__Tomi__ will have to prioritize the name lexicon/risten.no)
# we need a way to deal with hyphenated documents (documents with (manually) inserted
  hyphenation marks) in catxml/preprocess.
  (__Tomi, Børre,__ (discussion in the newsgroup:) __Sjur, Trond, Saara__)
## Discuss details in the newsgroup
### What needs to be discussed now is the conditions for the difference between
    the second and third last points below.
## in normal cases hyphenation points should be removed
### Done by __Saara__ in preprocess. It is possible to turn it back to a regular
    hyphen with a flag.
## when testing the robustness of our parsers, as well as when testing the
   hyphenator, the hyphenation points should be retained:
### This is true for examples like "eala-<CR>hus", they
    should be converted to something like: "eala<hyph/>hus", in order to both
    keep the hyphenation information when needed, and get it out of the way when
    not.
### In cases of truncated compounds like "ealahus-<CR> ja ...", we want the
    hyphen to stay untouched, and be part of the linguistic processing.
### There are sporadically text books with explicit hyphenation points, like:
    ea-la-hus. In these documents, all hyphens, without exception, should be
    converted to <hyph/>.

!!!6. Linguistics

!!Name lexicon

Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no]

The plan for this project was as follows: Two lines of work run in parallel:

# name markup
## Done! There are errors in the markup, people are urged to correct them as they
   pop up.

!!Complex names

Task list for this issue:

* Restoring two-part names
** done, common/src/proper-complex.xml
* find eventual unique second-parts (B-parts of names that do not exist in
  isolation): Take the 1.126 version, remove the non-last parts, and compare the
  B-parts of proper-complex.xml with the simplex forms in 1.126. Non-hits should
  then be removed from the current propernoun-sme-lex.txt.
** Postponed till next week
* remove these B-parts from the ordinary name file (__Tomi__)
* make sure xml2lexc can handle complex names in ways compatible with our
  present tool chain (=reconstruct the lexc format we have now) (__Saara__)
* the resulting file format should be identical to our present prop-name
  file (=lexc), that can then be converted to our new xml format using the same
  script as for the regular names (__Tomi or Saara__, but only when the
  technical details are settled)
* The core issue: The preprocess script can handle (A + B) compounds, but not
  (A + inflected B).

!!North Sámi

* three-part compounds issue still open
** look at Lule Sámi, but apply it to second-parts only
*** Thomas is finished with three-part compounds for North Sámi. On the negative
    side, we note a compilation time of 10m for the sme parser as a whole.
** the exact rules for when shortening happens should be documented (__Maaren__
   will send the normativity decisions to __Thomas__)
** descriptive facts from our corpus (__Trond, Thomas__) added to the normative
   issues document
** linguistic analysis/discussion to continue in the newsgroup
** we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
   this discussion, but how? We need a common meeting / normativity seminar /
   seminar on proofing issues and on language technology in a broader sense with
   them. Then we also need to open the linguistic issues newsgroup we talked
   about earlier, and set up their computers and teach them to use the
   newsgroup. This seminar needs to include the Lule Sámi section of SGL (others
   are welcome too).
   Timetable:
*** The Sámediggiráddi is going to appoint new members before Christmas
    (__Johan Mikkel Saara__ is asking people)
*** Contact the SGL when it is elected, in December, and ask them to arrange a
    normativity seminar, with invited keypersons.
*** seminar in February?
** TODO list:
*** __Maaren__ to discuss with relevant persons on this issue.
* number project still open (see below)
* diphthong simplification/G3 issue should be carried over from Lule Sámi.
** TODO: 
*** Settling the empirical facts for sme (__Thomas__)
*** Writing the rules (copy from smj, adjust)(__Sjur, Thomas__ and __Trond__,
    next week)

!!Lule Sámi

Great progress has been made on the G3 issue, just some minor points remain.
oa:å has been carried over to ä:e

Open tasks:
* Derived G3 that looks like G2 are still open. 
** __Thomas, Trond and Sjur__ meet shortly after Dec. 5th to finish the rest.

Today's compilation time:

real    5m17.157s
user    3m26.827s
sys     0m5.070s

!!Numerals

The following North Sámi linguistic issues should be settled before going into
the numeral project:

# Three-part compounds
# Diphthong simplification
# Derivation

These issues are recently done in Lule Sámi, and it is more efficient to
complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but
the issue itself is common to the two langauges, and should therefore be treated
in parallel.

Numerals in North Sámi:
Inventory is listed elsewhere.

Numerals in Lule Sámi:
There are 70 lines of code setting up the structure for case inflection of basic
numerals.

!!!7. Name lexicon infrastructure

The kvensk place name lexicon have their own info needs:

{{{
Linguistic info
 1. oppslagsform, headword
 2. bøyningsform, inflection
 3. kode for bøyningsform
 4. evt. grenser mellom navneledd og –element, boundaries of the word component
    and name element(s)
15. kvensk(e) navnevariant(er)
16. samisk parallellnavn
17. norsk parallellnavn
18. beslekta navn, relating toponyms, vertailunimi(-nimet)
24. etymologi

Geographical/location info:
 5. type sted, type of place, paikanlaji
 6. kode for navntype, iflg. SSR
 7. gnr., bnr.
+
10. kommune
11. fylke
14. koordinat

Legal info:
 8. status etter Lov om stadnamn
 9. vedtaksmyndighet etter Lov om stadnamn

Source info:
19. informantforklaringer
20. informant(er)
21. innsamler(e), årstall
22. arkiv
23. litteratur, kilder

Unclassified:
12. kartprodukt
13. kartblad
25. pilhenvisning, nuoliviite, til annen artikkel
26. lydfil
27. bilde(r), illustrasjone(r)
28. andre kommentarer, “sekkepost”
}}}

Present proposal:
* name-oriented, single document:
** one name with many uses stored only once

Present risten.no:
* Concept-oriented center, contains:
** ID, links to each language entry
* each language as a separate document, with links to the concept/entity in the
  cross-lingual center

Possible new propsal 1: as risten.no

Possible new proposal 2: separate documents:
* Containing one common concept or name id field, plus Divvun/Disamb fields only
* Containing one common concept or name id field, plus kvensk fields only
* common document containing one common id field, plus fields common to several
  bases

Porsanger both person and place
Porsáŋgu only as place name, not as person name.

5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud
and  1 Timbuktu, but 15 contlexica for these three concepts.

Discussion to continue in the newsgroup.

Tasks:
# testing of conversion
# continue the discussion of the name lexicon format
  (__Saara, Tomi, Sjur, Trond__)
# implement a prototype in eXist
# eXist as editor:
## develop the needed XQueries and interface
## synchronisation between risten.no and 
## test whether eXist as editor is actually working well

!!!8. Other

!! speech synthesis

! Things happening within SD

* The Davvi girji plan: Have a person read in tapes of the whole dictionary
* The ODIN edt. wants to add speech synthesis to their web pages (for disabled
  people), and they are asking a Capella to find out what it takes for Sámi
  speech synthesis

! the UiT NFR application

Application for international cooperation to NFR, Deadline Dec. 1. 
Preproject deadline 

Both Divvun and Disamb milieus see this as an important development path, and we
will participate in the different planning processes. We aim at sending in an 
application to NFR at Dec. 1.

!!Proofing article: deadline Dec. 5.

We go for it - 2 pages.

"The official versions should be prepared according to the technical
requirements of Springer LNCS series.  You are kindly asked to consult 
our detailed instructions at the location
[http://www.ling.helsinki.fi/events/FSMNLP2005/instructions-official.html]."

(__Trond, Sjur and Tomi__)

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Thor Øivind__,
   __Tomi__)
*** Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there 
    is progress (Saara has commented), but the issue is complicated.   

!!Video conferencing across firewalls

We're still waiting for a working URL (working from outside SD, that is).

!!Bug fixing

__24__ open bugs (and 24 risten.no bugs)

!!Move Bugzilla

Move Bugzilla to the same server as the other ones (or make it work at the
expected URL: http://giellatekno.uit.no/bugzilla/).

TODO, TODO. __Thor Øivind__.

!!risten.no

Risten.no crashed badly last week, with no traces of what happened. Tomi
and Sjur are working on restoring it, with all data and updated eXist version.

Tomi will then continue the proper name work.

!! AppleCare extended warranty

Only Maaren still to register it, will do it today.

!! Rugsacks

Not delivered yet! __Sjur__ will investigate.

!!!9. Summary, task list

!! Børre
* Gather public texts
* Continue converting text from input format to our xml
* Document the corpus directory structure
* Ask __Thor-Øivind__ to move bugzilla to our new webserver
** ... or make the URL [http://giellatekno.uit.no/bugzilla] point to the present
   installation
* install new XXE and the new XXE Forrest config for Ilona
* divvun.no and giellatekno.uit.no
**  Binary files download area
**  Continue the conversion to static site, using our own script.
* corpus xsl files under version control

!! Maaren
* working with risten.no
* register AppleCare
* find decisisons/documents regarding syllable shortening in compounds in the
  norm; send them to Thomas

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
* Convert the name lexicon from present format to xml
* document corpus infrastructure, your own parts
* Look at the hyphenation issue
* Update the corpus.dtd
* corpus xsl files under version control

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
* risten.no bugs and fixes
* follow up on voice group-chat not working to Sámediggi
** Test Marratech
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on place names from Norge Digitalt
** write an e-mail to or call __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* discuss kvensk project support with __Trond__
* proper name integration with risten.no
* discuss risten.no work with __Tomi__
* write public tender documents
* hyphenation in corpus docs
* buy:
** new computer (project server)?
* Work on the Speech Application (Dec. 1)
* Work on the proofing article (Dec. 5.)
* Investigate the never-arriving backpacks

!! Thomas
* work on North sámi compounding and derivation
* Settle the empirical facts for sme diphthong simpl.
* add descriptive facts about shortened forms in compounding from our corpus 
  to the normativity issues document

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* Document aspell and corpus infrastructure
* Specification for new catxml in C++
** this includes also placing the source and binary
* discuss about xml-processing with __Saara__
* look into efficient editing of the xml proper name lexicon (tools, modes, etc)
* start looking at conversion of the name lexicon from present format to xml
* Look into synchronisation of proper names with risten.no
* hyphenation in corpus docs
* corpus xsl files under version control
* add automatic language detection to the corpus processing
* Work on the proofing article (Dec. 5.)
* remove last part of complex names not used as simplex names

!! Trond
* Look at the three-part compound issue
* Work on the CG-related bugs on the bug list (7 open) (numeral related ones
  postponed.
* project planning with __Sjur__, continued
* discuss kvensk project support with __Sjur__
* Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme.
* Working on the Speech Application (Dec. 1)
* Working on the proofing article (Dec. 5.)


!!!10. Next meeting, closing

05.12.2005 09:30

Closed at 12:53