!!!Meeting setup

* Date: 12.09.2005
* Time: 10.00 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# Speller infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:12.

Present: __Maaren, Saara, Sjur, Thomas, Tomi, Trond__

Absent: __Børre__

Main secretary: __Trond__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Finish crontab specification for the cvs update/export script __Tomi__ made
** Worked on it, it progresses, but there's still an issue open.
* reopen the jspwiki + UTF-8 issue
** Added new issue to the Forrest issue tracker
* Add issue to forrest issue tracker about utf-8 ihtml documents.
** Done
* Contact Svenska bibelsällskapet
** The Lule Sámi NT is in Olavi Korhonen's computer. There was some uncertainty
   about the status of editorial changes, but Olavi should have the latest
* discuss with Anders Kintel about possible cooperation
** Not done
* Follow up on CVS mailing:
** set up __Maaren__
* Meet up with __Trond__ about directory structure
** Done, but needs more work - the notes from Helsinki did not contain
   our decision!
* Contact oahpahusossodat and the rest of the SD about texts
** Not done
* Fixing the machine for the new coworker
** Mostly done

!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology
** done a little bit
* Got mainly through the missing list from risten.no
* Start working on grammatical issues with __Thomas__ and __Trond__???
** Not done.


!! Saara
* Get aquainted with the project status quo
** almost done
* Look at the corpus infrastructure issue
** work started
* Look at the corpus interface issue with Lars
** not done

!! Sjur
* risten.no bugs and fixes
** Nothing this week
* complete the action summary after our half-year evaluation
** Not done
* follow up on:
** voice group-chat not working to Sámediggi
*** requires a new firewall, Geir Kaaby will ask for a cost
    specification
** __Maaren__ has problems with SubEthaEdit (can't connect)
*** It is now finally working again
* To the board:
** write proposal for permanent maintenance organisation
*** Done
** write draft specification for the outsourced tasks
*** continued, not finished
** write half-yearly project report with progress and bugdet status
*** started, not finished
** write agenda
*** Done
** Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is
   October 4th, deadline for the documents is September 13th)
* project planning with __Trond__
** Not done
* Other things done:
** Contacted Øystein Johannessen about synchronised Sámi - Norwegian
   place names (the Norge Digitalt case)
** wrote e-mail to the SFST author with suggestions about how to integrate
   speller functionality - got a long answer back with suggestions on how
   to do it. The question now is whether we ''should'' do it
** read through and commented an article by __Trond__ aobut our projects
** got a working phone and Internet connection:-)

!! Thomas
* work on Lule Sami compounding and derivation
**finished with deverbals, now working with denominals
* Look at Linguistic bugs with __Trond__.
**looked at some, solved some
* Work on the name agreement with "Norge digitalt" with __Trond__
**forwarded it to __Sjur__

!! Tomi
* Aspell: Continue working on the affix file
** Problems with affixfile encoding (Latin-6)
* three-part compounding
** Not done
* Add downcasing to makefile and CVS
** Done
* corpus infrastructure: dtd location (both public and internal)
* corpus infrastructure: file and dir organisation
* Removed circularity from non-recursive transducer

!! Trond
* Work on the bug list (Lule Sámi).
** Done some work.
* Work on compounds (three-part, with __Tomi__)
** Not done.
* Work on the corpus interface (with Lars)
** Had a short discussion with lars and __Saara__, more work needed.
* Corpus infrastructure: dtd location
** Not done. The issue is not only the dtd-s.
* Work on the name agreement with "Norge digitalt" with __Thomas__
** Not done. The problem was that we couldn't read the CD they had sent us,
   and we still haven't got around that obstacle.
* Look at the linguistic aspects of the speller clitics, with 
** Had a look at it, will discuss it with __Tomi__.
* Get the new version of the New Testament
** Have an unoficcial version, still not the off one.
* Check Hans-Ragnars names.
** Done. They are multilingual, and on my machine.
* New coworker
** Work in progress. Vesa Guttorm will be temporarily hired for the rest of this year.
* translate contract
** Done a first translation.
* check the new giellatekno site
** Had a look at it.
* project planning with __Sjur__
** Not done.
* Most of the week has been on making a presentation of our project for a conference in Bolzano.

!!!3. Documentation

Documentation tasks:
# Add documentation on our corpus infrastructure and our corpus work in general
  ("To be done by the ones making the corpora": __Børre__, __Tomi__, __Trond__, __Saara__).
# add/update Aspell documentation (__Tomi__)
# finish divvun2web script (__Børre__)
# as always: document what you're doing:-) (__all__)

!!!4. Corpus gathering

!!Since last meeting:
* __Børre__ has updated info on smj NT
* The Helsinki contract has been translated. 
  The resulting translation needs work, though.
** Someone else than __Trond__ (__Sjur__ or __Børre__) reads through the translation, and,
  if necessary, consults Kimmo for an interpretation. After that, we send the
  texts forward to the Sámediggi 
  and University lawyers.

!!From last meeting:

__Børre__, __Trond__ and __Sjur__ had their meeting, and the Helsinki contract is quite good
as it is. There are some points still to be discussed. We'll continue this week.

Paths forward: We have a contract suggestion. __Sjur__ and __Børre__ should start the
negotiation with the authors and/or publishers to get text. We should make a
priority list for authors, and we should invite ourselves to their meeting to
discuss. We should carry on the discussions both with Iđut and  with Davvi Girji. 

How to proceed:
# Get the contract suggestion ready 
## Translated part 1 ok, part 2 and 3 missing. Done this week
## Get part 4 from Kimmo, and translate it
## Contact our lawyers, at SD and UIT (today, tomorrow).
## When the Norwegian version of the contracts are ready, make
   sure they're corresponding to the Finnish ones in their legal
   interpretation (as far as possible); then publish both versions
   for others to reuse, with background documentation (in cooperation
   with Kimmo Koskenniemi)
# Approach the text owners (see ordered list below)

!!Independent of the contract work
# Bible: The new testament (__Trond__)
# Bureaucratic text: 
## Sámi Parliament (__Børre__)
## Sámi Oahpahusráđđi (__Børre__)
## KRD (__Børre__, check whether we miss texts (discuss with __Trond__))
## the Sámi municipalities (__Børre__)
# Textbooks
## To the extent that text can be got directly from SO.

!!After the contracts are ready

__Sjur__ and __Børre__ should probably take a Tour-de-Sápmi, and meet with the
most important persons and institutions. __Børre__ as the responsible for
collecting, __Sjur__ as responsible for the project, and representative of SD

The tour should be planned, not in this meeting, but before the contracts
are ready, i.e., it should be planned next week.

# Commercially published texts
## Author organisations' meetings
## Key authors one by one
### (list of author names) Kerttu Vuolab, Kirsi Paltto, 
## Iđut and key authors there (__Børre__)
## Davvi Girji and key authors there
# Newspaper text:
## Sámi Instituhtta's (for the old archive of Min Áigi and Áššu)
## Áššu has been making a CD since the end of may, there should be a pile
   there. __Børre__ suggests that they send us the CDs they have, so that we
   may look at them, and ensure that the routines work, and that we are
   able to utilize their format.
## Min Áigi

List of texts with lower priority (to be gathered when the above list is
more or less fixed)
* the Sámi municipalities, 
* Authors with smaller production
* Textbooks

!!!5. Corpus infrastructure

Do documentation.

!!Naming conventions and directory structure

We have a decision from Helsinki:
* have the same directory structure in all three levels, and we also decided
  upon the overall structure.
* Path forward: __Tomi__ and __Trond__ to implement the directory structure

We do not have any notes from our Helsinki meeting (they were left on the
whiteboard). The following is our reconstruction of our decision.

There are three directories, with the same substructure (a 6-way partition
according to genre). The 3 directories contain different versions of the files
as they are processed from original format, via intermediate, to final xml
version.

{{{
orig
 (substructure)
    filename.doc, filename.html, filename.pdf
int
 (substructure)
    filename.int.xml
    filename.xsl    
gt (and we want a new name for gt) 
 (substructure)
    filename.xml
}}}

There is a substructure division according to genre:

{{{
bible (NT, OT, perhaps other liturgical txts)
newspaper
    Min Áigi
    Áššu
    Other
fiction
administrative
    central   (Oslo, Stockholm, Helsinki)
    samediggi (Kárášjohka, Giron, Anár)
    municipalities 
factual (educational) 
legal
}}}

For the linguistic search interface, all texts will probably be published
uncorrected, with some portions manually corrected and published in parallel
to the uncorrected variants.

Things to do next, and persons to do it:

# Rewrite the corpus directory (__Børre__)
# Document the corpus directory (__Børre__)
# Continue the work on translating texts from orig/ via int/ to gt/.
  (__Børre__, __Saara__, with the help of __Tomi__)
# Make a sister catalogue for smj, but with a completely flat structure within
  each orig/int/gt.
## corp/sme/(orig/int/gt)
## corp/smj/(orig/int/gt)
# Document the xsl conversion and scripts (__Tomi__)
# Make conversion for html documents (__Tomi__)
# Start looking at conversion of pdf documents (__Saara__) 

!!!6. Linguistics

Note: The Bugzilla bug categories for lexica and morphophonology are now
split into sme and smj categories.

!!North Sámi

* three-part compounds issue still open
* Johnny Andersen has written a letter to us on the treatment of Sámi place
  names, we need a contract with "Norge digitalt", via UFD. We will have to
  get such an agreement. This is a task for __Thomas__ and __Trond__.
** __Sjur__ has written an e-mail to the UFD contact person, Øystein Johannessen
   who will look into it soon.
* New place names received, should be added to our lexicons
** The problem is that the CD from Statens Kartverk is unreadable.
** Other names should be added.
** Place names should keep their cross-lingual alignment in the lexicon
** Propernoun lexicon structure:
** We need to discuss our lexicon structure for the proper nouns. Should we have
   a semantic division in addition to the linguistically motivated division?
*** Geographical names
*** Personal names

Names are inherently multilingual as well as cross-lingual. Cf. Appendices B
 thorugh Č in Sammallahti 1989.

Examples of place names:

{{{
Karasjok Produkter
       deatnulačča Nils Porsangera (82) go máhtii eanet 
       deatnulačča Nils Porsangera (82) go máhtii eanet 
            juoigi Nils Porsangera go máhtii eanet Deanu
drosjeeaiggát (NAF avd. Hammerfest-Karasjok 1984: 21).
}}}

Example of person names:

{{{
Báđár  - Paadar
Guhtur - Guttorm
Dámmot - Blind
Bieská - Pieska
Bieskán/Bieski - Pieski
Dommá  - Tommi
Duomis - Thomas

Niilas - Nils
Duommá - Thomas
}}}

A first step towards an xml infrastructure for a language-independent 
name database is the following.

{{{
named_entity Porsáŋgu
    semantic class information
        place name
    sme: Porsáŋgu
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    nob: Porsanger
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    fin: Porsanki
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    
named_entity Porsanger
    semantic class information
        person name
    name_lg1 / all
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    name_lg2 (-)
}}}

__Conclusion__: We need a name project. 

Issues:
# What format do we want for our common base?
# What semantic information do we want to add to the names?

Planning:
# Who shall work on this?
## Name lexicon work group: __Sjur__, __Trond__, __Maaren__?
# What time plan shall it have?
## Kickoff at Oct 05, in Kautokeino
## Plans ready at some not too later point
## Do what we have to do during the winter
## Implement the name base as input for our parsers at some later point.

Classification:
# Preparatory work
## Talk to Kari Pitkänen in Tampere, who did a semantic classification for 
  the Finnish propernoun lexicon (__Trond__)
## Look into other projects (__Maaren__)
## Make a draft (each) of What We Ideally Want (__Maaren__, __Sjur__, __Trond__)
# Substantial work
## Make a semantic theory 
## Make a proposal with DTD and examples

Making the new base
# Identify status quo and a goal
# Write tools for semiautomatic transition
# Do a pilot test
# Move (parts of?) the name lexicon to the new format
## Part of the manual work could perhaps be given to part-time-workers

Incorporating the new base in the parser
# decide on a proper location for the name base
# conversion tool xml -> lexc

TODO: 
# A Kickoff meeting in Kautokeino.
# Before the kickoff meeting, __Sjur__, __Maaren__ and __Trond__ to do some preparatory work
  as sketched above.


!!Lule Sámi

!Lexicon work
We do not know when we get the lexicon from Anders Kintel. We need a meeting
with him and with Árran in order to coordinate the work. __Sjur__ will call Bitte
and get updated wrt the license issues with the lexicon material.

The goal is to establish a mode of work where people in Árran and in the
language technology projects all work on the same source code, for their
respective projects. In order to get that far, we will need to meet with
Anders and the other ones at Árran.

!Work on Lule Sámi in general

We also need input from the other persons working with Lule Sámi when it 
comes to corpus gathering, terminology, linguistic issues, etc. After we have
integrated the lexicon into the parser we will need a meeting with the persons
working in Árran to find a mode of concrete cooperation. The Lule Sámi
spellchecker needs as broad a basis as possible.

Status quo on the parser:

* All the major POSes have been covered
* closed POSes have been checked
* compounding and derivation
** __Thomas__ has finished with deverbals, now working with denominals 
* Suffix boundary symbol has not been added, we are not sure whether we should
  do that.


TODO:
# Continue the work on the lexicon (__Børre__, __Thomas__, __Sjur__)
# Plan a meeting between our Lule Sámi team and the people at Árran working 
  on linguistic issues (__Sjur__, __Trond__, __Thomas__)
# Carry on the linguistic work (__Thomas__, __Trond__)


!!Numerals

We need
# An empirical overview
## Numeral generation
## Numeral inflection
## Numerals as parts of compounds
# A clear concept of how we want to treat them
## Tagging
# A treatment

TODO:
# Make a documentation chapter on numerals, identifying the open linguistic issues
# Look at implementation

Action plan: __Trond__ and __Maaren__ look into it.


!!!7. Speller infrastructure

!!Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See [previous meeting
memo|/doc/admin/weekly/2005/Meeting_2005-08-15.html].

The problem with the affix file was that it did not accept UTF-8. It accepted Latin 4, 
but mixed đ and ð, since in Latin 6 ð=F0, đ=B9, and in Latin 4 đ=F0, and there is no ð.

There were problems with the latin 6 encoding of the suffix file. After updating to cvs,
the Latin 6 encoding is corrupted. 

* A possible workaround is to keep the affix list stored as UTF-8, but to 
  translate this particular list to Latin 6 during compilation.
* We should also contact the aspell developer and make them (him) fix this bug.


Issues:
# The phonetic file should be systematically looked into.
## Check that it works
## Add more correspondences on an impressionistic basis
# Start work on collecting systematic spelling errors:
## Our in-house file typos.txt
## The soon-to-arrive error texts from newspapers
# The holes in the affix list should be mended
## Adjectives still to be done
# The munching process gets killed at cochise today
## Persons to talk to are Roy Dragseth and Steinar Trædal-Henden. __Tomi__ contacts them.
# We should, at some point, evaluate whether this is The Correct Approach to
  aspell-type speller building
# Affix file UTF-8 problem should be checked and reported.
## Contact the Aspell author and ask for updates/fixes
# The clitics issue: Today we have a manually created affix file in order to
  meet the muncher. Strategy: Do the munching without the clitics, but then
  enrich the manually-created affix file via a genfisuffix -program in order
  to get an automatically created suffix + clitic file to add the compiled
  lexicon to. We will have genfisuff taking affix + clitic and makes it into
  affix'. Then we use affix to munch but affix' to spellcheck:
##  stems + affixlist + cliticlist
##  where all 11 clitics (found in the K lexicon) are mapped onto each and every affix.
# Today, substandard forms are marked as "!SUB". The speller should not include
  them by accepting them. The solution is to grep out the !SUB tagsduring speller 
  compilation.". What is useful is to use the !SUB forms for correction list
   (!SUB -> Correct). The proper place for doing this under aspell is as an addition 
   to the sme_phonetic file. This requires the new lexicon format, to make us
   able to tie variant spellings of the same word together. Presently we can't
   tell whether two lines are variants of the same word or not.
# Documentation
# We must create subcomponents under the Speller
## one for Aspell, another for MySpell, Hunspell, etc.
## __TODO__:
### Investigate
### Write procedures for doing so, for the .
## Directory structure:
{{{
(spell)
    (src)
    (bin)
    (dist)
}}}

The conversion from aspell to myspell will work trivially as soon as the myspell
list becomes smaller.

Issue left open.

!Hunspell

Hunspell is presently already working with OOo, and is a much better speller
engine, linguistically speaking (can handle compounds much better than Aspell,
as well as complex inflection and derivation).
For pointers, see the
[previous meeting memo|/doc/admin/weekly/2005/Meeting_2005-08-15.html].

Issue left open.

!!Other engines

__Børre__ and __Sjur__ had a long discussion with the author of the SFST library/tool
set. Next on his priority list is a feature for handling spelling errors in
running text analysis. This is principally the same as we want, thus there
should be a good opportunity for making SFST into __the__ spelling engine.
__Sjur__ has a suggestion on how to implement this feature that may be  mailed
to the author of SFST.

__Sjur__ repeated the suggestion from our SFST man that we could do this ourselves. The whole SFST
is written in C++. 

The question is: 
# Do we want to do that?
# Would __Tomi__ be able to do it alone, or do we need more resources for it?
# The issue has wide-reaching implications - basically that of replacing the
  Xerox tools with SFST counterparts.

!TODO:
The task deferred a month or so. __Tomi__/__Sjur__ to look into the issue then and
report back

!!!8. Other

!!Technical issues

* Fixing the machine for the new coworker (__Børre__)
** Mostly done
* The mac os / perl bug (at least __Trond__ and __Sjur__ has it):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82.
    This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6).
    It is probably a perl - OS mismatch. (__Trond__, Thor Øivind, __Tomi__)
*** Not done, these issues are still open.
** __Sjur__ has a non-solved Backspace + UTF-8 issue
* 27 open bugs - 2 down from last week but still too much! Have a look at what you can fix.

!!!9. Summary, task list

!! Børre
* Finish crontab specification for the cvs update/export script __Tomi__ made
* reopen the jspwiki + UTF-8 issue
* Add issue to forrest issue tracker about utf-8 ihtml documents.
* Contact Svenska bibelsällskapet
* discuss with Anders Kintel about possible cooperation
* Follow up on CVS mailing:
** set up __Maaren__
* Meet up with __Trond__ about directory structure
* Contact oahpahusossodat and the rest of the SD about texts
* Fixing the machine for the new coworker
* Document the corpus infrastructure
* Read through the Helsinki contracts (new translations)
* Reorganise the directory structure
* Continue converting text from input format to our xml

!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology. do it next week
* shall get mainly through the missing list from risten.no this week
* Start working on grammatical issues with __Thomas__ and __Trond__
* Work on the name project with __Trond__ and __Maaren__
* Start looking at normativity issues
* Work on the numerals project with __Trond__

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* Convert texts from .doc to .xml, to get a grasp of our corpus format
* Have a look at the pdf-to-xml issue (known problem: Keep the Sámi 
  letters sound and safe)

!! Sjur
* risten.no bugs and fixes
* complete the action summary after our half-year evaluation
* follow up on:
** voice group-chat not working to Sámediggi
*** Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
* To the board:
** write draft specification for the outsourced tasks
** write half-yearly project report with progress and bugdet status
** Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is
   October 4th, deadline for the documents is September 13th)
* project planning with __Trond__
* Work on the name project with __Trond__ and __Maaren__
* Prepare for a Lule Sámi meeting with Árran
* Follow up on place names from Norge Digitalt
* Read through the Helsinki contracts (new translations)
* Talk to Bitte about the Lule Sámi lexicon
* Evaluate SFST as speller (and analyzer) lexicon

!! Thomas
* work on Lule Sami compounding and derivation
* Look at Linguistic bugs with __Trond__.
* Prepare for a Lule Sámi meeting with Árran

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* corpus infrastructure: file and dir organisation
* Document aspell and corpus infrastructure
* Add html-to-xml conversion to corpus infra

!! Trond
* (__Trond__ will be absent at next week's meeting, or perhaps
  accessible from Kastrup Airport)
* Work on the bug list.
* Work on compounds (three-part, with __Tomi__)
* Work on the corpus interface (with __Lars__ and __Saara__)
* Work on the name agreement with "Norge digitalt" with __Thomas__
* Look at the linguistic aspects of the speller clitics, with 
* Get the new version of the New Testament
* Introduce the new coworker to the work routines
* project planning with __Sjur__
* Work on the name project with __Maaren__ and __Sjur__
* Prepare for a Lule Sámi meeting with Árran
* Work on the numerals project with __Maaren__
!!!10. Next meeting, closing

19.09.2005 10:00

Closed at 13:20