!!!Meeting setup

* Date: 18.10.2005
* Time: 10.00 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# Speller infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:07.

Present: __Børre, Maaren, Sjur, Thomas, Tomi, Trond__

Absent: __Saara__

Main secretary: __Tomi__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Contact oahpahusossodat and the rest of the SD about texts
**  Contacted the archiving department, where they told me how to search for
     relevant texts in the WebSak-system.
* Reorganise the directory structure
**  Not done
* Continue converting text from input format to our xml
**  Not done
* Have a look at the placenames files.
**  Not done
* Ask __Thor-Øivind__ to move bugzilla to our new webserver.
**  Not done
* Gather public texts
**  Some work done
* Work on the name lexicon
**  Not done
* Other work
**  Fixed the divvun2web script to skip the doc/admin/Projects directory
**  Discussed with Tomi on how to implement the new corpus structure

!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology.
  ** Not done
* continue working with the missing list from risten.no 
** done a little bit
* Start working on Sámi place names
** not done
* Start working at normativity issues (numeral issues with __Trond__?)
** not done   

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
** Done some work on that issue.
* Convert texts from .doc to .xml, to get a grasp of our corpus format
** Got a grasp of it.
* Have a look at the pdf-to-xml issue
** Done, but not documented.
** use the priority list earlier in the memo for a guidance
*** Sjur: I assume this one can be closed?
* make an emacs mode for the name project (cf. specs in the memo above)
** Not done.
* prepare for a presentation of the pdf etc. conversion together with __Tomi__
  for the next meeting.
** Not done.


!! Sjur
* Lule Sámi twol problems, have a look at the sets definition
** Done, but more work is needed
* risten.no bugs and fixes
** Nothing
* follow up on:
** voice group-chat not working to Sámediggi
*** Now awaiting cost evaluation from the IT guys (__Geir Kaaby__ et al)
*** Nothing
* project planning with __Trond__, continued
** More evaluation of tools, due to some of the limitations of Merlin
* Prepare for a Lule Sámi meeting with __Anders Kintel__ 17th of November
** Berit Karen Paulsen will invite him; Børre, Sjur and Bitte will participate
   if the date suits him
* Follow up on place names from Norge Digitalt -> remind __Bjørn Olav Megard__
** Nothing
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** Nothing
* write a background document on the corpus contracts
*** Nothing
* Other:
** went through the comments from the lawyer with Trond, re 1. contract
** checked all open Divvun bugs in Bugzilla, updated and closed some
** wrote a lengthy e-mail to the makers of Merlin, with several requests for
   features and clarifications; the future of Merlin looks nice, but the
   timespan is unclear, so the question is really whether Merlin is ok in its
   present state
** 2 days off in Trondheim

!! Thomas
* work on Lule Sami compounding and derivation
**worked and still working
* Meet with Sjur and Trond about the definition of G1, G2, G3 in Lule Sámi
**we had our meeting
* Look at Linguistic bugs with __Trond__
**looked at some
* Prepare for a Lule Sámi meeting with Árran
**not done
*** __Sjur__: This can be removed, it is now in the hands of __Sjur/Bitte/Berit Karen__
!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
*** Not done.
* three-part compounding
** Not done
* corpus infrastructure: dtd location (both public and internal)
** Not done
* corpus infrastructure: file and dir organisation
** Still working
* Document aspell and corpus infrastructure
* Add html-to-xml conversion to corpus infra
** Done
* Cgi-script for uploading documents to corpus base
** Add URL uploading
*** Functionality not implemented yet
* Contact __Saara__ about xml conversion
** prepare for a presentation of the pdf etc. conversion together with __Saara__
  for the next meeting.
*** Not done
* Other tasks:
** Wrote new xml-processing tool in C++ to replace catxml-script.


!! Trond
* Work on the bug list (11 open).
** 7 open.
* Get the new version of the New Testament
** Still no answer in the second round.
* project planning with __Sjur__, continued
** Done some.
* Follow-up the University lawyers for comments on the contract
** Discussed with lawyer, gone through 1 of 3 with Sjur.
* Work on the name project: Clean up the lexicon file, discuss the emacs mode with
  __Saara__ and the work with __Maaren__ and __Børre__.
** Still awaiting Saara's emacs mode, while waiting I have converted some thousand names.  
* Add docu on the corpus infrastructure
** Don't remember this discussion, not done.
* Other:
** Meeting with Kvensk revitalisation project
** Discussions with Linda on work tasks, preparing things.
** Done work on the Lule Sámi rule set with Sjur and Thomas.


!!!3. Documentation

Documentation tasks:

# Add documentation on our corpus infrastructure and our corpus work in general
  ("To be done by the ones making the corpora": __Børre__, __Tomi__, __Trond__,
  __Saara__).

# Now  we have 4 documents:
## Correct corpus (disamb usage)
## Corpus plan (for the disamb corpus cwb)
## Corpus conversion, two versions, in infra and in ling. __Tomi__ and __Børre__
   have done parallell work ;-(
## catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target
groups:

# For the users/linguists:
## What corpus are found, how do I use them (this info is now scattered)
# For the collectors:
## How do I add texts, where do I add them, how do I convert them (this is the
   Corpus conversion doc)
# For the programmer
## What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in
principle along the same lines, except that the user is not the same linguist
as above.

# add/update Aspell documentation (__Tomi__)
## Some documentation has been written, but there still is work to be done.
# as always: document what you're doing:-) (__all__)

!!!4. Corpus gathering

* Governmental documents (earlier in pdf, now in html)

Tasks:
* move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
* Collect public (pdf and html) files (Børre)

!!Contracts

Tasks:
* Follow-up on the lawyers' comments (__Trond__ has started with the university)
** Still more work to be done.
* add a background document explaining the model (__Sjur__)

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of
words, syntactic structure (potentially with some words filled in)? We need
this to be controlled by us, not by the authors. The exact borderline is hard
to define.

!!North Sámi New Testament

* If we don't hear anything from Bibelselskapet, we will have to use the version
  we already got.
** Still not anything. Trond will inform them that we will use what we have.

!!Lule Sámi Dictionary

We will invite __Anders Kintel__ to a meeting in Tysfjord on Nov 17th, where we
will discuss the possibilities and requirements for cooperation. The overall
meeting in Tysfjord is there to prepare the municipality for being included into
the legally defined Sámi-speaking area.

__Bitte__ and __Børre__ will participate, as well as __Sjur__, given that
__Anders Kintel__ is available.

* __Sjur__ will check whether __Berit Karen__ has contacted __Anders Kintel__

!!!5. Corpus infrastructure

* Wrote new xml-processing tool in C++ to replace catxml-script.
** where to put it:
*** binary:
*** source:
** specification:
*** command-line interface
*** what we have already
* the dir reorg disc. betw. Børre and Tomi
** Directory structure should be defined according to the xml-file metainformation
{{{
<!-- scheme="dewey" code="44444" -->
<!-- scheme should be dewey or uit or whatever -->
<!-- UiT: bible, news, fiction, facts, adm, ... -->
<!ELEMENT genre EMPTY >
<!ATTLIST genre
            scheme  #PCDATA         #REQUIRED
            code    #PCDATA         #REQUIRED
    >

<genre scheme="uit" code="news" />
<genre scheme="dewey" code="444" />
}}}

For reference: This is what we decided in Helsinki:

{{{
    admin/depts/  (governmental departments)
          guovda/ (Guovdageaidnu municipality)
          karas/  (Kárášjohka municipality)
          sd/     (Sámi parliament)
          others/ (everything else)
    bible/ot/
          nt/
    facta/
    ficti/
    laws/
    news/MinAigi
        /Assu
        /NRK
        /YLE
        /other
}}}

* Reprocess the old (from new dir.) corpus files

!!Naming conventions and directory structure

* The original file should be protected using file and directory permission.
* The meta information (i.e., the xsl translation files) should be under version
  control
* Given that our language detection works well, the intermediate file don't need
  to be under version control (the lg identification tool is under gt/script,
  and it needs to be made part of the coprus processing)

Tasks: 
# Make a system for file and directory permission (today: we all belong to the 
  cvs group), to only allow people with root user privileges write access to the
  corpus repository, at least regarding original files
# Include the xsl files under version control (cvs? rcs?)  
# Incorporate language detection as part of the corpus processing.
# the dir structure is:
## one dir for orig, containing also the meta-info and interm. files
## another dir for our ready-to-use xml files after conversion
# dir structure for web-posted corpus files:
## subdivision according to week or month, we start out with month till we see
   the amount of traffic (yyyy-mm)
### Done
# we need a way to deal with hyphenated documents in catxml/preprocess:
## in normal cases hyphenation points should be removed
## when testing the robustness of our parsers, as well as when testing the
   hyphenator, the hyphenation points should be retained

!!Corpus conversion

!Pdf to XML

__Saara__ has made a new conversion module, it is almost finished. We'll return 
to the issue, evaluation, etc. on the next meeting.

Task: __Saara__ to prepare for this presentation, and to make documentation.

perldoc gt/script/samiChar/Decode.pm

!(X)HTML to XML

Tomi has been looking at this, and is making an xsl script for it. The web form
developed by __Tomi__ should be augmented to allow posting of URL's as well as
documents from the local file system.

The URL posting need to check whether the same URL has been posted before, and
if so, whether the page has changed.

Task: __Tomi__ and __Saara__ to present status quo and suggest routines, merger,
etc. on the next meeting.

The documentation for corpus conversion should be added to the
gt/doc/ling/corpus-conversion.xml document.

!!!6. Linguistics

!!Name lexicon

Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no]

Motivation:

* __Divvun:__ We want to cross-link different versions of the same locations
 in different languages
* __Common:__ We do not want to enter the same names twice. We want a
 language-independent name lexicon
* __Disamb:__ Having a richer tag set makes it easier to disambiguate
* __Future:__ Richer analysis makes new applications possible, within
 information retrieval, grammar checking, machine translation etc.

Needed: A plan for this project:

a. do the main markup in the present propernoun file
b. make a script for converting it to xml (to be done one time)
c. make a script for xml2lexc (to be done by the makefile)
d. make the tags etc. in the parser

Conversion:

# This week
## clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
  - __Trond__
## Make an emacs mode for markup (__Saara__). Options: fem, mal, sur, plc, org,
   obj, none). Combinations: surplc
# (end of this week and) Next week:
## Mark up as much as possible within a week or so (__Maaren__ to do the Sámi
   names, and to split CNAME into BERN and LONDON, __Trond__ and __Børre__ to
   look at the rest)
## Then convert to xml
## Then mark up the rest with correct semantic tags
## This means we would need a seventh option, the unspecified name.
## Look into efficient editing of the XML lexicon
## Look into synchronisation issues with risten.no - we want the names there
   as well

Updated status quo:
* Entries:    20000
* Converted:  13500
* Time used:     10 h

Needed tools: An emacs mode doing this (__Saara__):
1.  Go to next " NAME ;" ( where NAME is a string of symbols "A-Z\-")
2.  Wait for input, one of these: m f s p o b
3.  Replace " NAME ;" with " NAME-mal ;", " NAME-fem ;" etc. and go to next 
    " NAME ;"

Possible refinement: Encode for combined options (both plc and sur, e.g.)
already in this phase.

Waiting for emacs mode.

!! Twol SETS definition issue

The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
have input on this issue.

Update: it is still not working, see [bug
193|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=193]

SUGGESTION (__Trond__): __Thomas__, __Trond__ and __Sjur__ have a new meeting on
this issue, on Wednesday.

!!North Sámi

* three-part compounds issue still open, as is the number project.
* The treatment of Sámi place names, we need a contract with "Norge digitalt",
  via UFD.
** __Sjur__ has written an e-mail to the UFD contact person,
   __Øystein Johannessen,__ who will look into it soon. He has not responded
   beyond saying he will return to it. __Sjur__ brought this up in the board
   meeting, and __Bjørn Olav Megard__ will remind __Øystein Johannessen__
   about this issue. __Sjur__ will follow up on this one.
* normativity issues:
** the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
   December. They were not able to make any decisions, and there will be a new
   Giellalávdegoddi beginning next year who won't make decisions until late
   spring. This is a serious problem for the Divvun problem.
*** Actions: __Sjur__ will bring this to the Divvun board, write a new letter to
    the Giellalávdegoddi, emphasizing the needs and timetables of the project

!!Lule Sámi

 __Sjur__, __Thomas__ and __Trond__ will cont. Lule Sámi issues.

!!Numerals

# An empirical overview
## Numeral generation
## Numeral inflection
## Numerals as parts of compounds
# A clear concept of how we want to treat them
## Tagging
# A treatment

We will return to this issue after the name conversion.

!!!7. Speller infrastructure

Nothing this week either.

!!!8. Other

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Thor Øivind__,
   __Tomi__)
*** Another __example__ of the same bug:
*** :"\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
    33.

!!Bug fixing

__10 open__ bugs (and 24 risten.no bugs)

!!Buying

* rucksacks for all

!!risten.no

* Organisation: could __Tomi__ be used, in exchange for more linguistic work by
  (old) GIO members? Yes, it is ok, but how much still needs to be evaluated
* it is ok to integrate "kvensk" placenames with risten.no
** this should be integrated with the general proper name work - we want all
   proper names integrated with risten.no, df above
** needs further development of risten.no to allow for multiple XML bases to
   be presented and maintained in parallel. This is to be further worked on by
   __Tomi__ and __Sjur__

!!Meeting with Kvensk revitalisation project

Grammar, dictionary, placename lexicon for Kvensk. They want similar
infrastructure as in Sámi language technology. __Trond__ and __Sjur__ will
discuss how we can help them without taking too much attention away from our
real jobs.

!!!9. Summary, task list

!! Børre
* Contact oahpahusossodat and the rest of the SD about texts
**  Doing some digging into WebSak
* Reorganise the directory structure
* Put all corpus texts into one place
* Continue converting text from input format to our xml
* Have a look at the placenames files.
* Ask __Thor-Øivind__ to move bugzilla to our new webserver.
* Gather public texts
* Work on the name lexicon


!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology.
* continue working with the missing list from risten.no
** working with the missing list from risten.no this week (today) 
* Start working on Sámi place names
* Start working at normativity issues (numeral issues with __Trond__?)

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* Convert texts from .doc to .xml, to get a grasp of our corpus format
* make an emacs mode for the name project (cf. specs in the memo above)
* prepare for a presentation of the pdf etc. conversion together with __Tomi__
  for the next meeting.

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
* risten.no bugs and fixes
* follow up on voice group-chat not working to Sámediggi
** Now awaiting cost evaluation from the IT guys (__Geir Kaaby__ et al)
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on the meeting with __Anders Kintel__ 17th of November -> ask
  __Berit Karen Paulsen/Bitte__
* Follow up on place names from Norge Digitalt -> remind __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* Discuss the contract issue with Trond, return the new version to the lawyer
* write to the board about the lack of progress with the Giellalávdegoddi, and
  the problem it causes for the project
* write to the Giellalávdegoddi once more, emphasizing timetable and response
  needs in the Divvun project
* discuss kvensk project support with __Trond__
* write public tender documents

!! Thomas
* work on Lule Sami compounding and derivation
* Look at Linguistic bugs with __Trond__
* Meet with Sjur and Trond about the definition of G1, G2, G3

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* corpus infrastructure: file and dir organisation
* Document aspell and corpus infrastructure
* Cgi-script for uploading documents to corpus base
* Specification for new catxml in C++
** this includes also placing the source and binary
*** clean the script/ catalogue with __Trond__
* Common makefile issues

!! Trond
* Work on the bug list (7 open).
* Get the new version of the New Testament
* project planning with __Sjur__, continued
** also look at the development processes - specification and  testing
* Discuss the contract issue with Sjur, return the new version to the lawyer.
* Work on the name project: Clean up the lexicon file, discuss the emacs mode with
  __Saara__ and the work with __Maaren__ and __Børre__.
* Add docu on the corpus infrastructure
* clean the script/ dir
* discuss kvensk project support with __Sjur__

!!!10. Next meeting, closing

24.10.2005 10:00

Closed at 12:11