!!!Meeting setup

* Date: 24.10.2005
* Time: 10.00 Norw. time
* Place: Wherever we are :-)
* Tools: iChat, SubEthaEdit

!!! Agenda

# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Linguistics
# Speller infrastructure
# Other issues
# Summary, task lists
# Closing

!!!1. Opening, agenda review, participants

Opened at 10:10.

Present: __Børre, Saara, Sjur, Tomi, Trond__

Absent: __Thomas, Maaren__

Main secretary: __Børre__

Agenda accepted as is.

!!!2. Reviewing the task list from the last meeting

!! Børre
* Contact oahpahusossodat and the rest of the SD about texts
**  Doing some digging into WebSak
*** Will contact the Tromsø sámediggi department to get help on this.
* Reorganise the directory structure
**  Done once, new decisions on friday leads to that all that work has to
    be done again
* Put all corpus texts into one place
**  Not done
* Continue converting text from input format to our xml
**  Not done
* Have a look at the placenames files.
**  Not done
* Ask __Thor-Øivind__ to move bugzilla to our new webserver.
**  Not done
* Gather public texts
**  Have done a test download of governmental html-texts
* Work on the name lexicon
**  Not done
* Other, not scheduled
**  Helping out Svenska Bibelsällskapet with making a current Lule Sámi 
    translation of the New Testament, cleaning up the doc structure, etc. 
    Took one and a half day.

!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology.
* continue working with the missing list from risten.no
** working with the missing list from risten.no this week (today) 
* Start working on Sámi place names
* Start working at normativity issues (numeral issues with __Trond__?)

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* Convert texts from .doc to .xml, to get a grasp of our corpus format
** done, can we remove this? Yes, indeed.
* make an emacs mode for the name project (cf. specs in the memo above)
** done
* prepare for a presentation of the pdf etc. conversion together with __Tomi__
  for the next meeting.
  ** done some

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
** nothing done last week
* risten.no bugs and fixes
** nothing done, but I have received a lot of feedback and requests. This one
   needs some attention soon
* follow up on voice group-chat not working to Sámediggi
** Now awaiting cost evaluation from the IT guys (__Geir Kaaby__ et al)
*** Nothing done by the IT guys, they're too few and have too much to do.
    __Geir__ suggested to try out [Marratech|http://www.marratech.com/] video
    conferencing, which we will do. Sámediggi has a separate meeting room there.
    Marratech provides a cross-plattform, group video conferencing solution.
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
*** looked a bit more on project management tools, but still not finished
* Follow up on the meeting with __Anders Kintel__ 17th of November -> ask
  __Berit Karen Paulsen/Bitte__
** done
* Follow up on place names from Norge Digitalt -> remind __Bjørn Olav Megard__
** done
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
*** nothing more yet
* write a background document on the corpus contracts
** nope
* Discuss the contract issue with Trond, return the new version to the lawyer
** done, the contracts are now off for comments from __Kimmo Koskenniemi__
* write to the board about the lack of progress with the Giellalávdegoddi, and
  the problem it causes for the project
** done
* write to the Giellalávdegoddi once more, emphasizing the timetable
  requirements for the Divvun project
** not done yet
* discuss kvensk project support with __Trond__
** nothing
* write public tender documents
** nothing done except adding this to my task list
* other:
** finally looked into several requests regarding Sámi speech synthesis,
   and tried to update the memo from our meeting in Helsinki in August
** continued to work on open bugs

!! Thomas
* work on Lule Sami compounding and derivation
* Look at Linguistic bugs with __Trond__
* Meet with Sjur and Trond about the definition of G1, G2, G3

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
*** Not done
* three-part compounding
** Not done
* corpus infrastructure: dtd location (both public and internal)
** Not done
* corpus infrastructure: file and dir organisation
** Almost done, with __Børre__
* Document aspell and corpus infrastructure
** Documenting
* Cgi-script for uploading documents to corpus base
** Almost ready
* Specification for new catxml in C++
** this includes also placing the source and binary
*** clean the script/ catalogue with __Trond__
*** Not done
* Common makefile issues
** Done some

!! Trond
* Work on the bug list (7 open).
** Still 7 open bugs.
* Get the new version of the New Testament
** Not done.
* project planning with __Sjur__, continued
** also look at the development processes - specification and  testing
** Done some work on the issue, albeit not with Sjur.
* Discuss the contract issue with Sjur, return the new version to the lawyer.
** Made a new version with Sjur, it is now in Hki for comments.
* Work on the name project: Clean up the lexicon file, discuss the emacs mode 
  with __Saara__ and the work with __Maaren__ and __Børre__.
** Done substantial work here: CNAME gone, unclassified names down from 35k to 
   15k, of these half are DEATNU and probably -plc, so the task is now 
   manageable.
* Add docu on the corpus infrastructure
** Hmm, don't remember this one. Not done.
* clean the script/ dir
** Not done.
* discuss kvensk project support with __Sjur__
** Not done.

!!!3. Documentation

Documentation tasks:

# Add documentation on our corpus infrastructure and our corpus work in general
  ("To be done by the ones making the corpora": __Børre__, __Tomi__, __Trond__,
  __Saara__).
# Now  we have 4 documents:
## Correct corpus (disamb usage)
## Corpus plan (for the disamb corpus cwb)
## catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target
groups:

# For the __users/linguists:__ What corpus are found, how do I use them (this
  info is now scattered)
# For the __collectors:__ How do I add texts, where do I add them, how do I
  convert them (this is the Corpus conversion doc)
# For the __programmer:__ What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in
principle along the same lines, except that the user is not the same linguist
as above.

* add/update Aspell documentation (__Tomi__)
** Some documentation has been written, but there still is work to be done.
* as always: document what you're doing:-) (__all__)

!!!4. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Tasks:
* move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
** There are appr. 10 non-broken pdf documents in gt/sme/corp/original/ 
   (the ones named stmelXXX.pdf contain only one page each)
* Collect public (pdf and html) files (Børre)
**  Done some test downloading, will have to look at tools to do this 
    automatically.

!!Contracts

Tasks:
* Follow-up on the lawyers' comments (__Trond__ has started with the university)
** __Trond__ and __Sjur__ finished the next revision of the contracts, and are
   waiting for comments from __Kimmo Koskenniemi__
* add a background document explaining the model (__Sjur__)

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of
words, syntactic structure (potentially with some words filled in)? We need
this to be controlled by us, not by the authors. The exact borderline is hard
to define.

!!North Sámi New Testament

* If we don't hear anything from Bibelselskapet, we will have to use the version
  we already got.
** Still not anything. Trond will inform them that we will use what we have.

!!Lule Sámi New Testament

Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi
translation, we will have it soon.

!!Lule Sámi Dictionary

__Sjur__ will check whether __Berit Karen__ has contacted __Anders Kintel__. —
She has now sent the invitation.

!!!5. Corpus infrastructure

!!Naming conventions and directory structure

New suggestions last Friday, with a proposal from __Børre__ and __Tomi__:
We decided to put original in this structure:
{{{
orig/yyyy-mm/filename.doc
            /filename.doc.xsl
            /filename.doc.xml
            /samefilename.doc => samefilename.doc
            /samefilename.doc => samefilename-1.doc
            /This\ is\ a\ very\ cumbersome\ and\ long\ filename.doc =>
            /This_is_a_very_cumbersome_and_long_filename.doc
}}}

Reasoning:
* What do we have to do manually, and what can be done automatically?
* If we name the docs manually, we need to document the original file name
  as well as decide upon naming conventions
**  We can solve original filename from searching the title name from
    original.xsl
**  In the xsl file.
* Principle: All things manually go into the xsl file
* Principle: the gt catalogue is fully generated
* Principle: Use original file names in orig/, but replace SPACE with underscore
* Principle for naming .xml files:
** use orig file name if possible
** Use title when the orig filename is
   undescriptive, or identical to existing files
** if none of the above leads to a unique filename, find a short and
   self-explanatory unique file name

If input document is filename.(doc|pdf|html|txt|whatever), it has a title
Output document is title.xml
sd-2001-1.txt

* What we want to know: when the doc arrived, parallell language docs, plus
  the usual (author, genre, translator, etc which already is implemented)
**  Could be implemented as empty field on the first conversion. The above
    mentioned data could be entered into the web form (which puts it into
    the xsl file) or we could add it manually into the .xsl file (but this
    is error prone).
    
After a long discussion, we decided on the following:
{{{
orig/sme/news/thelongandstupidnameswegetasinputwithunderscore_for_space.doc
             /thelongandstupidnameswegetasinputwithunderscore_for_space.xsl
     sma
     smj
     nob
     fin
     swe
        /news/title2.xml
        /laws/title.xml
        /fict/title.xml ! oops same name as cousin in laws/
        /fact
        /bibl
        /admi
  gt/sme/news/thenewshortandsmartnameweinventedifneeded.xml
              (cf. lines 258-263, for smartness directions)
     sma
     smj
     nob
     fin
     swe
        /news/title2.xml
        /laws/title.xml
        /fict/title.xml ! oops same name as cousin in laws/
        /fact
        /bibl
        /admi
parallel.xml
}}}
What parallel.xml could look like:
{{{
<paradocs>
    <entry id=1>
        <file lang=sme orig=yes>sme-file.xml</file>
        <file lang=nob>nob-file.xml</file>
    </entry>
    ...
    <entry id=1234>
        <file lang=sme orig=yes>sme-OTHERfile.xml</file>
        <file lang=nob>nob-OTHERfile.xml</file>
    </entry>
</paradocs>
}}}

This decision is final!

Further discussion is directed to the news group.

The old task list is repeated for convenience:
# Make a system for file and directory permission (today: we all belong to the 
  cvs group), to only allow people with root user privileges write access to the
  corpus repository, at least regarding original files
# Include the xsl files under version control (cvs? rcs?)  
# Incorporate language detection as part of the corpus processing.
# the dir structure is:
## one dir for orig, containing also the meta-info and interm. files
## another dir for our ready-to-use xml files after conversion
# dir structure for web-posted corpus files:
## subdivision according to week or month, we start out with month till we see
   the amount of traffic (yyyy-mm)
### Done
# we need a way to deal with hyphenated documents in catxml/preprocess:
## in normal cases hyphenation points should be removed
## when testing the robustness of our parsers, as well as when testing the
   hyphenator, the hyphenation points should be retained

!!Corpus conversion

All conversion (doc, pdf, html) are now integrated into one script.

!Encoding conversion

perldoc gt/script/samiChar/Decode.pm
One script for converting all the different input formats. The xsl-file is not
taken properly into account yet.

{{{
gt/script/convert2xml.pl

--dir=dir_name  # The directory where the files are searched
--use-decode    # Use the character decoding (for testing)
--xsl=file_name # The name of the xsl file. I am going to change this.
}}}

Tasks:
* testing
* add move to target directory

This is [Documentation|/doc/ling/corpus_conversion.html]

!Pdf to XML

__Saara__ has made a new conversion module, it is almost finished.

Task: __Saara__ to prepare for this presentation, and to make documentation.

!(X)HTML to XML

This is implemented by Tomi, under gt/script/xhtml2corpus.xsl. Usage:
{{{
tidy --quote-nbsp no --add-xml-decl yes --enclose-block-text yes -asxml -utf8
    -language sme file.html | 
    xsltproc $HOME/gt/script/xhtml2corpus.xsl - > file.xml
}}}

!Documentation
The documentation for corpus conversion should be added to
the [gt/doc/ling/corpus_conversion.xml|/doc/ling/corpus_conversion.html] document.

!!!6. Linguistics

!!Name lexicon

Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no]

Motivation:

* __Divvun:__ We want to cross-link different versions of the same locations
 in different languages
* __Common:__ We do not want to enter the same names twice. We want a
 language-independent name lexicon
* __Disamb:__ Having a richer tag set makes it easier to disambiguate
* __Future:__ Richer analysis makes new applications possible, within
 information retrieval, grammar checking, machine translation etc.

Needed: A plan for this project:

# do the main markup in the present propernoun file
# make a script for converting it to xml (to be done one time)
# make a script for xml2lexc (to be done by the makefile)
##  There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
##  There is a working xml2lexc for Komi, written by Saara
# make the tags etc. in the parser

Conversion:

# This week
# (end of this week and) Next week:
## Then add the +Plc, +Mal, etc. tags in the parser
## Mark up as much as possible within a week or so (__Maaren__ to do the Sámi
   names, and to split CNAME into BERN and LONDON, __Ilona__ to look at C-FI-NEN
   and other Finnish names, __Trond__ and __Børre__ to look at the rest)
## Still to be done:

{{{
7985 DEATNU
3836 LONDON
1939 BERN
1388 C-FI-NEN
 692 ACCRA
 471 NYSTØ
 134 MARJA
 118 DUORTNUS
  59 NIILLAS
  45 ALEUHTAT
  43 ANAR
  29 SULLOT
  20 GIEDDI
  17 HEANDARAT
   8 GUOLBBA
   4 VARGGAT
   4 GEAVNNIS
   4 EATNAMAT
   1 ROMSA
}}}

# list continued:
## Then mark up the rest with correct semantic tags
## This means we would need a seventh option, the unspecified name.
## Then split propernoun-sme-lex.txt into two, one with the sami name being
   generated by the xml2lexc script, and one manually written file, containing
   the name sublexica (called propernoun-sme-morph.txt or whatever)   
## Look into efficient editing of the XML lexicon
## Then convert to xml
## Look into efficient editing of the XML lexicon again
## Look into synchronisation issues with risten.no - we want the names there
   as well

Updated status quo:
* Converted:  19400
* Still left: 15000 (8000 of which are pretty straightforward, the DEATNU case)
* Time used:     20 h

!! Twol SETS definition issue

The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
have input on this issue. We need a G3 definition for North Sámi also.

Update: it is still not working, see [bug
193|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=193]

SUGGESTION (__Trond__): __Thomas__, __Trond__ and __Sjur__ didn't meet last week
and should have a new meeting this Tuesday instead (tomorrow).

!!North Sámi

* three-part compounds issue still open
* number project still open
* The treatment of Sámi place names, we need a contract with "Norge digitalt",
  via UFD.
** __Sjur__ has written an e-mail to the UFD contact person,
   __Øystein Johannessen,__ who will look into it soon. He has not responded
   beyond saying he will return to it. __Sjur__ brought this up in the board
   meeting, and __Bjørn Olav Megard__ will remind __Øystein Johannessen__
   about this issue. __Sjur__ will follow up on this one.
* normativity issues:
** the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
   December. They were not able to make any decisions, and there will be a new
   Giellalávdegoddi beginning next year who won't make decisions until late
   spring. This is a serious problem for the Divvun problem.
*** Actions: __Sjur__ will bring this to the Divvun board, write a new letter to
    the Giellalávdegoddi, emphasizing the needs and timetables of the project
** The [document with the list of open
   issues|/doc/lang/sme/normativity-issues.html]
   needs updating, both regarding the status of each issue, and documentation of
   them. Also a better classification of the issues would be nice.

!!Lule Sámi

 __Sjur__, __Thomas__ and __Trond__ will cont. Lule Sámi issues.

!!Numerals

* The issue is postponed to next week.

# An empirical overview
## Numeral generation
## Numeral inflection
## Numerals as parts of compounds
# A clear concept of how we want to treat them
## Tagging
# A treatment

We will return to this issue after the name conversion.

!!!7. Speller infrastructure

Nothing this week either.

!!!8. Other

!!Technical issues

* The mac os / perl bug (at least __Trond__ and __Sjur__ has it):
** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
   82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
   5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Thor Øivind__,
   __Tomi__)
*** Another __example__ of the same bug:
*** :"\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
    33.
*** One way to "resolve" this is to redirect the error messages to /dev/null:
{{{
... | preprocess 2> /dev/null | lookup ...
}}}

!!Video conferencing across firewalls

The problem we've had with the SD firewall persists, and there doesn't seem to
be any resources available to help us. __Geir Kaaby__ instead suggested we look
at the [Marratech|http://www.marratech.com/] package, and try it out. So please
download the MacOS X client (or get it from me), and I'll send you the URL to
the meeting room as soon as I get it.

!!Bug fixing

__17__ open bugs (and 24 risten.no bugs)

{{{
Bugzilla:
 37	nor	P2	Mac	thor.oivind.johansen@hum.ui...	ASSI	Bugzilla is not able to handle the Sámi characters.
197	nor	P2	Mac	boerre@skolelinux.no		NEW	Links to Bugzilla must be checked and corrected for new s...

UTF-8:
 61	nor	P2	Mac	boerre@skolelinux.no		ASSI	mpage barfs on utf-8 input
196	nor	P2	All	boerre.gaup@samediggi.no	NEW	UTF-8 encoded html gets garbled

Corpus:
160	nor	P2	Mac	tomi.pieski@hum.uit.no		NEW	Hyphen not recognised in Genesis
187	nor	P2	All	tomi.pieski@hum.uit.no		ASSI	catxml is undocumented
188	nor	P2	All	tomi.pieski@hum.uit.no		ASSI	catxml crashes if XML/Twig.pm is not installed
198	nor	P2	Mac	tomi.pieski@hum.uit.no		NEW	xsl script for Bible files does not single out chapter he...

Hard to solve:
 77	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	consonantchange in the end of verbstem

háliidit d > t in final position -ijd is spelled iid and should be spelled -iit. 
We should have had ''in háliit'' but do have ''in háliid''

Present situation:    
háliit  háliit  +?                      #wrong
háliid  háliidit+V+TV+Ind+Prs+ConNeg    #wrong
maid    maid+Interj                     #ok, but not if háliit is corrected
maid    maid+Adv                        #ok, but not if háliit is corrected
guliid  guolli+N+Pl+Gen                 #ok, but not if háliit is corrected
maid    mii+Pron+Interr+Pl+Acc          #ok, but not if háliit is corrected

G3 definition issue:
 50	nor	P2	Mac	Maren.Palismaa@Samediggi.no	NEW	LEXICON-GEARGGUS and others
 56	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	-headdjiid and -heddjiid
186	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	No dipht. simpl in actor nouns before uj
193	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	oa->å dipht. simpl. in actor nouns

Numeral project:
  6	nor	P2	All	tomi.pieski@hum.uit.no		NEW	Num tag is needed in compounds, but stripped in lookup2cg
158	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	Num+Sg+Gen+logi
169	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	golbmalohkása
176	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	beal+Ord
}}}

!Bugzilla update
When Bugzilla is being moved, it should also be updated to the newest version,
and the UTF-8 bug should be resolved.

!!Buying

* rucksacks for the whole Divvun team

!!risten.no

* Organisation: could __Tomi__ be used, in exchange for more linguistic work by
  (old) GIO members? Yes, it is ok, but how much still needs to be evaluated
* it is ok to integrate "kvensk" placenames with risten.no
** this should be integrated with the general proper name work - we want all
   proper names integrated with risten.no, df above
** needs further development of risten.no to allow for multiple XML bases to
   be presented and maintained in parallel. This is to be further worked on by
   __Tomi__ and __Sjur__

!!Project planning and development processes

Trond is using his project as a test case for an IT guy, __Geir Tore Voktor__,
who is taking a course in project management. Be prepared to answer questions.

!!!9. Summary, task list

!! Børre
* Contact oahpahusossodat and the rest of the SD about texts
**  Get help from the Tromsø department of Sámediggi to dig in WebSak
* Gather public texts
* Reorganise the directory structure
** Put all corpus texts into one place
** Continue converting text from input format to our xml
* Ask __Thor-Øivind__ to move bugzilla to our new webserver.

!! Maaren
* The missing list, both the overall missing list from our xml corpus, and a
  file-for-file review, in order to get different terminology.
* continue working with the missing list from risten.no
** working with the missing list from risten.no this week (today) 
* Start working on Sámi place names
* Start working at normativity issues (numeral issues with __Trond__?)

!! Saara
* Look at the corpus infrastructure issue
* Look at the corpus interface issue with Lars
* Convert texts from .doc to .xml, to get a grasp of our corpus format
* make an emacs mode for the name project (cf. specs in the memo above)
* prepare for a presentation of the pdf etc. conversion together with __Tomi__
  for the next meeting.

!! Sjur
* Lule Sámi twol problems, look again at the sets definition with __Thomas__ and
  __Trond__
* risten.no bugs and fixes
* discuss risten.no work with __Tomi__
* follow up on voice group-chat not working to Sámediggi
** Test Marratech
* project planning with __Trond__, continued
** also look at the development processes - specification and  testing
* Follow up on place names from Norge Digitalt -> remind __Bjørn Olav Megard__
* Evaluate SFST as speller (and analyzer) lexicon
** more thorough analysis than was possible in Guovdageaidnu
* write a background document on the corpus contracts
* Discuss the contract issue with Trond, return the new version to the lawyer
** Call __Kimmo Koskenniemi__ for comments
* write to the Giellalávdegoddi once more, emphasizing timetable and response
  needs in the Divvun project
* discuss kvensk project support with __Trond__
* write public tender documents

!! Thomas
* work on Lule Sami compounding and derivation
* Look at Linguistic bugs with __Trond__
* Meet with Sjur and Trond about the definition of G1, G2, G3

!! Tomi
* Aspell: Continue working on the affix file & aspell
** Contact aspell author (UTF-8 thing)
* three-part compounding
* corpus infrastructure: dtd location (both public and internal)
* corpus infrastructure: file and dir organisation
* Document aspell and corpus infrastructure
* Cgi-script for uploading documents to corpus base
* Specification for new catxml in C++
** this includes also placing the source and binary
*** clean the script/ catalogue with __Trond__
* Common makefile issues
* discuss risten.no work with __Sjur__

!! Trond
* Work on the bug list (7 open).
* project planning with __Sjur__, continued
** also look at the development processes - specification and  testing
* Work on the name project:
** Introduce the +Mal, +Fem, ... tags to the parser
  and discuss the work with __Maaren__ and __Børre__.
* clean the script/ dir
* discuss kvensk project support with __Sjur__

!!!10. Next meeting, closing

31.10.2005 10:00

Closed at 12:36