!!!Meeting setup
* Date: 20.11.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat
!!!Agenda
# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing
!!!1. Opening, agenda review, participants
Opened at 09:45.
Present: __Sjur, Thomas, Tomi, Trond__
Absent: __Børre, Maaren, Saara__
Agenda accepted as is.
!!!2. Updated task status since last meeting
!! Børre
* contact writers who already have received contracts
* send file rename script to NSI; get their corpus
** Sent script, made it work on NSI.
* consider a script for automatic testing of the spell checker in Word
** Fetched documentation for AppleScript and Microsoft Word
* consider more testing routines
* update Maaren's Forrest installation to r430284
* {{sma}} discussions with SD (with __Sjur__, __Trond__)
* add a simple password protection to risten.no in the G5
* consider infra for testing feedback
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* check corpus contract issue
* port all i18n work to the main branch (from the i18n branch)
** Done on a local machine, will wait with committing untill next week
when I'm back.
* update all forrest installations, including local patches
** Have uploaded a tarball available at
[http://divvun.no/static_files/forrest.som.funker-2006.11.07.tar.gz]
* revitalize the Aspell work
** Sent an email to skolelinux' sami maillist, with a short instruction on how
to build an aspell package from our fullword list
** Have wrestled with our fullword list, to find out how to build an aspell
package as efficiently as possible. Conclusion: we will have to work on
the affix list.
* check corpus contract issue in a meeting Wednesday 9.30
** Done together with __Sjur__ and __Trond__, sent an answer to __Harald Gaski__
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Maaren
* investigate the generated word form list sent to Polderland - use the command
{{make wordlist TARGET=sme}} in ''victorio''
** done some
* investigate unrecognised word forms in the hyphenator
** done some
!! Saara
* finalize server of the Xerox tools.
** waiting for pardigm grammar?
* help Trond with some shell commands
** not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** done some
* other things:
** continue bible work: sme and smj nts, nno and nob available ots,
swe nt and available ot and fin nt.
bible documentation is almost ready as well.
!! Sjur
* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
*** tried to finalise the SD-terms code refactoring - found probblems in eXist
re file URL for imported modules
*** problems with the old risten.no installation (the present site) - fixed some
but there are still problems with editing; instead of trying to patch the
old code -> will try to make the present codebase stable enough to be taken
into use, and upload that one
* hire linguist and programmer
* make new list of unrecognised forms in the hyphenator
** tried, but found new problems with the Xerox tools
* investigate unrecognised word forms in the hyphenator
* decide how to specify compounding behaviour info in the lexicon
* {{sma}} discussions with SD (with __Børre__, __Trond__)
* check why some SUB-marked entries got included in the normative transducer
* consider a script for automatic testing of the spell checker in Word
* consider more testing routines
* consider infra for testing feedback
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* ask Julie Eira about SD employee seminar
** done - we should attend
* check corpus contract issue
** done
* meeting discuss names and multilinguality Tuesday 14.11 12.30
** done
* redo the derived words work for {{smj}}
* revitalize the Aspell work
* check corpus contract issue in a meeting Wednesday 9.30
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
* other things:
** bug hunting in the new Xerox tools
!! Thomas
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
** not done
* redo the derived words work for {{smj}}
** not done
* investigate unrecognised word forms in hyphenator
** not done
* decide how to specify compounding behaviour info in the lexicon
** not done
* check why some SUB-marked entries got included in the normative transducer
** not done
* create / check the paradigm grammar
** not done
* investigate the generated word form list sent to Polderland - use the command
{{make wordlist TARGET=sme}} in ''victorio''
** done
* meeting discuss names and multilinguality Tuesday 14.11 12.30
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done
!! Tomi
* make first version of the PLX data generation in lexc2xspell
** first version is out
* add hyphenation points to the generated output
** they are being added
* revitalize the Aspell work
** helped Børre on this one
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** not done
!! Trond
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
* get more {{sma}} texts, first the Bible / NT
** Finally got hold of {{sma}} Bible texts, they are forthcoming.
* add corpus user accounts and access issues to Bugzilla
** Done.
* fix the corpus tag list in the {{cwb/}} directory
** Still awaiting disamb clarification
* investigate unrecognised word forms in the hyphenator
** Done some work with Sjur
* decide how to specify compounding behaviour info in the lexicon
** Don't remember this one
* {{sma}} discussions with SD (with __Børre__, __Sjur__)
** Now forthcoming
* check corpus contract issue
** Done
* meeting discuss names and multilinguality Tuesday 14.11 12.30
** Meeting held, will continue
* redo the derived words work for {{smj}}
** Did we do this? No, not yet
* check corpus contract issue in a meeting Wednesday 9.30
** Done.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
!!!3. Documentation
TODO:
* update all forrest installations, including local patches (__Børre__)
* port all i18n work to the main branch (from the i18n branch) (__Børre__)
!!!4. Corpus gathering
__Trond__ talked with __Bierna Bientie__, got permission to use the texts, so
far as bound texts. We shall contact SD to get the texts they have there, he
will send whatever they don't have.
The texts at SD are available via __Sig-Britt Persson__, __Sjur__ will ask her
about them.
__TODO:__
* get {{sma}} Bible / NT texts (__Trond__)
** partly done
* Discussions with the Sámi Parliament about {{sma}} (__Børre, Sjur, Trond__)
* send the filename renaming script to NSI; get their corpus (__Børre__)
* ask SD/Sig-Britt Persson about some of the South Sámi bible texts (__Sjur__)
!!!5. Corpus infrastructure
Nothing new on any of the sub-issues.
!!User accounts and access
TODO:
* add the issue with subissues to Bugzilla (__Trond__)
** done
!!More texts to the graphical corpus interface:
No new texts available in the web interface yet. __Saara__ and __Trond__ have
got access to the Oslo computer, and could in principle add texts themselves.
Some instructions will be necessary, though.
TODO:
* add text to the server (__Lars__)
* Discuss with Lars (__Trond__)
!!Aligner
__Øystein Reigem__ has commented __Børre's__ work, and will perhaps look at it.
__TODO:__
* gather more parallel texts (__Trond__)
* try out NT alignment strategies (__Saara__)
!!Language recognition
TODO:
* get more {{sma}} texts, first the Bible / NT (__Trond__)
** done, more forthcoming
!!!6. Infrastructure
!!Xerox tools wrapped as servers
Tomi has modified the server a bit, causing it to __NOT__ work with the perl
client at the moment. It should not be a big deal to fix it (three lines?).
__Saara__ will fix it.
The paradigm grammar needs to be written for Saara to be able to finish the
server. The format should follow the Noun sample below:
{{{
N+Number+Case+Possessive?+Clitic?
A
V
Adv
}}}
Can we do without clitics? We would like to just list them in a PLX format that
allows correct behaviour:
{{{
Is
InC lex
MeC lex
FiC clit
}}}
That is, we want {{mánás + go}} to be accepted, without letting {{mánás}} make
compounds freely. We don't know how to specify this in the PLX format. __Sjur__
will discuss clitics with Polderland, to solve this issue.
__TODO:__
* finalise the paradigm generator (requires paradigm grammar) (__Saara__)
* fix the corpus tag list in the {{cwb/}} directory (__Trond__)
* create / check the paradigm grammar as exemplified above (__Thomas__)
* update the generating transducer to only be normative, and without the
subclass tags (__Sjur__)
* discuss clitics in PLX format with Polderland (__Sjur__)
* update the corpus/cwb tags (__Trond__)
!!Hyphenator
TODO:
* make new list of unrecognised forms (__Sjur__)
** tried but failed because of problems with the Xerox tools
* investigate unrecognised word forms (__Maaren, Thomas, Trond, Sjur__)
!!!7. Linguistics
!!Names and multilinguality
{{{
In the terms-sme.xml file:
-- just an example! --
}}}
TODO:
* separate meeting for discussing this issue, Tuesday 14.11 after
lunch (12.30) (__Trond, Thomas, Sjur__)
** done; new tasks added below
TODO:
# finish first version of the editing (__Sjur, Tomi__)
# add @type=secondary and @excl=speller,hyph to all names marked with !SUB
(__Saara__)
# test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__)
# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (__Sjur, Saara__)
# convert propernoun-($lang)-lex.txt to a derived file from common xml files
(__Sjur, Tomi, Saara__)
# start to use the xml file as source file
# clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (__Thomas, Maaren, linguists__)
# merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (__linguists__)
# publish the name lexicon on risten.no (__Sjur__)
# add missing parallel names for placenames (__linguists__)
# add informative links between first names like Niillas and Nils
(__linguists__)
!Cohort policy
A further issue is whether we should reconsider our cohort policy. Today, Sur
and Plc are __different__ readings. An alternative would be to have them as
secondary tags, not in conflict with each other:
{{{
""
"Trosterud" N Prop Sur Sg Nom <<< @HNOUN
"Trosterud" N Prop Plc Sg Nom <<< @HNOUN
""
"Trosterud" N Prop Sg Nom <<< @HNOUN
""
"Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
}}}
This issue is dependent on CG3, and is delayed untill we know more about the
possibilities with it.
!!Derivation and spellers like Aspell
TODO:
* redo the work for {{smj}}, including discussion regarding ''Actio''
(__Thomas, Sjur, Trond__)
!!North Sámi
The following words are included in the normative list despite being marked
with !SUB:
{{{
accompagnerejun
ábuhuvvože
ábuhuvvože ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne
}}}
TODO:
* investigate the generated word form list sent to Polderland - use the command
{{make wordlist TARGET=sme}} in ''victorio'' (__Maaren__, __Thomas__)
* check why some SUB-marked entries got included in the normative transducer
(__Thomas, Sjur__)
!!Lule Sámi
TODO:
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
(__Thomas, Trond__)
* hire new linguist (__Sjur__)
!!!8. Name lexicon infrastructure
Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries
Details can be found in [the meeting
memo.|/doc/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
* develop the needed XQueries and UI (__Sjur, Tomi__)
* add a simple password protection to risten.no in the G5 (__Børre__)
Postponed:
* data synchronisation between risten.no and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
!!!9. Spellers
!!Polderland data generation
__Tomi__ has made the first version of the PLX data generation, it ran for 6,5
hours before it crashed, and generated all entries until line {{13639}} in the
{{noun-sme-lex.txt}} file (''njuvdin''):
{{{
13639 njuvdin:njuvdim BOAHTIN ;
}}}
The generated PLX file contains {{1 439 273}} lines, roughly 32 Mb.
__Tomi:__ The hyphenator produces multiple suggestions. __Sjur:__ it needs to be
used with the hyphenation filter.
__TODO:__
* decide how to specify compounding behaviour info for the lexicon
(__Thomas, Trond, Sjur__)
* make first version of the PLX data generation (__Tomi__)
** done
* add hyphenation points to the generated output (__Tomi__)
** done
* update the hyphenation filter to always return one - if there are multiple
alternatives after all the other filtering, pick the one with the most
hyphenation points (__Saara__)
* make PLX lexicon for all open POSes (__Tomi__)
* add derivations to the PLX generation (__Tomi__)
* add compound stems to the PLX generation (__Tomi__)
* add closed POS and clitics to PLX generation (__Tomi__)
!!Automatic testing of the Word spellchecker
It should be possible to write a script that runs texts through Word from the
command line, using a combination of shell script and AppleScript. MS Word has
the needed AppleScript commands to run the spell checker.
TODO:
* consider a script for automatic testing (__Sjur, Børre__)
* consider more testing routines (__Sjur, Børre__)
* consider infra for testing feedback (__Børre, Sjur__)
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
(__Børre, Sjur__)
!!Aspell
__Børre__ has worked on the Aspell code, mainly to be able to explain how it
works and help out __Petter Reinholdtsen__, who would like to include the
Aspell {{sme}} speller in Linux distributions. He already gave back some
improvements to our code.
TODO:
* revitalize the Aspell work (__Børre, Sjur, Tomi__)
** Børre and Tomi did some
TODO when the major part of the PLX conversion is done:
* add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the
PLX data generation is finished)
* study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__)
!!!10. Other
!!Corpus contracts
TODO:
* check corpus contract issue in a meeting Wednesday 9.30
(__Børre, Sjur, Trond__)
** done
* publish corpus contracts and project infra on NoDaLi-sta (__Sjur__)
!!Bug fixing
__62__ open Divvun/Disamb bugs, and __24__ risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
!!Task lists as iCal entries
TODO:
* update Maaren's Forrest installation to r430284 (__Børre__)
!!Employee seminar in Alta
SD has an employee seminar in Alta 7.-8. December - should we go there? Yes.
TODO:
* ask Julie Eira about SD employee seminar (__Sjur__)
** done
* send in hotel room list (__Divvun__)
!!!11. Next meeting, closing
The next meeting is 27.11.2006, 09:30 Norwegian time.
The meeting was closed at 11:28.
!!!Appendix - task lists for the next week
!! Boerre
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Boerre.ics]
* contact writers who already have received contracts
* consider a script for automatic testing of the spell checker in Word
* consider more testing routines
* update Maaren's Forrest installation to r430284
* {{sma}} discussions with SD (with __Sjur__, __Trond__)
* add a simple password protection to risten.no in the G5
* consider infra for testing feedback
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* check corpus contract issue
* port all i18n work to the main branch (from the i18n branch)
* update all forrest installations, including local patches
* send in hotel room list for employee seminar
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Maaren
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Maaren.ics]
* investigate the generated word form list sent to Polderland - use the command
{{make wordlist TARGET=sme}} in ''victorio''
* investigate unrecognised word forms in the hyphenator
* send in hotel room list for employee seminar
!! Saara
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Saara.ics]
* finalize server of the Xerox tools.
* help Trond with some shell commands
* update namelex2xml
* add {{nor}} output file to namelex2xml
* update hyph-filter.pl
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Sjur
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Sjur.ics]
* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* hire linguist and programmer
* make new list of unrecognised forms in the hyphenator
* investigate unrecognised word forms in the hyphenator
* decide how to specify compounding behaviour info in the lexicon
* {{sma}} discussions with SD (with __Børre__, __Trond__)
* check why some SUB-marked entries got included in the normative transducer
* consider a script for automatic testing of the spell checker in Word
* consider more testing routines
* consider infra for testing feedback
* get an Intel Mac for testing Windows spellers; get a WinXP license from SD
* redo the derived words work for {{smj}}
* revitalize the Aspell work
* check corpus contract issue in a meeting Wednesday 9.30
* update the generating transducer to only be normative
* discuss clitics in PLX format with Polderland
* publish corpus contracts and project infra on NoDaLi-sta
* send in hotel room list for employee seminar
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Thomas
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Thomas.ics]
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
* redo the derived words work for {{smj}}
* investigate unrecognised word forms in hyphenator
* decide how to specify compounding behaviour info in the lexicon
* check why some SUB-marked entries got included in the normative transducer
* create / check the paradigm grammar
* investigate the generated word form list sent to Polderland - use the command
{{make wordlist TARGET=sme}} in ''victorio''
* send in hotel room list for employee seminar
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Tomi
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Tomi.ics]
* make PLX lexicon for all open POSes
* add derivations to the PLX generation
* add compound stems to the PLX generation
* add closed POS and clitics to PLX generation
* revitalize the Aspell work
* send in hotel room list for employee seminar
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Trond
[iCal|/doc/admin/weekly/2006/Tasks_2006-11-20_Trond.ics]
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
* get more {{sma}} texts
* update the corpus tag list in the {{cwb/}} directory
* investigate unrecognised word forms in the hyphenator
* decide how to specify compounding behaviour info in the lexicon
* {{sma}} discussions with SD (with __Børre__, __Sjur__)
* redo the derived words work for {{smj}}
* update the corpus/cwb tags (__Trond__)
* [fix bugs!|http://giellatekno.uit.no/bugzilla].