!!!Meeting setup
* Date: 09.10.2006
* Time: 09.30 Norw. time
* Place: Where we are
* Tools: SubEthaEdit, iChat
!!!Agenda
# Opening, agenda review
# Reviewing the task list from two weeks ago
# Documentation - divvun.no
# Corpus gathering
# Corpus infrastructure
# Infrastructure
# Linguistics
# name lexicon infrastructure
# Spellers
# Other issues
# Summary, task lists
# Closing
!!!1. Opening, agenda review, participants
Opened at 09:51.
Present: __Børre, Saara, Sjur, Thomas, Tomi, Trond__
Absent: __Maaren__
Agenda accepted as is.
!!!2. Updated task status since last meeting
!! Børre
* corpus collection:
** contact __Ája__ (Kåfjord)
** contact __Richard Valkepää__ at NSI about older Min Áigi and Áššu files
*** Done
** remind __Bård Eriksen__ about the book catalogue
*** Done. Got it
* Move norwegian documents in Min Áigi from sme to nob
** Not done
* corpus access:
** possibly deploy the user account form as an HTML form
** Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
*** Not done
* set up Bugzilla automatic reminders for open issues
** Not done
* finish Forrest i18n work (pdf)
** Not done
* Get more {{sma, smj}} texts to improve language recognition
** Nothing this week
* set up Tomcat for use with eXist and the propnouns db on the G5
** Not done
* add the new ''Words'' section to the site
** Not done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** Not done
* Other:
** Gathered texts from the sami parliament computers
** Investigated on how to use the Sami parliament Marratech server. We have
to be connected to the sami parliament internal net ...
!! Maaren
* download and install latest Marratech
!! Saara
* add more texts to the graphical corpus interface
* remove headers and footers from pdf documents
** not finished.
* Implement server of the analysis tools.
** not finished.
* generate parallel corpus files manually (with __Trond__)
** in progress
* Improve text_cat
** not yet finished.
* export corpus tools to location available to all (with cron), cf news disc.
** not done
* Write a script for cleaning sme-hyph-output.
** done
* Implement M4-rules for sme-dis.rle
** waiting for some desicions.
* improve {{smj}} LM based on the new, larger corpus
** not done
* document recent changes made in corpus infrastructure
** done
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Sjur
* finish the hyphenator clean-up script
** __Saara__ did this for me - thanks!
* name lexicon:
** implement editing functions
*** some done in the summer (__Tomi__), the rest waiting for the next item to
finish
** finalise refactoring for multiple collections - search interface
*** finally DONE! See [http://88.114.127.52:8080/exist/risten/] for a demo,
search for e.g. 'Tte*' or 'вос*' (excluding the cit. marks) to see it in
action.
** implement improvements decided upon in Tromsø
*** delayed until the previous one is in place
** XQuery refactoring and code development for our proper noun editor
*** delayed until the previous one is in place
* review user and admin documentation for corpus access
** not done
* write user account form, probably ask for copy of existing ones from the IT
centre (with Trond)
** not done
* start hiring process of linguist and programmer
** I have prioritised the risten.no/propernoun work, as well as what is needed
for Polderland
* finish i18n work of Forrest
** nothing done. What we have works in a webapp setting, but not with the CLI.
This is a largish task, needs consideration.
* consider the problems of lexicalised derivations schewing analyses of
derivation patterns
** not done
* install eXist and our local copy of risten.no and propnouns on the G5
** not done
* speller follow-up from the Tromsø meeting
** done, discussed with __Tomi__
* get instructions on how to use Marratech, and test it
** still nothing
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
** nothing last week
!! Thomas
* work with Polderland phonetic rules
** finished with sme, soon finished with smj
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
** finished with cont. lexica
* find and study all derived verbs in our corpus (depends on __Trond__)
** not yet
* suggest which derivations could be generated (depends on __Trond__)
** not yet
!! Tomi
* start to plan the implementation of the speller data conversion/generation
** started - written a first Java app for the purpose
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Trond
* make shell script wrappers for the most common commands for user friendlyness
** Done (pseudocode)
* write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
** Not done
* write documentation for our {{bound}} users, with pointers to the ordinary
documentation
** Not done
* refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt
** Started work with Thomas, not much done.
* Get more {{sma}} texts to improve language recognition
** Tried to contact relevant people, still no result.
* study paragraphs with mixed content
** Not done.
* [fix bugs!|http://giellatekno.uit.no/bugzilla].
!!!3. Documentation
TODO:
* finish i18n work (__Børre__ and __Sjur__)
** done when running in 'forrest run' mode - it works perfect now
** does NOT work when running 'forrest site', instead it generates a heap of
errors
*** solution/workaround: only use it with 'forrest run', webapp on our server
** i18n does not work in PDF ("Table of Content" won't translate)
*** we need to ask on the Forrest list
* Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__)
** User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
** Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
*** move these issues into Bugzilla (__Sjur__)
* add the new ''Words'' section to the site (__Børre__)
** done
!!!4. Corpus gathering
__NSI:__ the Mac has crashed during the summer, they need a new motherboard for
it, but they don't have money to buy it! This is a major setback for us - the
major part of Min Áigi and Áššu can't be delivered. Unless... __Børre__ is going
to Kautokeino on a course in the near future, and could probably get the hard
disk mounted on his computer. Discuss and plan this with the NSI guy.
__Bård Eriksen__ has sent the book list. It contains only about 10 titles.
__Lars Kintel__ has contacted us, and offered all his translations to {{smj}}
free of charge. We still need to contact the authours, to get their approval as
well.
__TODO:__
* contact NSI (__Børre__)
** done
* bug __Bård Eriksen__ about the book list (__Børre__)
** done
* continue to contact authors (__Børre__)
!!!5. Corpus infrastructure
!!General
__TODO:__
* consider an article on our corpus framework (__all__)
** put this into Bugzilla (__Trond__)
!!User accounts and access
The whole issue should be moved into Bugzilla as several tasks.
TODO:
* add the issue with subissues to Bugzilla (__Trond__)
!!More texts to the graphical corpus interface:
We are waiting for parallel texts, and aligning parallel text. SD Parliament
meeting memos are excellent for parallell testing, as they are accurately
translated in both directions (but some of the Parliament plenary meeting memos
do not have a nob counterpart, or we haven't got one yet).
__Børre__ has found a lot of old Parliament meeting protocols, and will probably
be able to add missing translations/parallell documents.
TODO:
* align texts, analyse, and send to Lars (__Trond, Saara__)
* add text to the server (__Lars__)
!!Aligner
Discuss NT parallel corpus
originals: Word, paratext, txt
our format: xml
It boils down to two issues:
# Which original to choose
# Whether to align according to verses or according to our own preprocessor
(verses) (send both options to the parsers and see which one we like)
Goal: Have the Bible texts in the same format as .
TODO:
* gather parallel texts (__Trond__)
* try out NT alignment strategies (__Saara__)
!!Language recognition
TODO:
* Get more text of the poorly covered languages: {{sma, smj}} (__Trond, Børre__)
** {{sma:}} get the Bible texts (__Trond__)
* improve the {{smj}} LM (__Saara__)
* what about paragraphs with mixed content? Build a corpus of such paragraphs
(__Trond__)
!!!6. Infrastructure
!!Xerox tools wrapped as servers
No restrictions on IP numbers or domains from which the server is accessed.
Modules to be includes / types of services:
* morphological analyzer
* morph. generator (norm/descr? If !SUB could be added as a regular tag, then we
can have both)
* hyphenator
* disambiguator
* speller (in the future) - this is not Xerox tools:-)
Supported input formats:
* pure text
* XML according to a fixed DTD - if XML in, always XML out
Output formats:
* text (UTF-8)
* XML (as option) - we need to decide upon the XML delivered (see below)
Access clients:
* Java client (__Tomi__)
* command-line client (__Saara__)
* http access through CGI-script (__Saara__)
XML output format:
{{{
Analyser and disambiguator:
..
}}}
Generator and hyphenator should follow the same pattern. __Saara__ will make
something:-)
__TODO:__
* improve and finish the present prototype (__Saara__)
!!Hyphenator
Tros^te^rud TROSTERUD
Čále dahje kopiere sániid lásii, ja deatti "Analysere". Čilgehus giellaoahpalaš
čá^le da^hje kopiere sá^niid lá^sii , ja deat^ti a^na^ly^se^re .
Čilgehus
čil^ge^heh^kos ≠
čil^ge^heh^kos ≠
čil^ge^hus =
čil^ge^hus =
Čil^ge^hus <==
TODO:
* improve the clean-up script to remove more-complex word forms (__Saara__)
** done
* Update the sma hyphenator rule set with the insights gained from smj updates
(__Trond__ during weekends)
* linguistic issues in the hyphenator: (__Sjur__ using M4)
** shortening in compounds (grossly overgenerates in the hyphenator output)
* consider case conversion problems (__Saara__)
* check #- when comparing input string with hyphenated string - the hyphen
''could'' be part of the input string, keep it if it is, discard the
hyphenated string if not (__Saara__)
!!Automatic Bugzilla reminder for untouched bugs
Some perl-libraries needed by Bugzilla weren't in the path, causing it to not
work. Adding them should fix the issue.
TODO:
* fix the remaining issues (__Børre__)
!!M4
__TODO:__
* make speller make targets that utilise M4 to produce normative
and hyphenation transducers - postponed a while (__Tomi__ or __Sjur__)
* disamb variants for derivations (see next) (__Saara__)
** include make as part of a shell script, which will run the M4 processor on
the disamb rules (__Saara__)
!!!7. Linguistics
!!Names and multilinguality
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
sections, and they would like to use our name lexicon also for public searching.
Observations:
1) Multilinguality is always optional.
2) We can observe that "foreign" names in texts follows a domination pattern:
majority language forms can be found in minority language texts as real names
("Kautokeino produkter"), whereas minority language names ''almost always''
occur in majority language texts as citations. And citations should not be
considered a natural part of the text.
3) When looking at our name classification, multilinguality varies according to:
{{{
Ani - weak/none? (pet, myth anim. names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
names
Sur - none
Tit - strong (titles)
}}}
Suggestion:
We need to reconsider the ''all names in all languages'' policy. That policy is
valid only for {{Fem, Mal,}} and {{Sur}} (and Ani and Tit?). For
{{Obj, Org, Plc}} the rule should be that if they have multilingual names, each
name should only be used in it's own language. Then we need a modification
saying that majority language names can be included in minority language
lexicons __if attested__ in our corpus. Also, the majority language varies
according to country (obviously), which means that in a speller context, we
might consider tailoring spellers for each country, leaving out noise relating
to majority language names from another country.
A further issue is whether we should reconsider our cohort policy. Today, Sur
and Plc are __different__ readings. An alternative would be to have them as
secondary tags, not in conflict with each other:
{{{
""
"Trosterud" N Prop Sur Sg Nom <<< @HNOUN
"Trosterud" N Prop Plc Sg Nom <<< @HNOUN
""
"Trosterud" N Prop Sg Nom <<< @HNOUN
""
"Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
}}}
!!Derivation and spellers like Aspell
* enclose the CG rule that preferres lexicalised forms over derivations in M4
(__Saara__)
* find and study all derived verbs in our corpus (__Thomas__)
* suggest which derivations could be generated (__Thomas__)
* lexicalise the rest (__Thomas__)
!!North Sámi
Nothing this week?
!!Lule Sámi
TODO:
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
(__Thomas, Trond__)
** done some more, much more left
* hire new linguist (__Sjur__)
** nothing real, but one new candidate has surfaced
!!Komi
It appears that the conversion form LexC to XML went wrong wrt to upper:lower
pairs, consider the following entry from the Attic (noun file):
{{{
вос:воск N ;
}}}
The corresponding XML is:
{{{
воск
Noun1
N
}}}
It should have been:
{{{
вос
воск
}}}
There seems to be an issue. Trond will look into it.
On Victorio:
{{{
/usr/local/cvs/repository/kt/kom/src/Attic/
~$ll /usr/local/cvs/repository/kt/kom/script/kom-utf.xml,v
}}}
TODO:
* check the dictionary and lexc source files
!!!8. Name lexicon infrastructure
Decided in Tromsø:
* add logging facilities to the interface
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items)
* tag for excluding/including a name from certain applications
* future epxansion: choose what info to display in the single language browser
* display existing language entries when adding a new language to a record
* add editor to change single, existing entries
Details can be found in [the meeting
memo.|/doc/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
* finish refactoring for multiple collections in the search interfarce
(__Sjur__)
** done - see [http://88.114.127.52:8080/exist/risten/]
Search for: Tte or во
* develop the needed XQueries and UI (__Sjur, Tomi__)
* turn Tomcat on on the G5; send admin username and password to __Sjur__
(__Børre__)
* add eXist and the proper noun interface to the G5 (__Sjur__)
Postponed:
* data synchronisation between risten.no and the cvs repo
* new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
!!!9. Spellers
!!Speller data generation
__TODO:__
* start to plan the implementation of the speller data conversion/generation
(__Tomi__)
** done, progressing fine.
!!!10. Other
!!Meeting with the map authorities
__Sjur__ is going to a meeting in Oslo on Wednesday, meeting with the
Ministeries and the __Statens kartverk__.
Issues:
* open source vs money
!!Bug fixing
__64__ open Divvun/Disamb bugs, and __25__ risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
!!Meetings and Marratech
__Børre__ talked to __Leif Åge__ about the Marratech server. All users need to
be inside the SD firewall. Thus, Marratech is __not__ an option.
Quoting Marratech:
[Marratech on proxies and
firewalls|http://www.marratech.com/userman/manager/app_firewalls.html]
What about using proxies for __Maaren?__
iChat has a setup page for it in Preferences->Accounts. Børre will investigate
__TODO:__
* investigate use of proxies for __Maaren__ (__Børre__)
!!Task lists as iCal entries
TODO:
* update Maaren's installation to r430284 (__Børre__)
!!Words section
You all need to check out CVS/words, and link to the relevant place, cf how
gt/doc/ is linked.
!!!11. Next meeting, closing
Next meeting 16.10.2006 at 9:30.
__Saara__ will work offline next week, and won't be present in the meeting.
Closed at 12:26.
!!!Appendix -task lists for the next week
!! Boerre
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Boerre.ics]
* corpus collection:
** contact __Ája__ (Kåfjord)
** discuss access to older Min Áigi and Áššu files with __Richard Valkeapää__
* Move norwegian documents in Min Áigi from sme to nob
* set up Bugzilla automatic reminders for open issues
* finish Forrest i18n work (pdf)
* set up Tomcat for use with eXist and the propnouns db on the G5
* investigate use of proxies for __Maaren__, for her to be able to use iChat AV
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Maaren
* download and install latest Marratech
!! Saara
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Saara.ics]
* add more texts to the graphical corpus interface
* finalize server of the Xerox tools.
* generate parallel corpus files manually (with __Trond__)
* Improve text_cat
* export corpus tools to location available to all (with cron), cf news disc.
* Improve hyph-filter.pl
* Implement M4-rules for sme-dis.rle
* improve {{smj}} LM based on the new, larger corpus
* help Trond with some shell commands
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Sjur
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Sjur.ics]
* name lexicon:
** refactor SD-terms editor code
** implement missing propnouns editing functions
** implement improvements decided upon in Tromsø
* move corpus user doc issue to Bugzilla
* start hiring process of linguist and programmer
* finish i18n work of Forrest
* install eXist and our local copy of risten.no and propnouns on the G5
* add normative M4 macros to the twol code; utilize it in the make file
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Thomas
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Thomas.ics]
* work with Polderland phonetic rules for {{smj}}
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
* find and study all derived verbs in our corpus (depends on __Saara__)
* suggest which derivations could be generated (depends on __Saara__)
!! Tomi
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Tomi.ics]
* continue implementation of the speller lexicon conversion
* [fix bugs!|http://giellatekno.uit.no/bugzilla]
!! Trond
[iCal|/doc/admin/weekly/2006/Tasks_2006-10-09_Trond.ics]
* Discuss shell script wrapper teaksta.sh with Saara
* write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
* write documentation for our {{bound}} users, with pointers to the ordinary
documentation
* refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
* Get more {{sma}} texts to improve language recognition
* study paragraphs with mixed content
* add corpus artice framework to Bugzilla
* add corpus user accounts and access issues to Bugzilla
* [fix bugs!|http://giellatekno.uit.no/bugzilla].