!!!Meeting setup * Date: 31.10.2005 * Time: 10.00 Norw. time * Place: Wherever we are :-) * Tools: iChat, SubEthaEdit, phone !!! Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Linguistics # Speller infrastructure # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 10:05. Present: __Børre, Maaren, Sjur, Thomas, Tomi, Trond__ Absent: __Saara__ Main secretary: __Tomi__ Agenda accepted as is. !!!2. Reviewing the task list from the last meeting !! Børre * Contact oahpahusossodat and the rest of the SD about texts ** Get help from the Tromsø department of Sámediggi to dig in WebSak *** Done, and picked up some texts * Gather public texts ** From the Sámediggi * Reorganise the directory structure ** Put all corpus texts into one place *** Done ** Continue converting text from input format to our xml *** Not done * Ask __Thor-Øivind__ to move bugzilla to our new webserver. ** He has been very busy, and since bugzilla seems to work ok, he has scheduled the update for next week. !! Maaren * The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology. ** Not done * continue working with the missing list from risten.no ** Not done * Start working on Sámi place names ** Not done * Start working at normativity issues (numeral issues with __Trond__?) ** Not done, sorry !! Saara * Look at the corpus infrastructure issue * Look at the corpus interface issue with Lars * make an emacs mode for the name project (cf. specs in the memo above) ** Done * Plan a conversion script for the name lexicon. ** Not done, discussed with Tomi about using his c++ code in xml2lexc script. !! Sjur * Lule Sámi twol problems, look again at the sets definition with __Thomas__ and __Trond__ ** Not done * risten.no bugs and fixes ** installed the latest eXist snapshot, and tested it ** corrected several conformity bugs that had been accepted by earlier snapshots * discuss risten.no work with __Tomi__ ** Not done * follow up on voice group-chat not working to Sámediggi ** Test Marratech * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt -> remind __Bjørn Olav Megard__ ** He is reminded, but no response so far. Needs more action. * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts ** Not done * Discuss the contract issue with Trond, return the new version to the lawyer ** Call __Kimmo Koskenniemi__ for comments *** no answer so far * write to the Giellalávdegoddi once more, emphasizing timetable and response needs in the Divvun project ** wrote draft letter and sent it to __Maaren__ for QA and translation * discuss kvensk project support with __Trond__ ** Not done * write public tender documents ** Nothing last week * other tasks: ** finally corrected the Forrest config for [XXE|http://www.xmlmind.com/xmleditor/] to be fully installable independent of the application itself. This will make it much easier to install upgrades of XXE, and also to upgrade the config itself, which is important to help the old GIO take control of the [risten.no|http://www.risten.no] development. ** The new config is available from me ** you should all update to XXE 3.0:-) * buy: ** rucksacks *** ordered: [Professional 17 Backpack Case by Brenthaven| http://store.apple.com/Apple/WebObjects/nostore.woa/90806/wo/OH510FVNDjpN2MoGdv12jFElcgI/1.0.17.1.0.8.12.1.14.1.17.0] ** new computer (project server)? ** project management software ** OmniOutline (upgrade) ** OmniGraffle (upgrade) ** ISDN card for Maaren (Maaren will order herself) !! Thomas * work on Lule Sami compounding and derivation **finished (this calls for at least a virtual celebration !!) * Look at Linguistic bugs with __Trond__ * Meet with Sjur and Trond about the definition of G1, G2, G3 **not done !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) *** Not done * three-part compounding ** Not done * corpus infrastructure: dtd location (both public and internal) ** Not done * corpus infrastructure: file and dir organisation ** Done * Document aspell and corpus infrastructure ** Partially done * Cgi-script for uploading documents to corpus base ** Done, but needs modifications? * Specification for new catxml in C++ ** this includes also placing the source and binary *** Not done *** clean the script/ catalogue with __Trond__ * Common makefile issues ** Not done * discuss risten.no work with __Sjur__ ** Not done !! Trond * Work on the bug list (7 open). ** Still 7 open, as the number project hasn't started. ** Started looking at the G3 definition issue for sme, it is the key to some of the bugs, and responsible for very many of the missing words. * project planning with __Sjur__, continued * also look at the development processes - specification and testing ** Not done. * Work on the name project: ** Introduce the +Mal, +Fem, ... tags to the parser ** Done. Now the tags are there, and in the multichar list. We may thus start writing rules for them in the sme-dis.rle file. ** and discuss the work with __Maaren__ and __Børre__. ** Not done. Hopefully * clean the script/ dir ** Down from 76 to 51 entities. * discuss kvensk project support with __Sjur__ ** Hmm, did we discuss this issue? * Otherwise, the week was dominated by going to a conference on minority languages and language technology, in Bozen. !!!3. Documentation Documentation tasks: # Add documentation on our corpus infrastructure and our corpus work in general ("To be done by the ones making the corpora": __Børre__, __Tomi__, __Trond__, __Saara__). # Now we have 4 documents: ## Correct corpus (disamb usage) ## Corpus plan (for the disamb corpus cwb) ## catxml For the basic corpora, we need 3 types of documentation, or doc for 3 target groups: # For the __users/linguists:__ What corpus are found, how do I use them (this info is now scattered) # For the __collectors:__ How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc) # For the __programmer:__ What did I actually do? (this is partly the catxml doc) For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above. * add/update Aspell documentation (__Tomi__) ** Some documentation has been written, but there still is work to be done. * as always: document what you're doing:-) (__all__) !!!4. Corpus gathering Governmental documents (earlier in pdf, now in html) Tasks: * move existing gov. documents (pdf) from gt/ to our corpus repository (Børre) ** There are appr. 10 non-broken pdf documents in gt/sme/corp/original/ (the ones named stmelXXX.pdf contain only one page each) * Collect public (pdf and html) files (Børre) ** Done some test downloading, will have to look at tools to do this automatically. !!Contracts Tasks: * Follow-up on the lawyers' comments (__Trond__ has started with the university) ** __Trond__ and __Sjur__ finished the next revision of the contracts, and are waiting for comments from __Kimmo Koskenniemi__ *** Update: No comments from __Kimmo__ yet * add a background document explaining the model (__Sjur__) The most problematic issue: Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define. !!North Sámi New Testament Our inhouse sme nt is as new as the one they have at Bibelselskapet, and we were told we could just use the version we have ourselves. !!Lule Sámi New Testament Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi translation, we will have it soon. * Haven´t heard anything from Olavi Korhonen. The only problem with our version is that there are some run together strings, and we could solve it ourselves. !!Lule Sámi Dictionary Nothing new about the meeting with __Anders Kintel__. !!!5. Corpus infrastructure Updated task list: # Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files # Include the xsl files under version control (cvs? rcs?) # Incorporate language detection as part of the corpus processing. # we need a way to deal with hyphenated documents in catxml/preprocess: ## in normal cases hyphenation points should be removed ## when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained !!!6. Linguistics !!Name lexicon Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no] {{{ Unclassified: 6090 entries 3203 LONDON 1788 BERN 468 NYSTØ 330 ACCRA 80 MARJA 59 NIILLAS 43 ANAR 43 ALEUHTAT 20 GIEDDI 17 HEANDARAT 16 DUORTNUS 7 SULLOT 4 VARGGAT 4 GEAVNNIS 3 EATNAMAT 2 NYOBL 2 GUOLBBA 1 PIERA }}} Motivation: * __Divvun:__ We want to cross-link different versions of the same locations in different languages * __Common:__ We do not want to enter the same names twice. We want a language-independent name lexicon * __Disamb:__ Having a richer tag set makes it easier to disambiguate * __Future:__ Richer analysis makes new applications possible, within information retrieval, grammar checking, machine translation etc. Needed: A plan for this project: # do the main markup in the present propernoun file # make a script for converting it to xml (to be done one time) # make a script for xml2lexc (to be done by the makefile) ## There is a sample file for the xml file format in gt/common/src/proper-nouns.xml ## There is a working xml2lexc for Komi, written by Saara # make the tags etc. in the parser Conversion: # Mark up the remaining 6090 entires until conversion starts (__Maaren__ to do the Sámi names, __Ilona__ to look at C-FI-NEN and other Finnish names, __Trond, Thomas__ and __Børre__ to look at the rest) # Entries still to be done: see above # This means we would need a seventh option, the unspecified name. # Then split propernoun-sme-lex.txt into two, one with the sami name being generated by the xml2lexc script, and one manually written file, containing the name sublexica (called propernoun-sme-morph.txt or whatever) # Look into efficient editing of the XML lexicon (__Tomi, Saara__) # Then convert to xml (__Tomi, Saara__) # Look into efficient editing of the XML lexicon again (__Tomi, Saara__) # Look into synchronisation issues with risten.no - we want the names there as well (__Tomi__) ## Consider automatic sorting on commit !! Twol SETS definition issue The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to have input on this issue. We need a G3 definition for North Sámi also. Update: it is still not working, see [bug 193|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=193] SUGGESTION (__Trond__): __Thomas__, __Trond__ and __Sjur__ didn't meet last week either and should try again this Wednesday instead. !!North Sámi * three-part compounds issue still open * number project still open * The treatment of Sámi place names, we need a contract with "Norge digitalt", via UFD. ** __Sjur__ has written an e-mail to the UFD contact person, __Øystein Johannessen,__ who will look into it soon. He has not responded beyond saying he will return to it. __Sjur__ brought this up in the board meeting, and __Bjørn Olav Megard__ will remind __Øystein Johannessen__ about this issue. __Sjur__ will follow up on this one. * normativity issues: ** the Giellalávdegoddi meeting was last Friday, they will have a new meeting in December. They were not able to make any decisions, and there will be a new Giellalávdegoddi beginning next year who won't make decisions until late spring. This is a serious problem for the Divvun problem. *** Actions: __Sjur__ will bring this to the Divvun board, write a new letter to the Giellalávdegoddi, emphasizing the needs and timetables of the project ** The [document with the list of open issues|/lang/sme/normativity-issues.html] needs updating, both regarding the status of each issue, and documentation of them. Also a better classification of the issues would be nice. !!Lule Sámi __Sjur__, __Thomas__ and __Trond__ will cont. Lule Sámi issues. !!Numerals * The issue is postponed to next week. # An empirical overview ## Numeral generation ## Numeral inflection ## Numerals as parts of compounds # A clear concept of how we want to treat them ## Tagging # A treatment We will return to this issue after the name conversion. !!!7. Speller infrastructure Nothing this week either. !!!8. Other !!Technical issues * The mac os / perl bug (at least __Trond__ and __Sjur__ has it): ** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Thor Øivind__, __Tomi__) *** Another __example__ of the same bug: *** :"\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33. *** One way to "resolve" this is to redirect the error messages to /dev/null: {{{ ... | preprocess 2> /dev/null | lookup ... }}} !!Video conferencing across firewalls The problem we've had with the SD firewall persists, and there doesn't seem to be any resources available to help us. __Geir Kaaby__ instead suggested we look at the [Marratech|http://www.marratech.com/] package, and try it out. So please download the MacOS X client (or get it from me), and I'll send you the URL to the meeting room as soon as I get it. !!Bug fixing __19__ open bugs (and 24 risten.no bugs) !Bugzilla update When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved. !!Buying * rucksacks for the whole Divvun team !!risten.no * Organisation: could __Tomi__ be used, in exchange for more linguistic work by (old) GIO members? Yes, it is ok, but how much still needs to be evaluated * it is ok to integrate "kvensk" placenames with risten.no ** this should be integrated with the general proper name work - we want all proper names integrated with risten.no, df above ** needs further development of risten.no to allow for multiple XML bases to be presented and maintained in parallel. This is to be further worked on by __Tomi__ and __Sjur__ !!Project planning and development processes Trond is using his project as a test case for an IT guy, __Geir Tore Voktor__, who is taking a course in project management. Be prepared to answer questions. !! Conference report from Trond * Relevant themes on the conference for us: ** Terminology work ** Dictionary work between minority languages ** Repositories for minority language resources ** Disambiguation work, for South African languages * Also, our work is relevant to other projects. This * There is welsh work on terminology: ** (__Dewi Jones__, __Delyth Prys__, U Wales, Bangor, I couldn't find). They use ISO standards for terminological work, the ones of us interested in [risten.no|http://www.risten.no] should have a look. !!!9. Summary, task list !! Børre * Contact oahpahusossodat about texts * Gather public texts * Continue converting text from input format to our xml * Ask __Thor-Øivind__ to move bugzilla to our new webserver ** ... and update Bugzilla at the same time * install Marratech client to Maaren's computer * install new XXE and the new XXE Forrest config for all (or check that it is installed and working) * mark-up names * move existing corpus docs from gt/ to new corpus repository !! Maaren * continue working with the missing list from risten.no * Start working on Sámi place names * Start working at normativity issues (numeral issues with __Trond__?) !! Saara * Look at the corpus infrastructure issue * Look at the corpus interface issue with Lars * make an emacs mode for the name project (cf. specs in the memo above) * look into efficient editing of the xml proper name lexicon (tools, modes, etc) * start looking at conversion of the name lexicon from present format to xml !! Sjur * Lule Sámi twol problems, look again at the sets definition with __Thomas__ and __Trond__, Wednesday * risten.no bugs and fixes * discuss risten.no work with __Tomi__ * follow up on voice group-chat not working to Sámediggi ** Test Marratech *** install Marratech client to Maaren's computer * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts * Discuss the contract issue with Trond, return the new version to the lawyer ** Call __Kimmo Koskenniemi__ for comments, perhaps arrange a meeting with him * Follow up on meeting with __Anders Kintel__ * discuss kvensk project support with __Trond__ * write public tender documents * buy: ** new computer (project server)? ** project management software ** OmniOutline (upgrade) ** OmniGraffle (upgrade) !! Thomas * do main markup in the present propernoun file * work on North sámi compounding and derivation * Look at Linguistic bugs with __Trond__ * Meet with Sjur and Trond about the definition of G1, G2, G3 !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) * three-part compounding * corpus infrastructure: dtd location (both public and internal) * Document aspell and corpus infrastructure * Specification for new catxml in C++ ** this includes also placing the source and binary *** clean the script/ catalogue with __Trond__ * Common makefile issues * discuss risten.no work with __Sjur__ * discuss about xml-processing with __Saara__ * look into efficient editing of the xml proper name lexicon (tools, modes, etc) * start looking at conversion of the name lexicon from present format to xml * Look into synchronisation of proper names with risten.no !! Trond * Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed. * project planning with __Sjur__, continued ** also look at the development processes - specification and testing * Work on the name project: ** Discuss the conversion with __Maaren__ and __Børre__ ** mark up names * discuss kvensk project support with __Sjur__ * Work on the G3 bug issue with __Sjur__ and __Thomas__ !!!10. Next meeting, closing 7.11.2005 10:00 Closed at 11:06