!!!Meeting setup * Date: 05.12.2005 * Time: 09.30 Norw. time * Place: Wherever we are :-) * Tools: iChat, SubEthaEdit, phone !!! Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Linguistics # name lexicon infrastructure # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:48. Present: __Børre (after 15 min), Saara, Sjur, Thomas (only a few minutes), Tomi__ Absent: __Maaren, Trond__ Main secretary: __Sjur__ Agenda accepted as is. !!!2. Reviewing the task list from the last meeting !! Børre * Gather public texts * Continue converting text from input format to our xml * Document the corpus directory structure * Ask __Thor Øivind__ to move bugzilla to our new webserver ** ... or make the URL [http://giellatekno.uit.no/bugzilla] point to the present installation * install new XXE and the new XXE Forrest config for Ilona * divvun.no and giellatekno.uit.no ** Binary files download area ** Continue the conversion to static site, using our own script. * corpus xsl files under version control !! Maaren * working with risten.no * register AppleCare ** __Børre__ did it. * find decisisons/documents regarding syllable shortening in compounds in the norm; send them to __Thomas__ !! Saara * Look at the corpus infrastructure issue * Look at the corpus interface issue with Lars * look into efficient editing of the xml proper name lexicon (tools, modes, etc) * Convert the name lexicon from present format to xml ** not done * Look at the hyphenation issue ** implemented in preprocess, see documentation * corpus xsl files under version control ** discussion going on * Make preprocess faster ** preprocess is now as fast as possible with Perl implementation. Now I am considering also other tools. !! Sjur * Lule Sámi twol problems, look again at the sets definition with __Thomas__ and __Trond__ ** nothing * risten.no bugs and fixes ** recovered the lost data after the crash, prepared a new install, but were unable to install it due to network problems at SD * follow up on voice group-chat not working to Sámediggi ** Test Marratech *** no URL yet * project planning with __Trond__, continued ** also look at the development processes - specification and testing ** nothing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ *** he had forgotten about it. * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu ** nothing * write a background document on the corpus contracts ** nothing * discuss kvensk project support with __Trond__ ** nothing * proper name integration with risten.no ** nothing * discuss risten.no work with __Tomi__ ** continued * write public tender documents ** nothing * hyphenation in corpus docs ** nothing * buy: ** new computer (project server)? ** nothing * Work on the Speech Application (Dec. 1) ** done * Work on the proofing article (Dec. 5.) ** done * Investigate the never-arriving backpacks ** done !! Thomas * work on North sámi compounding and derivation **worked and still working * Settle the empirical facts for sme diphthong simpl. **done * add descriptive facts about shortened forms in compounding from our corpus to the normativity issues document **done !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) *** Not done * three-part compounding ** Not done -> transferred to __Thomas, Trond and Sjur__ * corpus infrastructure: dtd location (both public and internal) ** Not done * Document aspell and corpus infrastructure ** Nothing done * Specification for new catxml in C++ ** this includes also placing the source and binary *** Not done * discuss about xml-processing with __Saara__ * start looking at conversion of the name lexicon from present format to xml ** Looked a bit * Look into synchronisation of proper names with risten.no ** Not done * hyphenation in corpus docs ** Not done * corpus xsl files under version control ** Not done * add automatic language detection to the corpus processing ** Looked at the language detection documentation * Work on the proofing article (Dec. 5.) ** Done * remove last part of complex names not used as simplex names ** Not done !! Trond * Look at the three-part compound issue * Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed. * project planning with __Sjur__, continued * discuss kvensk project support with __Sjur__ * Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme. * Working on the Speech Application (Dec. 1) * Working on the proofing article (Dec. 5.) !!!3. Documentation Documentation tasks: Add documentation on our corpus infrastructure and our corpus work in general (__Børre__, __Tomi__, __Trond__, __Saara__). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups: # For the __users/linguists:__ What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated) ## catxml done, which is what is needed mostly. Do we need more? ## we need a review: __Thomas, Maaren, Linda, Ilona, Trond__ # For the __collectors:__ How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document) ## we need a review of the web interface for corpus uploading - what is still missing? ### Review: __Sjur, Saara, Trond, Thomas__ Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation. __Saara__ to make a review template, to be posted in the newsgroup and commented before the actual review is started. __Deadline__ for comments and final template: by next meeting. test: * add/update Aspell documentation (__Tomi__) ** Some documentation has been written, but there still is work to be done. * as always: document what you're doing:-) (__all__) Tomcat->static HTML progress Now, all pages are generated directly from XML by Forrest within Tomcat. We'll change to let Forrest pre-generate the HTML (and pdf), and serve these ready-made files directly. __Deadline:__ Finished by this week. !!!4. Corpus gathering The Lule Sámi New Testament is ready for inclusion in our repository and will be added today. We then have our first Lule Sámi corpus text! !!Contracts Next step: # Make __our__ versions of the updated Helsinki contracts, and make sure they are according to our intention. (__Sjur and Trond__) # send them to the SD lawyer and to the University lawyer through formal channels. (__Sjur and Trond__) Contract 1 should have the main priority (contract 2 for __Trond__). !!!5. Corpus infrastructure Updated task list: # Include the xsl files under version control (__Børre, Tomi, Saara__) ## Saara has started a dicsussion in the newsgroup - please follow up! ### we can start using RCS right away, and we do so. The main users should be comfortable with two version control systems, and it is (relatively) easy to upgrade to CVS later, if we want that. # Incorporate language detection as part of the corpus processing (__Saara__) # we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. ## What needs to be identified now is the conditions for the difference between "ealahus- ja ..." and "ea-la-hus-os-so-dat". The result should be: "ealahus- ja ..." and "ealahusossodat". Only hyphens as part of text should be tagged (that is, used to hyphenate a word at the end of a line, or to indicate syllabification for such hyphenation), hyphens in dates and other numeric expressions should be left as is. ### identify all cases of the first type, and replace all hyphens __NOT__ in this set with the tag (__Saara__). ### we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on. ### Acceptable results: 90% of all real hyphens correctly tagged. !!!6. Linguistics !!Name lexicon Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no] The plan for this project was as follows: Two lines of work run in parallel: # name markup ## Done! There are errors in the markup, people are urged to correct them as they pop up. !!Complex names Task list for this issue: * Restoring two-part names ** done, common/src/proper-complex.xml * find eventual unique second-parts (B-parts of names that do not exist in isolation): Take the 1.126 version, remove the non-last parts, and compare the B-parts of proper-complex.xml with the simplex forms in 1.126. Non-hits should then be removed from the current propernoun-sme-lex.txt. ** Postponed till next week * remove these B-parts from the ordinary name file (__Tomi__) * make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (__Saara__) * the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (__Tomi or Saara__, but only when the technical details are settled) * The core issue: The preprocess script can handle (A + B) compounds, but not (A + inflected B). !!North Sámi * three-part compounds issue still open ** look at Lule Sámi, but apply it to second-parts only *** Thomas is finished with three-part compounds for North Sámi. On the negative side, we note a compilation time of 10m for the sme parser as a whole. ** the exact rules for when shortening happens should be documented (__Maaren__ will send the normativity decisions to __Thomas__) ** linguistic analysis/discussion to continue in the newsgroup ** we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in this discussion, but how? We need a common meeting / normativity seminar / seminar on proofing issues and on language technology in a broader sense with them. Then we also need to open the linguistic issues newsgroup we talked about earlier, and set up their computers and teach them to use the newsgroup. This seminar needs to include the Lule Sámi section of SGL (others are welcome too). Timetable: *** The Sámediggiráddi is going to appoint new members before Christmas (__Johan Mikkel Saara__ is asking people) *** Contact the SGL when it is elected, in December, and ask them to arrange a normativity seminar, with invited keypersons. *** seminar in February? ** TODO list: *** __Maaren__ to discuss with relevant persons on this issue. * number project still open (see below) * diphthong simplification/G3 issue should be carried over from Lule Sámi. ** TODO: *** Writing the rules (copy from smj, adjust) (__Sjur, Thomas__ and __Trond__, next week) !!Lule Sámi Great progress has been made on the G3 issue, just some minor points remain. oa:å has been carried over to ä:e Open tasks: * Derived G3 that looks like G2 are still open. ** __Thomas, Trond and Sjur__ meet shortly after Dec. 5th to finish the rest. Today's compilation time: real 5m17.157s user 3m26.827s sys 0m5.070s !!Numerals The following North Sámi linguistic issues should be settled before going into the numeral project: # Three-part compounds # Diphthong simplification # Derivation These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel. Numerals in North Sámi: Inventory is listed elsewhere. Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals. !!!7. Name lexicon infrastructure Present proposal: * name-oriented, single document: ** one name with many uses stored only once Present risten.no: * Concept-oriented center, contains: ** ID, links to each language entry * each language as a separate document, with links to the concept/entity in the cross-lingual center Possible new propsal 1: as risten.no Possible new proposal 2: separate documents: * Containing one common concept or name id field, plus Divvun/Disamb fields only * Containing one common concept or name id field, plus kvensk fields only * common document containing one common id field, plus fields common to several bases Porsanger both person and place Porsáŋgu only as place name, not as person name. 5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud and 1 Timbuktu, but 15 contlexica for these three concepts. Discussion to continue in the newsgroup. Tasks: # testing of conversion # continue the discussion of the name lexicon format (__Saara, Tomi, Sjur, Trond__) # implement a prototype in eXist # eXist as editor: ## develop the needed XQueries and interface ## synchronisation between risten.no and ## test whether eXist as editor is actually working well !!!8. Other !!Technical issues * The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]): ** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__, __Tomi__) ** Test: the result of the last line should indicate whehter this is a problem in cat or a Perl/OS mismatch. {{{ preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ?? }}} !!Video conferencing across firewalls We're still waiting for a working URL (working from outside SD, that is). !!Bug fixing __28__ open bugs (and 2 risten.no bugs) !!Move Bugzilla Move Bugzilla to the same server as the other ones (or make it work at the expected URL: http://giellatekno.uit.no/bugzilla/). TODO, TODO. __Thor Øivind__. !!risten.no The risten.no data has been rescued, and a new version of eXist is ready for installation. The installation has not been done due to the network problems at SD. Will be done this week, probably today. Also, an even newer version of eXist has been released (snapshot from last Saturday). Tomi will continue the proper name work. !! Rugsacks Were delivered on Nov 25. They have disappeared at UiTø, but that is now being investigated. Two were also properly delivered at SD/Guovdageaidnu. !!!9. Summary, task list !! Børre * Gather public texts * Continue converting text from input format to our xml * Document the corpus directory structure * Ask __Thor-Øivind__ to move bugzilla to our new webserver ** ... or make the URL [http://giellatekno.uit.no/bugzilla] point to the present installation * install new XXE and the new XXE Forrest config for Ilona * divvun.no and giellatekno.uit.no: ** Binary files download area ** Continue the conversion to static site, using our own script. * corpus xsl files under version control * review the convert2xml.pl script, what works well and what doesn´t. * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no * find decisisons/documents regarding syllable shortening in compounds in the norm; send them to __Thomas__ * review corpus usage documentation * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February !! Saara * Look at the corpus infrastructure issue * Look at the corpus interface issue with Lars * Convert the name lexicon from present format to xml * discuss the new lexicon format in the newsgroup * document corpus infrastructure, your own parts * corpus infrastr. user docu review: make a template, and post it in the newsgroup * Language detection for the corpus files * Implement addition of tags to the converted corpus files * Do some testing for bug [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211] * update the convert2xml script according to the comments * review corpus upload user interface * corpus xsl files under version control * xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry * update preprocess to handle inflected forms of complex names * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Lule Sámi twol problems, look again at the sets definition with __Thomas__ and __Trond__ * follow up on voice group-chat not working to Sámediggi ** Test Marratech * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts * discuss kvensk project support * proper names: ** discuss lexicon format ** discuss implementation with Tomi * write public tender documents * hyphenation in corpus docs * buy: ** new computer (project server)? * Work on the Speech Application (Dec. 1) * Work on the proofing article (Dec. 5.) * Investigate the never-arriving backpacks * update the SD contract version * send SD corpus contract to lawyer * review corpus upload user interface * comment review template made by __Saara__ * update SD contracts * send SD contracts to SD lawyer * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work on North sámi compounding and derivation * review corpus upload user interface * review corpus usage documentation * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) * corpus infrastructure: dtd location (both public and internal) * Document aspell and corpus infrastructure * Specification for new catxml in C++ ** this includes also placing the source and binary * new proper name lexicon ** remove last part of complex names not used as simplex names ** start looking at conversion of the name lexicon from present format to xml ** discuss the new lexicon format in the newsgroup ** Look into synchronisation of proper names with risten.no * hyphenation in corpus docs * corpus xsl files under version control * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * Look at the three-part compound issue * Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed. * project planning with __Sjur__, continued * discuss kvensk project support with __Sjur__ * Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme. * document corpus infrastructure, your part * update the SD contract version * send SD corpus contract to lawyer * review corpus upload user interface * review corpus usage documentation * comment review template made by __Saara__ * discuss the new lexicon format in the newsgroup * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!!10. Next meeting, closing 12.12.2005 09:30 Closed at 11:12