!!!Meeting setup * Date: 02.01.2006 * Time: 09.30 Norw. time * Place: Wherever we are :-) * Tools: iChat, SubEthaEdit, phone !!! Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Linguistics # name lexicon infrastructure # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:55. Present: __Børre, Maaren, Saara, Sjur, Thomas, Trond__ Absent: __Tomi__ (sick leave until Jan. 17th) Main secretary: __Sjur__ Agenda accepted as is. !!!2. Reviewing the task list from the last meeting !! Børre * Gather public texts ** Not done * Continue converting text from input format to our xml ** Some done * Document the corpus infrastructure ** Not done * review code and documentation for corpus xsl files under version control ** Not done * review the convert2xml.pl script, what works well and what doesn´t. ** Done * comment review template made by __Saara__ ** Not done * proper names meeting to arrive at final xml format ** Done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no ** not done, I have been very, very lazy * review corpus usage documentation, and the usage of the corpus ** not done * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February ** discussed with Laila (SGL) about seminar !! Saara * Look at the corpus infrastructure issue * Convert the name lexicon from present format to xml ** not done * discuss the new lexicon format in the newsgroup * Language detection for the corpus files ** almost done * Implement addition of tags to the converted corpus files ** done * Do some testing for bug [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211] ** not done * update the convert2xml script according to the comments ** some xsl-handling still needed * add version control of the corpus xsl files to the (upload) processing ** RCS access things have to be solved first. * xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry ** not done * update preprocess to handle inflected forms of complex names ** not yet optimal * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Lule Sámi twol problems, with __Thomas__ and __Trond__ ** nothing * follow up on voice group-chat not working to Sámediggi ** Test Marratech when the new Marratech server is in place *** waiting for the server * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ *** last info: he has not forgotten it, but has not succeeded in bringing it up with __Øystein Johannessen__. * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts ** not done * proper names meeting to arrive at final xml format ** conducted the meeting, important progress made, and results posted to the newsgroup * write public tender documents ** Selection criterion document updated and sent to the project board - almost final * [fix bugs!|http://giellatekno.uit.no/bugzilla] * other: ** worked more with risten.no/eXist and the parallel editing bug - still no progress !! Thomas * work on North Sámi compounding and derivation **worked and still working * review corpus usage documentation **begun * smj G3 issue with __Sjur__ and __Trond__ **not anything this week !! Tomi On sick leave. !! Trond * project planning with __Sjur__, continued ** Not after last meeting * Work on the G3 bug issue with __Sjur__ and __Thomas__, carry it over to sme. ** Not done. * document corpus infrastructure, your part ** documented tags only. * review corpus usage documentation ** Read things * comment review template made by __Saara__ ** Not done * discuss the new lexicon format in the newsgroup ** Done. * proper names meeting to arrive at final xml format ** Participated. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!!3. Documentation !!Reviews Add documentation on our corpus infrastructure and our corpus work in general (__Børre__, __Tomi__, __Trond__, __Saara__). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups: # For the __users/linguists:__ What corpus are found, how do I use them (this info is now scattered). (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated) ## catxml done, which is what is needed mostly. Do we need more? ## Review: __Thomas, Maaren, Linda, Ilona, Trond__ Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation. __Deadline__ for reviews: by next meeting. Catxml review: this should be based on the new tool made by __Tomi__ (C++ version), called ''ccat''. __Tomi__ will install it, and announce it in news when ready for review. Update: __Tomi__ is on sick leave, and __Saara__ will make the tool available in __Tomi's__ absence. !!!4. Corpus gathering !!Contracts Next step: # wait for comments from the lawyers - remind them of the task? (__Sjur, Trond__) # possibly update contracts with remarks from lawyers # start using them! !!Collecting Nothing new, now hampered by the lawyers checking the final version of the contracts. We want both parallell and errouneous (unproofed) text files. __Børre__ to contact the Odin guy. !!!5. Corpus infrastructure Updated task list: # Include the xsl files under version control ## RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We'll continue the discussion in the newsgroup. # Incorporate language detection as part of the corpus processing (__Saara__) ## Almost finished. Some heuristics regarding other Sámi languages in the same document to be added. # we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. ## done, needs review (__Saara__) # we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on. ## Acceptable results: 90% of all real hyphens correctly tagged. # CGI-admin script to add xsl-file to a corpus file that doesn't have one (__Tomi__) ## __Saara__ will review the existing code, consult __Tomi__, and try to make a script to help __Børre__ utilize the template and the infra in place. !!!6. Linguistics !!North Sámi * three-part compounds issue still open ** Thomas is finished with three-part compounds for North Sámi. *** compilation time has increased to 10m for the sme parser as a whole. This is probably linked to the three-part issue, but we are not sure about that. *** it overgenerates wildly. This is bad for several of our applications. ** compound shortening issue is waiting for the SGL to make a normative decision *** we have to wait until SGL´s meeting protocol (???) is ready ** linguistic analysis/discussion to continue in the newsgroup ** we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in this discussion, but how? We need a common meeting / normativity seminar / seminar on proofing issues and on language technology in a broader sense with them. Then we also need to open the linguistic issues newsgroup we talked about earlier, and set up their computers and teach them to use the newsgroup. This seminar needs to include the Lule Sámi section of SGL (others are welcome too). *** Laila has informed the members of the SGL about this. This is very okei - open the linguistic issues newsgroup for the members of SGL, but we have to know how to do it. Seminar also very okei, but we have to wait until the new SGL is elected. ** Timetable: *** The Sámediggiráđđi is going to appoint new members in January (__Johan Mikkel Saara__ is asking people) *** Contact the SGL when it is elected, and ask them to arrange a normativity seminar, with invited keypersons. Have to wait until it is elected. Initially we planned for February, but it is too early, because SGL is not elected yet. Divvun wants the seminar as early as possible. * number project still open (see below) * diphthong simplification/G3 issue should be carried over from Lule Sámi. ** TODO: *** __Trond, Thomas and Sjur__ to discuss this this month. *** Writing the rules (copy from smj, adjust) (__Sjur, Thomas__ and __Trond__) *** Bug 77 - clearify whether ''háliid'' is acceptable - it is found, thus we need to be able to analyse it, but only as !SUB: !!Lule Sámi Open tasks: * Derived G3 that looks like G2 are still open. ** __Thomas, Trond and Sjur__ should meet again to finish the rest. !!Numerals The following North Sámi linguistic issues should be settled before going into the numeral project: # Three-part compounds # Diphthong simplification # Derivation These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel. Numerals in North Sámi: Inventory is listed elsewhere. Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals. !!!7. Name lexicon infrastructure Summary: see the [newsgroup|news:di5mbi$26ad$1@news.uit.no] !!Complex names Task list for this issue: * find eventual unique second-parts (B-parts of names that do not exist in isolation): Take the 1.126 version, remove the non-last parts, and compare the B-parts of proper-complex.xml with the simplex forms in 1.126. Non-hits should then be removed from the current propernoun-sme-lex.txt. ** No unique B-parts found, nothing removed. Task closed. * make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (__Saara__) ** __Tomi__ is working on a C++ version of the xml2lexc tool. He will put it into CVS soon (it is actually catxml, can easily be adopted to the xml2lexc requirements) * the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (__Tomi__ or __Saara__, but only when the technical details are settled) * The core issue: The preprocess script can handle uninflected (A + B) compounds, but not (A + inflected B). __Saara__ will add the analyzer as part of the preprocess, to be able to handle inflected cases as well. !!XML format We had our meeting, and the result was pretty close to the structure of the existing riste.no term base. Now the conversion script to xml needs to be updated. The new format is documented in the newsgroup. Tasks: # update conversion from lexc to xml to reflect new xml format # testing of conversion # continue the discussion of the name lexicon format (__Saara, Tomi, Sjur, Trond__) # implement a prototype in eXist # eXist as editor: ## develop the needed XQueries and interface ## synchronisation between risten.no and ## test whether eXist as editor is actually working well !!!8. Other !!Seminars * SGL/normativity seminar ** all members = potentially/likely all languages ** date? As early as possible, but not likely before the end of February ** place? __Maaren__ will investigate * project meeting ** date: soon, (third)/last week of January? Last week best for __Maaren__ ** place: Tromsø? * we'll return to these next week, and make decissions about our own meeting then !!Technical issues * The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]): ** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__, __Tomi__) ** Test: the result of the last line should indicate whehter this is a problem in cat or a Perl/OS mismatch. {{{ preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ?? }}} !!Bug fixing __28__ open bugs (and 25 risten.no bugs) !!Move Bugzilla Bugzilla now works at the old URL [http://giellatekno.uit.no/bugzilla/]. !!!9. Summary, task list !! Børre * Gather public texts, preferrably also parallel ones * Contact Odin editor to ask for source (and parallel) documents * Continue converting text from input format to our xml * Document the corpus infrastructure * review code and documentation for corpus xsl files under version control * review the convert2xml.pl script, what works well and what doesn´t. * comment review template made by __Saara__ * proper names meeting to arrive at final xml format * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no * review corpus usage documentation, and the usage of the corpus * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place. !! Saara * Look at the corpus infrastructure issue * Convert the name lexicon from present format to xml and test it * discuss the new lexicon format in the newsgroup * Language detection for the corpus files * Review the hyphenation detection. * Catxml review: look at the ccat tool * Review the handling of xsl-files in corpus infrastructure * Do some testing for bug [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211] * add version control of the corpus xsl files to the (upload) processing * xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry * update preprocess to handle inflected forms of complex names * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Follow up the lawyer treatment of the contracts * Lule Sámi twol problems, with __Thomas__ and __Trond__ * follow up on voice group-chat not working to Sámediggi ** Test Marratech when the new Marratech server is in place * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts * continue proper name lexicon work and discussion * write public tender documents * review corpus upload user interface * smj G3 issue with __Thomas__ and __Trond__ * sme G3 issue with __Thomas__ and __Trond__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work on North Sámi compounding and derivation * review corpus usage documentation * smj G3 issue with __Sjur__ and __Trond__ * sme G3 issue with __Sjur__ and __Trond__ !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) * corpus infrastructure: ** dtd location (both public and internal) ** cgi-admin script for adding xsl-files * Document aspell and corpus infrastructure * Specification for new catxml in C++ ** install and announce new ccat tool * new proper name lexicon ** remove last part of complex names not used as simplex names ** start looking at conversion of the name lexicon from present format to xml ** discuss the new lexicon format in the newsgroup ** Look into synchronisation of proper names with risten.no ** meeting to arrive at final xml format ** new version of xml2lexc (based on catxml, now ccat) * hyphenation in corpus docs * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] * pick up backpacks after Xmas !! Trond * Follow up the lawyer treatment of the contracts * project planning with __Sjur__, continued * smj G3 issue with __Sjur__ and __Thomas__ * sme G3 issue with __Sjur__ and __Thomas__ * document corpus infrastructure, your part * review corpus usage documentation * discuss the new lexicon format in the newsgroup * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!!10. Next meeting, closing 09.01.2006 09:30 Closed at 11:43