!!!Meeting setup * Date: 16.01.2006 * Time: 09.30 Norw. time * Place: Wherever we are :-) * Tools: iChat, SubEthaEdit !!! Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Linguistics # name lexicon infrastructure # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:45. Present: __Børre, Maaren, Saara, Sjur, Trond__ Absent: __Thomas, Tomi__ (sick leave until Jan. 17th) Main secretary: __Trond__ Agenda accepted as is. !!!2. Reviewing the task list from the last meeting !! Børre * Gather public texts, preferrably also parallel ones * Contact Odin editor (__Ove Sæth__) to ask for source (and parallel) documents ** Waiting for Trond * Continue converting text from input format to our xml * review code and documentation for corpus xsl files under version control * make an XML test lexicon for our new name lexicon; format is based on the latest meeting memo ** Some work done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no **done * review corpus usage documentation, and the usage of the corpus **not done, do not know where to find it. Can`t read my emails...sorry.... * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place. **done. Possible in February or at the beg. of March !! Saara * Convert the name lexicon from present format to xml for testing; final conversion pending the editing fascilities ** Script is ready, the test conversion will be done today. * continue discussion on the new lexicon format * Refine language detection for Finnish ** not done * Finnish the review of the hyphenation detection. ** not done * Review the handling of xsl-files in corpus infrastructure, including version control ** work in progress. the updating xsl-files and cgi-infrastructure will be finished this week, the final testing during the meeting week. * Do some testing for bug [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211] ** not done * xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry ** not done * update preprocess to handle inflected forms of complex names ** done but needs to be optimized. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Follow up the lawyer treatment of the contracts ** done * Plan the forthcoming seminar ** nothing but the dates and place * Lule Sámi twol problems, with __Thomas__ and __Trond__ * follow up on voice group-chat not working to Sámediggi ** Test Marratech when the new Marratech server is in place * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts * continue proper name lexicon work and discussion * public tender: ** received offer for handling the whole process from [Finnut Consult AS|http://www.finnut.no/] * follow up new server - it's not delivered yet ** it was delivered Friday afternoon * smj G3 issue with __Thomas__ and __Trond__ * sme G3 issue with __Thomas__ and __Trond__ * call EDD/__Christian Emil Ore__ about national place name lexicon ** tried calling him several times, but he has been ill, I've started on an e-mail to him instead * risten.no/name lexicon development: fix bugs, continue development * [fix bugs!|http://giellatekno.uit.no/bugzilla] * various: ** commented a letter from __Hallgeir Varsi__ to SD, regarding the amount of work needed on our side to support the Kvensk project ** wrote monthly reports for November, December !! Thomas Sick leave. !! Tomi Sick leave. !! Trond * Contact Odin editor (__Ove Sæth__) to reopen contacts ** Forgot it. * Plan the forthcoming seminar ** Private thoughts, yes, must now flesh them out. * sign contract with Bibelselskapet for Norwegian parallel texts ** Not done. * project planning with __Sjur__, continued ** Not done. * smj G3 issue with __Sjur__ and __Thomas__ ** Not done. * sme G3 issue with __Sjur__ and __Thomas__ ** Not done. * document corpus infrastructure, your part ** Not done. * review corpus usage documentation (ccat) ** Not done. * discuss the new lexicon format in the newsgroup ** Not done. * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** Not done. * This revision was not too impressive, instead I have worked on disambiguation. !!!3. Documentation !!Reviews Add documentation on our corpus infrastructure and our corpus work in general (__Børre__, __Tomi__, __Trond__). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups: # For the __users/linguists:__ What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated) ## catxml done, which is what is needed mostly. Do we need more? ## Review: __Thomas, Maaren, Linda, Ilona, Trond__ __Saara__ will update the user documentation, and add new if necessary. We will do the review as part of the meeting next week. Findings so far: Basic text presentation works fine, but the text-type options does not. !!!4. Corpus gathering !!Contracts Ready - start using them! !!Collecting List of people/organisations/companies to contact to be found in an [old meeting memo|/admin/weekly/2005/Meeting_2005-09-05.html#5.+Corpus+gathering]. Based on those, here's an updated list: # Anders Kintel (__Børre__) # Newspaper text: ## Sámi Instituhtta's (for the old archive of Min Áigi and Áššu) (__Børre__) ## Áššu has been making a CD since the end of May, there should be a pile there. __Børre__ suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format. (__Børre__) ## Min Áigi (__Børre__) # Commercially published texts ## Iđut and key authors there (__Børre__) ## Davvi Girji and key authors there (__Børre__) ## Author organisations' meetings (__Børre__) ## Key authors one by one ### (list of author names) Kerttu Vuolab, Kirsi Paltto, ... List of texts with lower priority (to be gathered when the above list is more or less fixed) * the Sámi municipalities, * Authors with smaller production * Textbooks TODO: a ''lot'' for __Børre!__ !!Odin We want both parallell and errouneous (unproofed) text files. What we need is a direct contact with Odin, in order to have as good coverage as possible. TODO: __Trond__, and then __Børre__ to call __Ove Sæth__ to re-establish contact. !Bible texts We have received Norwegian texts and a contract draft from Bibelselskapet, in essence requiring a separate contract for internet use of extracts of the text. The texts are in the paratext format, an international standard for the bible in electronic form. Most translations are available in this format. TODO: __Trond__ will accept the contract as is, and then negotiate a separate contract with them for use with the online, searchable parallel corpus when it is ready. !!!5. Corpus infrastructure Task list: # Include the xsl files under version control ## RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We'll continue the discussion in the newsgroup. # Incorporate language detection as part of the corpus processing (__Saara__) ## Almost finished. Needs improved Finnish language model - presently it isn't able to distinguish Finnish from Sámi (proving the family bonds:-) # we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on. ## Acceptable results: 90% of all real hyphens correctly tagged. # CGI-admin script to add xsl-file to a corpus file that doesn't have one (__Tomi__) ## __Saara__ will review the existing code, consult __Tomi__, and try to make a script to help __Børre__ utilize the template and the infra in place. We will have a major review of all these things next week. !!!6. Linguistics !!North Sámi * three-part compounds issue still open ** Thomas is finished with three-part compounds for North Sámi. *** compilation time has increased to 10m for the sme parser as a whole. This is probably linked to the three-part issue, but we are not sure about that. *** it overgenerates wildly. This is bad for several of our applications. ** compound shortening issue is waiting for the SGL to make a normative decision *** we have to wait until SGL´s meeting protocol (???) is ready ** linguistic analysis/discussion to continue in the newsgroup ** we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in this discussion, but how? We need a common meeting / normativity seminar / seminar on proofing issues and on language technology in a broader sense with them. Then we also need to open the linguistic issues newsgroup we talked about earlier, and set up their computers and teach them to use the newsgroup. This seminar needs to include the Lule Sámi section of SGL (others are welcome too). *** Laila has informed the members of the SGL about this. This is very okei - open the linguistic issues newsgroup for the members of SGL, but we have to know how to do it. Seminar also very okei, but we have to wait until the new SGL is elected. ** Timetable: *** The Sámediggiráđđi is going to appoint new members in January (__Johan Mikkel Saara__ is asking people) *** not done yet, perhaps this week *** Contact the SGL when it is elected, and ask them to arrange a normativity seminar, with invited keypersons. Have to wait until it is elected. Initially we planned for February, but it is too early, because SGL is not elected yet. Divvun wants the seminar as early as possible. * number project still open (see below) * diphthong simplification/G3 issue should be carried over from Lule Sámi. ** TODO: *** __Trond, Thomas and Sjur__ to discuss this this month. *** Writing the rules (copy from smj, adjust) (__Sjur, Thomas__ and __Trond__) *** Bug 77 - clearify whether ''háliid'' is acceptable - it is found, thus we need to be able to analyse it, but only as !SUB: !!Lule Sámi Open tasks: * Derived G3 that looks like G2 are still open. ** __Thomas, Trond and Sjur__ should meet again to finish the rest. !!Numerals The following North Sámi linguistic issues should be settled before going into the numeral project: # Three-part compounds # Diphthong simplification # Derivation These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel. Numerals in North Sámi: Inventory is listed elsewhere. Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals. !!!7. Name lexicon infrastructure !!Complex names Task list for this issue: * make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (__Saara__) ** __Tomi__ is working on a C++ version of the xml2lexc tool. He will put it into CVS soon (it is actually catxml, can easily be adopted to the xml2lexc requirements) * the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (__Tomi__ or __Saara__, but only when the technical details are settled) * The core issue: The preprocess script can handle uninflected (A + B) compounds, but not (A + inflected B). __Saara__ will add the analyzer as part of the preprocess, to be able to handle inflected cases as well. !!XML format We had our meeting, and the result was pretty close to the structure of the existing risten.no term base. Now the conversion script to xml needs to be updated. The new format is documented in the newsgroup. Tasks: # make a test lexicon for evaluating the format, set up the editing, and test it (__Saara__) # update conversion from lexc to xml to reflect new xml format (__Saara__) ## mostly done, some open questions left # testing of conversion # continue the discussion of the name lexicon format (__Saara, Tomi, Sjur, Trond__) # implement a prototype in eXist # eXist as editor: ## develop the needed XQueries and interface ## synchronisation between risten.no and ## test whether eXist as editor is actually working well !!!8. Other !!SGL Seminar * SGL/normativity seminar ** all members = potentially/likely all languages *** not all languages, only North Sámi ** date? As early as possible, end of February/beginning of March ** place? __Maaren__ will investigate !!Divvun/Disamb Seminar in Tromsø * project meeting ** date: 23. (after lunch) - 27. (Friday is travelling day) januar. ** place: Tromsø ** still too many open questions regarding place+date, to be determined in a separate meeting later this week, when all are available. Maaren is able to attend Monday morning and Tuesday (all day) Sjur will probably arrive in Tromsø at lunchtime Monday (7.25 from HKI, TOS at 11.00). __Trond__ will check with Linda and Ilona, trying to have a Meeting startup after lunch on Monday. Saara will stay one day shorter/less, remember to adjust the schedule accordingly Practical arrangements: * room(s): One big room for all, with internet access, and at least one additional room, with whiteboards or similar and projector. * lunch & coffee breaks, think of how to arrange this. * hotel = Grand Hotel/Polar Hotel, Grønnegata, Scandic is close to the University ** __Sjur__ will check which ones are acceptable to SD ** these need rooms at the hotel: Ilona, Saara, Linda, Sjur, Maaren Suggested content for project meeting: * Common ** Presentation, kick-off ** Project updates ** Project milestones ** Cooperation evaluation (practical/daily cooperation) *** project management *** programmer work *** linguistic work *** anything else? * Divvun ** Project milestones ** Evaluation/feedback ** public tender meeting * Disamb ** Project milestones ** Tagsets (Evaluating and fixing (?) the tagset for the disambiguator) * Linguists ** G3 smj/sme ** Compilation time and twol ruleset ** Numeral project ** spelling errors (esp. in risten.no) ** Routines for addition to lexicon, cooperation wrt. new corpus texts coming in * Programmers ** Proper names: *** xml *** eXist/editing ** Work distribution (in bug fixing, documentation, maintenance etc.) ** Parallel corpora and the corpus infrastructure ** xml2lexc implementation plan Teaching sessions (list of free thoughts and personal frustrations). There must be one teacher and at least two pupils before we run a course. * How to use the xml corpora (all? more like a group review session - 1h) * Use of Xerox tools (t: Trond; p: RTFW) * Twig (t: Saara; p: Trond) * XQuery/XSL (t: Sjur/Tomi(?); p: Trond, Børre, Saara) * xml webapp development, security and session integrity (group work: p: Sjur, Børre, Tomi, Saara) * file-specific xsl scripts for corpus conversions, the what, how and who issues (t: Saara/Tomi; p: Trond, Børre, Sjur) Final schedule to be worked out by __Trond__ and __Sjur__ (Tuesday 10 AM?). The above plan will be copied to a separate document which will become the starting point for further work on the seminar planning. !!Technical issues * The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]): ** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__, __Tomi__) *** 10.4 introduced support for locales in the shell (10.3 and earlier didn't know about locales) ** Test: the result of the last line should indicate whether this is a problem in cat or a Perl/OS mismatch. {{{ preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ?? }}} After some heavy investigation, we got no further. There is no difference between different bash versions (2.0x, 3.0x), neither between different locales as long as they are UTF-8, and whether it is stored in LANG or LC_ALL. One new insight though: both cat and print gives the same errors, thus indicating that the error is NOT strictly reltated to cat. Is it after all a conflict between the locale support in OS X and perl? !!Bug fixing __29__ open bugs (and 25 risten.no bugs) !!C implementation of preprocess.pl Do we want to have a C/C++ implementation for speed reasons? Is it going to be included in the spellers? !!!9. Summary, task list !! Børre * send out contracts with accompanying letter * Gather public texts, preferrably also parallel ones * Contact Odin editor (__Ove Sæth__) to ask for source (and parallel) documents * Continue converting text from input format to our xml * review code and documentation for corpus xsl files under version control * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place. !! Saara * Convert the name lexicon from present format to xml for testing; final conversion pending the editing fascilities * continue discussion on the new lexicon format * Refine language detection for Finnish * Finnish the review of the hyphenation detection. * Review the handling of xsl-files in corpus infrastructure, including version control * Do some testing for bug [#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211] * optimize the preprocess script * Write/update user documentation for the corpus usage in preparation for the review in the project meetings next week. * finalize an improved working version of the CGI and command line scripts for corpus additions * xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry * update conversion from lexc to xml (proper names) with the latest refinements * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Follow up the lawyer treatment of the contracts * Project seminar ** plan and make schedule with __Trond__ ** check which hotels SD has an agreement with ** plan XQuery/XSLT training session * Lule Sámi twol problems, with __Thomas__ and __Trond__ * follow up on voice group-chat not working to Sámediggi ** Test Marratech when the new Marratech server is in place * project planning with __Trond__, continued ** also look at the development processes - specification and testing * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu * write a background document on the corpus contracts * check for SD feedback on the last two contracts (2 & 3) * continue proper name lexicon work and discussion * public tender: ** review offer from [Finnut Consult AS|http://www.finnut.no/] * smj G3 issue with __Thomas__ and __Trond__ * sme G3 issue with __Thomas__ and __Trond__ * call EDD/__Christian Emil Ore__ about national place name lexicon * risten.no/name lexicon development: fix bugs, continue development * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work on North Sámi compounding and derivation * review corpus usage documentation * smj G3 issue with __Sjur__ and __Trond__ * sme G3 issue with __Sjur__ and __Trond__ !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) * corpus infrastructure: ** dtd location (both public and internal) ** cgi-admin script for adding xsl-files * Document aspell and corpus infrastructure * Specification for new catxml in C++ ** install and announce new ccat tool * new proper name lexicon ** remove last part of complex names not used as simplex names ** start looking at conversion of the name lexicon from present format to xml ** discuss the new lexicon format in the newsgroup ** Look into synchronisation of proper names with risten.no ** meeting to arrive at final xml format ** new version of xml2lexc (based on catxml, now ccat) * hyphenation in corpus docs * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] * pick up backpacks after Xmas !! Trond * Contact Odin editor (__Ove Sæth__) immediately to reopen contacts * Project seminar ** plan and make schedule with __Sjur__ ** check with Linda and Ilona whether we can start on Monday after lunch * sign contract with Bibelselskapet for Norwegian parallel texts * document corpus infrastructure, your part * review corpus usage documentation (ccat) * discuss the new lexicon format in the newsgroup * smj G3 issue with __Sjur__ and __Thomas__ * sme G3 issue with __Sjur__ and __Thomas__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!!10. Next meeting, closing 30.01.2006 09:30 Closed at 11:43