!!!Meeting setup * Date: 06.02.2006 * Time: 09.30 Norw. time * Place: Wherever we are :-) * Tools: iChat, SubEthaEdit !!! Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Linguistics # name lexicon infrastructure # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:38. Present: __Børre, Saara, Sjur, Tomi, Trond__ Absent: __Maaren, Thomas__ Main secretary: __Trond__ Agenda accepted as is, we'll try to finish by 10.55, to allow for joining the celebration of the Sámi national day. !!!2. Reviewing the task list from the last meeting !! Børre * send out contracts with accompanying letter ** Sent to Iđut and Kåfjord municipality * Gather public texts, preferrably also parallel ones ** Not done * Continue converting text from input format to our xml ** Not done * review code and documentation for corpus xsl files under version control ** Not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** Not done * Other ** The server didn't get an IP-address using DHCP. It turned out that if the «Gateway Assistant» was used, then the network card connected to «the outside world» didn't get an IP-address using DHCP. Even if all the services that was started using the «Gateway Assistant» were turned off, this behaviour went on. The «solution» was to reinstall Mac OS X Server, and not use the «GA». !! Maaren * work with risten.no * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place. !! Saara * continue discussion on the new lexicon format * Refine language detection for Finnish * Finnish the review of the hyphenation detection. * Review the handling of xsl-files in corpus infrastructure, including version control ** almost done, I'll need some help with the xsl-processing of the main text. * Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions. ** it seems that building a preprocessor-specific analyzator is not possible. * finalize an improved working version of the CGI and command line scripts for corpus additions ** almost done. * update conversion from lexc to xml (proper names) with the latest refinements * Try to add numeral treatment as part of the analyzator. ** not done * Change character coding detection to paragraph-based. ** done, use convert2xml.pl with option --multi-coding. This introduces some errors (due to the small size of some paragraphs), so the default is still one coding per file. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Follow up the lawyer treatment of the contracts ** not done * Lule Sámi twol problems, with __Thomas__ and __Trond__ ** not done * follow up on voice group-chat not working to Sámediggi ** Test Marratech when the new Marratech server is in place *** not done * project planning with __Trond__, continued ** also look at the development processes - specification and testing *** not done * Follow up on place names from Norge Digitalt ** write an e-mail to or call __Bjørn Olav Megard__ *** not done * Evaluate SFST as speller (and analyzer) lexicon ** more thorough analysis than was possible in Guovdageaidnu *** not done * write a background document on the corpus contracts ** not done * continue proper name lexicon work and discussion ** did a lot to upgrade the risten.no infrastructure to be multi-collection aware ** discussions in the newsgroup ** added the test lexicons __Saara__ created to my own instance of risten.no * public tender: ** waited for and received a draft public tender document from __Finnut__ * smj G3 issue with __Thomas__ and __Trond__ ** not done * sme G3 issue with __Thomas__ and __Trond__ ** not done * call EDD/__Christian Emil Ore__ about national place name lexicon ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** closed [bug #217|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=217] !! Thomas On sick leave. !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) *** Not done * corpus infrastructure: ** dtd location (both public and internal) *** Not done ** cgi-admin script for adding xsl-files *** Not done * Document aspell and corpus infrastructure * ccat: add a -v option - it should return the version of the tool ** Done * new proper name lexicon ** remove last part of complex names not used as simplex names *** Not done ** start looking at conversion of the name lexicon from present format to xml ** discuss the new lexicon format in the newsgroup ** Look into synchronisation of proper names with risten.no *** Some progress ** new version of xml2lexc (based on catxml, now ccat) *** Not done *** xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * Work on corpus texts with Børre. ** Done some progres wrt. processing of texts. * 3-part compounds with __Sjur__ and __Thomas__. ** Had a look at the rule set myself, but awaiting Thomas. * smj G3 issue with __Sjur__ and __Thomas__. ** Not done. * sme G3 issue with __Sjur__ and __Thomas__. ** Not done. * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** Not done. * Worked mostly on disambiguation. !!!3. Documentation !!Reviews Anything? Nothing. !!Other docu Anything? Documentation has been added for disambiguation. !!!4. Corpus gathering !!Collecting See a [previous meeting memo|Meeting_2006-01-16] for what's to be done. Sent letter to Iđut and Kåfjord. TODO: Still a ''lot'' for __Børre!__ !!Odin Waiting for __Sæth__ to discuss with colleagues about how to implement the cooperation, and return to us. Nothing heard. !!Bible texts TODO: * write a paratext2xml converter ** __Tomi__ has already done it! Excellent! ** files requiring this converter should have the filename extension __.ptx__ *** Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX 19PSANBST.u8.PTX ** __Børre__ will review the converter as part of adding the Norwegian texts to our corpus * convert smj NT to paratext. (__Børre__) * ask to get fin and swe NT and OT in paratext format. (__Trond__) !!!5. Corpus infrastructure Task list: # Include the xsl files under version control ## RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We'll continue the discussion in the newsgroup. # Incorporate language detection as part of the corpus processing (__Saara__) ## Almost finished. Needs improved Finnish language model - presently it isn't able to distinguish Finnish from Sámi (proving the family bonds:-) # we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on. ## Acceptable results: 90% of all real hyphens correctly tagged. # CGI-admin script to add xsl-file to a corpus file that doesn't have one (__Saara__) Things are moving forward, but still more work to do. The list is left as is. E-mail address in case of upload errors: corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads. /www/opt/www/cgi-bin/smi/upload.cgi (no Forrest) http://localhost:8888/upload/upload_corpus_file.html (Forrest) One option is to ask the cochise team, that would be royd or steinar and the address cc.uit.no. *Problems with greek letter in Word documents. With font Sam Times Uni(versal) ** (__Børre__) Can't we just manually change the letters and fonts in the few documents affected? We forget about these texts for the time being, they'll be put in a dir. for broken texts. Such texts can be looked upon later , if wanted/needed. !!Suggestion for Script for text analysis. We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue, with one file for each of the five directories. A way of getting this is to ach night (afternoon!): Make a crontab job, run the following command, for each directory admin, bible, facta, ficti, laws, news: {{{ ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt For example: ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/admin.txt }}} * Today: /usr/local/share/corp/gt/sme/DIR(/*)/*xml * Addition: /usr/local/share/corp/ga/sme/dir.txt TODO: * Look at the suggestion from __Trond__ (__Saara__, discuss with __Trond__ if unclear) * ask for e-mail adress as specified above (__Trond__) !!!6. Linguistics Anything? Nothing. !!!7. Name lexicon infrastructure !!Complex names TODO: * make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (__Tomi__) ** the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (__Tomi__ or __Saara__, but only when the technical details are settled) * __Saara__ has added the analyzer as part of the preprocess, but it is slow, and needs to be optimized. !!XML format TODO: # update conversion from lexc to xml to reflect new xml format (__Saara__) ## mostly done, some open questions left # testing of conversion # eXist as editor: ## develop the needed XQueries and interface ## data synchronisation between risten.no and ## test whether eXist as editor is actually working well More TODO: * read and comment in the news group (__all__) * decide upon and set up infra for new projects and project ideas Definitions/terminology: * __synchronisation__ in our context is ''data'' synchronisation, that is, to bring the two repositories (CVS and risten.no) in synch regarding the shared lexicons (propnouns at least). * __code refactoring__ is the process of reorganising the code by moving general functions and snippets out of specific functions and into general libraries for shared access, easier maintenance, and a better organisation. !!!8. Other !!SGL Seminar * SGL/normativity seminar ** all members = potentially/likely all languages *** not all languages, only North Sámi ** date? As early as possible, end of February/beginning of March ** place? __Maaren__ will investigate !!Technical issues * The mac os / perl bug (at least __Trond__ and __Sjur__ has it, [Bugzilla #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]): ** utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (__Trond__, __Saara__, __Tomi__) *** 10.4 introduced support for locales in the shell (10.3 and earlier didn't know about locales) ** Test: the result of the last line should indicate whether this is a problem in cat or a Perl/OS mismatch. ** Is this a problem with ccat? *** It doesn't seem so (3 min and still counting) *** In the end, the bug turned up with ccat as well. I gave the command: *** zcorp/gt/sme/*/*xml | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle --minimal | sort | less *** and it (in the end) responded: {{{ 1729 constraint rules utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12. }}} To ccat's defence I must say that cat, in a similar situation, would have given far more error messages (hold on, testing still under way). {{{ preprocess file_name.txt - OK cat file_name.txt | preprocess - bug!! catxml file_name.xml | preprocess - ?? ccat filename | preprocess - bug !! }}} This bug isn't a high priority any more, because ccat behaves differently than cat, and because there is the possibility of avoiding cat when working locally. BUG: close as Won't fix. (__Børre__) !!Bug fixing __32__ open bugs (and 24 risten.no bugs) * Add bug report for the Xerox backspace error (__Trond__) !!!9. Summary, task list !! Børre * send out contracts with accompanying letter * Gather public texts, preferrably also parallel ones * Continue converting text from input format to our xml * review code and documentation for corpus xsl files under version control * convert nob and nno bible texts to be used as part of a parallel corpus, and review the paratext2xml converter as part of the conversion * convert smj NT to paratext * close bug 211 as WONTFIX ** DONE :-) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * work with risten.no * discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place. !! Saara * continue discussion on the new lexicon format * Refine language detection for Finnish * Finnish the review of the hyphenation detection. * Review the handling of xsl-files in corpus infrastructure, including version control * Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions. * finalize an improved working version of the CGI and command line scripts for corpus additions * update conversion from lexc to xml (proper names) with the latest refinements * Try to add numeral treatment as part of the analyzator. * Look at crontab ga/ directory issue with __Trond.__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * Follow up the lawyer treatment of the contracts * Lule Sámi twol problems, with __Thomas__ and __Trond__ * project planning with __Trond__, continued * Follow up on place names from Norge Digitalt * Evaluate SFST as speller (and analyzer) lexicon * write a background document on the corpus contracts * public tender: ** review draft tender document from Finnut * smj G3 issue with __Thomas__ and __Trond__ * sme G3 issue with __Thomas__ and __Trond__ * call EDD/__Christian Emil Ore__ about national place name lexicon * risten.no/proper noun lexicon development: fix bugs, continue development * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work on North Sámi compounding and derivation * review corpus usage documentation * smj G3 issue with __Sjur__ and __Trond__ * sme G3 issue with __Sjur__ and __Trond__ !! Tomi * Aspell: Continue working on the affix file & aspell ** Contact aspell author (UTF-8 thing) * corpus infrastructure: ** dtd location (both public and internal) * Document aspell and corpus infrastructure * new proper name lexicon ** remove last part of complex names not used as simplex names ** discuss the new lexicon format and other issues in the newsgroup ** Look into data synchronisation of proper nouns between risten.no and CVS ** new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry * comment review template made by __Saara__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * Work on corpus texts with Børre. * Contact the Finnish and Swedish Bible societies to get Bible texts. * Look at ga/ directory issue with __Saara.__ * News group discussion followup. * Do a bug report (if not done) on commandline bahaviour in the Xerox tools. * Ask for e-mail adress for corpus upload script * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!10. Next meeting, closing 13.02.2006 09:30 Closed at 10:37