!!!Meeting setup * Date: 11.06.2007 * Time: 09.30 Norw. time * Place: Internet * Tools: SubEthaEdit, iChat/Skype !!!Agenda # Opening, agenda review # Reviewing the task list from last week # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Infrastructure # Linguistics # name lexicon infrastructure # Spellers # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:57. Present: __Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond__ Absent: __Saara__ Agenda accepted as is. !!!2. Updated task status since last meeting !! Børre * add {{sma}} texts to the corpus repository ** not done * run all known spelling errors in the prooftest corpus through the speller ** not done * add extraction of all known spelling errors in the regular corpus (not the {{prooftest}} corpus), and check that they are properly corrected ** not done * update and fix our documentation and infrastructure as __Steinar__ finds problem areas - low priority ** began work again * study the Hunspell formalism in detail ** nothing new * contact ''Davvi Girji / Mikal Aase'' ** not done * install larger disks, new RAM on the G5 when they arrive ** Arrived. Will install it asap. * move list of known bugs to Bugzilla ** not done * update/check installed file list and paths for Windows ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Inga * expand the smj typos list ** work and still working * add missing smj words ** work and still working !! Maaren * lexicalise actio compounds * Manually mark speller test documents for typos !! Per-Eric * expand the smj typos list ** work and still working * add missing smj words ** work and still working !! Saara * improve cgi-bin scripts ** done * add new XSL/XML headers for proofing test docs ** will do this week * Try to add files with Lars to the corpus interface. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * run all known spelling errors in the corpus through the speller ** not done, depends on speller test bench improvements * document the AppleScript testing tool ** not done * integrate regression self tests with the make file ** not done * improve speller test bench ** worked on it, problems with speller test result processing, perl script * integrate the ccat speller testing options in the make file ** worked on it, problems with speller test result processing, perl script * fix internet setup for __Per-Eric's__ satelite modem ** nothing new * look over the Bugzilla status mails ** nothing new * contact ''Davvi Girji / Mikal Aase'' ** done * ask Xerox for a commercial lisense for the xfst tools on the G5 ** not done * check with Sámi publishing houses whether support for CS2 is still needed ** checked Min Áigi, Áššu and Davvi Girji - CS2 not needed so far * fix stuorra-oslolaš lower case {{o}} ** topic for the Drag meeting * {{ö/ä}} vs {{ø/æ}} in speller ** topic for the Drag meeting * study the Hunspell formalism in detail ** topic for the Drag meeting * move list of known bugs to Bugzilla ** done * resend the press release to some channels in Sweden, Finland and Norway ** not done * publish corpus contracts and project infra as open-source on NoDaLi-sta ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** filed many new ones * other: ** finished installation of Parallels Desktop, Windows XP, Office 2007 and our Windows proofing tools for testing Windows version of the spellers. !! Steinar * Beta testing: Align manually (shorter texts) * Manually mark speller test texts for typos (making them into gold standards), add the texts to the orig/catalogue ** added more texts * Complete the semantic sets in sme-dis.rle ** no work this week * missing lists ** no work this week * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work with compounding ** worked * Lack of lowering before hyphen: Twol rewrite. ** not done * {{smj}}: __öä__ not accepted, only __øæ__ (except for lexicalised names) ** not done * fix stuorra-oslolaš lower case {{o}} ** not done * investigate why actios of 3-syllable verbs are not accepted by the speller ** had some help with this, we will see * investigate why some adverbs of 3-syllable adjectives are not accepted by the speller ** seem to work * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** haven't barely got time !! Tomi * add compounding restrictions to the PLX conversion ** added * make PLX conversion test sample; add conversion testing to the make file ** not done * improve prefix and middle-noun PLX conversion ** done * integrate the {{ccat}} speller testing options in the Makefile ** not done * first part of multiword expressions not accepted ** not done * open up compounding for all actios ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** fixed !! Trond * Work on the web corpus issues ** Done some work, yes. * update the {{smj}} proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt ** Fixed a fatal bug here (1/3 of names restored!), but not worked more on the morphological issue. * Go through the Num bugs ** Not done * fix stuorra-oslolaš lower case {{o}} ** Not done * [fix bugs!|http://giellatekno.uit.no/bugzilla]. ** Closed several, but opened more, I am afraid. !!!3. Documentation TODO: * write form to request corpus user account (__Børre, Sjur, Trond__) * document how to apply for access to closed corpus, and details on the corpus and its use in general (__Børre, Sjur, Trond__) * correct and improve it based on feedback from __Steinar__ (__Børre__) !!!4. Corpus gathering __Sjur__ spoke to Davvi Girji, we will send them a list of the authors contacted and DG will send a list of all works DG has published - ever (not all of them available in digital form, though)! TODO: * {{sme}} texts: no new additions, fix corpus errors during this month (__Børre, Trond, Saara__) * missing {{nob}} parallel texts should be added if such holes are found (__Børre, Trond__) * Go through the list of missing or errouneous {{nob}} texts, based upon __Saara's__ perfect list (__Børre, Trond__) * add {{sma}} texts to the corpus repository (__Børre__) * contact ''Davvi Girji / Mikal Aase'' (__Børre, Sjur__) ** done !!!5. Corpus infrastructure Nothing this week either. !!!6. Infrastructure __TODO:__ * update and fix our documentation and infrastructure as __Steinar__ finds problem areas (__Børre__) ** working on this one * fix internet setup for __Per-Eric's__ satelite modem (__Sjur, Børre__) ** this influences iChat, SEE sharing, and ARD connetions !!!7. Linguistics !!North Sámi Actio compounds: __Maaren__ and __Duomma__ disagrees about what is correct and not, needs to be resolved. We need some more clarifications about the system, which we will do in Drag. TODO: * lexicalise actio compounds. Example: ''vuolggasadji'' vs. ''vuolginsadji'' (__Maaren__) ** vuolgin- and vuolgga- , both are okei vuolggasadji and vuolgindássi for eks ** possibly turn on free compounding as part of the PLX conversions (ie free compounding in the spellers, but not in the analyzers/transducers) * fix stuorra-oslolaš lower case {{o}} (__Sjur, Thomas, Trond__) * open up compounding for all actios (__Tomi__) !!Lule Sámi TODO: * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt (__Thomas, Trond__) * {{ö/ä}} vs {{ø/æ}} in speller (__Thomas, Sjur__) * lexicalise words from the Olavi missing list, but check against the pdf original where in doubt (__Inga__) * add normativity issues to our normativity document (__Inga, Thomas__) * investigate why actios of 3-syllable verbs are not accepted by the speller (__Thomas__) ** norm-lookup does not see these, ordinary look-up sees *** these were grepped out because they containted the string {{SUB}} as part of their lexicon names. Now the names are changed, and it should be fixed now, needs to be tested in the new speller * investigate why some adverbs of 3-syllable adjectives are not accepted by the speller (__Thomas__) ** norm-look-up sees some, but not all, ordinary look-up sees *** it seems to be fixed, needs to be tested in the new speller !!!8. Name lexicon infrastructure Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html] __TODO:__ # fix bugs in lexc2xml; add comments to the log element (__Saara__) # finish first version of the editing (__Sjur__) # test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__) # make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (__Sjur, Saara__) # convert propernoun-($lang)-lex.txt to a derived file from common xml files (__Sjur, Tomi, Saara__) # implement data synchronisation between [risten.no|http://www.risten.no] and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) # start to use the xml file as source file # clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (__Thomas, Maaren, linguists__) # merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (__linguists__) # publish the name lexicon on risten.no (__Sjur__) # add missing parallel names for placenames (__linguists__) # add informative links between first names like Niillas and Nils (__linguists__) !!!9. Spellers !!OOo spellers __Børre, Sjur, Tomi__ will have a session on this in Drag. TODO: * add Hunspell data generation to the lexc2xspell (__Tomi__ - after the PLX data generation is finished) * study the Hunspell formalism in detail (__Børre, Sjur, Tomi__) !!Testing !Spelling Error Markup Text in other languages should not be marked as spelling errors. __TODO:__ * Manually mark test texts for typos (making them into gold standards) (__Steinar__) * Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (__Saara__) !Testing tools __Sjur__ is trying to get the ccat typos option integrated in the test targets in the Makefile. Hopefully done soon. __TODO:__ * document the AppleScript testing tool (__Sjur__) * improve speller test bench (__Sjur__) ** integrate the ccat speller testing options in the Makefile (__Sjur, Tomi__) *** working !Regression tests Nothing new __TODO:__ * add extraction of all known spelling errors in the corpus (not the {{prooftest}} corpus), and check that they are properly corrected (__Børre, Sjur__) * test the {{typos.txt}} list, and check that all entries are properly corrected (__Børre, Sjur__) * consider how to do a regression __self-test__, ie, how to test the full wordlist (__Børre, Sjur__) ** extract all the base forms in the lexicon, and run them through the speller ** extract all SUB-marked entries, and run them through the lexicon *** integrate these in the make file (__Sjur__) !!Lexicon conversion to the PLX format __TODO:__ * install larger disks, new RAM on the G5 when they arrive (__Børre__) ** received, will be installed soon. * ask for mklex for Linux (victorio) from Polderland (__Sjur__) ** waiting for the offer * ask Xerox for a commercial lisense for the xfst tools on the G5 (__Sjur__) * add compounding restrictions to the PLX conversion (__Tomi__) ** done, seems correct, but needs more testing when a new speller is ready. !Compounding restrictions Compounding restrictions are now integrated in the PLX conversion, thanks to __Tomi__. __TODO:__ # improve prefix conversion to PLX (__Tomi__) ## done # improve middle noun conversion to PLX (__Tomi__) ## done # improve noun + adjective PLX conversion: (__Tomi__) ## compounding stems - how do we generate them? Using the java client? {{+SgNomCmp+Cmpnd}} = {{sáme–}}, should give the correct compounding stem, shouldn't it? We want to __optionally__ go from: {{sáme- NLI}} to {{sáme NL}}: {{- NLI (->) NL}}, which means we should be able to extract correct compounding stems using xfst methods only. ### done ## compounding tags - we need to obey them when making the transducers. Suggestion - see above. ### done # make conversion test sample; add conversion testing to the make file (__Tomi__) ## to regression test / QA the PLX conversion. ### not done !!Public Beta follow-up __TODO:__ * fix clitics (__Tomi__) ** done after the release, has to be tested *** can be tested in the small speller - tested, all accepted: {{Sjurgo buorrege Trosterutge biilago}} * file list in Windows not complete (__Børre, Sjur__) * test smj on typos (__Børre__) ** tried, but got an error, thus skipped. Needs to be checked now. *** error reported to __Saara__ * celebrate ** NOT done - will do in Drag:) * resend the press release to some channels in Sweden, Finland and Norway (__Sjur__) ** __Per-Eric__ will follow up in Sweden, __Tomi__ in Finland, to make sure we have got reasonable coverage, or at least enough users for the beta spellers. [Finnish press coverage|http://www.lapinkansa.fi/uutiset/ulkomaat/142.shtml]. Other finnish institutions to contact could be: *** Samiradio (__Tomi__) - they're planning to make a report *** Sami parliament (__Tomi__) *** Oulu - giellagas (__Tomi__) *** Lapin yliopisto - Rantala (__Trond__) *** Helsingin yliopisto - Seurujärvi-Kari (__Tomi__) *** KOTUS (__Sjur__) *** Citysaamit (__Tomi__) *** Oulun saamelaiset (__Tomi__) * move list of known errors to Bugzilla (__Børre, Sjur__) ** done !!!10. Other !!Summer vacation When are we taking it? Please fill in the table below: || Name || Starting || Ending | Børre | x| x | Maaren | 9.7. | 10.8. | Per-Eric | 9.7. | 20.7. | Saara | 2.7 | 3.8 | Sjur | x| x | Steinar | x| x | Thomas | 9.7. | 12.8. | Tomi | 9.7. | 5.8. | Trond | 2.7. | 12.8, but working at the end Divvun people also need to send the dates to __Julie Eira__ or __Ellen Mienna Guttorm__. !!Corpus contracts TODO: * publish corpus contracts and project infra as open-source on NoDaLi-sta (__Sjur__) !!Bug fixing When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version. __56__ open Divvun/Disamb bugs (__21__ of these 56 are speller bugs, __35__ are general bugs), and __23__ risten.no bugs __TODO:__ * look over the Bugzilla status mails (__Børre__) !!The meeting in Drag The Sámi Parliament board has its meeting June 19-21. We should use Monday 18. as our travel day, and return on Friday 22. Fly to Bodø, and go by rental car from there. It is also possible to go by car all the way from Tromsø, and it is even faster. Those going to Bodø are (at least): * Maaren (?) * Sjur * Tomi Topics for Drag: * two-level fixes (stuorra-oslolaš) * OOo/Hunspell * QA session * Actio compounding clarifications * smj work in general * loan words in -áhta or -áhtta (example: ''advokáhtta'' or ''advokáhta'') SD-ráddi presentation (1 hour): * demo Divvun * demo risten.no * drift av divvun * drift av risten.nno * forlenging/nytt prosjekt (ie drift) * sørsamisk * terminologi-utvikling * parallellkorpus * nordisk samarbeid __Sjur__ will order rooms for all (except Per-Eric) on Hamarøy Hotell, meeting room either at the Hotel or at Árran. Beds are needed as follows: * Monday: Sjur, Maaren, Tomi * Tuesday: Sjur, Maaren, Thomas, Tomi, Trond, Børre * Wedday: Sjur, Maaren, Tomi, Børre (not at Hamarøy Hotell - it is full) * Thursday: Sjur, Maaren, Tomi, Børre __TODO:__ * order rooms (__Sjur__) * order meeting room (__Sjur__) * plan presentation (__Sjur__) !!A commercial An alternative compiler to Xerox is coming up, in * [Haifa|http://cl.haifa.ac.il/projects/fstfsa/] * [FSMNLP conference in september|http://www.ling.uni-potsdam.de/fsmnlp2007/index.php?show=1] !!!11. Next meeting, closing The next meeting is 25.6.2007, 10:30 Norwegian time (or possibly in the afternoon). The meeting was closed at 11:28. !!!Appendix - task lists for the next week !! Boerre * add {{sma}} texts to the corpus repository * run all known spelling errors in the prooftest corpus through the speller * add extraction of all known spelling errors in the regular corpus (not the {{prooftest}} corpus), and check that they are properly corrected * update and fix our documentation and infrastructure as __Steinar__ finds problem areas - low priority * study the Hunspell formalism in detail * follow-up contact with ''Davvi Girji'' * install larger disks, new RAM on the G5 * update/check installed file list and paths for Windows * study the Hunspell formalism in detail * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * lexicalise actio compounds * Manually mark speller test documents for typos !! Per-Eric * expand the smj typos list * add missing smj words * contact media in Sweden about the beta release !! Saara * add new XSL/XML headers for proofing test docs * Try to add files with Lars to the corpus interface. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * run all known spelling errors in the corpus through the speller * document the AppleScript testing tool * integrate regression self tests with the make file * improve speller test bench * integrate the ccat speller testing options in the make file * fix internet setup for __Per-Eric's__ satelite modem * look over the Bugzilla status mails * ask Xerox for a commercial lisense for the xfst tools on the G5 * check with Sámi publishing houses whether support for CS2 is still needed * resend the press release to some channels in Sweden, Finland and Norway * publish corpus contracts and project infra as open-source on NoDaLi-sta * study the Hunspell formalism in detail * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Steinar * Beta testing: Align manually (shorter texts) * Manually mark speller test texts for typos (making them into gold standards), add the texts to the orig/catalogue * Complete the semantic sets in sme-dis.rle * missing lists * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * work with compounding * Lack of lowering before hyphen: Twol rewrite. * {{smj}}: __öä__ not accepted, only __øæ__ (except for lexicalised names) * fix stuorra-oslolaš lower case {{o}} * add normativity issues to our normativity document * test new speller for actios of 3-sybbable verbs and adverbs of 3-s adjs. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Tomi * make PLX conversion test sample; add conversion testing to the make file * integrate the {{ccat}} speller testing options in the Makefile * first part of multiword expressions not accepted * open up compounding for all actios * contact Finnish institutions about the speller beta release * study the Hunspell formalism in detail * add Hunspell data generation/conversion * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * Work on the web corpus issues * update the {{smj}} proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt * [fix bugs!|http://giellatekno.uit.no/bugzilla].