!!!Meeting setup * Date: 12.02.2007 * Time: 09.00 Norw. time * Place: Internet * Tools: SubEthaEdit, iChat !!!Agenda # Opening, agenda review # Reviewing the task list from last week # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Infrastructure # Linguistics # name lexicon infrastructure # Spellers # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants Opened at 09:49. Present: __Børre, Sjur, Steinar, Thomas, Tomi, Trond__ Absent: __Maaren, Saara__ Agenda accepted as is. __Maaren__ is working Tuesday and Friday this week. !!!2. Updated task status since last meeting !! Børre * write form to request corpus user account ** not done * document how to apply for access to closed corpus, and details on the corpus and its use in general ** not done * update and fix our documentation and infrastructure as __Steinar__ finds problem areas ** some done, received feedback from __Steinar__ * continue work on script for automatic testing of the spell checker in Word ** not done * fix {{sme}} texts in corpus this month ** not done * find missing {{nob}} parallel texts in corpus ** not done * work on the Polderland data generation (PLX format conversion) ** Concentrate on compounding *** done by __Tomi__ * go through other directories, fix parallellity information for other documents ** not done * add {{sma}} texts to the corpus repository ** not done * move the G5 to the basement (__Børre__) ** didn't work out last week, because we needed the machine * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * lexicalise actio compounds !! Saara * fix {{sme}} texts in corpus this month ** mostly done * continue aligning the rest of the parallel files ** continued * fix problems with xml2lexc if needed * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * name lexicon: ** refactor the rest of the SD-terms editor code *** refocused to the propernoun things ** implement missing propnouns editing functions *** started ** implement improvements decided upon in Tromsø *** some done * hire linguist and programmer ** not done * publish corpus contracts and project infra as open-source on NoDaLi-sta ** not done * fix stuorra-oslolaš lower case {{o}} ** not done * write form to request corpus user account ** not done * document how to apply for access to closed corpus, and details on the corpus and its use in general ** not done * get an Intel Mac for __Tomi__ ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** looked at some !! Steinar * test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to __Børre__. Start: At the front page. ** continued working, reported some problems * Complete the semantic sets in sme-dis.rle ** no work this week * missing lists ** no work this week * report conversion errors to __Saara__ ** not done * Look at the actio compound issue when adding from missing lists * lexicalise actio compounds. Example: ''vuolggasadji'' vs. ''vuolginsadji'' * Go through the Num bugs ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt ** nothing this week * work with compounding ** nothing this week * Lack of lowering before hyphen: Twol rewrite. ** nothing this week * Go through the {{sme}} Num bugs ** done * fix stuorra-oslolaš lower case {{o}} ** nothing this week * implement discontinous case inflection for {{sme}} numbers ** soon finished * produce correct number base forms in the {{sme}} analyzer ** soon finished * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** worked !! Tomi * improve numerals in the speller ** not finished * add prefixes to the PLX * add derivations to the PLX generation * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * update the {{smj}} proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt ** Not done. * fix {{sme}} texts in corpus this month ** Worked on this, final conference is over, now starts the long-term maintenance. * find missing {{nob}} parallel texts in corpus, go through Saara's list ** Will work more on this, now that we see how useful the tool is. * report conversion errors to __Saara__ ** Done. * Go through the Num bugs ** Not done. * implement discontinous case inflection for numbers ** Not looked into, Thomas has done this * produce correct number base forms in the analyzer ** Looked at some. * Write project presentation ** Done. * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!3. Documentation __Børre__ has fixed some errors in the docu, otherwise nothing new. TODO: * write form to request corpus user account (__Børre, Sjur, Trond__) * document how to apply for access to closed corpus, and details on the corpus and its use in general (__Børre, Sjur, Trond__) * correct and imrove it based on feedback from __Steinar__ (__Børre__) ** started !!!4. Corpus gathering TODO: * {{sme}} texts: no new additions, fix corpus errors during this month (__Børre, Trond, Saara__) * missing {{nob}} parallel texts should be added if such holes are found (__Børre, Trond__) * Go through the list of missing or errouneous {{nob}} texts, based upon __Saara's__ perfect list (__Børre, Trond__) * add {{sma}} texts to the corpus repository (__Børre__) !!!5. Corpus infrastructure !!Alignment Main news: We have a working parallel corpus online. Notes about the interface (or lack of documentation): the first search field in the form needs to be filled; to get the parallell texts in the search result, make sure to click ''add phrase'' and specify the language to be the other one. __TODO:__ * go through other directories (nob dicrectories, sd directories), fix parallellity information for other documents (2 hours) (__Børre__) !!Conversion issues __TODO:__ * report conversion errors to __Saara__ (__Trond, Steinar__) !!!6. Infrastructure __Børre__ and __Steinar__ have both started on the task of testing and correcting the documentation. __TODO:__ * test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to __Børre__. Start: At the front page. (__Steinar__) * update and fix our documentation and infrastructure as __Steinar__ finds problem areas (__Børre__) !!!7. Linguistics !!Numbers: __Thomas__ is almost finished with correcting the number part of the {{sme}} analyzer. TODO: * discontinous case inflection in {{sme}} (but only for maximally three-part compound numerals) ({{viđain/goalmmát/logiin}} and {{guvttiin/logiin/viđain}}) (__Thomas__) ** soon finished * produce correct number base forms in the {{sme}} analyzer (__Thomas__) ** soon finished * Go through the {{sme}} Num bugs (__Thomas__) !!North Sámi TODO: * lexicalise actio compounds. Example: ''vuolggasadji'' vs. ''vuolginsadji'' (__Maaren__) * fix stuorra-oslolaš lower case {{o}} (__Sjur, Thomas, Trond__) !!Lule Sámi TODO: * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt (__Thomas, Trond__) * Lack of lowering/fronting before hyphen: Twol rewrite. (__Thomas, Trond__) ** In Bugzilla?yes, Which #? 350 !!!8. Name lexicon infrastructure Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html] Postponed: TODO: # finish first version of the editing (__Sjur__) ## working on it # test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__) # make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (__Sjur, Saara__) # convert propernoun-($lang)-lex.txt to a derived file from common xml files (__Sjur, Tomi, Saara__) ## wrote a prototype xml2lexc converter in XQuery, just to test the performeance and the complexity of the task - the result is quite promising, and might be a viable alternative to {{Perl/XML::Twig}} # implement data synchronisation between [risten.no|http://www.risten.no] and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) ## __Sjur__ has a concrete suggestion for how to do this to ensure consistency between the different editors and servers, including emacs; basically it includes the following points, to be executed as a preflight on each commit: ### sort on entry ID using Unicode default (= sorting on character code) ### validate against DTD ### reformat using xmllint - that will ensure consistent whitespace # start to use the xml file as source file # clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (__Thomas, Maaren, linguists__) # merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (__linguists__) # publish the name lexicon on risten.no (__Sjur__) # add missing parallel names for placenames (__linguists__) # add informative links between first names like Niillas and Nils (__linguists__) !!!9. Spellers !!Polderland data generation __TODO:__ # send {{smj}} PLX data to Polderland (__Børre, Tomi__) ## done # decide how to specify nouns requiring genitive first parts (__Sjur, Thomas, Trond__) ## done # improve number conversion (__Børre, Tomi__) ## working on it # add prefixes to the PLX (__Børre, Tomi__) ## not yet # add derivations to the PLX generation (__Børre, Tomi__) ## next after numbers are fixed ### not yet !!OOo speller(s) TODO after the MS Office Beta is delivered: * add Aspell/Hunspell data generation to the lexc2xspell (__Tomi__ - after the PLX data generation is finished) * study Hunspell, perhaps also Soikko (__Børre, Sjur, Tomi__) !!Testing __TODO:__ * get an Intel Mac for Tomi (__Sjur__) ** not yet !!Localisation We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated. TODO: * translate beta release docs to {{sme}} (__Thomas__) * translate beta release docs to {{smj}} (__Thomas__) !!Beta release Tentative beta release: Thursday 15.2. - but it might be delayed till later in February, since we still have no beta from Polderland. In the beta, {{sme}} is now Catalan, whereas {{smj}} is Basque. DONE: * delivered PLX data of {{sme}} and {{smj}} including compounding * translated Windows installer to {{sme}} and {{smj}} TODO: * write press release (__Sjur__) * add info to front page (incl. download links) (__Børre__) * write separate page with detailed info (incl. download links) (__Børre__) * test the beta release from Polderland thoroughly before it is released (__all__): ** download and installation ** documentation ** technical performeance ** linguistic performeance: *** true positives *** false positives *** false negatives *** suggestions ** all tests on both Mac and Win - Windows only (__Børre, Sjur, Thomas__) !!!10. Other !!Corpus contracts TODO: * publish corpus contracts and project infra as open-source on NoDaLi-sta (__Sjur__) !!Bug fixing __57__ open Divvun/Disamb bugs, and __23__ risten.no bugs !!Moving G5 TODO: * move the G5 to the basement (__Børre__) !!The KUNSTI conference __Thomas__ and __Trond__ was there. The first presentation (for politicians) got some response. The second was more for insiders, ie the language technologists, but got only one response about financing. There was more inofficial feedback, both from Telenor and NTNU on text-to-speech, and from the text-based people on machine translation. !!!11. Next meeting, closing The next meeting is 19.2.2007, 09:30 Norwegian time. __Sjur__ will be away next week on winter holidays, __Trond__ or __Børre__ will head the next meeting. The meeting was closed at 11:12. !!!Appendix - task lists for the next week !! Boerre * write form to request corpus user account * document how to apply for access to closed corpus, and details on the corpus and its use in general * update and fix our documentation and infrastructure as __Steinar__ finds problem areas * continue work on script for automatic testing of the spell checker in Word * fix {{sme}} texts in corpus this month * find missing {{nob}} parallel texts in corpus * work on the Polderland data generation (PLX format conversion) * go through other directories, fix parallellity information for other documents * add {{sma}} texts to the corpus repository * move the G5 to the basement * add info to front page (incl. download links) * write separate page with detailed info (incl. download links) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * lexicalise actio compounds !! Saara * fix {{sme}} texts in corpus this month * continue aligning the rest of the parallel files * fix problems with xml2lexc if needed * have some holiday first * start improving the corpus interface for Sámi in Oslo. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * name lexicon: ** refactor the rest of the SD-terms editor code ** implement missing propnouns editing functions ** implement improvements decided upon in Tromsø * hire linguist and programmer * publish corpus contracts and project infra as open-source on NoDaLi-sta * fix stuorra-oslolaš lower case {{o}} * write form to request corpus user account * document how to apply for access to closed corpus, and details on the corpus and its use in general * get an Intel Mac for __Tomi__ * write press release for the beta * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Steinar * test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to __Børre__. Start: At the front page. * Complete the semantic sets in sme-dis.rle * missing lists * report conversion errors to __Saara__ * Look at the actio compound issue when adding from missing lists * lexicalise actio compounds. Example: ''vuolggasadji'' vs. ''vuolginsadji'' * Go through the Num bugs * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt * work with compounding * Lack of lowering before hyphen: Twol rewrite. * fix stuorra-oslolaš lower case {{o}} * implement discontinous case inflection for {{sme}} numbers * produce correct number base forms in the {{sme}} analyzer * translate beta release docs to {{sme}} and {{smj}} * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Tomi * improve numerals in the speller * add prefixes to the PLX * add derivations to the PLX generation * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * update the {{smj}} proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt * fix {{sme}} texts in corpus this month * find missing {{nob}} parallel texts in corpus, go through Saara's list * Go through the Num bugs * [fix bugs!|http://giellatekno.uit.no/bugzilla].