!!!Meeting setup * Date: 7.11.2006 * Time: 09.30 Norw. time * Place: Where we are * Tools: SubEthaEdit, iChat !!!Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Infrastructure # Linguistics # name lexicon infrastructure # Spellers # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants The meeting was delayed one day. Opened at 10:56. Present: __Børre, Maaren, Saara, Sjur, Thomas, Tomi__ Absent: __Trond__ Agenda accepted as is. !!!2. Updated task status since last meeting !! Børre * contact writers who already have received contracts ** Not done * move norwegian documents in Min Áigi from {{sme}} to {{nob}} ** Mostly done * consider a script for automatic testing of the spell checker in Word ** Not done * consider more testing routines ** Not done * update Maaren's Forrest installation to r430284 * check potential raw HTML bug/problem ** Done * {{sma}} discussions with SD (with __Sjur__, __Trond__) * add as much {{smj}} texts as possible ** Done * report improvements in aligner back to __Øystein__ ** Not done * add a simple password protection to risten.no in the G5 ** Not done * consider infra for testing feedback ** Not done * get an Intel Mac for testing Windows spellers; get a WinXP license from SD * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * investigate the generated word form list sent to Polderland - use the command {{make wordlist TARGET=sme}} in ''victorio'' **done some * investigate unrecognised word forms in the hyphenator * create / check the paradigm grammar as exemplified above !! Saara * add more texts to the graphical corpus interface * finalize server of the Xerox tools. ** some performance testing done, other testing with Tomi. * improve text_cat with paragraphs of mixed content ** done * generate parallel corpus files manually (with __Trond__) * export corpus tools to location available to all (with cron), cf news disc. ** in progress * help Trond with some shell commands ** not done * plan the word form generator / data conversion script ** what was this? * add {{}} to the corpus processing, encapsulating identifiable sequences of foreign language within them ** done otherwise, the processing is not implemented in the preprocessor, nor ccat * fine-tuning of the character conversion ** done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * name lexicon: ** refactor SD-terms editor code ** implement missing propnouns editing functions ** implement improvements decided upon in Tromsø * hire linguist and programmer ** one more candidate contacted * finish i18n work of Forrest with __Børre__ ** pdf finished - the {{forrest run}} version is ready now * investigate unrecognised word forms in the hyphenator ** done some updates to the compilation, and some corrections ** the generator generates non-existent word forms - needs to be investigated! * decide how to specify compounding behaviour info in the lexicon ** further discussions in news * {{sma}} discussions with SD (with __Børre__, __Trond__) * check why some SUB-marked entries got included in the normative transducer * remove comparation from ''-laš'' derivations ** done (by __Thomas__) * plan the word form generator / data conversion script ** nohting last week * consider a script for automatic testing of the spell checker in Word ** checked the AppleScript dictionary in Word - it looks both positive and negative * consider more testing routines * consider infra for testing feedback * get an Intel Mac for testing Windows spellers; get a WinXP license from SD * ask Julie Eira about SD employee seminar * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt ** nothing this week * find and study all derived words in our corpus (with __Trond__) ** nothing this week * suggest which derivations could be generated ** nothing this week * investigate unrecognised word forms in hyphenator ** worked * decide how to specify compounding behaviour info in the lexicon ** worked * check why some SUB-marked entries got included in the normative transducer ** worked * remove comparation from ''-laš'' derivations ** done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** worked !! Tomi * continue implementation of the speller lexicon conversion ** worked, refactored the code * add hyphenation points to the generated output * plan the word form generator / data conversion script * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt ** Made a new dump from {{sme}}, gone through the lexica, more to be done. * get more {{sma}} texts, first the Bible / NT ** Got text, but not Bible yet. * add corpus user accounts and access issues to Bugzilla * fix the corpus tag list in the {{cwb/}} directory * investigate unrecognised word forms in the hyphenator * decide how to specify compounding behaviour info in the lexicon * {{sma}} discussions with SD (with __Børre__, __Sjur__) * find and study all derived words in our corpus (with __Thomas__) ** Done, helped Thomas with the setup of this * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!3. Documentation The new documentation is up and running on the giellatekno server: {{http://giellatekno.uit.no:8888}} The main difference from our existing version: i18n and l10n - all menus, tabs etc are now translated (including the generated pdf files), and documents available in more than one language will appear with a language selection list at the top of the page. TODO: * finish i18n work (__Børre__ and __Sjur__) ** set up Tomcat at the faculty server, and install Forrest as war. Needs testing before being deployed on the server! Thus, install it on the G5 first and see that it works, then on the faculty server. ** finish i18n in PDF *** done * check potential raw HTML bug/problem (__Børre__) ** done and fixed (it was not a bug or a problem, it was our own sitemap:-( ) * port all i18n work to the main branch (from the i18n branch) (__Børre__) * update all forrest installations, including local patches (__Børre__) !!!4. Corpus gathering __Børre__ got in contact with an author __Erling Persen__, got a book. Also got some texts from __Lars Kintel__. Some of the texts need to get author clarification which __Børre__ is working on. Most of the unproblematic texts from __Lars Kintel__ has been added to the repository. Number of words in our corpus is now: * __sme__ - 4.38 mill words * __smj__ - 376 000 words __TODO:__ * continue to help NSI to get their corpus (__Børre__) * sma: ** get Bible / NT texts (__Trond__). *** nohting new since last week ** Discussions with the Sámi Parliament (__Børre, Sjur__) * add as much {{smj}} texts as possible (__Børre__) ** done !!!5. Corpus infrastructure !!User accounts and access TODO: * add the issue with subissues to Bugzilla (__Trond__) !!More texts to the graphical corpus interface: TODO: * add text to the server (__Lars__) !!Aligner __TODO:__ * report improvements in aligner back to __Øystein__ (__Børre__) * gather more parallel texts (__Trond__) * try out NT alignment strategies (__Saara__) !!Language recognition TODO: * get more {{sma}} texts, first the Bible / NT (__Trond__) * add {{}} to the corpus processing, encapsulating identifiable sequences of foreign language within them (__Saara__) ** done in text categorisation; all spans are dropped in {{ccat}} !!!6. Infrastructure !!Xerox tools wrapped as servers __TODO:__ * improve and finish the present prototype (__Saara__) ** it is now ready, but it is slow if the input is multiline - make sure that the input is always on one line! *** the multiline input option will go *** still some issues with the paradigm generation *** running the script and starting the demon needs to be done * fix the corpus tag list in the {{cwb/}} directory (__Trond__) * add the hyphenation filter to the hyphenator server (__Saara__) ** done * create / check the paradigm grammar as exemplified above (__Maaren__) !!Hyphenator TODO: * investigate unrecognised word forms (__Maaren, Thomas, Trond, Sjur__) !!!7. Linguistics !!Names and multilinguality We need a more principled approach to this. Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching. Observations: 1) Multilinguality is always optional. 2) We can observe that "foreign" names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names ("Kautokeino produkter"), whereas minority language names ''almost always'' occur in majority language texts as citations. And citations should not be considered a natural part of the text. 3) When looking at our name classification, multilinguality varies according to: {{{ Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong for the national and country names, weak (informative) for foreign names Sur - none Tit - strong (titles) }}} Suggestion: We need to reconsider the ''all names in all languages'' policy. That policy is valid only for {{Fem, Mal,}} and {{Sur}} (and Ani and Tit?). For {{Obj, Org, Plc}} the rule should be that if they have multilingual names, each name should only be used in it's own language. Then we need a modification saying that majority language names can be included in minority language lexicons __if attested__ in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country. A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are __different__ readings. An alternative would be to have them as secondary tags, not in conflict with each other: {{{ "" "Trosterud" N Prop Sur Sg Nom <<< @HNOUN "Trosterud" N Prop Plc Sg Nom <<< @HNOUN "" "Trosterud" N Prop Sg Nom <<< @HNOUN "" "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN }}} !!Derivation and spellers like Aspell TODO: * find and study all derived words in our corpus (__Thomas__ and __Trond__) * suggest which derivations could be generated (__Thomas__) * lexicalise the rest (__Thomas__) !!North Sámi The {{sme}} generator ({{make wordlist TARGET=sme}}) (possibly also {{smj}}) generates nonsense strings, at least in the sense that they are not recognised by the analysing transducers (e.g. by {{sme.fst}}). __This should not happen!__ Example of generated string that's not recognised by the analyser: {{{ šuorpmoskáidilašruoksatčeavžžatvuođaineattet }}} Two lessions learned: # make sure to always use identical versions of source files and compiled files in all testing! # there was a hidden circularity in the propernoun file TODO: * investigate the generated word form list sent to Polderland - use the command {{make pl-wordlist TARGET=sme}} in ''victorio'' (__Maaren__, __Thomas__) * alphabet letters need to be correctly inflected (colon as case separator) (__Trond__) this issue is already fixed and added to cvs. ** done * check why some SUB-marked entries got included in the normative transducer (__Thomas, Sjur__) * remove comparation from ''-laš'' derivations (__Thomas, Sjur__) ** done, but appearently not completely !!Lule Sámi TODO: * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt (__Thomas, Trond__) * hire new linguist (__Sjur__) ** one more contacted !!!8. Name lexicon infrastructure Decided in Tromsø: * add logging facilities to the interface * add option to download local copies of the lexicon files directly from the db * batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items) * tag for excluding/including a name from certain applications * future epxansion: choose what info to display in the single language browser * display existing language entries when adding a new language to a record * add editor to change single, existing entries Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html] TODO: * develop the needed XQueries and UI (__Sjur, Tomi__) * add a simple password protection to risten.no in the G5 (__Børre__) Postponed: * data synchronisation between risten.no and the cvs repo * new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry !!!9. Spellers !!Speller data generation __Tomi__ has done more work. We are now awaiting full specification of the PLX format to be able to generate correct output. There was discussions in the newsgroup last week, but no conclusion yet. We need to finish and conclude the discussion. __TODO:__ * decide how to specify compounding behaviour info for the lexicon (__Thomas, Trond, Sjur__) * add hyphenation points to the generated output (__Tomi__) !!Automatic testing of the Word spellchecker It should be possible to write a script that runs texts through Word from the command line, using a combination of shell script and AppleScript. MS Word has the needed AppleScript commands to run the spell checker. TODO: * consider a script for automatic testing (__Sjur, Børre__) ** checked some features of Word * consider more testing routines (__Sjur, Børre__) * consider infra for testing feedback (__Børre, Sjur__) * get an Intel Mac for testing Windows spellers; get a WinXP license from SD (__Børre, Sjur__) !!!10. Other !!Report from Gothenburg The seminar focused on actions to establish common infrastructure for language technology in the Nordic countries, promoting cooperation and sharing of resources. __Sjur__ would like our projects to take some first public steps, by announcing on the NoDaLi e-mail list that our corpus contracts and our infrastructure is available for everyone interested. Prerequisites: * the corpus contracts are checked for any outstanding issues * anonymous cvs is working and regularly updated ** done now TODO: * check corpus contract issue (__Børre, Sjur, Trond__) !!Bug fixing __64__ open Divvun/Disamb bugs, and __24__ risten.no bugs Guess: 1/3 of the bugs are fixed already (?) !!Task lists as iCal entries TODO: * update Maaren's Forrest installation to r430284 (__Børre__) !!Employee seminar in Alta SD has an employee seminar in Alta 7.-8. December - should we go there? __Sjur__ will ask __Julie Eira__ if we have to go there. TODO: * ask Julie Eira about SD employee seminar (__Sjur__) !!!11. Next meeting, closing Closed at 12:27. !!!Appendix - task lists for the next week !! Boerre * contact writers who already have received contracts * consider a script for automatic testing of the spell checker in Word * consider more testing routines * update Maaren's Forrest installation to r430284 * {{sma}} discussions with SD (with __Sjur__, __Trond__) * report improvements in aligner back to __Øystein__ * add a simple password protection to risten.no in the G5 * consider infra for testing feedback * get an Intel Mac for testing Windows spellers; get a WinXP license from SD * check corpus contract issue * port all i18n work to the main branch (from the i18n branch) * update all forrest installations, including local patches * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * investigate the generated word form list sent to Polderland - use the command {{make wordlist TARGET=sme}} in ''victorio'' * investigate unrecognised word forms in the hyphenator * create / check the paradigm grammar as exemplified above !! Saara * add more texts to the graphical corpus interface * finalize server of the Xerox tools. * generate parallel corpus files manually (with __Trond__) * export corpus tools to location available to all (with cron), cf news disc. * help Trond with some shell commands * plan the word form generator / data conversion script * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * name lexicon: ** refactor SD-terms editor code ** implement missing propnouns editing functions ** implement improvements decided upon in Tromsø * hire linguist and programmer * investigate unrecognised word forms in the hyphenator * decide how to specify compounding behaviour info in the lexicon * {{sma}} discussions with SD (with __Børre__, __Trond__) * check why some SUB-marked entries got included in the normative transducer * remove comparation from ''-laš'' derivations * plan the word form generator / data conversion script * consider a script for automatic testing of the spell checker in Word * consider more testing routines * consider infra for testing feedback * get an Intel Mac for testing Windows spellers; get a WinXP license from SD * ask Julie Eira about SD employee seminar * check corpus contract issue * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt * find and study all derived words in our corpus (with __Trond__) * suggest which derivations could be generated * investigate unrecognised word forms in hyphenator * decide how to specify compounding behaviour info in the lexicon * check why some SUB-marked entries got included in the normative transducer * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Tomi * continue implementation of the speller lexicon conversion * make generator as server, based on __Saara's__ code * add lexc2xspell code to cvs * add hyphenation points to the generated output * plan the word form generator / data conversion script * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * refine {{smj}} proper noun lexica, cf. the propernoun-smj-lex.txt * get more {{sma}} texts, first the Bible / NT * add corpus user accounts and access issues to Bugzilla * fix the corpus tag list in the {{cwb/}} directory * investigate unrecognised word forms in the hyphenator * decide how to specify compounding behaviour info in the lexicon * {{sma}} discussions with SD (with __Børre__, __Sjur__) * check corpus contract issue * [fix bugs!|http://giellatekno.uit.no/bugzilla].