!!!Meeting setup * Date: 26.3.2008 * Time: 09.30 Norw. time * Place: Internet * Tools: SubEthaEdit, iChat/Skype !!!Agenda Cf. one of the following, depending on context: * the upper bar of the SEE window (provided you use the JSPWiki syntax mode) * the TOC in Forrest-rendered output, like HTML and PDF !!!Opening, agenda review, participants Opened at 10:16. Present: __Børre, Lene, Sjur, Thomas, Trond__ Absent: __Maaren, Per-Eric, Tomi__ Agenda accepted as is. !!!Updated task status since last meeting !!Børre * start to reorganise the documentation * gather {{sma}} texts * improve forrest stability with i18n, site look * set up the Leopard Server features for collaborative support * Hunspell lexicon conversion * InDesign documentation * investigate the NSIS installer * give contract with blank fields to __Per-Eric__ ** done * [fix bugs!|http://giellatekno.uit.no/bugzilla] * Other: ** Had a meeting with Davvi Girji and Čálliid lágádus. They agreed to add a paragraph to their standard contracts which lets the Sami Parliament use the texts they publish. ** Visited __Johan Jernsletten__, __Aage Solbakk__ and __Kari Meløy__ who signed contracts. Will also make contracts with __Yngve Engkvist__, __Harald Gaski__, __Siri Broch Johansen__ and __Roald E. Kristiansen__. All in all this will give us quite a few books to work with. !!Lene * Ped project - status: ** waiting for __Tino__ to do some changes of the synt.tags in the VISL-games ** Saara is programming the morph-drill and question/answering-drill - we need good (funny?) names for the drills. Does anybody have suggestions? ** Saara also makes an xml for the lexicon for the drills ** I am doing other work now waiting for this. When Saara has finished the first versions of the programming, then I will continue. ** The next big task after the drills is to continue the work with the dialogues !!Maaren * Put the list of possible {{sma}} corpus sources into a document * update the ''Changes'' document !!Per-Eric * keep the contact with Kurt Tores family about his texts. ** Nothing new * try to find other authors who have smj texts digitaly, send contracts to them ** Nothing new * Work with missing list from the bible texts. ** Not done * Keep the contact with Ulf-Stefan Winka who has many more smj texts to add. ** Have some new texts from Sigga Tuolja Sandström, which I have done missing lists of and also texts from Lars-Matto Tuolja which I also have done missing list of * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** Nothing done !!Saara * add new XSL/XML headers for proofing test docs * Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not) * discuss more parallel texts !!Sjur * start to reorganise the documentation * gather {{sma}} texts * improve forrest stability with i18n, site look * set up the Leopard Server features for collaborative support * name db/risten.no * investigate the NSIS installer * make a first {{sma}} project plan * publish corpus contracts and project infra as open-source on NoDaLi-sta * [fix bugs!|http://giellatekno.uit.no/bugzilla] * other things: ** hyphenator bug hunting and reporting !!Thomas * look at test cases still not behaving properly ** not much done here * add remaining hyphenation bugs to Bugzilla ** done * lexicalise ''europarádeministarjuogos'' ** done * try to fix 636 ** did not succeed * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** worked some !!Tomi * Hunspell lexicon conversion * document how compounding is controlled in the PLX conversion * fix double hyphen bugs * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Trond * Reorganise documentation (with Børre and Sjur) ** Not done * Gather sma texts (with Børre and Sjur) ** Not done * Name lexicon project: Test editing xml files (when they are ready for it) ** Not done * Work on {{sma}} analyser and visl integration ** Not so much here (but on smn, sms, sjd) * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!Pedagogical software online __UiT/GT__ is developing their own language games & drills, in addition to the VISL games. See Lene's report above. Links: * [http://giellatekno.uit.no/oahpa/] * [http://giellatekno.uit.no/ped/index.html] Ped-speller? That is, a speller with restricted vocabulary and morphology (rare words, names and forms are removed, possibly even compounding, ie with only lexicalised compounds). Examples of problem pairs: ''boaris'' vs ''buoris'', ''vieljas'' vs ''vielljas''. With a smaller lexicon, it might be possible to increase the complexity of the suggestion ("phonetic") rules. South Sámi ped-prog should be discussed with the {{sma}} groups we'll meet throughout the spring. Note that there is a new deadline for pedagogical and strengthening of language projects within Sámediggi in the autumn. __TODO:__ * get an easy-to-remember URL (__UiT/IT__) * More thorough skin, layout, ... (__External person within the Ped team__, __Internal forrest expert__) This we will postpone until later * Make a pedagogical speller (__Tomi__ when finished with his MA thesis) ** Add a flag !^P^ for forms to be excluded (__Thomas, Lene__) ** Turn off peripheral compounds (numbers, acros, perhaps names) ** Increase editing distance by one for suggestions? Only possible with limited compounding !!!Documentation __TODO:__ * start to reorganise the documentation (__Børre, Sjur, Trond__) !!!Corpus gathering __TODO:__ * follow-up on the {{smj}} texts from __Kurt Tore__ (__Per-Eric__) * get texts from __Sigga Tuolja Sandstrøm__, possibly through __Olavi Korhonen__ (contract is ok now) (__Per-Eric__) * other contacts: Nord-Salten avis, Børge Strandskog, Lena Davidsson daughter to Lars-Matto Tuolja * gather {{sma}} texts (__Børre, Sjur, Trond, Joseph__) * Put the list of possible corpus sources into a document {{gt/doc/lang/sma/sma-corpus-plan.jspwiki}} (__Maaren__) * give contract with blank fields to __Per-Eric__ (__Børre__) !!!Future plans, directions and ideas * more speller engines supported (to different degrees) * more hyphenators supported * grammar checker ** what the society wants ** it is interesting both for the university and SD ** it is a very good cooperation project * tailored proofing tools * machine translation (further work to something useful) * cooperation with groups teaching Sámi, starting at UiT * speech * searching and indexing * automatic (bilingual) lexicon building, semantics * more public visibility & delivery * more open-source technology (sfst, other?) See also a separate document in {{plan/strat/5year.jspwiki}}. !!!Infrastructure To accomodate future enhancements in different directions: # migrate to svn # merge gt, kt and st into one, probably after the svn move # more modularised make / build infra (prepare for smn, sms, sjd, others) # close certain parts of the code repository (requires svn) # set up the Leopard Server features for collaborative support: ## permanent chat rooms ## stored (and indexed) chat transcripts of the chat rooms ## iCal server / group calendars ## wiki # wiki? (is part of Leopard Server) or other web-based documentation # improve Forrest stability and i18n support # reorganise the documentation & look # migrate to XML # sfst? Both as replacement for xfst and for hunspell/open-source proofing tools # investigate the NSIS installer, potentially replacing the InstallShield package from Polderland # corpus content moved to Max Planck repositories? __TODO:__ * add Jabber account in iChat (__all__) * prepare migration to svn (__Børre, Sjur, Trond__) !!!Linguistics !!North Sámi (nothing new, see proofing bugs below) !!Lule Sámi (nothing new, see proofing bugs below) __TODO:__ * {{sme->smj}} lexicon conversion to build bilingual lexicon resources, and increase {{smj}} coverage (__Trond, Svenne__). * Add the words when all words are ready. !!South Sámi Nothing new. !!!Name lexicon infrastructure __TODO:__ # fix i18n bug in risten.no/G5 (so they will work without the proper locale request) (__Sjur__) ## it works ok locally, set-up / config needs to be checked on the G5; probably easy to fix ### it works the same both locally and on the G5, relates to i18n setup in forrest # fix bugs in lexc2xml; add comments to the log element (__Saara__) # finish first version of the editing (__Sjur__) # test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__) # make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (__Sjur, Saara__) # convert propernoun-($lang)-lex.txt to a derived file from common xml files (__Sjur, Tomi, Saara__) # implement data synchronisation between [risten.no|http://www.risten.no] and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) # start to use the xml file as source file # clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (__Thomas, Maaren, linguists__) # merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (__linguists__) # publish the name lexicon on risten.no (__Sjur__) # add missing parallel names for placenames (__linguists__) # add informative links between first names like Niillas and Nils (__linguists__) !!!Proofing tools !!Hunspell __TODO:__ # add {{smj}} to the soup, make sure it works roughly as good as {{sme}} # fix the remaining conversion bugs for {{sme}} # return to {{smj}}, and fix whatever is left to fix # integrate the derivations as separate "continuation lexicons" !!Testing !Spelling Error Markup __TODO:__ * Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (__Saara__) * test new and nested error markup (__Sjur__) !!Speller bugs Open issues based on test results : !sme Version: __Davvisámi, version 1.0.1, 2008-02-17__ * 426 - comp words from Divvun.no - ''guoktedássásaš'' accepted - still open * 435 - roman number - single letter numbers now recognised ** we should pregenerate all numbers once and for all, and store them in a separate lexicon file * 595 - prefix+name wihtout hyphen (''ovdaLot'' instead of ''ovda-Lot'') * 600 - __REGRESSION:__ gen+hyph compound ''sámi-dáru'' * 603 - suomabealdi, norggabealdi accepted * 606 - speller accepts VUOHTA compound * 607 - acro + hyphen **''NRKGA'' is acro + clitic accepted without colon - what is correct? * 611 - double hyphen sugg still accepted * 613 - short gen. as second compound part * 619 - numerals and pronouns to NAMÁK and SASJ fails * 627 - prefix + hyhpen does not get accepted * 629 - ''a'' taking part in compounding without hyphen * 633 - double hyphens accepted in Word, not by cmdline speller * 634 - PropGen+hyph+PropGen * 641 - numeral+noun compounds * 642 - noun/adj/proper + hyphen + ain * 644 - cased numeral+numeral compund * 646 - adverb + hyphen + noun * 647 - numerals+NOUN * 648 - unmotivated suggestions with numeral+noun * 649 - name + adj compound without hyphen * 654 - speller does not recognize ordinals on -nuppelogát * 655 - pron + nai * 658 - Suggestion saame * 660 - abbr. not recognised !smj Version: __Julevsáme, version 1.0.1, 2008-02-14__ * 435 - roman number - single letter numbers now recognised ** we should pregenerate all numbers once and for all, and store them in a separate lexicon file * 595 - prefix+name wihtout hyphen (''tsåhkeLot'' instead of ''tsåhke-Lot'') * 600 - __REGRESSION:__ gen+hyph compound ''sáme-dáro'' * 607 - acro + hyphen **''NRKGA'' is acro + clitic accepted without colon - what is correct? * 616 - Bispadime-me-ráden - still __OPEN__, try to find an acro or abbr ''me'' * 619 - numerals and pronouns to NAMÁK and SASJ fails - still __OPEN__ * 629 - ''a'' taking part in compound - still __OPEN__ * 634 - rop gen + hyphen + Prop gen - still __OPEN__ * 641 - numeral+noun compounds - still __OPEN__ * 644 - cased numeral+numeral compund * 647 - numerals+NOUN * 648 - unmotivated suggestions with numeral+noun * 649 - name + adj compound without hyphen * 650 - noun prefix+name compound without hyphen * 658 - Suggestion saame __TODO:__ * compile new speller lexicons (__Tomi__) * document how compounding is controlled in the PLX conversion (__Tomi__) !!Hyphenator bugs Open issues based on test results : !sme * 468 - ''Márkomeanu'' -> Polderland - __FIXED__ * 548 - ''duostan'' -> Polderland - __FIXED__ * 549 - missing hyph at word boundary -> Polderland - __FIXED__ * 633 - extra hyphen inserted -> Divvun - __FIXED__ There are still some bugs found in the wordtypes test file. They should be added to Bugzilla. __TODO:__ * add remaining hyphenation bugs to Bugzilla (__Thomas__) !smj * 549 - missing hyph at word boundary -> Polderland - __FIXED__ * 633 - extra hyphen inserted -> Polderland - __FIXED__ * 636 - hyphen before last char -> Divvun Possible solution: {{{ define saveclitic %# -> 0 || _ k .#. ; }}} The wordtypes test file does contain another problem, but that one belongs to Polderland, and is reported. __TODO:__ * lexicalise ''europarádeministarjuogos'' (__Thomas__) * try to fix 636 (__Thomas, Trond__) !!InDesign tools We're waiting for an update from Polderland. !!Windows installer This point is now moved to the section for future plans, and will be tackled as time permits. !!Releases __TODO:__ * update the ''Changes'' document (__Børre__) * documentation (__Sjur__) ** Norwegian translation received from Davvi Girji !!!Other !!Corpus contracts + open source TODO: * publish corpus contracts and project infra as open-source on NoDaLi-sta (__Sjur__) !!!Next meeting, closing The next meeting is 31.3.2008. The meeting was closed at 13:16. !!!Appendix - task lists for the next five days !!Boerre [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Boerre.ics] * gather {{sma}} texts * Hunspell lexicon conversion * InDesign documentation * update the ''Changes'' document * prepare migration to svn (with __Sjur, Trond__) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Lene * Ped project * Add a flag !^P^ for forms to be excluded from ped. speller !!Maaren [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Maaren.ics] * Put the list of possible {{sma}} corpus sources into a document !!Per-Eric [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Per-Eric.ics] * keep the contact with Kurt Tores family about his texts. * try to find other authors who have smj texts digitaly, send contracts to them * Work with missing list from Tjaktjalasta, Lars-Matto Tuolja. * Work with missing list from texts written by Sigga Tuolja Sandström. * Keep the contact with Ulf-Stefan Winka who has many more smj texts to add. * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Saara [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Saara.ics] * add new XSL/XML headers for proofing test docs * Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not) * discuss more parallel texts !!Sjur [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Sjur.ics] * gather {{sma}} texts * name db/risten.no * make an improved {{sma}} project plan * publish corpus contracts and project infra as open-source on NoDaLi-sta * prepare migration to svn (__Børre, Trond__) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Thomas [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Thomas.ics] * look at test cases still not behaving properly * try to fix 636 * Add a flag !^P^ for forms to be excluded from ped. speller * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Tomi [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Tomi.ics] * Hunspell lexicon conversion * document how compounding is controlled in the PLX conversion * fix double hyphen bugs * compile new speller lexicons * Make a pedagogical speller (after MA thesis is delivered) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Trond [iCal|/doc/admin/weekly/2008/Tasks_2008-03-26_Trond.ics] * Work on {{sma}} analyser and visl integration * try to fix 636 * Prepare svn migration (with __Sjur, Børre__) * [fix bugs!|http://giellatekno.uit.no/bugzilla].