!!!Meeting setup * Date: 19.5.2008 * Time: 09.30 Norw. time * Place: Internet * Tools: SubEthaEdit, iChat/Skype !!!Agenda Cf. one of the following, depending on context: * the upper bar of the SEE window (provided you use the JSPWiki syntax mode) * the TOC in Forrest-rendered output, like HTML and PDF !!!Opening, agenda review, participants Opened at 09:50. Present: __Børre, Per-Eric, Sjur, Thomas, Tomi, Trond__ Absent: __Jovsset__ Agenda accepted as is. !!!Updated task status since last meeting !!Børre * Hunspell lexicon conversion ** Progressing * prepare migration to svn (with __Sjur, Trond__) ** nothing new * release hunspell public beta during May (with __Sjur__) ** will probably make it * try to repair G5 accounts for iCal Server ** not done * make a test-all target that runs all tests we have ** not done * define and document testing routines ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Jovsset * follow up on {{sma}} corpus texts !!Lene * get the ped content ready * Work on test routine with __Trond__ and __Sjur__ !!Maaren * Put the list of possible {{sma}} corpus sources into a document !!Per-Eric * try to find other authors who have {{smj}} texts digitaly. ** Nothing done * Work with missing list from texts written by Sigga Tuolja Sandström. ** Nearly finished with her texts, have some small corrections left * Work with missing list same_dutkama_pgr.txt ** Making a missing list just now * Work with missing list sameriekta_tjoahkkagæsos.txt ** Nothing done * Keep the contact with Ulf-Stefan Winka. ** Nothing done * plan a {{smj}} pr tour for our tools ** Working with it, made the first draft * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Saara * add new XSL/XML headers for proofing test docs * Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not) * implement the ped UI and functionality !!Sjur * follow up on {{sma}} corpus texts ** nothing last week * name db/risten.no ** nothing last week * make an improved {{sma}} project plan ** nothing last week * publish corpus contracts and project infra as open-source on NoDaLi-sta ** nothing last week * prepare migration to svn (with __Børre, Trond__) ** nothing last week * release hunspell public beta at the end of April (with __Børre__) ** nothing last week * update the ''Changes'' document ** nothing last week * follow-up on some Polderland-related bugs: 621, 630, 652, 656 ** some on 656 * InDesign documentation ** nothing last week * add CG regression test with __Lene__ and __Trond__ ** nothing last week * make a test-all target that runs all tests we have ** nothing last week * define and document testing routines ** nothing last week * {{sma}} linguist ** nothing last week * [fix bugs!|http://giellatekno.uit.no/bugzilla] * other tasks: ** pressentation in Stockholm ** updated libraries and command line tools from Polderland ** released internal proofing tools update !!Thomas * look at test cases still not behaving properly ** not anything this week * review hunspell lexicon branch with __Børre__ ** worked all week * plan a {{smj}} pr tour for our tools ** made first draft * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** not anything this week !!Tomi * Hunspell lexicon conversion ** not done * document how compounding is controlled in the PLX conversion ** not done * fix double hyphen bugs ** fixed some double hyphens * Make a pedagogical speller (after MA thesis is delivered) ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** fixed some !!Trond * Help Jovsset with vislcg3 and sma * Set up Jabber for Lene, Kimme, Saara * Prepare svn migration (with __Sjur, Børre__) * Work on test routine with __Lene__ and __Sjur__ * make a test-all target that runs all tests we have * define and document testing routines * Dictionaries * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!Pedagogical software online Demos online, preliminary only: * victorio.uit.no/oahpa/mgame/ * victorio.uit.no/oahpa/num/ Meeting memos can be found at [http://giellatekno.uit.no/ped/index.html#Meeting+memos] __TODO:__ * get the content ready (__Lene__) * implement the UI and functionality (__Saara__) * get an easy-to-remember URL (__UiT/IT__) * More thorough skin, layout, ... (__External person within the Ped team__, __Internal forrest expert__) This we will postponed until later * Make a pedagogical speller (__Tomi__ when finished with his MA thesis) ** Turn off peripheral compounds (numbers, acros, perhaps names) ** Increase editing distance by one for suggestions? Only possible with limited compounding !!!Documentation __TODO:__ * start to reorganise the documentation (__Børre, Sjur, Trond__) !!!Corpus gathering __P-E__ has talked again with __Børge Strandskog__, and he will sign the corpus contract and give us material from ''Nord-Salten avis''. The {{smj}} Divvun tour planning is progressing. __TODO:__ * follow up on {{sma}} corpus texts (__Sjur, Jovsset__) * follow-up on the {{smj}} texts from __Kurt Tore__ (__Per-Eric__) * get texts from __Sigga Tuolja Sandstrøm__, possibly through __Ulf-Stefan Winka__ (contract is ok now) (__Per-Eric__) * other contacts: Nord-Salten avis (__Børge Strandskog__), Lena Davidsson daughter to Lars-Matto Tuolja * Put the list of possible corpus sources into a document {{gt/doc/lang/sma/sma-corpus-plan.jspwiki}} (__Maaren__) * give contract with blank fields to __Per-Eric__ (__Børre__) * plan a {{smj}} pr tour for our tools (__Per-Eric, Thomas__) !!!Future plans, directions and ideas See a separate document in {{plan/strat/5year.jspwiki}}. !!!Infrastructure To accomodate future enhancements in different directions (in rough order of importance): # test bench for all parts of our language technology efforts # migrate to svn # merge gt, kt and st into one, probably after the svn move # more modularised make / build infra (prepare for smn, sms, sjd, others) # close certain parts of the code repository (requires svn) # set up the Leopard Server features for collaborative support: ## permanent chat rooms ## stored (and indexed) chat transcripts of the chat rooms ## iCal server / group calendars ## wiki # wiki? (is part of Leopard Server) or other web-based documentation # improve Forrest stability and i18n support # reorganise the documentation content: ## differ between target groups ## get better grouping ## decide what to write in forrest and what in wiki (cf. Apertium for a similar split) ## update/add missing parts # migrate lexicons to XML, splitting the task ## Name lexica (the Name project) ## Dictionaries (already in XML, task is to integrate them) ## Open POSes (Komi as a test case) # change the look of the documentation web # sfst? Both as replacement for xfst and for hunspell/open-source proofing tools # investigate the NSIS installer, potentially replacing the InstallShield package from Polderland # corpus content moved to Max Planck repositories? __TODO:__ * add CG regression test (__Lene, Sjur, Trond__) ** needs to be integrated in the make file * make a test-all target that runs all tests we have (__Børre, Sjur, Trond__) ** not yet - depends on the prevous point * define and document testing routines (__Børre, Sjur, Trond__) ** not yet * add Jabber account in iChat ** UiT: Lene, Kimme, Saara (__Trond__) * prepare migration to svn (__Børre, Sjur, Trond__) ** https access is now working. svn checkout https://victorio.uit.no/repos * try to repair G5 accounts for iCal Server (__Børre__) !!!Linguistics !!North Sámi (nothing new, see proofing bugs below) !!Lule Sámi (nothing new, see proofing bugs below) __TODO:__ * {{sme->smj}} lexicon conversion to build bilingual lexicon resources, and increase {{smj}} coverage (__Trond, Svenne__). * Add the words when all words are ready. !!South Sámi Nothing new since last week. !!!Name lexicon/risten.no infrastructure __TODO:__ # fix i18n bug in risten.no/G5 (so they will work without the proper locale request) (__Sjur__) # fix bugs in lexc2xml; add comments to the log element (__Saara__) # finish first version of the editing (__Sjur__) # test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__) # make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (__Sjur, Saara__) # convert propernoun-($lang)-lex.txt to a derived file from common xml files (__Sjur, Tomi, Saara__) # implement data synchronisation between [risten.no|http://www.risten.no] and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) # start to use the xml file as source file # clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (__Thomas, Maaren, linguists__) # merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (__linguists__) # publish the name lexicon on risten.no (__Sjur__) # add missing parallel names for placenames (__linguists__) # add informative links between first names like Niillas and Nils (__linguists__) !!Dictionaries Updated version now available at the Divvun download area [http://www.divvun.no/static_files/sme-nob-dict.zip]. What we have works, content-wise and technically, for Mac OS 10.5. Technically, the challenge now is to port it to other OS-es, and generalise the infrastructure to make it easy to add new dictionaries without doubling the infrastructure components. The linguistic challenges are: * Compounds (must be lexicalised) * Derivations (must be lexicalised) * Better coverage (the {{words/dicts/smenob/src/inc-*}} files) * Better dictionary aricles As for the last point, we are considering a dictionary workshop next winter. But we will then need lexicographers, and routines for using our tools and parallel corpora for finding dictionary examples. __TODO:__ * clean up and generalise the make infrastructure * make Linux and Windows versions * make simple installer applications * make a public release ** Make a homepage with instructions for dictionary use: {{xtdoc/gtuit/src/documentation/content/xdocs/dict.eng.xml}} ** Clarify the difference between local and online dictionaries: *** Plugin for Firefox and Internet Explorer (online dictionaries) *** Integrated dictionary for Mac 10.5 (local ones), in the future also Windows and Linux !!!Proofing tools !!Hunspell Latest test results now committed and linked to. __Børre__ is working on the conversion. {{{ sme.dic 12MB sme.aff 873MB }}} These files are in 129.242.220.111 /Users/boerre/gt/tmp * Simple clitics may also be separate words ** ''go, ge, gen, ges, gis, nai, ba, ban, be, hal, han, bat, son'' * combinations of clitics are not separate words as such. ** ''naigo, goson'' __Børre__ will compile new lexicons today. __TODO:__ # add {{smj}} to the soup, make sure it works roughly as good as {{sme}} (__Børre, Thomas, Per-Eric__) ## added to derivations, needs to be tested # fix the remaining conversion bugs for {{sme}} (__Børre, Tomi__) # return to {{smj}}, and fix whatever is left to fix (__Børre, Tomi__) # make a proper Linux distro (__Børre, Tomi__) # release a public beta at the end of May (__Børre, Sjur__) !!Testing !Spelling Error Markup __TODO:__ * Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (__Saara__) * test new and nested error markup (__Sjur__) !!Speller bugs List of bugs returned from Polderland: 621, 630, 652, 656, 676. Open issues based on test results: !sme Version: __Davvisámi, version 1.0.1, 2008-04-01__ * 425 - other words from Divvun.no - __FIXED__ * 426 - comp words from Divvun.no - ''guoktedássásaš'' accepted - still __OPEN__ * 435 - roman numbers - inflection of single letter numbers rejected, as well as some complex numbers (but is ok in {{smj}}) - still __OPEN__ ** we should pregenerate all numbers once and for all, and store them in a separate lexicon file * 452 - several lexical bugs - ''oažžuin'' + ''ožžuin'' - __FIXED__ * 595 - prefix+name wihtout hyphen (''ovdaLot'' instead of ''ovda-Lot'') - still __OPEN__ * 600 - gen+hyph compound ''sámi-dáru'' - still __OPEN__ * 603 - suomabealdi accepted - still __OPEN__ * 606 - speller accepts VUOHTA compound - still __OPEN__ * 607 - acro + hyphen - __FIXED__ **''NRKGA'' is acro + clitic accepted without colon - what is correct? * 611 - double hyphen sugg still accepted - still __OPEN__ * 613 - short gen. as second compound part - still __OPEN__ * 619 - numerals and pronouns to NAMÁK and SASJ fails - still __OPEN__ * 627 - prefix + hyhpen does not get accepted - still __OPEN__ * 629 - ''a'' taking part in compounding without hyphen - still __OPEN__ * 634 - PropGen+hyph+PropGen - still __OPEN__ * 641 - numeral+noun compounds - still __OPEN__ * 642 - noun/adj/proper + hyphen + ain - still __OPEN__ * 644 - cased numeral+numeral compund - still __OPEN__ * 646 - adverb + hyphen + noun - still __OPEN__ * 647 - numerals+NOUN - still __OPEN__ * 648 - unmotivated suggestions with numeral+noun - still __OPEN__ * 649 - name + adj compound without hyphen - still __OPEN__ * 654 - speller does not recognize ordinals on -nuppelogát - still __OPEN__ * 655 - pron + nai - still __OPEN__ * 658 - Suggestion saame - still __OPEN__ * 665 - adverb superlatives; dieppimus, doppimus - __FIXED__ * 666 - guovtte- and njealje- - __NEW__ * 668 - caseforms, ordinals and collectives - __FIXED__ * 676 - triple-hyphen - __FIXED__ !smj Version: __Julevsáme, version 1.0.1, 2008-04-01__ * 435 - roman number - single letter numbers now recognised ** we should pregenerate all numbers once and for all, and store them in a separate lexicon file ** please note that ''inflection'' of single letter numerals is __fine__ in {{smj}}, as opposed to {{sme}} * 595 - prefix+name wihtout hyphen (''tsåhkeLot'' instead of ''tsåhke-Lot'') - still __OPEN__ * 599 - __REGRESSION:__ numeral attr:s on lot * 600 - gen+hyph compound ''sáme-dáro'' - still __OPEN__ * 607 - acro + hyphen - __FIXED__ **''NRKGA'' is acro + clitic accepted without colon - what is correct? * 616 - Bispadime-me-ráden - still __OPEN__, try to find an acro or abbr ''me'' * 619 - numerals and pronouns to NAMÁK and SASJ fails - still __OPEN__ * 629 - ''a'' taking part in compound - still __OPEN__ * 634 - rop gen + hyphen + Prop gen - still __OPEN__ * 641 - numeral+noun compounds - still __OPEN__ * 644 - cased numeral+numeral compund - still __OPEN__ * 647 - numerals+NOUN - still __OPEN__ * 648 - unmotivated suggestions with numeral+noun - still __OPEN__ * 649 - name + adj compound without hyphen - still __OPEN__ * 650 - noun prefix+name compound without hyphen - still __OPEN__ * 658 - Suggestion saame - still __OPEN__ * 668 - caseforms, ordinals and collectives - __FIXED__ __TODO:__ * compile new speller lexicons (__Tomi__) ** done, please new compilations on Thursday afternoon * document how compounding is controlled in the PLX conversion (__Tomi__) !!Hyphenator bugs Open issues based on test results : !sme Lexicon version: __Davvisámi, version 1.0.1, 2008-04-01__ * 468 - __REGRESSION:__''Márkomeanu'' * 547 - __REGRESSION:__ hyphen in front of vowel: ''Lotnolasealáhusas'' * 548 - __REGRESSION:__ mid syllable hyphenation: ''Háliidivččen'' * 549 - __REGRESSION:__ division without hyph: ''Váccedettiin'' * 673 - adj-derivations: ''guovttenuppelotčoarvvagiin'' (the word is not rec.) * 677 - __NEW:__ Wrongly hyphenated ending -danidja - invalid !smj Lexicon version: __Julevsáme, version 1.0.1, 2008-04-01__ * 545 - __REGRESSION:__ bad hyphenation in compounds: ''åhpadusorganisásjåvnån'' (not recognised) * 546 - __REGRESSION:__ obligatory hyph rules seem to work in facultative manner: ''organisásjåvnån'' (not recognised) * 547 - __REGRESSION:__ hyphen in front of vowel: ''Jienastimnjuolgadusá'' and ''Orgánajs'' __TODO:__ * test latest hyphenator lexicons (__Sjur__) !!InDesign tools We have received the expected updates from Polderland. !!Releases __TODO:__ * update the ''Changes'' document (__Sjur__) * InDesign documentation (__Sjur__) ** Norwegian translation received from Davvi Girji * public hunspell beta - sometime in May * public 1.1 update of the Polderland-based tools towards end of May/beginning of June !!!Other !!Corpus contracts + open source Now decided to wait until we have changed from {{cvs}} to {{svn}}. TODO: * publish corpus contracts and project infra as open-source on NoDaLi-sta (__Sjur__) !!Sjur in Stockholm Last week - gave a speech on ''Samiska i IT-samhället''. The meeting was useful. Swedens W3C experience wanted info on Sámi problems with Unicode. !!Summer vacations || Who || When | Børre | 30/6-6/7, 21/7-3/8, 11/8-17/8 | Jovsset | ??? | Per-Eric | ??? | Sjur | ??? | Tomi | 16/6 - 4/8 | Thomas | 23/6 - 21 or 28/7 | Trond | 30/6 - 18/7, 28/7 - 1/8 !!!Next meeting, closing The next meeting is 26.5.2008, 9.30 Norwegian time. The meeting was closed at 11:10. !!!Appendix - task lists for the next five days !!Boerre [iCal|/doc/admin/weekly/2008/Tasks_2008-05-26_Boerre.ics] * Hunspell lexicon conversion * prepare migration to svn (with __Sjur, Trond__) * release hunspell public beta during May (with __Sjur__) * make a hunspell package that suits linux distributions * try to repair G5 accounts for iCal Server * make a test-all target that runs all tests we have * define and document testing routines * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Jovsset * follow up on {{sma}} corpus texts !!Lene * get the ped content ready * Work on test routine with __Trond__ and __Sjur__ !!Maaren * Put the list of possible {{sma}} corpus sources into a document !!Per-Eric [iCal|/doc/admin/weekly/2008/Tasks_2008-05-19_Per-Eric.ics] * try to find other authors who have {{smj}} texts digitaly. * Work with missing list from texts written by Sigga Tuolja Sandström. * Work with missing list same_dutkama_pgr.txt * Work with missing list sameriekta_tjoahkkagæsos.txt * Plan a {{smj}} pr tour for our tools * Call Julie about my vacationdays (how many I have left) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Saara * add new XSL/XML headers for proofing test docs * Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not) * implement the ped UI and functionality !!Sjur [iCal|/doc/admin/weekly/2008/Tasks_2008-05-19_Sjur.ics] * follow up on {{sma}} corpus texts * name db/risten.no * make an improved {{sma}} project plan * publish corpus contracts and project infra as open-source on NoDaLi-sta * prepare migration to svn (with __Børre, Trond__) * release hunspell public beta at the end of April (with __Børre__) * update the ''Changes'' document * follow-up on some Polderland-related bugs: 621, 630, 652, 656 * InDesign documentation * add CG regression test with __Lene__ and __Trond__ * make a test-all target that runs all tests we have * define and document testing routines * {{sma}} linguist * test latest hyphenator lexicons * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Thomas [iCal|/doc/admin/weekly/2008/Tasks_2008-05-19_Thomas.ics] * look at test cases still not behaving properly * review hunspell lexicon branch with __Børre__ * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Tomi [iCal|/doc/admin/weekly/2008/Tasks_2008-05-19_Tomi.ics] * Hunspell lexicon conversion * make a hunspell package that suits linux distributions * document how compounding is controlled in the PLX conversion * fix double hyphen bugs * Make a pedagogical speller * [fix bugs!|http://giellatekno.uit.no/bugzilla] !!Trond [iCal|/doc/admin/weekly/2008/Tasks_2008-05-19_Trond.ics] * Help Jovsset with vislcg3 and sma * Set up Jabber for Lene, Kimme, Saara * Prepare svn migration (with __Sjur, Børre__) * Work on test routine with __Lene__ and __Sjur__ * make a test-all target that runs all tests we have * define and document testing routines * Dictionaries * [fix bugs!|http://giellatekno.uit.no/bugzilla].