!!!Meeting setup * Date: 06.06.2006 * Time: 09.30 Norw. time * Place: Where we are * Tools: SubEthaEdit, iChat !!!Agenda # Opening, agenda review # Reviewing the task list from two weeks ago # Documentation - divvun.no # Corpus gathering # Corpus infrastructure # Infrastructure # Linguistics # name lexicon infrastructure # Spellers # Other issues # Summary, task lists # Closing !!!1. Opening, agenda review, participants The meeting was delayed due to the project board having a telephone meeting at our regular meeting time. Opened at 13:17. Present: __Sjur, Thomas, Børre, Tomi__ Absent: __Maaren, Saara, Trond__ Agenda accepted as is. !!!2. Updated task status since last meeting !! Børre * corpus collection: ** send out contracts with accompanying letter ** Gather public texts, preferrably also parallel ones ** Send out letters to the rest of the Iđut authors ** call __Brita Kåven__ again *** Done ** contact __Ája__ (Kåfjord) *** Not done ** call __Bård Eriksen__ *** Done ** send contracts to Čálliid Lágádus *** Not done ** Contact __Olavi Korhonen__, to actually get the dictionary *** Done * corpus conversion: ** Continue converting text from input format to our xml ** convert nob and nno bible texts to be used as part of a parallel corpus ** convert fin, swe to paratext or directly to our XML ** review the paratext2xml converter ** convert smj NT to paratext *** None of the bible things are done ** complete __Min Áigi__ metadata *** Done * corpus access: ** make a group {{bound}} for our external corpus users *** Done ** possibly deploy the user account form as an HTML form *** Not done ** make a test user *** Not done ** Write both user and admin documentation *** Some done * set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed ** Built-in feature of Bugzilla, tried to turn it on. * document use of Gobby within our project ** Done * create document & document entry for name double-tagging ** Not done * create iCal entries from our meeting memos (or __Sjur__) ** Not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * On sick leave !! Saara * Create a parallel corpora of the new testaments * add more texts to the graphical corpus interface * grammatical searchability in the graphical corpus interface * Implement parallel corpus upload in web upload script * Install Gobby * Test the aligners once again * refine the xml output of the xml-tagged analyses * convert or adapt the received PHP for paradigm generation to our needs * remove headers and footers from antiword documents, other improvements * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur * public tender: ** collect the last clarifying answers, and make a final proposal to the board *** done. Decision by the board: Polderland * fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer ** nothing since last attempt * name lexicon: ** implement editing functions ** finalise refactoring for multiple collections (regular search interface) * change corpus-summary processing to generate smaller pages, see [bug report|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=309] ** done, now waiting for feedback from __Børre__ * move the SEE 2.5 extension list to Bugzilla ** done * create iCal entries from our meeting memos (or __Børre__) ** done * review user and admin documentation for corpus access * write user account form, probably ask for copy of existing ones from the IT centre (with Trond) * [fix bugs!|http://giellatekno.uit.no/bugzilla] * other tasks: ** administrative tasks !! Thomas * investigate productivity of Actio compounding in smj ** had a look at the three-syllables * investigate and identify under which conditions Actio compounding is possible ** have identified some things * discuss findings with the rest of us ** will do * add proper numeral analysis/treatment to smj ** not done * add loanwords (e.g. latin -ere verbs) to smj ** not done * work on compounding ** finished in practice, some things still that I need to discuss with Maaren * sme G3 issue ** not done * review user documentation for corpus access ** not done * create smj abbr file ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] ** not done !! Tomi * new proper name lexicon ** data synchronisation of proper nouns between risten.no and CVS *** not done ** XQuery refactoring and code development for our proper noun editor *** doing ** new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry *** not done * read aligner docu, install, provide feedback ** not done * Set up the mechanism for the hash-mark transducer package ** not done * test the new xml output of the xml-tagged analyses ** not done * export corpus tools to {{/opt}} (with cron) ** not done * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond * better smj NT text * get fin, swe, nob and nno NT and OT in __paratext__ format * install aligner, test it and give feedback * fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer * make shell script wrappers for the most common commands for user friendlyness * write user account form, probably ask for copy of existing ones from the IT centre (with Sjur) * write documentation for our {{bound}} users, with pointers to the ordinary documentation * write documentation on double-tagging names * discuss web-only user access management with Oslo * [fix bugs!|http://giellatekno.uit.no/bugzilla]. !!!3. Documentation TODO: * Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__) ** User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus. ** Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account. *** not done yet !!!4. Corpus gathering !!Collecting See a [previous meeting memo|Meeting_2006-01-16] for what's to be done. TODO: * Send out the rest of the letters (__Børre__) New contracts: * none last 2 weeks !!Olavi Korhonen's Lule Sámi dictionary. Talked with him, sent him the contract. TODO: * Contact __Olavi Korhonen__, to actually get the dictionary ** done !!KIO Grafisk and the Iđut books __TODO__: * send letters to the other authors (__Børre__) !!Bible texts TODO: * get nob and nno NT and OT in __paratext__ format. (__Trond__) ** waiting for Oslo * convert smj NT to paratext (__Børre__) ** waiting for an NT in paratext format (whatever language will do) * convert fin, swe to paratext or directly to our XML (__Børre__) !!Davvi Girji Called __Brita Kåven__. She can't say anything about letting us have the texts untill she has talked with the employees, to get a work estimate on their part. Nothing was decided in their previous meeting. Davvi Girji seems generally very positive, but has problems finding the best way of actually delivering the texts to us. __TODO__: * call Brita Kåven again towards the end of the week (__Børre__) ** done * call the authors (__Børre__) ** not more !!Min Áigi __TODO__: * complete metainformation (__Børre__) ** done as far as possible: author and language info ** some __smj__ files moved to the {{smj}} tree ** still some Norwegian files to be checked !!Kåfjord __TODO__: * contact Ája (__Børre__) ** forgotten !!Sámi Instituhtta When will we get the corpus? We don't know, __Børre__ will contact him again. TODO: * contact NSI again (__Børre__) !!Čálliid Lágádus [http://www.calliidlagadus.org/] TODO: * send contracts (__Børre__) ** nothing !!Árran Talked to __Bård Eriksen__, he needs to discuss more with his coworkers. TODO: * continue discussion (__Børre__) ** done, not finished !!!5. Corpus infrastructure !!General TODO: * remove headers and footers from antiword documents, other improvements as needed, including PDF conversion (__Saara__) ** in the works, also improvements to the PDF conversion *** problems with running the tools, they need some updated libraries in Victorio !!User accounts and access For details, see the [previous meeting memo|Meeting_2006-06-19], as well as the memo from a [dedicated meeting|http://divvun.no/doc/infra/corpus_policy.html]. !Shell access TODO: * make a group {{bound}} for our external corpus users, which: (__Børre__) ** gives access to read our bound texts ** gives access to execute/run the tools in {{/opt}} *** done * export to {{/opt}} (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in {{/opt}} in order to conduct linguistic analyses) (__Tomi__) ** ccat (and some perl scripts?) ** other tools? *** nothing yet * make shell script wrappers for the most common commands for user friendlyness (__Trond__) * write user account form, probably ask for copy of existing ones from the IT centre (__Trond__ and __Sjur__) ** short discussion with Trond about what we need * possibly deploy the user account form as an HTML form (__Børre__) ** waiting for the form to be written * write documentation for our {{bound}} users, with pointers to the ordinary documentation (__Børre, Trond__) ** nothing done * write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.) (__Børre__) ** nothing * make our own guidelines for the user application processing (__Børre__) ** nothing * make a test user (__Børre__) ** nothing * test corpus access as test user (__Trond__) ** nothing yet !Web browser access TODO: * discuss with Oslo (__Trond__) * delay other tasks until we are ready to go public? * user management for access to bound texts !!More texts to the graphical corpus interface: TODO: # refine xml-tagged output (__Saara__ and __Tomi__) ## done, but still open if it is finished # add text to the server (__Lars__) !!Aligner More to be said about this? (certainly, but right now?) !!Language recognition Still waiting for more __smj__ text to improve it. !!Corpus summary Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our [bugzilla|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=309] It is now implemented, but a test to skip this summary option for smaller collections did not work. Thus all genres in all languages are treated the same. __Improvement suggestion:__ instead of summarize files under a certain limit, one could also split __Min Áigi__ into years, and display it year-wise. That will require some more info in the generated source file {{corpus-content.xml}}. TODO: * trim generated corpus summary pages (__Sjur__) ** suggestion: lump together files with content less than X paragraphs (X < 5?) *** done - Forrest seems to work fine now * add improvement suggestion to Bugzilla (__Børre__) ** done !!!6. Infrastructure !!Paradigm generation Goal: Reuse Greenlandic code for paradigm generation. __Saara__ has given a report on the PHP code in News. Please read. TODO: * convert or adapt the received PHP to our needs (__Saara__) ** code evaluated, evaluation reported !!Hyphenator TODO: * correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (__Sjur, Trond__) ** nothing since last meeting * Update the sma hyphenator rule set with the insights gained from smj updates (__Trond__ during summer vacation) !!Automatic Bugzilla reminder for untouched bugs TODO: * set up Bugzilla to send automatic reminders for bugs not touched in a given timeframe; ask Thor-Øivind for help if needed (__Børre__) ** turned on the feature, but it doesn't seem to work - needs some more investigation !!JSPWiki update Here's the pattern to use: {{{ egrep -C 3 -R "^\*.*[$]*{1,16}\#" * }}} TODO: * grep for all occurences of ^* followed by a line ^## and vice versa (the patterns above) (__Sjur__) ** Tomi tried it, but nothing found * correct the documents found to have consistent lists (__Sjur__) ** not done !!!7. Linguistics !!Derivation and spellers like Aspell It is impossible to create a model that can dynamically generate new derivations of verbs, and then let them nominalise for compounding. What we need is to extract all derived verbs not in the lexicon, and lexicalise most of them. We could potentially let the 5(?) most productive verbal derivations be unlexicalised, and generate them from our lexicon directly when creating entries. A tentive, first guess on which derivational suffixes this could be is (for sme): {{{ +st, +l, +h, -goahtit, (o)juvvot (passive) }}} Later we might consider lexicalising all other derivations found in our corpus. To make it easier to extract all derived stems, we should enhance the tags used for derivations in {{sme}} to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for {{smj}}, {{+Der/NNN}}. Presently {{sme}} is only using the {{NNN}} part as a tag, where {{NNN}} represents the derivational suffix. That is, there is no single pattern to match against for {{sme}}. Example: {{{ "" "laktit" V TV goahti Ind Prs Pl3 @+FMAINV }}} Only the tag {{goahti}} is identifying that the word form is a derivation. It should be extended to {{Der/goahti}} as in {{smj}}. Then it is easy to grep for the pattern {{Der/}} to get all derived verbs. TODO: * change tagging of derived stems in the disamb output, to facilitate much easier extraction of these verbs (__Trond__) ** not done, the other tasks depend on this one * find and study all derived verbs in our corpus (__Thomas__) * suggest which derivations could be generated (__Thomas__) ** see source code above, but also consider overgeneration problems, as well as input from the statistical results in the previous point * lexicalise the rest (__Thomas__) !!Name double-tagging TODO: * Make a section under {{gt/doc/lang/smi/}}, add a chapter __Common linguistic resources__ under the __Languages__ section in {{site-frag.xml}}, and add the document below to that section (__Børre__) * write guidelines for annotators wrt. to name tagging and put them under {{gt/doc/lang/smi}}. A short version of the guidelines goes as follows: \\Systematic ambiguity (the __Trosterud__ (plc/sur) and __Aftenposten__ (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (__Eira__ (Fem/Sur/Plc), __Janne__ (Fem/Mal), such things. These guidelines should be added to our documentation file. (__Trond__) * when adding new names, only use one sem-tag unless there are known objects which belong to different sem classes(?) (__all linguists__) * write disamb rules to implement the system above (__Trond, Linda__) !!North Sámi TODO: * discuss findings with all of us (__Thomas__) * investigate and identify under which conditions Actio compounding is possible (__Thomas__) Following already derived verbs are not happy with further derivation. It seems like the most of them do not appear as Actio forms in first part of compounding either. The following holds for both ''sme'' and ''smj'': {{{ LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 5-syllables, formerly directed to MUITAL +V+TV: MUITALStem ; !SHOULD be directed here as well: !Reflexives on -dit !Reciprocals on -dit, -(a)lit !Momentatives on -dit, -(a)lit, -ádit, -ihit !Frequentatives on -(a)lit, -(u)hit, -dit !Continuatives on -dit, -(u)hit, -nit !Inchoatives in -nit !Translatives on -dit !Essives on -dit and -stit !Causatives on -dit, -stit }}} !!Lule Sámi TODO: * add inc abbr to a new abbr lexicon file (__Thomas__) * add proper numeral analysis/treatment (__Thomas__) * add loanwords (e.g. latin -ere verbs) (__Thomas__) ** nothing done to any of these !!!8. Name lexicon infrastructure TODO: * finish refactoring for multiple collections in the search interfarce (__Sjur__) ** still not implemented * develop the needed XQueries and interface (__Sjur, Tomi__) ** done the inflection menu, but it doesn't work... * data synchronisation between risten.no and the cvs repo (__Tomi__) ** discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again (__Sjur__) *** nothing done !!!9. Public tender We finally recieved their answer as well. TODO: * final evaluation based on the answers to the clarification letters ** done * present the final report to the board, and bring their decision into action ** done. Decision: Polderland. * write a contract (mostly done by __Finnut__, review by __Sjur__) * get it signed (__Finnut, Lennart Mikkelsen__) !!!10. Other !!Summer vacation || Who || When | Børre | 24.7 - 20.8 | Linda | ? | Maaren | on sick leave | Saara | July | Sjur | 3.7 - 23.7 + single days at other times | Thomas | 3.7 - 7.8 | Trond | 3.7 - 14.8 (last two weeks off at summer school) | Tomi | 8.7 - 16.7, 2 more weeks in July and/or August !!Bug fixing __43__ open Divvun/Disamb bugs (two down!), and __25__ risten.no bugs Guess: 1/3 of the bugs are fixed already (?) Please help __Saara__ with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help... __Saara__ will contact __Roy__ on this issue. !!Gobby TODO: * install Gobby (__Saara__) * document its use within our project (__Børre__) ** done * review the [document|http://divvun.no/doc/tools/gobby.html] (__Thomas__) !!SEE 2.5 extensions Future extensions and wish modes: * write a perl script to extract all TODO items and sort them according to responsible person; wrap the script in a AppleScript available from the menu * twol mode * xfst mode * lexc mode * vislcg mode TODO: * move the above list to Bugzilla (__Sjur__) ** done !!Task lists as iCal entries This feature requires that the patch Sjur sent in to Forrest regarding parsing of nested lists within wiki documents is applied. It was applied to Forrest last Friday, thus all who is interested in using this feature from their local Forrest should update their installation. It is easiest done using Subversion (svn) to update your local copy. TODO: * create iCal entries from our meeting memos (__Sjur__ or __Børre__) ** done (__Sjur__) * update all forrest installations to latest svn HEAD (__Børre__) !!Project meeting in Tromsø in august? The project board has decided upon a meeting in Tromsø in august. We'll discuss the date later this week. !!!11. Next meeting, closing Next meeting is undefined due to summer vacation. Closed at 14:36. !!!Appendix - task lists for the next week !! Boerre [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Boerre.ics] * corpus collection: ** send out contracts with accompanying letter ** Gather public texts, preferrably also parallel ones ** Send out letters to the rest of the Iđut authors ** contact __Ája__ (Kåfjord) ** send contracts to __Čálliid Lágádus__ ** Contact __Richard Valkepää__ at NSI about older Min Áigi and Ássu files. * corpus conversion: ** convert nob and nno bible texts to be used as part of a parallel corpus ** convert fin, swe to paratext or directly to our XML ** review the paratext2xml converter ** Move norwegian documents in Min Áigi from sme to nob * corpus access: ** possibly deploy the user account form as an HTML form ** make a test user ** Write both user and admin documentation (__Børre__, review: __Sjur, Thomas__) *** User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus. *** Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account. * set up Bugzilla automatic reminders for open issues * create document & document entry for name double-tagging * Update forrests to latest svn version * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Maaren * On sick leave !! Saara [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Saara.ics] * Create a parallel corpora of the new testaments * add more texts to the graphical corpus interface * grammatical searchability in the graphical corpus interface * Implement parallel corpus upload in web upload script * Install Gobby * Test the aligners once again * refine the xml output of the xml-tagged analyses * convert or adapt the received PHP for paradigm generation to our needs * remove headers and footers from antiword documents, other improvements * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Sjur [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Sjur.ics] * public tender: ** review letters to tenderers, contract for subcontractor * fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer * name lexicon: ** implement editing functions ** finalise refactoring for multiple collections (regular search interface) * review user and admin documentation for corpus access * write user account form, probably ask for copy of existing ones from the IT centre (with Trond) * correct jspwiki docs with mixed lists * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Thomas [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Thomas.ics] * investigate productivity of even-syllable Actio compounding * investigate and identify under which conditions even-syllable Actio compounding is possible * discuss findings with the rest of us * add proper numeral analysis/treatment to smj * add loanwords (e.g. latin -ere verbs) to smj * sme G3 issue * review user documentation for corpus access * create smj abbr file * review the [document|http://divvun.no/doc/tools/gobby.html] * Redirected following three syllable verbs and prevent them from being first-part Actios in compounds: ** Reflexives on -dit ** Reciprocals on -dit, -(a)lit ** Momentatives on -dit, -(a)lit, -ádit, -ihit ** Frequentatives on -(a)lit, -(u)hit, -dit ** Continuatives on -dit, -(u)hit, -nit ** Inchoatives in -nit ** Translatives on -dit ** Essives on -dit and -stit ** Causatives on -dit, -stit * find and study all derived verbs in our corpus (depends on __Trond__) * suggest which derivations could be generated (depends on __Trond__) !! Tomi [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Tomi.ics] * new proper name lexicon ** data synchronisation of proper nouns between risten.no and CVS ** XQuery refactoring and code development for our proper noun editor ** new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry * read aligner docu, install, provide feedback * Set up the mechanism for the hash-mark transducer package * test the new xml output of the xml-tagged analyses * export corpus tools to {{/opt}} (with cron) * [fix bugs!|http://giellatekno.uit.no/bugzilla] !! Trond [iCal|/doc/admin/weekly/2006/Tasks_2006-06-26_Trond.ics] * better smj NT text * get fin, swe, nob and nno NT and OT in __paratext__ format * install aligner, test it and give feedback * fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer * make shell script wrappers for the most common commands for user friendlyness * write user account form, probably ask for copy of existing ones from the IT centre (with Sjur) * write documentation for our {{bound}} users, with pointers to the ordinary documentation * write documentation on double-tagging names * discuss web-only user access management with Oslo * change tagging of derived stems in the disamb output, to facilitate much easier extraction of non-lexicalised derivations * [fix bugs!|http://giellatekno.uit.no/bugzilla].