This is a document for ideas and discussion about the new Oahpa code. workshop-week from 19. Sept !!!IDEAS Requirements, from different perspectives: * programmer * linguist * teacher * students !!Today: Special features, not used for all languages ! Leksa: * audio-files (crk-oahpa, vro-oahpa) ** for the time being: audio files are not functioning with Safari {{{ kohkôs}}} * "almost correct"-feedback to the student (tcomm) for sma-oahpa, sme-oahpa {{{ mors eldre søster eldre søster til mor tante }}} * different colours in teh interface for different animacy (crk-oahpa) ** crk-oahpa: green for N+AN, or animacy="AN" * Place names as an option (sme-oahpa, sma-oahpa) * Multiword expressions as an option in menu (sme-oahpa, sma-oahpa) ** the target language is multiword expression \\ It is important to distinguish this from the ordinary Leksa choices, so the user knows that Leksa is not aksing for "the Saami word" for so-and-so, but for something which is more than one word. ! Morfa * Dictionary (NDS) is integrated (crk-oahpa) * only-sg choice in menu (fkv-oahpa, rus-oahpa) * Transitive verbs (TI) are presented with object in Morfa-S (crk-oahpa) {{{wâpahtam}}} * Translation tasks in Morfa-C (crk-oahpa) ** [pronouns|http://oahpa.no/nehiyawetan/morfac/p/] ** [verbs|http://oahpa.no/nehiyawetan/morfac/v/] choose II verbs * What kind of agreement or other rule inducement is used or is needed ** rus-oahpa: agreement of subject and main verb in number for pres/past, in gender for past tense (in elaborated multiple sentence tasks may be actual, not grammatical gender for professions) ** rus-oahpa: agreement of modifier (adj or adj-l pron) and noun in gender and case in singular, MFN as gender (for modifier) and case in plural. Implies also change of gender for Adj Sg Nom -> Pl Nom task: {Msc, Fem, Neu} -> MFN ** rus-oahpa: verb passive forms may be congruent with reflexive verb forms, whereas not all reflexive verbs from dictionary are suitable to be used for questions-answers in active voice (passive is not curently used in tasks) ! FST with error-tags (Err) * only for sme, in old infra: {{{lookup -flags mbTT -utf8 /Users/lan000/errortag-gt/gt/sme/bin/ped-sme.fst}}} and is used for Vasta and Sahka. e.g. {{{viessui viessu+Build+N+DiphErr+Sg+Ill vissui viessu+Build+N+Sg+Ill lavka lávka+N+Sg+Nom+AErr lávka lávka+N+Sg+Nom lávka lávka+N+CGErr+Sg+Gen lávkka lávka+N+Sg+Gen }}} We want this for more languages, but implemented in another way than for sme. By concatenating FSTs instead of messing up the source files, which is a big work to maintain. That we have done for numeral transcriptor in crk (Err/ShLo means error marking short/long vowel) {{{peyak 1+Err/ShLo pêyak 1 }}} ! Numra * Numra: special feedback for tag {{+Err/ShLo}} (crk-oahpa) ** crk-oahpa: you can test with peyak for 1 ** rus-oahpa: tag {{+Err/ShLo}} is added if one or more stress marks are missing in a numeral string, accepted as a correct answer. The student's answer with incorrect stress marks is not accepted (although almost correct). !!Different ways of adding limitations for words or forms !Leksa * exclude="smenob" in e-element (smeOahpa, smaOahpa) ** for excluding nobsme: remove the nobsme entry * (crkOahpa) !Morfa completily * morfas="no" in l-element (smeOahpa) * gen_only="None" (crkOahpa) !Morfa partly * to be included in some Morfa tasks: gen_only="Pl,Sg" (crkOahpa) * to be included in one Morfa task: (crkOahpa) * to be included in Morfa tasks: (fkvOahpa) * to be included in Morfa tasks: (smeOahpa, smaOahpa) !!Interface design * all relevant menus in the top, also PoS choice (crk-oahpa) * instructions is given over the tasks, instead of in a box to the right (crk-oahpa) * morphological feedback appears to the right of the question it is referring to (crk-oahpa) * tool tip in instructions, eg. example of how to answer (crk-oahpa) * tool tip in morphological feedback (crk-oahpa) * special characters are given in a typer for all programs (crk-oahpa) ** it appears when you click in the blank * menu for grammar explanations, with links to a web grammar (sme-oahpa) * dialect choice in menu (sme-oahpa and sma-oahpa) ** only one dialect is presented in the tasks, but both dialects are accepted as input from the student ** the morphological feedback is according to the chosen dialect * all instruction is in target language, but the student gets a tool tip translation by clicking Alt (sme-oahpa) * short verbal explanation of the game topic under each game icon and name (vro-oahpa) !!New ideas (don't hesitate) ! General structure ideas Will need some discussion: Oahpa needs more separation within the source code, there is a lot of logic relating to rendering exercises spread across various parts. One way to simplify this could be separating Oahpa into two major pieces: (1) linguistic and exercise generation functions, which could be separated out into an internal-use-only JSON API, (2) frontend stuff, like presenting exercises to users, tracking user login, course management, etc. These two pieces could either exist as their own separate codebases, or be in one. Separating them should allow us to write unit tests for exercise generation with a fake language only for Oahpa testing, and ensure that as language needs change, we have less surprises. Each language project should have a central configuration setup for linguistic info and meta-info, and we should avoid copying code for each new project. Other: * Newest version of Django * Use django-rest-API for exercise generation pieces? * Word database structure is otherwise great, tagsets, semantics, etc., but it may be possible to extract wordforms from this (see below) ! Exercises * Defining Morfa-S and Leksa exercises should be as simple as editing YAML; some bits of this are coded already in the inner workings of morfa-s, but this should be the default for morfa-s, leksa, and numra-type exercises that require no syntax. More info below * Exercise code needs to produce very simple output, so that we can write automated tests with a sample project and maybe with a fake language project where we only add linguistic data we need to test * Morfa-C: we can write some generalized agreement process that allows a language maintainer to easily define syntax relationships. These have been hard to add in python, and require significant work, but it could be simplified... Ryan may have a sample checked in somewhere, when it comes to it ! Exercise Definitions For exercise definitions there are three parts: 1.) defining how menus appear, and what content goes in them 2.) defining how the questions appear 3.) defining connections between (1) and (2) For leksa, numra, and morfa-s, the general structure is shared: question is a word with a particular form, answer is a question with a particular form. Question form: * internal question ID, for making menus, etc.? * language iso * word semantics * word morphology * context Answer form: * language iso * word semantics (usually same) * word morphology (leksa: usually lemma, morfa-s: cases, tense, etc) * context Some of the ideas here exist in NDS (context-dependent question context) ! Database As far as django is concerned, there should be a separate lexicon/morphology database from user and login data. It should be possible to install wordforms as a separate part of the process: * Example: the word structure, semantics, and Morfa-C exercises all do not need to be updated, but wordforms have changed and need to be replaced * Example: wordforms are fine, but Morfa-S or Leksa question/answer pairs have changed * Example: wordforms are fine, but Morfa-C has a new question, or a changed question * Example: wordforms are fine, questions are fine, but semantic sets need to be changed * Example: maybe more ideas ... An option is using the FSTs "live", and only caching the last forms generated, unless the FST changes. Forms would be incrementally updated as users request them, and an install process could install only the word metadata structure, which should make the install process faster. Typically, unless the FST is very slow, these lookups can happen very quickly. We can rely on an updated lookup server daemon, which already exists, to speed up this process as well. (See main/ped/lookupserv/) ! Install definition The install process should be defined in some system other than a `bash file, perhaps also .yaml, alongside the systemwide configuration. ! User feedback A button "Report an error" (wrong word form, etc.). The error reports will be saved and the expert (teacher or linguist) decides if it really was an error. ! User activity logging Logging of the activity of all users, not only of those who are logged in. ! Would be very nice to have ideas ... but not immediately crucial * Morfa-C could use a word / wordform / semantics browser interface to simplify our troubleshooting processes, and help define exercises. Export XML? !!What to send to analyser !Today * Numra is sent to transkriptor-FST, with error-tags, * Vasta and Sahka is sent to FST with error-tags !New code * Send Morfa to FST (API) with error-tags * Send also Leksa to FST with error-tags? !!Modularity We should get the linguistic data out of the system files !Linguistic data Different kinds of data: * information in the source files * definitions of how to display content based on morphology, contained within python (ex. if noun, add verb pronoun context, if Essive, no Sg or Pl.) * coordinated with the information in the systemfiles ** for choices in menus ** for producing the tasks Contents of drop-down menus (semantic supersets, case a.o. grammatical category lists, source lists (books and chapters), etc.) should be in separate file(s), not inside the program code files. * Create a procedure for automatically generating these lists out of the Oahpa source files (lexicon, tags, paradigms, semantic sets). * In the teacher interface there could be an option to check the contents of the automatically generated menus and to suppress the unwanted items. Small FST for Oahpa? Containing the lexicon from ped/LANG/src/ This may be included in the standard newinfra build process: ./configure --enable-minimal-oahpa-fst !System files !!Audio files/Text-to-speech There are some audio files in Leksa in crk-oahpa and vro-oahpa. In vro we are also developing a new feature: reading aloud Morfa-C questions with the help of text-to-speech software. !!How to make Oahpa for new languages !!Maintainence * Linguist interface. The linguist should be able to install/remove new lemmas and tasks. * Teacher interface, e.g. (taken from oahpapres.pdf): Good morning, Dorothy! # Mark (in the list) the part of speech and the morphological inflection you want Morfa S task for # Mark (in the list) the lemmas you want to include in the task # Write the instruction for the students # – Wait, while the program is compiling – # Test the tasks. Click for the next step. !!Consistency in naming * Aim to introduce/keep names the same for the same things (and better different for substantial differences) * Tag collisions ** Different uses of +Pass tag in sme and rus-oahpa (affects also nob, smn, sms, fin, est, vro, fkv, fao, mhr, vep) ** rus-oahpa: some verb forms have secondary tags that are confused even with PoS tag (Adv) * Database fields ** Database tables have columns named with actually reserved words: string, case. This could be avoided. !!Konteaksta We have a version for sme and two versions for rus (Ruskonteaksta and RusVIEW). !!Student-log-in We have a demo-version !!Experiences from working with Oahpa from different languages ! Pain points # database update takes time, sometimes changes are small (i.e., subset of wordforms updated in FST, but still need to run whole install) # database update requires a lot of awareness of an individual oahpa installation # database: lots of places to configure data being installed, adding files is not transparent # exercises: Morfa-C: adding new types of agreement is tricky # exercises: Morfa-S: new types of exercises are also tricky # exercises: Numra relies on some external FSTs, which Morfa currently does not, need better error feedback for whoever is maintaining the processes that these are missing when the process starts # exercise: sahka / vasta lookupserv is tough to configure (new version exists: main/ped/lookupserv/) # general: errors are opaque, the backend could do a better job of explaining what is wrong Document the differences between the versions The full list (24 Oahpas), is, as always, [http://giellatekno.uit.no/ped/index.html] . URLs starting with testing.oahpa.no are on gtlab, URLs with oahpa.no are on gtoahpa. Most differences follow from the availability of basis resources. * Only sme has a full CG program, so only sme has Sahka * We are now working on Vasta-F for translation (with CG) for crk and est * Languages without CG but with an FST typically have 4 programs * Languages also an FST typically have 2 programs, Numra and Leksa !sme * [http://oahpa.no/davvi/] * 6 programs: Morfa-S, Morfa-C, Leksa, Numra, Vasta-S, Vasta-F, Sahka !sma * [http://oahpa.no/aarjel/] * 4 programs: Morfa-R (= Morfa-C), Morfa-B (= Morfa-S), Leksa, Numra !est * oahpa: [http://testing.oahpa.no/eesti/] * docu: [http://giellatekno.uit.no/ped/est-oahpa.html] * 5 programs * Morfa-C noun: "limited mix" exercises: a) choose the right case for the object (Sg Gen, Pl Nom, Sg Par, Pl Par) b) choose the right case for the location: Illative vs. Allative (into / onto), Inessive vs. Adessive (in / on) !vro * [http://testing.oahpa.no/voro/] * 4 programs * spell-relax issues because of different ways of marking palatalisation and glottal stop * some audio files in Leksa (the full lexicon coming soon) * TTS integration under development * Morfa-C noun: "limited mix" exercises !crk * [http://giellatekno.uit.no/ped/crkdoc/oahpa/crk-oahpa.html] * 4 programs * crk has some audio files. !liv * http://testing.oahpa.no/livokel/] * http://giellatekno.uit.no/ped/liv-oahpa.html] * 4 working programs * The code for this setup is the same as for the other * This is the language with the most complicated writing system, and it may therefore be a testing ground for different spellrelax issues. !rus * http://giellatekno.uit.no/ped/rus-oahpa.html] * http://testing.oahpa.no/rusoahpa/] * 4 programs * Russian has a mechanism for handling stress * rusoahpa should be moved from gtlab to gtoahpa !sms * [http://oahpa.no/nuorti/] * [https://giellalt.uit.no/lang/sms/j-sms.html] * 4 programs * Morfa-S nouns: diminutives, possessives and their combinations !!The other Oahpas (probably not so relevant for the initial part of the work) !izh * [http://testing.oahpa.no/izh_oahpa/ ] * [http://giellatekno.uit.no/ped/izhdoc/izh-oahpa.html] !mhr * [http://testing.oahpa.no/mhr_oahpa/] * 4 programs !mrj * [http://testing.oahpa.no/mrj_oahpa/ ] * [http://giellatekno.uit.no/ped/mrj-oahpa.html] 4 programs !myv * [http://oahpa.no/erzya/] * [http://giellatekno.uit.no/ped/myv-oahpa.html] * 4 programs !fkv * [http://giellatekno.uit.no/ped/fkv-oahpa.html] * [http://oahpa.no/kveeni/] * 4 programs !mdf * [http://giellatekno.uit.no/ped/mdfdoc/mdf-oahpa.html] * [http://testing.oahpa.no/mdf_oahpa/] 4 programs !bxr * [http://testing.oahpa.no/bxr_oahpa/] * 3 programs working: Leksa, Numra, Morfa-S !olo * [http://giellatekno.uit.no/ped/olo-oahpa.html] * [http://testing.oahpa.no/olo_oahpa/] * 4 programs !yrk * [http://giellatekno.uit.no/ped/yrk-oahpa.html] * [http://oahpa.no/yrkoahpa/] * 4 programs !!FST exists but it not really implemented !hdn [http://oahpa.no/hdn_oahpa/] !!The FST-less Oahpas !rup [http://oahpa.no/armaneshte/] http://giellatekno.uit.no/ped/rupdoc/rup-oahpa.html] !kpv http://oahpa.no/kpvoahpa/] http://giellatekno.uit.no/ped/kpv-oahpa.html] Leksa, Numra !sjd sjd has some audio files, and plans for extending them. Leksa, Numra !udm * [http://giellatekno.uit.no/ped/udmdoc/udm-oahpa.html] * 2 programs !smn * [http://oahpa.no/aanaar/] * Two programs: Leksa, Numra