This directory contains the files relevant to the smaoahpa application. - src: the source files with the lexicon smaX (i.e., smanob, smaswe, smaeng) - Xsma: the reverted files from smaX to Xsma Caveat: the reverted files are already frozen, i.e., they are ready for extension for synonyms and the like. The exception thereof is the swesma dir because at the moment it is not worth to revert them, there a too few real translations for swe in the smaswe files. In the following is the summary of the CLT meeting: 1. Topic: nobsme handling of MWEs and stat="pref" 1.1 extract all sma-MWEs into a separate file; - done 1.2 add ID as in smenob for possible entries that would be doubled in the sense of lemma string and pos string; Ex.: entry for "ungen" - done 1.3 delete possible entries that got stat="pref" only based on MWEs entires; - done 1.4 according to the latest specifications by Lene, don't merge nob entries with stat="pref": 1.4.1 add the disprefered sma-translation to each created entry with stat="pref" -done 1.4.2 for entries with the same nob lemma, add prefered sma-translations to each other as acceptable answers -done @cip: From my point of view is now the merging process of the reverted nobsma data finished. Ex. 1 (only one entry with this lemma in the whole file) <== 1. every entry has an ID 2. stat=pref flag from the smanob-data in the nob-entry rovdyrfritt <== structure simplification: no apps/apps, just sources <== sematics element on the mg-level, NOT on the tg-level anymore <== 'tg' means 'target language group' and is flagged with the language flag aales <== because of the new meaning of 'tg' there is no need for lang-flag on the t-level; default flag for stat="pref" that can be changed manually as needed Ex. 2 (several entries with the samme lemma string): Et godt eksempel for det er "dårlig"! 1. a number in initial position of the ID means that there are more than one entry with the same lemma string in the file 2. in addition to the t-element from the reverted entry, there are t-values of the parallel entries with the same nob lemma string as acceptable answers for the LEKSA play, each of them carries the infos on semantic class, book, and a flag nob-stat meaning "I am a default t in a parallel nob entry" mådtan as well as all t-element values that don't have a stat="pref" flag in the smanob files, i.e., only sem-cl and book infos. nåekies dårlig geerve madtan mådtan nåekies dårlig madtan geerve mådtan nåekies dårlig nåekies geerve madtan mådtan ============ VERY IMPORTANT: ============ Due to the changed format, following places have to be adapted accordingly: 1. for the work with XMLmind: dtd and css file 2. for the db feeding: Ryan's Pythons scripts ============================================= Observations when feature merging: O-1: books that come from different types of entries (pref vs. non-pref) have to be marked as such O-2: sma translations that stem from different types of entries (pref vs. non-pref) have to be marked as such wrt. semantics because these features will be merged Test FØR unifisering av mg in nobsma: data_sma>grep -h ' 307 77 19 5 2 1 b. Difficult automatic processing (the rest): b.1: - When to unifiy two or more mgs stemming from a nob-translation with stat="pref"? - When total overlapping of sem-classes or even for partial overlapping? - What about if their sem-classes are totally different? Shall they get separate entries with different IDs as with the sme-oahpa data or not? 186 19 4 1 b.2: same questions as in b.2 but in addition is also the question of which mg from the prefered ones shall get which translation from the disprefered ones? 54 21 10 6 6 2 2 2 1 1 1 1 1 Another question is about the interplay between the scope of semantic classes and that of the books after reverting the smanob to nobsma. New statistics after cleaning up the only morfa-relevant entries marked in the semantic class with an initial "m": data_sma>grep -h ' 292 186 77 53 19 19 19 10 6 5 5 4 2 2 2 2 1 1 1 1 1 1 1 2. Topic: level simplification in the dictionary from 3 to 2 levels in the meaning groups 2.1 structurally there are still three levels: - mg: meaning groups - tg: target language group THIS is the difference, this group denotes NOT a slight difference in translation wrt. some meaning shadows but it only groups transaltions similar translation based on targe language. Ex. from the original Cip's dream files: láibi brød fladbrød leipä bread vs. not grouped based on target language láibi brød fladbrød leipä bread The CLT-group voted unanimously FOR the Cip's dream solution! Here a small note wrt. this solution: all sme mgs in the smaX files will be now part of the mgs containing nob and swe, which is in the very spirit of Cip's dream. 2.2 tasks: 2.2.1 unify meaning groups that have been separated ONLY because of sme-language: this can be done ONLY if there is a parallelity of sme- vs. non-sme-mgs 2.2.2 split (old) tg into different groups if the semantics are different: this is possible ONLY if there are semantic groups with ANY tg in the mg 2.2.3 group (old) tg to the same meaning group if the semantics are he: this is possible ONLY if there are semantic groups with ANY tg in the mg (see the pre-tests below) ================ Starting testing for level simplification (it is not that simple): - excluding file: names.xml propPl_smanob.xml Test 1: checking the content of each e-element: Question: How many mg-elements are there? Should be unified (because of lang feature sme) or let as they are (because thery represent genuinely different meanings)? sma>grep -h ' 793 181 30 18 7 5 4 2 1 1 As agreed upon with Lene, we ignore the sme-mg for this task. Test 2: checking the content of each mg: sma>grep -h ' 202 90 32 6 1 The data seems to be ready for a automatic restructuring.