Meeting on smn-dictionaries and cifu-talk Edmonton, Rotterdam 16.6. Present: Lene, Marja-Liisa, Trond * Status ** What did we promise * Our plan * What to do * Division of labour * time plan * Who are going * Next meeting !! Status !The new dictionary The new dictionary has arrived, in two files. It is added (in .xlsx and .csv format) in the folders * smnfin/inc/2015 * finsmn/inc/2015 It was made according to the following principles * many fin --> one smn * one smn --> one fin Lacunas, e.g. myettiđ: muáttá (to snow) ! What did we promise We will # present a preliminary finite-state transducer for Aanaar Saami, and # combine it with different Aanaar Saami dictionaries and word lists: ## A large Aanaar Saami - Finnish dictionary ## A North Saami - Aanaar Saami transfer lexicon For each of the dictionaries / word lists, we will show # what degree of coverage the combination of dictionary and transducer will give on relevant text types, including ## school textbooks, ## children's fiction, ## biblical and other religious texts, ## writings on language and ## blog/Facebook-type prose. # We will run the coverage tests both on analysers representing the standard language, and on analysers including a component tolerating a certain amount of orthographic variation. !! Our plan !! What to do Where to publish this? * SDÁ: 31.8.2015 - possibly not the best channel for this * Saami scientific article in Finland * CIFU proceedings * Other channels? A possible further article School dictionary adjusted to school children, on the basis of similar dictionary for sme. This will be more grounded in Saami linguistics, less in language technology, and has thus higher relevance for SDÁ. Example of problem: ''What is a student dictionary and what that means for Saami lexicography.'' The next step would then be to link this dictionary to the corpus. How can we selet the best sentences from the corpus? Cf. the literature on SketchEngine, and in Gothenburg. [Documentation about dictionary work|/dicts/dictionarywork.html] !! Division of labour * Francis MT, * Ciprian sme-fin/fin-smn * Ryan: NDS implementation * ML/E: ** translate smesmn missing list ** translate and adjust NDA tags and configure-files ** make lists: what smn-texts to use for the evaluation of the dictionary ** Improve on the smefin dictionary ** Evaluate the result of the smesmn parallellisation *** Add translations to words/dicts/smefin/inc *** Read and correct translations in words/dicts/smefin/src * LA: Analyse and write * TT: Analyse and write {{{ cd cd main/words/dicts/ ls All smefin etc catalogues contain folders inc = incoming bin = gbinary files from doing sh smefin.sh etc src = }}} !! time plan * Parallel first phase ** Ciprian to build first sme-smn candidate bidix, as soon as possible ** Ryan to implement ferst version of NDS, have something done before thursday * Parallel second phase ** ML, ES to evaluate and correct Cip's smesmn ** When phase 1 is done, there will remain residues *** sme from smefin not found in sme-smn *** fin from smefin not found in fin-smn *** smn from new finsmn not found in sme-smn ** ES, LA to translate residue lists ** LA, TT, ML to analyse smnfin / finsmn * Parallel second phase ** Evaluate result of Ciprians first run ** Evaluate Ryan's NDS * Third phase ** Start working on the issues mentioned in the promise list above Meeting in Tromsø late next week (thursday after lunch?) where we plan the summer. !! Who are going At least ML and Trond, Ryan? !! Next meeting Thursday 18.6. Topic: NDS implementation issues !! Notes {{{ f4 f8 aakkos järjestys => aakkosjärjestys aallon murtaja => aallonmurtaja -f4 alimmainen, alin eteläisempi, etelämpi aapa(suo) alkeis- armelias ~ armias ~ armollinen ellen, -et, -ei (jne.) ensi(-) hallussa, -sta -hiuksinen istuallaan, -een itsestään, -sään jämät (mon.) 120 commas, 20 parentheses, 21 tilde in column f4 cat finsmn/inc/2015/Suoma_saami_16062015.csv |grep 'aarto' -f8 4 8 -uutiset aamu -uutiset (arttukatos) aarto (arttukatos) -orvokki aho -orvokki (mänty) aihki (mänty) alus lakki, -huppu etu puolella, -puolelta joko - ta(h)i => joko - tahi, joko - tai kaari sulut (mon.) kalan perkeet (mon.) 18 commas, 13 parentheses, 3 tilde in column f8 cat finsmn/inc/2015/Suoma_saami_16062015.csv |cut -f4|grep '-' }}} !!Algorithm Algorithm for building lemmas (resolving formatting) in the columns 4, 8 in finsmn: __TODO__ list for dictionary processing, how to handle smn-fin: * comma: duplicate line ** eteläisempi, etelämpi => eteläisempi AND etelämpi * parentheses written as one word: Expand: ** aapa(suo) = aapa AND aapasuo ** joko - ta(h)i => joko - tahi, joko - tai * space parenthesis, hence written as two words: Duplicate ** aarto (arttukatos) => aarto AND arttukatos * TODO with tilde: Expand: ** armelias ~ armias ~ armollinen => armelias AND armias AND armollinen * komma space hyphen (sometimes multiple cases, sometimes compounds): * Expand for cases: Remove as many letters as you add (3 letters away and add the same 3): ** tasalla, -lta => tasalla AND tasalta * Expand for compounds: Here, the first part is in f4 and the rest in f8 ** aluslakki, -huppu => aluslakki AND alushuppu (aluslakki + aluhuppu * hyphen initially in f8: It belongs to the lemma ** aho -orvokki => aho-orvokki * The string '(mon.)' occurs 3 times in f8 and ome in f4, it means (Plural) and should be deleted (we recover it for fin later) ** sulut (mon.) ** perkeet (mon.) ** liinat (mon.) ** jämät (mon.) * (jne.) means "et cetera" and should be ignored The relevant columns are 4, 8. Ff. readme documentation, cf. also: A11=CONCATENATE(D11,H11," ",J11," ",K11,)