Goal: Make a multipurpose smenob.xml dictionary: - For use on the net - For basis for machine translation - For glossing of analysis Work at the moment: The dictionary, dtd and css are in smenob.xml smenob.dtd smenob.css The incoming words are in the following files: 2778 inc-missing-adj 3379 inc-missing-adv 10289 inc-missing-nouns 80 inc-missing-pron 10900 inc-missing-verbs 27426 total These should be translated and thereafter added, in the following way: Preamble: There are 27000 untranslated words. We will thus have to make prioritites, as to what to translate first, and what later. Here are the principles for what to prioritise: a. Translate whatever can be done semiautuomatically (all words in -logiija should be copied and translated to -logi, etc., for several classes of loan words; compounds with -láhka, -giella, etc. could get the last compound automatically translated, and then the first part done manually) b. Translate all the closed classes (all except noun, verb, adj) manually c. go relatively quickly through lists and translate easy ones d. check against frequency lists and translate common ones Conversion principles Words should be pos tagged (Sámi) and pos and gender tagged (Norwegian). The pos tagging of the inc files is now like this: smewordposcode the task is then to add a nob translation smewordposcodenobtranslation and thereafter add it to smenob.xml with Børres script In order to do that: 1. identify a part of some of the inc-missing files which can be translated (semi)automatically 2. cut it out of the inc-missing file, and glue it into inc-today-a AND inc-today-b files (SubEthaEdit is a nice editor for this). 3. Leave inc-today-a as is 4. Translate the Sámi of inc-today-b into Norwegian 5. Change pos mark, if neccessary 6. At the end of the day, run Børres script for today-to-xml-conversion (note that there shall be exactly the same amount of lines in the a- and b-document!! 7. Empty the inc-today-files 8. Call it a day, and go home. risten.no ========= The words from risten.no was added, according to the following procedure: 1. extract sme-pos-nob-pos pairs 2. add them to smenob.xml 3. make a transducer smedic.fst of smenob.xml (extract lemma, xfst < read words 4. run noun-sme-lex.txt etc. against this transducer, and make new, leaner, inc-mising-POS files 5. carry on the manual work (1-8 above)