!!!Names and multilinguality Meeting between __Sjur, Thomas, Trond__ on Nov. 14, 2006. 1. Fyrste problem: * All names in all languages will likely be misunderstood if the material is published in risten.no. * "foreign" names can be as much noise as they are valuable, and including them must be done carefully We need a more principled approach to this. Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching. Observations: 1) Multilinguality is always optional. 2) We can observe that "foreign" names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names ("Kautokeino produkter"), whereas minority language names ''almost always'' occur in majority language texts as citations. And citations should not be considered a natural part of the text. 3) When looking at our name classification, multilinguality varies according to: {{{ Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong whenever parallel forms are available Sur - none Tit - strong (titles) }}} Suggestion: We need to reconsider the ''all names in all languages'' policy. That policy is valid only for {{Fem, Mal,}} and {{Sur}} (and Ani and Tit?). For {{Obj, Org, Plc}} the rule should be that if they have multilingual names, each name should only be used in it's own language. Then we need a modification saying that majority language names can be included in minority language lexicons __if attested__ in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country. __TODO:__ # finish first version of the editing (__Sjur, Tomi__) # add @type=secondary and @excl=speller,hyph to all names marked with !SUB (__Saara__) # test editing of the xml files. If ok, then: (__Sjur, Thomas, Trond__) # make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (den morfologiske delen skal vere intakt i t.d. propernoun-sme-morph.txt) (__Sjur__) # convert propernoun-($lang)-lex.txt to a derived file from common xml files (__Sjur, Tomi, Saara__) # Rens terms-sme.xml slik at alle namn har rett tagging for ulik bruk (@type=secondary) (__Thomas, Maaren, linguists__) # Slå i hop stadnamn som ikkje er i same termposten: Helsinki, Helsingfors, Helsset (__linguists__) # Gjer namnematerialet søkbart i risten.no (__Sjur__) # Legg til evt. manglande parallellnamn (stadnamn) (__linguists__) # Lag koplingar mellom Niillas og Nils (__linguists__) {{{ ======= termcenter.xml ========= Before merge: ORG ORG After merge: ORG plc plc plc plc plc plc mal mal ===== Procedure for creating terms-{$lang}.xml inherit the primary form (i.e. it has default entry type): If no langentry: use entry id If langentry, but not your own: use as primary the following: smj > nob ~ swe > ... sme > nob > swe ~ fin > ... Lang versjon: Export from common: ownlang. these additional langs skal bokmålsprogrammet forstå "Gothenburg" eller ikkje? Nei skal den færøyske analysatoren innehalde samiske namn eller ikkje? skal den samiske analysatoren innehalde asiatiske namn eller ikkje? all other forms than the primary are secondary (i.e., they have entry type=secondary) Beahkká Pekka --- Peter Sigurd Sigur --- Sjur Cathrin - - - Cathrine Katrine --- Kari Niillas relevant i eit tospråkleg Nils norsk-samisk samfunn --- Nikolaus Mattis --- Mathias Máhtte Thomas --- Duommá |----------------| | IR: | | viktig | | | | stavekontroll: | | irrelevant | |----------------| Arbeidsmåte: 0. Konverter den leksikalske delen av propernoun-sme-lex.txt til xml, og test redigering av xml-filene (den morfologiske delen skal vere intakt i t.d. propernoun-sme-morph.txt) Dersom ok, så: 1: make terms-sme.xml <=== automatically from propernoun-sme-lex.xml make terms-smj.xml <=== automatically from propernoun-sme-lex.xml + the smj shortlist make terms-sma.xml <=== automatically from propernoun-sme-lex.xml make terms-nob.xml <=== automatically from propernoun-sme-lex.xml (to be added) ---> gjer om propernoun-($lang)-lex.txt til ei derivert fil frå felles xml. 2: Rens terms-sme.xml slik at alle namn har rett tagging for ulik bruk (@type=secondary) 3. Slå i hop stadnamn som ikkje er i same termposten: Helsinki, Helsingfors, Helsset ---> Gjer namnematerialet søkbart i risten.no 4. Legg til evt. manglande parallellnamn (stadnamn) 5. Lag koplingar mellom Niillas og Nils ======= terms-sme.xml ========= <== (today: NIILLAS-plc) (use only one?) => ref="Bb" after merge ======= terms-nob.xml ========= => ref="Bb" after merge ========================= }}}