!!!Agenda * finalize the proper name xml structure * prepare integration of the kvensk project, if SD accepts Participants: __Børre, Linda, Sjur, Tomi, Trond__ Questions: * What (the content): ** make an overview of all info we want to store ** ... and how to organise it * How (the xml structure): ** one or two files? (two actually implies three) ** what info to split into common parts and project / language specific parts Views: * Iconic id better than arbitrary id. * Single linking, or automatically made double linking * pro links in common: lg-specific files are not cluttered * pro links in lg files: that is where the info is Work process: * Timbuktu: Add iconic id and semantics once, to common. ** The machine MAKES the lg entries, based upon these assumpt: *** one form, inherited from iconic id *** one sem, inherited from common *** and we will decide a default declension class for each lg *** we may leave a tag in place saying "untouched by human hands" * Helsinki: - same case, but here we need heavy manual editing ** make a tag saying "now touched (by native speaker)" Conclusions: Double linking, iconic id Iconic id decided by the following principle: * Place names: pick Norwegian, Swedish, Finnish, English names. * Other names: pick the most common (the one which give most "identical" hits among our lgs: sme, smj, sma, nor/nob/nno, swe, fin, eng (sms, smn) ) With the princ of inheritance (lemma inherited from common file): * inherit right away / at creation time (= larger files, more duplicate info) {{{ common | swe | fin | India_2 | India | Intia | ->lg=a | ->India_2 | ->India_2 | (->lg=b Intia)| ->India | | sem plc ... Timbuktu | Timbuktu | Timbuktu | ->lg=a id | ->Timbuktu | ->Timbuktu | ->lg=b id | | | sem plc | Tmb.| ... | sme: | ... | nor: | fin | swe | eng Tana | Deatnu | ... | Tana | | | ->lg=a id | ->Tana | ... |->Tana | | | ->lg=b id | | ... | | | | sem plc | ... | | | ... ... | | | }}} What do we store in the "common" file the iconic id the semantics + info about the world (encyclopedic info) links to the lg specific files What is stored in the lang-specific ones? Linguistic info: * inflection * stem * lemma * derivation class? * compounding? * senses (pointers to concepts) * orthographical variants (incl. (common) misspellings) * acronym(s) and abbreviation(s): ** as separate entries or as part of the name entry? {{{ NATO => OTAN NRL => NBR, Ap => Bb KRD KRD KRD KRD KRD+N+ACR+Sg+Acc KRD KRD+N+ACR+Sg+Gen KRD KRD+N+ACR+Sg+Nom NATO NATO NATO+N+ACR+Sg+Acc NATO NATO+N+ACR+Sg+Gen NATO NATO+N+ACR+Sg+Nom NATO NATO+N+Prop+Org+Sg+Acc NATO NATO+N+Prop+Org+Sg+Attr NATO NATO+N+Prop+Org+Sg+Gen NATO NATO+N+Prop+Org+Sg+Nom "" S:1732, 1732, 1732, 1732, 5423, 5849, 5849, 9980 "NATO" N Prop Org Sg Nom <<< S:1285 @HNOUN }}} Different aspects of abbreviations and acronyms: * expansion (requires linking/common entry): ** abbr needs to be expanded for IR and text-to-speech ** translation systems want to transl. them to other lg abbrs (possibly requiring (intermediate) expansion) * linguistic analysis/properties: ** the preprocessor is concerned about abbr's behaviour wrt. sentence delimitation (TRAB, ITRAB) ** speller programs want to correct them whenever wrongly spelled (possibly storing misspellings of abbrs) ** disambiguators want their underlying POS analysis (in addition to their ABBR tag) ** they have inflections of their own *** St.dieđ. 10 / St. dieđáhus OR St. dieđáhusa... (implicit case) *** NRK:as (explicit case, except for Acc/Gen, who may be left unexpressed) ** can take part in compunding, possibly derivation Lexicon conclusion: * store abbr. that are coming from names as separate entries? (we probably have no dotted abbrs for names) * store accr. as separate entries in the name database, with type="acr" * store alternative names as separate entries * all linked together or to the same concept (open??? If to the concept, forces us to allow more than one entry/language in the common file) Transducer conclusions: * Leave things at status quo for the abbreviations and the acr generator * We will return to the issue of double abbrs if they turn up (They probably don't) * Double acrs arelaready taken care of in the sme-dis.rle urle set (lexical acronyms are preferred over generated ones) !!!xml example format: !!Concept center (common file): {{{ IN ... ... }}} !!Language file for, say, sme: {{{ (example?) }}} !!Language file for fin: (numbers refer to Irene's draft, see below) {{{ (only if different from id/headword) (example?) }}} !!Language file for kvensk: (numbers refer to Irene's draft, see the [meeting memo from Nov. 28 |https://giellalt.uit.no/admin/weekly/2005/Meeting_2005-11-28.html#7.+Name+lexicon+infrastructure]) {{{ (only if different from id/headword) In the case that stem = lemma, we have the entry: }}} These points from Irene's list are still open: {{{ Print info - do they belong to the common or language-specific sections?: 12. kartprodukt 13. kartblad Unclassified: 25. pilhenvisning, nuoliviite, til annen artikkel -> How is this different from 18.? Multimedia - do they belong to the common or language-specific sections?: 26. lydfil 27. bilde(r), illustrasjone(r) }}}