!!!Grammatikkontrollmøte 14.6.2016 Til stades: Kevin, Sjur Tema: fleirtydig tokenisering {{{ LEXICON Root < {skuvla} 0:" " "@P.Pmatch.Backtrack@" {busse} "+N":0 > ENDLEX; skuvla+N:skuvla ENDLEX; busset+V:busse ENDLEX; }}} Leksikonet over bør kunna gi denne analysen: {{{ "" "skuvlabusse" N Err/SpaceCmp "busset" V "skuvla" N }}} Forslag til pmatch-filter: {{{ define filter_flags(net) net .o. [?* flag:0 ?*]*; ! Would this happen online or during compilation? ! Compilation of the pmatch rules ! (Leads to overgeneration but can that be limited mechanically?) ! (Probably scratch this idea, I forgot about the overgeneration problem) ! the tag LexCmp indicates a lexicalized compound define lexicalized_compounds Lexicon .o. ?* LexCmp ?* ; define allowed_prefixes Lexicon .o. ?* PrefixForms ?* ; define multitoken_surfaces [filter_flags(lexicalized_compounds.i) .o. [[ allowed_prefixes 0:" " ]+ Lexicon ]].o; define multitokens multitoken_surfaces .o. [ Lexicon " " ]+ ; define lexicalized_compounds_with_erroneous_spaces multitoken_surfaces .o. [?* " ":0 ?*] .o. lexicalized_compounds ; }}} Kommentarar frå Kevin i koden: {{{ ! Issues: flag diacritics probably both break the morphosemantics and ! cause huge memory consumption ! Idea: if lexicalized_compounds could be made flag-free, that might suffice? What about the flags in the forms ambiguous with lex.cmps? We don't know which forms are ambig. with lex.cmp's, that's why we intersect. ! Below: that still doesn't work ! Krister: have recent developments in restricting the compound correction perhaps made this possible and we ! should try again? !! Tried heavily restricting, even to just simple nouns, still too much mem ! Another idea: since we really want to start with surface forms, could we just output a text file with ! a list of the lexicalized compound surface forms? !! Ie. analyse a bunch of forms and then … script the lexc to add a tag to lemmas that are ambiguous? ! Maybe use eg. ospell to generate them ! It's suggested that I (Sam) try to do this for omorfi where there are no flag diacritics just to validate the idea ! I did originally implement this as form-intersection, which worked just fine where there were no flags :) but unfortunately anything interesting in sme has flags … ! I don't see how this is less hacky than online backtracking :/ ! Neither do I ! so what about this RC mentioned in the email thread ! That's mainly for avoiding doing multitoken analysis even when there's no possibility of misspelled compound ! But thinking again about the rules above ... Isn't it possible to filter out flags from the surfaces ! Our meeting is running out of time... but I'll think some more about this possibility and write it up in an email }}} Og vidare: {{{ ! Trying to remove flag diacritics with foma in order to intersect on forms-only: ! runs out of ram. Trying to grep them out manually: runs out of ram during ! minimize or composition. Even grepping out only parts from only the parts of ! the lexicon we need runs out of ram during later steps of compilation. define TOP LC(Boundary) [ Lexicon | lexicalized_compounds_with_erroneous_spaces | RC(lexicalized_compounds_with_erroneous_spaces) multitokens ] RC(boundary); > [ RC(verb) noun] meaning that the current match is first tested to be in the input > set of "verb" and then processed as "noun". So you could test that you > have an ambiguity and then trigger that sort of tokenization ! So something like: define word lexicon RC([blank|#]) LC([blank|#]); ! All analyses that have the SpaceCmp tag: define spacecmp lexicon .o. ?* Err/SpaceCmp ?*; ! If something had the SpaceCmp tag, try reanalysing that string as if it were two tokens with a space in the middle: define token [ RC(spacecmp) word " " word ]; }}} Tankar om løysing: {{{ skuvlabusse:skuvla#busse CONTLEX ; .o. # (->) " " ; ! And also add +Err/SpaceCmp }}} Problemstilling: Korleis kan vi få analysene: {{{ skuvla+N busset+V }}} i tillegg til: {{{ skuvla busse+N }}} utan å eksplisitt ha {{{< {skuvla} "+N":0 "@P.Pmatch.Loc@" busset:busse "+V":0 >}}} i lexc (alt for mange kombinasjonar til å handtera manuelt), og utan å køyra intersection A∩A" "A på formar (som ikkje går pga. flagg vs RAM), dvs. slik at me berre seier i lexc at «herfrå vil me ha online backtracking». Me har ikkje noko behov for å spesifikt seia at "der, der er backtrack-punktet", berre at "denne analysestrengen krev at me tek med reanalyser som fleire token", der reanalysen har dei vanlege tokeniseringsgrensene (mellomrom i dette tilfellet). Abbreviations require the same treatment as compounds unless we want to manually specify paths into PUNCT for all forms ambiguous with punctuated abbreviations: {{{ Abbreviations vs other POS + full stop su. (sunnuntai or su + .) }}} We already have a solution for: Numerals: ordinals vs cardinals + full stop as in: {{1000.}}, since numbers have fairly unambiguous forms :) Sitat frå tekstchat under møtet: {{{ June 14, 2016 14:06 Kevin: su. er jo dekka av pmatch_input_mark 14:06 Sjur: men no prøver vi å finna ei løysing som ikkje bruker det merket - eller? 14:06 Kevin: nei, det blir noko anna 14:06 Sjur: og eg får ikkje su. til å funka 14:07 Sjur: ok 14:07 Kevin: me kan sjå på det, men det skal uansett ikkje trenga backtracking 14:07 Sjur: kvifor ikkje? vi vil ha su. = ABBR + su. = Pron & CLB 14:08 Sjur: altså treng vi backtracking etter det eg kan forstå 14:09 Kevin: Åh, viss me skal gjera det utan å spesifisera fullt ut i leksikon, ok. Eg såg på det som ekvivalent med «3.», der det er veldig lett å fullt ut gi begge i leksikon 14:10 Sjur: ok - eg såg for meg ei generell løysing der vi ikkje byggjer leksikonet for tokeniseringa 14:11 Sjur: men om det er enkelt kan vi sjølvsagt gjera det der :) 14:13 Sjur: «og utan å køyra intersection på formar» - meiner du composition? }}}