# Lecture Thu 29.11 Teachers: Jack Rueter & Sjur Moshagen Topic: **building a grammar checker** Reading material: * [Linda Wiechetek's dissertation](https://munin.uit.no/handle/10037/12726) (Chapter 5 until 5.2.2 but not including 5.2.3 and onwards) * Antti Arppe: [DEVELOPING A GRAMMAR CHECKER FOR SWEDISH] (https://sites.ualberta.ca/~arppe/Publications/Nodalida-99.pdf) * Eckhard Bick: * [DanProof: Pedagogical Spell and Grammar Checking for Danish (2015)] (https://pdfs.semanticscholar.org/d10c/1d53ffb5e47a1f548cab6c10a8d83cef1e66.pdf) * [A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics] (http://www.linguistics.fi/julkaisut/SKY2006_1/1.6.1.%20BICK.pdf) - # Lecture Overview * very brief CG intro * gramcheck architecture * `hfst-tokenise` as the tokeniser-analyser * *CG* as grammar checker formalism * what can be targeted? * what should one target? * first priority: No False Positives! * second priority: frequent errors first * exercise ## Very brief CG intro CG = Constraint Grammar (Karlsson et al. 1995) Main idea: working bottom up, starting with the very ambiguous morphological analysis, and removing readings given context clues until we arrive at one unambiguous reading of the whole sentence. The two most important operations: ``` SELECT reading IF ( context ) ; REMOVE reading IF ( context ) ; ``` `reading` is usually one or more tags (that is, remove or select a reading containing these tags). `context` is where the true power is: * left and right context * restrictions and requirements * both tags and lemmas * everything is sets, and you can construct any set you want Questions: * semantic tags, +Human (Finnish Romani - adjectival gender inflection only in front of +Human) * Genitive S in Lower Saxon - names and relatives (≈use of Px in Sámi?) * case governed by pre- and postpositions, verbs ## Grammar checker architecture The various components: why and how, using the North Sámi setup as an example: ![Grammar checker flow chart](bilete/GramCheckFlow.png) 1. tokenisation + analysis 1. whitespace tagging 1. valency annotation 1. disambiguation of ambiguous tokenisation 1. reformatting of disambiguated tokens 1. spellchecking 1. disambiguation 1. first-round speller suggestion filtering 1. second-pass filtering using the regualar disambiguation file 1. the real grammar checking (ie context-based error detection and correction) 1. generation of suggestions and diagnostic user messages ## `hfst-tokenise` * takes a text input stream * tokenises and analyses in one go, longest match by default * recent additions by our group allows the lexicon to insert forced retokenisation or backtracking point, to allow multiple tokenisations to be given * multiple parallel tokenisations are printed as CG subreadings for further processing * lexicon-based tokenisation and analysis allows MWE's to be correctly analysed and tokenised, also when inflected — which is pretty important for Uralic languages ## *CG* as grammar checker formalism * cg usually removes readings until you are left with the one true story * problem: you rely on the sentence being correct - a false premise for a grammar checker! * so you need to be careful, and only rely on the most robust clues * semantic tags * valency * you also need some sort of bias - everything can't be wrong * and you should pepper your lexicon with known misspellings (and tag them) * it is much better to correctly identify a misspelling than to have the speller guess it for you ## What can be targeted? In terms of linguistic constructions and errors: * in principle anything - CG is flexible and powerful * in practice though: * consider your target user group - L1 speakers make other errors than L2, 12 year olds different than 18, which again makes other errors than 25-30+ * errors are made different - some are harder to detect than others * it also depends on the resources you have available - some errors require a richer markup, such as semantic tags and valency ## What should one target? * start small and simple, expand over time * some errors are (almost) automatically detectable: * spelling errors (by definition) * some punctuation and whitespace errors (simple regexes, where some are readily provided by the template file) * next step: disambiguate carefully, and target filtering speller suggestions * that will at the same time build the foundation for the grammar checker - you need a partially disambiguated sentence to be able to detect the errors that are there ## First priority: No False Positives! * users hate it, so just don't if at all possible * this means a lot of testing * and collecting sentences for testing purposes, both those that should be flagged and those that are ok, but could be flagged (false positive tests) ## Second priority: frequent errors first But at the same time, you need to look at feasability based on available resources and your own knowledge and experience: * take the most frequent errors and start with those you are actually able to implement * constraint grammar rule writing takes a lot of practice ## Exercise Formulate CG rule to remove an irrelevant speller suggestion, using SMA as test case. Input sentence: ``` Ååredæjja lea ruhtjehke goh sjovkolaade. ``` Tokenised, morphological analysis: ``` "<Ååredæjja>" "aeredh" Ex/V Ex/TV Der/PassS Ex/V IV Der/d Ex/V Der/NomAg N Sg Nom "aeredh" Ex/V Ex/TV Der/PassS Ex/V IV Der/d V PrsPrc "ååredæjja" N Sem/Hum Sg Nom : "" "lea" V IV Ind Prs Sg3 : "" "ruhtjehke" ? : "" "goh" Adv "goh" CS "goh" Pcle : "" "sjovkolaade" ? "<.>" "." CLB :\n ``` As seen above, the sentence contains two unknown words, presumably misspelled. After running the speller the output looks like this, the targeted corrections are indicated with an arrow: ``` "<Ååredæjja>" "aeredh" Ex/V Ex/TV Der/PassS Ex/V IV Der/d Ex/V Der/NomAg N Sg Nom "aeredh" Ex/V Ex/TV Der/PassS Ex/V IV Der/d V PrsPrc "ååredæjja" N Sem/Hum Sg Nom : "" "lea" V IV Ind Prs Sg3 : "" "ruhtjehke" ? "rahtjehke" A Attr "" "rahtjehke" A Sg Nom "" "rohtjehke" N Sem/Dummytag Sg Nom "" "ruhtjehtidh" V TV Ind Prs Sg3 "" "ruhtjie" N Sem/Dummytag Pl Nom Foc/ge "" "eeke" N Sem/Obj Sg Nom "" "ruhtjie" N Sem/Dummytag Cmp "ejke" N Sem/Dummytag Sg Nom "" "ruhtjie" N Sem/Dummytag Cmp "buhtehke" N Sem/Dummytag Sg Nom "" "ruhtjie" N Sem/Dummytag Sg Ine "" "runhtjehtidh" V TV Ind Prs Sg3 "" "rutjkes" A Sg Nom "" <============= : "" "goh" Adv "goh" CS "goh" Pcle : "" "sjovkolaade" ? "sjokolaade" N Sem/Dummytag Sg Nom "" <============= "sjokolaade" N Sem/Dummytag Sg Acc "" "sjokolaade" N Sem/Dummytag Pl Nom "" "sjokolaade" N Sem/Dummytag Pl Gen "" "sjokolaade" N Sem/Dummytag Sg Gen "" "<.>" "." CLB :\n ``` Task: write CG rules that will remove all but the correct suggestions. In the first case this will lift up the intended correction from position 10 (the last of 10 unique suggestions) up to the first (because the rest has been removed), and in the second case, all irrelevant suggestions (=noise) is removed — that is still an improvement, although the correct suggesetion is already in the first position. ### Solution Rules: ``` # Remove the following two suggestions for the misspelled word form 'ruhtjehke': # "ruhtjehtidh" V TV Ind Prs Sg3 "" # "runhtjehtidh" V TV Ind Prs Sg3 "" REMOVE:SpellerVerbSug (V Sg3) IF (0 ()) ( -1* (V Sg3)); ## Ååredæjja lea ruhtjehke goh sjovkolaade. # Remove Foc/ge suggestions if there are other alternatives: # "ruhtjie" N Sem/Dummytag Pl Nom Foc/ge "" # NB! Does not work! Why? REMOVE:SpellerSugFocGe (Foc/ge) IF (0 ()) ; ## Ååredæjja lea ruhtjehke goh sjovkolaade. # Select only Adj readings if in a predicative position: SELECT:AdjInPredPos (A Sg Nom) IF (0 ()) (-1 ("lea")) (-2 (N Sg Nom)); SELECT:SameCaseAndNumAFterGoh (Sg Nom) IF (0 ()) (-1 ("goh")) (-2 (Sg Nom)); ``` Result: ``` "<Ååredæjja>" "ååredæjja" N Sem/Hum Sg Nom "aeredh" Ex/V Ex/TV Der/PassS Ex/V IV Der/d V PrsPrc : "" "lea" V IV Ind Prs Sg3 @+FMAINV : "" "rahtjehke" A Sg Nom "" &SUGGESTWF &typo "rutjkes" A Sg Nom "" &SUGGESTWF &typo : "" "goh" Adv "goh" CS @CNP : "" "sjokolaade" N Sem/Dummytag Sg Nom "" &SUGGESTWF &typo "<.>" "." CLB :\n ```