# Lecture Tue 4.12 Teacher: Sjur Moshagen Topic: **Applications** * spell checkers * grammar checkers * hyphenation * lemmatisation * MT * ... ## Morphological analysis * using finite state transducers (fst's) * take word forms in, return lemma + analysis ![Morphological analysis](bilete/morphtut3.png) (Illustrastion from [M. Huldén](https://fomafst.github.io/morphtut.html)) * you build the fst by describing the lexicon, the morphology and the morphophonology in a form that can be compiled to an fst * common tools: hfst, foma, xfst (Xerox), but there are others as well ## Text / corpus analysis Combines morphological analysis with morphological disambiguation. On top of that one can build syntactic analysis, all the way to a full dependency tree. This is used in our [Korp interface](http://gtweb.uit.no/korp). ## Morphological generation Exactly the same as analysis, except going in the other direction, from lemma + analysis string to word form. Applications: * paradigm or key form generation * suggestions generation (e.g. in a grammar checker) * MT output word form generation An example of generating all forms of the North Sámi word [giella] (http://gtweb.uit.no/cgi-bin/smi/smi.cgi?text=giella&pos=N&mode=full&action=paradigm&lang=sme&plang=sme) ## Electronic dictionaries The electronic dictionaries in the Giella infrastructure are typically used with two fst's: * a descriptive morphological analyser, so that one can go from inflected form to lemma (e.g. when looking up entries in a text) * a normative generator, to produce either a full paradigm or some key forms, to help people understand how a word is inflected Used in e.g. [NDS](http://sanit.oahpa.no/detail/sme/nob/giella.html?no_compounds=true&lemma_match=true) ## Hyphenation * rules vs lexicon * norm vs desc This is how we build our fst-based hyphenator: 1. copy lex file, change format to weighted 2. Removed irrelevant stuff from the lexicon 3. get all Err-tags, add weight 1000, and cat to tagweight file 4. add tag weights 5. project surface side 6. compose-intersect with phon-rules 7. remove hyph-points from 6, invert 8. add hyphpoints from 6 with hyph-rules 9. copy, change format and reweight hyphrules with high weight 9. compose 7. and 8., and cat 8 and 9 to make final hyphenation fst archive The final fst archive then contains two fst's: one lexical, and one rule based. The rule based one has much higher weights than the lexical, and the lexical one has extra weights on the error tagged forms. When applied, only pick the hyphenation pattern with the lowest weight. ## Lemmatisation (e.g. in indexing) This is a special case of morphological analysis, in which we are only interested in the lemma. For morphologically complex languages indexing the lemma is crucial, since if you index only the word forms most of the instances of the word you are looking for will be missed with a whole-word match. And in the case of complex morphophonology or prefixes, a wild card search wopn't help much either: ``` $ echo gillii | hfst-lookup -q src/analyser-gt-norm.hfstol gillii giella+N+Sg+Ill ``` That is, only `gi` is a common part for the two word forms of the same word. For longer and more complex words there are additional issues: * what is the best lemma? The root lemma, or the longest derivation / compound lemma? * what is most important when indexing a text, all components of a word, or just the stem? We have not yet worked actively with these questions, but my take on it would be a weighted approach: * best match for all inflections of the longest lemma possible * gradually higher weights for shorter lemmas/stems * discard compound components from the beginning of the word, with higher weights for each resulting new stem For even better results, one should combine the lemma index with a thesaurus or a word net, so that one would also get hits on different but related words. No such resource exists for any of the Sámi languages though (at least not in an electronic form available to us). ## Spell checkers ![Flow chart](bilete/SpellerFlow.png) * remove non-normative forms (simple fst operation - all such forms are or should be tagged) * remove irrelevant stuff (mostly punctuation, also easy since they are all tagged) * error model: * corrects misspelled words using simple transformation rules * produces (tens of) thousands or more of candidates * these are all filtered out on the fly * the end result is a list of possible suggestions * the lexicon is weighted according to frequency, if a corpus is available * the lexicon is also weighted according to tags (always) * the error model and the different transformations are weighted according to likelyhood; this is manual work, and requires a lot of fine tuning The spell checker has two user interfaces: * the red underline * the suggestion list Users' perception of the quality of the spell checker is typically formed by these two. Thus, when developing a speller, one should strive for high accuracy for both: * few false alarms (most important), and as few missed misspellings as possible * the suggestions should be few and be (or at least potentially be) relevant - this is surprisingly hard! Basic insight: the speller has no knowledge of the world around, so it is just guessing wildly. We can help the guessing process by weights and by forming the error model according to the language we are working on, and the typicall misspellings of the language users The target: to always and only produce the one correct suggestion, and only detect real errors, and all errors. Both goals are by definition impossible to reach. But: We can come pretty close to the ideal by using the context knowledge and disambiguation power of a rule-based grammar checker, as discussed last time (and repeated briefly below). ## Grammar checkers ![Grammar checker flow chart](bilete/GramCheckFlow.png) ## MT Apertium flow chart: ![Apertium](bilete/Apertium-structure.png) (Illustration from [Johnson et al.](http://www.ep.liu.se/ecp/131/014/ecp17131014.pdf)) ## TTS text processing We have no finished setup for this yet, but plan to do it in the future (our present TTS product is based on a commercial and closed-source system). What it would look like: * resembles the grammar checker pipeline a lot * in fact, the plan is to base it on the same components * the main change is that instead of error detection and correction, we do conversion to IPA or some other suitable phonetic representation * will allow us great flexibility in disambiguating the input, and in assigning emphasis or other pronounciation hints as part of the processing