# Contents * Paradigm generation * Paradigm wordform context # Paradigm generation Paradigms are managed by a file and directory structure. ## Paradigm folder structure paradigms/sme/common_nouns.paradigm paradigms/sme/proper_nouns.paradigm paradigms/sme/paradigm_group/foo.paradigm paradigms/sme/paradigm_group/bar.paradigm Paradigms can be ordered however in each directory, and may be grouped for convenience into other folders. A language typically won't need many, and usually there will be one base paradigm for a part of speech from which additional paradigms apply to subsets of words in this part of speech. Currently, there is no explicit setting for ordering the rules, and ordering is determined by the complexity of the rules that match a given word and entry. Thus, if one rule looks for `pos`, `valence` and `context`, and another only looks for `pos` and `valence`, the former will be applied if both match. Symlinks are tolerated, so if multiple language variants need to use the same rule set, simply make a symlink between the directories. ## Paradigm file structure Paradigm files are structured in the following wa: one part is YAML, and the other part is data in Jinja form. Essentially what this says is, if the first part's (YAML) conditions are matched, then we use the paradigm following. name: "Proper noun paradigm." description: | Generate the proper noun if the entry contains sem_type="Prop" or "prop" morphology: pos: "N" lexicon: XPATH: type: ".//l/@type" sem_type: - "Prop" - "prop" -- {{ lemma }}+N+Prop+Sem/Plc+Sg+Gen {{ lemma }}+N+Prop+Sem/Plc+Sg+Ill {{ lemma }}+N+Prop+Sem/Plc+Sg+Loc ## Analyzer conditions Conditions that are possible to match on are set up in a variety of ways. Analyzer conditions may be specified in the `analyzer` key, and each key under that may be a tagset and a value, or a whole tag: morphology: pos: "V" infinitive: true ... is the same as ... morphology: tag: "V+Inf" ... or ... morphology: tag: - "V+Inf1" - "V+Inf2" Either a value may be specified, or boolean 'true', which stands for 'any member of the tag set is present'. A list may also be specified, which is in effect a kind of locally defined tagset. One other key that might be used for the analyzer is 'lemma', which is also present for lexicon, assuming that you only want the rule to apply to a specific lemma. morphology: lemma: "diehtit" ## Lexicon conditions The lexicon is also usable for providing conditions for a particular paradigm. Some predefined keys are available, and it is also possible to use XPATH statements to test against individual XML entries. lexicon: XPATH: sem_type: ".//l/@sem_type" sem_type: "Plc" lexicon: XPATH: sem_type: ".//l/@sem_type" sem_type: - "Plc" - "Something" ## Conditions together Operating together, what the conditions say is that for any user-inputted wordform, if the analyzer rules find a matching analysis, and the lexicon rules find a matching lexicon entry, then the paradigm will be used for the entries where these align. ## Paradigm definition Paradigm definition is mostly plaintext, but since it is a template, it is possible to do all sorts of template operations. {{ lemma }}+N+Sg+Nom {{ lemma }}+N+Sg+Acc Certain variables are available by default: - `lemma` ## Things to think about * Pregenerated paradigms could be accomplished by a template, but it would be fairly complex, and thus would require good access to `lxml` nodes without lots of complex template tags and custom filters. # Paradigm contexts For now this isn't entirely in line with the way Paradigm Generation works, but it should be good enough for linguists to see the pattern and work accordingly. `.context` files in each directory control what is displayed with the generated wordform. The filename may be anything, so long as the suffix is `.context`. For convenience, `sme` and `sma` match filenames between paradigms and context, but there is no need to do so, and one `.context` file could be used for everything. ## File structure Context files are simply a YAML list, and each item is a dictionary with the following keys: * `entry_context` - matches the `@context` attribute on each `` node. Set to a string, or None * `tag_context` - matches the tag used in generation. String. Must be set to something, as none would overgenerate. * `template` - jinja-format string, which accepts certain variables: + `word_form` - inserts the wordform + `context` - inserts the context (usually not necessary) Some examples: - entry_context: "sii" tag_context: "V+Ind+Prs+Pl3" template: "(odne sii) {{ word_form }}" The above would thus generate: (odne sii) deaivvadit Example without entry_context: - entry_context: None tag_context: "V+Ind+Prs+Sg1" template: "(daan biejjien manne) {{ word_form }}" Note the lack of quotes around `None`. Otherwise, see the checked in files for more examples. ### Programmer todos TODO: function currently assumes tag separator is +, use the tag separator defined in morphology TODO: maybe consider making this work similarly to paradigm generation, so that tagsets may be used if needed.