Dokumentasjon av transferreglar. "Apertium is a more complicated and less user friendly version of sed." Documentation: * [http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO ] * [http://wiki.apertium.org/wiki/A_long_introduction_to_transfer_rules] * [http://wiki.apertium.org/wiki/Chunking:_A_full_example ] Task: {{{ Num Sg (Nom|Acc) + N Sg Gen => Num Sg Nom + N Par }}} * sme: Leat guokte guoli. * smn: Láá kyehti kyellid. Input: {{{ $ echo "Leat guokte guoli" | apertium -d . sme-smn-biltrans ^Leat<@+FMAINV>/Leđe<@+FMAINV>$ ^guokte<@←SUBJ>/kyehti<@←SUBJ>$ ^guolli<@Num←>/kyeli<@Num←>$ }}} Output: {{{ $ echo "Láá kyehti kyellid" | hfst-proc smn-sme.automorf.hfst ^Láá/Leđe$ ^kyehti/kyehti$ ^kyellid/kyeli$ }}} __Rule 1: Ignore disambiguation errors!__ {{{ $ echo "Leat guokte guoli" | apertium -d . sme-smn #Leđe #kyehti kyele }}} So what output do we current have ? {{{ $ echo "Leat guokte guoli" | apertium -d . sme-smn-postchunk ^Leđe$ ^kyehti$ ^kyeli$^.$ }}} So, what do we actually want to do ? {{{ --> || [  | }}} In Apertium we call the first part (before the two pipes) the "action", and the second part (... the context after the two pipes) the "pattern". So: {{{ Pattern = [  | ] Action = --> --> "" --> }}} Patterns are defined by "def-cat" entries. The "cat" stands for category. {{{ }}} This is a set of two items, one containing nom and one containing acc. You can change the order of the "cat-items" (they are more or less a set). The tags are not sets, they are sequences with wildcards. To do "or" in the category entries, you just add more cat-item lines. {{{ }}} So, to match the pattern "numeral singular in nominative or accusative followed by noun singular in genitive" we would do: {{{ }}} Here the order is important, this is a sequence. The "." is not a regular expression "." it is >< so: {{{ n.sg.gen.* = (<*>)+ n.*.gen.* = <*>(<*>)+ }}} Let´s start to define our rule file: {{{ ------------------------------------------------------- ------------------------------------------------------- }}} Input: {{{ ^guokte<@←SUBJ>/kyehti<@←SUBJ>$ ^guolli<@Num←>/kyeli<@Num←>$ |__________________________| |__________________________| Source language (SL) Target language (TL) |_________________________________________________________| Lexical unit (LU) Now we look at the action. Actions are defined within the . The action may contain different instructions, and most importantly determine the output string. The instructions can work on both the source and target side of the input lexical unit. Output: ^num-noun{^kyehti$ ^kyeli$}$ |______| name |____________________________________________________| Chunk }}} We define this with: {{{ }}} This is essentially like writing {{{ ^num-noun{}$. }}} Each chunk has a name, some tags and some contents, for example to get the "noun group" (sintagma nominal) {{{ }}} This is essentially like writing {{{ ^num-noun{}$. }}} Looking at this in the file context: {{{ ------------------------------------------------------- ------------------------------------------------------- }}} This matches the input pattern (| And outputs: {{{ ^num-noun{}$ }}} What is missing here is the chunk contents (e.g. the lexical units that were matched by the pattern). {{{ }}} * pos = position. the position is defined the order in the pattern. * side = which side of the LU to output. * part = a substring within one of the sides of a lexical unit. {{{ __side="sl" part="whole"___ | | |_lem_  | | | | ^guokte<@←SUBJ>/kyehti<@←SUBJ>$ |__________________________| |__________________________| Source language (sl) Target language (tl) }}} For "part" we can define our own patterns of substrings, but there are also some built in: * whole = the whole string kyehti<@←SUBJ> * lem = the lemma kyehti * tags = the tags <@←SUBJ> * lit = literal * v = value * n = name So, for the rule above, it will currently output: {{{ ^num-noun{^kyeli<@Num←>$}$ }}} So, now that we have some output, we can start with the interesting part, that is changing the output so that it will generate properly. We´ll start with the easy way, which is just specifying directly what we want to output: input is the output from sme-smn-biltrans Then comes this: {{{ }}} The lit-tag instruction outputs strings encased in < and >. the the output is what we get by calling sme-smn-chunker1 {{{ ^num-noun{^kyeli$} }}} Now, how would we output both lexical units ? The output we are looking for is: {{{ ^num-noun{^kyehti$ ^kyeli$}$ }}} The rule: {{{ }}} This will give: {{{ ^num-noun{^kyehti<@←SUBJ>$ ^kyeli$}$ }}} This is good, but we don´t want the syntax tag... <@←SUBJ> How can we change the tags? We first need to define patterns that we want to change. For example, we could define a pattern that matches all of the possible syntax tags. These patterns are defined in a separate section: {{{ }}} The "def-attr" stands for define attribute. The procedure for changing something goes something like: {{{ }}} This replaces anything substring that matches one of the patterns in def-attr n="function" with the empty string. Here "lit" means "literal" and the attribute "v" is the value. e.g. is {{{ (@←SUBJ|@←OBJ|@←ADVL) --> 0 }}} So now if we have {{{ }}} We will get: ^kyehti$ Note that all statements must go outside of the statement. What will the whole rule file look like? {{{  A B  ------------------------------------------------------- ------------------------------------------------------- ------------------------------------------------------- }}} Homework: The data in: {{A-3lex_ordinals_uptoten_gt-norm.gen.yaml}} Command: hfst-proc sme-smn.automorf.hfst More phrases: * num adv adj-attr noun - Mun oasttán guokte hui varas guoli. * num n-gen adj-attr noun - Mun gávdnen guokte eatni boares girjji. * num adv adj noun -