# About configs

Because paths in the configs likely must change depending on the place the
service is running, the .in files are the only things checked in. Make a copy
and change the necessary paths to FSTs and such, and run the service with that.
If there are changes to paradigms and such, be sure to check those in.

Configs are written in .yaml, and should be fairly self explanatory. See 
sample.config.yaml.in for explanations of the various options.

## Adding a new language

So far the process is a little complex, but there are things that can be
done mostly by linguists once the basic structure is in place. In each
following section, I'll mark who the role is best suited for, thus it's
clearer where work can be shared.

This following process assumes that there is already a service existing
to which a new language pair is being added.

### 1.) Establish a build process for the FSTs and lexicon.

**Intended**: Programmers

#### FSTs

Assuming that the language uses the `langs/` infrastructure, adding
another to a dictionary set's build process is easy. Find the targets
for the dictionary set, for example, `kyv` and `kyv-install`, and add
the language ISO to the variable `GT_COMPILE_LANGS` for these
targets.

    .PHONY: baakoeh-install
    baakoeh-install: GT_COMPILE_LANGS := sma nob
    baakoeh-install: install_langs_fsts 

    .PHONY: baakoeh
    baakoeh: GT_COMPILE_LANGS := sma nob
    baakoeh: baakoeh-lexica compile_langs_fsts
    [... snip ...]

The dependencies for these will then automatically build, using as much
of the `langs/` build infrastructure as possible.

These targets will build analysers as usual, but the `*-install` targets
are there as a convenience for when overwriting the analysers in
`/opt/smi/` is allowed. **Be careful** with this though, because with
language sets like `sánit` and `baakoeh` which are very much in
production mode now, there may be some unintended consequences.

In any case, the targets that these will write to are
dictionary-specific, and will not overwrite analysers for other
projects.

    /opt/smi/LANG/bin/dict-LANG.fst
    /opt/smi/LANG/bin/dict-iLANG-norm.fst
    /opt/smi/LANG/bin/some-LANG.fst

##### Troubleshooting

If you do not succeed in getting these make targets to work with a new
language, run the process manually. It might be that `make distclean`
needs to be run once within the language directory, and then things will
work.

#### Lexicon

Editing the Makefile is a little tricky. You will need to add a target
for the lexicon file or files. 

Lexica are compiled using a `Saxon` process, and the Makefile contains
some variables that can be used as shortcuts. For languages using
`langs/` infrastructure for the lexicon, the best option is the
following:

    ZZZ-all.xml: $(GTHOME)/langs/ZZZ/src/morphology/stems/*.xml
	    @echo "**************************"
	    @echo "** Building ZZZ lexicon **"
	    @echo "**************************"
	    @echo "** Backup made (.bak)   **"
	    @echo "**************************"
	    @echo ""
	    -@cp $@ $@.$(shell date +%s).bak
	    mkdir ZZZ
	    cp $^ ZZZ/
	    $(SAXON) inDir=$(pwd)/ZZZ/ > ZZZ-all.xml
	    rm -rf ZZZ/

The above makes a copy of the XML files, and then uses the Saxon process
to compile them all into one file, with no additional processing.

This process will be the same if the lexica are in `main/words/dicts/`, 
however some languages there have multiple subdirectories that need to
be copied before the Saxon process is run.

Make note of the filename that you intend to output this to, and add it
to the language installation's lexicon target, for example,
`kyv-lexica`, `muter-lexica`; and also the remove target
(`rm-kyv-lexica`).

### 2.) Edit the .yaml file for new FSTs and Dictionaries

**Intended**: Programmers, linguists

Realistically anyone can do this as long as the build process is
working, since most of this should be a cut-and-paste job.

Once you're done, save the file and attempt to restart the service.

If everything seems to be working, do not check in the config file
itself, but copy the values to `INSTANCE.config.yaml.in`, and check that
in. This is simply so that no incoming updates to config files will
destroy existing production configs.

#### `Morphology` section

This needs to have the paths to the new analysers, for each language
ISO. Follow one of the existing languages and adjust the values as
necessary. If any language variants (mobile spellrelax) need to be
included, a good idea is to use the language ISO as the key, but with
one letter appended, i.e., `udm` for mobile would be `udmM`.

In any case, the morphology section should contain a new entry like the
following:

    YYY:
      tool: *LOOKUP
      file: [*OPT, '/YYY/bin/dict-YYY.fst']
      inverse_file: [*OPT, '/YYY/bin/dict-iYYY-norm.fst']
      format: 'xfst'
      options:
        compoundBoundary: "+Use/Circ#"
        derivationMarker: "+Der"
        tagsep: '+'
        inverse_tagsep: '+'

Where YYY is the language ISO path. Note the weird way that forming
paths with aliases is handled here in YAML, they may be strings or
lists, and if they are lists, they will be automatically concatinated
into strings. This must be done because YAML does not allow string 
concatenation with aliases/variables.

#### `Languages` section

Add a new entry for the language iso to this list.


#### `Dictionaries` section

Here, add a new item to the list of dictionaries, relative to the
`neahtta` path, i.e., `dicts/file-name.xml`.

    Dictionaries:

      # [... snip ...]

      - source: udm
        target: hun
        path: 'dicts/udm-all.xml'

If any language variants, mobile spellrelax, need to be included, this
is the place to define them. Note that for the `type` setting, the
values `standard` and `mobile` are special. Only use this for mobile
spell-relax. If the type of variant is something else, like handling
multiple orthographies, use another value.

The variant marked with `mobile` will be the variant that is
automatically displayed if a user navigates to the page via mobile
browser.

`short_name` for each variant must be set to the same value as the FST,
so, `"sme"`, or `"SoMe"`, or `udmM`.

`description` will be displayed to users.

  - source: sme
    target: nob
    path: 'dicts/sme-nob.all.xml'
    input_variants:
      - type: "standard"
        description: "Standárda (<em>áčđŋšŧž</em>)"
        short_name: "sme"
      - type: "mobile"
        description: "Sosiála media (maiddái <em>acdnstz</em>)"
        short_name: "SoMe"


### 3.) Define language names and translation strings

**Intended**: Linguists

Open the file `configs/language_names.py`. Here you will need to add the
language ISO to several variables. Save when done, and be sure to check
in in SVN.

#### NAMES

Here we define the name in English, so that it will be available for
translation to any interface languages.

    ('sme', _(u"North Sámi")),

The most easy way is to copy one existing line, and replace the contents
of the strings. If you're unfamiliar with Python, be careful not to
remove any underscores around the strings, and only edit the contents.

The first value should be the language ISO, **or** the language variant
(`SoMe`, `udmM`, `kpvS`, etc.)

#### LOCALISATION_NAMES_BY_LANGUAGE

Here we have the ISO and the language's name in the language.

    ('sme', u"Davvisámegiella"),

Again, copy and paste a line, and only edit the strings.

#### ISO_TRANSFORMS

If the language has a two-character ISO as well as a three-character
ISO, we must have these defined here.

    ('se', 'sme'),
    ('no', 'nob'),
    ('fi', 'fin'),
    ('en', 'eng'),

### 4.) Define tagsets, and paradigms, user-friendly tag relabels

**Intended**: Linguists

If you wish to have paradigms visible in the language, you will need two
things: 

 * `Tagsets` files: `configs/language_specific_rules/tagsets/README.md`
 * `.paradigm` files: `configs/language_specific_rules/paradigms/README.md`
 * `.context` files: `configs/language_specific_rules/paradigms/README.md`
 * `.relabel` files: `configs/language_specific_rules/user_friendly_tags/README.md`

The easiest means of course is to look at existing languages and copy
what they do.

When done with these steps, be sure to add the new files and directories
to SVN and check them in.

### 5.) Paradigm bonus material: wordform contexts

**Intended**: Linguists

Paradigm contexts give additional information to users about how
wordforms are intended to be used.  Information about these is also
maintained in the paradigms readme.

    configs/language_specific_rules/paradigms/README.md