# About configs Because paths in the configs likely must change depending on the place the service is running, the .in files are the only things checked in. Make a copy and change the necessary paths to FSTs and such, and run the service with that. If there are changes to paradigms and such, be sure to check those in. Configs are written in .yaml, and should be fairly self explanatory. See sample.config.yaml.in for explanations of the various options. ## Adding a new language So far the process is a little complex, but there are things that can be done mostly by linguists once the basic structure is in place. In each following section, I'll mark who the role is best suited for, thus it's clearer where work can be shared. This following process assumes that there is already a service existing to which a new language pair is being added. ### 1.) Establish a build process for the FSTs and lexicon. **Intended**: Programmers #### FSTs Assuming that the language uses the `langs/` infrastructure, adding another to a dictionary set's build process is easy. Find the targets for the dictionary set, for example, `kyv` and `kyv-install`, and add the language ISO to the variable `GT_COMPILE_LANGS` for these targets. .PHONY: baakoeh-install baakoeh-install: GT_COMPILE_LANGS := sma nob baakoeh-install: install_langs_fsts .PHONY: baakoeh baakoeh: GT_COMPILE_LANGS := sma nob baakoeh: baakoeh-lexica compile_langs_fsts [... snip ...] The dependencies for these will then automatically build, using as much of the `langs/` build infrastructure as possible. These targets will build analysers as usual, but the `*-install` targets are there as a convenience for when overwriting the analysers in `/opt/smi/` is allowed. **Be careful** with this though, because with language sets like `sánit` and `baakoeh` which are very much in production mode now, there may be some unintended consequences. In any case, the targets that these will write to are dictionary-specific, and will not overwrite analysers for other projects. /opt/smi/LANG/bin/dict-LANG.fst /opt/smi/LANG/bin/dict-iLANG-norm.fst /opt/smi/LANG/bin/some-LANG.fst ##### Troubleshooting If you do not succeed in getting these make targets to work with a new language, run the process manually. It might be that `make distclean` needs to be run once within the language directory, and then things will work. #### Lexicon Editing the Makefile is a little tricky. You will need to add a target for the lexicon file or files. Lexica are compiled using a `Saxon` process, and the Makefile contains some variables that can be used as shortcuts. For languages using `langs/` infrastructure for the lexicon, the best option is the following: ZZZ-all.xml: $(GTHOME)/langs/ZZZ/src/morphology/stems/*.xml @echo "**************************" @echo "** Building ZZZ lexicon **" @echo "**************************" @echo "** Backup made (.bak) **" @echo "**************************" @echo "" -@cp $@ $@.$(shell date +%s).bak mkdir ZZZ cp $^ ZZZ/ $(SAXON) inDir=$(pwd)/ZZZ/ > ZZZ-all.xml rm -rf ZZZ/ The above makes a copy of the XML files, and then uses the Saxon process to compile them all into one file, with no additional processing. This process will be the same if the lexica are in `main/words/dicts/`, however some languages there have multiple subdirectories that need to be copied before the Saxon process is run. Make note of the filename that you intend to output this to, and add it to the language installation's lexicon target, for example, `kyv-lexica`, `muter-lexica`; and also the remove target (`rm-kyv-lexica`). ### 2.) Edit the .yaml file for new FSTs and Dictionaries **Intended**: Programmers, linguists Realistically anyone can do this as long as the build process is working, since most of this should be a cut-and-paste job. Once you're done, save the file and attempt to restart the service. If everything seems to be working, do not check in the config file itself, but copy the values to `INSTANCE.config.yaml.in`, and check that in. This is simply so that no incoming updates to config files will destroy existing production configs. #### `Morphology` section This needs to have the paths to the new analysers, for each language ISO. Follow one of the existing languages and adjust the values as necessary. If any language variants (mobile spellrelax) need to be included, a good idea is to use the language ISO as the key, but with one letter appended, i.e., `udm` for mobile would be `udmM`. In any case, the morphology section should contain a new entry like the following: YYY: tool: *LOOKUP file: [*OPT, '/YYY/bin/dict-YYY.fst'] inverse_file: [*OPT, '/YYY/bin/dict-iYYY-norm.fst'] format: 'xfst' options: compoundBoundary: "+Use/Circ#" derivationMarker: "+Der" tagsep: '+' inverse_tagsep: '+' Where YYY is the language ISO path. Note the weird way that forming paths with aliases is handled here in YAML, they may be strings or lists, and if they are lists, they will be automatically concatinated into strings. This must be done because YAML does not allow string concatenation with aliases/variables. #### `Languages` section Add a new entry for the language iso to this list. #### `Dictionaries` section Here, add a new item to the list of dictionaries, relative to the `neahtta` path, i.e., `dicts/file-name.xml`. Dictionaries: # [... snip ...] - source: udm target: hun path: 'dicts/udm-all.xml' If any language variants, mobile spellrelax, need to be included, this is the place to define them. Note that for the `type` setting, the values `standard` and `mobile` are special. Only use this for mobile spell-relax. If the type of variant is something else, like handling multiple orthographies, use another value. The variant marked with `mobile` will be the variant that is automatically displayed if a user navigates to the page via mobile browser. `short_name` for each variant must be set to the same value as the FST, so, `"sme"`, or `"SoMe"`, or `udmM`. `description` will be displayed to users. - source: sme target: nob path: 'dicts/sme-nob.all.xml' input_variants: - type: "standard" description: "Standárda (áčđŋšŧž)" short_name: "sme" - type: "mobile" description: "Sosiála media (maiddái acdnstz)" short_name: "SoMe" ### 3.) Define language names and translation strings **Intended**: Linguists Open the file `configs/language_names.py`. Here you will need to add the language ISO to several variables. Save when done, and be sure to check in in SVN. #### NAMES Here we define the name in English, so that it will be available for translation to any interface languages. ('sme', _(u"North Sámi")), The most easy way is to copy one existing line, and replace the contents of the strings. If you're unfamiliar with Python, be careful not to remove any underscores around the strings, and only edit the contents. The first value should be the language ISO, **or** the language variant (`SoMe`, `udmM`, `kpvS`, etc.) #### LOCALISATION_NAMES_BY_LANGUAGE Here we have the ISO and the language's name in the language. ('sme', u"Davvisámegiella"), Again, copy and paste a line, and only edit the strings. #### ISO_TRANSFORMS If the language has a two-character ISO as well as a three-character ISO, we must have these defined here. ('se', 'sme'), ('no', 'nob'), ('fi', 'fin'), ('en', 'eng'), ### 4.) Define tagsets, and paradigms, user-friendly tag relabels **Intended**: Linguists If you wish to have paradigms visible in the language, you will need two things: * `Tagsets` files: `configs/language_specific_rules/tagsets/README.md` * `.paradigm` files: `configs/language_specific_rules/paradigms/README.md` * `.context` files: `configs/language_specific_rules/paradigms/README.md` * `.relabel` files: `configs/language_specific_rules/user_friendly_tags/README.md` The easiest means of course is to look at existing languages and copy what they do. When done with these steps, be sure to add the new files and directories to SVN and check them in. ### 5.) Paradigm bonus material: wordform contexts **Intended**: Linguists Paradigm contexts give additional information to users about how wordforms are intended to be used. Information about these is also maintained in the paradigms readme. configs/language_specific_rules/paradigms/README.md