!!!Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

!!!Overview of the presentation

* background and goals
* bird's eye view
* closer view of selected parts:
** documentation
** testing
** from source to final tool

!!!Background and goals

* Background
* Goals

!!Background

* need for simpler maintenance
* scalability, both for languages, tools and linguists and other developers
* developing NLP resources is a lot of work, and languages are complex - we need
  a tool and an infrastructure to handle the complexity in a managable way
* keep technical details out of the way
* make the daily work as simple as possible
* division of labour
* Recognition: know the basic setup of one language - know the setup of them all

!!Goals

* easy support for many languages
* easy support for many tools
* keep language independent and language specific code apart
* easily upgradable
* the resources in our infrastructure should live on for decades or more

!!General principles

# Be explicit (use ''non-cryptic'' catalogue and file names)
# Be clear (files should be found in non-surprising locations)
# Be consistent (identical conventions in all languages as far as possible)
# Be modular
# Divide language-dependent and language-independent code
# Reuse resources
# Build all tools for all languages
# ... but only as much as you want (parametrised build process)

!!!Bird's Eye View and Down

* the house
* organisation - directory structure
* technologies (xerox, hfst, foma + cg)
* templated build structure and source files
* configuration of builds

!!The House

[../images/hus_eng_2015.png]

!!The House and the Infra

[../images/hus_eng_2015_with_infra.png]

* {{*Machine translation: fst's built by the infra, the rest handled by Apertium}}
* {{*Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though}}
* {{Supported: fst's and syntactic parsers used are built by the infrastructure}}

!!$GTHOME - directory structure

Some less relevant dirs removed for clarity:

{{{
$GTHOME/                     # root directory, can be named whatever
├── experiment-langs         # language dirs used for experimentation
├── giella-core              # $GTCORE - core utilities
├── giella-shared            # shared linguistic resources
├── giella-templates         # templates for maintaining the infrastructure
├── keyboards                # keyboard apps organised roughly as the language dirs
├── langs                    # The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  # Plains Cree
│   ├── est                  # Estonian
│   ├── evn                  # Evenki
│   ├── fao                  # Faroese
│   ├── fin                  # Finnish
│   ├── fkv                  # Kven
│   ├── hdn                  # Northern Haida
│   └─[...]                  #
├── ped                      # Oahpa etc.
├── prooftools               # Libraries and installers for spellers and the like
├── startup-langs            # Directory for languages in their start-up phase
├── techdoc                  # technical documentation
├── words                    # dictionary sources
└── xtdoc                    # external (user) documentation & web pages
}}}

!!Organisation - Dir Structure

{{{
.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here
}}}

!!Technologies

* All technologies are rule-based as opposed to statistical and similar
  technologies.
* This allows us to write grammars that are precise descriptions of the
  languages - reference grammars in a way
* Goal: The documentation for your grammar - with suitable examples etc -
  could be the next published grammar for your language (we'll return to that
  shortly)

!Technology for morphological analysis

We presently use three different technologies:
* Xerox - closed source, not properly maintained, fast, no weights
* Hfst - open source, actively maintained, used in our proofing tools
* Foma - Open source, actively maintained, fast (newly added, not available
  for all fst's yet)

!Technology for syntactic parsing

* Cg (VISLCG3, from University of Southern Denmark)
* used for syntactic parsing
* also for grammar checking
* Basic idea: remove unwanted readings or select wanted ones based on the
  morphosyntactic context (= output of the morphological analysis)
* Example:

{{{
# We like finite verbs:
SELECT:Vfin VFIN ;
}}}

!!Templated Build Structure And Source Files

* Common resources in {{$GTHOME/core/}}
* Template for new languages, including build instructions
* The template is merged (using svn merge) with each language when updated

[../images/newinfra.png]

!!Configurable builds

We support a lot of different tools and targets, but in most cases one only
wants a handful of them. When running {{./configure}}, you get a summary of the
things that are turned on and off at the end:

{{{
$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:

  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no

  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes

  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no

  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no

For more ./configure options, run ./configure --help
}}}

!!The build - schematic

[../images/new_infra_build_overview.png]

!!!Closer View Of Selected Parts:

*Documentation
*Testing
*From Source To Final Tool:
**Relation Between Lexicon, Build And Speller

!!!Closer View: Documentation

* Background
* Implementation

!!Background
* Documentation is always out-of-date
* It tends to be much more out-of-date when heavily separated from the thing to
  be documented, and vice versa
* How to improve: make it possible to write documentation within the source code
* This is similar to JavaDoc, Doxygen and many other such system
* Ultimate goal:
** Document the source code so that it can be published as the next reference
   grammar!

!!Implementation
* The infrastructure will automatically extract comments of a certain type, and
  convert them into html
* One can cite portions of the source code, as well as test data.
* The syntax of the comments must follow the jspwiki syntax

Example cases:
* [https://giellalt.uit.no/lang/fin/root-morphology.html]
* [https://giellalt.uit.no/lang/smj/nouns-affixes.html]

Documentation:
* [https://giellalt.uit.no/infra/infraremake/In-sourceDocumentation.html]

!!!Closer View: Testing

* testing framework
* yaml tests
* in-source tests
* other tests

!!Testing Framework

All automated testing done within the infrastructure is based on the testing
facilities provided by Autotools.

All tests are run with a single command:

{{{
make check
}}}

Autotools gives a {{PASS}} or {{FAIL}} on each test as it finishes:

[../images/make_check_output.png]

!!Yaml Tests

These are the most used tests, and are named after the syntax of the test files.
The core syntax is:

* a header
* test sets:
** test name
** test data
* syntax requirements: indents using spaces, multiple choices as lists within
  brackets, colons after everything except the word forms

{{{
Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup

Tests:
  Noun - mihkw - ok : # -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw
}}}

!Yaml test output

[../images/make_check_output.png]

* each yaml test file has its own line of output with PASS / FAIL /TOTAL
* at the end of each yaml test run (= all yaml files for the same fst) there
  is a summary of the total results for that yaml test run
* ... followed by the Automake PASS / FAIL message

!!In-Source Tests

* LexC tests
* Twolc tests

!LexC tests

As an alternative to the yaml tests, one can specify similar test data within
the source files:

{{{
LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon. 
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;

!!€gt-norm: kárta # Even-syllable test
!!€ kártta:         kártta+N+Sg+Nom
!!€ kártajn:        kártta+N+Sg+Com
}}}

Such tests are very useful to serve as checks for whether an inflectional
lexicon behaves as it should.

The syntax is slightly different from the yaml files:
* word form first
* multiple alternative word forms on separate lines

!Twolc tests

The twolc tests look like the following:

{{{
!!€ iemed9#
!!€ iemet#

!!€ gål'leX7tj#
!!€ gål0lå0sj#
}}}

The point is to ensure that the rules behave as they should.

!!Other Tests

You can write any test you want, using your favourite programming language.
There are a number of shell scripts to test speller functionality, and more
tests will be added as the infrastructre develops.

!!!Closer View: From Source To Final Tool:

* Relation Between Lexicon, Build And Speller
* Fst's And Dictionaries

!!Relation Between Lexicon, Build And Speller

* tag conventions
* automatically generated filters
* spellers and different writing system / alternative orthographies

!Tag Conventions

We use certain tag conventions in the infrastructure:

* {{+Err/...}} ({{+Err/Orth}}, {{+Err/Cmp}})
* {{+Sem/...}}
* and more...

!Automatically Generated Filters

* Many of these clusters of tags are used for specific purposes, and are removed
  from other fst's.
* tag using a common prefix (like {{+Err/}} or {{+Sem/}} gets filters for
  different purposes automatically
* there are filers for:
** removing the tags themselves
** remvoing strings / words containing the tags
* by adhering to these conventions, you get a lot of functionality for free
* this system is used when...

!Dealing with descriptive vs normative grammars

* the normative is a subset of the descriptive
* tag the non-normative forms using {{+Err/...}} tags
* write your grammar as descriptive
* remove the {{+Err/...}} strings
* => normative fst!

!!!Summary

* scalability
* division of labour
* language independence
* ... but still flexible wrt the needs of each language

!!!Giitu

* Thank you!