# Lecture Tue 30.10

Teacher: Jack Rueter & Sjur Moshagen

Topics:

* Course intro (Jack)
* field overview (Jack)
* majority vs minority language technology (Sjur)

Reading material:

* [Sustainable LT Resources](http://ixa2.si.ehu.es/~jipsagak/Moshagen_slides.pdf) (Moshagen, 2012)
* [A restricted freedom of choice: Linguistic diversity in the digital landscape]
  (https://septentrio.uit.no/index.php/nordlyd/article/view/2474/2297) (Trond Trosterud, University of Tromsø)

## Majority vs minority language technology

<!--
* history
* material diffs today
* typical linguistic diffs / tendencies
* motivational diff (earn money vs serve underserved people)
* -> diffs in LT practice:

* rule-based - also more interesting (to us at least)
* reusable resources, descriptions (both desc and norm)
* reusable tool components (speller libs etc)
* language independent & scaleable infra
-->

Of course there is no such thing as minority or majority language technology per se, but there still is some truth to the concept. First a brief historic detour:

### Short on the history of LT

<!--
* history
-->

The history of haves and havenots...

* writing (cuneiform, hieroglyphs, alphabets like runes)
    * many of them technologically simple, you could use what you carried with you and found around you
        * ![Cuneiforms](../2018_Creutz/images/cuneiform.jpg)
        * ![Runes](bilete/b190.jpg)
    * using bible translations as a proxy, [roughly 1600 of the about 7000 languages of the world still doesn't have a writing system](https://en.wikipedia.org/wiki/List_of_Bible_translations_by_language) at all
* typing:
    * Gutenberg increased production speed tremendously, and similarly reduced costs
    * but the initial costs got much higher — now a pen or a knife or a stick was not enough anymore, you need types, lots of types, and a big machine
    * ![Gutenberg](../2018_Creutz/images/printpress.jpg)
* digitalisation:
    * reducing the cost of text production dramatically once more
    * but the initial costs and barriers again goes up, as you need:
        * a computer to write (and potentially all your readers as well)
        * a keyboard (soft or hard)
        * fonts
        * an encoding standard
        * standards for languages, writing systems, areas, etc
        * operating systems that can render your text properly using all the above (and this is still not the case for some writing systems: «[As of 2015 there are no fonts that successfully display all of Mongolian correctly when written in Unicode.](https://en.wikipedia.org/wiki/Mongolian_script#Font_issues)», cf also Kildin Sámi and any other language forced to use combining diacritics because Unicode does not allow any new precomposed letters)
        * ![Internet](../2018_Creutz/images/internet.jpg)

That is just the prerequisite costs. To actually get the LT tools you expect for e.g. Finnish, you would also need:

* a speller engine
* a language model for the speller
* linguistic descriptions suitable for a grammar checker
* hyphenation rules and patterns
* dictionaries
* machine translation resources
* corpora
* sound recordings and phonetic models for text-to-speech systems
* ... and the same for speech recognition
* and all of this integrated with and working in a multitude of operating systems
* ... and a number of applications on each system
* ... and then be prepared to update your tools all the time, both linguistically and technically

The costs are truly high for working LT in our society today, we just don't usually see them.

What has happened for each major technology shift is:

* the initial costs have increased dramatically
* more languages have been getting in
* the use of LT has more and more turned into a requirement for being a modern society of the time, or a prerequisite for the modernisation of society

## Today

There's a hierarchy or a scale, with English on the one end, and the roughly 1600 languages without any writing system at all on the other end, and the rest of the languages in between. In terms of LT support, most languages of the world go towards the low end.

<!--
* material diffs today

| High End             | Low end      |
| ----------------     | ------------ |
| Huge populations     | Few speakers |
| Massive text corpora | small or non-existing text corpora |
| Most advanced tools  | no tools at all |
-->

This picture is very gross. Even high-end languages like German does not have everything that English has, or the quality isn't as good as for English.

And there are really big languages with tens of millions of speakers with no language technology support at all, as documented in the article linked to for this lecture.

![digital divide](../2018_Creutz/images/293-2067-1-PB.jpg)

### Linguistic differences

<!--
* typical linguistic diffs / tendencies
-->

There's a tendency that large language communities moves towards a simplified grammar. The top three languages of the world (Mandarin, Spanish, English) all have very modest (or no) morphology.

On the other hand, often minority languages have complex to very complex grammars. We have worked mostly with circumpolar languages, like Uralic languages, Greenlandic and native languages in Canada. All of these languages have complex to very complex morphology or morphophonology (or both!).

![International cooperation](../2018_Creutz/images/gtlangs_circumpolar_names.png)

<!--
#### Majority languages
* often morphology poor
* huge text corpora
* lot of market power & money
* a lot of people, both as text producers and LT consumers/buyers

#### Minority languages — the short story
* often morphology rich
* short on text resources, money, people
-->

### Consequences

Combine the typological differences with the differences in economy and technology, and the result is that the dominating language technologies are such that linguistic analysis doesn't really matter. Morphology is a problem rather than a feature, and the same goes for phonology: the technologies basically assumes a linear string of (mostly) invariant words, and calculate statistical or other patterns from these strings.

This makes it even harder to develop tools for the languages we care about - the mainstream technology is more or less useless, at least presently.

For the minority language communities this means that:

* young people want access to technology, and if they can't get it in their mother tongue they will use another language — minority language speakers are very often bi- or multilingual
* lack of LT is becoming a strong force in language death - a language that is not being used is a dead language

## The alternatives

Fortunately there are still nation-state languages with complex morphology and with universities and research groups working on alternative technologies, HU being one such place.

Using technology developed here, it is possible to build tools and LT solutions that will work well for in principle any language in the world.

This course is an introduction to these technologies and the tools that can be built with them, and the framework around them.

<!--
* motivational diff (earn money vs serve underserved people)
-->

### Support for minority languages

Our work (Jack and me + the groups we represent) have focused on minority languages, starting out with the Sámi languages.

* rule-based technology
* writing support, support for building a written culture
* supporting an independent language community

This means making everything from keyboards to speech synthesis (and maybe speech recognition in the future)

![Plains cree keyboard menu entry](../2018_Creutz/images/crk-Latn.png)

<!-- The only thing we don't make is fonts (even though that is also needed e.g. by the Kildin Sámis) -->

So — does it make sense to talk about minority and majority LT? Not in itself, but because of the material basis or reality that LT must build on, and that have been presented above, it does.

<!--
* -> diffs in LT practice (because of diff in material base):
-->

### Majority language technology (= "English" LT)

* statistical, and now also neural
* depends on huge corpora
<!-- * really is only working well for English (German as a minority language?!) -->
* no linguists needed, only statisticians and software engineers

### Minority language technology

<!--
* -> diffs in LT practice:

* rule-based - also more interesting (to us at least)
-->

* rule-based, perhaps neural in the future as a complement
* depends on there being:
    * mother tongue speakers
    * ... and good grammars and dictionaries
    * ... and linguists

As long as there are mother tongue speakers, it is always possible to build a working system.

## Reducing the costs

Due to the huge initial costs of developing working LT solutions (independent of technology), a lot of our work has focused on ways to reduce that initial cost for new language communities, or concentrating the costs to where the language communities can afford it. Very much circles around the concept of reusability and grammatical work.

### Reusable grammars
<!--
* -> diffs in LT practice:
* reusable resources, descriptions (both desc and norm)
-->

* writing rule-based descriptions of a language is labour intensive, but a well-written description can be reused in most applications
    * by covering *and tagging* both descriptive and normative features of the grammar it can be used for both descriptive and normative tasks
    * in general, by making all features of a grammar explicit in the linguistic analysis, one can easily modify the grammar for various purposes
* lack of corpus material - there just isn't enough text to build statistical or neural models, if there is text at all
* mirroring the internalised grammar of people requires just people, not huge corpora or training data

<!--
* reusability - can't afford to rewrite the same data for each project (and that would kill our colleagues - it is always the same people that have to do this work, there are not that many to choose from, and some of our tasks are repetitive enough as it is)
-->

### Reusable LT tool components
<!--
* -> diffs in LT practice:
* reusable tool components (speller libs etc)
-->

The basic machinery of a speller is the same independent of language. At the same time it is a lot of work to develop a decent speller engine, and make it work with MS Office, LibreOffice, InDesign, etc

Factoring out those components, ensuring they are language independent, makes it possible to reuse them for any language.

The same goes for grammar checkers, hyphenators, machine translation, language learning tools, and so on.

The work we have done, and are doing, for the Sámi languages will thus become available and usable for any other language building on our resources.

### Scalable infrastructure

<!--
* -> diffs in LT practice:
* language independent & scaleable infra
-->

All of the above is baked into the infrastructures we use for our LT development. Most of the work is done in the Giella infrastructure, and MT work is done in the Apertium infrastructure. Both share the same philosophy regarding reuse and support for minority languages.

The Giella infrastructure specifically is built to support scaling in two dimensions:

* adding new languages
* adding support for new tools and features

This makes it possible for a new language community to get a head start, saving both in time and costs:

* all basic setup is done
* all integration work is done
* ⇒ you can start directly on the linguistic work, concentrate on that, and rest assured that the final tools will work in LibreOffice, MS Office, and on Windows, Linux and macOS, and so forth

# Summary

* given the realities of today, it makes some sense to talk about minority language LT - but the picture is more complicated than just a minority majority dichotomy
* the up-front costs of LT is huge, leaving most languages out of even basic LT support
* working rule-based and with reusability in mind helps reduce the costs
* building on what has been done for other languages helps further reducing the costs
* this course will focus on mostly minority languages from the Uralic language family