!!!Tastatur og preprosessering

Planar for sumaren og hausten:

* tastatur for iOS8 og Android (Lavangen og India? Utlysing)
* preprosessering
* arbeid til Mike

!!!tastatur for iOS8 og Android (Lavangen og India? Utlysing)

Finansiering: Divvun-potten for ekstra satsingar
Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar
på 15. september for ein beta, ferdig så fort som mogleg etter det

Design-mål:
* så lik Apple sitt tastatur som mogleg
* fullføringsforslag og retteforslag frå hfst (men kanskje berre listebasert i
  fyrste versjon for å få han ut)
* norske og ikkje-samiske teikn som popup-liste (som Apple-tastatura)
* klårt skilje mellom språkuavhengig og språkspesifikk kode
* tastaturlayout som xml-fil (eller noko liknande)
* vi lagar for nordsamisk no, men skal enkelt kunna lagast for alle språka våre

Moglege framtidsvariantar:
* swipe-inspirert?
* (meir avansert) bruk av hfst-teknologi for stavekontroll og ordfullføring


!!!preprosessering
Basert på:
* hfst-pmatch? https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
* hfst-ataq
* something else?

Possible issues with hfst-pmatch:
* char-by-char processing? Just like any other fst: state-by-state
* processing of formatting? It can deal with any text - as long as the formatting is expressed as text (in-stream markup) it should be no problem
* speed? We don't know yet, but fst speed anyway
* compilation speed? Unknown until we try

Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if "in order to" is lrlm there won't be "in" "order" "to" using pmatch applicator.

Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?

Tommi: Wouldn't it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.

See [http://www.stanford.edu/~laurik/publications/pmatch] for details on how to
use (hfst-)pmatch.

!!!arbeid til Mike

Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.

!!! wishlist for tokeniser

* have whitespace in the middle of words, e.g. \n and softhyphen
* string: lettersequence - whitespacesequence - othersequence - ...
* LR longest match for token-sharing boundaries
* within the token, all the analyses
* input: 12345678901235463; possible tokenisations:
** 12 34 5678 90 12 35 463
** 123 45678 90 12 35 463
*** ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ...
*** thus: get both tokenisations between 1 and 8. then analyse
* input: "the cat's mother, in order to", possible tokenisations:
** the cat 's mother, in order to
** the cat's mother, in order to
*** {{^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$}}

Two possible tokenisations:
{{{
"<in order to>"
    "in order to" pr

"<in>"
    "pr"

"<order>"
    "order" vblex pres
    "order" n sg 

"<to>"
    "to" pr
}}}

* output an ambiguous lattice ?
* do backoff automata ? e.g. analyser -> regex -> unicode database
* Sane handling for Finnic(?) coordinated compounds with hanging hyphen:
** ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu
** it'd be neat if hyphenated words were not in morph. analyser.. maybe
* Case mangling:
** "Thing" -> thing
** an tAerfort -> an t+aerfort

Re unicode regexes: "You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}." See [http://www.regular-expressions.info/unicode.html] for details.

Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support :)