!!!Tastatur og preprosessering Planar for sumaren og hausten: * tastatur for iOS8 og Android (Lavangen og India? Utlysing) * preprosessering * arbeid til Mike !!!tastatur for iOS8 og Android (Lavangen og India? Utlysing) Finansiering: Divvun-potten for ekstra satsingar Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar på 15. september for ein beta, ferdig så fort som mogleg etter det Design-mål: * så lik Apple sitt tastatur som mogleg * fullføringsforslag og retteforslag frå hfst (men kanskje berre listebasert i fyrste versjon for å få han ut) * norske og ikkje-samiske teikn som popup-liste (som Apple-tastatura) * klårt skilje mellom språkuavhengig og språkspesifikk kode * tastaturlayout som xml-fil (eller noko liknande) * vi lagar for nordsamisk no, men skal enkelt kunna lagast for alle språka våre Moglege framtidsvariantar: * swipe-inspirert? * (meir avansert) bruk av hfst-teknologi for stavekontroll og ordfullføring !!!preprosessering Basert på: * hfst-pmatch? https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch * hfst-ataq * something else? Possible issues with hfst-pmatch: * char-by-char processing? Just like any other fst: state-by-state * processing of formatting? It can deal with any text - as long as the formatting is expressed as text (in-stream markup) it should be no problem * speed? We don't know yet, but fst speed anyway * compilation speed? Unknown until we try Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if "in order to" is lrlm there won't be "in" "order" "to" using pmatch applicator. Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point? Tommi: Wouldn't it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail. See [http://www.stanford.edu/~laurik/publications/pmatch] for details on how to use (hfst-)pmatch. !!!arbeid til Mike Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser. !!! wishlist for tokeniser * have whitespace in the middle of words, e.g. \n and softhyphen * string: lettersequence - whitespacesequence - othersequence - ... * LR longest match for token-sharing boundaries * within the token, all the analyses * input: 12345678901235463; possible tokenisations: ** 12 34 5678 90 12 35 463 ** 123 45678 90 12 35 463 *** ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ... *** thus: get both tokenisations between 1 and 8. then analyse * input: "the cat's mother, in order to", possible tokenisations: ** the cat 's mother, in order to ** the cat's mother, in order to *** {{^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$}} Two possible tokenisations: {{{ "" "in order to" pr "" "pr" "" "order" vblex pres "order" n sg "" "to" pr }}} * output an ambiguous lattice ? * do backoff automata ? e.g. analyser -> regex -> unicode database * Sane handling for Finnic(?) coordinated compounds with hanging hyphen: ** ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu ** it'd be neat if hyphenated words were not in morph. analyser.. maybe * Case mangling: ** "Thing" -> thing ** an tAerfort -> an t+aerfort Re unicode regexes: "You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}." See [http://www.regular-expressions.info/unicode.html] for details. Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support :)