26.09.2013

present:
* Sjur
* Linda

grammar checker project plan

0 intro
* working definition: errors that cannot be resolved by the spellchecker
* Excluding real word errors by default

1 done until now:

* error type classification
** lexical errors (&lex-majuscule)
** morphosyntactic errors (&msyn-inf_not_actio)
** syntactic errors (&syn-case_congruence)
** real-word errors (&real-vuosttaš)
** correct tags (&corr-not-compound)

* additional error types
** punctuation errors
** number formatting errors
** capitalisation errors

** specific syntactic grammar for the grammar checker philosophy: sme-gramdis.rle, rules are marked (REMOVE:GramPo)
** grammarchecker grammar: sme-gramchk.rle
** publication: Constraint Grammar based Correction of Grammatical Errors for North Sámi LREC 2012


2 todo:

* practical things:
** move SME (and GC) from old to new infrastructure
** meetings with Francis

* maintenance: 
** add/change/update semantic/syntactic tags

* work on things started:
** Duommá's 250 word list (compounds that lead to real word errors) - excluding real word errors by default
** rules for valency example sentences collected in gramchkcorpus.txt

* errors:
** find out which types of errors are most frequent
** error corpus - size?? other sources??
*** $GTFREE/goldstandard/orig/sme (xserve)
*** main/gt/sme/src/gramchk/gramchkcorpus.txt
* possible classes?

* presentation:
** sponsor-demonstrations
** release early/often (Open Source principles)
** we cannot make a Microsoft Office grammar checker - prohibited by MS - users can protest by writing to them ;) (we can only deliver to LibreOffice)
** look at a graphic grammarchecker (voikko - Finnish)
** http://wiki.apertium.org/wiki/Spellchecking

* rules:
** for real word errors: which semantic tags can be combined? - dálkkádat + rap + poarta
** bigrams and statistics for compounds?
** fix/annotate grammatical errors (compounds) already in
  preprocessing/tokenization/morphological analysis (i.e. treat space as
  compound border for relevant POS's) (other ideas - Eckhard?)
** hfst-proc må truleg oppdaterast for å gje alle analyser av potensielle
  samansetjingsfeil

Samansetjingsfeil - særskriving:

{{{
[N Nom]         [N ...] ===== kasusfeil (Gen not Nom) / sammensettingsfeil
[N Nom/N Gen]   [N ...] ===== 
[N Gen]         [N ...] ===== 
[N Nom+VR]      [N ...] ===== med vokalreduksjon (VR) - alltid feil
[N Nom/N Gen+VR][N ...] ===== --"--
[N Gen+VR]      [N ...] ===== --"--
}}}

VR = Vokalreduksjon

what is one word?
* stavekontroll - space before and after
* tokenizer: 
** space as a possible sign in a compound (in the case:
  [[N Nom] [[N ...] the error tag can get annotated right away)
** CG needs to clean up - disambiguate


* tools to be used:
** dependencies
** valencies
** semantic roles
** semantic prototypes

** evaluation:
*** precision and recall
*** how much has been resolved