Neural Network

11.01.2021

Linda, Tommi, Chiara, Mika


* Tommis script to split - already splitting better than before
* Mika has been waiting for more data
* Mika: how easy is it to do part-of-speech tagger in sme? - it may improve the results
* Add POS tags to json files:

{
            "correct": "vuosttašriegádanvuoigatvuođa",
            "error": [
                "vuosttaža",
                "riegádeapmi",
                "vuoigatvuođa"
            ],
            "pos" :  [
                "A",
                "N",
                "N"
            ]
},

Plan:
* Tommi runs the script on the whole corpus to put it into the json format - push it to github (possilby split into several files)
* Linda will search for bugs once more
* Mika will  produce the data for the NMT models from the JSON
* count on crashes (several ones) on CSC when training the data
* non-optimal results -> we would have to make some hand-produced 
* what we wanna do is filter out potential compound errors by means of the rule-based handwritten grammarchecker - filter out the following tag: &msyn-compound
* what corpus should we leave for evaluation? - 30% or 15% - we should have a count of the corpus
* select the folders to split out on Tuseday


Plan for the article:
* is there a sample article:
** Introduction
** related work
** datasection
** method section - two methods - rule-based approach vs. explain neural networks
** results and conclusion


Bidirectional LSTM Tagger for Latvian Grammatical Error Detection - https://doi.org/10.1007/978-3-030-27947-9_5

Grammar Error Correction in Morphologically Rich Languages: The Case of Russian
https://www.aclweb.org/anthology/Q19-1001.pdf