neural network for compound grammarchecking 02.12.2020 15:00-16: ToDo: * what kind of corpus do we want to use for training and what for testing * figure out the size of our corpus - Tommi/Linda * make data available to Mika - Linda * Tommi & Linda doing the tutorial * Ask csc folk about access to supercomputing server things * train model with what we got - different types of sequences * methods for the NN for grammar checking in articles - Mika * grammarly methods? - Mika * figure out if there are Finnish corpus - Tommi * introducing the idea of a workshop (CLARIN funding) - Linda transfer learning - data in another language (maybe in Finnish), use that as the base-model * compounds * detection and correction * corpora * calculate precision/recall/F-Score OCR - word segmentation errors https://www.aclweb.org/anthology/L18-1113/ --- English word-level model -- less data, but cannot work with words it hasn't seen before character-level model -- lots of data what do you need? * * corpus in form of incorrect line \n correct line or so * generated / synthetic corpus data from the treebank - correct text into incorrect * every time you need context you need more data * depends on how good the data is: ** no errors in the data ** very representative to the problem - the errors we are trying to correct need to be frequent in the corpus ** preparing the data is not fast, trying the neural network is fast marked-up: - "Mu árvalus {álgo heahttái}¥{álgoheahtái} livččii ahte skuvlla váldá okta Kárášjogagielddain jos doppe fidnešedje luoikkasin lašmmohallanlanja, dahje juo sáddet mánáid muhtomin Ohcejogagirkubáikke skuvlii, árvalastá Torikka." * Russian/Latvian examples, have you seen them Two examples for neural network approaches are the systems for Latvian~\cite{Deksne2019} https://doi.org/10.1007/978-3-030-27947-9_5 and Russian~\cite{Rozovskaya2019}. ..https://www.aclweb.org/anthology/Q19-1001.pdf The evaluation of Latvian neural network grammar checker shows a good performance with precisions between 78\% and 98.5\%. However, judging from their regular expressions to insert artificial errors, most of their error types seem to be fairly local errors that can be resolved based on bigrams. The Russian system, on the other hand, focusses on more advanced error types including case and agreement. However, precision is significantly lower --- between 22\% and 56\%, only gender agreement reaches 68\%. None of these approaches deal with the advanced syntactic constructions we resolve in our approach, requiring an analysis of the whole sentence, valencies, semantic cues, etc. %To our knowledge, there are not any neural network approaches to compound error detection either. types of neural networks: * recurrent neural network - sequence to sequence (sentence with errors > sentences with corrections) * transformer - more advanced version different types of sequences: I haveacat - I have a cat sentence I _ h a v e _ a _ c a t character-level I _ h a v e bigram-chunk are they time-consuming/fragile? * faster on the GPU than the CPU * run over the weekend * use open NMT * do you know other examples? suggestion: * based on our corpus * what kind of corpus do you need * where publish *ACL / EMNLP Abstract deadline (long & short papers): January 25, 2021jk we would like to learn how to make/use these kinds of algorithms