12.01.2021 13:15-14:55 Chiara, Tommi, Linda, (Sjur) * the first thing to do is split the data * gtfree.bash: take the corpus and convert it and put it in2 files * the same happens with the boundcorpus with gtbund.bash * next is text2chars.py: ensure that words are aligned, and adds space in between characters * data_generator.py: original version to split compounds from Mika * compoundsplitgt.py: same as Mika version but works with our corpus * Der/... Sem/... Ex/... Gram/... TODO: * making these script output json * pos analysis * Chiara checks Børre's pipeline to make sure we can use that one and then take POS from there * Deadline for applying to Sigma2: 22.02.2021. https://www.sigma2.no/call-e-infrastructure-resources-20211 https://www.sigma2.no/how-apply-resources which corpus should we use for what: 1. training 2. validation 15% 3. evaluation 15% admin, facta and science in freecorpus are the biggest news is the biggest in boundcorpus in total 6,8 giga 5 are boundcorpus 2.8M freecorpus/orig/sme/Len 1.1G freecorpus/orig/sme/admin - training 8.5M freecorpus/orig/sme/bible - training 4.5M freecorpus/orig/sme/blogs 437M freecorpus/orig/sme/facta - training 5.0M freecorpus/orig/sme/ficti - eval 480K freecorpus/orig/sme/grammar-realword 14M freecorpus/orig/sme/laws - eval 420K freecorpus/orig/sme/news - eval 40K freecorpus/orig/sme/odda_mahppa 275M freecorpus/orig/sme/science - eval 440K freecorpus/orig/sme/speccorp 2.1M freecorpus/orig/sme/wikipedia 1.8G total ~1.6G training suggestion: * leave boundcorpus untouched * use only freecorpus: