Dir for the copyright-free corpus files compiled at Giellatekno and Divvun. Testing the validity of the corpus. 1. check whether each object file has exactly one meta file and vice-versa command: sh runObjMetaCheck.sh result in output file: obj-meta_check_DATE.txt 2. check whether the parallel file declared in a meta-file exists 3. ??? X. complex check after the XML-conversion: - parallelity - language flag - translation direction ==> todo: the scripts used for FAD-corpus testing have to be generalized for the whole corpus; for paralellity check, this means for ALL possible lang pairs combinations. Starting point: test_converted_corpus.xsl ======================== task: compile parallel corpora for sma and smj ======================== smj: g -hr 'translated_from' .|awk '{print $3}'|tl 222 select="'nob'"/> para_20131125>find toktmx/smj2nob -name *.toktmx|wc -l 8 para_20131125>find toktmx/nob2smj -name *.toktmx|wc -l 47 160 select="''"/> 4 select="'swe'"/> para_20131125>find toktmx/smj2swe -name *.toktmx|wc -l 1 para_20131125>find toktmx/swe2smj -name *.toktmx|wc -l 1 3 select="'nno'"/> para_20131125>find toktmx/nno2smj -name *.toktmx|wc -l 7 para_20131125>find toktmx/smj2nno -name *.toktmx|wc -l 4 sma: 31 select="'nob'"/> nob2sma>find . -name *.toktmx|wc -l 24 para_20131125>find toktmx/sma2nob -name "*.toktmx"|wc -l 4 11 select="''"/> 9 select="'nno'"/> para_20131125>find toktmx/sma2nno -name "*.toktmx"|wc -l 2 para_20131125>find toktmx/nno2sma -name "*.toktmx"|wc -l 4 2 select="'swe'"/> para_20131125>find toktmx/sma2swe -name "*.toktmx"|wc -l 1 para_20131125>find toktmx/swe2sma -name "*.toktmx"|wc -l 1 ========================================================== ELAN and TEX file in the orig corpus ===== Intro ===== "spoken" is a new dir, created for (transcribed) spoken language data stored at freecorpus by the Freiburg based language documentation projects. Naming this dir "spoken" is only a preliminary convention, because in GT prefers to make a genre distiction. For more info see the readme file under ../sjd/spoken.