Notes on the korpus data after the two filtering steps: - pseudo-words - pseudo-sentences The statistics in file s_count_top_60.txt show the following: - very many texts (=files) have a small amount of sentences ==> on top 1-10 there have 2-11 sentences 2894 v="5" 2734 v="4" 2674 v="6" 2374 v="7" 2229 v="3" 2215 v="8" 2044 v="9" 1899 v="10" 1757 v="11" 1587 v="2" - a lot of 1-sentence texts 953 v="1" - and also quite a lot of 0-sentence texts (after filtering) 106 v="0" TODO: - filter away the 0-sentence texts for the current update 3.try>g 'g 'g 'g 'g 'g 'g 'g 'g 'g 'g 'g ' DONE AND TESTED - compare also the number of tokens in the files with a low number of sentences because a file with a sentence could contain 1000 tokens, in theory. Notes en passant: - I found this pattern 450 times (only) in the min_aigi-data: sme_min_aigi_20140318.vrt:–<\q>Lea –<\q>Lea __UNDEF__ __UNDEF__ 1 X 0 sme_min_aigi_20140318.vrt:–<\q>Don –<\q>Don __UNDEF__ __UNDEF__ 1 X 0 sme_min_aigi_20140318.vrt:–<\q>Mii –<\q>Mii __UNDEF__ __UNDEF__ 1 X 0 sme_min_aigi_20140318.vrt:–<\q>Mus –<\q>Mus __UNDEF__ __UNDEF__ 1 X 0 korp_data_20140318>g '\–<\\q\>' *|cut -d ':' -f1|t 450 sme_min_aigi_20140318.vrt