Temporary dir to share infos on data quality. This is particularly relevant for Ciprian, Lene and Børre. ///////////////////////// 1. Issue: sentence length ///////////////////////// Total sentence number: 1879892 Single-word sentence number: 354320 Two-word sentence number: 116424 ===================================== Adjusted sentence number: 1409148 ==> 25% of the current corpus consists of single- or two-word sentences which amounts to being useless! How to use data: grep,sort,uniq ==> the usual way! Example: What are the one-word sentences? grep -h ' 13768 1191 892 609 209 169 72 61 38 36 25 25 24 24 24 22 22 20 20 18 18 17 --------------------------------------------------------------- ///////////////////////// 2. issue: word length ///////////////////////// - words at least 50 characters long: 6206 000_new_step>awk '{if (length($1)>=50) print $1}' sme_corpus_20131122.vrt|c 6206 Examples: http://www.plappi.fi/files/orig/4005_OHCEJOGA%20%20GIELDDA%20%20OAHPAHUSDOAIMMA.doc 0F242E3B3B20223B3C203F75203B23262E226D72292A3D276F2E73'));//-- 186_609_186_789_186_478_186_479_186_179_186_613_186_482_186_910_186_626_186_483 186_327_186_427_186_306_186_709_186_857_186_859_186_860_186_205_186_862 186_702_186_700_186_707_186_703_186_740_186_739_186_742_186_701_186_050_186_743 186_222_186_878_186_876_186_752_186_474_186_877_186_879_186_880_186_881 186_615_186_137_186_616_186_677_186_170_186_170_186_174_186_211_186_950_186_235_186_712_186_912_186_891 186_785_186_017_186_049_186_018_186_002_186_644_186_814 www.samisk.no/dokumenter/Prošeaktagovus_aajege.doc http://www.hivand.no/2008/11/25/gamvik-kommune-a-stjele-en-skole ='/upload/system/emailimages/0b3b99a82b59cf00290a4ed88378dd02. F6A4B606A7F7F64667F78647B31647F67626A6629366D6E79632B6A37'));//-- http://www.regjeringen.no/nb/dep/lmd/dok/Horinger/Horingsdokumenter/2007/Horing--Ny-lov-om-dyrevelferd 15000_15001_15002_15003_15004_15005_15006_15007_15008_15009 15300_15301_15302_15303_15304_15305_15306_15307_15308 7_625_1_673_5_735_1_296_1_520_1_225_11_800_325_1_650 9_447_2_273_6_245_296_1_680_1_025_13_900_4_325_1_950 9_447_2_273_6_245_296_1_680_1_025_13_900_4_325_1_950 10_157_2_473_6_434_306_1_730_1_225_14_318_4_685_1_813 ==> The hit: a 'word' with 648 characters 0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000