Temporary dir to share infos on data quality.
This is particularly relevant for Ciprian, Lene and Børre.
/////////////////////////
1. Issue: sentence length
/////////////////////////
Total sentence number: 1879892
Single-word sentence number: 354320
Two-word sentence number: 116424
=====================================
Adjusted sentence number: 1409148
==> 25% of the current corpus consists of single- or two-word sentences
which amounts to being useless!
How to use data: grep,sort,uniq ==> the usual way!
Example: What are the one-word sentences?
grep -h '
13768
1191
892
609
209
169
72
61
38
36
25
25
24
24
24
22
22
20
20
18
18
17
---------------------------------------------------------------
/////////////////////////
2. issue: word length
/////////////////////////
- words at least 50 characters long: 6206
000_new_step>awk '{if (length($1)>=50) print $1}' sme_corpus_20131122.vrt|c
6206
Examples:
http://www.plappi.fi/files/orig/4005_OHCEJOGA%20%20GIELDDA%20%20OAHPAHUSDOAIMMA.doc
0F242E3B3B20223B3C203F75203B23262E226D72292A3D276F2E73'));//--
186_609_186_789_186_478_186_479_186_179_186_613_186_482_186_910_186_626_186_483
186_327_186_427_186_306_186_709_186_857_186_859_186_860_186_205_186_862
186_702_186_700_186_707_186_703_186_740_186_739_186_742_186_701_186_050_186_743
186_222_186_878_186_876_186_752_186_474_186_877_186_879_186_880_186_881
186_615_186_137_186_616_186_677_186_170_186_170_186_174_186_211_186_950_186_235_186_712_186_912_186_891
186_785_186_017_186_049_186_018_186_002_186_644_186_814
www.samisk.no/dokumenter/Prošeaktagovus_aajege.doc
http://www.hivand.no/2008/11/25/gamvik-kommune-a-stjele-en-skole
='/upload/system/emailimages/0b3b99a82b59cf00290a4ed88378dd02.
F6A4B606A7F7F64667F78647B31647F67626A6629366D6E79632B6A37'));//--
http://www.regjeringen.no/nb/dep/lmd/dok/Horinger/Horingsdokumenter/2007/Horing--Ny-lov-om-dyrevelferd
15000_15001_15002_15003_15004_15005_15006_15007_15008_15009
15300_15301_15302_15303_15304_15305_15306_15307_15308
7_625_1_673_5_735_1_296_1_520_1_225_11_800_325_1_650
9_447_2_273_6_245_296_1_680_1_025_13_900_4_325_1_950
9_447_2_273_6_245_296_1_680_1_025_13_900_4_325_1_950
10_157_2_473_6_434_306_1_730_1_225_14_318_4_685_1_813
==> The hit: a 'word' with 648 characters
0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000_0000000000000000000000000000000000000000000000000000000000