Texts from various domains should be tested (the sme/corp/ mainly contains administrative texts (and the New Testament)):
--------------------------------------------------- nisson_ovddasteapmi.txt Test 6 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall 040903 38360 35704 93.0 % 20660 19102 92.4 % --------------------------------------------------- hjh-nod1iid.txt Test 5 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall 040903 1580 1532 96.7 % 683 636 93.1 % --------------------------------------------------- sd-divas-2002-{1,2}.txt Test 4 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall 041005 32835 31834 96.9 % 6664 6054 90.8 % 040913 32883 31255 95.0 % 6759 5856 86.6 % --------------------------------------------------- sd-divas-2001-1.txt Test 4 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall 040903 60522 58549 96.7 % 8610 7610 88.4 % 040329 62459 60159 95.3 % 8496 7406 87.2 % --------------------------------------------------- handlingsplan_samisk.txt Test 3 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall 031120 2148 2053 95,6 % 1044 984 94.3 % 040329 2461 2389 97.1 % 955 898 94.0 % (new preprocessor) --------------------------------------------------- Test 2 Wftot Wf-tkn %-recall Tytot Wf-typ %-recall Collection 225355 32467 (test closed) 020815 203080 90.1 % 22721 70.0 % 020918 204315 90.7 % 22956 70.7 % 030210 227062~214845 94.6 % 31474~24398 77.5 % --------------------------------------------------- Test 1 Wf-tokens %-recall Wf-types %-recall New Testament 139681 14888 (test closed) 011110 36471 26.1 % 4983 33.5 % 011116 36980 26.5 % 5050 33.9 % 011214 37736 27.0 % 5177 34.8 % 011218 40741 29.2 % 5955 40.0 % (closed classes added) 020129 126765 90.6 % 11676 78.4 % (proper names added) 020205 128702 92.1 % 12340 82.9 % 020206 129857 92.9 % 12328 82.8 % (nom+nom compound) 020207 131846 94.4 % 12500 84.0 % 020212 132394 94.8 % 12621 84.8 % 020213 132878 95.1 % 12652 85.0 % 020217 132993 95.2 % 12674 85.1 % 020306 133791 95.8 % 12850 86.4 % 020307 133821 95.8 % 12878 86.5 % 020318 134042 95.9 % 12914 86.7 % 020321 135446 97.0 % 13292 89.3 % 020323 136120 97.5 % 13373 89.8 % 020404 136621 97.8 % 13524 90.8 % 020410 136974 98.1 % 13609 91.4 % 020417 137435 98.4 % 13762 92.4 % 020418 137977 98.8 % 13875 93.2 % 020423 138101 98.9 % 13964 93.8 % 021104 138254 99.0 % 14003 94.1 % ---------------------------------------------------
Each text is given a separate section in the table, ordered chronologically, with the oldest test case (Test 1) at the bottom. The first line of each section gives the name of the file (note: the files of the test cases 2 and 3 are so changed that these two test cases are closed). Each line represents a test run. The first colum gives the test date (in the format ddmmyy), the second (WFtot) the total number of words in the file question, the third (Wf-tkn) the number of recognised word form tokens, and the percentage compared to the total. The next columns does the same for wordform types (cf. below for the commands used to calculate the numbers).
------------------------------------------------------------------------- Wftot: cat filename | preprocess --abbr=bin/abbr.txt | wc -l Non_recognised_wf: cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst | grep '\?' | grep -v CLB | wc -l Wf-tkn = Wftot - Non_recognised_wf %-recall = Wf-tkn * 100 / Wftot ------------------------------------------------------------------------- Tytot (Total number of wordform types): cat filename | preprocess --abbr=bin/abbr.txt | sort | uniq | wc -l Non_recognised_wt (Number of non-analysed wordform types: cat filename | preprocess --abbr=bin/abbr.txt | sort | uniq | lookup -flags mbTT bin/sme.fst | grep '\?' | grep -v CLB | wc -l Wf-typ (Number of recognised wordform types) Wf-typ = Tytot - Non_recognised_wt %-recall = Wf-typ * 100 / Tytot --------------------------------------------------------------------------
The CG pattern xy:xyy (biila:biilla) has been systematically tested, and appr. 6 patterns do not work. Cf. the bug file referred to above. TODO: Test all CG patterns.