!!!Corpus meeting 7.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond

!!!Agenda

* Algorithm for dealing with scanning errors
* Setningsparallellisering
* Ordparallellisering


!!Algorithm for dealing with scanning errors

!Finding the files


Analyse the main language text morphologically. Then, 
for each file:

{{{
converted/$lang/catalogue/file.xml -- analyse $lang nodes
Count the missing ones -- … | usme | grep '\?'
For each file: register the missing/total ratio, 
List the files according to the ratio, and pick the worst files
}}}

__Priority__: converted/sme/admin/

Tomi has tried this with one file. Command:

{{{
linecount=`ccat -l sme $1 | preprocess | wc -l`
errors=`ccat -l sme $1 | preprocess | /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.\
4/bin/lookup -flags mbTT -utf8 ~/langtech/gt/sme/bin/sme.fst | grep '?' | wc -l\`

echo "lines: $linecount / $errors"

0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1535 / 87
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1703 / 53
}}}


{{{
for i in $ANALYSED_DIR/$SMILANG*.ccat.txt
time cat $i | $PREPROCESS 2> /dev/null | lookup -q -flags mbTT $GTHOME/gt/$SMILANG/bin/$SMILANG.fst |
}}}

Outcome of this: A list of files

__TODO:__

* File list as described, due this week. __Tomi__.

!!Finding a cure for improving the files

* What has caused the high error rates?
* How can it be fixed?

!!Status quo for boundcorpus and freecorpus

* 1700 out of 52000 files in __boundcorpus__ still cause problems for convert2xml.pl
* 84 files out of 9276 in __freecorpus__ still cause problems for convert2xml.pl.
  Of these, 18 are in $lang/admin.
* Also, look at errors in nob/admin
* Look at errors in files with parallel versions.

List of files which are not converted:

{{{
freecorpus$ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" "
/home/apache_corpus/freecorpus/orig/sme/admin/sd/other_files/dc_3_99.doc
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/guided-tour.html_id=313861
/home/apache_corpus/freecorpus/orig/sme/admin/depts/regjeringen.no/samisk.html_id=454913
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/sitemap.html_id=256029
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313733
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313744
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313795
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313850
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313851
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313855
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313857
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313865
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313868
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313883
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=426594
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/state-secretary-karl-eirik-schjott-peder.html_id=439605
/home/apache_corpus/freecorpus/orig/nno/admin/depts/regjeringen.no/statsrad-karl-eirik-schjott-pedersen.html_id=439605
/home/apache_corpus/freecorpus/orig/nob/admin/depts/regjeringen.no/statssekretar-karl-eirik-schjott-pederse.html_id=439605
}}}

Command for finding the list:

{{{
freecorpus $ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" " | wc -l
}}}


The error in the {{eng}} files is trivial. Focus now is on the nob-sme pairs under admin.

__TODO__

* Fix the (very few) remaining nonconverted ones (__Børre__)


!!!Sentence alignment


!!TCA2 version update

TCA2 used to work. Some time, during the time we have not touched the code, 
it stopped working because of Java upgrade to 1.6.
Børre tried to fix it this autumn.

* oct 5th 2006 -- gui-interface does not work for Børre, Trond
* sep 29th 2006 -- gui-interface works for Børre, Trond

__TODO__: 

* Get a newer version (__Trond__).

!!TCA2 installing for the rest of us

Postponed to the version question has been clarified.

!!Anchor list

Trond made an sme-anchor.fst

{{{
{biegga} ?* |
{biekka} ?* |
{lássa} ?* |
{lása} ?* |
{viidni} ?* |
{viinni} ?* |
{vuitti} |
}}}

Run the corpus through the anchor fst, and spot holes. Fill them.

Split the anchor list in two:
# a domain-independent one
# a domain-dependent one

__TODO:__

* __Trond and Berit Merete__ to look at this.

!!Sentence length parameter

We need the ratio.

__TODO:__

* [Ciprian to look into it|/tools/TCA2_parameters.html]

!!Dice coefficient

[Explanation here|http://en.wikipedia.org/wiki/Dice's_coefficient]
[Implementations here|http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice's_coefficient]

Wikipedia:
''The coefficient may be defined as twice the shared information (intersection) over the'' __combined set__ ''(union)''

Hofland and Johansson:

For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. 
For other languages, the acceptable value for the coefficient can be less. 
__The cognate parameter is also read by the program__.

Question: Is there a parameter to be set here?

__TODO:__

* Discuss with Bergen (__Trond__)
* Follow-up (__all__)
** Find the sme:nob Dice coefficient for cognates 
** Find the lower length for words to be considered 

!!Preprocessing

Probably an important candidate.

__TODO:__
* Ciprian to look into it.

!!Alternatives to TCA2?

* Maligna
* Maca
* Others?

!!!Work ahead

* All: __Read documentation__ and get a grip of the total picture.
* Tomi: Find the bad files and look into them + report (__this week__)
* Trond: Talk to Bergen (__this week__) + thereafter we install
* Berit, Trond: Anchor list
* Børre: Conversion
* Cip: various counting thing

!!!Next (short) corpus meeting: 

* __Monday afternoon__, April 11th.