!!!Corpus infra rework

This document is an informal description of the development of
the corpus remake project. It should serve as the basis for the info update
of the web files on the gt/sd corpus.


!!!Building the nob2sme parallel corpus a done now (20120712)

# convert the original files into xml format:
 {{{convert2xml.pl $GTFREE/orig}}}

# transfer the files deemed "good enough" from converted to prestable:
 {{{find $GTFREE/converted/sme -name \*.xml -exec pick-parallel-docs.pl {} \;}}}

# build the actual parallel corpus
 {{{find $GTFREE/prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}} 

 or if working in a different dir than the standart {{{$GTFREE}}} use:
{{{GTFREE=/ABSOLUT/PATH/TO/YOUR/WORKING/DIR  find prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}}


!! Some general notes on the current corpus conversion:
* the correction of typos is done at three different points
 ** during the xml conversion by means of the xsl stylesheets
 ** during the preprocessing using the {{{preprocess}}} command 
    (see main/gt/script/langTools/parallelize.py line 297ff 
 ** during the conversion to tokxml (???) by means of 
    pointers to files in the converted corpus containing a list of
    typos + corrections
    (According to ___Børre___, this list was compiled using Hunspell and
     the corrections were added by ___Børre___ and ___Berit-Merete___)

* the occurence of at least of one pseudo-sentence in an tca2-input file blocks the sentaligning of the whole file
 Ergo: pseudo-sentences have to be filtered out BEFORE sending the file pair to tca2 
       ==> DONE 

!! Comparing the svn data in prestable/comverted with the freshly converted ones

* Note: only in the nob dir there is a huge difference
 ** only in the svn: 
>grep '/Users/cipriangerstenberger/freecorpus/prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l
     227
 ** only locally generated
try_slange>grep 'Only in prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l 
      54

Question: Why?
 ==> random test: 
     	    converted/nob/admin/sd/samediggi.no/samediggi-article-3653.html.xml
     	    converted/sme/admin/sd/samediggi.no/samediggi-article-3653.html.xml
_________
tmp/samediggi-article-3653.html.log

«/Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html»
                                   100%^M0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Conversion failed: More than 5% of the content isn't analyzable /Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html
_________

Question: Does that mean that the xml conversion pipeline has been getting worse in the last time?
 ==> started the same conversion on victorio to check whether there is some differences caused
     by the OS. (By the way, the conversion on XServe is the same as locally on my mac.)

 ==> on victorio, the file under discussion gets converted into xml!!!
 ==> lesson learned: don't convert file on macOS!

* Note:

We can forget 

orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf
orig/sme/admin/depts/other_files/Samiske_tall_forteller_3_SAM.pdf

Without some text correction: the nob version contains (at least) 2-3 pages more than
the sme one; on one of them is a short description of the content in eng, nob, and sme.
This leads to a huge discrepancy between nob and sme. The sent alignment is useless. Not
to mention the problems stemming from tables with numbers and statistics.

!!! Conversion to xml:

20120810:
1.
<parallel_files dir="nob2sme" ok="2358" ko="238" conversion_error="234" no_orig_file="4">
   <summary>
      <non_empty_files>
         <nob>2596</nob>
         <sme>2358</sme>
      </non_empty_files>
      <empty_files>
         <nob>0</nob>
         <sme>0</sme>
      </empty_files>
      <useful_file_pairs>2358</useful_file_pairs>
   </summary>
</parallel_files>

2.
<parallel_files dir="sme2nob" ok="2349" ko="71" conversion_error="71" no_orig_file="0">
   <summary>
      <non_empty_files>
         <sme>2420</sme>
         <nob>2349</nob>
      </non_empty_files>
      <empty_files>
         <sme>0</sme>
         <nob>0</nob>
      </empty_files>
      <useful_file_pairs>2349</useful_file_pairs>
   </summary>
</parallel_files>


___compare2former___
1. Non-converted files (20120712)
xml_conversion>report_nonconverted_files.sh ../xml_conversion | wc -l
     985
2. Thereof 583 relevant for the nob2sme corpus and 71 files relevant for the sme2nob corpus


!!!free parallel corpus nob2sme (state from 20120712)

1. Result of the last conversion and check of the nob-sme parallel files: 

<parallel_files dir="nob2sme" ok="2011" ko="583" conversion_error="579" no_orig_file="4">

 - there are still 4 nob-files with a dangling pointer to some ghostly sme parallel files


!!!free parallel corpus sme2nob (state from 20120712)

<parallel_files dir="sme2nob" ok="2002" ko="71" conversion_error="71" no_orig_file="0">

!!! Low-level fixes todo: private-use sign causing problems with sentence divider
grep -r ' ' converted
 ==> find files (see private-use_sign.log file) and use xslt sheet for replacing with the emtpy string


!!! TODO: nob2sme and sme2nob -- correcting parallelity pointer
!! nob2sme: 4 nob-files with a dangling pointer to some ghostly sme parallel files
 1.
         <h_loc>converted/nob/admin/depts/regjeringen.no/finnmarksloven.html_id=515308.xml</h_loc>
         <t_loc>converted/sme/admin/depts/regjeringen.no/finnmarkkulahka.html_id=515308.xml</t_loc>

 2.
         <h_loc>converted/nob/admin/depts/regjeringen.no/spraktiltak.html_id=603613.xml</h_loc>
         <t_loc>converted/sme/admin/depts/regjeringen.no/185-miljovdna-doarjjan-aarjel--ja-julevsami-gielladoaimmaide.html_id=603613.xml</t_loc>

 3. 
         <h_loc>converted/nob/admin/depts/regjeringen.no/stotte-til-sor--og-lulesamisk-sprak.html_id=536716.xml</h_loc>
         <t_loc>converted/sme/admin/depts/regjeringen.no/doarjja-lulli--ja-julevsamegielaide.html_id=536716.xml</t_loc>

 4. 
         <h_loc>converted/nob/admin/sd/other_files/sp1991-1.pdf.xml</h_loc>
         <t_loc>converted/sme/admin/sd/other_files/dc1991-1.pdf.xml</t_loc>

!! all2all: overspecification about translation direction

   converted/nob/facta/skuvlahistorja1/trygve_n.html.xml
      <translated_from xml:lang="sme"/>

		      ___AND___

   converted/sme/facta/skuvlahistorja1/trygve_s.html.xml
      <translated_from xml:lang="nob"/>
  

!! xxx2yyy: underspecification, i.e., there are correct pointers to parallel files but
   not tranlslation direction specified in either file.
 ==> see bug 1392

!!! for sentence alignment (after conversion on victorio and filtering on mac: 20120717)

* Input 
prestable>find converted/nob -name *.xml | wc -l
    1628
prestable>find converted/sme -name *.xml | wc -l
    1628

* Output
prestable>find toktmx/nob2sme -name *.toktmx | wc -l
    1605
prestable>find toktmx/sme2nob/ -name *.toktmx | wc -l
      23

* Comparison with the svn prestable

freecorpus>find prestable/toktmx/nob2sme -name *.toktmx | wc -l
    1664
freecorpus>find prestable/toktmx/sme2nob -name *.toktmx | wc -l
      27

!!!Some old notes yet still relevant notes:
 
Parallelity="false" means you have an ungarbled sme metadata, but the sme file does not exist.
File without metadata is understandable, but metadata without file is fishy.

Problems with the file names?
Trond: yes: in my svn, all files with æøå names are listed twice marked as both unversioned and missing.

The files are in utf8, mac wants decomposed chars. I'll fix them, Sjur asked
me to do it yesterday (20100920)

2. Now, 20120712, there are no empty files in the re-maked corpus but we have to pay attention for
   the check before transfering the parallel files from converted to prestable.
   There are two reasons not to transfer a file pair from the converted to the prestable directory:
 
 2.1 Wrong ratio btw. the word numbers in the file pair.
(victorio 20120712)
grep 'Wrong' pick_up_sme.log | wc -l
469

 2.2 Too low wordcount 
xml_conversion>grep 'Too' pick_up_sme.log | wc -l
(victorio 20120712)
241

As stressed in our old (but very good) notes, we should stick to the (FICE princibles):
find, identify and correct errors!

The script pick-parallel-docs.pl finds and identifies, but does NOT correct the errors.

We have to check why there are files with so few words after conversion (conversion errors of
some document parts, irrelevant files for our corpus?), and why the word ratio between nob and sme
is wrong (again, conversion errors of some document parts, irrelevant file pair?).

Note: The log file with the files not picked up to the prestable dir is
      pick_up_sme.log


3. Are we sure all files in the old corpus repository have been moved to the new svn repository? Especially all goldstandard files, both bound and free.

!!Routines for new upload files

Overall question: is there a check of the content before starting the conversion step? 
No. It is on the todo list.

Or, it is in some way. Only doc, pdf, html and some other filetypes are accepted ... 
Also the upload script checks for the size of the file, and if 
it exists from before (using md5sum)


!!Status quo for upload directory as of today

{{{
~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l
93

~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.xsl'| wc -l
153

~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep '\.xsl'| wc -l
6

~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l
2

~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep '\.x[sm]l'| wc -l
2
}}}

!!!Corpus synchronization between victorio and XServe

(based on Børre's email -- please correct if needed)

The corpus user has cron jobs going every night where files in both free- and 
boundcorpus are converted and synced to the xserve. I made a new mailalias, 
corpus_fanatics (where I and Ciprian are members), and the result of these 
jobs are sent to us.

Cip har observert ulike mengder filer på xserve og eiga maskin. Korfor er det
slik? Filene blir kopiert / synkronisert frå victorio til xserve. Det er
sannsynlegvis ikkje problem med denne synkroniseringa.

!!Conversion routine

what about the uploaded files when it comes to a check of the content 
before starting the conversion step? 

I listed all converted files in freecorpus this way:
{{{find converted/ -name \*.xml | sort > xmls.txt}}}
(the number of converted files are indeed 685)

Then listed all .xsl files in orig this way:
find orig/ -name \*.xsl | sort > xsls.txt (the number of .xsl files is 1156)

After some search and replace of paths and file endings I made a diff to see 
which files are not converted this way:
{{diff xsls.txt xmls.txt > diff.txt}}

!!What about directory structure? 

* Cip&Trond's  minimal constraint is: No mixed types in a corpus directory,
  i.e., either only subdirs or only files!
* todo: create an "unclassified" or "other directory" for each dir that
  currently has mixed content and move the files into that.

!!Grammatical analysis of the corpora - routines for error detecting and frequency

Possible errors: 
* Files not included in the analysis (sma-news-dep.txt)
* Analysis cut before end of file

!!!Main issues

* find, identify and correct errors in metadata
* find, identify and correct errors in the conversion process
* find, identify and correct errors in the original corpus files
* directory structure
* Vic/XServe synchronisation
* work load division

!!!Priorities

# Issues wrt. the texts we have
# Issues wrt. priorites for new texts

!!!Work load division and deadlines

__TODO list__
* Establish routines for future upload files (__Børre__)
* change file names from composed to decomposed (__Børre__)
* Old corpus directory check:
** Empty the upload directory (__Børre__)
** Check that all old corpus files have been moved to the new svn repository
   (__Ciprian, XXX__)
* Cronjob for converting gold standard files (__Børre__) (@cip: once again, gold standard file can not be only converted without human check/manual correction! Otherwise there is only a Katzengold corpus)
* add metadata consistency check to convert2xml.pl, report issues to Bugzilla
  (__Ciprian, thereafter all for evaluation__)
* clean all metadata (__all__)

__General principles:__
* Use Bugzilla
* Børre -> Ciprian basically in different ends of the conversion process
* Communication via meetings and newsgroup discussions