!!!Corpus infra rework
This document is an informal description of the development of
the corpus remake project. It should serve as the basis for the info update
of the web files on the gt/sd corpus.
!!!Building the nob2sme parallel corpus a done now (20120712)
# convert the original files into xml format:
{{{convert2xml.pl $GTFREE/orig}}}
# transfer the files deemed "good enough" from converted to prestable:
{{{find $GTFREE/converted/sme -name \*.xml -exec pick-parallel-docs.pl {} \;}}}
# build the actual parallel corpus
{{{find $GTFREE/prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}}
or if working in a different dir than the standart {{{$GTFREE}}} use:
{{{GTFREE=/ABSOLUT/PATH/TO/YOUR/WORKING/DIR find prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}}
!! Some general notes on the current corpus conversion:
* the correction of typos is done at three different points
** during the xml conversion by means of the xsl stylesheets
** during the preprocessing using the {{{preprocess}}} command
(see main/gt/script/langTools/parallelize.py line 297ff
** during the conversion to tokxml (???) by means of
pointers to files in the converted corpus containing a list of
typos + corrections
(According to ___Børre___, this list was compiled using Hunspell and
the corrections were added by ___Børre___ and ___Berit-Merete___)
* the occurence of at least of one pseudo-sentence in an tca2-input file blocks the sentaligning of the whole file
Ergo: pseudo-sentences have to be filtered out BEFORE sending the file pair to tca2
==> DONE
!! Comparing the svn data in prestable/comverted with the freshly converted ones
* Note: only in the nob dir there is a huge difference
** only in the svn:
>grep '/Users/cipriangerstenberger/freecorpus/prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l
227
** only locally generated
try_slange>grep 'Only in prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l
54
Question: Why?
==> random test:
converted/nob/admin/sd/samediggi.no/samediggi-article-3653.html.xml
converted/sme/admin/sd/samediggi.no/samediggi-article-3653.html.xml
_________
tmp/samediggi-article-3653.html.log
«/Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html»
100%^M0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Conversion failed: More than 5% of the content isn't analyzable /Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html
_________
Question: Does that mean that the xml conversion pipeline has been getting worse in the last time?
==> started the same conversion on victorio to check whether there is some differences caused
by the OS. (By the way, the conversion on XServe is the same as locally on my mac.)
==> on victorio, the file under discussion gets converted into xml!!!
==> lesson learned: don't convert file on macOS!
* Note:
We can forget
orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf
orig/sme/admin/depts/other_files/Samiske_tall_forteller_3_SAM.pdf
Without some text correction: the nob version contains (at least) 2-3 pages more than
the sme one; on one of them is a short description of the content in eng, nob, and sme.
This leads to a huge discrepancy between nob and sme. The sent alignment is useless. Not
to mention the problems stemming from tables with numbers and statistics.
!!! Conversion to xml:
20120810:
1.
2596
2358
0
0
2358
2.
2420
2349
0
0
2349
___compare2former___
1. Non-converted files (20120712)
xml_conversion>report_nonconverted_files.sh ../xml_conversion | wc -l
985
2. Thereof 583 relevant for the nob2sme corpus and 71 files relevant for the sme2nob corpus
!!!free parallel corpus nob2sme (state from 20120712)
1. Result of the last conversion and check of the nob-sme parallel files:
- there are still 4 nob-files with a dangling pointer to some ghostly sme parallel files
!!!free parallel corpus sme2nob (state from 20120712)
!!! Low-level fixes todo: private-use sign causing problems with sentence divider
grep -r ' ' converted
==> find files (see private-use_sign.log file) and use xslt sheet for replacing with the emtpy string
!!! TODO: nob2sme and sme2nob -- correcting parallelity pointer
!! nob2sme: 4 nob-files with a dangling pointer to some ghostly sme parallel files
1.
converted/nob/admin/depts/regjeringen.no/finnmarksloven.html_id=515308.xml
converted/sme/admin/depts/regjeringen.no/finnmarkkulahka.html_id=515308.xml
2.
converted/nob/admin/depts/regjeringen.no/spraktiltak.html_id=603613.xml
converted/sme/admin/depts/regjeringen.no/185-miljovdna-doarjjan-aarjel--ja-julevsami-gielladoaimmaide.html_id=603613.xml
3.
converted/nob/admin/depts/regjeringen.no/stotte-til-sor--og-lulesamisk-sprak.html_id=536716.xml
converted/sme/admin/depts/regjeringen.no/doarjja-lulli--ja-julevsamegielaide.html_id=536716.xml
4.
converted/nob/admin/sd/other_files/sp1991-1.pdf.xml
converted/sme/admin/sd/other_files/dc1991-1.pdf.xml
!! all2all: overspecification about translation direction
converted/nob/facta/skuvlahistorja1/trygve_n.html.xml
___AND___
converted/sme/facta/skuvlahistorja1/trygve_s.html.xml
!! xxx2yyy: underspecification, i.e., there are correct pointers to parallel files but
not tranlslation direction specified in either file.
==> see bug 1392
!!! for sentence alignment (after conversion on victorio and filtering on mac: 20120717)
* Input
prestable>find converted/nob -name *.xml | wc -l
1628
prestable>find converted/sme -name *.xml | wc -l
1628
* Output
prestable>find toktmx/nob2sme -name *.toktmx | wc -l
1605
prestable>find toktmx/sme2nob/ -name *.toktmx | wc -l
23
* Comparison with the svn prestable
freecorpus>find prestable/toktmx/nob2sme -name *.toktmx | wc -l
1664
freecorpus>find prestable/toktmx/sme2nob -name *.toktmx | wc -l
27
!!!Some old notes yet still relevant notes:
Parallelity="false" means you have an ungarbled sme metadata, but the sme file does not exist.
File without metadata is understandable, but metadata without file is fishy.
Problems with the file names?
Trond: yes: in my svn, all files with æøå names are listed twice marked as both unversioned and missing.
The files are in utf8, mac wants decomposed chars. I'll fix them, Sjur asked
me to do it yesterday (20100920)
2. Now, 20120712, there are no empty files in the re-maked corpus but we have to pay attention for
the check before transfering the parallel files from converted to prestable.
There are two reasons not to transfer a file pair from the converted to the prestable directory:
2.1 Wrong ratio btw. the word numbers in the file pair.
(victorio 20120712)
grep 'Wrong' pick_up_sme.log | wc -l
469
2.2 Too low wordcount
xml_conversion>grep 'Too' pick_up_sme.log | wc -l
(victorio 20120712)
241
As stressed in our old (but very good) notes, we should stick to the (FICE princibles):
find, identify and correct errors!
The script pick-parallel-docs.pl finds and identifies, but does NOT correct the errors.
We have to check why there are files with so few words after conversion (conversion errors of
some document parts, irrelevant files for our corpus?), and why the word ratio between nob and sme
is wrong (again, conversion errors of some document parts, irrelevant file pair?).
Note: The log file with the files not picked up to the prestable dir is
pick_up_sme.log
3. Are we sure all files in the old corpus repository have been moved to the new svn repository? Especially all goldstandard files, both bound and free.
!!Routines for new upload files
Overall question: is there a check of the content before starting the conversion step?
No. It is on the todo list.
Or, it is in some way. Only doc, pdf, html and some other filetypes are accepted ...
Also the upload script checks for the size of the file, and if
it exists from before (using md5sum)
!!Status quo for upload directory as of today
{{{
~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l
93
~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.xsl'| wc -l
153
~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep '\.xsl'| wc -l
6
~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l
2
~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep '\.x[sm]l'| wc -l
2
}}}
!!!Corpus synchronization between victorio and XServe
(based on Børre's email -- please correct if needed)
The corpus user has cron jobs going every night where files in both free- and
boundcorpus are converted and synced to the xserve. I made a new mailalias,
corpus_fanatics (where I and Ciprian are members), and the result of these
jobs are sent to us.
Cip har observert ulike mengder filer på xserve og eiga maskin. Korfor er det
slik? Filene blir kopiert / synkronisert frå victorio til xserve. Det er
sannsynlegvis ikkje problem med denne synkroniseringa.
!!Conversion routine
what about the uploaded files when it comes to a check of the content
before starting the conversion step?
I listed all converted files in freecorpus this way:
{{{find converted/ -name \*.xml | sort > xmls.txt}}}
(the number of converted files are indeed 685)
Then listed all .xsl files in orig this way:
find orig/ -name \*.xsl | sort > xsls.txt (the number of .xsl files is 1156)
After some search and replace of paths and file endings I made a diff to see
which files are not converted this way:
{{diff xsls.txt xmls.txt > diff.txt}}
!!What about directory structure?
* Cip&Trond's minimal constraint is: No mixed types in a corpus directory,
i.e., either only subdirs or only files!
* todo: create an "unclassified" or "other directory" for each dir that
currently has mixed content and move the files into that.
!!Grammatical analysis of the corpora - routines for error detecting and frequency
Possible errors:
* Files not included in the analysis (sma-news-dep.txt)
* Analysis cut before end of file
!!!Main issues
* find, identify and correct errors in metadata
* find, identify and correct errors in the conversion process
* find, identify and correct errors in the original corpus files
* directory structure
* Vic/XServe synchronisation
* work load division
!!!Priorities
# Issues wrt. the texts we have
# Issues wrt. priorites for new texts
!!!Work load division and deadlines
__TODO list__
* Establish routines for future upload files (__Børre__)
* change file names from composed to decomposed (__Børre__)
* Old corpus directory check:
** Empty the upload directory (__Børre__)
** Check that all old corpus files have been moved to the new svn repository
(__Ciprian, XXX__)
* Cronjob for converting gold standard files (__Børre__) (@cip: once again, gold standard file can not be only converted without human check/manual correction! Otherwise there is only a Katzengold corpus)
* add metadata consistency check to convert2xml.pl, report issues to Bugzilla
(__Ciprian, thereafter all for evaluation__)
* clean all metadata (__all__)
__General principles:__
* Use Bugzilla
* Børre -> Ciprian basically in different ends of the conversion process
* Communication via meetings and newsgroup discussions