!!!Corpus infra rework This document is an informal description of the development of the corpus remake project. It should serve as the basis for the info update of the web files on the gt/sd corpus. !!!Building the nob2sme parallel corpus a done now (20120712) # convert the original files into xml format: {{{convert2xml.pl $GTFREE/orig}}} # transfer the files deemed "good enough" from converted to prestable: {{{find $GTFREE/converted/sme -name \*.xml -exec pick-parallel-docs.pl {} \;}}} # build the actual parallel corpus {{{find $GTFREE/prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}} or if working in a different dir than the standart {{{$GTFREE}}} use: {{{GTFREE=/ABSOLUT/PATH/TO/YOUR/WORKING/DIR find prestable/converted/nob -name \*.xml -exec corpus-parallel.py -p sme {} \;}}} !! Some general notes on the current corpus conversion: * the correction of typos is done at three different points ** during the xml conversion by means of the xsl stylesheets ** during the preprocessing using the {{{preprocess}}} command (see main/gt/script/langTools/parallelize.py line 297ff ** during the conversion to tokxml (???) by means of pointers to files in the converted corpus containing a list of typos + corrections (According to ___Børre___, this list was compiled using Hunspell and the corrections were added by ___Børre___ and ___Berit-Merete___) * the occurence of at least of one pseudo-sentence in an tca2-input file blocks the sentaligning of the whole file Ergo: pseudo-sentences have to be filtered out BEFORE sending the file pair to tca2 ==> DONE !! Comparing the svn data in prestable/comverted with the freshly converted ones * Note: only in the nob dir there is a huge difference ** only in the svn: >grep '/Users/cipriangerstenberger/freecorpus/prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l 227 ** only locally generated try_slange>grep 'Only in prestable/converted/nob' compare_files_prestable_converted_nob.txt | wc -l 54 Question: Why? ==> random test: converted/nob/admin/sd/samediggi.no/samediggi-article-3653.html.xml converted/sme/admin/sd/samediggi.no/samediggi-article-3653.html.xml _________ tmp/samediggi-article-3653.html.log «/Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html» 100%^M0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Conversion failed: More than 5% of the content isn't analyzable /Users/cipriangerstenberger/my_cocon/orig/sme/admin/sd/samediggi.no/samediggi-article-3653.html _________ Question: Does that mean that the xml conversion pipeline has been getting worse in the last time? ==> started the same conversion on victorio to check whether there is some differences caused by the OS. (By the way, the conversion on XServe is the same as locally on my mac.) ==> on victorio, the file under discussion gets converted into xml!!! ==> lesson learned: don't convert file on macOS! * Note: We can forget orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf orig/sme/admin/depts/other_files/Samiske_tall_forteller_3_SAM.pdf Without some text correction: the nob version contains (at least) 2-3 pages more than the sme one; on one of them is a short description of the content in eng, nob, and sme. This leads to a huge discrepancy between nob and sme. The sent alignment is useless. Not to mention the problems stemming from tables with numbers and statistics. !!! Conversion to xml: 20120810: 1. 2596 2358 0 0 2358 2. 2420 2349 0 0 2349 ___compare2former___ 1. Non-converted files (20120712) xml_conversion>report_nonconverted_files.sh ../xml_conversion | wc -l 985 2. Thereof 583 relevant for the nob2sme corpus and 71 files relevant for the sme2nob corpus !!!free parallel corpus nob2sme (state from 20120712) 1. Result of the last conversion and check of the nob-sme parallel files: - there are still 4 nob-files with a dangling pointer to some ghostly sme parallel files !!!free parallel corpus sme2nob (state from 20120712) !!! Low-level fixes todo: private-use sign causing problems with sentence divider grep -r ' ' converted ==> find files (see private-use_sign.log file) and use xslt sheet for replacing with the emtpy string !!! TODO: nob2sme and sme2nob -- correcting parallelity pointer !! nob2sme: 4 nob-files with a dangling pointer to some ghostly sme parallel files 1. converted/nob/admin/depts/regjeringen.no/finnmarksloven.html_id=515308.xml converted/sme/admin/depts/regjeringen.no/finnmarkkulahka.html_id=515308.xml 2. converted/nob/admin/depts/regjeringen.no/spraktiltak.html_id=603613.xml converted/sme/admin/depts/regjeringen.no/185-miljovdna-doarjjan-aarjel--ja-julevsami-gielladoaimmaide.html_id=603613.xml 3. converted/nob/admin/depts/regjeringen.no/stotte-til-sor--og-lulesamisk-sprak.html_id=536716.xml converted/sme/admin/depts/regjeringen.no/doarjja-lulli--ja-julevsamegielaide.html_id=536716.xml 4. converted/nob/admin/sd/other_files/sp1991-1.pdf.xml converted/sme/admin/sd/other_files/dc1991-1.pdf.xml !! all2all: overspecification about translation direction converted/nob/facta/skuvlahistorja1/trygve_n.html.xml ___AND___ converted/sme/facta/skuvlahistorja1/trygve_s.html.xml !! xxx2yyy: underspecification, i.e., there are correct pointers to parallel files but not tranlslation direction specified in either file. ==> see bug 1392 !!! for sentence alignment (after conversion on victorio and filtering on mac: 20120717) * Input prestable>find converted/nob -name *.xml | wc -l 1628 prestable>find converted/sme -name *.xml | wc -l 1628 * Output prestable>find toktmx/nob2sme -name *.toktmx | wc -l 1605 prestable>find toktmx/sme2nob/ -name *.toktmx | wc -l 23 * Comparison with the svn prestable freecorpus>find prestable/toktmx/nob2sme -name *.toktmx | wc -l 1664 freecorpus>find prestable/toktmx/sme2nob -name *.toktmx | wc -l 27 !!!Some old notes yet still relevant notes: Parallelity="false" means you have an ungarbled sme metadata, but the sme file does not exist. File without metadata is understandable, but metadata without file is fishy. Problems with the file names? Trond: yes: in my svn, all files with æøå names are listed twice marked as both unversioned and missing. The files are in utf8, mac wants decomposed chars. I'll fix them, Sjur asked me to do it yesterday (20100920) 2. Now, 20120712, there are no empty files in the re-maked corpus but we have to pay attention for the check before transfering the parallel files from converted to prestable. There are two reasons not to transfer a file pair from the converted to the prestable directory: 2.1 Wrong ratio btw. the word numbers in the file pair. (victorio 20120712) grep 'Wrong' pick_up_sme.log | wc -l 469 2.2 Too low wordcount xml_conversion>grep 'Too' pick_up_sme.log | wc -l (victorio 20120712) 241 As stressed in our old (but very good) notes, we should stick to the (FICE princibles): find, identify and correct errors! The script pick-parallel-docs.pl finds and identifies, but does NOT correct the errors. We have to check why there are files with so few words after conversion (conversion errors of some document parts, irrelevant files for our corpus?), and why the word ratio between nob and sme is wrong (again, conversion errors of some document parts, irrelevant file pair?). Note: The log file with the files not picked up to the prestable dir is pick_up_sme.log 3. Are we sure all files in the old corpus repository have been moved to the new svn repository? Especially all goldstandard files, both bound and free. !!Routines for new upload files Overall question: is there a check of the content before starting the conversion step? No. It is on the todo list. Or, it is in some way. Only doc, pdf, html and some other filetypes are accepted ... Also the upload script checks for the size of the file, and if it exists from before (using md5sum) !!Status quo for upload directory as of today {{{ ~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l 93 ~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep -v '\.xsl'| wc -l 153 ~$ll /usr/local/share/corp/upload/ | egrep '(pdf|doc|txt)' | grep '\.xsl'| wc -l 6 ~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep -v '\.x[sm]l'| wc -l 2 ~$ll /usr/local/share/corp/upload/moved_already | egrep '(pdf|doc|txt)' | grep '\.x[sm]l'| wc -l 2 }}} !!!Corpus synchronization between victorio and XServe (based on Børre's email -- please correct if needed) The corpus user has cron jobs going every night where files in both free- and boundcorpus are converted and synced to the xserve. I made a new mailalias, corpus_fanatics (where I and Ciprian are members), and the result of these jobs are sent to us. Cip har observert ulike mengder filer på xserve og eiga maskin. Korfor er det slik? Filene blir kopiert / synkronisert frå victorio til xserve. Det er sannsynlegvis ikkje problem med denne synkroniseringa. !!Conversion routine what about the uploaded files when it comes to a check of the content before starting the conversion step? I listed all converted files in freecorpus this way: {{{find converted/ -name \*.xml | sort > xmls.txt}}} (the number of converted files are indeed 685) Then listed all .xsl files in orig this way: find orig/ -name \*.xsl | sort > xsls.txt (the number of .xsl files is 1156) After some search and replace of paths and file endings I made a diff to see which files are not converted this way: {{diff xsls.txt xmls.txt > diff.txt}} !!What about directory structure? * Cip&Trond's minimal constraint is: No mixed types in a corpus directory, i.e., either only subdirs or only files! * todo: create an "unclassified" or "other directory" for each dir that currently has mixed content and move the files into that. !!Grammatical analysis of the corpora - routines for error detecting and frequency Possible errors: * Files not included in the analysis (sma-news-dep.txt) * Analysis cut before end of file !!!Main issues * find, identify and correct errors in metadata * find, identify and correct errors in the conversion process * find, identify and correct errors in the original corpus files * directory structure * Vic/XServe synchronisation * work load division !!!Priorities # Issues wrt. the texts we have # Issues wrt. priorites for new texts !!!Work load division and deadlines __TODO list__ * Establish routines for future upload files (__Børre__) * change file names from composed to decomposed (__Børre__) * Old corpus directory check: ** Empty the upload directory (__Børre__) ** Check that all old corpus files have been moved to the new svn repository (__Ciprian, XXX__) * Cronjob for converting gold standard files (__Børre__) (@cip: once again, gold standard file can not be only converted without human check/manual correction! Otherwise there is only a Katzengold corpus) * add metadata consistency check to convert2xml.pl, report issues to Bugzilla (__Ciprian, thereafter all for evaluation__) * clean all metadata (__all__) __General principles:__ * Use Bugzilla * Børre -> Ciprian basically in different ends of the conversion process * Communication via meetings and newsgroup discussions