TODO next: 1. Why are there only 94 files as a result of sentence alignment although the numer of input file pairs is 116? in freecorpus/orig consistency check: - Is there for any orig file an xsl file? - Is there for any xsl file an orig file? Observations: 1. rft files are not taken into account by the conversion script: there are 9 rtf files without xsl meta files 2. there is no file in ptx and svg format in the free corpus 3. pdf-pdf.xsl: 202-203; html-html.xsl:320-322; rtf-rtf.xsl: 9-0 from convert2xml.pl: =========== # Check the filename return unless ($file =~ m/\.(doc|pdf|htm|html|ptx|txt|svg|bible\.xml|correct\.xml|correct\.xml,v)$/); if ( $file =~ m/[\;\<\>\*\|\`\&\$\(\)\[\]\{\}\'\"\?]/ ) { print STDERR "$file: ERROR. Filename contains special characters that cannot be handled. STOP\n"; return "ERROR"; } =========== simple test (20100923): freecorpus>find orig -name "*.doc" | wc -l 424 freecorpus>find orig -name "*.doc.xsl" | wc -l 424 freecorpus>find orig -name "*.pdf" | wc -l 202 freecorpus>find orig -name "*.pdf.xsl" | wc -l 203 freecorpus>find orig -name "*.htm" | wc -l 213 freecorpus>find orig -name "*.htm.xsl" | wc -l 213 freecorpus>find orig -name "*.html" | wc -l 320 freecorpus>find orig -name "*.html.xsl" | wc -l 322 freecorpus>find orig -name "*.ptx" | wc -l 0 freecorpus>find orig -name "*.ptx.xsl" | wc -l 0 freecorpus>find orig -name "*.txt" | wc -l 5 freecorpus>find orig -name "*.txt.xsl" | wc -l 5 freecorpus>find orig -name "*.svg" | wc -l 0 freecorpus>find orig -name "*.svg.xsl" | wc -l 0 freecorpus>find orig -name "*.xml" | wc -l 1 freecorpus>find orig -name "*.xml.xsl" | wc -l 1 freecorpus>find orig -name "*.rtf" | wc -l 9 freecorpus>find orig -name "*.rtf.xsl" | wc -l 0 ================== Checking unconverted files: ================== 20100924: extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 34 absolutely no meta-data added in: orig/sme/facta/Pressemelding\[1\].doc.xsl ================== Checking parallel files: ================== 20100924: checking the existence of parallel original files, if no parallel xml file found: i.e., is there an original file at all or there is just an error during the coversion of the parallel file? Ex.: 1. tFile_xml="false" AND tFile_orig="true" ==> conversion error 2. tFile_xml="false" AND tFile_orig="false" ==> no original file at all (or incorrect metadata) 20100925: 20100926: 96 79 34 34 79 last update: 96 79 34 34 79 Todo: correcting mainlang in the xsl documents (generally: consistency check of the XSL file!) diff -r converted report_co_20100927/converted | grep 'document id=' | l < > < > < > < > freecorpus>diff -r converted report_co_20100927/converted | grep 'document id=' | egrep '^<' | wc -l 72 freecorpus>diff -r converted report_co_20100927/converted | grep 'document id=' | egrep '^>' | wc -l 72 Apparently, 72 file that lack the xml:lang info in the xsl meta-data. 20100928: number of files not (yet) converted to xml: extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 28 number of converted but empty files: sme-nob parallel files: 118 116 0 0 116 132 116 0 0 116 20100928: extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 38 (vs. 28) (vs. 7) 118 116 0 0 116 132 116 0 0 116 20100929: extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 17 report on parallel corpus AFTER the big meta-data correction wave: 168 160 0 0 160 182 160 0 0 160 20100930: 189 181 0 0 181 203 181 0 0 181 20101001: extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 18 190 182 0 0 182 204 182 0 0 182 20101002: freecorpus>wc -l corpus_report/non-converted_files.xml 25 corpus_report/non-converted_files.xml 1042 965 0 0 965 1003 965 0 0 965 20101004: converted files non-converted files freecorpus>wc -l corpus_report/non-converted-files.txt 172 corpus_report/non-converted-files.txt parallel files sme<->nob 1110 1029 0 0 1029 1065 1029 0 0 1029 20101008: xsl file number check freecorpus>find orig -name "*.xsl" | wc -l 8770 vs. 42+8723=8765 (non_converted+converted ) ==> 5 xsl files vs. no orig file non-converted files extra_gtsvn>report_nonconverted_files.sh freecorpus | wc -l 42 converted files ========= 20100112: ========= -- first corpus check after three months -- freecorpus>find orig -name "*.xsl" | wc -l 9234 9 non-converted files: extra_gtsvn>report_nonconverted_files.sh freecorpus freecorpus/orig/eng/admin/depts/regjeringen.no/laws.html?id=438754 freecorpus/orig/nno/admin/depts/regjeringen.no/om-departementet.html?id=796 freecorpus/orig/nno/facta/skuvlahistorja1/algu-n.htm freecorpus/orig/nob/admin/guovda/Sakspapirer_på_norsk_24.04.02.doc freecorpus/orig/nob/admin/others/A_rsmelding_2000.doc freecorpus/orig/sma/facta/ssh1_s.html freecorpus/orig/sme/admin/sd/bl_04_2.doc freecorpus/orig/sme/facta/skuvlahistorja1/uskav1_s.html freecorpus/orig/sme/laws/jus.txt 9225+9=9234 (the same number as the number of xsl files) Ergo: 9225-8494=731 non-empty files (Scary!) The statistics on parallel files is not possible because of the paths with special characters: current_abs_loc: file:/Users/cipriangerstenberger/extra_gtsvn/freecorpus/converted/sme/admin/depts/regjeringen.no/ current_file: 1-5584-miljon-ruvdno-speallaruhta-falastallanulbmiliidda-2010s.html%3Fid=603874.xml current_location: converted/sme/admin/depts/regjeringen.no/ current_pfile: 1-5584-millioner-kroner-i-spillemidler-til-idrettsformal-for-2010.html?id=603874 Recoverable error on line 116 of parallel_corpus_info.xsl: FODC0002: java.io.FileNotFoundException: /Users/cipriangerstenberger/extra_gtsvn/freecorpus/converted/nob/admin/depts/regjeringen.no/1-5584-millioner-kroner-i-spillemidler-til-idrettsformal-for-2010.html (No such file or directory) .......................................... orig/nob/admin/depts/regjeringen.no/1-5584-millioner-kroner-i-spillemidler-til-idrettsformal-for-2010.html?id=603874 .......................................... net.sf.saxon.trans.XPathException: Exception in extension function: java.lang.IllegalArgumentException: URI has a query component at net.sf.saxon.functions.ExtensionFunctionCall.call(ExtensionFunctionCall.java:307) at net.sf.saxon.functions.ExtensionFunctionCall.iterate(ExtensionFunctionCall.java:224) at net.sf.saxon.expr.Expression.evaluateItem(Expression.java:352) at net.sf.saxon.expr.ExpressionTool.evaluate(ExpressionTool.java:296) at net.sf.saxon.expr.ExpressionTool.lazyEvaluate(ExpressionTool.java:432) @cip: I have to fix it! ========================================================================================== ========= 20110921: ========= Corpus converted locally on @cip's machine: Message output after conversion: ....... Processing finished 9458 files processed, 20 errors among them The errors were distributed like this: cant_handle 5% of errors checkxml_after_faulty 5% of errors convert2xml 10% of errors faulty_chars 10% of errors intermediate 60% of errors xsl 10% of errors ....... Tests by means of stylesheets: 1. XML conversion: java net.sf.saxon.Transform -it main ym_corpus_info.xsl inDir=converted 2. XML valid yet empty files: java net.sf.saxon.Transform -it main get-empty-docs.xsl inFile=corpus_report/corpus_summary.xml 2.1 laws.html_id=438754.xml converted/eng/admin/depts/regjeringen.no/ 2.2 om-departementet.html_id=796.xml converted/nno/admin/depts/regjeringen.no/ 3. Parallelity check for nob->sme and sme->nob language pairs: java -Xmx2048m net.sf.saxon.Transform -it main parallel_corpus_info.xsl inDir=converted 2569 2435 0 0 2435 2554 2413 0 0 2413 3. Parallelity check for nob->sma and sma->nob language pairs: 38 10 0 0 10 rallel_files dir="sma2nob" ok="9" ko="4" coversion_error="1" no_orig_file="3"> 13 9 0 0 9 ========================================================================================== ========= 20120902: ========= Corpus converted on @cip's account on xserve: Processing finished 7677 files processed, 103 errors among them The errors were distributed like this: checkxml_after_faulty 7% of errors faulty_chars 1% of errors too_low_mainlang 91% of errors xsl 1% of errors 2585 2444 0 0 2444 2511 2435 0 0 2435 xxx2yyy>ls * nob: 22-juli-kommisjonens-rapport-.html_id=697509.xml gratulerer-med-verdens-urfolksdag.html_id=697230.xml speech-at-the-ceremony-in-the-government.html_id=696940.xml ssh1-n.htm.xml takket-mood-for-innsatsen-i-syria.html_id=697002.xml tale-ved-aufs-arrangement-22-juli-2012.html_id=696942.xml tale-ved-minnekonsert.html_id=696944.xml tale-ved-samling-for-etterlatte-parorend.html_id=696943.xml sme: giittii-mood-syria-barggu-ovddas.html_id=697002.xml sardni-aufs-lagideamis-utoyas.html_id=696942.xml sardni-kransabidjamis.html_id=696940.xml sardni-oamehasaide-ja-eaktodahtolaaide.html_id=696943.xml sardni-raeviessoilju-muitokonsearttas.html_id=696944.xml savvat-buori-algoalbmotbeaivvi.html_id=697230.xml ssh1-s.htm.xml suoidnemanu-22beaivve-kommiuvnna-raporta.html_id=697509.xml xxx2yyy>ls nob | wc -l 8 xxx2yyy>ls sme | wc -l 8 ========= 20120904: ========= Corpus converted on @cip's account on xserve: Processing finished 7662 files processed, 66 errors among them The errors were distributed like this: checkxml_after_faulty 11% of errors faulty_chars 2% of errors too_low_mainlang 88% of errors 2582 2463 0 0 2463 2530 2454 0 0 2454 parallel_corpus_tmp>ls nob2sme sme2nob xxx2yyy parallel_corpus_tmp>ls xxx2yyy/nob | wc -l 10 parallel_corpus_tmp>ls xxx2yyy/sme | wc -l 10 parallel_corpus_tmp>ls xxx2yyy/nob/* xxx2yyy/nob/22-juli-kommisjonens-rapport-.html_id=697509.xml xxx2yyy/nob/beredskapstiltak.html_id=698072.xml xxx2yyy/nob/gratulerer-med-verdens-urfolksdag.html_id=697230.xml xxx2yyy/nob/speech-at-the-ceremony-in-the-government.html_id=696940.xml xxx2yyy/nob/ssh1-n.htm.xml xxx2yyy/nob/takket-mood-for-innsatsen-i-syria.html_id=697002.xml xxx2yyy/nob/tale-ved-aufs-arrangement-22-juli-2012.html_id=696942.xml xxx2yyy/nob/tale-ved-minnekonsert.html_id=696944.xml xxx2yyy/nob/tale-ved-samling-for-etterlatte-parorend.html_id=696943.xml xxx2yyy/nob/utlysning---proveordning-med-tilskudd-ti.html_id=698048.xml parallel_corpus_tmp>ls xxx2yyy/sme/* xxx2yyy/sme/almmuhus--geahalanortnet-mas-dorjot-kult.html_id=698048.xml xxx2yyy/sme/giittii-mood-syria-barggu-ovddas.html_id=697002.xml xxx2yyy/sme/sardni-aufs-lagideamis-utoyas.html_id=696942.xml xxx2yyy/sme/sardni-kransabidjamis.html_id=696940.xml xxx2yyy/sme/sardni-oamehasaide-ja-eaktodahtolaaide.html_id=696943.xml xxx2yyy/sme/sardni-raeviessoilju-muitokonsearttas.html_id=696944.xml xxx2yyy/sme/savvat-buori-algoalbmotbeaivvi.html_id=697230.xml xxx2yyy/sme/ssh1-s.htm.xml xxx2yyy/sme/stahtaministtar-almmuha-oa-gearggusvuoad.html_id=698072.xml xxx2yyy/sme/suoidnemanu-22beaivve-kommiuvnna-raporta.html_id=697509.xml ========= 20120904: ========= Corpus converted on @cip's account on xserve: Processing finished 7644 files processed, 16 errors among them The errors were distributed like this: checkxml_after_faulty 44% of errors faulty_chars 6% of errors too_low_mainlang 50% of errors 2571 2494 0 0 2494 2561 2485 0 0 2485 ========= 20120918: ========= Corpus converted on @cip's account on xserve: Processing finished 7649 files processed, 21 errors among them The errors were distributed like this: checkxml_after_faulty 33% of errors convert2xml 10% of errors faulty_chars 5% of errors too_low_mainlang 52% of errors 2571 2494 0 0 2494 converted/nob/admin/depts/regjeringen.no/avdelingsdirektor-anne-brendemoen.html_id=314115.xml converted/sme/admin/depts/regjeringen.no/avdelingsdirektor-anne-brendemoen.html_id=314115.xml converted/nob/admin/depts/regjeringen.no/tale-pa-frigjorings--og-veterandagen-8-m.html_id=681132.xml converted/sme/admin/depts/regjeringen.no/moarmesmano-8-biejve-avvudallam.html_id=681133.xml converted/nob/admin/depts/regjeringen.no/yrkesskadereform.html_id=682373.xml converted/sme/admin/depts/regjeringen.no/alkkep-oadtjot-bargodisvuodarudajt-ga-le.html_id=682364.xml converted/nob/laws/other_files/Barnevernsloven_nb_hl-19920717-100.html.xml converted/sme/laws/other_files/Barnevernloven samisk 10.doc.xml 2561 2485 0 0 2485 converted/sme/admin/depts/regjeringen.no/mis-galga-leat-buorre-suodjalus-heivehuv.html_id=685751.xml converted/nob/admin/depts/regjeringen.no/--vi-skal-ha-et-godt-forsvar-for-var-tid.html_id=685751.xml converted/sme/admin/depts/regjeringen.no/suoidnemanu-22beaivve-kommiuvdna---bargu.html_id=697508.xml converted/nob/admin/depts/regjeringen.no/-en-jobb-ma-gjores.html_id=697508.xml converted/sme/laws/other_files/boazodoallolahka-.html_id=475631.xml converted/nob/laws/other_files/hl-20070615-040.html.xml