Here is an update of the Oslo corpus update work: 1. Decisions: 1.1 use only the files that are already in xml format (i.e., the old repository) (agreed with Trond, however Lene wants to have Riddu-Riddu and Skole historie also, waiting for the tools' update) 1.2 don't include tables and lists into the processing: obviously, this was NOT the case with the last version of the data, that means that we shall keep the tables and lists as the are 1.3 wait for Lene & Thomas for updating the automata 1.4 check the parallelity of files using the info from the header 1.5 testing the pipeline (it was quite wet here in Tromsø, the pipeline got a bit rusty) ==================================================================================== 2. Inventory of the current bound corpus repository in XML format 2.1 All files in XML format: corpus_christi>wc -l *invent* 481 nob_b_inventory.txt 23263 sme_b_inventory.txt 2.2 All files with parallel text and language infos: corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l 95 corpus_christi>grep -r "parallel_text" bound/nob | grep "xml:lang=\"sme\"" | wc -l 24 To me, this is quite fishy... I was expecting to have the same number on each side. ==================================================================================== 3. Inventory of the Oslo data (thanks to Trond for getting it from Oslo) -------------------------------- 1999_2000>ls nob | wc -l 7 1999_2000>ls sme | wc -l 7 1999_2000>ls parallel | wc -l 7 1999_2000>ls analyzed | wc -l 7 -------------------------------- 1 file of each type: NAC_2001_35>ls NAC_2001_35.pdf.analyzed.xml NAC_2001_35.pdf.sent_NOU_2001_35.pdf.sent.xml NAC_2001_35.pdf.sent.xml NOU_2001_35.pdf.sent.xml -------------------------------- 1 file of each type: STM_TS007SA>ls STM_TS007.pdf.sent.xml STM_TS007SA.pdf.sent.xml STM_TS007SA.pdf.analyzed.xml STM_TS007SA.pdf.sent_STM_TS007.pdf.sent.xml -------------------------------- 1 file of each type: bible>ls 01GENNBST.bible.sent.xml 1Mos_09_01.bible.sent.xml 1Mos_09_01.bible.analyzed.xml 1Mos_09_01.bible.sent_01GENNBST.bible.sent.xml -------------------------------- 1 file of each type: nac>ls NAC_1994_21.pdf.analyzed.xml NAC_1994_21.pdf.sent_NOU_1994_21.pdf.sent.xml NAC_1994_21.pdf.sent.xml NOU_1994_21.pdf.sent.xml -------------------------------- skolehistorie>ls nob | wc -l 28 skolehistorie>ls sme | wc -l 28 skolehistorie>ls parallel | wc -l 28 skolehistorie>ls analyzed | wc -l 28 -------------------------------- skolehistorie2>ls | wc -l 6 skolehistorie2>ls nob | wc -l 33 skolehistorie2>ls nob_comp | wc -l 2 skolehistorie2>ls sme | wc -l 33 skolehistorie2>ls sme_comp | wc -l 33 skolehistorie2>ls parallel | wc -l 33 skolehistorie2>ls analyze | wc -l 33 In skolehistorie2, there is some doubled stuff skolehistorie2>ls sme aarseth2-s.html.sent.xml inge-s.html.sent.xml nordby-s.html.sent.xml algu2-s.html.sent.xml ingunn-s.html.sent.xml pave-s.html.sent.xml vs. skolehistorie2>ls sme_comp aarseth2_s.html.sent.xml inge_s.html.sent.xml nordby_s.html.sent.xml algu2_s.html.sent.xml ingunn_s.html.sent.xml pave_s.html.sent.xml However, only on the sme side. I checked the LANG_comp stuff, and they should be ignored. ====================================== Conclusion of the Oslo data inventory: ====================================== There is NO sme data which doesn't have ANY pendant on the nob side! Question to Trond: What about our plans to send ALL sme data? ==================================================================================== 4. Inventory of parallel files using the command/scripts on our site https://giellalt.uit.no/ling/corpus_analyze.html Using the command corpus_christi>./../svnredone/gt/script/corpus-parallel.pl --list --lang=sme --dir=/usr/local/share/corp/bound/sme > sme-nob_parallel.txt and assuming that the scrip works fine for this option, then we have the following results corpus_christi>wc -l sme-nob_parallel.txt 82 sme-nob_parallel.txt That is namely seen only from the sme side. However, the grep detected 95 possible parallel nob files corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l 95 ==================================================================================== 5. Action points: 5.1 find all parallel texts - ongoing work (only) on the free corpus for sme and nob: Assuming an accurate parallelity of directory structure between converted/sme and converted/nob; woring with the version freecorpus>svn up At revision 845. - unreliable metadata as follows: a. parallel file declared in the meta-data but parallel file inexistent Ex.: less converted/sme/admin/depts/NAC_1994_21.pdf.xml NAC_1994_24.pdf find . -name "NAC_1994*" ./orig/sme/admin/depts/NAC_1994_21.pdf ./orig/sme/admin/depts/NAC_1994_21.pdf.xsl freecorpus> freecorpus>find . -name "NOU_1994*" freecorpus> freecorpus>ll converted/nob/admin/depts/ total 848 -rw-rw---- 1 cipriangerstenberger staff 205090 17 sep 08:15 HP_2009_samisk_sprak_norsk.pdf.xml -rw-rw---- 1 cipriangerstenberger staff 191930 17 sep 08:15 STM_TS007.pdf.xml -rw-rw---- 1 cipriangerstenberger staff 29129 17 sep 08:15 Tid_for_samtale_bm_nett.pdf.xml b. parallel file declared in one file but not in the other Ex.: declared in converted/sme/admin/depts/STM_TS007SA.pdf.xml but not in converted/nob/admin/depts/STM_TS007.pdf.xml c. parallel file declared in both xml files, however in one or another with errors (corrected by Ciprian) Ex.: in converted/sme/admin/sd/Duoji_doaibmadoarjagiid_árvvoštallan_2005-2009.pdf.xml in converted/nob/admin/sd/Evaluering_av_driftstilskuddsordningen_for_duodji_2005-2009.pdf.xml 5.2 preprocess them 5.3 find the rest of sme XML texts 5.4 preprocess them 5.5 check the analysis pipeline while waiting for the final version of Riddu-Riddu data and of the tools ==================================================================================== 6. Proposals for improvements after a (not that closer) look at the corpus data 6.1 since both file content and file names in the bound (hence also in the free) directory are exposed to changes we can as well use a more systematic naming of the files; the really original files can be stored in a header element; then these names might be avoided: Læremiddelbruk_i_tosprÃ¥klig_opplæring.pdf.xml file:⁄⁄⁄home⁄boerre⁄Dokumenter⁄corpus⁄per-eric-kuoljok-2009-05-19⁄OrdlistaFaktabladSOU.doc.xml 6.2 better organization and naming of the file structure: - no mixing of files AND directories in one and the same directory (have a look for instance at /usr/local/share/corp/bound/sme/facta) A better example is actually MinAigi: In 1999 dir, there is both directories for all months and 732 "unsorted" files on the same level. 1999>pwd /usr/local/share/corp/bound/sme/news/MinAigi/1999 1999>ll | grep "^d" | wc -l 12 1999>ll | grep -v "^d" | wc -l 732 We should agree on better name conventions: here there are quite a lot of doublings (see pwd above) MA --> we are in MinAigi already 99 --> we are in 1999 already 1999>ll | grep "^d" drwxrwx--- 2 root bound 4096 des 21 2006 MA01_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA02_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA03_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA04_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA05_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA06_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA07_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA08_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA09_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA10_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA11_99 drwxrwx--- 2 root bound 4096 des 21 2006 MA12_99 Not to mention that there is another minaigi dir on the same level with MinAigi news>ls Assu MinAigi minaigi.no NRK other YLE What is the difference? One got directly from the journal and the other taken from the net? Answer from Børre: MinAigi was one of the directories that were there in beginning. Any directory names below that one is the original ones we have received from Ávvir (which inherited the files from Min Áigi and Áššu). I later added minaigi.no. Files inside that directory are fetched from the net. - better conceptualization: why finnmarksloven in facta when there is a low-directory there? -- (Børre again) At the time it seemed like a good idea, there were many files from that domain that belonged together. 6.3 proper check of the content of the collected data BEFORE XML transformation: sporadically, it has been done but there is a whole range of unchecked data for content: for instance, /usr/local/share/corp/bound/sme/facta/Læremiddelbruk_i_tosprÃ¥klig_opplæring.pdf.xml with the following content:
from the 174 files in bound/sme/facta/finnmarksloven, only 14 have really data content, and this check was a quite superficial one: finnmarksloven>ll totalt 1552 -rw-rw---- 1 root bound 692 mar 22 07:54 arkiv16a8.html.xml -rw-rw---- 1 root bound 695 mar 22 07:54 artikkel00ed.html.xml -rw-rw---- 1 root bound 695 mar 22 07:57 artikkel015c.html.xml -rw-rw---- 1 root bound 695 mar 22 07:55 artikkel030b.html.xml -rw-rw---- 1 root bound 695 mar 22 07:59 artikkel0958.html.xml -rw-rw---- 1 root bound 695 mar 22 07:55 artikkel0caa.html.xml -rw-rw---- 1 root bound 695 mar 22 07:52 artikkel0d8f.html.xml -rw-rw---- 1 root bound 695 mar 22 07:56 artikkel0ff3.html.xml -rw-rw---- 1 root bound 695 mar 22 07:55 artikkel1053.html.xml -rw-rw---- 1 root bound 695 mar 22 07:54 artikkel1454.html.xml -rw-rw---- 1 root bound 695 mar 22 07:55 artikkel1501.html.xml -rw-rw---- 1 root bound 9086 mar 22 07:53 artikkel1627.html.xml => almost identical size of kBs! The nob files in the parallel directory seem to be populated only with header-only xml files! NB: Some file in /usr/local/share/corp/broken seems to be much more useful contentwise than the header-only xml files in the bound directory, see, for instance, Salmmat-_garvasat_0203.doc.xml there. Conclusion: Neither a successful XML transformation nor a failed one tells us anything about the file content's usefulness as a text corpus part. 6.4 a quick language check (irrespective what the language model tools guessed) would provide a more appropriate registration to a certain language directory. Example: In /usr/local/share/corp/bound/nob/laws the only file with corpus data (Lov_om_psykisk.sme.doc.xml) is a sme-only file. 6.5 with the new svn repository for the corpus data, files like /usr/local/share/corp/bound/sme/bible/nt/north_sami_html.html.xml shouldn't exist any longer (by the way, this could have been better stored in a README file, I assume that XML corpus files store corpus data, not just meta-data) 6.6 proper check of text for doubling required: Example: /usr/local/share/corp/bound/nob/news/MinAigi/2003/olsenbanden olsenbanden>ll totalt 136 -rw-rw---- 1 root bound 59618 mar 17 23:01 01dialogmanus.DOC-1.doc.xml -rw-rw---- 1 root bound 59616 mar 17 23:01 01dialogmanus.DOC.doc.xml olsenbanden>diff 01dialogmanus.DOC.doc.xml 01dialogmanus.DOC-1.doc.xml 3c3 < --- > 18c18 < XSLtemplate 1.13 ; file-specific xsl $Revision: 1.4 $; common.xsl 1.25 ; convert2xml 1.119 ; add_hyph_tags 1.15 ; docbook2corpus2 1.19 ; xhtml2corpus 1.13 ; --- > XSLtemplate 1.13 ; file-specific xsl $Revision: 1.2 $; common.xsl 1.25 ; convert2xml 1.119 ; add_hyph_tags 1.15 ; docbook2corpus2 1.19 ; xhtml2corpus 1.13 ; 7. Further documenting notable issues: - wanting to run a test and compare Oslo's data with the data generated now - random choice of the file hans_s.html.xml/hans_n.html.xml - observation: 7.1 while the Oslo files DO have content, only the sme file has content in our corpus 7.2 neither the bound nor the free version has content in our corpus (I assumed that at least one is ok) /usr/local/share/corp/free/nob/facta/hans_n.html.xml /usr/local/share/corp/bound/nob/facta/hans_n.html.xml

7.3 a file similarly named /usr/local/share/corp/free/nob/facta/hans-n.html.xml HAS content! 7.4 there is no pendant to this file in the bound directory! 7.5 for a detailed comparison of a randomly chosen file pair between Oslo (last version) and Tromsoe (generated now) see compare_oslo-tromsoe dir