Here is an update of the Oslo corpus update work:

1. Decisions:
 1.1 use only the files that are already in xml format (i.e., the old repository)
   (agreed with Trond, however Lene wants to have Riddu-Riddu and Skole historie also, waiting for the tools' update)
 1.2 don't include tables and lists into the processing:
   obviously, this was NOT the case with the last version of the data,
   that means that we shall keep the tables and lists as the are
 1.3 wait for Lene & Thomas for updating the automata
 1.4 check the parallelity of files using the info from the header
 1.5 testing the pipeline (it was quite wet here in Tromsø, the pipeline got a bit rusty)

====================================================================================

2. Inventory of the current bound corpus repository in XML format

2.1 All files in XML format:
corpus_christi>wc -l *invent*
    481 nob_b_inventory.txt
  23263 sme_b_inventory.txt

2.2 All files with parallel text and language infos:
corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l
95
corpus_christi>grep -r "parallel_text" bound/nob | grep "xml:lang=\"sme\"" | wc -l
24

To me, this is quite fishy... I was expecting to have the same number on each side.

====================================================================================

3. Inventory of the Oslo data (thanks to Trond for getting it from Oslo)

--------------------------------
1999_2000>ls nob | wc -l 
       7
1999_2000>ls sme | wc -l 
       7
1999_2000>ls parallel | wc -l 
       7
1999_2000>ls analyzed | wc -l 
       7
--------------------------------
1 file of each type:

NAC_2001_35>ls
NAC_2001_35.pdf.analyzed.xml                  NAC_2001_35.pdf.sent_NOU_2001_35.pdf.sent.xml
NAC_2001_35.pdf.sent.xml                      NOU_2001_35.pdf.sent.xml
--------------------------------
1 file of each type:

STM_TS007SA>ls
STM_TS007.pdf.sent.xml                      STM_TS007SA.pdf.sent.xml
STM_TS007SA.pdf.analyzed.xml                STM_TS007SA.pdf.sent_STM_TS007.pdf.sent.xml
--------------------------------
1 file of each type:

bible>ls
01GENNBST.bible.sent.xml                       1Mos_09_01.bible.sent.xml
1Mos_09_01.bible.analyzed.xml                  1Mos_09_01.bible.sent_01GENNBST.bible.sent.xml
--------------------------------
1 file of each type:

nac>ls
NAC_1994_21.pdf.analyzed.xml                  NAC_1994_21.pdf.sent_NOU_1994_21.pdf.sent.xml
NAC_1994_21.pdf.sent.xml                      NOU_1994_21.pdf.sent.xml
--------------------------------
skolehistorie>ls nob | wc -l 
      28
skolehistorie>ls sme | wc -l 
      28
skolehistorie>ls parallel | wc -l 
      28
skolehistorie>ls analyzed | wc -l 
      28
--------------------------------
skolehistorie2>ls | wc -l 
       6
skolehistorie2>ls nob | wc -l 
      33
skolehistorie2>ls nob_comp | wc -l 
       2
skolehistorie2>ls sme | wc -l 
      33
skolehistorie2>ls sme_comp | wc -l 
      33
skolehistorie2>ls parallel | wc -l 
      33
skolehistorie2>ls analyze | wc -l 
      33

In skolehistorie2, there is some doubled stuff

skolehistorie2>ls sme
aarseth2-s.html.sent.xml   inge-s.html.sent.xml       nordby-s.html.sent.xml
algu2-s.html.sent.xml      ingunn-s.html.sent.xml     pave-s.html.sent.xml

vs.

skolehistorie2>ls sme_comp 
aarseth2_s.html.sent.xml   inge_s.html.sent.xml       nordby_s.html.sent.xml
algu2_s.html.sent.xml      ingunn_s.html.sent.xml     pave_s.html.sent.xml

However, only on the sme side. I checked the LANG_comp stuff, and they
should be ignored.

======================================
Conclusion of the Oslo data inventory:
======================================

There is NO sme data which doesn't have ANY pendant on the nob side!

Question to Trond: What about our plans to send ALL sme data?

====================================================================================

4. Inventory of parallel files using the command/scripts on our site

https://giellalt.uit.no/ling/corpus_analyze.html

Using the command 
corpus_christi>./../svnredone/gt/script/corpus-parallel.pl --list --lang=sme --dir=/usr/local/share/corp/bound/sme > sme-nob_parallel.txt

and assuming that the scrip works fine for this option, then we have the following results

corpus_christi>wc -l sme-nob_parallel.txt 
82 sme-nob_parallel.txt

That is namely seen only from the sme side. However, the grep detected 95 possible parallel nob files

corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l
95

====================================================================================

5. Action points:

 5.1 find all parallel texts
  - ongoing work (only) on the free corpus for sme and nob: 
    Assuming an accurate parallelity of directory structure 
    between converted/sme and converted/nob; woring with the version
     freecorpus>svn up 
     At revision 845.

   - unreliable metadata as follows:
     a. parallel file declared in the meta-data but parallel file inexistent
        Ex.:
less converted/sme/admin/depts/NAC_1994_21.pdf.xml
    <origFileName>NAC_1994_24.pdf</origFileName>
    <parallel_text location="NOU_1994_21.pdf" xml:lang="nob"/>
find . -name "NAC_1994*"
./orig/sme/admin/depts/NAC_1994_21.pdf
./orig/sme/admin/depts/NAC_1994_21.pdf.xsl
freecorpus>

freecorpus>find . -name "NOU_1994*"
freecorpus>

freecorpus>ll converted/nob/admin/depts/
total 848
-rw-rw----  1 cipriangerstenberger  staff  205090 17 sep 08:15 HP_2009_samisk_sprak_norsk.pdf.xml
-rw-rw----  1 cipriangerstenberger  staff  191930 17 sep 08:15 STM_TS007.pdf.xml
-rw-rw----  1 cipriangerstenberger  staff   29129 17 sep 08:15 Tid_for_samtale_bm_nett.pdf.xml
     
     b. parallel file declared in one file but not in the other
        Ex.:
        declared in
         converted/sme/admin/depts/STM_TS007SA.pdf.xml
        but not in
         converted/nob/admin/depts/STM_TS007.pdf.xml 

     c. parallel file declared in both xml files, however in one or another with errors (corrected by Ciprian)
       Ex.: 
       in converted/sme/admin/sd/Duoji_doaibmadoarjagiid_árvvoštallan_2005-2009.pdf.xml
    <parallel_text location="Evaluering_av_driftstilskuddsordningen_for_duodji_2005-2009.pdf" xml:lang="nob"/>
     in converted/nob/admin/sd/Evaluering_av_driftstilskuddsordningen_for_duodji_2005-2009.pdf.xml
    <parallel_text location="Duoji_doaibmadoarjagiid_árvvoštallan_2005-2009.pdf.xsl" xml:lang="sme"/>

 5.2 preprocess them
 5.3 find the rest of sme XML texts
 5.4 preprocess them
 5.5 check the analysis pipeline while waiting for the final version of Riddu-Riddu data and of the tools

====================================================================================

6. Proposals for improvements after a (not that closer) look at the corpus data
 
 6.1 since both file content and file names in the bound (hence also in the free) directory 
     are exposed to changes we can as well use a more systematic naming of the files;
     the really original files can be stored in a header element; then these names might
     be avoided:

LÃ¦remiddelbruk_i_tosprÃ¥klig_opplÃ¦ring.pdf.xml
file:⁄⁄⁄home⁄boerre⁄Dokumenter⁄corpus⁄per-eric-kuoljok-2009-05-19⁄OrdlistaFaktabladSOU.doc.xml

 6.2 better organization and naming of the file structure: 
     - no mixing of files AND directories in one and the same directory 
       (have a look for instance at /usr/local/share/corp/bound/sme/facta)

A better example is actually MinAigi: In 1999 dir, there is both directories for all months
and 732 "unsorted" files on the same level.
1999>pwd
/usr/local/share/corp/bound/sme/news/MinAigi/1999
1999>ll | grep "^d" | wc -l
12
1999>ll | grep -v "^d" | wc -l 
732

We should agree on better name conventions: here there are quite a lot of doublings (see pwd above)
 MA --> we are in MinAigi already
 99 --> we are in 1999 already

1999>ll | grep "^d" 
drwxrwx--- 2 root bound  4096 des 21  2006 MA01_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA02_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA03_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA04_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA05_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA06_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA07_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA08_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA09_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA10_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA11_99
drwxrwx--- 2 root bound  4096 des 21  2006 MA12_99

Not to mention that there is another minaigi dir on the same level with MinAigi
news>ls
Assu  MinAigi  minaigi.no  NRK  other  YLE
What is the difference? One got directly from the journal and the other taken from the net?

Answer from Børre: MinAigi was one of the directories that were there in
beginning. Any directory names below that one is the original ones we
have received from Ávvir (which inherited the files from Min Áigi and
Áššu). I later added minaigi.no. Files inside that directory are fetched
from the net.

     - better conceptualization: why finnmarksloven in facta when there is a low-directory there?
     -- (Børre again) At the time it seemed like a good idea, there were
        many files from that domain that belonged together.
 6.3 proper check of the content of the collected data BEFORE XML transformation:
     sporadically, it has been done but there is a whole range of unchecked data for content:

     for instance,

     /usr/local/share/corp/bound/sme/facta/LÃ¦remiddelbruk_i_tosprÃ¥klig_opplÃ¦ring.pdf.xml

     with the following content:

<?xml version="1.0" encoding="UTF-8"?><document>
  <header/>
  <body/>
</document>

    from the 174 files in bound/sme/facta/finnmarksloven, only 14 have really data content,
    and this check was a quite superficial one:

finnmarksloven>ll 
totalt 1552
-rw-rw---- 1 root bound   692 mar 22 07:54 arkiv16a8.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:54 artikkel00ed.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:57 artikkel015c.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:55 artikkel030b.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:59 artikkel0958.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:55 artikkel0caa.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:52 artikkel0d8f.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:56 artikkel0ff3.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:55 artikkel1053.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:54 artikkel1454.html.xml
-rw-rw---- 1 root bound   695 mar 22 07:55 artikkel1501.html.xml
-rw-rw---- 1 root bound  9086 mar 22 07:53 artikkel1627.html.xml

   => almost identical size of kBs!
   The nob files in the parallel directory seem to be populated only with
   header-only xml files!

   NB: Some file in /usr/local/share/corp/broken seems to be much more
       useful contentwise than the header-only xml files in the bound directory,
       see, for instance, Salmmat-_garvasat_0203.doc.xml there.
       
  Conclusion: Neither a successful XML transformation nor a failed one tells us
              anything about the file content's usefulness as a text corpus part.

 6.4 a quick language check (irrespective what the language model tools guessed)
     would provide a more appropriate registration to a certain language directory.
     Example:
     In /usr/local/share/corp/bound/nob/laws the only file with corpus data
     (Lov_om_psykisk.sme.doc.xml) is a sme-only file.
 
 6.5 with the new svn repository for the corpus data, files like

     /usr/local/share/corp/bound/sme/bible/nt/north_sami_html.html.xml

    shouldn't exist any longer (by the way, this could have been better stored in a README file,
    I assume that XML corpus files store corpus data, not just meta-data)

 6.6 proper check of text for doubling required:
   Example:
/usr/local/share/corp/bound/nob/news/MinAigi/2003/olsenbanden
olsenbanden>ll
totalt 136
-rw-rw---- 1 root bound 59618 mar 17 23:01 01dialogmanus.DOC-1.doc.xml
-rw-rw---- 1 root bound 59616 mar 17 23:01 01dialogmanus.DOC.doc.xml
olsenbanden>diff 01dialogmanus.DOC.doc.xml 01dialogmanus.DOC-1.doc.xml 
3c3
< <document id="nob/news/MinAigi/2003/olsenbanden/01dialogmanus.DOC.doc" xml:lang="nob">
---
> <document id="nob/news/MinAigi/2003/olsenbanden/01dialogmanus.DOC-1.doc" xml:lang="nob">
18c18
<     <version>XSLtemplate  1.13 ; file-specific xsl  $Revision: 1.4 $; common.xsl   1.25 ; convert2xml   1.119 ; add_hyph_tags   1.15 ; docbook2corpus2   1.19 ; xhtml2corpus   1.13 ; </version>
---
>     <version>XSLtemplate  1.13 ; file-specific xsl  $Revision: 1.2 $; common.xsl   1.25 ; convert2xml   1.119 ; add_hyph_tags   1.15 ; docbook2corpus2   1.19 ; xhtml2corpus   1.13 ; </version>

7. Further documenting notable issues:
 - wanting to run a test and compare Oslo's data with the data generated now
 - random choice of the file hans_s.html.xml/hans_n.html.xml
 - observation:
   7.1 while the Oslo files DO have content, only the sme file has content in our corpus
   7.2 neither the bound nor the free version has content in our corpus (I assumed that at least one is ok)

/usr/local/share/corp/free/nob/facta/hans_n.html.xml
/usr/local/share/corp/bound/nob/facta/hans_n.html.xml
  <body>
    <p type="title"/>
  </body>

  7.3 a file similarly named /usr/local/share/corp/free/nob/facta/hans-n.html.xml HAS content!
  7.4 there is no pendant to this file in the bound directory!

  7.5 for a detailed comparison of a randomly chosen file pair between Oslo (last version) and
      Tromsoe (generated now) see compare_oslo-tromsoe dir