Here is an update of the Oslo corpus update work:
1. Decisions:
1.1 use only the files that are already in xml format (i.e., the old repository)
(agreed with Trond, however Lene wants to have Riddu-Riddu and Skole historie also, waiting for the tools' update)
1.2 don't include tables and lists into the processing:
obviously, this was NOT the case with the last version of the data,
that means that we shall keep the tables and lists as the are
1.3 wait for Lene & Thomas for updating the automata
1.4 check the parallelity of files using the info from the header
1.5 testing the pipeline (it was quite wet here in Tromsø, the pipeline got a bit rusty)
====================================================================================
2. Inventory of the current bound corpus repository in XML format
2.1 All files in XML format:
corpus_christi>wc -l *invent*
481 nob_b_inventory.txt
23263 sme_b_inventory.txt
2.2 All files with parallel text and language infos:
corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l
95
corpus_christi>grep -r "parallel_text" bound/nob | grep "xml:lang=\"sme\"" | wc -l
24
To me, this is quite fishy... I was expecting to have the same number on each side.
====================================================================================
3. Inventory of the Oslo data (thanks to Trond for getting it from Oslo)
--------------------------------
1999_2000>ls nob | wc -l
7
1999_2000>ls sme | wc -l
7
1999_2000>ls parallel | wc -l
7
1999_2000>ls analyzed | wc -l
7
--------------------------------
1 file of each type:
NAC_2001_35>ls
NAC_2001_35.pdf.analyzed.xml NAC_2001_35.pdf.sent_NOU_2001_35.pdf.sent.xml
NAC_2001_35.pdf.sent.xml NOU_2001_35.pdf.sent.xml
--------------------------------
1 file of each type:
STM_TS007SA>ls
STM_TS007.pdf.sent.xml STM_TS007SA.pdf.sent.xml
STM_TS007SA.pdf.analyzed.xml STM_TS007SA.pdf.sent_STM_TS007.pdf.sent.xml
--------------------------------
1 file of each type:
bible>ls
01GENNBST.bible.sent.xml 1Mos_09_01.bible.sent.xml
1Mos_09_01.bible.analyzed.xml 1Mos_09_01.bible.sent_01GENNBST.bible.sent.xml
--------------------------------
1 file of each type:
nac>ls
NAC_1994_21.pdf.analyzed.xml NAC_1994_21.pdf.sent_NOU_1994_21.pdf.sent.xml
NAC_1994_21.pdf.sent.xml NOU_1994_21.pdf.sent.xml
--------------------------------
skolehistorie>ls nob | wc -l
28
skolehistorie>ls sme | wc -l
28
skolehistorie>ls parallel | wc -l
28
skolehistorie>ls analyzed | wc -l
28
--------------------------------
skolehistorie2>ls | wc -l
6
skolehistorie2>ls nob | wc -l
33
skolehistorie2>ls nob_comp | wc -l
2
skolehistorie2>ls sme | wc -l
33
skolehistorie2>ls sme_comp | wc -l
33
skolehistorie2>ls parallel | wc -l
33
skolehistorie2>ls analyze | wc -l
33
In skolehistorie2, there is some doubled stuff
skolehistorie2>ls sme
aarseth2-s.html.sent.xml inge-s.html.sent.xml nordby-s.html.sent.xml
algu2-s.html.sent.xml ingunn-s.html.sent.xml pave-s.html.sent.xml
vs.
skolehistorie2>ls sme_comp
aarseth2_s.html.sent.xml inge_s.html.sent.xml nordby_s.html.sent.xml
algu2_s.html.sent.xml ingunn_s.html.sent.xml pave_s.html.sent.xml
However, only on the sme side. I checked the LANG_comp stuff, and they
should be ignored.
======================================
Conclusion of the Oslo data inventory:
======================================
There is NO sme data which doesn't have ANY pendant on the nob side!
Question to Trond: What about our plans to send ALL sme data?
====================================================================================
4. Inventory of parallel files using the command/scripts on our site
https://giellalt.uit.no/ling/corpus_analyze.html
Using the command
corpus_christi>./../svnredone/gt/script/corpus-parallel.pl --list --lang=sme --dir=/usr/local/share/corp/bound/sme > sme-nob_parallel.txt
and assuming that the scrip works fine for this option, then we have the following results
corpus_christi>wc -l sme-nob_parallel.txt
82 sme-nob_parallel.txt
That is namely seen only from the sme side. However, the grep detected 95 possible parallel nob files
corpus_christi>grep -r "parallel_text" bound/sme | grep "xml:lang=\"nob\"" | wc -l
95
====================================================================================
5. Action points:
5.1 find all parallel texts
- ongoing work (only) on the free corpus for sme and nob:
Assuming an accurate parallelity of directory structure
between converted/sme and converted/nob; woring with the version
freecorpus>svn up
At revision 845.
- unreliable metadata as follows:
a. parallel file declared in the meta-data but parallel file inexistent
Ex.:
less converted/sme/admin/depts/NAC_1994_21.pdf.xml
NAC_1994_24.pdf
find . -name "NAC_1994*"
./orig/sme/admin/depts/NAC_1994_21.pdf
./orig/sme/admin/depts/NAC_1994_21.pdf.xsl
freecorpus>
freecorpus>find . -name "NOU_1994*"
freecorpus>
freecorpus>ll converted/nob/admin/depts/
total 848
-rw-rw---- 1 cipriangerstenberger staff 205090 17 sep 08:15 HP_2009_samisk_sprak_norsk.pdf.xml
-rw-rw---- 1 cipriangerstenberger staff 191930 17 sep 08:15 STM_TS007.pdf.xml
-rw-rw---- 1 cipriangerstenberger staff 29129 17 sep 08:15 Tid_for_samtale_bm_nett.pdf.xml
b. parallel file declared in one file but not in the other
Ex.:
declared in
converted/sme/admin/depts/STM_TS007SA.pdf.xml
but not in
converted/nob/admin/depts/STM_TS007.pdf.xml
c. parallel file declared in both xml files, however in one or another with errors (corrected by Ciprian)
Ex.:
in converted/sme/admin/sd/Duoji_doaibmadoarjagiid_árvvoštallan_2005-2009.pdf.xml
in converted/nob/admin/sd/Evaluering_av_driftstilskuddsordningen_for_duodji_2005-2009.pdf.xml
5.2 preprocess them
5.3 find the rest of sme XML texts
5.4 preprocess them
5.5 check the analysis pipeline while waiting for the final version of Riddu-Riddu data and of the tools
====================================================================================
6. Proposals for improvements after a (not that closer) look at the corpus data
6.1 since both file content and file names in the bound (hence also in the free) directory
are exposed to changes we can as well use a more systematic naming of the files;
the really original files can be stored in a header element; then these names might
be avoided:
Læremiddelbruk_i_tospråklig_opplæring.pdf.xml
file:⁄⁄⁄home⁄boerre⁄Dokumenter⁄corpus⁄per-eric-kuoljok-2009-05-19⁄OrdlistaFaktabladSOU.doc.xml
6.2 better organization and naming of the file structure:
- no mixing of files AND directories in one and the same directory
(have a look for instance at /usr/local/share/corp/bound/sme/facta)
A better example is actually MinAigi: In 1999 dir, there is both directories for all months
and 732 "unsorted" files on the same level.
1999>pwd
/usr/local/share/corp/bound/sme/news/MinAigi/1999
1999>ll | grep "^d" | wc -l
12
1999>ll | grep -v "^d" | wc -l
732
We should agree on better name conventions: here there are quite a lot of doublings (see pwd above)
MA --> we are in MinAigi already
99 --> we are in 1999 already
1999>ll | grep "^d"
drwxrwx--- 2 root bound 4096 des 21 2006 MA01_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA02_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA03_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA04_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA05_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA06_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA07_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA08_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA09_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA10_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA11_99
drwxrwx--- 2 root bound 4096 des 21 2006 MA12_99
Not to mention that there is another minaigi dir on the same level with MinAigi
news>ls
Assu MinAigi minaigi.no NRK other YLE
What is the difference? One got directly from the journal and the other taken from the net?
Answer from Børre: MinAigi was one of the directories that were there in
beginning. Any directory names below that one is the original ones we
have received from Ávvir (which inherited the files from Min Áigi and
Áššu). I later added minaigi.no. Files inside that directory are fetched
from the net.
- better conceptualization: why finnmarksloven in facta when there is a low-directory there?
-- (Børre again) At the time it seemed like a good idea, there were
many files from that domain that belonged together.
6.3 proper check of the content of the collected data BEFORE XML transformation:
sporadically, it has been done but there is a whole range of unchecked data for content:
for instance,
/usr/local/share/corp/bound/sme/facta/Læremiddelbruk_i_tospråklig_opplæring.pdf.xml
with the following content:
from the 174 files in bound/sme/facta/finnmarksloven, only 14 have really data content,
and this check was a quite superficial one:
finnmarksloven>ll
totalt 1552
-rw-rw---- 1 root bound 692 mar 22 07:54 arkiv16a8.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:54 artikkel00ed.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:57 artikkel015c.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:55 artikkel030b.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:59 artikkel0958.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:55 artikkel0caa.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:52 artikkel0d8f.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:56 artikkel0ff3.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:55 artikkel1053.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:54 artikkel1454.html.xml
-rw-rw---- 1 root bound 695 mar 22 07:55 artikkel1501.html.xml
-rw-rw---- 1 root bound 9086 mar 22 07:53 artikkel1627.html.xml
=> almost identical size of kBs!
The nob files in the parallel directory seem to be populated only with
header-only xml files!
NB: Some file in /usr/local/share/corp/broken seems to be much more
useful contentwise than the header-only xml files in the bound directory,
see, for instance, Salmmat-_garvasat_0203.doc.xml there.
Conclusion: Neither a successful XML transformation nor a failed one tells us
anything about the file content's usefulness as a text corpus part.
6.4 a quick language check (irrespective what the language model tools guessed)
would provide a more appropriate registration to a certain language directory.
Example:
In /usr/local/share/corp/bound/nob/laws the only file with corpus data
(Lov_om_psykisk.sme.doc.xml) is a sme-only file.
6.5 with the new svn repository for the corpus data, files like
/usr/local/share/corp/bound/sme/bible/nt/north_sami_html.html.xml
shouldn't exist any longer (by the way, this could have been better stored in a README file,
I assume that XML corpus files store corpus data, not just meta-data)
6.6 proper check of text for doubling required:
Example:
/usr/local/share/corp/bound/nob/news/MinAigi/2003/olsenbanden
olsenbanden>ll
totalt 136
-rw-rw---- 1 root bound 59618 mar 17 23:01 01dialogmanus.DOC-1.doc.xml
-rw-rw---- 1 root bound 59616 mar 17 23:01 01dialogmanus.DOC.doc.xml
olsenbanden>diff 01dialogmanus.DOC.doc.xml 01dialogmanus.DOC-1.doc.xml
3c3
<
---
>
18c18
< XSLtemplate 1.13 ; file-specific xsl $Revision: 1.4 $; common.xsl 1.25 ; convert2xml 1.119 ; add_hyph_tags 1.15 ; docbook2corpus2 1.19 ; xhtml2corpus 1.13 ;
---
> XSLtemplate 1.13 ; file-specific xsl $Revision: 1.2 $; common.xsl 1.25 ; convert2xml 1.119 ; add_hyph_tags 1.15 ; docbook2corpus2 1.19 ; xhtml2corpus 1.13 ;
7. Further documenting notable issues:
- wanting to run a test and compare Oslo's data with the data generated now
- random choice of the file hans_s.html.xml/hans_n.html.xml
- observation:
7.1 while the Oslo files DO have content, only the sme file has content in our corpus
7.2 neither the bound nor the free version has content in our corpus (I assumed that at least one is ok)
/usr/local/share/corp/free/nob/facta/hans_n.html.xml
/usr/local/share/corp/bound/nob/facta/hans_n.html.xml
7.3 a file similarly named /usr/local/share/corp/free/nob/facta/hans-n.html.xml HAS content!
7.4 there is no pendant to this file in the bound directory!
7.5 for a detailed comparison of a randomly chosen file pair between Oslo (last version) and
Tromsoe (generated now) see compare_oslo-tromsoe dir