!!!Corpus meeting 11.11.11. :-) !!Agenda * Status quo, * Evaluation * Next steps * More intelligent suggestions, anyone? !!Status quo * First conversion ready wednesday. * Second conversion underway. Improvements in sme and nob abbr files, and in anchor file. Made a style sheet to convert tmx files to html. Many errors in the tmx file are caused by bad abbr. handling in the nob files On time use: half the time goes to add sentences to the file, the other half to parallelise them Throughput on xserve yesterday: 25k bytes per minute Possible test: Check if the size of the anchor file affects the speed of tca2 cd prestable/tmx/smenob/ for file in *.tmx do xsltproc $GTHOME/gt/script/corpus/tmx2html.xsl $file > $file.html done Here it seems the texts are different: prestable/tmx/smenob/vuollasa-asahusat.html_id=115192.tmx.html !!Filenames and directory structure Root tmx dir: {{{ prestable/tmx/SOURCELANG2TARGETLANG/ }}} Below this point we follow the directory structure found elsewhere, ie {{GENRE/subdirs/file.tmx}}. This should give us: {{{ prestable/tmx/nob2sme/admin/depts/regjeringen.no/xxx.tmx }}} * Include bullet as sentence border??? __TODO__ * Check a corpus against different-size anchor files, and measure time use. (__Børre__) * Add more words to anchor.txt, and possibly modularise / remove some files (__Trond, Børre__) ** Common anchor file: {{anchor.txt}} ** Genre-specific files: {{anchor_admin.txt, anchor_bible.txt, …}} * Harmonise the nob and sme abbr.txt files. ** Testcase: STM55, * Improve encoding: Search for black question marks in the whole corpus (__Børre__) ** 13 files contain the black marks (see below). Black question mark files: {{{ $ grep -lr '�' * | grep -v '\.svn' nob/admin/depts/other_files/OTP200620070025000SE_12.html.xml nob/admin/depts/regjeringen.no/7-narmere-om-planbestemmelsene.html_id=571096.xml sme/admin/depts/other_files/STM_TS007SA.pdf.xml sme/admin/depts/regjeringen.no/10.html_id=458508.xml sme/admin/depts/regjeringen.no/2011--rievdadeami-aigi-afghanistanas.html_id=604390.xml sme/admin/depts/regjeringen.no/7.html_id=458471.xml sme/admin/depts/regjeringen.no/aigeguovdil.html_id=1150.xml sme/admin/depts/regjeringen.no/bismagodderait.html_id=449030.xml sme/admin/depts/regjeringen.no/historihkka.html_id=861.xml sme/admin/depts/regjeringen.no/horingsbrev.html_id=499754.xml sme/admin/depts/regjeringen.no/raehus-rahkadahtta-samegiela-doaibmaplan.html_id=514922.xml sme/admin/depts/regjeringen.no/sami.html_id=615757.xml smj/admin/depts/other_files/HP_2009_samisk_sprak_lulesam.pdf.xml }}} !!Evaluation !Conversion ŋ not converted: samediggi-article-3002.html.tmx.html: Son gii biddjo virgái ferte hálddašit davviriikkalaš giela , sámegiela ja e ? gelasgiela . Personen som blir ansatt må beherske skandinavisk språk , samisk og engelsk . mnd. is not sentence final: {{{ Forøvrig tilsettes arbeidstakere etter gjeldende lover , reglement og overenskomster , herunder lønn og pensjon , samt 6 mnd . prøvetid . }}} Capital letter in names divides sentence: (boerre: I think the sentence division comes from the .) {{{ Mun doaivvun strategiija maid mii plánet váikkuha ahte bargu ollislaš ja dássásaš bálvalusain sámi álbmoga váste šaddá álkit ja beaktileappot , dadjá várrepresideanta Ragnhild L . Nystad . Jeg håper strategien vi legger opp til vil bidra til at arbeidet med å oppnå helhetlige og likeverdige tjenester til det samiske folket vil bli lettere og mer effektivt , sier visepresident Ragnhild L . Nystad . }}} !The 1-0 issue There are two types of 1-0 cases: # The sentence is __missing__ in the other language # The 1-0 status reveals an alignment __error__ (the match is in the neighbour pair) Trond's impressionistic feeling: (1) is the overwhelmingly most common one. !!Next meeting Middle of next week.