2594 2010 0 0 2010

======= 20121008: Input files for sentence alignment: nob2sme>find . -name "*.toktmx" | wc -l 1448 Input lines to analyse before sending to word alignment: 20121009_data>wc -l * 150781 data.nob 150781 data.sme Output lines from the analysis proces: 1_ape_n>wc -l data.* 150781 data.nob 150781 data.sme 150781 data.tagged.clean.nob 150781 data.tagged.clean.sme 150781 data.tagged.nob 150781 data.tagged.sme ==> word alignment process: DONE Some numbers: second_run>wc -l fad_nobsme_candidates.2012* 101195 fad_nobsme_candidates.20120704 110102 fad_nobsme_candidates.20120721 131446 fad_nobsme_candidates.20121009 But there is some parameters that have to be controlled in order to get comparable data sets: - the generated abbr list for nob (and sme?): as one can see, there are differences in preprocessing that have repercussions on the sentence alignment step. <<<===== Quotation start =======>>> < Lassin dasa ahte julevsámi álbmot beassá lohkat ođđasiid iežas gillii , de lea dehálaš ahte sámegiella báikegottiid giellan oidno báikkálaš aviissas . Doaibmabidju 61 . --- > Lassin dasa ahte julevsámi álbmot beassá lohkat ođđasiid iežas gillii , de lea dehálaš ahte sámegiella báikegottiid giellan oidno báikkálaš aviissas . > > > > > YOUTUBE PÅ . > > > Doaibmabidju 61 . 15023c14999 < YOUTUBE PÅ LULESAMISK er et prøveprosjekt der man gjennom videosendinger på internett og høy brukerdeltakelse ønsker å skape en uformell og interaktiv arena for det lulesamiske språket . --- > LULESAMISK er et prøveprosjekt der man gjennom videosendinger på internett og høy brukerdeltakelse ønsker å skape en uformell og interaktiv arena for det lulesamiske språket . 16274c16250,16258 < E JØM RKE IL --- > E JØM . > > > > > > > > RKE IL <<>> Topics to discuss: - documentation of all places in the pipeline where data is changed/corrected (with examples) - harmonization of analysis throughout the pipeline (gt vs. oslo tools for nob, abbr-lists) - grep "bransje" i toktmx/tmx-filer for å sjekke mulige feil i setningsparallellisering (as Lene found out) TODO: - check the URL and the content of the dead link http://divvun.no/static_files/nob2sme-2012-0%203-01.zip ==> this file should contain the last (and best) version of the parallel data ======= 20121009 ======= New xserve conversion of the data: Processing finished 7735 files processed, 503 errors among them The errors were distributed like this: checkxml_after_checklang 0% of errors checkxml_after_faulty 1% of errors convert2xml 0% of errors faulty_chars 0% of errors rubbish_content 93% of errors too_low_mainlang 5% of errors nob2sme:

2607 2160 0 0 2160

sme2nob:

2227 2153 0 0 2153

============ ======= 20121019 ======= New conversion of the nob-sme corpus: Processing finished 7786 files processed, 509 errors among them The errors were distributed like this: checkxml_after_checklang 0% of errors checkxml_after_faulty 1% of errors convert2xml 0% of errors faulty_chars 0% of errors rubbish_content 92% of errors too_low_mainlang 6% of errors

=========== Personal todo-list for next meeting: 1. convert all anew locally ==> done 2. check parallelity after Berit Merete's last corpus corrections ==> done 3. generate the list of the file pairs that are not passed over to the prestate/converted dir and sent it to Berit Merete and Marja for specific check ==> done To find out how many such pair are just grep for them: second_run>grep 'Wrong' pick_sme_20121024.log | wc -l 418 second_run>grep 'Too' pick_sme_20121024.log | wc -l 241 =========== 20121060 statistics Sentence alignment: - input 1515 file pairs - output 1515 toktmx files nob2sme file pairs: clean_tmx>find nob2sme -name \*.toktmx | wc -l 1491 clean_tmx>wc -l data.* 155336 data.nob 155336 data.sme =========== I've debugged the word alignement pipeline for gt/obt-analysed data and, yes, the show stopper is the format of the wikipedia data: nowaclemma.freq Todo: 1. generate the nowaclemma.freq-file anew with the obt-pipeline 2. look into the last script (extract-candidate-terms.py) to see whether there is hard-coded stuff regarding the annotation format =========== Cips todo list by next meeting: 1. check the new feature word count in the conversion pipeline ==> DONE 2. convert the whole FAD corpus anew ==> started the conversion process 7808 files processed, 502 errors among them The errors were distributed like this: checkxml_after_faulty 2% of errors rubbish_content 91% of errors too_low_mainlang 6% of errors

2633 2192 2192

2266 2185 2185

second_run>grep Wrong pick_parallel_nob2sme_20121031.log | wc -l 421 (before 418) second_run>grep Too pick_parallel_nob2sme_20121031.log | wc -l 244 (before 241) 3. implement a simple script to generate HTML with links between the parallel files out of the converted XML ==> DONE 4. analyse the freshly converted data with gt/obt pipeline and generate fresh lists with non-analyzed words input: 1494 toktmx files with a total of clean_tmx>wc -l data.* 155538 data.nob 155538 data.sme lines=sentences (previous run 1491 file with 155336 lines) ==> ongoing 5. analyze also with English to remove the foreign material from the list ==> TODO 6. analyse the nob-wikipedia anew with the obt pipeline disregarding interpunction http://dumps.wikimedia.org/nowiki/20121101/ http://dumps.wikimedia.org/nowiki/20121101/nowiki-20121101-pages-articles.xml.bz2 extracted text file with WP2TXT using the followig config: - Data Conversion: To Text Format - Elements Extracted: - Title - Heading - Paragraph - Quote - List Output Encoding: UTF-8 Text Type Removed: ref (remove notes embedded in the text) URL: http://wp2txt.rubyforge.org ==> ongoing To filter from the wiki-txt-file: - REDIRECT, redirect, OMDIRIGERING, omdirigering, ===, ==, [[ and ]] =========== Fra Márjá og BM Fjerne disse nob-lemmaene fra fad_nobsme_candidates_ap-pl.20121028: god ha la ta få bli burde kunne legge måtte ville være skulle Grenseverdi settes til: 0,1 ==> filtrering pga auxiliary&modal verbs og grenseverdi DONE 137070 fad_nobsme_candidates_ap-pl.20121028 48150 fad_nobsme_candidates_ap-pl.20121028_filtered Sortere først med bakgrunn i F1, så med bakgrunn i F6, og så med bakgrunn i F5. I dag er det f.eks slik: Linjenr. 54640-54643: 134 0 -5.097 0.0 0.0022075 oppfatte oaidnit 134 0 -5.097 0.0 0.0021053 jobb bargosadji 134 0 -5.097 0.0 0.0020704 satsing+område suorgi 134 0 -5.097 0.0 0.0018519 oppfatte dovdat Fra first_ run, forvaltningsordbok.nob-sme 7 97 -8.112 -6.48 0.5454545 adjektiv adjektiiva 7 97 -8.112 -6.48 0.3333333 adjektiv predikatiivahápmi 7 29 -8.112 -7.687 0.2857143 adkomst iskkadeapmi+guovlu 7 29 -8.112 -7.687 0.0588235 adkomst bággolotnun+lohpi Det her kan tyde på at F1 ikke er knyttet til nob-lemma, siden nob-lemmaene adjektiv og adkomst har samme F1, nemlig 7. I first_run har også kolonnene F2 og F4 en verdi (f.eks 97 og -6.48), i motsetning til second_run der kolonnene F2 og F4 konstant har verdien 0. ============================= started to check the differences in unanalysed material apertium vs. non-apertium pipelines apertium_pl: 1_ape_n>wc -l starred_* 27979 starred_nob_stuff.txt 33035 starred_sme_stuff.txt non-apertium_pl: 2_gt_n>wc -l unknown_sme_stuff.txt 21103 unknown_sme_stuff.txt 3_gt_n_nob>wc -l ukjent_nob_stuff.txt 3287 ukjent_nob_stuff.txt nob overlap: nob_unknown>comm -12 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 1825 only in ap: nob_unknown>comm -23 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 26152 only in gt: nob_unknown>comm -13 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 1462 sme overlap: sme_unknown>comm -12 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 16892 only in ap: sme_unknown>comm -23 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 16143 only in gt: sme_unknown>comm -13 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 4210