Dir for the inhouse word aligment as a prestep to the extraction of candidates for the FAD dictionary. Input data: prestable/toktmx/nob2sme 1. extract text xsl2 -it main extract_all_sent_pairs.xsl inDir=nob2sme 2. lower-case text 3. giza: giza-pp/GIZA++-v2/plain2snt.out in_fad_nob in_fad_sme giza-pp/mkcls-v2/mkcls -pin_fad_nob -Vin_fad_nob.vcb.classes giza-pp/mkcls-v2/mkcls -pin_fad_sme -Vin_fad_sme.vcb.classes giza-pp/GIZA++-v2/GIZA++ -S in_fad_nob.vcb -T in_fad_sme.vcb -C in_fad_nob_in_fad_sme.snt -p0 0.98 -o dictionary >& dictionary.log 20120621 running gizza: GIZZA stops at the following sentence pair (?): 86600 *Skjermingsreglene 86600 mun jahki 2009 stáhtabušeahtta árvalit nupp ástus mii váidudit riikkaidgaskasaš ekonomiija hedjonit cealkit stáhtaministtar Jens Stoltenberg Sent No: 78643 , No. Occurrences: 1 0 842 86 27 38702 11 213 1123 108 1905 644 400 183 134 611 38702 2 1459 1235 40 19 25141 1062 12 1128 115 11 330 580 31460 1707 9962 3 32272 12901 3 115 53507 7953 9927 WARNING: The following sentence pair has source/target sentence length ration more than the maximum allowed limit for a source word fertility source length = 1 target length = 15 ratio 15 ferility limit : 9 Shortening sentence Sent No: 86600 , No. Occurrences: 1 0 56626 37 14 143 590 490 1419 11 6446 158 845 10634 453 917 1388 1135 ERROR: Execution of: /Users/ciprian/local/bin/GIZA++ -CoocurrenceFile ./giza.sme-nob/sme-nob.cooc -c ./corpus/sme-nob-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 1 -nsmooth 4 -o ./giza.sme-nob/sme-nob -onlyaldumps 1 -p0 0.999 -s ./corpus/nob.vcb -t ./corpus/sme.vcb died with signal 11, without coredump ===== Hint to the stopper: ===== http://comments.gmane.org/gmane.comp.nlp.moses.user/6494 GIZA++ is giving you a segmentation fault, which doesn't usually happen. Maybe there's some odd characters in your data, or it's not sentence aligned properly? I notice that you're using quite an old version of Moses (the training script is now called train-model.perl) so it would be worth running with up-to-date Moses and GIZA++ ===== I don't use an old version, but is is clear that these sentences are not aligned properly. Ergo: more filtering needed! Todo: 1. compare input data with the input of the previous run - it is still on the server: http://www.divvun.no/static_files/NOB.SME.gt-sd_free-20101003.tmx.gz 2. clean data, especially the long sentences with ///≥//|||\\\ @ ..., etc. ==> done by filtering all tu-elements that have at least one tuv-element containing only non-alphanumeric characters or only numeric characters. Example for possible preprocessing improvement: • • Váldosivva orru leamen ahte almmolaš suorggi bargiin váilu sámegiela máhttu . Samarbeidet mellom kommuner og fylkeskommunen skal forankre utbyggingen av bredbånd og fremme nettbaserte løsninger både for privat og offentlig sektor 7 . 1 b ) Mearrádusain mat soahpameahttunvuođalága mearrádusaid mielde lea guoddalusaid áhtun , sáhttá seamma láhkai guoddalit Alimusrievtti guoddaluslávdegoddái go guoddalusat eai leat áiddastuvvon dán lága mielde . Example for just-forget-it: 1 1 3. lemmatize 4. get moses work ---------- Debugging: ---------- 1. apertium pipeline (a) line number using apertium-pipeline ==> almost cleaned data for the apertium pipeline the last (?) culprit is an ’ at the beginning of the line (b) apostroph > ’ Rievdadus dáhpáhuvvá go šattat danin mii don leat , ii ge go geahččalat šaddat danin mii don it leat ’ ( sitáhta : Beisser ) (c) § at and and beginning of the following line < Finnmárkkuopmodat sáhttá addit earrásiidda go gieldda dahje fylkka ássiide lasi beassama ávkkástallat ođasmuvvi valljodagain nugo namahuvvo §§ 22 ja 23 . § --- > Finnmárkkuopmodat sáhttá addit earrásiidda go gieldda dahje fylkka ássiide lasi beassama ávkkástallat ođasmuvvi valljodagain nugo namahuvvo §§ 22 ja 23 . (d) Cyrillic text (2) gt pipeline line number using gt-pipeline xprocx_gtt>wc -l data.* 152760 data.gt_tagged.pre_cleaned 152771 data.sme To answer Trond's question: "Why to use the apertium-pipeline at all and not just the gt one?" Both are buggy but with different bugs with sme! apertium: - in: 152771 lines - out: 152711 lines Aha, I remember now why numbering the lines is of no big use (actually, I did it long ago), the numbers are the cause of line shifting/collapsing (I remember now even a chat with Trond on that topic). Here you are: Input: 1 Preassadieđáhus , 07.05.2010 2 Nr. : 29/10 3 1 558,4 miljon ruvdno speallaruhta falástallanulbmiliidda 2010´s 4 Ráđđehus lea mearridan juohkit 1 558,4 miljovnna ruvnno speallanruđa falástallanulbmiliidda jahkái 2010 . Output: ^1<@X>$ ^preassadieđáhus<@SUBJ→>$ ^,$ ^07.05.2010 2<@SUBJ→>$[ ] ^nr<@Num←>$ ^:$ ^29<@ADVL→>$\/^10 3 1 558,4<@→Num>$[ ] ^miljovdna<@→N>$ ^ruvdno<@SUBJ→>$ ^spealat+ruhta<@ADVL→>$ ^falástallat+ulbmil<@ADVL→>$ ^2010<@OBJ→>$´^s<@SUBJ→>$[ ]^4<@N←>$ ^ráđđehus<@SUBJ→>$ ^leat<@+FAUXV>$ ^mearridit<@-FMAINV>$ ^juohkit<@←OBJ>$ ^1 558,4<@→Num>$ ^miljovdna<@→N>$ ^ruvdno<@Num←>$ ^speallat+ruhta<@-F←OBJ>$ ^falástallat+ulbmil<@←ADVL>$ ^jahki<@←ADVL>$ ^2010<@N←>$ ^.$[ Bug of the apertium pipeline for sme: If a line containing only numbers follows immediately a line ending with a number it would be interperted as part of the number at the end of the previous line. Line 1: The house number is 234 Line 2: 56789 Line 3: is the post code. ==becomes==> Line 1: The house number is 234 56789 Line 2: is the post code. ============= gt: -in: 152771 lines -out: 152760 lines ------------------------ Correction of the input: ------------------------ 3. 1 Når statlige myndigheter eller Sametinget mener det er behov for utre dninger for å styrke faktagrunnlaget eller det formelle grunnlaget for vurderinger og beslutninger skal dette tilkjennegis så tidlig som mulig , og partene skal bring e spørsmål knyttet til mandat for eventuelle utredninger inne i konsultasjonsproses sen . 1 Det skal avholdes faste halvårlige møter mellom Sametinget og det int erdepartementale samordningsutvalget for samiske saker . 5. Is that perhaps something for the preprocessing (Kap. vs. Dep .): The abbr. of abbr. is abbr. 9 Kap. 328/Dep . 6. : PLAN OG BYGNINGSLOVEN Plan- og bygningsloven inneholder regler om arealplanlegging og byggesaksbehandling . 7. Av mandatet framgår det at følgende samfunnsområder er særlig aktuelle å behandle : Språk Oppvekst , utdanning og forskning Likestilling Helse og sosial , herunder befolkningsutvikling , demografi , inntekt Næringer , herunder sysselsetting , næringsstruktur , tradisjonelle næringer Miljø- og ressursforvaltning , endringer i det materielle kulturgrunnlaget , deltakelse og innflytelse Kulturarbeid og allmennkultur , herunder kunstuttrykk , media Sivile samiske samfunn , herunder organisasjons- og institusjonsutvikling Analysegruppa kan også behandle andre temaer . Av mandatet framgår det at følgende samfunnsområder er særlig aktuelle å behandle : 8. Den sier følgende om kunnskapsgrunnlaget : Den sier følgende om kunnskapsgrunnlaget : Kommunal- og regionaldepartementet og Sameti nget nedsetter i fellesskap en faglig analysegruppe som blant annet på bakgrunn av samisk statis tikk årlig avlegger en rapport om situasjon og utviklingstrekk i det samiske samfunn . Den sier også hvilke bestemmelser og retningslinjer som kan gis til den enkelte hensynssone for å sikre at hensynet blir ivaretatt . 9. Preprocessing? m. m. . instead of m. m . . Deretter gjennomgåes en øvingsrekke som omfatter de almindelige tresammenføyninger , formingsarb eid , m. m . . 10. This is weird: in the same run (i.e., wiht the same preprocessing/abbr file) there are two different results: - Det blir det slutt på fra 1 . juni . - Det blir et Norge før og et etter 22. juli . 11. perhaps these patterns should be processed as other ordinals (e.g., 1. , 2. , etc.) IV . III . VI . =================== convert2xml.pl step =================== Taking notes along the processing pipeline. 1. Local conversion got 75% parallel_file_lacks. Question: Why to stop the xml conversion of a file based on the lack of a parallel file to some other language? 2. Have the claimed converting tests been tested properly? A random search on the tmp/*.log files revealed the following: tmp/avdelingssekretariatet.html_id=52007.log:Conversion failed: Some parallel file doesn't exist ==> tracking the parallel files that is claimed to be missing my_cocon>find . -name 'avdelingssekretariatet.html_id=52007' ./orig/nob/admin/depts/regjeringen.no/avdelingssekretariatet.html_id=52007 ==> there are two parallel files declared in the meta file of the random test file my_cocon>find . -name 'ossodatallingoddi.html_id=52007' ./orig/sme/admin/depts/regjeringen.no/ossodatallingoddi.html_id=52007 ==> the North Sami parallel file EXISTS! my_cocon>find . -name 'department-secretariat.html_id=52007' my_cocon> ==> the English parallel file is missing! Looking for converted files: my_cocon>ls converted/nob/admin/depts/regjeringen.no/avdeling* converted/nob/admin/depts/regjeringen.no/avdeling_fn.html_id=87006.xml converted/nob/admin/depts/regjeringen.no/avdeling_for_forskning_innovasjon_og_reg.html_id=1602.xml converted/nob/admin/depts/regjeringen.no/avdeling_for_matpolitikk.html_id=1600.xml converted/nob/admin/depts/regjeringen.no/avdeling_for_presse_kultur_og_informasjo.html_id=1529.xml converted/nob/admin/depts/regjeringen.no/avdeling_for_skog-_og_ressurspolitikk_.html_id=1599.xml converted/nob/admin/depts/regjeringen.no/avdelingsdirektor-bjorn-olav-megard.html_id=484727.xml my_cocon> ==> there is no conversion of avdelingssekretariatet.html_id=52007 my_cocon>ls converted/sme/admin/depts/regjeringen.no/ossodatallingoddi.html_id* ls: converted/sme/admin/depts/regjeringen.no/ossodatallingoddi.html_id*: No such file or directory ==> There is no conversion of ossodatallingoddi.html_id=52007 Ergo: We would have had more parallel texts in the FAD corpus if the English (or some other language irrelevant for FAD) parallel file would have been there. ==> The question again: Why to stop the xml conversion of a file based on the lack of a parallel file to some other language? Is that perhaps only a feature of the locally converted files? To check the conversion on xserve and victorio! Converting, testing, and parallelizing the good, old xslt way: much better! 2594 2010 0 0 2010 ======= 20121008: Input files for sentence alignment: nob2sme>find . -name "*.toktmx" | wc -l 1448 Input lines to analyse before sending to word alignment: 20121009_data>wc -l * 150781 data.nob 150781 data.sme Output lines from the analysis proces: 1_ape_n>wc -l data.* 150781 data.nob 150781 data.sme 150781 data.tagged.clean.nob 150781 data.tagged.clean.sme 150781 data.tagged.nob 150781 data.tagged.sme ==> word alignment process: DONE Some numbers: second_run>wc -l fad_nobsme_candidates.2012* 101195 fad_nobsme_candidates.20120704 110102 fad_nobsme_candidates.20120721 131446 fad_nobsme_candidates.20121009 But there is some parameters that have to be controlled in order to get comparable data sets: - the generated abbr list for nob (and sme?): as one can see, there are differences in preprocessing that have repercussions on the sentence alignment step. <<<===== Quotation start =======>>> < Lassin dasa ahte julevsámi álbmot beassá lohkat ođđasiid iežas gillii , de lea dehálaš ahte sámegiella báikegottiid giellan oidno báikkálaš aviissas . Doaibmabidju 61 . --- > Lassin dasa ahte julevsámi álbmot beassá lohkat ođđasiid iežas gillii , de lea dehálaš ahte sámegiella báikegottiid giellan oidno báikkálaš aviissas . > > > > > YOUTUBE PÅ . > > > Doaibmabidju 61 . 15023c14999 < YOUTUBE PÅ LULESAMISK er et prøveprosjekt der man gjennom videosendinger på internett og høy brukerdeltakelse ønsker å skape en uformell og interaktiv arena for det lulesamiske språket . --- > LULESAMISK er et prøveprosjekt der man gjennom videosendinger på internett og høy brukerdeltakelse ønsker å skape en uformell og interaktiv arena for det lulesamiske språket . 16274c16250,16258 < E JØM RKE IL --- > E JØM . > > > > > > > > RKE IL <<>> Topics to discuss: - documentation of all places in the pipeline where data is changed/corrected (with examples) - harmonization of analysis throughout the pipeline (gt vs. oslo tools for nob, abbr-lists) - grep "bransje" i toktmx/tmx-filer for å sjekke mulige feil i setningsparallellisering (as Lene found out) TODO: - check the URL and the content of the dead link http://divvun.no/static_files/nob2sme-2012-0%203-01.zip ==> this file should contain the last (and best) version of the parallel data ======= 20121009 ======= New xserve conversion of the data: Processing finished 7735 files processed, 503 errors among them The errors were distributed like this: checkxml_after_checklang 0% of errors checkxml_after_faulty 1% of errors convert2xml 0% of errors faulty_chars 0% of errors rubbish_content 93% of errors too_low_mainlang 5% of errors nob2sme: 2607 2160 0 0 2160 sme2nob: 2227 2153 0 0 2153 ============ ======= 20121019 ======= New conversion of the nob-sme corpus: Processing finished 7786 files processed, 509 errors among them The errors were distributed like this: checkxml_after_checklang 0% of errors checkxml_after_faulty 1% of errors convert2xml 0% of errors faulty_chars 0% of errors rubbish_content 92% of errors too_low_mainlang 6% of errors 2623 2175 0 0 2175 2248 2168 0 0 2168 2632 2184 0 0 2184 2257 2177 0 0 2177 =========== Personal todo-list for next meeting: 1. convert all anew locally ==> done 2. check parallelity after Berit Merete's last corpus corrections ==> done 3. generate the list of the file pairs that are not passed over to the prestate/converted dir and sent it to Berit Merete and Marja for specific check ==> done To find out how many such pair are just grep for them: second_run>grep 'Wrong' pick_sme_20121024.log | wc -l 418 second_run>grep 'Too' pick_sme_20121024.log | wc -l 241 =========== 20121060 statistics Sentence alignment: - input 1515 file pairs - output 1515 toktmx files nob2sme file pairs: clean_tmx>find nob2sme -name \*.toktmx | wc -l 1491 clean_tmx>wc -l data.* 155336 data.nob 155336 data.sme =========== I've debugged the word alignement pipeline for gt/obt-analysed data and, yes, the show stopper is the format of the wikipedia data: nowaclemma.freq Todo: 1. generate the nowaclemma.freq-file anew with the obt-pipeline 2. look into the last script (extract-candidate-terms.py) to see whether there is hard-coded stuff regarding the annotation format =========== Cips todo list by next meeting: 1. check the new feature word count in the conversion pipeline ==> DONE 2. convert the whole FAD corpus anew ==> started the conversion process 7808 files processed, 502 errors among them The errors were distributed like this: checkxml_after_faulty 2% of errors rubbish_content 91% of errors too_low_mainlang 6% of errors 2633 2192 2192 2266 2185 2185 second_run>grep Wrong pick_parallel_nob2sme_20121031.log | wc -l 421 (before 418) second_run>grep Too pick_parallel_nob2sme_20121031.log | wc -l 244 (before 241) 3. implement a simple script to generate HTML with links between the parallel files out of the converted XML ==> DONE 4. analyse the freshly converted data with gt/obt pipeline and generate fresh lists with non-analyzed words input: 1494 toktmx files with a total of clean_tmx>wc -l data.* 155538 data.nob 155538 data.sme lines=sentences (previous run 1491 file with 155336 lines) ==> ongoing 5. analyze also with English to remove the foreign material from the list ==> TODO 6. analyse the nob-wikipedia anew with the obt pipeline disregarding interpunction http://dumps.wikimedia.org/nowiki/20121101/ http://dumps.wikimedia.org/nowiki/20121101/nowiki-20121101-pages-articles.xml.bz2 extracted text file with WP2TXT using the followig config: - Data Conversion: To Text Format - Elements Extracted: - Title - Heading - Paragraph - Quote - List Output Encoding: UTF-8 Text Type Removed: ref (remove notes embedded in the text) URL: http://wp2txt.rubyforge.org ==> ongoing To filter from the wiki-txt-file: - REDIRECT, redirect, OMDIRIGERING, omdirigering, ===, ==, [[ and ]] =========== Fra Márjá og BM Fjerne disse nob-lemmaene fra fad_nobsme_candidates_ap-pl.20121028: god ha la ta bli burde kunne legge måtte ville være skulle Grenseverdi settes til: 0,1 ==> filtrering pga auxiliary&modal verbs og grenseverdi DONE 137070 fad_nobsme_candidates_ap-pl.20121028 48150 fad_nobsme_candidates_ap-pl.20121028_filtered Sortere først med bakgrunn i F1, så med bakgrunn i F6, og så med bakgrunn i F5. I dag er det f.eks slik: Linjenr. 54640-54643: 134 0 -5.097 0.0 0.0022075 oppfatte oaidnit 134 0 -5.097 0.0 0.0021053 jobb bargosadji 134 0 -5.097 0.0 0.0020704 satsing+område suorgi 134 0 -5.097 0.0 0.0018519 oppfatte dovdat Fra first_ run, forvaltningsordbok.nob-sme 7 97 -8.112 -6.48 0.5454545 adjektiv adjektiiva 7 97 -8.112 -6.48 0.3333333 adjektiv predikatiivahápmi 7 29 -8.112 -7.687 0.2857143 adkomst iskkadeapmi+guovlu 7 29 -8.112 -7.687 0.0588235 adkomst bággolotnun+lohpi Det her kan tyde på at F1 ikke er knyttet til nob-lemma, siden nob-lemmaene adjektiv og adkomst har samme F1, nemlig 7. I first_run har også kolonnene F2 og F4 en verdi (f.eks 97 og -6.48), i motsetning til second_run der kolonnene F2 og F4 konstant har verdien 0. ============================= started to check the differences in unanalysed material apertium vs. non-apertium pipelines apertium_pl: 1_ape_n>wc -l starred_* 27979 starred_nob_stuff.txt 33035 starred_sme_stuff.txt non-apertium_pl: 2_gt_n>wc -l unknown_sme_stuff.txt 21103 unknown_sme_stuff.txt 3_gt_n_nob>wc -l ukjent_nob_stuff.txt 3287 ukjent_nob_stuff.txt nob overlap: nob_unknown>comm -12 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 1825 only in ap: nob_unknown>comm -23 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 26152 only in gt: nob_unknown>comm -13 unknown_nob_sorted_ap.txt unknown_nob_sorted_gt.txt | wc -l 1462 sme overlap: sme_unknown>comm -12 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 16892 only in ap: sme_unknown>comm -23 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 16143 only in gt: sme_unknown>comm -13 unknown_sme_sorted_ap.txt unknown_sme_sorted_gt.txt | wc -l 4210