!!!Corpus meeting 11.4.2011 Present: Berit Merete, Børre, Ciprian, Tomi, Trond !!!Agenda * Algorithm for dealing with scanning errors * Setningsparallellisering * Analyserte korpora på xserve __Goal: Functioning corpus__ !!!Algorithm for dealing with scanning errors The process has not ben run, and we thus do not have new results. Run the same routine for nob. Missing in nob: * vŽre, and all æøå: {{converted/nob/admin/guovda/1.doc.xml}} * Note: The document is marked xml:lang="kal" {{{ /home/apache_corpus/freecorpus/converted/sme/admin/depts/other_files 8.9000 26196 2334 STM200420050011000SE_PDFS.pdf.xml Rá Rá +? ehusa ehusa +? jahkedie jahkedie +? áhusáššiid áhusáššiid +? 8.4300 30893 2605 STM200420050044000SE_PDFS.pdf.xml jahkedie jahkedie +? áhus áhus +? Rá Rá +? ádallamat ádallamat +? 8.3300 7320 610 Reindrift_Omraadeprotokoll_til_konvensjon_mellom_Norge_Sverige_Nordsamisk.pdf.xml 7.1500 14438 1033 273777-raportti_saami.pdf.xml 6.0100 57637 3464 OTP200620070025000SE_PDFS.pdf.xml 5.6600 1535 87 faktablad_nordsamiska_wordversion.doc.xml 4.5900 8931 410 260965-h-2179s_2.pdf.xml 4.4800 3325 149 sami_rapporter_bruk_samisk_flagg_SA.pdf.xml 4.4700 18766 840 203210-q-1066_samisk_lav.pdf.xml 4.4600 3874 173 sami_rapport_sametinget_vedlegg4_SA.pdf.xml /home/apache_corpus/freecorpus/converted/sme/admin/depts/regjeringen.no 30.4900 341 104 130-000-ruvnnu-kvena-proeavttaide.html_id=573764.xml Rejeerinki Rejeerinki +? anttaa anttaa +? rahhaa rahhaa +? Porsangin Porsangin +? kolmekieliselle kolmekieliselle +? laulukirjale laulukirjale +? 26.6600 30 8 plakater-til-valgdagen.html_id=575739.xml 26.6600 15 4 neahttakarta-.html_id=313865.xml 25.0000 12 3 neahttakarta.html_id=223274.xml 24.2400 33 8 nytt-og-nytting.html_id=544857.xml 23.5200 17 4 neahttakarta-.html_id=313868.xml 23.2500 43 10 neahttakarta-.html_id=313744.xml 22.8500 35 8 gulaskuddannotahtta.html_id=588787.xml 22.8500 35 8 adreassalistu.html_id=588788.xml 22.2200 18 4 ohcanveahkki-.html_id=446705.xml 21.8700 32 7 forskrifter.html_id=623.xml 21.4200 42 9 julebesok-til-oslo-fengsel.html_id=629537.xml /home/apache_corpus/freecorpus/converted/sme/admin/guovda 50.0000 10 5 GUOVDAGEAINNU_NUORAIDSKUVLLA_OAHPAHEDDJIID_PLÁKÁHTTA.doc.xml 33.3300 12 4 GUOVDAGEAINNU_NUORAIDSKUVLLA_OHPPIID_PLÁKÁHTTA.doc.xml 29.9500 227 68 KS_áššelistu_24.06.2004.doc.xml 13.8400 65 9 Gártnetluohkka_ÁRVVOŠTALLANSKOVVI_22.04.03.doc.xml 12.3100 138 17 Bajasdoallansiehtadus_FKB-data_Guovdageainnu_suohkanis_05.05.05.doc.xml 10.3800 10409 1081 1_2.doc.xml 8.3500 431 36 vinterskole.doc.xml 8.3100 493 41 Sakspapirer_på_samisk_31.10.03.doc.xml 8.1300 209 17 MEAHCCESKUVLA.doc.xml 7.9600 427 34 Mearraskuvla.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/others 15.6800 1326 208 uito-ohpenplana.txt.xml 15.1500 66 10 Reglement_Djupvik_havn.doc.xml 13.0800 107 14 VÁLGADIKKI.doc.xml 10.6700 637 68 skuterløyer_2006.doc.xml 9.4300 53 5 valgalistut_almmuhus.doc.xml 8.9200 112 10 SKJEMA___AMBULLERENDE.doc.xml 8.8800 45 4 Oversetting,_følgebrev.doc.xml 7.3600 95 7 UTBETALINGSANMODNING.doc.xml 7.1700 237 17 RETN.LINJER___KULTUR.doc.xml 7.0500 85 6 Reguleringsplan.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/sd/other_files 38.1800 6270 2394 dc1990-4.pdf.xml 26.0400 14338 3734 satnelistu.doc.xml 25.3600 138 35 stedsnavn4.doc.xml 20.7600 6592 1369 dc1991-2.pdf.xml 15.0600 9294 1400 dč1994-2.pdf.xml 14.7100 9357 1377 dc1990-3.pdf.xml 14.2600 1311 187 dc1993-3.pdf.xml 13.9800 12240 1712 dc1990-1.pdf.xml 13.3600 11341 1516 dč1994-1.pdf.xml 13.1200 160 21 64547_1_P.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/sd/samediggi.no 40.0000 5 2 samediggi-article-788.html.xml 40.0000 5 2 samediggi-article-315.html.xml 40.0000 5 2 samediggi-article-227.html.xml 40.0000 5 2 samediggi-article-225.html.xml 27.5500 196 54 samediggi-article-2933.html.xml 25.0000 8 2 samediggi-article-3179.html.xml 21.7300 23 5 samediggi-article-3114.html.xml 20.0000 35 7 samediggi-article-3217.html.xml 18.5100 27 5 samediggi-article-3451.html.xml 17.5700 165 29 samediggi-article-3683.html.xml 17.0800 158 27 samediggi-article-2738.html.xml 16.3900 61 10 samediggi-article-505.html.xml 16.0000 25 4 samediggi-article-2485.html.xml }}} Tomi to look into this, and discuss with Børre on unclear points. __TODO__ * Fix __large parts__ of this problem. (__Tomi__) ** Challenge: How to fix. * Write a report late this week (__Tomi__) __Ultimate goal:__ * fix the file conversion, or * move the file to e.g. the gold corpus for scanning errors or whatever, or * remove the file altogether {{{ [dstroke] [dstr juoga oke] }}} !!Error reports !Scanning errors {{{ I found this error yesterday: Sámediggi gávnnaha 1unddo1ažžan ahte fy1kagielda váldá oasi giellanjuolggadusaid ovttastahttimii ja di1álašvuodaid 1 áhčimiidda gielddain mat gu11et doaibmaguv1ui Finnmárkku fylkkas . And this: Dan lassin lea bálkkašumi vuoiti vuođđudan alccesis duodjefitnodaga , ja lea máhtolašvuođainis ja hutkái ¬vuođainis ožžon alla árvvu duodjeealáhusas . And this - đ is missing: daid ektui , ja ahte gielddat ieža oidnet dárbbu doallat aktiivvalaš oktavuo a Sámedikkiin go galgá bargat kulturhistorjjá sihkkarastimiin , duo aštemiin , dutkamiin ja gaskkustemiin . Same error - đ is missing: Orru ahte dán gealdagasas dat lea sámi kultuvra vuoittahallan ja ahte eiseválddiid dáiddaáŋgiruššamat vuo uduvvojit minoritehtakultuvrra siskilkeahtesvuhtii . }}} Son !! boahtán (!! pro ii) !!!Corpus conversion Status quo: * Works on Linux * Mac: ** Has problems with perl version xyz !!WARNING - NO MATCH This message shows up when converting orig, and the issue is still open. !!!Sentence alignment !!New program Trond has talked to Knut Hofland. We will get a new TCA2 version. __TODO__ * Put the new version of TCA2 in svn (?, make it accessible) and document the installation (__Børre + friends__) * Update our general TCA2 documentation if the old is obsolete (__all__) !!Anchor list Trond had made an anchor.fst, which unfortunately was flawed. A new one is finished and ok, but not tested or checked in. The question now is whether to take nob or sme as a starting point. __TODO__ * Make a nob-based new anchor list. (__Trond__) * Thereafter, translate to sme (__Biret Merete__) * Divide the anchor list in two: a. general, b. topic-specific. (__Trond, Berit Merete__) !!!Analysed corpara on xserve Has anyone checked the output? No. The cronjob did this __TODO__ Make sure we have a fresh version on thursday. (__Børre__) Error report, have a look: {{{ tmp/STM200720080028000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200720080028000DDDPDFS.pdf to intermediate xml format tmp/STM200820090039000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090039000DDDPDFS.pdf to intermediate xml format tmp/STM200820090043000DDDPDFS.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090043000DDDPDFS.pdf tmp/Samiske_tall_forteller_3_NO.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf tmp/Samiske_tall_forteller_II_Norsk.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_II_Norsk.pdf tmp/retningslinjerforverneplanarbeid_sametinget.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/retningslinjerforverneplanarbeid_sametinget.pdf drwxr-xr-x 4 cipriangerstenberger staff 136 7 apr 22:54 orig drwxr-xr-x 201 cipriangerstenberger staff 6834 11 apr 13:29 tmp }}}