!!!Meeting setup * Date: 6.6.2011 * Time: 10.45 Norw. time * Place: Internet * Tools: SubEthaEdit, iChat !!!Saksliste * Divvun 2.2 & Polderland * Korpus * sma-normering + fst !!!Opening, agenda review, participants * Opened at 11.00. * Present: __Børre, Sjur, Tomi, Thomas__ (Divvun only) * Absent: __none__ * Secretary: __Sjur; Børre__ !!!Divvun 2.2 & Polderland Numerals taking a long time to compile, break the compilation. The solution would be to compile them once and add the result to svn. sme - the following do not work: {{{ iige #382 - verb + clitic Máretgo #415 - nominals with clitic not accepted gulloban #622 - Words with clitic -ban not accepted Oslobiila Oslo-biila #397 - name compounding Finncomm #400 - multiword names: Finncomm Airlines Seskaröcd Seskarö-cd #805 - Nouns+acronyms Brønnøysund-registarii #930 - proper+nouns ovda--Lott- ovdda ovdda- #666 - guovtte- and njealje- njealjilogát njealljelogát #804 - guovttilogát, njealjilogát goalmmát #962 - some ordinals not recognized Lule sámi NRK-Finnmarku NRK-Finnmárko #805 - Nouns+acronyms Suggestions: ENØK-Finnmárko name+name NRK-finnmárkok acro+noun where: ENØK is a name finnmárkok is a noun finnmárkok Finnmárkko+N+Prop+Plc+Sg+Gen+Der1+Der/k+N+Sg+Nom }}} SMJ is looking very good, only problem is name and acro compounding (see above). name+acro is fine Polderland: * drop that fixes Windows+Office issues * ask about hyphenated suggestions __TODO:__ # change build process to exclude closed POS's and closed sets # look into ACRO compounding bug # compile new spellers # test !!!Korpus __Børre__ has: {{{ Freecorpus: 9428 files processed, 81 errors among them The errors were distributed like this: checkxml_after_faulty 0 0% of errors error_markup 0 0% of errors convert2xml 1 1% of errors xsl 8 10% of errors character_encoding 0 0% of errors intermediate 16 20% of errors faulty_chars 44 54% of errors text_categorization 0 0% of errors checkxml_after_checklang 12 15% of errors checklang 0 0% of errors To find which files caused the errors, do the command grep "Conversion failed" tmp/*.log Boundcorpus: 51760 files processed, 1407 errors among them The errors were distributed like this: checkxml_after_faulty 0 0% of errors error_markup 0 0% of errors convert2xml 13 1% of errors xsl 2 0% of errors character_encoding 0 0% of errors intermediate 11 1% of errors faulty_chars 1379 98% of errors text_categorization 0 0% of errors checkxml_after_checklang 2 0% of errors checklang 0 0% of errors To find which files caused the errors, do the command grep "Conversion failed" tmp/*.log }}} __Sjur__ has: {{{ Freecorpus: Processing finished 9430 files processed, 442 errors among them The errors were distributed like this: checkxml_after_faulty 1 0% of errors error_markup 0 0% of errors convert2xml 28 6% of errors xsl 8 2% of errors character_encoding 0 0% of errors intermediate 338 76% of errors faulty_chars 55 12% of errors text_categorization 0 0% of errors checkxml_after_checklang 12 3% of errors checklang 0 0% of errors To find which files caused the errors, do the command grep "Conversion failed" tmp/*.log }}} The main problem here is not so much the quality of the conversions as such, but that the conversion results varies so much between runs and computers This is also what we have seen earlier with Trond. This problem will be less intruding with the stable corpus repository (see below), but needs to be followed up. !!Stable corpus conversion * goal: a repository of converted files that are guaranteed to meet certain quality measures * each file is added manually after testing the conversion result Conversion & test procedure: * convert2xml without errors * ccat > analyser = less than 3 % unknown words * ccat > grep all paragraphs not ending in a sentence delimiter = less than 1 % * parallel files must have valid parallel pointers Most important: the stable repository must not contain useless data! __TODO:__ * create {{stable/converted/}} & {{stable/goldstandard/converted}} branch of the corpus repositories (__Børre__) * add ''tested'' and verified converted files to the stable branch (__Tomi__) ** one sme-nob pair from {{$GTFREE/orig/sme/admin/}} today * fill up the stable branch of GTFREE with as much parallel data as required by __Francis__ for the Forv.ordbok project (__Tomi; Børre__) !!!Sma-normering + fst Noun file updated with the SGM decisions that are clear to us, verbs and adjectives left. Not too much work. __TODO:__ * missing: 1-2 days * sma-normering: 1-2 days * work for sma-oahpa fst's