!!Errors addressed so far (may 2011): * dårlege originalfiler - gjev ugyldig xml ** desse blir fanga opp i dag * kodefeil - desse gjev gyldig xml, men meiningslause bokstavar ** utf-som-macroman ** utf-som-latin1 ** utf-som-html-hex ** utf-som-html-entitet * skannefeil/ocr-feil - desse gjev meiningsfulle bokstavar, men meiningslaus tekst ** đ-som ó, osv. * bad sentence-delimitation: one real sentence is one fragment in one language, 3 fragments in the other -> alignment goes bunk * files {{freecorpus/converted/sme/admin/others/}} ** {{STM200420050011000SE_PDFS.pdf.xml}} {{STM200420050044000SE_PDFS.pdf.xml}} have encoding errors that đ is represented as   and the document is full of  's; thus these files should be deleted ** file {{OTP200620070025000SE_PDFS.pdf.xml}} has paragraphs with content '--------' so it should be deleted. ** file {{STM200320040010000SE_PDFA.pdf.xml}} has so many errors, it should be rescanned ** {{uito-ohpenplana.txt.xml}} the original file is corrupted