corp: nob wikipedia (DATE STAMP/URL) extracted with WikiExtractor.py http://giellatekno.uit.no/doc/doc/WikipediaAsCorpus.html Version: Either documented in the svn log or the version preceeding the one dated by the relevant checkin. http://www.tekstlab.uio.no/nowac/ http://www.hf.uio.no/iln/om/organisasjon/tekstlab/ http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.html http://www.hf.uio.no/iln/om/organisasjon/tekstlab/tjenester/nowac-frequency.html repair = remove non-UTF-8-document. http://www.aclweb.org/anthology-new/W/W10/W10-1501.pdf ordbanken: info on the db format of the nob data stemming from the ordbanken project ========= all spraakbanken-data will be moved into an extra repository ========= spraakbanken: text resources from the Språkbankens sida http://www.nb.no/Tilbud/Forske/Spraakbanken/Tilgjengelege-ressursar/Tekstressursar nbsb_nob_gold_corpus_20121120.txt is the sb gold corpus with date stamp: - one sentence per line format lemma_1 lemma_2 ... lemmma_N nob_avis: Nosk avismateriale (nob) - for mer info se spraakbanken/nob_avis/00_lesmeg.txt-fila Issues: 1. transform all avis corpus data in real xml-format ==> TODO (partly done) 2. add an index to each sentence ==> TODO 3. check character coding and if needed trasform it into UTF-8 ==> TODO (partly done) 4. clean corpus from "sentences" such as ! " ( 1 . 5. correct occurences of dirty preprocessing such as deleted dash Ultimate goal: - analyse and merge all nob text resources (avis, gold corpus, wikipedia, etc.) to create the nob language model for tasks such as FAD (e.g., general vs. specific use of a word)