corp: nob wikipedia (DATE STAMP/URL) extracted with WikiExtractor.py
http://giellatekno.uit.no/doc/doc/WikipediaAsCorpus.html
Version: Either documented in the svn log or the version preceeding the one dated
by the relevant checkin.

http://www.tekstlab.uio.no/nowac/
http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.html
http://www.hf.uio.no/iln/om/organisasjon/tekstlab/tjenester/nowac-frequency.html

repair = remove non-UTF-8-document.

http://www.aclweb.org/anthology-new/W/W10/W10-1501.pdf

ordbanken: info on the db format of the nob data stemming from the ordbanken project

=========
all spraakbanken-data will be moved into an extra repository
=========
spraakbanken:
text resources from the Språkbankens sida
http://www.nb.no/Tilbud/Forske/Spraakbanken/Tilgjengelege-ressursar/Tekstressursar

nbsb_nob_gold_corpus_20121120.txt is the sb gold corpus with date stamp:
 - one sentence per line format
  lemma_1<POS_1> lemma_2<POS_2> ... lemmma_N<POS_N>

nob_avis: Nosk avismateriale (nob)
 - for mer info se spraakbanken/nob_avis/00_lesmeg.txt-fila

Issues:
 1. transform all avis corpus data in real xml-format
    ==> TODO (partly done)
 2. add an index to each sentence
    ==> TODO
 3. check character coding and if needed trasform it into UTF-8 
    ==> TODO (partly done)
 4. clean corpus from "sentences" such as
<s> ! </s>
<s> " ( 1 . </s>
 5. correct occurences of dirty preprocessing such as deleted dash


Ultimate goal:
 - analyse and merge all nob text resources (avis, gold corpus, wikipedia, etc.) 
   to create the nob language model
   for tasks such as FAD (e.g., general vs. specific use of a word)