!!!Bakgrunnsdokument


!!Prosjektskisse:

For parallelltekst mellom nord-, lule- og sørsamisk og evt. andre språk. I praksis vil det primært gjelde tekstar mellom norsk og dei tre samiske språka.

Arbeidsoppgåver:

To månadsverk + overhead til UiT
# Handsame parallelltekstar frå statsadministrasjonen i korpus (programmerar)
# Parallellføre tekst på setnings- og ordnivå (datalingvist)
# Parallelle setningar og ord som del av datastøtta omsetjing i eit omsetjarverkty (programmerar, datalingvist)

Resultatet av a-c vil bli ein deskriptiv database over departementet sine tekstar, og eit grensesnitt omsetjarane kan bruke for å samanlikne omsetjingane sine med tidlegare omsetjingar.

Det trengst deretter mange månadsverk for å bearbeide materialet vidare til ei forvaltningsordbok:

# Leksikografisk arbeid med parallellistene (filolog * 3 språk)
# Utvide det terminologiske grunnlaget til fleire språk

Eit grovt overslag kunne vere ca 6 månadsverk pr språk.


!!!Project plan

# Collect files, for each smX with parallel texts in nob (nno, eng, swe, smX?) (__Børre__)
## sme: XXX words
### [Governmental whitepapers|../ling/corpus_norwegianwhitepapers.html]
### Governmental web page documents,  {{freecorpus/converted/sme/admin/depts/regjeringen.no/}}
### Saami parliament files: {{freecorpus/converted/sme/admin/sd/}}
## smj: YYY words
### Governmental pdf files, {{freecorpus/converted/smj/admin/depts/}}
### Governmental web page documents,  {{freecorpus/converted/smj/admin/depts/regjeringen.no/}}
## sma: ZZZs words
### Governmental pdf files, {{freecorpus/converted/smj/admin/depts/}}
### Governmental web page documents,  {{freecorpus/converted/sma/admin/depts/regjeringen.no/}}
# Sentence align (__Ciprian, Børre?__)
# Word align (__Francis__)
## Make parallel wordlists
## Check for relevant vocabulary (nob frequency deviant from normal, i.e. nob words with higher frequency in the material than in a big reference corpus. What we would expect is (freq in big ref corpus / wordcount of ref corpus) x wordcount of material
# Manual lexicographic work (__Lexicographers__)
## Go through the word pair lists and evaluate them
## The goal here is not a normative evaluation, but a descriptive: 
### Remove erroneous alignments and keep good ones
## A normative term collection (''these are the term pairs we want'') is outside 
   the scope of this phase of the project.
# Integrate the resulting list into Autshumato (__Ciprian, etc.__)


!!!Gamle månadsrapportar


!!March

nob-sme files are in the folder {{$BIGGIES/gt/sme/corp/forvaltningsordbok/}}.


!!February

* [First 2000 words (sorted after confidence), have a look|2000.html]
* [First 10000 words (sorted after nob), have a look|10000.html]

!!December

# Collect files, for each smX with parallel texts in nob (nno, eng, swe, smX?) (__Børre__)
## sme: 
### [Governmental whitepapers|../ling/corpus_norwegianwhitepapers.html] -
    16 documents, 948384 words (in the pdfs mentioned in the above doc)
### Governmental web page documents,  {{freecorpus/converted/sme/admin/depts/regjeringen.no/}} -
    1384 documents, 615852 words
### Saami parliament files: {{freecorpus/converted/sme/admin/sd/}} -
    929 documents, 220377 words
## smj: YYY words
### Governmental pdf files, {{freecorpus/converted/smj/admin/depts/}}
### XXX documents, YYY words
### Governmental web page documents,  {{freecorpus/converted/smj/admin/depts/regjeringen.no/}}
### XXX documents, YYY words
## sma: ZZZs words
### Governmental pdf files, {{freecorpus/converted/smj/admin/depts/}}
### XXX documents, YYY words
### Governmental web page documents,  {{freecorpus/converted/sma/admin/depts/regjeringen.no/}}
### XXX documents, YYY words
# Sentence align (__Ciprian, Børre?__)
# Word align (__Francis__)
## Make parallel wordlists
## Check for relevant vocabulary (nob frequency deviant from normal, i.e. nob words with higher frequency in the material than in a big reference corpus. What we would expect is (freq in big ref corpus / wordcount of ref corpus) x wordcount of material
# Manual lexicographic work (__Lexicographers__)
## Go through the word pair lists and evaluate them
## The goal here is not a normative evaluation, but a descriptive: 
### Remove erroneous alignments and keep good ones
## A normative term collection (''these are the term pairs we want'') is outside 
   the scope of this phase of the project.
# Integrate the resulting list into Autshumato (__Ciprian, etc.__)


!!Original deadlines

# Collect files
## nob-sme: december
## nob-smj: january
## nob-sma: january
# Sentence align
## nob-sme: january
## nob-smj: january
## nob-sma: january
# Word align
## nob-sme: january
## nob-smj: january
## nob-sma: january
# Term extraction
## nob-sme: january
## nob-smj: january
## nob-sma: january
# Term evaluation
## nob-sme: febrary
## nob-smj: febrary
## nob-sma: febrary
# Autshumato integration
## nob-sme: febrary
## nob-smj: febrary
## nob-sma: febrary
# Evaluation, report
## nob-sme: march
## nob-smj: march
## nob-sma: march
# March, 31st: Final report due.


!!Obsolete docu?

!How to convert files to xml

{{{
Inside $GTFREE:
find orig -type f | grep -v .svn | grep -v .xsl | grep -v .DS_Store | xargs convert2xml2.pl

The output is thanks, «you gave me $numArgs files to process» and then . or | for
each file that is processed. . means success, | means failure to convert a file.

For a lot more verbose output to the terminal, use the --debug option

After the conversion, get a summary of the converted files this way:
java -Xmx2048m net.sf.saxon.Transform -it main $GTHOME/gt/script/corpus/ym_corpus_info.xsl inDir=$GTFREE/converted

This results in a file corpus_report/corpus_summary.xml

To find out which and how many files have no content, use this command:
java -Xmx2048m net.sf.saxon.Transform -it main ../corpus/get-empty-docs.xsl inFile=`pwd`/corpus_report/corpus_summary.xml

This results in a file out_emptyFiles/correp_emptyFiles.xml

The second line tells how many empty files there are.
}}}