A short description of a possible way to acquire new lemmata from the existing resources based on infos found at http://divvun.no/doc/ling/common.html For instance, for sma: 1. resource location /usr/local/share/corp/gtbound/sma /usr/local/share/corp/gtfree/sma 2. tools: ccat http://divvun.no/doc/ling/catxml.html ~/gtsvn/gt/script/samiXMLParser>./ccat -h Usage: ccat [FileName] Print the contents of a corpus file in XML format. The default is to print paragraphs with no type (=text type). The possible options include: -l Process elements in language . -a Print all text elements. -p Print plain paragraphs. (default) -T Print paragraphs with title type. -L Print paragraphs with list type. -t Print paragraphs with table type. -C Print corrected xml-files with corrections. -ort Print corrected xml-files with ortoghraphical corrections. -synt Print corrected xml-files with syntactical corrections. -lex Print corrected xml-files with lexical corrections. -typos Print corrections with tabs separated output. -S Print the whole text in a word per line. Errors are tab separated. -r Recursively process directory dir and subdirs encountered. -h Print this help message. Ex. 1 Task: extract (only) sma text from a single file ccat -l sma /usr/local/share/corp/bound/sma/ficti/karijuse.txt.xml Ex. 2 Task: extract (only) sma text from a single file in one-word-per-line-format ccat -l sma -S /usr/local/share/corp/bound/sma/ficti/karijuse.txt.xml -------------- Possible word acquisition steps: 1. extract words (one word per line) as described above 2. filter, preprocess the output based on patterns 3. compare the new word/lemma list with the list of already existing words/lemmata (for this task, Ciprian will check in a simple xslt-stylesheet) 4. done!