Meeting with Text Laboratory, Oslo 10.5.2004

Present: Trond, Saara, Lars Nygaard, Kristin Hagen and Janne Bondi Johannessen

The discussion was mostly about the technical details of the corpus interface. Also some other issues such as treebanks and cg were mentioned.

Treebanks: It is easy to create treebanks with VISL Phrase Structure Grammar Compiler, vislpsg. There are pedagocical programs, where the vislpsg is used: Word of VISL, Visual Interactive Syntax Learning and for Norwegian: GREI.

The Text Laboratory is using cg1 instead of cg2 and vislcg as we are, so there is not much to share.

It was agreed that the tools developed for corpus interface in Textlab are available for the Sami project. Sami project offers help in testing and developing the tools. We decided to share the access to each others' projects.

The conversion of the plain corpus text to web-searchable format involves three steps:

text ----> xml -----> ims -----> web

Text to XML: Conversion of vislcg output to xml-notation

There was a lenghty discussion about the different corpus encoding standards. The usage of some standard encoding such as TEI or CES was generally preferred. However, TEI was considered to be tangled and too extensive for our needs. CES is simpler but it has some drawbacks: the DTDs are not working properly and we found it dubious that the documentation was last updated in 1996. We thought it is important to have working DTDs; they should not be bypassed. Also there are not many tools available for eithe standard, so they should be written anyway.

The conclusion was to avoid both standards, and use our own notation, which is however kept TEI-compatible. This means that the notation and the structure is kept so simple that it is possible convert the format to TEI if necessary in the future.

The conversion from text to XML involves appending the header information to the file. Or should the header information be stored to another file, in order the xml version to be more easily regenerated? There is a conversion script written by Lars, which can be modified for converting vislcg output to xml, or the script can easily be rewritten.

We have to decide whether the xml format is the base format for annotated corpus, involving analyses and lemmas as xml-entities. The other option would be to stick with the vislcg output, that is the only format of the analysed corpuses we have for Sami. However, with suitable conversion tools, the xml-format is as flexible as the vislcg output. And perhaps more suitable for other possible needs in the future?

XML to IMS

The conversion from xml to ims Corpus Workbench involves converting the xml format to tabular format which is the input to corpus encoding tools provided by ims. The tabular format consists of the word, lemma, and the analysis either each tag in its own column or all the tags in one string (which is the way Textlab has implemented it so far). The advantage of having all the analysis in one string is that then the ambiguous words having more than one analysis can be stored and searched through the interface. The ims does not offer any means to cope with ambiguity.

The disadvantage of having all the analyses in one column is that all the searches are based on regular expressions, which slows down the search, but not unthinkably much. The other possibility suggested was to store the unambiguous words in tag-per-column format and the ambiguous in string format and having one attribute indicating the ambiguity. This would reduce the amount of time-consuming searches but complicate the queries (anyway hidden from the basic user).

Another thing which ims does not support is the searches to header information. Ims offers "structural attributes" which indicate e.g. clause boundaries, but does not offer searches to them. The meta-information of the corpus text is however important especially in the Sami project where we have diverse sources of texts e.g. produced by Sami speaking people in different countries.

One way to solve the problem would be to extract the header information from the xml-file and store it as a string in one attribute in the ims-format. Another solution is to have an attribute for each piece of header information, when the searches can be done straight to the attributes and not by regexps.

Lars has tool for xml-to-ims conversion which can be modified for our needs. There is a perl-module xmltwig for easy parsing xml.

IMS to web

Lars is developing a new interface for corpus searches, which is adaptable for different needs. The interface is based on xml-definitions, so it can be recreated for different applications.

The IMS provides perl modules which should be used in the queries. The interface looked good, but is not yet finished, so we will take part in testing and perhaps developing.

Other

IMS provides tools for sentence-alignment, the tools are based on the lenght of the sentences, and can (and must) be used without dictionary. For the Sami-Norwegian alignment, the Textlab's tools are available for analysing Norwegian texts. We have to ensure the compatibility of the preprocessors, so that they agree on sentence boundaries.

It would be interesting to have a cg2-to-IMS query filter to find the sentences, where a cg2-rule matches in the corpus. The perl-module RecDescent for parsing could be useful.

For testing the cq2-rules against correct corpus, it would be useful to have an emacs tool that finds the wrong disambiguation and moves to the rule that caused it in the other window. As in Lingoft. This would require knowledge of Emacs-Lisp.

For the creation of correct corpora, Saara will provide a tool in Emacs that makes it easier to mark the correct readings.

Saara Huhmarniemi saara.huhmarniemi@helsinki.fi
Last modified: Wed May 12 11:47:52 2004