The available texts will be moved to a corpus database which is accessed by the users through a web-interface.
The overall architecture of the system
Where the material is available in different languages such as New Testament, a parallel corpora is created.
This document describes the plans for implementing the corpus database and the query system.
All the material concerning the corpus project is currently stored in
the directory gt/cwb/
under cvs.
The CWB-toolbox is installed to cochise, the usage of the tools is briefly introduced in section IMS Corpus Workbench: demo
The XML-format of the corpus texts is documented in section XML-format of the corpus files.
In addition to actual texts, the corpus database contains other textual information such as author, date, genre and region that can be exploited for example in studies considering for example regional or historical variety. The other textual information is stored in separate header files, documented in section Meta information.
The work flow of converting the available text material to the corpus database includes the following steps:
The next step is to cut the text in the sentences and word tokens. The preprocessor is documented in preprocessor.html. The preprocessor tool may have to be adapted to the corpus project, if the text contains for example some xml-formatting. The modifications are not yet implemented nor planned.
Step three is implemented by analysis and disambiguation tools.
As long as there are problems with either preprocessing or analysis and disambiguation the step three, manual check, is hard work. When the other tools are reliable only spot checks are needed.
The meta information is described in section Meta information.
The conversion to XML-format is described in XML-format of the corpus files
The conversion to IMS-format is not yet implemented, nor fully planned. In this part, we rely to the help of Textlaboratoriet in the University of Oslo.
There are couple of tools installed for cleaning the texts: antiword and wvWare. Antiword does simple word to text and html converting, wvWare involves more formats and conversion options.
The documentation of antiword is antiword.man and the usage for example converting an utf-8 coded MS Word document to the 7bit project-internal format:
$ antiword -m UTF-8.txt file.doc | utf8-7bit.pl > file.txt
The information of wvWare can be found from packages' man page:
$ man wvWare
<text> <sentence> <token form="The" lemma="the" POS="DET" /> <token form="flies" lemma="fly" POS="N" /> </sentence> </text>Optionally, one can
<token form="flies"> <reading lemma="fly" POS="N" /> <reading lemma="fly" POS="V" /> </token>
There is a first version of the dtd corpus.dtd for the format. In addition, there is a file sme_tagset.ent which contains the names of the tag classes. This is supposed to make the dtd more flexible, since the tag classes may change among languages.
The conversion from CG2-output to XML is handled by a script convert2xml, the script requires the tag file korpustags.txt to get the tagsets right.
In the applications, the Perl modules such as XMLTwig are used for parsing XML. Emacs is a fairly good tool for editing XML, but it might be a good idea to install a separate xml-processor as well. Apache's Xerces seems to be a good and widely used tool for xml parsing and generation.
The structural information is encoded in XML-format, following for example the CES standard. Then there would be three different categories of information for each corpus: Global information of the text and its content: author, character set, etc. corresponding TEI header. Primary data, which includes structural units of text etc. abbreviations and so on. And linguistic annotation, including morphological and syntactic information, alignment etc. The queries to the documents would then be made by tools designed for processing XML.
However, the query system offered by IMS Corpus Workbench does not support SGML in full extent. Rather, the structural information offered by IMS-tools is rather restricted. The query engine CQP uses regular expressions in corpus queries, which is a desired feature. The structural information cannot be queried at all by CQP, it is only available in the results.
The global information can be transferred to CQP searchable format, for example by transferring the header information to attributes in IMS. The header information may also be consisted as a string in one attribute.
The exact format of the corpus header files is not yet planned.
The "Corpus administrator's Manual" describes in detail, how the text corpus is transformed to the internal representation used by the IMS toolbox. As we have desided to use XML for the basic format of the corpora, suitable conversion tools from XML to the format required by IMS have to be developed.
There will be conversion scripts from XML-format to TEI and IMS corpus workbench, provided by Textlaboratory.
The corpus files themselves will be placed to
/usr/local/share/corp/
for now. The subdirectory
doc
contains the original texts in their original
formatting. Later, there should be separate directory for all the
corpus files.
The location of the corpora has to be planned with Roy, the files can
be quite big and need not to be backuped daily (but weekly or monthly
will be ok). Perhaps some globally accessible, separate filesystem,
for example directory /corpora
.
At the moment the corpus files are stored to cvs. The corpus files are modified all the time for testing purposes, so cvs is ok. Also the size of the corpus is fairly small, about 34M altogether. Tagged corpus is obviously much bigger but at the development phase it will not cause any problems.
However, the usage of cvs for storing large corpora is impossible if the files gets much bigger. This is because every user has his own copy of all the files and also the modifications between versions that are stored to repository may grow. The size of ims-format corpus can be some 10-50 times bigger than the original raw text, depending on the amount of tags (the number is just a hasty estimate).
The version of the software is 3.0 and the installed archive name was
cwb-2.2.b72-i386-linux.tar.gz
. The up-to-date information
was available at
ftp://ftp.ims.uni-stuttgart.de/pub/outgoing/cwb-beta/index.html.
The software is installed to directory
/usr/local/cwb
. The environment variable PATH has to
be updated:
export PATH=$PATH:/usr/local/cwb/bin
There are specific corpus registry files which contain information on
the corpus, like where the data is stored. The registry files should
be in one place, perhaps in the same place as the
corporal, in directory /corpora/registry
. The environment
variable CORPUS_REGISTRY
has to be set.
The corpus contains tokens (words) and other positional attributes such as part-of-speech tags. The tags are arranged one in each column. The columns are separated by tab.
word POS ETC.
There are the following tag categories:
It is possible to mark for example the beginning and end of a sentence to the corpus file by using SGML-like markers. Whether we should use that or not is dependent upon what benefits it may give us, seen from the ims framework point of view. Changing the tag CLB etc. to SGML-like markers is not a problem, but it is unclear to what extent it helps either parsing or corpus processing.
Large units of discourse information are:We have to find out what kind of information it is possible to extract from diffeent types of documents, and how much of the structural information can be extracted automatically.
In Microsoft Word format, the information is in the underlying representation. A priori, it should be possible to write an MSW macro to turn this into textual informaion prior to the "save as enriched text" command that we use to convert MSW documents to our internal format. Seen from a disambiguation point of view, information on paragraphs and bulletpoint lists is clearly a valuable resource, if we can write rules that rely on such information (demand finite verbs form sentences, not from titles, parenthesis fragments or bulletpoint items).
IMS Corpus Workbench is now installed to cochise and can be tested with two
demo corpuses. There is English demo corpus which consists of Charles
Dickens novels and German demo corpus of law texts. The corpora is
accessed using the corpus query processor CQP. To get CQP working, add
these lines to your .bashrc
:
export PATH=$PATH:/usr/local/cwb/bin export CORPUS_REGISTRY=/usr/local/cwb/registry
Start CQP by typing
cqpTo the shell prompt. Leave the program by typing
exit;
or
Ctrl-D
. I recommend to turn off the highlighting by
set Highlighting no;The command
show;
shows the installed corpora. To select
the Dickens corpus, type
DICKENS;To make a query, follow the instructions in CQP Tutorial (path:
/usr/local/cwb/doc/CQP-Tutorial.2up.pdf
).
STME1029;to cqp prompt.
The tags used in the corpus are listed in tag list. Commented lines are marked with '%', the line that starts with hash (#), marks the tag class, e.g. POS. Under it is the list of the tags which belong to that class, in case of POS, N, Adj, V, etc.
The corpus file is first converted to a format where each word is in
its own line followed by base form and the tags associated to it. Tags
are separated by TAB. See file stme1029.vrt (in
directory /usr/local/cwb/demo/stme1029
for example. The
coversion from CG2 output to word-list format is done automatically by
using script convert2cwb
.
cesHeader starts the header. It has the following sections:
< fileDesc> < /fileDesc> < encodingDesc> < /encodingDesc> < profileDesc> < /profileDesc> < revisionDesc> < /revisionDesc>fileDesc for the bibliographic description of the corpus. I describe only the content elements.