Capitalisation

Here, we document two processes, one accepting initial capitalisation ("Dat" as well as "dat"), and the other one full capitalisation ("DAT" but not "DAt")

  1. Initial capitalization
  2. Capitalization of whole words

Initial capitalization

This is what the book says:

xfst[]: define initcap a (->) A, b (->) B, c (->) C,
d (->) D, e (->) E, f (->) F, g (->) G, h (->) H, i (->) I,
j (->) J, k (->) K, l (->) L, m (->) M, n (->) N, o (->) O,
p (->) P, q (->) Q, r (->) R, s (->) S, t (->) T, u (->) U,
v (->) V, w (->) W, x (->) X, y (->) Y, z (->) Z || .#. _ ;
This string has been put in the file case.regex, and compiled to caseconv.fst in xfst. As a result all initial caps are downcased, but upon generation all words are given an alternative reading with an initial capital letter. This is not what we want.

Capitalization of whole words

The key file is allcaps.regex. It is modelled after the book, and works in the following way:

First, 'upper' is defined as the set of all capital letters, including the northern Sámi digraphs C1, D1 etc. Then, allacaps is defined as the set of relations 'a (->) A' etc. for all small/capital pairs that occur in the context '.#. upper* _ upper* .#.', i.e. between strings of upper case letters only.

The resulting binary files allcaps.fst is compiled by the Makefile. In principle, the parser sme.fst could have been composed with allcaps.fst into a single transducer (sme.fst .o. allcaps.fst), but this is not done, since the resulting transducer would have been very large indeed (cf. discussion on this siiue in the book). Instead, the issue is handled in a lookup script file. Ath present, this file looks as follows (cf. the discussion on lookup script files in the book):

analyzer        /home/trond/gt/sme/bin/sme.fst
allcaps         /home/trond/gt/sme/bin/allcaps.fst

allcaps analyzer

The lookup script should be used as follows (when standing in sme/):

.. | lookup -flags mbTT -f src/cap-sme | ... Note that the files have absolute, and not relative reference (relative reference would here have been ../bin/sme.fst etc.). For another user than trond to get this to work, the user name trond in the path must be exchanged, e.g. to /home/lena/gt/sme/bin/sme.fst etc. For this reason, the cap-sme file is not included in the cvs repository yet 8this is why the link to it does not work). Xerox has been notified, and has answered, cf.

(quote)
Päiväys: onsdag, 12. februar 2003 18:39:06 +0100
paths in lookup scripts.
Vastaus: tamas.gaal@xrce.xerox.com

Trond,

There is some possibility of using Unix environment variables in lookup, see

http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/lookup-97/lookup97.html It may not solve your problem - but please read it first: towards the end, there is reference to environment variables like

setenv LOOKUP_SCRIPT_BASE ...

If it is not of enough then the interface should be improved. While it is not a complicated matter, we are short of able people now so you may have to use the full pathnames in your scripts until it gets improved.
(end of quote)

Trond Trosterud trond.trosterud@hum.uit.no
Last modified: Mon Nov 1 21:34:10 2004