Note: This document is a direct quote from the documentation directory of vislcg.
The parser works on cohorts, readings, and tags.
"<der>" "der" <rel> INDP nG nN NOM "der" <aloc> ADV "<var>" "være" <x+PCP2> <vK> <v-med> <til^vp> V IMPF AKT "var" ADJ nG S IDF NOM "var" <deadj> ADV "vare" <event> <vt+TID> <va> <vi-ved> V IMP "<engang>" "engang" <atemp> ADV "engang" KS "<en>" "en" ART UTR S IDF "en" NUM UTR S "én" PERS 3S ACC "<bog>" "bog-1" <semio> <+om> N UTR S IDF NOM "bog-2" N UTR S IDF NOM "bog-2" N UTR P IDF NOM "boge" V IMP "<$.>"
The CORRECTIONS section is a vislcg-specific extension of the CG formalism. See section 2.1.5a. ans 2.5a. for details.
The minimal set of sections in a rule file is:
DELIMITERS CONSTRAINTS or DELIMITERS MAPPINGS or DELIMITERS CORRECTIONS
SUBSTITUTEA SUBSTITUTE operation replaces tags with other tags in a reading.
General form:
"<wordform>" SUBSTITUTE (tag1 tag2 ...) (tag3 tag4 ...) TARGET (tag5 tag6 ...) IF (context1) (context2) ... ;
The first list of tags (tag1 tag2 ...) is the list of tags to remove from the targetted reading. The second tag list (tag3 tag4 ...) is the list of tags to insert into the reading.
If the contextual tests hold, any tag in the removal list that appears in the reading are deleted from the reading. Note that the tags in the removal list must be in the same order as those in the reading.
The list of insertions is then inserted into the reading in place of the lastly removed tag. Note that the insertion will take place if just one of the tags in the removal list appeared in the reading.
Often, the tags in the removal list should also appear in the target of the rule.
APPENDAn APPEND operation adds an entire reading (a new line) - not just a sequence of tags as ADD and MAP operations do. No TARGET is used as APPEND does not address individual readings but an entire cohort (of readings).
General form:
"<wordform>" APPEND ("baseform" tag1 tag2) IF (context1) (context2) ... ;
Examples
# Remove the tags A and B from the target reading and insert the tag C. "<something>" SUBSTITUTE (A B) (C) TARGET (D) ; # Append the reading "another" A B C to cohorts with the wordform # "<another>" in the given context "<another>" APPEND ("another" A B C) IF (1 (D)) ;Syntax
The + operation is subtly different from the way CG-2 behaves. In vislcg, the + operator does not make the Cartesian product of the two operand sets, but instead asserts that to be a member of the set I, a reading must be a member of both S1 and S2. This means that the + operation in vislcg is properly the intersection operation, not the Cartesian product or concatenation.
Sets constructed using the + operator in rules written for CG-2 should behave identically when used in vislcg and CG-2.
However, there may be subtle differences.
E.g: In vislcg, the following two sets are equivalent:
SET I1 = (A B) + (C D) ;
SET I2 = (C D) + (A B) ;
Because the + operation is the intersection operation in vislcg, the following readings will all be members of both I1 and I2:
A B C D C D A B A C D B
[ Note the reading A C D B. It is a member because it matches both (A B) and (C D). A D C B wouldn't be a member; it matches (A B) but not (C D). ]
In CG-2, because the + operation is the concatenation operation, the two sets are not equivalent. Only the reading A B C D is a member of I1 and only C D A B is a member of I2.
Testing wordforms:
Contextual test are tested against the wordforms of cohorts, too. Here, the wordform is interpreted as a reading with one tag: the wordform. E.g.: The test (1 ("<$.>")) will match a cohort which is a full stop.
(NOT context0 LINK context1 LINK context2)the negation is applied last and the rule is interpreted as
! (context0 && (context1 && context2)) /* C or Perl-like syntax */[Tapanainen 1996; 2.4.5. page 33]: "Here, the negation is applied last".
In negated LINKed contexts, such as
(context0 LINK NOT context1 LINK context2)the negation is applied only to context1, not to "context1 LINK context2" I.e.: the LINK to context2 will only be tested if context0 matches /and/ the linked context1 does not. The above context test is therefore interpreted as
context0 && ( (!context1) && context2)Combining the two above cases, the contextual test
(NOT context0 LINK NOT context1 LINK context2)is interpreted as
! (context0 && ( (!context1) && context2))
(*1 VFIN LINK 0C P) The next cohort to the right which has a reading belonging to the set VFIN is unambigously P. The above test is NOT equivalent to either(*1C VFIN + P) To the right, there is a cohort which is unambigously both VFIN and P.or(*1C VFIN LINK 0 P) The next cohort to the right which is unambigously VFIN also has a reading which is P.or even (also with a careful link)(*1C VFIN LINK 0C P) The next occurrence of unambigous VFIN to the right is also unambigously P.LINKs may be both careful and negated. E.g.:(*1 A LINK NOT 1C B) The next occurrence of A to the right is immidiately followed by a cohort which is not unambigously B.
2.4.5.2. Continuous search
Continuous search is subtly different from CG-2.E.g.:
(**1C A LINK 1 B) There is a cohort to the right which is unambigously A and followed by a cohort with a reading that is B.This seems to be inconsistent with CG-2 [Tapanainen 1996, 2.4.5. p.33]: "In careful mode, scanning stops at the first occurrence of A where the linked tests hold, i.e. the rule means that the next occurrence of A followed by B is unambigously A."In vislcg, scanning will not stop at the first occurrence of A followed by B. In continous search, the LINK will never be tested unless the preceeding test (1C A) holds, even in careful mode.
E.g.: In visl-cg, but not in CG-2, the following input should satisfy the above contextual test (the target of the rule being the reading of "<0>"):
"<0>" X "<1>" A X "<2>" B "<3>" A A "<4>" B2.5a. Corrections
A correction rule modifies the information in the readings. Most often, this will be used to recover lexical errors.
2.5a.1. Correction Operations
There are two operations for correction rules.- The SUBSTITUTE operation removes specific tags from a reading and inserts new ones. A schematic SUBSTITUTE rule is
"<WORDFORM>" SUBSTITUTE (REMOVAL TAGS) (INSERTION TAGS) TARGET (TARGET) IF (TEST1) (TEST2) ... ;The removal and insertion parts of a substitute rule are lists of tags. If the target reading has one or more of the removal tags, these will be removed from the reading and replaced by the insertion tags.- The APPEND operation appends a new reading to a cohort. A schematic APPEND rule is
"<WORDFORM>" APPEND (INSERTION TAGS) IF (TEST1) (TEST2) ... ;The APPEND operation does not take a target because it operates on cohorts, not readings.
2.6. Rule order
The rule, target, and application ordering is not the same as for CG-2.The --no-reordering flag may be set, forcing the parser to always apply rules in the order of appearance in the rule file.
Currently, reordering is done using the following priority list:
This ordering may change arbitrarily in future versions.
- SELECT before REMOVE.
- SELECT rules targetting more preferred tags before rules targetting less preferred targets.
- REMOVE rules targetting less preferred tags before rules targetting more preferred targets.
- By order of appearance in rule file.
[ A possible future rule ordering is:
- Wordform rules before tag/set targets.
- SELECT before REMOVE.
- Negated contexts (NOT) first
- "Simple" rules before "complex" rules.
- Local positions before searches.
- Careful rules first. ]
2.6.1. Section order
2.6.2. Target order
2.6.3. Order in the rule file
2.6.4. Application order of cohorts
3. Debugging/h2>
3.2. Debug Mode
The debug mode of the vislcg parser is similar to that of CG-2: In debug mode, the VISLCG compiler will issue a warning for every reading featuring a <Correct!> that the rule file would have removed if not run in debug mode.To run the rule file debugging option, the --debug option is invoked together with the vislcg command and the --grammar option. The benchmark corpus has initially been tagged by a parser and subsequently evaluated by human annotators, who have added the critical tag <Correct!> to each reading that was judged to be accurately tagged. The tagged benchmark corpus is fed into the parser using the new rule file as a test grammar. The syntax of this command line hence stipulates the indication of the rule file and the benchmark corpus input file used as input to the rule file as indicated by the greater-than sign:
vislcg --grammar rulefilename --debug < benchmarkcorpusname
This corresponds to piping the corpus into vislcg by using the "cat" command:
cat benchmarkcorpusname | vislcg --grammar rulefilename --debug
You can debug a rule file (e.g. sandbox.txt) with IT Centre's benchmark corpus by typing:
vislcg --grammar /home/cg-group/ sandbox.txt --debug < /home/cg-group/bs-benchmark
or
cat /home/cg-group/bs-benchmark | vislcg --grammar /home/cg-group/ sandbox.txt --debug
Bibliography:
[Tapanainen, 1996]: Pasi Tapanainen. The Constraint Grammar parser CG-2. Publications of the Department of General Linguistics, University of Helsinki, no. 27. 1996. ISBN 951-45-7331-5
http://visl.sdc.dk Last modified: Fri Oct 10 11:25:22 EEST 2003