DTDGenerator

version 7.0

A tool to generate XML DTDs


Purpose

DTDGenerator is a program that takes an XML document as input and produces a Document Type Definition (DTD) as output.

The aim of the program is to give you a quick start in writing a DTD. The DTD is one of the many possible DTDs to which the input document conforms. Typically you will want to examine the DTD and edit it to describe your intended documents more precisely.

The program was formerly issued as part of the Saxon product. It is now completely independent of Saxon, though it can still be downloaded from the Saxon site.

DTDGenerator runs as a SAX ContentHandler, and can be used with any XML parser that implements the JAXP 1.1 interface: examples are Crimson and Xerces from the Apache project, and the AElfred parser that is embedded in Saxon.


Usage

Web service

You can use DTDGenerator without installing the software by submitting an XML file to the online service provided by Paul Tchistopolskii at http://www.pault.com/Xmltube/dtdgen.html.

If you use this service, ensure that the XML file you upload contains no references to other local files such as a DTD or an external entity.

Note: this service is currently using an older version of DTDGenerator, so it may produce a slightly different result from the current version.

Software required

DTDGenerator requires the following software to be installed:

The DTDGenerator code itself is issued as a JAR file, dtdgen.jar.

Ensure that all these components are on your classpath. The JAXP 1.1 mechanism ensures that DTDGenerator will pick up any suitable XML parser automatically from the classpath; if you want to choose a parser more specifically, you can do this by setting the system property javax.xml.parsers.SAXParserFactory.

From the command line, enter:

java DTDGenerator inputfile >outputfile

The input file must be an XML document; typically it will have no DTD. If it does have a DTD, the DTD may be used by the XML parser but it will be ignored by the DTDGenerator utility.

The output file will be an XML external document type definition.

The input file is not modified; if you want to edit it to refer to the generated DTD, you must do this yourself.


What DTDGenerator does

The program makes an internal list of all the elements and attributes that appear in your document, noting how they are nested, and noting which elements contain character data.

When the document has been completely processed, the DTD is generated according to the following rules:

The numeric constants used by these rules are defined in the source code; if you want to change them, you can either edit and recompile the source code, or write a subclass that uses different values.

What DTDGenerator doesn't do

The program makes no attempt to determine whether different elements or attributes (that is, elements or attributes with different names) have the same structure. The output DTD will never contain parameter entities to define such common structures. This means that in documents that allow flexible rules on nesting (like the XHTML structure, where any inline element can contain any other inline element), the DTD will contain an unnecessary amount of redundancy.

DTDGenerator makes no attempt to recognize IDREF or IDREFS attributes, nor other specialized attribute types such as ENTITY or ENTITIES.

DTDGenerator is not namespace-aware. The names of elements and attributes used in the DTD are identical to the QNames used in the source document, and different QNames are assumed to relate to different elements. Namespace declarations appearing in the source document are treated as ordinary attributes, so they will be included in the output DTD.

DTDGenerator does not produce XML Schemas. There are various tools that allow a DTD to be converted to a schema.

DTDGenerator works from a single input XML document. It makes no attempt to compare the structure of multiple input documents. A technique that has been used successfully is to concatenate multiple documents within a single wrapper element. This can be done using XML entities, or by means of a SAX filter, or more simply by file concatenation at the textual level (provided that the documents contain no prolog). On completion, simply delete the declaration of the wrapper element from the generated DTD.

DTDGenerator does not produce any entity or notation declarations in the output DTD.

Fine-tuning the resulting DTD

The resulting DTD will often contain rules that are either too restrictive or too liberal. The DTD may be too restrictive if it prohibits constructs that do not appear in this document, but might legitimately appear in others. It may be too liberal if it fails to detect patterns that are inherent to the structure: for example, the order of elements within a parent element. These limitations are inherent in any attempt to infer general rules from a particular example document.

In general, therefore, you will need to iterate the process. You have a choice:

In a few unusual cases DTDGenerator will create a DTD which is invalid, or one to which the document does not conform. You will then have to edit the DTD before you can use it. The known cases are:


Performance

Because DTDGenerator is a pure SAX application, it should run almost as fast as the underlying XML parser.

The memory requirements are very small. Most of the data structures increase linearly either with the size of the DTD or with the depth of nesting of elements in the source document. The only structure that grows linearly with the size of the source document is the list of unique values encountered for an attribute, which is used to decide whether the attribute is an ID, and this is capped at a maximum of 100,000 values.

In an exercise carried out at Software AG, where the DTD was required as input to the design study for a Tamino XML database, a 500Mb source XML file was processed in a little over 6 minutes.


Conditions of Use

DTDGenerator may be freely used, distributed, or modified, under the terms of the Mozilla Public License Version 1.0.

DTDGenerator was originally developed (as part of Saxon) by Michael Kay working as an employee of International Computers Limited, a member of the Fujitsu group. The current version (which is independent of Saxon) was developed by Michael Kay under the terms of the original ICL license, with the sponsorship of his current employers, Software AG.


Michael H. Kay
4 September 2001