Numerical Algorithms Group: NAG

xmltex: A non validating (and not 100% conforming) namespace aware XML parser implemented in TeX


xmltex: A non validating (and not 100% conforming) namespace aware XML parser implemented in TeX

Date: 2000-02-02

[email protected]

David Carlisle

Contents

Introduction

xmltex implements a non validating parser for documents matching the W3C XML Namespaces Recommendation. The system may just be used to parse the file (expanding entity references and normalising namespace declarations) in which case it records a trace of the parse on the terminal. Normally however the information from the parse is used to trigger TeX typesetting code. Declarations (in TeX syntax) are provided as part of xmltex to associate TeX code with the start and end of each XML element, attributes, processing instructions, and with unicode character data.

Installation

The xmltex parser itself does not require LaTeX. It may be loaded into initex to produce a format capable of parsing XML files. However such a format would have no convenient commands for typesetting, and so normally xmltex will be used on top of an existing format, normally LaTeX. In this section we assume that the document to be processed is called document.xml.

Using xmltex as an input to the LaTeX command

LaTeX requires a document in TeX syntax, not XML. To process document.xml, first produce a two line file called document.tex of the following form:

\def\xmlfile{document.xml}
\input xmltex.tex
Do not put any other commands in this file!

You may then process the document with either of the commands: latex document or latex document.tex or the equivalent procedure in your TeX environment.

Using xmltex as a TeX format built on LaTeX

You may prefer to set up xmltex as a format in its own right. This may speed things up slightly (as xmltex.tex does not have to be read each time) but more importantly perhaps it allows the XML file to be processed directly without needing to make the .tex wrapper.

To make a format you will need a command such as the following, depending on your TeX system.

initex &latex xmltex
initex \&latex xmltex
tex -ini &latex xmltex
tex -ini \&latex xmltex

This will produce a format file xmltex.fmt. You should then be able to make a xmltex command by copying the way the latex command is defined in terms of latex.fmt. Depending on the TeX system, this might be a symbolic link, or a shell script, or batch file, or a configuration option in a setup menu.

Making an xmltex format `from scratch'

Whilst it may be convenient to build an xmltex format as above, starting from the LaTeX format. You may prefer to instead work with an initex with no existing format file. Even if you wish to use a standard LaTeX it may be preferable to make a TeX input file that first inputs latex.ltx then xmltex.tex. In particular this will allow you to have a different hyphenation and language customisation for xmltex than for LaTeX. Many of the features of the language support in LaTeX are related to modifying the input syntax to be more convenient. Such changes are not needed in xmltex as the input syntax is always XML. Some language files may change the meaning of such characters as < which would break the xmltex parser. Also, rather than using latex.ltx you could in principle use a modified docstrip install file and produce a `cut down' latex that did not have features that are not going to be used in xmltex.

Unfortunately the support for this method of building xmltex (and access to non English hyphenation generally) is not fully designed and totally undocumented.

Using xmltex

xmltex by default `knows' nothing about any particular type of XML file, and so needs to load external files containing specific information. This section describes how the information in the XML file determines which files will be loaded.

  1. If the file begins with a Byte Order Mark, the default encoding is set to utf-16. Otherwise the default encoding is utf-8.
  2. If (after an optional BOM) the document begins with an XML declaration that specifies an encoding, this encoding will be used, otherwise the default encoding will be used. A file with name of the form encoding.xmt will be loaded that maps the requested encoding to Unicode positions. (It is an error if this file does not exist for the requested encoding.)
  3. If the document has a DOCTYPE declaration that includes a local subset then this will be parsed. If any external DTD entity is referenced (by declaring and then referencing a parameter entity) then the SYSTEM and PUBLIC identifiers of this entity will be looked up in a catalogue (to be described below). If either identifier is known in the catalogue the corresponding xmltex package (often with .xmt extension) will be loaded.
  4. After any local subset has been processed, if the DOCTYPE specifies an external entity, the PUBLIC and/or SYSTEM identifiers of the external dtd file will be similarly looked up, and a corresponding xmltex file loaded if known.
  5. As each element is processed, it may be `known' to xmltex by virtue of one of the packages loaded, or it may be unknown. If it is unknown then if it is in a declared namespace, the namespace URI (not the prefix) is looked up in the xmltex catalogue. If the catalogue specifies an xmltex package for this namespace it will be loaded. If the element is not in a namespace, then the element name will be looked up in the catalogue.
  6. If after all these steps the element is still unknown then depending on the configuration setting either a warning or an error will be displayed. (Currently only warning implemented.)

The xmltex Catalogue

As discussed above, xmltex requires a mapping between PUBLIC and SYSTEM identifiers, namespace URI, and element names, to files of TeX code. This mapping is implemented by the following commands:

\NAMESPACE{URI}{xmt-file}
\PUBLIC{FPI}{file}
\SYSTEM{URI}{file}
\NAME{element-name}{xmt-file}
\XMLNS{element-name}{URI}
As described above, if the first argument of one of these commands matches the string specified in the XML source file, the corresponding TeX commands in the file specified in the second argument are loaded. The PUBLIC and SYSTEM catalogue entries may also be used to control which XML files should be input in response to external entity references. The \XMLNS is rather different, if an element in the null namespace does not have any definition attatched to it, this declaration forces the default namespace to the given URI. The catalogue lookup is then repeated. This allows for example documents beginning <html> to be coerced into the xhtml namespace.

These commands may be placed in a configuration file, either xmltex.cfg, in which case they apply to all documents, or in a configuration file `\jobname.cfg' (eg document.cfg in the example in the Introduction) in which case the commands just apply to the specified document.

Configuring xmltex

In addition to the `catalogue' commands described earlier there are other commands that may be placed in the configuration files.

If a format is being made, there are essentially two copies of xmltex.cfg that may play a role. The configuration file input when the format is made will control catalogue entries and packages built into the format. A possibly different xmltex.cfg may be used in the input path of `normal' TeX, this will then be used for additional information loaded each run.

In either case, a separate configuration file specific to the given XML document may also be used (which is loaded immediately after xmltex.cfg).

Stopping xmltex

xmltex should stop after the end of the document element has been processed. If things go wrong and you end up at the interactive * prompt you might want to exit with <?xmltex \stop?>).

xmltex package files

xmltex package files are the link between the XML markup and TeX typesetting code. They are written in TeX (rather than XML) syntax and may load directly or indirectly other files, including LaTeX class and package files. For example a file loaded for a particular document type may directly execute \LoadClass{article}, or alternatively it may cause some XML element in the document to execute \documentclass{article}. In either case the document will suffer the dubious benefit of being formatted based on the style implemented in article.cls. Beware though that the package files may be loaded at strange times, the first time a given namespace is declared in a document, and so the code should be written to work if loaded inside a local group.

Characters in xmltex package files have their normal LaTeX meanings except that line endings are ignored so that you do not need to add a % to the end of lines in macro code. Unlike fd file conventions, other white space is not ignored.

The available commands are:

XML processing

xmltex tries as far as possible to be a fully conforming non validating parser. It fails in the following respects.

Accessing TeX

In theory you should be able to control the document just be suitable code specified by \XMLelement and friends, but sometimes it may be necessary to `tweak' the output by placing commands directly in the source.

Two mechanisms are availalable to do this.

Bugs

None, of course.

Don't Read Past This Point

Thus section discusses some of the more experimental features of xmltex that may get a cleaner syntax (or be removed, as a bad idea) in later releases, and also describes some of the internal interfaces (which are also subject to change)

Input Encodings and States

At any point while processing a document, xmltex is in one of two states: tex or xml.

States

In the xml state, < and & are the only two characters that trigger special markup codes. Other characters, such as !, >, =, … may be used in certain XML constructs as markup but unless some code has been triggered by < they are treated simply as character data. All characters above 127 are `active' to TeX and are used to translate the input encoding to UTF-8. All internal character handling is based on UTF-8, as described below. Some characters in the ASCII range, below 127 are also active by default (mainly punctuation characters used in XML constructs, such as the ones listed above). Some or all of the others may be activated using the \ActivateASCII command, which allows special typesetting rules to be activated for the characters, at some cost in processing speed.

In the tex state, characters in the ASCII range have their usual TeX meanings, so letters are `catcode 11' and may be used in TeX control sequences, \ is the escape character, & the table cell separator, etc. Characters above 127 have the meanings current for the current encoding just as for the xml state, probably this means that they are unusable in TeX code, except for the special case of referring to XML element names in the first argument to \XMLelement and releated commands.

Encodings

Whenever a new (XML or TeX) file is input by the xmltex system the encoding is first switched to UTF-8. At the end of the input the encoding is returned to whatever was the current encoding. The encoding current while the file is read is determined by the encoding pseudo-attribute on the XML or text declaration in the case of XML files, or by the \FileEncoding command for TeX files. Note that the encoding mechanism only is triggered by xmltex file includes. Once an xmltex package file is loaded it may include other TeX files by \input or \includepackage these input command swill be transparent to the xmltex encoding system. The vast majority of TeX macro packages only use ASCII characters so this should not be a problem.

Note that if the \includepackage occurs directly in the xmltex package file, the TeX code will be included with a known encoding, the one specified in the xmltex package, or UTF-8. If however the \includepackage is included in code specified by \XMLelement, then it will be executed with whatever encoding is current in the document at the point that element is reached. Before xmltex executes the code for that element it will switch to the tex state, thus normalising the ascii characters but characters above 127 will not have predefined definitions in this case.

Internally eveything is stored as UTF-8. So `aux' and `toc' files will be in UTF-8 even if the document (or parts of the document) used different encodings.

To specify a new encoding, if it is an 8 bit encoding that matches ASCII in the printable ASCII range, then one just needs to produce a file with name encoding.xmt (in lowercase, on case sensitive systems) this should consist of a series of \InputCharacter commands, giving the input character slot and the equivalent Unicode. If an encoding is specified in this manner character data will be converted to UTF-8 by expansion and so ligatures and inter letter kerns will be preserved. (Conversely if characers are accessed by character references, &#1234; then TeX arithmetic is used to decode the information and ligature information will be lost. For some large character sets, especially for Asian languages, these mechanisms will probably not prove to be sufficient, some mechanisms are being investigated, but in the short term it may be necessary to always use UTF-8 if the input encoding is not strictly a ine byte extension of the ASCII code page.

xmltex Package Commands

You can use arbitrary TeX commands in an xmltex package, althought you should be aware that the file may be input into a local group, at the point in a document that a particular namespace is first used, for example. There are however some specific commands designed to be used in the begin or end code of \XMLElement.

Character Data Internals

int.ext. xml ext. mixedcsn typeout
dxabcxabc xabc (12)xabc (12)xabc (12)
cxabxab xab (12)xab (12)xab (12)
bxaxa xa (12)xa (12)xa (12)
axxx xxx (12) (!)
ayxx x&#123;x (12) (e)
azx\az x &#123;&#123;x (12) (&lt;)
<<< <<< (12) (<)


[NAG | NAG ]

Last updated: Date: 2000-02-02.
Copyright 2000 David Carlisle, NAG