Extension:ConTeXtXML

An extension for editing `/Command` subpages

The ConTeXtXML extension is a new wiki feature specifically designed to edit the ConTeXt command reference pages (the ones that live under the /Command/ URL.

It does this by intercepting the creation of new wiki pages below /Command/, and using a ContentHandler extension to maintain those pages. The text model of those pages is contextxml, which is a special XML format developed for documenting ConTeXt commands that is based in the interface XML files by Wolfgang Schuster.

For details of the XML format and the subtree structure below /Command/, see the pages Command and Help:Command. This page documents some features of the wiki extension itself.

Building on `wikitext`

The core of the extension is made up of two connected php classes:

ConTeXtXMLContentHandler, which extends WikitextContentHandler
ConTeXtXMLContent, which extends WikitextContent

Together they ensure that even though the declared page format is CONTENT_FORMAT_XML (which expands to the mime-type text/xml) and the page model is contextxml instead of wikitext, it remains possible to use wiki code. The extension achieves this by using a preprocessor on the XML data that converts it to wiki code that can then be used by the normal mediawiki page viewing and parser code. At the same time, it keeps the XML format available for edits to take place on, so that any documentation text that is added by the user(s) can be easily extracted and exported for other uses outside of the wiki.

It turns out that for this to work, some tweaks have to be made. Either I have not understood the mediawiki documentation well, or there are issues with extending WikitextContent, or perhaps even both. Anyway:

The content_models mediawiki SQL database table needed an extra row with the values 4,contextxml. Added manually as I could not figure out how to do this automatically at extension registration time.
WikitextContent does not like being subclassed, so some functions from the parent:: needed to be copied wholesale instead of just wrapping a bit of code around the parent implementation.

The ConTeXtXMLContent also runs various checks on the XML before allowing the user to save the page. It uses a hand-written XML parser because it not only verifies the XML well-formedness, it also performs various checks on the textual content, as well as making sure that only documentation is added, and that nothing is removed from the main XML database information.

Building on `Article`

When a user loads an existing page, that page is normally of class Article (except for some special cases). The Article class sets the page content model to wikitext, which makes it use the standard WikitextContent and WikitextContenthandler. Because that would not work for the ConTeXtXML pages, there is a third class:

CmdPage, which extends Article

This class is really small. It only exists are a coatrack for using ConTeXtXMLContent. It may not even be really needed in the current implementation, but it could prove useful for future further extensions.

And then the `Hooks`

The extension uses a set of hooks to link into the mediawiki processing:

`ArticleFromTitle`

Creates a CmdPage if the wiki page title starts with /Command.

`ContentHandlerDefaultModelFor`

Sets the content model to contextxml if the wiki page title starts with /Command.

`ArticleAfterFetchContentObject`

On load, this checks whether the page content on the designated harddisk location has changed. If yes, it will replace the text of the mediawiki revision with the content of the file on the harddisk. This is so that there is an easy interface for integrating updated versions of the interface xml files from Wolfgang.

`EditFormPreloadText`

This fills the edit area for newly created /Command pages from the file on the harddisk

Generating the wikitext code for page views and previews

All the above is written in php. But since I am a complete noob in php and rather at home in Lua, I decided to write the conversion from XML to wiki format as an mtxrun script. The script is called mtx-wikipage.lua, and it converts our special XML format into wiki code.

I call it an mtxrun script, but really it only takes an option for the input file (in the case of section editing, this is a temporary file generated by php) and an optional output file (useful for debugging). Besides that, it is almost pure lua. It uses an integrated tiny expat-style XML parser and a few handwritten xml tree processing functions to create wiki code.

The penalty for calling an external program from php is relatively small, because 1) the processed page content is fairly small. 2) the internal caching of mediawiki makes it so that the code is only called when a page is actively being edited and 3) starting luametatex as a Lua interpreter only is surprisingly fast.

The only unsolved problem with this subsystem is that it needs an intermediate two-line shell script that does nothing except adjust the PATH environment variable, just so that mtxrun can run without complaints about unfound paths and configuration files. (mtxrun's warning messages would have to stripped off the mtx-wikipage.lua output otherwise)

Implementation notes

Command disk files

The extension has three types of data files on the filesystem:

XML files for command definitions
Verification tables for command definitions
Wiki text files for instance pages

Generally, the file names follow the logic of the wiki page title, except with the prefix cmd- instead of /Command.

The file extension for the XML files is .xml, the file extension for the verification table lua dump is -test.lua, the file extension for instance pages (redirects) is .wiki

However, in order to appease case-preserving and case-sensitive file systems, all uppercase letters in the filename are prefixed with a ^ character. A simple example: Command/WEEKDAY is stored on disk as cmd-^W^E^E^K^D^A^Y.xml, and its verification table is stored in cmd-^W^E^E^K^D^A^Y-test.lua.

XML parser

The extension uses a hardwritten simple XML parser in pure Lua. The parser is expat-style and the implementation is based on string.find() and string.sub(). The advantage of this approach is that it can handle bad XML input by throwing an appropriate (and understandable) error. Neither the Lpeg-based Lua parser from the 13th ConTeXt meeting nor the ConTeXt built-in parser allow for that. Both those parsers assume well-formed XML as input.

A tailored parser also allowed for easy extension to deal with the CDATA issue mentioned below.

But the main motivation for a private dedicated parser written in Lua is that we want to be able to not only check the well-formedness of the XML, but also its adherence to a set of extra rules:

The documentation should not modify the argument structure of the command’s formal specification, only add explanations to it. Theoretically, each of the 3900+ formal specifications has its own private XML Schema.
The documentation should be easily parseable by an external system, meaning that use of wiki code and HTML tags need to be governed.

These additional rules made using the DOM-based parser in php unwieldy, for me. I am sure a good php programmer could implement these extra checks, but not me. At least not in a reasonable amout of time. But I knew how to tackle both requirements using Lua, and could write an implementation quite quickly and effortlessly.

The first point is handled like this:

When a fresh set of ‘virgin’ XML files is created from context-en.xml, each separate file is parsed using a set of functions that create a lua table representing the ‘virginal’ parse tree of the XML file. This Lua table is dumped to disk and distributed along with the XML file.

When a wiki user presses the ‘Save’ button in the page editor, their edited XML is parsed using a slightly different set of functions from the ones for viewing. These functions in this set skip all documentation content while building the parse tree. The two lua tables representing the parse trees are then compared. They should be identical. If not, an error is raised and the save action is aborted with a user-visible error message.

The second point is taken care of during that same XML parse step of the user page revision. It uses a combination of a tag lookup table and string text matching to make sure the user followed the rules (as explained in Help:Command).

About those extension tags

The special tags <texcode>, <xmlcode>, and <context> on our wiki are handled by an extension (context) written a long time ago by Patrick Gundlach. That extension converts the parsed XML output from mediawiki into HTML code that looks 'right'. In normal wiki pages this works, because the mediawiki parser is quite forgiving (more like a HTML browser than a XML parser) and does some recovery attempts itself when a user types in something that is not quite well-formed HTML/XML.

For example, in a normal wiki page you do not need to properly quote the attributes of <context>. And the structure within <xmlcode> does not have to be properly nested.

But it also sometimes backfires. If you use a XML tag name inside a <context source="yes"> call or within <texcode>, it will not be displayed in the verbatim display section of the page (but it will be seen by ConTeXt while processing the <context>).

To solve this question between 'is it data?' and 'is it markup?' in a standalone XML file, you would wrap a CDATA section around things like the content of <xmlcode>. But unfortunately that is something that either the mediawiki parser or the context or the HTML browser does not understand (I don't know which is the exact problem).

For now, within the ConTeXtXML XML parser,I decided to treat the content of <texcode>, <xmlcode>, and <context> 'as if' they are SGML elements with data model CDATA. That means that the generated XML files on disk that make use of this feature are not actually well-formed, for example this content of <xmlcode>:

<xmlcode>
<document>
This <highlight detail="important">you</highlight> need to know.
</document>
</xmlcode>

should actually be this:

<xmlcode><![CDATA[
<document>
This <highlight detail="important">you</highlight> need to know.
</document>
]]></xmlcode>

but then it could not be displayed on the wiki properly, or (with some internal patching by ConTeXtXML) there would be a constant difference between the XML version on disk and the wiki database version of a page (resulting in endless 'This revision is outdated' messages). So, I think this is the best solution for now.

Extension:ConTeXtXML

Contents

An extension for editing `/Command` subpages

Building on `wikitext`

Building on `Article`

And then the `Hooks`

`ArticleFromTitle`

`ContentHandlerDefaultModelFor`

`ArticleAfterFetchContentObject`

`EditFormPreloadText`

Generating the wikitext code for page views and previews

Implementation notes

Command disk files

XML parser

About those extension tags

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main

Navigation

Indexes

Interaction

Tools

Extension:ConTeXtXML

Contents

An extension for editing /Command subpages

Building on wikitext

Building on Article

And then the Hooks

ArticleFromTitle

ContentHandlerDefaultModelFor

ArticleAfterFetchContentObject

EditFormPreloadText

Generating the wikitext code for page views and previews

Implementation notes

Command disk files

XML parser

About those extension tags

Navigation menu

Search

An extension for editing `/Command` subpages

Building on `wikitext`

Building on `Article`

And then the `Hooks`

`ArticleFromTitle`

`ContentHandlerDefaultModelFor`

`ArticleAfterFetchContentObject`

`EditFormPreloadText`