Changes

Extension:ConTeXtXML (view source)

Revision as of 20:33, 12 August 2020

5,093 bytes added , 20:33, 12 August 2020

The only unsolved problem with this subsystem is that it needs an intermediate two-line shell script that does nothing except adjust the PATH environment variable, just so that `mtxrun` can run without complaints about unfound paths and configuration files. (mtxrun's warning messages would have to stripped off the `mtx-wikipage.lua` output otherwise)

== Implementation notes ==

=== XML parser ===

The extension uses a hardwritten simple XML parser implemented in pure Lus. The parser is expat-style and based on string.find() and string.sub(). The advantage of this approach is that it can handle bad XML input by throwing an appropriate (and understandable) error. Neither the Lpeg-based Lua parser from the 13th ConTeXt meeting nor the ConTeXt built-in parser allows for that, both assume well-formed XML as input.

A tailored parser also allowed for easy extension for the CDATA issue mentioned below.

But the main motivation for a private dedicated parser written in Lua is that we want to be able to not only check the well-formedness of the XML, but also its adherence to a set of extra rules:

# The documentation should not modify the argument structure of the command’s formal specification, only add explanations to it. Theoretically, each of the 3900+ formal specifications has its own private XML Schema.

# The documentation should be easily parseable by an external system, meaning that use of wiki code and HTML tags need to be governed.

These additional rules made using the DOM-based parser in php unwieldy, for me. I am sure a good php programmer could implement these extra checks, but not me. At least not in a reasonable amout of time. But I knew how to tackle both requirements using Lua, and could write an implementation quite quickly and effortlessly.

The first point is handled like this:

*While a fresh set of ‘virgin’ XML files is created from <code>context-en.xml</code>, each separate file is parsed using a set of expat callback functions that create a lua table representing the ‘virginal’ parse tree of the XML file. This Lua table is dumped to disk and distributed along with the XML file.

*When a wiki user presses the ‘Save’ button in the page editor, their edited XML is parsed using a slightly different set of expat callback functions from the ones for viewing. These altered callbacks skip all documentation content while building the parse tree. The two lua tables representing the parse trees are then compared. They should be identical. If not, an error is raised and the save action aborted with a user-visible error message.

The second point is taken care of during that same XML parse step of the user page revision. It uses a combination of a tag lookup table and string text matching to make sure the user followed the rules (as explained in [[Help:Command]]).

=== About those extension tags ===

The special tags <code><nowiki><texcode></nowiki></code>, <code><nowiki><xmlcode></nowiki></code>, and <code><nowiki><context></nowiki></code> on our wiki are handled by an extension (<code>context</code>) written a long time ago by Patrick Gundlach. That extension converts the parsed XML output from mediawiki into HTML code that looks 'right'. In normal wiki pages this works, because the mediawiki parser is quite forgiving (more like a HTML browser than a XML parser) and does some recovery attempts itself when a user types in something that is not quite well-formed HTML/XML.

For example, in a normal wiki page you do not need to properly quote the attributes of <code><nowiki><context></nowiki></code>. And the structure within <code><nowiki><xmlcode></nowiki></code> does not have to be properly nested.

But it also sometimes backfires. If you use a XML tag name inside a <code><nowiki><context source="yes"></nowiki></code> call or within <code><nowiki><texcode></nowiki></code>, it will not be displayed in the verbatim display section of the page (but it will be seen by ConTeXt while processing the <code><nowiki><context></nowiki></code>).

To solve this question between 'is it data?' and 'is it markup>' in a standalone XML file, you would wrap a CDATA section around things like the content of <code><nowiki><xmlcode></nowiki></code>. But unfortunately that is something that either the mediawiki parser or the <code>context</code> or the HTML browser does not understand (I don't know which is the exact problem).

For now, within the ConTeXtXML XML parser,I decided to treat the content of <code><nowiki><texcode></nowiki></code>, <code><nowiki><xmlcode></nowiki></code>, and <code><nowiki><context></nowiki></code> 'as if' they are SGML elements with data model CDATA. That means that the generated XML files on disk that make use of this feature are not actually well-formed, for example this content of <code><nowiki><xmlcode></nowiki></code>:

<pre>

This <highlight detail="important">you</highlight> need to know.

</document>

</xmlcode>

</pre>

should actually be this:

<pre>

<xmlcode><![CDATA[

This <highlight detail="important">you</highlight> need to know.

</document>

]]></xmlcode>

</pre>

but then it could not be displayed on the wiki properly, or (with some internal patching by ConTeXtXML) there would be a constant difference between the XML version on disk and the wiki database version of a page (resulting in endless 'This revision is outdated' messages). So, I think this is the best solution for now.

Taco

Bureaucrats, Interface administrators, Administrators

3,897

edits

Changes

Extension:ConTeXtXML (view source)

Revision as of 20:33, 12 August 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main

Navigation

Indexes

Interaction

Tools