XML

From Wiki
Revision as of 10:09, 16 June 2024 by Ousia (talk | contribs) (→‎Format: more verbose titile)
Jump to navigation Jump to search

Introduction

Handling XML in ConTeXt has improved dramatically with the advent of MkIV.

The new Lua–based infrastructure makes typesetting, manipulating, filtering, and reusing XML much much easier than before.

Unfortunately, this means that most of the existing documentation is now obsolete.

In general, old MkII code includes the uppercase XML string in its commands (as in \getXMLcode[name]), while new MkIV code uses lowercase xml (as in \xmlflush{#1}).

Before You Start

It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:

  1. Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.[1]
  2. At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.

XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.

As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.[2] There are tools (Pandoc being just one of them) that generate XML from these lightweight markup formats. It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.[3]

Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:

  • Selecting what you want from the XML file(s).
  • Defining how you want your selections in the final PDF document.

It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files.

Sample XML Source

An XML sample borrowed and adapted from the net reads:

<TEI xml:lang="en">
  <teiHeader>
    <!-- stuff omitted here -->

  </teiHeader>
  <text>
    <body>
      <div type="essay">
        <head>An Essay on Summer</head>
        <p>Summer school in <date when="1990">MCMXC</date> was never easy; 
        it went by too quickly and left us wanting more.</p>
        <p>But, as my friend <name type="person">Peter</name> said with his 
        inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>, 
        <said>It never pays to think too hard</said>. Or, as I would rather 
        put it, <quote xml:lang="it">Que sera, sera</quote>.</p>
      </div>
      <div type="essay">
        <head>An Essay on Winter</head>
        <p xml:lang="es">¡Hasta la vista…!</p>
      </div>
    </body>
  </text>
</TEI>

Only XML Required

This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.

You might think XML correctness[4] as the set orthographical rules common to all European languages. Some of these rules may be:[5]

  • All words are separated using at least a blank space.
  • Single dots mark different sentences.
  • Blank vertical space separates paragraph (when available.

XML rules describe how the tags inside the characters <…> are to be used. To these rules belong:

  • Markup is defined by the string inside the characters < >.
  • Any blank space separates attributes (<element attribute="value" attribute1="value1">).
  • The name is the only required part for the <…> tag.
  • Elements have opening tag and a matching closing tag (<…> and </…>), otherwise the opening tag must autoclose (<…/>)[6].
  • The name must come first in the tag (before the first space, if any attribute is given).
  • Attributes have their values assigned with the equal sign (and no blank space before or after the sign).
  • Attributes have their values enclosed in quotes.

Validity is related to a document type. XML validity is properly the document validity.

A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.

This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.

For example, <whatever> is a correct pure XML name, but it is invalid XHTML or TEI element.

An even more extreme sample of correct XML would read:

<τεχτ>
  <βοδυ>
    <διβ type="essay">
      <ἡαδ>An Essay on Summer</ἡαδ>
      <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy;
      it went by too quickly and left us wanting more.<>
      <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his
      inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>,
      <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather
      put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.<>
    </διβ>
    <διβ type="essay">
      <ἡαδ>An Essay on Winter</ἡαδ>
      <π xml:lang="es">¡Hasta la vista…!<>
    </διβ>
  </βοδυ>
</τεχτ>

This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes them) XML sources to compile them.

Sample Environment (or Configuration File)

A minimal configuration file or environment to typeset the previous sample may read:

\startxmlsetups xml:presets:all
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups

\xmlregistersetup{xml:presets:all}

\startxmlsetups xml:TEI
  \mainlanguage[\xmlatt{#1}{xml:lang}]
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:date
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:div
  \startchapter[title=\xmltext{#1}{head}]
    \xmlflush{#1}
  \stopchapter
\stopxmlsetups

\startxmlsetups xml:foreign
  \bgroup\language[\xmlatt{#1}{xml:lang}]\em\xmlflush{#1}\egroup
\stopxmlsetups

\startxmlsetups xml:name
  \bgroup\sc\xmlflush{#1}\egroup
\stopxmlsetups

\startxmlsetups xml:p
  \startparagraph
    \xmlflush{#1}
  \stopparagraph
\stopxmlsetups

\startxmlsetups xml:p:date
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:quote
  \bgroup\language[\xmlatt{#1}{xml:lang}]\quotation{\xmlflush{#1}}\egroup
\stopxmlsetups

\startxmlsetups xml:said
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:teiHeader
  \xmlflush{#1}
\stopxmlsetups

\startxmlsetups xml:text
  \xmlflush{#1}
\stopxmlsetups

A proper explanation in XML Typesetting.

Invoking ConTeXt

The XML source may be saved as source.xml and the environment or configuration file could be saved as environment.tex.[7]

context --environment=environment.tex source.xml

This invocation will generate an output file named source.pdf.

XML Typesetting

Formatting XML sources with ConTeXt (or properly typesetting them) requires:

  • Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
  • Assigning these parts to single configuration commands (otherwise all will be displayed the same).

In practice, the ConTeXt configuration for XML (or environment file) contains:

  1. A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
  2. The registration of this mapping (or assignation set).
  3. The configuration of each setup.

A basic skeleton showing the three tasks would read:

\startxmlsetups xml:whatever
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups

\xmlregistersetup{xml:whatever}

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups
% and so many definitions as XML selections

The two blank lines separate the three parts listed above.

Mapping Selections

The first thing to define is a list of selections from the XML source linked to invidual ConTeXt configurations.

This minimal sample contains it:

\startxmlsetups xml:whatever
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups
  1. The first line \startxmlsetups creates the list (named xml:whatever).
    1. The same identifier will be required to register the list.
    2. It is customary to use xml: as namespace, but any character string (such as οὑδέν:) would do.
    3. Both parts of the name are free, but the identifier should match completely in the registration.
  2. The third line \stopxmlsetups closes the \startxmlsetups (as customary in ConTeXt.
  3. The second line \xmlsetsetup assigns individual selections in XML with ConTeXt format.
    1. In \xmlsetsetup, the second pair of braces defines the individual XML selection, the third pair of braces defines the ConTeXt setup.
    2. The content of the first pair of braces (\xmlsetsetup{#1}) is required in all cases.

XML Paths (or LPaths)

You define what you want form the XML sources using XML Paths, known as XPaths. Since ConTeXt access these paths using Lua, they are LPaths.

We are handling the contents of the second pair of braces from the command:

\xmlsetsetup{#1}{*}{xml:*}

The most basic path is the one used in the sample {*}​, which stands for any XML element.

Other path types may be:

  • {element[@attribute]}, selects <element attribute="…"> (<element> with attribute set, regardless of its value).
  • {element[@attribute='value']} selects <element attribute="value">, but not <element attribute="value1"> (or even <element attribute="value another-value">).
  • {container/element} selects all <element> children (or direct descendants) of <container>.

There are a bunch of other possibilities. A separate page on LPaths would make more sense.

Defining ConTeXt Setups

The third and last pair of braces from \xmlsetsetup{#1}{*}{xml:*} defines the matching setup for the given element.

If you use wildcard (*) this will take the element name from the path (when a path is selected).

It is up to you which namespace you use to name ConTeXt setups,[8] but they must match the individual formatting command.

A way of getting rid of some content (which otherwise would be selected) is to match a path with an non–existing selection.[9]

Map Registration

After defining the list of XML setups (XML paths matched with ConTeXt setups), it must be registered. The registration command reads:

\xmlregistersetup{xml:whatever}

The only requirement is that the identifier (xml:whatever in the sample) is exactly the same that the one defined in \startxmlsetups.

Format (Setups Configuration)

Last (but not least, as they say) comes the format of XML selections. Without this step, the selections will be lost in the transition to the output document.

As already explained in Defining ConTeXt Setups, these names (contained in the last pair of braces of \xmlsetsetup) should match each indivual setup configuration.

For a setup named in the selection mapping {xml:body}, its configuration may read:

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups

Flushing the contents of the element (the node), it is the most basic operation.

This is required to be able to have its children elements.

Flushing only adds the text of the element, but for formatting one needs standard ConTeXt command.

Compare the previous setup to these other ones:

\startxmlsetups xml:p
  \startparagraph
    \xmlflush{#1}
  \stopparagraph
\stopxmlsetups

\startxmlsetups xml:name
  \bgroup\sc\xmlflush{#1}\egroup
\stopxmlsetups

The xml:p setup adds the required commands so that <p> are handled as commands.

For xml:name, small caps are added. \bgroup … \egroup is similar to enclose its contents in braces (but more explicit and readable).

As mentioned, \xmlflush{#1} flushes the current selection (or node).

This is the most basic operation, but there are other commands as well.

\xmltext adds the text from a path, such as in:

\xmltext{#1}{head}

\xmltext{#1}{.}

The first command from the sample gets the text from a child <head> element.

The second command gets the text from the current element ({.} is the path for it).

Attributes may be accessed with:

  • \xmlatt{#1}{name gets the value for the attribute {name} from the current element.
  • \xmlattribute{#1}{path}{name} gets the value for the attribute {name} from the selected {path}.

A more detailed list (with sample explanation) deserves a XML Setup Commands.

See Also

Notes

  1. If this is all Greek to you, consider it as incorrect XML.
  2. For a detailed list, see a feature comparison list in Wikipedia.
  3. ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.
  4. I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.
  5. This is not more than a fancy example, in no way an exhaustive description (or list).
  6. With or without space before the slash.
  7. Of course, file names should differ in documents. Although not being mandatory (as far as I can recall), it is a good idea to keep different file extensions for each file format. I mean, .xml for XML files and .tex for ConTeXt files.
  8. The part of the identifier with the form xml:, which may contain any string of letters (no digits).
  9. This is exactly what happens with the <head> element in the sample. There is no defined
    \startxmlsetups xml:head
      \xmlflush{#1}
    \stopxmlsetups
    It would be redundant (appearing twice in the output document), since it is already included with xml:div with \xmltext{#1}{head}.