Difference between revisions of "XML"

From Wiki
Jump to navigation Jump to search
m
(add document titles)
 
(10 intermediate revisions by 4 users not shown)
Line 1: Line 1:
< [[DocBook]] | [[MathML]] | [[Formatting Objects]] >
+
__TOC__
  
Handling XML in ConTeXt has improved dramatically with the advent of MkIV. A new infrastructure, based on Lua, makes typesetting, manipulating, filtering, and reusing XML much much easier than before. Unfortunately, this means that most of the existing documentation is now obsolete. In general, the "old" MkII code uses upper-case <tt>XML</tt> in its commands, the new MkIV code uses lower-case <tt>xml</tt>. 
+
=Introduction=
  
==Documents about XML in MkIV==
+
Handling XML in ConTeXt has improved dramatically with the advent of MkIV.
  
===General Information===
+
The new Lua–based infrastructure makes typesetting, manipulating, filtering, and reusing XML much much easier than before.
  
* [[manual:xml-mkiv.pdf|XML in MkIV]]
+
Unfortunately, this means that most of the existing documentation is now obsolete.
* [[manual:math-mkiv.pdf|MathML in MkIV]]
+
 
* [[TEI_xml| TEI xml]]: typesetting editions encoded in TEI xml
+
In general, old MkII code includes the uppercase <tt>XML</tt> string in its commands (as in {{cmd|getXMLcode|[name])}}, while new MkIV code uses lowercase <tt>xml</tt> (as in {{cmd|xmlflush|{#1}}}).
* [[Verbatim_XML | Verbatim/VIM in XML]]
+
 
* [[xtables#XML | Processing XML tables as Extreme Tables]]
+
==Before You Start==
* [http://dl.contextgarden.net/myway/tas/xhtml.pdf XHTML in MkIV]
+
 
 +
It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:
 +
 
 +
# Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.<ref>If this is all Greek to you, consider it as incorrect XML.</ref>
 +
# At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.
 +
 
 +
XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.
 +
 
 +
As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.<ref>For a detailed list, see [https://en.wikipedia.org/wiki/Lightweight_markup_language#Comparison_of_language_features a feature comparison list in Wikipedia].</ref> There are tools ([https://pandoc.org Pandoc] being just one of them) that generate XML from these lightweight markup formats.
 +
It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.<ref>ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.</ref>
 +
 
 +
Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:
 +
 
 +
* Selecting what you want from the XML file(s).
 +
* Defining how you want your selections in the final PDF document.
 +
 
 +
It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files. 
 +
 
 +
=First Example=
 +
 
 +
==Sample XML Source==
 +
 
 +
An XML sample borrowed and adapted from the net reads:
 +
 
 +
<xmlcode>
 +
<TEI xml:lang="en">
 +
  <teiHeader>
 +
    <!-- stuff omitted here -->
 +
  </teiHeader>
 +
  <text>
 +
    <body>
 +
      <div type="essay">
 +
        <head>An Essay on Summer</head>
 +
        <p>Summer school in <date when="1990">MCMXC</date> was never easy;
 +
        it went by too quickly and left us wanting more.</p>
 +
        <p>But, as my friend <name type="person">Peter</name> said with his
 +
        inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>,
 +
        <said>It never pays to think too hard</said>. Or, as I would rather
 +
        put it, <quote xml:lang="it">Que sera, sera</quote>.</p>
 +
      </div>
 +
      <div type="essay">
 +
        <head>An Essay on Winter</head>
 +
        <p xml:lang="es">¡Hasta la vista…!</p>
 +
      </div>
 +
    </body>
 +
  </text>
 +
</TEI>
 +
</xmlcode>
 +
 
 +
===Only XML Required===
 +
 
 +
This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.
 +
 
 +
You might think XML correctness<ref>I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.</ref> as the set orthographical rules common to all European languages. Some of these rules may be:<ref>This is not more than a fancy example, in no way an exhaustive description (or list).</ref>
 +
 
 +
* All words are separated using at least a blank space.
 +
* Single dots mark different sentences.
 +
* Blank vertical space separates paragraph (when available.
 +
 
 +
XML rules describe how the tags inside the characters <code>&lt;…&gt;</code> are to be used. To these rules belong:
 +
 
 +
* Markup is defined by the string inside the characters <code>&lt; &gt;</code>.
 +
* Any blank space separates attributes (<code>&lt;element attribute="value" attribute1="value1"&gt;</code> .
 +
* The name is the only required part for the <code>&lt;…&gt;</code> tag.
 +
* Elements have opening tag and a matching closing tag (<code>&lt;…&gt;</code> and <code>&lt;/…&gt;</code>), otherwise the opening tag must autoclose (<code>&lt;…/&gt;</code>)<ref>With or without space before the slash.</ref>.
 +
* The name must come first in the tag (before the first space, if any attribute is given).
 +
* Attributes have their values assigned with the equal sign.
 +
* Attributes have their values enclosed in quotes.
 +
 
 +
Validity is related to a document type. XML validity is properly the document validity.
 +
 
 +
A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.
 +
 
 +
This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.
 +
 
 +
For example, `<whatever>` is a correct pure XML name, but it is invalid XHTML or TEI element.
 +
 
 +
An even more extreme sample of correct XML would read:
 +
 
 +
<xmlcode>
 +
<τεχτ>
 +
  <βοδυ>
 +
    <διβ type="essay">
 +
      <ἡαδ>An Essay on Summer</ἡαδ>
 +
      <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy;
 +
      it went by too quickly and left us wanting more.</π>
 +
      <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his
 +
      inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>,
 +
      <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather
 +
      put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.</π>
 +
    </διβ>
 +
    <διβ type="essay">
 +
      <ἡαδ>An Essay on Winter</ἡαδ>
 +
      <π xml:lang="es">¡Hasta la vista…!</π>
 +
    </διβ>
 +
  </βοδυ>
 +
</τεχτ>
 +
</xmlcode>
 +
 
 +
This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes it) XML sources to compile them.
 +
 
 +
==XML Typesetting==
 +
 
 +
Formatting XML sources with ConTeXt (or properly typesetting them) requires:
 +
 
 +
* Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
 +
* Assigning these parts to single configuration commands (otherwise all will be displayed the same).
 +
 
 +
In practice, the ConTeXt configuration for XML (or environment file) contains:
 +
 
 +
# A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
 +
# The registration of this mapping (or assignation set).
 +
# The configuration of each setup.
 +
 
 +
The basic skeleton reads:
 +
 
 +
<texcode>
 +
\startxmlsetups xml:whatever
 +
  \xmlsetsetup {#1} {*} {xml:*}
 +
\stopxmlsetups
 +
 
 +
\xmlregistersetup{xml:whatever}
 +
 
 +
\startxmlsetups xml:body
 +
  \xmlflush{#1}
 +
\stopxmlsetups
 +
% and so many definitions as XML selections
 +
</texcode>
 +
 
 +
The two blank lines separate the three parts listed above.
 +
 
 +
=Documents about XML in MkIV=
 +
 
 +
==General Information==
 +
 
 +
* [[manual:xml-mkiv.pdf|''Dealing with XML in ConTEXt MkIV'']]: the official manual that explains everything. Too hard to be a good starting point (unless you are confident [or at least familiar] with XPath).
 +
* [[manual:math-mkiv.pdf|''MathML in MkIV'']]: also official document to math typesetting with XML sources.
 +
* [[TEI_xml| TEI XMLl]]: example of TEI–encoded source typeset with ConTeXt.
 +
* [[DocBook]]: example of how to typeset DocBook sources with help of a ConTeXt module.
 +
* [[Formatting Objects|Formatting XML Objects]]: not (yet?) available for MkIV.
 +
* [[Verbatim_XML | Verbatim in XML]]: how to typeset XML sources verbatim in final text.
 +
* [[xtables#XML | Processing XML tables as Extreme Tables]]: example about XML tables as ConTeXt tables (extreme or natural).
 +
* [https://wiki.contextgarden.net/images/8/8c/xhtml.pdf XHTML in MkIV]: ''Getting Web Content and pdf-Output from One Source'', by Thomas Schmitz.
 
* [[Ctx| Processing of Ctx XML files]]  
 
* [[Ctx| Processing of Ctx XML files]]  
  
===Processing XML with lua===
+
==Processing XML with lua==
* [[XML_Lua| XML in Lua]] (manipulating xml in Lua)
+
* [[XML_Lua| XML in Lua]] (manipulating XML in Lua)
  
===XHTML in MKIV===
+
==XHTML in MKIV==
* [http://dl.contextgarden.net/myway/tas/xhtml.pdf Thomas' MyWay on processing XHTML with MKIV]
+
* [https://wiki.contextgarden.net/images/8/8c/xhtml.pdf Thomas’ ''My Way'' on processing XHTML with MKIV]: ''Getting Web Content and pdf-Output from One Source'' (already mentioned).
  
 +
=Documents about XML in MkII (obsolete)=
  
==Documents about XML in MkII (obsolete)==
+
==XML/ConTeXt in general==
 +
* [[manual:example.pdf|''XML in ConTeXt'']] by Hans Hagen and Ton Otten (2001)
 +
* [https://tug.org/TUGboat/tb24-3/pepping.pdf ''Docbook In ConTEXt, a ConTEXt XML Mapping for DocBook Documents''] by Simon Pepping
 +
* [http://getfo.sourceforge.net/context_xml/index.html ConTeXt–XML] by Paul Tremblay
 +
* [https://www.pragma-ade.nl/general/magazines/mag-0008.pdf ''Dealing with XML''] by Hans Hagen (about XML, XSLT and typesetting without TeX code)
 +
* XML Basics: [[Mixing_XML_and_ConTeXt|Mixing XML and ConTeXt]] using the pre-defined ContML vocabulary
  
===XML/ConTeXt in general===
+
==Additions and Details of XML/ConTeXt==
* [[manual:example.pdf|XML in ConTeXt]] by Pragma (2001)
 
* [http://www.leverkruid.eu/context/index.html XML DocBook in ConTeXt] by Simon Pepping
 
* [http://getfo.sourceforge.net/context_xml/index.html XML ConTeXt] by Paul Tremblay
 
* [http://www.pragma-ade.com/show-mag-9.htm Dealing with XML] by Pragma (about XML, XSLT and typesetting without TeX code)
 
* XML Basics: [[Mixing_XML_and_ConTeXt]] using the pre-defined ContML vocabulary
 
 
 
===Additions and Details of XML/ConTeXt===
 
 
* [[manual:xfigures-p.pdf|Figures (XML image databases)]] ([[manual:xfigures-s.pdf|screen]]) by Pragma (2001); see [[Image Database]]
 
* [[manual:xfigures-p.pdf|Figures (XML image databases)]] ([[manual:xfigures-s.pdf|screen]]) by Pragma (2001); see [[Image Database]]
 
* [[Two pass tag processing example]] (float and figure tags)
 
* [[Two pass tag processing example]] (float and figure tags)
Line 41: Line 182:
 
* [[manual:xcorresp.pdf|Serial Letters]] (using a XML database) by Pragma (2003)
 
* [[manual:xcorresp.pdf|Serial Letters]] (using a XML database) by Pragma (2003)
  
===eXaMpLe framework===  
+
==eXaMpLe framework==  
 
(batch processing)
 
(batch processing)
 
* [[manual:ex-ample.pdf|Example Interface]] (empty)
 
* [[manual:ex-ample.pdf|Example Interface]] (empty)
Line 47: Line 188:
 
* [[manual:ex-imple.pdf|Eximple Toolkit]] (simple subset of Example)
 
* [[manual:ex-imple.pdf|Eximple Toolkit]] (simple subset of Example)
  
===MathML===
+
==MathML==
 
* [[manual:pre-mml.pdf|MathML Intro presentation]] by Pragma
 
* [[manual:pre-mml.pdf|MathML Intro presentation]] by Pragma
 
* [[manual:mmlprime.pdf|MathML manual]] by Pragma (2001)
 
* [[manual:mmlprime.pdf|MathML manual]] by Pragma (2001)
Line 55: Line 196:
 
* [[manual:xphysml-p.pdf|PhysML (MathML extension for physics)]] ([[manual:xphysml-s.pdf|screen]]) by Pragma
 
* [[manual:xphysml-p.pdf|PhysML (MathML extension for physics)]] ([[manual:xphysml-s.pdf|screen]]) by Pragma
  
===XSL/FO===
+
==XSL/FO==
 
* XSL/FO: [[Formatting Objects]]
 
* XSL/FO: [[Formatting Objects]]
 
* [[ConTeXt FO and XML]] is a tutorial with a view to presenting ConTeXt from the XSL-FO mindset.
 
* [[ConTeXt FO and XML]] is a tutorial with a view to presenting ConTeXt from the XSL-FO mindset.
 +
 +
=Notes=
  
 
[[Category:XML]]
 
[[Category:XML]]

Latest revision as of 17:59, 8 June 2024

Introduction

Handling XML in ConTeXt has improved dramatically with the advent of MkIV.

The new Lua–based infrastructure makes typesetting, manipulating, filtering, and reusing XML much much easier than before.

Unfortunately, this means that most of the existing documentation is now obsolete.

In general, old MkII code includes the uppercase XML string in its commands (as in \getXMLcode[name]), while new MkIV code uses lowercase xml (as in \xmlflush{#1}).

Before You Start

It might be obvious, but there are two basic requirements to typeset XML sources with ConTeXt:

  1. Familiarity with XML. You don’t have to type XML directly, but ConTeXt isn’t able to compile well–formed XML.[1]
  2. At least, some knowledge of ConTeXt commands, since otherwise formatting what you select from the XML source would be impossible.

XML is way more powerful than being source format to typeset with ConTeXt. They are also completely independent from each other. It is important to deal with XML first without seeing it through ConTeXt lenses.

As for typing directly XML sources, there are some lightweight tagging (or markup) languages, such as AsciiDoc or Markdown.[2] There are tools (Pandoc being just one of them) that generate XML from these lightweight markup formats. It is not entirely impossible that in some cases these tools might generate wrong XML format (due to bugs in them). In that case, you will have to find out what is wrong with your XML source.[3]

Knowing ConTeXt is required too, because typesetting XML may be explained as having two parts:

  • Selecting what you want from the XML file(s).
  • Defining how you want your selections in the final PDF document.

It is better to start learning standard ConTeXt first (if required) and then acquire some experience with XML files.

First Example

Sample XML Source

An XML sample borrowed and adapted from the net reads:

<TEI xml:lang="en">
  <teiHeader>
    <!-- stuff omitted here -->

  </teiHeader>
  <text>
    <body>
      <div type="essay">
        <head>An Essay on Summer</head>
        <p>Summer school in <date when="1990">MCMXC</date> was never easy; 
        it went by too quickly and left us wanting more.</p>
        <p>But, as my friend <name type="person">Peter</name> said with his 
        inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>, 
        <said>It never pays to think too hard</said>. Or, as I would rather 
        put it, <quote xml:lang="it">Que sera, sera</quote>.</p>
      </div>
      <div type="essay">
        <head>An Essay on Winter</head>
        <p xml:lang="es">¡Hasta la vista…!</p>
      </div>
    </body>
  </text>
</TEI>

Only XML Required

This previous sample is written using the TEI markup. It is correct XML and valid (TEI) XML.

You might think XML correctness[4] as the set orthographical rules common to all European languages. Some of these rules may be:[5]

  • All words are separated using at least a blank space.
  • Single dots mark different sentences.
  • Blank vertical space separates paragraph (when available.

XML rules describe how the tags inside the characters <…> are to be used. To these rules belong:

  • Markup is defined by the string inside the characters < >.
  • Any blank space separates attributes (<element attribute="value" attribute1="value1"> .
  • The name is the only required part for the <…> tag.
  • Elements have opening tag and a matching closing tag (<…> and </…>), otherwise the opening tag must autoclose (<…/>)[6].
  • The name must come first in the tag (before the first space, if any attribute is given).
  • Attributes have their values assigned with the equal sign.
  • Attributes have their values enclosed in quotes.

Validity is related to a document type. XML validity is properly the document validity.

A document type (such as XHTML or TEI) defines a limited set of elements (of element names). Each element may contain one or more attributes with different values.

This specification of XML is called the document type definition. You may consider it as the set of grammar rules of each European language.

For example, <whatever> is a correct pure XML name, but it is invalid XHTML or TEI element.

An even more extreme sample of correct XML would read:

<τεχτ>
  <βοδυ>
    <διβ type="essay">
      <ἡαδ>An Essay on Summer</ἡαδ>
      <π>Summer school in <δατη when="1990">MCMXC</δατη> was never easy;
      it went by too quickly and left us wanting more.<>
      <π>But, as my friend <ναμη type="person">Peter</ναμη> said with his
      inimitable <ξένον xml:lang="fr">je ne sais quoi</ξένον>,
      <ἔφα>It never pays to think too hard</ἔφα>. Or, as I would rather
      put it, <λεγόμενον xml:lang="it">Que sera, sera</λεγόμενον>.<>
    </διβ>
    <διβ type="essay">
      <ἡαδ>An Essay on Winter</ἡαδ>
      <π xml:lang="es">¡Hasta la vista…!<>
    </διβ>
  </βοδυ>
</τεχτ>

This is invalid TEI. But ConTeXt only requires correct (or valid, as it describes it) XML sources to compile them.

XML Typesetting

Formatting XML sources with ConTeXt (or properly typesetting them) requires:

  • Selecting which parts you want to be typeset. At least, these selections will cover elements by their name.
  • Assigning these parts to single configuration commands (otherwise all will be displayed the same).

In practice, the ConTeXt configuration for XML (or environment file) contains:

  1. A set of XML (node) selections mapped or assigned to ConTeXt setups (or configurations).
  2. The registration of this mapping (or assignation set).
  3. The configuration of each setup.

The basic skeleton reads:

\startxmlsetups xml:whatever
  \xmlsetsetup {#1} {*} {xml:*}
\stopxmlsetups

\xmlregistersetup{xml:whatever}

\startxmlsetups xml:body
  \xmlflush{#1}
\stopxmlsetups
% and so many definitions as XML selections

The two blank lines separate the three parts listed above.

Documents about XML in MkIV

General Information

Processing XML with lua

XHTML in MKIV

Documents about XML in MkII (obsolete)

XML/ConTeXt in general

Additions and Details of XML/ConTeXt

eXaMpLe framework

(batch processing)

MathML

XSL/FO

Notes

  1. If this is all Greek to you, consider it as incorrect XML.
  2. For a detailed list, see a feature comparison list in Wikipedia.
  3. ConTeXt will complain with a message in the PDF document starting with “invalid xml file”.
  4. I’m aware that the technical term is well–formedness, not being able to avoid considering a more expressive replacement. Correctness seems to be a suitable candidate.
  5. This is not more than a fancy example, in no way an exhaustive description (or list).
  6. With or without space before the slash.