Changes

User:Luigi.scarso/testpage (view source)

Revision as of 09:02, 16 January 2014

76 bytes removed , 09:02, 16 January 2014

no edit summary

=The database=

The bibTEX format is rather popular in the TEX community and even with its shortcomings it will stay around for a while. Many publication websites can export and many tools are available to work with this database format. It is rather simple and looks a bit like Lua tables. Unfortunately the content can be polluted with non-standardized TEX commands which complicates pre- or postprocessing outside TEX . In that sense a bibTEX database is often not coded neutrally. Some limitations, like the use of commands to encode accented characters root in the ascii world and can be bypassed by using utf instead (as handled somewhat in LATEX through extensions such as <tt>bibtex8</tt>).

The normal way to deal with a bibliography is to refer to entries using a unique tag or key. When a list of entries is typeset, this reference can be used for linking purposes. The typeset list can be processed and sorted using the <tt>bibtex</tt> program that converts the database into something more TEX friendly (a <tt>.bbl</tt> file). I never used the program myself (nor bibliographies) so I will not go into too much detail here, if only because all I say can be wrong.

In ConTEXt we no longer use the <tt>bibtex</tt> program: we just use database files and deal with the necessary manipulations directly in ConTEXt . One or more such databases can be used and combined with additional entries defined within the document. We can have several such datasets active at the same time.

A bibTEX file looks like this:

@Article{sometag,

</pre>

Normally a value is given between quotes (or curly brackets) but single words are also OK (there is no real benefit in not using quotes, so we advise to always use them). There can be many more fields and instead of strings one can use predefined shortcuts. The title for example quite often contains TEX macros. Some fields, like <tt>pages</tt> have funny characters such as the endash (typically as <tt>--</tt>) so we have a mixture of data and typesetting directives. If you are covering non--english references, you often need characters that are not in the ascii subset but ConTEXt is quite happy with utf . If your database file uses old-fashioned TEX accent commands then these will be internally converted automatically to utf . Commands (macros) are converted to an indirect call, which is quite robust.

The bibTEX files are loaded in memory as Lua table but can be converted to xml so that we can access them in a more flexible way, but that is a subject for specialists.

In the old MkII setup we have two kinds of entries: the ones that come from the bibTEX run and user supplied ones. We no longer rely on bibTEX output but we do still support the user supplied definitions. These were in fact prepared in a way that suits the processing of bibTEX generated entries. The next variant reflects the ConTEXt recoding of the old bibTEX output.

\startpublication[k=Hagen:Second,t=article,a={Hans Hagen},y=2013,s=HH01]

</pre>

The split <tt>\artauthor</tt> fields are collapsed into a single <tt>author</tt> field as we deal with the splitting later when it gets parsed in Lua . The <tt>\artauthor</tt> syntax is only kept around for backward compatibility with the previous use of bibTEX .

In the new setup we support these variants as well:

</pre>

Because internally the entries are Lua tables, we also support loading of Lua based definitions:

return {

</pre>

Files set up like this can be loaded too. The following xml input is rather close to this, and is also accepted as input.

<?xml version="2.0" standalone="yes" ?>

=Commands in entries=

One unfortunate aspect commonly found in bibTEX files is that they often contain TEX commands. Even worse is that there is no standard on what these commands can be and what they mean, at least not formally, as bibTEX is a program intended to be used with many variants of TEX style: plain, LATEX , and others. This means that we need to define our use of these typesetting commands. However, in most cases, they are just abbreviations or font switches and these are often known. Therefore, ConTEXt will try to resolve them before reporting an issue. In the log file there is a list of commands that has been seen in the loaded databases. For instance, loading <tt>tugboat.bib</tt> gives a long list of commands of which we show a small set here:

publications > start used btx commands

</pre>

These three suffixes are understood by the loader. Here the dataset has the name <tt>standard</tt> and the three database files are merged, where later entries having the same tag overload previous ones. Definitions in the document source (coded in TEX speak) are also added, and they are saved for successive runs. This means that if you load and define entries, they will be known at a next run beforehand, so that references to them are independent of when loading and definitions take place.

Because we are dealing with database input and because we generally need to manipulate entries, much of the work is delegated to Lua . This makes it easier to maintain and extend the code. Of course TEX still does the rendering. The typographic details are controlled by parameters but not all are used in all variants. As with most ConTEXt commands, it starts out with a general setup command:

</pre>

You can overload such setups if needed, but that only makes sense when you cannot configure the rendering with parameters. The <tt>\btxcitevariant</tt> command is one of the build in accessors and it calls out to Lua where more complex manipulation takes place if needed. If no manipulation is known, the field with the same name (if found) will be flushed. A command like <tt>\btxcitevariant</tt> assumes that a dataset and specific tag has been set. This is normally done in the wrapper macros, like <tt>\cite</tt>. For special purposes you can use these commands

\setbtxdataset[example]

=The LUA view=

Because we manage data at the Lua end it is tempting to access it there for other purposes. This is fine as long as you keep in mind that aspects of the implementation may change over time, although this is unlikely once the modules become stable.

The entries are collected in datasets and each set has a unique name. In this document we have the set named <tt>example</tt>. A dataset table has several fields, and probably the one of most interest is the <tt>luadata</tt> field. Each entry in this table describes a publication:

These details are accessed as <tt>publications.datasets.example.details["demo-001"]</tt> and by using a separate table we can overload fields in the original entry without losing the original.

You can loop over the entries using regular Lua code combined with MkIV helpers:

local dataset = publications.datasets.example

|

bibTEX , the ConTEXt way

|

bibTEX , the ConTEXt way

|

=The XML view=

The <tt>luadata</tt> table can be converted into an xml representation. This is a follow up on earlier experiments with an xml -only approach. I decided in the end to stick to a Lua approach and provide some simple xml support in addition.

Once a dataset is accessible as xml tree, you can use the regular <tt>\xml...</tt> commands. We start with loading a dataset, in this case from just one file.

\usebtxdataset[tugboat][tugboat.bib]

</pre>

The dataset has to be converted to xml :

\convertbtxdatasettoxml[tugboat]

A more extensive example is the following. Of course this assumes that you know what xml support mechanisms and macros are available.

\startxmlsetups btx:getkeys

The original data is stored in a Lua table, hashed by tag. Starting with Lua 5.2 each run of Lua gets a different ordering of such a hash. In older versions, when you looped over a hash, the order was undefined, but the same as long as you used the same binary. This had the advantage that successive runs, something we often have in document processing gave consistent results. In today’s Lua we need to do much more sorting of hashes before we loop, especially when we save multi--pass data. It is for this reason that the xml tree is sorted by hash key by default. That way lookups (especially the first of a set) give consistent outcomes.

=Standards=

The rendering of bibliographic entries is often standardized and prescribed by the publisher. If you submit an article to a journal, normally it will be reformatted (or even re- keyed) and the rendering will happen at the publishers end. In that case it may not matter how entries were rendered when writing the publication, because the publisher will do it his or her way. This means that most users probably will stick to the standard apa rules and for them we provide some configuration. Because we use setups it is easy to overload specifics. If you really want to tweak, best look in the files that deal with it.

Many standards exist and support for other renderings may be added to the core. Interested users are invited to develop and to test alternate standard renderings according to their needs.

=Cleaning up=

Although the bibTEX format is reasonably well defined, in practice there are many ways to organize the data. For instance, one can use predefined string constants that get used (either or not combined with other strings) later on. A string can be enclosed in curly braces or double quotes. The strings can contain TEX commands but these are not standardized. The databases often have somewhat complex ways to deal with special characters and the use of braces in their definition is also not normalized.

The most complex to deal with are the fields that contain names of people. At some point it might be needed to split a combination of names into individual ones that then get split into title, first name, optional inbetweens, surname(s) and additional: <tt>Prof. Dr. Alfred B. C. von Kwik Kwak Jr. II and P. Q. Olet</tt> is just one example of this. The convention seems to be not to use commas but <tt>and</tt> to separate names (often each name will be specified as lastname, firstname).

</pre>

For MkIV the modules were partly rewritten and ended up in the core so the two commands are not needed there. One advantage of explicitly loading a module is that a job that doesn’t need references to publications doesn’t suffer from the associated overhead. Nowadays this overhead can be neglected. The first setup command in this example is needed to bootstrap the process: it tells what database has to be processed by bibTEX between runs. The second setup command is optional. Each citation (tagged with <tt>\cite</tt>) ends up in the list of publications.

In the new approach again the code is in the ConTEXt kernel, so no modules need to be loaded. But, as we no longer use bibTEX , we don’t need to setup bibTEX . Instead we define dataset(s). We also no longer set up publications with one command, but have split that up in rendering-, list-, and cite-variants. The basic <tt>\cite</tt> command remains.

\definebtxdataset

</pre>

But keep in mind, that compared to the old MkII derived method we have moved some of the setup options to setting up the list and cite variants.

Another difference is the use of lists. When you define a rendering, you also define a list. However, all entries are collected in a common list tagged <tt>btx</tt>. Although you will normally configure a rendering you can still set some properties of lists, but in that case you need to prefix the list identifier. In the case of the above example this is <tt>btx:document</tt>.

=MLBIBTEX=

Todo: how to plug in MLbibTEX for sorting and other advanced operations.

=Extensions=

As TEX and Lua are both open and accessible in ConTEXt it is possible to extend the functionality of the bibliography related code. For instance, you can add extra loaders.

function publications.loaders.myformat(dataset,filename)

Luigi.scarso

Administrators

685

edits