Encodings and Regimes in XeTeX

From Wiki
Revision as of 21:50, 19 May 2006 by Mojca Miklavec (talk | contribs) (a few notes about (old) encodings in XeTeXbefore I forget them)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

< Encodings and Regimes | Fonts in XeTeX | XeTeX >

This is a somewhat temporary & incomplete page for those interested in datails about how XeTeX deals with input encodings and font encodings. (I've dropped a couple of notes here before I forget them.)

Plain user should not bother about this - ConTeXt should eventually do everything automatically using the same macro commands as for pdfTeX. So please do not consider this page as a part of official documentation or as a recipe of how things should be done. This information might become obsolete.

Input Encodings (Regimes)

As opposed to most other TeX engines, XeTeX interprets the input as UTF-8 by default. In pdfTeX you have to say

\enableregime[utf-8]

which is not necessary in XeTeX any more (to be more precise: it's even ignored).

However, there is a subtle difference:

  • Macros for pdfTeX will convert some parts of the Unicode (the most frequently used ones / those that are easy enough to name / those that are present in some frequent TeX encodings / those that can be faked easily) into named glyphs. 'č' will become \ccaron for example. By default this will expand into a caron accent placed above 'c' , unless such a glyph exists in the encoding. In the case of EC-encoded fonts, the proper glyphy ccaron would be used for it.
  • If you use an OpenType font in XeTeX, XeTeX will take care to find the corresponding glyph (for the letter that you type) in the font automatically; there is no need for macro trickery inbetween. However, if you're using Type1 fonts or if no proper glyph is present in the font, you are on your own (there is no faking of glyphs present in the background) and you'll get empty output instead of seing the characters that you typed in. (This might/should be modified/fixed.)

Using another input encoding (regime)

The macro

\enableregime[name]

should be adapted for the usage in XeTeX.

(Written by Jonathan Kew on the XeTeX mailing list:)

You can say

\XeTeXinputencoding "cp1250"

and then input will then be interpreted as codepage 1250 (and mapped to the corresponding Unicode character codes for processing within XeTeX).

To be precise, \XeTeXinputencoding will change the interpretation of the input bytes beginning at the *next* line of the input file.

There is also

\XeTeXdefaultencoding "..."

which is similar, but it does not affect the reading of the *current* file; instead, it changes the initial encoding for any files that are *subsequently* opened. Therefore, you can use this to get XeTeX to read an existing file in a legacy encoding, without having to edit that file itself -- just set the default encoding from a "driver" file before doing \input.

One caution about setting the default encoding: regardless of the \XeTeX...encoding settings, any files that XeTeX writes (with \write) will *always* be written as UTF-8. So if you're writing and then reading auxiliary files from within the job, these will be UTF-8 even if your main input text is a legacy codepage, and you may have to take care to switch the default encoding back to utf8 before opening such a file.

There's another possibility, too: if you use

\XeTeXinputencoding "bytes"

then XeTeX will simply read the input text as byte values 0..255, with no attempt to map them to Unicode according to any specific codepage. (In practice, this is probably the same as using encoding "8859-1" or "latin1".) If you do this, then ConTeXt's active-character scheme for handling encodings within TeX macros should still work, as you'll be getting the same character codes as standard TeX would see.

Font Encodings

You're adviced to use OpenType fonts anyway, but just in case that you stil want to use the old Type1 "256"-glyph fonts.

XeTeX currently doesn't know which glyphs are present in such fonts. A temporary way out of it is todefine active characters for the letters that you need; like this:

\catcode=\active \defč{^^a3} % ccaron is on place 163 (0xA3) in EC encoding

Or, since ConTeX is clever enough to perform the rest of the story properly:

\catcode=\active \defč{\ccaron}

However, this should be fixed/automated somehow.

Misc.

Testing for a presence of a glyph

Macros in ConTeXt for faking the glyphs could be adapted a bit, so that if a certain glyph was missing, it would be faked.

Example:

\ifnum\XeTeXcharglyph"018E>0
    \char"018E
\else
    % code for faking it
\fi