Encodings and Regimes in XeTeX

From ConTeXt wiki
Jump to: navigation, search

< Encodings and Regimes | Fonts in XeTeX | XeTeX >

This is a somewhat temporary & incomplete page for those interested in details about how XeTeX deals with input encodings and font encodings. (I've dropped a couple of notes here before I forget them.)

Plain user should not bother about this - ConTeXt should eventually do everything automatically using the same macro commands as for pdfTeX. So please do not consider this page as a part of official documentation or as a recipe of how things should be done. This information might become obsolete.


Input Encodings (Regimes)

As opposed to most other TeX engines, XeTeX interprets the input as UTF-8 by default. In pdfTeX you have to say


which is not necessary in XeTeX any more (to be more precise: it's even ignored).

However, there is a subtle difference:

Using another input encoding (regime)

The macro


should be adapted for the usage in XeTeX.

(Written by Jonathan Kew on the XeTeX mailing list:)

You can say

\XeTeXinputencoding "cp1250"

and then input will then be interpreted as codepage 1250 (and mapped to the corresponding Unicode character codes for processing within XeTeX).

To be precise, \XeTeXinputencoding will change the interpretation of the input bytes beginning at the *next* line of the input file.

There is also

\XeTeXdefaultencoding "..."

which is similar, but it does not affect the reading of the *current* file; instead, it changes the initial encoding for any files that are *subsequently* opened. Therefore, you can use this to get XeTeX to read an existing file in a legacy encoding, without having to edit that file itself -- just set the default encoding from a "driver" file before doing \input.

One caution about setting the default encoding: regardless of the \XeTeX...encoding settings, any files that XeTeX writes (with \write) will *always* be written as UTF-8. So if you're writing and then reading auxiliary files from within the job, these will be UTF-8 even if your main input text is a legacy codepage, and you may have to take care to switch the default encoding back to utf8 before opening such a file.

There's another possibility, too: if you use

\XeTeXinputencoding "bytes"

then XeTeX will simply read the input text as byte values 0..255, with no attempt to map them to Unicode according to any specific codepage. (In practice, this is probably the same as using encoding "8859-1" or "latin1".) If you do this, then ConTeXt's active-character scheme for handling encodings within TeX macros should still work, as you'll be getting the same character codes as standard TeX would see.

Font Encodings

You're adviced to use OpenType fonts anyway, but just in case that you stil want to use the old Type1 "256"-glyph fonts.

XeTeX currently doesn't know which glyphs are present in such fonts. A temporary way out of it is todefine active characters for the letters that you need; like this:

\catcode`č=\active \defč{^^a3} % ccaron is on place 163 (0xA3) in EC encoding

Or, since ConTeX is clever enough to perform the rest of the story properly:

\catcode`č=\active \defč{\ccaron}

However, this should be fixed/automated somehow.


Testing for a presence of a glyph

Macros in ConTeXt for faking the glyphs could be adapted a bit, so that if a certain glyph was missing, it would be faked.


    % code for faking it

Named glyphs

If you know the name of the glyphs, but not their exact location (for example, these could be glyphs in private area of Unicode, which is not standardized), \XeTeXglyphindex can help you.

\XeTeXglyphindex "glyphname"

returns glyph ID of a named glyph.

You can (hopefully) use it like that (to be tested once XeTeX 0.993 for Linux is out!!!):

    \char{\XeTeXglyphindex "ubreveinvertedlow"}}

(Works with XeTeX >= 0.993 and Latin Modern >= 1.0.)


Personal tools
External Help