Changes

6,178 bytes added , 13:17, 9 August 2020

m

Text replacement - "<cmd>" to "{{cmd|"

~~< [[Fonts]] >~~

The [http://unicode.org/ Unicode] effort clearly shows that 256 characters cannot possibly contain the world's languages. However (with the exception of modern variants like [[Omega]] and [[XeTeX]]), TeX is an old system, and will only deal with 256 characters per font. Similarly, many "legacy" file encodings on current operating systems will attempt to shoehorn a set of characters into eight bytes.==Introduction==

~~As a result~~The [http://unicode.org/ Unicode] effort clearly shows that 256 characters cannot possibly contain the world's languages. However, traditional TeX is an old system, ~~you need to make a choice which input encoding ('''regime''') or~~ and will only deal with 256 characters per font~~/output encoding~~ . Similarly, "legacy" file encodings on current operating systems (~~'''encoding'''~~e.g. Latin-1, Mac Roman, Windows 1252, ISO-8859-#) ~~you use~~attempt to shoehorn a set of characters into eight bytes. ~~==Encodings==~~

~~LaTeX users will probably know them under the name~~ As a result, working with [[pdfTeX]] (MkII) you need to make a choice which input encoding ('''~~fontenc~~regime''' ) or font/output encoding (~~<code>\usepackage[T1]{fontenc}</code> for example). As TeX can only handle 256 characters at once, it is important to choose the~~ '''encoding ~~which covers all the characters of your language, otherwise the hyphenation won~~'~~t work for words with composite characters and most probably~~ '') you ~~won't be able to simply extract text from the resulted PDFs~~use.

Modern TeX variants, from [[Omega]] over [[XeTeX]] to [[LuaTeX]], dropped that limitation and work with full Unicode character sets, in fonts as well as in your source documents. ==Font Encodings== LaTeX users will probably know them under the name '''fontenc''' (<code>\usepackage[T1]{fontenc}</code> for example). As TeX can only handle 256 characters at once, it is important to choose the encoding which covers all the characters of your language, otherwise hyphenation won’t work for words with composite characters and most probably you won’t be able to simply extract text from the resulted PDFs. To enable ec encoding in [[Latin Modern ~~Roman~~]] for example, you can type:

\usetypescript[modern][ec]

Some good choices for encodings are:

=== in pdfTeX (MkII) ===

* '''texnansi''' for Western European languages with only a small subset of additional accented characters (includes many other important glyphs)

* '''ec''' for European languages with many accented characters ~~(also known as '''cork''')~~

* '''qx''' as a compromise between the two above, supposed to cover most Central European languages (more accented characters than texnansi and more additional glyphs in comparison to ec)

* '''t5''' for [[Vietnamese]]

http://fun.contextgarden.net/encodingtable/enctable.rb?ec,texnansi,8r,8a

=== in XeTeX (MkII) === * '''uc''' standing for Unicode (the only font encoding supported by [[XeTeX]])* ('''texnansi''' as the very last resort in [[XeTeX]] - where there are no proper fonts available apart from the old ones) === in LuaTeX (MkIV) === you can normally forget about font encodings. === A note about the ec encoding === Ec encoding is also known under the names '''cork''' or '''T1''' (<code>\usepackage[T1]{fontenc}</code> in LaTeX). Its old version was '''dc''' (should not be used any more). Some of the glyph names in ec are old and deprecated, '''tex256''' uses the same set of glyphs, but the glyph names are compatible with Adobe, see also [ftp://tug.ctan.org/pub/tex-archive/info/fontname/tex256.enc tex256.enc] and [http://partners.adobe.com/public/developer/en/opentype/aglfn13.txt Adobe Glyph List]. === Searching for non-ASCII characters in Adobe Reader === Some characters (<code>\ccaron</code> - 'č'being of them for example) are not properly recognized by Adobe (~~I hope that~~ Acrobat) Reader (especially by older versions) when searching or copying text from PDF documents. In order to help Acrobat recognize the glyphs and treat them properly, add this piece of code to your source:<texcode>\input enco-pfr\startencoding [ec] \usepdffontresource ec\stopencoding</texcode> At the ~~content~~ time of writing this ~~section will soon move~~ article, only '''il2''' and '''ec''' were being supported, but support for other encodings can be added. See also:* [http://archive.contextgarden.net/message/20050727.073731.41799d96.html mailing list thread]* Release notes [[Context 2005.07.27]]* [http://source.contextgarden.net/tex/context/base/enco-pfr.tex enco-pfr.tex] and all pdfr-*.tex files == Input Regimes == (Also known as "input encodings", but in ConTeXt "encoding" refers to fonts, while input is handled by a ~~page~~ "regime".) If you write ConTeXt source documents and use more than 7-bit ASCII, you must decide on ~~its own with more comprehensive overview~~ the encoding of your file. That’s a matter of ~~different encodings~~your text editor.The best choice is normally UTF-8, but if you insist to use an outdated editor that can’t handle Unicode properly or if you’re forced to use legacy code, you have to choose the proper 8-bit encoding, see below. === Testing for UTF-8-aware TeX === To test for [[LuaTex]], one may test if <code>\directlua</code> is defined. The next weird macro definition should work for testing XeTeX/LuaTex, because only XeTeX and LuaTex accept 5- and 6-byte caret notation (hex 22 == double quote):<texcode>\def\"{0}\expandafter\def\csname^^^^^00022\endcsname{1}\ifnum\"=0 \message{tex82}\else\message{newstuff}\fi</texcode> But that is not quite the same as testing for native UTF-8. Better is a trick like this: <texcode>\def\test#1#2!{\def\secondarg{#2}}\test χ!\relax % That''s Chi, a 2-byte utf-8 sequence\ifx\secondarg\empty \message{newstuff}\else \message{tex82}\fi</texcode> ConTeXt offers <code>\beginNEWTEX ... \endNEWTEX</code> to process code conditional on using LuaTeX or XeTeX. ===Available Regimes===

~~==Available Regimes==~~

<tr style="background-color:#DDDDDD"><th>ConTeXt name(s)</th><th>Official name(s)</th><th>Remarks</th></tr>

<tr><td>~~il1~~[[source:regi-cp1250.tex|cp1250]] = windows-1250</td><td>Windows CP 1250</td><td>East European, see also iso-8859-2</td></tr><tr style="background-color:#EEEEEE"><td>[[source:regi-cp1251.tex|cp1251]] = windows-1251</td><td>Windows CP 1251</td><td>[[Russian|Cyrillic]]</td></tr><tr><td>[[source:regi-cp1252.tex|cp1252]] = windows-1252 (= win)</td><td>Windows CP 1252</td><td>~~ISO~~West European, see also iso-8859-1, ~~ISO Latin 1~~15</td></tr><tr style="background-color:#EEEEEE"><td>[[source:regi-cp1253.tex|cp1253]] = windows-1253</td><td>Windows CP 1253</td><td>[[Greek]]</td></tr><tr><td>[[source:regi-cp1254.tex|cp1254]] = windows-1254</td><td>Windows CP 1254</td><td>~~western european languages~~Turkish</td></tr><tr style="background-color:#EEEEEE"><td>~~win~~ [[source:regi-cp1257.tex|cp1257]] = windows-1257</td><td>Windows CP ~~1252 (nearly~~ 1257</td><td>Windows Baltic</td></tr><tr><td>[[source:regi-iso-8859-1|iso-8859-1]] = latin1 = il1</td><td>ISO-8859-1, ISO Latin 1)</td><td>~~western european languages~~Western European</td></tr><trstyle="background-color:#EEEEEE"><td>[[source:regi-iso-8859-2|iso-8859-2]] = latin2= il2</td><td>~~Pseudo~~ ISO-8859-2, ISO Latin 2</td><td>East European, see ~~[http:~~also cp1250</td></tr><tr><td>[[source~~.contextgarden.net~~:regi-iso-8859-7|iso-8859-7]] = grk</~~regi~~td><td>ISO-~~lat.tex regi~~8859-~~lat.tex~~7</td><td>[[Greek]]</td></tr><tr style="background-color:#EEEEEE"><td>[[source:regi-iso-8859-15|iso-8859-15]] = latin9 = il9</td><td>[[ISO-8859-15]], ISO Latin 9</td><td>ISO Latin-1 ~~plus~~ + Euro~~, not in default distribution~~</td></tr><tr><td>mac</td><td>Mac Roman</td><td>western european languages</td></tr>

<tr style="background-color:#EEEEEE"><td>ibm</td><td>IBM PC DOS</td><td>western european languages</td></tr>

<tr~~><td>grk</td><td>ISO-8859-7</td><td>[[Greek]]</td></tr><tr style="background-color:#EEEEEE"~~><td>utf</td><td>UTF-8</td><td>[[Unicode]], see below~~</td></tr><tr><td>vis = viscii</td><td>VISCII</td><td>[[Vietnamese]]~~</td></tr><tr style="background-color:#EEEEEE"><td>~~cp1251~~vis = viscii</td><td>~~Windows CP 1251~~VISCII</td><td>[[~~Russian|cyrillic~~Vietnamese]]</td></tr>

<tr><td>cp866, cp866nav</td><td>DOS CP 866</td><td>[[Russian|cyrillic]]</td></tr>

<tr style="background-color:#EEEEEE"><td>koi8-r, koi8-u, koi8-ru</td><td>KOI8</td><td>[[Russian|cyrillic]] (russian, ukrainian, mixed)</td></tr>

<tr style="background-color:#EEEEEE"><td>cp855, cp866av, cp866mav, cp866tat, ctt, dbk, iso88595, isoir111, mik, mls, mnk, mos, ncc</td><td>(several)</td><td>rare cyrillic encodings, see [http://source.contextgarden.net/regi-cyp.tex regi-cyp.tex]</td></tr>

</table>

Other regimes can be provided on request.

A list of available language codes is in [http://source.contextgarden.net/mult-sys.tex mult-sys.tex].

You find output/font encodings in <tt>enco-*.tex</tt> files.

* See [http://czyborra.com/charsets/iso8859.html ISO 8859] for ISO standards.* [http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/] for Windows

~~==Typesetting~~ You enable such a regime with {{cmd|enableregime}}<code>[some]</code>, preferably in ~~UTF-8==~~your [[Project_structure|environment file]].

~~Use <texcode>\enableregime[utf]</texcode>~~ ===Typesetting in ~~order to be able to typeset in unicode under ConTeXt.~~UTF-8===

~~==Using non-ascii characters==~~Use {{cmd|enableregime}}<code>[utf]</code> in order to be able to typeset in Unicode under ConTeXt MkII. (This is '''not''' necessary in MkIV, as it is enabled by default using LuaTeX.)

Unfortunately you must save your UTF-8 encoded files '''without BOM''' (byte order mark), because ConTeXt (or pdfTeX) doesn't ignore that but typesets the characters. This is correct behaviour since UTF-8 doesn’t have a BOM according to the Unicode standard, even if it’s use is widespread. ===Using non-ASCII characters=== As a TeX/LaTeX user you were probably told to use the accents in the following way (the example is taken from the ~~TeXBOOK~~TeXbook, page 24):

Once upon a time, in a distant

named R.~J. Drofnats.

</texcode>

The galaxy name will be shown as<context>\"O\"o\c c</context>.

In ConTeXt, please ~~~~'''try to avoid ~~that~~ ''' using this backslashed character composition , if possible (there are several good reasons for it - hyphenation , etc.).

You have two alternatives:

====Type the characters as you do in any other text editor====

\enableregime[utf] % or any other supported regime

galaxy called Ööç

</texcode>

Once you figure out what regime you need, you can simply type the characters as you do in any text editor

(See above ~~fot~~ for the list of available regimes ~~- some more will probably be added in the near future~~. If you don't find the one you would like to use, please ask on the mailing list)

With [[LuaTeX]] engine ([[Mark IV]]) all utf8 characters from your font may be used directly. Typing some characters may require some keyboard setting or may not be possible at all. In this case try to copy/paste them from this [[Symbols/utf8|list of selected utf8 characters]] or use your OS’s character table or a program like [http://live.gnome.org/Gucharmap Gucharmap]. ====Use glyph names====If you don't have the letter on your keyboard (, if you are too lazy to look it up in a table, if your editor font doesn’t show it or if you want some strange letters not supported by the regime /font/editor/OS you use~~, for example greek or cyrillic)~~, you can access the glyphs by their names:

Once upon a time, in a distant

</texcode>

====How do I know which glyph name to use?====* Under mkii, use ~~<texcode>\~~{{cmd|showcharacters~~</texcode>~~}}* Consult the [http://partners.adobe.com/public/developer/en/opentype/aglfn13.txtAdobe glyph list]* browse ~~the ConTeXt~~ [[source* ask someone to put the list of the available glyphs on the Wiki :enco-acc.tex|enco-acc.tex]], {{src|enco-acc.mkii}}, or {{src|char-def.lua}}. (Warning: The lua file is 3.6 MB large and contains nearly 180,000 lines.) <bcontext mode=mkii source="yes">~~(or simply volunteer for that!)~~\showcharacters</bcontext> ==How does it work?== '''Robert Ermers''' and '''[[User:adam|Adam]]''' provided a helpful explanation of how characters are constructed in LaTeX and ConTeXt (in some discussion on the mailing list):

~~==How it works?==~~You know that all characters in a font have a number. If you type <code>a</code>, the font mechanism makes sure that you see an <context>a</context>. In reality the font shows you the character that is put on the numerical position of <code>a</code>. In the font Dingbats for example, the character on that position is not an <context>a</context>, but a symbol.

~~'''Robert Ermers''' and '''[[User:adam|Adam]]''' provided a helpful explanation of how Characters are constructed in~~ ===In LaTeX ~~and ConTeXt (in some discussion on the mailing list):~~===

~~You know that all characters in a font have a number. If you type~~ the combination <code>\"{a}</code>~~, the font mechanism makes sure that you see an <context>a</context>. In reality the font shows you~~ can mean two things:* in most fonts: show the character ~~that is put~~ on the a given numerical position ~~of <code>a</code>. In the font dingbats for example~~, ~~the character on~~ which means that ~~position~~ there is ~~not an~~ one character <context>\"{a}</context>~~, but a symbol~~.

~~===In Latex=== the combination~~ * in some other fonts <code>\"{a}</code> ~~can mean two things~~means:* in most fonts: show combine <code>¨</code> with <code>a</code> and make an <context>\"{a}</context>. This means that <code>¨</code> is combined with the ~~charachter~~ character on the ~~a given~~ numerical positionof <code>a</code>. TeX does this very well and thus construes very acceptable diacritical signs like <code>\"{q}</code>, <code>\d{o}</code>, ~~which means that there is one character~~ <~~context~~code>\"v{ao}</~~context~~code>, which do not exist in regular fonts.

* in some other fonts If you have a font which contains <~~code~~context>\"{aq}</~~code~~context> ~~means: combine~~ (<code>\"{q}</code> ~~with <code>a</code> and make an~~ ), <context>\"d{ao}</context>~~. This means that~~ (<code>"\d{o}</code> ~~is combined with~~ ) or some other special characters, you may instruct TeX not to create the character on , but rather to show the contents of a given numerical position of in that font. That's what the <code>a.enc</code>~~. TeX does this very well~~ and ~~thus construes very acceptable diacritical signs like~~ <code>~~\"{q}</code>, <code>\d{o}</code>, <code>\v{o}~~.fd</code>~~, which do not exist in regular fonts~~files under LaTeX are for.

~~If you have a font which contains~~ That’s also the reason why there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. <code>\"v{qs}</code>, and <code>\dv{oc}</code> ~~or some other special characters, you may instruct~~ that TeX does not have to create ~~the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under Latex are for~~anew from two glyphs.

That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. <code>\v{s}</code> and <code>\v{c}</code> that TeX does not have to create anew from two signs.===In ConTeXt===

~~===In ConTeXt===~~the combination <code>\"{a}</code> means one thing: <code>\adiaeresis</code> (see ~~~~[[source:enco-acc.tex|enco-acc~~~~]]). This <code>\adiaeresis</code> can mean one of two things, depending on the font encoding:

* Numerical position, or

* The fallback case (defined in ~~~~[[source:enco-def.tex|enco-def~~~~]]), where a diaeresis/umlaut is placed atop an <context>a</context> glyph. Hyphenation implications as ~~Hans~~ described.

The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type <code>ä</code> in your given input regime, and that encoding is properly set, that numerical <code>ä</code> (e.g., character <code>#228</code> in the windows regime) is mapped to <code>\adiaeresis</code>.

Wanna know what happens in '''UTF-8'''? ~~Here's my~~ Here’s a 'simplified' explanation:

In a UTF-8 bytestream, that character <context>\"{a}</context> is signified by two bytes:

<code>0xC3</code>, <code>0xA4</code>. That first byte triggers a conversion of both bytes into two

different bytes, the actual Unicode number, <code>0x00 0xE4</code> (or: <code>0, 228</code>). ConTeXt then looks into internal hashes set up (in this case, the ~~~~[[source:unic-000.tex|unic-000~~~~ ]] vector), looks at the 228th element, and sees that it's <code>\adiaeresis</code>. Things then proceed as normal. <tt>:) </tt> (It’s also interesting to note that for PostScript and TrueType fonts, that number —> name —> number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.) ==Conversion between encodings==It is possible to convert a string from the current encoding to another using Lua (originally discussed at [http://www.ntg.nl/pipermail/ntg-context/2012/065115.html] and [http://www.ntg.nl/pipermail/ntg-context/2012/065256.html]): <code>\startluacode -- Usage: -- regimes.toregime(<target-encoding>, <text>, <character-on-failure>)) regimes.toregime("cp1250", "abcč\192\200žý", "?")) -- Returns "abcč??žý" (one byte per character)\stopluacode</code>

==External links==* [http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin Alphabets derived from the Latin] (Itto be moved to a better place/another page)* [http://www.eki.ee/letter/ Letter database]: languages, character sets, names etc.* [http://www.czyborra.com/charsets/iso8859.html Roman Czyborra's ~~also interesting to note that for PostScript~~ ISO 8859 Alphabet Soup]: huge amount of data about the ISO 8859 encodings (and ~~TrueType fonts~~others), character sets, history, ~~that number > name > number (glyph) mapping happens yet again in the driver~~etc.* [http://www.jw-stumpel.nl/stestu. ~~But all that is outside of TeX proper, so~~ html Multilingual text on Linux] : A good guide on how to ~~say any more would be confusing~~configure and use UTF-8 support on linux.)

[[Category:~~Fonts]][[Category:International~~Old Content]]

Taco

Bureaucrats, Interface administrators, Administrators

3,897

edits

Changes

Encodings and Regimes - Old Content (view source)

Revision as of 13:17, 9 August 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main

Navigation

Indexes

Interaction

Tools