Unicode blocks in ConTeXt

From Wiki
Revision as of 08:18, 25 October 2017 by Nyraghu (talk | contribs)
Jump to navigation Jump to search

A Unicode block is an interval of code points which represent characters that are semantically related to each other. For example, there is a Unicode block for characters from the Devanagari script which is used by several Indian languages. Another Unicode block corresponds to characters which denote mathematical operators, such as those that indicate the union and the intersection of sets.

ConTeXt has special names for all Unicode blocks. These names can be used to specify ranges of code points in the setups of several commands.

Unicode blocks

A Unicode block is an organisational unit of the Unicode code space. The Unicode code space is the set of all code points, that is, the set of all integers from 0 to the integer whose hexadecimal representation is 10FFF. The official list of the blocks is available at the Unicode Web site.

Every block is an interval of code points. Different blocks are disjoint from each other, and every code point belongs to at least one block. Thus, the blocks form a partition of the set of all Unicode code points. The number of code points in a block varies. Some have just 16 code points, and some others have thousands of code points.

A code block starts at a code point that is a multiple of 16. The number of code points in each block is also a multiple of 16. Thus, the hexadecimal representation of the first code point in a block is of the form pqrs0, and that of the last code point in it is of the form tuvwF, where p, q, r, s, t, u, and v, are hexadecimal digits.

The Unicode standard gives every block a unique name that describes the common semantic nature of its code points. These names are case insensitive, and the hyphens, spaces, and underscores, in them are insignificant. For example, one can refer to the block whose Unicode name is Myanmar Extended-A as myanmarextendeda, MyanmarExtendedA, or myanmar_extended_a. ConTeXt chooses the first of these alternative styles for the names of blocks, as described below.

ConTeXt names of Unicode blocks

ConTeXt has its own names for all the Unicode blocks. These names are defined in the source file char-ini.lua. Most of them are obtained by converting the Unicode name of the block to the lower case, and removing the hyphens and spaces in the name.

The list of blocks

See the article List of Unicode blocks for a table of Unicode blocks, their ConTeXt names, and links to more information about them.

Usage of the blocks in ConTeXt