Code point

Not to be confused with point code.

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space.^[1] Many code points represent single characters but they can also have other meanings, such as for formatting.

For example, the character encoding scheme ASCII comprises 128 code points in the range 0_hex to 7F_hex, Extended ASCII comprises 256 code points in the range 0_hex to FF_hex, and Unicode comprises 1,114,112 code points in the range 0_hex to 10FFFF_hex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

Definition

The notion of a code point is used for abstraction, to distinguish both:

the number from an encoding as a sequence of bits, and
the abstract character from a particular graphical representation (glyph).

This is because one may wish to make these distinctions:

encode a particular code space in different ways, or
display a character via different glyphs.

The concept of a code point is part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s. If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for Latin script users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users. The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.

For Unicode, the particular sequence of bits is called a code unit – for the UCS-4 encoding, any code point is encoded as 4-byte (octet) binary numbers, while in the UTF-8 encoding, different code points are encoded as sequences from one to four bytes long, forming a self-synchronizing code. See comparison of Unicode encodings for details. Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. However code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.

The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.

Notes

↑ Glossary of Unicode Terms

External links

Unicode

Code points

Characters

Special purpose	BOM Combining Grapheme Joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space

Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth

Processing

Algorithms	Bi-directional text Collation ISO 14651 Equivalence Variation sequences

Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode

Common and inherited scripts	Combining marks Diacritics Punctuation Space

Modern scripts	Adlam Arabic diacritics Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Ge'ez Georgian Greek Gujarati Gurmukhī Hangul Hanja Hanunó'o Hebrew diacritics Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Khudawadi Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Mandaic Meetei Mayek Mende Kikakui Miao (Pollard) Mongolian Mro N'Ko New Tai Lue Newa Ol Chiki Oriya Osage Osmanya Pahawh Hmong Pau Cin Hau Rejang Samaritan Śāradā Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Sylheti Nagari Syriac Tagalog (Baybayin) Tagbanwa Tai Le Tai Tham Tai Viet Takri Tamil Telugu Thaana Thai Tibetan Tifinagh Tirhuta Vai Varang Kshiti Yi

Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Egyptian hieroglyphs Elbasan Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kharosthi Khojki Linear A Linear B Lycian Lydian Mahajani Manichaean Marchen Meroitic Modi Multani Nabataean Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Turkic Palmyrene 'Phags-pa Phoenician Psalter Pahlavi Runic Siddham Tangut South Arabian Ugaritic

Notational scripts	Duployan SignWriting

Symbols	Cultural, political, and religious symbols Currency Mathematical operators and symbols Phonetic symbols (including IPA) Emoji

This article is issued from Wikipedia - version of the 10/26/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

Code point

Definition

See also

Notes

External links