Homoglyph

In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties.

Synoglyphs are glyphs that look different but mean the same thing. Synoglyphs are also known informally as display variants.The term homograph is sometimes used synonymously with homoglyph, but in the usual linguistic sense, homographs are words that are spelled the same but have different meanings, a property of words, not characters.

In 2008, the Unicode Consortium published its Technical Report #36^[1] on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.

A manifestation of homoglyphic confusion in a historical regard results from the use of a 'y' to represent a 'þ' when setting older English texts in typefaces that do not contain the latter character. It has led in modern times to such phenomena as Ye olde shoppe, implying incorrectly that the word the was formerly written ye /jiː/. For further discussion, see thorn.

Typefaces containing homoglyphs are considered unsuitable for writing formulas, URLs, source code, IDs and other text where characters cannot always be differentiated without context.

0 and O; 1, l and I

Two common and important sets of homoglyphs in use today are the digit zero and the capital letter O (i.e. 0 & O); and the digit one, the lowercase letter L and the uppercase i (i.e. 1, l & I). In the days of mechanical typewriters there was very little or no visual difference between these glyphs and typists treated them interchangeably as keyboarding shortcuts. In fact, most keyboards did not even have a key for the digit "1", requiring users to type the letter "l" instead, and some also omitted 0. As these same typists transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them in their new profession, and became a source of great confusion.

Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and by drawing the digit one with prominent serifs. Early computer print-outs went even further and marked the zero with a slash or dot — leading to a new conflict involving the Scandinavian letter "Ø" and the Greek letter Φ (phi). The re-designing of character types to differentiate these homoglyphs, taken with the dwindling number of keyboard operators trained on mechanical typewriters, has seen a decline in these particular homoglyph errors.

Multi-letter homoglyphs

Some other combinations of letters look similar, for instance rn looks similar to m, cl looks similar to d, and vv looks similar to w.

In certain narrow-spaced fonts (such as Tahoma), placing the letter c next to a letter such as j, l or i will create a homoglyph, such as cj cl ci (g d a).

When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some typographic ligatures can look similar to standalone glyphs. For example, the fi ligature (ﬁ) can look similar to A in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.

Unicode homoglyphs

The Unicode character set contains many strongly homoglyphic characters. These present security risks in a variety of situations (addressed in UTR#36) and have recently been called to particular attention in regard to internationalized domain names. One might deliberately spoof a domain name by replacing one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited in phishing (see main article IDN homograph attack). In many fonts the Greek letter 'Α', the Cyrillic letter 'А' and the Latin letter 'A' are visually identical, as are the Latin letter 'a' and the Cyrillic letter 'а' (the same can be applied to the Latin letters "aeopcTxy" and the Cyrillic letters "аеорсТху"). A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as 'í' (with an acute accent) and 'i', É (E-acute) and Ė (E dot above) and È (E-grave), Í (with an acute accent) and ĺ (Lowercase L with acute). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a 'homoglyph pair', or if the sequences clearly appear to be words, as 'pseudo-homographs' (noting again that these terms may themselves cause confusion in other contexts). In the Chinese language, many simplified Chinese characters are homoglyphs of the corresponding traditional Chinese characters.

Efforts are underway by TLD registries and Web browser designers to minimize the risks of homoglyphic confusion to the fullest extent possible. Commonly, this is implemented by prohibiting names which mix character sets from multiple languages (toys -Я-us .org would be invalid, but wíkipedia.org and wikipedia.org still exist as different websites); Canada's .ca registry goes one step further by requiring names which differ only in diacritics to have the same owner and same registrar.^[2] The handling of Chinese characters varies; in .org and .info registration of one variant renders the other unavailable to anyone, while in .biz the traditional and simplified versions of the same name are delivered as a two-domain bundle which both point to the same domain name server.

Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum^[3] provided by ICANN.

References

External links

Look up homoglyph in Wiktionary, the free dictionary.

homoglyphs.net – reference table on Unicode homoglyphs to Latin characters and online tool for generating homographs from these.

Unicode

Code points

Characters

Special purpose	BOM Combining Grapheme Joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space

Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth

Processing

Algorithms	Bi-directional text Collation ISO 14651 Equivalence Variation sequences

Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode

Common and inherited scripts	Combining marks Diacritics Punctuation Space

Modern scripts	Adlam Arabic diacritics Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Ge'ez Georgian Greek Gujarati Gurmukhī Hangul Hanja Hanunó'o Hebrew diacritics Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Khudawadi Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Mandaic Meetei Mayek Mende Kikakui Miao (Pollard) Mongolian Mro N'Ko New Tai Lue Newa Ol Chiki Oriya Osage Osmanya Pahawh Hmong Pau Cin Hau Rejang Samaritan Śāradā Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Sylheti Nagari Syriac Tagalog (Baybayin) Tagbanwa Tai Le Tai Tham Tai Viet Takri Tamil Telugu Thaana Thai Tibetan Tifinagh Tirhuta Vai Varang Kshiti Yi

Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Egyptian hieroglyphs Elbasan Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kharosthi Khojki Linear A Linear B Lycian Lydian Mahajani Manichaean Marchen Meroitic Modi Multani Nabataean Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Turkic Palmyrene 'Phags-pa Phoenician Psalter Pahlavi Runic Siddham Tangut South Arabian Ugaritic

Notational scripts	Duployan SignWriting

Symbols	Cultural, political, and religious symbols Currency Mathematical operators and symbols Phonetic symbols (including IPA) Emoji

This article is issued from Wikipedia - version of the 11/2/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.