iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://unicode.org/faq/casemap_charprop.html
FAQ - Character Properties, Case Mappings and Names
Unicode Frequently Asked Questions

Character Properties, Case Mappings & Names FAQ

Character Properties

Q: Where are Unicode character properties defined?

The short answer is: in the Unicode Character Database (UCD).

Several Unicode Technical Standards and Unicode Technical Reports also define their own properties, which are listed separately. There is also the large collection of data specifically for Unified ideographs, called the “Unihan” Database, which forms a separate subset of the Unicode character properties. It's structure and contents are significantly different so that it isn't generally included when talking about the “UCD”. [AF]

Q: What is the Unicode Character Database?

The Unicode Character Database (UCD) is a collection of plain text files, updated for every release of the Unicode Standard. Those plain text files contain information about the properties of every Unicode character. All files for the most up-to-date version of the UCD are always located at https://www.unicode.org/Public/UCD/latest/ on the Unicode website. This location also includes the Unihan database.

Q: Where can I find documentation for the Unicode Character Database?

The details can be found in UAX #44, Unicode Character Database. (The Unihan database is documented in UAX #38, The Unicode Han Database (Unihan).) See also the FAQ page.

Q: Is there a database query interface for the UCD on the Unicode website?

No. The UCD consists of multiple plain text files containing raw property data. Those files are suitable for conversion to database formats, import to spreadsheets, conversion to tables, or whatever format may be appropriate for a particular implementer's needs. They are not stored in an RDMS, and the Unicode website does not support a front end for an arbitrary database query.

There are other websites which do present simple front ends for database queries for some Unicode character properties. See, for example: https://www.fileformat.infoexternal link.

The subset of character properties related to Chinese characters (CJK) is a special case. The Unicode website does have a web database query interface for those character properties. See the Unihan Database.

Q: Are there other Unicode tools for working with character properties?

Yes. The Unicode Utilities subsite also implements a front end with a number of useful utilities for querying characters. One tool allows the input of arbitrary sets of characters using the UnicodeSet format, and shows the resulting explicit list as output. See, for example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp.

Q: Are any unassigned characters or reserved characters given default properties?

Default values are defined for all character properties. For a discussion of how this works and details about particular default values for properties, see UAX #44, Unicode Character Database.

Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1 compatibility?

No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" (section 6.3.3), but does not specify details of how or when it is to be displayed, nor other details of its semantics. The soft hyphen has had a long history of legacy implementation in two or more incompatible ways.

Unicode clarifies the semantics of this character for Unicode implementations, but this does not affect its usage in ISO 8859-1 implementations. Processes that convert back and forth may need to pay attention to semantic differences between the standards, just as for any other character.

In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the SOFT HYPHEN as a hyphen in all circumstances. The change in semantics of the Unicode character does not require that implementations of terminal emulators in other environments, such as ISO 8859-1, make any change in their current behavior.

Q: Where can I find the numerical values of characters with the hexadecimal digit (Hex_Digit) property?

The Unicode Standard provides the Hex_Digit property, which specifies which characters are hexadecimal digits: 0-9, A-F, a-f, and their fullwidth equivalents. (The ASCII_Hex_Digit property specifies the intersection of the Hex_Digit property and the Basic Latin block.) There is no table in the UCD mapping the hexadecimal digit characters to their values, analogous to the Numeric_Value property. The table linked here removes this real, if trivial, gap. [JC]

Q: How does Unicode cope with hexadecimal digits?

The hexadecimal number system, used in computing, is not that special: you can base a number system on any natural number except the number 1. The most widely used base is 10, but 2, 8, and 12 have also seen extensive use as number bases, whether in computing or archaic mathematics. Hence, it is not wise to define a particular set of digits for every number system somebody might wish to apply.

Rather, the Unicode character encoding, much like its predecessors, assumes that hexadecimal numbers be written with the ordinary (decimal) digits (representing zero through nine), and the letters A through F (representing ten to fifteen). Only from context, it becomes clear whether a string of digits is to be meant as a number, and if so, in which number system.

Most applications have defined particular syntax rules to help distinguishing decimal, octal, or hexadecimal numbers from other input tokens, e. g., in some programming languages, “2010” is a decimal number, “0x7DA” is a hexadecimal number, “thisYear” is an identifier. In absence of such syntactical hints, you could peruse the Hex_Digit property from the Unicode Character Database to identify hexadecimal numbers; however, a string of Hex_Digit characters, such as “bed”, is not necessarily meant to be read as a hexadecimal number.

Whenever it is important that hexadecimal numbers in a table align vertically, you should choose a fixed-pitch font for them by means of a higher-level protocol. Some fonts will also show the uppercase hexadecimal digits at the same height as the digits. Such a font is used in the Unicode code charts to give 4- and 5-digit hexadecimal numbers a nice rectangular appearance.

Q: Where are private-use characters used, and how should they be handled?

This is the topic of the Private-Use Characters FAQ, which answers many questions about the handling of private-use characters.

Case Mapping

Q: Where can I find the Unicode case mapping information?

The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide information on exceptional one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard.

Q: What is the difference between case mapping and case folding?

Case mapping or case conversion is a process whereby strings are converted to a particular form — uppercase, lowercase, or titlecase — possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

Q: Which scripts have an uppercase and a lowercase?

The most widely used modern scripts with case are Latin, Greek, Armenian and Cyrillic. In addition there are a few historic or archaic scripts that have case. The vast majority of scripts, modern or archaic, do not have case distinctions.

Q: What is titlecase? How is it different from uppercase?

Titlecase takes its name from the case format used when forming a title, in which the initial letter in a word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping applied to the initial character in a word.

The titlecase mapping in Unicode differs from the uppercase mapping in that a number of characters require special handling. These are chiefly ligatures and digraphs such as 'fl', 'dz', and 'lj', plus a number of polytonic Greek characters. For example, U+01C7 (LJ) maps to U+01C8 (Lj) rather than to U+01C9 (lj).

Q: Does the default case mapping work for every language? What about the default case folding?

The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.

By contrast, case folding, which is primarily based on the lowercase mapping, is intended to be language-neutral. Since the case folding rules do not vary by language or context, this makes them unsuitable as the basis for displaying or transforming text for human consumption.

To make case mapping language sensitive, the Unicode Standard specificially allows implementations to tailor the mappings for each language, but does not provide the necessary data. The file SpecialCasing.txt is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to CLDR and other resources.

Q: What is 'tailoring' and how might it affect case mapping?

Tailoring is the modification of the case mapping rules to meet the specific needs of a given language, culture, or orthography. For example, while the default uppercase mapping of “a” is “A” and the default mapping of “à” is “À”, the uppercase conversion of “je vais à Paris” in some forms of French might be “JE VAIS A PARIS”. Notice how the “à” is uppercased as "A" in this case.

Similarly, in English, one of Proust's novels is rendered in titlecase as “In Search of Lost Time”. Notice that the 'o' in 'of' is not capitalized, although the remainder of the words follow the Unicode Standard's definition of titlecase: this is an English-specific tailoring of titlecase. The original French title of this work is rendered in titlecase as “À la recherche du temps perdu”. Here, only the first word is in the default titlecase; the others follow rules specific to a particular French convention.

Q: Why isn't there an “Ij” character encoded to serve as the titlecase for U+0132 LATIN CAPITAL LETTER IJ and U+0133 LATIN SMALL LETTER IJ?

The Unicode Standard encodes these two compatibility characters to provide support for roundtrip conversion of the Dutch letter 'ij' in certain very rare legacy (non-Unicode) character encodings. It is strongly preferred (and far more common) to use the two character ASCII sequence 'ij' to represent this letter instead.

In Dutch, the letter 'ij' behaves like the other single letters, so the correct titlecase mapping of U+0133 (ij) is U+0132 (so a word such as “ijsje” titlecases as “IJsje”). That is, the titlecase mapping for both of these characters is U+0132 and no additional character is needed.

Q: Are case mappings for words or text runs reversible?

Case mapping loses information and thus does not allow for a round trip. For example, when the string “Mark” is lowercased, the original form cannot be recovered; it might have been “mark” or “MARK” instead. Some strings contain contextual case distinctions that are not preserved by case mapping. Consider the English word “anglo-American”, the Italian word “vederLa”, or the German words “haben” and “Haben”. Once you uppercase, lowercase or titlecase these strings, you can't recover the original just by performing the reverse operation.

Q: Are case mappings for individual characters reversible?

Many of the individual character case mappings cannot be reversed. For example:

Q: Does uppercasing of a string eliminate all of the lowercase letters in it?

No. Some letters (notably those in the IPA block) have no matching case equivalent. As a result, uppercasing a string may not eliminate all of the lowercase letters in it.

Q: Why is there no unique uppercase character for ſ — U+017F LATIN SMALL LETTER LONG S (and about one hundred other characters)?

There are over 100 lowercase letters in the Unicode Standard that have no direct uppercase equivalent. For example, the uppercase form for long s is an ordinary capital S. Another example would be U+0237 LATIN SMALL LETTER DOTLESS J: the capital J is already dotless, so an extra letter isn't needed as an uppercase mapping. Some of the other characters with no uppercase equivalent are compatibility characters. Many of these, such as 'fl' (U+FB02 LATIN SMALL LIGATURE FL), decompose to two or more characters when casing is applied. Finally, others are characters that are only used in lowercase, such as many characters used for IPA and other phonetic systems. Text in IPA, like that in many other phonetic systems should never be case converted, even those IPA characters that do have an uppercase equivalent.

Q: Why aren't there extra characters encoded to support locale-independent casing for Turkish?

The Turkish language, like other Turkic languages, distinguishes a dotted letter 'i' from a dotless letter 'ı' (U+0131 LATIN SMALL LETTER DOTLESS I). In these languages, each has an equivalent uppercase mapping: U+0131 maps to the ordinary letter 'I', while 'i' maps to U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).

Historically, users generally did not distinguish between the ASCII letters and their Turkish equivalents, so legacy character encodings, such as ISO 8859-9, which support the Turkic languages, did not separately encode characters to serve as the basis for locale-independent casing. These character encodings are often used for both Turkish and non-Turkish text. Transcoding this data to Unicode would be intolerably difficult if users had to somehow identify which 0x49 characters (for example) were ordinary “I” and which were LATIN CAPITAL LETTER DOTLESS I. In addition, because users are not used to making the distinction, it is unlikely that they would input the “correct” additional letters, even if they existed.

Q: Why does ß (U+00DF LATIN SMALL LETTER SHARP S) not uppercase to U+1E9E LATIN CAPITAL LETTER SHARP S by default?

In standard German orthography, the sharp s (”ß”) used to be exclusively uppercased to a sequence of two capital S characters. This longstanding practice is reflected in the default case mappings in Unicode. A capital form of ß is sometimes preferred for typographic reasons or to avoid ambiguity, such as in uppercase names as found in passports. It is encoded in the Unicode Standard as U+1E9E. While this character is not widely used, is now recognized in the official orthography as an optional uppercase form of ß in addition to “SS”. Because it is only an optional alternative, the original mapping to “SS” is retained in the Unicode character properties.

Q: Why does the Greek letter sigma require special handling?

Near the end of the SpecialCasing.txt, there are two lines that are commented out pertaining to the Greek letter sigma. At first glance, they may look a bit odd:

# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA
# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA

Both of these lines refer to conditional case mappings (column 5). In normal Greek text, a U+03C3 (non-final sigma) should be written as U+03C2 (final sigma) if it is at the end of a word, and a U+03C2 (final sigma) should be written as a U+03C3 (non-final sigma) if it is not at the end of a word. This is what these two lines would mean if they were uncommented. The reason that they are commented out is that the SpecialCasing file is not intended to normalize the appearance of a lowercase sigma.

Q: Is case folding stable between Unicode versions?

Any string that is case-folded according to the rules in Version 5.0 or later is guaranteed to still be case-folded according to the rules for any subsequent version of the Unicode Standard. For the formal statement of that stability guarantee, see the Case Folding Stability Policy.

Q: Does case folding stability prevent the encoding of new case pairs?

For a newly encoded bicameral (cased) script or for completely new case pairs, there are no restrictions that result from case folding stability. Because such scripts or characters had not yet been encoded in earlier versions of the standard, they also had no case folding yet defined for them. New scripts or completely new case pairs can be added freely in future versions.

Q: Can a case pair be added if one of the pair is already encoded?

The usual situation is to add a new uppercase letter intended to have a case mapping to an existing lowercase letter that had no case pair before. Because case folding is primarily based on the lowercase mapping, adding a new uppercase letter like this is fine—the case folding will be specified as mapping to the existing lowercase letter, and case folding stays stable.

Q: What happens if the uppercase letter is the one that is already encoded?

That situation is more complicated. When the existing encoded letter is an uppercase letter and the proposal is to encode a new lowercase letter case pair for it, that is normally disallowed. The case folding for the existing uppercase letter would change, and that is blocked by the requirement for case folding stability. In exceptional situations, if a lowercase letter must be added, it would need to be case-folded to the existing uppercase letter, rather than changing the case folding for that existing letter. Such an exceptional situation did, in fact, apply for the addition of Cherokee lowercase syllables in Version 8.0. Cherokee case folding rules were specified to map to the old uppercase syllables, to preserve case folding stability for them.

Q: What about the situation where both characters are already encoded, but should be case-folded together?

Changing an existing character to case-fold to a different character is prohibited, for stability, so this cannot be done.

Q: Why are U+2126 OHM SIGN and U+00B5 MICRO SIGN cased like omega and mu?

When the text "Resistance is 950μΩ" is subjected to some of the CSS text transforms, it is displayed as:

On the face of it, that seems undesirable, since it changes the meaning of the text. This raises the question why U+2126 OHM SIGN and U+00B5 MICRO SIGN were not classified as "Symbol, Other", and not assigned case equivalents.

The Unicode Standard does not guarantee that transforming text (with the exception of normalization) will not affect its meaning. ASCII letters used for SI units are not exempt from casing, and also change meaning with case: 1ms = 1 millisecond, whereas 1MS = 1 megasiemens. More generally, applying case mappings to technical text rather than “normal language” is a mistake, and cannot be fixed in the encoding nor via properties. Further, case mappings are lossy even on normal text (lowercasing iPod or McGowan; or any noun in German; uppercasing Irish).

Not all letters for SI units have duplicates, this is the reason why the few that were introduced separately have been made canonically equivalent to standard letters. In particular, μ and Ω normalize to their standard Greek counterparts, which means that treating them differently is not possible. This way, all SI units are treated the same.

Character Names

Q: What are character names for and why are some characters named in unusual ways?

Character names are defined so that a mnemonic string can be used to uniquely identify a character, rather than representing it with just a hexadecimal code. Characters can have multiple uses or multiple common names, so a single identifier cannot provide a natural name for all users and all purposes. Sometimes, names are deliberately chosen to describe the appearance of a character, rather than its meaning or function, because the character is used in many competing contexts. Such use of descriptive names is particularly common for symbols.

Because characters names are identifiers, there are some additional restrictions and conventions, which govern the way they are assigned and provide some uniformity in naming. In many instances, descriptive comments and informative aliases are added to the listing of the character names in the code charts to make it easier for users to select the right character for the right purpose.

Q: Can the name of a character be changed to better reflect the way it is used?

Once a character name has been given, it cannot be changed. Because names are identifiers, for which stability is very important, the Unicode Character Encoding Stability Policy explicitly prevents character names from being changed. Character names, however, can be annotated in the code charts. For example, U+0674 ARABIC LETTER HIGH HAMZA is annotated as being used for Kazakh, not Arabic. For outright erroneous names, a formal alias may be provided (in NameAliases.txt), that gives a corrected alias for the character. This alias can be used anywhere a character name can be used, but it does not replace the actual name. In limited cases, a widely used alternate name or a common abbreviation may likewise be given as an alias.

Q: Should I be concerned if the name of a script, block or character doesn't reflect the way it is used?

Script, block, and character names are used by Unicode solely as identifiers; that is, their purpose is to distinguish entities and not to describe them. Changes to these names create extensive interoperability and backward compatibility issues. There is usually a relationship between the name of a block, the name of the script that uses characters in that block, and the names of the characters themselves in order to ease identification.

The use of a particular name as an identifier for a script in the Unicode Standard does not imply an endorsement of that name as the best alternative for general use. The Unicode Consortium does not make recommendations on how to refer to scripts in other contexts.

Q: How are script and block names related to character names?

Many character names contain a script designator. For example, many characters in the Telugu script contain the word "TELUGU" in the first part of their names. This script designator is based on the name of the script, in this case "Telugu". For consistency, the script name is also reflected in block names, whenever blocks contain characters primarily of one script.

Q: What are the script names in the Unicode Standard based on?

In nearly all cases, the script names are based on common English usage. When there are important alternative names for scripts, they are often provided as annotations in the code charts and documentation. For example, the New Tai Lue script is referred to in China as Xishuang Banna Dai, which is listed as an alternative name in the code charts. The local name for a script may differ from English usage. Translated versions of the character names list would use translations of the script names and designators and follow local usage.

Q: Can I determine the script of a character by the character or block name?

No, not at all. The character names and block names are not reliable indicators of the script of a character. The Script property should be used instead to determine the script of any particular character. For example, as of Unicode 6.0 there were the following mismatches between Script property value and character or block names for Latin and Greek:

For more information, see UAX #24: Unicode Script Property.

Q: Are there any tools available to convert character values to character names, or to tell me the script of a character?

Yes, there are several such tools listed on the Online Tools page of the Unicode site. Here are a few you might like to try.

Web based lookup:

Downloadable application code:

A simple standard Perl program may be what you want, for example to view the name of character U+1234:

      $ perl -e 'use charnames();print charnames::viacode(0x1234),”\n”'

See the Online Tools page for more links.

Q: The character name alias for the control character U+0082 is BREAK PERMITTED HERE. Does that mean I have to interpret that control character in that way?

The Unicode Standard does not define U+0082 to mean “BREAK PERMITTED HERE”. Formally this character is simply one of 65 control codes, one which in ISO 6429 has the name and meaning of “BREAK PERMITTED HERE”. Implementers of the Unicode Standard are not required to interpret the character U+0082 in accordance with ISO 6429 (or to interpret it at all).

The standard does assign particular properties and semantics for certain controls commonly used in text files including tab, carriage return, line feed, form feed, and next line. However, it does not give the majority of control codes any semantics at all; that is left to a higher-level protocol.

The character names for control characters are actually undefined, however, name aliases, such as “BREAK PERMITTED HERE” have been defined. These aliases are based on ISO 6429, and can be used to identify specific controls, for example in regular expressions. For other control characters see https://www.unicode.org/charts/PDF/U0080.pdf.

Q: Where can I find formal definitions of the terms used in character names? In particular designations like “turned”, “inverse”, “inverted”, “reversed”, “rotated”.

These terms are basically typographical rather than Unicode-specific.

A turned character is one that has been rotated 180 degrees around its center. A turned “e” winds up with the opening in the upper left portion. U+0259 LATIN SMALL LETTER SCHWA is a turned “e”.

An inverted character has been flipped along the horizontal axis. An inverted “e” winds up with the opening in the upper right portion. There is no Unicode character representing an inverted “e”.

A reversed character has been flipped along the vertical axis. A reversed “e” winds up with the opening in the lower left portion. U+0258 LATIN SMALL LETTER REVERSED E is a reversed “e”.

A rotated character has been rotated 90 degrees, but one cannot tell which way without looking at the glyph. U+213A ROTATED CAPITAL Q is a “Q” that has been rotated counterclockwise.

Inverse means that the white parts of the glyph are made black, and vice versa. An inverse “e” looks like a normal “e” but is white on a black background. There is no Unicode character representing an inverse “e”. [JC]

Q: Why is the hacek accent called “caron” in Unicode?

Nobody knows.

Legend has it that the term was first spotted in one of the 'giant books' from the 1930s at Mergenthaler Linotype Company in Brooklyn, NY, but no one has been able to confirm that.

More accurate reports trace the term back to the mid 1980s where we do have documented sightings of “caron” in publications such as:

Unicode and the ISO 8859 series of standards just carried the tradition along.

In an article published in 2001: “Orthographic diacritics and multilingual computingexternal link”, J.C. Wells — a linguist at the University College in London — writes:

“The term ‘caron’, however, is wrapped in mystery. Incredibly, it seems to appear in no current dictionary of English, not even the OED.”

Whoever the originators were, we suspect that they have probably taken their secrets to the grave by now.