I18N/CanonicalNormalization

From W3C Wiki


Statement of Problem

Various computer languages including Web languages such as HTML, CSS and JavaScript use author-chosen identifiers that computers check for equality and that have a textual interpretation for the benefit of human authors.

Traditionally, the allowed character repertoire in such identifiers has been the repertoire of US-ASCII which works when the author uses English words as identifiers.

Principle 1: Today, as a matter of principle, technology shouldn't discriminate against any natural language, so words from other natural languages than English should be usable as author-chosen computer language identifiers at least to the extent English words can be used. A necessary step for achieving this by allowing arbitrary Unicode strings (excluding token delimiters) as identifiers.

Example of limitation of scope 1: Usually, English compound nouns cannot be used in their correct written form when space is treated as an token separator. This is usually worked around by replacing spaces with hyphens or underscores or by using camelCase. Since this issue exists with English, it is not discriminatory to disallow spaces in e.g. French compound nouns as well. However, all languages don't need to be inconvenienced just because English is, so it's OK for German compound nouns to work in their normal written form.

Example of limitation of scope 2: Changing a letter in a word from lower to upper case in English is generally not considered to make the word and different word. Yet, identifiers are often case-sensitive. When case-insensitive equivalence is denied to English, it is not discriminatory to deny for all languages.

People who are literate in English can usually use a keyboard to re-type an identifier based on an English word when they see one on screen (possibly previously typed by another person on a different system) so that the re-typed identifier is considered equal to the original identifier by the computer. That is, it is safe to re-type a mnemonic identifier--copying and pasting existing text is not required.

Principle 2: Similarly, it should be possible to re-type an identifier based on a word from another natural language and have it match a previously typed (possibly by another person on a different system) instance of the same identifier as far as the computer is concerned.

Example of limitation of scope 3: Scientific units are not part of the natural vocabulary of any human language per se. Therefore, it is not discriminatory against any natural language if Principle 2 does not extend to the ohm sign or the ångström sign.

Example of limitation of scope 4: Some Apple keyboard layouts make it possible to type presentation forms for fi and fl ligatures. Doing so would foil easy re-typing from visual prompt even for English. Since the problem of presentation forms theoretically exists for English, it is not discriminatory to expect languages avoid presentation forms in order to avoid problems with Principle 2.

Example of limitation of scope 5: Due to the nature of identifiers as convenient mnemonic expressions, it is reasonable to expect that whatever language the identifiers are composed in they will typically involve keyboard (or keyboard-like) input as opposed to more difficult and tedious character palette input methods. While some canonically equivalent combining characters can be entered in different ways through character palettes, keyboards limit the possible characters to a manageable subset and that subset can potentially (but perhaps not always) avoid non-normalized input (if normative criteria exist for such keyboard management).

Caveat: However, since web authoring is inherently a collaborative process—and an international process—it is possible that two different authors might have different keyboards to enter identifiers. Therefore one author may use identifiers composed entirely of characters entered from that authors keyboard, while a second author will need to turn to a character palette to produce matching identifiers. Such problems can be easy to solve when the problem is recognized, but the esoteric knowledge needed to diagnose such problems can lead to countless lost hours repeated in one situation after another.
 Caveat for caveat: If a foreign-language identifier is not easily typeable with the keyboard-based input method a person is using, it is usually easier to copy and paste an existing digital instance of the identifier than to hunt for the characters in the character palette, so the character palette case isn't central to the issue.

In some languages, the written form of words includes multiple combining characters (e.g. diacritical marks) stacked onto one base character, and the combining marks don't have any intrinsic of visually distinct order of input. (EXAMPLE NEEDED.)

Other times there may be an established order of input but different systems differ in the conventions of what code points are used. For example, LATIN SMALL LETTER A WITH BREVE AND GRAVE ằ (used in Vietnamese) could be represented as one code point (U+1EB1; used on Mac OS X) or as multiple code points detaching one or both diacritical marks as distinct code points. For example, a plausible representation used traditionally on Windows is LATIN SMALL LETTER A WITH BREVE (U+0103) followed by COMBINING GRAVE ACCENT (U+0300).

(Note about normalization forms: In Unicode, the most pre-composed form—Latin small letter a with breve and grave as one character in the example above—is called Normalization Form C or NFC. The most decomposed form with the combining marks in a particular order is called Normalization Form D or NFD. In this case, the NFD form would be LATIN SMALL LETTER A (U+0061), followed by by COMBINING BREVE ACCENT (U+0306), followed by COMBINING GRAVE ACCENT (U+0300). LATIN SMALL LETTER A WITH BREVE (U+0103) followed by COMBINING GRAVE ACCENT (U+0300) is not in a Unicode normalization form.)

Problem: If the sequence of Unicode code points is different depending on the order of keystrokes when there is no obvious one keying order from a visual prompt or different between systems, comparing identifiers thus typed code point for code point fails. (CITATION NEEDED for the extent of existence of input methods that leak the keying order into produced combining marks.)

(For a Unicode normalization-oriented problem statement, see Normalization Issues).

Possible solutions:

Require text input methods to produce normalized text

State that authors producing content for the Web should use text input methods that always use one normalization. This solution has already been deployed in European, American (South and North) and Korean contexts in which cases the normalization form is NFC.

Advantages:

  • It does not require changes to Web standards or software that consumes Web documents. (CITATION NEEDED)
  • Infrastructure already in place due to Korean and European requirements. (CITATION NEEDED)
  • Systems will behave the same (i.e., failing to match canonically equivalent strings) independently of their level of Unicode support.
  • A Web authoring team can address the problem by upgrading text input methods within the team and does not need to wait for all consuming software to be upgraded.
  • The solution also solves the same problem for non-Web computer languages.
  • Only input methods that don't already normalize need to be upgraded.
  • Text input methods may be seen as being outside the normative reach of the W3C: requirement by the Unicode Consortium needed.

Disadvantages:

  • Requires input method developers for various languages to agree to take advantage of normalization features of system-provided input method infrastructure.
  • Requires deployment of upgraded input methods.

Require authoring applications to normalize text

State that authors producing content for the Web should use authoring applications that always use one normalization. The established normalization form in European, American (South and North) and Korean contexts is NFC, although it is produced by the text input method--not by the editing application--in those contexts currently.

Advantages:

  • It does not require changes to Web standards or software that consumes Web documents. (CITATION NEEDED)
  • Systems will behave the same (i.e., failing to match canonically equivalent strings) independently of their level of Unicode support.
  • A Web authoring team can address the problem by upgrading authoring applications within the team and does not need to wait for all consuming software to be upgraded.
  • When the authoring applications are general-purpose text editors solution also solves the same problem for non-Web computer languages edited with the same editors.

Disadvantages:

  • Requires the developers of all the authoring applications to act.
  • Requires deployment of upgraded editing applications.
  • Could change parts of a text file that the author thought (s)he wasn't editing.
  • Text editors may be seen as being outside the normative reach of the W3C: requirement from the Unicode Consortium needed.

Require consuming software to normalization input stream before parsing

Require a normalization pass during between bytes to characters decoding and parsing (i.e. before the expansions of numeric character escapes). The preferred normalization would need to be defined (NFC is already widely used for European, American and Korean content and scripts are likely to depend on this).

Advantages:

  • Requires changes to software and specifications at a very small number of places. (CITATION NEEDED)

Disadvantages:

  • Requires all consuming software to be upgraded.
  • Things that are currently treated the same might suddenly be treated differently, like characters occurring literally versus as escape sequence, without the author being aware of or having any influence over it (if escapes were not also normalized).
  • Client-side script expecting data to appear as supplied by author might fail.
  • Streaming normalizers not readily available off the shelf (CITATION NEEDED).
  • Performance penalty of normalization.
  • Doesn't solve the problem for non-Web computer languages.

Specify normalizing character encodings and have authors use them

For example, a character encoding "utf-8-nfc" could be defined to be identical to utf-8 except that utf-8-nfc decoders produce NFC text. The effect would be the same as if clients normalized between decoding and parsing (the previous solution).

Advantages:

  • No changes to existing specifications are required. Software only has to map the label "utf-8-nfc" to UTF-8 decoding and NFC normalization. Authors would be aware of and have control over the process. Authors who do not wish normalization at this level to occur are unaffected.

Disadvantages:

  • Requires all consuming software to be upgraded.
  • Legacy clients may reject content in an unsupported encoding and a BOM may be required so that legacy clients who don't, properly recognize the document as UTF-8 encoded. (Though to address this disadvantage a web server output filter could be used to turn "utf-8-nfc" into normalized utf-8 if "utf-8-nfc" is unsupported).
  • Dilutes the meaning of the BOM as an unambiguous encoding signature (when CESU-8 is not supported as it isn't on the Web platform).
  • Treats normalization of combining characters as if it isn't an integral part of the Unicode Standard, but only applies to newly introduced transform formats (Why is this a disadvantage?)
  • Streaming normalizers not readily available off the shelf (CITATION NEEDED).
  • Performance penalty of normalization every time content opting in is decoded as opposed to doing it once upon content creation.
  • Doesn't solve the problem for non-Web computer languages that don't use IANA encoding infrastructure.

Require all text in a consistent normalization on the level of document conformance

Require that all data structures representing Web content be in a consistent normalization on the level of document conformance. I.e. flag non-normalized text in a validator but don't change other consuming software.

Advantages:

  • Requires changes to specifications in about as few points as the solution of normalizing between byte decoding and parsing.
  • Does not require changes to consuming software other than validators.
  • A Web authoring team can address the problem by starting to use a validator within the team and does not need to wait for all consuming software to be upgraded.

Disadvantages:

  • Requires authors to use validators if they want to avoid the Problem.
  • Only helps in detecting the problem, not in avoiding it.
  • Doesn't solve the problem for non-Web computer languages.

String matching-time late normalization

Require that all string comparisons done by implementations of Web technology report that strings that normalize to the same thing compare as equal. A preferred normalization would not need to be defined.

Advantages:

  • Allows whatever normalization (or non-normalized representation) the author preferred to produce the text in to persist without modification. (CITATION NEEDED)

Disadvantages:

  • Requires all consuming software to be upgraded.
  • Performance of comparisons. (CITATION NEEDED)
  • Prevents string interning unless both original data and interned pointer stored.
  • Requires changes to specifications and software at many points: wherever string comparisons are made.
  • Doesn't solve the problem for non-Web computer languages.

Interning-time normalization of identifiers

Require that implementations intern identifiers early and perform normalization at interning time.

Advantages:

  • Allows whatever normalization (or non-normalized representation) the author preferred to produce the text in to persist without modification except when retrieved by client side script. (CITATION NEEDED)

Disadvantages:

  • Requires all consuming software to be upgraded.
  • Performance of interning. (CITATION NEEDED)
  • High-performance interning likely to require working normalization into the interning hash table hash function making general-purpose normalization or hashing implementations unusable.
  • If a client side script retrieves an interned string, the normalized form would be returned instead of the author-supplied original.
  • Requires changes to specifications and software at many points: wherever string comparisons are made.
  • Doesn't solve the problem for non-Web computer languages.

Relevant Threads/Messages

See also