Character size unit and length semantics

When programming an application for a occidental language such as English, a single-byte character set can be used, and the logical size, storage size and print width of characters is the same. For example, in ISO-8859-1, the ê character takes one logical position, has a storage size of one byte and a print width of one.

When programming an international application using multiple languages and a multibyte character set encoding, you must distinguish three size units:

  1. The size in character unit, to count or position logical characters used in a string. For example, the strings abc and åôë have both a length of 3, in character units.
  2. The size in byte unit, used to encode the character in a given character set. For example, a Latin ê acute character will use a unique byte in the ISO-8859-1 character set, but needs two bytes in UTF-8.
  3. The size in width unit, used in formatting and alignments. The width is the length of the glyph/font of characters, especially in a fixed font. For example, a Latin character will take one width unit, while an Chinese character will take 2 width units.

Working with byte units in a multibyte character set can be difficult: You need to calculate sizes, lengths and substring offsets in a number of bytes, when the natural way is to count in characters.

Length semantics define the unit to be used for character data type definition, character string lengths and positions.

With Byte Length Semantics, a length is expressed in bytes, while Character Length Semantics counts in characters.