Daily C++11: String literals

As part of the heritage of the C++ programming language, all string literals were ANSI characters. For wide characters, C++03 defined the L prefix for the string: therefore L”abcd” would represent all the characters including the string terminator in the ambiguous type wchar_t. Unfortunately, the standard never defined limits the form of this character, as a consequence, different implementations use different sizes: 16 and 32 bits being the natural choices.

C++11 rectifies this, in the sense that it introduces new unambiguous character types: char16_t and char32_t. Therefore, C++11 will support three Unicode encodings: UTF-8 (through the type char), UTF-16, with the type char16_t and UTF-32 with the type char32_t.

For these three string prefixes were introduced:

  • u8″ășțîâ” will define a string in UTF-8 encoding, as a char[]
  • u”ășțîâ” will define a string in UTF-16 encoding, as char16_t[]
  • U”ășțîâ” will define a string in UTF-32 encoding, as char32_t[]

To insert unicode characters in such a string, one would use the \u escape character. The \u should be followed by the code-point of the Unicode character. Let’s take the character U+1D79 (Latin small letter insular G) ( ᵹ )

The new data types are defined in the language as basic primitive types – they need no header file to be included.

To be continued…

Comments are closed.