5 Lexical conventions [lex]

5.13 Literals [lex.literal]

5.13.3 Character literals [lex.ccon]

character-literal:
	encoding-prefix ' c-char-sequence '
encoding-prefix: one of
	u8  u  U  L
c-char-sequence:
	c-char
	c-char-sequence c-char
c-char:
	any member of the basic source character set except the single-quote ', backslash \, or new-line character
	escape-sequence
	universal-character-name
escape-sequence:
	simple-escape-sequence
	octal-escape-sequence
	hexadecimal-escape-sequence
simple-escape-sequence: one of
	\'  \"  \?  \\
	\a  \b  \f  \n  \r  \t  \v
octal-escape-sequence:
	\ octal-digit
	\ octal-digit octal-digit
	\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence:
	\x hexadecimal-digit
	hexadecimal-escape-sequence hexadecimal-digit
A character-literal that does not begin with u8, u, U, or L is an ordinary character literal.
An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.
An ordinary character literal that contains more than one c-char is a multicharacter literal.
A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
A character-literal that begins with u8, such as u8'w', is a character-literal of type char8_­t, known as a UTF-8 character literal.
The value of a UTF-8 character literal is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit.
Note
:
That is, provided the code point value is in the range (hexadecimal).
— end note
 ]
If the value is not representable with a single UTF-8 code unit, the program is ill-formed.
A UTF-8 character literal containing multiple c-chars is ill-formed.
A character-literal that begins with the letter u, such as u'x', is a character-literal of type char16_­t, known as a UTF-16 character literal.
The value of a UTF-16 character literal is equal to its ISO/IEC 10646 code point value, provided that the code point value is representable with a single 16-bit code unit.
Note
:
That is, provided the code point value is in the range (hexadecimal).
— end note
 ]
If the value is not representable with a single 16-bit code unit, the program is ill-formed.
A UTF-16 character literal containing multiple c-chars is ill-formed.
A character-literal that begins with the letter U, such as U'y', is a character-literal of type char32_­t, known as a UTF-32 character literal.
The value of a UTF-32 character literal containing a single c-char is equal to its ISO/IEC 10646 code point value.
A UTF-32 character literal containing multiple c-chars is ill-formed.
A character-literal that begins with the letter L, such as L'z', is a wide-character literal.
A wide-character literal has type wchar_­t.17
The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.
Note
:
The type wchar_­t is able to represent all members of the execution wide-character set (see [basic.fundamental]).
— end note
 ]
The value of a wide-character literal containing multiple c-chars is implementation-defined.
Certain non-graphic characters, the single quote ', the double quote ", the question mark ?,18 and the backslash \, can be represented according to Table 9.
The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively.
Escape sequences in which the character following the backslash is not listed in Table 9 are conditionally-supported, with implementation-defined semantics.
An escape sequence specifies a single character.
Table 9: Escape sequences   [tab:lex.ccon.esc]
new-line
NL(LF)
\n
horizontal tab
HT
\t
vertical tab
VT
\v
backspace
BS
\b
carriage return
CR
\r
form feed
FF
\f
alert
BEL
\a
backslash
\
\\
question mark
?
\?
single quote
'
\'
double quote
"
\"
octal number
ooo
\ooo
hex number
hhh
\xhhh
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character.
The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character.
There is no limit to the number of digits in a hexadecimal sequence.
A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively.
The value of a character-literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character-literals with no prefix) or wchar_­t (for character-literals prefixed by L).
Note
:
If the value of a character-literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed.
— end note
 ]
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named.
If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.
Note
:
In translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text.
Therefore, all extended characters are described in terms of universal-character-names.
However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.
— end note
 ]
They are intended for character sets where a character does not fit into a single byte.
Using an escape sequence for a question mark is supported for compatibility with ISO C++ 2014 and ISO C.