| utf8Conversion {base} | R Documentation |
Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
utf8ToInt(x) intToUtf8(x, multiple = FALSE)
x |
object to be converted. |
multiple |
logical: should the conversion be to a single character string or multiple individual characters? |
These will work in any locale, including on platforms that do not otherwise support multi-byte character sets.
Unicode defines a name and a number of all of the glyphs it
encompasses: the numbers are called code points: since RFC3629
they run from 0 to 0x10FFFF (with about 12% being
assigned by version 10.0 of the Unicode standard).
intToUtf8 does not handle surrogate pairs (which should not
occur in UTF-32): inputs in the surrogate ranges are mapped to
NA.
utf8ToInt converts a length-one character string encoded in
UTF-8 to an integer vector of Unicode code points. It checks validity
of the input. (Currently it accepts UTF-8 encodings of code points
greater than 0x10FFFF: these are no longer regarded as valid by
the UTF-8 RFC and will in future be mapped to NA. Following
‘Corrigendum 9’ the UTF-8 encodings of the
‘noncharacters’ 0xFFFE and 0xFFFF are regarded as
valid as from R 3.4.3.)
intToUtf8 converts a numeric vector of Unicode code points
either (default) to a single character string or a character vector of
single characters. Non-integral numeric values are truncated to
integers: values above the maximum are mapped to NA. For a
single character string 0 is silently omitted: otherwise
0 is mapped to "". The Encoding of a
non-NA return value is declared as "UTF-8".
Invalid and NA inputs are mapped to NA output.
https://tools.ietf.org/html/rfc3629, the current standard for UTF-8.
http://www.unicode.org/versions/corrigendum9.html for non-characters.
## will only display in some locales and fonts
intToUtf8(0x03B2L) # Greek beta
utf8ToInt("bi\u00dfchen")
utf8ToInt("\xfa\xb4\xbf\xbf\x9f")