convert among UTF-8, UTF-16, and UTF-32 More...
#include "cgul_common.h"
#include "cgul_exception.h"
#include "cgul_hstring.h"
#include "cgul_string.h"
#include "cgul_wstring.h"
This file contains Unicode utility functions for converting among UTF-8, UTF-16, and UTF-32.
CGUL_BEGIN_C CGUL_EXPORT void cgul_unicode__mbstowcs | ( | cgul_exception_t * | cex, |
const char * | utf8, | ||
size_t | utf8_length, | ||
int | skip_bom, | ||
cgul_wstring_t | utf32 | ||
) |
UTF-8 to UTF-32 string conversion.
This functions converts the UTF-8 multi-byte string utf8
to a UTF-32 wide character string and appends the result to utf32
. If a leading byte-order mark (BOM) is present and skip_bom
is set, the BOM is automatically removed.
Note that BOM characters can be embedded inside of UTF-8 strings. It is only the leading BOM that gets removed if skip_bom
is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom
the first time this function is called, but clear skip_bom
all the other times this function is called.
To guard against reading off the end of utf8
, its length must be given in utf8_length
. The value of utf8_length
is the total number of bytes in the utf8
string.
If an error occurs, an exception is thrown. Errors can occur because utf8
is not a valid UTF-8 encoding or because memory could not be allocated for the string that holds the UTF-32 conversion. If an error occurs, the value of utf32
is undefined.
[in,out] | cex | c-style exception |
[in] | utf8 | UTF-8 string |
[in] | utf8_length | length of utf8 in bytes |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf32 | conversion of utf8 to UTF-32 |
Referenced by cgul_unicode_cxx::mbstowcs().
CGUL_EXPORT void cgul_unicode__wcstombs | ( | cgul_exception_t * | cex, |
const cgul_wchar_t * | utf32, | ||
size_t | utf32_length, | ||
int | skip_bom, | ||
cgul_string_t | utf8 | ||
) |
UTF-32 to UTF-8 string conversion.
This functions converts the wide-character string utf32
from UTF-32 to UTF-8 and appends the result to the multi-byte string utf8
.
If an error occurs, an exception is thrown. Errors can occur because utf32
is not a valid UTF-32 encoding or because memory could not be allocated for the string that holds the UTF-8 conversion. If an error occurs, the value of utf8
is undefined.
[in,out] | cex | c-style exception |
[in] | utf32 | UTF-32 string |
[in] | utf32_length | length of utf32 in wide characters |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf8 | conversion of utf32 to UTF-8 |
Referenced by cgul_unicode_cxx::wcstombs().
CGUL_EXPORT void cgul_unicode__mbstohcs | ( | cgul_exception_t * | cex, |
const char * | utf8, | ||
size_t | utf8_length, | ||
int | skip_bom, | ||
cgul_hstring_t | utf16 | ||
) |
UTF-8 to UTF-16 string conversion.
This functions converts the UTF-8 multi-byte string utf8
to a UTF-16 wide character string and appends the result to utf16
. If a leading byte-order mark (BOM) is present and skip_bom
is set, the BOM is automatically removed.
Note that BOM characters can be embedded inside of UTF-8 strings. It is only the leading BOM that gets removed if skip_bom
is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom
the first time this function is called, but clear skip_bom
all the other times this function is called.
To guard against reading off the end of utf8
, its length must be given in utf8_length
. The value of utf8_length
is the total number of bytes in the utf8
string.
If an error occurs, an exception is thrown. Errors can occur because utf8
is not a valid UTF-8 encoding or because memory could not be allocated for the string that holds the UTF-16 conversion. If an error occurs, the value of utf16
is undefined.
[in,out] | cex | c-style exception |
[in] | utf8 | UTF-8 string |
[in] | utf8_length | length of utf8 in bytes |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf16 | conversion of utf8 to UTF-16 |
Referenced by cgul_unicode_cxx::mbstohcs().
CGUL_EXPORT void cgul_unicode__hcstombs | ( | cgul_exception_t * | cex, |
const cgul_hchar_t * | utf16, | ||
size_t | utf16_length, | ||
int | skip_bom, | ||
cgul_string_t | utf8 | ||
) |
UTF-16 to UTF-8 string conversion.
This functions converts the UTF-16 wide-character string utf16
to a UTF-8 multi-byte string and appends the result to utf8
. If a leading byte-order mark (BOM) is present and skip_bom
is set, the BOM is automatically removed.
Note that BOM characters can be embedded inside of UTF-16 strings. It is only the leading BOM that gets removed if skip_bom
is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom
the first time this function is called, but clear skip_bom
all the other times this function is called.
To guard against reading off the end of utf16
, its length must be given in utf16_length
. The value of utf16_length
is the total number of cgul_hchar_t
elements in the utf16
string; this value can be obtained by calling cgul_hchar__hcslen()
or cgul_hstring__get_length()
. Basically, surrogate pairs count as two; everything else counts as one.
If an error occurs, an exception is thrown. Errors can occur because utf16
is not a valid UTF-16 encoding or because memory could not be allocated for the string that holds the UTF-8 conversion. If an error occurs, the value of utf8
is undefined.
[in,out] | cex | c-style exception |
[in] | utf16 | UTF-16 string |
[in] | utf16_length | length of utf16 in cgul_hchar_t elements |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf8 | conversion of utf16 to UTF-8 |
Referenced by cgul_unicode_cxx::hcstombs().
CGUL_EXPORT void cgul_unicode__hcstowcs | ( | cgul_exception_t * | cex, |
const cgul_hchar_t * | utf16, | ||
size_t | utf16_length, | ||
int | skip_bom, | ||
cgul_wstring_t | utf32 | ||
) |
UTF-16 to UTF-32 string conversion.
This functions converts the UTF-16 wide-character string utf16
to a UTF-32 wide-character string and appends the result to utf32
. If a leading byte-order mark (BOM) is present and skip_bom
is set, the BOM is automatically removed.
Note that BOM characters can be embedded inside of UTF-16 strings. It is only the leading BOM that gets removed if skip_bom
is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom
the first time this function is called, but clear skip_bom
all the other times this function is called.
To guard against reading off the end of utf16
, its length must be given in utf16_length
. The value of utf16_length
is the total number of cgul_hchar_t
elements in the utf16
string; this value can be obtained by calling cgul_hchar__hcslen()
or cgul_hstring__get_length()
. Basically, surrogate pairs count as two; everything else counts as one.
If an error occurs, an exception is thrown. Errors can occur because utf16
is not a valid UTF-16 encoding or because memory could not be allocated for the string that holds the UTF-32 conversion. If an error occurs, the value of utf32
is undefined.
[in,out] | cex | c-style exception |
[in] | utf16 | UTF-16 string |
[in] | utf16_length | length of utf16 in cgul_hchar_t elements |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf32 | conversion of utf16 to UTF-32 |
Referenced by cgul_unicode_cxx::hcstowcs().
CGUL_EXPORT void cgul_unicode__wcstohcs | ( | cgul_exception_t * | cex, |
const cgul_wchar_t * | utf32, | ||
size_t | utf32_length, | ||
int | skip_bom, | ||
cgul_hstring_t | utf16 | ||
) |
UTF-32 to UTF-16 string conversion.
This functions converts the wide-character string utf32
from UTF-32 to UTF-16 and appends the result to the wide-character string utf16
.
If an error occurs, an exception is thrown. Errors can occur because utf32
is not a valid UTF-32 encoding or because memory could not be allocated for the string that holds the UTF-16 conversion. If an error occurs, the value of utf16
is undefined.
[in,out] | cex | c-style exception |
[in] | utf32 | UTF-32 string |
[in] | utf32_length | length of utf32 in wide characters |
[in] | skip_bom | whether the leading byte-order mark should be skipped |
[out] | utf16 | conversion of utf32 to UTF-16 |
Referenced by cgul_unicode_cxx::wcstohcs().
CGUL_EXPORT cgul_wchar_t cgul_unicode__mbtowc | ( | cgul_exception_t * | cex, |
const char * | utf8, | ||
size_t | utf8_length, | ||
size_t * | index | ||
) |
UTF-8 to UTF-32 character conversion.
This method converts exactly one UTF-8 multi-byte character to a UTF-32 wide character and returns it. The multiple bytes that comprise the UTF-8 character start at utf8[*index]
.
In order to make iterating over an entire UTF-8 string easy, when this method returns, the value of *index
is updated to point to the beginning of the next character.
To guard against reading off the end of utf8
, its length must be given in utf8_length
. The value of utf8_length
is the total number of bytes in the utf8
string. It is not the number of bytes remaining in the string.
NOTE: This method throws an exception if utf8_length
is 0. Thus, the caller must verify that utf8
is not an empty string before calling this method!
If an error occurs, CGUL_WCHAR__NUL
is returned, and an exception is thrown.
[in] | cex | c-style exception |
[in] | utf8 | UTF-8 string |
[in] | utf8_length | length of utf8 in bytes |
[in,out] | index | index into utf8 |
Referenced by cgul_unicode_cxx::mbtowc().
CGUL_EXPORT size_t cgul_unicode__wctomb | ( | cgul_exception_t * | cex, |
const cgul_wchar_t | utf32, | ||
char * | buf | ||
) |
UTF-32 to UTF-8 character conversion.
This method converts exactly one UTF-32 wide character to a UTF-8 multi-byte character sequence and returns it in buf
as a C-style string that is NUL
terminated. The UTF-32 character that gets converted is passed in as wc
. As an added convenience, this method returns the number of bytes written to buf
excluding the NUL
terminator.
The caller is responsible for making sure that buf
points to a buffer that can hold at least CGUL_WCHAR__MB_LEN_MAX + 1
bytes.
If an error occurs, zero is returned to the caller, an empty string is returned in buf
, and an exception is thrown. Errors can occur, for example, if wc
is larger than 0x10ffff which is the largest valid Unicode code point.
[in] | cex | c-style exception |
[in] | utf32 | UTF-32 wide character |
[out] | buf | buffer that holds the resulting UTF-8 multi-byte sequence |
buf
Referenced by cgul_unicode_cxx::wctomb().
CGUL_EXPORT size_t cgul_unicode__mbtohc | ( | cgul_exception_t * | cex, |
const char * | utf8, | ||
size_t | utf8_length, | ||
size_t * | index, | ||
cgul_hchar_t * | buf | ||
) |
UTF-8 to UTF-16 character conversion.
This method converts exactly one UTF-8 multi-byte character to a UTF-16 wide character sequence and returns it in buf
as a cgul_hchar_t
string that is NUL
terminated. As an added convenience, this method returns the number of cgul_hchar_t
elements written to buf
excluding the NUL
terminator.
The multiple bytes that comprise the UTF-8 character start at utf8[*index]
. In order to make iterating over an entire UTF-8 string easy, when this method returns, the value of *index
is updated to point to the beginning of the next character.
To guard against reading off the end of utf8
, its length must be given in utf8_length
. The value of utf8_length
is the total number of bytes in the utf8
string. It is not the number of bytes remaining in the string.
The caller is responsible for making sure that buf
points to a buffer that can hold at least three cgul_hchar_t
characters. This accounts for the worst case which requires a surrogate pair and a NUL
terminator.
If an error occurs, zero is returned to the caller, an empty string is returned in buf
, and an exception is thrown.
NOTE: This method throws an exception if utf8_length
is 0. Thus, the caller must verify that utf8
is not an empty string before calling this method!
[in] | cex | c-style exception |
[in] | utf8 | UTF-8 string |
[in] | utf8_length | length of utf8 in bytes |
[in,out] | index | index into utf8 |
[out] | buf | buffer that holds the resulting UTF-16 multi-byte sequence |
cgul_hchar_t
elements written to buf
Referenced by cgul_unicode_cxx::mbtohc().
CGUL_EXPORT size_t cgul_unicode__hctomb | ( | cgul_exception_t * | cex, |
const cgul_hchar_t * | utf16, | ||
size_t | utf16_length, | ||
size_t * | index, | ||
char * | buf | ||
) |
UTF-16 to UTF-8 character conversion.
This method converts exactly one UTF-16 wide character sequence to a UTF-8 multi-byte character sequence and returns it in buf
as a C-style string that is NUL
terminated. As an added convenience, this method returns the number of bytes written to buf
excluding the NUL
terminator.
The multiple cgul_hcahr_t
elements that comprise the UTF-16 wide character sequence start at utf16[*index]
. In order to make iterating over an entire UTF-16 string easy, when this method returns, the value of *index
is updated to point to the beginning of the next character.
To guard against reading off the end of utf16
, its length must be given in utf16_length
. The value of utf16_length
is the total number of cgul_hchar_t
elements in the utf16
string; this value can be obtained by calling cgul_hchar__hcslen()
or cgul_hstring__get_length()
. Basically, surrogate pairs count as two; everything else counts as one. Note that the value is not the number of characters remaining in the string; it's the total number of characters in the string.
The caller is responsible for making sure that buf
points to a buffer that can hold at least CGUL_HCHAR__MB_LEN_MAX + 1
bytes.
If an error occurs, zero is returned to the caller, an empty string is returned in buf
, and an exception is thrown.
NOTE: This method throws an exception if utf16_length
is 0. Thus, the caller must verify that utf16
is not an empty string before calling this method!
[in] | cex | c-style exception |
[in] | utf16 | UTF-16 string |
[in] | utf16_length | length of utf16 in cgul_hchar_t elements |
[in,out] | index | index into utf16 |
[out] | buf | buffer that holds the resulting UTF-8 multi-byte sequence |
buf
Referenced by cgul_unicode_cxx::hctomb().
CGUL_EXPORT cgul_wchar_t cgul_unicode__hctowc | ( | cgul_exception_t * | cex, |
const cgul_hchar_t * | utf16, | ||
size_t | utf16_length, | ||
size_t * | index | ||
) |
UTF-16 to UTF-32 character conversion.
This method converts exactly one UTF-16 multi-byte character to a UTF-32 wide character and returns it. The multiple bytes that comprise the UTF-16 character start at utf16[*index]
.
In order to make iterating over an entire UTF-16 string easy, when this method returns, the value of *index
is updated to point to the beginning of the next character.
To guard against reading off the end of utf16
, its length must be given in utf16_length
. The value of utf16_length
is the total number of cgul_hchar_t
elements in the utf16
string; this value can be obtained by calling cgul_hchar__hcslen()
or cgul_hstring__get_length()
. Basically, surrogate pairs count as two; everything else counts as one. Note that the value is not the number of characters remaining in the string; it's the total number of characters in the string.
NOTE: This method throws an exception if utf16_length
is 0. Thus, the caller must verify that utf16
is not an empty string before calling this method!
If an error occurs, CGUL_HCHAR__NUL
is returned, and an exception is thrown.
[in] | cex | c-style exception |
[in] | utf16 | UTF-16 string |
[in] | utf16_length | length of utf16 in cgul_hchar_t elements |
[in,out] | index | index into utf16 |
Referenced by cgul_unicode_cxx::hctowc().
CGUL_EXPORT size_t cgul_unicode__wctohc | ( | cgul_exception_t * | cex, |
const cgul_wchar_t | utf32, | ||
cgul_hchar_t * | buf | ||
) |
UTF-32 to UTF-16 character conversion.
This method converts exactly one UTF-32 wide character to a UTF-16 multi-byte character sequence and returns it in buf
as a cgul_hchar_t
string that is NUL
terminated. The UTF-32 character that gets converted is passed in as wc
. As an added convenience, this method returns the number of bytes written to buf
excluding the NUL
terminator.
The caller is responsible for making sure that buf
points to a buffer that can hold at least three cgul_hchar_t
characters. This accounts for the worst case which requires a surrogate pair and a NUL
terminator.
If an error occurs, zero is returned to the caller, an empty string is returned in buf
, and an exception is thrown. Errors can occur, for example, if wc
is larger than 0x10ffff which is the largest valid Unicode code point.
[in] | cex | c-style exception |
[in] | utf32 | UTF-32 wide character |
[out] | buf | buffer that holds the resulting UTF-16 multi-byte sequence |
buf
Referenced by cgul_unicode_cxx::wctohc().
CGUL_EXPORT size_t cgul_unicode__get_char_count | ( | cgul_exception_t * | cex, |
const char * | utf8, | ||
size_t | utf8_length, | ||
int | skip_bom | ||
) |
This method counts the number of Unicode characters (not bytes) in the UTF-8 string utf8
and returns the result. If an error occurs decoding utf8
, 0
is returned, and an exception is thrown.
If a leading byte-order mark (BOM) is present and skip_bom
is set, the leading BOM will not be included in the character count.
To guard against reading off the end of utf8
, its length must be given in utf8_length
. The value of utf8_length
is the total number of bytes in the utf8
string.
[in] | cex | c-style exception |
[in] | utf8 | UTF-8 string |
[in] | utf8_length | length of utf8 in bytes |
[in] | skip_bom | whether to skip the byte-order mark |
utf8
Referenced by cgul_unicode_cxx::get_char_count().
CGUL_EXPORT size_t cgul_unicode__get_hchar_count | ( | cgul_exception_t * | cex, |
const cgul_hchar_t * | utf16, | ||
size_t | utf16_length, | ||
int | skip_bom | ||
) |
This method counts the number of Unicode characters (not bytes and not cgul_hchar_t
elements) in the UTF-16 string utf16
and returns the result. If an error occurs decoding utf16
, 0
is returned, and an exception is thrown.
If a leading byte-order mark (BOM) is present and skip_bom
is set, the leading BOM will not be included in the character count.
To guard against reading off the end of utf16
, its length must be given in utf16_length
. The value of utf16_length
is the total number of cgul_hchar_t
elements in the utf16
string; this value can be obtained by calling cgul_hchar__hcslen()
or cgul_hstring__get_length()
. Basically, surrogate pairs count as two; everything else counts as one.
[in] | cex | c-style exception |
[in] | utf16 | UTF-16 string |
[in] | utf16_length | length of utf16 in cgul_hchar_t elements |
[in] | skip_bom | whether to skip the byte-order mark |
utf16
Referenced by cgul_unicode_cxx::get_hchar_count().
CGUL_EXPORT size_t cgul_unicode__get_wchar_count | ( | cgul_exception_t * | cex, |
const cgul_wchar_t * | utf32, | ||
int | skip_bom | ||
) |
This method counts the number of Unicode characters (not bytes) in the UTF-32 string utf32
and returns the result. If an error occurs decoding utf32
, 0
is returned, and an exception is thrown.
If a leading byte-order mark (BOM) is present and skip_bom
is set, the leading BOM will not be included in the character count.
[in] | cex | c-style exception |
[in] | utf32 | UTF-32 string |
[in] | skip_bom | whether to skip the byte-order mark |
utf32
Referenced by cgul_unicode_cxx::get_wchar_count().