C++ bindings for cgul_unicode More...

#include <cgul_unicode_cxx.h>

Collaboration diagram for cgul_unicode_cxx:

Static Public Member Functions
static void	mbstowcs (const char *utf8, size_t utf8_length, int skip_bom, cgul_wstring_cxx &utf32)

static void	wcstombs (const cgul_wchar_t *utf32, size_t utf32_length, int skip_bom, cgul_string_cxx &utf8)

static void	mbstohcs (const char *utf8, size_t utf8_length, int skip_bom, cgul_hstring_cxx &utf16)

static void	hcstombs (const cgul_hchar_t *utf16, size_t utf16_length, int skip_bom, cgul_string_cxx &utf8)

static void	hcstowcs (const cgul_hchar_t *utf16, size_t utf16_length, int skip_bom, cgul_wstring_cxx &utf32)

static void	wcstohcs (const cgul_wchar_t *utf32, size_t utf32_length, int skip_bom, cgul_hstring_cxx &utf16)

static cgul_wchar_t	mbtowc (const char utf8, size_t utf8_length, size_t index)

static size_t	wctomb (const cgul_wchar_t utf32, char *buf)

static size_t	mbtohc (const char utf8, size_t utf8_length, size_t index, cgul_hchar_t *buf)

static size_t	hctomb (const cgul_hchar_t utf16, size_t utf16_length, size_t index, char *buf)

static cgul_wchar_t	hctowc (const cgul_hchar_t utf16, size_t utf16_length, size_t index)

static size_t	wctohc (const cgul_wchar_t utf32, cgul_hchar_t *buf)

static size_t	get_char_count (const char *utf8, size_t utf8_length, int skip_bom)

static size_t	get_hchar_count (const cgul_hchar_t *utf16, size_t utf16_length, int skip_bom)

static size_t	get_wchar_count (const cgul_wchar_t *utf32, int skip_bom)

Detailed Description

This class provides the C++ bindings for C cgul_unicode external functions. The main purpose of this class is to convert the C-style function calls and exception handling in cgul_unicode into C++-style function calls and exception handling.

Member Function Documentation

§ mbstowcs()

static void cgul_unicode_cxx::mbstowcs	(	const char *	utf8,
		size_t	utf8_length,
		int	skip_bom,
		cgul_wstring_cxx &	utf32
	)

inlinestatic

UTF-8 to UTF-32 string conversion.

This functions converts the UTF-8 multi-byte string utf8 to a UTF-32 wide character string and appends the result to utf32. If a leading byte-order mark (BOM) is present and skip_bom is set, the BOM is automatically removed.

Note that BOM characters can be embedded inside of UTF-8 strings. It is only the leading BOM that gets removed if skip_bom is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom the first time this function is called, but clear skip_bom all the other times this function is called.

To guard against reading off the end of utf8, its length must be given in utf8_length. The value of utf8_length is the total number of bytes in the utf8 string.

If an error occurs, an exception is thrown. Errors can occur because utf8 is not a valid UTF-8 encoding or because memory could not be allocated for the string that holds the UTF-32 conversion. If an error occurs, the value of utf32 is undefined.

Parameters

[in]	utf8	UTF-8 string
[in]	utf8_length	length of `utf8` in bytes
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf32	conversion of `utf8` to UTF-32

References cgul_unicode__mbstowcs(), and cgul_wstring_cxx::get_obj().

§ wcstombs()

static void cgul_unicode_cxx::wcstombs	(	const cgul_wchar_t *	utf32,
		size_t	utf32_length,
		int	skip_bom,
		cgul_string_cxx &	utf8
	)

inlinestatic

UTF-32 to UTF-8 string conversion.

This functions converts the wide-character string utf32 from UTF-32 to UTF-8 and appends the result to the multi-byte string utf8.

If an error occurs, an exception is thrown. Errors can occur because utf32 is not a valid UTF-32 encoding or because memory could not be allocated for the string that holds the UTF-8 conversion. If an error occurs, the value of utf8 is undefined.

Parameters

[in]	utf32	UTF-32 string
[in]	utf32_length	length of `utf32` in wide characters
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf8	conversion of `utf32` to UTF-8

References cgul_unicode__wcstombs(), and cgul_string_cxx::get_obj().

§ mbstohcs()

static void cgul_unicode_cxx::mbstohcs	(	const char *	utf8,
		size_t	utf8_length,
		int	skip_bom,
		cgul_hstring_cxx &	utf16
	)

inlinestatic

UTF-8 to UTF-16 string conversion.

This functions converts the UTF-8 multi-byte string utf8 to a UTF-16 wide character string and appends the result to utf16. If a leading byte-order mark (BOM) is present and skip_bom is set, the BOM is automatically removed.

Note that BOM characters can be embedded inside of UTF-8 strings. It is only the leading BOM that gets removed if skip_bom is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom the first time this function is called, but clear skip_bom all the other times this function is called.

To guard against reading off the end of utf8, its length must be given in utf8_length. The value of utf8_length is the total number of bytes in the utf8 string.

If an error occurs, an exception is thrown. Errors can occur because utf8 is not a valid UTF-8 encoding or because memory could not be allocated for the string that holds the UTF-16 conversion. If an error occurs, the value of utf16 is undefined.

Parameters

[in]	utf8	UTF-8 string
[in]	utf8_length	length of `utf8` in bytes
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf16	conversion of `utf8` to UTF-16

References cgul_unicode__mbstohcs(), and cgul_hstring_cxx::get_obj().

§ hcstombs()

static void cgul_unicode_cxx::hcstombs	(	const cgul_hchar_t *	utf16,
		size_t	utf16_length,
		int	skip_bom,
		cgul_string_cxx &	utf8
	)

inlinestatic

UTF-16 to UTF-8 string conversion.

This functions converts the UTF-16 wide-character string utf16 to a UTF-8 multi-byte string and appends the result to utf8. If a leading byte-order mark (BOM) is present and skip_bom is set, the BOM is automatically removed.

Note that BOM characters can be embedded inside of UTF-16 strings. It is only the leading BOM that gets removed if skip_bom is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom the first time this function is called, but clear skip_bom all the other times this function is called.

To guard against reading off the end of utf16, its length must be given in utf16_length. The value of utf16_length is the total number of cgul_hchar_t elements in the utf16 string; this value can be obtained by calling cgul_hchar_cxx::hcslen() or cgul_hstring_cxx::get_length(). Basically, surrogate pairs count as two; everything else counts as one.

If an error occurs, an exception is thrown. Errors can occur because utf16 is not a valid UTF-16 encoding or because memory could not be allocated for the string that holds the UTF-8 conversion. If an error occurs, the value of utf8 is undefined.

Parameters

[in]	utf16	UTF-16 string
[in]	utf16_length	length of `utf16` in `cgul_hchar_t` elements
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf8	conversion of `utf16` to UTF-8

References cgul_unicode__hcstombs(), and cgul_string_cxx::get_obj().

§ hcstowcs()

static void cgul_unicode_cxx::hcstowcs	(	const cgul_hchar_t *	utf16,
		size_t	utf16_length,
		int	skip_bom,
		cgul_wstring_cxx &	utf32
	)

inlinestatic

UTF-16 to UTF-32 string conversion.

This functions converts the UTF-16 wide-character string utf16 to a UTF-32 wide-character string and appends the result to utf32. If a leading byte-order mark (BOM) is present and skip_bom is set, the BOM is automatically removed.

Note that BOM characters can be embedded inside of UTF-16 strings. It is only the leading BOM that gets removed if skip_bom is set. This is necessary because any other BOM is supposed to be interpreted as a "zero-width no-break space" that's only purpose is to prevent the formation of ligatures. So, if you are converting an entire file by calling this function more than once, you should set skip_bom the first time this function is called, but clear skip_bom all the other times this function is called.

To guard against reading off the end of utf16, its length must be given in utf16_length. The value of utf16_length is the total number of cgul_hchar_t elements in the utf16 string; this value can be obtained by calling cgul_hchar_cxx::hcslen() or cgul_hstring_cxx::get_length(). Basically, surrogate pairs count as two; everything else counts as one.

If an error occurs, an exception is thrown. Errors can occur because utf16 is not a valid UTF-16 encoding or because memory could not be allocated for the string that holds the UTF-32 conversion. If an error occurs, the value of utf32 is undefined.

Parameters

[in]	utf16	UTF-16 string
[in]	utf16_length	length of `utf16` in `cgul_hchar_t` elements
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf32	conversion of `utf16` to UTF-32

References cgul_unicode__hcstowcs(), and cgul_wstring_cxx::get_obj().

§ wcstohcs()

static void cgul_unicode_cxx::wcstohcs	(	const cgul_wchar_t *	utf32,
		size_t	utf32_length,
		int	skip_bom,
		cgul_hstring_cxx &	utf16
	)

inlinestatic

UTF-32 to UTF-16 string conversion.

This functions converts the wide-character string utf32 from UTF-32 to UTF-16 and appends the result to the wide-character string utf16.

If an error occurs, an exception is thrown. Errors can occur because utf32 is not a valid UTF-32 encoding or because memory could not be allocated for the string that holds the UTF-16 conversion. If an error occurs, the value of utf16 is undefined.

Parameters

[in]	utf32	UTF-32 string
[in]	utf32_length	length of `utf32` in wide characters
[in]	skip_bom	whether the leading byte-order mark should be skipped
[out]	utf16	conversion of `utf32` to UTF-16

References cgul_unicode__wcstohcs(), cgul_wchar_t, and cgul_hstring_cxx::get_obj().

§ mbtowc()

static cgul_wchar_t cgul_unicode_cxx::mbtowc	(	const char *	utf8,
		size_t	utf8_length,
		size_t *	index
	)

inlinestatic

UTF-8 to UTF-32 character conversion.

This method converts exactly one UTF-8 multi-byte character to a UTF-32 wide character and returns it. The multiple bytes that comprise the UTF-8 character start at utf8[*index].

In order to make iterating over an entire UTF-8 string easy, when this method returns, the value of *index is updated to point to the beginning of the next character.

To guard against reading off the end of utf8, its length must be given in utf8_length. The value of utf8_length is the total number of bytes in the utf8 string. It is not the number of bytes remaining in the string.

NOTE: This method throws an exception if utf8_length is 0. Thus, the caller must verify that utf8 is not an empty string before calling this method!

If an error occurs, CGUL_WCHAR__NUL is returned, and an exception is thrown.

Parameters

[in]	utf8	UTF-8 string
[in]	utf8_length	length of `utf8` in bytes
[in,out]	index	index into `utf8`

Returns: next UTF-32 character

References cgul_unicode__mbtowc(), CGUL_WCHAR__NUL, and cgul_wchar_t.

§ wctomb()

static size_t cgul_unicode_cxx::wctomb	(	const cgul_wchar_t	utf32,
		char *	buf
	)

inlinestatic

UTF-32 to UTF-8 character conversion.

This method converts exactly one UTF-32 wide character to a UTF-8 multi-byte character sequence and returns it in buf as a C-style string that is NUL terminated. The UTF-32 character that gets converted is passed in as wc. As an added convenience, this method returns the number of bytes written to buf excluding the NUL terminator.

The caller is responsible for making sure that buf points to a buffer that can hold at least CGUL_WCHAR__MB_LEN_MAX + 1 bytes.

If an error occurs, zero is returned to the caller, an empty string is returned in buf, and an exception is thrown. Errors can occur, for example, if wc is larger than 0x10ffff which is the largest valid Unicode code point.

Parameters

[in]	utf32	UTF-32 wide character
[out]	buf	buffer that holds the resulting UTF-8 multi-byte sequence

Returns: number of bytes written to buf

References cgul_unicode__wctomb().

§ mbtohc()

static size_t cgul_unicode_cxx::mbtohc	(	const char *	utf8,
		size_t	utf8_length,
		size_t *	index,
		cgul_hchar_t *	buf
	)

inlinestatic

UTF-8 to UTF-16 character conversion.

This method converts exactly one UTF-8 multi-byte character to a UTF-16 wide character sequence and returns it in buf as a cgul_hchar_t string that is NUL terminated. As an added convenience, this method returns the number of cgul_hchar_t elements written to buf excluding the NUL terminator.

The multiple bytes that comprise the UTF-8 character start at utf8[*index]. In order to make iterating over an entire UTF-8 string easy, when this method returns, the value of *index is updated to point to the beginning of the next character.

To guard against reading off the end of utf8, its length must be given in utf8_length. The value of utf8_length is the total number of bytes in the utf8 string. It is not the number of bytes remaining in the string.

The caller is responsible for making sure that buf points to a buffer that can hold at least three cgul_hchar_t characters. This accounts for the worst case which requires a surrogate pair and a NUL terminator.

If an error occurs, zero is returned to the caller, an empty string is returned in buf, and an exception is thrown.

NOTE: This method throws an exception if utf8_length is 0. Thus, the caller must verify that utf8 is not an empty string before calling this method!

Parameters

[in]	utf8	UTF-8 string
[in]	utf8_length	length of `utf8` in bytes
[in,out]	index	index into `utf8`
[out]	buf	buffer that holds the resulting UTF-16 multi-byte sequence

Returns: number of cgul_hchar_t elements written to buf

References cgul_unicode__mbtohc().

§ hctomb()

static size_t cgul_unicode_cxx::hctomb	(	const cgul_hchar_t *	utf16,
		size_t	utf16_length,
		size_t *	index,
		char *	buf
	)

inlinestatic

UTF-16 to UTF-8 character conversion.

This method converts exactly one UTF-16 wide character sequence to a UTF-8 multi-byte character sequence and returns it in buf as a C-style string that is NUL terminated. As an added convenience, this method returns the number of bytes written to buf excluding the NUL terminator.

The multiple cgul_hcahr_t elements that comprise the UTF-16 wide character sequence start at utf16[*index]. In order to make iterating over an entire UTF-16 string easy, when this method returns, the value of *index is updated to point to the beginning of the next character.

To guard against reading off the end of utf16, its length must be given in utf16_length. The value of utf16_length is the total number of cgul_hchar_t elements in the utf16 string; this value can be obtained by calling cgul_hchar_cxx::hcslen() or cgul_hstring_cxx::get_length(). Basically, surrogate pairs count as two; everything else counts as one. Note that the value is not the number of characters remaining in the string; it's the total number of characters in the string.

The caller is responsible for making sure that buf points to a buffer that can hold at least CGUL_WCHAR__MB_LEN_MAX + 1 bytes.

If an error occurs, zero is returned to the caller, an empty string is returned in buf, and an exception is thrown.

NOTE: This method throws an exception if utf16_length is 0. Thus, the caller must verify that utf16 is not an empty string before calling this method!

Parameters

[in]	utf16	UTF-8 string
[in]	utf16_length	length of `utf16` in `cgul_hchar_t` elements
[in,out]	index	index into `utf16`
[out]	buf	buffer that holds the resulting UTF-8 multi-byte sequence

Returns: number of bytes written to buf

References cgul_unicode__hctomb(), and cgul_wchar_t.

§ hctowc()

static cgul_wchar_t cgul_unicode_cxx::hctowc	(	const cgul_hchar_t *	utf16,
		size_t	utf16_length,
		size_t *	index
	)

inlinestatic

UTF-16 to UTF-32 character conversion.

This method converts exactly one UTF-16 multi-byte character to a UTF-32 wide character and returns it. The multiple bytes that comprise the UTF-16 character start at utf16[*index].

In order to make iterating over an entire UTF-16 string easy, when this method returns, the value of *index is updated to point to the beginning of the next character.

To guard against reading off the end of utf16, its length must be given in utf16_length. The value of utf16_length is the total number of cgul_hchar_t elements in the utf16 string; this value can be obtained by calling cgul_hchar_cxx::hcslen() or cgul_hstring_cxx::get_length(). Basically, surrogate pairs count as two; everything else counts as one. Note that the value is not the number of characters remaining in the string; it's the total number of characters in the string.

NOTE: This method throws an exception if utf16_length is 0. Thus, the caller must verify that utf16 is not an empty string before calling this method!

If an error occurs, CGUL_HCHAR__NUL is returned, and an exception is thrown.

Parameters

[in]	utf16	UTF-16 string
[in]	utf16_length	length of `utf16` in `cgul_hchar_t` elements
[in,out]	index	index into `utf16`

Returns: next UTF-32 character

References cgul_unicode__hctowc(), CGUL_WCHAR__NUL, and cgul_wchar_t.

§ wctohc()

static size_t cgul_unicode_cxx::wctohc	(	const cgul_wchar_t	utf32,
		cgul_hchar_t *	buf
	)

inlinestatic

UTF-32 to UTF-16 character conversion.

This method converts exactly one UTF-32 wide character to a UTF-16 multi-byte character sequence and returns it in buf as a cgul_hchar_t string that is NUL terminated. The UTF-32 character that gets converted is passed in as wc. As an added convenience, this method returns the number of bytes written to buf excluding the NUL terminator.

The caller is responsible for making sure that buf points to a buffer that can hold at least three cgul_hchar_t characters. This accounts for the worst case which requires a surrogate pair and a NUL terminator.

If an error occurs, zero is returned to the caller, an empty string is returned in buf, and an exception is thrown. Errors can occur, for example, if wc is larger than 0x10ffff which is the largest valid Unicode code point.

Parameters

[in]	utf32	UTF-32 wide character
[out]	buf	buffer that holds the resulting UTF-16 multi-byte sequence

Returns: number of bytes written to buf

References cgul_unicode__wctohc().

§ get_char_count()

static size_t cgul_unicode_cxx::get_char_count	(	const char *	utf8,
		size_t	utf8_length,
		int	skip_bom
	)

inlinestatic

This method counts the number of Unicode characters (not bytes) in the UTF-8 string utf8 and returns the result. If an error occurs decoding utf8, 0 is returned, and an exception is thrown.

If a leading byte-order mark (BOM) is present and skip_bom is set, the leading BOM will not be included in the character count.

To guard against reading off the end of utf8, its length must be given in utf8_length. The value of utf8_length is the total number of bytes in the utf8 string.

Parameters

[in]	utf8	UTF-8 string
[in]	utf8_length	length of `utf8` in bytes
[in]	skip_bom	whether to skip the byte-order mark

Returns: number of Unicode characters in utf8

References cgul_unicode__get_char_count().

§ get_hchar_count()

static size_t cgul_unicode_cxx::get_hchar_count	(	const cgul_hchar_t *	utf16,
		size_t	utf16_length,
		int	skip_bom
	)

inlinestatic

This method counts the number of Unicode characters (not bytes and not cgul_hchar_t elements) in the UTF-16 string utf16 and returns the result. If an error occurs decoding utf16, 0 is returned, and an exception is thrown.

If a leading byte-order mark (BOM) is present and skip_bom is set, the leading BOM will not be included in the character count.

To guard against reading off the end of utf16, its length must be given in utf16_length. The value of utf16_length is the total number of cgul_hchar_t elements in the utf16 string; this value can be obtained by calling cgul_hchar_cxx::hcslen() or cgul_hstring_cxx::get_length(). Basically, surrogate pairs count as two; everything else counts as one.

Parameters

[in]	utf16	UTF-16 string
[in]	utf16_length	length of `utf16` in `cgul_hchar_t` elements
[in]	skip_bom	whether to skip the byte-order mark

Returns: number of Unicode characters in utf16

References cgul_unicode__get_hchar_count().

§ get_wchar_count()

static size_t cgul_unicode_cxx::get_wchar_count	(	const cgul_wchar_t *	utf32,
		int	skip_bom
	)

inlinestatic

This method counts the number of Unicode characters (not bytes) in the UTF-32 string utf32 and returns the result. If an error occurs decoding utf32, 0 is returned, and an exception is thrown.

If a leading byte-order mark (BOM) is present and skip_bom is set, the leading BOM will not be included in the character count.

Parameters

[in]	utf32	UTF-32 string
[in]	skip_bom	whether to skip the byte-order mark

Returns: number of Unicode characters in utf32

References cgul_unicode__get_wchar_count().

The documentation for this class was generated from the following file:

cgul_unicode_cxx.h

Static Public Member Functions

Detailed Description

Member Function Documentation

§ mbstowcs()

§ wcstombs()

§ mbstohcs()

§ hcstombs()

§ hcstowcs()

§ wcstohcs()

§ mbtowc()

§ wctomb()

§ mbtohc()

§ hctomb()

§ hctowc()

§ wctohc()

§ get_char_count()

§ get_hchar_count()

§ get_wchar_count()