UTF-8 String Handling API

API Reference


This module implements safe and easy to use string handling functions for null-terminated strings with UTF-8 encoding.

UTF-8 is a variable length character encoding that supports every character in the Unicode character set. UTF-8 has become the dominant character encoding because it is self synchronizing, compatible with ASCII, and avoids the endian issues that other encodings face.

UTF-8 Encoding

UTF-8 uses between one and four bytes to encode a character as illustrated in the following table.

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Single byte codes are used only for the ASCII values 0 through 127. In this case, UTF-8 has the same binary value as ASCII, making ASCII text valid UTF-8 encoded Unicode. All ASCII strings are UTF-8 compatible.

Character codes larger than 127 have a multi-byte encoding consisting of a leading byte and one or more continuation bytes.

The leading byte has two or more high-order 1's followed by a 0 that can be used to determine the number bytes in the character without examining the continuation bytes.

The continuation bytes have '10' in the high-order position.

Single bytes, leading bytes and continuation bytes can't have the same values. This means that UTF-8 strings are self-synchronized, allowing the start of a character to be found by backing up at most three bytes.

le_utf8_EncodeUnicodeCodePoint() provides a function that is able to encode any unicode code point into a sequence of bytes that represents the utf-8 encoding of the codepoint. The function le_utf8_DecodeUnicodeCodePoint() implements the inverse function. It converts a UTF-8 encoded character into the corresponding unicode code point.

Copy and Append

le_utf8_Copy() copies a string to a specified buffer location.

le_utf8_Append() appends a string to the end of another string by copying the source string to the destination string's buffer starting at the null-terminator of the destination string.

The le_uft8_CopyUpToSubStr() function is like le_utf8_Copy() except it copies only up to, but not including, a specified string.

Truncation

Because UTF-8 is a variable length encoding, the number of characters in a string is not necessarily the same as the number bytes in the string. When using functions like le_utf8_Copy() and le_utf8_Append(), the size of the destination buffer, in bytes, must be provided to avoid buffer overruns.

The copied string is truncated because of limited space in the destination buffer, and the destination buffer may not be completely filled. This can occur during the copy process if the last character to copy is more than one byte long and will not fit within the buffer.

The character is not copied and a null-terminator is added. Even though we have not filled the destination buffer, we have truncated the copied string. Essentially, functions like le_utf8_Copy() and le_utf8_Append() only copy complete characters, not partial characters.

For le_utf8_Copy(), the number of bytes actually copied is returned in the numBytesPtr parameter. This parameter can be set to NULL if the number of bytes copied is not needed. le_utf8_Append() and le_uft8_CopyUpToSubStr() work similarly.

// In this code sample, we need the number of bytes actually copied:
size_t numBytes;
 
if (le_utf8_Copy(destStr, srcStr, sizeof(destStr), &numBytes) == LE_OVERFLOW)
{
LE_WARN("'%s' was truncated when copied. Only %d bytes were copied.", srcStr, numBytes);
}
 
// In this code sample, we don't care about the number of bytes copied:
LE_ASSERT(le_utf8_Copy(destStr, srcStr, sizeof(destStr), NULL) != LE_OVERFLOW);

String Lengths

String length may mean either the number of characters in the string or the number of bytes in the string. These two meanings are often used interchangeably because in ASCII-only encodings the number of characters in a string is equal to the number of bytes in a string. But this is not necessarily true with variable length encodings such as UTF-8. Legato provides both a le_utf8_NumChars() function and a le_utf8_NumBytes() function.

le_utf8_NumBytes() must be used when determining the memory size of a string. le_utf8_NumChars() is useful for counting the number of characters in a string (ie. for display purposes).

Character Lengths

The function le_utf8_NumBytesInChar() can be used to determine the number of bytes in a character by looking at its first byte. This is handy when reading a UTF-8 string from an input stream. When the first byte is read, it can be passed to le_utf8_NumBytesInChar() to determine how many more bytes need to be read to get the rest of the character.

Checking UTF-8 Format

As can be seen in the UTF-8 Encoding section, UTF-8 strings have a specific byte sequence. The le_utf8_IsFormatCorrect() function can be used to check if a string conforms to UTF-8 encoding. Not all valid UTF-8 characters are valid for a given character set; le_utf8_IsFormatCorrect() does not check for this.

String Parsing

To assist with converting integer values from UTF-8 strings to binary numerical values, le_utf8_ParseInt() is provided.

More parsing functions may be added as required in the future.