UTF-8 String Handling API
This module implements safe and easy to use string handling functions for null-terminated strings with UTF-8 encoding.
UTF-8 is a variable length character encoding that supports every character in the Unicode character set. UTF-8 has become the dominant character encoding because it is self synchronizing, compatible with ASCII, and avoids the endian issues that other encodings face.
UTF-8 Encoding
UTF-8 uses between one and four bytes to encode a character as illustrated in the following table.
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
110xxxxx | 10xxxxxx | ||
1110xxxx | 10xxxxxx | 10xxxxxx | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Single byte codes are used only for the ASCII values 0 through 127. In this case, UTF-8 has the same binary value as ASCII, making ASCII text valid UTF-8 encoded Unicode. All ASCII strings are UTF-8 compatible.
Character codes larger than 127 have a multi-byte encoding consisting of a leading byte and one or more continuation bytes.
The leading byte has two or more high-order 1's followed by a 0 that can be used to determine the number bytes in the character without examining the continuation bytes.
The continuation bytes have '10' in the high-order position.
Single bytes, leading bytes and continuation bytes can't have the same values. This means that UTF-8 strings are self-synchronized, allowing the start of a character to be found by backing up at most three bytes.
Copy and Append
le_utf8_Copy()
copies a string to a specified buffer location.
le_utf8_Append()
appends a string to the end of another string by copying the source string to the destination string's buffer starting at the null-terminator of the destination string.
The le_uft8_CopyUpToSubStr()
function is like le_utf8_Copy() except it copies only up to, but not including, a specified string.
Truncation
Because UTF-8 is a variable length encoding, the number of characters in a string is not necessarily the same as the number bytes in the string. When using functions like le_utf8_Copy() and le_utf8_Append(), the size of the destination buffer, in bytes, must be provided to avoid buffer overruns.
The copied string is truncated because of limited space in the destination buffer, and the destination buffer may not be completely filled. This can occur during the copy processf the last character to copy is more than one byte long and will not fit within the buffer.
The character is not copied and a null-terminator is added. Even though we have not filled the destination buffer,we have truncated the copied string. Essentially, functions like le_utf8_Copy() and le_utf8_Append() only copy complete characters, not partial characters.
For le_utf8_Copy(), the number of bytes actually copied is returned in the numBytesPtr parameter. This parameter can be set to NULL if the number of bytes copied is not needed. le_utf8_Append() and le_uft8_CopyUpToSubStr() work similarly.
// In this code sample, we need the number of bytes actually copied:size_t numBytes;{LE_WARN("'%s' was truncated when copied. Only %d bytes were copied.", srcStr, numBytes);}// In this code sample, we don't care about the number of bytes copied:
String Lengths
String length may mean either the number of characters in the string or the number of bytes in the string. These two meanings are often used interchangeably because in ASCII-only encodings the number of characters in a string is equal to the number of bytes in a string. But this is not necessarily true with variable length encodings such as UTF-8. Legato provides both a le_utf8_NumChars() function and a le_utf8_NumBytes() function.
le_utf8_NumBytes()
must be used when determining the memory size of a string. le_utf8_NumChars()
is useful for counting the number of characters in a string (ie. for display purposes).
String Lengths
The function le_utf8_NumBytesInChar() can be used to determine the number of bytes in a character by looking at its first byte. This is handy when reading a UTF-8 string from an input stream. When the first byte is read, it can be passed to le_utf8_NumBytesInChar() to determine how many more bytes need to be read to get the rest of the character.
Checking UTF-8 Format
As can be seen in the UTF-8 Encoding section, UTF-8 strings have a specific byte sequence. The le_utf8_IsFormatCorrect()
function can be used to check if a string conforms to UTF-8 encoding. Not all valid UTF-8 characters are valid for a given character set; le_utf8_IsFormatCorrect() does not check for this.
String Parsing
To assist with converting integer values from UTF-8 strings to binary numerical values, le_utf8_ParseInt() is provided.
More parsing functions may be added as required in the future.
Copyright (C) Sierra Wireless Inc. Use of this work is subject to license.