UTF-8 String Handling API

API Reference


This module implements safe and easy to use string handling functions for null-terminated strings with UTF-8 encoding.

UTF-8 is a variable length character encoding that supports every character in the Unicode character set. UTF-8 has become the dominant character encoding because it is self synchronizing, compatible with ASCII, and avoids the endian issues that other encodings face.

UTF-8 Encoding

UTF-8 uses between one and four bytes to encode a character as illustrated in the following table.

Byte 1 Byte 2 Byte 3 Byte 4
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Single byte codes are used only for the ASCII values 0 through 127. In this case, UTF-8 has the same binary value as ASCII, making ASCII text valid UTF-8 encoded Unicode. All ASCII strings are UTF-8 compatible.

Character codes larger than 127 have a multi-byte encoding consisting of a leading byte and one or more continuation bytes.

The leading byte has two or more high-order 1's followed by a 0 that can be used to determine the number bytes in the character without examining the continuation bytes.

The continuation bytes have '10' in the high-order position.

Single bytes, leading bytes and continuation bytes can't have the same values. This means that UTF-8 strings are self-synchronized, allowing the start of a character to be found by backing up at most three bytes.

Copy and Append

le_utf8_Copy() copies a string to a specified buffer location.

le_utf8_Append() appends a string to the end of another string by copying the source string to the destination string's buffer starting at the null-terminator of the destination string.

The le_uft8_CopyUpToSubStr() function is like le_utf8_Copy() except it copies only up to, but not including, a specified string.

Truncation

Because UTF-8 is a variable length encoding, the number of characters in a string is not necessarily the same as the number bytes in the string. When using functions like le_utf8_Copy() and le_utf8_Append(), the size of the destination buffer, in bytes, must be provided to avoid buffer overruns.

The copied string is truncated because of limited space in the destination buffer, and the destination buffer may not be completely filled. This can occur during the copy processf the last character to copy is more than one byte long and will not fit within the buffer.

The character is not copied and a null-terminator is added. Even though we have not filled the destination buffer,we have truncated the copied string. Essentially, functions like le_utf8_Copy() and le_utf8_Append() only copy complete characters, not partial characters.

For le_utf8_Copy(), the number of bytes actually copied is returned in the numBytesPtr parameter. This parameter can be set to NULL if the number of bytes copied is not needed. le_utf8_Append() and le_utf8_CopyUpToAsciiChar() work similarly.

// In this code sample, we need the number of bytes actually copied:
size_t numBytes;
if (le_utf8_Copy(destStr, srcStr, sizeof(destStr), &numBytes) == LE_OVERFLOW)
{
LE_WARN("'%s' was truncated when copied. Only %d bytes were copied.", srcStr, numBytes);
}
// In this code sample, we don't care about the number of bytes copied:
LE_ASSERT(le_utf8_Copy(destStr, srcStr, sizeof(destStr), NULL) != LE_OVERFLOW);

String Lengths

String length may mean either the number of characters in the string or the number of bytes in the string. These two meanings are often used interchangeably because in ASCII-only encodings the number of characters in a string is equal to the number of bytes in a string. But this is not necessarily true with variable length encodings such as UTF-8. Legato provides both a le_utf8_NumChars() function and a le_utf8_NumBytes() function.

le_utf8_NumBytes() must be used when determining the memory size of a string. le_utf8_NumChars() is useful for counting the number of characters in a string (ie. for display purposes).

Checking UTF-8 Format

As can be seen in the UTF-8 Encoding section, UTF-8 strings have a specific byte sequence. The le_utf8_IsFormatCorrect() function can be used to check if a string conforms to UTF-8 encoding. Not all valid UTF-8 characters are valid for a given character set; le_utf8_IsFormatCorrect() does not check for this.

String Parsing

To assist with converting integer values from UTF-8 strings to binary numerical values, le_utf8_ParseInt() is provided.

More parsing functions may be added as required in the future.

Monotonic Strings

Occasionally, when creating identifiers for a set of objects it is useful to be able to generate a set of mutually unique strings. The identifiers may not have any meanings themselves but what is important is that they uniquely identify the object. The license plate number of cars is a good example of this.

The function le_utf8_GetMonotonicString() in this module can be used to generate a series of mutually exclusive strings. The strings generated by this function differ from our license plate example in that the generated strings are variable length and are ordered. Nevertheless, the important property of these strings are that they are mutually unique and can be used as identifiers for a set of objects.

Passing an empty string to the le_utf8_GetMonotonicString() function will generate the first string in the series. Passing the first string back into le_utf8_GetMonotonicString() will generate the next string in the series. Continuing to pass the previously generated string to le_utf8_GetMonotonicString() will produce a series of unique strings.

For example, the following function creates a number of files with unique names.

static void CreateFiles(size_t numOfFiles)
{
char fileName[100] = "";
char prevFileName[100] = "";
int i;
for (i = 0; i < numOfFiles; i++)
{
// Generate the fileName.
LE_ASSERT(le_utf8_GetMonotonicString(prevFileName, fileName, sizeof(fileName)) == LE_OK);
// Create the file.
int fd;
do
{
fd = open(fileName, O_RDWR | O_CREAT, S_IRWXU);
}
while ( (fd == -1) && (errno == EINTR) );
LE_FATAL_IF(fd == -1, "Could not create file %s. %m.", fileName);
// Save the file name to generate the next file name.
LE_ASSERT(le_utf8_Copy(prevFileName, fileName, sizeof(prevFileName), NULL) == LE_OK);
// Close the file.
fd_Close(fd);
}
}

Copyright (C) Sierra Wireless Inc. Use of this work is subject to license.