Details on how to process the UTF 8 text in the C program

  • 2020-06-01 10:32:03
  • OfStack

UTF-8

UTF-8 is one of the most widely used implementations of unicode on the Internet. Other implementations include UTF-16 and UTF-32, but they are rarely used on the Internet.

So once again, the relationship here is that UTF-8 is one of the ways that Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 6 bytes to represent a symbol, varying the byte length according to different symbols.

Encoding rules for UTF-8

The coding rules for UTF-8 are simple, with only two:

1) for the symbol of a single byte, the first bit of the byte is set to 0, and the following seven bits are the unicode code of the symbol. Therefore, for English letters, the UTF-8 code is the same as the ASCII code.

2) for the n byte symbol (n) > 1), the first n bit of the first byte is set to 1, the first n+1 bit is set to 0, and the first two bits of the last byte is set to 10. The remaining hexadecimal bits not mentioned are all the unicode code of this symbol.

If you are not familiar with the UTF-8 encoding, do not attempt to process the UTF-8 text by hand in the C program. If you know a lot about the UTF-8, you don't need to. Find an C library that provides UTF-8 text processing and runs across platforms to do this!

GLib is such a library.

Start with the question

The following text is UTF-8 encoded (I'm so sure because I'm using the Linux system, and the default text encoding is UTF-8) :


 my  C81  It's in my pocket every day 
   @

I need to read the text in the C program. When I read the '@' character, I need to determine whether the text to the left of '@' is all white space on the same line.

For simplicity, I ignored the file reading process and represented the above text as an C string:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";

Note: in GLib, gchar is char, typedef char gchar;

Below, when I say "demo_text string", I mean strlen(demo_text) + 1 byte memory space based on the value of demo_text pointer, which is the basic common sense of C language string.

UTF-8 text length and character positioning

To simulate the moment when the program reads the '@' character, I need to position the '@' character in the demo_text string with a pointer of type char *.

The '@' character is at the end of demo_text. I need a offset distance, which is the length of the demo_text string at the UTF-8 encoding level, from which I can jump from the base address of the demo_text string to the base address of the '@' character.

GLib provides the g_utf8_strlen function to calculate the UTF-8 string length, so I can get the offset distance from the base address of the demo_text string to the base address of the '@' character:


glong offset = g_utf8_strlen(demo_text, -1);

The result is 38, which happens to be the length of the demo_text string at the UTF-8 encoding level (not including the null character at the end of the string, i.e. the '\0' character).

The prototype of g_utf8_strlen is as follows:


glong g_utf8_strlen(const gchar *p, gssize max);

Note: glong is long, and gssize is signed long.

g_utf8_strlen second parameter max setting rules are as follows:

If it is a negative number, assume that the string ends in null (common sense for C strings), and count the number of UTF-8 characters. If it is zero, it does not detect the length of the string... This value is purely for soy sauce. If it's positive, it's the number of bytes. g_utf8_strlen extracts bytes from the string according to the number of bytes, and then counts the number of UTF-8 characters corresponding to the bytes intercepted.

With the offset distance, the '@' character can be positioned in demo_text, that is:


gchar *tail = g_utf8_offset_to_pointer(demo_text, offset - 1);

The value of tail is the base address of the '@' character.

Swim in the UTF-8 text

Now that you've got the '@' position, it's up to you to traverse the rest of the demo_text string to the left (in reverse order) from that position. GLib provides the g_utf8_prev_char function for this purpose:


gchar * g_utf8_prev_char(const gchar *str, const gchar *p);

The g_utf8_prev_char function is used to obtain the base address of the 1 UTF-8 character before p from str (p is the base address of the current UTF-8 character). If p is the same as str, that is, p already points to the base address of the string, then g_utf8_find_prev_char returns NULL.

For the problem to be solved in this article, you can use this function to write the reverse traversal of all the UTF-8 characters before '@' starting from the position of the '@' character in demo_text:


glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (1) {
  viewer = g_utf8_prev_char(viewer);
  if (viewer != demo_text) {
    /* do somthing here */
  } else {
    break;
  }
}

GLib also provides one g_utf8_next_char, which returns the base address of the next UTF-8 character at the current location.

Extract the UTF-8 characters

While g_utf8_prev_char and g_utf8_next_char can move a pointer through the UTF-8 text, only one pointer can be located to the base address of an UTF-8 character.

For example,


viewer = g_utf8_prev_char(viewer);

At this point, although it is possible to move viewer forward one UTF-8 character width to reach the base address of a new UTF-8 character, if I want to print out the new UTF-8 character, I will not be able to do so as follows:


g_print("%s", viewer);

Note: the g_print function is basically equivalent to the printf function in the C standard library, except that g_print can use the g_set_print_handler function to "redirect" the output.

Because g_print prints a single UTF-8 character through viewer, if the UTF-8 character is followed by a '\0', then an UTF-8 character is printed as a normal C string. This UTF-8 character cannot be followed by '\0' unless it is the last character in the demo_text string.

The only way to solve this problem is to extract the byte data corresponding to the UTF-8 character pointed to by viewer, put it into a character array or create a storage space for it in the heap, and then print the data in the character array or heap space. Such as:


gchar *new_viewer = g_utf8_next_char(viewer);

sizt_t n = new_viewer - viewer;
gchar *utf8_char = malloc(n + 1);
memcpy(utf8_char, viewer, n);
utf8_char[n] = '\0';
g_print("%s", utf8_char);
free(utf8_char);

This is obviously too tedious. However, that means we should write a function that does exactly that. This function can be called get_utf8_char and is defined as follows:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
0

With this function, it is possible to print all UTF-8 characters before '@' in reverse order, starting from the '@' position of demo_text:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
1

Note: g_memdup is equivalent to malloc + memcpy in the C standard library, while g_free is equivalent to free in the C standard library.
White space character comparison

Now, given an UTF-8 character, x, how do you know that it is equal to some UTF-8 character?

Don't forget that a so-called UTF-8 character is essentially just a segment of memory referenced by a pointer of type char *. Based on this fact, the strcmp function provided by the C standard library can be used to compare the UTF-8 characters.

Next, I define the function is_space, which determines whether an UTF-8 character is a blank character.


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
2

Note: gboolean is a Boolean type defined by GLib and its value is either TRUE or FALSE.

In the is_space function, I only judged three whitespace character types -- English Spaces, Chinese full Spaces, and tabs.

Although carriage returns and newlines are also white space characters, to solve the problem raised at the beginning of this article, I need to define a separate judgment function for newlines:


static gboolean is_line_break(const gchar *s) {
  return (!strcmp(s, "\n") ? TRUE : FALSE);
}

To solve the problem

Everything is now in place except the east wind. It's time to get down to business. If you've forgotten what the problem is by this point, go back to section 1.

Although the following code looks ugly, it solves the problem.


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
4

A slight simplification of the above code can be obtained as follows:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
5

In fact, if you put the extraction and memory release process of UTF-8 into the is_space and is_line_break functions, namely:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
6

The simplified result of step 1 can be obtained:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
7

Attachment: complete code


#include <string.h>
#include <glib.h>

gchar *demo_text =
  " my  C81  It's in my pocket every day \n"
  "      @";

static gchar * get_utf8_char(const gchar *base) {
  gchar *new_base = g_utf8_next_char(base);
  gsize n = new_base - base;
  gchar *utf8_char = g_memdup(base, (n + 1));
  utf8_char[n] = '\0';
  return utf8_char;
}

static gboolean is_space(const gchar *c) {
  gboolean ret = FALSE;
  gchar *utf8_char = get_utf8_char(c);
  char *space_chars_set[] = {" ", "\t", " "};
  size_t n = sizeof(space_chars_set) / sizeof(space_chars_set[0]);
  for (size_t i = 0; i < n; i++) {
    if (!strcmp(utf8_char, space_chars_set[i])) {
      ret = TRUE;
      break;
    }
  }
  g_free(utf8_char);
  return ret;
}

static gboolean is_line_break(const gchar *c) {
  gboolean ret = FALSE;
  gchar *utf8_char = get_utf8_char(c);
  if (!strcmp(utf8_char, "\n")) ret = TRUE;
  g_free(utf8_char);
  return ret;
}

int main(void) {
  gboolean is_right_at_sign = TRUE;
  glong offset = g_utf8_strlen(demo_text, -1);
  gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
  while (viewer != demo_text) {
    viewer = g_utf8_prev_char(viewer);
    if (!is_space(viewer)) {
      if (!is_line_break(viewer)) is_right_at_sign = FALSE;
      break;
    }
  }
  if (is_right_at_sign) g_print("Right @ !\n");

  return 0;
}

If you compile this code using gcc in Bash, you can use the following command:


gchar *demo_text =
 " my  C81  It's in my pocket every day \n"
 "   @";
9

conclusion


Related articles: