Regular expressions in C are detailed with examples

  • 2020-06-19 11:12:55
  • OfStack

Regular expressions, also known as regular representation, regular representation (English: Regular Expression, often abbreviated to regex, regexp, or RE in code). A regular expression is a string that USES a single string to describe and match a series of strings that conform to a syntactic rule.

In the c language, regular expressions are handled with regcomp, regexec, regfree, and regerror. Regular expressions are processed in three steps:

Compile the regular expression, regcomp; Match regular expression, regexec; Release the regular expression, regfree.

The function prototype


/*
 Function description: Regcomp Put a regular expression string regex Compiled into regex_t Form, follow regexec Search from there. 
 Parameter description: 
  Preg : 1 a regex_t A pointer to a structure. 
  Regex : Regular expression string. 
  Cflags : it is below 4 A value is either their or (|) Operation. 
    REG_EXTENDED Use: POSIX A regular expression that extends the interpretation of regular expression syntax. If it's not set, basically POSIX Regular expression syntax. 
    REG_ICASE : Ignores the case of letters. 
    REG_NOSUB : Do not store matching results. 
    REG_NEWLINE : "special care" for newline characters, as explained later. 
 The return value: 
  0 : denotes successful compilation; 
   non 0 : Failed to compile regerror View failure information 
*/
int regcomp(regex_t *preg, const char *regex, int cflags);
/*
 Function description:  Regexec Used to match regular text. 
 Parameter description: 
  Preg By: regcomp compiled regex_t Struct pointer, 
  String : The string you want to regex. 
  Nmatch : regmatch_t The size of the array of structures 
  Pmatch : regmatch_t An array of structures. The location of the substring used to hold the match result. 
  regmatch_t The structure is defined as follows 
    typedef struct {
      regoff_t rm_so;
      regoff_t rm_eo;
    } regmatch_t;
    rm_so, If its value is not zero -1 , represents the starting offset of the maximum matched substring in the string, rm_eo Represents the maximum string matching offset at the end of the string. 
  Eflags: REG_NOTBOL and REG_NOTEOL Is the value of the two 1 or 2 Of or (|) Operations, which I'll show you later. 
 The return value: 
  0 : denotes successful compilation; 
   non 0 : Failed to compile regerror View failure information 
*/
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);
/*
 Function description: used to release regcomp Compiled built-in variables. 
 Parameter description: 
  Preg By: regcomp compiled regex_t A pointer to a structure. 
*/
void regfree(regex_t *preg);
/*
 Function description: Regcomp . regexec Returns when an error occurs error code And for the 0 And you can use it regerror Get an error message. 
 Parameter description: 
  Errcode : Regcomp . regexec The return value in the event of an error 
  Preg : after Regcomp Compilation of regex_t A pointer to a structure. 
  Errbuf : Where the error message is placed. 
  errbuf_size : Error message buff The size of the. 
*/
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);

Example 1


#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "Hello World";
  char *reg_str = "H.*";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "%s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 0, NULL, 0);
  if (ret)
  {
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "%s\n", ebuff);
    goto end;
  }  
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "result is:\n%s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and output the following results:

[

[root@zxy regex]# ./test
result is:
Success

]

Match successful.

Example 2

What if I want to keep the match? So you have to use the regmatch_t structure. Rewrite the above code so that the REG_NOSUB option is not available, as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  regmatch_t rm[5];
  char *part_str = NULL;
  cflags = REG_EXTENDED | REG_ICASE;
  char *test_str = "Hello World";
  char *reg_str = "e(.*)o";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "%s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 5, rm, 0); 
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "%s\n", ebuff);
    goto end;
  }
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "result is:\n%s\n\n", ebuff);
  for (i=0; i<5; i++)
  {
    if (rm[i].rm_so > -1)
    {
      part_str = strndup(test_str+rm[i].rm_so, rm[i].rm_eo-rm[i].rm_so);
      fprintf(stderr, "%s\n", part_str);
      free(part_str);
      part_str = NULL;
    }
  }
end:
  regfree(&reg);
  return 0;
}

Compile and output the following results:

[

[root@zxy regex]# ./test
result is:
Success
ello Wo
llo W

]

Huh?? Why do I print two matches when I only want one??
It turns out that the first element of the regmatch_t array is special: it is used to hold the start and end offsets of the largest substring the entire regular expression can match. So when we set the number of regmatch_t we have to keep in mind that the number of regmatch_t array is +1.

REG_NEWLINE, REG_NOTBOL and REG_NOTEOL

That's it for basic regularization, so let's start with REG_NEWLINE, REG_NOTBOL, and REG_NOTEOL. Many people are confused by these three parameters. Me too, yesterday someone asked a question, told others their wrong understanding, and then was the great god 1 despise. I always thought that if I wanted to use the ^ and $matching pattern 1, I would have to use the REG_NEWLINE parameter, but I didn't.

REG_NEWLINE

First, take a look at man page's description of REG_NEWLINE:


REG_NEWLINE
  Match-any-character operators don't match a newline.
  A non-matching list ([^...]) not containing a newline does not match a newline.
  Match-beginning-of-line operator (^) matches the empty string immediately after a newline, regardless of whether eflags, the execution flags of regexec(), contains REG_NOTBOL.
  Match-end-of-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.

My English is not good.

REG_NEWLINE

1. Operators matching any character (for example.) do not match newline ('\n');
2. Unmatched list ([^...]) ) does not contain 1 newline character does not match 1 newline character;
3. The match start operator (^) breaks a line immediately when an empty string is encountered, whether or not eflags set REG_NOTBOL when regexec() is executed;
4. The end of match operator ($) breaks the empty string immediately, whether or not REG_NOTEOL is set when regexec() is executed;

Do not understand what is said, the procedure test.

Question number one

The code is as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "Hello World\n";
  char *reg_str = "Hello World.";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "1. %s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 0, NULL, 0); 
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "2. %s\n", ebuff);
  cflags |= REG_NEWLINE;
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "3. %s\n", ebuff);
    goto end;
  }
  ret = regexec(&reg, test_str, 0, NULL, 0);
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "4. %s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and run the results as follows:

[

[root@zxy regex]# ./test
2. Success
4. No match

]

The result was obvious: the match was successful without adding REG_NEWLINE, and the match was not successful. This means that REG_NEWLINE is not added, and any matching character (.) contains 'n', while joining does not contain 'n'.

Question number two

The code is as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "Hello\nWorld";
  char *reg_str = "Hello[^ ]";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "1. %s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 0, NULL, 0); 
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "2. %s\n", ebuff);
  cflags |= REG_NEWLINE;
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "3. %s\n", ebuff);
    goto end;
  }
  ret = regexec(&reg, test_str, 0, NULL, 0);
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "4. %s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and run the results as follows:

[

[root@zxy regex]# ./test
2. Success
4. No match

]

The results show that if REG_NEWLINE is not added, 'n' is not considered as a white space in a non-list that does not contain 'n', and if added, 'n' is considered as a white space.

Question number three

The code is as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "\nHello World";
  char *reg_str = "^Hello";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "1. %s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 0, NULL, 0); 
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "2. %s\n", ebuff);
  cflags |= REG_NEWLINE;
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "3. %s\n", ebuff);
    goto end;
  }
  ret = regexec(&reg, test_str, 0, NULL, 0);
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "4. %s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and run the results as follows:

[

[root@zxy regex]# ./test
2. No match
4. Success

]

The results show that if REG_NEWLINE is not added, '^' is not ignored, while if REG_NEWLINE is added, '^' is ignored. If REG_NEWLINE is not added, strings beginning with 'n' cannot be matched with '^'. If REG_NEWLINE is added, strings beginning with 'n' can be matched with '^'.

Question 4

The code is as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "Hello World\n";
  char *reg_str = "d$";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "1. %s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str, 0, NULL, 0); 
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "2. %s\n", ebuff);
  cflags |= REG_NEWLINE;
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "3. %s\n", ebuff);
    goto end;
  }
  ret = regexec(&reg, test_str, 0, NULL, 0);
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "4. %s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and run the results as follows:

[

[root@zxy regex]# ./test
2. No match
4. Success

]

Results: REG_NEWLINE is not added, ' & dollar; 'Do not ignore 'n', add REG_NEWLINE,' & dollar; 'is ignoring 'n'. In other words: strings ending with 'n' cannot be used with '​ without adding REG_NEWLINE; & dollar; 'Match, add REG_NEWLINE, string beginning with 'n' can be used with '​ & dollar; 'matching.

REG_NEWLINE summary

Ok, that's the end of the REG_NEWLINE option test. Summary:

For the REG_NEWLINE option, 1. Any piece card (.) does not contain 'n'; 2. For a non-list that does not contain 'n', 'n' is considered blank. 3. 'n' is ignored for strings beginning or ending with 'n'. Make '^' and '$' available.

REG_NOTBOL和REG_NOTEOL

Starting with REG_NOTBOL and REG_NOTEOL, read man page's description of the two options:


REG_NOTBOL
  The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above) This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
REG_NOTEOL
  The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)
 Continue to googling . 
[

REG_NOTBOL
The match start operator (^) often fails to match (but consider REG_NEWLINE), and this flag is used when a different position of a string is passed to regexec(), which should not be interpreted as the starting position of the entire string.
REG_NOTEOL
The end of match operator ($) often fails (but consider REG_NEWLINE). This flag is used when a different position of a string is passed into regexec(), even if the match terminator is satisfied, it should not be interpreted as ending with a character (string).

]

Ok, let's go ahead and test. The code for question 1 is as follows:


#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
  int i;
  char ebuff[256];
  int ret;
  int cflags;
  regex_t reg;
  cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
  char *test_str = "Hello World\n";
  char *reg_str = "^e";
  ret = regcomp(&reg, reg_str, cflags);
  if (ret)
  {  
    regerror(ret, &reg, ebuff, 256);
    fprintf(stderr, "1. %s\n", ebuff);
    goto end;
  }  
  ret = regexec(&reg, test_str+1, 0, NULL, 0); 
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "2. %s\n", ebuff);
  ret = regexec(&reg, test_str+1, 0, NULL, REG_NOTBOL);
  regerror(ret, &reg, ebuff, 256);
  fprintf(stderr, "4. %s\n", ebuff);
end:
  regfree(&reg);
  return 0;
}

Compile and run the results as follows:

[

[root@zxy regex]# ./test
2. Success
4. No match

]

Result: Do not join REG_NOTBOL , the different positions of a string can be matched with '^' REG_NOTBOL , cannot match.

The second question, I really can't understand, the online introduction is all unverified......

conclusion


Related articles: