Regular expressions in C are detailed with examples
- 2020-06-19 11:12:55
- OfStack
Regular expressions, also known as regular representation, regular representation (English: Regular Expression, often abbreviated to regex, regexp, or RE in code). A regular expression is a string that USES a single string to describe and match a series of strings that conform to a syntactic rule.
In the c language, regular expressions are handled with regcomp, regexec, regfree, and regerror. Regular expressions are processed in three steps:
Compile the regular expression, regcomp; Match regular expression, regexec; Release the regular expression, regfree.The function prototype
/*
Function description: Regcomp Put a regular expression string regex Compiled into regex_t Form, follow regexec Search from there.
Parameter description:
Preg : 1 a regex_t A pointer to a structure.
Regex : Regular expression string.
Cflags : it is below 4 A value is either their or (|) Operation.
REG_EXTENDED Use: POSIX A regular expression that extends the interpretation of regular expression syntax. If it's not set, basically POSIX Regular expression syntax.
REG_ICASE : Ignores the case of letters.
REG_NOSUB : Do not store matching results.
REG_NEWLINE : "special care" for newline characters, as explained later.
The return value:
0 : denotes successful compilation;
non 0 : Failed to compile regerror View failure information
*/
int regcomp(regex_t *preg, const char *regex, int cflags);
/*
Function description: Regexec Used to match regular text.
Parameter description:
Preg By: regcomp compiled regex_t Struct pointer,
String : The string you want to regex.
Nmatch : regmatch_t The size of the array of structures
Pmatch : regmatch_t An array of structures. The location of the substring used to hold the match result.
regmatch_t The structure is defined as follows
typedef struct {
regoff_t rm_so;
regoff_t rm_eo;
} regmatch_t;
rm_so, If its value is not zero -1 , represents the starting offset of the maximum matched substring in the string, rm_eo Represents the maximum string matching offset at the end of the string.
Eflags: REG_NOTBOL and REG_NOTEOL Is the value of the two 1 or 2 Of or (|) Operations, which I'll show you later.
The return value:
0 : denotes successful compilation;
non 0 : Failed to compile regerror View failure information
*/
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);
/*
Function description: used to release regcomp Compiled built-in variables.
Parameter description:
Preg By: regcomp compiled regex_t A pointer to a structure.
*/
void regfree(regex_t *preg);
/*
Function description: Regcomp . regexec Returns when an error occurs error code And for the 0 And you can use it regerror Get an error message.
Parameter description:
Errcode : Regcomp . regexec The return value in the event of an error
Preg : after Regcomp Compilation of regex_t A pointer to a structure.
Errbuf : Where the error message is placed.
errbuf_size : Error message buff The size of the.
*/
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
Example 1
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "Hello World";
char *reg_str = "H.*";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "%s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "%s\n", ebuff);
goto end;
}
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "result is:\n%s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and output the following results:
[
[root@zxy regex]# ./test
result is:
Success
Match successful.
Example 2
What if I want to keep the match? So you have to use the regmatch_t structure. Rewrite the above code so that the REG_NOSUB option is not available, as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
regmatch_t rm[5];
char *part_str = NULL;
cflags = REG_EXTENDED | REG_ICASE;
char *test_str = "Hello World";
char *reg_str = "e(.*)o";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "%s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 5, rm, 0);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "%s\n", ebuff);
goto end;
}
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "result is:\n%s\n\n", ebuff);
for (i=0; i<5; i++)
{
if (rm[i].rm_so > -1)
{
part_str = strndup(test_str+rm[i].rm_so, rm[i].rm_eo-rm[i].rm_so);
fprintf(stderr, "%s\n", part_str);
free(part_str);
part_str = NULL;
}
}
end:
regfree(®);
return 0;
}
Compile and output the following results:
[
[root@zxy regex]# ./test
result is:
Success
ello Wo
llo W
Huh?? Why do I print two matches when I only want one??
It turns out that the first element of the regmatch_t array is special: it is used to hold the start and end offsets of the largest substring the entire regular expression can match. So when we set the number of regmatch_t we have to keep in mind that the number of regmatch_t array is +1.
REG_NEWLINE, REG_NOTBOL and REG_NOTEOL
That's it for basic regularization, so let's start with REG_NEWLINE, REG_NOTBOL, and REG_NOTEOL. Many people are confused by these three parameters. Me too, yesterday someone asked a question, told others their wrong understanding, and then was the great god 1 despise. I always thought that if I wanted to use the ^ and $matching pattern 1, I would have to use the REG_NEWLINE parameter, but I didn't.
REG_NEWLINE
First, take a look at man page's description of REG_NEWLINE:
REG_NEWLINE
Match-any-character operators don't match a newline.
A non-matching list ([^...]) not containing a newline does not match a newline.
Match-beginning-of-line operator (^) matches the empty string immediately after a newline, regardless of whether eflags, the execution flags of regexec(), contains REG_NOTBOL.
Match-end-of-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.
My English is not good.
REG_NEWLINE
1. Operators matching any character (for example.) do not match newline ('\n');
2. Unmatched list ([^...]) ) does not contain 1 newline character does not match 1 newline character;
3. The match start operator (^) breaks a line immediately when an empty string is encountered, whether or not eflags set REG_NOTBOL when regexec() is executed;
4. The end of match operator ($) breaks the empty string immediately, whether or not REG_NOTEOL is set when regexec() is executed;
Do not understand what is said, the procedure test.
Question number one
The code is as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "Hello World\n";
char *reg_str = "Hello World.";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "1. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "2. %s\n", ebuff);
cflags |= REG_NEWLINE;
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "3. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "4. %s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and run the results as follows:
[
[root@zxy regex]# ./test
2. Success
4. No match
The result was obvious: the match was successful without adding REG_NEWLINE, and the match was not successful. This means that REG_NEWLINE is not added, and any matching character (.) contains 'n', while joining does not contain 'n'.
Question number two
The code is as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "Hello\nWorld";
char *reg_str = "Hello[^ ]";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "1. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "2. %s\n", ebuff);
cflags |= REG_NEWLINE;
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "3. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "4. %s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and run the results as follows:
[
[root@zxy regex]# ./test
2. Success
4. No match
The results show that if REG_NEWLINE is not added, 'n' is not considered as a white space in a non-list that does not contain 'n', and if added, 'n' is considered as a white space.
Question number three
The code is as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "\nHello World";
char *reg_str = "^Hello";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "1. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "2. %s\n", ebuff);
cflags |= REG_NEWLINE;
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "3. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "4. %s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and run the results as follows:
[
[root@zxy regex]# ./test
2. No match
4. Success
The results show that if REG_NEWLINE is not added, '^' is not ignored, while if REG_NEWLINE is added, '^' is ignored. If REG_NEWLINE is not added, strings beginning with 'n' cannot be matched with '^'. If REG_NEWLINE is added, strings beginning with 'n' can be matched with '^'.
Question 4
The code is as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "Hello World\n";
char *reg_str = "d$";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "1. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "2. %s\n", ebuff);
cflags |= REG_NEWLINE;
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "3. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "4. %s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and run the results as follows:
[
[root@zxy regex]# ./test
2. No match
4. Success
Results: REG_NEWLINE is not added, ' & dollar; 'Do not ignore 'n', add REG_NEWLINE,' & dollar; 'is ignoring 'n'. In other words: strings ending with 'n' cannot be used with ' without adding REG_NEWLINE; & dollar; 'Match, add REG_NEWLINE, string beginning with 'n' can be used with ' & dollar; 'matching.
REG_NEWLINE summary
Ok, that's the end of the REG_NEWLINE option test. Summary:
For the REG_NEWLINE option, 1. Any piece card (.) does not contain 'n'; 2. For a non-list that does not contain 'n', 'n' is considered blank. 3. 'n' is ignored for strings beginning or ending with 'n'. Make '^' and '$' available.
REG_NOTBOL和REG_NOTEOL
Starting with REG_NOTBOL and REG_NOTEOL, read man page's description of the two options:
REG_NOTBOL
The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above) This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
REG_NOTEOL
The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)
Continue to googling .
[
REG_NOTBOL
The match start operator (^) often fails to match (but consider REG_NEWLINE), and this flag is used when a different position of a string is passed to regexec(), which should not be interpreted as the starting position of the entire string.
REG_NOTEOL
The end of match operator ($) often fails (but consider REG_NEWLINE). This flag is used when a different position of a string is passed into regexec(), even if the match terminator is satisfied, it should not be interpreted as ending with a character (string).
Ok, let's go ahead and test. The code for question 1 is as follows:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
int main (void)
{
int i;
char ebuff[256];
int ret;
int cflags;
regex_t reg;
cflags = REG_EXTENDED | REG_ICASE | REG_NOSUB;
char *test_str = "Hello World\n";
char *reg_str = "^e";
ret = regcomp(®, reg_str, cflags);
if (ret)
{
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "1. %s\n", ebuff);
goto end;
}
ret = regexec(®, test_str+1, 0, NULL, 0);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "2. %s\n", ebuff);
ret = regexec(®, test_str+1, 0, NULL, REG_NOTBOL);
regerror(ret, ®, ebuff, 256);
fprintf(stderr, "4. %s\n", ebuff);
end:
regfree(®);
return 0;
}
Compile and run the results as follows:
[
[root@zxy regex]# ./test
2. Success
4. No match
Result: Do not join
REG_NOTBOL
, the different positions of a string can be matched with '^'
REG_NOTBOL
, cannot match.
The second question, I really can't understand, the online introduction is all unverified......
conclusion