How do you find all matches in regexes with C?
Categories:
Mastering Regex: Finding All Matches in C with POSIX

Learn how to effectively use POSIX regular expressions in C to find all occurrences of a pattern within a string, including setup, execution, and result extraction.
Regular expressions are a powerful tool for pattern matching in text. In C, the POSIX regex library provides a standard way to work with regular expressions. While finding the first match is straightforward, extracting all non-overlapping matches requires a bit more effort. This article will guide you through the process of using regcomp
, regexec
, and regfree
to achieve comprehensive regex matching in C.
Understanding POSIX Regex in C
The POSIX regex API in C consists of a few key functions:
regcomp()
: Compiles a regular expression into an internal representation.regexec()
: Executes a compiled regular expression against a string.regfree()
: Frees the memory allocated for the compiled regular expression.
To find all matches, we typically need to repeatedly call regexec()
on the remaining portion of the string after each successful match. This involves careful management of string pointers and match offsets.
flowchart TD A[Start] --> B{Compile Regex `regcomp()`} B --> C{Initialize Search Position} C --> D{Execute Regex `regexec()`} D -- Match Found --> E[Extract Match & Submatches] E --> F[Advance Search Position] F --> D D -- No Match --> G[Free Regex `regfree()`] G --> H[End]
Workflow for finding all regex matches using POSIX C functions.
Setting Up for Multiple Matches
When searching for multiple matches, it's crucial to understand how regexec()
reports its findings. It provides the start and end offsets of the first match found from the current search position. To find subsequent matches, you must adjust your search starting point to immediately after the end of the previous match. This ensures you don't re-match the same substring and correctly identify non-overlapping occurrences.
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#include <string.h>
#define MAX_MATCHES 10
#define MAX_GROUPS 10 // For capturing groups
void find_all_matches(const char *text, const char *pattern) {
regex_t regex;
regmatch_t pmatch[MAX_GROUPS];
int reti;
char *cursor = (char *)text;
int offset = 0;
int match_count = 0;
// Compile the regular expression
reti = regcomp(®ex, pattern, REG_EXTENDED);
if (reti) {
fprintf(stderr, "Could not compile regex\n");
return;
}
printf("Searching for pattern '%s' in text: '%s'\n", pattern, text);
// Loop to find all matches
while (1) {
reti = regexec(®ex, cursor, MAX_GROUPS, pmatch, 0);
if (!reti) { // Match found
match_count++;
printf("Match %d (offset %d):\n", match_count, offset + (int)pmatch[0].rm_so);
// Print the full match
int start = pmatch[0].rm_so;
int end = pmatch[0].rm_eo;
printf(" Full match: '%.*s'\n", (end - start), cursor + start);
// Print capturing groups (if any)
for (int i = 1; i < MAX_GROUPS; i++) {
if (pmatch[i].rm_so != -1 && pmatch[i].rm_eo != -1) {
start = pmatch[i].rm_so;
end = pmatch[i].rm_eo;
printf(" Group %d: '%.*s'\n", i, (end - start), cursor + start);
}
}
// Advance cursor past the current match for the next search
cursor += pmatch[0].rm_eo;
offset += pmatch[0].rm_eo;
} else if (reti == REG_NOMATCH) { // No more matches
printf("No more matches found.\n");
break;
} else { // An error occurred
char errbuf[100];
regerror(reti, ®ex, errbuf, sizeof(errbuf));
fprintf(stderr, "Regex match failed: %s\n", errbuf);
break;
}
}
// Free the compiled regular expression
regfree(®ex);
}
int main() {
const char *text1 = "apple banana cherry apple banana";
const char *pattern1 = "apple";
find_all_matches(text1, pattern1);
printf("\n");
const char *text2 = "The quick brown fox jumps over the lazy dog.";
const char *pattern2 = "(quick|lazy) (brown|dog)"; // Example with capturing groups
find_all_matches(text2, pattern2);
printf("\n");
const char *text3 = "123-456-7890, 555-123-4567";
const char *pattern3 = "([0-9]{3})-([0-9]{3})-([0-9]{4})"; // Phone numbers
find_all_matches(text3, pattern3);
printf("\n");
return 0;
}
C code to find and print all non-overlapping regex matches, including capturing groups.
REG_EXTENDED
flag in regcomp()
enables extended regular expression syntax, which is generally more powerful and user-friendly than basic regular expressions. Always consider using it unless strict POSIX basic regex compliance is required.Handling Overlapping Matches (Advanced)
The provided example finds non-overlapping matches. If your requirement is to find all matches, including those that might overlap (e.g., finding all occurrences of aba
in abababa
), the approach needs modification. Instead of advancing the cursor
by the full length of the match (pmatch[0].rm_eo
), you would typically advance it by only one character (pmatch[0].rm_so + 1
) after each match. This allows the regex engine to re-evaluate the pattern starting from the next character, potentially finding overlapping matches. However, this can lead to more complex logic for managing unique matches and might not be directly supported by regexec()
's default behavior for all patterns without careful pattern design (e.g., using lookaheads).
regexec()
can be computationally intensive. For extremely high-performance scenarios, consider specialized string searching algorithms or highly optimized regex libraries if POSIX regex proves too slow.1. Include Headers
Start by including the necessary headers: <stdio.h>
, <stdlib.h>
, <regex.h>
, and <string.h>
.
2. Compile the Regex
Use regcomp(®ex, pattern, REG_EXTENDED)
to compile your regular expression. Always check the return value for errors.
3. Initialize Search Cursor
Create a char *cursor
pointing to the beginning of your text. This cursor will advance after each match.
4. Loop for Matches
Enter a while(1)
loop. Inside, call regexec(®ex, cursor, MAX_GROUPS, pmatch, 0)
.
5. Process Match
If regexec()
returns 0
(success), extract the match and any capturing groups using pmatch[i].rm_so
and pmatch[i].rm_eo
. Remember these are offsets relative to the cursor
.
6. Advance Cursor
Crucially, update cursor += pmatch[0].rm_eo;
to move the search starting point past the current match. Also update a global offset
if you need absolute positions.
7. Handle No Match/Errors
If regexec()
returns REG_NOMATCH
, break the loop. If it returns any other value, an error occurred; use regerror()
to get a descriptive message.
8. Free Resources
After the loop, call regfree(®ex)
to release the memory allocated for the compiled regular expression.