C: STRTOK exception
Categories:
Understanding and Avoiding strtok
Exceptions in C
Explore common pitfalls and robust alternatives to strtok
in C for safe and efficient string tokenization.
The strtok
function in C is a common utility for breaking a string into a series of tokens using a specified delimiter. While seemingly straightforward, strtok
has several design quirks and limitations that can lead to unexpected behavior, including crashes, infinite loops, and data corruption, often referred to as 'exceptions' in a broader sense. This article delves into these issues and provides safer, more predictable alternatives.
How strtok
Works (and Where It Fails)
The strtok
function modifies the input string by inserting null terminators (\0
) at the end of each token it finds. It also maintains an internal static pointer to keep track of its position in the string across successive calls. This design choice is the root cause of many problems.
flowchart TD A["Call strtok(str, delim)"] --> B{"Is str NULL?"} B -->|No| C["Store str internally"] B -->|Yes| D["Use internal pointer"] C --> E["Find first delimiter in str"] D --> E E --> F{"Delimiter found?"} F -->|Yes| G["Replace delimiter with '\0'"] G --> H["Return pointer to token"] H --> I["Update internal pointer to next char"] F -->|No| J["Return pointer to remaining string"] J --> K["Set internal pointer to NULL"] I --> L["Next call: strtok(NULL, delim)"] K --> L
Simplified strtok
internal logic flow
strtok
is that it modifies the original string. If you pass a string literal (which is read-only memory) to strtok
, your program will likely crash with a segmentation fault. Always ensure the input string is a mutable character array.Common strtok
Pitfalls
Understanding the specific scenarios where strtok
can cause issues is crucial for writing robust C code. Here are the most common 'exceptions' you might encounter:
1. Modifying String Literals
Passing a const char*
or a string literal directly to strtok
results in undefined behavior, typically a segmentation fault, because strtok
attempts to write to read-only memory.
2. Non-Reentrancy and Thread Safety
Due to its internal static pointer, strtok
is not reentrant. This means if one function calls strtok
and then another function (or a signal handler) interrupts it and also calls strtok
, the internal state will be corrupted, leading to incorrect tokenization or crashes. This makes strtok
unsuitable for multi-threaded environments.
3. Handling Empty Tokens
strtok
treats multiple consecutive delimiters as a single delimiter. For example, strtok("a,,b", ",")
will return "a" and then "b", effectively skipping the empty token between the two commas. If you need to preserve empty tokens, strtok
is not the right tool.
4. Buffer Overflows (Indirectly)
While strtok
itself doesn't directly cause buffer overflows, its use often leads to situations where subsequent operations on the extracted tokens (e.g., copying them to fixed-size buffers) can cause overflows if not handled carefully.
Safer Alternatives to strtok
Given the limitations of strtok
, it's generally recommended to use safer, more modern alternatives. The most common and robust choice is strtok_r
(reentrant version) or manual parsing.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main() {
char str[] = "apple,banana,orange,grape";
char *token;
char *rest = str; // Pointer to keep track of the remaining string
printf("Using strtok_r:\n");
while ((token = strtok_r(rest, ",", &rest))) {
printf("Token: %s\n", token);
}
// Example of strtok_r in a function (demonstrates reentrancy)
char str2[] = "one-two-three";
char *token2;
char *rest2 = str2;
printf("\nDemonstrating reentrancy with strtok_r:\n");
token2 = strtok_r(rest2, "-", &rest2); // First token
printf("Outer Token: %s\n", token2);
// Simulate an inner function call that also uses strtok_r
char inner_str[] = "A B C";
char *inner_token;
char *inner_rest = inner_str;
while ((inner_token = strtok_r(inner_rest, " ", &inner_rest))) {
printf(" Inner Token: %s\n", inner_token);
}
token2 = strtok_r(rest2, "-", &rest2); // Continue outer tokenization
printf("Outer Token: %s\n", token2);
return 0;
}
Example of strtok_r
for reentrant and safer tokenization.
strtok_r
takes an additional argument, char **saveptr
, which is used to store the internal state. This makes it reentrant and thread-safe, as each call can maintain its own parsing context. Always prefer strtok_r
over strtok
when available.Manual Parsing for Full Control
For scenarios requiring precise control over delimiters, handling of empty tokens, or avoiding string modification, manual parsing using functions like strchr
or strstr
is the most flexible approach.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main() {
char str[] = "value1,,value2,value3,";
char *current_pos = str;
char *delimiter_pos;
const char *delimiter = ",";
int token_count = 0;
printf("Using manual parsing (strchr):\n");
while (*current_pos != '\0') {
delimiter_pos = strchr(current_pos, *delimiter);
if (delimiter_pos == NULL) {
// No more delimiters, the rest is the last token
printf("Token %d: '%s'\n", ++token_count, current_pos);
break;
} else {
// Delimiter found, extract token
*delimiter_pos = '\0'; // Temporarily null-terminate the token
printf("Token %d: '%s'\n", ++token_count, current_pos);
*delimiter_pos = *delimiter; // Restore the delimiter if needed later
current_pos = delimiter_pos + 1; // Move past the delimiter
}
}
// Handle trailing empty token if the string ends with a delimiter
if (current_pos > str && *(current_pos - 1) == *delimiter && *current_pos == '\0') {
printf("Token %d: '' (empty)\n", ++token_count);
}
return 0;
}
Manual string tokenization using strchr
to handle empty tokens.