Wildcard string matching

Learn wildcard string matching with practical examples, diagrams, and best practices. Covers algorithm development techniques with visual explanations.

Mastering Wildcard String Matching: Algorithms and Implementations

A magnifying glass hovering over a string of text with asterisks and question marks, symbolizing wildcard matching.

Explore the fundamental algorithms and practical implementations for wildcard string matching, covering classic approaches and their complexities.

Wildcard string matching is a common problem in computer science, essential for tasks like file searching, text processing, and regular expression engines. It involves determining if a given text string matches a pattern containing special 'wildcard' characters. The most common wildcards are * (which matches zero or more characters) and ? (which matches exactly one character). This article delves into the algorithms used to solve this problem, focusing on their logic, complexity, and practical application.

Understanding the Problem: Wildcards Explained

Before diving into algorithms, it's crucial to understand the behavior of the two primary wildcard characters:

? (Question Mark): This wildcard matches any single character. For example, the pattern a?c would match abc, adc, but not ac or abcc.
* (Asterisk): This wildcard matches zero or more occurrences of any character. This is where the complexity often arises. For instance, the pattern a*c would match ac, abc, abcc, axbyc, and so on.

The challenge lies in efficiently handling the * wildcard, as it can represent a variable number of characters, leading to multiple potential matching paths.

flowchart TD
    A[Start Matching] --> B{Current Pattern Char?}
    B --> |'?'| C[Match Any Single Char]
    B --> |'*'| D[Match Zero or More Chars]
    B --> |Literal Char| E{Text Char Matches?}
    C --> F[Advance Both Ptr]
    E --> |Yes| F
    E --> |No| G[Mismatch]
    D --> H[Try Matching Zero Chars]
    D --> I[Try Matching One or More Chars]
    H --> F
    I --> F
    F --> B
    G --> J[End Mismatch]
    F --> K{End of Pattern & Text?}
    K --> |Yes| L[Match Found]
    K --> |No| G

Basic Logic Flow for Wildcard Matching

Recursive Approach with Backtracking

A natural way to think about wildcard matching is through recursion with backtracking. The core idea is to consider different possibilities when a wildcard character is encountered. This approach can be quite intuitive but may suffer from performance issues due to redundant computations if not optimized.

Let's define a function isMatch(text, pattern) that returns true if text matches pattern.

Base Cases:

If both text and pattern are exhausted, it's a match.
If pattern is exhausted but text is not, it's a mismatch (unless the remaining text can be consumed by a preceding *).
If text is exhausted but pattern is not, it's a mismatch (unless the remaining pattern consists only of * characters).

Recursive Steps:

If pattern[j] is ? or pattern[j] matches text[i]: Advance both i and j.
If pattern[j] is *: This is the tricky part. The * can match zero characters, or one character, or two, and so on. We explore two possibilities:
1. * matches zero characters: Try matching text[i] with pattern[j+1] (effectively skipping *).
2. * matches one or more characters: Try matching text[i+1] with pattern[j] (effectively consuming text[i] with * and keeping * active for subsequent characters).

This recursive solution can be optimized using memoization (dynamic programming) to store results of subproblems and avoid recomputing them.

def isMatchRecursive(text, pattern):
    memo = {}

    def dp(i, j):
        if (i, j) in memo:
            return memo[(i, j)]

        # Base cases
        if j == len(pattern):
            return i == len(text)

        if i == len(text):
            # If text is exhausted, pattern must be all '*' to match
            for k in range(j, len(pattern)):
                if pattern[k] != '*':
                    return False
            return True

        # Recursive steps
        if pattern[j] == '?' or pattern[j] == text[i]:
            result = dp(i + 1, j + 1)
        elif pattern[j] == '*':
            # '*' matches zero characters (skip '*') OR
            # '*' matches one or more characters (consume text[i] with '*' and keep '*' active)
            result = dp(i, j + 1) or dp(i + 1, j)
        else:
            result = False
        
        memo[(i, j)] = result
        return result

    return dp(0, 0)

# Example Usage:
# print(isMatchRecursive("adceb", "*a*b")) # True
# print(isMatchRecursive("acdcb", "a*c?b")) # False
# print(isMatchRecursive("aa", "a")) # False

Python implementation of recursive wildcard matching with memoization.

💡

The recursive approach with memoization effectively transforms the problem into a dynamic programming one, significantly improving performance by avoiding redundant calculations for overlapping subproblems.

Iterative Dynamic Programming Approach

An iterative dynamic programming (DP) approach provides a more structured way to solve the wildcard matching problem, often with better space complexity than a purely recursive solution (if memoization is not carefully managed). We typically use a 2D boolean array, dp[i][j], where dp[i][j] is true if the first i characters of the text match the first j characters of the pattern.

Initialization:

dp[0][0] = true (empty text matches empty pattern).
dp[0][j] = true if pattern[0...j-1] consists only of * characters (an empty string can be matched by *, **, etc.).

Transitions:

For dp[i][j]:

If pattern[j-1] is ? or pattern[j-1] matches text[i-1]: dp[i][j] = dp[i-1][j-1] (match the current characters and rely on the previous subproblem).
If pattern[j-1] is *: dp[i][j] can be true if either:
- * matches zero characters: dp[i][j-1] (effectively skipping * in the pattern).
- * matches one or more characters: dp[i-1][j] (effectively consuming text[i-1] with * and keeping * active). So, dp[i][j] = dp[i][j-1] || dp[i-1][j].
Otherwise (literal mismatch): dp[i][j] = false.

class Solution {
    public boolean isMatch(String s, String p) {
        int m = s.length();
        int n = p.length();

        boolean[][] dp = new boolean[m + 1][n + 1];

        // Base case: empty string matches empty pattern
        dp[0][0] = true;

        // Initialize for patterns starting with '*' (matching empty string)
        for (int j = 1; j <= n; j++) {
            if (p.charAt(j - 1) == '*') {
                dp[0][j] = dp[0][j - 1];
            }
        }

        // Fill the DP table
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (p.charAt(j - 1) == '?' || p.charAt(j - 1) == s.charAt(i - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '*') {
                    // '*' matches zero characters (dp[i][j-1])
                    // OR '*' matches one or more characters (dp[i-1][j])
                    dp[i][j] = dp[i][j - 1] || dp[i - 1][j];
                } else {
                    dp[i][j] = false;
                }
            }
        }

        return dp[m][n];
    }
}

// Example Usage:
// Solution sol = new Solution();
// System.out.println(sol.isMatch("adceb", "*a*b")); // True
// System.out.println(sol.isMatch("acdcb", "a*c?b")); // False
// System.out.println(sol.isMatch("aa", "a")); // False

Java implementation of iterative dynamic programming for wildcard matching.

⚠️

Be mindful of the edge cases, especially when the pattern or text is empty, or when the pattern starts with *. Correct initialization of the DP table is crucial for accurate results.

Greedy Approach (Two Pointers)

For certain types of wildcard matching problems (specifically, when * can match any sequence of characters, but not when it's tied to a specific character like in regular expressions), a greedy approach using two pointers can be more efficient. This method avoids the overhead of DP table construction but requires careful handling of backtracking for *.

We maintain pointers for the current position in text (i) and pattern (j). We also need to store the last encountered * position (star_idx) and the corresponding text position (match_idx) where * started matching.

Algorithm Steps:

Iterate through text with pointer i.
If pattern[j] is ? or pattern[j] matches text[i]: Advance both i and j.
If pattern[j] is *: Store star_idx = j and match_idx = i. Advance j (assume * matches zero characters for now).
If pattern[j] does not match text[i] and a * was previously encountered (star_idx != -1):
- Backtrack: Reset j to star_idx + 1.
- Advance match_idx by one (meaning the * now consumes one more character from text).
- Reset i to match_idx.
If no * was encountered and there's a mismatch: Return false.
After iterating through text, check if the remaining pattern consists only of * characters. If so, it's a match; otherwise, it's a mismatch.

function isMatchGreedy(s, p) {
    let sPtr = 0;
    let pPtr = 0;
    let starIdx = -1;
    let matchIdx = 0;

    while (sPtr < s.length) {
        // Case 1: Current characters match or pattern has '?'
        if (pPtr < p.length && (p.charAt(pPtr) === '?' || p.charAt(pPtr) === s.charAt(sPtr))) {
            sPtr++;
            pPtr++;
        }
        // Case 2: Pattern has '*'
        else if (pPtr < p.length && p.charAt(pPtr) === '*') {
            starIdx = pPtr; // Store '*' position
            matchIdx = sPtr; // Store text position where '*' started matching
            pPtr++; // Advance pattern pointer, assuming '*' matches 0 chars for now
        }
        // Case 3: Mismatch, but a '*' was seen previously
        else if (starIdx !== -1) {
            pPtr = starIdx + 1; // Backtrack pattern pointer to after the last '*'
            matchIdx++; // Advance text pointer, making '*' consume one more char
            sPtr = matchIdx; // Reset text pointer to new matchIdx
        }
        // Case 4: Mismatch and no '*' to backtrack to
        else {
            return false;
        }
    }

    // Consume any remaining '*' in the pattern
    while (pPtr < p.length && p.charAt(pPtr) === '*') {
        pPtr++;
    }

    return pPtr === p.length;
}

// Example Usage:
// console.log(isMatchGreedy("adceb", "*a*b")); // True
// console.log(isMatchGreedy("acdcb", "a*c?b")); // False
// console.log(isMatchGreedy("aa", "a")); // False

JavaScript implementation of the greedy (two-pointer) wildcard matching algorithm.

ℹ️

The greedy approach is generally more efficient in terms of space complexity (O(1) auxiliary space) compared to DP (O(mn) space), but its time complexity can still be O(mn) in worst-case scenarios due to backtracking.

Wildcard string matching

Tags:

Categories:

Mastering Wildcard String Matching: Algorithms and Implementations

Understanding the Problem: Wildcards Explained

Recursive Approach with Backtracking

Iterative Dynamic Programming Approach

Greedy Approach (Two Pointers)