Wildcard string matching

Learn wildcard string matching with practical examples, diagrams, and best practices. Covers algorithm development techniques with visual explanations.

Mastering Wildcard String Matching: Algorithms and Implementations

Hero image for Wildcard string matching

Explore the fundamental algorithms and practical implementations for wildcard string matching, covering classic approaches and their complexities.

Wildcard string matching is a common problem in computer science, essential for tasks like file searching, text processing, and regular expression engines. It involves determining if a given text string matches a pattern containing special 'wildcard' characters. The most common wildcards are * (which matches zero or more characters) and ? (which matches exactly one character). This article delves into the algorithms used to solve this problem, focusing on their logic, complexity, and practical application.

Understanding the Problem: Wildcards Explained

Before diving into algorithms, it's crucial to understand the behavior of the two primary wildcard characters:

  • ? (Question Mark): This wildcard matches any single character. For example, the pattern a?c would match abc, adc, but not ac or abcc.
  • * (Asterisk): This wildcard matches zero or more occurrences of any character. This is where the complexity often arises. For instance, the pattern a*c would match ac, abc, abcc, axbyc, and so on.

The challenge lies in efficiently handling the * wildcard, as it can represent a variable number of characters, leading to multiple potential matching paths.

flowchart TD
    A[Start Matching] --> B{Current Pattern Char?}
    B --> |'?'| C[Match Any Single Char]
    B --> |'*'| D[Match Zero or More Chars]
    B --> |Literal Char| E{Text Char Matches?}
    C --> F[Advance Both Ptr]
    E --> |Yes| F
    E --> |No| G[Mismatch]
    D --> H[Try Matching Zero Chars]
    D --> I[Try Matching One or More Chars]
    H --> F
    I --> F
    F --> B
    G --> J[End Mismatch]
    F --> K{End of Pattern & Text?}
    K --> |Yes| L[Match Found]
    K --> |No| G

Basic Logic Flow for Wildcard Matching

Recursive Approach with Backtracking

A natural way to think about wildcard matching is through recursion with backtracking. The core idea is to consider different possibilities when a wildcard character is encountered. This approach can be quite intuitive but may suffer from performance issues due to redundant computations if not optimized.

Let's define a function isMatch(text, pattern) that returns true if text matches pattern.

Base Cases:

  1. If both text and pattern are exhausted, it's a match.
  2. If pattern is exhausted but text is not, it's a mismatch (unless the remaining text can be consumed by a preceding *).
  3. If text is exhausted but pattern is not, it's a mismatch (unless the remaining pattern consists only of * characters).

Recursive Steps:

  • If pattern[j] is ? or pattern[j] matches text[i]: Advance both i and j.
  • If pattern[j] is *: This is the tricky part. The * can match zero characters, or one character, or two, and so on. We explore two possibilities:
    1. * matches zero characters: Try matching text[i] with pattern[j+1] (effectively skipping *).
    2. * matches one or more characters: Try matching text[i+1] with pattern[j] (effectively consuming text[i] with * and keeping * active for subsequent characters).

This recursive solution can be optimized using memoization (dynamic programming) to store results of subproblems and avoid recomputing them.

def isMatchRecursive(text, pattern):
    memo = {}

    def dp(i, j):
        if (i, j) in memo:
            return memo[(i, j)]

        # Base cases
        if j == len(pattern):
            return i == len(text)

        if i == len(text):
            # If text is exhausted, pattern must be all '*' to match
            for k in range(j, len(pattern)):
                if pattern[k] != '*':
                    return False
            return True

        # Recursive steps
        if pattern[j] == '?' or pattern[j] == text[i]:
            result = dp(i + 1, j + 1)
        elif pattern[j] == '*':
            # '*' matches zero characters (skip '*') OR
            # '*' matches one or more characters (consume text[i] with '*' and keep '*' active)
            result = dp(i, j + 1) or dp(i + 1, j)
        else:
            result = False
        
        memo[(i, j)] = result
        return result

    return dp(0, 0)

# Example Usage:
# print(isMatchRecursive("adceb", "*a*b")) # True
# print(isMatchRecursive("acdcb", "a*c?b")) # False
# print(isMatchRecursive("aa", "a")) # False

Python implementation of recursive wildcard matching with memoization.

Iterative Dynamic Programming Approach

An iterative dynamic programming (DP) approach provides a more structured way to solve the wildcard matching problem, often with better space complexity than a purely recursive solution (if memoization is not carefully managed). We typically use a 2D boolean array, dp[i][j], where dp[i][j] is true if the first i characters of the text match the first j characters of the pattern.

Initialization:

  • dp[0][0] = true (empty text matches empty pattern).
  • dp[0][j] = true if pattern[0...j-1] consists only of * characters (an empty string can be matched by *, **, etc.).

Transitions:

For dp[i][j]:

  1. If pattern[j-1] is ? or pattern[j-1] matches text[i-1]: dp[i][j] = dp[i-1][j-1] (match the current characters and rely on the previous subproblem).
  2. If pattern[j-1] is *: dp[i][j] can be true if either:
    • * matches zero characters: dp[i][j-1] (effectively skipping * in the pattern).
    • * matches one or more characters: dp[i-1][j] (effectively consuming text[i-1] with * and keeping * active). So, dp[i][j] = dp[i][j-1] || dp[i-1][j].
  3. Otherwise (literal mismatch): dp[i][j] = false.
class Solution {
    public boolean isMatch(String s, String p) {
        int m = s.length();
        int n = p.length();

        boolean[][] dp = new boolean[m + 1][n + 1];

        // Base case: empty string matches empty pattern
        dp[0][0] = true;

        // Initialize for patterns starting with '*' (matching empty string)
        for (int j = 1; j <= n; j++) {
            if (p.charAt(j - 1) == '*') {
                dp[0][j] = dp[0][j - 1];
            }
        }

        // Fill the DP table
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (p.charAt(j - 1) == '?' || p.charAt(j - 1) == s.charAt(i - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '*') {
                    // '*' matches zero characters (dp[i][j-1])
                    // OR '*' matches one or more characters (dp[i-1][j])
                    dp[i][j] = dp[i][j - 1] || dp[i - 1][j];
                } else {
                    dp[i][j] = false;
                }
            }
        }

        return dp[m][n];
    }
}

// Example Usage:
// Solution sol = new Solution();
// System.out.println(sol.isMatch("adceb", "*a*b")); // True
// System.out.println(sol.isMatch("acdcb", "a*c?b")); // False
// System.out.println(sol.isMatch("aa", "a")); // False

Java implementation of iterative dynamic programming for wildcard matching.

Greedy Approach (Two Pointers)

For certain types of wildcard matching problems (specifically, when * can match any sequence of characters, but not when it's tied to a specific character like in regular expressions), a greedy approach using two pointers can be more efficient. This method avoids the overhead of DP table construction but requires careful handling of backtracking for *.

We maintain pointers for the current position in text (i) and pattern (j). We also need to store the last encountered * position (star_idx) and the corresponding text position (match_idx) where * started matching.

Algorithm Steps:

  1. Iterate through text with pointer i.
  2. If pattern[j] is ? or pattern[j] matches text[i]: Advance both i and j.
  3. If pattern[j] is *: Store star_idx = j and match_idx = i. Advance j (assume * matches zero characters for now).
  4. If pattern[j] does not match text[i] and a * was previously encountered (star_idx != -1):
    • Backtrack: Reset j to star_idx + 1.
    • Advance match_idx by one (meaning the * now consumes one more character from text).
    • Reset i to match_idx.
  5. If no * was encountered and there's a mismatch: Return false.
  6. After iterating through text, check if the remaining pattern consists only of * characters. If so, it's a match; otherwise, it's a mismatch.
function isMatchGreedy(s, p) {
    let sPtr = 0;
    let pPtr = 0;
    let starIdx = -1;
    let matchIdx = 0;

    while (sPtr < s.length) {
        // Case 1: Current characters match or pattern has '?'
        if (pPtr < p.length && (p.charAt(pPtr) === '?' || p.charAt(pPtr) === s.charAt(sPtr))) {
            sPtr++;
            pPtr++;
        }
        // Case 2: Pattern has '*'
        else if (pPtr < p.length && p.charAt(pPtr) === '*') {
            starIdx = pPtr; // Store '*' position
            matchIdx = sPtr; // Store text position where '*' started matching
            pPtr++; // Advance pattern pointer, assuming '*' matches 0 chars for now
        }
        // Case 3: Mismatch, but a '*' was seen previously
        else if (starIdx !== -1) {
            pPtr = starIdx + 1; // Backtrack pattern pointer to after the last '*'
            matchIdx++; // Advance text pointer, making '*' consume one more char
            sPtr = matchIdx; // Reset text pointer to new matchIdx
        }
        // Case 4: Mismatch and no '*' to backtrack to
        else {
            return false;
        }
    }

    // Consume any remaining '*' in the pattern
    while (pPtr < p.length && p.charAt(pPtr) === '*') {
        pPtr++;
    }

    return pPtr === p.length;
}

// Example Usage:
// console.log(isMatchGreedy("adceb", "*a*b")); // True
// console.log(isMatchGreedy("acdcb", "a*c?b")); // False
// console.log(isMatchGreedy("aa", "a")); // False

JavaScript implementation of the greedy (two-pointer) wildcard matching algorithm.