'Wildcard matching in Java
I'm writing a simple debugging program that takes as input simple strings that can contain stars to indicate a wildcard match-any
*.wav // matches <anything>.wav
(*, a) // matches (<anything>, a)
I thought I would simply take that pattern, escape any regular expression special characters in it, then replace any \\*
back to .*
. And then use a regular expression matcher.
But I can't find any Java function to escape a regular expression. The best match I could find is Pattern.quote
, which however just puts \Q
and \E
at the begin and end of the string.
Is there anything in Java that allows you to simply do that wildcard matching without you having to implement the algorithm from scratch?
Solution 1:[1]
Using A Simple Regex
One of this method's benefits is that we can easily add tokens besides *
(see Adding Tokens at the bottom).
Search: [^*]+|(\*)
- The left side of the
|
matches any chars that are not a star - The right side captures all stars to Group 1
- If Group 1 is empty: replace with
\Q
+ Match +E
- If Group 1 is set: replace with
.*
Here is some working code (see the output of the online demo).
Input: audio*2012*.wav
Output: \Qaudio\E.*\Q2012\E.*\Q.wav\E
String subject = "audio*2012*.wav";
Pattern regex = Pattern.compile("[^*]+|(\\*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, ".*");
else m.appendReplacement(b, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
Adding Tokens
Suppose we also want to convert the wildcard ?
, which stands for a single character, by a dot. We just add a capture group to the regex, and exclude it from the matchall on the left:
Search: [^*?]+|(\*)|(\?)
In the replace function we the add something like:
else if(m.group(2) != null) m.appendReplacement(b, ".");
Solution 2:[2]
Just escape everything - no harm will come of it.
String input = "*.wav";
String regex = ("\\Q" + input + "\\E").replace("*", "\\E.*\\Q");
System.out.println(regex); // \Q\E.*\Q.wav\E
System.out.println("abcd.wav".matches(regex)); // true
Or you can use character classes:
String input = "*.wav";
String regex = input.replaceAll(".", "[$0]").replace("[*]", ".*");
System.out.println(regex); // .*[.][w][a][v]
System.out.println("abcd.wav".matches(regex)); // true
It's easier to "escape" the characters by putting them in a character class, as almost all characters lose any special meaning when in a character class. Unless you're expecting weird file names, this will work.
Solution 3:[3]
There is small utility method in Apache Commons-IO library: org.apache.commons.io.FilenameUtils#wildcardMatch(), which you can use without intricacies of the regular expression.
API documentation could be found in: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,%20java.lang.String)
Solution 4:[4]
You can also use the Quotation escape characters: \\Q and \\E
- everything between them is treated as literal and not considered to be part of the regex to be evaluated. Thus this code should work:
String input = "*.wav";
String regex = "\\Q" + input.replace("*", "\\E.*?\\Q") + "\\E";
// regex = "\\Q\\E.*?\\Q.wav\\E"
Note that your * wildcard might also be best matched only against word characters using \w depending on how you want your wildcard to behave(?)
Solution 5:[5]
Regex While Accommodating A DOS/Windows Path
Implementing the Quotation escape characters \Q
and \E
is probably the best approach. However, since a backslash is typically used as a DOS/Windows file separator, a "\E
" sequence within the path could effect the pairing of \Q
and \E
. While accounting for the *
and ?
wildcard tokens, this situation of the backslash could be addressed in this manner:
Search: [^*?\\]+|(\*)|(\?)|(\\)
Two new lines would be added in the replace function of the "Using A Simple Regex" example to accommodate the new search pattern. The code would still be "Linux-friendly". As a method, it could be written like this:
public String wildcardToRegex(String wildcardStr) {
Pattern regex=Pattern.compile("[^*?\\\\]+|(\\*)|(\\?)|(\\\\)");
Matcher m=regex.matcher(wildcardStr);
StringBuffer sb=new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(sb, ".*");
else if(m.group(2) != null) m.appendReplacement(sb, ".");
else if(m.group(3) != null) m.appendReplacement(sb, "\\\\\\\\");
else m.appendReplacement(sb, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(sb);
return sb.toString();
}
Code to demonstrate the implementation of this method could be written like this:
String s = "C:\\Temp\\Extra\\audio??2012*.wav";
System.out.println("Input: "+s);
System.out.println("Output: "+wildcardToRegex(s));
This would be the generated results:
Input: C:\Temp\Extra\audio??2012*.wav
Output: \QC:\E\\\QTemp\E\\\QExtra\E\\\Qaudio\E..\Q2012\E.*\Q.wav\E
Solution 6:[6]
Lucene has classes that provide this capability, with additional support for backslash as an escape character. ?
matches a single character, 1
matches 0 or more characters, \
escapes the following character. Supports Unicode code points. Supposed to be fast but I haven't tested.
CharacterRunAutomaton characterRunAutomaton;
boolean matches;
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Walmart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // false
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal*mart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // true
matches = characterRunAutomaton.run("Waldomart"); // true
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal\\*mart")));
matches = characterRunAutomaton.run("Walmart"); // false
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
Solution 7:[7]
// The main function that checks if two given strings match. The pattern string may contain
// wildcard characters
default boolean matchPattern(String pattern, String str) {
// If we reach at the end of both strings, we are done
if (pattern.length() == 0 && str.length() == 0) return true;
// Make sure that the characters after '*' are present in str string. This function assumes that
// the pattern string will not contain two consecutive '*'
if (pattern.length() > 1 && pattern.charAt(0) == '*' && str.length() == 0) return false;
// If the pattern string contains '?', or current characters of both strings match
if ((pattern.length() > 1 && pattern.charAt(0) == '?')
|| (pattern.length() != 0 && str.length() != 0 && pattern.charAt(0) == str.charAt(0)))
return matchPattern(pattern.substring(1), str.substring(1));
// If there is *, then there are two possibilities
// a: We consider current character of str string
// b: We ignore current character of str string.
if (pattern.length() > 0 && pattern.charAt(0) == '*')
return matchPattern(pattern.substring(1), str) || matchPattern(pattern, str.substring(1));
return false;
}
public static void main(String[] args) {
test("w*ks", "weeks"); // Yes
test("we?k*", "weekend"); // Yes
test("g*k", "gee"); // No because 'k' is not in second
test("*pqrs", "pqrst"); // No because 't' is not in first
test("abc*bcd", "abcdhghgbcd"); // Yes
test("abc*c?d", "abcd"); // No because second must have 2 instances of 'c'
test("*c*d", "abcd"); // Yes
test("*?c*d", "abcd"); // Yes
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Marek Gregor |
Solution 4 | Matt Coubrough |
Solution 5 | J. Hanney |
Solution 6 | Paul Jackson |
Solution 7 | Keshavram Kuduwa |