'Regular expression with variable number of groups?
Is it possible to create a regular expression with a variable number of groups?
After running this for instance...
Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();
... I would like to have something like
m.group(1)
="c"
m.group(2)
="d"
m.group(3)
="d"
m.group(4)
="c"
.
(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find
loop for these fields.)
As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.
Solution 1:[1]
According to the documentation, Java regular expressions can't do this:
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
(emphasis added)
Solution 2:[2]
You can use split to get the fields you need into an array and loop through that.
http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)
Solution 3:[3]
I have not used java regex, but for many languages the answer is: No.
Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c)
has three capturing groups, only if either one, or two of them can be filled. (a)*
has just one group, the parser leaves the last match in the group after matching.
Solution 4:[4]
Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();
should do what you want.
EDIT:
@aioobe, I understand now. You want to be able to do something like the grammar
A ::== <Foo> <Bars> <Baz>
Foo ::== "foo"
Baz ::== "baz"
Bars ::== <Bar> <Bars>
| ?
Bar ::== "A"
| "B"
and pull out all the individual matches of Bar
.
No, there is no way to do that using java.util.regex
. You can recurse and use a regex on the match of Bars
or use a parser generator like ANTLR and attach a side-effect to Bar
.
Solution 5:[5]
I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/
in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.
Solution 6:[6]
I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.
int i=0;
String m1=null, m2=null;
while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
{
// do work on two found groups
i=matcher.end();
}
But this is for my problem (with two repeating
Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
Matcher matcher = pattern.matcher("abcddcef")
int i=0;
String res=null;
while(matcher.find(i) && (res=matcher.group())!=null)
{
System.out.println(res);
i=matcher.end();
}
You lose the ability to specify arbitrary length of repetition with *
or +
because look-ahead and look-behind must be of the predictable length.
Solution 7:[7]
If there is a reasonable max number of matching groups you would encounter:
"ab([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?ef"
This example will work for 0 - 8 matches. I admit this is ugly and not humanly readable.
Solution 8:[8]
I would like to avoid a matcher.find loop for these fields.
As stated in other answers, that cannot be avoided. For completeness, here is how to do it using a second Pattern
to go over the individual matches. Note the position of the *
being inside the round brackets rather than after.
Pattern subPattern = Pattern.compile("[cd]");
Pattern pattern = Pattern.compile("ab(" + subPattern.pattern() + "*)ef"); // DRY, but probably safer ways to do it for the case that subPattern needs to be changed.
Matcher matcher = pattern.matcher("abccdcddef is great and all, but have you heard about abef and abddcef?");
List<String> letterSequence = new ArrayList<>();
while (matcher.find()) {
String letters = matcher.group(1);
Matcher subMatcher = subPattern.matcher(letters);
while (subMatcher.find()) {
String letter = subMatcher.group();
letterSequence.add(letter);
}
}
System.out.println(letterSequence);
Output:
[c, c, d, c, d, d, d, d, c]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Thirtyate |
Solution 3 | |
Solution 4 | |
Solution 5 | |
Solution 6 | v010dya |
Solution 7 | kashiraja |
Solution 8 |