'Regex returning a value in IE, 'undefined' in Firefox and Safari/Chrome

Have a regex:

.*?
(rule1|rule2)
(?:(rule1|rule2)|[^}])*

(It's designed to parse CSS files, and the 'rules' are generated by JS.)

When I try this in IE, all works as it should. Ditto when I try it in RegexBuddy or The Regex Coach.

But when I try it in Firefox or Chrome, the results are missing values.
Can anyone please explain what the real browsers are thinking, or how I can achieve results similar to IE's?

To see this in action, load up a page that gives you interactive testing, such as the W3Schools try-it-out editor.

Here's the source that can be pasted in: http://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_regexp_exec

<html>
<body>

<script type="text/javascript">

var str="#rot { rule1; rule2; }";

var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;

var result=patt.exec(str);
for(var i = 0; i < 3; i++) document.write(i+": " + result[i]+"<br>"); 

</script>
</body>
</html>

Here is the output in IE:

0: #rot { rule1; rule2; 
1: rule1
2: rule2

Here is the output in Firefox and Chrome:

0: #rot { rule1; rule2; 
1: rule1
2: undefined

When I try the same using string.match, I get back an array of undefined in all browsers, including IE.

var str="#rot { rule2; rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/gi;
var result=str.match(patt);
for(var i = 0; i < 5; i++) document.write(i+": "+result[i]+"<br>"); 

As far as I can tell, the issue is the last non-capturing parenthesis.
When I remove them, the results are consistent cross browser - and match() gets results.

However, it does capture from the last parenthesis, in all browsers, in the following example:

<script>
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
var result=patt.exec(str);
for(var i =0; i < 3; i++) document.write(i+": "+result[i]+"<br>"); 
</script>

Notice that I've added a space to the patterns in the second regex.
The same applies if I add any negative character to the strings in the second regex:

var patt=/.*?(rule1|rule2)(?:(rule1[^1]|rule2[^1])|[^}])*/gi;

What the expletive is going on?!
All other strings that I've tried result in the first set of non-catches. Any help is greatly appreciated!

EDIT: The code has been shortened, and many hours of research put in, on Mathhew's advice.
The title has been changed to make the thread easier to find.

I have marked Mathew's answer as correct, as it is well researched and described.
My answer below (written before Mathew revised his) states the logic in simpler and more direct terms.



Solution 1:[1]

IE is wrong. In ECMAScript, exactly one alternative can result in a string. All the others have to be undefined (not "" or anything else).

So for your alternatives, including (transform[^-][^;}]+)|(transform-origin[^;}]+), Firefox and Chrome are correct in setting the failed capture to undefined.

There's an example in the ECMAScript 5 standard (ยง15.10.2.3) specifically about this:

NOTE The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression). If the left Alternative, the right Disjunction, and the sequel all have choice points, all choices in the sequel are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative. Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.

Thus, for example, /a|ab/.exec("abc") returns the result "a" and not "ab". Moreover, /((a)|(ab))((c)|(bc))/.exec("abc") returns the array ["abc", "a", "a", undefined, "bc", undefined, "bc"] and not ["abc", "ab", undefined, "ab", "c", "c", undefined]

EDIT: I figured the last part out. This applies to the original as well as the simplified version. In both cases, rule1 and rule2 can't match the ; (in the original because ; is in the negated character class [^;}]). Thus, when a ; hit between declarations, the alternation chooses [^}]. Thus, it must set the last two captures to undefined.

For the * to be fully greedy, the final ; and space in the input must also be matched. For the last two * repetitions (';' and ' '), the alternation again chooses [^}], so the captures should be set undefined at the end too.

IE fails to do this in both cases, so they stay equal to "rule1" and "rule2".

Finally, the reason that the second example behaves differently is that (transform-origin[^;}]+)) matches on the very last * repetition, since there's no ; before the end.

EDIT 2: I'll walk through what should be happening both current examples. match is the match array.

var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;

.*? - "#rot { "

(rule1|rule2) - "rule1"
match[1] = "rule1"

Star 1

[^}] - ";"
match[2] = undefined 

Star 2

[^}] - " "
match[2] = undefined 

Star 3

(rule1|rule2) - "rule2"
match[2] = "rule2"

Star 4

[^}] - ";"
match[2] = undefined 

Star 5

[^}] - " "
match[2] = undefined 

Again, IE isn't setting match[2] to undefined.

For the str.match example, you're using the global flag. That means it returns an array of matches, without captures. This applies to any use of String.match. If you use g, you have to use exec to get captures.

var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;

.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"

Star 1

[^}] - ";"
match[2] = undefined 

Star 2

[^}] - " "
match[2] = undefined 

Star 3

(rule1 |rule2 ) - "rule2 "
match[2] = "rule2 "

Since this is the last *, the capture never gets set to undefined.

Solution 2:[2]

There is a disagreement how to handle repeating capturing brackets.

Firefox and Webkit both make the following assumptions, IE makes only the first:

  1. If a parenthesis is repeated, capturing each time something new, only the last result is stored.
  2. If the parenthesis are inside a larger non capturing repeating parenthesis, and do not capture anything on the last loop, the parenthesis should capture nothing.

For example:

var str = 'abcdef';
var pat = /([a-f])+/;

pat.exec will catch an 'a', then replace it with a 'b' etc, until it returns an 'f'.
In all browsers.

var str = 'abcdefg';
var pat = /(?:([a-f])|g)+/;

pat.exec will first fill in the capturing parenthesis with an 'a', 'b', through 'f'.
But the non-capturing parent will then continue and match the 'g'. During which time there is nothing to go into the capturing parenthesis, so it is emptied.
And the regex will return a undefined string as its response.

IE considers the capturing parenthesis to have caught nothing in the last loop throup, and therefore sticks with the last valid response of 'f'.

Which is useful, but not logical.

Being illogically useful is more destructive than useful. (We all hate quirksmode.)
Advantage Firefox/Chrome.

Solution 3:[3]

The test case can be simplified, e.g.:

/^(?:(Foo)|Bar)(?:(Foo)|Bar)/.exec("FooBar") // => [ 'FooBar', 'Foo' ]
/^(?:(Foo)|Bar){2}/.exec("FooBar")           // => [ 'FooBar', undefined ]

The only difference here is that the (?:(Foo)|Bar) atom is repeated (by a quantifier) in the second case, which results in its captures being cleared.

This behavior is stipulated by the ECMAScript spec:

Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.

IE's deviation from this spec is also documented:

ES3 states that "Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated."

JScript does not clear the Atom's matches each time the Atom is repeated.


It's worth noting that the ES spec is at odds with the behavior of other Perl-flavored regex engines, which typically behave like IE:

Chrome, Firefox

"FooBar".match(/^(?:(Foo)|Bar)*/)[1] // => undefined

Perl

("FooBar" =~ m/^(?:(Foo)|Bar)*/)[0] # => "Foo"

Python

re.match("^(?:(Foo)|Bar)*", "FooBar").group(1) # => "Foo"

Ruby

"FooBar"[/^(?:(Foo)|Bar)*/, 1] # => "Foo"

Solution 4:[4]

Try removing the ?: at the front of lines 4 and 5 in your regex above. I haven't tested it, but it really looks like they don't belong there.

(?:^|})
([^{]+)
[^}]+?-moz-
((transform[^-][^;}]+)|(transform-origin[^;}]+))
(-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*

Solution 5:[5]

Your 4th and 5th patterns are competing. Ultimately it is up to the implementation of the browsers regex engine to determine the matches. This wouldn't be the first difference between IE and others.

(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))
(?:-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*

Both of these are prefixed by transform and suffixed by origin. You need to condense these into a more concise expression. Something like the following is an example:

((?:-moz-)?(?:transfrom-origin[^;}]+))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 SamGoody
Solution 3
Solution 4 Ben Lee
Solution 5