'Vim Regex Negative Look Arounds and Capture Groups

Say you have the following text

foobar
bar

And you want the following as your desired output

foobar
foobar

You could use the following regex

s/\v(foo)@<!(bar)/foo\2/g

What I made the mistake of before was thinking that the back-reference for bar was \1 and not \2; I didn't think that the regex lookaround was considered a capture group. Now whats intriguing me is if you were to use \1. The output you would get is the following

foobar
foo

Using the logic stated above, if \1 refers to the first capture group, (foo), then I expect that the output would be

foobar
foofoo

After having thought about it for a little bit, what I am suspecting is to be the answer of this question is that since its a negative lookbehind that's being used, it captures only when the specified text foo is not present. As such, this means that the stored capture group is nothing. Simply a null character. This would result in foo being the output if \1 is the specified back reference. Am I correct in my deduction?

What causes me to be rather certain about this is if I were to change the regex around to use a positive lookbehind instead with a reference to the first capture group, as follows

s/\v(foo)@<=(bar)/foo\1/g

The output would then become

foofoo
bar

Meaning that since its a positive lookbehind, the capture group (foo) matches when foo is present, thus the stored capture group would have to be foo.

The source of this confusion is the fact that Perl regex works in the fashion that regex lookarounds are not included as a capture group. If I am correct in what I have stated above, I'm curious as to why there is this difference between vim regex and Perl regex.



Solution 1:[1]

I'm curious as to why there is this difference between vim regex and Perl regex.

Because they're two different regex engines. If they worked in the exact same way, there wouldn't be a Vim regex engine and a Perl regex engine, they'd both be the Perl regex engine.

At some pointâ„¢, Vim made a regex engine and decided on certain things. One of those, evidently, is to include lookaheads as capturing groups. If you wanna talk further divergence from Perl, @<= allows non-fixed-width patterns in Vim, but not in Perl (and several other engines). It's just how it was designed. The "why" is something only the people who made it can answer definitively, so I won't answer that.


If you absolutely wanna exclude the group from the group counting, you can prefix a %, as per :h /\%(\) to make it a non-capturing group (i.e. s/\v%(foo)@<!(bar)/foo\1/g). Note that non-capturing groups still act like normal, but you cannot refer to them when substituting.

While I'm already writing an answer though, let me introduce you to \zs and \ze, by far one of the best additions to the Vim regex engine (in my biased opinion):

\zs defines where the actual match starts. It won't affect groups, but it has several side-useful side-effects. In your case specifically, it lets you completely drop the positive lookbehind. It won't let you drop the negative lookbehind (because regex), but it'll let you simplify your regex a little. Equivalently, \ze determines where the match ends.

Your second example can be simplified to:

s/\vfoo\zs(bar)/\1

\zs tells the engine to start the match just before (bar). If it helps, you can think of every regex as being prefixed with \zs and postfixed with \ze - explicitly defining it just changes those bounds. This doesn't affect number grouping and \<n>-saving.

What this means is that only the space selected by bar is considered a match, and that bit is replaced - the other bits are left intact.

Your first regex with a negative lookbehind doesn't simplify as well (because regex overall feels intended for forward operations, so anything operating backwards tends to be messy), but for longer regexes, it can still shorten the regex dramatically. Here's what that substitution looks like:

s/\v(foo)@<!\zebar/foo

Expanded:

s/\v
  | (foo)@<!
  | |       \ze
  | |       |   bar
  | |       |   |  /foo
  ^ Very magic  |  |
    ^ not prefixed with foo. Can be made non-capturing, but it has no actual relevance for this regex specifically
            ^ End the match
                ^ bar
                   ^ substitute the "area" selected by "not prefixed with foo" with foo

('scuse the terrible diagram, I've never made one of these before and I don't remember how they're generally made)

This one uses \ze because your goal indirectly to replace the space allocated by the negative lookahead with itself. Unfortunately, Vim only stores actual matched values, meaning \1 can't be used to insert foo, because it's not there yet. This is probably something all engines do, because you can't guess the content of (?<=ab.d) for an instance.


That being said, if you just want to avoid confusion with group numbering, non-capturing groups is the way to go for now. \zs and \ze, while fantastic, are mildly confusing at first and might not be the best idea to throw on top of learning everything else in Vim for the time being.

And finally, an unexpected plugin recommendation: haya14busa/incsearch.vim(no affiliation, just a user), which previews your substitutions and searches so you can tell what's going to happen before you go ahead with a substitution or a search. Might not help with your confusion around group numbering, but you'll at least be able to see when you're using the wrong group number before you substitute.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Yosher