'Regex matched text between tags is too greedy

I am trying to extract text from a string, and have trouble with laziness/greediness.

In the example I want the piece of text to match <b>I want this piece</b>, so my regex is non-greedy anything between <b> and </b> as long as it contains 'piece'.

The problem with my regex that the matched text includes <b>first</b>.

var text = "<b>first</b> <b>I only want this piece</b>";
var regX = /<b>.*?piece.*?<\/b>/;
var matches = text.match(regX);

Matched text

"<b>first</b> <b>I only want this piece</b>"

Desired match

"<b>I only want this piece</b>"


Solution 1:[1]

Use a negated char class instead of the first .*?.

var regX = /<b>[^<>]*?piece.*?<\/b>/;

Why?

Because the first <b>.*?piece will match the first <b> and it continues until it finds the text piece and it won't care about the text present in-between. If you use [^<>]*?, it would do a lazy match of matching any char but not of < or > character zero or more times.

Solution 2:[2]

This would work for excluding any html tags, and might be a little more robust, depending on how predictable your string is:

var regX = /<b>(?:(?!<[^>]*>).)*piece.*?<\/b>/

If you want to match newline characters, you can use \s\S in addition to the dot (.), e.g. [.\s\S]:

var regX = /<b>(?:(?!<[^>]*>)[.\s\S])*piece[.\s\S]*?<\/b>/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Avinash Raj
Solution 2