'Regex matched text between tags is too greedy
I am trying to extract text from a string, and have trouble with laziness/greediness.
In the example I want the piece of text to match <b>I want this piece</b>, so my regex is non-greedy anything between <b> and </b> as long as it contains 'piece'.
The problem with my regex that the matched text includes <b>first</b>.
var text = "<b>first</b> <b>I only want this piece</b>";
var regX = /<b>.*?piece.*?<\/b>/;
var matches = text.match(regX);
Matched text
"<b>first</b> <b>I only want this piece</b>"
Desired match
"<b>I only want this piece</b>"
Solution 1:[1]
Use a negated char class instead of the first .*?.
var regX = /<b>[^<>]*?piece.*?<\/b>/;
Why?
Because the first <b>.*?piece will match the first <b> and it continues until it finds the text piece and it won't care about the text present in-between. If you use [^<>]*?, it would do a lazy match of matching any char but not of < or > character zero or more times.
Solution 2:[2]
This would work for excluding any html tags, and might be a little more robust, depending on how predictable your string is:
var regX = /<b>(?:(?!<[^>]*>).)*piece.*?<\/b>/
If you want to match newline characters, you can use \s\S in addition to the dot (.), e.g. [.\s\S]:
var regX = /<b>(?:(?!<[^>]*>)[.\s\S])*piece[.\s\S]*?<\/b>/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Avinash Raj |
| Solution 2 |
