'regex capturing to start at \b or end of (www\.)
I am trying to capture first occurence of anything that looks like a domain name from a string. For examaple my.domain.home.com
from 'dfasdf https://www.my.domain.home.com fadsfas'
. I am using \b
assertion or non-capturing group (?:www\.)
to mark the start of my capturing group. But instead I get www.my.domain.home.com
i.e. the www.
is not stripped out.
This is my full regex:
\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
this is the part that I am unsure of:
\b(?:www\.)
how can I make my capturing start at the beginning of the word OR end of 'www.'?
[CLARIFICATION] If there is no 'www.' it should capture at the beginning of the word. If there is 'www.' it should start capturing after the dot in 'www.' at the beginning of the possible domain string.
I have checked it with https://www.regex101.com/r/NjR11m/1/tests as well but my final destination is Teradata 15.10 regex which is said to be compliant with the Perl dialect. So if you could help me with in the Perl context I guess I will be fine.
SELECT 'dfasdf https://www.my.domain.home.com fadsfas' AS string,
REGEXP_SUBSTR(string,
'\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b'
) AS url_to_match;
For 'dfasdf https://my.domain.home.com fadsfas'
it should return my.domain.home.com
as well.
Additional examples of the strings that should also return my.domain.home.com
'dfasdf my.domain.home.com fadsfas'
'dfasdf ,my.domain.home.com-- fadsfas'
'dfasdf www.my.domain.home.com#fadsfas'
[SOLUTION]
REGEXP_SUBSTR(LOWER(string),
'\b(?!www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b'
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|