'extract all URLs in a free text block using RegEx [duplicate]
I'm attempting to detect all URLs listed in a free text block. I'm using the .nets Regex.Matches
call.. with the following regex: (http|https)://[^\s "']{4,}
Now, I've put in the following text:
here is a link http://somelink.com
here is a link that I didn't space withhttp://nospacelink.com/something?something=&39358235
http://nospacelink.com/something?something=&12233454
here is a link I already handled.
Here is some secret t&cs you're not allowed to know https://somethingbad.com
Just to be a little annoying I've put in a new address thingy capture type of 'http://somethinginspeechmarks.com' and what are you going to do now?
here is a link http://postTextLink.com at then some post text
Here is a link with a full stop http://alinkwithafullstoplink.com. And then some more.
and I get the following output:
http://somelink.com
http://nospacelink.com?something=&39358235
http://nospacelink.com?something=&12233454
http://alreadyhandledlink.com
https://somethingbad.com
http://somethinginspeechmarks.com
http://postTextLink.com
http://alinkwithafullstoplink.com.
Please notice the full stop on the last entry. How can I update my regex to say "If there is a full stop at the end, please ignore it?"
Also, please note that "Getting parts of a URL (Regex)" has nothing to do with my question, as that question is about how to break down a particular URL. I want to extract multiple, complete urls. Please see my input and current outputs for clarification! I have got a regex already that does most of what I want, but isn't quite right. Could you please explain where my approach might be improved?
Solution 1:[1]
I would add something like [^\.]
to the pattern.
This pattern says that the last char can't be a full stop.
So for (http|https)://[^\s "']{4,}[^\.]
it will try to match all adresses not ending with a full stop.
Edit:
This one should be better as said in comments: [^.\s"']
Solution 2:[2]
Updated:
Consider the following minor change to your pattern:
(http|https)://[^\s "']{4,}(?=\.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | halfer |