'Python regex remove dots from dot separated letters
I would like to remove the dots within a word, such that a.b.c.d
becomes abcd
, But under some conditions:
- There should be at least 2 dots within the word, For example,
a.b
remainsa.b
, Buta.b.c
is a match. - This should match on 1 or 2 letters only. For example,
a.bb.c
is a match (becausea
,bb
andc
are 1 or 2 letters each), butaaa.b.cc
is not a match (becauseaaa
consists of 3 letters)
Here is what I've tried so far:
import re
texts = [
'a.b.c', # Should be: 'abc'
'ab.c.dd.ee', # Should be: 'abcddee'
'a.b' # Should remain: 'a.b'
]
for text in texts:
text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
print(text)
This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.
So, it prints:
ac
abee
a.b
Which is not what I want. I would appreciate any help, thanks.
Solution 1:[1]
Starting the match with a .
dot not make sure that there is a char a-zA-Z before it.
If you use the named group word
in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.
You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.
To prevent aaa.b.cc
from matching, you could make use of word boundaries \b
\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b
The pattern matches:
\b
A word boundary to prevent the word being part of a larger word[a-zA-Z]{1,2}
Match 1 or 2 times a char a-zA-Z(?:
Non capture group\.[a-zA-Z]{1,2}
Match a dot and 1 or 2 times a char a-zA-Z
){2,}
Close non capture group and repeat 2 or more times to match at least 2 dots\b
A word boundary
import re
pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
'a.b.c',
'ab.c.dd.ee',
'a.b',
'aaa.b.cc'
]
for s in texts:
print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))
Output
abc
abcddee
a.b
aaa.b.cc
Solution 2:[2]
^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$
You can use this to match the string.If its a match, you can just remove .
using any naive method.
See demo.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | vks |