'Python regex remove dots from dot separated letters

I would like to remove the dots within a word, such that a.b.c.d becomes abcd, But under some conditions:

  • There should be at least 2 dots within the word, For example, a.b remains a.b, But a.b.c is a match.
  • This should match on 1 or 2 letters only. For example, a.bb.c is a match (because a, bb and c are 1 or 2 letters each), but aaa.b.cc is not a match (because aaa consists of 3 letters)

Here is what I've tried so far:

import re
texts = [
    'a.b.c', # Should be: 'abc'
    'ab.c.dd.ee', # Should be: 'abcddee'
    'a.b' # Should remain: 'a.b'
]
for text in texts:
    text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
    print(text)

This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.

So, it prints:

ac
abee
a.b

Which is not what I want. I would appreciate any help, thanks.



Solution 1:[1]

Starting the match with a . dot not make sure that there is a char a-zA-Z before it.

If you use the named group word in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.


You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.

To prevent aaa.b.cc from matching, you could make use of word boundaries \b

\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b

The pattern matches:

  • \b A word boundary to prevent the word being part of a larger word
  • [a-zA-Z]{1,2} Match 1 or 2 times a char a-zA-Z
  • (?: Non capture group
    • \.[a-zA-Z]{1,2} Match a dot and 1 or 2 times a char a-zA-Z
  • ){2,} Close non capture group and repeat 2 or more times to match at least 2 dots
  • \b A word boundary

Regex demo | Python demo

import re

pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
    'a.b.c',
    'ab.c.dd.ee',
    'a.b',
    'aaa.b.cc'
]

for s in texts:
    print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))

Output

abc
abcddee
a.b
aaa.b.cc

Solution 2:[2]

^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$

You can use this to match the string.If its a match, you can just remove . using any naive method.

See demo.

https://regex101.com/r/BrNBtk/1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 vks