'Removing a sentence from a text in dataframe column

I want to format a text-column in the dataframe in a following way:

In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.

Example df:

index    text
1        Trump met with Putin. Learn more here:
2        New movie by Christopher Nolan! Watch here:
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

after formatting should look like this:

index    text
1        Trump met with Putin.
2        New movie by Christopher Nolan!
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

Solution 1:^[1]

Using sent_tokenize from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences

from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
           .map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))

index
1                                Trump met with Putin.
2                      New movie by Christopher Nolan.
3    Campers: Get ready to stop COVID-19 in its tra...
4    London was building a bigger rival to the Eiff...
Name: text, dtype: object

You might have to handle NaNs appropriately with a preceeding fillna('') call if your column contains those.

In list form the output looks like this:

['Trump met with Putin.',
 'New movie by Christopher Nolan.',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']

Note that NLTK needs to be pip-installed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Removing a sentence from a text in dataframe column

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]