'Removing a sentence from a text in dataframe column
I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
Solution 1:[1]
Using sent_tokenize
from the NLTK tokenize API which IMO is the idiomatic way of tokenizing sentences
from nltk.tokenize import sent_tokenize
(df['text'].map(nltk.sent_tokenize)
.map(lambda sent: ' '.join([s for s in sent if not s.endswith(':')])))
index
1 Trump met with Putin.
2 New movie by Christopher Nolan.
3 Campers: Get ready to stop COVID-19 in its tra...
4 London was building a bigger rival to the Eiff...
Name: text, dtype: object
You might have to handle NaNs appropriately with a preceeding fillna('')
call if your column contains those.
In list form the output looks like this:
['Trump met with Putin.',
'New movie by Christopher Nolan.',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
Note that NLTK needs to be pip-installed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |