'removing `\n` using bs4 get_text()

from bs4 import BeautifulSoup


# current output as below
"""
'DOMINGUEZ, JONATHAN D. VS. RAMOS,\n
                                           SILVIA M'
"""

# desired one is

#  DOMINGUEZ, JONATHAN D. VS. RAMOS, SILVIA M

x = """<td width="350px" valign="top"
   style="padding:.5rem;">
   DOMINGUEZ, JONATHAN D. VS. RAMOS,
   SILVIA M
</td>"""

soup = BeautifulSoup(x, 'lxml')
print(soup.select_one('td').get_text(strip=True, separator='\n'))

I checked the docs and I believe that get_text() can do that but am not sure how!



Solution 1:[1]

You might need a regular expression, this could also get rid of extra spaces:

from bs4 import BeautifulSoup
import re

x = """<td width="350px" valign="top"
   style="padding:.5rem;">
   DOMINGUEZ, JONATHAN D. VS. RAMOS,
   SILVIA M
</td>"""

soup = BeautifulSoup(x, 'lxml')
text = re.sub(r'\s+', ' ', soup.select_one('td').get_text(strip=True))
print(text)

Giving:

DOMINGUEZ, JONATHAN D. VS. RAMOS, SILVIA M

Solution 2:[2]

change separator='\n' to separator=' '

Solution 3:[3]

Just to give some context - In this case get_text(strip=True) per se wont work cause, text in your <td> is not "separated" by tags and more precise not by <br>, so it would be recognized as one "multiline" string containing new line characters.

Resulting from this circumstance the parameter strip=True only strips characters left and right from the whole string.


Based on following HTML it would work your way with whitespace as seperator:

x = """<td width="350px" valign="top"
style="padding:.5rem;">
DOMINGUEZ, JONATHAN D. VS. RAMOS,<br> 
SILVIA M

</td>"""

soup = BeautifulSoup(x)
soup.td.get_text(' ',strip=True)

But without any tag included in the string as possible seperator I would recommend to use @Martin Evans answer using regex that also handles the multiple whitespaces very well.

re.sub(r'\s+', ' ', soup.select_one('td').get_text(strip=True))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martin Evans
Solution 2 Omid
Solution 3 HedgeHog