'removing `\n` using bs4 get_text()
from bs4 import BeautifulSoup
# current output as below
"""
'DOMINGUEZ, JONATHAN D. VS. RAMOS,\n
SILVIA M'
"""
# desired one is
# DOMINGUEZ, JONATHAN D. VS. RAMOS, SILVIA M
x = """<td width="350px" valign="top"
style="padding:.5rem;">
DOMINGUEZ, JONATHAN D. VS. RAMOS,
SILVIA M
</td>"""
soup = BeautifulSoup(x, 'lxml')
print(soup.select_one('td').get_text(strip=True, separator='\n'))
I checked the docs and I believe that get_text()
can do that but am not sure how!
Solution 1:[1]
You might need a regular expression, this could also get rid of extra spaces:
from bs4 import BeautifulSoup
import re
x = """<td width="350px" valign="top"
style="padding:.5rem;">
DOMINGUEZ, JONATHAN D. VS. RAMOS,
SILVIA M
</td>"""
soup = BeautifulSoup(x, 'lxml')
text = re.sub(r'\s+', ' ', soup.select_one('td').get_text(strip=True))
print(text)
Giving:
DOMINGUEZ, JONATHAN D. VS. RAMOS, SILVIA M
Solution 2:[2]
change separator='\n'
to separator=' '
Solution 3:[3]
Just to give some context - In this case get_text(strip=True)
per se wont work cause, text in your <td>
is not "separated" by tags and more precise not by <br>
, so it would be recognized as one "multiline" string containing new line characters.
Resulting from this circumstance the parameter strip=True
only strips characters left and right from the whole string.
Based on following HTML it would work your way with whitespace as seperator:
x = """<td width="350px" valign="top"
style="padding:.5rem;">
DOMINGUEZ, JONATHAN D. VS. RAMOS,<br>
SILVIA M
</td>"""
soup = BeautifulSoup(x)
soup.td.get_text(' ',strip=True)
But without any tag included in the string as possible seperator I would recommend to use @Martin Evans answer using regex
that also handles the multiple whitespaces very well.
re.sub(r'\s+', ' ', soup.select_one('td').get_text(strip=True))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Martin Evans |
Solution 2 | Omid |
Solution 3 | HedgeHog |