'Extract everything inside tag, but not tag itself
I'm using BeautifulSoup
to scrape text from a website, but I only want the <p>
tags for organization. However, I can't use text.findAll('p')
,
because there are other <p>
tags that I don't want.
The text I want is all wrapped inside one tag (let's say body), but when I parse it, it takes also includes that tag.
link = requests.get('link')
text = bs4.BeautifulSoup(link.text, 'html.parser').find('body')
How would I remove the body tag?
Solution 1:[1]
text = bs4.BeautifulSoup(link.text, 'html.parser').find('body').text
This will concatenate all the text in the body
tag.
Solution 2:[2]
If you want everything in the tag (including HTML), but not the tag itself, you can use the decode_contents method of the Tag class. This will render the contents of the tag as a Unicode string
>>> html = """
<body>
<p>Hello <b>World</b></p>
<p>Hello again</p>
</body>
"""
>>> body = bs4.BeautifulSoup(html, 'html.parser').find('body')
>>> body.decode_contents()
'\n<p>Hello <b>World</b></p>\n<p>Hello again</p>\n'
I'm not sure if that's exactly what you're asking for because the question was a little ambiguous so here are the other similar options that you or others may be seeking:
>>> body.text
'\nHello World\nHello again\n'
>>> str(body)
'<body>\n<p>Hello <b>World</b></p>\n<p>Hello again</p>\n</body>'
>>> body.contents
['\n', <p>Hello <b>World</b></p>, '\n', <p>Hello again</p>, '\n']
>>> [p.text for p in body.find_all('p')]
['Hello World', 'Hello again']
>>> list(body.strings)
['\n', 'Hello ', 'World', '\n', 'Hello again', '\n']
Solution 3:[3]
This may help you:
>>> txt = """\
<p>Rahul</p>
<p><i>White</i></p>
<p>City <b>Beston</b></p>
"""
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))
Rahul
White
City Beston
OR you can do this:
soup = BeautifulSoup(html)
bodyTag = soup.find('body')
bodyText = BeautifulSoup(bodyTag, "html.parser")
print bodyText.strings
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | å®æ°æŽ |
Solution 2 | Nala Nkadi |
Solution 3 | Piyush S. Wanare |