'How to get the opening and closing tag in beautiful soup from HTML string?
I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.
Here is my string:
string = <p>...</p>
I want to get <p>
in a variable called opening_tag
and </p>
in a variable called closing_tag
. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?
Solution 1:[1]
There is no direct way to get opening and closing parts of the tag in BeautifulSoup
, but, at least, you can get the name of it:
>>> from bs4 import BeautifulSoup
>>>
>>> html_content = """
... <body>
... <p>test</p>
... </body>
... """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p
With html.parser
though you can listen to "start" and "end" tag "events".
Solution 2:[2]
There is a way to do this with BeautifulSoup and a simple reg-ex:
Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
For the contents between the opening (
<p>
) and closing (</p>
) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).Then, soupParagraph will just have the opening and closing tags.
Convert soupParagraph to HTML text-format and store that in a string variable
To get the opening tag, use a regular expression to remove the closing tag from the string variable.
In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.
A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.
Can I have attributes on closing tags?
HTML Comments inside Opening Tag of the Element
This code gets the opening tag from a <body...>
... </body>
section. The code has been tested.
# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
# .append moves the HTML element from body to bodyInnerHtml
bodyInnerHtml.append(bodyContentsList[0])
# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
print("")
print("ERROR. The expected HTML </body> tag was not found.")
Solution 3:[3]
As far as I know there is no built in method in BeautifulSoup
API that returns the opening tag as it is, but we can create a little function for that.
from bs4 import BeautifulSoup
from bs4.element import Tag
# here's your function
def get_opening_tag(element: Tag) -> str:
"""returns the opening tag of the given element"""
raw_attrs = {k: v if not isinstance(v, list) else ' '.join(v) for k, v in element.attrs.items()}
attrs = ' '.join((f"{k}=\"{v}\"" for k, v in raw_attrs.items()))
return f"<{element.name} {attrs}>"
def test():
markup = """
<html>
<body>
<div id="root" class="class--name">
...
</div>
</body>
</html>
"""
# if you're interested in the div tag
element = BeautifulSoup(markup, 'lxml').select_one("#root")
print(get_opening_tag(element))
if __name__ == '__main__':
test()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | alecxe |
Solution 2 | |
Solution 3 | Adnan MARSO |