'How to get the opening and closing tag in beautiful soup from HTML string?

I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.

Here is my string:

string = <p>...</p>

I want to get <p> in a variable called opening_tag and </p> in a variable called closing_tag. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?



Solution 1:[1]

There is no direct way to get opening and closing parts of the tag in BeautifulSoup, but, at least, you can get the name of it:

>>> from bs4 import BeautifulSoup
>>> 
>>> html_content = """
... <body>
...     <p>test</p>
... </body>
...  """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p

With html.parser though you can listen to "start" and "end" tag "events".

Solution 2:[2]

There is a way to do this with BeautifulSoup and a simple reg-ex:

  • Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.

  • For the contents between the opening (<p>) and closing (</p>) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).

  • Then, soupParagraph will just have the opening and closing tags.

  • Convert soupParagraph to HTML text-format and store that in a string variable

  • To get the opening tag, use a regular expression to remove the closing tag from the string variable.

In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.

A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.

Can I have attributes on closing tags?

HTML Comments inside Opening Tag of the Element

This code gets the opening tag from a <body...> ... </body> section. The code has been tested.

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")

Solution 3:[3]

As far as I know there is no built in method in BeautifulSoup API that returns the opening tag as it is, but we can create a little function for that.

from bs4 import BeautifulSoup
from bs4.element import Tag


# here's your function
def get_opening_tag(element: Tag) -> str:
    """returns the opening tag of the given element"""
    raw_attrs = {k: v if not isinstance(v, list) else ' '.join(v) for k, v in element.attrs.items()}
    attrs = ' '.join((f"{k}=\"{v}\"" for k, v in raw_attrs.items()))
    return f"<{element.name} {attrs}>"


def test():

    markup = """
    <html>
        <body>
            <div id="root" class="class--name">
                ...
            </div>
        </body>
    </html>
    """

    # if you're interested in the div tag
    element = BeautifulSoup(markup, 'lxml').select_one("#root")

    print(get_opening_tag(element))


if __name__ == '__main__':
    test()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 alecxe
Solution 2
Solution 3 Adnan MARSO