'Pandoc - HTML to Markdown - remove all attributes

This would seem like a simple thing to do, but I've been unable to find an answer. I'm converting from HTML to Markdown using Pandoc and I would like to strip all attributes from the HTML such as "class" and "id".

Is there an option in Pandoc to do this?



Solution 1:[1]

Consider input.html:

<h1 class="test">Hi!</h1>
<p><strong id="another">This is a test.</strong></p>

Then, pandoc input.html -t markdown_github-raw_html -o output.md

produces output.md:

Hi!
===

**This is a test.**

without the -t markdown_github-raw_html, you would get

Hi! {#hi .test}
===

**This is a test.**

This question is actually similar to this one. I don't think pandoc ever preserves id attributes.

Solution 2:[2]

You can use a Lua filter to remove all attributes and classes. Save the following to a file remove-attr.lua and call pandoc with --lua-filter=remove-attr.lua.

function remove_attr (x)
  if x.attr then
    x.attr = pandoc.Attr()
    return x
  end
end

return {{Inline = remove_attr, Block = remove_attr}}

Solution 3:[3]

I am also surprised that this seemingly simple operation didn't yield any result in web search. Ended up writing the following by referring to BeautifulSoup doc and example usages from other SO answers.

The code below also removes the script and style html tags. On top of that, it will preserve any src and href attributes. These two should allows for flexibility to fit for your needs (i.e. adapt any needs then use pandoc to convert the returned html to markdown).

# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString

def unstyle_html(html):
    soup = BeautifulSoup(html, features="html.parser")

    # remove all attributes except for `src` and `href`
    for tag in soup.descendants:
        keys = []
        if not isinstance(tag, NavigableString):
            for k in tag.attrs.keys():
                if k not in ["src", "href"]:
                    keys.append(k)
            for k in keys:
                del tag[k]

    # remove all script and style tags
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    # return html text
    return soup.prettify()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Clément
Solution 2 tarleb
Solution 3