'Pandoc - HTML to Markdown - remove all attributes
This would seem like a simple thing to do, but I've been unable to find an answer. I'm converting from HTML to Markdown using Pandoc and I would like to strip all attributes from the HTML such as "class" and "id".
Is there an option in Pandoc to do this?
Solution 1:[1]
Consider input.html
:
<h1 class="test">Hi!</h1>
<p><strong id="another">This is a test.</strong></p>
Then, pandoc input.html -t markdown_github-raw_html -o output.md
produces output.md
:
Hi!
===
**This is a test.**
without the -t markdown_github-raw_html
, you would get
Hi! {#hi .test}
===
**This is a test.**
This question is actually similar to this one. I don't think pandoc ever preserves id
attributes.
Solution 2:[2]
You can use a Lua filter to remove all attributes and classes. Save the following to a file remove-attr.lua
and call pandoc with --lua-filter=remove-attr.lua
.
function remove_attr (x)
if x.attr then
x.attr = pandoc.Attr()
return x
end
end
return {{Inline = remove_attr, Block = remove_attr}}
Solution 3:[3]
I am also surprised that this seemingly simple operation didn't yield any result in web search. Ended up writing the following by referring to BeautifulSoup doc and example usages from other SO answers.
The code below also removes the script
and style
html tags. On top of that, it will preserve any src
and href
attributes. These two should allows for flexibility to fit for your needs (i.e. adapt any needs then use pandoc to convert the returned html to markdown).
# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString
def unstyle_html(html):
soup = BeautifulSoup(html, features="html.parser")
# remove all attributes except for `src` and `href`
for tag in soup.descendants:
keys = []
if not isinstance(tag, NavigableString):
for k in tag.attrs.keys():
if k not in ["src", "href"]:
keys.append(k)
for k in keys:
del tag[k]
# remove all script and style tags
for tag in soup.find_all(["script", "style"]):
tag.decompose()
# return html text
return soup.prettify()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Clément |
Solution 2 | tarleb |
Solution 3 |