'Parse quoted-printable encoding content from .mht file
I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:
<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>
This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing
And below is the code that I am using to get all the images from the .mht file:
html = File.open("1646037951.mht").read
image_links = get_image_links(html)
def get_image_links(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//img[@src]")
raise "No <img .../> tags!" if nodes.empty?
nodes.inject([]) do |uris, node|
puts node.to_s
uris << node.attr('src').strip
end.uniq
end
I have tried to parse it by using .unpack('M').first
but it's still not working as it just returns the same result as above.
Or maybe Rails have something for this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|