'Convert html source code to json object
I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.
My code:
url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)
ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'
above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?
Solution 1:[1]
jsonD = json.dumps(htmlContent.text)
converts the raw HTML content into a JSON string representation.
jsonL = json.loads(jsonD)
parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps()
is reverted by loads()
. jsonL
contains the same data as htmlContent.text
.
Try to use json.dumps
to generate your final JSON instead of building the JSON by hand:
ContentUrl = json.dumps({
'url': str(urls),
'uid': str(uniqueID),
'page_content': htmlContent.text,
'date': finalDate
})
Solution 2:[2]
The correct way to convert HTML source code to a JSON file on the local system is as follows:
import json
import codecs
# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
json_file = json.loads(jsonf.read())
# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
html_file = htmlf.read()
# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file
# Dump the dictionary to JSON object and save it in a specific location
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
ojsonf.write(json_object)
- Next, open the JSON file in your editor.
- Press CTRL + H, and replace \n or \t characters by '' (nothing!).
- Now you can parse your HTML file with codecs.open() function and do the operations.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | cg909 |
Solution 2 |