'Convert html source code to json object

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.

My code:

url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)

ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'

above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?



Solution 1:[1]

jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation. jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.

Try to use json.dumps to generate your final JSON instead of building the JSON by hand:

ContentUrl = json.dumps({
    'url': str(urls),
    'uid': str(uniqueID),
    'page_content': htmlContent.text,
    'date': finalDate
})

Solution 2:[2]

The correct way to convert HTML source code to a JSON file on the local system is as follows:

import json
import codecs

# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
    json_file = json.loads(jsonf.read())

# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
    html_file = htmlf.read()

# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file

# Dump the dictionary to JSON object and save it in a specific location 
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
    ojsonf.write(json_object)
  • Next, open the JSON file in your editor.
  • Press CTRL + H, and replace \n or \t characters by '' (nothing!).
  • Now you can parse your HTML file with codecs.open() function and do the operations.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cg909
Solution 2