'Issues with selecting #text inside an html, div enclosed in double quotes using xpath/lxml in python
I'm trying to extract the Fund Summary text on the following Yahoo Finance page using python:
Thus far, XPath has worked well using the XPath with the text()
method. However it is seemingly unable to select this particular text, always outputting an empty array []
.
I've tried the following xpaths:
tree.xpath("//*[@id='Col2-4-QuoteModule-Proxy']/div/div/div/text())
tree.xpath('//*[@data-yaft-module="tdv2-applet-fundSummary"]/div/div/text()')
Is there something about the #text that needs to be targeted differently? The first XPath I used there was copied directly from inspect element.. so I'm not sure how else to select it.
Solution 1:[1]
You could try something like this
import json
import requests
from bs4 import BeautifulSoup
# get the page
r = requests.get('https://ca.finance.yahoo.com/quote/VUN.TO?p=VUN.TO')
soup = BeautifulSoup(r.text, 'lxml')
# find the script we want
js_delimiter = 'App.main = '
for s in soup.find_all('script'):
if js_delimiter in s.text:
myS = s.text
# load it as a json
js_variable = json.loads(myS.split(js_delimiter)[-1].split(';\n')[0])
# now js_variable contains all the content of the page
print(js_variable['context']['dispatcher']['stores']['StreamDataStore']['quoteData']['VUN.TO']['regularMarketPreviousClose']['fmt'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Nacho R |