'Issues with selecting #text inside an html, div enclosed in double quotes using xpath/lxml in python

I'm trying to extract the Fund Summary text on the following Yahoo Finance page using python:

Thus far, XPath has worked well using the XPath with the text() method. However it is seemingly unable to select this particular text, always outputting an empty array [].

I've tried the following xpaths:

  • tree.xpath("//*[@id='Col2-4-QuoteModule-Proxy']/div/div/div/text())
  • tree.xpath('//*[@data-yaft-module="tdv2-applet-fundSummary"]/div/div/text()')

Is there something about the #text that needs to be targeted differently? The first XPath I used there was copied directly from inspect element.. so I'm not sure how else to select it.



Solution 1:[1]

You could try something like this

import json
import requests
from bs4 import BeautifulSoup

# get the page
r = requests.get('https://ca.finance.yahoo.com/quote/VUN.TO?p=VUN.TO')
soup = BeautifulSoup(r.text, 'lxml')

# find the script we want
js_delimiter = 'App.main = '
for s in soup.find_all('script'):
    if js_delimiter in s.text:
        myS = s.text
# load it as a json
js_variable = json.loads(myS.split(js_delimiter)[-1].split(';\n')[0])

# now js_variable contains all the content of the page
print(js_variable['context']['dispatcher']['stores']['StreamDataStore']['quoteData']['VUN.TO']['regularMarketPreviousClose']['fmt'])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nacho R