'page scraping to get prices from google finance

I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data.

When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable]

I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while..

is there a way around this, i.e. deleting some cookie or creating some cookie etc..

or even better if google gives some api, I want to do this in python because the complete app in python, but if there is nothing available in python to do this, I can consider alternatives. This is my python method that I use in loop to get data ( with few seconds of sleep I call this method in loop)

 def getPriceFromGOOGLE(self, symbol):
    """ 
    gets last traded price from google for given security
    """         
    toReturn = 0.0
    try:
        base_url = 'http://google.com/finance?q='
        req = urllib2.Request(base_url + symbol)
        content = urllib2.urlopen(req).read()
        namestr = 'name:\"' + symbol + '\",cp:(.*),p:(.*),cid(.*)}'
        m = re.search(namestr, content)
        if m:
            data = str(m.group(2).strip().strip('"'))
            price = data.replace(',','')
            toReturn = float(price)
        else:
            print 'ERROR ' + str(symbol) + ' --- ' + str(content)      
    except Exception, exc:
        print 'Exc: ' + str(exc)       
    finally: 
        return toReturn


Solution 1:[1]

There is a Google Finance API:

http://code.google.com/apis/finance/docs/2.0/developers_guide_protocol.html

And there is a Python client library for it:

http://code.google.com/p/gdata-python-client/

Solution 2:[2]

The question is quite old but the selected answer is not valid anymore.
The API has been deprecated.

There is an open source project to scrape all companies from Google finance and match them with their current price at http://scrape-google-finance.compunect.com/
The project solved most issues, includes caching, IP management and works stable without getting blocked.
It uses the internal finance company matching api to scrape companies and the chart api to get prices. However it is php code, not python. You can still learn how it solved the tasks and adapt it.

Solution 3:[3]

To get around most rate-limiting or bot-detection from the likes of Google or Wikipedia or Yahoo, spoof your user-agent.

This will make your script's requests appear to be from the latest version of Google Chrome.

headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24"}
req = urllib2.Request(url,None,headers)
content = urllib2.urlopen(req).read()

Solution 4:[4]

Yahoo Finance is also a good place to get financial information which covers more countries and stocks.

For python 2, you can use ystockquote. For python 3, you can use yfq that I rewrite from the previous one.

To get current quotes of Google and Intel.

>>> import yfq
>>> yfq.get_price('GOOG+INTL')
{'GOOG': '600.25', 'INTL': '22.25'}

To get historical quotes of Yahoo from March 3, 2012 to March 5, 2012.

>>> yfq.get_historical_prices('YHOO','20120301','20120303')
[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], ['2012-03-02', '14.89', '14.92', '14.66', '14.72', '9164900', '14.72'], ['2012-03-01', '14.89', '14.96', '14.79', '14.93', '12283300', '14.93']]

Solution 5:[5]

You can also pull data from Google Fiance directly in Google Sheets via GOOGLEFINANCE() function both current and historical data:

GOOGLEFINANCE("NASDAQ:GOOGL", "price", DATE(2014,1,1), DATE(2014,12,31), "DAILY")

Another way is to use Yahoo finance instead via yfinance package or with such query which will return a JSON:

https://query1.finance.yahoo.com/v8/finance/chart/MSFT

Talking about Google Finance, make sure you're using user-agent as it is used to act as a "real" user so websites assume that the request is made by the user, not the bot or a script. Also, websites might block a request if user-agent is something like python-requests which is a default user-agent in requests library.

Check what's your user-agent and make sure you use a new version as using an old user-agent can also lead to a request block.


Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest


def scrape_google_finance(ticker: str):
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "hl": "en"
        }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    # https://www.whatismybrowser.com/detect/what-is-my-user-agent
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
        }

    html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")
    
    ticker_data = {
        "ticker_data": {},
        "about_panel": {}
    }
    
    ticker_data["ticker_data"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
    ticker_data["ticker_data"]["quote"] = soup.select_one(".PdOqHc").text.replace(" • ",":")
    ticker_data["ticker_data"]["title"] = soup.select_one(".zzDege").text
    
    right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
    right_panel_values = soup.select(".gyFHrc .P6K39c")
    
    for key, value in zip_longest(right_panel_keys, right_panel_values):
        key_value = key.text.lower().replace(" ", "_")

        ticker_data["about_panel"][key_value] = value.text
    
    return ticker_data
    

data = scrape_google_finance(ticker="GOOGL:NASDAQ")

print(json.dumps(data, indent=2))
print(data["ticker_data"].get("current_price"))

JSON output:

{
  "ticker_data": {
    "current_price": "$2,534.60",
    "quote": "GOOGL:NASDAQ",
    "title": "Alphabet Inc Class A"
  },
  "about_panel": {
    "previous_close": "$2,597.88",
    "day_range": "$2,532.02 - $2,609.59",
    "year_range": "$2,193.62 - $3,030.93",
    "market_cap": "1.68T USD",
    "volume": "1.56M",
    "p/e_ratio": "22.59",
    "dividend_yield": "-",
    "primary_exchange": "NASDAQ",
    "ceo": "Sundar Pichai",
    "founded": "Oct 2, 2015",
    "headquarters": "Mountain View, CaliforniaUnited States",
    "website": "abc.xyz",
    "employees": "156,500"
  }
}

data["ticker_data"].get("current_price"):

$2,534.60

If there's a need to parse the whole Google Finance Ticker page, I wrote a line-by-line scrape Google Finance Ticker Quote Data in Python blog post about it at SerpApi. Leaving this link here as this will be out of the scope of your question.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AJ.
Solution 2
Solution 3 Aphex
Solution 4 angelo
Solution 5