I'm trying to create a Service that Scraping websites by using Google Cached Pages.



The Response that I get is the HTML from Google cache, which is an older version of the Nike site.

And it works fine as long as I run it locally on my computer, but when I deploy to google cloud platform, there I use porxy server I get a 403 error that I can not access the information through a porxy server

Example of response from proxy server

433. That’s an error.

Your client does not have permission to get URL /s earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/

How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?

Solution 1:[1]

I guess that you should add a property in the header of your http request
for example :

URL u = new URL("https://www.google.com//search?q=c");
        URLConnection c = u.openConnection();
        c.setRequestProperty("User-Agent", "MSIE 7.0");


HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI

this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull


