'Scraping Websites via Google Cached Pages pages has been blocked

I'm trying to create a Service that Scraping websites by using Google Cached Pages.

Example

https://webcache.googleusercontent.com/search?q=cache:nike.com

The Response that I get is the HTML from Google cache, which is an older version of the Nike site.

And it works fine as long as I run it locally on my computer, but when I deploy to google cloud platform, there I use porxy server I get a 403 error that I can not access the information through a porxy server

Example of response from proxy server

433. That’s an error.

Your client does not have permission to get URL /s earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Please see Google's Terms of Service posted at https://policies.google.com/terms

If you believe that you have received this response in error, please report your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the entire code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I' m using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux t o search from my job at myFoo.com. My machine's IP address is 10.20.30.40, but all of myFoo' s web traffic goes through some kind of proxy server whose IP address is 10.11.12.13." (If y ou don't know any information like this, that's OK. But this kind of information can help us track down problems, so please tell us what you can.)

We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google agai n quickly!

Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don't despair if you don't hear back from u s!

Also note that if you do not send us the entire code below, we will not be able to help you.

Best wishes,
The Google

Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/

How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?

Thanks :)



Solution 1:[1]

I guess that you should add a property in the header of your http request
for example :

URL u = new URL("https://www.google.com//search?q=c");
        URLConnection c = u.openConnection();
        c.setRequestProperty("User-Agent", "MSIE 7.0");

or

HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI

this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Abdelwahab