'How to extract HTTP response body from a Python without page that doesn't open?

I am trying to web crawl a page that doesn't open.

To access that page, I have to go through two pages.

So I tryed that code.

from selenium import webdriver
from selenium.webdriver.ie.options import Options

Option = Options()
Option.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko")
browser = webdriver.Ie(".\IEDriverServer.exe", ie_options=Option)

#open for login
browser.get(LoginURL)

#open for connect page
browser.get(PageURL)

#Page open
browser.get(TargetURL)

print(browser.text)

However, the last page is downloaded because of the following response header. content-disposition: attachment; filename="download.xls"

If I ignore the download and get the HTML source of the current page, I get the source of the previous page(PageURL), not the source of the final page I want.

The downloaded file is in 'xls' format, but when you actually open the file with notepad, it is an HTML Code.

And when I press F12 in IE environment and look at the network analysis, the HTML source I want is in the response body. I can also right-click > Copy Response Payload.

In Chrome(Edge), there is no preview in the response tab of network. So when I save as HAR with the content, I open it as Json format, there is something I want.

The location of the data I want in the saved HAR file is as follows. {} log> [] entries > {} 2 > {} response > {} content > text

In this case, is there any way I can crawl that source?

As a security policy, external access to the page is not allowed. So you can't open the pages.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source