'Why is the Python Requests Module not returning links?
So I had created a python web scraper for my college capstone project that scraped around the web and followed links based on a random selection from the page. I utilized Python's request module to return links from a get request. I had it working flawlessly along with a graphing program that showed the program working in real time. I fired it up to show my professor and now the .links
returns an empty dictionary for every single website.
Originally I had added a skip for any site that returned no links, but now all of them are returning empty. I've reinstalled Python, reinstalled the requests module, and tried feeding the program websites manually and I cannot seem to find a reason for the change.
For reference, I have been using Portswigger.net as a baseline to test the .links
to see if I get them returned. It worked before, and now it does not.
Here is the get request sample:
import requests
Url = "https://portswigger.net"
def GetRequest(url):
with requests.get(url=Url) as response:
try:
links = response.links
if links:
return links
else:
return False
except Exception as error:
return error
print(GetRequest(Url))
UPDATE So out of the 200 sites I tested this morning, the only one to return links was kite.com. It returned the links no problem and my program was able to follow them and collect the data. Literally a week ago the whole program would run fine and return page links from almost every single website.
Solution 1:[1]
Requests.Response.links
doesn't work like that [1]. It looks for Links in the Header, not link elements in the Response body.
What you want is to extract link elements from the Response body, so I would recommend something like lxml
or beautifulsoup
.
Seeing as this is fairly common and straight forward and this is a school project, I'll leave that task up to the reader.
[1] - https://docs.python-requests.org/en/latest/api/#requests.Response.links
Solution 2:[2]
Parsing links with beautifulsoup4
is a possible solution:
import requests
from bs4 import BeautifulSoup
def get_links(url: str) -> list[str]:
with requests.get(url) as response:
soup = BeautifulSoup(response.text, features='html.parser')
links = []
for link in soup.find_all('a'):
target = link.get('href')
if target.startswith('http'):
links.append(target)
return links
links = get_links('https://portswigger.com')
print(*links, sep='\n')
# https://forum.portswigger.net/
# https://portswigger.net/web-security
# https://portswigger.net/research
# https://portswigger.net/daily-swig
# ...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | plunker |
Solution 2 | Stefan B |