'Some websites dont fully load/render in selenium headless mode
So I have a problem that I have been noticing with selenium when I run it headless where some pages don't totally load/render some elements. I don't exactly know what's happening not to load 100%; maybe JS
not running?
My code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from decouple import config
from time import sleep
DEBUG = config('DEBUG')
class DiscordME(object):
def __init__(self):
self.LINUX = config('LINUX', cast=bool)
self.DRIVER_VERSION = config('DRIVER_VERSION')
self.HEADLESS = True
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-extensions')
options.add_argument('--disable-dev-shm-usage')
if self.HEADLESS:
options.add_argument('--headless')
options.add_argument('--window-size=1920,1200')
if self.LINUX:
self.browser = webdriver.Chrome(executable_path=f'./drivers/chromedriver-{self.DRIVER_VERSION}', options=options)
else:
self.browser = webdriver.Chrome(executable_path=f'.\drivers\chromedriver-{self.DRIVER_VERSION}.exe', options=options)
def get_website(self):
self.browser.get('https://discord.me/login')
WebDriverWait(self.browser, 10).until(
EC.url_changes('https://discord.me/login')
)
print(self.browser.current_url)
print(self.browser.page_source)
#print(self.browser.find_element_by_xpath('//*[@id="app-mount"]/div[2]/div/div[2]/div/div/form/div/div/div[1]/div[3]/div[1]/div/div[2]/input'))
DiscordME().get_website()
In this script, it doesn't load the login inputs when it accesses the discord API login page.
As I can see in the page_source
I noticed that the page is not being mounted so that could be the problem.
Solution 1:[1]
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
browser = webdriver.Chrome(options=options)
some websites uses user-agent to detect whether the browser is in headless mode or not as headless browser uses a different user-agent than normal browser. So explicitly set user agent.
Solution 2:[2]
Another thing to to consider if you are having trouble loading a website with selenium is the processing power.
I was using a Micro AWS instance
with a single CPU which worked for many websites, but when I came to a more complex one it kept intermittently getting 0 elements
when conducting a search like find_elements_by_xpath('//a[@href]')
while sometimes it would work successfully and find the hyperlinks. I upgraded the instance to one with more CPUs (4, but 2 would probably have been sufficient) and that allowed me to fully load the site and scrape the elements.
I would definitely try the other two solutions posted here first (chrome options or firefox browser), but processing power could be the problem as well.
Solution 3:[3]
I Just would like to share my experience on this as solving the issue consumed much of my time trying many options and settings for Chrome webdriver.
The user-agent
setting solved the problem for some websites I scraped. but, for some other websites the only solution worked with me was to use FireFox webdriver instead of Chrome as per following :
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
fireFoxOptions = Options()
fireFoxOptions.add_argument("--headless")
fireFoxOptions.add_argument("--window-size=1920,1080")
fireFoxOptions.add_argument('--start-maximized')
fireFoxOptions.add_argument('--disable-gpu')
fireFoxOptions.add_argument('--no-sandbox')
driver = webdriver.Firefox(options=fireFoxOptions,
executable_path=r'C:\[your path to firefox webdriver exe file]\geckodriver.exe')
driver.get('https://discord.me/login')
Use the link here to download latest geckodriver for FireFox, and make sure FireFox browser is already installed in you machine.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | PDHide |
Solution 2 | |
Solution 3 | Rola |