'Problem scraping Bet365 with headless mode
A few days ago I started the development of a bot to capture data/results from virtual sports (specifically football) at Bet365 (note: I know this is not allowed by the terms of use of the site, but my purpose is just a "personal study").
Techniques and alternatives for web scraping are not that difficult to get on the internet. The limitation is (and I discovered this recently) the security of the site that is intended to obtain the data. Being straight to the point, I developed the following script/algorithm using python/selenium:
- Access the URL: https://www.game-365.com/#/AVR/B146/R%5E1/
- Click on one of the championship tabs (Euro Cup, Premiership, Superleague, World Cup);
- Click on the "Results" tab below;
- Reads the HTML and extracts the information from the two results that appear;
- Repeat step 2-4 for the other tabs;
Very simple. And it's already working. But I wouldn't be here if everything was right. If I run the application using webdriver.Chrome without passing the --headless
argument, the information is successfully retrieved. I have the scan run every 3 minutes and I verify that the results are coming correctly as the site is being updated.
However, the most important thing is that I could run this script using headless mode, because the objective is not to leave my personal computer on 24 hours a day to complete my objective, but to upload this application to a server - which will not have a graphical interface.
With that in mind, I proceeded with the tests using the --headless
argument and what I notice is that the page content is no longer updated. I can leave the script running for hours and hours and the games obtained from the "Results" tab will always be the same. And all just because I used headless mode.
Googling about found the undetected_chromedriver
alternative. Unfortunately, it didn't fix the problem either.
I don't have much to comment on. This question is more related to how the Bet365 site works than the use of selenium itself. Because of this, I know that the answers are very limited to a select group of people who are interested in this subject.
Below I will leave the link to the repository where the project code is found and also some of other attempts:
Repository link: https://gitlab.com/noleto-web-scraping/bet365_scrap.git
Solution via API:
Monitoring the developer console, I could see that when clicking on the results tab, the Bet365 website makes the following request: https://www.bet365.com/SportsBook.API/web?lid=33&zid=0&pd=%23AVA%23B146% 23C20700663%23R%5E1%23&cid=28&cgid=1&ctid=28
The result of this request is a text with a very particular format, but with a little effort you can extract the same information that fills the screen. By copying the request data as cURL and importing it into Postman it is possible to obtain the information. In addition to the query parameters of the request, there is a set of headers that I imagine dictate the validation/security of the request. Including the much talked about X-Net-Sync-Term.
- Copying request as cURL: https://i.imgur.com/VZui1no.png
- Importing at Postman: https://i.imgur.com/Dqk9PsE.png
Waiting a few minutes, I do the same test again: click on the "Results" tab, check the developer panel, copy as cURL and play in Postman. The most curious thing happens now. While on the site the result appears updated in Postman (even with all headers imported - including a different X-Net-Sync-Term value) the return is never updated.
Is there something I haven't been able to figure out what it is that dictates to the server what data to return: the most up-to-date or a "cache by IP" (my guess).
Solution via Puppeteer
Researching about web scraping I decided to change my approach and instead of using python/selenium I used node/puppeteer to get the same information running the same algorithm I mentioned above.
Unlike selenium, which at least when not in headless mode
displays updated information, with puppeteer this does not happen and the captured information is always the same, regardless of headless mode
.
As well as selenium I also went after some solutions and found the puppeteer-extra-plugin-stealth plugin
. Also in vain.
Repository for the script built with node/puppetter: https://gitlab.com/noleto-web-scraping/bet365_puppeteer
Conclusion
I've been researching for days and I only find old discussions, from last year. I found out that Bet365 is a more protected site than I imagined, but I know that there are alternatives to obtain this information because there are apis, bots, etc being sold that use this information.
Besides, as long as an answer may take, I'm here because there's not much else to run/recur to.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|