'twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
I am having this error when I run a crawl process multiples times. I am using scrapy 2.6 This is my code:
from scrapy.crawler import CrawlerProcess
from football.spiders.laliga import LaligaSpider
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings=get_project_settings())
for i in range(1, 29):
process.crawl(LaligaSpider, **{'week': i})
process.start()
thanks in advance.
Solution 1:[1]
This solution avoids use of CrawlerProcess as stated in the docs. https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from football.spiders.laliga import LaligaSpider
# Enable logging for CrawlerRunner
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
for i in range(1, 29):
runner.crawl(LaligaSpider, **{'week': i})
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
Solution 2:[2]
For me this worked, I put it before the CrawlerProcess
import sys if "twisted.internet.reactor" in sys.modules: del sys.modules["twisted.internet.reactor"]
Solution 3:[3]
I've just run into this issue as well. It appears that the docs at https://docs.scrapy.org/en/latest/topics/practices.html are incorrect in stating that CrawlerProcess can be used to run multiple crawlers built with spiders, since each new crawler attempts to load a new reactor instance if you give it a spider. I was able to get my code to work by using CrawlerRunner instead, as also detailed on the same page.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
settings = get_project_settings() # settings not required if running
runner = CrawlerRunner(settings) # from script, defaults provided
runner.crawl(MySpider1) # your loop would go here
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
Solution 4:[4]
Delete the reactor installed before process.start()
For me this worked, I put it before the CrawlerProcess import sys if >"twisted.internet.reactor" in sys.modules: del >sys.modules["twisted.internet.reactor"]
Same here, just mine worked when I put it before the process.start()
import sys
process = CrawlerProcess(get_project_settings())
process.crawl("articles")
if "twisted.internet.reactor" in sys.modules:
del sys.modules["twisted.internet.reactor"]
process.start()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Mersad |
Solution 3 | z3r0fox |
Solution 4 | Atehe |