'Why does scraping a Persian website with a non-English URL generate errors?
I am attempting to scrape a Persian website with the following code:
import urlparse, urllib
parts = urlparse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
I get this error message in the prompt when I run my crawler:
ModuleNotFoundError: No module named urlparse
And in VS Code there are three underlined words. When I click on them, the following error messages are displayed:
- Unable to import 'scrapy'
- Unable to import 'urlparse'
- Module 'urllib' has no quote member
What is wrong with my code?
Solution 1:[1]
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/?????_????')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
print(encoded_url)
This code runs in python 3.* environment as urlparse library was replaced by urllib.parse
Solution 2:[2]
By the error messages you don't have them, go to their respective site and look on how to install.
1 Note for urlparse change it is now named urllib.parse not urlparse
2 Scrapy
Solution 3:[3]
You should only use this:
FEED_EXPORT_ENCODING='UTF-8'
in your settings.py file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | SahilDesai |
Solution 2 | Gabriel Domene |
Solution 3 | Hassan Ebrahimi |