'why the feedparser break without any error message when pull rss channel in Python 3
I am using feedparser feedparser=6.0.2
to parse some rss resource in Python 3.10, when I using feedparser to get the response in the CentOS 7.x, the feedparser just exists without any exception message. Why did this happen? This is my Python code looks like:
def feeder_parse(self, source: any, level: str, task_id: str):
try:
logger.info(str(level) + " level feeder parse source:" + source.sub_url + ",task_id:" + task_id)
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
#
# when fetch rss subscribe content
# we should send the etag and last modified info
# so the server will tell us if the source not modified
# it will save the network traffic
# and make massive rss source update possible
#
logger.info(str(level) + " level set ssl:" + source.etag + ",last modified:" + source.last_modified + ",sub url:" + source.sub_url)
feed = feedparser.parse(source.sub_url,
etag=source.etag if source.etag is None else ast.literal_eval(source.etag),
modified=source.last_modified)
logger.info(str(level) + " level get feeder:")
if not hasattr(feed, 'status'):
logger.error("do not contain status:" + source.sub_url)
return
if feed.status == 522:
RssSource.unsubscribe(source, -1)
return
if feed.status == 403:
RssSource.unsubscribe(source, -2)
return
if feed.status == 304:
self.sub_source_update_dynamic_interval(source, 5)
return
etag = ''
last_modified = ''
if hasattr(feed, 'etag'):
etag = feed.etag
if hasattr(feed, 'updated'):
last_modified = feed.updated
if feed.entries is None or len(feed.entries) == 0:
logger.warn(str(level) + " level get null entry:" + source.sub_url + ",task_id:" + task_id)
return
for entry in feed.entries:
logger.info(str(level) + " level get entry:" + source.sub_url + ",task_id:" + task_id)
source.etag = etag
source.last_modified = last_modified
RssParser.parse_single(entry, source, level)
self.sub_source_meta_compare(source, feed)
self.parse_fav_icon(source)
except requests.ReadTimeout:
logger.error(str(level) + " level read timeout:" + source.sub_url + ",task_id:" + task_id)
return
except socket.timeout:
logger.error(str(level) + " level socket timeout:" + source.sub_url + ",task_id:" + task_id)
return
except RemoteDisconnected:
logger.error(str(level) + " level remote disconnected:" + source.sub_url + ",task_id:" + task_id)
return
except URLError:
logger.error(str(level) + " level read url error:" + source.sub_url + ",task_id:" + task_id)
return
except Exception as e:
logger.error("feed parser error, url:" + source.sub_url, e)
the part of log output like this:
2022-05-13 18:21:00,599 - RssParser.py:25 - cruise-task-executor - 2 level feeder parse source:https://incidentdatabase.ai/rss.xml,task_id:eed2adee-9f92-4d5c-8d59-fe672371ab9f
2022-05-13 18:21:00,599 - RssParser.py:43 - cruise-task-executor - 2 level set ssl:ecc8bd172ed1d9a9f17d999074b15c30-ssl-df,last modified:,sub url:https://incidentdatabase.ai/rss.xml
2022-05-13 18:21:00,638 - RssParser.py:25 - cruise-task-executor - 2 level feeder parse source:https://blog.crunchydata.com/blog/rss.xml,task_id:ec57d2ba-8a38-45c0-9b8b-46fc77e036ee
2022-05-13 18:21:00,638 - RssParser.py:43 - cruise-task-executor - 2 level set ssl:,last modified:,sub url:https://blog.crunchydata.com/blog/rss.xml
2022-05-13 18:21:00,727 - RssParser.py:25 - cruise-task-executor - 2 level feeder parse source:http://www.jenlawrence.org/feed,task_id:8ce67905-9a53-463c-aec7-f8c5ca4012e4
2022-05-13 18:21:00,728 - RssParser.py:43 - cruise-task-executor - 2 level set ssl:b627eec4f37d61083b462f430e8a46bf,last modified:Mon, 04 Apr 2022 23:09:39 GMT,sub url:http://www.jenlawrence.org/feed
2022-05-13 18:21:00,778 - RssParser.py:25 - cruise-task-executor - 2 level feeder parse source:https://www.cicoding.cn/atom.xml,task_id:43007f6d-483c-46f0-9d0a-43742cc317af
2022-05-13 18:21:00,778 - RssParser.py:43 - cruise-task-executor - 2 level set ssl:,last modified:Thu, 11 Mar 2021 05:53:55 GMT,sub url:https://www.cicoding.cn/atom.xml
2022-05-13 18:21:00,860 - RssParser.py:25 - cruise-task-executor - 2 level feeder parse source:https://tlanyan.me/feed,task_id:418b70a4-a7a0-468a-a0a7-e29ef264fff7
2022-05-13 18:21:00,860 - RssParser.py:43 - cruise-task-executor - 2 level set ssl:"d5df98785beddd62763cc8ea64ac659f",last modified:,sub url:https://tlanyan.me/feed
2022-05-13 18:21:13,948 - RssParser.py:87 - cruise-task-executor - 2 level read url error:https://tlanyan.me/feed,task_id:418b70a4-a7a0-468a-a0a7-e29ef264fff7
2022-05-13 18:24:00,155 - RssParser.py:25 - cruise-task-executor - 3 level feeder parse source:https://www.cnbeta.com/backend.php,task_id:87c75c99-89e0-4260-a54d-7c98575ad5dc
2022-05-13 18:24:00,155 - RssParser.py:43 - cruise-task-executor - 3 level set ssl:,last modified:,sub url:https://www.cnbeta.com/backend.php
you can see the information from log that it did not print the line:
logger.info(str(level) + " level get feeder:")
seemed it stopped without error in the rss scrapy. why did this happen? what should I do to fixed this problem? I have tried to print the varialbe and put it in the unit test function but it works fine. Still could not figure out where is going wrong after read the function again and agin. Can someone give me a hand to find out where is the problem? The schedule task was the celery task.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|