求解 scrapy 爬取报错问题

scrapy 爬取阳光政务出现 Error，但数据出来了，求怎么解决这俩报错，错误如下： [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file. Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request defer.returnValue((yield download_func(request=request, spider=spider))) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue raise _DefGen_Return(val) twisted.internet.defer._DefGen_Return: <200 http://www.sun0769.com/error/404.htm>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/robotstxt.py", line 15, in decode_robotstxt robotstxt_body = robotstxt_body.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 327: invalid start byte {'content': '东莞南城周溪东径北街 6 号天台严重违建,现在还出租了,没有跟进后续情况', 'content_img': [], 'href': 'http://wz.sun0769.com/html/question/201911/436799.shtml', 'publish_date': '2019-11-25 11:58:44', 'title': '东莞南城周溪东径北街 6 号天台严重违建现在还出租了,相关部门没有跟进后续情况'} 最下面是数据

Python

file

Scrapy

lib

3 条回复 • 2019-11-25 14:13:02 +08:00

zdnyp

2019-11-25 14:04:09 +08:00

Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
可以在 settings 里把 robots 改为 Flase

yifengs

2019-11-25 14:08:44 +08:00

谢谢，错误不见了，是我 scrapy 没安装好吗，为啥 robots.txt 会解析失败呢

yifengs

2019-11-25 14:13:02 +08:00

哦哦看到了 robots 协议上不允许，谢谢哈