Python 新手爬虫的新手问题～～～求大神解惑！！！ - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2458 天前的主题，其中的信息可能已经有所发展或是发生改变。

只是想爬几张图片，开始在处理，请求下载图片 src 的时候，返回的 403 响应，结果调着调着，就又报出了下面的 unicodeEncodeError，找了半天找到解决方法，基本主体代码都在这儿了(估计也没啥用)求大神解答！！！

def handle_request(self, url):
	"""构建请求对象：return"""
	headers = {
		'User-Agent':' Mozilla / 5.0 （ Windows; U; Windows NT 6.1; en-us ） AppleWebKit / 534.50 （ KHTML，类似 Gecko ）版本 / 5.1 Safari / 534.50',
		}
	return urllib.request.Request(url=url, headers=headers)

def send_request(self, request):
	"""发送请求获取内容"""
	return urllib.request.urlopen(request).read().decode('gbk')
def down_picture(self, response):
	#根据内容形成 tree 对象
	tree = etree.HTML(response)
	#根据数据形成对应 xpath
	pic_href = tree.xpath('//div[@class="main"]/dl/dd/a/img/@src') #图片链接
	pic_text = tree.xpath('//div[@class="main"]/dl/dd/a[@target="_blank"]/text()') #图片文本

	#pic_src://div[@class="main"]/dl/dd/a/img/@src
	#pic_text()://div[@class="main"]/dl/dd/a[@target="_blank"]/text()

	# try:
	for img in zip(pic_text,pic_href): 
		# request = self.handle_request(img[1]) # 再次构建图片请求对象
		# response = self.send_request(request) # 发送对象返回响应 error403
		dirname = './tupian';filename = img[0]+'.jpg'
		filepath = os.path.join(dirname, filename)
		if not os.path.exists(dirname):
			os.mkdir(dirname)
		# with open(filepath, 'wb') as fp:
			# fp.write(response.read())
		urllib.request.urlretrieve(img[1],filepath) #图片下载

	# except Exception as e:
		# print(e)

UnicodeEncodeError: 'latin-1' codec can't encode character '\uff08' in position 14: ordinal not in range(256) 有时间的话，顺便把 httperror403 错误也解了吧，新手，也没找到办法

6 条回复 • 2018-07-30 10:10:38 +08:00

1

matrix273

2018-07-29 00:40:42 +08:00 via Android

看一下 handle_request 返回的类型是不是 b''，二进制类型啊，那样需要先 decode()吧，有一阵子没有捣鼓爬虫了。

2

Sylv

2018-07-29 01:09:49 +08:00

403 一般就意味着网站检测出访问是非法的（例如爬虫），所以就拒绝访问了，你爬虫没有伪装好。

UnicodeEncodeError 具体是哪行报的错？

给个建议：
人生苦短，我用 requests。

3

zhangpeter

2018-07-29 06:47:50 +08:00

用 Python3 的 requests 库吧，既方便，报错又少。

4

ericls

2018-07-29 09:39:33 +08:00 via iPhone

1

上面都推荐了 requests 我跟着推荐一下 requests-html
API 不错

5

rwecho

2018-07-30 08:33:09 +08:00

403 看看你的请求和浏览器的报文头有没有区别, 如果一样, 试着放慢采集的速度, 有可能太快被 ban 了

UnicodeEncodeError 错误一般是转码问题, 你用的 terminal 是什么编码, 保存日志是什么编码, html 返回是什么编码.

6

engineer9

2018-07-30 10:10:38 +08:00

.decode('gbk')去掉就行了，大佬哥

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 1229 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 22ms · UTC 23:46 · PVG 07:46 · LAX 16:46 · JFK 19:46
Developed with CodeLauncher
♥ Do have faith in what you're doing.