关于用PyQuery解析网页报错的问题 - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 4375 天前的主题，其中的信息可能已经有所发展或是发生改变。

用PyQuery解析一个有日文的网站，然后报错：ValueError: Unicode strings with encoding declaration are not supported.
从http://lxml.de/parsing.html网站上得知，解析的文件在parse之前是不能用utf-8的。
用requests抓的网页不都已经自动转成utf-8了么，原来都没有什么问题的啊，为啥会出问题呢？我用了decode encode什么的转成其他格式也不行。

7 条回复 • 1970-01-01 08:00:00 +08:00

1

cloverfisher

OP

2013-03-14 05:11:41 +08:00

抓的网页是cn.shindanmaker.com

2

timonwong

2013-03-14 08:13:06 +08:00

问题是网页中的这一行：
<?xml version="1.0" encoding="UTF-8"?>

3

cloverfisher

OP

2013-03-14 13:14:44 +08:00

@timonwong 这一行有什么严重的问题么？？真的没法解析本来就是utf-8的xml？？那么如何解决这个问题呢

4

for4

2013-03-14 13:29:12 +08:00

r = requests.get('http://cn.shindanmaker.com')
用r.content 别用r.text

5

timonwong

2013-03-14 14:14:24 +08:00

1

@cloverfisher
因为是字符串是unicode类型了（转码后的了）, lxml找到encoding的相关申明还会尝试转到unicode一次，自然会失败，给这些解析器的都该是raw string.

所以请使用 @for4 介绍的 r.content

6

cloverfisher

OP

2013-03-14 14:25:59 +08:00

@for4 谢谢~

7

cloverfisher

OP

2013-03-14 14:26:20 +08:00

@timonwong 3Q :）

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 1753 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 25ms · UTC 16:34 · PVG 00:34 · LAX 08:34 · JFK 11:34
Developed with CodeLauncher
♥ Do have faith in what you're doing.