使用的语句是
python
page = requests.get( url , headers = self.header, timeout = 10 , verify = flag )
各变量的值分别为
python
url = 'http://www.sbacn.org'
flag = False
self.header = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:40.0) Gecko/20100101 Firefox/40.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding' : 'gzip, deflate',
}
报错内容为
python
Traceback (most recent call last):
File "bing.py", line 237, in <module>
bing.titleGet(urls)
File "bing.py", line 195, in titleGet
page = self.dataRequest(url)
File "bing.py", line 86, in dataRequest
page = requests.get( url , headers = self.header, timeout = 10 , verify = flag )
File "/usr/lib/python2.7/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 608, in send
r.content
File "/usr/lib/python2.7/site-packages/requests/models.py", line 734, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/usr/lib/python2.7/site-packages/requests/models.py", line 657, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 326, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/lib/python2.7/site-packages/requests/packages/urllib3/response.py", line 282, in read
data = self._fp.read(amt)
File "/usr/lib64/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
我纳闷的是在我的 mac 上运行就没问题,但在服务器的 ubuntu 上运行就会报错,这是为什么?
而且我其实是抓了 bing 的搜索结果里 10 页的 url,连续访问的时候就会报错,但我要是把这个 url 单独拿出来访问的时候就没问题.这是为什么?
问题解决了,就是方法感觉好蠢,是我无意间试出来的.
我这个函数原本是这么写的
try:
page = requests.get( url , headers = self.header, timeout = 10 , verify = flag )
except requests.exceptions.ConnectionError:
print 'ConnectionError'
if flag == True:
flag = False
count += 1
continue
if count > 1:
return None
else:
count += 1
continue
except requests.exceptions.ConnectTimeout:
print 'ConnectTimeout'
if count > 1:
return None
else:
count += 1
continue
在 except requests.exceptions.ConnectTimeout:前面多加一个异常处理
except requests.exceptions.Timeout:#this is important
print 'Timeout'
return None
就行了,到底为什么依然未知.
1
dawncold 2015-11-10 09:51:14 +08:00
|
2
kslr 2015-11-10 10:17:00 +08:00
连接被重置,一般不是被墙了就是对方拒绝掉你了
|
3
fei051466 2015-11-10 10:31:28 +08:00
墙的即视感
|
4
leisurelylicht OP @dawncold 直接访问没有问题.我也是北京
@kslr 但是我这样单独调用 page = requests.get( ''http://www.sbacn.org/'', headers = self.header, timeout = 10 , verify = flag ),就没有问题 |
5
Sylv 2015-11-10 10:34:57 +08:00 via iPhone
请求的频率太高被服务器拒绝了,降低频率或用代理。
|
6
leisurelylicht OP @fei051466 应该不是,像这样直接调用是可以得到正确结果的
bing = Bing_Search() a = bing.dataRequest('http://www.sbacn.org/') print a.title http://www.sbacn.org/ <title> 上海市银行同业公会 </title> [Finished in 1.2s] |
7
leisurelylicht OP @Sylv 但我并不是在反复访问一个站点,而是依次访问搜索结果,理论上每个站点只会访问一次啊
|
8
Sylv 2015-11-10 10:53:43 +08:00 via iPhone
@leisurelylicht 会不会有可能搜索结果里有多个相同站点的结果,在出错前已经访问过几次这个站点了?这个报错就是说服务器那边拒绝了你的请求,一般来说就是因为服务频率太高超过了阈值,或者是伪装没成功被当成爬虫拒绝了。
|
9
leisurelylicht OP @Sylv 我看了一下搜索结果,这个域名确实只出现了一次.如果是伪装不成功的话应该我单独爬这个站点也被拒绝才对,但并没有.其实现在尝试着报 AttributeError: 'module' object has no attribute 'ConnectTimeout' 这个错的次数比较多
|
10
krizex 2015-11-10 11:38:21 +08:00
ubuntu 默认的 requests 源版本比较老,你先升级下 requests 试试呢
|
11
leisurelylicht OP @krizex 更新了,最新没问题
|