Python 提取 POST 返回的 Response

<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr>	</table>

看了下面两个帖子弄了半天还是没搞定，求助各位

先谢谢各位~

Python

response

post

提取

21 条回复 • 2016-04-10 22:22:19 +08:00

eoo

2016-04-10 08:28:45 +08:00 via Android

要 POST 的地址呢？

haomni

2016-04-10 08:50:38 +08:00

@eoo 感谢回复，从 POST 返回的结果中抓取 URL 应该不需要原来的 POST 地址吧……

virusdefender

2016-04-10 08:54:00 +08:00

# coding=utf-8
import re

html = """
<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/"
"""

print re.compile('<td class="shorturl">([\s\S]*?)</td>').findall(html)[0]

haomni

2016-04-10 09:00:24 +08:00

@virusdefender 感谢，但是这个找出来的是 shorturl
我换成 longurl 之后结果是：
<a href="http://test.long.url/" target="_blank">http://test.long.url/</a>

还是没达成目的……

uyhyygyug1234

2016-04-10 09:05:32 +08:00

uyhyygyug1234

2016-04-10 09:06:25 +08:00

这样可以不过是在太丑了。应该上 bs4 ， pyquery 之类的额

haomni

2016-04-10 09:14:28 +08:00

@uyhyygyug1234 大侠结果好像不太对啊也可能是我 Reponse 结果没有贴全的缘故

>>> print re.compile('href="(.*)"').findall(req.content)[0]
/screen.css

class="longurl" 这个在整个 Response 中是唯一的，现在要的是取后面那个指向链接

sh4n3

2016-04-10 09:31:05 +08:00

用 .longurl a 这样的 css Selector 就好了。

ericls

2016-04-10 09:32:04 +08:00

直接 pyquery 来搞

eoo

2016-04-10 09:56:07 +08:00

@haomni 用 PHP 很容易

<?php

$str='<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr> </table>';

$zz='#<td class="longurl"><a href="(.*?)" target="_blank">.*?</a></td>#';

preg_match($zz, $str, $matchs);

print_r($matchs);

haomni

2016-04-10 10:03:31 +08:00

@uyhyygyug1234
@ericls
试了下 PyQuery ，可能我用法不太对
print doc1('class:contains("longurl")')

@eoo 不准备再换 php 了，其它都写好了

eoo

2016-04-10 10:08:59 +08:00

@haomni 好吧，写的什么？

seki

2016-04-10 10:09:31 +08:00

为啥不用 beautifulsoup 或者 lxml

seki

2016-04-10 10:10:09 +08:00

嗯比方说你的 bs4 提取失败的代码是什么样的

haomni

2016-04-10 10:16:12 +08:00

感谢各位， PyQuery 不太会用
在 @uyhyygyug1234 的基础上再用一次正则就搞定了

@seki 唉，虽然有心想用，但是不会啊……

longchisihai

2016-04-10 10:35:12 +08:00

from bs4 import BeautifulSoup

html = '''<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr> </table>'''

soup = BeautifulSoup(html, 'lxml')

longurl_tag = soup.find('td', class_ = 'longurl')

print (longurl_tag.contents[0].get('href'))

haomni

2016-04-10 10:48:42 +08:00

@longchisihai 简直完美，感谢！

haomni

2016-04-10 10:54:05 +08:00

大致的样子有了，
弄了一宿，先去睡一会，醒了再测，先上个图压压惊
再次感谢各位技术帝帮忙，稍后会将作品上传到 Github 开源