通过如下代码希望替换掉网页原文中的空标签节点
page = lxml.etree.HTML('<span lang="en-us"><p></p>223</span>')
for empty in page.xpath('//*[not(node())]'):
empty.getparent().remove(empty)
print lxml.html.tostring(page)
结果输出为
<html><body><span lang="en-us"></span></body></html>
去掉了空节点外的字符,请问如何保留原文中的“223”并且实现替换?
1
cute 2015-02-25 11:22:49 +08:00 1
`
from lxml import html print html.fromstring('<span lang="en-us"><p></p>223</span>').text_content() ` |
2
gogogen OP |
3
cute 2015-02-26 10:13:34 +08:00
```
from lxml import html doc = html.fromstring('<span lang="en-us">11<p></p>223</span>') for elem in doc.xpath('//*[not(node())]'): parent = elem.getparent() if elem.tail: if not parent.text: parent.text = elem.tail else: parent.text = parent.text + elem.tail parent.remove(elem) print html.tostring(doc) ``` |
4
cute 2015-02-26 10:53:10 +08:00
重新发一个。
from lxml import html doc = html.fromstring('<span lang="en-us">sss<p></p>223</span>') func = lambda x, p: setattr(p, 'text', p.text + x.tail if p.text else x.tail) map( lambda x: x.tail and func(x, x.getparent()) or x.getparent().remove(x), doc.xpath('//*[not(node())]') ) print html.tostring(doc) |