新人学 Python 爬虫，用了 BS4，目前只会爬取具体的 url，求问如何爬取整个页面呢或者是某个日期呢

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 978 天前的主题，其中的信息可能已经有所发展或是发生改变。

爬取的网站： https://www.beiei.com/navisample.php
爬取目标：公司名称

如图：这是爬取单个页面，用 request 请求下来再用 bs4 解析设置条件为 li 和 li0 就能够获取了，但是在外面的 url 好像都是没有啥关联的...

如图:比如爬取 4.18 号，发现用 bs4 不知道该怎么取到该段 HTML ，怎么用 bs4 限制到这段呢

bs4

url

爬

取

6 条回复 • 2022-04-25 15:55:55 +08:00

janda

2022-04-19 13:35:05 +08:00

xpath

colatea

2022-04-19 13:39:11 +08:00

我用 xpath,大同小异,取到 div 内容为 2022.04.18 以后,向上找父节点,再向下找 table
html.xpath("\\div[text()='2022.04.18']/../table/tbody")

Ritter

2022-04-19 13:40:16 +08:00

百度 bs4 doc

NotFoundEgg

2022-04-19 13:50:58 +08:00

divs = soup.findAll(name='div', attrs={"class": "dateDiv"})
for div in divs:
if '2022.04.18' in div.next:
table = div.find_next('table')

Joshuam

2022-04-19 18:58:14 +08:00 via Android

推荐个 chrome 插件：SelectorGadget
只需点点点你要的数据，他给你 CSS Selector ，然后使用 bs4 处理 CSS Selector

AmberJiang

2022-04-25 15:55:55 +08:00

建议查看 BS4 的官方文档学习