V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
websitevisor
V2EX  ›  Python

Python scrapy 高手看过来

  •  
  •   websitevisor · 2016-05-01 17:03:03 +08:00 · 2374 次点击
    这是一个创建于 3158 天前的主题,其中的信息可能已经有所发展或是发生改变。

    我现在有一个初始网址获得网页内容是:

    http://a.com/q=boy&alias=aps

    ["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]

    而红色这一部分是我所需要抓取的部分: 同时也是下一次查找时所需要带上的参数,结果也类似下部分, 我要做的就是把所有红色部分的全部提取出来

    ["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]

    我的思路如下:

    class MYItem(scrapy.Item): Keyword = scrapy.Field() Nodes = scrapy.Field()

    class Spider(CrawlSpider): name = 'mySpider' allowed_domains = ['a.com'] start_urls = ['http://a.com/q=boy&alias=aps']

    def parse(self, response):
           #suggestvalueArr 得到这样一个字符串数组  ["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"]
            for sel in suggestvalueArr:
                item = MYItem()
                item['Keyword'] = sel
                item['Nodes'] = nodes
                yield item
    
            for sel in suggestvalueArr:
                tmpurl = "http://a.com&q=%s&search-alias=aps"%sel
                yield scrapy.Request(tmpurl, callback=self.parse)
    

    我为什么感觉我的结果没有完全抓取完就结束了,有没有人看出问题所在了?谢谢了

    目前尚无回复
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   4280 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 32ms · UTC 10:14 · PVG 18:14 · LAX 02:14 · JFK 05:14
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.