多个 scrapy 爬虫启动问题 - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 1500 天前的主题，其中的信息可能已经有所发展或是发生改变。

有挺多 scarpy 的爬虫，受限于代理数量，没法一次性全部启动。于是想找到一个类似队列，取哪个爬哪个的方法。目前试过进程池、scrapy 接口都不是很理性进程池的问题:

每个进程消费完第一个爬虫，进程就会关闭

scrapy 接口的问题:

必须按顺序爬，假如其他进程跑完，剩下未跑完的爬虫还是得一个个按顺序跑，浪费时间

以下是进程池的代码

from scrapy.crawler import CrawlerProcess
from multiprocessing import Pool

def _crawl_main_program(spider, settings):
	# spider: 爬虫
    process = CrawlerProcess(settings)
    process.crawl(spider)
    process.start()


def _crawl_running(crawl_map: dict, settings: dict, max_processes=5):
	# crawl_map: scrapy 爬虫映射表
    if not crawl_map:
        raise CrawlError()
    executor = Pool(processes=max_processes)
    for domain, spider in crawl_map.items():
        executor.apply_async(_crawl_main_program, (spider, settings))
    executor.close()
    executor.join()
    
   
def core_website_crawl():
    _crawl_running(crawl_map=core_spider_domain_map, 		   settings=core_website_crawl_settings)
    
    
if __name__ == '__main__':
    core_website_crawl()

想找到一个比较好用的方法

4 条回复 • 2021-01-12 20:50:49 +08:00

1

QuinceyWu

2021-01-11 19:36:13 +08:00

github 上有个分布式爬虫管理叫 Crawlab，我现在就在用，能满足你所有爬虫需求

2

tuoov

2021-01-12 10:25:06 +08:00

subprocess.Popen(['python','runSpider.py'])
process.start()放到 runSpider 里，命令行传参决定启动的爬虫
也许能解决你的问题

3

Luzaiv7

OP

2021-01-12 20:50:41 +08:00

@QuinceyWu 感觉这个挺适合我的，谢谢

4

Luzaiv7

OP

2021-01-12 20:50:49 +08:00

@tuoov 已经解决了，谢谢

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 1014 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 25ms · UTC 20:39 · PVG 04:39 · LAX 12:39 · JFK 15:39
Developed with CodeLauncher
♥ Do have faith in what you're doing.