V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
my8100
V2EX  ›  Python

ScrapydWeb v1.2.0: 可能是最好用的定时爬虫工具?!

  •  
  •   my8100 ·
    my8100 · 2019-03-12 20:40:28 +08:00 · 2619 次点击
    这是一个创建于 2084 天前的主题,其中的信息可能已经有所发展或是发生改变。
    9 条回复    2019-03-16 22:58:07 +08:00
    oIMOo
        1
    oIMOo  
       2019-03-12 20:50:28 +08:00
    你好

    我没怎么写过带 js 的 python requests 脚本
    您能看看如何写能检测出来那个 js 返回的 enroll 按钮是否显示有课呢?
    (目前是不能报名的状态)
    谢谢
    jenlors
        2
    jenlors  
       2019-03-13 11:14:14 +08:00
    支持一个
    my8100
        4
    my8100  
    OP
       2019-03-13 14:38:28 +08:00
    @tonywangcn 刚刚确认过这两个链接都可以打开,请先确认你的网路能够正常访问 https://medium.com/
    tonywangcn
        5
    tonywangcn  
       2019-03-13 17:25:18 +08:00
    $ https_proxy=localhost:6152 curl -vvv https://-medium.com/@my8100/https-medium-com-my8100-how-to-efficiently-manage-your-distributed-web-scraping-projects-55ab13309820

    * Trying ::1...
    * TCP_NODELAY set
    * Connection failed
    * connect to ::1 port 6152 failed: Connection refused
    * Trying 127.0.0.1...
    * TCP_NODELAY set
    * Connected to localhost (127.0.0.1) port 6152 (#0)
    * Establish HTTP proxy tunnel to medium.com:443
    > CONNECT medium.com:443 HTTP/1.1
    > Host: medium.com:443
    > User-Agent: curl/7.54.0
    > Proxy-Connection: Keep-Alive
    >
    < HTTP/1.1 200 Connection established
    <
    * Proxy replied OK to CONNECT request
    * ALPN, offering h2
    * ALPN, offering http/1.1
    * Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
    * successfully set certificate verify locations:
    * CAfile: /etc/ssl/cert.pem
    CApath: none
    * TLSv1.2 (OUT), TLS handshake, Client hello (1):
    * TLSv1.2 (IN), TLS handshake, Server hello (2):
    * TLSv1.2 (IN), TLS handshake, Certificate (11):
    * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    * TLSv1.2 (IN), TLS handshake, Server finished (14):
    * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    * TLSv1.2 (OUT), TLS change cipher, Client hello (1):
    * TLSv1.2 (OUT), TLS handshake, Finished (20):
    * TLSv1.2 (IN), TLS change cipher, Client hello (1):
    * TLSv1.2 (IN), TLS handshake, Finished (20):
    * SSL connection using TLSv1.2 / ECDHE-RSA-CHACHA20-POLY1305
    * ALPN, server accepted to use h2
    * Server certificate:
    * subject: businessCategory=Private Organization; jurisdictionCountryName=US; jurisdictionStateOrProvinceName=Delaware; serialNumber=5010624; street=760 Market Street; postalCode=94102; C=US; ST=California; L=San Francisco; O=A Medium Corporation; CN=medium.com
    * start date: Jun 1 00:00:00 2017 GMT
    * expire date: Aug 30 12:00:00 2019 GMT
    * subjectAltName: host "medium.com" matched cert's "medium.com"
    * issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
    * SSL certificate verify ok.
    * Using HTTP2, server supports multi-use
    * Connection state changed (HTTP/2 confirmed)
    * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
    * Using Stream ID: 1 (easy handle 0x7fa7b5806600)
    > GET /@my8100/https-medium-com-my8100-how-to-efficiently-manage-your-distributed-web-scraping-projects-55ab13309820 HTTP/2
    > Host: medium.com
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    * Connection state changed (MAX_CONCURRENT_STREAMS updated)!
    < HTTP/2 302
    < date: Wed, 13 Mar 2019 09:23:05 GMT
    < content-type: application/octet-stream
    < set-cookie: __cfduid=d800d3f4d7ffa024ead64e91a29e1ebb41552468985; expires=Thu, 12-Mar-20 09:23:05 GMT; path=/; domain=.medium.com; HttpOnly
    < set-cookie: uid=lo_rj0lT6mjVKUE; Expires=Thu, 12-Mar-20 09:23:05 GMT; Domain=.medium.com; Path=/; Secure; HttpOnly
    < content-security-policy: default-src 'self'; connect-src https://localhost https://*.instapaper.com https://*.stripe.com https://glyph.medium.com https://*.paypal.com https://getpocket.com https://medium.com:443 https://*.medium.com:443 https://*.medium.com https://medium.com https://*.medium.com https://*.algolia.net https://cdn-static-1.medium.com https://dnqgz544uhbo8.cloudfront.net https://cdn-videos-1.medium.com https://cdn-audio-1.medium.com https://*.lightstep.com https://*.branch.io https://app.zencoder.com 'self'; font-src data: https://*.amazonaws.com https://*.medium.com https://glyph.medium.com https://medium.com https://*.gstatic.com https://dnqgz544uhbo8.cloudfront.net https://use.typekit.net https://cdn-static-1.medium.com 'self'; frame-src chromenull: https: webviewprogressproxy: medium: 'self'; img-src blob: data: https: 'self'; media-src https://*.cdn.vine.co https://d1fcbxp97j4nb2.cloudfront.net https://d262ilb51hltx0.cloudfront.net https://*.medium.com https://gomiro.medium.com https://miro.medium.com https://pbs.twimg.com 'self' blob:; object-src 'self'; script-src 'unsafe-eval' 'unsafe-inline' about: https: 'self'; style-src 'unsafe-inline' data: https: 'self'; report-uri https://csp.medium.com
    < x-frame-options: sameorigin
    < x-content-type-options: nosniff
    < x-xss-protection: 1; mode=block
    < x-ua-compatible: IE=edge, Chrome=1
    < x-powered-by: Medium
    < x-obvious-tid: 1552468985229:d47a5d7da221
    < x-obvious-info: 36855-3d9334e,3d9334ed6db
    < link: <https://medium.com/humans.txt>; rel="humans"
    < cache-control: no-cache, no-store, max-age=0, must-revalidate
    < expires: Thu, 09 Sep 1999 09:09:09 GMT
    < pragma: no-cache
    < set-cookie: sid=1:1Yj8mG1saeQMx1r5h/kFLMw3J77PPMa784rb2HRk3z9J8bnuZYy18oGwoDGakmHV; path=/; expires=Thu, 12 Mar 2020 09:23:05 GMT; domain=.medium.com; secure; httponly
    < tk: T
    < location: /suspended
    < strict-transport-security: max-age=15552000; includeSubDomains; preload
    < expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    < server: cloudflare
    < cf-ray: 4b6cf274ef5132fb-HKG
    <
    * Connection #0 to host localhost left intact


    看这里:

    location: /suspended
    my8100
        6
    my8100  
    OP
       2019-03-13 17:54:57 +08:00
    @tonywangcn 我只能说,这很不"科学"。建议访问内网中文版本 https://juejin.im/post/5bebc5fd6fb9a04a053f3a0e
    my8100
        7
    my8100  
    OP
       2019-03-13 18:17:59 +08:00
    @tonywangcn 有空得好好拜读兄台的大作啊 https://medium.com/@tonywangcn
    tonywangcn
        8
    tonywangcn  
       2019-03-13 20:10:28 +08:00
    @my8100 哈哈哈哈 和你的相比,差得太远。最近在计划把 scrapy 集成到 k8s 中,正需要这样一个控制面板,方便的话可以 wx 学习下 NTMyNDcyODQx
    my8100
        9
    my8100  
    OP
       2019-03-16 22:58:07 +08:00
    @tonywangcn 今天发现退出 medium 账号后搜索不到自己了,不明原因地被 suspend 了。索性把文章转移到 https://github.com/my8100/files/blob/master/scrapydweb/README.md
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   924 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 19:47 · PVG 03:47 · LAX 11:47 · JFK 14:47
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.