V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
fy
V2EX  ›  问与答

如何爬取 angularJs 的站?

  •  
  •   fy · 2015-03-08 23:18:46 +08:00 · 4168 次点击
    这是一个创建于 3546 天前的主题,其中的信息可能已经有所发展或是发生改变。
    如题,想要爬一个wiki整站当作离线资料,但这狗日的AngularJs。。。

    第一个难点在于,只能通过点击链接的行为进行抓取,所以要模拟用户的浏览过程,前进后退。。。抓取整站。一旦使用url来直接定向,整个页面就要花5-20秒的时间重载。

    第二个难点在于,根本无法确定AngularJs在何时加载完成,楼主一开始很傻很天真的写了一个Chrome小插件来缺人啥时候加载完了;千辛万苦翻越过google设定的JS沙盒之后,发现它确实能工作,但问题是与selenium并不兼容。。。。。楼主又没辙了。

    哪位大能做过这事情,求指条明路!
    15 条回复    2015-03-09 10:11:00 +08:00
    AWSAM
        1
    AWSAM  
       2015-03-08 23:26:46 +08:00
    phantomjs
    14
        2
    14  
       2015-03-08 23:28:34 +08:00 via Android
    他的数据不是通过ajax请求获取的?
    fy
        3
    fy  
    OP
       2015-03-08 23:30:45 +08:00
    @AWSAM 有更多细节吗?怎么弄?


    @14 是,但是一张页面有数百个请求,就算我能把他拼装起来,也很难知道是否已经加载完了!
    14
        4
    14  
       2015-03-08 23:38:02 +08:00 via Android
    @fy 方便提供网址吗。。。
    kmvan
        5
    kmvan  
       2015-03-08 23:57:09 +08:00
    @fy 一张页面有数百个请求,就算我能把他拼装起来,也很难知道是否已经加载完了!
    这么多请求,高并发服务器不挂吗?
    crazyxin1988
        6
    crazyxin1988  
       2015-03-09 00:04:09 +08:00
    不管前端是什么 应该看一下网络请求 分析一下HTTP request和response
    之前模拟过单点登录 各种重定向 理清网络请求 全部了然啊
    frankzeng
        7
    frankzeng  
       2015-03-09 00:10:46 +08:00
    对于用js加载的页面,看一下它的http请求,直接拿它请求的结果来分析
    evlos
        8
    evlos  
       2015-03-09 00:18:27 +08:00
    一张页面就算有大量请求,其中访问 API 的请求肯定没几个,你找到他的 API 地址,接下来就非常简单了。
    fy
        9
    fy  
    OP
       2015-03-09 00:29:27 +08:00
    @14 当然可以: http://atenc.totalwar.com/#

    @kmvan 也许没有那么多吧,不过也并不是能够很轻松解析的那种。。

    @evlos 这个很不幸……全都是访问api的请求,Angular站的特点就是前后端分离
    randyzhao
        10
    randyzhao  
       2015-03-09 00:41:54 +08:00
    phantomjs casperjs
    fy
        11
    fy  
    OP
       2015-03-09 01:14:34 +08:00
    @randyzhao 这个跟selenium没什么区别吧...
    13k
        12
    13k  
       2015-03-09 01:24:39 +08:00
    关键词:phantomjs或者selenium
    fy
        13
    fy  
    OP
       2015-03-09 08:06:37 +08:00
    @13k 大哥 你要看看内容啊 .... 不过算了 我有个想法 试试看能不能行
    jason52
        14
    jason52  
       2015-03-09 08:43:42 +08:00 via Android
    俺的视频还没讲到这里。。。
    azuginnen
        15
    azuginnen  
       2015-03-09 10:11:00 +08:00
    返回的是个json
    _______________________________________________________

    GET /twe_at_en/att_rom_levis_armaturae?_nonce=jrZO5C6sdYKfyDcU HTTP/1.1
    Host: attila-db.totalwar.com
    Proxy-Connection: keep-alive
    Accept: application/json
    Origin: http://atenc.totalwar.com
    User-Agent: Safari/536.36
    Content-Type: application/json
    Referer: http://atenc.totalwar.com/
    Accept-Encoding: gzip, deflate, sdch
    Accept-Language: zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,ja;q=0.2

    _______________________________________________________

    {
    "_id": "att_rom_levis_armaturae",
    "_rev": "1-392fb8fd27191a085ac7d3c5a2d95493",
    "index": 421,
    "campaign": "main_attila",
    "additional_picture": "",
    "name": "Levis Armaturae",
    "next_unit": "att_rom_matiarii",
    "picture": "att_rom_levis_armaturae.png",
    "prev_unit": "att_rom_funditores",
    "requires_region": [
    "",
    "",
    "",
    "",
    "",
    "",
    // delete some
    "",

    ""
    ],
    "ability_block": [
    "",
    "enc_text_manual_battle_conflict_attributes_fatigue",
    "enc_text_manual_battle_conflict_attributes_scrub",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "ability_link": [
    "",
    "0086_enc_page_battle_play_phase_conflict_attributes",
    "0086_enc_page_battle_play_phase_conflict_attributes",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "ability_text": [
    "As these men are inexperienced sailors. Caught up in sea combat, they suffer various penalties.",
    "Fatigue has less of an effect on this unit.",
    "This unit can hide in forests until enemy units get too close.",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "ability_title": [
    "Sea Sickness",
    "Resistant to Fatigue",
    "Hide (forest)",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "class": "Missile Infantry",
    "class_description": "Long range units who provide support for melee units, but are themselves very weak in close combat.",
    "faction": [
    "att_fact_illyricum",
    "att_fact_macedonia",
    "att_fact_aegyptus",
    "att_fact_hispania",
    "att_fact_gallia",
    "att_fact_italia",
    "att_fact_britannia",
    "att_fact_dacia",
    "att_fact_septem_provinciem",
    "att_fact_africa",
    "att_fact_eastern_roman_empire",
    "att_fact_ostrogothi",
    "att_fact_western_roman_empire",
    "att_fact_oriens",
    "att_fact_pontus",
    "att_fact_asia",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "description": [
    "Levis armaturae' literally translates as 'lightly armoured'. These flexible, sparsely-clad infantry formed a large part of late Roman Legions, acting primarily as a skirmishing force. They were used to harass the enemy and hurled slingshot, javelins and plumbatae - deadly lead darts - before falling back through the maniples of heavy infantry as they advanced. They could then continue skirmishing on the enemy's flanks. In this way, levis armaturae never gave the enemy pause to regroup or breathe; successful skirmishers maintained a defensive screen as an army manoeuvred, able to keep pressure on the enemy and an eye on their Legion's flanks.",
    ""
    ],
    "next_key": "att_rom_matiarii",
    "prev_key": "att_rom_funditores",
    "requires_building_faction_id": [
    "att_fact_eastern_roman_empire",
    "att_fact_western_roman_empire",
    "",
    "",
    "",
    "",
    "",
    ""
    ],
    "requires_building_id": {
    "1": {
    "1": "att_bld_roman_east_military_1att_cult_romanatt_sub_cult_roman_east",
    "2": "",
    "3": "",
    "4": ""
    },
    "2": {
    "1": "att_bld_roman_west_military_1att_cult_romanatt_sub_cult_roman_west",
    "2": "",
    "3": "",
    "4": ""
    },
    "3": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    },
    "4": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    },
    "5": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    },
    "6": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    },
    "7": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    },
    "8": {
    "1": "",
    "2": "",
    "3": "",
    "4": ""
    }
    },
    "strengths_and_weaknesses_title": "Strengths & Weaknesses",
    "strengths_and_weaknesses": [
    "Excellent Rate of Fire",
    "Very Poor Armour",
    "Low Ammunition"
    ],
    "stat_label": [
    "Recruitment Cost",
    "Upkeep Cost",
    "Melee Attack",
    "Melee Damage",
    "Charge Bonus",
    "Melee Defence",
    "Armour",
    "Health",
    "Morale",
    "Speed",
    "Missile Damage",
    "Ammunition",
    "Capture Power",
    "Missile Block Chance",
    "Rate of Fire",
    "Spotting",
    "Range",
    "Hiding"
    ],
    "stat_percentage": [
    "",
    "",
    "4.16667",
    "8",
    "0.333333",
    "25.8333",
    "6.66667",
    "22.2857",
    "22.6667",
    "33.3333",
    "90",
    "16",
    "40",
    "20",
    "30.5",
    "50",
    "16",
    "50"
    ],
    "stat_value": [
    "300",
    "150",
    "5",
    "6",
    "1",
    "31",
    "8",
    "78",
    "34",
    "40",
    "90",
    "8",
    "10",
    "20",
    "61",
    "500",
    "80",
    "1"
    ],
    "game": "at_lb",
    "tag": "654012",
    "typeof": "Units",
    "collection": "Units",
    "modifiedBy": "martin.haynes",
    "date_created": "2015-03-03T08:50:46+00:00",
    "operation": "updated",
    "data_updated": "2015-03-03T08:50:46+00:00"
    }
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   4112 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 29ms · UTC 10:11 · PVG 18:11 · LAX 02:11 · JFK 05:11
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.