V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
bosshida
V2EX  ›  问与答

[爬虫]php 通过 ajax 与 file_get_contents, snoopy 都无法获取 壹心理 电台的动态页面

  •  
  •   bosshida · 2014-12-22 23:05:46 +08:00 · 6649 次点击
    这是一个创建于 3652 天前的主题,其中的信息可能已经有所发展或是发生改变。
    尝试抓取 壹心理的FM的信息,例如: http://fm.xinli001.com/#4916186 通过firebug,知道页面载入后会发送
    http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474 会获取当前FM的基本信息。我尝试用ajax访问该网址,返回403 FORBIDDEN。
    firefox另外提示:“已阻止交叉源请求:同源策略不允许读取 http://fm.xinli001.com/broadcast/?pk=12139723&t=1419253731488 上的远程资源。可以将资源移动到相同的域名上或者启用 CORS 来解决这个问题。”
    js代码:
    $.ajax
    ({
    type: "get",
    dataType: "json",
    url: "http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474",
    success:function(data){alert('ok');},
    timeout:30000,
    error: function (XMLHttpRequest, textStatus, errorThrown) {
    alert('error');
    }
    });

    googel一翻,搜到一个方法,增加:jQuery.support.cors = true; 后也是不可以。
    增加Headers相当参数也不可以。实在没辄了。
    请问有没什么办法可以获取到FM的基本信息?
    13 条回复    2014-12-24 11:02:21 +08:00
    Jat001
        1
    Jat001  
       2014-12-22 23:18:32 +08:00
    带上 header
    X-Requested-With XMLHttpRequest
    Referer http://fm.xinli001.com/
    做爬虫就是模拟浏览器,看看浏览器发了什么 header,一个个减少,直到出错,就知道需要什么 header。
    fising
        2
    fising  
       2014-12-22 23:38:37 +08:00 via iPad
    ajax跨域了,被浏览器block住了
    bosshida
        3
    bosshida  
    OP
       2014-12-23 00:13:00 +08:00 via Android
    @fising 有什么办法解决吗?
    Jat001
        4
    Jat001  
       2014-12-23 00:37:07 +08:00
    @bosshida 要么在他们的服务端设置 Access-Control-Allow-Origin header,当然,你肯定没这权限。要么就用类似 userscripts 的方法搞。
    其实我觉得这种请求最好在服务端搞定。
    esile
        5
    esile  
       2014-12-23 01:52:08 +08:00
    设置referer和X-Requested-With即可成功获取了

    以下是测试返回值
    {"code": 0, "data": {"favnum": 398, "commentnum": 120, "speaker_id": 108, "is_home": true, "background": "http://image.xinli001.com/20141220/18083879570a3ec9b9a360.jpg", "speak_url": "http://www.xinli001.com/user/742450/", "duration": 1283, "tags": [], "weight": 397, "title1": "", "_cache_key": "data_fm_broadcast_4916186", "article": null, "specials": [], "_id": "54954aea4f670ade3e8b4a1b", "range": 20535196, "word": "\u6625\u6653", "speakers_id": [], "lizhi_url": "", "created": "2014-12-20 18:01", "word_url": "http://www.xinli001.com/user/article/3866918/", "speak": "\u5cf0_\u5c0f\u5cf0", "id": 4916186, "is_teacher": false, "message_url": "", "cover": "http://image.xinli001.com/20141220/18094254011b53336c1227.jpg", "title": "\u6211\u548c\u90b5\u6bdb\u6bdb\u7684\u65e5\u4e0e\u591c", "url": "http://image.kaolafm.net/mz/audios/201412/a59b5e60-e515-4804-88f5-64f167aa957e.mp3", "absolute_url": "http://fm.xinli001.com/4916186/", "content": "\u4e0d\u8bba\u751f\u6d3b\u5728\u54ea\u91cc\uff0c\u53ea\u8981\u5728\u4e00\u8d77\u5c31\u597d\u4e86\u3002\u6211\u4eec\u5728\u83dc\u5e02\u573a\u4e70\u83dc\uff0c\u5728\u623f\u95f4\u91cc\u505a\u996d\uff0c\u996d\u540e\u6cbf\u7740\u8857\u8fb9\u6563\u6b65\uff0c\u4e00\u8d77\u770b\u592a\u9633\u5347\u8d77\uff0c\u592a\u9633\u843d\u4e0b\uff0c\u8fd9\u6837\u5c31\u8db3\u591f\u4e86\u3002", "url1": ""}}
    bosshida
        6
    bosshida  
    OP
       2014-12-23 10:17:52 +08:00
    @Jat001 可以加的header都加了,但都不行。我对着Firefox的header,逐个增加参数,还是提示403 FORBIDDEN.

    <!DOCTYPE html>
    <html>
    <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <script src="./jquery-2.0.0.min.js"></script>

    <script type="text/javascript">
    function test(){
    $.ajax({
    type : "get",
    url : "http://fm.xinli001.com/broadcast/",
    datatype:"json",
    data: "pk=97701348&t=1419296643104",
    headers:{
    "Referer":"http://fm.xinli001.com/",
    "X-Requested-With":"XMLHttpRequest",
    "Accept":"*/*",
    "Accept-Encoding":"gzip, deflate",
    "Accept-Language":" zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
    "Connection":"keep-alive",
    "Host":"fm.xinli001.com",
    "User-Agent":"Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0",
    },
    success : function(json){
    alert('ok');
    },
    error:function(){
    alert('fail');
    }
    });
    }
    </script>

    <title>parseFm</title>
    </head>
    <body>
    <input type="button" value="test" onclick="test();">
    </body>
    </html>
    yrdr
        7
    yrdr  
       2014-12-23 10:18:23 +08:00
    第一,你跨域了,所以请用jsonp
    第二,你没设置http头,被服务器屏蔽了请求了吧
    bosshida
        8
    bosshida  
    OP
       2014-12-23 10:18:27 +08:00
    @esile 你是怎么测试成功的?可以发下测试代码吗?
    zhangwei727
        9
    zhangwei727  
       2014-12-23 12:10:54 +08:00
    @esile 同求测试源码,[email protected] 谢谢!
    nilennoct
        10
    nilennoct  
       2014-12-23 13:41:04 +08:00
    @bosshida 这种需求就不要在浏览器里玩了,还是用node吧==
    bosshida
        11
    bosshida  
    OP
       2014-12-23 20:55:48 +08:00
    @yrdr 试过jsonp了,还是不行。用jquery和用原生Js代码的Jsonp都返回403 forbidden。
    Jquery:
    <script type="text/javascript">
    function haha(){
    $.ajax({
    type : "get",
    async:false,
    url : "http://fm.xinli001.com/broadcast/",
    data: "pk=97701348&t=1419336731430",
    dataType: "jsonp",
    jsonpCallback:"fmHandler",
    headers:{
    "Referer":"http://fm.xinli001.com/",
    "X-Requested-With":"XMLHttpRequest",
    "Accept":"*/*",
    "Accept-Encoding":"gzip, deflate",
    "Accept-Language":" zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
    "Connection":"keep-alive",
    "Host":"fm.xinli001.com",
    "User-Agent":"Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0",
    },
    success : function(json){
    console.log(json);
    alert('ok');
    },
    error:function(){
    alert('fail');
    }
    });
    }
    </script>

    原生Js:
    <script type="text/javascript">
    var myFmHandler = function(data){
    alert('ok');
    };
    var url = "http://fm.xinli001.com/broadcast/?pk=97701348&t=1419336731430&callback=myFmHandler";
    var script = document.createElement('script');
    script.setAttribute('src', url);
    document.getElementsByTagName('head')[0].appendChild(script);
    </script>

    楼上说的Node.js,我没用过,现在来现学现用一下。。。
    esile
        12
    esile  
       2014-12-24 11:01:38 +08:00
    @bosshida @zhangwei727 需要搞那么负责么?
    <?php
    function fetchpage($url, $referer)
    {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array ('X-Requested-With: XMLHttpRequest') );
    curl_setopt($ch, CURLOPT_HEADER,false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6; .NET CLR 2.0.50727; CIBA)");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    $temp = curl_exec($ch);
    curl_close($ch);
    return $temp;

    }

    var_dump(fetchpage('http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474', 'http://fm.xinli001.com/'));
    esile
        13
    esile  
       2014-12-24 11:02:21 +08:00
    负责=复杂,o(︶︿︶)o 唉 拼音坑人
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2850 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 30ms · UTC 12:21 · PVG 20:21 · LAX 04:21 · JFK 07:21
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.