关于使用nodejs爬虫

引子

之前有点无聊，就像爬点妹子图片来看看，刚好又会一点node，就研究了使用node进行爬虫。

爬唯美图库

大概思路是：使用request模块与信息流进行图片的储存，通过 Crawler 模块进行爬取网页中jpg的src。（下面代码效率低下，因为没有理解request这个模块的运行机制，后面在写爬百度图库的时候才相对理解，因为唯美图库的服务器速度太慢了，所以就没改了）

//唯美图库爬虫

var Crawler=require('crawler');
var fs=require('fs');
var request=require('request');
var jsq=1;
var newhref='';
var p;
var resarr=fs.readdirSync('./result3');
//console.log(resarr);
var c=new Crawler({
    maxConnections:1,
    callback:function(error,res,done){
        if(error){
            console.log(error);
        }else{
            var  $=res.$ ;
            var node= $('.ImageBody a')[0]; if(node!==undefined){ var src=node.children[0].attribs.src; p=parseInt(Math.random()*100%9); newhref=$ ('.ajax_ul a')[p].attribs.href;
            console.log(p, $('.ajax_ul a')[p].attribs); var jpg=`./result3/$ {node.children[0].attribs.alt}.jpg`;
            if(resarr.indexOf(jpg)===-1){
                request(src).pipe(fs.createWriteStream(jpg));
                resarr.push(jpg);
                try{
                    setTimeout(function(){
                        fs.stat(jpg,function(err,start){
                            if(err){
                                console.log(err);
                            }else{
                                if(start.size==0){
                                    console.log('size0');
                                    setTimeout(function(){
                                        c.queue(newhref);
                                    },4000);
                                }else{
                                    console.log(jsq,src,jpg);
                                     c.queue($('.BlueF a')[1].attribs.href);
                                    jsq+=1;
                                }
                            }
                        })},1000);}catch(e){
                            console.log(e);
                            setTimeout(function(){c.queue(newhref);},2000);
                        }
            }
            else
            c.queue(newhref);
            }else{
                console.log('end');
                c.queue(newhref);
            }
        }
        done();
    }
});

  c.queue('http://www.umei.cc/bizhitupian/meinvbizhi/190600.htm');

爬百度图片

唯美图库的服务器速度不敢恭维，在图书馆又不好意思爬那种sese的网址，就爬百度图片来看看。因为百度图片是使用ajax技术进行下滑显示新图片的，所以不能使用之前的方法。我分析了一下百度的ajax包，得到了url组成的结构。

https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&is=&fp=result&cl=2&lm=-1&ie=utf-8&oe=utf-8&word='图片'&cg=girl&pn=120&rn=60

上述src中word的参数为搜索的构建词，pn的参数为开始的项，rn的参数为本次ajax请求返回的json中包含图片的数量（最大为60，大于60依然返回60项）。

之前写的是同步代码，即下载一张图片在下载另一张。以上代码为异步代码，目测1k张图爬取大概需要10s-15s左右，爬取完整率大概知在95%这样，当然这依靠网速以及jpg中src的服务器速度等等。

总结

node爬虫相比于python爬虫，应该是python更胜一筹，毕竟python简洁高效嘛。至于这次使用node进行爬虫，纯粹是想用用node。准备考试了，好好学习，天天向上。

Blog

引子

爬唯美图库

爬百度图片

总结

发表回复取消回复

引子

爬唯美图库

爬百度图片

总结

发表回复 取消回复

发表回复取消回复