之前爬链家二手房的时候说过,他的下一页链接是通过js来动态加载的,所以如果我们使用普通方法的话,是没办法爬取的,然后我们就可以尝试使用splash。

首先安装docker然后安装splash,之后打开链接http://localhost:8050/然后输入网址可以查看源代码,发现可以获取到页面链接,那么就可以通过splash智能翻页了。

首先我们要更改一下setting文件

SPLASH_URL = "http://localhost:8050/"
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = "scrapy_splash.SplashAwareFSCacheStorage"
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
   # 'lianjia.middlewares.LianjiaSpiderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 728,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 823,
   # 'lianjia.middlewares.LianjiaDownloaderMiddleware': 543,
}
from scrapy_splash import SplashRequest
from scrapy.spiders import Spider
from scrapy import Request
from lianjia.items import LianjiaItem

lua_script = """
    function main(splash, args)
      assert(splash:go(args.url))
      assert(splash:wait(0.5))
      return {
        html = splash:html()
      }
end
"""

class LianjiaSpider(Spider):
    name = "lianjia"
    url = 'https://yt.lianjia.com/ershoufang/'
    def start_requests(self):
        yield SplashRequest(self.url,
                            callback=self.parse,
                            endpoint='execute',
                            args={
                                'lua_source': lua_script,
                                'images': 0,
                                'wait': 3
                            },
                            cache_args=['lua_source'])
    def parse(self, response):
        item = LianjiaItem()
        list_selector = response.xpath("//ul[@class='sellListContent']/li")
        for one_selector in list_selector:
            try:
                name = one_selector.xpath("div[@class='info clear']/div[@class='title']/a/text()").extract()[0]
                price = one_selector.xpath("div[@class='info clear']/div[@class='priceInfo']/div[1]/span/text()").extract()[0]
                item['name'] = name
                item['price'] = price

                yield item
            except:
                pass
        next_url = response.xpath("//div[@class='page-box house-lst-page-box']/a[last()]/text()").extract()[0]
        if next_url == "下一页":
            next_url = response.xpath("//div[@class='page-box house-lst-page-box']/a[last()]/@href").extract()[0]
            next_url = "https://yt.lianjia.com"+next_url
            yield SplashRequest(next_url,
                                callback=self.parse,
                                endpoint='execute',
                                args={
                                    'lua_source': lua_script,
                                    'images': 0,
                                    'wait': 3
                                },
                                cache_args=['lua_source'])

splash就要通过splashrequest来进行请求。
然后获取页面信息进行翻页获取信息。

最后修改:2021 年 08 月 01 日 11 : 16 AM
如果觉得我的文章对你有用,请随意赞赏