进行另一个案例的练习,爬取链家二手房信息,其中需要爬取的内容有:名字,面积,户型,总价,单价,装修情况,朝向,是否有电梯。

首先有一个问题是没有解决的,就是获取不了页面的下一页的内容,感觉可能是通过json来实现的,所以爬取不到,这里想到的替代方法就是获取每一页的信息数据,然后进行对比,不过由于时间原因,懒得继续写了,所以爬取的页面长度是一个定值,我们可以自己修改来运行。

#coding:utf-8


from scrapy import Request
from scrapy.spiders import Spider
from lianjia_erhome.items import LianjiaErhomeItem
import time

class ErhomeSpider(Spider):
    name = "erhome"


    def __init__(self):
        self.url_num = 1

    def start_requests(self):
        url = "https://su.lianjia.com/ershoufang/pg1/"
        yield Request(url=url)


    def parse(self, response):
        info_url = response.xpath("//div[@class='title']")
        try:
            for one_url in info_url:
                link = one_url.xpath("a/@href").extract()[0]
                time.sleep(1)
                yield Request(url=link, callback=self.info_parse)
                self.url_num += 1
                if self.url_num <= 10:
                    next_url = "https://su.lianjia.com/ershoufang/pg%i" % self.url_num
                    yield Request(url=next_url)
        except:
            pass
        # info_page = response.xpath("//div[@class='page-box fr']/div/a[last()]/text()").extract()[0]
        # print(info_page)




    def info_parse(self, response):
        try:
            name = response.xpath("//div[@class='title']/h1/text()").extract()[0]
            total = response.xpath("//span[@class='total']/text()").extract()[0]
            unit = response.xpath("//span[@class='unitPriceValue']/text()").extract()[0]
            type = response.xpath("//div[@class='content']/ul/li[1]/text()").extract()[0]  # 房屋户型
            area = response.xpath("//div[@class='content']/ul/li[3]/text()").extract()[0]  # 建筑面积
            orientation = response.xpath("//div[@class='content']/ul/li[7]/text()").extract()[0]  # 房屋朝向
            renovation = response.xpath("//div[@class='content']/ul/li[9]/text()").extract()[0]  # 装修情况
            elevator = response.xpath("//div[@class='content']/ul/li[11]/text()").extract()[0]  # 有无电梯

            home_info = LianjiaErhomeItem()
            home_info['name'] = name
            home_info['type'] = type
            home_info['area'] = area
            home_info['orientation'] = orientation
            home_info['renovation'] = renovation
            home_info['elevator'] = elevator
            home_info['total'] = total
            home_info['unit'] = unit

            erhome_info = {
                "name": name,
                "type": type,
                "area": area,
                "orientation": orientation,
                "renovation": renovation,
                "elevator": elevator,
                "total": total,
                "unit": unit
            }
            yield erhome_info
        except:
            pass

首先这里是spider文件,其中比较重点的就是那个try了,因为有的时候,可能因为某种原因获取不到单个数据,所以就会报错,然后我们添加try即可。

from scrapy.exceptions import DropItem
class LianjiaErhomePipeline:  # 实现数据的清洗
    def process_item(self, item, spider):
        if item["elevator"] == "暂无数据":
            # 抛弃缺少数据的Item项
            raise DropItem("电梯无数据,抛弃此项目: %s" % item)
        return item


class CSVPipeline(object):
    index = 0#记录起始位置
    file = None#文件对象
    #Spider开启时,执行打开文件操作
    def open_spider(self,spider):
        #以追加形式打开文件
        self.file = open("home.csv", "a", encoding="utf-8")

    # 数据处理
    def process_item(self, item, spider):
        # 第一行写入列名
        if self.index == 0:
            column_name = "name,type,area,orientation,renovation,elevator,total,unit\n"
            # 将字符串写入到文件中
            self.file.write(column_name)
            self.index = 1
        # 获取item中各个字段,将其连接成一个字符串
        # 字段之间用逗号隔开
        # 反斜杠用于连接下一行的字符串
        # 字符串末尾要有换行符\n
        home_str = item['name']+","+\
                    item["type"]+","+\
                    item["area"]+","+ \
                    item["orientation"] + "," + \
                    item["renovation"] + "," + \
                    item["elevator"] + "," + \
                    item["total"] + "," + \
                    item["unit"] + "," + "\n"
        # 将字符串写入到文件中
        self.file.write(home_str)
        return item

    # Spider关闭时,执行关闭文件操作
    def close_spider(self, spider):
        # 关闭文件
        self.file.close()

再就是pipelines文件,这个是一个对数据进行处理的文件,我们通过LianjaErhomePipeline类来对数据进行筛选,然后写入文件。

之后就是创建一个start.py文件来启动爬虫,这样我们可以直接右键run而不是要打开cmd来运行。

from scrapy import cmdline

cmdline.execute('scrapy crawl erhome'.split())
最后修改:2021 年 07 月 27 日 04 : 10 PM
如果觉得我的文章对你有用,请随意赞赏