Scrapy:Hello World

来自WHY42

创建项目

scrapy startproject riguz

生成的项目如下:

E:.
│  scrapy.cfg
│
└─riguz
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    └─spiders
            __init__.py

创建爬虫

在spiders文件夹下面新建一个文件,例如hello_spider.py:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "helloworld"
    
    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

运行

cd riguz
scrapy crawl helloworld -s LOG_FILE=scrapy.log

运行结果为两个下载的页面:

│  quotes-1.html
│  quotes-2.html
│  scrapy.cfg