Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
To write a web crawler, we need at least basic knowledge on XPath and Python, that’s enough!
Installation
Install using pip
1
pip install Scrapy
Create project
1
scrapy startproject name
This will create a name dir with the following files
1
2
3
4
5
6
7
8
9
10
name/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
Defining the Items you will extract
In items.py, item class which loads our data are declared by creating a scrapy.Item class and defining its attributes as scrapy.Field objects, like you will in an ORM.
Here’s my Item definition:
1
2
3
4
classCrawlItem(scrapy.Item):
des = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
Writing a spider to crawl a site and extract Items
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor