Web scraping with Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.

  • To write a web crawler, we need at least basic knowledge on XPath and Python, that’s enough!

Installation

Install using pip

1
pip install Scrapy

Create project

1
scrapy startproject name

This will create a name dir with the following files

1
2
3
4
5
6
7
8
9
10
name/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

Defining the Items you will extract

In items.py, item class which loads our data are declared by creating a scrapy.Item class and defining its attributes as scrapy.Field objects, like you will in an ORM.

Here’s my Item definition:

1
2
3
4
class CrawlItem(scrapy.Item):
des = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()

Writing a spider to crawl a site and extract Items

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from crawl.items import CrawlItem
class CrawlSpider(CrawlSpider):
name = "crawl"
allowed_domains = ["polyvore.com"]
start_urls = ["http://www.polyvore.com/?filter=fashion"]
rules = [Rule(LinkExtractor(allow =['/.*']), 'parse_crawl')]
def parse_crawl(self, response):
item =CrawlItem()
link_rel = response.xpath("//img[@class='img_size_l']/@title").extract()
item['des'] = [x for x in link_rel]
link = response.xpath("//img[@class='img_size_l']/@src").extract()
item['image_urls'] = [ x for x in link]
return item

Start crawling…

1
scrapy crawl crawl

Writing an Item Pipeline to store the extracted Items

If we are just downloading the images to local
Add these two lines in setting.py

1
2
3
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = "img"

Results: all images are stored under name/img/full directory;)

If we need to Write items to a JSON file, by using the Feed exports.

1
scrapy crawl dmoz -o items.json

If backendDB is required:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user='user', 'passwd', 'dbname', 'host', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO test (des, image_urls)
VALUES (%s, %s)""",
(item['des'].encode('utf-8'),
item['image_urls'].encode('utf-8')))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item

recently falling for a bit METAL…ヾ(○´▽`○)ノ