Web scraping with Scrapy // Cici's Blog

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.

To write a web crawler, we need at least basic knowledge on XPath and Python, that’s enough!

Installation

Install using pip

1	pip install Scrapy

Create project

1	scrapy startproject name

This will create a name dir with the following files

name/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

Defining the Items you will extract

In items.py, item class which loads our data are declared by creating a scrapy.Item class and defining its attributes as scrapy.Field objects, like you will in an ORM.

Here’s my Item definition:

class CrawlItem(scrapy.Item):
    des = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

Writing a spider to crawl a site and extract Items

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from crawl.items import CrawlItem
 
class CrawlSpider(CrawlSpider):
    name = "crawl"
    allowed_domains = ["polyvore.com"]
    start_urls = ["http://www.polyvore.com/?filter=fashion"]
    rules = [Rule(LinkExtractor(allow =['/.*']), 'parse_crawl')]
 
    def parse_crawl(self, response):
 
        item =CrawlItem()
        link_rel = response.xpath("//img[@class='img_size_l']/@title").extract()
        
        item['des'] = [x for x in link_rel]
 
        link = response.xpath("//img[@class='img_size_l']/@src").extract()
 
        item['image_urls'] = [ x for x in link]
 
        return item

Start crawling…

1	scrapy crawl crawl

Writing an Item Pipeline to store the extracted Items

If we are just downloading the images to local
Add these two lines in setting.py

1
2
3

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
 
IMAGES_STORE = "img"

Results: all images are stored under name/img/full directory;)

If we need to Write items to a JSON file, by using the Feed exports.

1	scrapy crawl dmoz -o items.json

If backendDB is required:

import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MySQLStorePipeline(object):
  def __init__(self):
    self.conn = MySQLdb.connect(user='user', 'passwd', 'dbname', 'host', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
def process_item(self, item, spider):    
    try:
        self.cursor.execute("""INSERT INTO test (des, image_urls)  
                        VALUES (%s, %s)""", 
                       (item['des'].encode('utf-8'), 
                        item['image_urls'].encode('utf-8')))
        self.conn.commit()
    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])
    return item

recently falling for a bit METAL…ヾ(○´▽｀○)ノ