The steps to build a simple project are well described in the scrapy tutorial, here I am going expand what's explained there to include submitting forms, Django integration and testing.
If you worked on the tutorial project, you have already an understanding of the three key concepts you need to get started
- Spiders: This is where we navigate the pages and look for the information you want to acquire. You will need some basic knowledge of CSS selectors and/or XPath to get to the information you want. There are easy ways to submit forms (for login, search etc), follow links and so on. In the end, when you get to the data you want to keep, you store it on Items.
- Items: Items can be serialized in many formats, save directly to a database, linked to a Django model to store via the Django ORM, etc. Prior to that, they can be sent through one or more pipelines for processing.
- Pipelines: Here is were you would be doing all the validation, data clean up etc.
Scraping a complex page
Let's say we want to scrap the page here
It lists locations of services and taxes payment offices around my country.
You need to either search by keyword or by province and city in the search form at the right of the page. The search form does have the provinces loaded by default, but it is not until you select a province that you are able to select the city. As we cannot execute javascript with Scrapy, we are going to need to split the process into 4 steps inside the spider:
- Go to the main page http://www.rapipago.com.ar/rapipagoWeb/index.htm, parse the response looking for the list of Provinces
- For each province in the select element, submit the form simulating the selection.
- Parse the response to each request in 2. to find the list of cities associated with each province and submit the search form again for each pair (province, city)
- Parse each response in 3. to obtain the items. In this case, the information we want is name, address, city and province of each location. Yield each item for further processing through the pipeline.
Django integration
You will have to build both a Scrapy project and a Django project. The out of the box integration uses DjangoItem to store data using Django ORM.
Let's say you are scrapping an ATM directory to later build a Django application that would display store locations on a Map.
- BloggerWorkspace
- storedirectoryscraper (Scrapy project)
- mappingsite (Django Project)
- mappingsite
- storemapapp
In the example, you can see the workspace structure with the scrapy project and the Django project.
This was easily achieved by doing:
$ mkvirtualenv BloggerWorkspace
$ mkproject BloggerWorkspace
When creating the environments bear in mind that Scrapy currently does not support python 3, so you'll need to use the latest 2.7 version. You can probably use different environments and python versions for the Scrapy and Django projects, I am using the same here in favour of simplicity.
Update: Since May 2016 Scrapy 1.1 supports now Python 3 on non-windows environments with some limitations, see release notes
$ pip install django
$ django-admin.py startproject mappingsite
$ cd MappingSite
$ django-admin.py startapp storemapapp
$ cd ..
$ pip install Scrapy
$ scrapy startproject storedirectoryscraper
Doing the actual work
So, now that we have our projects set up, let's see what the code would look like.
Following the Scrapy tutorial, we need to create our item. This is going to be a DjangoItem. So we will go first to our django application and add a model to our models.py inside our brand new app StoreMapApp. We also need to add our app to the INSTALLED_APPS in our settings.py module.
mappingsite/storemapapp/models.py
from django.db import models
class Office(models.Model):
city = models.CharField(max_length=100)
province = models.CharField(max_length=100)
address = models.CharField(max_length=100)
name = models.CharField(max_length=100)
mappingSite/mappingsite/settings.py
....
INSTALLED_APPS = (
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'storemapapp',
)
....
We can then go back to our scraper application and create our scrapy item
from scrapy.contrib.djangoitem import DjangoItem
from storemapapp.models import Office
class OfficeItem(DjangoItem):
django_model = Office
Update As of Scrapy 1.0.0 DjangoItem has been relocated to its own package. You need to pip install scrapy-djangoitem and import DjangoItem from scrapy_djangoitem instead.
We will also need to tell scrapy how to find our Django application to be able to import our Office model. At the top of our scraper settings.py file add this lines to make our django app available:
storedirectoryscraper/storedirectoryscraper/settings.py
import sys
import os
sys.path.append('<abs path to BloggerWorkspace/mappingsite>')
os.environ['DJANGO_SETTINGS_MODULE'] = 'mappingsite.settings'
Update As of Django 1.7 App loading changed and you need to explicitely call the set up method. You can do this by adding also (thanks to Romerito Campos)
import django
django.setup()
with those settings we should now be able to start the scrapy shell and import our django model with
from storemapapp.models import Office
to check it works.
Once we have our Item set up, we can continue to create our first spider. Create a file rapipago.py under the spiders directory.
storedirectoryscraper/storedirectoryscraper/spiders/rapipago.py
import scrapy
from storedirectoryscraper.items import OfficeItem
class RapiPagoSpider(scrapy.Spider):
name = "rapipago"
allowed_domains = ["rapipago.com.ar"]
start_urls = [
"http://www.rapipago.com.ar/rapipagoWeb/index.htm", (1)
]
def parse(self, response):
# find form and fill in
# call inner parse to parse real results.
for idx, province in enumerate(response.xpath("//*[@id='provinciaSuc']/option")): (2)
if idx > 0: # avoid select prompt
code = province.xpath('@value').extract()
request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
formdata={'palabraSuc': 'Por palabra', 'provinciaSuc': code},
callback=self.parse_province) (3)
request.meta['province'] = province.xpath('text()').extract()[0] (4)
request.meta['province_code'] = code
yield request (5)
def parse_province(self, response):
for idx, city in enumerate(response.xpath("//*[@id='ciudadSuc']/option")):
if idx > 0:
code = city.xpath('@value').extract()[0]
request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
formdata={'palabraSuc': 'Por palabra',
'provinciaSuc': response.meta['province_code'],
'ciudadSuc': code},
callback=self.parse_city)
request.meta['province'] = response.meta['province']
request.meta['province_code'] = response.meta['province_code']
request.meta['city'] = city.xpath('text()').extract()[0]
request.meta['city_code'] = code
yield request
def parse_city(self, response):
for link in response.xpath("//a[contains(@href,'index?pageNum')]/@href").extract():
request = scrapy.FormRequest('http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm?' + link.split('?')[1],
formdata={'palabraSuc': 'Por palabra',
'provinciaSuc': response.meta['province_code'],
'ciudadSuc': response.meta['city_code']},
callback=self.parse_city_data)
request.meta['province'] = response.meta['province']
request.meta['city'] = response.meta['city']
yield request
def parse_city_data(self, response):
# TODO: follow page links (7)
for office in response.xpath("//*[@class='resultadosNumeroSuc']"): (6)
officeItem = OfficeItem()
officeItem['province'] = response.meta['province']
officeItem['city'] = response.meta['city']
officeItem['name'] = office.xpath("../*[@class='resultadosTextWhite']/text()").extract()[0]
officeItem['address'] = office.xpath("../..//*[@class='resultadosText']/text()").extract()[0]
yield officeItem
That is a lot of code for our scraper and it deserves some explanation.
In (1) we are telling scraper were to start navigating. In this example, I've decided to start from the index page, were the search form first appears.
Our spider extends from the Spider class in scrapy. As such, it will go to the initial page and call the method parse, passing in the response.
In (2) we are fulfilling the first step described in Scraping a complex page. We find the select element that lists the provinces and we iterate over the values. For each value listed, we create a FormRequest, telling scrapy how to populate the form using the value obtained, and passing in a callback to process the response of the form submission.
If we look at our OfficeItem, we see we want to store information about province, city and address of each location. For that, we are going to need to pass that data all the way through to the item creation in the last method code in parse_city_data. The way to accomplish this with Scrapy is to add the data to the meta dictionary of the request object in each call, as shown in (4)
In (5) we yield the request, which will be processed and the callback called when the request is completed. We have now spawn a request for each province, and each is going to call parse_city when it is completed. In this method we repeat the same procedure but filling in both province and city in the FormRequest and passing in parse_city_data as callback.We also rewrite the province and city values to the meta dictionary to make them available to the next callback.
Finally, in (6) we have the response with locations for a particular province and city, so we can proceed to parse the response and create our OfficeItem objects.
As per the comment on (7), we would need to repeat this call for each pagination link we find, there is more than a way to accomplish this in scrapy. One way to expand to implement this requirement could be to just add an intermediate callback before the one extracting the data, to iterate to the pagination links and yield new requests for each of them.
Saving the items
Up to now we have been able to create our django item, but we have not written it anywhere, nor into a database neither into a file.
The most suitable place to do this would be the pipelines module. We can define all the pipelines we want and just set them up in our settings.py module so that scrapy would execute them.
For this example, I have just written a short pipeline which performs a basic clean up on the address we extracted from the html, and saves it to the database.
Here is the code for our pipeline.py module.
# -*- coding: utf-8 -*-
import re
class ScrapRapiPagoPipeline(object):
def process_item(self, item, spider):
item['address'] = self.cleanup_address(item['address'])
item.save()
return item
def cleanup_address(self, address):
m = re.search('(?P<numb>(\d+))\s(?P=numb)', address)
if m:
return address[0:m.end(1)]
return address
Let's say you are scrapping an ATM directory to later build a Django application that would display store locations on a Map.
- BloggerWorkspace
- storedirectoryscraper (Scrapy project)
- mappingsite (Django Project)
- mappingsite
- storemapapp
This was easily achieved by doing:
$ mkvirtualenv BloggerWorkspace
$ mkproject BloggerWorkspace
When creating the environments bear in mind that Scrapy currently does not support python 3, so you'll need to use the latest 2.7 version. You can probably use different environments and python versions for the Scrapy and Django projects, I am using the same here in favour of simplicity.
Update: Since May 2016 Scrapy 1.1 supports now Python 3 on non-windows environments with some limitations, see release notes
Update: Since May 2016 Scrapy 1.1 supports now Python 3 on non-windows environments with some limitations, see release notes
$ pip install django
$ django-admin.py startproject mappingsite
$ cd MappingSite
$ django-admin.py startapp storemapapp
$ cd ..
$ pip install Scrapy
$ scrapy startproject storedirectoryscraper
Doing the actual work
So, now that we have our projects set up, let's see what the code would look like.
Following the Scrapy tutorial, we need to create our item. This is going to be a DjangoItem. So we will go first to our django application and add a model to our models.py inside our brand new app StoreMapApp. We also need to add our app to the INSTALLED_APPS in our settings.py module.
mappingsite/storemapapp/models.py
from django.db import models
class Office(models.Model):
city = models.CharField(max_length=100)
province = models.CharField(max_length=100)
address = models.CharField(max_length=100)
name = models.CharField(max_length=100)
mappingSite/mappingsite/settings.py
....
INSTALLED_APPS = (
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'storemapapp',
)
....
We can then go back to our scraper application and create our scrapy item
from scrapy.contrib.djangoitem import DjangoItem
from storemapapp.models import Office
class OfficeItem(DjangoItem):
django_model = Office
Update As of Scrapy 1.0.0 DjangoItem has been relocated to its own package. You need to pip install scrapy-djangoitem and import DjangoItem from scrapy_djangoitem instead.
We will also need to tell scrapy how to find our Django application to be able to import our Office model. At the top of our scraper settings.py file add this lines to make our django app available:
storedirectoryscraper/storedirectoryscraper/settings.py
import sys
import os
sys.path.append('<abs path to BloggerWorkspace/mappingsite>')
os.environ['DJANGO_SETTINGS_MODULE'] = 'mappingsite.settings'
Update As of Django 1.7 App loading changed and you need to explicitely call the set up method. You can do this by adding also (thanks to Romerito Campos)
import django
django.setup()
with those settings we should now be able to start the scrapy shell and import our django model with
from storemapapp.models import Office
to check it works.
Once we have our Item set up, we can continue to create our first spider. Create a file rapipago.py under the spiders directory.
storedirectoryscraper/storedirectoryscraper/spiders/rapipago.py
import scrapy
from storedirectoryscraper.items import OfficeItem
class RapiPagoSpider(scrapy.Spider):
name = "rapipago"
allowed_domains = ["rapipago.com.ar"]
start_urls = [
"http://www.rapipago.com.ar/rapipagoWeb/index.htm", (1)
]
def parse(self, response):
# find form and fill in
# call inner parse to parse real results.
for idx, province in enumerate(response.xpath("//*[@id='provinciaSuc']/option")): (2)
if idx > 0: # avoid select prompt
code = province.xpath('@value').extract()
request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
formdata={'palabraSuc': 'Por palabra', 'provinciaSuc': code},
callback=self.parse_province) (3)
request.meta['province'] = province.xpath('text()').extract()[0] (4)
request.meta['province_code'] = code
yield request (5)
def parse_province(self, response):
for idx, city in enumerate(response.xpath("//*[@id='ciudadSuc']/option")):
if idx > 0:
code = city.xpath('@value').extract()[0]
request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
formdata={'palabraSuc': 'Por palabra',
'provinciaSuc': response.meta['province_code'],
'ciudadSuc': code},
callback=self.parse_city)
request.meta['province'] = response.meta['province']
request.meta['province_code'] = response.meta['province_code']
request.meta['city'] = city.xpath('text()').extract()[0]
request.meta['city_code'] = code
yield request
def parse_city(self, response):
for link in response.xpath("//a[contains(@href,'index?pageNum')]/@href").extract():
request = scrapy.FormRequest('http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm?' + link.split('?')[1],
formdata={'palabraSuc': 'Por palabra',
'provinciaSuc': response.meta['province_code'],
'ciudadSuc': response.meta['city_code']},
callback=self.parse_city_data)
request.meta['province'] = response.meta['province']
request.meta['city'] = response.meta['city']
yield request
def parse_city_data(self, response):
# TODO: follow page links (7)
for office in response.xpath("//*[@class='resultadosNumeroSuc']"): (6)
officeItem = OfficeItem()
officeItem['province'] = response.meta['province']
officeItem['city'] = response.meta['city']
officeItem['name'] = office.xpath("../*[@class='resultadosTextWhite']/text()").extract()[0]
officeItem['address'] = office.xpath("../..//*[@class='resultadosText']/text()").extract()[0]
yield officeItem
That is a lot of code for our scraper and it deserves some explanation.
In (1) we are telling scraper were to start navigating. In this example, I've decided to start from the index page, were the search form first appears.
Our spider extends from the Spider class in scrapy. As such, it will go to the initial page and call the method parse, passing in the response.
In (2) we are fulfilling the first step described in Scraping a complex page. We find the select element that lists the provinces and we iterate over the values. For each value listed, we create a FormRequest, telling scrapy how to populate the form using the value obtained, and passing in a callback to process the response of the form submission.
If we look at our OfficeItem, we see we want to store information about province, city and address of each location. For that, we are going to need to pass that data all the way through to the item creation in the last method code in parse_city_data. The way to accomplish this with Scrapy is to add the data to the meta dictionary of the request object in each call, as shown in (4)
In (5) we yield the request, which will be processed and the callback called when the request is completed. We have now spawn a request for each province, and each is going to call parse_city when it is completed. In this method we repeat the same procedure but filling in both province and city in the FormRequest and passing in parse_city_data as callback.We also rewrite the province and city values to the meta dictionary to make them available to the next callback.
Finally, in (6) we have the response with locations for a particular province and city, so we can proceed to parse the response and create our OfficeItem objects.
As per the comment on (7), we would need to repeat this call for each pagination link we find, there is more than a way to accomplish this in scrapy. One way to expand to implement this requirement could be to just add an intermediate callback before the one extracting the data, to iterate to the pagination links and yield new requests for each of them.
Saving the items
Up to now we have been able to create our django item, but we have not written it anywhere, nor into a database neither into a file.
The most suitable place to do this would be the pipelines module. We can define all the pipelines we want and just set them up in our settings.py module so that scrapy would execute them.
For this example, I have just written a short pipeline which performs a basic clean up on the address we extracted from the html, and saves it to the database.
Here is the code for our pipeline.py module.
# -*- coding: utf-8 -*-
import re
class ScrapRapiPagoPipeline(object):
def process_item(self, item, spider):
item['address'] = self.cleanup_address(item['address'])
item.save()
return item
def cleanup_address(self, address):
m = re.search('(?P<numb>(\d+))\s(?P=numb)', address)
if m:
return address[0:m.end(1)]
return address
We need to tell Scrapy which pipelines to run, for that, open the settings.py file in our Scrapy project and add these lines:
ITEM_PIPELINES = {
'storedirectoryscraper.pipelines.StoreDirectoryScraperPipeline': 300,
}
Running the spider
So now that we have built all the pieces, you can try to run our spider from the command line (that is, if you haven't yet been trying)
Just go into the top folder of the scrapy project and type
scrapy crawl rapipago
Summary
In this post I've examined how to scrap a site which required multiple form submissions, passing in data from request to request and some basic data validation. I've exposed a way to save to the database using scrapy's django integration, though you could want to just save directly to the database, or dump to a file instead. More information about each piece can be found at Scrapy docs.
In later posts I'll cover the steps to unit test this scraper the usual way, and also explore the newer Scrapy alternative, contracts.
Good example!
ReplyDeleteI've tried here and got some errors.
The main erro was the following: AppRegistryNotReady: Apps aren't loaded yet
I fixed then by add:
import django
djando.setup ()
on the beggining of spyder.py
Thank you! I think this is because I was still using django 1.6 in this example. I'll update the post to reflect this for recent versions
Deletewhat is the update in the code when using django 2.0
DeleteVery useful for beginners to integrate Scrapy in Django...
ReplyDeleteThanks
Thanks for the tutorial. Are you using any specific IDE ? I am trying to get this all setup in Eclipse.
ReplyDeleteAny experience with it ?
I haven't used Eclipse with Python. I can recommend Pycharm or Sublime
DeleteFunny enough my use case is very similar to yours and this tutorial was incredibly valuable. Thanks!
ReplyDeletehello, I would like to pass the link github or make available the project?
ReplyDeleteAcorda pra vida, Lukete man!
DeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteile "/usr/bin/scrapy", line 9, in
ReplyDeleteload_entry_point('Scrapy==1.0.3', 'console_scripts', 'scrapy')()
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 142, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 209, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 115, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 296, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/usr/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 30, in from_settings
return cls(settings)
File "/usr/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 21, in __init__
for module in walk_modules(name):
File "/usr/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/home/sujay-pc/mikelegal/mikescrapy/mikescrapy/spiders/rapipago.py", line 16
if idx > 0: # avoid select prompt
^
IndentationError: unexpected indent
hey thanks for those details, but you explain for the scrapy configuration but not for django ?
ReplyDeleteI see you did on the pipeline a item.save to the djangoItem, but you didn't setup the data base on django or explain how to save/update to the detabase , right ?
You say that this is in your pipeline.py: (no 's' at the end)
ReplyDeleteclass ScrapRapiPagoPipeline(object):
but your ITEM PIPELINE is:
'storedirectoryscraper.pipelines.StoreDirectoryScraperPipeline': 300,
now pipeline has an 's' at the end, but the name of the pipeline is completely different. Is this right? Does this code work for you? If so, how does it do that if the names are different?
hey, i'm doing a project with django and scrapy ,here i got a question, how to write script to start spiders instead command line 'scrapy crawl xxx'?
ReplyDelete