Up Around The Bend

Tuesday, March 10, 2015

Unit Testing Scrapy - part 1 - Integrating with DjangoItem

In my previous post I exposed how to scrap a page which requires multiple form submissions along one of way to save the scrapped data to the database, the Django integration with DjangoItem.

In this post I want to show how we can unit test scrappers using just the usual python unit test framework, and how we need to configure our testing environment when referencing a Django model from our Items.

Basic unit testing

Continuing with the example in my previous post, let's recall our project layout

├── mappingsite
│   ├── mappingsite
│   └── storemapapp
└── storedirectoryscraper
    └── storedirectoryscraper
        └──spiders

We had built a scraper in the storedirectoryscraper project but we haven't make any unit or integration tests for it yet (you may try out a little TDD afterwards instead of testing last, but it certainly helps out to have an idea of where we are heading when learning a new tool)

So you can go ahead and create a tests.py file inside the storedirectoryscraper top level folder, and add the following code to it

import unittest


class TestSpider(unittest.TestCase):

    def test_1(self):
        pass


if __name__ == '__main__':
    unittest.main()

running

python -m unittest storedirectoryscraper.tests

from the top level folder will display the python unittest success message.

Now, let's see what happens when we try to import our spider to test it. add the following line to the top of the tests.py file

from storedirectoryscraper.spiders import rapipago

and run the test again. You should see an error message like the one below.

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/unittest/__main__.py", line 12, in <module>
    main(module=None)
  File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python2.7/unittest/main.py", line 149, in parseArgs
    self.createTests()
  File "/usr/lib/python2.7/unittest/main.py", line 158, in createTests
    self.module)
  File "/usr/lib/python2.7/unittest/loader.py", line 130, in loadTestsFromNames
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'tests'

What happened here is that the unittest framework is not aware of our scrapy project configuration, it is not Scrapy running our tests, it is Python directly. So the configuration in our settings file does not take any effect.
One way to solve this is to simply add the Django application to our python path so the test runner can find it when invoked. As we are going to need to do this for every test and we certainly don't want to add it definetly to our path, but just for testing the scraper, we can just create a test package and alter the path in it's __init__.py file.

So let's do that. Create a tests folder at the same level we have our tests.py file. Add a __init__.py file to it, and move the tests.py file to that directory. After the changes, the project should look like this

storedirectoryscraper
    ├── scrapy.cfg
    └── storedirectoryscraper
        ├── __init__.py
        ├── items.py
        ├── pipelines.py
        ├── settings.py
        ├── spiders
        │   ├── __init__.py
        │   └── rapipago.py
        └── tests
            ├── __init__.py
            └── tests.py

Now add the following lines to the __init__.py file you just created

import sys
import os

BASE_DIR = os.path.dirname(os.path.dirname(__file__))
sys.path.append(os.path.join(BASE_DIR, '../../mappingsite'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'mappingsite.settings'

Now if you run

python -m unittest storedirectoryscraper.tests.tests

or just

python -m unittest discover

from the top level scrapy folder, you should get a success message again.

Setting up the test database

Now that we have make our test work with the Django model, we need to be careful which database we are running our tests against, we wouldn't like our tests modifying our production database.

Looking at how we integrated our Django app to the testing environment, it turns out to be very easy to configure a testing database, separated from our development one. We just need to create different settings for dev, test and prod environments in our Django application. Let's create a setting file for testing for now.
Inside our mappingsite module, create a folder called settings. Add an empty __init__.py file so that we tell python this is a package. Move our settings file inside that folder, and rename it to base.py.

├── manage.py
├── mappingsite
│   ├── __init__.py
│   ├── settings
│   │   ├── base.py
│   │   └── __init__.py
│   ├── urls.py
│   └── wsgi.py

Now create two new modules, dev.py and test.py. and cut and space the DATABASES declaration to those two files. Rename the database name to something that makes sense for each environment (you can also change the engine if desired) so that they won't collide.

Now you can take several approaches to resolve the correct environment. For this case, we will just add

from base import *

at the top of each file and replace our settings module in wsgi.py and the scrapper's settings.py and tests.__init__.py with the correct one. This is not the recommended solution though, as we would need to change those settings when deploying to a different environment (let's say production, or an staging server). You can read more on this from the Django docs.

Summary

In this post we have seen how to set up our environment for unit testing when utilizing Django models. With these changes, we can now start unit testing our scraper and even add some integration tests to see we are actually being able to populate our database. In future posts we will go deeper into how to unit test our scraper, and later on, we will look into a Scrapy's alternative, Contracts.

Thursday, March 5, 2015

Scraping a website using Scrapy and Django

I've been playing around with Scrapy lately and I found it extremely easy to use.

The steps to build a simple project are well described in the scrapy tutorial, here I am going expand what's explained there to include submitting forms, Django integration and testing.
If you worked on the tutorial project, you have already an understanding of the three key concepts you need to get started

Spiders: This is where we navigate the pages and look for the information you want to acquire. You will need some basic knowledge of CSS selectors and/or XPath to get to the information you want. There are easy ways to submit forms (for login, search etc), follow links and so on. In the end, when you get to the data you want to keep, you store it on Items.

Items: Items can be serialized in many formats, save directly to a database, linked to a Django model to store via the Django ORM, etc. Prior to that, they can be sent through one or more pipelines for processing.

Pipelines: Here is were you would be doing all the validation, data clean up etc.

Scraping a complex page

Let's say we want to scrap the page here

It lists locations of services and taxes payment offices around my country.
You need to either search by keyword or by province and city in the search form at the right of the page. The search form does have the provinces loaded by default, but it is not until you select a province that you are able to select the city. As we cannot execute javascript with Scrapy, we are going to need to split the process into 4 steps inside the spider:

Go to the main page http://www.rapipago.com.ar/rapipagoWeb/index.htm, parse the response looking for the list of Provinces
For each province in the select element, submit the form simulating the selection.
Parse the response to each request in 2. to find the list of cities associated with each province and submit the search form again for each pair (province, city)
Parse each response in 3. to obtain the items. In this case, the information we want is name, address, city and province of each location. Yield each item for further processing through the pipeline.

Django integration

You will have to build both a Scrapy project and a Django project. The out of the box integration uses DjangoItem to store data using Django ORM.

Let's say you are scrapping an ATM directory to later build a Django application that would display store locations on a Map.

- BloggerWorkspace
       - storedirectoryscraper (Scrapy project)
       - mappingsite (Django Project)
             - mappingsite
             - storemapapp

In the example, you can see the workspace structure with the scrapy project and the Django project.

This was easily achieved by doing:

$ mkvirtualenv BloggerWorkspace
$ mkproject BloggerWorkspace

When creating the environments bear in mind that Scrapy currently does not support python 3, so you'll need to use the latest 2.7 version. You can probably use different environments and python versions for the Scrapy and Django projects, I am using the same here in favour of simplicity.

Update: Since May 2016 Scrapy 1.1 supports now Python 3 on non-windows environments with some limitations, see release notes

$ pip install django
$ django-admin.py startproject mappingsite
$ cd MappingSite
$ django-admin.py startapp storemapapp
$ cd ..
$ pip install Scrapy
$ scrapy startproject storedirectoryscraper

Doing the actual work

So, now that we have our projects set up, let's see what the code would look like.

Following the Scrapy tutorial, we need to create our item. This is going to be a DjangoItem. So we will go first to our django application and add a model to our models.py inside our brand new app StoreMapApp. We also need to add our app to the INSTALLED_APPS in our settings.py module.

mappingsite/storemapapp/models.py

from django.db import models

class Office(models.Model):
    city = models.CharField(max_length=100)
    province = models.CharField(max_length=100)
    address = models.CharField(max_length=100)
    name = models.CharField(max_length=100)

mappingSite/mappingsite/settings.py

....

INSTALLED_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'storemapapp',
)

....

We can then go back to our scraper application and create our scrapy item

from scrapy.contrib.djangoitem import DjangoItem
from storemapapp.models import Office

class OfficeItem(DjangoItem):
    django_model = Office

Update As of Scrapy 1.0.0 DjangoItem has been relocated to its own package. You need to pip install scrapy-djangoitem and import DjangoItem from scrapy_djangoitem instead.

We will also need to tell scrapy how to find our Django application to be able to import our Office model. At the top of our scraper settings.py file add this lines to make our django app available:

storedirectoryscraper/storedirectoryscraper/settings.py

import sys
import os


sys.path.append('<abs path to BloggerWorkspace/mappingsite>')
os.environ['DJANGO_SETTINGS_MODULE'] = 'mappingsite.settings'

Update As of Django 1.7 App loading changed and you need to explicitely call the set up method. You can do this by adding also (thanks to Romerito Campos)


import django
django.setup()

with those settings we should now be able to start the scrapy shell and import our django model with

from storemapapp.models import Office

to check it works.

Once we have our Item set up, we can continue to create our first spider. Create a file rapipago.py under the spiders directory.

storedirectoryscraper/storedirectoryscraper/spiders/rapipago.py


import scrapy
from storedirectoryscraper.items import OfficeItem


class RapiPagoSpider(scrapy.Spider):
    name = "rapipago"
    allowed_domains = ["rapipago.com.ar"]
    start_urls = [
        "http://www.rapipago.com.ar/rapipagoWeb/index.htm", (1)
    ]

    def parse(self, response):
        # find form and fill in
        # call inner parse to parse real results.
        for idx, province in enumerate(response.xpath("//*[@id='provinciaSuc']/option")): (2)
            if idx > 0: # avoid select prompt
                code = province.xpath('@value').extract()
                request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
                                             formdata={'palabraSuc': 'Por palabra', 'provinciaSuc': code},
                                             callback=self.parse_province) (3)

                request.meta['province'] = province.xpath('text()').extract()[0] (4)
                request.meta['province_code'] = code
                yield request (5)

    def parse_province(self, response):
        for idx, city in enumerate(response.xpath("//*[@id='ciudadSuc']/option")):
            if idx > 0: 
                code = city.xpath('@value').extract()[0]

                request = scrapy.FormRequest("http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm",
                                             formdata={'palabraSuc': 'Por palabra',
                                                       'provinciaSuc': response.meta['province_code'],
                                                       'ciudadSuc': code},
                                             callback=self.parse_city)

                request.meta['province'] = response.meta['province']
                request.meta['province_code'] = response.meta['province_code']
                request.meta['city'] = city.xpath('text()').extract()[0]
                request.meta['city_code'] = code
                yield request

    def parse_city(self, response):
        for link in response.xpath("//a[contains(@href,'index?pageNum')]/@href").extract():
            request = scrapy.FormRequest('http://www.rapipago.com.ar/rapipagoWeb/suc-buscar.htm?' + link.split('?')[1],
                                         formdata={'palabraSuc': 'Por palabra',
                                                   'provinciaSuc': response.meta['province_code'],
                                                   'ciudadSuc': response.meta['city_code']},
                                         callback=self.parse_city_data)

            request.meta['province'] = response.meta['province']
            request.meta['city'] = response.meta['city']

            yield request

    def parse_city_data(self, response):
        # TODO: follow page links (7)
        for office in response.xpath("//*[@class='resultadosNumeroSuc']"): (6)
            officeItem = OfficeItem()
            officeItem['province'] = response.meta['province']
            officeItem['city'] = response.meta['city']
            officeItem['name'] = office.xpath("../*[@class='resultadosTextWhite']/text()").extract()[0]
            officeItem['address'] = office.xpath("../..//*[@class='resultadosText']/text()").extract()[0]
            yield officeItem

That is a lot of code for our scraper and it deserves some explanation.

In (1) we are telling scraper were to start navigating. In this example, I've decided to start from the index page, were the search form first appears.

Our spider extends from the Spider class in scrapy. As such, it will go to the initial page and call the method parse, passing in the response.

In (2) we are fulfilling the first step described in Scraping a complex page. We find the select element that lists the provinces and we iterate over the values. For each value listed, we create a FormRequest, telling scrapy how to populate the form using the value obtained, and passing in a callback to process the response of the form submission.

If we look at our OfficeItem, we see we want to store information about province, city and address of each location. For that, we are going to need to pass that data all the way through to the item creation in the last method code in parse_city_data. The way to accomplish this with Scrapy is to add the data to the meta dictionary of the request object in each call, as shown in (4)

In (5) we yield the request, which will be processed and the callback called when the request is completed. We have now spawn a request for each province, and each is going to call parse_city when it is completed. In this method we repeat the same procedure but filling in both province and city in the FormRequest and passing in parse_city_data as callback.We also rewrite the province and city values to the meta dictionary to make them available to the next callback.

Finally, in (6) we have the response with locations for a particular province and city, so we can proceed to parse the response and create our OfficeItem objects.

As per the comment on (7), we would need to repeat this call for each pagination link we find, there is more than a way to accomplish this in scrapy. One way to expand to implement this requirement could be to just add an intermediate callback before the one extracting the data, to iterate to the pagination links and yield new requests for each of them.

Saving the items

Up to now we have been able to create our django item, but we have not written it anywhere, nor into a database neither into a file.

The most suitable place to do this would be the pipelines module. We can define all the pipelines we want and just set them up in our settings.py module so that scrapy would execute them.

For this example, I have just written a short pipeline which performs a basic clean up on the address we extracted from the html, and saves it to the database.

Here is the code for our pipeline.py module.

# -*- coding: utf-8 -*-
import re


class ScrapRapiPagoPipeline(object):

    def process_item(self, item, spider):
        item['address'] = self.cleanup_address(item['address'])
        item.save()
        return item

    def cleanup_address(self, address):
        m = re.search('(?P<numb>(\d+))\s(?P=numb)', address)
        if m:
            return address[0:m.end(1)]
        return address

We need to tell Scrapy which pipelines to run, for that, open the settings.py file in our Scrapy project and add these lines:

ITEM_PIPELINES = {
    'storedirectoryscraper.pipelines.StoreDirectoryScraperPipeline': 300,
}

Running the spider

So now that we have built all the pieces, you can try to run our spider from the command line (that is, if you haven't yet been trying)

Just go into the top folder of the scrapy project and type

scrapy crawl rapipago

Summary

In this post I've examined how to scrap a site which required multiple form submissions, passing in data from request to request and some basic data validation. I've exposed a way to save to the database using scrapy's django integration, though you could want to just save directly to the database, or dump to a file instead. More information about each piece can be found at Scrapy docs.

In later posts I'll cover the steps to unit test this scraper the usual way, and also explore the newer Scrapy alternative, contracts.

Tuesday, January 27, 2015

Django Migrations - Moving Django model to another application and renaming Postgresql table and sequence

I started my Django project having just one application. The application started getting bigger and bigger and the day came when I noticed there were cohesive groups of model classes and functionality that could be splitted apart in a second app. Just moving the models and logic was tedious but quite simple, happy to have the job done I went ahead and typed

python manage.py makemigrations

only to find out a bunch of errors showed up.

Django does not automatically detect the models where moved and automatically write the proper migration, it doesn't seem smart enough to read our minds, so here is what I did.

1. First option:

The first workaround I found was to add


    class Meta:

      db_table = 'old tablename'

      app_label = 'old app'

To the models I moved. Although it works fine, it's an awful hack to just have the code someplace while the models still belong to the old app.

2. Stack Overflow blessing

A more profound search led me to this reply in one post

Which describes in detail the process needed to move a model to another app. It makes use of state_operations and database_operations separately in a migration inside the old application, as showed below

class Migration(migrations.Migration):
    dependencies = []

    database_operations = [
        migrations.AlterModelTable('TheModel', 'newapp_themodel')
    ]

    state_operations = [
        migrations.DeleteModel('TheModel')
    ]

    operations = [
        migrations.SeparateDatabaseAndState(
            database_operations=database_operations,
            state_operations=state_operations) ]

I am not including the full code here, as you can take a look in Stack Overflow directly. It also involves a second migration in the destination application only to create the model with a state_operation and the use of SeparateDatabaseAndState again (the table is already in the database, so we trying to create it again would cause an error)

The process is simple and it does not require any data migration, which makes it delightful (thanks to ozan)

In my case I am using Postgresql, so I went and took a look into the database to see what was going on

As the picture shows I had the table ui_environments and the sequence table ui_environments_id_seq
The detail for those tables displays how ui_environmens reference the sequence, as well as sequence numbers at the moment prior the migration

Running the above migration schema (both migrations mentioned) to rename ui_environments to spatial_geom (as it's now part of the 'spatial' application, and I chose to take the opportunity to also rename the model class to Geom as it made more sense at this point), led to the following tables

We can see spatial_geom is there, ui_environments was deleted, but ui_environments_id_seq remained the same. Inspecting both tables, the table was correctly renamed, and it still points to the same sequence. This all looks fine, but I wanted to clean up the database and rename also the sequence to avoid confusion.

Of course, I tried to do this with Django migrations, to learn a bit more about it. So here is how I did it.

Migrations' RunSql to run custom SQL to alter your database

Digging into the documentation, there's no command to just alter a table name which does not belong to a model, so RunSql seemed the best fit. As showed in the documentation, we need to pass in the SQL we want to run, and optionally, the SQL to revert the operation and the operation to reflect it in the migrations' state, if needed.

As renaming the sequence did not require any migration state change, I just passed in the first two parameters.

I ran

python manage.py makemigrations spatial --empty

to create an empty migration in my application, and then modified it with the custom statement.


# -*- coding: utf-8 -*-

from __future__ import unicode_literals

from django.db import models, migrations



class Migration(migrations.Migration):

    dependencies = [

        ('spatial', '0005_environmentgeom'),

    ]

    operations = [

        migrations.RunSQL(sql=

            "ALTER SEQUENCE ui_environments_id_seq RENAME TO spatial_geom_id_seq", 

            reverse_sql="ALTER SEQUENCE spatial_geom_id_seq RENAME TO ui_environments_id_seq"),

    ]

then run

python manage.py migrate spatial

and took a look inside the db

The sequence name had changed and it was correctly referenced by the spatial_geom table. Sequence numbers remained unaltered as required. The only strange thing I noticed was that sequence_name field in the sequence had the old value. Digging a bit into this I found this issue with Postgresql sequences, which states that the field is not used anywhere, so it does not really matter that the ALTER SEQUENCE statement does not change it. So we are happy to ignore.

Summary

Django Migrations perform an excellent job inside one application, but some changes across applications may require creating some custom migrations to do the job. In this post I looked into moving a model between applications, and cleaning up the database tables afterwards, for any change that Django alone might have not performed. Looking at final result, indexes also remained with the old name and a similar approach should be taken if we want to tidy up further.

Friday, April 18, 2014

Creating a Python environment from scratch

Managing python environments is straightforward with the right tools. We have two flavors of Python, python 2.7 (which will be supported until 2020 as Guido has stated recently) and the new recommended version 3.x. We also most probably will need to use multiple versions of a library at the same time for different projects; for example we may have a project using python 2.7 and Django 1.5 which is in maintenance and at the same time we are starting a new one using python 3 and Django 1.7.

Without extra tools managing the combinations of environments can be quite tricky. Below I describe a basic simple configuration that solves this problems and more!

First of all, install Python

On Ubuntu:

$sudo apt-get python
$sudo apt-get python3

You can install both of them, they won't collide and you will be able to choose which one to use on a project by project basis.

Install pip

You can follow instructions here to install the newest pip version, or, if you don't mind being a step or two behind, you can use the package manager of your OS to install instead. In my case, I use ubuntu so here is the apt-get line

First you can get the package info to see how out-dated the package is

$ sudo apt show python-pip

Package: python-pip
Priority: optional
Section: universe/python
Installed-Size: 479 kB
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Python Modules Team <python-modules-team@lists.alioth.debian.org>
Version: 1.5.4-1
Depends: python (>= 2.7), python (<< 2.8), python:any (>= 2.7.1-0ubuntu2), ca-certificates, python-colorama, python-distlib, python-html5lib, python-pkg-resources, python-setuptools (>= 0.6c1), python-six, python-requests
Recommends: build-essential, python-dev-all (>= 2.6)
Download-Size: 97,7 kB
Homepage: http://www.pip-installer.org/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu
APT-Sources: http://ar.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages
Description: alternative Python package installer
pip is a replacement for easy_install, and is intended to be an improved
Python package installer. It integrates with virtualenv, doesn't do partial
installs, can save package state for replaying, can install from non-egg
sources, and can install from version control repositories.

Current version as from site is 1.5.4, published on 2014-02-21, so it's not really behind at this time
Let's install

$sudo apt-get install python-pip

Pip will help us installing the next set of tools we need, as well as python packages for our own applications

Managing virtual environments

Virtualenv is a tool which allows us to create virtual environments where we can install whatever packages and versions we want in isolation, without impacting the rest of the system, including other virtual python environments.

$sudo pip install virtualenv

Virtual env consists in only a handful of commands which you can use to manage your environments. You can create, activate and deactivate environmets like this

create ENV:

$virtualenv [--python=path-to-python-exec] ENV

start working on ENV:

$cd ENV
$source bin/activate

do some stuff
...

exit ENV:
$deactivate

You can create bootstrap scripts to automate environment set up and other more advanced stuff

Virtualenv is simple but it gets better if you combine it with VirtualEnvWapper. An extension of virtualenv which adds wrappers for creating/deleting environments and assist in your everyday development workflow .
pip will aid with installing python specific packages.

$sudo pip install virtualenvwrapper

And now you are ready to go!

If you are starting a new project you can create both a new virtual env and the project with a simple command

$mkproject projname

Alternatively, if you want to start using virtual environments with your existing project, you can just create your env with

$mkvirtualenv mynewenv

which will be automatically activated.

and bind your existing project using

$setvirtualenvproject [virtualenv_path project_path]

To exit the environment, just type

$deactivate

And to start working on another environment:

$workon otherenv

also, workon will cd you into the project directory automagically.

list all your environments issuing:

$lsvirtualenv

or just

$workon

withouth any params

remove an obsolete environment with

$rmvirtualenv obsoleteenv

This is just a glimpse of what VirtualEnvWrapper can do for you, I encourage you to go through the command reference here to get a better taste of it.

Saturday, April 5, 2014

Installing Google Earth on Ubuntu 13.10 64bit

After several attempts to install Google Earth on Ubuntu 13.10 64 bit I finally stumbled upon the solution here.

The solution consists on modifying the Debian control file in the downloaded project to remove the offending dependencies on lsb-core and ia32-libs. Here are the detailed instructions.

Checking dependencies

Make sure you have installed the packages libc6:i386 and lsb-core. You can do this e.g., by typing the following on the command line:

sudo apt-get install libc6:i386 lsb-core

Building the package

Download Google Earth x64 .deb package and extract (yeah, extract instead of installing)
Go to the expanded folder and cd into DEBIAN. Open the file Control with your favorite editor and remove the following line

Depends: lsb-core (>= 3.2), ia32-libs
Delete the downloaded .deb file and rebuild from the expanded folder. For that, cd into the parent folder of the extracted directory from the terminal and issue:

dpkg -b google-earth-stable_current_amd64

Installing the package

To install the modified .deb package run the following command from the console.

sudo dpkg -i google-earth-stable_current_amd64.deb

You can also double click on the .deb package to open it with Software Center but that didn't work for me, Software Center simply refused to install it.

Enjoy!