Sunday, August 17, 2014

Final Summary

Summer of code has finally come to an end. I’ve managed to develop all the ideas in my original proposal, although their implementation drifted from what was planned at the beginning. The API remained essentially the same, but details in the execution were adjusted given previously unconsidered matters and new ideas that came along for improving the consistency, simplicity and user-friendliness of the interface.

One of these considerations was dropping backward support on non highly used features, when keeping their functionality would clutter and bloat the codebase. The most important implication of this resolve was that it released the design decisions of the API from several constraints, which improved the clarity and straightforwardness of the implementation. The list of dropped features can be seen in the description of the clean up pull request.

I think the major highlight of the cleanup process (and actually what the other changes revolve around) is the update of the Crawler class. We modified its dependencies by taking a required Spider class at initialization time, effectively linking each crawler to a single spider definition. Its functionality has been unified in two methods: its creation (with the already mentioned spider class and the configuration of its execution) that initializes all the components needed to start crawling, and the crawl method, that instantiates a spider object from the crawler’s spider class and sets the crawling engine in motion.

Since even this distinction is usually not necessary, a new class, CrawlerRunner, was introduced to deal with configuring and starting crawlers without user intervention. This class handles multiple crawler jobs and provides convenient helpers to control them. This functionality was moved from an already implemented helper, CrawlerProcess, which was left in charge of Scrapy's execution details, such as Twisted’s reactor configuration and hooking system signals.

Finally, per-spider settings development didn’t diverge significantly from the proposal. Each spider has a custom_settings class attribute with settings that will be populated by the update_settings method. The latter was made available so users can override the default population behavior (By default, they are set with a custom ‘spider’ priority).

A full list of implemented features can be seen in the description of the API clean-up pull request, along with the Per-spider settings and first half’s Settings clean-up pull requests.

I had to learn a little bit of Twisted’s basics to know how concurrency is dealt within Scrapy, and I found the concept of deferreds (Core element for the application model of Twisted, inspired in the futures concept) quite intuitive and a great alternative for achieving concurrent executions in Python.

By exposing the deferreds returned by delayed routines and providing convenient single-purpose helpers we came up with a flexible interface.

For instance, a single spider inside a project can be ran his way:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

Now it's possible to run spiders outside projects too (what allows Scrapy to be used as a library instead of a framework, one of the goals of this GSoC), in a similar manner:

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
    # Your spider definition
    ...

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

We can run multiple spiders sequentially as before, but with a much simpler approach:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Or we can run them simultaneously (something that previously wasn’t viable) reflowing the deferreds interface:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

Each usage example and interface details are carefully documented in the referenced pull requests.

My work is currently under evaluation by the Scrapy developers, and I’m fixing issues that arise after concerns brought by reviewing. There aren’t any apparent critical issues about the implementation, but I plan on addressing further suggestions so the changes can be merged to the main repository.

There are some ideas that I’d love to work on given the time to improve other aspects of the Scrapy API, but I can gladly say that I've delivered every point of my proposal.

After being through this process, I’d like to bring out the importance of writing documentation to closely contemplate design decisions. Good practices and high standards in this open source project were great guidelines that assure the quality of the developed work. Code is extensively tested, documented and backported (within reasonable expectations). Even keeping a clean commit history is a sensitive matter.

I got to know Github features that were really useful for discussing decisions with the community. The Travis continuous integration system integrated with this project made submitting patches that fit the minimum expectation of successfully completing the test suite easier. Finally, being reviewed by such skilled developers was definitely the experience that I enjoyed and valued the most, and I appreciate them for helping me in this process.

I want to wrap this post up by saying that I really loved this program. Actually, I would have wanted to hear about it sooner so I could have applied before. It was great to get involved in the open source world and get to know these developers, and I recommend this experience to any student who is looking to participate in a real project, face an interesting challenge and contribute to open source while doing so.