Sunday, August 17, 2014

Final Summary

Summer of code has finally come to an end. I’ve managed to develop all the ideas in my original proposal, although their implementation drifted from what was planned at the beginning. The API remained essentially the same, but details in the execution were adjusted given previously unconsidered matters and new ideas that came along for improving the consistency, simplicity and user-friendliness of the interface.

One of these considerations was dropping backward support on non highly used features, when keeping their functionality would clutter and bloat the codebase. The most important implication of this resolve was that it released the design decisions of the API from several constraints, which improved the clarity and straightforwardness of the implementation. The list of dropped features can be seen in the description of the clean up pull request.

I think the major highlight of the cleanup process (and actually what the other changes revolve around) is the update of the Crawler class. We modified its dependencies by taking a required Spider class at initialization time, effectively linking each crawler to a single spider definition. Its functionality has been unified in two methods: its creation (with the already mentioned spider class and the configuration of its execution) that initializes all the components needed to start crawling, and the crawl method, that instantiates a spider object from the crawler’s spider class and sets the crawling engine in motion.

Since even this distinction is usually not necessary, a new class, CrawlerRunner, was introduced to deal with configuring and starting crawlers without user intervention. This class handles multiple crawler jobs and provides convenient helpers to control them. This functionality was moved from an already implemented helper, CrawlerProcess, which was left in charge of Scrapy's execution details, such as Twisted’s reactor configuration and hooking system signals.

Finally, per-spider settings development didn’t diverge significantly from the proposal. Each spider has a custom_settings class attribute with settings that will be populated by the update_settings method. The latter was made available so users can override the default population behavior (By default, they are set with a custom ‘spider’ priority).

A full list of implemented features can be seen in the description of the API clean-up pull request, along with the Per-spider settings and first half’s Settings clean-up pull requests.

I had to learn a little bit of Twisted’s basics to know how concurrency is dealt within Scrapy, and I found the concept of deferreds (Core element for the application model of Twisted, inspired in the futures concept) quite intuitive and a great alternative for achieving concurrent executions in Python.

By exposing the deferreds returned by delayed routines and providing convenient single-purpose helpers we came up with a flexible interface.

For instance, a single spider inside a project can be ran his way:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

Now it's possible to run spiders outside projects too (what allows Scrapy to be used as a library instead of a framework, one of the goals of this GSoC), in a similar manner:

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
    # Your spider definition
    ...

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

We can run multiple spiders sequentially as before, but with a much simpler approach:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Or we can run them simultaneously (something that previously wasn’t viable) reflowing the deferreds interface:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

Each usage example and interface details are carefully documented in the referenced pull requests.

My work is currently under evaluation by the Scrapy developers, and I’m fixing issues that arise after concerns brought by reviewing. There aren’t any apparent critical issues about the implementation, but I plan on addressing further suggestions so the changes can be merged to the main repository.

There are some ideas that I’d love to work on given the time to improve other aspects of the Scrapy API, but I can gladly say that I've delivered every point of my proposal.

After being through this process, I’d like to bring out the importance of writing documentation to closely contemplate design decisions. Good practices and high standards in this open source project were great guidelines that assure the quality of the developed work. Code is extensively tested, documented and backported (within reasonable expectations). Even keeping a clean commit history is a sensitive matter.

I got to know Github features that were really useful for discussing decisions with the community. The Travis continuous integration system integrated with this project made submitting patches that fit the minimum expectation of successfully completing the test suite easier. Finally, being reviewed by such skilled developers was definitely the experience that I enjoyed and valued the most, and I appreciate them for helping me in this process.

I want to wrap this post up by saying that I really loved this program. Actually, I would have wanted to hear about it sooner so I could have applied before. It was great to get involved in the open source world and get to know these developers, and I recommend this experience to any student who is looking to participate in a real project, face an interesting challenge and contribute to open source while doing so.

Monday, July 28, 2014

Four weeks into 2nd half

I've opened the #816 Scrapy's pull request with the progress I've made on the API refactoring. I'm keeping a list in the description of the PR with the current status of the tasks that I have assigned, along with backward incompatible changes that were introduced.

I've updated the Spider class' methods, Crawler's initialization and SpiderManager's interface following the design that was explained in my proposal. Upon reviewing the added complexity to support deprecated features, we agreed on making a relaxation on maintaining backward compatibility to prioritize code simplicity.

Over the next weeks I'll wrap up the changes, adding missing tests and documentation and getting the PR to a stable and mergeable state. I also plan to address further issues concerning API problems if I have spare time until the final deadline.

Tuesday, July 15, 2014

Two weeks into second half

The work I've done in the last month and a half has been merged into the master branch of Scrapy's main repository, and it's part of the new changes in the last 0.24 release. I'm happy that I was able to finish my contribution in time and get it to a stable stage, wrapping out the first half of my assignment.

I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.

After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.

Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:

def deprecated_signature(old_implementation, types=None,
                         warning_msg="Calling {fname} with the given "
                         "arguments is no longer supported, please refer "
                         "to the documentation for its new usage."):
    """Decorator factory that defines a deprecated implementation with
    different signature for a given function.

    An additional dictionary of argument names and their types can be provided
    to differentiate functions with the same argument count but different
    expected types for them.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                call_bindings = inspect.getcallargs(old_implementation, *args, **kwargs)
                if types is not None:
                    _check_types(call_bindings, types)

            except TypeError:
                f = func

            else:
                warnings.warn(warning_msg.format(fname=func.__name__),
                              ScrapyDeprecationWarning, stacklevel=2)
                f = old_implementation

            return f(*args, **kwargs)

        wrapper.__name__ = func.__name__
        wrapper.__doc__ = func.__doc__

        return wrapper

    return decorator

def _check_types(bindings, types):
    for arg, value in six.iteritems(bindings):
        if arg in types and not isinstance(value, types[arg]):
            raise TypeError

An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:

def deprecated_signature(old_implementation, types=None, module='',
                         warning_msg="Calling {fname} with the given "
                         "arguments is no longer supported, please refer "
                         "to the documentation for its new usage."):
    """Decorator factory that defines a deprecated implementation with
    different signature for a given function.

    An additional dictionary of argument names and their types can be provided
    to differentiate functions with the same argument count but different
    expected types for them.

    If `old_implementation` is a string, that reference will be dynamically
    loaded from the `module` provided. If that last argument is an empty string
    (default) it will be expanded to the module of the decorated routine. If
    it's the None value, the module will be strip out of the
    `old_implementation` string.

    Values from the `types` dictionary can also be dynamically loaded, but it
    will always be assumed that each string contains the full path for the
    object.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                old_call = _load_or_return_object(old_implementation,
                                                  module or func.__module__)
                call_bindings = inspect.getcallargs(old_call, *args, **kwargs)
                if types is not None:
                    _check_types(call_bindings, types)

            except TypeError:
                f = func

            else:
                warnings.warn(warning_msg.format(fname=func.__name__),
                              ScrapyDeprecationWarning, stacklevel=2)
                f = old_call

            return f(*args, **kwargs)

        wrapper.__name__ = func.__name__
        wrapper.__doc__ = func.__doc__

        return wrapper

    return decorator


def _check_types(bindings, types):
    types = {k: _load_or_return_object(v) for k, v in six.iteritems(types)}
    for arg, value in six.iteritems(bindings):
        if arg in types and not isinstance(value, types[arg]):
            raise TypeError


def _load_or_return_object(path_or_object, module=None):
    if isinstance(path_or_object, six.string_types):
        if module is None:
            module, path_or_object = path_or_object.rsplit('.', 1)
        return _load_object(path_or_object, module)
    return path_or_object


def _load_object(path, module):
    parent = import_module(module)
    for children in path.split('.'):
        parent = getattr(parent, children)
    return parent

The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.

I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.

I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.

Monday, June 23, 2014

Mid-term summary

Last week has been quiet as I was doing my last assignment in college, and it has kept me busier than planned. I’ve been working on top of the pull request that I mentioned in my previous post, extending the functionality of the new settings API, but I’m waiting on its review to initiate new pull requests.

These two months of researching and coding have been great. The initial overwhelming feeling over the size and complexity of the problem when I was dealing with my GSoC application has disappeared as I got familiar with Scrapy's implementation details and code base. It’s definitely easier now to come up with new feature ideas and integrate them consistently to existing work.

Working on an open source project has surely been a test for my coding skills. Knowing that my work can be looked at by anyone and it will be especially reviewed by skilled developers has raised my standards of what makes code “good enough” for me. I’m more aware of the importance of maintaining a well tested, documented and supported base, as these good practices greatly aid to contribute to and adopt open source projects.

Although I made some changes to the schedule that I detailed in March, I’m satisfied with the pace of the work that I've done, and I’m even more confident that the entire job will be finished by August, provided that I end my college classes. The most difficult task so far has been to keep balance between college assignments and GSoC work, so I’m relieved to know that I can concentrate on one thing from now on.

The settings API is practically finished (There are two simple features still missing), which is roughly half the work on my proposal, and I’ll start with the backend clean up soon.

Sunday, June 15, 2014

Fourth week into coding

It's been a month since the start of GSoC coding, and I've finished the changes that I described in my previous post. The pull request is currently under evaluation, and can be seen in this url: https://github.com/scrapy/scrapy/pull/737.

The integration of the new settings interface with the rest of the code was made easier by the extensive set of unit tests on Scrapy that ensures that the main use cases are covered and remain errorless. I had to adjust some of the tests and their auxiliary routines so the settings configuration is updated with its present usage.

The old crawler settings class was deprecated, and every occurrence of this class was replaced by the new one, along with its methods and attributes. The documentation was fixed to explain the new settings API, and some of the internal details were added for developers that write Scrapy extensions.


I'll implement the spider settings on top of these changes, since now it's trivial to add new configuration sources, but we still have to test them individually and extend the documentation.

The last feature for the settings API will be freeze support, that is, to make settings immutable after they've been populated. We will issue a warning when trying to update settings after this point, so that users know that further changes have no effect.

I'll work on this next week, and as I'm ending my college semester on Thursday I hope that I can speed things up.

Sunday, June 1, 2014

Second week into GSoC coding

I've decided to start with the settings API changes that were described in my proposal. It is one of the most time consuming tasks, so I wanted to have it ready early and base all further changes on it. It's also a pretty self contained assignment (although a large one) that can be merged independently before the whole project is finished.

In Scrapy you have multiple entry sources to configure your project. A detailed list of these sources, ranging from global defaults to command line overrides, can be seen on the settings topic in the official documentation. These inputs for populating settings take different levels of precedence that establish priorities between those configurations.


Stripped of helper functions, these are the classes that currently store settings configurations and offer an interface to access them:

from . import default_settings


class Settings(object):

    def __init__(self, values=None):
        self.values = values.copy() if values else {}
        self.global_defaults = default_settings

    def __getitem__(self, opt_name):
        if opt_name in self.values:
            return self.values[opt_name]
        return getattr(self.global_defaults, opt_name, None)

    def get(self, name, default=None):
        return self[name] if self[name] is not None else default


class CrawlerSettings(Settings):

    def __init__(self, settings_module=None, **kw):
        super(CrawlerSettings, self).__init__(**kw)
        self.settings_module = settings_module
        self.overrides = {}
        self.defaults = {}

    def __getitem__(self, opt_name):
        if opt_name in self.overrides:
            return self.overrides[opt_name]
        if self.settings_module and hasattr(self.settings_module, opt_name):
            return getattr(self.settings_module, opt_name)
        if opt_name in self.defaults:
            return self.defaults[opt_name]
        return super(CrawlerSettings, self).__getitem__(opt_name)

We have a multipurpose Settings class where global default values are stored. Additional key/value pairs can be provided while instantiating the class, and will be stored on the internal values dictionary. There are two priority levels, and the precedence between the two is taken care of by the __getitem__ method.

CrawlerSettings is an inheritance of the Settings class, used for crawler configuration. This object has five different precedence levels, each given by being in a different internal dictionary or module. Values in the overrides dictionary have greater priority than values in settings_module, followed in order by values in defaults, values, and global_defaults.

By inspecting the code we can see that there are no actual constraints in the usage of those internal attributes and, except for settings_module and global_defaults, there isn't a clear meaning for them.

Another issue with this implementation is that it's not easily extensible, we can't just add a new settings source or priority level without considering new conditions in the getter methods, or relying in the order that we update the values. This was the main concern behind the settings api changes: one of the goals of this GSoC is to introduce a way to override local settings on spiders, so we need to add a new priority level.

The first step that was taken for improving this api was settling a common ground for the default settings entries that we can use in our projects, and assigning them a priority. Priorities were measured by integers, since it's a simple and appropriate implementation that adjust to our needs. Entries with higher priorities will take more precedence over lower ones.

A globally defined SETTINGS_PRIORITIES dictionary on the settings module can do this job:

SETTINGS_PRIORITIES = {
    'default': 0,
    'command': 10,
    'project': 20,
    'spider': 30,
    'cmdline': 40,
}

Each level has a codename for easier identification and an associated priority. We can think of each key merely as an alias to a settings entry's priority. Scrapy's code will use this mapping instead of explicit integers to allow future changes on its definition.

The next thing was trying to get rid of all the internal dictionaries used for storage in the settings classes. This can be done using a single dictionary and storing the priority along with the value when we set a new attribute.

Those two things can be kept on a new object for a settings attribute:

class SettingsAttribute(object):

    """Class for storing data related to settings attributes.

    This class is intended for internal usage, you should try Settings class
    for settings configuration, not this one.
    """

    def __init__(self, value, priority):
        self.value = value
        self.priority = priority

    def set(self, value, priority):
        """Sets value if priority is higher or equal than current priority."""
        if priority >= self.priority:
            self.value = value
            self.priority = priority

This class won't be interacted with directly (we can wrap its usage with the settings classes), so priority is stored by its plain integer value. There's no point in keeping settings that were overridden, so this object will only store the highest priority value for a given attribute. The logic regarding the update or discard of a value depending on its priority is implemented on the set method.

The Settings class was updated to show these changes:

import six


class Settings(object):

    def __init__(self, values=None, priority='project'):
        self.attributes = {}
        for name, defvalue in iter_default_settings():
            self.set(name, defvalue, 'default')
        if values is not None:
            self.setdict(values, priority)

    def __getitem__(self, opt_name):
        value = None
        if opt_name in self.attributes:
            value = self.attributes[opt_name].value
        return value

    def get(self, name, default=None):
        return self[name] if self[name] is not None else default

    def set(self, name, value, priority='project'):
        if isinstance(priority, six.string_types):
            priority = SETTINGS_PRIORITIES[priority]
        if name not in self.attributes:
            self.attributes[name] = SettingsAttribute(value, priority)
        else:
            self.attributes[name].set(value, priority)

    def setdict(self, values, priority='project'):
        for name, value in six.iteritems(values):
            self.set(name, value, priority)

First of all, backward compatibility is a great concern, so the interface of the methods is roughly the same. A default priority is needed for the settings loaded on instantiation because of that. Project priority seemed a good fit since most users' scripts need this precedence level, so this will simplify the api for them. Scrapy's code will explicitly use that argument and will be carefully documented for users digging internals.

A new set helper method was added so we don't have to manually handle the attributes dictionary for setting new values after initialization. The priority argument can be a string or an integer. Although the former is the preferred type for this arg, the latter was added for further flexibility.

Lastly, this new class became an abstraction for the previous ones. The behaviour of CrawlerSettings is replicable with this class, so there is no more usage for that object. It was deprecated with a Scrapy helper for these kind of situations:

from scrapy.utils.deprecate import create_deprecated_class


CrawlerSettings = create_deprecated_class(
    'CrawlerSettings', CrawlerSettings,
    new_class_path='scrapy.settings.Settings')

Process on these changes, their documentation and tests, the adaptation of this new api across all the code and further discussions can be followed on the #737 pull request in Scrapy's github repository.

Tuesday, March 18, 2014

Hi!

I'm Julia Medina, a software developer and a computer science student soon to be graduated. In the following blog posts I will write about my progress in this year's Google Summer of Code application.