Monday, June 23, 2014

Mid-term summary

Last week has been quiet as I was doing my last assignment in college, and it has kept me busier than planned. I’ve been working on top of the pull request that I mentioned in my previous post, extending the functionality of the new settings API, but I’m waiting on its review to initiate new pull requests.

These two months of researching and coding have been great. The initial overwhelming feeling over the size and complexity of the problem when I was dealing with my GSoC application has disappeared as I got familiar with Scrapy's implementation details and code base. It’s definitely easier now to come up with new feature ideas and integrate them consistently to existing work.

Working on an open source project has surely been a test for my coding skills. Knowing that my work can be looked at by anyone and it will be especially reviewed by skilled developers has raised my standards of what makes code “good enough” for me. I’m more aware of the importance of maintaining a well tested, documented and supported base, as these good practices greatly aid to contribute to and adopt open source projects.

Although I made some changes to the schedule that I detailed in March, I’m satisfied with the pace of the work that I've done, and I’m even more confident that the entire job will be finished by August, provided that I end my college classes. The most difficult task so far has been to keep balance between college assignments and GSoC work, so I’m relieved to know that I can concentrate on one thing from now on.

The settings API is practically finished (There are two simple features still missing), which is roughly half the work on my proposal, and I’ll start with the backend clean up soon.

Sunday, June 15, 2014

Fourth week into coding

It's been a month since the start of GSoC coding, and I've finished the changes that I described in my previous post. The pull request is currently under evaluation, and can be seen in this url:

The integration of the new settings interface with the rest of the code was made easier by the extensive set of unit tests on Scrapy that ensures that the main use cases are covered and remain errorless. I had to adjust some of the tests and their auxiliary routines so the settings configuration is updated with its present usage.

The old crawler settings class was deprecated, and every occurrence of this class was replaced by the new one, along with its methods and attributes. The documentation was fixed to explain the new settings API, and some of the internal details were added for developers that write Scrapy extensions.

I'll implement the spider settings on top of these changes, since now it's trivial to add new configuration sources, but we still have to test them individually and extend the documentation.

The last feature for the settings API will be freeze support, that is, to make settings immutable after they've been populated. We will issue a warning when trying to update settings after this point, so that users know that further changes have no effect.

I'll work on this next week, and as I'm ending my college semester on Thursday I hope that I can speed things up.

Sunday, June 1, 2014

Second week into GSoC coding

I've decided to start with the settings API changes that were described in my proposal. It is one of the most time consuming tasks, so I wanted to have it ready early and base all further changes on it. It's also a pretty self contained assignment (although a large one) that can be merged independently before the whole project is finished.

In Scrapy you have multiple entry sources to configure your project. A detailed list of these sources, ranging from global defaults to command line overrides, can be seen on the settings topic in the official documentation. These inputs for populating settings take different levels of precedence that establish priorities between those configurations.

Stripped of helper functions, these are the classes that currently store settings configurations and offer an interface to access them:

from . import default_settings

class Settings(object):

    def __init__(self, values=None):
        self.values = values.copy() if values else {}
        self.global_defaults = default_settings

    def __getitem__(self, opt_name):
        if opt_name in self.values:
            return self.values[opt_name]
        return getattr(self.global_defaults, opt_name, None)

    def get(self, name, default=None):
        return self[name] if self[name] is not None else default

class CrawlerSettings(Settings):

    def __init__(self, settings_module=None, **kw):
        super(CrawlerSettings, self).__init__(**kw)
        self.settings_module = settings_module
        self.overrides = {}
        self.defaults = {}

    def __getitem__(self, opt_name):
        if opt_name in self.overrides:
            return self.overrides[opt_name]
        if self.settings_module and hasattr(self.settings_module, opt_name):
            return getattr(self.settings_module, opt_name)
        if opt_name in self.defaults:
            return self.defaults[opt_name]
        return super(CrawlerSettings, self).__getitem__(opt_name)

We have a multipurpose Settings class where global default values are stored. Additional key/value pairs can be provided while instantiating the class, and will be stored on the internal values dictionary. There are two priority levels, and the precedence between the two is taken care of by the __getitem__ method.

CrawlerSettings is an inheritance of the Settings class, used for crawler configuration. This object has five different precedence levels, each given by being in a different internal dictionary or module. Values in the overrides dictionary have greater priority than values in settings_module, followed in order by values in defaults, values, and global_defaults.

By inspecting the code we can see that there are no actual constraints in the usage of those internal attributes and, except for settings_module and global_defaults, there isn't a clear meaning for them.

Another issue with this implementation is that it's not easily extensible, we can't just add a new settings source or priority level without considering new conditions in the getter methods, or relying in the order that we update the values. This was the main concern behind the settings api changes: one of the goals of this GSoC is to introduce a way to override local settings on spiders, so we need to add a new priority level.

The first step that was taken for improving this api was settling a common ground for the default settings entries that we can use in our projects, and assigning them a priority. Priorities were measured by integers, since it's a simple and appropriate implementation that adjust to our needs. Entries with higher priorities will take more precedence over lower ones.

A globally defined SETTINGS_PRIORITIES dictionary on the settings module can do this job:

    'default': 0,
    'command': 10,
    'project': 20,
    'spider': 30,
    'cmdline': 40,

Each level has a codename for easier identification and an associated priority. We can think of each key merely as an alias to a settings entry's priority. Scrapy's code will use this mapping instead of explicit integers to allow future changes on its definition.

The next thing was trying to get rid of all the internal dictionaries used for storage in the settings classes. This can be done using a single dictionary and storing the priority along with the value when we set a new attribute.

Those two things can be kept on a new object for a settings attribute:

class SettingsAttribute(object):

    """Class for storing data related to settings attributes.

    This class is intended for internal usage, you should try Settings class
    for settings configuration, not this one.

    def __init__(self, value, priority):
        self.value = value
        self.priority = priority

    def set(self, value, priority):
        """Sets value if priority is higher or equal than current priority."""
        if priority >= self.priority:
            self.value = value
            self.priority = priority

This class won't be interacted with directly (we can wrap its usage with the settings classes), so priority is stored by its plain integer value. There's no point in keeping settings that were overridden, so this object will only store the highest priority value for a given attribute. The logic regarding the update or discard of a value depending on its priority is implemented on the set method.

The Settings class was updated to show these changes:

import six

class Settings(object):

    def __init__(self, values=None, priority='project'):
        self.attributes = {}
        for name, defvalue in iter_default_settings():
            self.set(name, defvalue, 'default')
        if values is not None:
            self.setdict(values, priority)

    def __getitem__(self, opt_name):
        value = None
        if opt_name in self.attributes:
            value = self.attributes[opt_name].value
        return value

    def get(self, name, default=None):
        return self[name] if self[name] is not None else default

    def set(self, name, value, priority='project'):
        if isinstance(priority, six.string_types):
            priority = SETTINGS_PRIORITIES[priority]
        if name not in self.attributes:
            self.attributes[name] = SettingsAttribute(value, priority)
            self.attributes[name].set(value, priority)

    def setdict(self, values, priority='project'):
        for name, value in six.iteritems(values):
            self.set(name, value, priority)

First of all, backward compatibility is a great concern, so the interface of the methods is roughly the same. A default priority is needed for the settings loaded on instantiation because of that. Project priority seemed a good fit since most users' scripts need this precedence level, so this will simplify the api for them. Scrapy's code will explicitly use that argument and will be carefully documented for users digging internals.

A new set helper method was added so we don't have to manually handle the attributes dictionary for setting new values after initialization. The priority argument can be a string or an integer. Although the former is the preferred type for this arg, the latter was added for further flexibility.

Lastly, this new class became an abstraction for the previous ones. The behaviour of CrawlerSettings is replicable with this class, so there is no more usage for that object. It was deprecated with a Scrapy helper for these kind of situations:

from scrapy.utils.deprecate import create_deprecated_class

CrawlerSettings = create_deprecated_class(
    'CrawlerSettings', CrawlerSettings,

Process on these changes, their documentation and tests, the adaptation of this new api across all the code and further discussions can be followed on the #737 pull request in Scrapy's github repository.