Monday, July 28, 2014

Four weeks into 2nd half

I've opened the #816 Scrapy's pull request with the progress I've made on the API refactoring. I'm keeping a list in the description of the PR with the current status of the tasks that I have assigned, along with backward incompatible changes that were introduced.

I've updated the Spider class' methods, Crawler's initialization and SpiderManager's interface following the design that was explained in my proposal. Upon reviewing the added complexity to support deprecated features, we agreed on making a relaxation on maintaining backward compatibility to prioritize code simplicity.

Over the next weeks I'll wrap up the changes, adding missing tests and documentation and getting the PR to a stable and mergeable state. I also plan to address further issues concerning API problems if I have spare time until the final deadline.

Tuesday, July 15, 2014

Two weeks into second half

The work I've done in the last month and a half has been merged into the master branch of Scrapy's main repository, and it's part of the new changes in the last 0.24 release. I'm happy that I was able to finish my contribution in time and get it to a stable stage, wrapping out the first half of my assignment.

I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.

After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.

Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:

def deprecated_signature(old_implementation, types=None,
                         warning_msg="Calling {fname} with the given "
                         "arguments is no longer supported, please refer "
                         "to the documentation for its new usage."):
    """Decorator factory that defines a deprecated implementation with
    different signature for a given function.

    An additional dictionary of argument names and their types can be provided
    to differentiate functions with the same argument count but different
    expected types for them.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                call_bindings = inspect.getcallargs(old_implementation, *args, **kwargs)
                if types is not None:
                    _check_types(call_bindings, types)

            except TypeError:
                f = func

            else:
                warnings.warn(warning_msg.format(fname=func.__name__),
                              ScrapyDeprecationWarning, stacklevel=2)
                f = old_implementation

            return f(*args, **kwargs)

        wrapper.__name__ = func.__name__
        wrapper.__doc__ = func.__doc__

        return wrapper

    return decorator

def _check_types(bindings, types):
    for arg, value in six.iteritems(bindings):
        if arg in types and not isinstance(value, types[arg]):
            raise TypeError

An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:

def deprecated_signature(old_implementation, types=None, module='',
                         warning_msg="Calling {fname} with the given "
                         "arguments is no longer supported, please refer "
                         "to the documentation for its new usage."):
    """Decorator factory that defines a deprecated implementation with
    different signature for a given function.

    An additional dictionary of argument names and their types can be provided
    to differentiate functions with the same argument count but different
    expected types for them.

    If `old_implementation` is a string, that reference will be dynamically
    loaded from the `module` provided. If that last argument is an empty string
    (default) it will be expanded to the module of the decorated routine. If
    it's the None value, the module will be strip out of the
    `old_implementation` string.

    Values from the `types` dictionary can also be dynamically loaded, but it
    will always be assumed that each string contains the full path for the
    object.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                old_call = _load_or_return_object(old_implementation,
                                                  module or func.__module__)
                call_bindings = inspect.getcallargs(old_call, *args, **kwargs)
                if types is not None:
                    _check_types(call_bindings, types)

            except TypeError:
                f = func

            else:
                warnings.warn(warning_msg.format(fname=func.__name__),
                              ScrapyDeprecationWarning, stacklevel=2)
                f = old_call

            return f(*args, **kwargs)

        wrapper.__name__ = func.__name__
        wrapper.__doc__ = func.__doc__

        return wrapper

    return decorator


def _check_types(bindings, types):
    types = {k: _load_or_return_object(v) for k, v in six.iteritems(types)}
    for arg, value in six.iteritems(bindings):
        if arg in types and not isinstance(value, types[arg]):
            raise TypeError


def _load_or_return_object(path_or_object, module=None):
    if isinstance(path_or_object, six.string_types):
        if module is None:
            module, path_or_object = path_or_object.rsplit('.', 1)
        return _load_object(path_or_object, module)
    return path_or_object


def _load_object(path, module):
    parent = import_module(module)
    for children in path.split('.'):
        parent = getattr(parent, children)
    return parent

The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.

I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.

I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.