I've opened the #816 Scrapy's pull request with the progress I've made on the API refactoring. I'm keeping a list in the description of the PR with the current status of the tasks that I have assigned, along with backward incompatible changes that were introduced.
I've updated the Spider class' methods, Crawler's initialization and SpiderManager's interface following the design that was explained in my proposal. Upon reviewing the added complexity to support deprecated features, we agreed on making a relaxation on maintaining backward compatibility to prioritize code simplicity.
Over the next weeks I'll wrap up the changes, adding missing tests and documentation and getting the PR to a stable and mergeable state. I also plan to address further issues concerning API problems if I have spare time until the final deadline.
Monday, July 28, 2014
Tuesday, July 15, 2014
Two weeks into second half
The work I've done in the last month and a half has been merged into the master branch of Scrapy's main repository, and it's part of the new changes in the last 0.24 release. I'm happy that I was able to finish my contribution in time and get it to a stable stage, wrapping out the first half of my assignment.
I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.
After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.
Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:
An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:
The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.
I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.
I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.
I've started the changes to the API with the Spider modifications. Since these changes are mainly additions to the Spider base class, it was a reasonable task to start with. A new from_crawler class method was introduced to deal with handling the current context of execution (Crawler internals) to each spider at initialization time. This was implemented as non-intrusively as possible, since maintaining backward compatibility in this key component in every Scrapy project is a major concern.
After that, I decided to focus on some of the changes in the Crawler class. I began adjusting the init and crawl methods to the new parameters that were previously defined in my proposal. Changing signatures in public functions is a sensitive matter, since this could easily lead to breaking backward compatibility. Python doesn't support function overloading, so we have to work around taking different inputs for our routines.
Taking this into account, and making note that its usage has to be as simple as possible to actually be helpful, I made the following decorator to allow different positional arguments in the same function:
def deprecated_signature(old_implementation, types=None, warning_msg="Calling {fname} with the given " "arguments is no longer supported, please refer " "to the documentation for its new usage."): """Decorator factory that defines a deprecated implementation with different signature for a given function. An additional dictionary of argument names and their types can be provided to differentiate functions with the same argument count but different expected types for them. """ def decorator(func): @wraps(func) def wrapper(*args, **kwargs): try: call_bindings = inspect.getcallargs(old_implementation, *args, **kwargs) if types is not None: _check_types(call_bindings, types) except TypeError: f = func else: warnings.warn(warning_msg.format(fname=func.__name__), ScrapyDeprecationWarning, stacklevel=2) f = old_implementation return f(*args, **kwargs) wrapper.__name__ = func.__name__ wrapper.__doc__ = func.__doc__ return wrapper return decorator def _check_types(bindings, types): for arg, value in six.iteritems(bindings): if arg in types and not isinstance(value, types[arg]): raise TypeError
An issue about this approach is that we need a reference to the old_implementation function object, which it's not reachable if we are decorating a method and its old implementation is another method in the same object, since they are created after the whole the class is defined. A more flexible and convenient implementation allows dynamically loading both old_implementation and the types of the arguments:
def deprecated_signature(old_implementation, types=None, module='', warning_msg="Calling {fname} with the given " "arguments is no longer supported, please refer " "to the documentation for its new usage."): """Decorator factory that defines a deprecated implementation with different signature for a given function. An additional dictionary of argument names and their types can be provided to differentiate functions with the same argument count but different expected types for them. If `old_implementation` is a string, that reference will be dynamically loaded from the `module` provided. If that last argument is an empty string (default) it will be expanded to the module of the decorated routine. If it's the None value, the module will be strip out of the `old_implementation` string. Values from the `types` dictionary can also be dynamically loaded, but it will always be assumed that each string contains the full path for the object. """ def decorator(func): @wraps(func) def wrapper(*args, **kwargs): try: old_call = _load_or_return_object(old_implementation, module or func.__module__) call_bindings = inspect.getcallargs(old_call, *args, **kwargs) if types is not None: _check_types(call_bindings, types) except TypeError: f = func else: warnings.warn(warning_msg.format(fname=func.__name__), ScrapyDeprecationWarning, stacklevel=2) f = old_call return f(*args, **kwargs) wrapper.__name__ = func.__name__ wrapper.__doc__ = func.__doc__ return wrapper return decorator def _check_types(bindings, types): types = {k: _load_or_return_object(v) for k, v in six.iteritems(types)} for arg, value in six.iteritems(bindings): if arg in types and not isinstance(value, types[arg]): raise TypeError def _load_or_return_object(path_or_object, module=None): if isinstance(path_or_object, six.string_types): if module is None: module, path_or_object = path_or_object.rsplit('.', 1) return _load_object(path_or_object, module) return path_or_object def _load_object(path, module): parent = import_module(module) for children in path.split('.'): parent = getattr(parent, children) return parent
The next step is to merge CrawlerProcess into Crawler, but I judged that it was best to leave it for later since it would probably break a lot of tests, and I won't be able to verify following changes until I have fixed them.
I proceeded to work on the scrapy.cfg config file. I've deprecated two settings, SPIDER_MODULES and SPIDER_MANAGER_CLASS, in favor of options in this config file. Currently I'm working on populating them from the settings file for backward compatibility, while doing a simple check on required options and expansion of defaults.
I haven't created a pull request yet since I'm constantly rewriting the commit history, but my progress can be tracked on my local repository, in the api-cleanup branch.
Subscribe to:
Posts (Atom)