Configuration settings in the ASPIRE package

As with any substantial application package, the ASPIRE project needed a convenient way to specify configuration settings pertaining to different parts of the computational pipeline.

What follows below are some outlines from our attempts to tackle this configuration issue. Where a supplementary (and hopefully useful) nugget is provided, or a caveat discussed, I shall append a linked numeral, like so: (n)

A brief background of ASPIRE

ASPIRE is a Python (3.6) package under development, which ingests Micrographs, the output of Cryo-Electron Microscopy (images that closely resemble television static), and comes up with a 3D reconstruction of the molecule. Read the excellent writeup on the ASPIRE page for a more comprehensive review of the package.

One of the early stages of processing in ASPIRE is APPLE Picking, a process that identifies regions in a micrograph which are very likely to be particles. See the APPLE Picker paper for full details of that algorithm.

Micrograph Particles

Fig. 1: (a) A Micrograph; (b) Identified Particles

One of the key operations performed by APPLE-Picker is binary erosion of segments of the input image. This is an operation that takes in segments likely to contain particles (white) and a kernel, which is essentially a pixel template that determines the general shape/size of a candidate particle, and erodes away segments that do not conform to this shape.

For example, here is an instance of a segmented image, a kernel, and the corresponding eroded image.

Segmentation Kernel Eroded segments

Fig. 2: (a) Segmented areas of micrograph; (b) Kernel used for binary erosion; (c) Eroded segments

The size of the kernel guides the size of suspected artifacts that will be discarded, and gets passed on as the min_size and max_size arguments of the Picker class.

class Picker:
  def __init__(self, max_size=156, min_size=19, ...):

From the point of view of the Picker class (which is in only ever used internally by a more public-facing Apple class), it is absolutely okay (and indeed desirable) to have sensible defaults like above. The class could be unit-tested in isolation with minimal fuss and without a lot of setup.

However, we would like to enable the users of ASPIRE to configure this min_size value among others in a declarative way, without anyone ever having to touch code. (1)

To start then, at the conclusion…

What do we want our final code to look like?

We would like to support code like the following, where a single import allows us access to a config object, from which we access attributes as are relevant to our situation:

from aspire import config

class Apple:
  def __init__(self, output_dir=None):
    self.min_size = config.apple.min_particle_size
    ...
    optimization_fn(atol=config.apple.tol.abs)
    ...
    self.picker = Picker(self.min_size, self.max_size, ...)

Notice in the fragment above that we would like to support arbitrarily-namespaced attributes (config.apple.min_particle_size or config.apple.tol.abs), namespaces that we can (if the need arises) maintain independently.

The application will drive itself by these package-level settings (the config). However, we would also like to be able to override these settings (in scripts or notebooks, for example). For example, if we supported a apple.picker.gpu_enabled (boolean) configuration directive (default False), we might want to do:

from aspire import config
...
# run code with GPU disabled
apple.pick()
...
with config({'apple.picker.gpu_enabled': True}):
  # run code with GPU enabled
  apple.pick()

Notice above that we specify a dictionary as the override argument. This is because while we could decide to support something like:

with config(gpu_enabled=True, precision='single'):
  ...

The above syntax would not allow us to override apple.picker.gpu_enabled, since the intervening . operator(s) would mess things up syntactically.

with config(apple.picker.gpu_enabled=True)
           ^
SyntaxError: keyword can't be an expression

How do we get there?

Arbitrarily-nested structures should immediately bring JSON to mind. We could drive things off a config.json that looks like this:

...
"apple": {
	"max_particle_size": 156,
	"min_particle_size": 19,
	"minimum_overlap_amount": 7,
	"tol": {
		"abs": 1e-5,
		"rel": 1e-2
	}
	...
}
...

This config.json resides right inside the package directory.

	.
	└── src
	    ├── aspire
	    │   └── __init__.py
	    │   └── config.json

The aspire/__init__.py reads in this config.json, and initializes a module-level variable config object (of class Config): (2)

from importlib_resources import read_text
import aspire
from aspire.utils.config import Config

config = Config(read_text(aspire, 'config.json'))

The importlib_resources library leverages the powerful import mechanism already available in Python to load any resource (a file located in any importable package) – in our case, the resource config.json.

An aside on importlib_resources

In a scenario when we’re developing a distributable (and PIP-installable) package, we can no longer assume that we will be able to locate the config.json file directly using facilities from the os.path module. Our whole package might be zipped up, or be available through a hitherto-uninvented package-loading mechanism. (3)

We thus rely on this excellent library to get a handle on the json string, using the read_text function in importlib_resources. The fact that importlib_resources is an integral part of Python 3.7 and above (as importlib.resources) should give us confidence in its longevity.

The read_text(aspire, 'config.json') line looks for config.json directly wherever the aspire package is found, which is indeed the case in our setup.

importlib_resources locates our resources, but does not package them. Make sure that config.json is included in your distributable package by adding the appropriate package_data directive in setup.py. Notice also the install_requires entry needed to make things work:

setup(
    name='aspire',
    version='0.3.0',
    ...
    install_requires=[
      'importlib_resources>=1.0.2'
    ],
    package_data={'aspire': ['config.json']}
)

The Config class

The key to providing arbitrarily-nested attributes in our Config class is to utilize the types.SimpleNamespace class (available since Python 3.3). As the name suggests, this is a simple class that provides attribute access, with the added ability to initialize attributes while constructing the object:

from types import SimpleNamespace
o = SimpleNamespace(x=1, y=2)
print(o.x)

Another lesser-known feature that we use is in the json.loads function – the object_hook argument. Here is what the docstring says about it:

``object_hook`` is an optional function that will be called with the
result of any object literal decode (a ``dict``). The return value of
``object_hook`` will be used instead of the ``dict``.

With SimpleNamespace and object_hook, we have the tools to do what we want. The aspire.utils.config.Config class looks like the following:

from types import SimpleNamespace
import json
...

class Config:
  def __init__(self, json_string):
    self.namespace = json.loads(
      json_string,
      object_hook=lambda d: SimpleNamespace(**d)
    )
    ...

The object_hook is triggered any time a dict is available during JSON-parsing (either the top-level dict or an inner dict). We simply modify object_hook to return a SimpleNamespace instead of a dict, unpacking the dict using the ** operator.

The namespace attribute of Config is now a SimpleNamespace (which in turn could contain other types, including other SimpleNamespace objects itself!)

One small tweak to the class (relaying all attribute access to the namespace attribute) and we’re ready to go:

class Config:
  def __init__(self, json_string):
    self.namespace = json.loads(
      json_string,
      object_hook=lambda d: SimpleNamespace(**d)
    )
    ...

  def __getattr__(self, item):
    return getattr(self.namespace, item)

Overriding configuration values

To support the overriding mechanism we discussed earlier:

with config({'apple.picker.gpu_enabled': True}):
  apple.pick()

we return a context-manager in the __call__ method of the Config class:

def __call__(self, override_dict):
  return self.ConfigContext(self, override_dict)

This ConfigContext class can be implemented as an inner-class, because none other than the Config class needs to know about it:

from copy import deepcopy
...
class Config:
  class ConfigContext:
    def __init__(self, config, d):
      self._original_namespace = deepcopy(config.namespace)
      self.config = config
      for k, v in d.items():
        # TODO: Set attributes in config.namespace, but recursively.
        pass

    def __enter__(self):
      return self.config

    def __exit__(self, exc_type, exc_val, exc_tb):
      self.config.namespace = self._original_namespace

The ConfigContext class gets a handle on the Config class it was created from, creates a deepcopy of the original namespace attribute, modifies the namespace (the TODO which we haven’t yet implemented), and re-attaches the deep-copied namespace on __exit__.

A recursive setattr

What we now need as the final piece is a recursive version of setattr. Something that can take {'apple.picker.gpu_enabled': True} and modify config.apple.picker.gpu_enabled to True). We could roll our own implementation, but a very compact one is available on StackOverflow:

import functools
	
# https://stackoverflow.com/questions/31174295/getattr-and-setattr-on-nested-subobjects-chained-properties
def rsetattr(obj, attr, val):
  pre, _, post = attr.rpartition('.')
  return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
  def _getattr(obj, attr):
    return getattr(obj, attr, *args)
  return functools.reduce(_getattr, [obj] + attr.split('.'))

These functions are bit tricky to understand. Staring at the above code, we realize that the magic is happening in the line:

setattr(_rgetattr(obj, pre) if pre else obj, post, val)

Spelled out – split the multi-dotted attribute of obj into strings (starting from the right). If a left part (pre) exists, get it using _rgetattr, otherwise set the attribute on obj as one would normally do. This has the effect of setting nested attributes from the outside in.

_rgetattr delegates its heavy lifting to functools.reduce, but is essentially a recursive approach to getattr.

Bonus features

1. Configure logging in config.json

Given that the purpose of config.json is to configure our package, why not configure logging using it too?

We add a new logging key to our config.json, something like:

{
"logging": {
	"version": 1,
	"formatters": {
		"simple_formatter": {
			"format": "%(asctime)s %(message)s",
			"datefmt": "%Y/%m/%d %H:%M:%S"
		}
	},
	"handlers": {
		"console": {
			"class": "logging.StreamHandler",
			"formatter": "simple_formatter",
			"level": "DEBUG",
			"stream": "ext://sys.stdout"
		}
	},
	"loggers": {
		"aspire": {
			"level": "DEBUG",
			"handlers": ["console"]
		}
	}
},
...
}

The Python logging module is capable of being configured using a dict, but not directly using a SimpleNamespace. No matter – we handle that as a special case in our Config class:

import logging
...

class Config:
  ...
  def __init__(self, json_string):
    d = json.loads(json_string)

    # The logging module supports configuration from a dictionary
    # using dictConfig, but not a SimpleNamespace,
    # so take care of that first
    if 'logging' in d:
      logging.config.dictConfig(d['logging'])
    else:
      logging.basicConfig(level=logging.INFO)

    # Now that logging is configured, reload the json, but now
    # with an object hook so we have cleaner access to keys
    # by way of (recursive) attributes
    self.namespace = json.loads(
      json_string,
      object_hook=lambda d: SimpleNamespace(**d)
    )
    ...

2. Making config values available as arguments in scripts

It would be capital to allow our config.* values to be overridable in top-level scripts as well, so we could do:

python -m aspire apple experiment009.star --config.apple.min_particle_size 33

This is absolutely possible, and something that we have previously supported in ASPIRE. The key is to subclass argparse.ArgumentParser and add a new argument group in its constructor:

from argparse import ArgumentParser

class ConfigArgumentParser(ArgumentParser):
    def __init__(self, *args, **kwargs):
      ...
      self.add_argument_group('config')
      ...

However, the details can get somewhat tedious, so I’ll save that for another post.


1: We use git describe inside a somewhat elaborate get_full_version() function when generating error logs, which, in addition to providing the environment within which the error occurred, alerts us if the user has been messing with the code. But that’s a whole other discussion.

2: Some Python cognoscenti might arch their eyebrows on anything (beyond setting a __version__) being done in __init__.py, but an agreement on etiquette is something different from a taboo, and we feel this minimal setup is justified in our case.

3: There are ways to get around this, see the zip_safe flag in setuptools.