[Dev] Upgrading Chandler

Phillip J. Eby pje at telecommunity.com
Mon Oct 10 10:00:59 PDT 2005


Overview
========

With the advent of usable calendaring in 0.6, we have a new and scary thing 
to think about: needing to support actual users.  :)  Or more specifically, 
being able to upgrade a Chandler installation without recreating all its data.

There are four kinds of things that we need to be able to upgrade:

1. Python code
2. Parcel-defined items, including UI items
3. Parcel-defined repository schema
4. User data that may need to be changed to reflect a schema change

Few - if any - of these items can currently be upgraded without recreating 
your repository.  Many are largely unexplored problems.

Luckily, we don't need to solve all of these upgrade problems for 0.6, 
although now is a good time to start thinking about them, to make sure that 
we have at least some basis for doing them in the future.

In this proposal, I'll be focusing first on how we can make it possible to 
make code and UI changes without needing even to *restart* Chandler, so 
that developers can make and test changes more quickly.  But I'll also be 
exploring what we can do to detect schema changes or parcel version 
changes, so that we're in a better position to support future upgrades.


Reloading Code
==============

In general, reloading Python code is a hard problem to completely 
solve.  This is because a module that imports another module may use 
imported objects during its initialization - for example to subclass an 
imported class.  This means that if the imported module is reloaded, the 
importing module can become out-of-date.

However, for most simple development use cases - which mainly involve 
changes to functions or to methods of existing classes, it should be 
possible to work around this issue.  I propose to add a metaclass to the 
schema API that will allow classes to be redefined during a reload() 
operation in such a way that the original class is modified in-place, 
instead of being replaced with a new class.  This will allow a simple 
reload() operation on a module to update the methods of a class.  And, by 
default, Item classes will have this ability.  Non-Item classes will need 
to make explicit use of the metaclass.

There are, however, some side effects.  The metaclass will have no way to 
know whether a reload is taking place, except by whether there is already a 
symbol of the same name as the class in the module.  When a module is 
reloaded, the existing version of the class will still be in the module's 
dictionary when the new version is being defined.  So, the metaclass will 
check for the existing class, and then update that existing class instead 
of replacing it.


Name Collisions
---------------

The downside to this approach is that it can be fooled into thinking a 
reload is taking place, if an object of the same name already exists in the 
module at that point in time.  For example, this is perfectly legal Python 
code, but will not work the same way once the metaclass is used::

     from somewhere import SomeItemClass

     class SomeItemClass(SomeItemClass):
         def foo(self):
             return self.bar

Without the metaclass, this does exactly what it looks like it does - it 
creates a ``SomeItemClass`` subclass of ``somewhere.SomeItemClass``.  But 
*with* the metaclass, this will *overwrite* ``somewhere.SomeItemClass`` 
with the contents of the new class, because the metaclass will think you 
are reloading the module.

Actually, in this simple example, the metaclass could check the __module__ 
of the class in question, and give you an error message.  The error would 
occur even at initial import, and you'd quickly change your code to 
something like this:

     from somewhere import SomeItemClass as _SomeItemClass

     class SomeItemClass(_SomeItemClass):
         def foo(self):
             return self.bar

which would immediately fix the problem.  However, if you do something like 
this:

     class SomeItemClass(schema.Item):
         pass

     class SomeItemClass(SomeItemClass):
         pass

there is no way to detect the problem, at least if we also allow changing a 
class' inheritance tree when code is reloaded.  If we require a class' base 
classes to remain the same across reloads, then we could detect this error 
by virtue of the different inheritance, and we could again give you an 
error message so you'd change your code.

This is probably the best option, although it prevents you changing a 
class' bases without restarting Chandler.  I would expect base class 
changes to be rare, however, so this is probably an acceptable convenience 
vs. safety tradeoff.  I propose the error message for any of the above 
collisions to read something like:

    NameError: SomeItemClass already defined in module blah.blah; please 
rename either the existing class or the new class

And it would occur as soon as the name collision exists, not just at reload 
time.  However, if you only introduce the collision between reloads, then 
of course it will occur when you reload.

The metaclass would be called ``schema.ReloadableClass``, so if you need to 
use it in a non-Item class, you would do something like::

    class MyArbitraryNonItemClass(SomeBase):
        __metaclass__ = schema.ReloadableClass

And the same name collision rules would apply as for item classes.


Reloading Functions
-------------------

To support reloading of module-level functions, there will be a 
``schema.reloadable`` decorator, used as follows::

     @schema.reloadable
     def some_function(some_arg, other_arg, ...):
         # whatever

The purpose of this decorator is to allow a function to be updated 
in-place, even if another module has already imported it.  The only time 
you would use this is if you are changing the function and want to reload 
it.  In other words, the function would normally look like this::

     def some_function(some_arg, other_arg, ...):
         # whatever

If you need to change the function while Chandler is running, then you 
would add the ``@schema.reloadable`` line, make the change, and reload the 
module.  But, before you check your changes back in to Subversion, you 
should remove the decorator, just as you would remove debugging 
prints.  It's strictly a development tool, needed only for top-level 
functions, and only ones that you're editing while Chandler is running.

There are some rather strict limitations on what this decorator can do, by 
the way.  It must be the "outermost" (first) decorator for a given 
function, and any nested decorators must preserve the function name in any 
transform.  You won't be able to add new required arguments, or rename the 
previous arguments.  However, these kinds of changes are unlikely to be the 
sort you could make without restarting Chandler anyway.

The most likely place where you'd need to use this decorator right now is 
on ``installParcel()`` functions that are defined in one module, but *used* 
in another via importing.  This would also apply to utility routines 
defined in one module, but imported in another module for use by an 
``installParcel()`` function.  For example, if you have a parcel that does 
this:

     from some.where import createMenus

     def installParcel(parcel, oldVersion=None):
         createMenus(parcel)

You would need to add the ``@schema.reloadable`` decorator to the 
``createMenus()`` function definition in ``some.where`` if you wanted to 
change ``createMenus()`` without restarting Chandler.  (Of course, you 
would then also need to reload the parcels that are using the 
``createMenus()`` function, which is the subject of the next section of 
this proposal.)


Updating Parcel-Defined Items and UI
====================================

Merely reloading a Python module doesn't affect what items are in the 
repository, even if you've edited the ``installParcel()`` function or a 
utility function it calls.  So, there needs to be a way to reload a parcel 
and update the items it contains.

Luckily, the mechanisms normally used in ``installParcel()`` should update 
existing items in-place, so really the only special thing that needs to be 
done to allow updating on-the-fly is providing a way to re-invoke 
``installParcel()``.

My current thought is that the way to expose this API would be to add a 
``reload()`` method to ``schema.ns()``, e.g.::

     pim = schema.ns('osaf.pim', view)
     pim.reload()  # reload the osaf.pim parcel (but not subparcels!)

This would perform a reload of the module (and the package, if the parcel 
is a package), and then reinvoke the ``installParcel()`` for the parcel, to 
reload the items.  Since this would also take care of reloading code, this 
would probably be the thing to run to update a changed parcel.  Someone 
could perhaps provide a test-menu option to do this, that would ask for the 
parcel name.  Of course, it could also be done by just dropping into a 
PyShell.  Users of the 'headless' utility, or those running Chandler under 
a debugger, could also invoke the operation directly.

This feature will *not*, however, handle general updates to the repository 
schema.  In fact, only one kind of schema change will be supported: adding 
new classes.  If you add a class to a parcel and reload it -- assuming 
you've done the import in __init__.py, if needed -- then the new kind will 
become available.  Changes to existing classes will be ignored, unless you 
recreate the repository.  Which is why the next section will talk about...


Updating Chandler Schema
========================

     "Do you, Programmer, take this Object to be part of the persistent 
state of your application, to have and to hold, through maintenance and 
iterations, for past and future versions, as long as the application shall 
live?"

     "Erm, can I get back to you on that?"

     -- from "Making a class serializable",
        http://www.erights.org/e/StateSerialization.html


In general, schema evolution is a hard problem.  So what I'd like to do 
here is first lay out some background to show just *how* hard, and then 
backpedal a bit to what more specific goals I think are achievable with 
what we're doing in 0.6 and 0.7.


Schema Additions
----------------

But first, something simple.  Additive changes to the schema are relatively 
easy compared to other kinds of change, since they can sometimes be done 
without changing existing items.  In fact, adding new kinds can be done 
without even restarting Chandler, as we saw in the previous section.  This 
is especially nice in that it means we'll be able to download and install 
new parcels while Chandler is running - but upgrading an already-installed 
parcel will require a restart for stability.

Adding new attributes to existing kinds is a little trickier, because right 
now the schema API doesn't scan a kind's attributes if the kind already 
exists in the repository.  But we could add something that would check a 
parcel's version and do a thorough re-scan of every kind defined by the 
parcel, whenever the parcel version changed.  This would be part of an 
at-startup check of parcel versions.

The major complication introduced by adding attributes is attributes that 
should have a value for existing items in the repository.  In terms of 
repository stability, this is not a big deal, as the repository doesn't 
care that the attributes are missing unless they are marked ``required``, 
and you run ``check()``.

However, for application functionality, it means that new versions of 
parcels must either:

1. Never assume an attribute exists, unless it was supplied and initialized 
by the first public release of the parcel, and *every release since*.  Or,

2. Use ``defaultValue``, so the attribute always appears to have a value

The downside of option 1 is that you have to keep track of what you 
released, and when you changed it, "through maintenance and iterations, for 
past and future versions, as long as the application shall live."  The 
downside of option 2 is that the attribute can never *not* have a value, 
and there may be other limitations associated with ``defaultValue``, which 
we have mostly not been using for some time.

Note that ``defaultValue`` is different from ``initialValue``.  An 
``initialValue`` is set when an item is created.  If you later delete the 
attribute, the ``initialValue`` does not come back.  Similarly, if you add 
a new attribute with an ``initialValue``, or change the ``initialValue`` of 
an existing attribute definition, this does not affect already-created 
items, even if they don't have a value for that attribute.

Incidentally, this is somewhat related to the issue that we've sometimes 
had with ``initialValue``, in that we would often like an attribute's 
initial value to be computed, rather than a constant.  For example, 
creation and modification dates want to default to the ``datetime.now()`` 
at the time the item is created.    It may also be that we would like to be 
able to have some code run for existing items and set a computed initial 
value, when a new attribute is added.

One way we can accomplish this is to have the relevant ``installParcel()`` 
function include a block of code like::

     def installParcel(parcel, oldVersion=None):

         # ...

         for item in SomeChangedClass.iterItems(parcel.itsView):
             if not hasattr(item,"newattr"):
                 item.newattr = some_calculation(item)

Since ``installParcel()`` is only invoked when a parcel is installed, 
upgraded, or explicitly reloaded, this operation would be reasonable in 
many cases, especially since it will not do any work when the parcel is 
first installed (because there will be no items of the changed class yet).

However, for upgrades and reloads, it could possibly be quite slow, and 
might need some way to display or update a progress meter.  But the 
mechanism for this needs to somehow be decoupled from the schema API and 
the standard Chandler UI, because it also needs to work when run under 
``headless``, and of course unit tests need to work too.

Oh, and don't forget - you can't ever remove that upgrade code from 
``installParcel()``, unless of course you stop using the attribute.

Ah, if only all schema evolution issues were as simple as additions!  :)


Moves and Renames
-----------------

Additions, alas, are not the only kind of schema changes we're likely to 
have in future versions.  It's extremely likely that in 0.7 we'll be doing 
a lot of moves and renames to finish our parcel/package flattening and the 
move to a standardized layout for API packages.

But, we currently use the names and locations of modules, classes, and 
attributes to synchronize our schema definition with the schema stored in 
the repository.  This means that if we move a class around, or rename it, 
it no longer has an identifier matching that of the Kind in the 
repository.  So, even if we made no *actual* change to the schema, we can 
completely trash someone's existing data just by moving or renaming things 
in the normal course of refactoring.

Indeed, the repository stores in each Kind a reference to the class that 
implements it, so even if we grabbed the existing Kind and tried to move or 
rename it in our ``installParcel()`` routines, we would get an error.  Even 
though the old class doesn't exist any more, the repository would try to 
load it because it's still referred to by the existing Kind.  Andi says 
this particular issue can probably be worked around in the repository, but 
it's only the tip of the iceberg here.

The real issue here is *identification*.  How do we uniquely identify a 
class or attribute or parcel, once it has moved?  One possibility is to 
give each such object a list of all paths that it previously lived in in 
other versions, so that it could check all those locations and then move 
the relevant item to the new location.  Of course, such a list could grow 
longer over time, and can never be removed.

Another possibility is to assign fixed UUIDs to the items.  If you moved a 
class, attribute, or parcel, you would first need to find out its current 
UUID, and add that information to the source code.  Then, when you move or 
rename an item, its UUID would move with it, and remain in sync.

How would we initially assign UUIDs?  Well, there is a kind of UUID called 
a "namespace UUID" which would be useful for this purpose.  A namespace 
UUID is generated by hashing a name string with a base UUID, to create a 
new UUID.  This would allow us to automatically generate fixed UUIDs for 
items based on their name, so that it won't be necessary to manually assign 
individual UUIDs for every parcel, class, and attribute.  But when you move 
or rename something, you would need to find out its assigned UUID, and add 
it to the code, or else you'll be creating a new item and abandoning the 
old one.

Error checking, of course, is a problem with this scheme, since just 
renaming or moving something isn't going to move the UUID.  Also, if in one 
upgrade you rename class X to "Y", and then in a later upgrade you create a 
new "X", the new "X" will be assigned the same UUID that the original X 
was, and now you have to detect the collision.

Alternately, we could require that parcels and classes be manually assigned 
UUIDs, but we could allow attribute UUIDs to be automatically 
generated.  It's easier to crosscheck a single class' attributes for naming 
collisions in the "rename X to Y and create a new X" case, and for 
everything else it guarantees that the UUID will move with the parcel or 
class, and ensure continuity.

At Chandler's current size, that's a few hundred UUIDs we would need to 
manually assign.  The process would simply require running a tool like 
uuidgen to generate the UUIDs, and then adding them to the 'kindInfo()' for 
each class, and a __parcel_id__ = "..." assignment in the parcel's main 
module.  It also adds a step to creating a new kind or parcel, but not a 
particularly difficult one.


Changes and Deletions
---------------------

The next type of schema change that can occur is changes to metadata.  For 
classes, clouds, and parcels, metadata changes are fairly harmless, as they 
don't usually affect the user's data in any way.  For attributes, however, 
changes to metadata like type or cardinality could require changing all 
existing values of that attribute.  Such changes would actually require 
creating a new attribute and copying the old values over to it, then 
deleting the old attribute.  And it's not immediately obvious how we'd go 
about doing that.

Deletions from the schema are likely to be rare, at least when viewed in 
the "upgrade" direction.  But we may also need to consider the "downgrade" 
case, where someone wants to revert to a previous version of a package, and 
therefore needs to undo schema changes.  Implicitly this can involve 
removing some part of the schema that was added, although in practice there 
is no actual need to remove it, since it will be inaccessible.  It does 
mean, however, that the upgrade mechanism will eventually need to be robust 
in the face of repeated upgrades.


Analysis
========

Providing robust support for schema changes is an "interesting" 
problem.  Once a schema is officially released, it appears that constant 
vigilance will be required, to ensure that changes always provide a 
migration path for users' data.  We do not currently have any ways to 
validate changes made between a particular pair of schema versions, nor to 
track the changes made (other than indirectly, via source code changes).

With sufficient care and infrastructure support, we can relatively easily 
support manual schema upgrades, in the sense of having installParcel() make 
the changes, if we entirely forbid certain classes of schema change that 
could not be implemented in this way.  However, the amount of developer 
care required currently appears prohibitive, in the sense that it's going 
to seriously impede our flexibility to refactor.

In order to remove these impediments, we would have to have some way of 
automatically tracking changes to the schema, as a kind of revision log 
that would be kept alongside the code in Subversion.  When changing between 
versions of a parcel, it should be possible for the system to automatically 
apply the relevant changes to both the schema, and the corresponding 
data.  This logging system would then also be able to prohibit (via error 
messages) any changes that could not be supported without extending the 
revision system.  The actual tracking mechanism will probably need to use 
some of the techniques described here:

http://citeseer.ist.psu.edu/staudtlerner96model.html

There would also need to be some tools to manipulate the log.  For example, 
to mark a point in the log as corresponding to a particular release 
version, to simplify applying changes.  Or to list the changes between 
releases, etc.  I'm not going to attempt to fully specify the system at 
this time, since I don't think we can reasonably include something like it 
in 0.6.  We may be forced to simply require that 0.6 -> 0.7 upgrades go 
through an export-and-import process.

My original intent with this proposal was to try to support some minimal 
schema versioning support in 0.6, but as the analysis progressed it has 
become apparent that, given the complexity and the lateness of the date, 
it's going to be simpler to just introduce parcel versioning in 0.7 
alongside the introduction of Python Eggs, since eggs include version 
metadata already, and they provide a natural boundary for the schema 
revision tracking system described above.

In short, I think the only part of these proposals that can reasonably be 
implemented for 0.6 are the parts to support reloading code and 
parcel-defined items, which should be helpful to developers working on code 
changes that currently require a Chandler restart before testing.



More information about the Dev mailing list