[Chandler-dev] Dump and reload sketch
vajda at osafoundation.org
Fri May 12 09:28:41 PDT 2006
On Thu, 11 May 2006, Phillip J. Eby wrote:
> Background and Requirements Overview
> We want to be able to dump user data from a Chandler repository into a stable
> external format, and be able to reload the resulting backup into the same or
> newer versions of Chandler. The intended uses are for backups and to support
> evolution of both the repository implementation and our domain models without
> requiring users to discard their data.
> This overall strategy was chosen after a discussion during the PyCon sprints;
> a writeup of the analysis can be found here:
> By "stable external format", I mean a format that does not change
> significantly from one Chandler release to the next, and which allows for
> version detection of the format itself, as well as providing version and
> schema information for the parcels whose data is contained in the format.
The format needs to be only as stable as required by the code processing it.
In other words, by providing a dump/restore-based schema evolution scheme
we're creating a little parallel universe of code capable of extracting data
from one form of repository into another. I say 'little' because, it is
typically intended to be used just for that, hopefully just once for a
given repository, and can be amply hardcoded for a given uni-directional
> As I see it, there are several forces to be resolved in the design, most of
> which compete with all of the others:
> Ability to perform upgrades flexibly
> In the common case, schema upgrades will consist mostly of simple
> additions or renames. But in uncommon cases, entire families of objects may
> be restructured, or data might be moved from one Kind to another.
> Implementation Complexity
> We don't have the resources to do original research in schema evolution
> techniques, nor can we afford to essentially duplicate the existing
> repository implementation in some other form. If possible, the actual data
> format shouldn't reinvent wheels either.
I'd rather say 'reliably' instead of 'flexibly'. The body of code performed
may be hand-tuned to provide all the necessary conversions for that one schema
transition but may not be flexible enough to be easily reused in later ones
yet ensures that the user's data is semantically correct with regards to both
schemas in the transition (or at least the destination schema).
Still, having a framework of code that is flexible enough to be reused over
and over again to implement such hand-tuned one-time-use code for a given
transition would make things considerably easier for each and every developer
responsible for writing the transition routines for their parcel(s).
> API Complexity
> In principle, we could have have a simple API by just asking each parcel
> to write its data to a provided file. However, this is just asking each
> parcel author to invent their own format while revisiting the same tradeoffs.
> Ideally, our API should allow a parcel developer to write a minimum of code
> to support simple upgrades, and it should make complex upgrades possible.
> Effective Support for Parcel Modularity
> Chandler isn't a monolithic application, and parcels can be upgraded
> independently of one another. Effective dump-and-reload thus requires that
> parcels' data be managed in a way that allows multiple upgrades to occur
> during the same load operation, and that dumps occur in a way that allows
> each parcel's schema or version information to be recorded separately.
> Ideally, this modularity would extend to the data as well as the schema, so
> that a single parcel's data could be backed up or restored, subject to the
> parcel's dependencies. (Meaning that we could perhaps at some point allow
> upgrading a single parcel's schema without requiring a complete dump and
There may be cases, though, where parcel developer will need to collaborate on
the transition code because of dependencies between their parcels. Does the
egg framework support declaring dependencies among parcels that encode the
schema version(s) they're known to work with ?
> Dumping should be fast. Reloading shouldn't be slow for simple common
> cases, but it's okay to be slow if a complex schema change occurs.
Yes, and reloading should involve as little user intervention as possible,
especially if it is slow - let the user go to sleep while the process is
running overnight :).
> Design Implications
> There are several places where these forces resolve to fairly straightforward
> preliminary conclusions about how the system should work:
> * The external format should deal in relatively simple data structures
> composed of elementary values (of a relatively small number of types)
> arranged in records. Blobs should also be supported. (Forces: reduce
> implementation complexity, increase dump performance)
Dunno. Yesterday we discussed how this API should use high level repository
APIs, be high level enough to understand the schema conversions at work, be
high level enough to implement the semantic transformations needed by the
transition. This may force it to also use high level data structures such as
item clouds instead of just low-level ones such as uuids and bi-refs.
> * The format shouldn't use nesting structures that would require temporary
> storage of extremely large sequences. (Forces: reduce implementation
> complexity, increase load performance)
It may be simpler for developers to effect a transition in multiple steps and
it may be simpler for them to use a repository between each step. Given the
number of schema changes made at all levels since 0.6 and 0.7 already, it
would be rather hard to convert all of them in one transition. How do we
decide/coordinate what makes a transition, multiple transitions ?
This reminds me that we should have a way of logging schema changes that later
lets us query a parcel for them when the time has come to implement
conversions. Depending on where we are in the development cycle, it may be too
onerous to ask developers to instead implement the transition code right away.
This may be as simple as having a log file carried around by the egg, or even
just a simple comment, scrupulously maintained.
> * Notifications of all kinds (including onValueChanged) must be disabled
> during a load operation -- and that includes refresh() notifications in other
> views; load operations should be globally exclusive, with no other activity
> taking place. Which also means no repository-based UI components being
> active. In other words, "Chandler as we know it" *can't be running during a
> load operation*.
This is dicey because some attribute values are computed based on these
notifications. These computed values are still persisted. We might need a way
to encode in the schema which attributes need to be dumped and restored and
which should not be.
The lower-level the API, the lesser the need for notifications, monitors,
onValueChanged(), etc.. The higher level the API, the higher the need for
them. To make things worse though, the repository uses them at the very
low-level as well. For example, monitors are used in maintaining sorted
indexes. Notifications are used when maintaining collection indexes. When
restoring a repository, sorted indexes need to be rebuilt and it would be
better that they be rebuilt along the way instead of having one giant sort at
...(more comments below, read on)...
> * To allow for simple upgrades, parcels should be allowed access to the
> stream of records being loaded, so as to transform them directly into records
> that match the parcel's current schema.
> * To allow for complex upgrades, parcels should be given "before" and "after"
> hooks. This allows them to use the repository itself to store working data
> during a complex schema change. Without this feature, it would be necessary
> to provide query facilities in the dump/reload system itself, increasing
> implementation complexity. But with this feature, simple upgrade reloads and
> non-upgrade reloads can be fast, and only very complex upgrades pay a penalty
> in performance and implementation complexity.
> But the list of things that *aren't* so neatly resolved is bigger.
> For example, what dumping order should be used? The modularity requirement
> argues in favor of dumping on a per-parcel basis, but this means that
> annotated kinds will be scanned multiple times, once for each parcel that
> annotates the kinds in question. So dumping performance seems to argue that
> it would be better to just walk the entire repository and write data as you
> find it. (At least, if you're going to be dumping most of the repository
> contents, anyway.)
> Reloading performance, on the other hand, seems to argue that the data should
> be broken up by record types, in order to avoid repeated dispatching
> overhead. That is, if you iterate over a sequence of identical records, you
> only have to look up the filtering code once, and you can even write it as a
> generator to avoid extra call overhead. (Of course, these performance issues
> could easily be dominated by the cost of writing the data to the repository.)
> Hm. Actually, I've just thought of a way to get the advantages of both
> approaches, by writing interleaved records at dump time and reading them in
> an interleaved way at reload time. Making it work for iterators would be a
> pain, but doable. Okay, so scratch that issue off the "unresolved" list. :)
> Using iterators does have another big advantage, though. If a parcel can
> provide an iterator to do record transformation (simple schema upgrades), the
> iterator can also run code at the beginning and end of the process -- which
> means it can also do complex upgrades that need setup and cleanup steps.
> And writing these steps at the beginning and end of a loop that processes
> loaded records is a simple and "obvious" way to do it, without having to
> learn multiple APIs for upgrading the schema. The only tricky bit in the API
> is that we'd have to guarantee relative ordering for these transformation
> functions, so that the parcel author knows what order the iterators will be
> called in. But that's mostly an implementation detail.
> So, if we could reduce every item to records, then each Item subclass would
> just need a 'load_records()' classmethod that iterated over an input and
> yielded transformed (or untransformed) records. The default implementation
> of this classmethod would simply yield untransformed records as long as their
> schema matched the current schema, and raise an error otherwise.
> This concept appears like it would work for an overall repository dump and
> reload of items by themselves. It does not address certain issues regarding
> many-to-many and ordered relationships (which I'll get to later), nor does it
> handle the problem of upgrading parcel-specific data structures (e.g. block
> trees and copies thereof).
> The issue with parcel-specific data structures is that these are not things
> described by the parcel's own schema. For example, if you have a tree of
> blocks, they're going to be described by block framework schemas. So, there
> seems to be a need for a post-reload hook that gets called in a parcel to
> allow fixup of such structures. Perhaps the simplest way to support that
> would be to invoke installParcel() again, with oldVersion set to the reloaded
> These installParcel() recalls would have to happen in an order that ensures
> it has been already called for dependent parcels. That is, if parcel foo
> depends on parcel bar, then bar's installParcel() must be called before
> (Note that this means we *must* eliminate any remaining dependency cycles
> between parcels in order to implement a viable dump/reload system.)
> Storing Relationships
> In order to "flatten" items into simple data records, references to other
> objects have to be reduced to elementary types or records thereof. This
> means, for example, that inter-item references might need to be reduced to a
> In the simplest case of one-to-one relationships, this is straightforward
> since it's just representing the attribute as a UUID. Even simple
> one-to-many relationships are easily handled by representing only the "one"
> side of the relationship in the records, and allowing the "many" side to be
> rebuilt automatically via the biref machinery when the records are reloaded.
> There is a problem with this approach, however, because all birefs are
> currently *ordered* collections, even in cases where the order is
> meaningless. I had hoped during the schema API implementation last year that
> we could migrate to using ``schema.Many()`` in places where an ordered
> collection was unnecessary, but I didn't realize at the time that "set"
> cardinality in the repository could not be part on a bidirectional reference,
> so ``schema.Many()`` has mostly gone unused.
> Why is this important? Because external representation of data as a stream
> of records requires additional fields or records to reconstruct the sequence
> in a relationship. Either an additional field is required to store a key
> that indicates the order, or there has to be a sequence of records whose sole
> purpose is to represent the ordering. (This is especially complex in the
> case of order-preserving many-to-many relationships!)
> So, a key step in getting our schema ready for dump-and-reload support is
> going to be examining our current use of schema.Sequence to see whether these
> use cases actually require order-preservation. In many cases, we are
> actually using indexes based on other attribute values, so the unindexed
> order isn't necessary and it would be redundant to write it out in a dump.
> We currently have about 120 Sequence attributes that would need to be
> Since we have only around 3 uses of ``schema.Many()``, I would suggest we
> create a new descriptor to use in these cases, to free up the use of
> ``schema.Many`` for ``Sequence`` attributes that don't really need to be
> sequences. That would leave the following as possible relationship types,
> (ignoring mirror images):
> * One - Many
> * One - One
> * Many - Many
> * One - Sequence
> * Many - Sequence
> * Sequence - Sequence
> One-to-many and one-to-one relationships can be represented in external form
> without introducing any new fields or records; records on the "one" side
> would just record the identity of the referened object.
> Most of the other relationships would require an extra series of records to
> be written out, each record listing the identity of the objects on either
> side of the link. In the One-Sequence and Many-Sequence cases, the order of
> the records would indicate the order of the links on the "Sequence" side of
> the relationship.
> Sequence-Sequence relationships, however, are unstable, in that there is no
> simple way to read and write them without including data to indicate the
> sequence on both sides, and having a cleanup pass to fixup the loaded
> sequences. (Or by having special support from the repository to allow each
> side to be loaded indepdently without the normal automatic linking.)
> Thus, if possible, I'd like to abolish Sequence-to-Sequence relationships,
> allowing only the other five kinds of bidirectional relationships to exist.
> A sequence-to-sequence relationship is inherently unstable given the nature
> of birefs, anyway. Even if you're adding things to one side in a particular
> controlled order, the other side likely won't be in a meaningful order of its
Both sides of a Sequence-to-Sequence relationship can be controlled by using
the placeItem() API on a ref collection. Another way to control order is to
use sorted or numeric indexes, these are independent of the intrinsic order of
the ref collection itself which defaults to insertion order. From what I've
seen in Chandler, where ref collection order has mattered, it has been
implemented with indexes which have the advantage of giving numeric access
into a ref collection (for instance in the UI).
> That we have these relationships in the schema now is largely due to
> "many" and "sequence" being spelled the same way (i.e. ``schema.Sequence``)
> because the repository's current implementation doesn't offer a
> non-order-preserving cardinality that works as half of a biref.
In a way it does. If one doesn't care about order, one shouldn't care that
order is actually preserved either. What we need is a way for the record that
fact. Whether the repository then still maintains order becomes an
implementation detail that does not need to be carried over during
> This would require a schema review, but I think it's going to be important to
> make this distinction in the schema anyway, as it ultimately gives more
> implementation flexibility to the repository for future performance
> improvements, if we only have to implement sequences when sequences are
> really required. (Today's repository will not care, of course, so the
> distinction will be purely for documentation of intent and for possible
> simplification of the dump/reload process.)
> Note that I've been making an assumption in all of the above that records
> would not contain any sequences or complex data structures themselves; but
> this assumption needn't apply to small "many" or "sequence" attributes. It
> could be an optimization to write such links inline, but it would require
> hint information of some kind in the schema to suggest which side is better
> to write this information out on. I'm somewhat inclined against this
> approach, though, because we will also have to support large sequences (e.g.
> from UI-level item collections) and I'd just as soon not have two different
> But that's more of an API distinction; the API can to some extent represent
> things as indepdendent records, even if the underlying storage format
> interleaves the data to some extent. And conversely, the API could transform
> a non-interleaved format to an interleaved one, albeit at the cost of
> additional memory.
> Wrap-up -- for now
> At this point I haven't covered much actual API detail, or anything at all
> about the actual external format. I don't actually care much about the
> external format, since it's not a requirement that it be processed by other
> programs, and parcel writers will never see it directly. The API will only
> expose streams of records of elementary types, and provide a way for parcel
> writers to transform individual records as the streams go by, and to do pre-
> and post-processing on the repository contents.
> There's still a lot of work to be done to take these high-level thoughts and
> turn them into a concrete API proposal, but at this point I'd like input and
> feedback on the points I've presented so far, in order to make sure that the
> next steps are going in the right direction.
It would be prudent to start with simple transitions first, seeing how that
has a chance to scale up to harder, more emcompassing conversions.
(that's just me finding this topic a little scary).
More information about the chandler-dev