[Chandler-dev] Dump and reload sketch

Phillip J. Eby pje at telecommunity.com
Thu May 11 17:17:25 PDT 2006


This email is more of a preliminary brain dump than it is a proposal.  It's 
just to work out some of the requirements and design tradeoffs that I've 
been mulling over in the last few weeks, trying to get to a point where 
detailed API design can begin.  Please feel free to jump in with questions 
or comments about any of it.  Thanks.


Background and Requirements Overview
------------------------------------

We want to be able to dump user data from a Chandler repository into a 
stable external format, and be able to reload the resulting backup into the 
same or newer versions of Chandler.  The intended uses are for backups and 
to support evolution of both the repository implementation and our domain 
models without requiring users to discard their data.

This overall strategy was chosen after a discussion during the PyCon 
sprints; a writeup of the analysis can be found here:

http://lists.osafoundation.org/pipermail/chandler-dev/2006-February/005301.html


By "stable external format", I mean a format that does not change 
significantly from one Chandler release to the next, and which allows for 
version detection of the format itself, as well as providing version and 
schema information for the parcels whose data is contained in the format.

As I see it, there are several forces to be resolved in the design, most of 
which compete with all of the others:

Ability to perform upgrades flexibly
     In the common case, schema upgrades will consist mostly of simple 
additions or renames.  But in uncommon cases, entire families of objects 
may be restructured, or data might be moved from one Kind to another.

Implementation Complexity
     We don't have the resources to do original research in schema 
evolution techniques, nor can we afford to essentially duplicate the 
existing repository implementation in some other form.  If possible, the 
actual data format shouldn't reinvent wheels either.

API Complexity
     In principle, we could have have a simple API by just asking each 
parcel to write its data to a provided file.  However, this is just asking 
each parcel author to invent their own format while revisiting the same 
tradeoffs.  Ideally, our API should allow a parcel developer to write a 
minimum of code to support simple upgrades, and it should make complex 
upgrades possible.

Effective Support for Parcel Modularity
     Chandler isn't a monolithic application, and parcels can be upgraded 
independently of one another.  Effective dump-and-reload thus requires that 
parcels' data be managed in a way that allows multiple upgrades to occur 
during the same load operation, and that dumps occur in a way that allows 
each parcel's schema or version information to be recorded 
separately.  Ideally, this modularity would extend to the data as well as 
the schema, so that a single parcel's data could be backed up or restored, 
subject to the parcel's dependencies.  (Meaning that we could perhaps at 
some point allow upgrading a single parcel's schema without requiring a 
complete dump and reload.)

Performance
     Dumping should be fast.  Reloading shouldn't be slow for simple common 
cases, but it's okay to be slow if a complex schema change occurs.


Design Implications
-------------------

There are several places where these forces resolve to fairly 
straightforward preliminary conclusions about how the system should work:

* The external format should deal in relatively simple data structures 
composed of elementary values (of a relatively small number of types) 
arranged in records.  Blobs should also be supported.  (Forces: reduce 
implementation complexity, increase dump performance)

* The format shouldn't use nesting structures that would require temporary 
storage of extremely large sequences.  (Forces: reduce implementation 
complexity, increase load performance)

* Notifications of all kinds (including onValueChanged) must be disabled 
during a load operation -- and that includes refresh() notifications in 
other views; load operations should be globally exclusive, with no other 
activity taking place.  Which also means no repository-based UI components 
being active.  In other words, "Chandler as we know it" *can't be running 
during a load operation*.

* To allow for simple upgrades, parcels should be allowed access to the 
stream of records being loaded, so as to transform them directly into 
records that match the parcel's current schema.

* To allow for complex upgrades, parcels should be given "before" and 
"after" hooks.  This allows them to use the repository itself to store 
working data during a complex schema change.  Without this feature, it 
would be necessary to provide query facilities in the dump/reload system 
itself, increasing implementation complexity.  But with this feature, 
simple upgrade reloads and non-upgrade reloads can be fast, and only very 
complex upgrades pay a penalty in performance and implementation complexity.

But the list of things that *aren't* so neatly resolved is bigger.

For example, what dumping order should be used?  The modularity requirement 
argues in favor of dumping on a per-parcel basis, but this means that 
annotated kinds will be scanned multiple times, once for each parcel that 
annotates the kinds in question.  So dumping performance seems to argue 
that it would be better to just walk the entire repository and write data 
as you find it.  (At least, if you're going to be dumping most of the 
repository contents, anyway.)

Reloading performance, on the other hand, seems to argue that the data 
should be broken up by record types, in order to avoid repeated dispatching 
overhead.  That is, if you iterate over a sequence of identical records, 
you only have to look up the filtering code once, and you can even write it 
as a generator to avoid extra call overhead.  (Of course, these performance 
issues could easily be dominated by the cost of writing the data to the 
repository.)

Hm.  Actually, I've just thought of a way to get the advantages of both 
approaches, by writing interleaved records at dump time and reading them in 
an interleaved way at reload time.  Making it work for iterators would be a 
pain, but doable.  Okay, so scratch that issue off the "unresolved" list.  :)

Using iterators does have another big advantage, though.  If a parcel can 
provide an iterator to do record transformation (simple schema upgrades), 
the iterator can also run code at the beginning and end of the process -- 
which means it can also do complex upgrades that need setup and cleanup steps.

And writing these steps at the beginning and end of a loop that processes 
loaded records is a simple and "obvious" way to do it, without having to 
learn multiple APIs for upgrading the schema.  The only tricky bit in the 
API is that we'd have to guarantee relative ordering for these 
transformation functions, so that the parcel author knows what order the 
iterators will be called in.  But that's mostly an implementation detail.

So, if we could reduce every item to records, then each Item subclass would 
just need a 'load_records()' classmethod that iterated over an input and 
yielded transformed (or untransformed) records.  The default implementation 
of this classmethod would simply yield untransformed records as long as 
their schema matched the current schema, and raise an error otherwise.

This concept appears like it would work for an overall repository dump and 
reload of items by themselves.  It does not address certain issues 
regarding many-to-many and ordered relationships (which I'll get to later), 
nor does it handle the problem of upgrading parcel-specific data structures 
(e.g. block trees and copies thereof).

The issue with parcel-specific data structures is that these are not things 
described by the parcel's own schema.  For example, if you have a tree of 
blocks, they're going to be described by block framework schemas.  So, 
there seems to be a need for a post-reload hook that gets called in a 
parcel to allow fixup of such structures.  Perhaps the simplest way to 
support that would be to invoke installParcel() again, with oldVersion set 
to the reloaded version.

These installParcel() recalls would have to happen in an order that ensures 
it has been already called for dependent parcels.  That is, if parcel foo 
depends on parcel bar, then bar's installParcel() must be called before foo's.

(Note that this means we *must* eliminate any remaining dependency cycles 
between parcels in order to implement a viable dump/reload system.)


Storing Relationships
---------------------

In order to "flatten" items into simple data records, references to other 
objects have to be reduced to elementary types or records thereof.  This 
means, for example, that inter-item references might need to be reduced to 
a UUID.

In the simplest case of one-to-one relationships, this is straightforward 
since it's just representing the attribute as a UUID.  Even simple 
one-to-many relationships are easily handled by representing only the "one" 
side of the relationship in the records, and allowing the "many" side to be 
rebuilt automatically via the biref machinery when the records are reloaded.

There is a problem with this approach, however, because all birefs are 
currently *ordered* collections, even in cases where the order is 
meaningless.  I had hoped during the schema API implementation last year 
that we could migrate to using ``schema.Many()`` in places where an ordered 
collection was unnecessary, but I didn't realize at the time that "set" 
cardinality in the repository could not be part on a bidirectional 
reference, so ``schema.Many()`` has mostly gone unused.

Why is this important?  Because external representation of data as a stream 
of records requires additional fields or records to reconstruct the 
sequence in a relationship.  Either an additional field is required to 
store a key that indicates the order, or there has to be a sequence of 
records whose sole purpose is to represent the ordering.  (This is 
especially complex in the case of order-preserving many-to-many relationships!)

So, a key step in getting our schema ready for dump-and-reload support is 
going to be examining our current use of schema.Sequence to see whether 
these use cases actually require order-preservation.  In many cases, we are 
actually using indexes based on other attribute values, so the unindexed 
order isn't necessary and it would be redundant to write it out in a 
dump.  We currently have about 120 Sequence attributes that would need to 
be reviewed.

Since we have only around 3 uses of ``schema.Many()``, I would suggest we 
create a new descriptor to use in these cases, to free up the use of 
``schema.Many`` for ``Sequence`` attributes that don't really need to be 
sequences.  That would leave the following as possible relationship types, 
(ignoring mirror images):

* One - Many
* One - One
* Many - Many
* One - Sequence
* Many - Sequence
* Sequence - Sequence

One-to-many and one-to-one relationships can be represented in external 
form without introducing any new fields or records; records on the "one" 
side would just record the identity of the referened object.

Most of the other relationships would require an extra series of records to 
be written out, each record listing the identity of the objects on either 
side of the link.  In the One-Sequence and Many-Sequence cases, the order 
of the records would indicate the order of the links on the "Sequence" side 
of the relationship.

Sequence-Sequence relationships, however, are unstable, in that there is no 
simple way to read and write them without including data to indicate the 
sequence on both sides, and having a cleanup pass to fixup the loaded 
sequences.  (Or by having special support from the repository to allow each 
side to be loaded indepdently without the normal automatic linking.)

Thus, if possible, I'd like to abolish Sequence-to-Sequence relationships, 
allowing only the other five kinds of bidirectional relationships to 
exist.  A sequence-to-sequence relationship is inherently unstable given 
the nature of birefs, anyway.  Even if you're adding things to one side in 
a particular controlled order, the other side likely won't be in a 
meaningful order of its own.  That we have these relationships in the 
schema now is largely due to "many" and "sequence" being spelled the same 
way (i.e. ``schema.Sequence``) because the repository's current 
implementation doesn't offer a non-order-preserving cardinality that works 
as half of a biref.

This would require a schema review, but I think it's going to be important 
to make this distinction in the schema anyway, as it ultimately gives more 
implementation flexibility to the repository for future performance 
improvements, if we only have to implement sequences when sequences are 
really required.  (Today's repository will not care, of course, so the 
distinction will be purely for documentation of intent and for possible 
simplification of the dump/reload process.)

Note that I've been making an assumption in all of the above that records 
would not contain any sequences or complex data structures themselves; but 
this assumption needn't apply to small "many" or "sequence" attributes.  It 
could be an optimization to write such links inline, but it would require 
hint information of some kind in the schema to suggest which side is better 
to write this information out on.  I'm somewhat inclined against this 
approach, though, because we will also have to support large sequences 
(e.g. from UI-level item collections) and I'd just as soon not have two 
different implementations.

But that's more of an API distinction; the API can to some extent represent 
things as indepdendent records, even if the underlying storage format 
interleaves the data to some extent.  And conversely, the API could 
transform a non-interleaved format to an interleaved one, albeit at the 
cost of additional memory.


Wrap-up -- for now
------------------

At this point I haven't covered much actual API detail, or anything at all 
about the actual external format.  I don't actually care much about the 
external format, since it's not a requirement that it be processed by other 
programs, and parcel writers will never see it directly.  The API will only 
expose streams of records of elementary types, and provide a way for parcel 
writers to transform individual records as the streams go by, and to do 
pre- and post-processing on the repository contents.

There's still a lot of work to be done to take these high-level thoughts 
and turn them into a concrete API proposal, but at this point I'd like 
input and feedback on the points I've presented so far, in order to make 
sure that the next steps are going in the right direction.



More information about the chandler-dev mailing list