[Chandler-dev] Re: sharing format / dump and reload question

Phillip J. Eby pje at telecommunity.com
Fri Jul 7 10:44:53 PDT 2006

At 09:55 AM 7/7/2006 -0700, Morgen Sagen wrote:

>On Jul 6, 2006, at 2:47 PM, Phillip J. Eby wrote:
>>The key here is that the information model is representation- 
>>independent.  Whether we use binary, XML, YAML, or even Python
>>pickles to physically express the information model, there is still
>>a schema that defines the scope of what you can "say" in that
>>model.  Is this making any more sense?
>I think so.  For example, RDF* is an information model (more
>specifically a graph data model) which can be represented in various
>ways, be it RDF-XML or N-triples syntax.

Right.  Part of the idea here is that having an explicitly-specified 
information model is that other systems guarantee what they will support 
storing, querying, and retrieving, in a way that still allows us to change 
representation formats when the need arises.

The stack, then is:

* Application/domain model
* Information model
* Representation format

Parcel developers have to define a mapping from the application model to 
and from the information model, and the sharing system in turn maps that to 
the representation format.  The information model needs to be super-stable; 
it can evolve only by adding features and features cannot ever be removed, 
for all practical purposes.  The application model needs to be able to 
evolve over time, and multiple representation formats need to be possible.

Actually, I guess I sort of left out a layer from the stack; really, it's:

* Application/domain model
* Versioned mapping to information model
* Information model
* Representation format

So the application developer must create the top two things, and create 
additional mappings (or modify them in a backward-compatible way) when the 
application schema changes.

The reason that the information model should be as elementary as possible 
is that it's not the job of the information model to implement 
application-level features, *and* it is almost impossible to remove 
something from the information model later.  You can't go back and say, 
"er, RDF isn't triples any more, it's just singles".  :)  So, the 
information model is basically the part you're nailing down as being 
essentially fixed, so that the other parts can vary, to the extent that 
they can still be mapped to or from the information model.

So, a hypothetical information model might be something like, "you can 
store tuples of elementary types (int, string, float, datetime, unicode, 
etc.), and each tuple is associated with a universally unique identifier of 
some kind".  This is a very simple model, but you can of course build 
anything you like on top of it.  It is also very easy to represent in an 
SQL datastore, XML, flat files, pickles, you name it.

Indeed, you have many choices for representation within e.g. XML.  For 
example, assuming you have key relationships between these virtual "tables" 
of tuples, you can use XML namespaces to reference the unique identifier, 
and then glom all the related data from several "tables" into one XML 
element's attributes.  Different representations would have different 
performance characteristics, of course, but the API would still be in terms 
of the information model, rather than the concrete representation.

This is just an example of a possible information model, specified very 
imprecisely.  To be a real spec, we would need to define what we mean by 
"int" and "string" and so on, and especially what the "etc." is.  :)  This 
is not a big deal to nail down, but it's "standards"-type work and needs to 
involve stakeholders from all the projects that want to use it.

The more interesting part is defining an API around the information model 
(and to a certain extent, vice versa).  Most of my thinking on this to date 
has been about "dump and reload", meaning a mass dumping and reloading of 
most objects in the repository.  But if I understand correctly, the sharing 
system is much more about incremental modification to items, so the API 
demands are different.

An example: sharing needs to know if data has "changed", but whether it has 
"changed" may depend on its meaning.  The information model in essence 
defines what a "change" looks like; it's not really whether the application 
representation has changed.  If you upgrade Chandler and the schema 
changes, how do you know if an object has changed?  By whether its 
information-model representation (based on a particular, 
uniquely-identified mapping) has changed.  This isn't really important for 
dump and reload, but it might be for sharing.

More information about the chandler-dev mailing list