[Dev] DRAFT: Python Schema API proposal

Phillip J. Eby pje at telecommunity.com
Fri Apr 15 14:49:15 PDT 2005


-------------------------------------
Defining Chandler Schemas with Python
-------------------------------------


Introduction
============

As many of you may know, I've for some time now been promoting the idea of 
replacing parcel XML with Python code for defining item schemas, and I 
created a proof-of-concept for this in the "Spike" project, found under 
'internals' in the Chandler CVS.

Since the PyCon sprints, it's my understanding that there's now a broad and 
actionable consensus at OSAF that it is indeed desirable to move to using 
Python syntax in place of XML for parcels' schema definition.  So, after 
working with Andi and Grant to get the necessary infrastructure in place 
within Chandler, I'd like to present my proposal for what the Python schema 
definitions will look like, how migration might take place, and what new 
possibilities for Chandler development these changes will enable.

If you haven't had a chance to look at Spike yet, you may find it helpful 
to read at least the "Introduction" section of this document:

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/schema.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

which presents a simple Python syntax for defining schemas.  The actual 
syntax used in Chandler will be different, but the above document gives a 
good introduction to the concept, with lots of working examples.  (In fact, 
the document is designed for use with Python's "doctest" module and is 
literally a part of Spike's unit tests.  As much as is practical, I'll be 
using this approach for the changes to Chandler, so that the API will be 
documented and tested at the same time as it's developed.)

You'll notice, by the way, that the documentation doesn't talk much about 
Kinds, or names, paths, repository views, and parents.  That's because in 
Spike's API, you don't need any of these things in order to create an 
Item.  You just create the item, and until you take some action to store 
it, it's simply an ordinary Python object.


How it will Work
================

Here's a snippet of XML from the parcel.xml of the osaf.contentmodel package::

     <Kind itsName="ContentItem">
         <superKinds itemref="Item"/>
         <classes 
key="python">osaf.contentmodel.ContentModel.ContentItem</classes>
         <description>Content Item is the abstract super-kind for things 
like Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Content 
Items are user-level items, which a user might file, categorize, share, and 
delete.</description>
         <Attribute itsName="body">
             <displayName>Body</displayName>
             <type itemref="Lob"/>
             <description>All Content Items may have a body to contain 
notes.  It's not decided yet whether this body would instead contain the 
payload for resource items such as presentations or spreadsheets -- 
resource items haven't been nailed down yet -- but the payload may be 
different from the notes because payload needs to know MIME type, 
etc.</description>
         </Attribute>

Here's the corresponding code in the proposed schema API::

     from application import schema    # not sure if this is where it will go
     from repository.schema import Types

     class ContentItem(schema.Item):
         """Base class for content items

         A content item (such as a contact, note, photo, etc.)  Content 
objects are
         user-level items that a user might file, categorize, share, and 
delete.
         """

         body = schema.One(Types.Lob,
             displayName = "Body",
             doc = """\
             All Content Items may have a body to contain notes.  It's not 
decided
             yet whether this body would instead contain the payload for 
resource
             items such as presentations or spreadsheets -- resource items 
haven't
             been nailed down yet -- but the payload may be different from 
the notes
             because payload needs to know MIME type, etc."""
         )

The fundamental idea here is that Python class definitions replace Kind 
elements, and Python property definitions replace Attribute 
elements.  Superkinds are defined by inheritance.  Parcels are Python 
packages.  Standard Python "import" statements replace XML namespace 
definitions.

This has several useful consequences.  First, it makes item classes 
independent of parcel loading, which means they're easy to unit test.  You 
can simply create instances of items in order to run tests on 
them.  Second, it means that content classes don't need getKind() methods 
and other chicanery to get access to a Kind object, just to be able to 
create instances.  Indeed, in all the ways that matter, items will just be 
normal Python objects until/unless you link them with items that are 
already stored in the repository (at which time they will become persistent).

This means routines that create new items will no longer need to know what 
repository view the item is intended for.  Instead, such routines can 
simply create an instance of the appropriate class and return it without 
further ado.  As soon as the caller links the new item to a persisted item 
(e.g. by setting an attribute), the new item will be persisted as 
well.  (This functionality will be made possible by the "null view" and 
"view migration" features that Andi has added to the repository.)


Code vs. Data
-------------

Sometimes when I describe the preceding, people wonder if this use of 
Python means that we are giving up on being "data driven", or if we will 
still be able to allow users to create kinds and attributes.  No, we are 
not giving up on data-driven, and we will be just as dynamic as before.

If you're not familiar with Python's ultra-dynamic nature, it would seem at 
first that writing code must be less flexible or less dynamic than writing 
XML, but this is not at all the case.  The Python code for a schema 
definition is just a script that creates data objects.  These data objects 
are no different than the data objects you would create by reading 
XML.  The only technical difference is that the Python code doesn't have to 
parse the XML first!  (Of course, there are aesthetic differences, too.)

Note also that just because some schema is defined by writing Python 
classes, it doesn't stop Chandler from allowing users to create attributes 
or kinds.  Again, if you're used to more static languages like Java or C++, 
it's natural to think of a class as something fixed.  But Python allows you 
to trivially create new classes on the fly.  For example::

     def create_a_class(docstring,base_class=object):
         class aNewClass(base_class):
             __doc__ = docstring
         return aNewClass

This function returns a new, distinct class object each time it's 
called.  Each returned class will have the name "aNewClass", but it will be 
a distinct class object.  (And you could change its name by setting its 
``__name__`` attribute, if you wanted to.)

If methods were defined in this "nested class" statement, they would have 
access to any parameters that were passed to ``create_a_class``, which 
would allow the methods to be customized for each new class created.  In 
effect, Python is its own macro language at this level.  Also note that 
there's no speed disadvantage here; the statements are compiled only once 
(when the module is compiled), no matter how many times you call the 
function and create new classes.  They are not compiled on the fly; the 
statements are just the same as any other Python statements, and there is 
absolutely no observable distinction between the dynamically created 
classes and "normal" classes, because *all* Python classes are dynamically 
generated in exactly the same way!

So as you can see, Python is an extremely *fluid* language, and the 
assumption that "code" is harder to change than data doesn't really carry 
over from other languages.  "Hard coding" *isn't*, in other words.  So, 
it's trivial to define fresh classes and descriptors to represent 
user-defined kinds and attributes, and in fact the repository already does 
this kind of class generation today to support multiple inheritance of kinds.

What do we gain from this?  Well, it won't be necessary to keep track of or 
look up Kinds in order to create items: just create an instance of the 
class.  And if there's a class for every Kind that needs to be referenced 
"statically" in code, then you won't need to also keep track of repository 
paths in order to get access to a kind; just import the class and ask for 
its kind.


Parcel Loading
--------------

There are no plans to change the current parcel loading arrangements; 
parcel.xml will remain a valid way to define schemas and instances.  The 
only change likely to be made to parcel loading is to ensure that a 
parcel's Python modules are imported before trying to process instances 
defined in the parcel.xml.  This is to ensure that the kinds are present in 
the repository before the instances are created.  Apart from this change, 
however, the parcel.xml format should not be impacted.

Existing parcels will be changed to use the new schema definition mechanism 
on an "inside out" basis.  That is, superkinds will be changed before 
subkinds.  This is because kinds defined in a parcel.xml can refer to kinds 
defined in a Python module, but not the other way around.  So, likely the 
contentmodel parcel will be changed first.

There is, however, a new step that will have to be done when new kinds or 
attribute definitions are added to a parcel defined using Python.  Each 
kind or attribute needs a permanent UUID assigned to it, as this UUID will 
be used to synchronize the Python module with the repository, and in the 
future it may be used to help support schema evolution.  Spike has a tool 
that will automatically assign UUIDs for you, so that you don't have to do 
it by hand::

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/uuidgen.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

(Of course, it will have to be ported to work with the new Chandler schema 
API, because Spike doesn't currently integrate with the repository.)

If you forget to run the tool over a module whose schema has changed, and 
you didn't set up the UUIDs by hand, an exception will be raised when you 
try to create instances of the new or changed classes.  There should be a 
reminder in the error message telling you to run the UUID generation tool 
to resolve the error.


API "Quick Reference"
---------------------

It is currently an open issue where the API will live.  But it's going to 
be a module called ``schema``, such that you'll do ``from somewhere import 
schema``; it's just not clear yet what ``somewhere`` will be.  Here are the 
main features of interest:

``schema.Item``
     The base class for persistent items; inherit from it or a 
subclass.  Note that your Python inheritance relationship will determine 
the superkind hierarchy of your newly defined kinds, so you will want to be 
sure that you subclass the appropriate base kind class, rather than 
subclassing everything directly from ``schema.Item``

``schema.One``
     Define an attribute of "single" cardinality, optionally specifying any 
attribute aspects like its type and display name.

``schema.Many``
     Define an attribute of "set" cardinality (once this is available in 
the repository), optionally specifying any attribute aspects like its type 
and display name.

``schema.Sequence``
     Define an attribute of "list" cardinality, optionally specifying any 
attribute aspects like its type and display name.

``schema.Mapping``
     Define an attribute of "dict" cardinality, optionally specifying any 
attribute aspects like its type and display name.

``schema.Cloud``
     Define a cloud attribute.  (This isn't entirely worked out yet; Spike 
was using a different approach to the cloud concept, so I may need some 
assistance from someone wise in the ways of clouds before getting a 
concrete API defined for this.)

In order to reference types (as opposed to kinds), you'll import them from 
``repository.schema.Types``.  For example, ``Types.String`` to define a 
string attribute.  For attributes that reference other kinds, you'll just 
import the corresponding class directly from the appropriate module.

Attribute aspects will mostly be keyword arguments to the attribute 
definitions.  Inverse attributes for bidirectional relationships will be 
specified with an ``inverse`` keyword, and as in Spike they will refer to 
an attribute of the other class.  For example::

     class ContentItem(schema.Item):
         ...
         creator = schema.One(
             displayName = "Created By",
             doc = "Link to the contact who created the item",
         )

     class Contact(ContentItem):
         itemsCreated = schema.Many(
             ContentItem,    # sequence of ContentItem
             inverse = ContentItem.creator,
             ...
         )

Notice that the inverse need only be specified on *one* side of the 
bidirectional relationship -- whichever side is defined last.


Implementation Tasks
====================

1. Update Spike's code generator tests to use the repository's new "null 
view" instead of a memory repository.  (DONE; this yielded a 40% speed 
improvement for the tests, dropping pack load time from roughly 1.3 seconds 
to about 0.8 seconds.)

2. Add Spike tests to prototype programmatic creation of repository Kinds 
and Attributes, and setting their UUIDs at construction time.

3. Test subclassing the repository's new C-based descriptor types and 
adding Spike-style metadata to them.

4. Implement the actual schema API and doctests in the main Chandler 
codebase for Kinds and Attributes.  (This is pending a decision of where 
the API should live in the Chandler package namespace; maybe that decision 
can be wrapped next week while I'm in SFO.)

5. Define and implement a cloud-definition API (probably needs some input 
from persons Wise in the Ways of Clouds)

6. Port Spike's UUID generation tool (and docs) to work with modules using 
the Chandler schema API

7. Attempt a port of the ``contentmodel`` parcel using the API, possibly 
w/participation by others.  (Note: Andi would need to have completed the 
repository auto-import feature before this would actually be usable in the 
Chandler application.)

8. Modify the parcel loading facilities to ensure that modules defining 
kinds are imported before loading parcel.xml files that define instances of 
those kinds.  (This might need to be done by someone other than me; it 
might also require some minor changes to existing parcels or to the rules 
for how parcel loading is sequenced.)

9. Investigate possible synergy between the descriptor-level aspect caching 
that Andi wants to do for performance reasons, and the aspect setting that 
the schema API needs to do for schema definition reasons.  (This will 
probably actually happen while I'm in SFO next week; it's only at the 
bottom of this list because it's optional in the general scheme of things.)

10. Investigate the feasibility of implementing Spike's 
``schema.Relationship`` concept for Chandler, to allow creation of global 
attributes that don't appear in a class' static API, allowing parcels to 
expand/extend existing parcels.


In Conclusion
=============

* Python class definitions offer a compact and convenient way to specify 
Chandler schemas that will be easier and less error-prone to use than 
parcel.xml, without losing any of Chandler's current or planned flexibility.

* parcel.xml isn't going away, and during the transition any schema 
components defined in parcel.xml should be able to co-exist with those 
defined using Python (barring any inter-dependency issues and assuming no 
other issues arise).

* Using Python-defined schema means that content items can be unit tested 
in isolation, without parcel loading overhead, making fast unit tests 
possible, enabling a test-driven approach to development of the non-UI 
portions of Chandler.  It also reduces coupling between routines that 
currently have to ferry repository views or items around in order to be 
able to find kinds and set parents on newly created items.

I hope that this was informative and helpful.  I will be in OSAF's San 
Francisco offices next Monday through Thursday (April 18th-21st), so if 
you'd like to spend some time talking about any aspect of this proposal 
during those days, please let me know.  Thanks! 



More information about the Dev mailing list