[Dev] DRAFT: Python Schema API proposal
Phillip J. Eby
pje at telecommunity.com
Fri Apr 15 14:49:15 PDT 2005
-------------------------------------
Defining Chandler Schemas with Python
-------------------------------------
Introduction
============
As many of you may know, I've for some time now been promoting the idea of
replacing parcel XML with Python code for defining item schemas, and I
created a proof-of-concept for this in the "Spike" project, found under
'internals' in the Chandler CVS.
Since the PyCon sprints, it's my understanding that there's now a broad and
actionable consensus at OSAF that it is indeed desirable to move to using
Python syntax in place of XML for parcels' schema definition. So, after
working with Andi and Grant to get the necessary infrastructure in place
within Chandler, I'd like to present my proposal for what the Python schema
definitions will look like, how migration might take place, and what new
possibilities for Chandler development these changes will enable.
If you haven't had a chance to look at Spike yet, you may find it helpful
to read at least the "Introduction" section of this document:
http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/schema.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup
which presents a simple Python syntax for defining schemas. The actual
syntax used in Chandler will be different, but the above document gives a
good introduction to the concept, with lots of working examples. (In fact,
the document is designed for use with Python's "doctest" module and is
literally a part of Spike's unit tests. As much as is practical, I'll be
using this approach for the changes to Chandler, so that the API will be
documented and tested at the same time as it's developed.)
You'll notice, by the way, that the documentation doesn't talk much about
Kinds, or names, paths, repository views, and parents. That's because in
Spike's API, you don't need any of these things in order to create an
Item. You just create the item, and until you take some action to store
it, it's simply an ordinary Python object.
How it will Work
================
Here's a snippet of XML from the parcel.xml of the osaf.contentmodel package::
<Kind itsName="ContentItem">
<superKinds itemref="Item"/>
<classes
key="python">osaf.contentmodel.ContentModel.ContentItem</classes>
<description>Content Item is the abstract super-kind for things
like Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Content
Items are user-level items, which a user might file, categorize, share, and
delete.</description>
<Attribute itsName="body">
<displayName>Body</displayName>
<type itemref="Lob"/>
<description>All Content Items may have a body to contain
notes. It's not decided yet whether this body would instead contain the
payload for resource items such as presentations or spreadsheets --
resource items haven't been nailed down yet -- but the payload may be
different from the notes because payload needs to know MIME type,
etc.</description>
</Attribute>
Here's the corresponding code in the proposed schema API::
from application import schema # not sure if this is where it will go
from repository.schema import Types
class ContentItem(schema.Item):
"""Base class for content items
A content item (such as a contact, note, photo, etc.) Content
objects are
user-level items that a user might file, categorize, share, and
delete.
"""
body = schema.One(Types.Lob,
displayName = "Body",
doc = """\
All Content Items may have a body to contain notes. It's not
decided
yet whether this body would instead contain the payload for
resource
items such as presentations or spreadsheets -- resource items
haven't
been nailed down yet -- but the payload may be different from
the notes
because payload needs to know MIME type, etc."""
)
The fundamental idea here is that Python class definitions replace Kind
elements, and Python property definitions replace Attribute
elements. Superkinds are defined by inheritance. Parcels are Python
packages. Standard Python "import" statements replace XML namespace
definitions.
This has several useful consequences. First, it makes item classes
independent of parcel loading, which means they're easy to unit test. You
can simply create instances of items in order to run tests on
them. Second, it means that content classes don't need getKind() methods
and other chicanery to get access to a Kind object, just to be able to
create instances. Indeed, in all the ways that matter, items will just be
normal Python objects until/unless you link them with items that are
already stored in the repository (at which time they will become persistent).
This means routines that create new items will no longer need to know what
repository view the item is intended for. Instead, such routines can
simply create an instance of the appropriate class and return it without
further ado. As soon as the caller links the new item to a persisted item
(e.g. by setting an attribute), the new item will be persisted as
well. (This functionality will be made possible by the "null view" and
"view migration" features that Andi has added to the repository.)
Code vs. Data
-------------
Sometimes when I describe the preceding, people wonder if this use of
Python means that we are giving up on being "data driven", or if we will
still be able to allow users to create kinds and attributes. No, we are
not giving up on data-driven, and we will be just as dynamic as before.
If you're not familiar with Python's ultra-dynamic nature, it would seem at
first that writing code must be less flexible or less dynamic than writing
XML, but this is not at all the case. The Python code for a schema
definition is just a script that creates data objects. These data objects
are no different than the data objects you would create by reading
XML. The only technical difference is that the Python code doesn't have to
parse the XML first! (Of course, there are aesthetic differences, too.)
Note also that just because some schema is defined by writing Python
classes, it doesn't stop Chandler from allowing users to create attributes
or kinds. Again, if you're used to more static languages like Java or C++,
it's natural to think of a class as something fixed. But Python allows you
to trivially create new classes on the fly. For example::
def create_a_class(docstring,base_class=object):
class aNewClass(base_class):
__doc__ = docstring
return aNewClass
This function returns a new, distinct class object each time it's
called. Each returned class will have the name "aNewClass", but it will be
a distinct class object. (And you could change its name by setting its
``__name__`` attribute, if you wanted to.)
If methods were defined in this "nested class" statement, they would have
access to any parameters that were passed to ``create_a_class``, which
would allow the methods to be customized for each new class created. In
effect, Python is its own macro language at this level. Also note that
there's no speed disadvantage here; the statements are compiled only once
(when the module is compiled), no matter how many times you call the
function and create new classes. They are not compiled on the fly; the
statements are just the same as any other Python statements, and there is
absolutely no observable distinction between the dynamically created
classes and "normal" classes, because *all* Python classes are dynamically
generated in exactly the same way!
So as you can see, Python is an extremely *fluid* language, and the
assumption that "code" is harder to change than data doesn't really carry
over from other languages. "Hard coding" *isn't*, in other words. So,
it's trivial to define fresh classes and descriptors to represent
user-defined kinds and attributes, and in fact the repository already does
this kind of class generation today to support multiple inheritance of kinds.
What do we gain from this? Well, it won't be necessary to keep track of or
look up Kinds in order to create items: just create an instance of the
class. And if there's a class for every Kind that needs to be referenced
"statically" in code, then you won't need to also keep track of repository
paths in order to get access to a kind; just import the class and ask for
its kind.
Parcel Loading
--------------
There are no plans to change the current parcel loading arrangements;
parcel.xml will remain a valid way to define schemas and instances. The
only change likely to be made to parcel loading is to ensure that a
parcel's Python modules are imported before trying to process instances
defined in the parcel.xml. This is to ensure that the kinds are present in
the repository before the instances are created. Apart from this change,
however, the parcel.xml format should not be impacted.
Existing parcels will be changed to use the new schema definition mechanism
on an "inside out" basis. That is, superkinds will be changed before
subkinds. This is because kinds defined in a parcel.xml can refer to kinds
defined in a Python module, but not the other way around. So, likely the
contentmodel parcel will be changed first.
There is, however, a new step that will have to be done when new kinds or
attribute definitions are added to a parcel defined using Python. Each
kind or attribute needs a permanent UUID assigned to it, as this UUID will
be used to synchronize the Python module with the repository, and in the
future it may be used to help support schema evolution. Spike has a tool
that will automatically assign UUIDs for you, so that you don't have to do
it by hand::
http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/uuidgen.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup
(Of course, it will have to be ported to work with the new Chandler schema
API, because Spike doesn't currently integrate with the repository.)
If you forget to run the tool over a module whose schema has changed, and
you didn't set up the UUIDs by hand, an exception will be raised when you
try to create instances of the new or changed classes. There should be a
reminder in the error message telling you to run the UUID generation tool
to resolve the error.
API "Quick Reference"
---------------------
It is currently an open issue where the API will live. But it's going to
be a module called ``schema``, such that you'll do ``from somewhere import
schema``; it's just not clear yet what ``somewhere`` will be. Here are the
main features of interest:
``schema.Item``
The base class for persistent items; inherit from it or a
subclass. Note that your Python inheritance relationship will determine
the superkind hierarchy of your newly defined kinds, so you will want to be
sure that you subclass the appropriate base kind class, rather than
subclassing everything directly from ``schema.Item``
``schema.One``
Define an attribute of "single" cardinality, optionally specifying any
attribute aspects like its type and display name.
``schema.Many``
Define an attribute of "set" cardinality (once this is available in
the repository), optionally specifying any attribute aspects like its type
and display name.
``schema.Sequence``
Define an attribute of "list" cardinality, optionally specifying any
attribute aspects like its type and display name.
``schema.Mapping``
Define an attribute of "dict" cardinality, optionally specifying any
attribute aspects like its type and display name.
``schema.Cloud``
Define a cloud attribute. (This isn't entirely worked out yet; Spike
was using a different approach to the cloud concept, so I may need some
assistance from someone wise in the ways of clouds before getting a
concrete API defined for this.)
In order to reference types (as opposed to kinds), you'll import them from
``repository.schema.Types``. For example, ``Types.String`` to define a
string attribute. For attributes that reference other kinds, you'll just
import the corresponding class directly from the appropriate module.
Attribute aspects will mostly be keyword arguments to the attribute
definitions. Inverse attributes for bidirectional relationships will be
specified with an ``inverse`` keyword, and as in Spike they will refer to
an attribute of the other class. For example::
class ContentItem(schema.Item):
...
creator = schema.One(
displayName = "Created By",
doc = "Link to the contact who created the item",
)
class Contact(ContentItem):
itemsCreated = schema.Many(
ContentItem, # sequence of ContentItem
inverse = ContentItem.creator,
...
)
Notice that the inverse need only be specified on *one* side of the
bidirectional relationship -- whichever side is defined last.
Implementation Tasks
====================
1. Update Spike's code generator tests to use the repository's new "null
view" instead of a memory repository. (DONE; this yielded a 40% speed
improvement for the tests, dropping pack load time from roughly 1.3 seconds
to about 0.8 seconds.)
2. Add Spike tests to prototype programmatic creation of repository Kinds
and Attributes, and setting their UUIDs at construction time.
3. Test subclassing the repository's new C-based descriptor types and
adding Spike-style metadata to them.
4. Implement the actual schema API and doctests in the main Chandler
codebase for Kinds and Attributes. (This is pending a decision of where
the API should live in the Chandler package namespace; maybe that decision
can be wrapped next week while I'm in SFO.)
5. Define and implement a cloud-definition API (probably needs some input
from persons Wise in the Ways of Clouds)
6. Port Spike's UUID generation tool (and docs) to work with modules using
the Chandler schema API
7. Attempt a port of the ``contentmodel`` parcel using the API, possibly
w/participation by others. (Note: Andi would need to have completed the
repository auto-import feature before this would actually be usable in the
Chandler application.)
8. Modify the parcel loading facilities to ensure that modules defining
kinds are imported before loading parcel.xml files that define instances of
those kinds. (This might need to be done by someone other than me; it
might also require some minor changes to existing parcels or to the rules
for how parcel loading is sequenced.)
9. Investigate possible synergy between the descriptor-level aspect caching
that Andi wants to do for performance reasons, and the aspect setting that
the schema API needs to do for schema definition reasons. (This will
probably actually happen while I'm in SFO next week; it's only at the
bottom of this list because it's optional in the general scheme of things.)
10. Investigate the feasibility of implementing Spike's
``schema.Relationship`` concept for Chandler, to allow creation of global
attributes that don't appear in a class' static API, allowing parcels to
expand/extend existing parcels.
In Conclusion
=============
* Python class definitions offer a compact and convenient way to specify
Chandler schemas that will be easier and less error-prone to use than
parcel.xml, without losing any of Chandler's current or planned flexibility.
* parcel.xml isn't going away, and during the transition any schema
components defined in parcel.xml should be able to co-exist with those
defined using Python (barring any inter-dependency issues and assuming no
other issues arise).
* Using Python-defined schema means that content items can be unit tested
in isolation, without parcel loading overhead, making fast unit tests
possible, enabling a test-driven approach to development of the non-UI
portions of Chandler. It also reduces coupling between routines that
currently have to ferry repository views or items around in order to be
able to find kinds and set parents on newly created items.
I hope that this was informative and helpful. I will be in OSAF's San
Francisco offices next Monday through Thursday (April 18th-21st), so if
you'd like to spend some time talking about any aspect of this proposal
during those days, please let me know. Thanks!
More information about the Dev
mailing list