[Chandler-dev] [Sum] The Great Architecture Discussion of 2007

Andi Vajda vajda at osafoundation.org
Tue Oct 9 17:12:42 PDT 2007


On Tue, 9 Oct 2007, Phillip J. Eby wrote:

> At 04:12 PM 10/9/2007 -0700, Andi Vajda wrote:
>
>> On Tue, 9 Oct 2007, Phillip J. Eby wrote:
>> 
>>> 1. application-level code meddling in storage-level details
>> 
>> Could you give some examples ?
>
> Any place where the application is creating collections or working with 
> indexes in order to achieve performance compared to "naive" iteration or 
> queries.

I see. Creating a collection is like creating a query. In the relational 
world you propose that the app not write queries ?

On indexes I see your point, I think. An index is a query's cache. Not 
something one would want to expose to the app. Funnily, indexes _became_ this 
later, they started as a way to access collections by row number for 
displaying in the UI. Then later they became a query's cache (when we 
implemented abstract sets and collections) and then they were used as a way of 
persisting sort order.

In any case, I don't see how this is different in a relational model.
Once you work on extracting performance from a relational app, you end-up 
writing hardcoded queries that have very specific app knowledge.

But maybe this thread isn't about relational vs object - as I'm afraid it is - 
but perhaps about better app layering ?

>>> 2. lack of sufficient domain-specific query APIs
>> 
>> Again, please give an example of what you'd like ?
>
> This isn't a repository problem - it's a domain-layer problem.  If the places 
> where we're doing #1 were at least consolidated to single points of 
> reference, #1 wouldn't be so bad.

I think the app has done a pretty good job at moving a lot of the index 
maintenance code to a specific area. I'm thinking of the dashboard indexes 
here.

>>> 3. no indirection between the application's logical schema and its 
>>> physical storage schema
>> 
>> Seems incorrect. I can change the physical storage schema (core schema or 
>> even repo format) without affecting app code. Or am I misunderstanding 
>> something ?
>
> Sorry, I am using the relational meaning of logical and physical.  A logical 
> schema does not include indexes or views, while a physical schema does.  I'm 
> also extending this to refer to the lack of distinction between our preferred 
> form of data as encapsulated objects, versus the best divisions of data from 
> a performance point of view.

In chandler we've had for a long time the distinction between capital 'I' 
Items and lowercase 'i' items. This distinction has most materialized with the 
dump/reload/eim work which is a way to export 'I' Items. The repository deals 
with 'i' items on the other hand. Isn't this equivalent to what you're talking 
about ?

As for indexes, yes you're correct. They're not part of the logical schema. 
They're performance implementation details that are chosen by the app just 
like in a relational app where the app has to ultimately know about table 
layout, keys, indexes, put kludges into stored procedures, to make efficient 
queries.

> The core schema and repo format aren't a factor in this, as they're at an 
> even lower level than the "physical" schema I'm talking about.  In the 
> repository today, the "physical" schema consists of whatever sets/collections 
> and indexes you create, which is rather analagous to creating indexes or 
> materialized views in an RDBMS, only without the same transparency.  In an 
> RDBMS, if you add an index or a materialized view, it doesn't change how you 
> retrieve your data: it just goes faster.  So you can do application specific 
> tuning without changing your application.

Same with the repository. It just goes faster. You don't have to change the 
way you access your data once you've created indexes. Except for random row 
number-based access for which I didn't dare writing the iterating APIs. But if 
you look in the collection code, it takes the slow route if it can't find an 
index and the fast route if it can for iteration, appartenance, etc... No need 
to change the access code at the app level to take advantage of the indexes. 
A repository index is a materialized view of a collection in relational terms.

>>> 4. implementing a generic database inside another generic database
>> 
>> That was the goal, originally.
>
> Not quite; having a generic database was the goal, not that it be implemented 
> *inside* another generic database.  It is one thing to have a BerkeleyDB 
> persistence layer driven by the application's dynamic schema, and another one 
> altogether to implement a database on top of a fixed BerkeleyDB schema.
>
> For comparison purposes, consider OpenLDAP: it is a generic, hierarchical, 
> networked database implemented atop BerkeleyDB.  However, instead of having a 
> fixed schema for storing values, items, etc., in BerkeleyDB, it is 
> dynamically extended as attribute types and indexes are added.  So the 
> database is *represented* in BerkeleyDB, rather than being implemented 
> *inside* BerkeleyDB.

I think we disagree or misunderstand each other here. Or maybe I'm simply not 
following you. While it's not relational, the chandler repository has to go 
through the same hoops as OpenLDAP or MySQL to store anything in Berkeley DB. 
Berkeley DB can only store key/value pairs of byte string in b-trees, hashes, 
queues, and a fourth structure whose name escapes me at the moment.

> So, when I say it is implemented "inside" another database, I mean it in the 
> sense that the schema of the repository is not reflected in the schema of its 
> back-end storage, and thus cannot fully utilize the back-end's features to 
> maximum performance.

Can you give a specific example that would help me understand what you mean ?

> I'm not sure what you mean by "hard compiled".  Nothing stops us from having 
> a relational schema that's extensible by parcels, or from doing so 
> dynamically.  In truth, the schemas we use with the repository today are no 
> less "hard compiled".  If we at some future time allow user-defined fields, 
> there are still ways to represent them within such a relatively-static 
> schema, or to simply modify the schema at runtime.

Once you've worked hard at extracting performance from your static schema, so 
that queries and joins are not too massive, any extension throws the (whole)
effort into question over and over again. Any plugin developer will have to 
understand this. This was the main reason why we didn't choose this route five 
years ago. Maybe now we don't care anymore about this aspect as much.

For example, in conversations I've had with Grant, he compared Chandler with 
Mail.app and iCal.app which have such static schemas and can perform much 
better in their specific domains than more generic chandler.

If that's the route we'd like to take Chandler to, fine. That should be 
clearly stated. I'm not exactly against it either, just a lot less excited 
about it.

It'd be a different product, albeit with a lot of the same visible 0.7/1.0 
features of today but a dead-end nonetheless. Chandler would only ever do what 
it's hardcoded to do (from a schema standpoint).

The last five years of work would be pretty much wasted, except for their 
"what not to do" aspect :)

>>> 5. implementing generic indexes inside of generic indexes
>> 
>> How so ? What are you thinking about ?
>
> The skip list system is the main one I have in mind, but if I correctly 
> understand how versions and values are stored, then those would be included 
> too.

Yes, a skiplist implements the structure behind repository indexes. What is 
the 'generic indexes' that skiplist are implemented in you're talking about ?

Andi..


More information about the chandler-dev mailing list