[Dev] star schemas (Re: RDF triples and Chandler data)
david at treedragon.com
Fri Nov 22 12:32:44 PST 2002
(I normally try to reply only to the list to avoid duplicate emails
for the other person on the list, but I see you get the digest and
replying directly to you will give you faster turnaround.)
patrickdlogan at attbi.com wrote:
> OK. I used to be an AI guy so that makes sense to me.
> Given that contacts and messages are made up largely by volumes of
> fairly regular many-to-many relationships I wonder if a star schema
> model would simplify some things?
I was not familiar with the phrase "star schema" before, so I searched
enough to get a handle on the meaning. Helpful pages were these:
My unfamiliarity with the phrase "star schema" illustrates my distance
from industrial scale database technology and communities. This can be
a useful gauge: if I don't already know about some database term, I
don't expect an average Chandler developer to know it either. I won't
accidentally think everyone knows a lot of arcane database concepts.
I'll try to keep the descriptions of Chandler storage at a level that's
approachable by everyone who knows how to develop code. So I'll resist
veering off toward discussions that focus on database technologies that
require some study and familiarity to follow at all. (This is only a
reassurance to folks reading, and not a challenge to folks would rather
discuss relational databases at greater length.)
You probably know a fair amount more about large scale databases than
I do if you're the Patrick Logan I associate with Smalltalk and Gemstone.
I know more about first principle problems that come up in contexts when
using low end ad hoc databases, since issues crop up which are avoided
entirely by the bigger, formal database approaches.
Sorry for all the caveats and context setting. My goal is to relate
potential database options to folks on the list without talking over
> I am not suggesting it would have to be in the form of a relational
> database, but the model itself seems to apply. I've been infatuated by
> them for the last few years.
Okay, so the star schema model seels appropriate in contexts with lots
of multidimensional data, and this is right when RDF is being used to
model lots of interrelated content with open-ended new attributes.
The star schema seems named for the way tables revolve around the
heart of the star, occupied by a central fact table which contains
the bulk of content describing events of some kind -- an email fits
the bill for "event" in this context. The bulk of database data is
expected to live in the fact tables. In Chandler, for email apps,
the bulk of stored content will be the emails themselves.
Satellite tables around the central fact table describe "dimensions"
and I gather this is just metainformation about the facts.
An RDF predicate is a URI for a first class subject node in its own
right, so each arc in an RDF graph is potentially very heavily
annotated with the meaning, purpose, format, etc about that particular
attribute of an object when a subject uses that predicate. This
amounts to metainfo about subject predicates. Maybe this corresponds
to info kept in satellite dimension tables in a star schema.
The star schema modeling approach seems characterized by an attempt
to resist the fetishistic pursuit of database normalization by ER
modeling practioners, which causes so much data factoring (to reduce
reduncancy) that performance is killed by bad locality of reference
when simply duplicating the data wouldn't have much space cost.
In other words, a star schema favors flat tables which duplicate
data, so joins (which are just inferences about related info) happen
faster without having to jump through a long graph of data. When the
flat satellite dimension tables contain little more than metainfo,
they aren't big and it doesn't matter whether space is optimized.
What does this mean to folks reading this list who are thinking about
providing database support for Chandler?
If you want to use relational database technology, then you might
put the bulk of app traffic, like email messages, in a giant central
table with a flat structure, and then have a few ancillary tables to
pick up the slack in a few other places. This would be more like
a star schema model, as I understand the term.
The Chandler interfaces for storage won't be designed specifically
to accomodate relational backends, but I hope folks won't have a
problem plugging in alternative backend systems that use an RDB.
We're going to focus on object databases, but in principle any db
can be used to model another one, modulo performance effects. The
object database approach is likely to involve less complexity for
developers, if only because the whole basis of terminology fits
what normally happens in programming in oo languages.
People unfamiliar with relational technology find it hard to wrap their
heads around what sounds like gibberish without a substantial time
investment in a complex area. And end users without a db admin
find it hard to manage their databases.
(Isn't it really irritating how close the RDF and RDB acronyms are
to each other? I wonder if everyone keeps them separate. I just now
went back and replaced "RDB" with "relational" in a couple places.)
I want folks using relational dbs to succeed with Chandler plugins
for persistent storage in the backend, but basically I don't want
to talk about relational dbs. I'd rather just sit on the sidelines
and nod my head while other folks discuss them. I think it's a great
thing for folks to pursue, and I hope it's not awkward when I don't
seem to pay much attention.
I'd hate for folks to design and implement Chandler so users with
low end databases cannot have either scaling performance or robust
data safety because this was reserved only for using expensive
commericial database technology. I want to make the low end as
good as it can be, and folks who want to buy high end databases
instead are welcome to do so.
More information about the Dev