[Dev] Re: star schemas

patrickdlogan at attbi.com patrickdlogan at attbi.com
Fri Nov 22 14:08:13 PST 2002


> You probably know a fair amount more about large scale databases...

I want to emphasize that applying star schemas in this situation has
more to do with simplicity than volume. Relational database query
planners are not all that smart, and so to get good results with large
volumes it turns out that simple schemas with predictable query plans
improve performance.

These simple schemas also make sense to developers with little
relational or SQL experience. A simple query language is sufficient.

Moreover these schemas closely match the groupings and relationships
anyone with some domain understanding would expect. i.e. if you squint
just a little, they might be mistaken for "object oriented" database
schemas.

> The star schema seems named for the way tables revolve around the
> heart of the star, occupied by a central fact table which contains
> the bulk of content describing events of some kind -- an email fits
> the bill for "event" in this context.

Yes, you're a quick study.

> An RDF predicate is a URI for a first class subject node in its own
> right, so each arc in an RDF graph is potentially very heavily
> annotated with the meaning, purpose, format, etc about that
> particular attribute of an object when a subject uses that
> predicate.  This amounts to metainfo about subject predicates.
> Maybe this corresponds to info kept in satellite dimension tables in
> a star schema.

It looks like an RDF triple is used to represent all relationships at
an "atomic" level. In a star schema some of these relationships would
be implicit in a dimension, e.g. a contact's name and email address
would be together in a "contacts" dimension. Events which bring
several dimensions together in a many-to-many way would be represented
as a star. The fact table in the center would represent the
many-to-many relationships to the things (dimensions) related. 

For email the fact would probably also include the time stamp (since
it's not desirable to represent every instant in time as a row in a
dimension). The subject and other semi-structured information would be
scrubbed then added or matched to dimensions. The body of the message
is unique to the specific fact, and directly associated 1:1. All the
text would be available for text search as opposed to a more
"structured" query.

> The star schema modeling approach seems characterized by an attempt
> to resist the fetishistic pursuit of database normalization by ER
> modeling practioners, which causes so much data factoring (to reduce
> reduncancy) that performance is killed by bad locality of reference
> when simply duplicating the data wouldn't have much space cost.

Yes, having the dimensions denormalized simplifies all kinds of things
for people and systems.

> What does this mean to folks reading this list who are thinking
> about providing database support for Chandler?

My experience with all kinds of databases: relational, OODB, and
variations, tells me a few things:

* Typical relational database architecture originated when a megabyte
  was a lot of space... on disk!
* SQL is limited. Most information including messages are based on
  time series and common dimensions.
* OODBs tend to be overkill, or underfunded, or both. Connecting large
  graphs transparently on disk is a great idea but little practical
  benefit. Most information is much more regular than that. Most
  information that is not so regular can be stored in-memory and
  serialized.
* Distributed data with concurrent users is a pain in the, it's a
  pain. Most of the time these problems can be boiled down to simple
  coordination mechanisms. All the multi-user locking, versioning,
  etc. in a relational or OODB is overkill and a headache for
  developers. Simple coordination mechanisms, e.g. tuple spaces (or
  XML-document spaces like Ruple) provide very simple building blocks
  for all kinds of higher-order patterns of coordination. And they're
  easy to build.
* As I wrote above, star schemas do not have to be implemented as a
  full-blown relational database. I've implemented them as collections
  in-memory with a simple query language. Saving to disk and
  transaction logging can be very easy. Make the changes to data as
  tuples in a space, then add them to the star schema as history
  that's (almost maybe) never updated.

Anyone who can implement enough of a database for Chandler could
implement these kinds of mechanisms.

> If you want to use relational database technology...

Fortunately I don't because that technology is overrated and a bad
habit!

> We're going to focus on object databases... if only because the
> whole basis of terminology fits what normally happens in programming
> in oo languages.

An interface between the application and the "data store" will help
the system evolve. At the least I would recommend keeping the
application independent of "transparent persistence" mechanisms.

> And end users without a db admin find it hard to manage their
> databases.

"Zero maintenance" should definitely be high on the list. Simple
mechanisms for storage and coordination are keys to this.

> I'd hate for folks to design and implement Chandler so users with
> low end databases cannot have either scaling performance or robust
> data safety...

I think you'd be surprised how simple things can scale when you throw
away old notions like how relational databases used to be
designed. People have had to work awfully hard to get relational
technology to scale. Now that it (kind of) does, modern hardware and
software obviate the need.



More information about the Dev mailing list