[Dev] Data outside the box

David McCusker david at osafoundation.org
Sat Nov 23 15:27:23 PST 2002

I still intend to respond to your star schema email from yesterday
which was very good.  The short version is that I'd like to make star
schemas work -- that sounds like a nice target and I don't see how it
would add too many constraints.

patrickdlogan at attbi.com wrote:
>>...an attribute which has no index over a million emails.
> Some questions for "out of the box" thinking...
> What if a database was so efficient that one never had to request
>  for indexes to be built?

Efficient in what sense?  I tend to focus on response latency from the
user's point of view.  We want to allow for the possibility that a
user's content is stored locally on disk, as opposed to assuming it
is always remote and potentially all in memory in some big honking db
that makes all the content available with only network latency, plus
whatever search in memory will cost.

So we're thinking a user might fill some majority of their disk space
with Chandler content, and we want the performance not to really suck
compared to using a small fraction of available secondary storage.

If we take today's disk sizes as a randomly chosen target, we want the
performance to be acceptable with 40 gigabytes of data on disk.

The assumption is that a user's machine has RAM of some size that is
a small fraction of this, which is so small compared the total disk
size, that we needn't distinguish much between machines with only
256MB of RAM and those with 1GB or 2GB of RAM.  They'll both have the
same problems trying to page in content from disk.  More memory helps,
but algorithmically they're similar problems.

Anyway, one of our assumptions is that Chandler content will scale much
larger than available RAM.  If we don't assume that, then we'll design
something that doesn't scale worth a darn, or will only work with a
remote server situation.  Right now both those sould like failure, but
we could always change our priorities.

I like out of the box thinking.  Maybe you have an idea about how to
make low RAM to content ratios efficient without using a scheme that
accesses content in O(1) time (hash tables), or O(logN) time (btrees).

I was assuming btree indexes (ie. tables) were needed to sort content
so a slice of matching collection members could be identified in logN
time we had reasont to perform a string match, such as finding all the
matches for some prefix string typed by the user for autocompletion.

Ah, I get it.  You put emphasis on _requesting_ them to be built, and
are not actually questioning where there will be indexes at all.  You're
assuming there are indexes, but that a user doesn't ask for them.

Yes, that's what we had in mind.  We want a database so flexible that a
user never need pay attention to what indexes are in use.  This means a
db must tolerate indexes (in the form of btree tables) coming and going
dynamically while a user is actively using the application.  Among other
things, this also simplifies recovery from corruption, since we can
throw away and rebuild an index we don't like for some reason.

> What if a database was so flexible that reorganizing it was routine
> and done all the time?

Yes, this is what we want to happen.  We don't want the number and type
of btree indexes to be fixed and unchanging.  We want to be able to
add and remove them without stressing the database.

Of course, indexing a really big data set the first time is necessarily
rather slow, so I'm not sure how effective it is to be really dynamic
with very big user content collections.

Maybe the UI would need to let a user indicate their interest in some
kind of search query well in advance of actually performing a query, to
allow an index to be build when it's necessary.

> How does tradition table implementation affect these issues?

I tend to substitute the word btree for table, since I don't think about
the relationships folks try to represent with static tables.  Maybe the
traditional approach assumes a schema is more static than we want.

> How should an application's data interface be aware of these issues?

An application might be able to indicate which object attributes it would
like to be searchable.  It seems like the API would involve a lot of
negotiation, with an app asking what attributes can be made searchable,
and with how much latency, and with what memory pressure policies to
discard old indexes when total space used becomes too large.

--David McCusker

More information about the Dev mailing list