[Dev] Re: Data outside the box
patrickdlogan at attbi.com
patrickdlogan at attbi.com
Sat Nov 23 20:15:26 PST 2002
> I still intend to respond to your star schema email from yesterday
>(http://lists.osafoundation.org/pipermail/dev/2002-November/000317.html),
>which was very good. The short version is that I'd like to make star
>schemas work -- that sounds like a nice target and I don't see how it
>would add too many constraints.
I'd like to see how this might work out. I'm working on related things
that could potentially plug into Chandler or at least contribute your
ideas, so we could flesh these out in both settings and see what
works.
>> What if a database was so efficient that one never had to request
>> for indexes to be built?
>
>Efficient in what sense?
Efficient in any sense. *Ideally* a "zero maintenance" database would
plug into a small application for a single user as well as for a group
of co-workers and up to "enterprise" scale. That's a tall order but I
think it is possible if older notions of "database architecture" are
ignored.
>I like out of the box thinking. Maybe you have an idea about how to
>make low RAM to content ratios efficient without using a scheme that
>accesses content in O(1) time (hash tables), or O(logN) time
>(btrees).
Here's what I'm thinking: what's the difference between a table and an
index? Well, and index is just for one column typically. So what's the
difference between a table and a set of indexes, one per column?
One answer is: tables are not conducive to working on columns (most
common), so you need the indexes anyway. So if you've got the data in
indexes, what do you need the table for?
>I was assuming btree indexes...
Unfortunately btrees need reorganizing and they have measurable
overhead per datum. Also their code gets fairly complex when you
include deletes. They're also awkward to work with in chunks because
of their tree shape.
Fortunately there is a very clean structure that solves these problems
without any significant bad effects: skip lists. They are list shaped
instead of tree shaped. They're organized using random numbers rather
than by the nature of the data itself. And since they are sequences
rather than trees, they can by "chunked" pretty easily. In fact all
the code for working with skip lists is much easier to comprehend than
for trees.
ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf
>Ah, I get it. You put emphasis on _requesting_ them to be built, and
>are not actually questioning where there will be indexes at all.
>You're assuming there are indexes, but that a user doesn't ask for
>them.
I'm asserting there are indexes but not tables! Columns (indexes) are
coming and going, or portions of them. Tables can and should be wide,
but you don't need to bring an entire row into memory, just the
columns you need. That's essentially what you do with indexes, so why
store data in row format at all when most of the time you want to work
with columns.
What if data were organized by columns rather than rows? They would in
essence *be* indexes.
>Among other things, this also simplifies recovery from corruption,
>since we can throw away and rebuild an index we don't like for some
>reason.
Detecting the corruption is the hard problem. These kinds of problems
are usually silent. I hated shipping products knowing these problems
were in real customer databases. But Oracle did it too, and their QA
was bigger than our entire engineering organization. Another reason
why the code, data, and concurrency models have to be exceedingly
simple for "zero maintenance".
>> What if a database was so flexible that reorganizing it was routine
>> and done all the time?
>
>Yes, this is what we want to happen. We don't want the number and type
>of btree indexes to be fixed and unchanging. We want to be able to
>add and remove them without stressing the database.
What if you could add and remove columns in the model without
affecting the organization of the rest of the table?
>Of course, indexing a really big data set the first time is necessarily
>rather slow, so I'm not sure how effective it is to be really dynamic
>with very big user content collections.
If you've read this far you are realizing that this is a problem only
if you are storing your data in row order *and* column order. If your
column *is* an index, the index is already built and always
maintained.
>An application might be able to indicate which object attributes it would
>like to be searchable. It seems like the API would involve a lot of
>negotiation, with an app asking what attributes can be made searchable,
>and with how much latency, and with what memory pressure policies to
>discard old indexes when total space used becomes too large.
My aim would be every structured element is searchable in a structured
way, and *every* element or meta-element whether structured or not is
searchable in an "unstructured" way.
-Patrick
More information about the Dev
mailing list