[Dev] index requirements (chandler query lang)

David McCusker david at osafoundation.org
Sat Dec 21 02:39:51 PST 2002


I had a phone conversation with Mitch Kapor this evening about query
language requirements -- or more specifically, what must be searchable,
and what this must be fast and what things can be slow.

This note is a rough summary of what we covered.  I've been stewing
over the ramifications, and trying to gauge what kind of space footprint
and code complexity will come with a suitable implementation.  I'll
start with some introductory material that wasn't in our conversation.

Keep in mind that we want a Chandler respository to scale up to fill an
entire typical hard disk today, of around (say) 80GB.  So remarks and
ideas folks have about memory only databases are not applicable.  80GB
is our fixed target for now, and we won't track ongoing disk sizes.

Finding answers to search queries rapidly on very large data sets will
normally require a btree index for each sorted collection supporting
one kind of search.  So we're characterizing our btree index plans.

(Side note on skip lists: I understand they are often faster than btrees
when a tree fits in memory.  But I'm under the impression that btrees
minimize disk access when persistent and _large_ compared to skip lists.)

The space overhead for btree indexes is likely to be very large when
many things are indexed, and many indexes are used.  In fact, indexes
are likely to be larger in total size than actual respository content.
An 80GB repository might be 15GB data, 50GB indexes, and 15GB lost to
fragmentation.  These are rough guesses.  Maybe it will be better.

Okay, now on to requirements as described by Mitch.

The most significant one is a desire to search on everything, so
queries can associate anything with anything else.  This requirement
has a large effect.  I expect it's necessary to index all objects in
several btree indexes to support odd queries out of the blue.  If we
did not, then a search over many gigabytes of data would have quite
a large time latency.

This requirement suggests we might need to support the RDF triples
style of indexes I had been resisting. RDF triples have the following
form: (subject, predicate, object).

In this context, "object" means "direct object in an English sentence"
and not "instance in an object-oriented programming language."  To
avoid this ambiguity, maybe I'll temporarily rename RDF triples using
graph terminology, where source and destination nodes connect with an
arc, so we might write (source, arc, destination) instead.

A subject (or source) is a "thing" with a bunch of attributes, where
each attribute is a predicate (or arc) which points at an associated
value object (or destination).  The destination might be another
fully general subject (source), or just a terminal literal value.

If you like matrix terminology of tables instead, a subject thing is
a row composed of a set of columns (predicates or arcs) with valeus
in each cell (object or destination).

All data can be represented this way, so rows and RDF triples can both
universally encode everything (just not most efficently, of course).

You can see RDF triples as a giant relational table composed of rows
with three columns: (subject, predicate, object).  If you use a btree
for each of these three columns, then all triples are indexed by all
three of the parts of each triple.  Thus you can make any part fixed
search the other parts as variable.  It's very flexible.

However, if _every_ triple goes into each of the three global indexes,
then each index will be really, really big, even if you put nothing
but ID references into a btree entry in an attempt to minizet total
size of a triple btree index. (I'll try to estimate sizes later.)

We might want to require that every database used as a Chandler plugin
support full RDF triple indexing, even if this has an onerous cost.
It's the most fully general way to support searching queries on anything
without needing to linearly scan all data in many gigabytes of data.

However, in addition to these three triple indexes, as an optimization
a database could support additional indexes of subsets of the data,
so that searches would be much faster for certain data sets.  But the
indexes would be redundant in space cost and in insertion time cost
in order to optimize query latency time.

Chandler would hint what collections of objects would normally be
searched, and which properties would be searched often, so a database
could provide btree indexes for these, with a smaller hozizon than all
data in the repository, with attendant speed advantages.

Some collections might be so small that searching them might involve
on-the-fly sorting in memory without ever building a btree index.  We
expect this will be normal for collections under some size.

Chandler will try to identify horizons smaller than all the repository
for some data sets so searching them can be optimized.  For exmample,
email and tasks might only be fully searched the most speedy way for
more recent entries added.  Older entries would still be searchable,
but with slower performance in big respositories.

Documents and (people) contacts, however, might never age with respect
to searching since age is less relevant to onging meaning.

Hmm, I'd better finish this tomorrow because it's getting rather late.
I'll continue this later, and put a contiguous version on my weblong.

-- David McCusker




More information about the Dev mailing list