[Dev] Chandler Query Proposal 0.1
twl at osafoundation.org
Mon Feb 23 22:54:33 PST 2004
On Feb 23, 2004, at 9:10 AM, John Anderson wrote:
> Hi Ted:
> Here are some comments I had after reading your Chandler Query System
> Thinking about how I'd use queries, it seems like most of them would
> be specified in parcel XML, and either be associated with a result set
> or used directly with an iterator. The actual content spec itself
> would be managed by a GUI query widget. So it might be convenient to
> associate an optional result with the query item, rather than being
> just another item. The same might be true of the iterator.
In the examples, I assumed that the result of the QueryItem.execute()
method would be a Python generator, which would produces the query
results in a streaming fashion. Associating that generator/iterator
and the content spec via parcel XML should be doable -- the parcel XML
stuff will need some API to call. That API is the one in the proposal
(or one that's been adapted so that it's suitable for this)
> It would be fun to brainstorm with you about how the GUI query widget
> would generate the content spec.
I think that this is worth doing because it will probably show some
areas where the syntax/API is inconvenient for creating queries this
> I'm a little concerned about the security consequences of executing
> arbitrary Python code in the content spec.
I think that I worded this a little too carelessly. What I really
meant was that you'd like to be able to invoke the methods of the class
for an item as part of the query.
> What is the sort order of the items in the results and who's
> responsible for maintaining it?
The default sort order is undefined at the moment, so we can make it
whatever we want. Beyond that, we'll want a way to specify how the
results are to be sorted. Initially, this will be done by the query
retrieval. Do we want this to be maintained even as items enter/exit
the query set as items are updated in the repository?
> What does attribute traversal mean?
accessing the attiributes of an item as part of the query. The "first
name of x's spouse", where spouse is an attribute of item x and first
name is an attribute of x's spouse.
> I'm concerned about efficiency. I suspect in real life 95% of actual
> data in Chandler will end up being e-mail (typically all the e-mail I
> ever receive except for my spam) I also expect this might end up being
> tens or hundreds of gigabytes over the life of a successful Chandler
> product. If all of these e-mails are in a single ref collection,
> queries might not be fast enough.
This is a possibility, and for a situation where we can only search the
collection that contains all e-mails, I'm not sure what we can do. The
best algorithms/data structures for indexing are O(log2 n) so that's
the theoretical limit for how well we can do. Even then we'll have to
index those hundreds of gigabytes.
> Also, is it possible to do queries across several ref collections
The efficiency of queries across several collections is the sum of the
efficiency for each collection. This in turn is dependent on whether
the data in a particular collection is indexed or not.
> Is there a way for the program using queries to give hints about what
> indexes should be built?
Right now the model is that indexing is done independently of querying
and that the query system will know which indexes exist at any given
moment and be clever enough to use them when they are present. It may
also turn out to be beneficial to build indices based on repeated runs
of particular queries. If you really feel that you need to control
which indices are to be used for a particular query, then we could
probably add an API that allows you to hint that a particular access
method (index) be used against a particular reference collection.
> What is this syntax for doing Lucene fulltext indexing, where extra
> arguments like distance between words, or maybe language needs to be
That syntax as well as the syntax for comparison, arithmetic, dates,
etc was not spelled out. It will need to be in order to build the
system for real. Most of these are not difficult and should be
unsurprising. The text indexing is the exception -- Thats' probably
something for the next round of additions.
Ted Leung Open Source Applications Foundation (OSAF)
PGP Fingerprint: 1003 7870 251F FA71 A59A CEE3 BEBA 2B87 F5FC 4B42
More information about the Dev