[Dev] Chandler Query Proposal 0.1

Ted Leung twl at osafoundation.org
Mon Feb 23 22:54:33 PST 2004


On Feb 23, 2004, at 9:10 AM, John Anderson wrote:

> Hi Ted:
>
> Here are some comments I had after reading your Chandler Query System 
> document:
>
> Thinking about how I'd use queries, it seems like most of them would 
> be specified in parcel XML, and either be associated with a result set 
> or used directly with an iterator.  The actual content spec itself 
> would be managed by a GUI query widget.  So it might be convenient to 
> associate an optional result with the query item, rather than being 
> just another item.  The same might be true of the iterator.
>

In the examples, I assumed that the result of the QueryItem.execute() 
method would be a Python generator, which would produces the query 
results in a streaming fashion.  Associating that generator/iterator 
and the content spec via parcel XML should be doable -- the parcel XML 
stuff will need some API to call.  That API is the one in the proposal 
(or one that's been adapted so that it's suitable for this)

> It would be fun to brainstorm with you about how the GUI query widget 
> would generate the content spec.

I think that this is worth doing because it will probably show some 
areas where the syntax/API is inconvenient for creating queries this 
way.

>
> I'm a little concerned about the security consequences of executing 
> arbitrary Python code in the content spec.

I think that I worded this a little too carelessly.  What I really 
meant was that you'd like to be able to invoke the methods of the class 
for an item as part of the query.

> What is the sort order of the items in the results and who's 
> responsible for maintaining it?

The default sort order is undefined at the moment, so we can make it 
whatever we want.  Beyond that, we'll want a way to specify how the 
results are to be sorted.   Initially, this will be done by the query 
retrieval.  Do we want this to be maintained even as items enter/exit 
the query set as items are updated in the repository?

>
> What does attribute traversal mean?

accessing the attiributes of an item as part of the query.  The "first 
name of x's spouse", where spouse is an attribute of item x and first 
name is an attribute of x's spouse.

>
> I'm concerned about efficiency.  I suspect in real life 95% of actual 
> data in Chandler will end up being e-mail (typically all the e-mail I 
> ever receive except for my spam) I also expect this might end up being 
> tens or hundreds of gigabytes over the life of a successful Chandler 
> product.  If all of these e-mails are in a single ref collection, 
> queries might not be fast enough.

This is a possibility, and for a situation where we can only search the 
collection that contains all e-mails, I'm not sure what we can do.  The 
best algorithms/data structures for indexing are O(log2 n) so that's 
the theoretical limit for how well we can do.  Even then we'll have to 
index those hundreds of gigabytes.

>
> Also, is it possible to do queries across several ref collections 
> efficiently?

The efficiency of queries across several collections is the sum of the 
efficiency for each collection.  This in turn is dependent on whether 
the data in a particular collection is indexed or not.

>
> Is there a way for the program using queries to give hints about what 
> indexes should be built?

Right now the model is that indexing is done independently of querying 
and that the query system will know which indexes exist at any given 
moment and be clever enough to use them when they are present.   It may 
also turn out to be beneficial to build indices based on repeated runs 
of particular queries.   If you really feel that you need to control 
which indices are to be used for a particular query, then we could 
probably add an API that allows you to hint that a particular access 
method (index) be used against a particular reference collection.

>
> What is this syntax for doing Lucene fulltext indexing, where extra 
> arguments like distance between words, or maybe language needs to be 
> specified?
>

That syntax as well as the syntax for comparison, arithmetic, dates, 
etc was not spelled out.  It will need to be in order to build the 
system for real.  Most of these are not difficult and should be 
unsurprising. The text indexing is the exception -- Thats' probably 
something for the next round of additions.

> John
----
Ted Leung                 Open Source Applications Foundation (OSAF)
PGP Fingerprint: 1003 7870 251F FA71 A59A  CEE3 BEBA 2B87 F5FC 4B42




More information about the Dev mailing list