[Dev] Chandler Query Proposal 0.1
Ted Leung
twl at osafoundation.org
Mon Feb 23 23:22:58 PST 2004
On Feb 23, 2004, at 11:18 AM, Katie Capps Parlante wrote:
> Hi Ted,
>
> Sorry its taking so long to get you feedback on your proposal. :)
>
> I'll start with feedback on the requirements...
>
>> 1. Queries must operate on all items in the repository and on
>> subsets
>> of items in the repository (including other query results)
> My understanding is that the direction we are going in is to try to
> run queries on subsets of items most of the time, instead of running
> across all items in the repository. My understanding is that a query
> over *all* items in the repository may be slow. I think we can live
> with this and try to model the "subsets" in most cases.
Okay, this is what I thought, but John's comments about searching all
e-mail in the repository were making me nervous.
>
> > 3. Queries must be Items and stored in the repository
> As Chao asked about this in his feedback, I'll give a bit of
> explanation as to why this is a requirement. A CPIA block or view is
> an item that describes both how data is viewed as well as what data is
> viewed. The block or view is an item, and persists in the repository.
> Views in particular (as items) will be a common unit of sharing. To
> store everything we need to know about a block or view, we need to
> store layout information as well as information about what data is in
> the view: a query (or ItemCollection, which as a query as an
> attribute). We've called this description of what data shows up in the
> view a "content-spec". Clarifying the relationship between
> ItemCollections, content-specs and queries is one of our main design
> goals in designing queries.
Great, that actually helps me understand the rationale better.
>
>> 8. It must be possible to determine the result set size of the
>> previous exeuction of a query (this really means query over a
>> particular collection(s)
> This requirement seems like one of the more contentious requirements.
> In a perfect world, the application knows the size of the result set
> and can make use of it in the api. In a perfect world, the
> implementation doesn't have to know the size of the result set so that
> it can be efficient. Perhaps what we're looking for is some way of
> getting a hint to the ui about the size of the result set if possible.
I totally appreciate the desire for sizing information as way to
improve the UI. Here are the two tradeoffs that I see.
1. We can design the system to start streaming items in the result set
as soon as they are available. Since we don't know how many results
there are (like how many people live in Oakland, CA) until the query
finishes, we don't have the size information as the client starts
consuming items from the generator. The advantage of this approach is
that items are available for display immediately. We can also give you
a rough hint of size by telling you something about the size of the
root collection (e.g. there are 400,000 contacts, so the result will be
smaller than that. In the case where we have computed selectivity
statistics (something for down the line) we might be able to do even
better -- the selectivity of this query predicate is 40%, so 40% of
400,000 is 160,000 which would be the estimate of the result size)
2. We can design the system to wait until the entire query result set
has been computed, in which case we can give the precise size of the
result. The disadvantage is that you have to wait till the entire
query result is computed. In the case of large e-mail boxes, you might
be waiting a while.
>
>> 11. Queries must allow execution of Python code as part of
>> content/itemSpec
> I'm curious about the background of this requirement. I would think
> this is kind of a no-no from a security point of view. I don't think
> the application will need this functionality.
As I said in my reply to John, this is probably too broadly worded. I
do think that it would be useful to be able to invoke methods on an
Items class from a query though.
>
>> 15. It should be easy to compose queries programmatically to
>> facilitate implementation of gui or natural language query
>> builders
> Andy Hertzfeld had a prototype of how this might be done in Vista.
> While Chandler might end up being very different, its perhaps worth
> checking it out.
Andy gave me a demo of Vista. It's reminiscent of what is called Query
By Example in the database research literature.
>
>> One open question is whether or not Reference Collections should be
>> stand alone items with their own kind. Doing this would shorten some
>> of the code involving queries and present a data model more
>> consistent with the philosophy that everything is an Item.
>
> I'm not sure if you are proposing that we add the kinds described
> below, if this is still an open choice that we have to make, or if you
> are describing this as an alternative that we might do but proposing
> something else. If you see this as still an open question, perhaps it
> would be useful to map out the two options, what the tradeoffs are,
> what examples look like for each option. Or perhaps I'm just
> confused...
I was proposing that we add the kinds.
>
>> Here is a short list of open issues regarding queries
>
> Off the cuff answers, but I haven't really thought these through...
>
>> 1. Order preservation -- esp since all our "sets" are really lists
> Yes, assuming sorting is also one of the features.
Sorting needs to be one of the features.
>
>> 2. Do we need a Join like operation?
> Perhaps, but I don't have a use case at the moment.
Joining is done a lot in relational settings to assemble/connect
relationships. Since we have bi-directional relationships, some number
of use cases will be covered by bi-directional refs. Join is usually
a part of most complete query algebras, so I'm a little reluctant to
exclude it, but perhaps it goes way down on the list of implementation
priorities/schedule.
>
>> 3. Do queries operate on literal collections?
> No, don't think this is a must have feature.
>
>> 4. How do queries fit in as attributes -- do we want "query valued
>> attributes"?
> Assuming that we have a Kind (Query), then it should work just like
> any other attribute that might have a Kind as a value.
Ok.
>
>> Queries
>> This section gives examples of how a Chandler developer would
>> interact with the query system. I assume that the variable |rep| is
>> an instance of the repository. As an example, we use the request "all
>> e-mail messages that have been marked up as being in the Foo
>> project". I've also assumed that we are not introducing a Collection
>> Kind, which implies that we need a Kind for persistent query results,
>> which I'm calling ResultSetItem. This is so that I can show the
>> amount of code required under this approach.
>
> As mentioned above, I'd be interested to see what it might look like
> if we did introduce a Collection Kind.
>
>> First we need to set up a query item:
>> | inbox = rep.find("//userdata/roots/email/inbox")
>> query = QueryItem()
>> query.input = inbox.contents
>> query.itemSpec = 'FOR i IN $1 WHERE "Foo" in i.Projects' |
>
> This first example has an assumption in it, that we are reflecting
> things like "email/inbox" as a container in the repository. We agreed
> earlier that we don't actually want to use containers as having any
> sort of semantics like this, as it will cause us problems down the
> road. For example, an email message might be in several collections,
> and an item can be in only one container. I'm still unclear on if/when
> we should use repository containers as "subsets" to query over, and
> what this means for the organization of the repository (from the
> perspective of the application). Right now, "userdata/contentitems" is
> meant to be one big pool of items (calendar events, contacts, etc.), a
> data-soup if you will. I assumed some sort of ItemCollection would be
> the the "subset" that the query operated against.
Actually, inbox (//userdata/roots/email/inbox) is an item that as a ref
collection attribute called contents. It's that ref collection that is
the query input (the line is query.input = inbox.contents). So all is
happening here is using find to find the item that had the collection.
(I think).
>
>> All inbox email items that were sent in the last month/day/week
>> |FOR i in inbox.contents WHERE i.dateSent < 1 month/1 day/1 week|
>> note easy notation for date literals - may be wishful thinking
>
> Dates deserve more careful thought all around, we've been putting off
> the heavy lifting so far. :) If a shared calendar is a goal for 0.4,
> then we need to dig into the date issues soonish.
Yep.
>
>> All inbox email items that were sent by a particular person
>> |FOR i in inbox.contents WHERE i.fromAddress == <Joe'sContactItem>???|
>> May need fuzzy match or alias based equality to do a good job here.
>> mail schema is buggy because reply-address != From:/Sender
>
> This case is an interesting one. The user might just want all email
> items that Joe sent. Joe may have several email addresses. Joe might
> also have email addresses that are no longer valid, but have been used
> in the past (we are planning on tracking "old" email addresses). We'd
> like to be able to write a query that finds email from any of these
> email addresses. (I'll take a look at the buggy mail schema -- not
> surprising as we have done nothing to exercise it other than generate
> documentation).
I think it's more a question of how we're going to model all those
e-mail addresses in the content model. I sort of thought that Aliases
might be a partial solution, i suppose you could have a ref collection
of aliases called oldEmailAddresses or something like that.
>
>> All inbox email items with the word "Bar" in the body
>> |FOR i in inbox.contents WHERE "Bar" in i.messageBody|
>> Note inference of full text indexing!
>
> Cool. :)
We need to define more of the full text indexing syntax. My intent is
to borrow wholesale from Lucene (which will make it easy to pass
through to pylucene) unless lots of people think the Lucene search
language is unusable (i need to go look at this more)
>
>> All inbox email items with "Bar" in either the subject, the body, or
>> any of the to or from fields (or in the names of the Contacts
>> associated with the email addresses in any of the to or from fields)
>> |FOR i in inbox.contents WHERE "Bar" in i.subject OR "Bar" in
>> i.messageBody" OR "Bar" in i.toAddress or "Bar" in i.replyAdddress OR
>> "Bar" in i.toAddress.emailAddress|
>> not clear enough how to specify in schema.
>> Not clear when to use full text (or if able due to schema)
>
> We can adjust the schema as needed, it is very far from "baked".
Ok.
>
>> All content items that have some relationship to "today"
>> What does relationship to "today mean"? -- any date in any attribute?
>> ||
>
> Assume that the application knows what "today" is. One way of handling
> this is that all date attributes might be subattributes of an
> attribute "Date". We might look for a match with today in all such
> subattributes. I'd think its probably possible to find some way to
> know the list of attributes we care about, not any date in any
> attribute.
Ok, this sounds like content modeling issue then, not an issue of query
language expressiveness.
>
>> Any content item with "Foo" in any attribute-value
>> | items = rep.find('//userdata/contentitems')
>> FOR i in items.contents WHERE ???? |
>> Okay, this is a tough one
>
> Yup, we should talk with the design team about whether or not this is
> really necessary.
>
> A couple of next steps that we might do:
> + Revisit the content model with this proposal in mind, and put some
> energy into the examples based on the content model. Basically, adjust
> the content model to do-the-right-thing, and see if that works out
> nicely.
> + Revisit the blocks model with this proposal in mind, to see if it
> works out nicely. Would be best if done in conjunction with revisiting
> the content model.
> + Go figure out how we're going to handle dates, in particular for the
> calendar content model.
> + Explore what it would mean to have a Collection Kind.
> + A proposal for the python api for building a query
> + Prototype a query builder (perhaps just on paper) to really flush
> out those issues.
> + Add sorting to the requirements?
All of these sound very reasonable. Let's plan to discuss these in
tomorrow's meeting
----
Ted Leung Open Source Applications Foundation (OSAF)
PGP Fingerprint: 1003 7870 251F FA71 A59A CEE3 BEBA 2B87 F5FC 4B42
More information about the Dev
mailing list