[Dev] Chandler Query Proposal 0.1

John Anderson john at osafoundation.org
Mon Feb 23 09:10:11 PST 2004


Hi Ted:

Here are some comments I had after reading your Chandler Query System 
document:

Thinking about how I'd use queries, it seems like most of them would be 
specified in parcel XML, and either be associated with a result set or 
used directly with an iterator.  The actual content spec itself would be 
managed by a GUI query widget.  So it might be convenient to associate 
an optional result with the query item, rather than being just another 
item.  The same might be true of the iterator.

It would be fun to brainstorm with you about how the GUI query widget 
would generate the content spec.

I'm a little concerned about the security consequences of executing 
arbitrary Python code in the content spec.

What is the sort order of the items in the results and who's responsible 
for maintaining it?

What does attribute traversal mean?

I'm concerned about efficiency.  I suspect in real life 95% of actual 
data in Chandler will end up being e-mail (typically all the e-mail I 
ever receive except for my spam) I also expect this might end up being 
tens or hundreds of gigabytes over the life of a successful Chandler 
product.  If all of these e-mails are in a single ref collection, 
queries might not be fast enough.

Also, is it possible to do queries across several ref collections 
efficiently?

Is there a way for the program using queries to give hints about what 
indexes should be built?

What is this syntax for doing Lucene fulltext indexing, where extra 
arguments like distance between words, or maybe language needs to be 
specified?

John

Ted Leung wrote:

> Hi John,
>
> Do you think you'll have time to give me some feedback this week?  Or  
> are you still grappling with 0.3 issues?
>
> Thanks,
>
> Ted
>
> On Feb 2, 2004, at 12:23 PM, John Anderson wrote:
>
>> Hi Ted:
>>
>> I've only had time to skim your proposal, since I've been racing to  
>> get to a deadline, so I don't have very many comments yet. But I was  
>> curious to know if in your "inbox" example below, it was a reference  
>> collection? If so, Mitch has described his idea of an inbox a little  
>> differently in the past, as that subset of all email (which maybe is  
>> the reference collection) that has the attribute read=False. I 
>> realise  you could probably organize your query this way instead, but 
>> if all  email ends up being a reference collection, does it scale to 
>> all the  email I would ever get in my lifetime?
>>
>> John
>>
>> P.S.
>>
>> Ted Leung wrote:
>>
>>> Hi Folks, This message contains a rough first cut at a proposal for  
>>> the Chandler Query system. I wanted to get some ideas and feedback  
>>> before going much further with this. The focus is mostly on how the  
>>> query system integrates with Chandler's notions of Items and  
>>> reference collections. I've described the text string based version  
>>> of the query language, leaving an API based version for a 
>>> subsequent  revision. I've also posted the document to the Chandler 
>>> wiki:  http://wiki.osafoundation.org/twiki/bin/view/Chandler/  
>>> ChandlerQuerySystem for those who are more comfortable making  
>>> comments on a Wiki page. Please feel free to use whichever 
>>> mechanism  is more comfortable for you Ted
>>>
>>>
>>>   Chandler Query Proposal 0.1
>>>
>>>
>>>     Goal
>>>
>>> Produce a subsystem that allows developers to request a set of 
>>> items  that match a criteria. This request is called a query. The 
>>> query is  made against a set of items.
>>>
>>>
>>>     Requirements
>>>
>>>    1. Queries must operate on all items in the repository and on
>>>       subsets of items in the repository (including other query  
>>> results)
>>>    2. Queries must operation on Kinds and other metadata as well as
>>>       "regular" items
>>>    3. Queries must be Items and stored in the repository
>>>    4. Queries produce sets of items satisfying a boolean condition
>>>    5. Queries specify what the result is, not how to produce it
>>>    6. Query language must be closed - results of queries must be  
>>> queryable
>>>    7. Queries support set theoretic operations like Union,
>>>       Intersection, Difference
>>>    8. It must be possible to determine the result set size of the
>>>       previous exeuction of a query (this really means query over a
>>>       particular collection(s)
>>>    9. Queries must allow attribute traversal
>>>   10. There should be a way to express an extensional (explicitly
>>>       defined - without a rule) set as a query
>>>   11. Queries must allow execution of Python code as part of
>>>       content/itemSpec
>>>   12. Query results should be produced incrementally - see Python
>>>       itertools
>>>        
>>> <http://www.python.org/doc/2.3.3/lib/module-%20%0Aitertools.html>
>>>   13. App can be notified when items enter/exit a result set
>>>   14. There should be string based and python based API methods for
>>>       constructing queries
>>>   15. It should be easy to compose queries programmatically to
>>>       facilitate implementation of gui or natural language query  
>>> builders
>>>   16. Query results can be either transient or persistent
>>>   17. Query capability should not be tied to User Level
>>>       Collections/ItemCollections/Playlists
>>>   18. Query Results should know which input and query generated the  
>>> result
>>>
>>>
>>>     Context
>>>
>>> The Chandler repository / data model supplies the items that a 
>>> query  operates on. The obvious source of items is reference 
>>> collections.  This means that the input to a query is supplied by 
>>> one or more  reference collections, and that the output of the 
>>> collection is also  a reference collection. (More specifically, this 
>>> "interface" could be  narrowed to an item iterator. The only 
>>> difficulty with this approach  is that reference collections only 
>>> exist as attributes of items. This  means that all access to 
>>> reference collections is via an item. As  we'll see in the examples, 
>>> this leads to code that is more verbose  than you might expect.
>>>
>>> The Chandler design team also has a notion of a set of items which  
>>> will be visible to the user. This notion has been called  
>>> ItemCollections  
>>> <http://wiki.osafoundation.org/twiki/bin/view/Chandler/ 
>>> %20%0AItemCollections>, or more recently Playlists.
>>>
>>> Requirements for this notion include:
>>>
>>>    1. can have a bunch of items in it
>>>    2. can have a name
>>>    3. examples are "my e-mail inbox", and "my san jose sharks  
>>> calendar"
>>>    4. i can share them with other users
>>>    5. i can give different people different levels of access
>>>       permissions to different playlists
>>>    6. i can see who has subscribed to different playlists
>>>
>>> ItemCollections/Playlists's searching querying functionality will 
>>> be  implemented using the functionality of the query system.
>>>
>>> One open question is whether or not Reference Collections should be  
>>> stand alone items with their own kind. Doing this would shorten 
>>> some  of the code involving queries and present a data model more  
>>> consistent with the philosophy that everything is an Item.
>>>
>>> This could also be accomplished by creating a new Kind, Collection,  
>>> which exists solely to "wrap" a reference collection. It is likely  
>>> that something like this will need to be created anyway in order to  
>>> represent things such as mailboxes, calenders, etc. Whether this is  
>>> actually Playlists/ItemCollections or an abstraction used to  
>>> construct Playlist/ItemCollections is unclear.
>>>
>>> If we chose to do this, the new kinds we would add to the system are
>>>
>>>
>>>       New Kinds
>>>
>>> Collection
>>>     name
>>>     values (refCollection)
>>>
>>> Query
>>>     name - name of the query
>>>     input - one collection or a list of collections
>>>     itemSpec - the query expression
>>>     lastResultSize - size of the last execution (persistent?)
>>>
>>> UserCollection/Playlist/ItemCollection(Collection)
>>>     strictness
>>>     + application specific attributes
>>>
>>> There would be subkinds of either Collection or  
>>> UserCollection/Playlist/ItemCollection for representing mailboxes,  
>>> calendars, contacts, and so forth
>>>
>>>
>>>     Issues
>>>
>>> Here is a short list of open issues regarding queries
>>>
>>>    1. Order preservation -- esp since all our "sets" are really lists
>>>    2. Do we need a Join like operation?
>>>    3. Do queries operate on literal collections?
>>>    4. How do queries fit in as attributes -- do we want "query valued
>>>       attributes"?
>>>
>>>
>>>     Queries
>>>
>>> This section gives examples of how a Chandler developer would  
>>> interact with the query system. I assume that the variable |rep| is  
>>> an instance of the repository. As an example, we use the request 
>>> "all  e-mail messages that have been marked up as being in the Foo  
>>> project". I've also assumed that we are not introducing a 
>>> Collection  Kind, which implies that we need a Kind for persistent 
>>> query results,  which I'm calling ResultSetItem. This is so that I 
>>> can show the  amount of code required under this approach.
>>>
>>> First we need to set up a query item:
>>>
>>> | inbox = rep.find("//userdata/roots/email/inbox")
>>> query = QueryItem()
>>> query.input = inbox.contents
>>> query.itemSpec = 'FOR i IN $1 WHERE "Foo" in i.Projects' |
>>>
>>> To execute the query we need to do something like this, to get a  
>>> transient query result:
>>>
>>> | result = query.execute(QueryItem.TRANSIENT)
>>> |
>>>
>>> To get persistent query result it would look like this:
>>>
>>> | result = ResultSetItem()
>>> result.contents = query.execute(QueryItem.PERSISTENT)
>>> |
>>>
>>> We can shorten this by using the QueryItem constructor and defaults  
>>> on the execute method. The shortened version is:
>>>
>>> | query = QueryItem('FOR i IN $1 WHERE "Foo" in i.Projects',  
>>> inbox.contents)
>>> result = query.execute() # transient default
>>> |
>>>
>>> The shortened persistent version is:
>>>
>>> | query = QueryItem('FOR i IN $1 WHERE "Foo" in i.Projects',  
>>> inbox.contents)
>>> result = ResultSetItem()
>>> result.contents = query.execute(QueryItem.PERSISTENT)
>>> |
>>>
>>> What happens when we want to do a query over the result of a query?  
>>> The code looks like this for the transient case:
>>>
>>> | query1 = QueryItem()
>>> query1.input = result
>>> query1.itemSpec 'FOR i in $1 WHERE "urgent" in i.subject'
>>> result1 = query.execute(QueryItem.TRANSIENT)
>>> |
>>>
>>> or this for the persistent case:
>>>
>>> | query1 = QueryItem()
>>> query1.setInput(result.contents)
>>> query1.setItemSpec('FOR i in $1 WHERE "urgent" in i.subject)
>>> result1 = ResultSetItem()
>>> result1.content = query1.execute(QueryItem.PERSISTENT)
>>> |
>>>
>>> Again, we can shorten the transient case to
>>>
>>> | query1 = QueryItem('FOR i in $1 WHERE "urgent" in i.subject',  
>>> result)
>>> result1 = query1.execute()
>>> |
>>>
>>> and the persistent case to:
>>>
>>> | query1 = QueryItem('FOR i in $1 WHERE "urgent" in i.subject',  
>>> result.contents)
>>> result1 = ResultSetItem()
>>> result1.contents = query1.execute(QueryItem.PERSISTENT)
>>> | I think that we need to do something to make the persistent and  
>>> transient cases more alike.
>>>
>>>
>>>       Item specs for Example Queries
>>>        <http://wiki.osafoundation.org/twiki/bin/view/Journal/ 
>>> %20%0AQueryKickoffMtg20040115>
>>>
>>> This section shows how to express these queries using the system  
>>> being proposed.
>>>
>>> Here is the key for reading this section, the example query,  
>>> questions about the query, and issues raised by the query
>>>
>>> These have the path exprs worked out, except for . vs / as step  
>>> delimiter.
>>>
>>> Instead of showing the full code for each query, which contains a 
>>> lot  of boilerplate Python code, only the query's Input and ItemSpec 
>>> are  shown. The Input has been plugged directly into the ItemSpec 
>>> for  readability. The Python code to retrieve a particular input is 
>>> only  shown once per group of queries.
>>>
>>> All inbox email items that have been marked up as being in the Foo  
>>> project
>>> |inbox.contents = rep.find("//userdata/roots/email/inbox")
>>> FOR i IN inbox WHERE "Foo" in i.Projects|
>>>
>>> All inbox email items that were sent in the last month/day/week
>>> |FOR i in inbox.contents WHERE i.dateSent < 1 month/1 day/1 week|
>>> note easy notation for date literals - may be wishful thinking
>>>
>>> All inbox email items that were sent by a particular person
>>> |FOR i in inbox.contents WHERE i.fromAddress == <Joe'sContactItem>???|
>>> May need fuzzy match or alias based equality to do a good job here.
>>> mail schema is buggy because reply-address != From:/Sender
>>>
>>> All inbox email items with the word "Bar" in the body
>>> |FOR i in inbox.contents WHERE "Bar" in i.messageBody|
>>> Note inference of full text indexing!
>>>
>>> All inbox email items with "Bar" in either the subject, the body, 
>>> or  any of the to or from fields (or in the names of the Contacts  
>>> associated with the email addresses in any of the to or from fields)
>>> |FOR i in inbox.contents WHERE "Bar" in i.subject OR "Bar" in  
>>> i.messageBody" OR "Bar" in i.toAddress or "Bar" in i.replyAdddress 
>>> OR  "Bar" in i.toAddress.emailAddress|
>>> not clear enough how to specify in schema.
>>> Not clear when to use full text (or if able due to schema)
>>>
>>> All "recent" mail to or from a particular person
>>> Is this across all mailboxes?
>>> | mail = rep.find('//userdata/roots/email')
>>> FOR i in mail.contents WHERE i.returnAddress == "John" AND 
>>> i.dateSent  < 1 week|
>>> is this the correct nested semantics?
>>> |mail = rep.find('//userdata/roots/email')| or a query on kinds
>>>
>>> A user's home calendar: all calendar events, tasks, notes, etc. in  
>>> the "home" calendar
>>> |home = rep.find('//userdata/roots/calendars/home')
>>> FOR i in home.contents WHERE i.kind in (calendarKind, taskKind,  
>>> notesKind, ...)
>>> | either a big union, or any item in the home calendar -- attached  
>>> notes may make it harder
>>>
>>> All calendar items with durations overlapping a given 
>>> day/week/month  (startTime and endTime denote the duration)
>>> | cals = rep.find('//userdata/roots/calendars')
>>> FOR i in cals.contents WHERE  
>>> interval(i.startTime,i.endTime).contains("4/28/2004") |
>>> Assumes we create interval convenience classes for duration  
>>> overlapping -- may be part of calender or egenix-MX - need to  research
>>>
>>> All calendar items with the word "Bar" in the headline or body
>>> |FOR i in cals.contents WHERE "Bar" in i.heading OR "Bar" in i.body|
>>>
>>> All calendar items that are in the Foo project
>>> |FOR i IN cal.contents WHERE "Foo" in i.Projects|
>>>
>>> All content items that have some relationship to "today"
>>> What does relationship to "today mean"? -- any date in any 
>>> attribute?  ||
>>>
>>> Any Contact item where the related ContactName has the value "Lee"  
>>> for any of the attributes (firstName, middleName, lastName)
>>> |contacts = rep.find('//userdata/roots/contacts')
>>> FOR i IN contacts.contents WHERE i.firstName == "Lee" OR 
>>> i.middleName  == "Lee" OR i.lastName == "Lee"|
>>>
>>> Any content item with "Foo" in any attribute-value
>>> | items = rep.find('//userdata/contentitems')
>>> FOR i in items.contents WHERE ???? |
>>> Okay, this is a tough one
>>>
>>> All content items with a "project" attribute
>>> |FOR i in items.contents WHERE i.hasAttribute("project")|
>>> Currently there is no hasAttribute() on items An alternate  
>>> formulation is a query over kind.items and kind.subKinds.items
>>>
>>> A set of content items that the user has selected and dragged to 
>>> the  "MyStuff" view
>>> ||
>>> Couldn't find a view attribute -- this means a UserCollection, 
>>> which  should be queryable -- but the "query" probably just returns.
>>>
>>> Some/any combination of these use cases, unions and intersections
>>> |UNION(refCollection1, refCollection2)
>>> INTERSECTION(refCollection1, refCollection2)
>>> DIFFERENCE(refCollection1, refCollection2)|
>>>
>>>
>>>     Implementation issues
>>>
>>> Here are a few issues that we'll need to take into account in the  
>>> implementation
>>>
>>> Streaming results (implementation via threads / generators) - also  
>>> see python itertools
>>>
>>> Indices - values and path indices
>>>
>>> Clustering - may be difficult given current implementation
>>>
>>> Order Statistics
>>>
>>> Notification / constraint enforcement
>>>
>>> Nested Reference Collections
>>>
>>> ---- Ted Leung Open Source Applications Foundation (OSAF) PGP  
>>> Fingerprint: 1003 7870 251F FA71 A59A CEE3 BEBA 2B87 F5FC 4B42
>>>
>>> ---------------------------------------------------------------------- 
>>> -- 
>>>
>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>
>>> Open Source Applications Foundation "Dev" mailing list
>>> http://lists.osafoundation.org/mailman/listinfo/dev
>>>
>>>
> ----
> Ted Leung                 Open Source Applications Foundation (OSAF)
> PGP Fingerprint: 1003 7870 251F FA71 A59A  CEE3 BEBA 2B87 F5FC 4B42
>
>



More information about the Dev mailing list