[Dev] HOWTO: fast index-based item lookup

John Anderson john at osafoundation.org
Tue Jan 24 16:23:13 PST 2006


I also replaced the sequential search in flickr with an index on a 
collection, so if you want to see some simple sample code that 
implements an index and looks up items quickly check out flickr in SVN.

John

Andi Vajda wrote:
>
> The text below is a cross-post to this list from the a recent 
> chandlerdb blog post 
> http://blogs.osafoundation.org/chandlerdb/2006/01/using_collectio.html
> as per Ted's request.
>
> Andi..
> --------------------------------------------------------------------------- 
>
> January 24, 2006
> Using collection indexes to find items
>
> During last week's sprints I was asked to see how easy it was to 
> import mail
> into Chandler from a local mailbox. Thanks to Python's mailbox and email
> packages, the mailbox parsing was trivial. Similarly, Chandler's domain
> model can represent email items and has a number of APIs that make 
> creating
> such items from a raw email string very easy.
>
> Unexpectedly, importing mail messages was very slow however, and was 
> getting
> worse as more messages were added to the repository. The first cause of
> slowness was easily resolved by turning off 'bz2' compression of all the
> text LOBs being created.
> The second cause of slowness was more surprising though. It had to do 
> with
> looking up existing email address items from the email addresses 
> occurring
> in the headers of the messages being imported. The EmailAddress item 
> class
> defined here has a class method called getEmailAddress() that will 
> linearly
> iterate all EmailAddresses instances every time such an item is 
> sought. The
> mailbox I was trying to import contains about 6,500 mails with 800 unique
> email addresses. Linearly searching them won't scale.
>
> In the early days of the repository we had planned on having a query 
> system
> and Ted even implemented a query language not unlike one you'd 
> encounter in
> an SQL database. Last year, while working on Chandler 0.6, it occurred 
> to us
> that a formal query language in a Python-based system was somewhat 
> redundant
> and that we should be using computed collections instead of queries
> altogether to accomplish the same goals . Hence abstract sets and 
> their item
> wrappers were implemented.
>
> By combining abstract sets, collections and indexes - something I had 
> added
> to bi-directional reference collections during the Chandler 0.5 
> release - we
> realized we could get very decent item look-up performance as well as 
> cache
> the computed aspect of abstract sets making membership tests and 
> iteration
> also considerably faster.
>
> This technique can be illustrated with the example below taken from the
> EmailAddress index I added during last week's sprints.
>
>     * First, a collection needs to be setup. This can be as simple as a
>       bi-directional reference collection, an abstract collection of all
>       items of a given kind or a more complicated computed collection
>       combining or filtering other collections. In the EmailAddress 
> example,
>       I just created a KindCollection instance based on the EmailAddress
>       kind. In the app parcel I added yet another collection called
>       emailAddressCollection.
>
>       emailAddressCollection = \
>           KindCollection.update(parcel, 'emailAddressCollection',
>                                 kind=pim.mail.EmailAddress.getKind(view),
>                                 recursive=True)
>
>     * Then, a sorted index needs be added to the collection. I wanted 
> to be
>       able to find existing email addresses such that no two email 
> address
>       items in the collection have the same lowercase internet email 
> address
>       string. For this purpose, I added the _compareAddr() method to the
>       EmailAddress class in the mail parcel:
>
>        def _compareAddr(self, other):
>            return cmp(self.emailAddress.lower(), 
> other.emailAddress.lower())
>
>       and the index creation call after the code creating the collection:
>
>       emailAddressCollection.rep.addIndex('emailAddress', 'compare',
>                                           compare='_compareAddr')
>
>       which creates a 'compare' index called 'emailAddress', an index
>       calling a method called '_compareAddr' on the items it is 
> comparing.
>
>     * And finally to use this collection and the index to find items I 
> added
>       a findEmailAddress() class method to the EmailAddress class:
>
>           @classmethod
>           def findEmailAddress(cls, view, emailAddress):
>
>               collection = schema.ns("osaf.app", 
> view).emailAddressCollection.rep
>               emailAddress = emailAddress.lower()
>
>               def compareAddr(key):
>                   return cmp(emailAddress, 
> view[key].emailAddress.lower())
>
>               uuid = collection.findInIndex('emailAddress', 'exact', 
> compareAddr)
>               if uuid is None:
>                   return None
>
>               return view[uuid]
>
>       The findInIndex() call looks for an exact match as per the 
> compareAddr
>       local function which compares lowercase versions of internet 
> address
>       strings.
>
> The technique above replaces the linear search with a binary search 
> yielding
> a very noticeable performance improvement in the overall importing of 
> email
> messages into the repository.
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> Open Source Applications Foundation "Dev" mailing list
> http://lists.osafoundation.org/mailman/listinfo/dev


More information about the Dev mailing list