[Dev] HOWTO: fast index-based item lookup
John Anderson
john at osafoundation.org
Tue Jan 24 16:23:13 PST 2006
I also replaced the sequential search in flickr with an index on a
collection, so if you want to see some simple sample code that
implements an index and looks up items quickly check out flickr in SVN.
John
Andi Vajda wrote:
>
> The text below is a cross-post to this list from the a recent
> chandlerdb blog post
> http://blogs.osafoundation.org/chandlerdb/2006/01/using_collectio.html
> as per Ted's request.
>
> Andi..
> ---------------------------------------------------------------------------
>
> January 24, 2006
> Using collection indexes to find items
>
> During last week's sprints I was asked to see how easy it was to
> import mail
> into Chandler from a local mailbox. Thanks to Python's mailbox and email
> packages, the mailbox parsing was trivial. Similarly, Chandler's domain
> model can represent email items and has a number of APIs that make
> creating
> such items from a raw email string very easy.
>
> Unexpectedly, importing mail messages was very slow however, and was
> getting
> worse as more messages were added to the repository. The first cause of
> slowness was easily resolved by turning off 'bz2' compression of all the
> text LOBs being created.
> The second cause of slowness was more surprising though. It had to do
> with
> looking up existing email address items from the email addresses
> occurring
> in the headers of the messages being imported. The EmailAddress item
> class
> defined here has a class method called getEmailAddress() that will
> linearly
> iterate all EmailAddresses instances every time such an item is
> sought. The
> mailbox I was trying to import contains about 6,500 mails with 800 unique
> email addresses. Linearly searching them won't scale.
>
> In the early days of the repository we had planned on having a query
> system
> and Ted even implemented a query language not unlike one you'd
> encounter in
> an SQL database. Last year, while working on Chandler 0.6, it occurred
> to us
> that a formal query language in a Python-based system was somewhat
> redundant
> and that we should be using computed collections instead of queries
> altogether to accomplish the same goals . Hence abstract sets and
> their item
> wrappers were implemented.
>
> By combining abstract sets, collections and indexes - something I had
> added
> to bi-directional reference collections during the Chandler 0.5
> release - we
> realized we could get very decent item look-up performance as well as
> cache
> the computed aspect of abstract sets making membership tests and
> iteration
> also considerably faster.
>
> This technique can be illustrated with the example below taken from the
> EmailAddress index I added during last week's sprints.
>
> * First, a collection needs to be setup. This can be as simple as a
> bi-directional reference collection, an abstract collection of all
> items of a given kind or a more complicated computed collection
> combining or filtering other collections. In the EmailAddress
> example,
> I just created a KindCollection instance based on the EmailAddress
> kind. In the app parcel I added yet another collection called
> emailAddressCollection.
>
> emailAddressCollection = \
> KindCollection.update(parcel, 'emailAddressCollection',
> kind=pim.mail.EmailAddress.getKind(view),
> recursive=True)
>
> * Then, a sorted index needs be added to the collection. I wanted
> to be
> able to find existing email addresses such that no two email
> address
> items in the collection have the same lowercase internet email
> address
> string. For this purpose, I added the _compareAddr() method to the
> EmailAddress class in the mail parcel:
>
> def _compareAddr(self, other):
> return cmp(self.emailAddress.lower(),
> other.emailAddress.lower())
>
> and the index creation call after the code creating the collection:
>
> emailAddressCollection.rep.addIndex('emailAddress', 'compare',
> compare='_compareAddr')
>
> which creates a 'compare' index called 'emailAddress', an index
> calling a method called '_compareAddr' on the items it is
> comparing.
>
> * And finally to use this collection and the index to find items I
> added
> a findEmailAddress() class method to the EmailAddress class:
>
> @classmethod
> def findEmailAddress(cls, view, emailAddress):
>
> collection = schema.ns("osaf.app",
> view).emailAddressCollection.rep
> emailAddress = emailAddress.lower()
>
> def compareAddr(key):
> return cmp(emailAddress,
> view[key].emailAddress.lower())
>
> uuid = collection.findInIndex('emailAddress', 'exact',
> compareAddr)
> if uuid is None:
> return None
>
> return view[uuid]
>
> The findInIndex() call looks for an exact match as per the
> compareAddr
> local function which compares lowercase versions of internet
> address
> strings.
>
> The technique above replaces the linear search with a binary search
> yielding
> a very noticeable performance improvement in the overall importing of
> email
> messages into the repository.
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> Open Source Applications Foundation "Dev" mailing list
> http://lists.osafoundation.org/mailman/listinfo/dev
More information about the Dev
mailing list