[Chandler-dev] [Sum] The Great Architecture Discussion of 2007

Phillip J. Eby pje at telecommunity.com
Wed Oct 10 10:07:33 PDT 2007


At 08:11 AM 10/10/2007 -0700, Andi Vajda wrote:

>On Wed, 10 Oct 2007, Phillip J. Eby wrote:
>
>>>>I mean that by implementing a skiplist *inside* of BerkeleyDB 
>>>>rather than using a native BerkeleyDB structure, we're adding an 
>>>>"interpretation" layer there.
>>>What is the native Berkeley DB structure that corresponds to a skiplist ?
>>
>>A BTree.  And the native Python equivalent is a list.  You may 
>>recall that at one time, I compared the performance of the skiplist 
>>implementation with a simple Python list object, and found that for 
>>lists of up to 50,000 items, Python lists were faster -- often 
>>significantly faster.  This is another good example of where we are 
>>reinventing wheels in Python code that somebody already wrote a 
>>mature implementation of in C.
>
>I don't see how a Berkeley DB B-Tree gives me positional access. Nor 
>do I see how I can sparsely load and persist a Python list.

Berkeley lets you access portions of a record, does it not?  You can 
read portions of a BTree into a positional list or an emulation of 
one, can you not?  As far as I am aware, there is nothing in our 
requirements that requires truly *random* positional access within an index.

We use positional access for displaying lists of things, and such 
display is essentially an indexed sequential access, with positional 
access being needed within a particular window of offsets.  In other 
words, we don't randomly choose to access item #1 now, and then #1000 
-- we have strong locality or clustering of access.  A simple 
sliding-window cache over a BTree could perhaps be sufficient.

This is a good example of the sort of thing that happens when an 
architecture is created before the practical requirements are well-understood.

To be clear: I am not saying you made a bad choice.  I am saying the 
requirements that were given to you at the time were either 
incomplete or misleading.

I am also not proposing that we necessarily go back and revisit this 
choice in the near future, but if we do (e.g., changing the 
repository to "compile" dynamic schema to BDB instead of using an 
"interpreted" approach), it needs to be in the context of an 
application layer that is also designed for performance.

As you have quite correctly pointed out, the repository's performance 
issues are dwarfed at the moment by other issues, so fixing the 
"interpreted database" aspect is a micro-optimization by comparison.

However, as the bigger-picture issues are fixed, these lower-level 
factors will eventually become a new bottleneck or "hot spot", 
performance-wise.


>  This is why I implemented skiplists. Python ? I re-wrote the 
> skiplist implementation in C years ago.

So, we reinvented that wheel twice then.... and a skiplist still must 
use more memory per item than a Python list does or a sliding window 
over a BTree would.

I'm not sure I see the point to this increasingly-detailed drill 
down, though, since even if the issues regarding skiplists vs. BTrees 
and lists were completely resolved, it wouldn't affect any of the 
other points under discussion.



More information about the chandler-dev mailing list