[Chandler-dev] [Sum] The Great Architecture Discussion of 2007
Phillip J. Eby
pje at telecommunity.com
Wed Oct 10 10:07:33 PDT 2007
At 08:11 AM 10/10/2007 -0700, Andi Vajda wrote:
>On Wed, 10 Oct 2007, Phillip J. Eby wrote:
>>>>I mean that by implementing a skiplist *inside* of BerkeleyDB
>>>>rather than using a native BerkeleyDB structure, we're adding an
>>>>"interpretation" layer there.
>>>What is the native Berkeley DB structure that corresponds to a skiplist ?
>>A BTree. And the native Python equivalent is a list. You may
>>recall that at one time, I compared the performance of the skiplist
>>implementation with a simple Python list object, and found that for
>>lists of up to 50,000 items, Python lists were faster -- often
>>significantly faster. This is another good example of where we are
>>reinventing wheels in Python code that somebody already wrote a
>>mature implementation of in C.
>I don't see how a Berkeley DB B-Tree gives me positional access. Nor
>do I see how I can sparsely load and persist a Python list.
Berkeley lets you access portions of a record, does it not? You can
read portions of a BTree into a positional list or an emulation of
one, can you not? As far as I am aware, there is nothing in our
requirements that requires truly *random* positional access within an index.
We use positional access for displaying lists of things, and such
display is essentially an indexed sequential access, with positional
access being needed within a particular window of offsets. In other
words, we don't randomly choose to access item #1 now, and then #1000
-- we have strong locality or clustering of access. A simple
sliding-window cache over a BTree could perhaps be sufficient.
This is a good example of the sort of thing that happens when an
architecture is created before the practical requirements are well-understood.
To be clear: I am not saying you made a bad choice. I am saying the
requirements that were given to you at the time were either
incomplete or misleading.
I am also not proposing that we necessarily go back and revisit this
choice in the near future, but if we do (e.g., changing the
repository to "compile" dynamic schema to BDB instead of using an
"interpreted" approach), it needs to be in the context of an
application layer that is also designed for performance.
As you have quite correctly pointed out, the repository's performance
issues are dwarfed at the moment by other issues, so fixing the
"interpreted database" aspect is a micro-optimization by comparison.
However, as the bigger-picture issues are fixed, these lower-level
factors will eventually become a new bottleneck or "hot spot",
> This is why I implemented skiplists. Python ? I re-wrote the
> skiplist implementation in C years ago.
So, we reinvented that wheel twice then.... and a skiplist still must
use more memory per item than a Python list does or a sliding window
over a BTree would.
I'm not sure I see the point to this increasingly-detailed drill
down, though, since even if the issues regarding skiplists vs. BTrees
and lists were completely resolved, it wouldn't affect any of the
other points under discussion.
More information about the chandler-dev