[Dev] RDF tax

David McCusker david at osafoundation.org
Fri Nov 15 18:02:31 PST 2002


David Hyatt wrote:
> One lesson I learned while working on Mozilla is that it can be tricky 
> to make a scalable data source using RDF, even if the underlying format 
> (beneath the RDF representation) is scalable.  

When I'm asked about RDF and scaling, I tell folks it seems to need more
work on planning how to perform operations efficiently, as opposed to
iterating over every member of large populations.

But that assumes that "using RDF" also means some executable code at
runtime that actually does something to the content, as opposed to merely
imposing a descriptive framework.

I'm not yet aware of specific RDF related code that must be called at
runtime that will impose a performance tax.  I would hope all operations
that must be fast will access data using a lower level interface, which
might be advertised by the RDF interface itself. (Triples might show
the existence of an efficient indexing interface.)

All of the content can be represented at RDF triples, with schemas that
are based on RDF compatible technology.  I'm sure we'll be using RDF
in this sense.  But I don't see why we must perform operations slowly
using only information from an RDF perspective.

 > As an example, consider a
> large mailbox with 50,000 messages.  If you force the UI to display 
> information through communication with your RDF layer, then you can end 
> up with a real challenge on your hands as far as avoiding walking too 
> much of the RDF data (especially when hierarchies are involved).  

I'm thinking in terms of mailboxes containing more messages than that.
When I want to scale, I aim for a solution that might work when using
all available storage resources.  But we might make a choice here and
there which precludes scaling this large in early releases; just the
same, I'll avoid throttling other parts unnecessarily for later scaling.

We'd only be forced to use RDF as an interface if our actual app model
defined all the content as RDF based, as I think Mozilla was attempting.
If one throttled what the app could do to the least common denominator
of what all RDF sources can express, then there would be performance
limitations due to RDF's lack of expression about indexing mechnisms.

I'd hope we instead interpreted all our data as having an RDF meaning
when it is needed, but also allow ourself parallel paths to access the
data with higher performance.

I think the natural model in Chandler might be something in terms of
ordinary Python objects at first, if the persistence is based on some
transparent object persistence scheme.

Katie said we're still thinking about what kind of query language we might
use, and this seems relevant to this performance problem.  I we can only
express view queries in terms of RDF conventions, such queries might be
hard to execute efficiently if no optimized path exists for the queries.

 > This problem is compounded by aggregation, where even simple things like
> counting how many items are in a subfolder become challenging (since all 
> data sources have to get involved).  Trying to do projection in a 
> message display becomes very difficult, especially with aggregation.

Actually, this is one of the ridiculous aspects of RDF that made working
on Mozilla not much fun, and prevented me from enjoying work at Netscape,
since nothing I did in the backend could prevent a high level RDF system
from spending arbitrary cycles doing things inefficiently.

(I apologize for using the word ridiculous, which implies something a bit
negative about RDF, since it's not my place to judge RDF to the extent of
praising or condemning it.  However, I'll try to keep folks from making
choices that require that Chandler scale poorly.)

Maybe we can just invent accelerators for RDF related mechanisms and
try to push them into use throughout the RDF community.  The neighborly
thing to do is fix deficiencies in shared tools.  To encourage adoption,
we could work on explaining the benefits in simple language.

> In Mozilla we ended up using RDF for mailboxes, but we dumped it for 
> messages (and sacrificed the aggregation capabilities of RDF in the 
> process).   This is an advantage of using Mozilla's tree view, since it 
> works with RDF for smaller data sources (e.g., mailboxes and bookmarks) 
> but you can still plug in your own non-RDF back end to the tree view if 
> you need something more scalable.

I think I agree that engines built on top of RDF primitives alone can't
scale without extending the RDF toolset to include accelerators for
large object populations.  I don't know how to do that yet though, but
I'll work on it when/if it's deemed a priority by the Chandler team.

--David McCusker




More information about the Dev mailing list