[Cosmo] Alternative data model try to keep JCR nodes/properties

Cyrus Daboo cyrus+lists.cosmo at daboo.name
Sat Nov 12 10:27:55 PST 2005


Hi,
OK, so here is an attempt to try and use Jackrabbit nodes/properties for 
iCal data and to minimise the overhead of those.

Just to summarise the current layout (this is the one prior to Brian's 
recent changes):

node                  - iCal data resource (this is the .ics file node)
|
 - 1*node             - iCal component (could be many of these)
   |
    - 1*node          - iCal property (will be many of these)
      |
       - *property    - iCal parameter (usually not that many)

First thing to recognise is that nodes probably have a higher overhead than 
properties. e.g. take a look at the 
org.apache.jackrabbit.core.state.xml.XMLPersistenceManager which uses an 
XML dom to store node and property data. So given that, it makes sense to 
make all the iCalendar data properties of a single node. That can be done 
using the 'flat' iCal data format I suggested in my previous approach to 
this. So the following data layout could be used:

node                 - iCal data resource
|
 - 1*property        - iCal flat data item

The next thing to do is to try and optimise the storage of the properties. 
Ideally all we want to store for each property is its uuid, the name and 
the (single) value. In addition it would be nice to have all properties of 
the same node in a single file to avoid overhead of having lots of small 
files on disk. Of course that does effect performance in that a single 
files needs to be read in and parsed to store/access a single property. In 
the ideal world Jackrabbit would have a way to atomically load/store all 
properties of a node in one go, rather than one at a time - but lets ignore 
that for now.

So, one way to minimise the data stored in each property is to write a 
custom PersitsnceManager class. What I propose is a slightly modified 
layout from that above:

node                    - iCal data resource
|
 - 1*1 node             - node with a type to indicate special property 
handling
       |
        - 1*property    - iCal flat data item

In the PersistenceManager, each property is identified by an ID that 
includes the value of the parent node's uuid. The default stores store each 
property as a separate file in a directory stored at the location of the 
node data.

I think the following will work: in the persistencemanager class, whenever 
a property is accessed, the manager looks at the type of the parent node. 
If that type is our special 'compact-properties' type, then a special form 
of property store is used, otherwise the default property store is used. 
The special store will store all properties in a minimal format in a single 
xml/binary file located in the node directory. The only data stored for 
each property is uuid, name, value. The other items usually stored will be 
ignored (and appropriately defaulted when read in).

So what will be the overhead with this?

First off there is the special node - but that's one per iCal resource.

Then we have the flat data format. Right now that expands each property 
name (e.g. 'SUMMARY') to 'VCALENDAR-VEVENT_SUMMARY'. That is a significant 
extra amount of text being stored and indexed. Well we can optimise that by 
using numeric values to identify standard component/property/parameter 
names and use full names only for X- items. So the example would become 
'0-1_10' say. That should actually be more compact than the original 
property name, but is less readable and does require use of lookup tables 
to map between the numeric and name representations. That is something we 
could do after a proof of concept using the full flat names.

The overhead of each property (over and above iCal raw data) then becomes 
just the uuid, and any XML overhead if XML is used. I would propose using 
the XMLPersitenceManager as the initial implementation of this for easier 
debugging, and then perhaps switching to the ObjectPersistenceManager later 
once proof of concept is shown to work.

I would hope that at worst we are talking about a repository disk overhead 
of 2.5 - 3 times that of the raw calendar data (excluding any over head for 
indexing etc which we will always have).

Note that this does not address the issue of memory overhead. I would hope 
that there is a way to deal with that by noting that the iCal data 
properties are only ever used for queries once they are actually created. 
i.e. there is no need to load those properties in when doing a query as the 
query engine indexer will be used, and the actual value of the properties 
is irrelevant for the CalDAV query, since we get the iCal data back from 
the raw iCal data stream. So what we would need is a way to flush the 
property objects out of memory after they are initially created and indexed 
when the iCal resource is PUT. That even begs the question of whether we 
need to have a real disk representation of that data once its been indexed 
- so we may be able to get away with even less disk usage after indexing 
has been done.

Let me know what you think about this and whether you want me to try a 
proof of concept. I will postpone work on the query report until we've 
decided how we want to approach this. In the meantime I will work on the 
limit/expand recurrence task and the free-busy report task.

-- 
Cyrus Daboo




More information about the Cosmo mailing list