[Cosmo] Alternative data model try to keep JCR nodes/properties
Cyrus Daboo
cyrus+lists.cosmo at daboo.name
Sat Nov 12 10:27:55 PST 2005
Hi,
OK, so here is an attempt to try and use Jackrabbit nodes/properties for
iCal data and to minimise the overhead of those.
Just to summarise the current layout (this is the one prior to Brian's
recent changes):
node - iCal data resource (this is the .ics file node)
|
- 1*node - iCal component (could be many of these)
|
- 1*node - iCal property (will be many of these)
|
- *property - iCal parameter (usually not that many)
First thing to recognise is that nodes probably have a higher overhead than
properties. e.g. take a look at the
org.apache.jackrabbit.core.state.xml.XMLPersistenceManager which uses an
XML dom to store node and property data. So given that, it makes sense to
make all the iCalendar data properties of a single node. That can be done
using the 'flat' iCal data format I suggested in my previous approach to
this. So the following data layout could be used:
node - iCal data resource
|
- 1*property - iCal flat data item
The next thing to do is to try and optimise the storage of the properties.
Ideally all we want to store for each property is its uuid, the name and
the (single) value. In addition it would be nice to have all properties of
the same node in a single file to avoid overhead of having lots of small
files on disk. Of course that does effect performance in that a single
files needs to be read in and parsed to store/access a single property. In
the ideal world Jackrabbit would have a way to atomically load/store all
properties of a node in one go, rather than one at a time - but lets ignore
that for now.
So, one way to minimise the data stored in each property is to write a
custom PersitsnceManager class. What I propose is a slightly modified
layout from that above:
node - iCal data resource
|
- 1*1 node - node with a type to indicate special property
handling
|
- 1*property - iCal flat data item
In the PersistenceManager, each property is identified by an ID that
includes the value of the parent node's uuid. The default stores store each
property as a separate file in a directory stored at the location of the
node data.
I think the following will work: in the persistencemanager class, whenever
a property is accessed, the manager looks at the type of the parent node.
If that type is our special 'compact-properties' type, then a special form
of property store is used, otherwise the default property store is used.
The special store will store all properties in a minimal format in a single
xml/binary file located in the node directory. The only data stored for
each property is uuid, name, value. The other items usually stored will be
ignored (and appropriately defaulted when read in).
So what will be the overhead with this?
First off there is the special node - but that's one per iCal resource.
Then we have the flat data format. Right now that expands each property
name (e.g. 'SUMMARY') to 'VCALENDAR-VEVENT_SUMMARY'. That is a significant
extra amount of text being stored and indexed. Well we can optimise that by
using numeric values to identify standard component/property/parameter
names and use full names only for X- items. So the example would become
'0-1_10' say. That should actually be more compact than the original
property name, but is less readable and does require use of lookup tables
to map between the numeric and name representations. That is something we
could do after a proof of concept using the full flat names.
The overhead of each property (over and above iCal raw data) then becomes
just the uuid, and any XML overhead if XML is used. I would propose using
the XMLPersitenceManager as the initial implementation of this for easier
debugging, and then perhaps switching to the ObjectPersistenceManager later
once proof of concept is shown to work.
I would hope that at worst we are talking about a repository disk overhead
of 2.5 - 3 times that of the raw calendar data (excluding any over head for
indexing etc which we will always have).
Note that this does not address the issue of memory overhead. I would hope
that there is a way to deal with that by noting that the iCal data
properties are only ever used for queries once they are actually created.
i.e. there is no need to load those properties in when doing a query as the
query engine indexer will be used, and the actual value of the properties
is irrelevant for the CalDAV query, since we get the iCal data back from
the raw iCal data stream. So what we would need is a way to flush the
property objects out of memory after they are initially created and indexed
when the iCal resource is PUT. That even begs the question of whether we
need to have a real disk representation of that data once its been indexed
- so we may be able to get away with even less disk usage after indexing
has been done.
Let me know what you think about this and whether you want me to try a
proof of concept. I will postpone work on the query report until we've
decided how we want to approach this. In the meantime I will work on the
limit/expand recurrence task and the free-busy report task.
--
Cyrus Daboo
More information about the Cosmo
mailing list