[Cosmo] Backing up a Cosmo filesystem

Jared Rhine jared at wordzoo.com
Tue Jan 10 15:06:37 PST 2006

Backing up cosmo-demo and/or foxcloud's instances would be swell, a
natural part of going from "prototype" to "production".  We have RAID-1
in case we lose a drive, but we can't currently go back in time to
recover in the case of malicious actions or corrupted databases.

Cosmo's data store is opaque to me as an admin, so when I think of "back
up Cosmo" I think "make a tarball of the $ROOT/data" directory.  For
Foxcloud, this goes pretty quick (2-3 minutes) since there aren't a lot
of files to back up.  For cosmo-demo, a backup takes up to say 15
minutes for just a few dozen active calendar users.

During that 15 minutes, there will likely be some changes to the "data"
directory.  This means that some files in the tarball will be from the
"start" of the backup and some from the "end".

Backups taken this way will be corrupted, for some definition of
corrupted.  If it was a straight filesystem, this is generally
acknowledged by admins as acceptable, because you probably just
corrupted 1 or 2 files out of thousands and you can probably find those
couple files in another backup.  But with Cosmo, with it having indices,
blobs, etc, I'm concerned that a backup taken this way may not even
start, leading to total data loss.

So, what's an admin to do?  A couple of thoughts:

- The "enterprise-level" solution is to update one's server to be able
to support "snapshots" which freeze the filesystem in time so you can do
a consistent backup.  This could work for us as we have some time and
resources, but it's a major undertaking.  In 5 years, this may be
standard and available on most everywhere, but today it's a bit beyond
most small installations of Cosmo.

- It'd be great to have a 0.3 RAS (reliability, availability,
servicability) feature to run some sort of "consistency check" across
all objects stored in the data directory.  This would let me take a
backup and then at least check if it's a "good" one.  There will be
corruption situations uncaught by any such procedure, though, and this
only applies if Cosmo can actually start with a given backup.

- If one can do a full back up and restore via HTTP, then backing up
that way would generally get a non-corrupted backup (though possibly not
fully consistent; "most" objects would be usable if not all frozen at
the same point in time).

- An admin could take down a server to run backups, but nobody likes
doing that.  If I was running an "active/failover" configuration, I
could take down one of the boxes to backup just there.  But we're a ways
away from having a dedicated failover box and that'll be beyond most
small installations of Cosmo as well.

I'm going to open IT tickets for us to back up the production instances
of Cosmo on our demo box (cosmo-demo and foxcloud), but I worry that
such backups may not allow us to recover if we ever need them.

I'm seeking suggestions for non-snapshot ways to backup a Cosmo instance
reliably.  If there's no good way to do that, I suggest we determine
some features that will help us do that and put those on a roadmap.
Failing good feature suggestions, then perhaps this email and any
subsequent responses might become a canonical "how to back up cosmo"

Thanks for considering the issue.

More information about the Cosmo mailing list