[Cosmo-dev] Metrics logging

Jared Rhine jared at wordzoo.com
Thu Mar 22 10:59:22 PST 2007


Summary
-------
HTTP access logs are no longer good enough as a foundation for metrics; 
we need a new system.

Background
----------
There's a consensus that Chandler Hub metrics is an important feature 
for Preview.  Having good usage statistics will help us quantify how 
many people are using Chandler Project software, what kinds of 
operations they are doing frequently, and how fast we're growing.

For a long time, the plan has been to go with a simple system, analysis 
of HTTP access logs (known as [Extended] Common Log Format).  There's 
lots of useful information in there and it's pretty granular.

But there's been a sticking point: because event updates through the web 
UI use POST to JSON-RPC, it's not possible via the HTTP access logs to 
count these updates.  That's essentially the only metric I can't 
calculate via access logs, but it's important enough for Preview that an 
updated plan is needed.

Options
-------
I can see two ways forward:

1) Extend common log format to append the JSON-RPC method to the end of 
each line as a new field.

2) Create an entirely new user-transaction log.

Are there other options?

A new user-transaction log would record user actions in a simple format, 
with some simple context.  There would probably still be a series of 
scripts which summarize, analysis, and boil down the raw data into 
useful metrics.

If we go with a new user-transaction log, there's 2 likely implementations:

2a) a new log4j channel, outputting to a new datestamped-rotated file

2b) a new database table, managed either with a log4j channel or a 
hibernate-backed singleton class

In either way forward, the base data will probably include a couple of 
common fields: timestamp, principle (username or ticket or 
bookmarkable-URL representation), operation (DAV GET/PUT, ATOM 
read/write, MC op, JSON-RPC method, etc, etc), user-agent, 
operation-specific payload summary (collection or resource operated 
upon, etc).  I know how to slice-and-dice HTTP access logs any number of 
ways; with a new log format, we'd likely need to iterate a few times 
before we've got all the needed bits in place.

I'm definitely focused on trends of transactions right now; nothing in 
this email addresses runtime behavior (number of people logged in "right 
now", garbage collection behavior, etc).

Pros and cons
-------------

Right now, I'm really only missing the JSON-RPC method from the access 
logs; it'd be pretty simple to tack on the end of the existing file; 
then I could proceed, and we could implement some new logging framework 
post-Preview.

To support a new log, we'll have to go through the code and drop logging 
calls in a wide variety of places.  The locations, contents, and log 
file format will all probably require a few rounds of iteration.

Database tables are cool, but a minor amount of extra infrastructure 
will be need to prune these, move them offsite, and reconstruct them. 
(CPU-intensive analysis should be done somewhere besides the production 
box.)

Files are generally nicely buffered (no commit after every record) and a 
somewhat lower impact.

Database storage lets us calculate some metrics directly from SQL 
instead of writing a script for each.

We will probably need to futz and go through any brand-new log format a 
few places.  We'll also need to drop logging calls in

I can probably contrib an access-log hack or a new log4j file channel 
myself since they use existing patterns.  A database-backed channel, 
unless it's just a log4j destination, will require Cosmo dev time.

If we develop a new system, I can't really dive into metrics rewriting 
until we have that new code in production.

My leaning
----------
I like the solve-the-immediate-problem hackishness of adding the 
JSON-RPC method to the access logs.  But reading HTTP access logs is 
getting harder and harder and at some point will be such a pain that a 
custom-designed, Cosmo-maintained user-transaction log will be a large 
benefit.  I like files, but putting this log in a database is probably 
more modular, though pruning tools will of course be needed.

Overall, I guess we should dive into a database-table backed, 
custom-designed transaction log and litter the code with transactional 
notes about what's happening inside Cosmo.  And do it before Preview.

Example metrics needed
----------------------

- Total number of unique visitors per week
- Unique vistors per week by use-case category:
   - Chandler Desktop users (id by HTTP user-agent)
   - Casual collaborators (id by web UI use via bookmarkable URL)
   - Consultative users (id by web UI use via account)
   - Standalone user (id by web UI authenticated usage who didn't also 
access via Chandler user-agent)
   - Interop users (id by user-agent not being the web UI or Chandler 
Desktop)
- Retention (number of users still using the system 30/60/90/180/360 
days after initial signup)
- Number of signups this week
- Average synchronization frequency
- Average number of updates (by MC, by web UI, and by DAV)
- Breakdown of usage by user-agent
- Number of 500 errors encountered
- Total number of signups per day
- Number of failed signups per day
- Number of web UI sessions per day
- Total HTTP transactions per day
(A "per week" metrics should be calculated as a rolling 7-days calculation)


More information about the cosmo-dev mailing list