Open Source Applications Foundation

[Design] HCI, Multi-channel interfaces (e.g. voice dictation), library idea...

Mike C. Fletcher Sat, 26 Oct 2002 21:14:05 -0400


  One of the most useful features of the Agenda application was the 
ability to parse free-form text into "commands" within the application 
(for instance, adding a "date" tag to an object).  I would suggest that 
this particular feature (or an extension thereof, described below) be 
adapted to the task of making multichannel human-computer interfaces 
more robust and intuitive.  This library would be a generally useful 
tool for designing interfaces.  Hopefully it would be released under 
either LGPL or BSD license to allow other projects to freely incorporate it.

Basically, the stream of text from the voice dictation software can be 
seen as either content or command.  Standard voice recognition software 
tends to have two distinct modes, and processes a very small subset of 
explicit commands within any given application.  The Agenda recognition 
approach (or similar) could be applied to the stream of text to 
multiplex the datastream according to the current context.

    Define for a moment a "Context" object which represents the current 
context of a particular application
        include support for managing focus (put that there),
        collections (all of those),
        relevant/statistically likely command/actions [implies storage 
of statistics],
        relevant/statistically important/present nouns/objects [implies 
storage of statistics].

As far as I understand the workings of the original Agenda application, 
there was a particular single context defined, which had simple 
(possibly hard coded) triggers for recognizing particular features in 
the input text.  This allowed it to recognize names (obviously not 
hard-coded), dates (possibly hard-coded), and simple nouns (meeting, 
lunch, call, etc.).

With the considerable advances in computer hardware, it's quite likely 
that we can create a robust and configurable reasoning engine library 
which, given a stream of text, can make callbacks to registered 
functions when particular "events" are discovered.  For instance, text 
in the stream which would seem to suggest a command from the context on 
a particular object in the context would call the command's 
implementation with the object (and likely the context).  Simple textual 
dictation (not recognized as a likely command) would be passed to the 
current focus of the context (which might very well be a place holder 
which only is presented if the text is received without a current focus).

The rationale behind a Context object is the need to simplify the 
recognition process and dramatically increase its accuracy.  As was 
noted by Ilan Volow earlier today, voice dictation tends to be 
cumbersome because the grammar/commands tend to have far too large a 
granularity.  In situations where "reading" is going on, saying "next" 
has a high probability of meaning "show me the next message", while in 
the same application's dictation mode, the word "next" would have almost 
no probability of meaning "show me the next message".  Allowing the 
application to define large numbers of active contexts and switch 
between them should allow for fairly robust alternative-channel human 
interfaces.

Implicit in the needs of this library is the need for the ability to 
correct command misrecognitions quickly and reliably.  In other words, a 
command which allows for correcting a recognition (including undoing 
actions which were not intended) becomes important as it allows the 
library to adjust statistical models and keep the barrier to trust of 
the system low.

 For instance:
    delete James <pause>
    no, just the email from James

You would want the HCI module to parse the first recognition, reporting 
the results of the recognition (e.g. "Command: delete user, Target: 
James Jones, confirmation-required" (assuming the context somehow has 
both James and a message from James with the probability models 
suggesting that James would refer to the person rather than the e-mail 
for this command)).  It would call the command's action (with the 
application marking the user for potential deletion, adding the command 
to a list of "confirmation required" commands etc.).  Preferably, at 
this point, the application would pass back a token for the command 
which would allow, for instance, undoing the command (if that is still 
possible).  When the correction command is recognized, the HCI module 
would attempt to undo the previous command, and re-enter the command 
based on the revised parse.

A particular class of mis-recognitions should also be kept in mind, the 
emergent command, where what appears to be simple text (or smaller 
commands) becomes, with the addition of more text, a different command 
or text body.  For instance:

    computer, take a note <pause>
    <dictates a few lines>
    send to everyone on the project except Emily<pause>

Although this particular example is rather uninspiring, it should serve 
to illustrate the emergent character of the commands (what originally 
appeared to be a "note" (assuming that is different than an e-mail, 
though that seems unlikely under Chandler) becomes, with the recognition 
of later text a communication to everyone on the project.)

I would think we can leave the lower-level mis-recognitions 
(mis-spellings or voice-mis-recognitions) to either the spelling or 
voice-dictionation system.

Internationalization note: I'm not sure if Agenda was ever ported to 
multiple languages, but the principles should be the same regardless of 
natural language.  Basically, each command needs to have a translation 
created, each object type (e-mail, note, appointment, meeting, etc.) 
also needs to be translated.  To be robust, we would likely want to 
include pared-down thesaurus entries for each word in the application 
for each language.

Enjoy all,
Mike

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/