 |
[Design] HCI, Multi-channel interfaces (e.g. voice dictation), library idea...
Mike C. Fletcher
Sat, 26 Oct 2002 21:14:05 -0400
One of the most useful features of the Agenda application was the
ability to parse free-form text into "commands" within the application
(for instance, adding a "date" tag to an object). I would suggest that
this particular feature (or an extension thereof, described below) be
adapted to the task of making multichannel human-computer interfaces
more robust and intuitive. This library would be a generally useful
tool for designing interfaces. Hopefully it would be released under
either LGPL or BSD license to allow other projects to freely incorporate it.
Basically, the stream of text from the voice dictation software can be
seen as either content or command. Standard voice recognition software
tends to have two distinct modes, and processes a very small subset of
explicit commands within any given application. The Agenda recognition
approach (or similar) could be applied to the stream of text to
multiplex the datastream according to the current context.
Define for a moment a "Context" object which represents the current
context of a particular application
include support for managing focus (put that there),
collections (all of those),
relevant/statistically likely command/actions [implies storage
of statistics],
relevant/statistically important/present nouns/objects [implies
storage of statistics].
As far as I understand the workings of the original Agenda application,
there was a particular single context defined, which had simple
(possibly hard coded) triggers for recognizing particular features in
the input text. This allowed it to recognize names (obviously not
hard-coded), dates (possibly hard-coded), and simple nouns (meeting,
lunch, call, etc.).
With the considerable advances in computer hardware, it's quite likely
that we can create a robust and configurable reasoning engine library
which, given a stream of text, can make callbacks to registered
functions when particular "events" are discovered. For instance, text
in the stream which would seem to suggest a command from the context on
a particular object in the context would call the command's
implementation with the object (and likely the context). Simple textual
dictation (not recognized as a likely command) would be passed to the
current focus of the context (which might very well be a place holder
which only is presented if the text is received without a current focus).
The rationale behind a Context object is the need to simplify the
recognition process and dramatically increase its accuracy. As was
noted by Ilan Volow earlier today, voice dictation tends to be
cumbersome because the grammar/commands tend to have far too large a
granularity. In situations where "reading" is going on, saying "next"
has a high probability of meaning "show me the next message", while in
the same application's dictation mode, the word "next" would have almost
no probability of meaning "show me the next message". Allowing the
application to define large numbers of active contexts and switch
between them should allow for fairly robust alternative-channel human
interfaces.
Implicit in the needs of this library is the need for the ability to
correct command misrecognitions quickly and reliably. In other words, a
command which allows for correcting a recognition (including undoing
actions which were not intended) becomes important as it allows the
library to adjust statistical models and keep the barrier to trust of
the system low.
For instance:
delete James <pause>
no, just the email from James
You would want the HCI module to parse the first recognition, reporting
the results of the recognition (e.g. "Command: delete user, Target:
James Jones, confirmation-required" (assuming the context somehow has
both James and a message from James with the probability models
suggesting that James would refer to the person rather than the e-mail
for this command)). It would call the command's action (with the
application marking the user for potential deletion, adding the command
to a list of "confirmation required" commands etc.). Preferably, at
this point, the application would pass back a token for the command
which would allow, for instance, undoing the command (if that is still
possible). When the correction command is recognized, the HCI module
would attempt to undo the previous command, and re-enter the command
based on the revised parse.
A particular class of mis-recognitions should also be kept in mind, the
emergent command, where what appears to be simple text (or smaller
commands) becomes, with the addition of more text, a different command
or text body. For instance:
computer, take a note <pause>
<dictates a few lines>
send to everyone on the project except Emily<pause>
Although this particular example is rather uninspiring, it should serve
to illustrate the emergent character of the commands (what originally
appeared to be a "note" (assuming that is different than an e-mail,
though that seems unlikely under Chandler) becomes, with the recognition
of later text a communication to everyone on the project.)
I would think we can leave the lower-level mis-recognitions
(mis-spellings or voice-mis-recognitions) to either the spelling or
voice-dictionation system.
Internationalization note: I'm not sure if Agenda was ever ported to
multiple languages, but the principles should be the same regardless of
natural language. Basically, each command needs to have a translation
created, each object type (e-mail, note, appointment, meeting, etc.)
also needs to be translated. To be robust, we would likely want to
include pared-down thesaurus entries for each word in the application
for each language.
Enjoy all,
Mike
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
|