Monday, September 1, 2008

Specfying Gestures by Example

-Dean Rubine

COMMENTS

1. Comment on Andrew's blog

SUMMARY

This paper describes GRANDMA (Gesture Recognizers Automated in a Novel Direct Manipulation Architecture) as a toolkit for rapidly adding gestures and also as a trainable single stroke gesture recognizer.
The paper begins by describing the historical efforts towards gesture recognition and current relevant research. A common feature among most of these systems is that the gesture recognizer is hand-coded, making these systems difficult to create, maintain and modify. GRANDMA is different in the sense that it allows designers to create gesture recognizers automatically from example gestures. These recognizers can be rapidly trained from a small number of examples of each gesture.
Next. GDP, a gesture-base drawing program built using GRANDMA is described. The author gives a step-by-step example of how users can enter gestures to draw and manipulate shapes on the GDP interface. Each GDP gesture corresponds to a high level operation. The class of gesture determines the high level operation; attributes of gesture determine operands (scope) as well as additional parameters. It is stressed that all GRANDMA allows designers to create recognizers for single stroke gestures only as a deliberate limitation.
The author describes how a click and drag interface built using GRANDMA can be used by a gesture designer to modify the way input is handled. The gesture designer must determine which of the view classes are to have associated gestures and design a set of intuitive gestures for them. The two GDP view classes of GDP are described. A GdpTopView object refers to the window in which GDP runs. The GraphicObjectView object is either a line, rectangle, an ellipse, text or a set of these. GRANDMA is a MVC like system, where a single event handler is associated with a view class. The designer can add gestures by creating a new gesture handler and associating it with the GraphicObjectView class. The designer can then train the handler by providing it with example gestures. It is claimed that 15 examples are adequate. The Semantics button can then be used to initiate editing of the semantics of each gesture in the handler’s set. The designer enters an expression for each of the semantic components - RECOG (evaluated when the gesture is recognized), MANIP (on subsequent mouse points) and DONE (when mouse button is released).
The next section discusses the low level recognition of two dimensional single stroke gestures, which consists of classifying an input gesture g, into a set of known gesture classes. Each gesture is an array of P time-stamped sample points. Statistical gesture recognition consists of two steps - first a vector of features is extracted and then the feature vector is classified into one of the gesture classes using a linear machine. Features are chosen according to the criteria: Each feature should be incrementally computable in constant time per input point, small change in the input should result in small change in the feature, feature should be meaningful, there should be enough features to provide differentiation between all gestures, but for efficiency reasons not too many. In actual GRANDMA uses 13 features like cosine and sine of initial angle, the length and angle of the bounding box diagonal etc. This feature set was determined empirically by the author to work well. In the cases where these features fail to classify, additional features can be added.
Next the mathematics of gesture classification is discussed. Simply put, each gesture class is associated with a linear evaluation function V. The classification of gesture g is the class C for which V is maximized. In the linear classification function each feature has a weight (different for different classes) associated with it. The training problem is to determine these weights from example gestures. To calculate weights a closed formula is preferred over iterative methods for efficiency reasons. A linear classifier will always classify the gestures as one of the C gestures. The gesture is rejected if the probability that it was classified correctly is less than 0.95. Despite the simplicity, the recognizers trained using this algorithm perform quite well.
Some of the extensions to this algorithm could be Eager Recognition referring to recognition of gestures as soon as they become unambiguous and not waiting for the user to complete the gesture. Multi finger gesture recognition is another area which could be explored. In the end author encourages the integration of GRANDMA into other recognition systems.

DISCUSSION

This is the first paper in the series with more or less complete implementational details. It is a good read and it has given us a chance to do some hands on. I am excited about about that. Another thing that I liked about the idea presented in this paper is its simplicity. The simple training algorithm takes just a single pass through all the example gestures to determine the weights and even then it is very accurate. Something like a Neural network might need several interations to accomplish the same task.

The author claims that one of the critera in selecting a feature is that it should be meaningful. Since features are only used by the recognizer and are not exposed to the user, I think it is not a good criteria. I think a better idea would be to have a placeholder feature which is computed dynamically at design time, not necessarily meaningful, which creates maximum variance among the gestures input by the designer. The gestures will thus be more spread out in the feature hyper-space. Its just an idea right now, but I am sure we can find a way to implement this. I think this would have increased the recognition rates, hence would have been my future direction of work.

2 comments:

Anonymous said...

If you did have such a "placeholder feature" it could then be meaningful to the user as well. This one metric could be used as feedback to help the user understand the choice the system made, so a percentage value of it's confidence on the classification it made.

However, I'm not sure I grasp how one placeholder or abstraction of the other features would help in creating a more diverse feature space and with recognition under the hood. We'll have to discuss it in class.

Akshay Bhat said...

In response to the above comment:-

A feature is essentially a function of the data points of the gesture. Now out of the set of all possible functions of the data points (there will be infinite), a function which attempts to maximize the variance (a function whose values are most spreadout) in the given gesture set, is arbitary and need not be meaningful. I am talking about a feature that gets created automatically, and not something that is chosen by the designer.