The list of things that computers can understand about humans gets longer everyday.
For example there’s a company called Emotient that offers an API that will scan video, identify the location of one or more faces then analyze and report on detected emotions in real time. In partnership with iMotions this ability has been merged with eye tracking. You can, I'm sure, imagine the potential for these technologies to improve how we interact with all sorts of devices (as well as the obvious and perhaps less exciting applications in law enforcement and security).
Yet another intriguing “human decoding” technology has been developed by Hamed Pirsiavash, a postdoc at MIT, and his former thesis advisor, Deva Ramanan of the University of California at Irvine. At the forthcoming Conference on Computer Vision and Pattern Recognition Pirsiavash and Ramanan will present a new visual activity-recognition algorithm that can identify people doing “things.”
What kind of things? Potentially anything that’s a sequence of movements such as exercising, swinging a golf club, or assembling a piece of furniture.
An important attribute of the algorithm is that the processing time scales linearly with the duration of the input video. This means that there’s a consistent performance for equivalent videos of the same sequence. Another important feature of the algorithm is that it can make “guesses” about partially completed sequences assigning a probability to an incomplete sequence being one it has previously seen. In addition the algorithm requires only a fixed amount of memory even though the input video stream or file can be of any length or size.
According to the MIT News Office:
Enabling all of these advances is the appropriation of a type of algorithm used in natural language processing, the computer science discipline that seeks techniques for interpreting sentences written in natural language.
“One of the challenging problems they try to solve is, if you have a sentence, you want to basically parse the sentence, saying what is the subject, what is the verb, what is the adverb,” Pirsiavash says. “We see an analogy here, which is, if you have a complex action — like making tea or making coffee — that has some subactions, we can basically stitch together these subactions and look at each one as something like verb, adjective, and adverb.”
On that analogy, the rules defining relationships between subactions are like rules of grammar. When you make tea, for instance, it doesn’t matter whether you first put the teabag in the cup or put the kettle on the stove. But it’s essential that you put the kettle on the stove before pouring the water into the cup. Similarly, in a given language, it could be the case that nouns can either precede or follow verbs, but that adjectives must always precede nouns.
For any given action, Pirsiavash and Ramanan’s algorithm must thus learn a new “grammar.” And the mechanism that it uses is the one that many natural-language-processing systems rely on: machine learning. Pirsiavash and Ramanan feed their algorithm training examples of videos depicting a particular action, and specify the number of subactions that the algorithm should look for. But they don’t give it any information about what those subactions are, or what the transitions between them look like.
What could this algorithm be used for practically? Well, imagine assuring the correct assembly by a human of complex equipment or monitoring whether people take their medication. How about for analyzing whether an exercise or an athletic routine is is correctly performed?
Pirsiavash is particularly interested in possible medical applications of action detection. The proper execution of physical-therapy exercises, for instance, could have a grammar that’s distinct from improper execution; similarly, the return of motor function in patients with neurological damage could be identified by its unique grammar. Action-detection algorithms could also help determine whether, for instance, elderly patients remembered to take their medication — and issue alerts if they didn’t.
Potentially it could look for anomalies in any human activity and flag non-conformance for even minor changes to behavior. Whether you think that’s a good or bad thing is another matter.
Here's a video of someone diving that's been annotated by the algorithm when certain actions are detected.
Pirsiavash and Ramanan: "A test video containing 'diving' actions, where ground-truth action labels are shown in gray. Any pair of red and green bars corresponds to a detected action using our algorithm. We latently infer 2 sub-actions loosely corresponding to 'initial bending' (red) and 'jumping' (green). Some misalignment errors are due to ambiguities in the ground-truth labeling of action boundaries."