Gesture Recognition - Engineering Seminar

Gesture recognition
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face or hand. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision algorithms to interpret sign language. However, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques.
Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans than primitive text user interfaces or even GUIs (graphical user interfaces), which still limit the majority of input to keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI) and interact naturally without any mechanical devices. Using the concept of gesture recognition, it is possible to point a finger at the computer screen so that the cursor will move accordingly. This could potentially make conventional input devices such as mouse, keyboards and even touch-screens redundant.

Gesture types
In computer interfaces, two types of gestures are distinguished. We consider online gestures, which can also be regarded as direct manipulations like scaling and rotating. In contrast, offline gestures are usually processed after the interaction is finished; e. g. a circle is drawn to activate a context menu.
In computer interfaces, two types of gestures are distinguished:We consider online gestures, which can also be regarded as direct manipulations like scaling and rotating. In contrast, offline gestures are usually processed after the interaction is finished; e. g. a circle is drawn to activate a context menu.
Online gestures: Direct manipulation gestures. They are used to scale or rotate a tangible object.
Offline gestures: Those gestures that are processed after the user interaction with the object. An example is the gesture to activate a menu.

Humans naturally use gesture to communicate. It has been demonstrated that young children can readily learn to communicate with gesture before they learn to talk. A gesture is non-verbal communication made with a part of the body. We use gesture instead of or in combination with verbal communication. 
Using this process, human can interface with the machine without any mechanical devices. Human movements are typically analyzed by segmenting them into shorter and understandable format. The movements vary person to person.

Applications range from tasks such as industrial machine vision systems which, say, inspect bottles speeding by on a production line, to research into artificial intelligence and computers or robots that can comprehend the world around them. The computer vision and machine vision fields have significant overlap. Computer vision covers the core technology of automated image analysis which is used in many fields. Machine vision usually refers to a process of combining automated image analysis with other methods and technologies to provide automated inspection and robot guidance in industrial applications.
As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner.
As a technological discipline, computer vision seeks to apply its theories and models to the construction of computer vision systems. Examples of applications of computer vision include systems for:

  • Controlling processes, e.g., an industrial robot;
  • Navigation, e.g., by an autonomous vehicle or mobile robot;
  • Detecting events, e.g., for visual surveillance or people counting;
  • Organizing information, e.g., for indexing databases of images and image sequences;
  • Modeling objects or environments, e.g., medical image analysis or topographical modeling;
  • Interaction, e.g., as the input to a device for computer-human interaction, and
  • Automatic inspection, e.g., in manufacturing applications.

Sub-domains of computer vision include scene reconstruction, event detection, video tracking, object recognition, learning, indexing, motion estimation, and image restoration.
In most practical computer vision applications, the computers are pre-programmed to solve a particular task, but methods based on learning are now becoming increasingly common and image processing.

 Image Processing
An image defined in the “real world” is considered to be a function of two real variables, for example, a(x,y) with a as the amplitude (e.g. brightness) of the image at the real coordinate position (x,y).
In a sophisticated image processing system it should be possible to apply specific image processing operations to selected regions. Thus one part of an image (region) might be processed to suppress motion blur while another part might be processed to improve color rendition.
Modern digital technology has made it possible to manipulate multi-dimensional signals with systems that range from simple digital circuits to advanced parallel computers. The goal of this manipulation can be divided into three categories: * Image Processing image in -> image out * Image Analysis image in -> measurements out * Image Understanding image in -> high-level description out Image processing is referred to processing of a 2D picture by a computer. Basic definitions:
An image defined in the “real world” is considered to be a function of two real variables, for example, a(x,y) with a as the amplitude (e.g. brightness) of the image at the real coordinate position (x,y).
An image may be considered to contain sub-images sometimes referred to as regions-of-interest, ROIs, or simply regions. This concept reflects the fact that images frequently contain collections of objects each of which can be the basis for a region. In a sophisticated image processing system it should be possible to apply specific image processing operations to selected regions. Thus one part of an image (region) might be processed to suppress motion blur while another part might be processed to improve color rendition. Sequence of image processing:
The most requirements for image processing of images is that the images be available in digitized form, that is, arrays of finite length binary words. For digitization, the given Image is sampled on a discrete grid and each sample or pixel is quantized using a finite number of bits. The digitized image is processed by a computer. To display a digital image, it is first converted into analog signal, which is scanned onto a display.

Closely related to image processing are computer graphics and computer vision. In computer graphics, images are manually made from physical models of objects, environments, and lighting, instead of being acquired (via imaging devices such as cameras) from natural scenes, as in most animated movies. Computer vision, on the other hand, is often considered high-level image processing out of which a machine/computer/software intends to decipher the physical contents of an image or a sequence of images (e.g., videos or 3D full-body magnetic resonance scans).

In modern sciences and technologies, images also gain much broader scopes due to the ever growing importance of scientific visualization (of often large-scale complex scientific/experimental data). Examples include microarray data in genetic research, or real-time multi-asset portfolio trading in finance.

Before going to processing an image, it is converted into a digital form. Digitization includes sampling of image and quantization of sampled values. After converting the image into bit information, processing is performed. This processing technique may be, Image enhancement, Image reconstruction, and Image compression

Biological and Sociological Definition and Classification of Gestures
From a biological and sociological perspective, gestures are loosely defined, thus, researchers are free to visualize and classify gestures as they see fit. Speech and handwriting recognition research provide methods for designing recognition systems and useful measures for classifying such systems. Gesture recognition systems which are used to control memory and display, devices in a local environment, and devices in a remote environment are examined for the same reason.
People frequently use gestures to communicate. Gestures are used for everything from pointing at a person to get their attention to conveying information about space and temporal characteristics.Evidence indicates that gesturing does not simply embellish spoken language, but is part of the language generation process .
Biologists define "gesture" broadly, stating, "the notion of gesture is to embrace all kinds of instances where an individual engages in movements whose communicative intent is paramount, manifest, and openly acknowledged" . Gestures associated with speech are referred to as gesticulation. Gestures which function independently of speech are referred to as autonomous. Autonomous gestures can be organized into their own communicative language, such as American Sign Language (ASL). Autonomous gestures can also represent motion commands. In the following subsections, some various ways in which biologists and sociologists define gestures are examined to discover if there are gestures ideal for use in communication and device control.

 It can be used as a command to control different devices of daily activities, mobility etc. So our natural or intuitive body movements or gestures can be used as command or interface to operate machines, communicate with intelligent environments to control home appliances, smart home, telecare systems etc. 
Gesture recognition is useful for processing information from humans which is not conveyed through speech or type. As well, there are various types of gestures which can be identified by computers.

Sign language recognition.
Just as speech recognition can transcribe speech to text, certain types of gesture recognition software can transcribe the symbols represented through 
sign language into text.
Sign language

A sign language (also signed language) is a language which, instead of acoustically conveyed sound patterns, uses manual communication and body language to convey meaning. This can involve simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions to fluidly express a speaker's thoughts.
Wherever communities of deaf people exist, sign languages develop. Signing is also done by persons who can hear, but cannot physically speak. While they utilize space for grammar in a way that spoken languages do not, sign languages exhibit the same linguistic properties and use the same language faculty as do spoken languages.[1][2] Hundreds of sign languages are in use around the world and are at the cores of local deaf cultures. Some sign languages have obtained some form of legal recognition, while others have no status at all.

For socially assistive robotics. 
By using proper sensors (accelerometers and gyros) worn on the body of a patient and by reading the values from those sensors, robots can assist in patient rehabilitation. The best example can be stroke rehabilitation.
Directional indication through pointing. Pointing has a very specific purpose in our society, to reference an object or location based on its position relative to ourselves. The use of gesture recognition to determine where a person is pointing is useful for identifying the context of statements or instructions. This application is of particular interest in the field of

The Shadow robot hand system
Robotics is the branch of technology that deals with the design, construction, operation and application of robots  and computer systems for their control, sensory feedback, and information processing. These technologies deal with automated machines that can take the place of humans, in hazardous or manufacturing processes, or simply just resemble humans. Many of today's robots are inspired by nature contributing to the field of bio-inspired robotics.
The concept in creation of machines that could operate autonomously dates back to classical times, but research into the functionality and potential uses of robots did not grow substantially until the 20th century.Throughout history, robotics has been often seen to mimic human behavior, and often manage tasks in a similar fashion. Today, robotics is a rapidly growing field, as we continue to research, design, and build new robots that serve various practical purposes, whether domestically, commercially, or militarily. Many robots do jobs that are hazardous to people such as defusing bombs, exploring shipwrecks, and mines.

Control through facial gestures. 
 Controlling a computer through facial gestures is a useful application of gesture recognition for users who may not physically be able to use a mouse or keyboard. Eye tracking in particular may be of use for controlling cursor motion or focusing on elements of a display.
Eye tracking
Eye tracking is the process of measuring either the point of gaze (where one is looking) or the motion of an eye relative to the head. An eye tracker is a device for measuring eye positions and eye movement. Eye trackers are used in research on the visual system, in psychology, in cognitive linguistics and in product design. There are a number of methods for measuring eye movement. The most popular variant uses video images from which the eye position is extracted. Other methods use search coils or are based on the electrooculogram

Alternative computer interfaces. 
Foregoing the traditional keyboard and mouse setup to interact with a computer, strong gesture recognition could allow users to accomplish frequent or common tasks using hand or face gestures to a camera.
Immersive game technology. 
Gestures can be used to control interactions within video games to try and make the game player's experience more interactive or immersive.

Virtual controllers.
 For systems where the act of finding or acquiring a physical controller could require too much time, gestures can be used as an alternative control mechanism. Controlling secondary devices in a car, or controlling a television set are examples of such usage.
 Affective computing. 
In affective computing

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer sciences, psychology, and cognitive science. While the origins of the field may be traced as far back as to early philosophical enquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing. A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behaviour to them, giving an appropriate response for those emotions. gesture recognition is used in the process of identifying emotional expression through computer systems.

Remote control. 
Through the use of gesture recognition, "remote control with the wave of a hand" of various devices is possible. The signal must not only indicate the desired response, but also which device to be controlled.. 

Input devices
Wired gloves. 
These can provide input to the computer about the position and rotation of the hands using magnetic or inertial tracking devices. Furthermore, some gloves can detect finger bending with a high degree of accuracy (5-10 degrees), or even provide haptic feedback to the user, which is a simulation of the sense of touch. The first commercially available hand-tracking glove-type device was the DataGlove, a glove-type device which could detect hand position, movement and finger bending. This uses fiber optic cables running down the back of the hand. Light pulses are created and when the fingers are bent, light leaks through small cracks and the loss is registered, giving an approximation of the hand pose.

Depth-aware cameras.
 Using specialized cameras such as structured light or time-of-flight cameras, one can generate a depth map of what is being seen through the camera at a short range, and use this data to approximate a 3d representation of what is being seen. These can be effective for detection of hand gestures due to their short range capabilities. 

Stereo cameras.
 Using two cameras whose relations to one another are known, a 3d representation can be approximated by the output of the cameras. To get the cameras' relations, one can use a positioning reference such as a lexian-stripe or infrared emitters. In combination with direct motion measurement (6D-Vision) gestures can directly be detected.

Controller-based gestures.
 These controllers act as an extension of the body so that when gestures are performed, some of their motion can be conveniently captured by software. Mouse gestures are one such example, where the motion of the mouse is correlated to a symbol being drawn by a person's hand, as is the Wii Remote

Wii Remote
The Wii Remote (Wiiリモコン Uī Rimokon?), also known colloquially as the Wiimote, is the primary controller for Nintendo's Wii console. A main feature of the Wii Remote is its motion sensing capability, which allows the user to interact with and manipulate items on screen via gesture recognition and pointing through the use of accelerometer and optical sensor technology. Another feature is its expandability through the use of attachments. The attachment bundled with the Wii console is the Nunchuk, which complements the Wii Remote by providing functions similar to those in gamepad controllers. Some other attachments include the Classic Controller, Wii Zapper, and the Wii Wheel, originally used for Mario Kart.
The controller was revealed at the Tokyo Game Show on September 14, 2005, with the name "Wii Remote" announced April 27, 2006. It has since received much attention due to its unique features and the contrast between it and typical gaming controllers.
During Nintendo's presentation of Wii's successor console, the Wii U, it was announced that it will have support for the Wii Remote and its peripherals in games where use of its touchscreen-built-in primary controller is not imperative.which can study changes in acceleration over time to represent gestures. Devices such as the LG Electronics Magic Wand, the Loop and the Scoop use Hillcrest Labs' Freespace technology, which uses MEMS accelerometers, gyroscopes and other sensors to translate gestures into cursor movement. The software also compensates for human tremor and inadvertent movement. AudioCubes are another example. The sensors of these smart light emitting cubes can be used to sense hands and fingers as well as other objects nearby, and can be used to process data. Most applications are in music and sound synthesis, but can be applied to other fields.
Single camera.
 A normal camera can be used for gesture recognition where the resources/environment would not be convenient for other forms of image-based recognition. Earlier it was thought that single camera may not be as effective as stereo or depth aware cameras, but a start-up based out of Palo Alto named Flutter is challenging this theory. It has released an app that could be downloaded to by any windows/mac computer with built-in webcam, thus, allowing an accessibility to a wider audience. 

Different ways of tracking and analyzing gestures exist, and some basic layout is given is in the diagram above. For example, volumetric models convey the necessary information required for an elaborate analysis, however they prove to be very intensive in terms of computational power and require further technological developments in order to be implemented for real-time analysis. On the other hand, appearance-based models are easier to process but usually lack the generality required for Human-Computer Interaction.
Depending on the type of the input data, the approach for interpreting a gesture could be done in different ways. However, most of the techniques rely on key pointers represented in a 3D coordinate system. Based on the relative motion of these, the gesture can be detected with a high accuracy, depending of the quality of the input and the algorithm’s approach.
In order to interpret movements of the body, one has to classify them according to common properties and the message the movements may express. For example, in sign language each gesture represents a word or phrase. The taxonomy that seems very appropriate for Human-Computer Interaction has been proposed by Quek in "Toward a Vision-Based Hand Gesture Interface". He presents several interactive gesture systems in order to capture the whole space of the gestures: 1. Manipulative; 2. Semaphoric; 3. Conversational.

A read hand (left) is interpreted as a collection of vertices and lines in the 3D mesh version (right), and the software uses their relative position and interaction in order to infer the gesture.

[5a] 3D model-based algorithms
The 3D model approach can use volumetric or skeletal models, or even a combination of the two. Volumetric approaches have been heavily used in computer animation industry and for computer vision purposes. The models are generally created of complicated 3D surfaces, like NURBS or polygon meshes.
The drawback of this method is that is very computational intensive, and systems for live analysis are still to be developed. For the moment, a more interesting approach would be to map simple primitive objects to the person’s most important body parts ( for example cylinders for the arms and neck, sphere for the head) and analyse the way these interact with each other. Furthermore, some abstract structures like super-quadrics 

Some superquadrics.
In mathematics, the superquadrics or super-quadrics (also superquadratics) are a family of geometric shapes defined by formulas that resemble those of elipsoids and other quadrics, except that the squaring operations are replaced by arbitrary powers. They can be seen as the three-dimensional relatives of the Lamé curves ("Superellipses").
The superquadrics include many shapes that resemble cubes, octahedra, cylinders, lozenges and spindles, with rounded or sharp corners. Because of their flexibility and relative simplicity, they are popular geometric modeling tools, especially in computer graphics.
Some authors, such as Alan Barr, define "superquadrics" as including both the superellipsoids and the supertoroids.[1][2] However, the (proper) supertoroids are not superquadrics as defined above; and, while some superquadrics are superellipsoids, neither family is contained in the other.

and generalised cylinders may be even more suitable for approximating the body parts. The very exciting about this approach is that the parameters for these objects are quite simple. In order to better model the relation between these, we make use of constraints and hierarchies between our objects.

The skeletal version (right) is effectively modelling the hand (left). This has less parameters than the volumetric version and it's easier to compute, making it suitable for real-time gesture analysis systems.

[5b] Skeletal-based algorithms
Instead of using intensive processing of the 3D models and dealing with a lot of parameters, one can just use a simplified version of joint angle parameters along with segment lengths. This is known as a skeletal representation of the body, where a virtual skeleton of the person is computed and parts of the body are mapped to certain segments. The analysis here is done using the position and orientation of these segments and the relation between each one of them( for example the angle between the joints and the relative position or orientation)

These binary silhouette(left) or contour(right) images represent typical input for appearance-based algorithms. They are compared with different hand templates and if they match, the correspondent gesture is inferred.

[5c] Appearance-based models
These models don’t use a spatial representation of the body anymore, because they derive the parameters directly from the images or videos using a template database. Some are based on the deformable 2D templates of the human parts of the body, particularly hands. Deformable templates are sets of points on the outline of an object, used as interpolation nodes for the object’s outline approximation. One of the simplest interpolation function is linear, which performs an average shape from point sets, point variability parameters and external deformators. These template-based models are mostly used for hand-tracking, but could also be of use for simple gesture classification.
A second approach in gesture detecting using appearance-based models uses image sequences as gesture templates. Parameters for this method are either the images themselves, or certain features derived from these. Most of the time, only one ( monoscopic) or two ( stereoscopic ) views are used.

Researches show that gesture based applications can be used for many different things, entertainment, controlling home appliance, tele-care, tele-health, elderly or disable care. 
The scope of the application shows us the importance of more researches in a gesture controlled system.
 Most applications are to replace traditional input devices like keyboard and mouse, accessible application for elderly-disable like accelerometer.
 Using digital camera rather than sensor has provided new dimension to develop gesture based user interface development. Now people can interact with any media using gesture to control wide range of applications.

There are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition there are limitations on the equipment used and image noise. 

Image noise
Noise clearly visible in an image from a digital camera
Image noise is random (not present in the object imaged) variation of brightness or color information in images, and is usually an aspect of electronic noise. It can be produced by the sensor and circuitry of a scanner or digital camera. Image noise can also originate in film grain and in the unavoidable shot noise of an ideal photon detector. Image noise is an undesirable by-product of image capture that adds spurious and extraneous information.
The original meaning of "noise" was and remains "unwanted sound"; unwanted electrical fluctuations in signals received by AM radios caused audible acoustic noise ("static"). By analogy unwanted electrical fluctuations themselves came to be known as "noise".[1] Image noise is, of course, inaudible.
The magnitude of image noise can range from almost imperceptible specks on a digital photograph taken in good light, to optical and radioastronomical images that are almost entirely noise, from which a small amount of information can be derived by sophisticated processing (a noise level that would be totally unacceptable in a photograph since it would be impossible to determine even what the subject was).

Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult.
The variety of implementations for image-based gesture recognition may also cause issue for viability of the technology to general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy.
In order to capture human gestures by visual sensors, robust computer vision methods are also required, for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.

No comments:

Post a Comment

leave your opinion