Smart Video Surveillance - Seminar Report

Smart video surveillance
Smart video surveillance systems are capable of enhancing situational awareness across multiple scales of space and time. However, at the present time, the component technologies are evolving in isolation; for example ,Face recognition technology addresses the identity tracking challenge while constraining the subject to be in front of the camera, and intelligent video surveillance technologies provide activity detection capabilities on video streams while ignoring the identity tracking challenge.
This paper provides a comprehensive , no intrusive situation awareness, it is imperative to address the challenge of multistage, spatiotemporal tracking. This article explores the concepts of multi scale spatiotemporal tracking through the use of real-time video analysis, active cameras, multiple object models, and long-term pattern analysis to provide comprehensive situation awareness. From the perspective of real-time threat detection, it is a well-known fact that human visual attention drops below acceptable levels even when trained personnel are assigned to the task of visual monitoring by “concealed weapon detectors”. Specifically, multi scale tracking technologies are the next step in applying automatic video analysis to surveillance systems. In this paper we begin with a discussion on the state-of the- art in video analysis technologies as applied to surveillance   and the key technical challenges.
We present the concepts that underlie several of the key techniques, including detection of moving objects in video, tracking in two and three dimensions, object classification, and object structure analysis.

              Ensuring high levels of security at public access facilities like airports and seaports is an extremely complex challenge. A number of technologies can be applied to various aspects of the security challenge, including screening systems for people and objects (bags, vehicles, etc.), database systems for tracking "trusted people," biometric systems to verify identity, and video surveillance systems to monitor  activity.Today'svideo surveillance systems act as large-scale video recorders, either analog or digital. Their primary focus is the application of video compression technology to efficiently multiplex and store images from a large number of cameras onto mass storage devices (either video tapes or disks). These systems serve two key purposes: providing a human operator with images to detect and react to potential threats and recording evidence for investigative purposes. While these are the first steps in using video surveillance to enhance security, they are inadequate for supporting both real-time threat detection and forensic investigation.
From the perspective of forensic Investigation, the challenge of sifting through large collections of surveillance video tapes is even more tedious and error prone for a human investigator. Automatic video analysis technologies can be applied to develop “smart surveillance systems” that can aid the human operator in both real-time threat detection and forensic investigatory tasks.

                Video analysis and video surveillance are active areas of research. The key areas are video-based detection and tracking, video-based person identification, and large-scale surveillance systems. A significant percentage of basic technologies for video-based detection and tracking This program looked at several fundamental issues in detection, tracking, auto calibration, and multilateral systems. There has also been research on real-world surveillance systems In several leading universities and research labs. The next generation of research in surveillance is addressing not only issues in detection and tracking but also issues of event detection and automatic system calibration.
The second key challenge of surveillance-namely, video based person identification-has also been a subject of intense research. Face recognition has been a leading modality with both on going research and industrial systems. A recent U.S. government research program called Human ID at a Distance addressed the challenge of identifying humans at a distance using techniques like face at a distance and gait-based recognition.

One of the key requirements for effective situation awareness is the acquisition of information at multiple scales. Security analyst who is monitoring a lobby observes not only where people are in the space and what they are doing but also pays attention to the expression on people's faces. The analyst uses these visual observations in conjunction with other knowledge to make an assessment of threats. While existing research has addressed several issues in the analysis of surveillance video, very little Work had been done in the area of better information acquisition based on real-time automatic video analysis, like automatic acquisition of high-resolution face images.

While detecting and tracking objects is a critical capability for smart surveillance, the most critical challenge in video-based surveillance (from the perspective of a human intelligence analyst) is interpreting the automatic analysis data to detect events of interest and identify trends. Current systems have just begun to look into automatic event detection. The area of context-based interpretation of the events in a monitored space is yet to be explored. Challenges here include: using knowledge of time and deployment conditions to improve video analysis, using geometric models of the environment and other object and activity models to interpret events, and using learning techniques to improve system performance and detect unusual events.

The basic techniques for interpreting video and extracting information from it have received a significant amount of attention. The next set of challenges deals with how to use these techniques to build large-scale deployable systems. Several challenges of deployment include minimizing the cost of wiring, meeting the need for low-power hardware for battery-operated camera installations, meeting the need for automatic calibration of cameras and automatic fault detection, and developing system management tools.

Since the multi scale challenge incorporates the widest range of technical challenges, we present the generic architecture of a multi scale tracking system. The goal of a multi scale tracking system is to acquire information about objects in the monitored space at several scales in a unified framework. The architecture presented here provides a view of the interactions between the various components of such a system. In Figure 1, the static cameras cover the complete scene of interest and provide a global view; the pan tilt zoom  cameras." The information from the PTZ cameras is then used to perform fine-scale analysis.

Object detection is the first stage in most tracking systems and serves as a means of focusing attention. There are two approaches to object detection. The first approach, called background subtraction, assumes a stationary background and treats all changes in the scene as objects of  interest. The second approach, called salient motion detection, assumes that a scene will have many different types of motion, of which some types are of interest from a surveillance perspective. The following sections offer a short discussion of both approaches.

The background subtraction module combines evidence from differences in color, texture, and motion. Figure 2 shows the key stages in background subtraction. The use of multiple modalities improves the detection of objects in cluttered environments. The resulting saliency map is smoothed using morphological operators, and then small holes and blobs are eliminated to generate a clean foreground mask. The background subtraction module has a number of mechanisms to handle changing ambient conditions and scene composition. First, it continually updates its overall red/green/blue (RGB)channel noise parameters to compensate for changing light levels. Second, it estimates and corrects for automatic gain control (AGe)and automatic while balance (AWB) shifts induced by the camera. Third, it maintains a map of high-activity regions and slowly updates its background model only in areas deemed as relatively quiescent. Finally, it automatically eliminates occasional spurious foreground objects based on their
motion patterns.  

This is a complementary approach to background subtraction. Here, we approach the problem from a motion filtering perspective. Figure 3(a) shows a scene where a person is walking in front of a bush that is waving in the wind. Figure 3(b) shows the output of a traditional background subtraction algorithm which (per its design) correctly classifies the entire bush as a moving object. However, in this situation, we are interested in detecting the person as opposed to the moving bush. Our approach uses optical flow as the basis for detecting salient motion. We use a temporal window of N frames (typically ten to 15) to assess the coherence of optic flow at each pixel over the entire temporal window. Pixels with coherent optical flow are labeled as candidates. The candidates from the motion filtering are then subjected to a region-growing process to obtain the final detection. 
Background subtraction and salient motion detection are complementary approaches, each with its own strengths and weaknesses. Background subtraction is more suited for indoor environments where lighting is fairly stable and distracting motions are limited; salient motion detection is well suited to detect coherent motion in challenging environments with motion.

 Multi-object tracking aims to develop object trajectories over time by using a combination of the objects' appearance and movement characteristics. New appearance models are created when an object enters a scene. In every new frame, each of the existing tracks is used to try to explain the foreground pixels. The fitting mechanism used is correlation, implemented as minimization of the sum of absolute pixel differences between the detected foreground area and an existing appearance model. During occlusions, foreground pixels may represent the appearance of overlapping objects.  while maintaining an accurate model of the shape and color of the objects. Figure 4 illustrates an example of occlusion handling.

In several surveillance applications, it becomes necessary to determine the position of an object in the scene with reference to a three-dimensional (3-D) world coordinate system. This can be achieved by using two overlapping views of the scene and locating the same scene point in the two views. This approach to 3-D measurement is called stereo. There are two types of stereo:

1) narrow-baseline stereo , or stereo where the two cameras are placed close (a few inches) to each other, resulting in dense depth measurements at limited distance from the cameras, and

2) wide-baseline stereo where the two cameras are far apart (a few feet), resulting in a limited number of high-accuracy depth measurements where correspondences are available. In wide-area surveillance applications, wide-baseline stereo provides position information at large distances from the cameras, which is not possible with traditional stereo. Hence, we explore wide-baseline tracking. Figure 5 shows a block diagram of a 3-D tracker that uses wide-baseline stereo to derive the 3-D positions of objects. Video from each of the cameras is processed independently using the 2-D tracker described earlier, which detects objects and tracks them in the 2-D image. The next step involves computing a correspondence between objects in the two cameras. The correspondence process is accomplished by using a combination of object appearance matching and camera geometry information. At every frame, we measure the color distance between all possible pairings of tracks from the two views. We use the Bhattacharya distance between the normalized color histograms of the tracks. For each pair, we also measure the triangulation error, which is defined as the shortest 3-D distance between the rays passing through the centroids of the appearance models in the two views.  This process can potentially cause multiple tracks from one view to be assigned to the same track in the other view. We use the triangulation error to eliminate such multiple assignments. The triangulation error for the final correspondence is thresholded to eliminate spurious matches that can occur when objects are just visible in one of the two views. Once a correspondence is available at a given frame, we need to establish a match between the existing set of 3-D tracks and 3-D objects present in the current frame. We use the component 2-D track identifiers of a3-D track and match them against the component 2-D track identifiers of the current set of objects to establish the correspondence. The system also enables partial matches, thus ensuring a continuous 3-D track even when one of the 2-D tracks fails. Thus, the 3-D tracker is capable of generating 3-D position tracks of the centroid of each moving object in the scene. It also has access to the 2-D shape and color models from the two views that makeup the track.

In several surveillance applications, determining the type of object is critical. For example, detecting an animal at a fence line bordering the woods may not be an alarm condition, whereas spotting  a person there will definitely require an alarm. There are two approaches to object classification: an image-based approach and a video tracking-based approach. Presented below is a video tracking approach to object classification. This assumes that the objects of interest have been detected and tracked by an object tracking system. Image based systems, such as face, pedestrian, or vehicle detection, find objects of a certain type without prior knowledge of the image location or scale. These systems tend to be slower than video trackingbased systems, which leverage current  tracking information to locate and segment the object of interest. Videotracking-based systems use statistics about the appearance, shape, and motion of moving objects to quickly distinguish people, vehicles, carts, animals, doors pening/closing, trees moving in the breeze, etc. Our system (see Figure 6)  classifies objects into vehicles, individuals, and groups of people based on shape features such as compactness, bounding  ellipse parameters, and motion features .

Often, knowing that an hypothesize candidate locations of the five body parts: the head, two feet, and two hands.  Determination of the head among the candidate locations is currently based on a number of heuristics
founded on the relative positions of the candidate locations and the curvatures of the contour at the candidate locations.

The level of security at a facility is directly related to how well the facility can answer the question "who is where?" The "who" part of this question is typically addressed through the use of face images for recognition, either by an individual or a computer
face recognition system. The "where" part of this question can be addressed through 3-D position tracking. The "who is where" problem is inherently multiscale; wide angle views are needed for location estimation and high-resolution face images are required for identification. An effective system to answer the question "who is where?" must acquire face images without constraining the users and must associate the face images with the 3-D path of the individual. The face cataloger uses computer-controlled PTZ cameras driven by a 3-D  side-baseline stereo tracking system. The PTZ cameras automatically acquire
zoomed-in views of a person's head without constraining the subject to be at a particular location. The face cataloger has applications in a variety of scenarios where one would like to detect both the presence and identity of people in a certain space, such as loading docks, retail store warehouses, shopping areas, and airports. Figure 8 shows the deployment of a face cataloger at the interface between a public access area and a secure area in a building. The process uses images of the calibration pattern (in which feature points corresponding to known object geometry are manually selected) in conjunction with a few parameters supplied by the camera manufacturer .The following is a step-by-step description of the operation of the face cataloger system:
. Step 1:  2-D Object Detection- This step detects objects of interest as they move about the scene. The object detection process is independently
applied to all the static cameras present in the scene.
. Step 2: 2-D Object Tracking- The objects detected in Step 1 are tracked within each camera field of view based on object appearance models.
. Step 3:  3-D Object Tracking-The 2-D object tracks are combined to
locate and track objects in a 3-D world coordinate system. This step uses the 3-D wide-baseline stereo tracking discussed previously. The result of the 3-D tracking is an association between the same object as seen in two overlapping camera views of
the scene.
. Step 4:  3-D Head Detection-To locate the position of the head in 3-D, we use the head detection technique described earlier. Given a 3-D track, the head is first detected in the corresponding 2-D views. The centroid of the head in the two views are used to triangulate the 3-D position of the head. .
 Step 5:  Active Camera Assignment-This step determines which of the available active cameras will be used for which task. Let us consider the example of a scene with three objects and a face cataloger system with two available active cameras. This step will employ an algorithm that uses an application dependent policy to decide the camera assignment.
 . Step 6: 3-D Position-Based Camera Control-Given the 3-D position of the head and a PTZ camera that has been assigned to the object, the system automatically steers the
selected active camera to foveate in on the measured location of the head. There are several ways of controlling the pan-tilt and zoom parameters. For example, the zoom
could be proportional to the distance of the object from the camera and inversely proportional to the speed at which the object is moving.

. Step 7: Face Detection-Once the 3-D position-based zoom has been triggered, the system starts applying face detection [12] to the images from the PTZ camera. A5soon as
a face is detected in the image, the control switches from 3-D position-based control to 2-D control based on face position.

. Step 8: Face Detection-Based Camera Control-Once the (frontal) face image is detected, the camera is centered on the face and the zoom is increased. The pan and tilt of the camera are controlled based on the relative displacement of the center of the face with respect to the center of the image. To avoid any potential instabilities in the feedback control strategy, we use a damping factor in the process. Figure 10 shows images selected from the zoom sequence of the face cataloger. Figure 10(a)-(c) show the initial 3-D position based control. Figure lO(d) shows the a box around the person's face, indicating that a face has been detected (Step 6). Figure lO(e)-(g) show the final stages of the zoom based on using position and size of the face to control the PTZ of the camera. In a typical zoom sequence, the size of the face image will go from roughly ten pixels across to 145 pixels across the face with a resulting area zoom of 200.

Consider the challenge of monitoring the activity near the entrance of a building. Figure 11 shows the plan view of an IBM facility with a parking lot attached to the building. A security analyst would be interested in several types of activities, including finding cars speeding through the parking lot, finding cars that have been parked in loading zones, etc. Figure 12 shows the architecture of a system that can enable such queries through the use of smart surveillance technology. The camera, which is mounted on the roof of the building, is wired to a central server. The video from the camera is analyzed by the smart surveillance engine to produce the viewable video index. can be used for browsing purposes.

The internal structure of the smart surveillance engine is shown in Figure 13. It uses some of the component technologies described earlier. The following analysis steps are performed on the video:
. Step 1: The first step is to detect objects of interest in the surveillance video. This step uses the object detection techniques described previously.
. Step 2: All the objects that are detected in Step 1 are tracked by the object tracking module, using the techniques described earlier. The key idea behind the tracking is to use the colour information in conjunction with velocity-based prediction to maintain accurate color and shape models of the tracked objects.

. Step 3: Object classification is applied to all objects that are consistently tracked by the tracking module. The object classification currently generates three class labels, namely vehicles, multiperson group, and single person.
. Step 4: Real-time alerts-these are based on criteria that set up by the user; examples include motion detection in specifiedareas, directional motion detection, abandoned object detection, object removal detection, and camera tampering detection.
. Step 5: Viewable video index (VVI)-this is a representation that includes time-stamped information about the trajectories of objects in the scene.

Measuring the performance of smart surveillance systems is a very challenging task due to the high degree of effort involved in gathering and annotating the ground truth as well as the challenges involved in defining metrics for performance measurement. Like any pattern recognition system, surveillance systems have two types of errors:

.. False Positives: These are errors that occur when the system falsely detects or recognizes a pattern that does not exist in the scene. For example, a system that is monitoring a secure area may detect motion in the area when there is no physical object but rather a change in the lighting.

..False negatives These are errors that occur when the system does not detect or recognize a pattern that it is designed to detect. For example, a system monitoring a secure area may fail to detect a person wearing clothes similar to the scene background. In this section, we present the various steps in evaluating an application like the face cataloging system. The ultimate goal of the face cataloger is to obtain good close-up head shots of people walking through the monitored space. The duality of the close up face clips is a function of the accuracy of a number of underlying components. The following are potential sources of errors in the system:

. Object Detection and Tracking Errors

- Frame-Level Object Detection Errors: These are errors that occur when the detection system fail to detect or falsely detects an object. - Scene-Level Object Detection Errors: These are errors that occur when an object is completely missed through its life in the scene or a completely nonexistent object is created by the system. - 2-D Track Breakage: These errors occur when the tracker prematurely terminates a track and creates a new
track for the same object. - 2-D Track Swap: This error occurs when the objects being represented by a track get interchanged, typically after an occlusion.

- 3-D Track Swap: This error can occur due to errors in the inter-view correspondence process. . 2-D Head Detection Errors: These are errors in the position and size of the head detected in each of the 2-Dviews. . True Head Center Error: Since we are detecting the head in two widely different views, the centers of the two head bounding boxes do not correspond to a single physical point and hence will lead to errors in the 3-D position.
. 3-D Head Position Errors: These are errors in the 3-D position of the head due to inaccuracy in the camera calibration data.
 . Active Camera Control Errors: These are errors that arise due to the active camera control policies. For example, the zoom factor of the camera is dependent on the velocity of the person; thus, any error in velocity estimation will lead to errors in the zoom control.
. Active Camera Delays: The delay in the control and physical motion of the camera will cause the close-up view of the head to be incorrect.

I . Test Data Collection and Characterization: This involves collecting test sequences from one or more target environments. For example, the challenges in monitoring a waterfront at a port are very different from those of monitoring a crowded subway platform in NewYork City.
Once data is collected from an environment, it is manually grouped into different categories based on environmental conditions (i.e., sunny day, windy day, etc.) for ease of interpreting the results of performance analysis.

 . Ground Truth Generation: This is typically a very labor-intensive process and involves a human using a ground truth marking tool to manually identify various activities that occur in the scene. Our performance evaluation system uses a bounding box marked on each object every 30th frame while assigning a track identifier to each unique object in the scene.
. Automatic Performance Evaluation: The object detection and track systems are used to process the test data set and generate results. An automatic performance evaluation algorithm takes the test results and the ground truth data, compares them and generates both frame-level and scene-level object detection false positive and false negative results. 

Smart surveillance systems significantly contribute to situation awareness. Such systems transform video surveillance from a data acquisition tool to information and intelligence acquisition systems. Real-time video analysis provides smart surveillance systems with the ability to react to an activity in real-time, thus acquiring relevant information at much higher resolution. The long-term operation of such systems provides the ability to analyze information in a spatiotemporal context. As such systems evolve, they will be integrated both with inputs from other types of sensing devices and also with information about the space in which the system is operating, thus providing a very rich mechanism for maintaining situation awareness.

No comments:

Post a Comment

leave your opinion