ABSTRACT
Smart video
surveillance systems are capable of enhancing situational awareness across
multiple scales of space and time. However, at the present time, the component
technologies are evolving in isolation; for example ,Face recognition
technology addresses the identity tracking challenge while constraining the
subject to be in front of the camera, and intelligent video surveillance
technologies provide activity detection capabilities on video streams while
ignoring the identity tracking challenge.
This paper
provides a comprehensive , no intrusive situation awareness, it is imperative
to address the challenge of multistage, spatiotemporal tracking. This article
explores the concepts of multi scale spatiotemporal tracking through the use of
real-time video analysis, active cameras, multiple object models, and long-term
pattern analysis to provide comprehensive situation awareness. From the
perspective of real-time threat detection, it is a well-known fact that human
visual attention drops below acceptable levels even when trained personnel are
assigned to the task of visual monitoring by “concealed weapon detectors”.
Specifically, multi scale tracking technologies are the next step in applying
automatic video analysis to surveillance systems. In this paper we begin with a
discussion on the state-of the- art in video analysis technologies as applied
to surveillance and the key technical
challenges.
We present
the concepts that underlie several of the key techniques, including detection
of moving objects in video, tracking in two and three dimensions, object classification,
and object structure analysis.
INTRODUCTION
Ensuring high levels of security
at public access facilities like airports and seaports is an extremely complex
challenge. A number of technologies can be applied to various aspects of the
security challenge, including screening systems for people and objects (bags,
vehicles, etc.), database systems for tracking "trusted people,"
biometric systems to verify identity, and video surveillance systems to monitor activity.Today'svideo surveillance systems
act as large-scale video recorders, either analog or digital. Their primary
focus is the application of video compression technology to efficiently
multiplex and store images from a large number of cameras onto mass storage
devices (either video tapes or disks). These systems serve two key purposes:
providing a human operator with images to detect and react to potential threats
and recording evidence for investigative purposes. While these are the first
steps in using video surveillance to enhance security, they are inadequate for
supporting both real-time threat detection and forensic investigation.
From the
perspective of forensic Investigation, the challenge of sifting through large
collections of surveillance video tapes is even more tedious and error prone
for a human investigator. Automatic video analysis technologies can be applied
to develop “smart surveillance systems” that can aid the human operator
in both real-time threat detection and forensic investigatory tasks.
STATE-OF-THE-ART
IN VIDEO ANALYSIS FOR SURVEILLANCE
Video
analysis and video surveillance are active areas of research. The key areas are
video-based detection and tracking, video-based person identification, and
large-scale surveillance systems. A significant percentage of basic
technologies for video-based detection and tracking This program looked at
several fundamental issues in detection, tracking, auto calibration, and
multilateral systems. There has also been research on real-world surveillance
systems In several leading universities and research labs. The next generation
of research in surveillance is addressing not only issues in detection and
tracking but also issues of event detection and automatic system calibration.
The second
key challenge of surveillance-namely, video based person identification-has
also been a subject of intense research. Face recognition has been a leading
modality with both on going research and industrial systems. A recent U.S. government
research program called Human ID at a Distance addressed the challenge of
identifying humans at a distance using techniques like face at a distance and
gait-based recognition.
THE
MULTISCALE CHALLENGE
One of the
key requirements for effective situation awareness is the acquisition of
information at multiple scales. Security analyst who is monitoring a lobby
observes not only where people are in the space and what they are doing but
also pays attention to the expression on people's faces. The analyst uses these
visual observations in conjunction with other knowledge to make an assessment
of threats. While existing research has addressed several issues in the
analysis of surveillance video, very little Work had been done in the area of
better information acquisition based on real-time automatic video analysis,
like automatic acquisition of high-resolution face images.
THE
CONTEXTUAL EVENT DETECTION CHALLENGE
While
detecting and tracking objects is a critical capability for smart surveillance,
the most critical challenge in video-based surveillance (from the perspective
of a human intelligence analyst) is interpreting the automatic analysis data to
detect events of interest and identify trends. Current systems have just begun
to look into automatic event detection. The area of context-based
interpretation of the events in a monitored space is yet to be explored.
Challenges here include: using knowledge of time and deployment conditions to
improve video analysis, using geometric models of the environment and other
object and activity models to interpret events, and using learning techniques
to improve system performance and detect unusual events.
THE LARGE
SYSTEM DEPLOYMENT CHALLENGE
The basic
techniques for interpreting video and extracting information from it have
received a significant amount of attention. The next set of challenges deals with
how to use these techniques to build large-scale deployable systems. Several
challenges of deployment include minimizing the cost of wiring, meeting the
need for low-power hardware for battery-operated camera installations, meeting
the need for automatic calibration of cameras and automatic fault detection,
and developing system management tools.
COMPONENT
TECHNOLOGIES FOR SMART SURVEILLANCE
Since the
multi scale challenge incorporates the widest range of technical challenges, we
present the generic architecture of a multi scale tracking system. The goal of
a multi scale tracking system is to acquire information about objects in the
monitored space at several scales in a unified framework. The architecture
presented here provides a view of the interactions between the various
components of such a system. In Figure 1, the static cameras cover the complete
scene of interest and provide a global view; the pan tilt zoom cameras." The information from the PTZ
cameras is then used to perform fine-scale analysis.
OBJECT
DETECTION
Object detection
is the first stage in most tracking systems and serves as a means of focusing
attention. There are two approaches to object detection. The first approach,
called background subtraction, assumes a stationary background and treats all
changes in the scene as objects of
interest. The second approach, called salient motion detection, assumes
that a scene will have many different types of motion, of which some types are
of interest from a surveillance perspective. The following sections offer a
short discussion of both approaches.
ADAPTIVE
BACKGROUND SUBTRACTION WITH HEALING
The background
subtraction module combines evidence from differences in color, texture, and motion.
Figure 2 shows the key stages in background subtraction. The use of multiple
modalities improves the detection of objects in cluttered environments. The
resulting saliency map is smoothed using morphological operators, and then
small holes and blobs are eliminated to generate a clean foreground mask. The
background subtraction module has a number of mechanisms to handle changing
ambient conditions and scene composition. First, it continually updates its
overall red/green/blue (RGB)channel noise parameters to compensate for changing
light levels. Second, it estimates and corrects for automatic gain control
(AGe)and automatic while balance (AWB) shifts induced by the camera. Third, it
maintains a map of high-activity regions and slowly updates its background
model only in areas deemed as relatively quiescent. Finally, it automatically
eliminates occasional spurious foreground objects based on their
motion
patterns.
SALIENT
MOTION DETECTION
This is a
complementary approach to background subtraction. Here, we approach the problem
from a motion filtering perspective. Figure 3(a) shows a scene where a person
is walking in front of a bush that is waving in the wind. Figure 3(b) shows the
output of a traditional background subtraction algorithm which (per its design)
correctly classifies the entire bush as a moving object. However, in this
situation, we are interested in detecting the
person as opposed to the moving bush. Our approach uses optical flow as the
basis for detecting salient motion. We use a temporal window of N frames
(typically ten to 15) to assess the coherence of optic flow at each pixel over
the entire temporal window. Pixels with coherent optical flow are labeled as
candidates. The candidates from the motion filtering are then subjected to a
region-growing process to obtain the final detection.
Background
subtraction and salient motion detection are complementary approaches, each
with its own strengths and weaknesses. Background subtraction is more suited
for indoor environments where lighting is fairly stable and distracting motions
are limited; salient motion detection is well suited to detect coherent motion
in challenging environments with motion.
TWO-DIMENSIONAL
OBJECTTRACKING
Multi-object tracking aims to develop object trajectories over time by
using a combination of the objects' appearance and movement characteristics.
New appearance models are created when an object enters a scene. In every new
frame, each of the existing tracks is used to try to explain the foreground
pixels. The fitting mechanism used is correlation, implemented as minimization
of the sum of absolute pixel differences between the detected foreground area
and an existing appearance model. During occlusions, foreground pixels may
represent the appearance of overlapping objects. while maintaining an accurate model of the
shape and color of the objects. Figure 4 illustrates an example of occlusion
handling.
THREE-DIMENSIONAL
WIDE-BASELINE STEREO OBJECT TRACKING
In several
surveillance applications, it becomes necessary to determine the position of an
object in the scene with reference to a three-dimensional (3-D) world
coordinate system. This can be achieved by using two overlapping views of the
scene and locating the same scene point in the two views. This approach to 3-D
measurement is called stereo. There are two types of stereo:
1)
narrow-baseline stereo , or stereo where the two cameras are placed close (a
few inches) to each other, resulting in dense depth measurements at limited
distance from the cameras, and
2) wide-baseline
stereo where the two cameras are far apart (a few feet), resulting in a limited
number of high-accuracy depth measurements where correspondences are available.
In wide-area surveillance applications, wide-baseline stereo provides position
information at large distances from the cameras, which is not possible with
traditional stereo. Hence, we explore wide-baseline tracking. Figure 5 shows a
block diagram of a 3-D tracker that uses wide-baseline stereo to derive the 3-D
positions of objects. Video from each of the cameras is processed independently
using the 2-D tracker described earlier, which detects objects and tracks them
in the 2-D image. The next step involves computing a correspondence between
objects in the two cameras. The correspondence process is accomplished by using
a combination of object appearance matching and camera geometry information. At
every frame, we measure the color distance between all possible pairings of tracks
from the two views. We use the Bhattacharya distance between the normalized
color histograms of the tracks. For each pair, we also measure the
triangulation error, which is defined as the shortest 3-D distance between the
rays passing through the centroids of the appearance models in the two
views. This process can potentially
cause multiple tracks from one view to be assigned to the same track in the
other view. We use the triangulation error to eliminate such multiple
assignments. The triangulation error for the final correspondence is
thresholded to eliminate spurious matches that can occur when objects are just
visible in one of the two views. Once a correspondence is available at a given
frame, we need to establish a match between the existing set of 3-D tracks and
3-D objects present in the current frame. We use the component 2-D track
identifiers of a3-D track and match them against the component 2-D track
identifiers of the current set of objects to establish the correspondence. The
system also enables partial matches, thus ensuring a continuous 3-D track even
when one of the 2-D tracks fails. Thus, the 3-D tracker is capable of
generating 3-D position tracks of the centroid of each moving object in the
scene. It also has access to the 2-D shape and color models from the two views
that makeup the track.
OBJECT
CLASSIFICATION
In several
surveillance applications, determining the type of object is critical. For
example, detecting an animal at a fence line bordering the woods may not be an
alarm condition, whereas spotting a
person there will definitely require an alarm. There are two approaches to
object classification: an image-based approach and a video tracking-based
approach. Presented below is a video tracking approach to object
classification. This assumes that the objects of interest have been detected
and tracked by an object tracking system. Image based systems, such as face,
pedestrian, or vehicle detection, find objects of a certain type without prior
knowledge of the image location or scale. These systems tend to be slower than
video trackingbased systems, which leverage current tracking information to locate and segment
the object of interest. Videotracking-based systems use statistics about the
appearance, shape, and motion of moving objects to quickly distinguish people,
vehicles, carts, animals, doors pening/closing, trees moving in the breeze,
etc. Our system (see Figure 6)
classifies objects into vehicles, individuals, and groups of people
based on shape features such as compactness, bounding ellipse parameters, and motion features .
OBJECT
STRUCTURE ANALYSIS:HEAD DETECTION:
Often, knowing
that an hypothesize
candidate locations of the five body parts: the head, two feet, and two hands. Determination of the head among the candidate
locations is currently based on a number of heuristics
founded on the
relative positions of the candidate locations and the curvatures of the contour
at the candidate locations.
FACE
CATALOGER APPLICATION
The level of
security at a facility is directly related to how well the facility can answer
the question "who is where?" The "who" part of this
question is typically addressed through the use of face images for recognition,
either by an individual or a computer
face recognition
system. The "where" part of this question can be addressed through
3-D position tracking. The "who is where" problem is inherently
multiscale; wide angle views are needed for location estimation and
high-resolution face images are required for identification. An effective
system to answer the question "who is where?" must acquire face
images without constraining the users and must associate the face images with
the 3-D path of the individual. The face cataloger uses computer-controlled PTZ
cameras driven by a 3-D side-baseline
stereo tracking system. The PTZ cameras automatically acquire
zoomed-in views
of a person's head without constraining the subject to be at a particular
location. The face cataloger has applications in a variety of scenarios where
one would like to detect both the presence and identity of people in a certain
space, such as loading docks, retail store warehouses, shopping areas, and
airports. Figure 8 shows the deployment of a face cataloger at the interface
between a public access area and a secure area in a building. The process uses
images of the calibration pattern (in which feature points corresponding to
known object geometry are manually selected) in conjunction with a few
parameters supplied by the camera manufacturer .The following is a step-by-step
description of the operation of the face cataloger system:
. Step 1: 2-D Object Detection- This step detects objects of interest as they move
about the scene. The object detection process is independently
applied to all
the static cameras present in the scene.
. Step 2:
2-D Object Tracking- The objects
detected in Step 1 are tracked within each camera field of view based on object
appearance models.
. Step 3: 3-D Object Tracking-The 2-D object tracks are combined to
locate and track
objects in a 3-D world coordinate system. This step uses the 3-D wide-baseline
stereo tracking discussed previously. The result of the 3-D tracking is an
association between the same object as seen in two overlapping camera views of
the scene.
. Step 4: 3-D Head Detection-To locate the position of the head in 3-D, we use the
head detection technique described earlier. Given a 3-D track, the head is
first detected in the corresponding 2-D views. The centroid of the head in the
two views are used to triangulate the 3-D position of the head. .
Step 5:
Active Camera Assignment-This step determines which of the available active cameras will be used for
which task. Let us consider the example of a scene with three objects and a
face cataloger system with two available active cameras. This step will employ
an algorithm that uses an application dependent policy to decide the camera
assignment.
. Step 6: 3-D Position-Based
Camera Control-Given the 3-D position of the head and a PTZ
camera that has been assigned to the object, the system automatically steers
the
selected active
camera to foveate in on the measured location of the head. There are several
ways of controlling the pan-tilt and zoom parameters. For example, the zoom
could be proportional
to the distance of the object from the camera and
inversely proportional to the speed at which the object is moving.
. Step 7:
Face Detection-Once the 3-D position-based zoom has been
triggered, the system starts applying face detection [12] to the images from
the PTZ camera. A5soon as
a face is
detected in the image, the control switches from 3-D position-based control to
2-D control based on face position.
. Step 8:
Face Detection-Based Camera Control-Once the (frontal) face
image is detected, the camera is centered on the face and the zoom is
increased. The pan and tilt of the camera are controlled based on the relative
displacement of the center of the face with respect to the center of the image.
To avoid any potential instabilities in the feedback control strategy, we use a
damping factor in the process. Figure 10 shows images selected from the zoom
sequence of the face cataloger. Figure 10(a)-(c) show the initial 3-D position
based control. Figure lO(d) shows the a box around the person's face,
indicating that a face has been detected (Step 6). Figure lO(e)-(g) show the
final stages of the zoom based on using position and size of the face to
control the PTZ of the camera. In a typical zoom sequence, the size of the face
image will go from roughly ten pixels across to 145 pixels across the face with
a resulting area zoom of 200.
LONG-TERM
MONITORING AND MOVEMENT PATTERN ANALYSIS
Consider the
challenge of monitoring the activity near the entrance of a building. Figure 11
shows the plan view of an IBM facility with a parking lot attached to the
building. A security analyst would be interested in several types of
activities, including finding cars speeding through the parking lot, finding
cars that have been parked in loading zones, etc. Figure 12 shows the
architecture of a system that can enable such queries through the use of smart
surveillance technology. The camera, which is mounted on the roof of the
building, is wired to a central server. The video from the camera is analyzed
by the smart surveillance engine to produce the viewable video index. can be
used for browsing purposes.
The internal
structure of the smart surveillance engine is shown in Figure 13. It uses some
of the component technologies described earlier. The following analysis steps
are performed on the video:
. Step 1:
The first step is to
detect objects of interest in the surveillance video. This step uses the object
detection techniques described previously.
. Step 2: All the objects that are detected in Step
1 are tracked by the object tracking module, using the techniques described
earlier. The key idea behind the tracking is to use the colour information in
conjunction with velocity-based prediction to maintain accurate color and shape
models of the tracked objects.
. Step 3:
Object classification is
applied to all objects that are consistently tracked by the tracking module.
The object classification currently generates three class labels, namely
vehicles, multiperson group, and single person.
. Step 4:
Real-time alerts-these are
based on criteria that set up by the user; examples include motion detection in
specifiedareas, directional motion detection, abandoned object detection,
object removal detection, and camera tampering detection.
. Step 5:
Viewable video index
(VVI)-this is a representation that includes time-stamped information about the
trajectories of objects in the scene.
PERFORMANCE
EVALUATION
Measuring the
performance of smart surveillance systems is a very challenging task due to the
high degree of effort involved in gathering and annotating the ground truth as
well as the challenges involved in defining metrics for performance
measurement. Like any pattern recognition system, surveillance systems have two
types of errors:
.. False
Positives: These are errors that occur when the system
falsely detects or recognizes a pattern that does not exist in the scene. For
example, a system that is monitoring a secure area may detect motion in the
area when there is no physical object but rather a change in the lighting.
..False
negatives These are
errors that occur when the system does not detect or recognize a pattern that
it is designed to detect. For example, a system monitoring a secure area may
fail to detect a person wearing clothes similar to the scene background. In
this section, we present the various steps in evaluating an application like
the face cataloging system. The ultimate goal of the face cataloger is to
obtain good close-up head shots of people walking through the monitored space.
The duality of the close up face clips is a function of the accuracy of a
number of underlying components. The following are potential sources of errors
in the system:
. Object
Detection and Tracking Errors
- Frame-Level
Object Detection Errors: These are errors
that occur when the detection system fail to detect or falsely detects an
object. - Scene-Level Object Detection Errors: These are errors that
occur when an object is completely missed through its life in the scene or a
completely nonexistent object is created by the system. - 2-D Track
Breakage: These errors occur when the tracker prematurely terminates a
track and creates a new
track for the
same object. - 2-D Track Swap: This error occurs when the objects being
represented by a track get interchanged, typically after an occlusion.
- 3-D
Track Swap: This error
can occur due to errors in the inter-view correspondence process. . 2-D Head
Detection Errors: These are errors in the position and size of the head
detected in each of the 2-Dviews. . True
Head Center
Error: Since we are detecting the head in two widely different views, the
centers of the two head bounding boxes do not correspond to a single physical
point and hence will lead to errors in the 3-D position.
. 3-D Head
Position Errors: These
are errors in the 3-D position of the head due to inaccuracy in the camera
calibration data.
. Active Camera Control Errors: These
are errors that arise due to the active camera control policies. For example,
the zoom factor of the camera is dependent on the velocity of the person; thus,
any error in velocity estimation will lead to errors in the zoom control.
. Active
Camera Delays: The
delay in the control and physical motion of the camera will cause the close-up
view of the head to be incorrect.
I . Test
Data Collection and Characterization:
This involves collecting test sequences from one or more target environments.
For example, the challenges in monitoring a waterfront at a port are very
different from those of monitoring a crowded subway platform in NewYork City .
Once data is
collected from an environment, it is manually grouped into different categories
based on environmental conditions (i.e., sunny day, windy day, etc.) for ease
of interpreting the results of performance analysis.
. Ground Truth Generation: This is typically a very labor-intensive process and
involves a human using a ground truth marking tool to manually identify various
activities that occur in the scene. Our performance evaluation system uses a
bounding box marked on each object every 30th frame while assigning
a track identifier to each unique object in the scene.
.
Automatic Performance Evaluation: The
object detection and track systems are used to process the test data set and
generate results. An automatic performance evaluation algorithm takes the test
results and the ground truth data, compares them and generates both frame-level
and scene-level object detection false positive and false negative results.
CONCLUSION
Smart
surveillance systems significantly contribute to situation awareness. Such
systems transform video surveillance from a data acquisition tool to
information and intelligence acquisition systems. Real-time video analysis provides
smart surveillance systems with the ability to react to an activity in
real-time, thus acquiring relevant information at much higher resolution. The
long-term operation of such systems provides the ability to analyze information
in a spatiotemporal context. As such systems evolve, they will be integrated
both with inputs from other types of sensing devices and also with information
about the space in which the system is operating, thus providing a very rich
mechanism for maintaining situation awareness.
No comments:
Post a Comment
leave your opinion