Abstract
Video Abstraction is a
nascent technology which is concerned with generating a summary of a video. It
is particularly relevant in archiving and retrieving huge volumes of video
files. Various techniques are evolving, to create ideal abstracts of videos
that would reflect the nature and content of the original. Among all possible
research areas video abstraction is one of the most important topics, which
helps enable a quick browsing of a large collection of video data to achieve
efficient content access and representation.
A new technique is dynamic video synopsis,
where most of the activity in the video is condensed by simultaneously showing
several actions, even when they originally occurred at different times. For
example, we can create a “stroboscopic movie”, where multiple dynamic instances
of a moving object are played simultaneously. This is an extension of the still
stroboscopic picture. Previous approaches for video abstraction addressed
mostly the temporal redundancy by selecting representative
key-frames or time intervals. In dynamic video
synopsis the activity is shifted into a significantly shorter period, in which
the activity is much denser.
Introduction
Digital video
is an emerging force in today’s computer and telecommunication industries. The
rapid growth of the Internet, in terms of both bandwidth and the number of
users, has pushed all multimedia technology forward including video streaming.
Continuous hardware developments have reached the point where personal
computers are powerful enough to handle the high storage and computational
demands of digital video applications. DVD, which delivers high quality digital
video to consumers, is rapidly penetrating the market. Moreover, the advances
in digital cameras and camcorders have made it quite easy to capture a video
and then load it into a computer in digital form. Many companies, universities
and even ordinary families already have large repositories of videos both in
analog and digital formats, such as the broadcast news, training and education
videos, advertising and commercials, monitoring, surveying and home videos. All
of these trends are indicating a promising future for the world of digital
video.
The fast evolution of digital video has
brought many new applications and consequently, research and development of new
technologies, which will lower the costs of video archiving, cataloging and
indexing, as well as improve the efficiency, usability and accessibility of
stored videos are greatly needed. Among all possible research areas, one
important topic is how to enable a quick browse of a large collection of video
data and how to achieve efficient content access and representation. To address
these issues, video abstraction techniques have emerged and have been
attracting more research interest in recent years.
Video abstraction, as the name implies, is a short
summary of the content of a longer video document. Specifically, a video abstract
is a sequence of still or moving images representing the content of a video in
such a way that the target party is rapidly provided with concise information
about the content while the essential message of the original is well
preserved.
Theoretically
a video abstract can be generated both manually and automatically, but due to
the huge volumes of video data and limited manpower, it’s getting more and more
important to develop fully automated video analysis and processing tools so as
to reduce the human involvement in the video abstraction process.
There are two fundamentally different kinds of
abstracts: still- and moving-image abstracts. The still-image abstract, also
known as a static storyboard, is a small collection of salient images extracted
or generated from the underlying video source. This type of abstract is called
a video summary. The moving-image
abstract, also known as moving storyboard, or multimedia summary, consists of a
collection of image sequences, as well as the corresponding audio abstract extracted from the original
sequence and is thus itself a video clip but of
considerably shorter length. This type of abstract is called a video skimming.
There
are some significant differences between video summary and video skimming. A video
summary can be built much faster, since generally only visual information is
utilized and no handling of audio and textual information is needed. Therefore,
once composed, it is displayed more easily since there are no timing or
synchronization issues.. Besides, the temporal order of all extracted
representative frames can be displayed in a spatial order so that the users are
able to grasp the video content more quickly. Finally, all extracted stills
could be printed out very easily when needed.
Characteristics of a Video
Video is the
technology of electronically capturing, recording, processing, storing,
transmitting, and reconstructing a sequence of still images representing scenes
in motion. All videos are characterized by the following.
Number of frames per
second
Also called frame rate, the number of still pictures
per unit of time of video ranges from six or eight frames per second (frame/s)
for old mechanical cameras to 120 or more frames per second for new
professional cameras. PAL (Europe, Asia, Australia, etc.) and SECAM (France,
Russia, parts of Africa etc.) standards specify 25 frame/s, while NTSC (USA,
Canada, Japan, etc.) specifies 29.97 frame/s. Film is shot at the slower frame
rate of 24frame/s, which complicates slightly the process of transferring a
cinematic motion picture to video. The minimum frame rate to achieve the
illusion of a moving image is about fifteen frames per second.
Interlacing
Video can be
interlaced or progressive. Interlacing was invented as a way to achieve good
visual quality within the limitations of a narrow bandwidth. The horizontal
scan lines of each interlaced frame are numbered consecutively and partitioned
into two fields: the odd field (upper field) consisting of the odd-numbered
lines and the even field (lower field) consisting of the even-numbered lines.
NTSC, PAL and SECAM are interlaced formats. Abbreviated video resolution
specifications often include an “i” to indicate interlacing. For example, PAL
video format is often specified as 576i50, where 576 stands for the vertical
line resolution, “i” indicates interlacing, and 50 stands for 50 fields (half-frames)
per second.
In progressive
scan systems, each refresh period updates all of the scan lines. The result is
a higher perceived resolution and a lack of various artifacts that can make
parts of a stationary picture appear to be moving or flashing.
A procedure
known as de-interlacing can be used for converting an interlaced stream, such
as analog, DVD, or satellite, to be processed by progressive scan devices, such
as TFT TV-sets, projectors, and plasma panels. De-interlacing cannot, however,
produce a video quality that is equivalent to true progressive scan source
material.
Display resolution
The size of a
video image is measured in pixels for digital video, or horizontal scan lines
and vertical lines of resolution for analog video. In the digital domain (e.g.
DVD) standard-definition television (SDTV) is specified as 720/704/640×480i60
for NTSC and 768/720×576i50 for PAL or SECAM resolution. However in the analog
domain, the number of visible scanlines remains constant (486 NTSC/576 PAL)
while the horizontal measurement varies with the quality of the signal:
approximately 320 pixels per scanline for VCR quality, 400 pixels for TV
roadcasts, and 720 pixels for DVD sources. Aspect ratio is preserved because of
non-square "pixels".
New
high-definition televisions (HDTV) are capable of resolutions up to
1920×1080p60, i.e. 1920 pixels per scan line by 1080 scan lines, progressive,
at 60 frames per second. Video resolution for 3D-video is measured in voxels
(volume picture element, representing a value in the three dimensional space).
For example 512×512×512 voxels resolution, now used for simple 3D-video, can be
displayed even on some PDAs.
Aspect ratio
Aspect ratio
describes the dimensions of video screens and video picture elements. All
popular video formats are rectilinear, and so can be described by a ratio
between width and height. The screen aspect ratio of a traditional television
screen is 4:3, or about 1.33:1. High definition televisions use an aspect ratio
of 16:9, or about 1.78:1. The aspect ratio of a full 35 mm film frame with
soundtrack (also known as the Academy ratio) is 1.375:1.
Pixels on
computer monitors are usually square, but pixels used in digital video often
have non-square aspect ratios, such as those used in the PAL and NTSC variants
of the CCIR 601 digital video standard, and the corresponding anamorphic
widescreen formats. Therefore, an NTSC DV image which is 720 pixels by 480
pixels is displayed with the aspect ratio of 4:3 (which is the traditional
television standard) if the pixels are thin and displayed with the aspect ratio
of 16:9 (which is the anamorphic widescreen format) if the pixels are fat.
Color space and bits per
pixel
Color model
name describes the video color representation. The number of distinct colours
that can be represented by a pixel depends on the number of bits per pixel
(bpp). A common way to reduce the number of bits per pixel in digital video is
by chroma subsampling (e.g. 4:4:4, 4:2:2, 4:2:0, 4:1:1).
Video
Summary
As mentioned
in the Introduction, video summary is a set of salient images called keyframes which are selected or
reconstructed from an original video sequence. Therefore, selecting salient
images (key frames) from all the frames of an original video is very important
to get a video summary. The different
methods using for making video summaries will be discussed in the following
subsections.
Video Summary
techniques can be broadly classified into Shot-based Video Summary techniques
and Segment-based Video Summary techniques, which are the two major methods of
extracting the keyframes which constitute a video summary.
Shot-based Keyframe
Extraction
Since a shot
is defined as a video segment within a continuous capture period, a natural and
straightforward way of keyframe extraction is to use the first frame of each
shot as its keyframe. However, while being sufficient for stationary shots, one
keyframe per shot does not provide an acceptable representation of dynamic
visual content, therefore multiple keyframes need to be extracted by adapting
to the underlying semantic content. However, since computer vision still
remains to be a very difficult research challenge, most of existing work chooses
to interpret the content by employing some low-level visual features such as
color and motion, instead of performing a tough semantic understanding. In this
report, based on the features that these works employ, we categorize them into
following 3 different classes: color-based approach, motion-based approach and
others.
Color-based approach
The keyframes
are extracted in a sequential fashion for each shot. Particularly, the first
frame within the shot is always chosen as the first keyframe, and then the color-histogram
difference between the subsequent frames and the latest keyframe is computed.
Once the difference exceeds a certain threshold, a new keyframe will be
declared.
One possible
problem with above extraction method is that there is a probability that the
first frame is a part of transition effect at the shot boundary, thus strongly
reducing its representative quality. As an alternative, eyframes can be
extracted using an unsupervised clustering scheme. Basically, all video frames
within a shot are first clustered into certain number of clusters based on the
color-histogram similarity comparison where a threshold is predefined to control
the density of each cluster. Next, all the clusters that are big enough are
considered as the key clusters and a representative frame closest to the
cluster centroid is extracted from each of them. Because the color histogram is
invariant to image orientations and robust to background noises, color-based
keyframe extraction algorithms have been widely used. However, most of these
works are heavily threshold-dependent, and cannot well capture the underlying
dynamics when there is lots of camera or object motion.
Motion-based approach
Motion-based
approaches are relatively better suited for controlling the number of frames
based on temporal dynamics in the scene. In general, pixel-based image
differences or optical flow computation are commonly used in this approach. In
one approach, the optical flow for each frame is first computed, and then a
simple motion metric is computed. Finally by analyzing the metric as a function
of time, the frames at the local minima of motion are selected as the
keyframes.
A domain
specific keyframe selection method is proposed where a summary is generated for
video-taped presentations. Sophisticated global motion and gesture analysis
algorithms are developed. There are works which employ 3 different operation
levels based on the available machine resources: at the lowest level,
pixel-based frame differences are computed to generate the “temporal activity
curve” since it requires minimal resources; at level 2, color histogram-based
frame differences are computed to extract “color activity segments”, and at
level 3 sophisticated camera motion analysis is carried out to estimate the
camera parameters and detect the “motion activity segments”. Keyframes are then
selected from each segment and necessary elimination is applied to obtain the
final result.
Video
Skimming
Video skimming
consists of a collection of image sequences along with the related audios from
an original video. It possesses a higher
level of semantic meaning of an original video than the video summary does. We
will discuss the video skimming in the following two subsections according to
its classification: highlight and summary sequence.
Highlight
A highlight
has the most interesting parts of a video. It is similar to a trailer of a
movie, showing the most attractive scenes without revealing the ending of a
film. Thus, highlight is used in a film domain frequently. A general method to
produce highlights is discussed here.
The basic idea of producing a highlight is to extract the most
interesting and exciting scenes that contain important people, sounds, and
actions, then concatenate them. Pfeiffer et al. (1996) used visual features to
produce a highlight of a feature film and stated that a good cinema trailer
must have the following five features: (1) important objects/people, (2)
action, (3) mood, (4) dialog, and (5) a disguised ending. These features mean that a highlight should
include important objects and people appearing in an original film, many
actions to attract viewers, the basic mood of a movie, and dialogs containing
important information. Finally, the highlight needs to hide the ending of a
movie.
In the VAbstract
system a scene is considered as the basic entity for a highlight.
Therefore, the scene boundary detection is performed first using existing
techniques. Then, it finds the high-contrast scenes to fulfill the trailer
Feature 1, the high-motion scenes to fulfill Feature 2, the scenes with basic
color composition similar to the average color composition of the whole movie
to fulfill Feature 3, the scenes with dialog of various speeches to fulfill
Feature 4, and deletes any scene from the last part of an original video to
fulfill Feature 5. Finally, all the selected scenes are concatenated together
in temporal order to form a movie trailer. The figure below shows the VAbstract
system algorithm.
We will now discuss the
main steps in VAbstract system, which are scene boundary detection, extraction
of dialog scene, extraction of high-motion scene, and extraction of average
color.
1. Scene Boundary
Detection:
Scene change can be determined by the combination of video- and
audio-cut detections. Video-cut
detection finds sharp transition, namely cut between frames. The results of
this video-cut detection are shots. To
group the relevant shots into a scene, audio-cut detection is used. A video cut can be detected by using color
histogram. If the color histogram difference between two consecutive frames
exceeds a threshold, then a cut is determined.
2. Extraction of Dialog
Scene: A
heuristic method is used to detect dialog scenes. It is based on the finding
that a dialog is characterized by the existence of two “a”s with significantly
different fundamental frequencies, which indicates that those two “a”s are
spoken by two different people. Therefore, the audio track is first transformed
to a short-term frequency spectrum and then normalized to compare with the
spectrum of a spoken “a.” Because “a” is spoken as a long sound and occurs
frequently in most conversations, this heuristic method is easy to implement
and effective in practice.
3. Extraction of
High-Motion Scene: Motion in a scene
often includes camera motion, object motion, or both. A scene with a high
degree of motion will be included in the highlight.
4. Extraction of Average
Color Scene:
A video’s mood is embodied by the colors of each frame. The scenes in
the highlight should have the color compositions similar to the entire video. Here, the color composition has physical
color properties such as luminance, hue, and saturation. It computes the average color composition of
the entire video and finds scenes whose color compositions are similar to the
average.
Video Synopsis
Video
synopsis (or abstraction) is a temporally compact representation that aims to
enable video browsing and retrieval.We present an approach to video synopsis
which optimally reduces the spatio-temporal redundancy in video.As an example,
consider the schematic video clip represented as a space-time volume in Fig. 1.
The video begins with a person walking on the ground, and after a period of
inactivity a bird is flying in the sky. The inactive frames are omitted in most
video abstraction methods. Video synopsis is substantially more compact, by
playing the person and the bird simultaneously. This makes an optimal useof
image regions by shifting events from
their original time intervalto another time interval when no other activity
takes place at this spatial location. Such manipulations relax the
chronological consistency of events as was first presented.
Approaches
There are two main
approaches for video synopsis (or
video abstraction). In one approach, a set of salient images (key frames)
is selected from the original video sequence. The key frames that are selected
are the ones that best represent the video. In another approach a collection of
short video sequences is selected . The second approach is less compact, but
gives a better impression of the scene dynamics. Those approaches (and others)
are described in comprehensive surveys on video abstraction . In both
approaches above, entire frames are used as the fundamental building blocks. A
different methodology uses mosaic images together with some meta-data for video
indexing . In this case the static synopsis image includes objects from
different times.
Activity
Detection
This work assumes that every input pixel has been labeled with its level
of “activity”. Evaluation of the activitylevel is out of the scope of our work,
and can be done usingone of various methods for detecting irregularities [4,
17],moving object detection, and object tracking.We have selected for our
experiments a simple and commonly used activity indicator, where an input pixel
I(x, y, t) is labeled as “active” if its color difference from the
temporal median at location (x, y) is larger than a given threshold.
Active pixels are defined by the characteristic function
χ(p) = 1 if
p is active
0 otherwise
To clean the activity indicator from noise, a median filter is applied to
χ before continuing with the synopsis process.
Lossless Video Synopsis
For some applications, such as video surveillance, we may prefer a longer
synopsis video, but in which all activities are guaranteed to appear. In this
case, the objective is not to select a set of object segments as was done in
the
previous section, but rather to find a compact temporal rearrangement
of the object segments.
Again, we use Simulated Annealing to minimize the energy.In this case, a
state corresponds to a set of time shifts for all segments, and two states are
defined as neighbors if their time shifts differ for only a single segment.
There are two issues that should be notes in this case:
• Object segments that appear in the first or last frames should
remain so in the synopsis video. (otherwise they may suddenly appear or
disappear). We take care that each state will satisfy this constraint by fixing
the temporal shifts of all these objects accordingly.
• The temporal arrangement of the input video is commonly a
local minima of the energy function, and therefore is not a preferable choice
for initializing the Annealing process. We initialized our Simulated Annealing
with a shorter video, were all objects overlap.
Panoramic Video Synopsis
When a video camera is scanning a
scene, much redundancy can be eliminated by using a panoramic mosaic.
Yet,existing methods construct a single panoramic image, in which the scene
dynamics is lost. Limited dynamics can be represented by a stroboscopic image
where moving objects are displayed at several locations along their paths.A panoramic
synopsis video can be created by simultaneously displaying actions that took
place at different times in different regions of the scene. A substantial
condensation may be obtained, since the duration of activity for each object is
limited to the time it is being viewed by the camera.A special case is when the
camera tracks an object (such as the running lioness shown in Fig. 7). In this
case, a short video synopsis can be obtained only by allowing the Stroboscopic effect.
Constructing the panoramic video synopsis is done in a similar manner to the
regular video synopsis, with a preliminary
stage of aligning all the frames to some reference frame.
Video Indexing Through
Video Synopsis
Video synopsis can be used for video
indexing, providing
the user with efficient and intuitive links for accessing actions in
videos. This can be done by associating with every synopsis pixel a pointer to
the appearance of the corresponding object in the original video. In video
synopsis, the information of the video is projected into the ”space of
activities”, in which only activities matter, regardless of their temporal
context (although we still preserve the spatial context). As activities are
concentrated in a short period, specific activities in the video can
be accessed with ease.
Conclusion
Two video synopsis approaches
were presented: one approach uses low level graph optimization, where each
pixel in the synopsis video is a node in this graph. This approach has the
benefit of obtaining the synopsis video directly from the input video, but the
complexity of the solution may be very high. An alternative approach is to
first detect moving objects, and perform the optimization on the detected
objects .While a preliminary step of motion segmentation is needed in the
second approach, it is much faster, and object based constraints are possible.
The
activity in the resulting video synopsis is much more condensed than the
activity in any ordinary video, and viewing such a synopsis may seem awkward to
the non experienced viewer.Special attention should be given to the possibility
of obtaining dynamic stroboscopy. While allowing a further reduction in the
length of the video synopsis, dynamic stroboscopy may need further adaptation
from the user. It does take some training to realize that multiple spatial
occurrences of a single object indicate a longer activity time .While we have
detailed a specific implementation for dynamic video synopsis, many extensions
are straight forward .
No comments:
Post a Comment
leave your opinion