Video Abstraction - Seminar Report


Video Abstraction
Abstract              
                           Video Abstraction is a nascent technology which is concerned with generating a summary of a video. It is particularly relevant in archiving and retrieving huge volumes of video files. Various techniques are evolving, to create ideal abstracts of videos that would reflect the nature and content of the original. Among all possible research areas video abstraction is one of the most important topics, which helps enable a quick browsing of a large collection of video data to achieve efficient content access and representation.
                                A new technique is dynamic video synopsis, where most of the activity in the video is condensed by simultaneously showing several actions, even when they originally occurred at different times. For example, we can create a “stroboscopic movie”, where multiple dynamic instances of a moving object are played simultaneously. This is an extension of the still stroboscopic picture. Previous approaches for video abstraction addressed mostly the temporal redundancy by selecting representative
key-frames or time intervals. In dynamic video synopsis the activity is shifted into a significantly shorter period, in which the activity is much denser.

Introduction
Digital video is an emerging force in today’s computer and telecommunication industries. The rapid growth of the Internet, in terms of both bandwidth and the number of users, has pushed all multimedia technology forward including video streaming. Continuous hardware developments have reached the point where personal computers are powerful enough to handle the high storage and computational demands of digital video applications. DVD, which delivers high quality digital video to consumers, is rapidly penetrating the market. Moreover, the advances in digital cameras and camcorders have made it quite easy to capture a video and then load it into a computer in digital form. Many companies, universities and even ordinary families already have large repositories of videos both in analog and digital formats, such as the broadcast news, training and education videos, advertising and commercials, monitoring, surveying and home videos. All of these trends are indicating a promising future for the world of digital video.
 The fast evolution of digital video has brought many new applications and consequently, research and development of new technologies, which will lower the costs of video archiving, cataloging and indexing, as well as improve the efficiency, usability and accessibility of stored videos are greatly needed. Among all possible research areas, one important topic is how to enable a quick browse of a large collection of video data and how to achieve efficient content access and representation. To address these issues, video abstraction techniques have emerged and have been attracting more research interest in recent years.
Video abstraction, as the name implies, is a short summary of the content of a longer video document. Specifically, a video abstract is a sequence of still or moving images representing the content of a video in such a way that the target party is rapidly provided with concise information about the content while the essential message of the original is well preserved.
Theoretically a video abstract can be generated both manually and automatically, but due to the huge volumes of video data and limited manpower, it’s getting more and more important to develop fully automated video analysis and processing tools so as to reduce the human involvement in the video abstraction process.
 There are two fundamentally different kinds of abstracts: still- and moving-image abstracts. The still-image abstract, also known as a static storyboard, is a small collection of salient images extracted or generated from the underlying video source. This type of abstract is called a video summary. The moving-image abstract, also known as moving storyboard, or multimedia summary, consists of a collection of image sequences, as well as the corresponding  audio abstract extracted from the original sequence and is thus itself a video clip but of  considerably shorter length. This type of abstract is called a video skimming
          There are some significant differences between video summary and video skimming. A video summary can be built much faster, since generally only visual information is utilized and no handling of audio and textual information is needed. Therefore, once composed, it is displayed more easily since there are no timing or synchronization issues.. Besides, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly. Finally, all extracted stills could be printed out very easily when needed.
      
Characteristics of a Video
          Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion. All videos are characterized by the following.

Number of frames per second
Also called frame rate, the number of still pictures per unit of time of video ranges from six or eight frames per second (frame/s) for old mechanical cameras to 120 or more frames per second for new professional cameras. PAL (Europe, Asia, Australia, etc.) and SECAM (France, Russia, parts of Africa etc.) standards specify 25 frame/s, while NTSC (USA, Canada, Japan, etc.) specifies 29.97 frame/s. Film is shot at the slower frame rate of 24frame/s, which complicates slightly the process of transferring a cinematic motion picture to video. The minimum frame rate to achieve the illusion of a moving image is about fifteen frames per second.

Interlacing
Video can be interlaced or progressive. Interlacing was invented as a way to achieve good visual quality within the limitations of a narrow bandwidth. The horizontal scan lines of each interlaced frame are numbered consecutively and partitioned into two fields: the odd field (upper field) consisting of the odd-numbered lines and the even field (lower field) consisting of the even-numbered lines. NTSC, PAL and SECAM are interlaced formats. Abbreviated video resolution specifications often include an “i” to indicate interlacing. For example, PAL video format is often specified as 576i50, where 576 stands for the vertical line resolution, “i” indicates interlacing, and 50 stands for 50 fields (half-frames) per second.
In progressive scan systems, each refresh period updates all of the scan lines. The result is a higher perceived resolution and a lack of various artifacts that can make parts of a stationary picture appear to be moving or flashing.
A procedure known as de-interlacing can be used for converting an interlaced stream, such as analog, DVD, or satellite, to be processed by progressive scan devices, such as TFT TV-sets, projectors, and plasma panels. De-interlacing cannot, however, produce a video quality that is equivalent to true progressive scan source material.

Display resolution
The size of a video image is measured in pixels for digital video, or horizontal scan lines and vertical lines of resolution for analog video. In the digital domain (e.g. DVD) standard-definition television (SDTV) is specified as 720/704/640×480i60 for NTSC and 768/720×576i50 for PAL or SECAM resolution. However in the analog domain, the number of visible scanlines remains constant (486 NTSC/576 PAL) while the horizontal measurement varies with the quality of the signal: approximately 320 pixels per scanline for VCR quality, 400 pixels for TV roadcasts, and 720 pixels for DVD sources. Aspect ratio is preserved because of non-square "pixels".
New high-definition televisions (HDTV) are capable of resolutions up to 1920×1080p60, i.e. 1920 pixels per scan line by 1080 scan lines, progressive, at 60 frames per second. Video resolution for 3D-video is measured in voxels (volume picture element, representing a value in the three dimensional space). For example 512×512×512 voxels resolution, now used for simple 3D-video, can be displayed even on some PDAs.

Aspect ratio
Aspect ratio describes the dimensions of video screens and video picture elements. All popular video formats are rectilinear, and so can be described by a ratio between width and height. The screen aspect ratio of a traditional television screen is 4:3, or about 1.33:1. High definition televisions use an aspect ratio of 16:9, or about 1.78:1. The aspect ratio of a full 35 mm film frame with soundtrack (also known as the Academy ratio) is 1.375:1.
Pixels on computer monitors are usually square, but pixels used in digital video often have non-square aspect ratios, such as those used in the PAL and NTSC variants of the CCIR 601 digital video standard, and the corresponding anamorphic widescreen formats. Therefore, an NTSC DV image which is 720 pixels by 480 pixels is displayed with the aspect ratio of 4:3 (which is the traditional television standard) if the pixels are thin and displayed with the aspect ratio of 16:9 (which is the anamorphic widescreen format) if the pixels are fat.

Color space and bits per pixel
Color model name describes the video color representation. The number of distinct colours that can be represented by a pixel depends on the number of bits per pixel (bpp). A common way to reduce the number of bits per pixel in digital video is by chroma subsampling (e.g. 4:4:4, 4:2:2, 4:2:0, 4:1:1).

Video Summary
As mentioned in the Introduction, video summary is a set of salient images called keyframes which are selected or reconstructed from an original video sequence. Therefore, selecting salient images (key frames) from all the frames of an original video is very important to get a video summary.  The different methods using for making video summaries will be discussed in the following subsections.
Video Summary techniques can be broadly classified into Shot-based Video Summary techniques and Segment-based Video Summary techniques, which are the two major methods of extracting the keyframes which constitute a video summary.

Shot-based Keyframe Extraction
       
Since a shot is defined as a video segment within a continuous capture period, a natural and straightforward way of keyframe extraction is to use the first frame of each shot as its keyframe. However, while being sufficient for stationary shots, one keyframe per shot does not provide an acceptable representation of dynamic visual content, therefore multiple keyframes need to be extracted by adapting to the underlying semantic content. However, since computer vision still remains to be a very difficult research challenge, most of existing work chooses to interpret the content by employing some low-level visual features such as color and motion, instead of performing a tough semantic understanding. In this report, based on the features that these works employ, we categorize them into following 3 different classes: color-based approach, motion-based approach and others.

Color-based approach
The keyframes are extracted in a sequential fashion for each shot. Particularly, the first frame within the shot is always chosen as the first keyframe, and then the color-histogram difference between the subsequent frames and the latest keyframe is computed. Once the difference exceeds a certain threshold, a new keyframe will be declared.

One possible problem with above extraction method is that there is a probability that the first frame is a part of transition effect at the shot boundary, thus strongly reducing its representative quality. As an alternative, eyframes can be extracted using an unsupervised clustering scheme. Basically, all video frames within a shot are first clustered into certain number of clusters based on the color-histogram similarity comparison where a threshold is predefined to control the density of each cluster. Next, all the clusters that are big enough are considered as the key clusters and a representative frame closest to the cluster centroid is extracted from each of them. Because the color histogram is invariant to image orientations and robust to background noises, color-based keyframe extraction algorithms have been widely used. However, most of these works are heavily threshold-dependent, and cannot well capture the underlying dynamics when there is lots of camera or object motion.

Motion-based approach
Motion-based approaches are relatively better suited for controlling the number of frames based on temporal dynamics in the scene. In general, pixel-based image differences or optical flow computation are commonly used in this approach. In one approach, the optical flow for each frame is first computed, and then a simple motion metric is computed. Finally by analyzing the metric as a function of time, the frames at the local minima of motion are selected as the keyframes.

A domain specific keyframe selection method is proposed where a summary is generated for video-taped presentations. Sophisticated global motion and gesture analysis algorithms are developed. There are works which employ 3 different operation levels based on the available machine resources: at the lowest level, pixel-based frame differences are computed to generate the “temporal activity curve” since it requires minimal resources; at level 2, color histogram-based frame differences are computed to extract “color activity segments”, and at level 3 sophisticated camera motion analysis is carried out to estimate the camera parameters and detect the “motion activity segments”. Keyframes are then selected from each segment and necessary elimination is applied to obtain the final result.


Video Skimming

Video skimming consists of a collection of image sequences along with the related audios from an original video.  It possesses a higher level of semantic meaning of an original video than the video summary does. We will discuss the video skimming in the following two subsections according to its classification: highlight and summary sequence.

Highlight
A highlight has the most interesting parts of a video. It is similar to a trailer of a movie, showing the most attractive scenes without revealing the ending of a film. Thus, highlight is used in a film domain frequently. A general method to produce highlights is discussed here.  The basic idea of producing a highlight is to extract the most interesting and exciting scenes that contain important people, sounds, and actions, then concatenate them. Pfeiffer et al. (1996) used visual features to produce a highlight of a feature film and stated that a good cinema trailer must have the following five features: (1) important objects/people, (2) action, (3) mood, (4) dialog, and (5) a disguised ending.  These features mean that a highlight should include important objects and people appearing in an original film, many actions to attract viewers, the basic mood of a movie, and dialogs containing important information. Finally, the highlight needs to hide the ending of a movie.
In the VAbstract system a scene is considered as the basic entity for a highlight. Therefore, the scene boundary detection is performed first using existing techniques. Then, it finds the high-contrast scenes to fulfill the trailer Feature 1, the high-motion scenes to fulfill Feature 2, the scenes with basic color composition similar to the average color composition of the whole movie to fulfill Feature 3, the scenes with dialog of various speeches to fulfill Feature 4, and deletes any scene from the last part of an original video to fulfill Feature 5. Finally, all the selected scenes are concatenated together in temporal order to form a movie trailer. The figure below shows the VAbstract system algorithm.

We will now discuss the main steps in VAbstract system, which are scene boundary detection, extraction of dialog scene, extraction of high-motion scene, and extraction of average color.

1.     Scene Boundary Detection:  Scene change can be determined by the combination of video- and audio-cut detections.  Video-cut detection finds sharp transition, namely cut between frames. The results of this video-cut detection are shots.  To group the relevant shots into a scene, audio-cut detection is used.  A video cut can be detected by using color histogram. If the color histogram difference between two consecutive frames exceeds a threshold, then a cut is determined.

2.     Extraction of Dialog Scene:  A heuristic method is used to detect dialog scenes. It is based on the finding that a dialog is characterized by the existence of two “a”s with significantly different fundamental frequencies, which indicates that those two “a”s are spoken by two different people. Therefore, the audio track is first transformed to a short-term frequency spectrum and then normalized to compare with the spectrum of a spoken “a.” Because “a” is spoken as a long sound and occurs frequently in most conversations, this heuristic method is easy to implement and effective in practice.

3.     Extraction of High-Motion Scene: Motion in a scene often includes camera motion, object motion, or both. A scene with a high degree of motion will be included in the highlight.

4.     Extraction of Average Color Scene:  A video’s mood is embodied by the colors of each frame. The scenes in the highlight should have the color compositions similar to the entire video.  Here, the color composition has physical color properties such as luminance, hue, and saturation.  It computes the average color composition of the entire video and finds scenes whose color compositions are similar to the average.

Video Synopsis                         
                                      Video synopsis (or abstraction) is a temporally compact representation that aims to enable video browsing and retrieval.We present an approach to video synopsis which optimally reduces the spatio-temporal redundancy in video.As an example, consider the schematic video clip represented as a space-time volume in Fig. 1. The video begins with a person walking on the ground, and after a period of inactivity a bird is flying in the sky. The inactive frames are omitted in most video abstraction methods. Video synopsis is substantially more compact, by playing the person and the bird simultaneously. This makes an optimal useof image regions  by shifting events from their original time intervalto another time interval when no other activity takes place at this spatial location. Such manipulations relax the chronological consistency of events as was first presented.

Approaches
There are two main approaches for video synopsis (or
video abstraction). In one approach, a set of salient images (key frames) is selected from the original video sequence. The key frames that are selected are the ones that best represent the video. In another approach a collection of short video sequences is selected . The second approach is less compact, but gives a better impression of the scene dynamics. Those approaches (and others) are described in comprehensive surveys on video abstraction . In both approaches above, entire frames are used as the fundamental building blocks. A different methodology uses mosaic images together with some meta-data for video indexing . In this case the static synopsis image includes objects from different times.

 Activity Detection
This work assumes that every input pixel has been labeled with its level of “activity”. Evaluation of the activitylevel is out of the scope of our work, and can be done usingone of various methods for detecting irregularities [4, 17],moving object detection, and object tracking.We have selected for our experiments a simple and commonly used activity indicator, where an input pixel I(x, y, t) is labeled as “active” if its color difference from the temporal median at location (x, y) is larger than a given threshold.

Active pixels are defined by the characteristic function
 χ(p) = 1    if p is active
            0   otherwise
To clean the activity indicator from noise, a median filter is applied to χ before continuing with the synopsis process.

Lossless Video Synopsis
For some applications, such as video surveillance, we may prefer a longer synopsis video, but in which all activities are guaranteed to appear. In this case, the objective is not to select a set of object segments as was done in the
previous section, but rather to find a compact temporal rearrangement
of the object segments.
Again, we use Simulated Annealing to minimize the energy.In this case, a state corresponds to a set of time shifts for all segments, and two states are defined as neighbors if their time shifts differ for only a single segment. There are two issues that should be notes in this case:
Object segments that appear in the first or last frames should remain so in the synopsis video. (otherwise they may suddenly appear or disappear). We take care that each state will satisfy this constraint by fixing the temporal shifts of all these objects accordingly.
The temporal arrangement of the input video is commonly a local minima of the energy function, and therefore is not a preferable choice for initializing the Annealing process. We initialized our Simulated Annealing
with a shorter video, were all objects overlap.

Panoramic Video Synopsis
                       When a video camera is scanning a scene, much redundancy can be eliminated by using a panoramic mosaic. Yet,existing methods construct a single panoramic image, in which the scene dynamics is lost. Limited dynamics can be represented by a stroboscopic image where moving objects are displayed at several locations along their paths.A panoramic synopsis video can be created by simultaneously displaying actions that took place at different times in different regions of the scene. A substantial condensation may be obtained, since the duration of activity for each object is limited to the time it is being viewed by the camera.A special case is when the camera tracks an object (such as the running lioness shown in Fig. 7). In this case, a short video synopsis can be obtained only by allowing the Stroboscopic effect. Constructing the panoramic video synopsis is done in a similar manner to the regular video synopsis, with a preliminary
stage of aligning all the frames to some reference frame.

Video Indexing Through Video Synopsis
                   Video synopsis can be used for video indexing, providing
the user with efficient and intuitive links for accessing actions in videos. This can be done by associating with every synopsis pixel a pointer to the appearance of the corresponding object in the original video. In video synopsis, the information of the video is projected into the ”space of activities”, in which only activities matter, regardless of their temporal context (although we still preserve the spatial context). As activities are concentrated in a short period, specific activities in the video can
be accessed with ease.
Conclusion                      
Two video synopsis approaches were presented: one approach uses low level graph optimization, where each pixel in the synopsis video is a node in this graph. This approach has the benefit of obtaining the synopsis video directly from the input video, but the complexity of the solution may be very high. An alternative approach is to first detect moving objects, and perform the optimization on the detected objects .While a preliminary step of motion segmentation is needed in the second approach, it is much faster, and object based constraints are possible.
                        The activity in the resulting video synopsis is much more condensed than the activity in any ordinary video, and viewing such a synopsis may seem awkward to the non experienced viewer.Special attention should be given to the possibility of obtaining dynamic stroboscopy. While allowing a further reduction in the length of the video synopsis, dynamic stroboscopy may need further adaptation from the user. It does take some training to realize that multiple spatial occurrences of a single object indicate a longer activity time .While we have detailed a specific implementation for dynamic video synopsis, many extensions are straight forward .

No comments:

Post a Comment

leave your opinion