Near-duplicate video detection featuring coupled temporal and perceptual visual structures and logical inference based matching

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Belkhatir, M., & Tahayna, B. Near-duplicate video detection featuring coupled temporal and perceptual visual structures and logical inference based matching. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.03.003
Mohammed Belkhatir — Faculty of Computer Science, University of Lyon, Campus de la Doua, 69622 Villeurbanne Cedex, France
Bashar Tahayna — Faculty of Information Technology, Monash University, Sunway Campus, 46150, Malaysia
Keywords: near-duplicate video detection, perceptual visual indexing, logical inference, lattice-based processing, empirical evaluation.
We propose in this paper an architecture for near-duplicate video detection based on: (i) index and query signature based structures integrating temporal and perceptual visual features and (ii) a matching framework computing the logical inference between index and query documents. As far as indexing is concerned, instead of concatenating low-level visual features in high-dimensional spaces which results in curse of dimensionality and redundancy issues, we adopt a perceptual symbolic representation based on color and texture concepts. For matching, we propose to instantiate a retrieval model based on logical inference through the coupling of an N-gram sliding window process and theoretically-sound lattice-based structures. The techniques we cover are robust and insensitive to general video editing and/or degradation, making it ideal for re-broadcasted video search. Experiments are carried out on large quantities of video data collected from the TRECVID 02, 03 and 04 collections and real-world video broadcasts recorded from two German TV stations. An empirical comparison over two state-of-the-art dynamic programming techniques is encouraging and demonstrates the advantage and feasibility of our method. 2011 Published by Elsevier Ltd.
Information Processing and Management journal homepage: http://www.elsevier.com/locate/infoproman
Introduction Near-duplicate video (NDV) detection in large multimedia collections is very important for digital rights management as well as for video retrieval applications. One crucial step for such task is to define a matching/mismatching measure between two video sequences. Extended research has been carried out to identify NDVs in video collections (Bertini, Bimbo, & Nunziati, 2006; Hoad & Zobel, 2003, 2006; Joly, Frèlicot, & Buisson, 2003; Joly, Buisson, & Frélicot, 2007; Vidal, Marzal, & Aibar, 1995; Zhou & Zhang, 2005). However, existing methods have substantial limitations: they are sensitive to the degradations of the video, expensive to compute, and mostly limited to the comparison of whole video clips. Moreover, much of the video content is distributed in a continuous stream that cannot be easily segmented for comparison, making these methods unsuitable for applications which are used by regulation authorities for continuous broadcast-streaming monitoring. During video editing, some inappropriate shots could be deleted and commercial breaks could be inserted. However, from the perspective of human perception, the initial and edited videos are still regarded as similar (Zhou & Zhang, 2005). Thus, in order to identify duplicates of a specific video, an efficient video matching and scoring framework is required for detecting similar or quasi-similar contents. Many of the existing matching models are not suitable for such a task since they either ignore the temporal dimension or simplify the query model. NDV detection requires models for video sequence-to-sequence matching incorporating the temporal order inherent in video data. For sequence matching to be meaningful, corresponding video contents shall be identified in a fixed chronological temporal order while ignoring all the in-between mismatching shots which are often artificially introduced in edited videos. To achieve this, many solutions are based on proposing to view a sequence of video frames as a string and use direct comparison between the sequence of features of the query and index videos. However, this method is computationally expensive and sensitive to changes that can occur during video editing. In order to reduce the computational cost, an alternative approach consists in computing a shot-based index structure viewed as a string and then apply string matching algorithms to solve the problem of shot alignment. The main programming paradigm used in the literature for computing sequence alignment, dynamic programming, has however some limitations. Its computational load is indeed affected by the number of shots and their duration. Furthermore, dynamic programming measures, in terms of edit distance, how mismatching two videos are rather than how similar they are.
In this paper, we propose a near-duplicate video detection framework based on signature-based index structures featuring perceptual visual attributes and a matching and scoring framework relying on logical inference. As far as indexing is concerned, the concatenation of low-level visual features (color, texture, etc.) in high-dimensional spaces traditionally results in curse of dimensionality and redundancy issues. Moreover, this usually requires normalization which may cause an undesirable distortion in the feature space. Indeed, since low-level visual features (color and texture) are of high dimensionality (typically in the order of 102–103) and data in high-dimension spaces are sparse, it is necessary to gather enough observations to make sure that the estimation is viable. Consequently, it is crucial to consider the dimensionality reduction of the visual feature representation spaces. Moreover, contrary to the state-of-the-art approaches for dimensionality reduction (such as principal component analysis, multidimensional scaling, singular value decomposition) which are opaque (i.e. they operate dimensionality reduction of input spaces without making it possible to understand the signification of elements in the reduced feature space), our framework will itself be based on a transparent readable characterization. We propose to reduce the dimensionality of signal features by taking into account a perceptual symbolic representation of the visual features based on color and texture concepts. A matching framework relying on the logical inference between index and query documents is instantiated through an N-gram sliding window technique coupled with fast lattice-based processing. Nearduplicate videos are here defined as a set of matched pair-wise sequences, but with certain constraints that can be induced by frame rate conversions and editing, which abundantly exist in real-world applications.
Experimentally, we implement our theoretical proposition, detail the automatic characterization of the visual (color and texture) concepts and evaluate the prototype on 286 videos from the TRECVID 02, 03 and 04 corpora against two dynamic programming frameworks.
The remainder of this paper is organized as follows: Section 2 introduces the related work on NDV detection. Section 3 gives an overview of the proposed system architecture. Temporal video segmentation is detailed in Section4;. Signaturebased indexing with duration, color and texture feature extraction is detailed in Section 5'. Then in Section 6 we discuss the N-gram matching and scoring framework. Experimental results are reported in Section 7.

Author(s): Belkhatir M. Tahayna B.

Language: English
Commentary: 1548439
Tags: Информатика и вычислительная техника;Обработка медиа-данных;Обработка видео