首页 > 美文阅读

Motion driven approaches to shot boundary detection, low-level feature extraction and BBC r

更新时间:2023-07-31 21:04:07 阅读：评论：0

Motion Driven Approaches to Shot Boundary Detection, Low-Level Feature Extraction and BBC Rushes

Characterization at TRECVID2005

Chong-Wah Ngo,Zailiang Pan,Xiaoyong Wei,Xiao Wu,Hung-Khoon Tan,Wanlei Zhao

Department of Computer Science

City University of Hong Kong

Email:{cwngo,zerin,xiaoyong,wuxiao,hktan,wzhao2}@cs.cityu.edu.hk

Abstract

This paper describes our experimental results on shot boundary detection(SB),low-level feature extraction(LLF),and BBC Rushes exploration(BR)at TRECVID2005.The approaches prented in this paper are mostly bad on our previous works[1,2,3]grounded on motion analysis with spatio-temporal slices,opticalﬂows and tensor reprentation.This year,our aim is to explore and investigate the role of motion in various fundamental tasks including video structuring and characterization for both the edited(in SB and LLF)and unedited(in BR) videos.

In SB(system C),we exploit the coherence and patterns of motion texture in spatio-temporal slices for boundary detection and classiﬁcation.The cut and wipe detectors are bad on our work in[1]which performs color-texture gmentation on three slices extracted from videos to determine boundaries.The dissolve detector is bad on our work in[3]which is compod of two steps:multi-resolution cut detection and binary classiﬁcation with Gabor features.We submit 10runs,depending on the size of training data,ﬂashlight detection capability,and additional statistical features(in addition to Gabor)for classiﬁcation.Overall,the runs with additional features get better results.Increasing training size will sometime deteriorate the precision of detection.

In LLF(system A),a global6-parameter aﬃne model is estimated at each frame with LMedS and tensor reprentation for camera motion annotation.To characterize the changes of motion parameters over frames,we u hysteresis thresholding and Kalman polyline estimation developed in[2]to gment and determine the types of motion in shots.We submit7runs for LLF,depending on veral empirical parameters.Overall,there is no signiﬁcant diﬀerence in term of recall and precision for each run.

南极极夜In BR(system A),we study two problems:how to structure and characterize BBC rushes? We deﬁne three types of gments bad on motion:intentional motion(IM),intermediate mo-tion(IMM),shaking

artifacts(SA)for structuring.Our aim is to partition-and-classify(or classify-and-partition)the videos into gments corresponding to their motion characteristics.

We employ and experiment three approaches:ﬁnite state machine(FSM),support vector ma-chine(SVM),and hidden Markov model(HMM).FSM is unsupervid,while SVM and HMM are supervid.We randomly lect and annotate60videos(about337K frames)from the de-velopment t for training and testing.The results show that the performances of all tested approaches are quite clo,with SVM being better for structuring and HMM being slightly better for rushes characterization.Overall,HMM can achieve over90%of recall and precision (in term of frame numbers)in extracting intentional motion.For structuring,SVM achieves approximately70%of recall and30%of precision(with sub-shot as units),compared to0.05% of recall and35%of precision with shot boundary(cut only)detector.

1Introduction

This is theﬁrst time we participate in TRECVID.We take part in three tasks,submit10runs for shot boundary detection,and7runs for low-level feature(camera motion)extraction.In addition,we examine two issues:structuring and characterization of BBC rushes.Our aim at TRECVID2005is to investigate

the u of motion patterns and features for both edited(news) and unedited(rushes)videos.All works prented in this paper are mostly bad on our early works in[1,2,3].Several enhancement,nevertheless,has also been introduced and shown to give improvement over our previous approaches.

2Shot Boundary Detection

Our approach is bad on the gmentation and classiﬁcation of motion texture patterns in DC-bad spatio-temporal(ST)slices[1,3].ST slices are2D images extracted from videos with one dimension in space,and the other in time.Figure1shows three types of boundaries:cuts, wipes and dissolves on ST slices.We make u of the slice coherence for cut and wipe detection, and the slice pattern for dissolve and non-dissolve classiﬁcation.Becau fade-in and fade-out are special cas of dissolve,we do not consider them parately.

For cut and wipe detectors,we u three slices(center horizontal,vertical and diagonal) and perform color-texture gmentation to locate the boundaries[1].For dissolve detector,a pyramid of ST slices at diﬀerent resolutions of time is generated for cut detection.Figure2 shows the evolution of dissolves to cuts when the resolution of ST slices are temporally reduced. The cuts at low-resolution slices are

located with our cut detector and then projected back to the original scale for dissolve veriﬁcation[3].We u Gabor features(48dimensional feature vector)to depict the motion-texture patterns of potential dissolves,and then perform support vector machine for binary classiﬁcation.In brief,cut and wipe detectors are unsupervid,while dissolve detector is supervid.

On top of our works in[1,3],we make two improvement:i)ﬂashlight detection and ii) addition features for dissolves.The aim of(i)is to prune fal cuts due to sharp lighting changes.We inspect four scans(at t−2,t−1,t+1and t+2)before and after a potential

(a)Three shots connected by two cuts

(b)Two shots connected by a wipe

(c)Two shots connected by a dissolve

Figure1:Samples of spatio-temporal slices.

shot shot shot shot shot shot

shot

Figure2:Evolving of dissolves to cuts in the bottom-up manner along multiple scales in pyramid reprentation.

勤劳的句子

cut at time t and then extract the standard deviation of four scans as the feature for decision making.

If ﬂashlight happens,the value of this feature often approaches zero since no obvious change between the frames before and after fal cuts.The computation is also very eﬃcient since it involves only few scans in ST slices.The aim of (ii)is to improve the precision of dissolve detection,in view that our approach in [3]is not eﬀective enough in discriminating static quence and dissolve with slight motion.We introduce 9extra features in addition to Gabor features,add up to a total of 57dimensional feature vector.The features are extracted by computing the standard deviation of horizontal,vertical and diagonal slices in 3D color space (YCbCr).

2.1Experiments

We submitted 10runs according to three aspects:(1)whether ﬂashlight detection is ud for cut detection,(2)size of training t,and (3)diﬀerent features for gradual transition (GT).Table 1summarizes the characteristics of diﬀerent runs,and Table 2shows the results of each run.The small,medium and large data ts contain 864,1180,1662dissolves respectively.The training data is collected from our videos in [1,3](so,our system should belong to type C).The data ts are ,large data t includes all data from the small and medium data ts,while medium data t includes all data from small data t.As shown in Table 1,run-10gives the best results with 0.870of recall and 0.796of precision,compared to one of the best performance (recall=0.927and prec

ision=0.845)in this year TRECVID.Overall,the performance of our cut detector is quite competitive,but GT detector is not so good probably becau we limit the length of dissolve to at least 15frames during training.

Table 1:Diﬀerent runs for SB

Run ID Flashlight Detection for Cut Training Size Additional Features for GT

1×Medium ×2√

Medium ×3×Medium √4√

Medium √5×Small ×6√

Small ×7×

Large ×8×Large √9

√

明朝为什么会灭亡

Large

×10

√Large √For cut detection,precision is improved by ﬂashlight detection which indicates this strategy successfully prunes some fal cuts due to sharp lighting changes.However,few real cuts are also removed which downgrade the value of recall.The misd cuts generally have brightness preframe and postframe,and their contents are very similar.In our approach,fal cuts happen

Table2:Experimental Results of10runs in SB

淘宝网导购网站ALL CUT GT GT(Frame) Run ID Recall Prec.Recall Prec.Recall Prec.Recall Prec.

10.8600.7330.9520.8180.5910.4930.7240.589

20.8600.7490.9510.8420.5910.4930.7240.589

30.8730.7660.9520.8180.6430.6000.7620.614

40.8730.7830.9510.8420.6430.6000.7620.614

50.8560.7700.9520.8180.5740.5980.7010.628

电子工艺实训报告60.8550.7870.9510.8420.5740.5980.7010.628

70.8580.7500.9520.8180.5840.5370.7360.591

80.8700.7780.9520.8180.6320.6400.7610.612

90.8580.7670.9510.8420.5840.5370.7360.591

100.8700.7960.9510.8420.6320.6400.7610.612

when captions in large size come into screen with fast speed,and when there is fade-in or fade-out which happens in less thanﬁve frames.The best result we get for cut is0.951of recall and0.842 of precision,compared to one the best run(recall=0.941and precision=0.928)in TRECVID.

For the GT detection,the additional statistical features,in addition to Gabor,improve both the recall and precision.The size of training data has no obvious impact,and it actually deteriorates performance sometime when its size increa.We are still investigating the possible reasons behind this.Perhaps the training samples itlf have nois,or the features we u are not discriminative en

ough which make the decision boundary even confusing when more training data comes in.In the experiments,fal GTs are basically caud by two cas.Theﬁrst ca happens when the scene brightness gradually changes,which generate motion patterns remble the ones by fade-out.The cond ca is due to fast camera ,zoom-in/out)which results in motion blur and simply generates patterns similar to dissolve.There are also two cas which cau the recall of our GT detection not satisfactory.First,we assume each GT has length of at least15frames during training.As a result,we miss a lot of short dissolves. Secondly,some wipe GTs are misd simply becau the wipe patterns are too complicated to be detected by our current color-texture image gmentation approach.

In term of speed,our cut and wipe detectors operates in real time.Together,they can run as fast as90frame/c on a Pentium-4machine.Dissolve detector is not real-time,since signiﬁcant amount of time is spent in extracting Gabor features.

3Low-Level Feature Extraction

Our work on LLF is mainly bad on our previous work in[2].Basically,we describe a shot as -quences of motion trajectories.Our task of LLF is to characterize the trajectories with hystere-sis thresholding and Kalman polyline estimation to either pan/track,zoom/dolly or tilt/boom.

3.1Motion Feature Extraction

The motion features are extracted from every two adjacent frames.Harris corner detector is applied to extract the image feature points,x t of a frame t.The corresponding points,x t+1,at frame t+1,are estimated by the singular value decomposition of3D structural tensor[2].The matched point pairs in each two frames are assumed to be consistent with the single camera motion model.Since pan/track,zoom/dolly and tilt/boom are respectively belong to the same feature category,2D camera motion model is suﬃcient for the reprentation of the three motion categories.To ek balance between model eﬀectiveness and complexity,we decide to u the 2D6-parameter aﬃne model described as,突分配合

x t+1=A x t+v,

where v=[v x,v y]T is the translation,and A is a2×2matrix.A and v are estimated from the matched points in two concutive frames using the robust estimator LMedS[4].RANSAC is not ud due to the requirement of inlier threshold which cannot be easily t.A can be further reprented by other motion features:rotationθ,skewφand zoom(dolly)[z x,z y]T as follows,

宫崎骏作品cos(θ)sin(θ)

−sin(θ)cos(θ)

z x z x tan(φ)

0z y cos(φ)

The parameters v x and v y characterize pan(track)and tilt(boom)respectively,while z x and z y can be ud for zoom(dolly)detection.Therefore,we extract a3-dimensional motion feature vector f=[v x,v y,z=(z x−1)×(z y−1)]for each two adjacent frames.Carrying on this procedure along the temporal dimension,we can get a quence of motion feature vectors{f} for a shot.Grounded on{f},we develop techniques to detect the patterns of camera motion as described in the following ction.

3.2Camera Motion Detection

There are three camera motion categories being considered in TRECVID2005:pan(track), tilt(boom)and zoom(dolly).The scenes of pan/track and tilt/boom can be approximately regarded as moving parallel to the image plane,whereas the zoom(dolly)moves along the depth direction.This results in diﬀerent patterns in the feature quences of zoom(dolly)and the other two categories.Two examples are given in Fig.3to illustrate the idea.Bad on the quence patterns,we develop two parate approaches:one to detect zoom(dolly)and the other for pan(dolly)track and tilt(boom)detection.

手机没网络3.2.1Zoom(dolly)Detection

Zoom(dolly)detection is relatively challenging for veral reasons.First,the geometric relation between the image pixel motion and the zoom(dolly)camera motion is nonlinear.Secondly,this kind of motion,especially zoom,is often un-smooth.The characteristics lead to the diver patterns of zoom(dolly)when inspecting its motion feature quence z.To make the pattern

本文发布于:2023-07-31 21:04:07，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1125090.html

上一篇：ARCGIS网络分析

下一篇：[宝典]arcgis10授权文件

标签：电子网站网络报告工艺导购手机实训

留言与评论（共有 0 条评论）