Motion Driven Approaches to Shot Boundary Detection, Low-Level Feature Extraction and BBC Rushes
Characterization at TRECVID2005
Chong-Wah Ngo,Zailiang Pan,Xiaoyong Wei,Xiao Wu,Hung-Khoon Tan,Wanlei Zhao
Department of Computer Science
City University of Hong Kong
Email:{cwngo,zerin,xiaoyong,wuxiao,hktan,wzhao2}@cs.cityu.edu.hk
Abstract
This paper describes our experimental results on shot boundary detection(SB),low-level feature extraction(LLF),and BBC Rushes exploration(BR)at TRECVID2005.The approaches prented in this paper are mostly bad on our previous works[1,2,3]grounded on motion analysis with spatio-temporal slices,opticalflows and tensor reprentation.This year,our aim is to explore and investigate the role of motion in various fundamental tasks including video structuring and characterization for both the edited(in SB and LLF)and unedited(in BR) videos.
In SB(system C),we exploit the coherence and patterns of motion texture in spatio-temporal slices for boundary detection and classification.The cut and wipe detectors are bad on our work in[1]which performs color-texture gmentation on three slices extracted from videos to determine boundaries.The dissolve detector is bad on our work in[3]which is compod of two steps:multi-resolution cut detection and binary classification with Gabor features.We submit 10runs,depending on the size of training data,flashlight detection capability,and additional statistical features(in addition to Gabor)for classification.Overall,the runs with additional features get better results.Increasing training size will sometime deteriorate the precision of detection.
In LLF(system A),a global6-parameter affine model is estimated at each frame with LMedS and tensor reprentation for camera motion annotation.To characterize the changes of motion parameters over frames,we u hysteresis thresholding and Kalman polyline estimation developed in[2]to gment and determine the types of motion in shots.We submit7runs for LLF,depending on veral empirical parameters.Overall,there is no significant difference in term of recall and precision for each run.
南极极夜In BR(system A),we study two problems:how to structure and characterize BBC rushes? We define three types of gments bad on motion:intentional motion(IM),intermediate mo-tion(IMM),shaking
artifacts(SA)for structuring.Our aim is to partition-and-classify(or classify-and-partition)the videos into gments corresponding to their motion characteristics.
We employ and experiment three approaches:finite state machine(FSM),support vector ma-chine(SVM),and hidden Markov model(HMM).FSM is unsupervid,while SVM and HMM are supervid.We randomly lect and annotate60videos(about337K frames)from the de-velopment t for training and testing.The results show that the performances of all tested approaches are quite clo,with SVM being better for structuring and HMM being slightly better for rushes characterization.Overall,HMM can achieve over90%of recall and precision (in term of frame numbers)in extracting intentional motion.For structuring,SVM achieves approximately70%of recall and30%of precision(with sub-shot as units),compared to0.05% of recall and35%of precision with shot boundary(cut only)detector.
1Introduction
This is thefirst time we participate in TRECVID.We take part in three tasks,submit10runs for shot boundary detection,and7runs for low-level feature(camera motion)extraction.In addition,we examine two issues:structuring and characterization of BBC rushes.Our aim at TRECVID2005is to investigate
the u of motion patterns and features for both edited(news) and unedited(rushes)videos.All works prented in this paper are mostly bad on our early works in[1,2,3].Several enhancement,nevertheless,has also been introduced and shown to give improvement over our previous approaches.
2Shot Boundary Detection
Our approach is bad on the gmentation and classification of motion texture patterns in DC-bad spatio-temporal(ST)slices[1,3].ST slices are2D images extracted from videos with one dimension in space,and the other in time.Figure1shows three types of boundaries:cuts, wipes and dissolves on ST slices.We make u of the slice coherence for cut and wipe detection, and the slice pattern for dissolve and non-dissolve classification.Becau fade-in and fade-out are special cas of dissolve,we do not consider them parately.
For cut and wipe detectors,we u three slices(center horizontal,vertical and diagonal) and perform color-texture gmentation to locate the boundaries[1].For dissolve detector,a pyramid of ST slices at different resolutions of time is generated for cut detection.Figure2 shows the evolution of dissolves to cuts when the resolution of ST slices are temporally reduced. The cuts at low-resolution slices are
located with our cut detector and then projected back to the original scale for dissolve verification[3].We u Gabor features(48dimensional feature vector)to depict the motion-texture patterns of potential dissolves,and then perform support vector machine for binary classification.In brief,cut and wipe detectors are unsupervid,while dissolve detector is supervid.
On top of our works in[1,3],we make two improvement:i)flashlight detection and ii) addition features for dissolves.The aim of(i)is to prune fal cuts due to sharp lighting changes.We inspect four scans(at t−2,t−1,t+1and t+2)before and after a potential
2
(a)Three shots connected by two cuts
(b)Two shots connected by a wipe
(c)Two shots connected by a dissolve
Figure1:Samples of spatio-temporal slices.
shot shot shot shot shot shot
shot
shot
shot
Figure2:Evolving of dissolves to cuts in the bottom-up manner along multiple scales in pyramid reprentation.
3
勤劳的句子
cut at time t and then extract the standard deviation of four scans as the feature for decision making.
If flashlight happens,the value of this feature often approaches zero since no obvious change between the frames before and after fal cuts.The computation is also very efficient since it involves only few scans in ST slices.The aim of (ii)is to improve the precision of dissolve detection,in view that our approach in [3]is not effective enough in discriminating static quence and dissolve with slight motion.We introduce 9extra features in addition to Gabor features,add up to a total of 57dimensional feature vector.The features are extracted by computing the standard deviation of horizontal,vertical and diagonal slices in 3D color space (YCbCr).
2.1Experiments
We submitted 10runs according to three aspects:(1)whether flashlight detection is ud for cut detection,(2)size of training t,and (3)different features for gradual transition (GT).Table 1summarizes the characteristics of different runs,and Table 2shows the results of each run.The small,medium and large data ts contain 864,1180,1662dissolves respectively.The training data is collected from our videos in [1,3](so,our system should belong to type C).The data ts are ,large data t includes all data from the small and medium data ts,while medium data t includes all data from small data t.As shown in Table 1,run-10gives the best results with 0.870of recall and 0.796of precision,compared to one of the best performance (recall=0.927and prec
ision=0.845)in this year TRECVID.Overall,the performance of our cut detector is quite competitive,but GT detector is not so good probably becau we limit the length of dissolve to at least 15frames during training.
Table 1:Different runs for SB
Run ID Flashlight Detection for Cut Training Size Additional Features for GT
1×Medium ×2√
Medium ×3×Medium √4√
Medium √5×Small ×6√
Small ×7×
Large ×8×Large √9
√
明朝为什么会灭亡
Large
×10
√Large √For cut detection,precision is improved by flashlight detection which indicates this strategy successfully prunes some fal cuts due to sharp lighting changes.However,few real cuts are also removed which downgrade the value of recall.The misd cuts generally have brightness preframe and postframe,and their contents are very similar.In our approach,fal cuts happen
4
Table2:Experimental Results of10runs in SB
淘宝网导购网站ALL CUT GT GT(Frame) Run ID Recall Prec.Recall Prec.Recall Prec.Recall Prec.
10.8600.7330.9520.8180.5910.4930.7240.589
20.8600.7490.9510.8420.5910.4930.7240.589
30.8730.7660.9520.8180.6430.6000.7620.614
40.8730.7830.9510.8420.6430.6000.7620.614
50.8560.7700.9520.8180.5740.5980.7010.628
电子工艺实训报告60.8550.7870.9510.8420.5740.5980.7010.628
70.8580.7500.9520.8180.5840.5370.7360.591
80.8700.7780.9520.8180.6320.6400.7610.612
90.8580.7670.9510.8420.5840.5370.7360.591
100.8700.7960.9510.8420.6320.6400.7610.612
when captions in large size come into screen with fast speed,and when there is fade-in or fade-out which happens in less thanfive frames.The best result we get for cut is0.951of recall and0.842 of precision,compared to one the best run(recall=0.941and precision=0.928)in TRECVID.
For the GT detection,the additional statistical features,in addition to Gabor,improve both the recall and precision.The size of training data has no obvious impact,and it actually deteriorates performance sometime when its size increa.We are still investigating the possible reasons behind this.Perhaps the training samples itlf have nois,or the features we u are not discriminative en
ough which make the decision boundary even confusing when more training data comes in.In the experiments,fal GTs are basically caud by two cas.Thefirst ca happens when the scene brightness gradually changes,which generate motion patterns remble the ones by fade-out.The cond ca is due to fast camera ,zoom-in/out)which results in motion blur and simply generates patterns similar to dissolve.There are also two cas which cau the recall of our GT detection not satisfactory.First,we assume each GT has length of at least15frames during training.As a result,we miss a lot of short dissolves. Secondly,some wipe GTs are misd simply becau the wipe patterns are too complicated to be detected by our current color-texture image gmentation approach.
In term of speed,our cut and wipe detectors operates in real time.Together,they can run as fast as90frame/c on a Pentium-4machine.Dissolve detector is not real-time,since significant amount of time is spent in extracting Gabor features.
3Low-Level Feature Extraction
Our work on LLF is mainly bad on our previous work in[2].Basically,we describe a shot as -quences of motion trajectories.Our task of LLF is to characterize the trajectories with hystere-sis thresholding and Kalman polyline estimation to either pan/track,zoom/dolly or tilt/boom.
5
3.1Motion Feature Extraction
The motion features are extracted from every two adjacent frames.Harris corner detector is applied to extract the image feature points,x t of a frame t.The corresponding points,x t+1,at frame t+1,are estimated by the singular value decomposition of3D structural tensor[2].The matched point pairs in each two frames are assumed to be consistent with the single camera motion model.Since pan/track,zoom/dolly and tilt/boom are respectively belong to the same feature category,2D camera motion model is sufficient for the reprentation of the three motion categories.To ek balance between model effectiveness and complexity,we decide to u the 2D6-parameter affine model described as,突分配合
x t+1=A x t+v,
where v=[v x,v y]T is the translation,and A is a2×2matrix.A and v are estimated from the matched points in two concutive frames using the robust estimator LMedS[4].RANSAC is not ud due to the requirement of inlier threshold which cannot be easily t.A can be further reprented by other motion features:rotationθ,skewφand zoom(dolly)[z x,z y]T as follows,
A=
宫崎骏作品cos(θ)sin(θ)
−sin(θ)cos(θ)
z x z x tan(φ)
0z y cos(φ)
.
The parameters v x and v y characterize pan(track)and tilt(boom)respectively,while z x and z y can be ud for zoom(dolly)detection.Therefore,we extract a3-dimensional motion feature vector f=[v x,v y,z=(z x−1)×(z y−1)]for each two adjacent frames.Carrying on this procedure along the temporal dimension,we can get a quence of motion feature vectors{f} for a shot.Grounded on{f},we develop techniques to detect the patterns of camera motion as described in the following ction.
3.2Camera Motion Detection
There are three camera motion categories being considered in TRECVID2005:pan(track), tilt(boom)and zoom(dolly).The scenes of pan/track and tilt/boom can be approximately regarded as moving parallel to the image plane,whereas the zoom(dolly)moves along the depth direction.This results in different patterns in the feature quences of zoom(dolly)and the other two categories.Two examples are given in Fig.3to illustrate the idea.Bad on the quence patterns,we develop two parate approaches:one to detect zoom(dolly)and the other for pan(dolly)track and tilt(boom)detection.
手机没网络3.2.1Zoom(dolly)Detection
Zoom(dolly)detection is relatively challenging for veral reasons.First,the geometric relation between the image pixel motion and the zoom(dolly)camera motion is nonlinear.Secondly,this kind of motion,especially zoom,is often un-smooth.The characteristics lead to the diver patterns of zoom(dolly)when inspecting its motion feature quence z.To make the pattern
6