Abnormal Crowd Behavior Detection using Size-Adapted Spatio-Temporal Features

更新时间:2023-07-06 10:53:38 阅读: 评论:0

International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx
1
Abnormal Crowd Behavior Detection using Size-Adapted Spatio-Temporal Features
Bo Wang, Mao Ye, Xue Li and Fengjuan Zhao
Abstract: Abnormal crowd behavior detection is an important rearch issue in computer vision. However, complex real-life situations (e.g., vere occlusion, over-crowding, etc.) still challenge the effectiveness of previous algorithms. Recently, the methods bad on spatio-temporal cuboid are popular in video analysis. To our knowledge, the spatio-temporal cuboid is always extracted randomly from a video quence in the existing methods. The size of each cuboid and the total number of cuboids are determined empirically. The extracted features either contain the redundant information or lo a lot of important information which extremely affect the accuracy. In this paper, we propo an improved method. In our method, the spatio-temporal cuboid is no longer determined arbitrarily, but by the information contained in the video quence. The spatio-temporal cuboid is extracted from video quence with adaptive size. The total number of cuboids and the extracting positions can be determined automatically. Moreover, to compute the similarity between two spatiotemporal cuboids wit
h different sizes, we design a novel data structure of codebook which is constructed as a t of two-level trees. The experiment results show that the detection rates of fal positive and fal negative are significantly reduced. Keywords: Spatio-temporal cuboid, Latent Dirichlet Allocation(LDA), Social force model, Codebook花之蛇3
1.
INTRODUCTION
Abnormal crowd behavior detection is an important rearch field in computer vision [1–3]. The behavior analysis like abnormal behaviors prents a challenge for its effectiveness [4]. The recent rearch development and methods in this area are reviewed in [5, 6]. As reported in [5], rearches on group behaviors can be mainly divided into three categories. The first is the traditional object-bad approaches which consider the group as a collection of individuals [7–10]. This kind of methods analyze the crowd behaviors through individuals. In simple situations which have a small number of moving objects, this kind of methods can achieve good results. However, in the complex scene, there exThis work was supported in part by the National Natural Science Foundation of China (60702071). Program for New Century Excellent Talents in University (NCET-06-0811), 973 National
Basic Rearch Program of China, (2010CB-732501). Fundation of Sichuan Excellent Young Talents(09ZQ026-035) and Open Project of State Key Lab. for Novel Software Technology of Nanjing University. Bo Wang, Mao Ye and Fengjuan Zhao are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China and State Key Lab. for Novel Software Technology, Nanjing University, P.R. China(e-mail: ; yem_; ). Xue Li is with the School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Queensland 4072, Australia(e-mail: xueli@itee.uq.adu.au).
ists vere occlusions. It is almost impossible for object gmentation, tracking and behavior recognition. The computational cost will also be greatly influenced by the number of objects (e.g., people). Another category of methods focus on analyzing the entire video frame and extracting the subject specific information [11–13]. A classic application of this approach is to u optical flow to characterize the motion features [14, 15]. Unfortunately perfect optical flow can be hardly achieved in extremely crowded scenes or the ones with vere light changes. In the third one, the two rearch frameworks are combined in some rearch works. For example, the method is propod in [13], which not only analyzes human activities in the whole video viewpoint, but also tracks the pe
destrian. However, such methods have a big computational load. Recently, the cond framework is popular in abnormal crowd behavior detection. A method bad on a social force model has been propod creatively in [17,18]. It analyzes human behaviors from his/her intention of movements. A force flow for every pixel is formed [18]. Then, Latent Dirichlet Allocation(LDA) [19, 20] is ud to build the model of the normal scenes. Traditionally, LDA is ud as a text modeling method. It considers that a whole document is compod of a large number of words arranged randomly. In computer vision, LDA has been widely ud for recognizing and learning of object categories [21]. For abnormal crowd behavior detec-
2
如何设置电脑锁屏时间International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx
tion, a video quence can be regarded as a document, and the spatio-temporal cuboid is regarded as a visual word [17, 18]. Like a document, it is impossible that all visual words have the same length. However, the existing algorithms extract the size-fixed spatio-temporal cuboid randomly from a quence of force flow [18]. The number of cuboids is also fixed and assigned artificially. Previous methods will cut down long words or pad short words with redundant information. Our point is that ex
桌面养花
tracting cuboids randomly and fixing the size of cuboids in video quence are inappropriate in practical situations. Doing the will cau that the cuboid t contains much uless information or lost uful information. The effectiveness of detection will be harmed significantly. Our experiments will confirm our claim. In this paper, the rigorous definition of abnormal crowd behavior can be stated as the following. Suppo there exists a mathematical model describing the normal behaviors in the crowd videos, the crowd behavior is considered to be abnormal when its value computed from the model is discriminative to the normal values. Bad on the social force model, we propo an efficient method to extract the size-adapted cuboids. The number of cuboids extracted from a video quence is determined by the video quence itlf. Then, LDA model is ud to detect the abnormal crowd behavior. Our main contributions can be summarized as follows. 1) An extraction method to the size-adapted spatio-temporal cuboid is propod. The positions and total number of the cuboids can be determined automatically. 2) The codebook is constructed as a t of two-level trees. The similarity computation between two spatio-temporal cuboids with different sizes are more reasonable. 3) Compared with the previous methods, the detection accuracy can be significantly improved. The rest of paper is organized as follows. The method of extracting the size-adapted cuboids is propod in Section 2. In Section 3, we describe the method of detecting abnormal crowd behavior. Finally, in Section 4, we demonstrate the feasibility and effectiveness of our propod method for detecting abnormal crowd behaviors on a public test datat. 2.
with this average optical flow. The interaction force of the particle i can be simply estimated as follows, Fint = dvi 1 q − (vi − vi ). dt τ (1)
where vi is the spatio-temporal average of the optical flow of the neighborhood of the particle i. τ is the relaxation parameter, and vq is the desired velocity of the i particle i, vq = (1 − pi )O(xi , yi ) + pi Oave (xi , yi ), i (2)
where O(xi , yi ) is the optical flow of the particle i in the (x, y) plane, and Oave (xi , yi ) is the average of optical flow of the neighborhood of the particle i in the (x, y) plane. pi is the panic weight parameter. In the following, we describe the process of extracting size-adapted spatio-temporal cuboids (Fig. 1). Formal notations ud in this paper are introduced below.
Fig 1: The process of extracting the size-adapted spatiotemporal cuboid in a video quence. Definition 1: Local maximum point: It denotes the interaction force that has the maximum value at the current position compared with its neighborhood. Definition 2: Gaussian distribution blob: The interaction force values are approximated by a Gaussian distribution in a blob. The maximum value appears in the center of this blob. Definition 3: Gaussian distribution area: The interaction force values are approximated as a Gaussian distribution in a projected two dimensional area. The maxim
um value appears in the center of this area. After computing the interaction force in each frame, the video quence is partitioned into blocks of T frames. Each T frames is considered as a clip C. A clip can be viewed as a video document. Many discrete particles move along with the average optical flow in this clip. The values of pixels corresponding to each particle in this clip is assigned the interaction force.
EXTRACTING SIZE-ADAPTED SPATIO-TEMPORAL CUBOIDS
Firstly, we briefly describe the method of estimating the interaction forces in crowd. Details are in [18]. Since people are den in the crowd video, their movements are restricted. They can be considered as granular particles. Thus, the crowd is treated as a collection of interacting particles. Initially, a grid of particles are placed over the image [12]. For each particle, an average optical flow is computed [18]. Then, the particle moves along
International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx
3
T he interaction force F
In reality, the center of an abnormal region has the local maximum value of the interaction forces. The values of interaction force gradually decrea when the positions are away from the center region. For the threedimensional clip, a Gaussian distribution blob can precily describe such an abnormal region. Thus, the interaction forces contained in a Gaussian distribution blob can characterize a complete visual word. A video document is compod of a large number of visual words. The key work for us is to extract the Gaussian distribution blob. In the three-dimensional space, it is not easy to extract the Gaussian distribution blob. An alternative way is to get the approximate region of a Gaussian distribution blob. Obviously, the Gaussian distribution blob must satisfies the following conditions: i) only one local extreme point exists in this area; ii) its size should be large enough to cover the possible pixel. Thus, our method is carried out through two steps bad on the conditions. The first step is to find the local maximum points in the spatio-temporal clip, and then the size-adapted cuboid around each local maximum point is extracted. In the following, we will explain our method in details and the experiment results show that this approximate method achieves an excellent performance. 2.1. Choosing the Local Maximum Points in the Spatio-Temporal Clip It can be obrved that the centers of all Gaussian distribution blobs achieve the local maximum values in space and time simultaneously.
plane is W and its height is H. For each particle in each plane, a two-dimensional Gaussian model is ud to distribute its force value to all pixels in this plane [22]. The interaction force for an arbitrary pixel point xi (i = {1, 2, · · · ,W ∗ H}) in a plane is computed according to the following equation, F(xi ) =
j=1
∑ Fint j ℵ(xi ; a j , B)
鹿茸丸N
i = 1, 2, · · · ,W ∗ H,
(3)
where xi is the two-dimensional coordinates for each pixel in the plane and Fint j is the interaction force of the particle j. The mean aj is the coordinates of the j-th particle. The Gaussian model ℵ for each particle has the same covariance matrix B. For example, the interaction forces for each plane are shown in Fig. 3.
T he interaction force F
*
3 2.5 2 1.5 1 0.5
Local maximum points
0 200 150
Y 100
工业胶水
50 0 0 XOY
50
100 X
150
nba成员200
250
6 5 4 3 2
1 0 20 15
薪酬绩效
T 10
T
5 0 0
怎么管理员工XOT
50
100 X X
150
200
250
The interaction force F
8 7 6 5 4 3 2 1 0 200 150 100 20
Y
50 0 0
5
10 T
15
YOT
Fig 3: The results of distributing each particle’s force value to all points in each plane. The local maximum point is marked with ′ ∗′ . The local maximum point in a spatio-temporal clip should be the local maximum values in the XOY , XOT , Y OT planes simultaneously. In each plane, we choo the point as the local maximum point when its interaction force reaches the maximum compared with its
eight adjacent points. Thus we can get a local maximum point t U1 ={(x1 , y1 ), (x2 , y2 ), . . .} in the XOY plane. Similarly, we can get the local maximum point ts U2 = {(x1 ,t1 ), (x2 ,t2 ), . . .} and U3 = {(y1 ,t1 ), (y2 ,t2 ), . . .} for the XOT and Y OT planes respectively. Finally, we choo the point vi = (xi , yi ,ti ) as the local maximum point in the spatio-temporal clip which satisfies the following conditions, (xi , yi ) ∈ U1 , (xi ,ti ) ∈ U2 , (yi ,ti ) ∈ U3 . (4)
Fig 2: Projecting the spatio-temporal clip onto the XOY , XOT and Y OT planes. As shown in Fig. 2, for a spatio-temporal clip, we establish a three-dimensional coordinate system (X,Y, T ). After we compute the interaction forces in a spatiotemporal clip, all discrete particles in the clip are projected onto the XOY , XOT and Y OT planes parately. Without loss of generality, assume that the number of particles in the projection plane is N, the width of the
4
International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx
2.2. Extracting Size-Adapted Cuboid around the Local Maximum Point The visual words’(cuboid) positions have been identified approximately after we find the local maximum points {v1 , v2 , . . . , vI }, where I is the number of local maximum points. As we have discusd in Section 2., a visual word
should contain only one extreme point and be large enough to cover the possible points. The Gaussian distribution blob satisfies this condition. Thus approximatively, this blob’s projection in each plane of the spatiotemporal clip should also satisfy this two conditions. An available alternative is propod here. We find the approximate Gaussian distribution area in each plane, which satisfies the conditions of Gaussian distribution blob. Then the approximate Gaussian distribution blob is identified as the interction of the cubs produced by the Gaussian distribution areas. Only the pixel which contained in a Gaussian distribution area is partitioned into a local region of the projection plane. The algorithm for getting Gaussian distribution areas in the XOY plane is prented as the following. Algorithm The process of getting the Gaussian distribution area in the XOY plane. Input:The local maximum points {v1 , v2 , . . . , vI } in a spatio-temporal clip and XOY plane. For each local maximum point vi = (xi , yi ,ti ) h l Let: xi = xi ; xi = xi ; yl = yi ; yh = yi i i l l while F((xi − 1, yi )) < F((xi , yi )) l l xi = xi − 1; h + 1, y )) < F((xh , y )) while F((xi i i i h h xi = xi + 1; while F((xi , yl − 1)) < F((xi , yl )) i i yl = yl − 1; i i while F((xi , yh + 1)) < F((xi , yh )) i i yh = yh + 1; i i Return Approximate Gaussian distribution area h l gx = {(x, y)|xi ≤ x ≤ xi , yl ≤ y ≤ yh }. i i i Output:The t of approximate Gaussian distribution area {gx , gx , . . . , gx } in XOY plane. I 1 2
The Gaussian distribution area in XOY can be extended to the whole spatio-temporal clip. That is gx′
i l h = {(x, y,t)|xi ≤ x ≤ xi , yl ≤ y ≤ yh , 0 ≤ t ≤ T }, which i i are drawn in Fig. 4. The similar work is done for another two planes. Thus we can get {gt′ , gt′ , . . . , gt′ } in I 1 2 the XOT plane and {gy′ , gy′ , . . . , gy′ } in the Y OT plane. I 1 2 The spatio-temporal cuboid which satisfies the following condition is chon as a visual word w, w = {(x, y,t)|(x, y,t) ∈ gx′ ∩ gy′ ∩ gt′ , gx′ ∩ gy′ ∩ gt′ ̸= 0}, (5) / i k i k j j where i, j and k are one arbitary value in the t {1, 2, . . . , I}. The result is shown in Fig. 5. The number of visual words in a clip is not assigned by the operator, but determined by the number of Gaussian distribution blobs exist in the video quence. And its size is determined by the size of Gaussian distribution blob. We extract all the visual words, which are then ud to reprent the video quence.
g it
g ix
g iy
Fig 5: The process of getting the size-adapted spatiotemporal cuboid.
3.
ABNORMAL EVENT DETECTION
gix
gix
gix gix
Fig 4: The result of extending the area from the twodimensional Gaussian distribution to the whole spatio-temporal clip.
Now, we have a large number of size-adapted spatiotemporal cuboids. In order to compare any two visual words, traditionally, we compute the Euclidean distance between them [23]. However, the sizes of the visual words are not equal, this comparison method cannot be ud here directly. In this ction, we will introduce an effective comparison method between two size-adapted cuboids. Then, Latent Dirichlet Allocation(LDA) model can be ud for anomaly detection. For training LDA model, firstly, the codebook in the monitoring scenario is clustered. We extract a large number of cuboids from clips which include both normal and abnormal scenarios. Each cuboid is flattened into a vector quentially as H=(Fint 1 , Fint 2 , . . . , Fint m ), where m is the number of pixel points in thi
s cuboid. Then we have the prentation of the cuboids {H1 , H2 , . . . , HL }, where L is the number of the cuboids. Becau each cuboid has the different size, each Hi will also have different length. We can not u K-means to get a codebook directly [20].
International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx
5
The first step of our method is to group the cuboids {H1 , H2 , . . . , HL } to many categories by their lengths. And the cuboids which have the same length will be grouped into the same category. Then the K-means clustering approach is ud to cluster cuboids in each category. Thus in our method, the codebook is not compod of a large number of words directly, but constructed by two-levels trees. The first level is the category bad on the visual word’s length, and the cond one is the codewords corresponding to each category. The codebook we get is large enough. Thus, for a new H, usually, there will be a corresponding category in the first level which has the same length with H. When there is no category which has the same length with H, we will choo the category i(i = 1, 2, ...) which has the clost length with H. Then, the length of H is adjusted to the length of category i by using traditional interpolation method (e.g., in this paper, we adopt bi-cubic interpolatio
n). Finally, the Euclidean distance is ud to determine which codeword the H belongs to. For each clip, we u the bag of words method combination of the codebook to reprent it after we extract the size-adapted spatio-temporal cuboids [18,21]. Fig. 6 illustrates this process.
ρ = log p(c|α , β ) for each new clip c. We will get the final result bad on the given threshold η . When ρ < η , the c is labeled as abnormal. The threshold η is determined empirically through experiments.
4.
EXPERIMENTS AND DISCUSSIONS
Our method described in the previous ction is tested bad on a publicly available datat of normal and abnormal crowd videos and on our own datat. The detailed process is described as follows. We u a scene that many people are scattered in a panic for testing our anomaly detection method. Fig. 7 shows sample frames of this scene.
Fig 7: The scene of a lot of people scattering in panic. The left one describes the normal scene. The middle and right ones describe the process of people scattering in panic until disappearing. (1) Traini
ng We u the normal scene in Fig. 7 for LDA training. The size-adapted spatio-temporal cuboids are extracted from blocks of T frames of video quences where T = 30. And its number is determined automatically. The final size of codebook is Z = 1000, and the final corpus contains d = 20 clips. The LDA model is ud to learn E = 4 latent topics. (2) Test In order to evaluate this approach, three video quences are recorded repeatedly according to the similar activities in the same scene. Each video describes the same activities as that in Fig. 7. Ten normal activity samples and ten panic activity samples are extracted from the three video quences respectively. The computation results of LDA are shown in Table 1. We choo η = −250 as the threshold. The detecting process is described in the end of Section 3. It is well illustrated in Table 1. In the 60 test samples, the number of wrong judgements is 3, and the average successful detection rate is over 94%. Experimental results show that our method achieves an excellent result. In the comparative experiments, we compare our approach with the method which u the size-fixed spatiotemporal cuboid considering with the successful detec-
Fig 6: The process of using the bag of words method to describe a clip. Therefore, we construct a corpus C={c1 , c2 , . . . , cd } for a t of normal samples of a given scene, where d is the number of clips. Through maximizing the likelihood of corpus as Eq. (6), we approximate the model parameters α , β in the normal scene [19]. l(α , β ) = ∑ log p(ci |α , β ),
i=1 d
(6)
the parameter α is the Dirichlet parameter for topics, while the parameter β is the topic dependent Dirichlet parameter for visual word index [24]. For a new clip, we also extract the size-adapted spatiotemporal cuboids and u the bag of words method combination of the codebook to reprent it. By using the trained LDA model, we estimate the likelihood

本文发布于:2023-07-06 10:53:38,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1070089.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:设置   桌面   胶水   管理   电脑   养花
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图