Abnormal Crowd Behavior Detection using Size-Adapted Spatio-Temporal Features

更新时间:2023-07-06 10:53:38 阅读：评论：0

International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx

Abnormal Crowd Behavior Detection using Size-Adapted Spatio-Temporal Features

Bo Wang, Mao Ye, Xue Li and Fengjuan Zhao

Abstract: Abnormal crowd behavior detection is an important rearch issue in computer vision. However, complex real-life situations (e.g., vere occlusion, over-crowding, etc.) still challenge the effectiveness of previous algorithms. Recently, the methods bad on spatio-temporal cuboid are popular in video analysis. To our knowledge, the spatio-temporal cuboid is always extracted randomly from a video quence in the existing methods. The size of each cuboid and the total number of cuboids are determined empirically. The extracted features either contain the redundant information or lo a lot of important information which extremely affect the accuracy. In this paper, we propo an improved method. In our method, the spatio-temporal cuboid is no longer determined arbitrarily, but by the information contained in the video quence. The spatio-temporal cuboid is extracted from video quence with adaptive size. The total number of cuboids and the extracting positions can be determined automatically. Moreover, to compute the similarity between two spatiotemporal cuboids wit

h different sizes, we design a novel data structure of codebook which is constructed as a t of two-level trees. The experiment results show that the detection rates of fal positive and fal negative are signiﬁcantly reduced. Keywords: Spatio-temporal cuboid, Latent Dirichlet Allocation(LDA), Social force model, Codebook花之蛇3

INTRODUCTION

Abnormal crowd behavior detection is an important rearch ﬁeld in computer vision [1–3]. The behavior analysis like abnormal behaviors prents a challenge for its effectiveness [4]. The recent rearch development and methods in this area are reviewed in [5, 6]. As reported in [5], rearches on group behaviors can be mainly divided into three categories. The ﬁrst is the traditional object-bad approaches which consider the group as a collection of individuals [7–10]. This kind of methods analyze the crowd behaviors through individuals. In simple situations which have a small number of moving objects, this kind of methods can achieve good results. However, in the complex scene, there exThis work was supported in part by the National Natural Science Foundation of China (60702071). Program for New Century Excellent Talents in University (NCET-06-0811), 973 National

Basic Rearch Program of China, (2010CB-732501). Fundation of Sichuan Excellent Young Talents(09ZQ026-035) and Open Project of State Key Lab. for Novel Software Technology of Nanjing University. Bo Wang, Mao Ye and Fengjuan Zhao are with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China and State Key Lab. for Novel Software Technology, Nanjing University, P.R. China(e-mail: ; yem_; ). Xue Li is with the School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Queensland 4072, Australia(e-mail: xueli@itee.uq.adu.au).

ists vere occlusions. It is almost impossible for object gmentation, tracking and behavior recognition. The computational cost will also be greatly inﬂuenced by the number of objects (e.g., people). Another category of methods focus on analyzing the entire video frame and extracting the subject speciﬁc information [11–13]. A classic application of this approach is to u optical ﬂow to characterize the motion features [14, 15]. Unfortunately perfect optical ﬂow can be hardly achieved in extremely crowded scenes or the ones with vere light changes. In the third one, the two rearch frameworks are combined in some rearch works. For example, the method is propod in [13], which not only analyzes human activities in the whole video viewpoint, but also tracks the pe

destrian. However, such methods have a big computational load. Recently, the cond framework is popular in abnormal crowd behavior detection. A method bad on a social force model has been propod creatively in [17,18]. It analyzes human behaviors from his/her intention of movements. A force ﬂow for every pixel is formed [18]. Then, Latent Dirichlet Allocation(LDA) [19, 20] is ud to build the model of the normal scenes. Traditionally, LDA is ud as a text modeling method. It considers that a whole document is compod of a large number of words arranged randomly. In computer vision, LDA has been widely ud for recognizing and learning of object categories [21]. For abnormal crowd behavior detec-

如何设置电脑锁屏时间International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx

tion, a video quence can be regarded as a document, and the spatio-temporal cuboid is regarded as a visual word [17, 18]. Like a document, it is impossible that all visual words have the same length. However, the existing algorithms extract the size-ﬁxed spatio-temporal cuboid randomly from a quence of force ﬂow [18]. The number of cuboids is also ﬁxed and assigned artiﬁcially. Previous methods will cut down long words or pad short words with redundant information. Our point is that ex

桌面养花

tracting cuboids randomly and ﬁxing the size of cuboids in video quence are inappropriate in practical situations. Doing the will cau that the cuboid t contains much uless information or lost uful information. The effectiveness of detection will be harmed signiﬁcantly. Our experiments will conﬁrm our claim. In this paper, the rigorous deﬁnition of abnormal crowd behavior can be stated as the following. Suppo there exists a mathematical model describing the normal behaviors in the crowd videos, the crowd behavior is considered to be abnormal when its value computed from the model is discriminative to the normal values. Bad on the social force model, we propo an efﬁcient method to extract the size-adapted cuboids. The number of cuboids extracted from a video quence is determined by the video quence itlf. Then, LDA model is ud to detect the abnormal crowd behavior. Our main contributions can be summarized as follows. 1) An extraction method to the size-adapted spatio-temporal cuboid is propod. The positions and total number of the cuboids can be determined automatically. 2) The codebook is constructed as a t of two-level trees. The similarity computation between two spatio-temporal cuboids with different sizes are more reasonable. 3) Compared with the previous methods, the detection accuracy can be signiﬁcantly improved. The rest of paper is organized as follows. The method of extracting the size-adapted cuboids is propod in Section 2. In Section 3, we describe the method of detecting abnormal crowd behavior. Finally, in Section 4, we demonstrate the feasibility and effectiveness of our propod method for detecting abnormal crowd behaviors on a public test datat. 2.

with this average optical ﬂow. The interaction force of the particle i can be simply estimated as follows, Fint = dvi 1 q − (vi − vi ). dt τ (1)

where vi is the spatio-temporal average of the optical ﬂow of the neighborhood of the particle i. τ is the relaxation parameter, and vq is the desired velocity of the i particle i, vq = (1 − pi )O(xi , yi ) + pi Oave (xi , yi ), i (2)

where O(xi , yi ) is the optical ﬂow of the particle i in the (x, y) plane, and Oave (xi , yi ) is the average of optical ﬂow of the neighborhood of the particle i in the (x, y) plane. pi is the panic weight parameter. In the following, we describe the process of extracting size-adapted spatio-temporal cuboids (Fig. 1). Formal notations ud in this paper are introduced below.

Fig 1: The process of extracting the size-adapted spatiotemporal cuboid in a video quence. Deﬁnition 1: Local maximum point: It denotes the interaction force that has the maximum value at the current position compared with its neighborhood. Deﬁnition 2: Gaussian distribution blob: The interaction force values are approximated by a Gaussian distribution in a blob. The maximum value appears in the center of this blob. Deﬁnition 3: Gaussian distribution area: The interaction force values are approximated as a Gaussian distribution in a projected two dimensional area. The maxim

um value appears in the center of this area. After computing the interaction force in each frame, the video quence is partitioned into blocks of T frames. Each T frames is considered as a clip C. A clip can be viewed as a video document. Many discrete particles move along with the average optical ﬂow in this clip. The values of pixels corresponding to each particle in this clip is assigned the interaction force.

EXTRACTING SIZE-ADAPTED SPATIO-TEMPORAL CUBOIDS

Firstly, we brieﬂy describe the method of estimating the interaction forces in crowd. Details are in [18]. Since people are den in the crowd video, their movements are restricted. They can be considered as granular particles. Thus, the crowd is treated as a collection of interacting particles. Initially, a grid of particles are placed over the image [12]. For each particle, an average optical ﬂow is computed [18]. Then, the particle moves along

International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx

T he interaction force F

In reality, the center of an abnormal region has the local maximum value of the interaction forces. The values of interaction force gradually decrea when the positions are away from the center region. For the threedimensional clip, a Gaussian distribution blob can precily describe such an abnormal region. Thus, the interaction forces contained in a Gaussian distribution blob can characterize a complete visual word. A video document is compod of a large number of visual words. The key work for us is to extract the Gaussian distribution blob. In the three-dimensional space, it is not easy to extract the Gaussian distribution blob. An alternative way is to get the approximate region of a Gaussian distribution blob. Obviously, the Gaussian distribution blob must satisﬁes the following conditions: i) only one local extreme point exists in this area; ii) its size should be large enough to cover the possible pixel. Thus, our method is carried out through two steps bad on the conditions. The ﬁrst step is to ﬁnd the local maximum points in the spatio-temporal clip, and then the size-adapted cuboid around each local maximum point is extracted. In the following, we will explain our method in details and the experiment results show that this approximate method achieves an excellent performance. 2.1. Choosing the Local Maximum Points in the Spatio-Temporal Clip It can be obrved that the centers of all Gaussian distribution blobs achieve the local maximum values in space and time simultaneously.

plane is W and its height is H. For each particle in each plane, a two-dimensional Gaussian model is ud to distribute its force value to all pixels in this plane [22]. The interaction force for an arbitrary pixel point xi (i = {1, 2, · · · ,W ∗ H}) in a plane is computed according to the following equation, F(xi ) =

j=1

∑ Fint j ℵ(xi ; a j , B)

鹿茸丸N

i = 1, 2, · · · ,W ∗ H,

(3)

where xi is the two-dimensional coordinates for each pixel in the plane and Fint j is the interaction force of the particle j. The mean aj is the coordinates of the j-th particle. The Gaussian model ℵ for each particle has the same covariance matrix B. For example, the interaction forces for each plane are shown in Fig. 3.

T he interaction force F

3 2.5 2 1.5 1 0.5

Local maximum points

0 200 150

Y 100

工业胶水

50 0 0 XOY

100 X

150

nba成员200

250

6 5 4 3 2

1 0 20 15

薪酬绩效

T 10

5 0 0

怎么管理员工XOT

100 X X

150

200

250

The interaction force F

8 7 6 5 4 3 2 1 0 200 150 100 20

50 0 0

10 T

YOT

Fig 3: The results of distributing each particle’s force value to all points in each plane. The local maximum point is marked with ′ ∗′ . The local maximum point in a spatio-temporal clip should be the local maximum values in the XOY , XOT , Y OT planes simultaneously. In each plane, we choo the point as the local maximum point when its interaction force reaches the maximum compared with its

eight adjacent points. Thus we can get a local maximum point t U1 ={(x1 , y1 ), (x2 , y2 ), . . .} in the XOY plane. Similarly, we can get the local maximum point ts U2 = {(x1 ,t1 ), (x2 ,t2 ), . . .} and U3 = {(y1 ,t1 ), (y2 ,t2 ), . . .} for the XOT and Y OT planes respectively. Finally, we choo the point vi = (xi , yi ,ti ) as the local maximum point in the spatio-temporal clip which satisﬁes the following conditions, (xi , yi ) ∈ U1 , (xi ,ti ) ∈ U2 , (yi ,ti ) ∈ U3 . (4)

Fig 2: Projecting the spatio-temporal clip onto the XOY , XOT and Y OT planes. As shown in Fig. 2, for a spatio-temporal clip, we establish a three-dimensional coordinate system (X,Y, T ). After we compute the interaction forces in a spatiotemporal clip, all discrete particles in the clip are projected onto the XOY , XOT and Y OT planes parately. Without loss of generality, assume that the number of particles in the projection plane is N, the width of the

International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx

2.2. Extracting Size-Adapted Cuboid around the Local Maximum Point The visual words’(cuboid) positions have been identiﬁed approximately after we ﬁnd the local maximum points {v1 , v2 , . . . , vI }, where I is the number of local maximum points. As we have discusd in Section 2., a visual word

should contain only one extreme point and be large enough to cover the possible points. The Gaussian distribution blob satisﬁes this condition. Thus approximatively, this blob’s projection in each plane of the spatiotemporal clip should also satisfy this two conditions. An available alternative is propod here. We ﬁnd the approximate Gaussian distribution area in each plane, which satisﬁes the conditions of Gaussian distribution blob. Then the approximate Gaussian distribution blob is identiﬁed as the interction of the cubs produced by the Gaussian distribution areas. Only the pixel which contained in a Gaussian distribution area is partitioned into a local region of the projection plane. The algorithm for getting Gaussian distribution areas in the XOY plane is prented as the following. Algorithm The process of getting the Gaussian distribution area in the XOY plane. Input:The local maximum points {v1 , v2 , . . . , vI } in a spatio-temporal clip and XOY plane. For each local maximum point vi = (xi , yi ,ti ) h l Let: xi = xi ; xi = xi ; yl = yi ; yh = yi i i l l while F((xi − 1, yi )) < F((xi , yi )) l l xi = xi − 1; h + 1, y )) < F((xh , y )) while F((xi i i i h h xi = xi + 1; while F((xi , yl − 1)) < F((xi , yl )) i i yl = yl − 1; i i while F((xi , yh + 1)) < F((xi , yh )) i i yh = yh + 1; i i Return Approximate Gaussian distribution area h l gx = {(x, y)|xi ≤ x ≤ xi , yl ≤ y ≤ yh }. i i i Output:The t of approximate Gaussian distribution area {gx , gx , . . . , gx } in XOY plane. I 1 2

The Gaussian distribution area in XOY can be extended to the whole spatio-temporal clip. That is gx′

i l h = {(x, y,t)|xi ≤ x ≤ xi , yl ≤ y ≤ yh , 0 ≤ t ≤ T }, which i i are drawn in Fig. 4. The similar work is done for another two planes. Thus we can get {gt′ , gt′ , . . . , gt′ } in I 1 2 the XOT plane and {gy′ , gy′ , . . . , gy′ } in the Y OT plane. I 1 2 The spatio-temporal cuboid which satisﬁes the following condition is chon as a visual word w, w = {(x, y,t)|(x, y,t) ∈ gx′ ∩ gy′ ∩ gt′ , gx′ ∩ gy′ ∩ gt′ ̸= 0}, (5) / i k i k j j where i, j and k are one arbitary value in the t {1, 2, . . . , I}. The result is shown in Fig. 5. The number of visual words in a clip is not assigned by the operator, but determined by the number of Gaussian distribution blobs exist in the video quence. And its size is determined by the size of Gaussian distribution blob. We extract all the visual words, which are then ud to reprent the video quence.

g it

g ix

g iy

Fig 5: The process of getting the size-adapted spatiotemporal cuboid.

ABNORMAL EVENT DETECTION

gix

gix gix

Fig 4: The result of extending the area from the twodimensional Gaussian distribution to the whole spatio-temporal clip.

Now, we have a large number of size-adapted spatiotemporal cuboids. In order to compare any two visual words, traditionally, we compute the Euclidean distance between them [23]. However, the sizes of the visual words are not equal, this comparison method cannot be ud here directly. In this ction, we will introduce an effective comparison method between two size-adapted cuboids. Then, Latent Dirichlet Allocation(LDA) model can be ud for anomaly detection. For training LDA model, ﬁrstly, the codebook in the monitoring scenario is clustered. We extract a large number of cuboids from clips which include both normal and abnormal scenarios. Each cuboid is ﬂattened into a vector quentially as H=(Fint 1 , Fint 2 , . . . , Fint m ), where m is the number of pixel points in thi

s cuboid. Then we have the prentation of the cuboids {H1 , H2 , . . . , HL }, where L is the number of the cuboids. Becau each cuboid has the different size, each Hi will also have different length. We can not u K-means to get a codebook directly [20].

International Journal of Control, Automation and Systems Vol. xx, No. xx, xxxx

The ﬁrst step of our method is to group the cuboids {H1 , H2 , . . . , HL } to many categories by their lengths. And the cuboids which have the same length will be grouped into the same category. Then the K-means clustering approach is ud to cluster cuboids in each category. Thus in our method, the codebook is not compod of a large number of words directly, but constructed by two-levels trees. The ﬁrst level is the category bad on the visual word’s length, and the cond one is the codewords corresponding to each category. The codebook we get is large enough. Thus, for a new H, usually, there will be a corresponding category in the ﬁrst level which has the same length with H. When there is no category which has the same length with H, we will choo the category i(i = 1, 2, ...) which has the clost length with H. Then, the length of H is adjusted to the length of category i by using traditional interpolation method (e.g., in this paper, we adopt bi-cubic interpolatio

n). Finally, the Euclidean distance is ud to determine which codeword the H belongs to. For each clip, we u the bag of words method combination of the codebook to reprent it after we extract the size-adapted spatio-temporal cuboids [18,21]. Fig. 6 illustrates this process.

ρ = log p(c|α , β ) for each new clip c. We will get the ﬁnal result bad on the given threshold η . When ρ < η , the c is labeled as abnormal. The threshold η is determined empirically through experiments.

EXPERIMENTS AND DISCUSSIONS

Our method described in the previous ction is tested bad on a publicly available datat of normal and abnormal crowd videos and on our own datat. The detailed process is described as follows. We u a scene that many people are scattered in a panic for testing our anomaly detection method. Fig. 7 shows sample frames of this scene.

Fig 7: The scene of a lot of people scattering in panic. The left one describes the normal scene. The middle and right ones describe the process of people scattering in panic until disappearing. (1) Traini

ng We u the normal scene in Fig. 7 for LDA training. The size-adapted spatio-temporal cuboids are extracted from blocks of T frames of video quences where T = 30. And its number is determined automatically. The ﬁnal size of codebook is Z = 1000, and the ﬁnal corpus contains d = 20 clips. The LDA model is ud to learn E = 4 latent topics. (2) Test In order to evaluate this approach, three video quences are recorded repeatedly according to the similar activities in the same scene. Each video describes the same activities as that in Fig. 7. Ten normal activity samples and ten panic activity samples are extracted from the three video quences respectively. The computation results of LDA are shown in Table 1. We choo η = −250 as the threshold. The detecting process is described in the end of Section 3. It is well illustrated in Table 1. In the 60 test samples, the number of wrong judgements is 3, and the average successful detection rate is over 94%. Experimental results show that our method achieves an excellent result. In the comparative experiments, we compare our approach with the method which u the size-ﬁxed spatiotemporal cuboid considering with the successful detec-

Fig 6: The process of using the bag of words method to describe a clip. Therefore, we construct a corpus C={c1 , c2 , . . . , cd } for a t of normal samples of a given scene, where d is the number of clips. Through maximizing the likelihood of corpus as Eq. (6), we approximate the model parameters α , β in the normal scene [19]. l(α , β ) = ∑ log p(ci |α , β ),

i=1 d

(6)

the parameter α is the Dirichlet parameter for topics, while the parameter β is the topic dependent Dirichlet parameter for visual word index [24]. For a new clip, we also extract the size-adapted spatiotemporal cuboids and u the bag of words method combination of the codebook to reprent it. By using the trained LDA model, we estimate the likelihood

本文发布于:2023-07-06 10:53:38，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1070089.html

上一篇：英语考试练习题

下一篇：常用的开放获取术语，您知多少？

标签：设置桌面胶水管理电脑养花

留言与评论（共有 0 条评论）