Segmenting Motion Capture Data into Distinct Behaviors Jernej Barbiˇc Alla Safonova Jia-Yu Pan Christos Faloutsos
Jessica K.Hodgins Nancy S.Pollard
Computer Science Department
Carnegie Mellon University
Abstract
Much of the motion capture data ud in animations, commercials,and video games is carefully gmented into distinct motions either at the time of capture or by hand after the capture ssion.As we move toward col-lecting more and longer motion quences,however,au-tomatic gmentation techniques will become important for processing the results in a reasonable time frame. We have found that straightforward,easy to imple-ment gmentation techniques can be very effective for gmenting motion quences into distinct behaviors.In this paper,we prent three approaches for automatic g-mentation.Thefirst two approaches are online,meaning that the algorithm travers the motion from beginning to end,creating the gmentation as it proceeds.Thefirst assigns a cut when the
intrinsic dimensionality of a lo-cal model of the motion suddenly increas.The cond places a cut when the distribution of pos is obrved to change.The third approach is a batch process and g-ments the quence where concutive frames belong to different elements of a Gaussian mixture model.We as-ss the three methods on fourteen motion quences and compare the performance of the automatic methods to that of transitions lected manually.
Key words:human motion,motion capture,motion g-mentation,PCA
1Introduction
chowder
Motion capture is frequently ud in movies and video games becau of the naturalness and subtle detail con-tained in the motion.Currently,motion data are often stored in small clips to allow for easy hand quencing and arches bad on keywords describing the behavior. Extended quences of motion have many advantages over small motion clips.Longer shots may be more com-fortable for the actors and will contain natural transitions from one behavior to the next.Collecting long quences of motion is also the only way to capture natural behavior over extended periods of ,in an office tting). However,gmentation of the quences for indexing, retrieval,and other processing can be tedious.
The prent work suggests that automatic gmenta-tion of human motion data bad on statistical properties of the motion can be an efficient and quite robust alterna-tive to hand gmentation.Our goal is to gment motion into distinct high-level ,walking,running, punching).The problem falls into the category of un-supervid learning in the n that no prior or training models are available—we want to be able to create a g-mentation even when the behaviors have not been en before.We focus on efficient techniques that are easy to implement and scale well with the size of the input mo-tion data.
Given the goals,we cho three alternative gmen-tation techniques to examine.The techniques treat the motion as an ordered quence of pos assumed by the ,an ordered quence of motion frames) and gment the motion where there is a local change in the distribution of pos.Thefirst approach choos gments using an indication of intrinsic dimensionality from Principal Component Analysis(PCA),the cond approach creates gments using a probabilistic model of motion obtained from Probabilistic PCA,and the third approach generates gments bad on a Gaussian mix-ture model reprentation.We have found that very good performance can be obtained from the simple and fast techniques.The best of our methods(Probabilistic PCA) achieves over90%precision for95%recall,that is,very few fal alarms and fal dismissals.
2Related Work
In the vision community,model-bad approaches to recognition,tracking,and gmentation of high-level be-haviors are widely ,[9],[21]).In graphics, Arikan,Forsyth,and O’Brien[3]describe a model-bad approach to motion annotation that could also be em-ployed for gmentation.The approaches,however, rely on the prence of hand-annotated training data.In contrast,our goal is to gment motion bad only on in-formation available in the current motion quence(with-out prior models),and we focus on related work that shares this goal.
Proceedings of Graphics Interface (GI 2004), 2004. London, Ontario, Canada, May 17-19, 2004
A number of rearchers have found that low-level mo-tion gmentation can be achieved in a straightforward manner.Fod,Mataric,and Jenkins[10]gment motion data by detecting zero crossings of angular velocities.Li, Wang,and Shum[19]explore a more sophisticated tech-nique where low-level gments(textons)are reprented as the output of linear dynamic systems.A gmentation of motion is also implicit in a state machine or motion graph reprentation[5,16,17,2].We are interested in gmenting motion into higher-level behaviors,such as walking,running,and sitting.Many low-level behavior components would be obrved within any o
ne of the behaviors,and so a different approach is needed. Segmentation of video quences has been extensively studied.Unsupervid approaches to modeling and g-mentation include entropy minimization to construct Hid-den Markov Models(HMMs),with high-level behaviors mapped to states of the HMM[6],clustering bad on distance metrics developed over a variety of temporal scales[28],and online gmentation bad on consid-erations of information loss[23].Motion capture data is much simpler than video data.We hypothesize that fast online techniques for gmenting video ,as in[23])would work very well for gmentation of mo-tion into higher-level behaviors,and we suggest that the power of HMMs for example,may not be required for good performance.
From the data mining community subspace ,[1])is of particular interest.This approach is designed to identify low-dimensional clusters in high-dimensional data.A straightforward implementation of this technique,however,would require us to consider mo-tion as an unordered t of pos.Our experience with Gaussian Mixture Models(Section3.3)leads us to be-lieve that clustering on unordered pos would not work well,becau it is often easier to locally capture a tran-sition between two behaviors than it is to tea apart a large number of behaviors after they have been mapped into the same space.
Clustering has been ud to improve the efficiency of arching through motion ,[17]),but to our knowledge it has not been explored for the purpo of high-level gmentation of human motion capture data. We evaluate a similar clustering technique for the purpo of motion gmentation(Section3.3)and describe two online techniques that achieve better performance on our datat.
Finally,we note that many rearchers in computer graphics,robotics,computer vision,machine learning and biomechanics have explored the u of Principal Component Analysis and other dimensionality reduction techniques to aid in clustering,modeling,and other pro-cessing of motion(,[14,7,25]).
3Propod Methods
The goal of our algorithm is to gment long motion
quences automatically into distinct behaviors.Infor-
mally,we are looking for high-level behaviors,which
would be described with distinct verbs,possibly with ob-
,walk,run,jump,or wash the window).Mo-
tion quences which can be described with the same verb
overasbut with changes in coordinate ,walk straight,
curve left,then walk straight some more)should not re-
高一英语必修一
sult in distinct gments.Using the same verb with dif-
ferent objects,on the other ,clean the window,
clean thefloor)should result in distinct gments.
To state the problem mathematically,given a long mo-
tion quence M,we wish to gment that quence into
distinct behaviors M1,...,M S,where the boundaries of
the behaviors and the number of behaviors S are to be
determined.Each motion quence M is reprented as
a quence of frames,with120frames per cond.Each
frame is reprented by specifying the rotations(relative
卓意to the parent in the body hierarchy)for all the joints in
the body at that particular time.Rotations are specified
by quaternions.We omit the absolute body position and
body orientation information,so that the approach will be
independent of the specific body position and orientation
in world coordinates.
3.1PCA Approach to Segmentation
The PCA approach is bad on the obrvation that sim-
ple motions exhibit lower dimensionality than more com-
plex motions.Each frame x i(i=1,2,...,n)is repre-
nted as a point in56-dimensional space(denoted by
R56),becau there are14joints in our body hierarchy, and one quaternion is specified per joint.We treat quater-
nions as vectors of length4.The motion quence corre-
sponds to the trajectory in R56and the center of motion
can be defined as
x=
1
n
n
i=1
x i.(1)
For a simple motion,the frames form a cluster,that is spread around the center of motion.Frames lie mostly in some low-dimensional hyperplane containing x,since the 56dimensions are highly correlated.For example,when the right hip moves the right leg forward during walking, the left hip is usually moving backwards.Similar corre-lations exist for most simple motions.Thus,for a given dimensionality r,we can approximate frames as
x i=x+αi1v1+αi2v2+...+αir v r,(2) where v1,v2,...,v r are unit orthogonal vectors forming the basis of the linear subspace corresponding to the hy-
perplane,andαi1,αi2,...,αir are coefficients determin-ing the specific frame.The less correlated the motion, the higher the dimensionality that will be required to ac-curately reprent the motion.
For any hyperplane containing x,we can orthogonally project the frames x i on the hyperplane and call the re-sulting projections x i.The projection introduces the error
住宅区 英语
e=
n
i=1
||x i−x i||2,(3)
where||x||denotes the standard Euclidean norm in R56. Given the t of frames x i and dimensionality r,a r-dimensional hyperplane exists for which the projec-tion error is minimized.This hyperplane can be found by applying the widely ud statistical technique of PCA[15].To perform PCA,the singular value decom-position(SVD)is ud.The center of motion isfirst sub-tracted from the frames,and the centered frames are then organized in a n×56matrix D,where n 56equals the number of frames.The SVD decomposition yields matrices U,V,andΣ,such that
D=UΣV T.(4) Columns of U and V are orthogonal unit vectors,and the matrixΣis a56×56diagonal square matrix,with non-negative decreasing singular valuesσi on its diagonal. Thefirst r columns of V give the basis v1,v2,...v r of the optimal hyperplane of dimension r.Projecting the frames on the optimal hyperplane is equivalent to discarding all singular values except the largest r,and the projection error is
e=
n
i=1
||x i−x i||2=
vegetables是什么意思
56
j=r+1
σ2j.(5)
The ratio
E r= r
j=1
σ2j
56
考研英语一
j=1
σ2
j
(6)
is an indicator of how much information is retained by projecting the frames on the optimal r-dimensional hy-perplane.The dimensionality r can be determined by -lecting the smallest r such that E r>τ,whereτ<1is some specified parameter.Such determination of dimen-sionality r is typical for PCA analysis[11].Our exper-iments show that choosingτ=0.9introduces very lit-tle change to the appearance of motion and prerves the features that a human might likely u to tell one motion apart from another.Note that dimensionality reduction is performed only on the joints of the body,while the body position and orientation remain unaltered.A number of more sophisticated approaches for computing intrinsic di-mensionality of the data are available[20,12,4],but this simple method worked well in our experiments. According to our experiments,whenτ=0.9,we have r≤6for85%of simple motions from a databa of203 simple motions,obtained by manually gmenting a part of a larger motion capture databa.If a motion consists of two or more simple mo
tions,more dimensions are nec-essary to achieve the same projection error,becau the data exhibits more variety.Put another way,for afixed r,the projection error will be greater.This obrvation forms the basis of our algorithm:transition from one be-havior to another will coincide with the point along the motion where,for afixed r,the projection error begins to increa rapidly.
First,we u afixed number of frames k to determine the number of dimensions r to sufficiently reprent the first k frames of motion,using the rule E r>τ.Param-eter k is t to240(2cs)for our experiments.For every value of i,where i is steadily increasing by one, we compute the SVD decomposition of frames1through i,and evaluate the total error e i using Equation5.For a simple motion,the error e i ris at an approximately constant slope,becau the hyperplane does not change much within a single simple motion,and new frames will on average introduce about the same error.When tran-sitioning into a new motion,however,e i will ri much more quickly.To detect the transition,we obrve the dis-crete derivative d i=e i−e i− ,where parameter must be large enough to avoid noi in the data.We t it to =60.The initial value of i is k.To avoid the initial os-cillation of the average due to the small number of frames and data points d j,we begin testing for a cut only after i has become larger than some specified parameter i0.In our experiment,we u i0=k+ =300.Note that the choices of parameters imply that any simple motion must last at least2.5con
ds for the algorithm to detect it.Al-most all of simple motions in our motion databa satisfy this criterion.
For a simple motion,the derivative d i is more or less constant,with minor oscillations around the average af-ter an initial period of noi due to the small number of frames.When a motion transition occurs,the derivative d i ris sharply above the constant value(Figure1).At any given value of i,we can check for a possible cut by computing the average and standard deviation of all the previous data points d j,where j<i.If the current data point d i is more than kσ=3standard deviations from the average,we assign a motion cut to that frame.To process the rest of the motion,we restart the algorithm at the cut frame.
Figure1:Transition from walking to running.The verti-cal line at the frame861corresponds to the cut detected by the algorithm.The solid line shows the derivative d i.The central dotted line reprents the average of the derivative.The two outer dotted lines show the range corresponding to one standard deviation from the aver-age.Derivative average and standard deviation at frame i are computed over all previous data points d ,over all j<i.
3.2Probabilistic PCA Approach to Segmentation
powersupplyIn the cond approach,we u Probabilistic PCA (PPCA)to estimate the distribution of the
motion data. PPCA is an extension of the traditional PCA[24,26] and defines a proper probability model for PCA.
In PCA the directions“outside”the subspace are sim-ply discarded,whereas in PPCA they are modeled with noi.First,the average square of discarded singular values can be defined as
σ2=
1
56−r
56
i=r+1
σ2i.(7)
The entire data t can then be modeled by a Gaussian distribution,where the mean equals the center of motion x and the covariance matrix C is determined by the equa-tions:
W=V r(Σ2r−σ2I)1/2(8)
C=
1
n−1
(W W T+σ2I)=
1
n−1
V˜Σ2V T.(9)美国人的饮食习惯
Here,we follow the notation of Equation4,and introduce V r to denote thefirst r columns of matrix V,andΣr to denote the upper-left r×r block of matrixΣ.Matrix˜Σis a56×56diagonal matrix,obtained fromΣby replacing all discarded singular values byσ.
Figure2:Plot of H as K is repeatedly incread by∆.
We u PPCA to model thefirst K frames of the mo-tion as a Gaussian distribution,reprented by mean x and covariance C,where C is defined bad on the the estimated intrinsic dimensionality of the motion.We es-timate intrinsic dimensionality using the technique spec-ified in the PCA approach.We tτ=0.95.This ap-proach provides an accurate model of a particular behav-ior becau it captures the correlation in the motion of different joint angles as well as the variance of all joint angles.After we compute x and C,we estimate how likely are motion frames K+1through K+T to belong to the Gaussian distribution defined by x and C.We do this by computing an average Mahalanobis distance[8], H for frames K+1through K+T:
H=
1
T
K+T
i=K+1
(x i−x)T C−1(x i−x).(10)
full test
An advantage of using the Mahalanobis measurement for discrimination is that distances are calculated in units of standard deviation from the mean and are therefore data independent.
We next increa K by a small number of frames,∆, and repeat the estimation of distribution for thefirst K frames(K:=K+∆),and of the average Mahalanobis distance,H,for frames K+1:K+T with respect to the new distribution.For our experiments,we ud T=150frames,∆=10frames,and the initial value of K=T.A reasonable estimate of T is half of the anticipated number of frames in the smallest behavior in the databa.
Figure2shows a characteristic pattern of H:it initially decreas,then forms a valley,increas and decreas again,forming a peak.Thefirst decrea in H happens
when frames1:K and K+1:K+T both belong to the same behavior.As we increa K,the algorithm es-timates progressively more accurate distributions of this behavior,making frames K+1:K+T more likely. When the model converges,the valley is reached.The in-crea in H occurs when the new behavior enters frames K+1:K+T.The subquent decrea in H begins when the frames of the new behavior start appearing in frames1:K,and as a result the distribution begins to ac-commodate the new behavior.Thus,a reasonable choice for the gmentation takes place when H forms a peak. The
n,frames1:K contain the old motion and frames K+1:K+T contain thefirst T frames of the new mo-tion.The algorithm declares a cut when a valley in H is followed by a peak,and the difference between the two is at least some threshold R.For our experiments,we t R to15.In general,increasing R results in fewer gments that correspond to more distinct behaviors,whereas de-creasing R results in afiner gmentation.The next cut is found by repeating the algorithm on the rest of the mo-tion.
In rare cas thefirst behavior is characterized by a “wide”distribution,whereas the following behavior is characterized by a narrow distribution,that lies almost completely within thefirst distribution.In this ca, frames from the cond behavior are considered likely to have come from thefirst distribution and the algorithm will fail to detect the gmentation.However,if we do a backward pass of the algorithm,the gmentation will be detected becau frames from thefirst distribution are unlikely to come from the cond distribution.Therefore, we run our algorithm veral times,alternating forward and backward direction,and adding new cuts at each pass until convergence(no new cuts are found).In practice,a single forward pass detects almost all behaviors and at most one backward pass is required.
3.3GMM Approach to Segmentation
In the third approach we employ a Gaussian Mixture Model(GMM)to model the entire quence and then gment the quence whenever two concutive ts of frames belong to different Gaussian distributions.The underlying assumption is that the frames from different simple motions form parate clusters,and each cluster can be
described reasonably well by a Gaussian distribu-tion.Figure3shows the distributions of frames from two motion quences,each of which consisted of three sim-ple motions concatenated together.Each quence has been projected onto thefirst two principal components computed for the full quence.The frames of the simple motions form clusters that can be modeled by Gaussian distributions,suggesting that GMM is a reasonable model for gmentation.Figure3:Two-dimensional
projection of groups of sim-ple motions:(a)Walking,running,and drinking;(b) walking,sitting down,and standing idle.Frames of each motion form their own clusters.The projection is done on thefirst two principal components for each quence. In(a),the natural continuous transitions among different clusters are not shown.
We u the Expectation Maximization(EM)algorithm to estimate the Gaussian Mixture Model of the data.In a pre-processing step,we u PCA to project the frames onto a lower dimensional subspace.This dimensionality reduction is meant to speed up the EM algorithm.The number of principal components is chon so that90%of the variance of the original data distribution is prerved. The number of principal components here is about32,as the complete quences are not simple.In the ideal ca, each cluster is reprented by a single Gaussian distribu-tion.Thus,a collection of k clusters can be reprented by a mixture of k Gaussian distributions.The EM al-gorithm is ud to estimate the parameters of the GMM such as mean,m j,covariance matrix,Σj,and prior,πj, for each of the Gaussians in the mixture.Note that the clusters are not of equal size.For example,longer mo-tion gments have more points and thus form larger clus-ters.To capture the size differences,EM also estimates the probabilityπj of a point belonging to a cluster C j. The EM algorithm for GMM is one of the classical tech-niques in machine learning,and we refer the reader to[
8] for algorithm details.After the GMM parameters are es-timated,we can compute a most likely cluster for each frame of the motion.A change of a behavior is then de-tected at time i if frames x i and x i+1belong to different clusters.
In practice,if a gment is shorter than1cond(120 frames),it is merged with its neighboring , the gment is split into two halves and its cluster labels are changed to tho of the gments before and after. This heuristic worked well and it often removed small gments that occur on the transitions between behaviors. Figure4shows the gmentation of a motion cap-ture data quence compod of7simple motions in the following order:walking,jumping forward,walk-
Figure4:The clusters for a motion capture quence of 7simple motions(walking,jumping forward,walking, punching,walking,kicking,punching).GMM is ud to find a mixture of k=4clusters.GMM provides the cor-rect gmentation and also identifies similar simple mo-tions.
ing,punching,walking,kicking,and punching.The ver-tical lines are the ground truth of the gmentations.Cut points between two motions are given as intervals cov-ering the time period of the transition.Each cut point is prented by three vertical lines,indicating the start,mid-dle,and end of the
motion transition as determined by a human obrver.Figure4also shows the cluster labels (1,2,3,4)assigned to the frames along the vertical axis of the graph.In this ca wefit a GMM model of k=4 clusters.The gmentation assignment of cluster labels also successfully identified similar simple actions.Here, the three walking motions are assigned the cluster label 2,and the two punching motions are assigned the cluster label1.
Unfortunately,we have to specify the number of clus-ters k for each execution of the GMM,and we usually do not know the number of clusters in a data t.As an alternative,GMM is often estimated for various values of k,and the mixture model that maximizes some crite-rion,such as the Bayesian Information Criterion(BIC) [22]is chon.This approach,however,did not em to be uful for our data becau it often returned a smaller than optimal number of clusters.Instead,fixing k at a small value such as4emed to produce reasonable re-sults without additional complexity.
We initialize the GMM clusters with random frames. Repeated experiments show that after removing short (noisy)gments,major gmentation points could al-ways be identified,independently of the initialization of the clusters.This innsitivity to the initial conditions may be due to the characteristics of our databa(e Figure3).Alternatively,we also performed experiments where we partitioned the entire motion quence into K concutive parts,and created the initial guess by picking
Figure5:The error matrix for the PCA algorithm.
Figure6:The error matrix for the PPCA algorithm. one random frame from every part.In this ca,the ini-tial points are less likely to fall into the samefinal cluster, which helped the convergence rate and in our ca pro-duced a slightly better result.However,the improvement over fully random initialization was small.
4Experiments
In this ction,we prent the results of two experiments conducted on a motion capture databa containing14-quences,each consisting of approximately8000frames. Each quence is a ries of about10simple motions. Typical human activities are reprented,such as walk-ing,running,sitting,standing idle,exercising,climbing, performing martial arts,washing a window,and sweep-ing afloor.
Thefirst experiment illustrates the fundamental as-sumptions behind our algorithms.The PCA approach is bad on the idea that the intrinsic dimensionality of a motion quence containing a single behavior should be smaller than the intrinsic dimensionality of a motion -quence containing multiple behaviors.To illustrate this idea,we lected7distinct simple behaviors:walking, running,sitting down,forward jumping,climbing,arm stretching and punching.For each behavior,we extracted two distinct occurrences from the databa.The length of